1 Introduction

Sinon is a Java tool that extracts textual information from Web sites. In other words, it is a tool that can be used to scrape any kind of text (HTML included) available in the Internet or in a filesystem. An XML configuration written by the user file informs Sinon what steps it must execute in order to reach the desired information. Page downloads and cookie management are transparent to the user. The extracted information and extraction status information are made available through the Sinon API.

The Sinon API does not provide any data manipulation or storage services. It just extracts data and provides notification of data extraction events to listener classes specified by the XML configuration file. The Sinon API documentation is described in section 4.

Note: Sinon 0.8 was targeted to download and handling of textual documents (HTML, text, XML, etc). Beginning with version 0.9, Sinon handles binary documents as well as textual ones. Consequently, where this document says "page", read "document" instead, as the latter cover both types of Web content.

1.1 Usage

Sinon can be used in two ways:

  • Command-line:
    java -cp [classpath] -jar sinon-0.9.jar <collector file> [catalog name]
  • Embedded in your application:
    1. Use some CollectorDAO implementation to load a Collector instance from the filesystem or other source.
    2. Create a CollectorExecutor instance and then call one of its execute methods.

1.2 License

Sinon is an open-source software created and mantained by Eteg Internet Ltda and is available under the Apache License 2.0, which can be found at http://www.apache.org/licenses/LICENSE-2.0.html.

1.3 Dependencies

Sinon has the following dependencies:

2 Basic concepts

2.1 Configuration

A collector configuration (or just configuration) is a set of information created by the user to instruct Sinon how to locate and extract his/her desired data. It is comprised of a page request configuration (HTTP headers sent to the server, think time, connection timeout, and number of attempts to download a page), catalog listener configuration, and catalogs.

Configuration are represented in Sinon by Collector instances and are typically stored in XML files. Instances of XMLCollectorDAO reads Sinon XML files and returns Collector instances. The Sinon 0.9 configuration DTD is http://sinon.sourceforge.net/dtd/sinon-0.8.dtd (the DTD has not changed between versioins 0.8 and 0.9) and its tags are described in section 3. Sinon, by default, does not validate XML inputs and ignores unknown tags and attributes.

2.2 Listeners

Listeners are objects, especified by the configuration, that are notified when some events ocurr. There are three types of them: catalog listeners (catalog started, catalog finished) (section 4.2), page listeners (page start, page downloaded, page processed, page finished) (section 4.5), and error listeners (section 4.8).

2.3 Catalog

Each catalog represents a path to some desired information and has one or more pages and one or more start pages. Different catalogs inside a same configuration represents different data to extract.

2.4 Start page

A start page represents different ways to start a catalog, including the name of the inital page and a set of variables.

2.5 Page

A page represents information about how to extract the desired data from a given Web page, what page listener will be notified and what actions to take.

2.6 Variable

Extracted data and other values are stored in variables (also called properties). They can be simple, holding only a string; or multivalued (list variables), holding a list of strings. Variables have two purposes: to hold extracted data and to construct strings, specially URLs.

Almost all tag and attribute values can have values constructed using simple variables (multivalued ones cannot be used in this case). Every ocorrence of ${someVariableName} is replaced by the current value of of the variable someVariableName. For example:

<page url="http://www.example.br/${pageId}.html" ...

If the value of pageId is sinon, the URL used by that page is http://www.example.br/sinon.html

Variables are set using the data-extraction and property tags.

2.7 Think time

Think time is a pause between two consecutive page requests to the same server. Its main purpose is to emulate the behavior of a human user in a website. It is implemented by classes that implement the ThinkTime interface (section 4.10).

2.8 How Sinon executes collectors

The algorithm used by Sinon to execute collectors is:

  • For every catalog in configuration
    • Remove all variables
    • Notify the catalog listener that the current catalog was started
    • Set the catalog global variables
    • For every start page of the catalog
      • Set the start page variables
      • Process page
    • Notify the catalog listener that the current catalog was finished

The page processing algorithm is:

  • Notify the listener that the page state was changed to "before download"
  • Download page
  • Notify the listener that the page state was changed to "before processing's
  • Execute page extraction steps
  • Notify the listener that the page state was changed to "after processing"
  • Execute page actions
  • Notify the listener that the page state was changed to "finished"

3 Tag reference

3.1 collector

Set of information needed for the data extraction of a given site.

Attributes:

Name Type Definition Mandatory?
id String Define the collector name. Shorld be equal to the name of the file withort the suffix .xml. Yes

Child elements: <config> (section 3.2), <on-catalog-state-change> (section 3.8), <catalog> (section 3.2).

3.2 config

Collector configuration.

No attributes.

Child elements: <headers> (section 3.3), <thinkTime> (section 3.5), <timeout> (section 3.6), <retries> (section 3.7).

3.3 headers

Set of HTTP headers sent to the server during a request.

Attributes: none.

Child element: <header> (section 3.4).

3.4 header

A HTTP header (name-value pair) sent to the server during a request.

Attributes:

Name Type Definition Mandatory?
name String Name of the HTTP header. Yes
value String Value of the HTTP header. Yes

Child elements: none.

Example:

<headers>
    <header name="Accept" value="image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, */*"/>
    <header name="Accept-Language" value="pt-BR"/>
    <header name="User-Agent" value="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"/>
    <header name="Connection" value="Keep-Alive"/>
</headers>

3.5 thinkTime

Think time (time between requests to the same site) configuration. The fully qualified name of a class implementing the eteg.sinon.executor.ThinkTime interface must be scpecified. Parameters can be passed to the ThinkTime implementation through nested property tags, each one with name and value attributes.

Sinon provides eteg.sinon.util.ConstantThinkTime, a ThinkTime implementation that defines some constant think time specified by the. interval parameter.

Attributes:

Name Type Definition Mandatory?
class String Think time implementation fully qualified class name. Yes

Child element: <property> (section 3.13).

Example: a constant think time of 10 seconds (10000 milliseconds).

<thinkTime class="eteg.sinon.util.ConstantThinkTime">
    <property name="interval" value="10000"/>
</thinkTime>

3.6 timeout

Maximum connection timeout in milliseconds. Its recommended value is at least 10000.

Attributes:

Name Type Definition Mandatory?
value Integer Maximum connection timeout in milliseconds. Yes

Child elements: none.

Example: a constant think time of 10 seconds (10000 milliseconds).

<timeout value="180000"/>

3.7 retries

Maximum number of page download retries.

Attributes:

Name Type Definition Mandatory?
value Integer Maximum number of page download retries. Yes

Child elements: none.

Example:

<retries value="5"/>

3.8 on-catalog-state-change

Fully qualified name of the class that will receive notification of catalog processing events (catalog processing start and catalog processing finish).

Attributes:

Name Type Definition Mandatory?
class String Catalog listener fully qualified class name. Yes

Child elements: none.

3.9 catalog

For a definition of catalog, see section 2.3.

Attributes:

Name Type Definition Mandatory?
id String Catalog identifier. Yes

Child elements: <properties> (section 3.12), <start-pages> (section 3.10), <global-data-extractions> (section 3.24), <page> (section 3.14), <parameter-set> (section 3.21).

3.10 start-pages

Set of start pages.

Attributes: none.

Child elements: <start-page> (section 3.11).

3.11 start-page

Declares a start page. For a definition of start page, see section 2.4.

Attributes:

Name Type Definition Mandatory?
refid String Identifier of the <page> used as start page. Yes

Child elements: <property> (section 3.13).

3.12 properties

Set of properties (variables) that are set just before the catalog that owns them is processed.

Child elements: (section 3.13).

3.13 property

When enclosed in a <catalog>, <start-page>, or <page-config> tag, defines a variable and its value. For more information about variables, see section 2.6. When enclosed in a <thinkTime>, defines a configuration parameter for the ThinkTime implementation used. For more information about think time, see section 2.7.

Attributes:

Name Type Definition Mandatory?
name String Property name. Yes
value String Property value. Interpreted attribute if enclosed in a <catalog>, <start-page>, or <page-config> tag . Yes

Child elements: none.

3.14 page

Declares a page. For a definition of page, see section 2.5.

Attributes:

Name Type Definition Mandatory?
id String Identifier of this page. Inside a catalog, each page must have a different identifier. Yes
url String URL of this page. Interpreted attribute. Yes
method String Method used to send HTTP parameters to a page. The valid values are GET and POST. No. Default: GET
follow-redirect boolean If true, HTTP redirections are followed automatically. No. Default: true

Child elements: <data-extractions> (section 3.17), <page-config> (section 3.16), <action> (section 3.20), <save> (section 3.15).

3.15 save

Writes the downloaded page to a file in the local filesystem. It does not handle download of non-textual documents.

Attributes:

Name Type Definition Mandatory?
id String File at which the downloaded page will be written. It can be any string accepted by the constructor http://java.sun.com/j2se/1.4.2/docs/api/java/io/File.html#File(java.lang.String). Interpreted attribute. Yes

Child elements: none.

3.16 page-config

Configuration of a page. Actually, it is just a convenient way to pass parameters to dynamic pages. Pages that receive parameters via POST must use this tag.

Attributes: none.

Child elements: <parameter-set> (section 3.21).

3.17 data-extractions

Set of data extractions of a given page.

Attributes: none.

Child elements: <data-extraction> (section 3.18), <do-data-extraction> (section 3.19), <loop> (section 3.25), <jump> (section 3.26).

3.18 data-extraction

Declares a data extraction. The extracted data is the string found after the <from> string and before the <to> string.

Attributes:

Name Type Definition Mandatory?
id String Name of the variable that will receive the extracted data. Yes
trim boolean Tells if the extracted data must be trimmed (removed leading and trailing whitespace). The trimming is done using . No. Default: true
list boolean Tells if the varible defined by this data extraction is multivalued. See section 2.6. No. Default: true
allowDuplicates boolean Tells if duplicate values are accepted or not. If not, eventual duplicate values are ignored. This attribute is ignored if list is false. No. Default: true
failOnError boolean Tells if the ocorrence of some data extraction step fail must be considered an error. No. Default: true
valueOnError boolean Defines what value the data extraction variable will receive if failOnError is true and some data extraction step fails. No. Default: empty string
resetPosition boolean Tells if the page current position must be reset to 0 before the execution of this data extraction. No. Default: false

Child elements: <jump> (section 3.26), <from> (section 3.27), <to> (section 3.28), <loop> (section 3.25).

3.19 do-data-extraction

Executes a given global data extraction (one that is declared inside a <global-data-extractions>).

Attributes:

Name Type Definition Mandatory?
refid String Data extraction identifier. Yes

Child elements: none.

3.20 action

Declares an action to be executed after a page is processed.

Attributes:

Name Type Definition Mandatory?
pageId String Identified of the page that will be executed by this action. Yes
condition String Defines that this action must be executed only if some condition is satisfied. Up to now, only the notEmpty condition is available. It is satisfied if the variable referred by the propertyId attribute is not an empty string. No
propertyId String Name of the variable to be tested by the action condition. Yes if the action has some condition.
url String URL to be downloaded and processed. If this attribute is not specified, the value of the url attribute of the <page> tag is used. No
foreach String Executes this action for each value of the list variable specified by this attribute. No
variable String Defines the name of the variable that will receive the value of the current foreach value. This variable is set before the action is executed. Ignored if this action is not a foreach one. No
index String Defines the name of the variable that will receive the current foreach index. The first index is 0. No

Child elements: <page-config> (section 3.16).

3.21 parameter-set

Set of parameters passed to a dynamic page.

Attributes:

Name Type Definition Mandatory?
id String Identifier of this parameter set. Yes if enclosed in a <catalog> tag, no if enclosed in a <catalog> tag.

Child elements: <parameter> (section 3.22), <include-parameter-set> (section 3.23).

3.22 parameter

A parameter (name-value pair) sent to a dynamic page. Name and value are automatically encoded.

Attributes:

Name Type Definition Mandatory?
id String Parameter name. Yes.
id String Parameter value. Interpreted attribute. Yes, but can be empty.
encoding String Character set used to encode the parameter name and value. No. Default: ISO_8859_1.

Child elements: none.

3.23 include-parameter-set

Includes all the parameters of some given parameter set in this one. If some parameter was already set and the included parameter set has a parameter with the same name, it is overwritten.

Attributes:

Name Type Definition Mandatory?
refid String Parameter set identifier. Yes.

Child elements: none.

3.24 global-data-extractions

This tag is a way to declare data extractions that can be used in more than one page within a catalog. These global data extractions can be used using the <do-data-extraction> tag (section 3.19).

Attributes: none.

Child element: <data-extraction> (section 3.18),

3.25 loop

Declares a loop. The extractions steps enclosed by this tag are executed until the loop termination condition is satisfied.

Attributes:

Name Type Definition Mandatory?
condition notFails or before Loop stop condition. The notFails condition executes the loop until some extraction step fails. The before condition executes the loop until the page current position is before the first ocurrence of the string defined by the before attribute. Yes
value String String used as reference to the before loop condition. Yes if condition is before, false otherwise.

Child elements: <data-extraction> (section 3.18), <jump> (section 3.26).

3.26 jump

Declares a jump, positioning step that changes the page current position after the first occurrence of the reference string. This string is enclosed in this tag and is interpreted.

Attributes:

Name Type Definition Mandatory?
trim boolean Tells if reference string must be trimmed (removed leading and trailing whitespace). The trimming is done using http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#trim() No. Default: true
direction forward or backwards Defines the search direction No. Default: forward
failIfFound boolean If false, this positioning step fails if the reference string is not not found. If true, this positioning step fails if the reference string is not found. No. Default: false

Child element: <newline> (section 3.29).

Examples:

<jump>Name:</jump>
<jump><![CDATA[<td class="name">]]></jump>
<jump>&lt;td class="name"&gt;</jump>

These two last examples have the same meaning, but written in two different ways: using CDATA sections and escaping of < (less than) e > (greater than) characters.

3.27 from

Works exactly like a <jump>, but it marks the position of the beginning of the extracted data as the first one after the string reference occurrence.

Attributes:

Name Type Definition Mandatory?
trim boolean Tells if reference string must be trimmed (removed leading and trailing whitespace). The trimming is done using . No. Default: true
direction forward or backwards Defines the search direction. No. Default: forward

Child element: <newline> (section 3.29).

3.28 to

Works exactly like a <jump>, but it marks the position of the end of the extracted data as the first one before the string reference occurrence.

Attributes:

Name Type Definition Mandatory?
trim boolean Tells if reference string must be trimmed (removed leading and trailing whitespace). The trimming is done using http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#trim() No. Default: true
direction forward or backwards Defines the search direction. No. Default: forward

Child element: <newline> (section 3.29).

3.29 newline

Used in <jump>, <from> and <to> to represente a newline character in a more convenient way. There are very few cases in which the use of <newline> is needed.

Attributes: none.

Child elements: none.

Example: the desired data, a name, found between a newline and a | character.

<data-extraction id="name" list="true" trim="true">
    <from trim="false"><newline/></from>
    <to>|</to>
</data-extraction>

4 Sinon API

This section describes only the classes, interfaces and methods used to create an application using Sinon. The complete Sinon API documentation can be found here.

4.1 Class CollectorExecutor

Instances of eteg.sinon.executor.CollectorExecutor executes collectors. The following methods are the most useful for implementing listeners. The CollectorExecutor complete Javadoc API is available here.

CollectorExecutor(Collector collector) Single constructor of this class. Receives the Collector instance to be executed.
void execute() Executes the collector passed as argument to the constructor.
void execute(Catalog[] catalogs) Executes the collector passed as argument to the constructor, but only the catalogs in the catalogs array.
String[] getPropertyNames() Returns the names of all variables set.
Object getPropertyValue(String propertyName) Returns the value of the variable whose name is propertyName. If the given variable is simple, this method returns a String. Otherwise, a List containing Strings is returned. Changes made to this list change the variable value. If there is no such variable, null is returned. This method raises an IllegalArgumentException if propertyName is null.
void setProperty(String propertyId, String value) throws SinonException Sets the value of a simple property. If this property already exists, the old value is replaced by the new. Otherwise, the property is created with the given value. This method throws SinonException if the given property is multivalued.
long getStartTime() Returns the Unix timestamp at which the collector was started or -1 if it was not started yet.
long getFinishTime() Returns the Unix timestamp at which the collector has finished or -1 if it has not finished yet.
Log getLog() returns org.apache.commons.logging.Log (from the Jakarta Commons Logging package) instance used to create an execution log.
String getCurrentText() Returns the last downloaded page in a String or null if none was downloaded yet.
byte[] getCurrentBytes() Retorns the last downloaded document as a byte array. or null if none was downloaded yet. Useful when a listener wants to handle binary documents downloaded by Sinon.
void stop() Stops this collector execution.
boolean isStopped() Returns true if this collector was stopped, false otherwise.

4.2 Interface CatalogStateListener

In order to receive catalog processing events, a eteg.sinon.listener.CatalogStateListener interface implementation must be written. The following methods must be implemented:

Method Description
public void onStateChange(Catalog catalog, CatalogState state, CollectorExecutor executor) Method invoked when a catalog state change ocurrs.
public static CatalogStateListener getInstance() Method that returns the CatalogStateListener instance that will receive the catalog state change events.

See also Catalog (section 2.3), CatalogState (section 4.4), CollectorExecutor (section 4.1).

4.3 Class Catalog

eteg.sinon.core.Catalog instances represent catalogs in Sinon. The only method needed to implement an application that uses Sinon is String getId(). It returns the catalog id attribute.

4.4 Class CatalogState

eteg.sinon.core.CatalogState is an enumeration class that represents the possible catalog processing states: before processing and after processing. It has some methods:

Method Description
boolean isBeforeProcessing() Returns true if this catalog state is before processing and false otherwise.
boolean isAfterProcessing() Returns true if this catalog state is after processing and false otherwise.
boolean equals(Object other) Returns true if other is a CatalogState instance and if other represents the same catalog state as this object.
String getName() Returns the name of this state.

4.5 Interface PageStateListener

In order to receive page processing events, a eteg.sinon.listener.PageStateListener interface implementation must be written. The following methods must be implemented:

Method Description
public void onStateChange(Page page, PageState state, CatalogStateListener catalogStateListener, CollectorExecutor executor) Method invoked when a page state change ocurrs.
public static PageStateListener getInstance() Method that returns the PageStateListener instance that will receive the page state change events.

See also Page (section 4.7), PageState (section 4.6), CatalogStateListener (section 4.2), CollectorExecutor (section 4.4).

4.6 Class PageState

eteg.sinon.executor.PageState is an enumeration class that represents the possible page processing states: before download, before processing (page download done but no processing yet), after processing (page download and data extractions done, but no page action executed), and finished (page download, data extractions, and actions done). It has some methods:

Method Description
boolean isBeforeDownload() Returns true if this page state is before download and false otherwise.
boolean isBeforeProcessing() Returns true if this page state is before processing and false otherwise.
boolean isAfterProcessing() Returns true if this page state is after processing and false otherwise.
boolean isFinished() Returns true if this page state is finished and false otherwise.
boolean equals(Object other) Returns true if other is a PageState instance and if other represents the same catalog state as this object.
String getStateName() Returns the name of this state.

4.7 Class Page

eteg.sinon.core.Page instances represent pages in Sinon. The only method needed to implement an application that uses Sinon is String getId(). It returns the page id attribute.

4.8 Interface ErrorStateListener

In order to receive error events and react to them, a eteg.sinon.listener.ErrorListener interface implementation must be written. If no ErrorListener is used, any error causes Sinon to stop the execution of the collector. The following methods must be implemented:

Method Description
public ErrorResponse onError(SinonException exception) Handles an error ocurred during the execution of a collector and returns what action to take.
public static ErrorListener getInstance() Method that returns the ErrorListener instance that will receive the error events and handle them.

See also ErrorResponse (section 4.9).

4.9 Class ErrorResponse

Enumeration class that represents the possible error responses: processing states: stop (stop execution when an error ocurrs), retry (retry the action that caused the error), and resume next (ignore the error and go ahead). It has some methods:

Method Description
boolean isStop() returns true if this error response is stop and false otherwise.
boolean isRetry(): Returns true if this error response is retry and false otherwise.
boolean isResumeNext() Returns true if this error response is after processing and false otherwise.
boolean equals(Object other) Returns true if other is a ErrorResponse instance and if other represents the same error response as this object.

4.10 Interface ThinkTime

Think times implementations must implement the eteg.sinon.executorThinkTime interface and have a constructor without parameters. An instance of the ThinkTime implementation used is created before the processing of catalogs and then used for the whole collector execution.

Sinon provides eteg.sinon.util.ConstantThinkTime, a think time implementation whose nextThinkTime method that always returns the same value. It is defined, in milliseconds, by the interval parameter.

The following methods must be implemented:

Method Description
long nextThinkTime() Method invoked before every page download. It must return the think time im milliseconds.
void setConfiguration(Properties properties) Method invoked immediately after the instance is created. Used for configuration purposes. properties contain the name-value pairs contained in <property> tags enclosed in the <thinkTime> tag.