Sinon is a Java tool that extracts textual information from Web sites. In other words, it is a tool that can be used to scrape any kind of text (HTML included) available in the Internet or in a filesystem. An XML configuration written by the user file informs Sinon what steps it must execute in order to reach the desired information. Page downloads and cookie management are transparent to the user. The extracted information and extraction status information are made available through the Sinon API.
The Sinon API does not provide any data manipulation or storage services. It just extracts data and provides notification of data extraction events to listener classes specified by the XML configuration file. The Sinon API documentation is described in section 4.
Note: Sinon 0.8 was targeted to download and handling of textual documents (HTML, text, XML, etc). Beginning with version 0.9, Sinon handles binary documents as well as textual ones. Consequently, where this document says "page", read "document" instead, as the latter cover both types of Web content.
Sinon can be used in two ways:
java -cp [classpath] -jar sinon-0.9.jar <collector file> [catalog name]
CollectorDAO
implementation to load a Collector
instance
from the filesystem or other source.
CollectorExecutor
instance and then call one of its execute
methods.
Sinon is an open-source software created and mantained by Eteg Internet Ltda and is available under the Apache License 2.0, which can be found at http://www.apache.org/licenses/LICENSE-2.0.html.
Sinon has the following dependencies:
A collector configuration (or just configuration) is a set of information created by the user to instruct Sinon how to locate and extract his/her desired data. It is comprised of a page request configuration (HTTP headers sent to the server, think time, connection timeout, and number of attempts to download a page), catalog listener configuration, and catalogs.
Configuration are represented in Sinon by
Collector
instances and are typically stored in XML files.
Instances of
XMLCollectorDAO
reads Sinon XML files and returns Collector
instances.
The Sinon 0.9 configuration DTD is
http://sinon.sourceforge.net/dtd/sinon-0.8.dtd
(the DTD has not changed between versioins 0.8 and 0.9)
and its tags are described in
section 3.
Sinon, by default, does not validate XML inputs and ignores
unknown tags and attributes.
Listeners are objects, especified by the configuration, that are notified when some events ocurr. There are three types of them: catalog listeners (catalog started, catalog finished) (section 4.2), page listeners (page start, page downloaded, page processed, page finished) (section 4.5), and error listeners (section 4.8).
Each catalog represents a path to some desired information and has one or more pages and one or more start pages. Different catalogs inside a same configuration represents different data to extract.
A start page represents different ways to start a catalog, including the name of the inital page and a set of variables.
A page represents information about how to extract the desired data from a given Web page, what page listener will be notified and what actions to take.
Extracted data and other values are stored in variables (also called properties). They can be simple, holding only a string; or multivalued (list variables), holding a list of strings. Variables have two purposes: to hold extracted data and to construct strings, specially URLs.
Almost all tag and attribute values can have values constructed
using simple variables (multivalued ones cannot be used in
this case). Every ocorrence of
${someVariableName}
is replaced by the current value of
of the variable
someVariableName
. For example:
<page url="http://www.example.br/${pageId}.html" ...
If the value of
pageId
is
sinon
,
the URL used by that page is
http://www.example.br/sinon.html
Variables are set using the
data-extraction
and
property
tags.
Think time is a pause between two consecutive
page requests to the same server. Its main purpose is to
emulate the behavior of a human user in a website.
It is implemented by classes that implement
the
ThinkTime
interface
(section 4.10).
The algorithm used by Sinon to execute collectors is:
The page processing algorithm is:
Set of information needed for the data extraction of a given site.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
id
|
String
|
Define the collector name. Shorld be equal
to the name of the file withort the suffix
.xml .
| Yes |
Child elements:
<config>
(section 3.2),
<on-catalog-state-change>
(section 3.8),
<catalog>
(section 3.2).
Collector configuration.
No attributes.
Child elements:
<headers>
(section 3.3),
<thinkTime>
(section 3.5),
<timeout>
(section 3.6),
<retries>
(section 3.7).
Set of HTTP headers sent to the server during a request.
Attributes: none.
Child element:
<header>
(section 3.4).
A HTTP header (name-value pair) sent to the server during a request.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
name
|
String
| Name of the HTTP header. | Yes |
value
|
String
| Value of the HTTP header. | Yes |
Child elements: none.
Example:
<headers> <header name="Accept" value="image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, */*"/> <header name="Accept-Language" value="pt-BR"/> <header name="User-Agent" value="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"/> <header name="Connection" value="Keep-Alive"/> </headers>
Think time (time between requests to the same site)
configuration. The fully qualified name of a class
implementing the
eteg.sinon.executor.ThinkTime
interface must be scpecified. Parameters can be
passed to the
ThinkTime
implementation
through nested
property
tags, each
one with
name
and
value
attributes.
Sinon provides
eteg.sinon.util.ConstantThinkTime
,
a
ThinkTime
implementation that
defines some constant think time specified by the.
interval
parameter.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
class
|
String
| Think time implementation fully qualified class name. | Yes |
Child element:
<property>
(section 3.13).
Example: a constant think time of 10 seconds (10000 milliseconds).
<thinkTime class="eteg.sinon.util.ConstantThinkTime"> <property name="interval" value="10000"/> </thinkTime>
Maximum connection timeout in milliseconds. Its recommended value is at least 10000.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
value
|
Integer
| Maximum connection timeout in milliseconds. | Yes |
Child elements: none.
Example: a constant think time of 10 seconds (10000 milliseconds).
<timeout value="180000"/>
Maximum number of page download retries.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
value
|
Integer
| Maximum number of page download retries. | Yes |
Child elements: none.
Example:
<retries value="5"/>
Fully qualified name of the class that will receive notification of catalog processing events (catalog processing start and catalog processing finish).
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
class
|
String
| Catalog listener fully qualified class name. | Yes |
Child elements: none.
For a definition of catalog, see section 2.3.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
id
|
String
| Catalog identifier. | Yes |
Child elements:
<properties>
(section 3.12),
<start-pages>
(section 3.10),
<global-data-extractions>
(section 3.24),
<page>
(section 3.14),
<parameter-set>
(section 3.21).
Declares a start page. For a definition of start page, see section 2.4.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
refid
|
String
|
Identifier of the
<page>
used as start page.
| Yes |
Child elements:
<property>
(section 3.13).
Set of properties (variables) that are set just before the catalog that owns them is processed.
Child elements: (section 3.13).
When enclosed in a
<catalog>
,
<start-page>
,
or
<page-config>
tag,
defines a variable and its value.
For more information about variables, see
section 2.6.
When enclosed in a
<thinkTime>
,
defines a configuration parameter for the
ThinkTime
implementation used.
For more information about think time,
see
section 2.7.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
name
|
String
| Property name. | Yes |
value
|
String
|
Property value.
Interpreted attribute if
enclosed in a
<catalog> ,
<start-page> ,
or
<page-config> tag
.
| Yes |
Child elements: none.
Declares a page. For a definition of page, see section 2.5.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
id
|
String
| Identifier of this page. Inside a catalog, each page must have a different identifier. | Yes |
url
|
String
| URL of this page. Interpreted attribute. | Yes |
method
|
String
| Method used to send HTTP parameters to a page. The valid values are GET and POST. | No. Default: GET |
follow-redirect
|
boolean
|
If
true , HTTP redirections
are followed automatically.
|
No. Default:
true
|
Child elements:
<data-extractions>
(section 3.17),
<page-config>
(section 3.16),
<action>
(section 3.20),
<save>
(section 3.15).
Writes the downloaded page to a file in the local filesystem. It does not handle download of non-textual documents.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
id
|
String
| File at which the downloaded page will be written. It can be any string accepted by the constructor http://java.sun.com/j2se/1.4.2/docs/api/java/io/File.html#File(java.lang.String). Interpreted attribute. | Yes |
Child elements: none.
Configuration of a page. Actually, it is just a convenient way to pass parameters to dynamic pages. Pages that receive parameters via POST must use this tag.
Attributes: none.
Child elements:
<parameter-set>
(section 3.21).
Set of data extractions of a given page.
Attributes: none.
Child elements:
<data-extraction>
(section 3.18),
<do-data-extraction>
(section 3.19),
<loop>
(section 3.25),
<jump>
(section 3.26).
Declares a data extraction.
The extracted data is the string found after the
<from>
string and before the
<to>
string.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
id
|
String
| Name of the variable that will receive the extracted data. | Yes |
trim
|
boolean
| Tells if the extracted data must be trimmed (removed leading and trailing whitespace). The trimming is done using . |
No. Default:
true
|
list
|
boolean
| Tells if the varible defined by this data extraction is multivalued. See section 2.6. |
No. Default:
true
|
allowDuplicates
|
boolean
|
Tells if duplicate values are accepted or
not. If not, eventual duplicate values are
ignored. This attribute is ignored if
list is
false .
|
No. Default:
true
|
failOnError
|
boolean
| Tells if the ocorrence of some data extraction step fail must be considered an error. |
No. Default:
true
|
valueOnError
|
boolean
|
Defines what value the data extraction
variable will receive if
failOnError is
true and some data extraction
step fails.
| No. Default: empty string |
resetPosition
|
boolean
| Tells if the page current position must be reset to 0 before the execution of this data extraction. |
No. Default:
false
|
Child elements:
<jump>
(section 3.26),
<from>
(section 3.27),
<to>
(section 3.28),
<loop>
(section 3.25).
Executes a given global data extraction (one that is
declared inside a <global-data-extractions>
).
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
refid
|
String
| Data extraction identifier. | Yes |
Child elements: none.
Declares an action to be executed after a page is processed.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
pageId
|
String
| Identified of the page that will be executed by this action. | Yes |
condition
|
String
|
Defines that this action must be executed
only if some condition is satisfied.
Up to now, only the
notEmpty
condition is available. It is satisfied
if the variable referred by the
propertyId attribute is not
an empty string.
| No |
propertyId
|
String
| Name of the variable to be tested by the action condition. | Yes if the action has some condition. |
url
|
String
|
URL to be downloaded and processed.
If this attribute is not specified, the
value of the
url attribute of the
<page> tag is used.
| No |
foreach
|
String
| Executes this action for each value of the list variable specified by this attribute. | No |
variable
|
String
|
Defines the name of the variable that will
receive the value of the current
foreach value.
This variable is set before the action
is executed.
Ignored if this action is not a
foreach one.
| No |
index
|
String
|
Defines the name of the variable that will
receive the current
foreach index. The first index
is 0.
| No |
Child elements:
<page-config>
(section 3.16).
Set of parameters passed to a dynamic page.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
id
|
String
| Identifier of this parameter set. |
Yes if enclosed in a
<catalog>
tag, no if enclosed in a
<catalog> tag.
|
Child elements:
<parameter>
(section 3.22),
<include-parameter-set>
(section 3.23).
A parameter (name-value pair) sent to a dynamic page. Name and value are automatically encoded.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
id
|
String
| Parameter name. | Yes. |
id
|
String
| Parameter value. Interpreted attribute. | Yes, but can be empty. |
encoding
|
String
| Character set used to encode the parameter name and value. | No. Default: ISO_8859_1. |
Child elements: none.
Includes all the parameters of some given parameter set in this one. If some parameter was already set and the included parameter set has a parameter with the same name, it is overwritten.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
refid
|
String
| Parameter set identifier. | Yes. |
Child elements: none.
This tag is a way to declare data extractions that can be
used in more than one page within a catalog. These
global data extractions can be used using the
<do-data-extraction>
tag
(section 3.19).
Attributes: none.
Child element:
<data-extraction>
(section 3.18),
Declares a loop. The extractions steps enclosed by this tag are executed until the loop termination condition is satisfied.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
condition
|
notFails or
before
|
Loop stop condition.
The
notFails condition
executes the loop until some extraction
step fails.
The
before condition executes
the loop until the page current position
is before the first ocurrence of the
string defined by the
before
attribute.
| Yes |
value
|
String
|
String used as reference to the
before loop condition.
|
Yes if condition is
before ,
false otherwise.
|
Child elements:
<data-extraction>
(section 3.18),
<jump>
(section 3.26).
Declares a jump, positioning step that changes the page current position after the first occurrence of the reference string. This string is enclosed in this tag and is interpreted.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
trim
|
boolean
| Tells if reference string must be trimmed (removed leading and trailing whitespace). The trimming is done using http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#trim() |
No. Default:
true
|
direction
|
forward or
backwards
| Defines the search direction |
No. Default:
forward
|
failIfFound
|
boolean
|
If
false , this positioning
step fails if the reference string
is not not found.
If
true , this positioning
step fails if the reference string
is not found.
|
No. Default:
false
|
Child element:
<newline>
(section 3.29).
Examples:
<jump>Name:</jump> <jump><![CDATA[<td class="name">]]></jump> <jump><td class="name"></jump>
These two last examples have the same meaning, but written in two different ways: using CDATA sections and escaping of < (less than) e > (greater than) characters.
Works exactly like a
<jump>
,
but it marks the position of the beginning of the
extracted data as the first one after the
string reference occurrence.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
trim
|
boolean
| Tells if reference string must be trimmed (removed leading and trailing whitespace). The trimming is done using . |
No. Default:
true
|
direction
|
forward or
backwards
| Defines the search direction. |
No. Default:
forward
|
Child element:
<newline>
(section 3.29).
Works exactly like a
<jump>
,
but it marks the position of the end of the
extracted data as the first one before the
string reference occurrence.
Attributes:
Name | Type | Definition | Mandatory? |
---|---|---|---|
trim
|
boolean
| Tells if reference string must be trimmed (removed leading and trailing whitespace). The trimming is done using http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#trim() |
No. Default:
true
|
direction
|
forward or
backwards
| Defines the search direction. |
No. Default:
forward
|
Child element:
<newline>
(section 3.29).
Used in <jump>
, <from>
and <to>
to represente a newline character
in a more convenient way. There are very few cases in which
the use of <newline>
is needed.
Attributes: none.
Child elements: none.
Example: the desired data, a name, found between a newline and
a |
character.
<data-extraction id="name" list="true" trim="true"> <from trim="false"><newline/></from> <to>|</to> </data-extraction>
This section describes only the classes, interfaces and methods used to create an application using Sinon. The complete Sinon API documentation can be found here.
Instances of eteg.sinon.executor.CollectorExecutor
executes collectors. The following methods are the most
useful for implementing listeners. The
CollectorExecutor
complete Javadoc API
is available
here.
CollectorExecutor(Collector collector)
|
Single constructor of this class.
Receives the Collector instance
to be executed.
|
void execute()
| Executes the collector passed as argument to the constructor. |
void execute(Catalog[] catalogs)
|
Executes the collector passed as argument
to the constructor, but only the catalogs
in the catalogs array.
|
String[] getPropertyNames()
| Returns the names of all variables set. |
Object getPropertyValue(String propertyName)
|
Returns the value of the variable whose name is
propertyName . If the given variable is
simple, this method returns a String .
Otherwise, a List containing
String s is returned.
Changes made to this list change the variable value.
If there is no such variable, null is returned.
This method raises an IllegalArgumentException if
propertyName is null .
|
void setProperty(String propertyId, String value)
throws SinonException
|
Sets the value of a simple property. If this property
already exists, the old value is replaced by the new.
Otherwise, the property is created with the given value.
This method throws SinonException if
the given property is multivalued.
|
long getStartTime()
|
Returns the Unix timestamp at which the collector
was started or -1 if it was not started yet.
|
long getFinishTime()
|
Returns the Unix timestamp at which the collector
has finished or -1 if it has not finished
yet.
|
Log getLog()
|
returns org.apache.commons.logging.Log
(from the Jakarta Commons Logging package) instance
used to create an execution log.
|
String getCurrentText()
|
Returns the last downloaded page in a String
or null if none was downloaded yet.
|
byte[] getCurrentBytes()
|
Retorns the last downloaded document as a byte array.
or null if none was downloaded yet.
Useful when a listener wants to handle
binary documents downloaded by Sinon.
|
void stop()
| Stops this collector execution. |
boolean isStopped()
|
Returns true if this collector was stopped,
false otherwise.
|
In order to receive catalog processing events,
a eteg.sinon.listener.CatalogStateListener
interface implementation must be written.
The following methods must be implemented:
Method | Description |
---|---|
public void onStateChange(Catalog catalog,
CatalogState state,
CollectorExecutor executor)
| Method invoked when a catalog state change ocurrs. |
public static CatalogStateListener getInstance()
|
Method that returns the CatalogStateListener
instance that will receive the catalog state change
events.
|
See also
Catalog
(section 2.3),
CatalogState
(section 4.4),
CollectorExecutor
(section 4.1).
eteg.sinon.core.Catalog
instances represent
catalogs in Sinon. The only method
needed to implement an application that uses Sinon is
String getId()
. It returns the catalog
id
attribute.
eteg.sinon.core.CatalogState
is an enumeration class
that represents the possible catalog
processing states:
before processing
and
after processing
.
It has some methods:
Method | Description |
---|---|
boolean isBeforeProcessing()
|
Returns true if this catalog
state is before processing and
false otherwise.
|
boolean isAfterProcessing()
|
Returns true if this catalog
state is after processing and
false otherwise.
|
boolean equals(Object other)
|
Returns true if other
is a CatalogState instance and
if other represents the same catalog
state as this object.
|
String getName()
| Returns the name of this state. |
In order to receive page processing events,
a eteg.sinon.listener.PageStateListener
interface implementation must be written.
The following methods must be implemented:
Method | Description |
---|---|
public void onStateChange(Page page,
PageState state,
CatalogStateListener catalogStateListener,
CollectorExecutor executor)
| Method invoked when a page state change ocurrs. |
public static PageStateListener getInstance()
|
Method that returns the
PageStateListener
instance that will receive the page state change
events.
|
See also
Page
(section 4.7),
PageState
(section 4.6),
CatalogStateListener
(section 4.2),
CollectorExecutor
(section 4.4).
eteg.sinon.executor.PageState
is an enumeration
class that represents the possible page
processing states:
before download
, before processing
(page download done but no processing yet),
after processing
(page download and data extractions done, but
no page action executed), and finished
(page download, data extractions, and actions done).
It has some methods:
Method | Description |
---|---|
boolean isBeforeDownload()
|
Returns true if this page
state is before download and
false otherwise.
|
boolean isBeforeProcessing()
|
Returns true if this page
state is before processing and
false otherwise.
|
boolean isAfterProcessing()
|
Returns true if this page
state is after processing and
false otherwise.
|
boolean isFinished()
|
Returns true if this page
state is finished and
false otherwise.
|
boolean equals(Object other)
|
Returns true if other
is a PageState instance and
if other represents the same catalog
state as this object.
|
String getStateName()
| Returns the name of this state. |
eteg.sinon.core.Page
instances
represent pages in Sinon. The only method
needed to implement an application that uses Sinon is
String getId()
. It returns the page
id
attribute.
In order to receive error events and react to them,
a eteg.sinon.listener.ErrorListener
interface implementation must be written.
If no ErrorListener
is used, any error
causes Sinon to stop the execution of the collector.
The following methods must be implemented:
Method | Description |
---|---|
public ErrorResponse onError(SinonException exception)
| Handles an error ocurred during the execution of a collector and returns what action to take. |
public static ErrorListener getInstance()
|
Method that returns the
ErrorListener
instance that will receive the error events and handle
them.
|
See also
ErrorResponse
(section 4.9).
Enumeration class that represents the possible error responses:
processing states:
stop
(stop execution when an error ocurrs),
retry
(retry the action that caused the error),
and resume next
(ignore the error and go ahead).
It has some methods:
Method | Description |
---|---|
boolean isStop()
|
returns true if this error response
is stop and
false otherwise.
|
boolean isRetry() :
|
Returns true if this error
response is retry and
false otherwise.
|
boolean isResumeNext()
|
Returns true if this error
response is after processing and
false otherwise.
|
boolean equals(Object other)
|
Returns true if other
is a ErrorResponse instance and
if other represents the same error response
as this object.
|
Think times implementations must implement the
eteg.sinon.executorThinkTime
interface and
have a constructor without parameters.
An instance of the ThinkTime
implementation
used is created before the processing of catalogs and then
used for the whole collector execution.
Sinon provides eteg.sinon.util.ConstantThinkTime
,
a think time implementation whose nextThinkTime
method that always returns the same value. It is defined, in
milliseconds, by the interval
parameter.
The following methods must be implemented:
Method | Description |
---|---|
long nextThinkTime()
| Method invoked before every page download. It must return the think time im milliseconds. |
void setConfiguration(Properties properties)
|
Method invoked immediately after the instance is created.
Used for configuration purposes. properties
contain the name-value pairs contained in
<property> tags enclosed in the
<thinkTime> tag.
|