Pagescraper configuration file
Configuring PageScraper
The configuration file is a JSON array of JSON objects that follow a certain schema.
Each object represents a single value to scrape from all URLs that are run through the executable. Each of these objects, which we will call a selection, is consulted during scraping and generates a single cell of output - each selection's scraped value under the column bearing the selection's name in the row of the URL it was scraped from.
Each selection has the following key-value pairs:
Name | Type | Description |
---|---|---|
| String | The name of the selection. This is used in output in the header row. |
| String | A CSS selector used to obtain a list of DOM elements to process. |
| String | An XPath selector used to obtain a list of DOM elements to process. |
| Array of String | Tests used to include or exclude found DOM elements for further processing. |
| Object of type selection | A full JSON object of this same type to process nested DOM elements. |
| String | A Node To Value expression used to convert found DOM elements to strings for output. |
| Array of String | A Node To Value expression used to convert found DOM elements to strings for output. |
| String | An aggregate value to consolidate multiple, found DOM elements into a single output value. |
| String | A default value for output that is used if there are no DOM elements found for this selection. |
General Selection Operation
Each selection is provided with the root DOM node for each URL as input. A selection's job is to reduce that root node to a single string of output that will show up in the pagescraper executable's output.
The name
of the selection is used in the output. Each name is the name of the column in the output header row. If sub selections are provided for a selection, those sub-selection names are ignored. I.e. only top-level sub-selection names in config are used as output.
The css_selector
and xpath_selector
values are used to find all nodes from each page that match each selector. If both are provided, then the css_selector
is used and the xpath_selector
is ignored.
All nodes that are found from one of the above filters are then passed through every test in the order that the tests are defined. Only those nodes that pass all tests continue on to the next section.
Here there are two possible paths for nodes. If a sub_selection
is supplied, then this process recurses with the found nodes and the outputs from the sub-selection nodes are passed to the aggregate step. If no sub_selection
is supplied, then the current nodes continue on to the step below.
Each node that passes all tests from above is then "converted" into a string value using the to_value
expression. If no to_value
expression is provided for a selection, then text
is used as a default. All nodes that pass tests are kept in an array, then each "converted string" is also kept in an array. At this point in processing, each found node and their associated values are stored in a unit. This unit is finally passed to aggregation and default processing to obtain a final output value.
Each found node is passed to the aggregator for the selection. If no aggregator is supplied, then concat:
is used as a default. All of the nodes and their values are passed to the aggregator which results in a single string value.
If the above single string value is empty, then the selection's default value is used as output. If the default value for a selection is not defined, then an empty string is used.
This final value is what shows up in the output of the executable.
As an example, here is a config that results in the number of a
tags in the DOM, a TRUE
or FALSE
value determining whether or not at least one .btn
element is found in the DOM, and a pipe-delimited list of all script sources.
CODE
|
CSS Selectors
CSS Selectors are parsed and evaluated using this library.
CSS Selectors have higher precedence than XPath Selectors, so if both are defined in a selection, then only the css selector will be used.
All nodes in a DOM tree that match the css selector will pass through the algorithm described above.
XPath Selectors
XPath Selectors are parsed and evaluated using this library.
It is important to note that XPath selectors are used to find nodes within a DOM tree. Using XPath selectors that do not result in nodes are allowed in config but result in no output. For instance, the selector a@href
, since it selects the href
attribute of an a
tag, which is a string, not an actual DOM element, the resulting output for that selection would be the empty string.
XPath Selectors have a lower precedence than css selectors, so if both are defined in a selection, then only the css selector will be used.
All nodes in a DOM tree that match the xpath selector will pass through the algorithm described above.
Tests
Tests are used to reduce an input list of matched nodes to only those nodes that pass each test.
Test expressions are NameArgumentExpressions.
Following are the available tests.
Contains Text
The test name is contains_text
. It has a single required argument that is the substring (case-sensitive) to look for in an element's text.
E.g. contains_text:Sign In
passes elements whose display text contains Sign In
.
Sub Selections
Sub Selections are a type of selection that is used to generate output from a nested element from one already matched.
For instance, the outer selection could match a modal and do some tests on the modal. A subselection would then be used to get the output of a single button on the modal.
Node To Value
Node to value expressions is used to convert DOM elements into strings for more filtering and output.
Node to Value expressions are NameArgumentExpressions.
Following are the available to_value
options.
Inner HTML
The to_value
name is inner_html
. There are no arguments. It returns the rendered inner html of found elements.
E.g. inner_html
.
Outer HTML
The to_value
name is outer_html
. There are no arguments. It returns the rendered outer html of found elements.
E.g. outer_html
.
Text
The to_value
name is text
. There are no arguments. It returns the text of found elements. This differs from inner_html
and outer_html
in that only text seen on the page is returned instead of a rendered html fragment.
E.g. text
.
Attribute
The to_value
name is attribute
. There is one argument that is the name of the attribute to return. It returns the value of an attribute of found elements.
E.g. attribute:href
of
CODE
|
results in https://www.google.com
.
Filters
Filters are used to transform found node values into different values that become output.
Filter expressions are NameArgumentExpressions.
Following are the available filter
options.
After
The filter name is after
. There is one argument that is the string to search for. It returns the substring that exists after the argument. If the argument is not found in the input value, then the original input value is returned.
E.g. after:Bar
of FooBarBaz
returns Baz
.
E.g. after:DNE
of One two three
returns One two three
.
Before
The filter name is before
. There is one argument that is the string to search for. It returns the substring that exists after the argument. If the argument is not found in the input value, then the original input value is returned.
E.g. before:Bar
of FooBarBaz
returns Foo
.
E.g. before:DNE
of One two three
returns One two three
.
Aggregates
Aggregates are used to reduce a list of found and testes nodes and their associated values into a single string value for output.
Aggregate expressions are NameArgumentExpressions.
Following are the available aggregate
options.
Value
The aggregate name is value
. There is one argument that is the value to return if nodes exist. If the number of input nodes is zero, then the empty string is returned, otherwise, the argument is returned. You may optionally set no arguments which means the empty string is always returned.
E.g. value:True
with actual nodes returns True
.
Count
The aggregate name is count
. There are no arguments. It returns the number of input nodes.
E.g. count
with five input nodes returns 5
.
E.g. count
with no input nodes returns 0
.
Concatenate
The aggregate name is concat
. There is one argument that is the string around which to concatenate all input values. It returns strings.Join(inputValues, argument)
in Go terms.
E.g. concat:~
with the values ["one", "two", "three"] returns onetwothree
.
Default Values
Default values are optional strings that become the final output for a selection if all prior steps in the selection algorithm result in an empty string.
Name and Args Expressions
Name and argument expressions are string expressions that are used in config to indicate to pagescraper what filters, tests, etc. to use.
These expressions are of the following form:
Begin with a name that does not contain a colon :
.
Optionally followed by an argument list.
The argument list begins with a colon and is followed by a properly formatted CSV record. Each of the values in the CSV record is individual arguments.