Pagescraper configuration file

Configuring PageScraper

The configuration file is a JSON array of JSON objects that follow a certain schema.

Each object represents a single value to scrape from all URLs that are run through the executable. Each of these objects, which we will call a selection, is consulted during scraping and generates a single cell of output - each selection's scraped value under the column bearing the selection's name in the row of the URL it was scraped from.

Each selection has the following key-value pairs:

Name	Type	Description
`name`	String	The name of the selection. This is used in output in the header row.
`css_selector`	String	A CSS selector used to obtain a list of DOM elements to process.
`xpath_selector`	String	An XPath selector used to obtain a list of DOM elements to process.
`tests`	Array of String	Tests used to include or exclude found DOM elements for further processing.
`sub_selection`	Object of type selection	A full JSON object of this same type to process nested DOM elements.
`to_value`	String	A Node To Value expression used to convert found DOM elements to strings for output.
`filters`	Array of String	A Node To Value expression used to convert found DOM elements to strings for output.
`aggregate`	String	An aggregate value to consolidate multiple, found DOM elements into a single output value.
`default`	String	A default value for output that is used if there are no DOM elements found for this selection.

General Selection Operation

Each selection is provided with the root DOM node for each URL as input. A selection's job is to reduce that root node to a single string of output that will show up in the pagescraper executable's output.

The name of the selection is used in the output. Each name is the name of the column in the output header row. If sub selections are provided for a selection, those sub-selection names are ignored. I.e. only top-level sub-selection names in config are used as output.

The css_selector and xpath_selector values are used to find all nodes from each page that match each selector. If both are provided, then the css_selector is used and the xpath_selector is ignored.

All nodes that are found from one of the above filters are then passed through every test in the order that the tests are defined. Only those nodes that pass all tests continue on to the next section.

Here there are two possible paths for nodes. If a sub_selection is supplied, then this process recurses with the found nodes and the outputs from the sub-selection nodes are passed to the aggregate step. If no sub_selection is supplied, then the current nodes continue on to the step below.

Each node that passes all tests from above is then "converted" into a string value using the to_value expression. If no to_value expression is provided for a selection, then text is used as a default. All nodes that pass tests are kept in an array, then each "converted string" is also kept in an array. At this point in processing, each found node and their associated values are stored in a unit. This unit is finally passed to aggregation and default processing to obtain a final output value.

Each found node is passed to the aggregator for the selection. If no aggregator is supplied, then concat: is used as a default. All of the nodes and their values are passed to the aggregator which results in a single string value.

If the above single string value is empty, then the selection's default value is used as output. If the default value for a selection is not defined, then an empty string is used.

This final value is what shows up in the output of the executable.

As an example, here is a config that results in the number of a tags in the DOM, a TRUE or FALSE value determining whether or not at least one .btn element is found in the DOM, and a pipe-delimited list of all script sources.

CODE

[
  {
    "name": "number_of_a_tags",
    "css_selector": "a",
    "aggregate": "count"
  },
  {
    "name": "dom_contains_class_btn",
    "css_selector": ".btn",
    "default": "FALSE",
    "aggregate": "value:TRUE"
  },
  {
    "name": "pipe_delim_sources",
    "xpath_selector": "//head/script",
    "to_value": "attribute:src",
    "aggregate": "concat:|"
  }
]

CSS Selectors

CSS Selectors are parsed and evaluated using this library.

CSS Selectors have higher precedence than XPath Selectors, so if both are defined in a selection, then only the css selector will be used.

All nodes in a DOM tree that match the css selector will pass through the algorithm described above.

XPath Selectors

XPath Selectors are parsed and evaluated using this library.

It is important to note that XPath selectors are used to find nodes within a DOM tree. Using XPath selectors that do not result in nodes are allowed in config but result in no output. For instance, the selector a@href, since it selects the href attribute of an a tag, which is a string, not an actual DOM element, the resulting output for that selection would be the empty string.

XPath Selectors have a lower precedence than css selectors, so if both are defined in a selection, then only the css selector will be used.

All nodes in a DOM tree that match the xpath selector will pass through the algorithm described above.

Tests

Tests are used to reduce an input list of matched nodes to only those nodes that pass each test.

Test expressions are NameArgumentExpressions.

Following are the available tests.

Contains Text

The test name is contains_text. It has a single required argument that is the substring (case-sensitive) to look for in an element's text.

E.g. contains_text:Sign In passes elements whose display text contains Sign In.

Sub Selections

Sub Selections are a type of selection that is used to generate output from a nested element from one already matched.

For instance, the outer selection could match a modal and do some tests on the modal. A subselection would then be used to get the output of a single button on the modal.

Node To Value

Node to value expressions is used to convert DOM elements into strings for more filtering and output.

Node to Value expressions are NameArgumentExpressions.

Following are the available to_value options.

Inner HTML

The to_value name is inner_html. There are no arguments. It returns the rendered inner html of found elements.

E.g. inner_html.

Outer HTML

The to_value name is outer_html. There are no arguments. It returns the rendered outer html of found elements.

E.g. outer_html.

Text

The to_value name is text. There are no arguments. It returns the text of found elements. This differs from inner_html and outer_html in that only text seen on the page is returned instead of a rendered html fragment.

E.g. text.

Attribute

The to_value name is attribute. There is one argument that is the name of the attribute to return. It returns the value of an attribute of found elements.

E.g. attribute:href of

CODE

<pre><a href="https://www.google.com">Go to Google</a></pre>

results in https://www.google.com.

Filters

Filters are used to transform found node values into different values that become output.

Filter expressions are NameArgumentExpressions.

Following are the available filter options.

After

The filter name is after. There is one argument that is the string to search for. It returns the substring that exists after the argument. If the argument is not found in the input value, then the original input value is returned.

E.g. after:Bar of FooBarBaz returns Baz.

E.g. after:DNE of One two three returns One two three.

Before

The filter name is before. There is one argument that is the string to search for. It returns the substring that exists after the argument. If the argument is not found in the input value, then the original input value is returned.

E.g. before:Bar of FooBarBaz returns Foo.

E.g. before:DNE of One two three returns One two three.

Aggregates

Aggregates are used to reduce a list of found and testes nodes and their associated values into a single string value for output.

Aggregate expressions are NameArgumentExpressions.

Following are the available aggregate options.

Value

The aggregate name is value. There is one argument that is the value to return if nodes exist. If the number of input nodes is zero, then the empty string is returned, otherwise, the argument is returned. You may optionally set no arguments which means the empty string is always returned.

E.g. value:True with actual nodes returns True.

Count

The aggregate name is count. There are no arguments. It returns the number of input nodes.

E.g. count with five input nodes returns 5.

E.g. count with no input nodes returns 0.

Concatenate

The aggregate name is concat. There is one argument that is the string around which to concatenate all input values. It returns strings.Join(inputValues, argument) in Go terms.

E.g. concat:~ with the values ["one", "two", "three"] returns onetwothree.

Default Values

Default values are optional strings that become the final output for a selection if all prior steps in the selection algorithm result in an empty string.

Name and Args Expressions

Name and argument expressions are string expressions that are used in config to indicate to pagescraper what filters, tests, etc. to use.

These expressions are of the following form:

Begin with a name that does not contain a colon :.

Optionally followed by an argument list.

The argument list begins with a colon and is followed by a properly formatted CSV record. Each of the values in the CSV record is individual arguments.