Pagescraper App

PageScraper Apps are available to pull status codes and redirect URLs for a list of URLs. This can be useful to update keyword URLs in order to make sure URLs in accounts are not bad or go through unnecessary redirects.

Configuration

Following are the configuration options for a PageScraper App and what they mean.

Relation Name

A relation is a table or view in a database. In terms of tables or views in the Datawarehouse, those are going to be a datasource's table or the published view of a report.

Currently, Redshift relations are supported to pull data from.

This value is required.

URL Column

This is the name of the column that holds the URLs you want to scrape. This can only be a single column. If there are multiple columns with URLs, then you must create another App with the other column name.

The index of the column can also be used to specify which column contains the URLs. For instance, for the columns (campaign_id, url) you could specify either "url" or "1". 1 is the zero-based index of the desired column.

When Alli Marketplace queries from Redshift, Redshift returns the column names as lowercase. Alli Marketplace uses an exact match on the column name from Redshift and the configuration input. It is recommended to use the lower case of the column in configuration.

Alli Marketplace requires that the URL contains a valid scheme. I.e. "https://www.google.com " is valid while "http://www.google.com " is not.

User Agent

A user agent is a string value sent in requests over the web in order for services to determine what kind of software is making the request. For instance, Google Chrome running on Linux, and the Safari mobile browser are going to use different User Agents so the servers handling those requests know that information.

User-Agents can be used for security filtering or for detecting robots. But, for the most part, you will not need to worry about this option.

This option is provided in case your clients need to have bots specify themselves with a User Agent. Leaving it blank uses a default value that will rarely need to change.

Rate Limit

The rate limit is an optional field to specify the minimum amount of time between requests. This is provided to ensure Alli Marketplace does not overload sites that have requested a lower rate.

See here for specifications of the duration. Feel free to reach out to the dev team for questions.

Scrape Selections File

This file is a JSON Array formatted file that conforms to the PageScraper configuration used by the App. Here is the reference for the configuration file.

If you don't have access to the link above, a copy of it is shown here.

Output

The output of a PageScraper execution will show all errors that occurred while scraping. Unless something is really wrong, there should not be any error output even if status codes indicate page failure.

A single page failure does not trigger the whole Execution to fail.

You must examine the rest of the output to determine page failures.

The output will have an S3 URL that you can use to pull the data back into the Alli Data. It also includes a link that you can copy and paste into a browser to download the results.