Skip to main content
Skip table of contents

URL Scraper Setup

The URL Scrape tool uses the “PageScraper” app in Marketplace to crawl all of the URLs your client has live in PMG managed media to look for broken pages and redirecting pages.

If you have issues please file a service desk ticket

1. Alli Data Setup

In order to set this up for you client, you will need to set up up to three data reports within Alli Data, one each for social, display, and search - depending on which channels PMG manages for your client. Go to the “Explorer” tab and click the blue “New Data Report” button. (Alternatively, you can click on Reports tab and New Data Report as well.)

Name the report url_scraper_{your media channel)

ex.) url_scraper_search, url_scraper_display, url_scraper_social

Change the report type from “report” to “redshift”

You will need to create a SQL query in the field provided.

SQL Query Requirements

1. In order to build out the SQL query, you will need to include the following and make sure that the column names match the labels in red:

Unique URLs (without any tracking or tagging parameters) - landingpageurl

Account ID - accountid

Campaign Name & ID - campaign, campaignid

Creative Name & ID - creative, creativeid (this name is used for consistency, this will be whatever level the URL lives in in the engine)

2. Include a filter where impressions are greater than 0

3. Once you have each individual channel’s datasources set up, you will need to combine all three channels in a master view and create a final Alli Data Report. Name this report url_scraper_master.

In order to combine the three, plug them into the following query, replacing {client} with your client:

CODE
WITH display AS(SELECT 'display' AS channel,'display' AS accountid,campaign,campaignid,site,creative,creativeid,landingpageurl FROM {client}.url_scraper_display),

search AS(SELECT 'search' AS channel,accountid,campaign,campaignid,'search' AS site,lowest_level_entity AS creative,lowest_level_entity_id AS creativeid,url as landingpageurl FROM {client}.url_scraper_search),

social AS(SELECT 'social' AS channel,account_id,campaign_name,campaign_id,'Facebook' AS site,creative_name,creative_id,landingpageurl FROM {client}.url_scraper_social)

select * from display

union

select * from search

union

select * from social

*Make sure to hit save and publish after you have created each report.

2. Setting up the Alli Marketplace Page Scraper App

  1. Go to Marketplace and in the browse tab search for Page Scraper 2.0

  2. Reference this documentation for setup: Pagescraper quick start guide

  3. Schedule a cadence on how often you’d like the pallet to run.

DO NOT RUN THE APP IF YOU DON’T HAVE CLIENT APPROVAL. SEE BELOW

Sample Client Email

An example is below but to see more info see our sample email here Client Scrape Request Sample

Make sure to send this to the client before running the pallet. This email asks for permission and lets the client know that PMG will be scraping their websites so that they aren’t alarmed.

CODE
{Client},

 

We're hoping to onboard Gitlab to a tool that will regularly scrape all URLs that are being used in live media. The intention to catch any URLs that are broken or redirecting to other pages to ensure the best advertising experience for the user and prevent wasteful ad spend. This will help the team catch and fix these issues as soon as possible. To achieve this we would need to scrape your site regularly to identify pages with issues. We recognize that your site may have restrictions on how and when we should perform these scrapes. Are you okay with us moving forward with this tool, and if so could you answer the below questions so that we don't cause any issues on the site.




Do you have any requirements on what days we can scrape the site?

Do you have any requirements on what times of the day we can scrape the site?

Do we need to apply a rate limit to our scraper for your site?

The bot will have the user agent "pmg.com-urlscraper" in case you need to whitelist or identify traffic from these scrapes.

Are there any other restrictions or requirements we should be aware of when scraping your site.

 

Thanks,

After getting confirmation you can run the app.

3. Setup Amazon s3 Datasource

Once you run the pallet, you will need to create an Advanced Amazon s3 datasource in Alli Data.

Additional Configuration Step

Request new set of AWS Access Keys

You will need to submit a dev access ticket requesting a new set of AWS Access Key ID and Access Secret ID to fulfill this setup. These creds will need read and write access to forklift-prod-outputs. See below for proper request details.

Fill out Region, bucket name, and directory

Once creds are obtained, fill out region, bucket name, and directory accordingly.

To retrieve the file name, go to the Marketplace pallet’s execution you ran and enter the .csv file that it returns. See example highlighted below:

Define Your Data Step

download csv from the page scraper

In the Define Your Data section, you will need to enter the highlighted line below into a new window in your browser. It will download the csv file.

Drop the csv

Drag the csv file into the Define Your Data section and makes sure it looks like the image below. Afterwards, hit load data and your s3 datasource will be set up.

4. Setting up the Alli Alert

You will need to set up 1 Custom alert for each individual channel (social, display, and search).

Set up the Report view

  1. Create a new Redshift data report under the Explorer tab in Alli Data like you did before choosing custom SQL.

  2. Your SQL query will need to join the s3 datasource’s output to the channel’s URL datasource you created earlier. Example SQL query below.

  3. CODE
    WITH s3 AS
      (SELECT *
       FROM {client}.url_scraper_s3
       WHERE (final_status != 200
         OR status != 200
         OR landingpageurl != final_landingpageurl)
         AND landingpageurl NOT ILIKE '%utm%' ), 
         
    {media channel} AS 
       ({Your SQL codeFOR the channel's URL datasource})
    
    SELECT *
    FROM s3
    LEFT JOIN {media channel} on s3.landingpageurl = {mediachannel}.landingpageurl

4. Enter the SQL in the field in the new data report and hit save and publish. You’ll use this report view name in the next step.

Set up the Alert

1. In Alli Marketplace, go to browse, find Alli Custom Alert App and hit install (the plus).

2. Enter the name of the channel-specific data report you just created in the view_name field. Fill out the remainder of the fields according to your media channel and media channel representatives.

Check the instructions in the Alli Custom Alert App for what to put in each field.

7. Schedule a cadence for the alerts and hit save & run.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.