Tutorial: Migrating Redshift SQL/Objects to BigQuery Table

Get Started

Within Marketplace, find the Redshift to BigQuery application in the browse section. Install the application and when setting up this app you will need to configure various fields. These are documented below and within the app details:

Best Practices

Either the REDSHIFT_SQL or REDSHIFT_OBJ should be used, if both are used it will error out.
This App is best used in the following scenarios:
1. At the end of an analytical workflow where you need to move the final result to BigQuery
2. One Off Data Load requests.
3. PARTITON_COL SHOULD be used for all Data Copies that include a date column
  1. This will allow BigQuery to trim unneeded partitions during query time

Inputs

Input	Description	Required	Example
REDSHIFT_SQL	The SQL that generates the result that needs to be migrated to BigQuery	N	select * from gap_core.search_account where date >='2019-01-01'
REDSHIFT_OBJ	The Table or View that needs to be migrated to BigQuery	N	surveymonkey.omnichannel_complete
BIGQUERY_PROJECT	The BigQuery Project to create the new table in	Y, Defaults to pmg-datawarehouse	pmg-datawarehouse
BIGQUERY_DATASET	The BigQuery Dataset that the new table needs to be created in	Y	surveymonkey
BIGQUERY_TBL	The BigQuery Table Name	Y	omnichannel_complete
PARTITION_COL	The Column that should partition the data	N, SHOULD be used in most instances	date
GOOGLECREDS	The GOOGLE_APPLICATION_CREDENTIALS FIle.	Y, Defaults to pmg-datawarehouse creds
BIGQUERY_SCHEMA_FILE	The Target Schema File. All Datatypes supported can be found here: https://cloud.google.com/bigquery/docs/schemas#standard_sql_data_types NOTE If there is no schema manually defined by the user, the code will now auto-generate one and output it to the console.	N	schema.json
CLUSTER_COLS	Comma delimited list of columns to cluster by. For Best Practices on when to cluster see the BigQuery Documentation Here: https://cloud.google.com/bigquery/docs/clustered-tables	N	`engine, account_name`
DROP_BIGQUERY_TBL_FLAG	Drops Target Table before attempting to load it. Should be used if Schema changes	Y, Defaults to FALSE
REDSHIFT_HOSTNAME	The Redshift Host you need to connect to	N	redshift.data.alliplatform.com
REDSHIFT_DATABASE	The Redshift Database	N	datawarehouse
REDSHIFT_USERNAME	The Redshift User that has access to the data	N	stephen.frazier@pmg.com

If you provide a value for REDSHIFT_HOSTNAME, the other Redshift Template Variables will need to be populated, as well as you’ll need to add an Environment Variable named REDSHIFT_PASSWORD with the password for the corresponding REDSHIFT_USERNAME

Features Change log

07 Jan 2021

Added the ability to pull data from any Redshift Cluster

26 Aug 2020

Added Cluster by Column(s) Option to help speed up BigQuery Query Performance
Added Drop Table Option
Auto-Generate BigQuery Schema (Only if no schema is provided by the user)
- Table Mapping of Pandas Dataframe dtype to BigQuery Data Type:

Pandas dtype	BigQuery Data Type
datetime	TIMESTAMP
string	STRING
integer	INT64
number	FLOAT64
any	STRING

Assumptions

Data will always load into BigQuery as Truncate/Insert
Data being migrated is less than 7 million rows