Tutorial: Migrating Redshift SQL/Objects to BigQuery Table
Get Started
Within Marketplace, find the Redshift to BigQuery application in the browse section. Install the application and when setting up this app you will need to configure various fields. These are documented below and within the app details:
Best Practices
Either the
REDSHIFT_SQL
orREDSHIFT_OBJ
should be used, if both are used it will error out.This App is best used in the following scenarios:
At the end of an analytical workflow where you need to move the final result to BigQuery
One Off Data Load requests.
PARTITON_COL
SHOULD be used for all Data Copies that include a date columnThis will allow BigQuery to trim unneeded partitions during query time
Inputs
Input | Description | Required | Example |
---|---|---|---|
REDSHIFT_SQL | The SQL that generates the result that needs to be migrated to BigQuery | N | select * from gap_core.search_account where date >='2019-01-01' |
REDSHIFT_OBJ | The Table or View that needs to be migrated to BigQuery | N | surveymonkey.omnichannel_complete |
BIGQUERY_PROJECT | The BigQuery Project to create the new table in | Y, Defaults to pmg-datawarehouse | pmg-datawarehouse |
BIGQUERY_DATASET | The BigQuery Dataset that the new table needs to be created in | Y | surveymonkey |
BIGQUERY_TBL | The BigQuery Table Name | Y | omnichannel_complete |
PARTITION_COL | The Column that should partition the data | N, SHOULD be used in most instances | date |
GOOGLECREDS | The GOOGLE_APPLICATION_CREDENTIALS FIle. | Y, Defaults to pmg-datawarehouse creds | |
BIGQUERY_SCHEMA_FILE | The Target Schema File. | N | |
CLUSTER_COLS | Comma delimited list of columns to cluster by. For Best Practices on when to cluster see the BigQuery Documentation Here: https://cloud.google.com/bigquery/docs/clustered-tables | N |
|
DROP_BIGQUERY_TBL_FLAG | Drops Target Table before attempting to load it. | Y, Defaults to FALSE | |
REDSHIFT_HOSTNAME | The Redshift Host you need to connect to | N | redshift.data.alliplatform.com |
REDSHIFT_DATABASE | The Redshift Database | N | datawarehouse |
REDSHIFT_USERNAME | The Redshift User that has access to the data | N | stephen.frazier@pmg.com |
If you provide a value for REDSHIFT_HOSTNAME, the other Redshift Template Variables will need to be populated, as well as you’ll need to add an Environment Variable named REDSHIFT_PASSWORD
with the password for the corresponding REDSHIFT_USERNAME
Features Change log
Added the ability to pull data from any Redshift Cluster
Added Cluster by Column(s) Option to help speed up BigQuery Query Performance
Added Drop Table Option
Auto-Generate BigQuery Schema (Only if no schema is provided by the user)
Table Mapping of Pandas Dataframe dtype to BigQuery Data Type:
Pandas dtype | BigQuery Data Type | |
---|---|---|
datetime | TIMESTAMP | |
string | STRING | |
integer | INT64 | |
number | FLOAT64 | |
any | STRING |
Assumptions
Data will always load into BigQuery as Truncate/Insert
Data being migrated is less than 7 million rows