How to copy data from Redshift to Hive ? How to unload from Redshift in Parquet format ? Explained with Sample code .

AWS Redshift @ Freshers.in

There may be some business scenario to copy daily data from Redshift to Hive. For the compression , it is always preferable to have the data in Parquet format. Before Dec 2019 , redshift don’t have the capability of unloading data to parquet format.

From Dec 2019  you can unload the result of an Amazon Redshift query to your Amazon S3 data lake as Apache Parquet, an efficient open columnar storage format for analytics. The great advantage is Parquet format is up to 2x faster to unload and consumes up to 6x less storage in Amazon S3, compared to text formats. During unload you can specify one or more partition columns so that unloaded data is automatically partitioned into folders in your Amazon S3 bucket.

UNLOAD (‘select * from hrm.my_freshers_in_data where as_on_date = ”2021-08-01”’)
to ‘s3://sem-freshers-in-us-east-1-daily/hrm/base/hrm.my_freshers_in_data’ iam_role ‘arn:aws:iam::100988878888:role/FRESHERS_IN_Redshift_S3_Role’ PARQUET ALLOWOVERWRITE PARTITION BY (as_on_date);

PARQUET with ENCRYPTED is only supported with server-side encryption with an AWS Key Management Service key (SSE-KMS).By default, each row group is compressed using SNAPPY compression. During unload you can’t use PARQUET with DELIMITER, FIXEDWIDTH, ADDQUOTES, ESCAPE, NULL AS, HEADER, GZIP, BZIP2, or ZSTD.

Reference 
Amazon Redshift introducing Apache Parquet
Redshift Unload Command

Author: user

Leave a Reply