PySpark : Reading parquet file stored on Amazon S3 using PySpark

PySpark @ Freshers.in

To read a Parquet file stored on Amazon S3 using PySpark, you can use the following code:

from pyspark.sql import SparkSession
# create a SparkSession
spark = SparkSession.builder \
        .appName("Read S3 Parquet file") \
        .getOrCreate()
# set S3 credentials if necessary
spark.conf.set("spark.hadoop.fs.s3a.access.key", "ACCESS_KEY")
spark.conf.set("spark.hadoop.fs.s3a.secret.key", "SECRET_KEY")
# read parquet file from S3
df = spark.read.parquet("s3a://freshers_bkt/training/view_country/parquet_file")
# show data
df.show()

If in your system , if you have already configured access s3 with your instance then you can remove the line starting with spark.conf.set . You can directly read using spark.read.parquet , make sure that you need to read as s3a

In this code, you first create a SparkSession. Then, you can set the S3 credentials if necessary using the spark.conf.set() method. Finally, you can read the Parquet file from S3 using the spark.read.parquet() method and passing the S3 path of the file as an argument. Once you have read the file, you can use the df.show() method to display the data.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply