To read a Parquet file in Spark, you can use the spark.read.parquet()
method, which returns a DataFrame. Here is an example of how you can use this method to read a Parquet file and display the contents:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("ReadParquet").getOrCreate()
# Read the Parquet file
df = spark.read.parquet("path/to/file.parquet")
# Show the contents of the DataFrame
df.show()
# Stop the SparkSession
spark.stop()
You can also read a parquet file from a hdfs directory,
df = spark.read.format("parquet").load("hdfs://path/to/directory")
You can also read a parquet file with filtering using the where
method
df = spark.read.parquet("freshers_path/to/freshers_in.parquet").where("column_name = 'value'")
In addition to reading a single Parquet file, you can also read a directory containing multiple Parquet files by specifying the directory path instead of a file path, like this:
df = spark.read.parquet("freshers_path/to/directory")
You can also use the schema
option to specify the schema of the parquet file:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType())
])
df = spark.read.schema(schema).parquet("freshers_path/to/file.parquet")
By providing the schema, Spark will skip the expensive process of inferring the schema from the parquet file, which can be useful when working with large datasets.
Spark important urls to refer