AWS Glue : Example on how to read a sample csv file with PySpark

PySpark @ Freshers.in

Reading a sample csv file using PySpark

Here assume that you have your CSV data in AWS S3 bucket. The next step is the crawl the data that is in AWS S3 bucket. Once its done , you can find the crawler has created a metadata table for your csv data. 

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
freshers_data = spark.read.format("com.databricks.spark.csv").option(
"header", "true").option(
"inferSchema", "true").load(
's3://freshers_in_datasets/training/students/final_year.csv')
freshers_data.printSchema()

Result

root
|-- Freshers def: string (nullable = true)
|-- student Id: string (nullable = true)
|-- student Name: string (nullable = true)
|-- student Street Address: string (nullable = true)
|-- student City: string (nullable = true)
|-- student State: string (nullable = true)
|-- student Zip Code: integer (nullable = true)

Spark Reference

Spark Official Doc

Author: user

Leave a Reply