AWS Glue : Example on how to read a sample csv file with PySpark

PySpark @ Freshers.in

Here assume that you have your CSV data in AWS S3 bucket. The next step is the crawl the data that is in AWS S3 bucket. Once its done , you can find the crawler has created a metadata table for your csv data. 

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
freshers_data = spark.read.format("com.databricks.spark.csv").option(
"header", "true").option(
"inferSchema", "true").load(
's3://freshers_in_datasets/training/students/final_year.csv')
freshers_data.printSchema()

Result

root
|-- Freshers def: string (nullable = true)
|-- student Id: string (nullable = true)
|-- student Name: string (nullable = true)
|-- student Street Address: string (nullable = true)
|-- student City: string (nullable = true)
|-- student State: string (nullable = true)
|-- student Zip Code: integer (nullable = true)

Spark Reference

Spark Official Doc

Author: user

Leave a Reply