Reading a sample csv file using PySpark
Here assume that you have your CSV data in AWS S3 bucket. The next step is the crawl the data that is in AWS S3 bucket. Once its done , you can find the crawler has created a metadata table for your csv data.
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job glueContext = GlueContext(SparkContext.getOrCreate()) freshers_data ="com.databricks.spark.csv").option( "header", "true").option( "inferSchema", "true").load( 's3://freshers_in_datasets/training/students/final_year.csv') freshers_data.printSchema()
root |-- Freshers def: string (nullable = true) |-- student Id: string (nullable = true) |-- student Name: string (nullable = true) |-- student Street Address: string (nullable = true) |-- student City: string (nullable = true) |-- student State: string (nullable = true) |-- student Zip Code: integer (nullable = true)