PySpark-How to create and RDD from a List and from AWS S3

user July 1, 2021 Leave a Comment

In this article you will learn , what an RDD is ? How can we create an RDD from a Python List ? What is Parallelize ? How to create RDD from S3 ?

RDD : RDD (Resilient Distributed Datasets) is an immutable distributed collection of elements of your data, partitioned across nodes.

Parallelize : Parallelized collection is created by calling “SparkContext” parallelize method on a collection in the driver program. Once we call a parallelize, elements in the collection will copied to form a distributed dataset which in turn can be operated in parallel.

# Converting List to an RDD
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Freshers_in").getOrCreate()
sample_data = ["INDIA","USA","CANADA","INDIA","USA","JAPAN","UK","UAE","INDIA"] 
# type(sample_data) => <type 'list'>
rdd=spark.sparkContext.parallelize(sample_data) 
#Converted as RDD : type(rdd) => <class 'pyspark.rdd.RDD'>
rdd.collect()

# Reading from a S3 and converting to RDD
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Freshers_in").getOrCreate()
rdd_data = spark.sparkContext.textFile('s3://sem-freshers-in-spark_training/training/sample_txt.txt') 
# created RDD from External Source :  type(rdd_data) <class 'pyspark.rdd.RDD'>
rdd_data.collect()

How to run dataframe as Spark SQL?
How to get all combination of columns using PySpark? What is Cube in Spark ?
How to remove csv header using Spark (PySpark) ?

Post Views: 59

Author: user

PySpark-How to create and RDD from a List and from AWS S3

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Most Viewed Posts

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget