A common use case when dealing with CSV file is to remove the header from the source to do data analysis. In PySpark this can be done as bellow.
Source Code ( PySpark – Python 3.6 and Spark 3, this is compatible with spark 2.2+ ad Python 2.7)
from pyspark import SparkContext import csv sc = SparkContext() readFile = sc.textFile("D:\\Users\\speedika\\PycharmProjects\\sparkprojects\\sample_csv_01.csv") readCSV = readFile.mapPartitions(lambda x : csv.reader(x)) file_with_indx = readCSV.zipWithIndex() for data_with_idx in file_with_indx.collect(): print (data_with_idx) rmHeader = file_with_indx.filter(lambda x : x[1] > 0).map(lambda x : x[0]) for cleanse_data in rmHeader.collect(): print(cleanse_data)
Code Explanation
file_with_indx = readCSV.zipWithIndex()
The zipWithIndex() transformation appends the RDD with the element indices. Each row in the CSV will have and index attached starting from 0.
rmHeader = file_with_indx.filter(lambda x : x[1] > 0).map(lambda x : x[0])
This will remove the rows with index less than 0. So if you want to skip ‘n’ number of rows you can use the same code as well.
Note: Here we use the print statements to show the functionality .
Sample data Name,Country,Phone TOM,USA,343-098-292 JACK,CHINA,783-098-232 CHARLIE,INDIA,873-984-123 SUSAN,JAPAN,898-231-987 MIKE,UK,987-989-121 Result ['TOM', 'USA', '343-098-292'] ['JACK', 'CHINA', '783-098-232'] ['CHARLIE', 'INDIA', '873-984-123'] ['SUSAN', 'JAPAN', '898-231-987'] ['MIKE', 'UK', '987-989-121']
Reference documentation : zipWithIndex()