pyspark.sql.functions.arrays_zip In PySpark, the arrays_zip function can be used to combine two or more arrays…
Tag: Big Data
How to find difference between two arrays in PySpark(array_except)
array_except In PySpark , array_except will returns an array of the elements in one column but not in another column…
How to convert Array elements to Rows in PySpark ? PySpark – Explode Example code.
Function : pyspark.sql.functions.explode To converts the Array of Array Columns to row in PySpark we use “explode” function. Explode returns…
How to find array contains a given value or values using PySpark ( PySpark search in array)
array_contains You can find specific value/values in an array using spark sql function array_contains. array_contains(array, value) will return true if…
How to removes duplicate values from array in PySpark
This blog will show you , how to remove the duplicates in an column with array elements. Consider the below example….
What are the Python libraries provided by AWS Glue Version 2.0
The defaults Python libraries available in AWS Glue version 2.0 are as below boto3==1.12.4 botocore==1.15.4 certifi==2019.11.28 chardet==3.0.4 cycler==0.10.0 Cython==0.29.15 docutils==0.15.2…
AWS Glue : Example on how to read a sample csv file with PySpark
Reading a sample csv file using PySpark Here assume that you have your CSV data in AWS S3 bucket. The…
How to renaming Spark Dataframe having a complex schema with AWS Glue – PySpark: pyspark rename columns
pyspark rename columns There can be multiple reason to rename the Spark Data frame . Even though withColumnRenamed can be…
PySpark – How to read a text file as RDD using Spark3 and Display the result in Windows 10
Here we will see how to read a sample text file as RDD using Spark Environment and version which we…
What is the problem in having lots of small files in HDFS? What is the remediation plan?
In Hadoop ecosystem we are storing files under folders in HDFS, most of the time the folder name we are…
Explain distributed cache in Hadoop ?
Distributed cache is a facility provided by Hadoop map reduce framework to access small file needed by application during its…