If you have a situation that you can easily get the result using SQL/ SQL…
Category: spark
Spark User full article
How can I see the full column values in a Spark Dataframe ?
When we do a dataframe.show () , we can see that some of the column values got truncated. Here we…
What is the difference between repartition() and coalesce() ?
The repartition algorithm will perform a full shuffle and creates new partitions with data that’s distributed evenly. The repartition algorithm makes…
Converts a column containing a StructType, ArrayType or a MapType into a JSON string-PySpark(to_json)
You can convert a column containing a StructType, ArrayType or a MapType into a JSON string using to_json function. pyspark.sql.functions.to_json…
How to get json object from a json string based on json path specified – get_json_object – PySpark
get_json_object get_json_object will extracts json object from a json string based on json path mentioned and this will and returns…
How to round the given value to scale decimal places using HALF_EVEN rounding in Spark – PySpark
bround function bround function returns the rounded expr using HALF_EVEN rounding mode. That means bround will round the given value…
How to replace a value with another value in a column in Pyspark Dataframe ?
In PySpark we can replace a value in one column or multiple column or multiple values in a column to…
How to drop nulls in a dataframe : PySpark
For most of the data cleansing the first thing that you may need to do drop the nulls in the…
In Spark how to replace null value for all columns or for each column separately-PySpark (na.fill)
Spark api : pyspark.sql.DataFrameNaFunctions.fill Syntax : fill(value, subset=None) value : “value” can only be int, long, float, string, bool or…
How to create an array containing a column repeated count times – PySpark
For repeating array elements k times in PySpark we can use the below library. Library : pyspark.sql.functions.array_repeat array_repeat is a…
How to run a Spark Job in Airflow hosted on a different server using BashOperator ?
In this article we will discuss on how we can trigger a PySpark Job running on a AWS EMR from…