from_json If you have JSON object in a column, and need to do any transformation…
Tag: Big Data
Converts a column containing a StructType, ArrayType or a MapType into a JSON string-PySpark(to_json)
You can convert a column containing a StructType, ArrayType or a MapType into a JSON string using to_json function. pyspark.sql.functions.to_json…
How to round the given value to scale decimal places using HALF_EVEN rounding in Spark – PySpark
bround function bround function returns the rounded expr using HALF_EVEN rounding mode. That means bround will round the given value…
What are the Optimization Techniques that you can apply on Apache Hive ?
1. Partitioning : Partitioning works by dividing the data into smaller segments, These are created using logical grouping based on…
How to replace a value with another value in a column in Pyspark Dataframe ?
In PySpark we can replace a value in one column or multiple column or multiple values in a column to…
How to drop nulls in a dataframe : PySpark
For most of the data cleansing the first thing that you may need to do drop the nulls in the…
In Spark how to replace null value for all columns or for each column separately-PySpark (na.fill)
Spark api : pyspark.sql.DataFrameNaFunctions.fill Syntax : fill(value, subset=None) value : “value” can only be int, long, float, string, bool or…
How to create an array containing a column repeated count times – PySpark
For repeating array elements k times in PySpark we can use the below library. Library : pyspark.sql.functions.array_repeat array_repeat is a…
How to run a Spark Job in Airflow hosted on a different server using BashOperator ?
In this article we will discuss on how we can trigger a PySpark Job running on a AWS EMR from…
How to create UDF in PySpark ? What are the different ways you can call PySpark UDF ( With example)
PySpark UDF In order to develop a reusable function in Spark, one can use the PySpark UDF. PySpark UDF is…
How to convert MapType to multiple columns based on Key using PySpark ?
Use case : Converting Map to multiple columns. There can be raw data with Maptype with multiple key value pair….