PySpark is the Python library for Spark programming. It allows developers to interface with RDDs…
Tag: PySpark
PySpark-What is map side join and How to perform map side join in Pyspark
Map-side join is a method of joining two datasets in PySpark where one dataset is broadcast to all executors, and…
Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following is a detailed guide on…
How to use if condition in spark SQL , explanation with example
In PySpark, you can use the if statement within a SQL query to conditionally return a value based on a…
What is GC (Garbage Collection) time in Spark UI ?
In the Spark UI, GC (Garbage Collection) time refers to the amount of time spent by the JVM (Java Virtual…
PySpark : How do I read a parquet file in Spark
To read a Parquet file in Spark, you can use the spark.read.parquet() method, which returns a DataFrame. Here is an…
Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in Spark. It provides a way…
PySpark : Connecting and updating postgres table in spark SQL
Apache Spark is an open-source, distributed computing system that can process large amounts of data quickly. Spark SQL is a…
Kafka streaming with PySpark – Things you need to know – With Example
To use Kafka streaming with PySpark, you will need to have a good understanding of the following concepts: Kafka: Kafka…
How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache Spark ?
In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations that are performed on a…
When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark, which is a powerful open-source…