Apache Spark interview questions

user March 7, 2021 Leave a Comment

133. What are the Operations That Affect Partitioning in Apache Spark ?
All the operations that result in a partitioner being set on the output RDD: cogroup(), groupWith(), join(), leftOuterJoin(), rightOuter Join(), groupByKey(), reduceByKey(), combineByKey(), partitionBy(), sort(), mapValues() (if the parent RDD has a partitioner), flatMapValues() (if parent has a partitioner), and filter() (if parent has a partitioner). All other operations will produce a result with no partitioner.

134. What are the Custom Partitioners in Apache Spark ?
To implement a custom partitioner, you need to subclass the org.apache.spark.Partitioner class and implement three methods:
numPartitions: Int, which returns the number of partitions you will create.
getPartition(key: Any): Int, which returns the partition ID (0 tonumPartitions-1) for a given key.
equals(), the standard Java equality method. This is important to implement because Spark will need to test your Partitioner object against other instances of itself when it decides whether two of your RDDs are partitioned the same way!

135. Explain addPyFile(path) in Apache Spark ?
Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoopsupported
filesystems), or an HTTP, HTTPS or FTP URI.

136. What is applicationId in Apache Spark ?
A unique identifier for the Spark application. Its format depends on the scheduler implementation.
In case of local spark app something like ‘local1433865536131’
In case of YARN something like ‘application_1433865536131_34483’

137. What are binaryFiles(path, minPartitions=None) in Apache Spark ?
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any
Hadoopsupported file system URI as a byte array. Each file is read as a single record and
returned in a keyvalue pair, where the key is the path of each file, the value is the content of
each file.

138. What are the binaryRecords(path, recordLength) is Spache Spark ?
Load data from a flat binary file, assuming each record is a set of numbers with the specified
numerical format (see ByteBuffer), and the number of bytes per record is constant.
Parameters: path – Directory to the input data files
recordLength – The length at which to split the records

139. What are broadcast(value) in Apache Spark ?
Broadcast a readonly variable to the cluster, returning a
L{Broadcast<pyspark.broadcast.Broadcast>} object for reading it in distributed functions. The
variable will be sent to each cluster only once.

Post Views: 328

Related Posts

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Spark : Advantages of Google's Serverless Spark
Google's Serverless Spark has several advantages compared to traditional Spark clusters: Cost-effective: Serverless Spark eliminates…

How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache Spark ?
In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

PySpark : Inserting row in Apache Spark Dataframe.
In PySpark, you can insert a row into a DataFrame by first converting the DataFrame…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

PySpark : Explain in detail whether Apache Spark SQL lazy or not ?
Yes, Apache Spark SQL is lazy. In Spark, the concept of "laziness" refers to the…

In pyspark what is the difference between Spark spark.table() and spark.read.table()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas spark.read.table()…

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Share: Twitter Facebook Pinterest Reddit VK Digg Linkedin Mix
Tagged Big Data, software_engineering, Technical

Author: user

Website

Related Articles

Hive interview questions

AWS S3 interview questions

Artificial Intelligence interview questions

Cobol interview questions

Apache PIG interview questions

Amazon API Gateway interview questions

OOPS interview questions for freshers and experienced

Data Structure interview questions

Post navigation

What are the Best Practices when using Snowflake Transactions? →
← Apache PIG interview questions

Leave a Reply Cancel reply
You must be logged in to post a comment.

Search for:
Trending
DBT
Python
Numpy
PySpark
Hive
Snowflake
Redshift
Airflow
Aptitude

Recent Posts

How PARTITION BY Works in Snowflake, and SQL in general

Stash a specific file using Git

Prevent your computer from locking : Python to simulate mouse movements

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Featured Posts – Slider Widget

How PARTITION BY Works in Snowflake, and SQL in general

Stash a specific file using Git

Prevent your computer from locking : Python to simulate mouse movements

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Related Posts

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Spark : Advantages of Google's Serverless Spark
Google's Serverless Spark has several advantages compared to traditional Spark clusters: Cost-effective: Serverless Spark eliminates…

How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache Spark ?
In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

PySpark : Inserting row in Apache Spark Dataframe.
In PySpark, you can insert a row into a DataFrame by first converting the DataFrame…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

PySpark : Explain in detail whether Apache Spark SQL lazy or not ?
Yes, Apache Spark SQL is lazy. In Spark, the concept of "laziness" refers to the…

In pyspark what is the difference between Spark spark.table() and spark.read.table()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas spark.read.table()…

Most Viewed Posts

dbt (data build tool) interview questions

Python throwing as NameError: name ‘__file__’ is not defined – Solution

DBT command not found after intalling DBT-How to resolve.

BigQuery : Handle missing or null values in BigQuery

Airflow dags not getting refreshed/updating. How to do it manually?

Home

How to delete a partition data as well from Hive external table on DROP command?

Copyright © 2025 Freshers.in