Apache Spark interview questions

PySpark @ Freshers.in

133. What are the Operations That Affect Partitioning in Apache Spark ?
All the operations that result in a partitioner being set on the output RDD: cogroup(), groupWith(), join(), leftOuterJoin(), rightOuter Join(), groupByKey(), reduceByKey(), combineByKey(), partitionBy(), sort(), mapValues() (if the parent RDD has a partitioner), flatMapValues() (if parent has a partitioner), and filter() (if parent has a partitioner). All other operations will produce a result with no partitioner.

134. What are the Custom Partitioners in Apache Spark ?
To implement a custom partitioner, you need to subclass the org.apache.spark.Partitioner class and implement three methods:
numPartitions: Int, which returns the number of partitions you will create.
getPartition(key: Any): Int, which returns the partition ID (0 tonumPartitions-1) for a given key.
equals(), the standard Java equality method. This is important to implement because Spark will need to test your Partitioner object against other instances of itself when it decides whether two of your RDDs are partitioned the same way!

135. Explain addPyFile(path) in Apache Spark ?
Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoopsupported
filesystems), or an HTTP, HTTPS or FTP URI.

136. What is applicationId in Apache Spark ?
A unique identifier for the Spark application. Its format depends on the scheduler implementation.
In case of local spark app something like ‘local1433865536131’
In case of YARN something like ‘application_1433865536131_34483’

137. What are binaryFiles(path, minPartitions=None) in Apache Spark ?
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any
Hadoopsupported file system URI as a byte array. Each file is read as a single record and
returned in a keyvalue pair, where the key is the path of each file, the value is the content of
each file.

138. What are the binaryRecords(path, recordLength) is Spache Spark ?
Load data from a flat binary file, assuming each record is a set of numbers with the specified
numerical format (see ByteBuffer), and the number of bytes per record is constant.
Parameters: path – Directory to the input data files
recordLength – The length at which to split the records

139. What are broadcast(value) in Apache Spark ?
Broadcast a readonly variable to the cluster, returning a
L{Broadcast<pyspark.broadcast.Broadcast>} object for reading it in distributed functions. The
variable will be sent to each cluster only once.

Author: user

Leave a Reply