Apache PIG interview questions

user March 21, 2021 Leave a Comment

51. What are the different ways of executing Pig script?
Grunt Shell: This is Pig’s interactive shell provided to execute all Pig Scripts.
Script File: Write all the Pig commands in a script file and execute the Pig script file. This is executed by the Pig Server.
Embedded Script: If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring that functionality using other languages like Java, Python, Ruby, etc. and embed it in the Pig Latin Script file. Then, execute that script file.

52. What is a bag in Pig Latin?
A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags are used to store collections of tuples while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.

53. What do you understand by an inner bag and outer bag in Pig?
Outer bag or relation is nothing but a bag of tuples. Here relations are similar as relations in relational databases. For example:
{(Linkin Park, California), (Metallica, Los Angeles), (Mega Death, Los Angeles)}
An inner bag contains a bag inside a tuple. For Example:
(Los Angeles, {(Metallica, Los Angeles), (Mega Death, Los Angeles)})
(California, {(Linkin Park, California)})

54. How Apache Pig deals with the schema and schema-less data?
If the schema only includes the field name, the data type of field is considered as a byte array.
If you assign a name to the field you can access the field by both, the field name and the positional notation
If you perform any operation which is a combination of relations (like JOIN, COGROUP, etc.) and if any of the relation is missing schema, the resulting relation will have null schema.
If the schema is null, Pig will consider it as a byte array and the real data type of field will be determined dynamically.

55. List the relational operators in Pig.
COGROUP: Joins two or more tables and then perform GROUP operation on the joined table result.
CROSS: CROSS operator is used to compute the cross product (Cartesian product) of two or more relations.
DISTINCT: Removes duplicate tuples in a relation.
FILTER: Select a set of tuples from a relation based on a condition.
FOREACH: Iterate the tuples of a relation, generating a data transformation.
GROUP: Group the data in one or more relations.
JOIN: Join two or more relations (inner or outer join).
LIMIT: Limit the number of output tuples.
LOAD: Load data from the file system.
ORDER: Sort a relation based on one or more fields.
SPLIT: Partition a relation into two or more relations.
STORE: Store data in the file system.
UNION: Merge the content of two relations. To perform a UNION operation on two relations, their columns and domains must be identical.