Apache PIG interview questions

user March 21, 2021 Leave a Comment

31. What Is Pig Useful For?
Pig Latin use cases tend to fall into three separate categories: traditional extract transform load (ETL) data pipelines, research on raw data, and iterative processing.
For research on raw data, some users prefer Pig Latin. Because Pig can operate in situations where the schema is unknown, incomplete, or inconsistent, and because it can easily manage nested data, researchers who want to work on data before it has been cleaned.

32. What are the thing pig need to know to run on a cluster
The only thing Pig needs to know to run on your cluster is the location of your cluster’s
NameNode and JobTracker. The NameNode is the manager of HDFS, and the Job- Tracker coordinates MapReduce jobs.

33. Where can I find Hadoop location ?
Hadoop location are found in hadoop-site.xml file. In Hadoop 0.20 and later, they are in three separate files: core-site.xml, hdfs-site.xml, and mapred-site.xml.
Hadoop-site.xml will look like the following:
<configuration>
<property>
<name>fs.default.name</name>
<value>namenode_hostname:port</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>jobtrack_hostname:port</value>
</property>
</configuration>
PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig -e fs -mkdir /user/username
pig -e fs -ls will list your home directory of HDFS

34. What are the Complex Types in PIG
Pig has three complex data types: maps, tuples, and bags.
Map
A map in Pig is a chararray to data element mapping, where that element can be any
Pig type, including a complex type. The chararray is called a key and is used as an index
to find the element, referred to as the value. Pig does not know the type of the value, it will assume it is a bytearray.If you do not cast the value, Pig will make a best guess based on how you use the value in your script.If the value is of a type other than bytearray, Pig will figure that out at runtime and handle it. Map constants are formed using brackets to delimit the map, a hash between keys and values, and a comma between key-value pairs. For example, [‘name’#’bob’,’age’#55] will create a map with two keys, “name” and “age”.
Tuple
A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields.
Bag
A bag is an unordered collection of tuples. Because it has no order, it is not possible to
reference tuples in a bag by position.
Bag is the one type in Pig that is not required to fit into memory. As you will see later,
because bags are used to store collections when grouping, bags can become quite large.

35. What makes easier to program in Apache Pig than Hadoop MapReduce?
The initial step of a PigLatin program is to load the data from HDFS. Run the data through a series of business transformations (these transformations are internally converted to MapReduce task, so the developers don’t have to write the Java code for the business logic). Store the results in a file or present them on the interface.

Post Views: 22

Related Posts

Apache Storm interview questions
1. What is Apache Storm? Apache Storm is a free and open source distributed realtime…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Data communication interview questions
1. What are the components of Data communication ? a. Message - It is the…

What are the Data Processing Operators in Snowflake ?
Filter : Represents an operation that filters the records. Attributes: Filter condition - the condition…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Data Structure interview questions
1. What is data structure? Data structure refers to the way data is organized and…

How does Snowflake differ from other data warehousing solutions
Snowflake is a cloud-based data warehousing solution that differs from traditional on-premises and other cloud-based…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

Pages: 1 2 3 4 5 6 7 8 9 10 11

Share: Twitter Facebook Pinterest Reddit VK Digg Linkedin Mix
Tagged Apache, Big Data, cloud, interview_qa, software_engineering, Technical

Author: user

Website

Related Articles

Apache Storm interview questions

Amazon Redshift interview questions

Cobol interview questions

AWS S3 interview questions

Data Structure interview questions

Computer Organization interview questions

Digital Electronics interview questions

dbt (data build tool) interview questions

Post navigation

Apache Spark interview questions →
← Apache Storm interview questions

Leave a Reply Cancel reply
You must be logged in to post a comment.

Search for:
Trending
DBT
Python
Numpy
PySpark
Hive
Snowflake
Redshift
Airflow
Aptitude

Recent Posts

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Related Posts

Apache Storm interview questions
1. What is Apache Storm? Apache Storm is a free and open source distributed realtime…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Data communication interview questions
1. What are the components of Data communication ? a. Message - It is the…

What are the Data Processing Operators in Snowflake ?
Filter : Represents an operation that filters the records. Attributes: Filter condition - the condition…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Data Structure interview questions
1. What is data structure? Data structure refers to the way data is organized and…

How does Snowflake differ from other data warehousing solutions
Snowflake is a cloud-based data warehousing solution that differs from traditional on-premises and other cloud-based…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

Most Viewed Posts

dbt (data build tool) interview questions

Python throwing as NameError: name ‘__file__’ is not defined – Solution

DBT command not found after intalling DBT-How to resolve.

BigQuery : Handle missing or null values in BigQuery

Airflow dags not getting refreshed/updating. How to do it manually?

How to delete a partition data as well from Hive external table on DROP command?

PySpark – groupby with aggregation (count, sum, mean, min, max)

Copyright © 2024 Freshers.in