For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs
1. What is AWS Glue ?
AWS Glue is a cloud service that prepares data for analysis through automated extract, transform and load (ETL) processes. Glue also supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases that run on Amazon Elastic Compute Cloud (EC2) instances in an Amazon Virtual Private Cloud. AWS Glue is a fully-managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the time-consuming steps of data preparation for analytics.
2. When will I use AWS Glue for Streaming ?
AWS Glue is recommended for Streaming when your use cases are primarily ETL and when you want to run jobs on a serverless Apache Spark-based platform.
3. How to launch the Spark history server ?
We can launch the Spark history server using a AWS CloudFormation template that hosts the server on an EC2 instance, or launch locally using Docker.
4. How does a glue crawler determine when to create partitions ?
When an AWS Glue crawler scans Amazon S3 path and if detects multiple folders in a bucket, it determines the root of a table in the folder structure and which folders are partitions of a table. The name of the table is based on the Amazon S3 prefix or folder name. You provide an Include path that points to the folder level to crawl. When the majority of schemas at a folder level are similar, the crawler creates partitions of a table instead of two separate tables.
5. When do I use a Glue Classifier ?
You use classifiers when you crawl a data store to define metadata tables in the AWS Glue Data Catalog. You can set up your crawler with an ordered set of classifiers. When the crawler invokes a classifier, the classifier determines whether the data is recognized. If the classifier can’t recognize the data or is not 100 percent certain, the crawler invokes the next classifier in the list to determine whether it can recognize the data.