AWS Glue interview questions

user January 6, 2021 Leave a Comment

6. How to import data from my existing Apache Hive Metastore to the AWS Glue Data Catalog ?
Run an ETL job that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the AWS Glue Data Catalog.

7. What is Time-Based Schedules for Jobs and Crawlers ?
We can define a time-based schedule for your crawlers and jobs in AWS Glue. You specify time in Coordinated Universal Time (UTC), and the minimum precision for a schedule is 5 minutes.

8. What will happens when a crawler Runs?
When a crawler runs, it takes the following actions to interrogate a data store:
Classifies data to determine the format, schema, and associated properties of the raw data – You can configure the results of classification by creating a custom classifier.
Groups data into tables or partitions – Data is grouped based on crawler heuristics.
Writes metadata to the Data Catalog – You can configure how the crawler adds, updates, and deletes tables and partitions.

9. What is Development Endpoints ?
The Development Endpoints API describes the AWS Glue API related to testing using a custom DevEndpoint. A development endpoint where a developer can remotely debug extract, transform, and load (ETL) scripts.

10. In Glue is it possible to trigger an AWS Glue crawler on new files, that get uploaded into a S3 bucket, given that the crawler is “pointed” to that bucket?
No, there is currently no direct way to invoke an AWS Glue crawler in response to an upload to an S3 bucket. S3 event notifications can only be sent to:
SNS
SQS
Lambda

Post Views: 964