1. What is Amazon Athena?
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing data immediately. Amazon Athena works directly with data stored in S3. Amazon Athena uses Presto with full standard SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet and Avro.
2. How can I submit my queries in Amazon Athena?
You can submit your queries using the Athena Console, Athena APIs, or using the Athena preview JDBC driver with any off-the-shelf query and result visualization tools such as SQL WorkBench.
3. How does machine learning in Athena relate to other AWS services?
Athena SQL queries can invoke ML models deployed on Amazon SageMaker. You can specify the Amazon S3 location where they want to store results of these Athena SQL queries. Creating tables, data formats and partitions.
4. What is a SerDe? What is the role of Amazon Athena in SerDe ?
SerDe stands for Serializer/Deserializer, which are libraries that tell Hive how to interpret data formats. Hive DLL statements require you to specify a SerDe, so that the system knows how to interpret the data that you’re pointing to. Amazon Athena uses SerDes to interpret the data read from Amazon S3. The concept of SerDes in Athena is the same as the concept used in Hive. Amazon Athena supports the following SerDes:
Apache Web Logs: “org.apache.hadoop.hive.serde2.RegexSerDe”
CSV: “org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe”
TSV: “org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe”
Custom Delimiters: “org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe”
Parquet: “org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe”
Orc: “org.apache.hadoop.hive.ql.io.orc.OrcSerde”
JSON: “org.apache.hive.hcatalog.data.JsonSerDe” OR org.openx.data.jsonserde.JsonSerDe
5. Does Amazon Athena support data partitioning?
Yes. Amazon Athena allows you to partition your data on any column. Partitions allow you to limit the amount of data each query scans, leading to cost savings and faster performance. You can specify your partitioning scheme using the PARTITIONED BY clause in the CREATE TABLE statement.