MapType in PySpark is a data type used to represent a value that maps keys…
Author: user
PySpark – groupby with aggregation (count, sum, mean, min, max)
pyspark.sql.DataFrame.groupBy PySpark groupby functions groups the DataFrame using the specified columns to run aggregation ( count,sum,mean, min, max) on them….
PySpark filter : How to filter data in Pyspark – Multiple options explained.
pyspark.sql.DataFrame.filter PySpark filter function is used to filter the data in a Spark Data Frame, in short used to cleansing…
Amazon CloudFront quick reference and cheat sheet
1. CloudFront gives developers an easy and cost-effective way to distribute content with low latency and high data transfer speeds….
Amazon Aurora quick reference and cheat sheet.
1. Aurora is an AWS proprietary database. 2. Aurora is a fully managed service. 3. Aurora have High performance and…
Amazon Athena quick reference and cheat sheet
1. Amazon Athena is an interactive query service to analyze data in Amazon S3 using standard SQL. 2. Athena is…
Python throwing as NameError: name ‘__file__’ is not defined – Solution
On Executing os.path.dirname(os.path.realpath(__file__)) in python interactive shell, you will get the error NameError: name ‘__file__’ is not defined. This is…
Amazon API Gateway quick reference and cheat sheet
1. Amazon API Gateway is an AWS service for creating, publishing, maintaining, monitoring, and securing REST, HTTP, and WebSocket APIs…
How to drop multiple partition in Hive by giving condition.
Hive Partitions is a good and easy way to organizes Hive tables into partitions by dividing tables into different parts…
How to delete a partition data as well from Hive external table on DROP command?
As you know external tables are tables where Hive does not manage the data of the External table. So when…
How to convert a hive managed table to external table without recreating it ?
In Hive, Managed tables / Internal table are Hive owned tables and the tables data are managed and controlled by…