To connect PySpark to Google BigQuery, you will need to have the Google Cloud SDK…
Author: user
Google Cloud – GCP all products and what each service is used for.
These are the list of Google Cloud Products / Services COMPUTE Cloud Functions Event-driven serverless functions App Engine Managed app…
How to run a Spark Job in Airflow hosted on a different server using BashOperator ?
In this article we will discuss on how we can trigger a PySpark Job running on a AWS EMR from…
How to merge multiple PDF files using Python?
Use case : If you have multiple files for example chapter wise question papers etc. and you need to have…
Over the Wall – 0/1 knapsack ( Smallest number of boxes required to build two towers such that each of them has least height )
Ramu and Jithin want to watch the grand finale, but unfortunately, they could not get tickets to the match. However,…
How to create UDF in PySpark ? What are the different ways you can call PySpark UDF ( With example)
PySpark UDF In order to develop a reusable function in Spark, one can use the PySpark UDF. PySpark UDF is…
How to convert MapType to multiple columns based on Key using PySpark ?
Use case : Converting Map to multiple columns. There can be raw data with Maptype with multiple key value pair….
How to create a Airflow DAG(Scheduler) to execute a redshift query ?
Use case : We have a redshift query (an insert sql ) to load data from another table on daily…
Explain how can you implement dynamic partitioning in Hive (automatically creating partition based on column value)
Dynamic partition in hive Dynamic partitioning is a tactical method for loading data from a…
How to insert from Non Partitioned table to Partitioned table in Hive?
You can insert data from Non Partitioned table to Partitioned table , in short , if you want to have…
How to create AWS Glue table where partitions have different columns?
AWS Glue is a serverless data integration service. There can be a condition where you can expect new column in…