Tag: big_data_interview

PySpark @ Freshers.in

Converting numerical strings from one base to another within DataFrames : conv

The conv function in PySpark simplifies the process of converting numerical strings from one base to another within DataFrames. With…

Continue Reading Converting numerical strings from one base to another within DataFrames : conv
PySpark @ Freshers.in

Loading JSON schema from a JSON string in PySpark

We want to load the JSON schema from a JSON string. In PySpark, you can do this by parsing the…

Continue Reading Loading JSON schema from a JSON string in PySpark

Coping files from Hadoop’s HDFS (Hadoop Distributed File System) to your local machine

To copy files from Hadoop’s HDFS (Hadoop Distributed File System) to your local machine, you can use the hadoop fs…

Continue Reading Coping files from Hadoop’s HDFS (Hadoop Distributed File System) to your local machine
PySpark @ Freshers.in

Optimizing PySpark queries with adaptive query execution – (AQE) – Example included

Spark 3+ brought numerous enhancements and features, and one of the notable ones is Adaptive Query Execution (AQE). AQE is…

Continue Reading Optimizing PySpark queries with adaptive query execution – (AQE) – Example included
AWS Glue @ Freshers.in

Navigating job dependencies in AWS glue – Managing ETL workflows

AWS Glue manages dependencies between jobs using triggers. Triggers can start jobs based on the completion status of other jobs,…

Continue Reading Navigating job dependencies in AWS glue – Managing ETL workflows
PySpark @ Freshers.in

Spark repartition() vs coalesce() – A complete information

In PySpark, managing data across different partitions is crucial for optimizing performance, especially for large-scale data processing tasks. Two methods…

Continue Reading Spark repartition() vs coalesce() – A complete information