Tag: Big Data

PySpark @ Freshers.in

PySpark : Dropping duplicate rows in Pyspark – A Comprehensive Guide with example

PySpark provides several methods to remove duplicate rows from a dataframe. In this article, we will go over the steps…

PySpark @ Freshers.in

PySpark : Replacing null column in a PySpark dataframe to 0 or any value you wish.

To replace null values in a PySpark DataFrame column that contain null with a numeric value (e.g., 0), you can…

PySpark @ Freshers.in

PySpark : unix_timestamp function – A comprehensive guide

One of the key functionalities of PySpark is the ability to transform data into the desired format. In some cases,…

PySpark @ Freshers.in

PySpark : Reading parquet file stored on Amazon S3 using PySpark

To read a Parquet file stored on Amazon S3 using PySpark, you can use the following code: from pyspark.sql import…

Google DataFlow @ Freshers.in

Google Dataflow : Handling Late Data in Google Dataflow

Handling late-arriving data is a common challenge when working with streaming data processing systems like Google Dataflow. Late data refers…

Google DataFlow @ Freshers.in

Google Dataflow-An Overview and programming languages are supported by Google Dataflow

Google Dataflow is a cloud-based data processing service that allows developers to easily and efficiently process large volumes of data….

Hive @ Freshers.in

Hive : Hive Table Properties : How are Hive Table Properties used?

One of the key features of Hive is the ability to define table properties, which can be used to control…

Hive @ Freshers.in

Hive : Implementation of UDF in Hive using Python. A Comprehensive Guide

A User-Defined Function (UDF) in Hive is a function that is defined by the user and can be used in…

Hive @ Freshers.in

Hive : Hive metastore and its importance.

The Hive Metastore is an important component of the Apache Hive data warehouse software. It acts as a central repository…

Hive @ Freshers.in

Hive : Hive Optimizers: A Comprehensive Guide

Hive is a data warehousing tool that provides a SQL-like interface for querying large datasets stored in Hadoop Distributed File…