Tag: big_data_interview
Hive : How to update the access time of a file or directory in the Hive data warehouse [Touch]
Among the many functions Hive provides, one essential operation is “TOUCH.” In this article, we will explore the purpose of…
PySpark : Identifying Data Skewness and Partition Row Counts in PySpark
Data skewness is a common issue in large scale data processing. It happens when data is not evenly distributed across…
Hive : Understanding Array Aggregation in Apache Hive
Apache Hive offers many inbuilt functions to process data, among which collect_list() and collect_set() are commonly used to perform array aggregation….
Hive : Creating and Utilizing 64-bit Hash Values in Apache Hive
Apache Hive provides several inbuilt functions to process the data. One of these is the hash() function, which calculates a…
Hive : How can we return the average of non-NULL records in Hive ?
The function you’re need to refer in Apache Hive is the avg() function. It is an aggregate function that returns…
Hive : How to Delete Old Apache Hive Logs , increase space and boosting Cluster Performance
Apache Hive logs are a critical component for debugging and performance optimization. However, over time, these logs can occupy significant…
Hive : How to Kill a Running Query in Apache Hive
There may be times when a running query needs to be terminated due to excessive resource usage, incorrect syntax, or…
Hive : Seeing Long Running Queries in Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop that provides data query and analysis….
PySpark : from_utc_timestamp Function: A Detailed Guide
The from_utc_timestamp function in PySpark is a highly useful function that allows users to convert UTC time to a specified…
PySpark : Fixing ‘TypeError: an integer is required (got type bytes)’ Error in PySpark with Spark 2.4.4
Apache Spark is an open-source distributed general-purpose cluster-computing framework. PySpark is the Python library for Spark, and it provides an…