Introduction to 64-bit Hashing
A hash function is a function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash codes, hash values, or simply hashes.
When we say a hash value is a “signed 64-bit” value, it means the hash function outputs a 64-bit integer that can represent both positive and negative numbers. In computing, a 64-bit integer can represent a vast range of numbers, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
A 64-bit hash function can be useful in a variety of scenarios, particularly when working with large data sets. It can be used for quickly comparing complex data structures, indexing data, and checking data integrity.
Use of 64-bit Hashing in PySpark
While PySpark does not provide a direct function for 64-bit hashing, it does provide a function hash() that returns a hash as an integer, which is usually a 32-bit hash. For a 64-bit hash, we can consider using the murmur3 hash function from Python’s mmh3 library, which produces a 128-bit hash and can be trimmed down to 64-bit. You can install the library using pip:
pip install mmh3
Here is an example of how to generate a 64-bit hash value in PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import LongType
import mmh3
#Create a Spark session
spark = SparkSession.builder.appName("freshers.in Learning for 64-bit Hashing in PySpark ").getOrCreate()
#Creating sample data
data = [("Sachin",), ("Ramesh",), ("Babu",)]
df = spark.createDataFrame(data, ["Name"])
#Function to generate 64-bit hash
def hash_64(input):
return mmh3.hash64(input.encode('utf-8'))[0]
#Create a UDF for the 64-bit hash function
hash_64_udf = udf(lambda z: hash_64(z), LongType())
#Apply the UDF to the DataFrame
df_hashed = df.withColumn("Name_hashed", hash_64_udf(df['Name']))
#Show the DataFrame
df_hashed.show()
In this example, we create a Spark session and a DataFrame df with a single column “Name”. Then, we define the function hash_64 to generate a 64-bit hash of an input string. After that, we create a user-defined function (UDF) hash_64_udf using PySpark SQL functions. Finally, we apply this UDF to the column “Name” in the DataFrame df and create a new DataFrame df_hashed with the 64-bit hashed values of the names.
Advantages and Drawbacks of 64-bit Hashing
Advantages:
- Large Range: A 64-bit hash value has a very large range of possible values, which can help reduce hash collisions (different inputs producing the same hash output).
- Fast Comparison and Lookup: Hashing can turn time-consuming operations such as string comparison into a simple integer comparison, which can significantly speed up certain operations like data lookups.
- Data Integrity Checks: Hash values can provide a quick way to check if data has been altered.
Drawbacks:
- Collisions: While the possibility is reduced, hash collisions can still occur where different inputs produce the same hash output.
- Not for Security: A hash value is not meant for security purposes. It can be reverse-engineered to get the original input.
- Data Loss: Hashing is a one-way function. Once data is hashed, it cannot be converted back to the original input.
Spark important urls to refer