One popular method of ensuring integrity is through the use of Cyclic Redundancy Checks (CRC), which detect accidental changes to raw data. In this tutorial, we delve into the utilization of PySpark’s crc32 function, which computes the CRC32 of a binary column, returning the value as a bigint, thus enabling the verification of data integrity.
Proceed by forming a PySpark DataFrame containing binary data. For simplicity, we’ll use string data, which PySpark will auto-convert to binary when the crc32
function is called.
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark = SparkSession.builder \
.appName("CRC32 Demonstration") \
.getOrCreate()
data = [
Row(file_data="Sample data for CRC32 check"),
Row(file_data="Another row of data"),
Row(file_data="Yet another row of important data")
]
df = spark.createDataFrame(data)
df.show()
Output
+---------------------------------+
|file_data |
+---------------------------------+
|Sample data for CRC32 check |
|Another row of data |
|Yet another row of important data|
+---------------------------------+
Computing CRC32 with the crc32 Function
PySpark SQL provides the crc32 function for calculating the CRC32 of binary columns. Implement the select and crc32 functions as follows:
from pyspark.sql.functions import crc32
df_with_crc32 = df.select("*", crc32("file_data").alias("crc32_value"))
df_with_crc32.show()
The script computes the CRC32 checksum of the ‘file_data’ column and creates a new column in the DataFrame named ‘crc32_value’ containing the CRC32 values.
Output
+---------------------------------+-----------+
|file_data |crc32_value|
+---------------------------------+-----------+
|Sample data for CRC32 check |3874211970 |
|Another row of data |1113011489 |
|Yet another row of important data|2499421737 |
+---------------------------------+-----------+
Spark important urls to refer