PySpark : Fixing ‘TypeError: an integer is required (got type bytes)’ Error in PySpark with Spark 2.4.4

PySpark @ Freshers.in

Apache Spark is an open-source distributed general-purpose cluster-computing framework. PySpark is the Python library for Spark, and it provides an easy-to-use API for Spark programming. However, sometimes, you might run into an error like TypeError: an integer is required (got type bytes) when trying to use PySpark after installing Spark 2.4.4.

This issue is typically related to a Python version compatibility problem, especially if you are using Python 3.7 or later versions. Fortunately, there’s a straightforward way to address it. This article will guide you through the process of fixing this error so that you can run your PySpark applications smoothly.

Let’s assume we’re trying to run the following simple PySpark code that reads a CSV file and displays its content:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CSV Reader").getOrCreate()
data = spark.read.csv('sample.csv', inferSchema=True, header=True)
data.show()
OR with hardcoded values
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
spark = SparkSession.builder.appName("DataFrame Creator").getOrCreate()
data = [("John", 1), ("Doe", 2)]
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("ID", IntegerType(), True)
])
df = spark.createDataFrame(data, schema)
df.show()
We will have this error message:
TypeError: an integer is required (got type bytes)

How to resolve

First you can try installing again

pip install --upgrade pyspark

The issue occurs due to a compatibility problem with Python 3.7 or later versions and PySpark with Spark 2.4.4. PySpark uses an outdated method to check for a file type, which leads to this TypeError.

A quick fix for this issue is to downgrade your Python version to 3.6. However, if you don’t want to downgrade your Python version, you can apply a patch to PySpark’s codebase.

The patch involves modifying the pyspark/serializers.py file in your PySpark directory:

1. Open the pyspark/serializers.py file in a text editor. The exact path depends on your PySpark installation.

2. Find the following function definition (around line 377):

def _read_with_length(stream):
    length = read_int(stream)
    if length == SpecialLengths.END_OF_DATA_SECTION:
        return None
    return stream.read(length)

3. Replace the return stream.read(length) line with the following code:

result = stream.read(length)
if length and not result:
    raise EOFError
return result

4. Save and close the file.

This patch adds a check to ensure that the stream has not reached the end before attempting to read from it, which is the cause of the TypeError.

Now, try running your PySpark code again. The error should be resolved, and you should be able to run your PySpark application successfully.

Author: user

Leave a Reply