PySpark LongType and ShortType: Handling Integer Data

PySpark @ Freshers.in

In this comprehensive guide, we’ll dive into two essential PySpark integer data types: LongType and ShortType. You’ll discover their applications, use cases, and how to leverage them effectively in your PySpark projects.

Understanding Integer Data Types

Integer data types in PySpark are crucial for handling numerical values, such as counts, IDs, timestamps, and more. They allow you to represent and manipulate integer data efficiently, catering to a wide range of data analysis and processing needs.

1. LongType: Handling Large Integer Values

The LongType data type in PySpark is designed to handle large integer values. It is commonly used for representing numeric IDs, timestamps, or any integer data that can be quite substantial.

Example: Storing Timestamps

Let’s consider a scenario where you want to store timestamps for events in a PySpark dataframe:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType

# Initialize SparkSession
spark = SparkSession.builder.appName("LongType at Freshers.in Learning Example").getOrCreate()
# Create a sample dataframe
data = [("Event 1", 1643145600),  # January 26, 2022, 00:00:00
        ("Event 2", 1674681600),  # February 24, 2023, 00:00:00
        ("Event 3", 1706217600)]  # March 26, 2024, 00:00:00
schema = StructType([StructField("EventName", StringType(), True),
                     StructField("Timestamp", LongType(), True)])
df = spark.createDataFrame(data, schema)
# Show the dataframe
df.show()
Output
+---------+----------+
|EventName| Timestamp|
+---------+----------+
|  Event 1|1643145600|
|  Event 2|1674681600|
|  Event 3|1706217600|
+---------+----------+

In this example, we use LongType to store timestamps for events.

2. ShortType: Handling Small Integer Values

The ShortType data type in PySpark is designed for handling small integer values. It is particularly useful when you need to optimize storage and memory usage for integer data that doesn’t require the range of LongType.

Example: Storing Product IDs

Suppose you want to store product IDs in a PySpark dataframe:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ShortType

# Initialize SparkSession
spark = SparkSession.builder.appName("ShortType @ Freshers.in Learning Example").getOrCreate()

# Create a sample dataframe
data = [("Product 1", 101),
        ("Product 2", 202),
        ("Product 3", 303),
        ("Product 4", 404),
        ("Product 5", 505)]

schema = StructType([StructField("ProductName", StringType(), True),
                     StructField("ProductID", ShortType(), True)])

df = spark.createDataFrame(data, schema)

# Show the dataframe
df.show()
Output
+-----------+---------+
|ProductName|ProductID|
+-----------+---------+
|  Product 1|      101|
|  Product 2|      202|
|  Product 3|      303|
|  Product 4|      404|
|  Product 5|      505|
+-----------+---------+
Author: user