In this article, we will explore PySpark’s LongType and ShortType data types, their properties, and how to work with them.
PySpark is a powerful data processing framework that allows users to work with large-scale datasets. It provides several data types that allow users to represent data in different ways. Two of these data types are LongType and ShortType.
LongType is a data type in PySpark that represents signed 64-bit integer values. The range of values that can be represented by a LongType variable is from -9223372036854775808 to 9223372036854775807. This data type is useful when working with large numbers, such as timestamps or IDs.
ShortType is a data type in PySpark that represents signed 16-bit integer values. The range of values that can be represented by a ShortType variable is from -32768 to 32767. This data type is useful when working with small integers, such as counts or indices.
To use LongType or ShortType in PySpark, we need to import the LongType and ShortType classes from the pyspark.sql.types module. Here is an example:
from pyspark.sql.types import LongType, ShortType # create a LongType variable long_var = LongType() # create a ShortType variable short_var = ShortType()
Now that we have created LongType and ShortType variables, we can use them to define the schema of a DataFrame. Here is an example:
from pyspark.sql.types import StructType, StructField # define the schema of the DataFrame schema = StructType([ StructField("id", LongType(), True), StructField("name", ShortType(), True) ]) # create the DataFrame data = [(1, "John"), (2, "Jane"), (3, "Bob")] df = spark.createDataFrame(data, schema) # show the DataFrame df.show()
In the above example, we define the schema of the DataFrame using the StructType and StructField classes. The StructField class takes three arguments: the name of the field, the data type, and a boolean value that indicates whether the field can be null or not. We then create the DataFrame using the spark.createDataFrame() method and pass in the data and schema as arguments.
We can perform various operations on LongType and ShortType variables in PySpark, such as arithmetic operations and comparisons. Here is an example:
# create a DataFrame with LongType and ShortType columns data = [(1, 100), (2, 200), (3, 300)] df = spark.createDataFrame(data, ["id", "count"]) # perform arithmetic operations df.withColumn("count_double", df["count"] * 2).show() df.withColumn("count_sum", df["count"] + 100).show() # perform comparisons df.filter(df["count"] > 200).show() df.filter(df["count"] == 200).show()
In the above example, we create a DataFrame with LongType and ShortType columns and perform arithmetic operations and comparisons on them. The
withColumn() method adds a new column to the DataFrame that is the result of an arithmetic operation on the existing columns. The
filter() method filters the rows of the DataFrame based on a comparison condition.
LongType and ShortType are useful data types in PySpark that allow users to represent large and small integers, respectively. They can be used to define the schema of a DataFrame, perform arithmetic operations and comparisons, and more.
Spark important urls to refer