Understanding Series.astype(dtype)
The Series.astype(dtype)
method in Pandas-on-Spark allows users to cast the data type of a series to a specified type (dtype
). This can be extremely useful when dealing with data processing tasks where the data types need to be consistent or transformed for further analysis.
Syntax:
Series.astype(dtype)
Where:
dtype
: The data type to which the series will be cast.
Examples:
Let’s dive into some examples to understand how Series.astype(dtype)
works in practice.
Casting Series to Numeric Data Type
Suppose we have a Pandas-on-Spark series containing numerical data in string format, and we want to convert it to the float
data type.
# Importing necessary libraries
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np
# Creating a SparkSession
spark = SparkSession.builder \
.appName("Pandas-on-Spark @ Freshers.in") \
.getOrCreate()
# Creating a Pandas DataFrame
data = {'numbers': ['10.5', '20.7', '30.9', '40.2']}
pdf = pd.DataFrame(data)
# Converting Pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)
# Converting the 'numbers' column to float data type
sdf['numbers'] = sdf['numbers'].astype(float)
# Displaying the result
sdf.show()
Output:
+-------+
|numbers|
+-------+
| 10.5|
| 20.7|
| 30.9|
| 40.2|
+-------+
Casting Series to Categorical Data Type
Suppose we have a Pandas-on-Spark series containing categorical data, and we want to convert it to the category
data type.
# Creating a Pandas DataFrame with categorical data
data = {'categories': ['A', 'B', 'C', 'A', 'B', 'C']}
pdf = pd.DataFrame(data)
# Converting Pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)
# Converting the 'categories' column to category data type
sdf['categories'] = sdf['categories'].astype('category')
# Displaying the result
sdf.show()
Output:
+----------+
|categories|
+----------+
| A|
| B|
| C|
| A|
| B|
| C|
+----------+
Casting Series to Integer Data Type
Suppose we have a Pandas-on-Spark series containing numerical data in string format, and we want to convert it to the integer
data type.
# Creating a Pandas DataFrame with numerical data in string format
data = {'numbers': ['10', '20', '30', '40']}
pdf = pd.DataFrame(data)
# Converting Pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)
# Converting the 'numbers' column to integer data type
sdf['numbers'] = sdf['numbers'].astype(int)
# Displaying the result
sdf.show()
Output:
+-------+
|numbers|
+-------+
| 10|
| 20|
| 30|
| 40|
+-------+
Spark important urls to refer