Categorical data refers to values that can be categorized into distinct groups or categories. Unlike continuous data, categorical data represent discrete sets, like gender, colors, or ratings.
Importance of Categorical Data in Pandas
Using categorical data in Pandas can lead to more efficient data processing. It reduces memory usage and speeds up operations like grouping and sorting, especially beneficial for large datasets with many repeating values.
Creating a Categorical Series
Let’s start by creating a Pandas Series with categorical data.
Example:
import pandas as pd
# Sample data
names = ['Sachin', 'Manju', 'Ram', 'Raju', 'David', 'Wilson']
categories = ['Engineering', 'Medicine', 'Arts', 'Engineering', 'Law', 'Medicine']
# Creating a categorical series
category_series = pd.Series(categories, dtype="category", index=names)
In this example, we assign professions to different individuals, categorizing them into various fields like ‘Engineering’, ‘Medicine’, and so on.
Exploring the Categorical Series
Once a categorical series is created, you can explore its properties like categories and codes.
# Displaying categories
print("Categories:", category_series.cat.categories)
# Displaying codes
print("Codes:", category_series.cat.codes)
Advantages of Categorical Data
- Memory Efficiency: Categorical data uses less memory, which is advantageous for large datasets.
- Performance Improvement: Operations like sorting and grouping are faster with categorical data.
- Clearer Analysis: Categorical data make some types of analysis and visualization more straightforward and meaningful.
Modifying Categories
Pandas allows you to add, remove, or rename categories in a categorical series.
Example of Modifying Categories:
# Adding a new category
category_series.cat.add_categories('Science', inplace=True)
# Removing a category
category_series.cat.remove_categories('Law', inplace=True).