Linear regression is a foundational tool in data science and machine learning, offering a simple yet powerful way to predict outcomes and understand relationships between variables. Python, a leading programming language in these fields, provides the scikit-learn library, an efficient tool for implementing linear regression models. This article will guide you through the steps of using scikit-learn to create a linear regression model.
Understanding Linear regression
Linear regression models the relationship between a dependent variable and one or more independent variables using a linear approach. It’s commonly used for forecasting, time series modeling, and finding causal effect relationships between variables.
Setting environment
To get started, ensure you have Python installed, along with the scikit-learn library. If you haven’t installed scikit-learn yet, you can do so using pip:
pip install scikit-learn
Importing Necessary Libraries:
Begin by importing the required libraries:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Preparing Your Data:
Load your dataset and prepare your independent (features) and dependent (target) variables. Here’s an example using a pandas DataFrame:
Sample Dataset (your_dataset.csv
)
Here’s an example of what the dataset (your_dataset.csv
) might look like:
feature1,feature2,target
1.2,3.4,10.5
2.3,4.5,12.7
3.4,1.2,14.1
4.5,2.3,18.3
5.6,3.4,20.5
df = pd.read_csv('your_dataset.csv')
X = df[['feature1', 'feature2']] # Independent variables
y = df['target'] # Dependent variable
Splitting the Dataset:
Split your data into training and testing sets to validate the model’s performance:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Creating and Training the Model:
Initialize the Linear Regression model and fit it to your training data:
model = LinearRegression()
model.fit(X_train, y_train)
Making Predictions and Evaluating the Model:
Use the trained model to make predictions on the test set and evaluate its performance:
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
Python Script for Linear Regression
The Python script to apply linear regression on this dataset using scikit-learn is as follows:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load the dataset
df = pd.read_csv('your_dataset.csv')
# Prepare the data
X = df[['feature1', 'feature2']]
y = df['target']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
# Output the Mean Squared Error
print("Mean Squared Error:", mse)
1.0
. This value quantifies the average squared difference between the predicted values and the actual values in the dataset.