Understanding Logistic Regression

Logistic Regression is one of the most fundamental algorithms for binary classification problems in machine learning. Despite its name, it is a linear classification model rather than a regression model. In this post, we’ll delve into how Logistic Regression works, its formula, and a step-by-step implementation in Python.

What is Logistic Regression?

Logistic Regression is used when the dependent variable is categorical. For binary classification, this means the dependent variable takes value 0 or 1. The aim is to find the best fitting model to describe the relationship between the dependent variable and a set of independent variables.

The Logistic Function

The core idea behind logistic regression is to find a relationship between features and the likelihood of a particular outcome. The logistic function, also known as the sigmoid function, maps any real-valued number into the range (0, 1):

\[\sigma(t) = \frac{1}{1 + e^{-t}}\]

This function takes a linear combination of input features and maps it to a probability score. The model uses this score to classify input values into two classes.

Mathematical Formulation

In logistic regression, we predict the probability that an instance belongs to a class 1 using the hypothesis:

\[h(x) = \sigma(w^Tx + b) = \frac{1}{1 + e^{-(w^Tx + b)}}\]

where:

  • ( w ) is the weight vector,
  • ( x ) is the input feature vector,
  • ( b ) is the bias term.

Implementing Logistic Regression in Python

To demonstrate logistic regression, we can use Python’s scikit-learn library, a comprehensive and powerful tool for machine learning.

First, let’s prepare the environment by importing necessary libraries:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Loading the Dataset

For this illustration, we will use the popular Iris dataset available in scikit-learn. This dataset consists of three classes but we’ll only take the first two classes for binary classification.

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Take only two classes for binary classification
X = X[y < 2]
y = y[y < 2]

Splitting the Dataset

We will split the data into training and testing sets to evaluate the performance of our model.

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training the Logistic Regression Model

Once we have our data prepared, training a Logistic Regression model is straightforward.

# Initialize the Logistic Regression model
log_reg = LogisticRegression(solver='liblinear')

# Train the model
log_reg.fit(X_train, y_train)

Making Predictions and Evaluating the Model

After training, we can predict and evaluate our model using accuracy as a metric.

# Make predictions on the test data
y_pred = log_reg.predict(X_test)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Visualizing Decision Boundary

Finally, let’s visualize the decision boundary on a two-dimensional space using the first two features for simplicity:

def plot_decision_boundary(X, y, model):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                         np.arange(y_min, y_max, 0.1))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('Logistic Regression Decision Boundary')
    plt.show()

# Visualize
plot_decision_boundary(X_train, y_train, log_reg)

Conclusion

Logistic Regression remains an essential tool for classification problems due to its simplicity and ease of interpretation. While it may not perform as well as more complex models in certain circumstances, understanding its workings is crucial as it serves as the foundation for many advanced algorithms in machine learning. Whether you’re a budding data scientist or a seasoned engineer looking to refresh your knowledge, gaining a deep understanding of logistic regression is invaluable.

In future posts, we’ll explore how we can extend this to multiclass classification and discuss regularization techniques to improve its performance in scenarios of high variance.