machine learning models often rely on libraries like sklearn, but understanding the math behind them is essential. in this blog, let’s build ML algos from scratch in python without using sklearn.

simple linear regression

simple linear regression is a method to model the relationship between two variables using a straight line. the equation is:

y = m * x + b

where:

  • m is the slope (how much y changes for a unit change in x)
  • b is the intercept (value of y when x is 0)

mathematical intuition

the goal of linear regression is to find the best-fitting line for the given data points by minimizing the sum of squared errors. this means we try to find values of m and b that minimize the mean squared error:

mse = (1/n) * sum((y_true - y_predicted) ** 2)

using calculus, we derive the formulas for m and b using the least squares method:

m = sum((x - mean_x) * (y - mean_y)) / sum((x - mean_x) ** 2)
b = mean_y - m * mean_x

this formula ensures that the line minimizes the total squared error between predicted and actual values.

implementing in python

here’s the complete code:

import numpy as np

class SimpleLinearRegression:
    def __init__(self):
        self.m = 0  # slope of the line
        self.b = 0  # y-intercept of the line

    def fit(self, X, Y):
        # calculate the means of X and Y
        X_mean, Y_mean = np.mean(X), np.mean(Y)
        
        # calculate the slope (m) using the least squares method
        numerator = sum((X - X_mean) * (Y - Y_mean))
        denominator = sum((X - X_mean) ** 2)
        self.m = numerator / denominator
        
        # calculate the intercept (b)
        self.b = Y_mean - self.m * X_mean

    def predict(self, X):
        # apply the linear equation to predict y values
        return self.m * X + self.b

    def mean_squared_error(self, Y_true, Y_pred):
        # calculate the mean squared error for performance evaluation
        return np.mean((Y_true - Y_pred) ** 2)

# example usage
X = np.array([1, 2, 3, 4, 5])  # input features
Y = np.array([2, 4, 5, 4, 5])  # target values

model = SimpleLinearRegression()
model.fit(X, Y)  # train the model
predictions = model.predict(X)  # generate predictions

print(f"slope (m): {model.m:.3f}, intercept (b): {model.b:.3f}")
print(f"mean squared error: {model.mean_squared_error(Y, predictions):.3f}")

Multiple Linear Regression

Multiple linear regression is an extension of simple linear regression that allows us to model the relationship between two or more features (independent variables) and a target variable (dependent variable) using a linear equation.

The equation for multiple linear regression is:

y = m1 * x1 + m2 * x2 + ... + mn * xn + b

Where:

  • m1, m2, ..., mn are the coefficients (slopes) for each feature (x1, x2, ..., xn).
  • b is the intercept, the value of y when all x values are 0.

Mathematical Intuition

The goal of multiple linear regression is to find the best-fitting hyperplane (a line in 2D, a plane in 3D, and a hyperplane in higher dimensions) that minimizes the error between the predicted and actual values. The error is usually minimized using the Least Squares Method.

The Matrix Formulation

Let’s define the elements in matrix form:

  • X is the matrix of input features, shaped (m, n) where m is the number of data points and n is the number of features.
  • y is the vector of target values, shaped (m, 1).
  • m is the vector of coefficients (slopes), shaped (n, 1).
  • b is the intercept.

To estimate the coefficients (m and b), we can use the following formula derived from the Least Squares Method:

m = (X^T * X)^(-1) * X^T * y

Where:

  • X^T is the transpose of matrix X (shaped (n, m)).
  • X^T * X is a square matrix (shaped (n, n)).
  • (X^T * X)^(-1) is the inverse of that matrix (shaped (n, n)).
  • X^T * y is the vector of dot products of X^T and the target y (shaped (n, 1)).

Now, we can also calculate the intercept b using the following equation:

b = mean(y) - sum(m * mean(X)) for each feature

Where mean(X) is the vector of means for each feature in X.

Implementing in Python

Let’s implement Multiple Linear Regression from scratch in Python:

import numpy as np

class MultipleLinearRegression:
    def __init__(self):
        self.m = None  # coefficients (slopes)
        self.b = None  # intercept

    def fit(self, X, y):
        # Add a column of ones to X to account for the intercept (b)
        X_b = np.c_[np.ones((X.shape[0], 1)), X]
        
        # Apply the normal equation to compute the coefficients
        # m = (X^T * X)^(-1) * X^T * y
        self.m = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

        # Extract the intercept from the coefficients
        self.b = self.m[0]
        self.m = self.m[1:]

    def predict(self, X):
        # Add a column of ones to X for intercept (b)
        X_b = np.c_[np.ones((X.shape[0], 1)), X]
        
        # Calculate predictions
        return X_b.dot(np.r_[self.b, self.m])

    def mean_squared_error(self, y_true, y_pred):
        # Calculate mean squared error (MSE) for performance evaluation
        return np.mean((y_true - y_pred) ** 2)

# Example usage
# Independent variables (features)
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
# Dependent variable (target)
y = np.array([2, 3, 4, 5, 6])

# Instantiate the model
model = MultipleLinearRegression()
model.fit(X, y)  # Train the model
predictions = model.predict(X)  # Generate predictions

# Output model coefficients and performance metrics
print(f"Intercept (b): {model.b:.3f}")
print(f"Coefficients (m): {model.m}")
print(f"Mean Squared Error: {model.mean_squared_error(y, predictions):.3f}")

Explanation:

  1. Data Preprocessing:
    • We start by adding a column of ones to the feature matrix X to account for the intercept b. This transforms X into X_b.
  2. Model Fitting:
    • The model is fitted by solving the normal equation: m = (X^T * X)^(-1) * X^T * y. This provides the coefficients of the linear model, including the intercept.
  3. Prediction:
    • To make predictions, we multiply X_b (the modified feature matrix) with the computed coefficients m.
  4. Evaluation:
    • The mean squared error (MSE) is used to evaluate the model’s performance.