Building My Own Regression Library: A Journey from Scikit-Learn to Self-Sufficiency

By maaz.waheed • December 7, 2025 at 05:02:01 AM • 4 min read • 86 views

in Regrssion

Tags: #math #linear-regression #ml-algorithms #ml-code #maaz-waheed #python

Building My Own Regression Library: A Journey into Machine Learning Fundamentals

Author: Maaz Waheed
Repository: 42Wor/Regression

🚀 The First Step: Why Build My Own Library?

In the world of machine learning, we often reach for powerful libraries like scikit-learn without understanding what happens under the hood. Today, I'm taking a different approach — building my own regression library from scratch. This isn't just about creating another tool; it's about understanding the mathematics, the algorithms, and the engineering decisions that make machine learning work.

📦 My First Creation: `MyLinearRegression`

Let me introduce you to my custom Linear Regression class — built with NumPy and designed for transparency:


class MyLinearRegression:
    def __init__(self):
        self.coef_ = None  # Feature coefficients
        self.intercept_ = None  # Bias term

    def fit(self, X, y):

    def predict(self, X):


    def score(self, X, y):


    def save(self, filename):

    @classmethod
    def load(cls, filename):

🔧 The Normal Equation: Mathematics Made Simple

The heart of my implementation is the normal equation:

[
\theta = (X^T X)^{-1} X^T y
]

Where:

(X) is the feature matrix with an added intercept column
(y) is the target vector
(\theta) contains both intercept and coefficients

This closed-form solution gives us the optimal parameters without iterative optimization. I use np.linalg.pinv() (pseudoinverse) for numerical stability, even when (X^T X) is singular.

🛠️ Building a Custom `train_test_split`

Scikit-learn's train_test_split is convenient, but I wanted to understand it better, so I built my own:

def my_train_test_split(*arrays, test_size=None, train_size=None, 
                       random_state=None, shuffle=True):
    # ... (implementation details)
    
    # Key features:
    # - Handles multiple arrays simultaneously
    # - Supports pandas DataFrames without conversion
    # - Flexible train/test size specification
    # - Reproducible splits with random_state

What makes my implementation special:

Pandas-aware: Works directly with DataFrames without converting to NumPy
Flexible sizing: Accepts both integers and floats for train/test sizes
Error handling: Comprehensive validation for edge cases
Reproducibility: Consistent splits with random_state

🏠 Testing with Real Data: House Price Prediction

I tested my library on the House Price India dataset:

from Regression import MyLinearRegression, my_train_test_split
import pandas as pd

# Load and prepare data
d = pd.read_csv("data/House Price India.csv")
x = d[['Date', 'number of bedrooms', 'number of bathrooms', ...]]  # 21 features
y = d[["Price"]]

# Split data using my custom function
x_train, x_test, y_train, y_test = my_train_test_split(
    x, y, test_size=0.2, random_state=42
)

# Train model
mlr = MyLinearRegression()
mlr.fit(x_train, y_train)

# Evaluate
print("Linear Regression R^2 score:", mlr.score(x_test, y_test))

# Predict
new_data = [[42491, 3, 2, 1500, 2000, 1, 0, 0, 3, 7, 1500, 0, 
             2000, 0, 122004, 52.9, -114.5, 1500, 2000, 2, 10]]
predicted_price = mlr.predict(new_data)
print("Predicted Price:", predicted_price[0])

📊 Results and Insights

My implementation achieved an R² score of 0.702 on the test set — explaining approximately 70.2% of the variance in house prices. Not bad for a first attempt!

Key takeaways from this project:

Understanding beats convenience: Building from scratch revealed nuances I'd miss using pre-built libraries
Mathematics is foundational: The normal equation is elegant and efficient for linear regression
Data handling matters: Proper array manipulation and dimension handling is crucial
Serialization is important: Models need to be saved and loaded for real-world use

🚀 What's Next for My Regression Library?

This is just the beginning! Here's my roadmap:
OmniRegress: A comprehensive Python & Rust library for all types of regression analysis.

Add gradient descent optimization (stochastic, mini-batch, adaptive learning rates)
Implement regularization (Ridge, Lasso, ElasticNet)
Create polynomial regression with feature engineering
Add model diagnostics (residual plots, influence statistics)
Build ensemble methods (Random Forest, Gradient Boosting for regression)

💡 Lessons Learned

Start simple: Linear regression is the perfect starting point
Test thoroughly: Edge cases matter (empty arrays, single samples, etc.)
Document as you go: Clear docstrings and comments are invaluable
Compare with established libraries: Use scikit-learn as a benchmark

🤝 Join Me on This Journey

This is more than just code — it's a learning expedition. I'm documenting every step, every challenge, and every breakthrough. Whether you're a machine learning beginner or an experienced practitioner looking to deepen your understanding, I invite you to:

Star the repository to follow along
Fork and contribute improvements
Share your insights in the issues section
Build your own version and compare approaches

Remember: The goal is to replace scikit-learn witj omniRegrssion

"We don't just use machine learning — we understand it, we build it, we master it."

🔗 GitHub Repository: 42Wor/OmniRegress

What should I build next in my regression library? Share your suggestions in the comments!