Building My Own Regression Library: A Journey from Scikit-Learn to Self-Sufficiency

Building My Own Regression Library: A Journey from Scikit-Learn to Self-Sufficiency

By maaz.waheed December 7, 2025 at 05:02:01 AM 4 min read 38 views
in Regrssion
Tags: #math #linear-regression #ml-algorithms #ml-code #maaz-waheed #python

Building My Own Regression Library: A Journey into Machine Learning Fundamentals

Author: Maaz Waheed
Repository: 42Wor/Regression

🚀 The First Step: Why Build My Own Library?

In the world of machine learning, we often reach for powerful libraries like scikit-learn without understanding what happens under the hood. Today, I'm taking a different approach — building my own regression library from scratch. This isn't just about creating another tool; it's about understanding the mathematics, the algorithms, and the engineering decisions that make machine learning work.

📦 My First Creation: MyLinearRegression

Let me introduce you to my custom Linear Regression class — built with NumPy and designed for transparency:


class MyLinearRegression:
    def __init__(self):
        self.coef_ = None  # Feature coefficients
        self.intercept_ = None  # Bias term

    def fit(self, X, y):

    def predict(self, X):


    def score(self, X, y):


    def save(self, filename):

    @classmethod
    def load(cls, filename):

🔧 The Normal Equation: Mathematics Made Simple

The heart of my implementation is the normal equation:

[
\theta = (X^T X)^{-1} X^T y
]

Where:

  • (X) is the feature matrix with an added intercept column
  • (y) is the target vector
  • (\theta) contains both intercept and coefficients

This closed-form solution gives us the optimal parameters without iterative optimization. I use np.linalg.pinv() (pseudoinverse) for numerical stability, even when (X^T X) is singular.

🛠️ Building a Custom train_test_split

Scikit-learn's train_test_split is convenient, but I wanted to understand it better, so I built my own:

def my_train_test_split(*arrays, test_size=None, train_size=None, 
                       random_state=None, shuffle=True):
    # ... (implementation details)
    
    # Key features:
    # - Handles multiple arrays simultaneously
    # - Supports pandas DataFrames without conversion
    # - Flexible train/test size specification
    # - Reproducible splits with random_state

What makes my implementation special:

  1. Pandas-aware: Works directly with DataFrames without converting to NumPy
  2. Flexible sizing: Accepts both integers and floats for train/test sizes
  3. Error handling: Comprehensive validation for edge cases
  4. Reproducibility: Consistent splits with random_state

🏠 Testing with Real Data: House Price Prediction

I tested my library on the House Price India dataset:

from Regression import MyLinearRegression, my_train_test_split
import pandas as pd

# Load and prepare data
d = pd.read_csv("data/House Price India.csv")
x = d[['Date', 'number of bedrooms', 'number of bathrooms', ...]]  # 21 features
y = d[["Price"]]

# Split data using my custom function
x_train, x_test, y_train, y_test = my_train_test_split(
    x, y, test_size=0.2, random_state=42
)

# Train model
mlr = MyLinearRegression()
mlr.fit(x_train, y_train)

# Evaluate
print("Linear Regression R^2 score:", mlr.score(x_test, y_test))

# Predict
new_data = [[42491, 3, 2, 1500, 2000, 1, 0, 0, 3, 7, 1500, 0, 
             2000, 0, 122004, 52.9, -114.5, 1500, 2000, 2, 10]]
predicted_price = mlr.predict(new_data)
print("Predicted Price:", predicted_price[0])

📊 Results and Insights

My implementation achieved an R² score of 0.702 on the test set — explaining approximately 70.2% of the variance in house prices. Not bad for a first attempt!

Key takeaways from this project:

  1. Understanding beats convenience: Building from scratch revealed nuances I'd miss using pre-built libraries
  2. Mathematics is foundational: The normal equation is elegant and efficient for linear regression
  3. Data handling matters: Proper array manipulation and dimension handling is crucial
  4. Serialization is important: Models need to be saved and loaded for real-world use

🚀 What's Next for My Regression Library?

This is just the beginning! Here's my roadmap:
OmniRegress: A comprehensive Python & Rust library for all types of regression analysis.

  1. Add gradient descent optimization (stochastic, mini-batch, adaptive learning rates)
  2. Implement regularization (Ridge, Lasso, ElasticNet)
  3. Create polynomial regression with feature engineering
  4. Add model diagnostics (residual plots, influence statistics)
  5. Build ensemble methods (Random Forest, Gradient Boosting for regression)

💡 Lessons Learned

  • Start simple: Linear regression is the perfect starting point
  • Test thoroughly: Edge cases matter (empty arrays, single samples, etc.)
  • Document as you go: Clear docstrings and comments are invaluable
  • Compare with established libraries: Use scikit-learn as a benchmark

🤝 Join Me on This Journey

This is more than just code — it's a learning expedition. I'm documenting every step, every challenge, and every breakthrough. Whether you're a machine learning beginner or an experienced practitioner looking to deepen your understanding, I invite you to:

  1. Star the repository to follow along
  2. Fork and contribute improvements
  3. Share your insights in the issues section
  4. Build your own version and compare approaches

Remember: The goal is to replace scikit-learn witj omniRegrssion


"We don't just use machine learning — we understand it, we build it, we master it."

🔗 GitHub Repository: 42Wor/OmniRegress


What should I build next in my regression library? Share your suggestions in the comments!

Comments

Note: Comments are currently restricted to MBK Tech authorized members only. Public commenting is not available at this time.

Please login to leave a comment.

No comments yet.