Building My Own Regression Library: A Journey from Scikit-Learn to Self-Sufficiency
Building My Own Regression Library: A Journey into Machine Learning Fundamentals
Author: Maaz Waheed
Repository: 42Wor/Regression
🚀 The First Step: Why Build My Own Library?
In the world of machine learning, we often reach for powerful libraries like scikit-learn without understanding what happens under the hood. Today, I'm taking a different approach — building my own regression library from scratch. This isn't just about creating another tool; it's about understanding the mathematics, the algorithms, and the engineering decisions that make machine learning work.
📦 My First Creation: MyLinearRegression
Let me introduce you to my custom Linear Regression class — built with NumPy and designed for transparency:
class MyLinearRegression:
def __init__(self):
self.coef_ = None # Feature coefficients
self.intercept_ = None # Bias term
def fit(self, X, y):
def predict(self, X):
def score(self, X, y):
def save(self, filename):
@classmethod
def load(cls, filename):
🔧 The Normal Equation: Mathematics Made Simple
The heart of my implementation is the normal equation:
[
\theta = (X^T X)^{-1} X^T y
]
Where:
- (X) is the feature matrix with an added intercept column
- (y) is the target vector
- (\theta) contains both intercept and coefficients
This closed-form solution gives us the optimal parameters without iterative optimization. I use np.linalg.pinv() (pseudoinverse) for numerical stability, even when (X^T X) is singular.
🛠️ Building a Custom train_test_split
Scikit-learn's train_test_split is convenient, but I wanted to understand it better, so I built my own:
def my_train_test_split(*arrays, test_size=None, train_size=None,
random_state=None, shuffle=True):
# ... (implementation details)
# Key features:
# - Handles multiple arrays simultaneously
# - Supports pandas DataFrames without conversion
# - Flexible train/test size specification
# - Reproducible splits with random_state
What makes my implementation special:
- Pandas-aware: Works directly with DataFrames without converting to NumPy
- Flexible sizing: Accepts both integers and floats for train/test sizes
- Error handling: Comprehensive validation for edge cases
- Reproducibility: Consistent splits with
random_state
🏠 Testing with Real Data: House Price Prediction
I tested my library on the House Price India dataset:
from Regression import MyLinearRegression, my_train_test_split
import pandas as pd
# Load and prepare data
d = pd.read_csv("data/House Price India.csv")
x = d[['Date', 'number of bedrooms', 'number of bathrooms', ...]] # 21 features
y = d[["Price"]]
# Split data using my custom function
x_train, x_test, y_train, y_test = my_train_test_split(
x, y, test_size=0.2, random_state=42
)
# Train model
mlr = MyLinearRegression()
mlr.fit(x_train, y_train)
# Evaluate
print("Linear Regression R^2 score:", mlr.score(x_test, y_test))
# Predict
new_data = [[42491, 3, 2, 1500, 2000, 1, 0, 0, 3, 7, 1500, 0,
2000, 0, 122004, 52.9, -114.5, 1500, 2000, 2, 10]]
predicted_price = mlr.predict(new_data)
print("Predicted Price:", predicted_price[0])
📊 Results and Insights
My implementation achieved an R² score of 0.702 on the test set — explaining approximately 70.2% of the variance in house prices. Not bad for a first attempt!
Key takeaways from this project:
- Understanding beats convenience: Building from scratch revealed nuances I'd miss using pre-built libraries
- Mathematics is foundational: The normal equation is elegant and efficient for linear regression
- Data handling matters: Proper array manipulation and dimension handling is crucial
- Serialization is important: Models need to be saved and loaded for real-world use
🚀 What's Next for My Regression Library?
This is just the beginning! Here's my roadmap:
OmniRegress: A comprehensive Python & Rust library for all types of regression analysis.
- Add gradient descent optimization (stochastic, mini-batch, adaptive learning rates)
- Implement regularization (Ridge, Lasso, ElasticNet)
- Create polynomial regression with feature engineering
- Add model diagnostics (residual plots, influence statistics)
- Build ensemble methods (Random Forest, Gradient Boosting for regression)
💡 Lessons Learned
- Start simple: Linear regression is the perfect starting point
- Test thoroughly: Edge cases matter (empty arrays, single samples, etc.)
- Document as you go: Clear docstrings and comments are invaluable
- Compare with established libraries: Use scikit-learn as a benchmark
🤝 Join Me on This Journey
This is more than just code — it's a learning expedition. I'm documenting every step, every challenge, and every breakthrough. Whether you're a machine learning beginner or an experienced practitioner looking to deepen your understanding, I invite you to:
- Star the repository to follow along
- Fork and contribute improvements
- Share your insights in the issues section
- Build your own version and compare approaches
Remember: The goal is to replace scikit-learn witj omniRegrssion
"We don't just use machine learning — we understand it, we build it, we master it."
🔗 GitHub Repository: 42Wor/OmniRegress
What should I build next in my regression library? Share your suggestions in the comments!