Welcome to Scikit-Learn Mastery

Master machine learning with Scikit-Learn - the most powerful and beginner-friendly ML library in Python. This comprehensive platform will transform you into a Scikit-Learn expert through structured learning, real-world projects, and hands-on code examples.

What You'll Learn

🔧 Data Preprocessing

Scaling, encoding, feature selection, and data transformation techniques

🎯 Classification

Logistic Regression, SVM, Random Forest, and more

📈 Regression

Linear, Ridge, Lasso, and ensemble regression models

🔀 Clustering

K-Means, Hierarchical, DBSCAN, and clustering evaluation

📊 Dimensionality Reduction

PCA, t-SNE, and feature selection methods

⚙️ Model Optimization

Cross-validation, Grid Search, and hyperparameter tuning

Why Scikit-Learn?

Simple API: Consistent interface across all algorithms
Comprehensive: 200+ algorithms for every ML task
Production-Ready: Used by companies worldwide
Well-Documented: Extensive documentation and examples
Fast: Optimized C implementations for performance
Integrates Well: Works seamlessly with NumPy, Pandas, Matplotlib

The Complete ML Pipeline

Understanding the machine learning pipeline is crucial. This is the exact order and workflow you should follow for every ML project:

Pipeline Workflow

1. Load Data

2. Explore Data

3. Clean Data

4. Feature Engineering

5. Split Data

6. Scale/Normalize

7. Train Model

8. Evaluate

9. Tune Hyperparameters

10. Deploy

Step-by-Step Explanation

1. Load Data

Start by loading your dataset using Pandas or Scikit-Learn's built-in datasets.

from sklearn.datasets import load_iris
import pandas as pd

# Load built-in dataset
iris = load_iris()
X = iris.data
y = iris.target

# Or load from CSV
df = pd.read_csv('data.csv')
                    

2. Explore Data (EDA)

Understand your data before building models. Check shape, missing values, distributions, and correlations.

import pandas as pd

# Check data shape and info
print(df.shape)
print(df.info())
print(df.describe())
print(df.isnull().sum())  # Missing values
                    

3. Clean Data

Handle missing values, remove duplicates, and fix inconsistencies.

# Handle missing values
df.fillna(df.mean(), inplace=True)  # Fill with mean
df.dropna(inplace=True)  # Or drop rows

# Remove duplicates
df.drop_duplicates(inplace=True)

# Remove outliers (example)
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['column'] >= Q1 - 1.5*IQR) & (df['column'] <= Q3 + 1.5*IQR)]
                    

4. Feature Engineering

Create new features, encode categorical variables, and select relevant features.

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif

# Encode categorical variables
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])

# Create new features
df['feature_new'] = df['col1'] * df['col2']

# Select best features
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
                    

5. Split Data

Divide data into training and testing sets to avoid overfitting.

from sklearn.model_selection import train_test_split

# 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape}")
print(f"Testing set: {X_test.shape}")
                    

6. Scale/Normalize Features

Standardize features to have similar scales. This is crucial for many algorithms.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# StandardScaler: mean=0, std=1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# MinMaxScaler: range [0, 1]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
                    

7. Train Model

Choose an algorithm and train it on your training data.

from sklearn.ensemble import RandomForestClassifier

# Create and train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)
                    

8. Evaluate Model

Measure model performance using appropriate metrics.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
                    

9. Tune Hyperparameters

Use Grid Search or Random Search to find optimal hyperparameters.

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
                    

10. Deploy Model

Save your model and deploy it for predictions on new data.

import joblib

# Save model
joblib.dump(model, 'model.pkl')

# Load model
loaded_model = joblib.load('model.pkl')

# Make predictions on new data
new_predictions = loaded_model.predict(new_data_scaled)
                    

💡 Pipeline Best Practices:

Always fit scalers on training data only, then transform test data
Use cross-validation to get reliable performance estimates
Never leak test data into training (fit transformers on train data)
Document your pipeline for reproducibility
Use Scikit-Learn's Pipeline class to automate this workflow

Using Scikit-Learn Pipeline Class

Automate the entire pipeline with a single object:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Fit and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# Evaluate
accuracy = pipeline.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")
                    

Data Preprocessing & Feature Engineering

Data preprocessing is the foundation of successful machine learning. Garbage in, garbage out!

Scaling & Normalization

Different algorithms require different scaling approaches:

StandardScaler (Z-score normalization)

Transforms data to have mean=0 and standard deviation=1. Best for algorithms like SVM, KNN, Linear Regression.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Formula: (x - mean) / std

MinMaxScaler (Min-Max normalization)

Scales features to a fixed range [0, 1]. Good for neural networks and when you need bounded values.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)

# Formula: (x - min) / (max - min)

RobustScaler

Uses median and interquartile range. Best when you have outliers.

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)
                    

Encoding Categorical Variables

LabelEncoder

Converts categorical labels to integers. Use for target variable.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_encoded = le.fit_transform(y)  # ['cat', 'dog', 'cat'] → [0, 1, 0]
                    

OneHotEncoder

Creates binary columns for each category. Use for features.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
X_encoded = encoder.fit_transform(X[['category']])
                    

Feature Selection

Select only the most important features to improve model performance and reduce overfitting.

SelectKBest

Select the k best features based on statistical tests.

from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Get selected feature indices
selected_features = selector.get_support(indices=True)
                    

VarianceThreshold

Remove features with low variance (they don't vary much).

from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
                    

Recursive Feature Elimination (RFE)

Recursively removes features and builds a model to rank them.

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
rfe = RFE(model, n_features_to_select=10)
X_selected = rfe.fit_transform(X, y)
                    

Handling Missing Values

from sklearn.impute import SimpleImputer

# Strategy: mean, median, most_frequent, constant
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# For categorical data
imputer = SimpleImputer(strategy='most_frequent')
X_imputed = imputer.fit_transform(X_categorical)
                    

🎯 Preprocessing Tips:

Always fit transformers on training data only
Apply the same transformation to test data
Handle missing values before scaling
Remove or handle outliers appropriately
Use Pipeline to automate preprocessing

Classification Algorithms

Classification predicts discrete categories. Learn the most important algorithms:

Logistic Regression

Despite its name, it's a classification algorithm. Best for binary classification and interpretability.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Get probabilities
y_proba = model.predict_proba(X_test)
                    

Support Vector Machine (SVM)

Powerful for both linear and non-linear classification. Works well with high-dimensional data.

from sklearn.svm import SVC

# Linear kernel
model = SVC(kernel='linear', C=1.0, random_state=42)

# RBF kernel (non-linear)
model = SVC(kernel='rbf', gamma='scale', C=1.0, random_state=42)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
                    

Decision Trees

Interpretable, handles non-linear relationships, but prone to overfitting.

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(
    max_depth=10,
    min_samples_split=5,
    random_state=42
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
                    

Random Forest

Ensemble of decision trees. More robust and better generalization than single trees.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1  # Use all processors
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Feature importance
importances = model.feature_importances_
                    

Gradient Boosting

State-of-the-art ensemble method. Often wins Kaggle competitions.

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
                    

K-Nearest Neighbors (KNN)

Simple but effective. Classifies based on nearest neighbors.

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
                    

Naive Bayes

Fast and effective for text classification and spam detection.

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
                    

Algorithm	Speed	Accuracy	Interpretability	Best For
Logistic Regression	Very Fast	Good	Excellent	Binary classification, baseline
SVM	Slow	Excellent	Poor	High-dimensional data
Decision Tree	Fast	Good	Excellent	Interpretability
Random Forest	Fast	Excellent	Good	General purpose
Gradient Boosting	Slow	Excellent	Poor	Competitions, best accuracy
KNN	Slow	Good	Good	Small datasets
Naive Bayes	Very Fast	Good	Good	Text classification

Regression Algorithms

Regression predicts continuous values. Master these essential algorithms:

Linear Regression

The foundation of regression. Simple, fast, and interpretable.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Get coefficients
coefficients = model.coef_
intercept = model.intercept_
                    

Ridge Regression (L2 Regularization)

Adds penalty to prevent overfitting. Good for multicollinearity.

from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)  # Higher alpha = more regularization
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
                    

Lasso Regression (L1 Regularization)

Can shrink coefficients to zero, performing feature selection.

from sklearn.linear_model import Lasso

model = Lasso(alpha=0.1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
                    

ElasticNet (L1 + L2)

Combines Ridge and Lasso. Best of both worlds.

from sklearn.linear_model import ElasticNet

model = ElasticNet(alpha=0.1, l1_ratio=0.5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
                    

Random Forest Regression

Ensemble method for regression. Handles non-linear relationships.

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    random_state=42
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
                    

Gradient Boosting Regression

State-of-the-art for regression tasks.

from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
                    

Support Vector Regression (SVR)

SVM for regression tasks.

from sklearn.svm import SVR

model = SVR(kernel='rbf', C=100, gamma='scale')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
                    

Regression Metrics

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R² Score: {r2:.4f}")
                    

Clustering Algorithms

Unsupervised learning to group similar data points. No labels needed!

K-Means Clustering

Most popular clustering algorithm. Partitions data into k clusters.

from sklearn.cluster import KMeans

# Determine optimal k using elbow method
inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Train final model
model = KMeans(n_clusters=3, random_state=42)
clusters = model.fit_predict(X)
                    

Hierarchical Clustering

Creates a tree of clusters. Good for understanding cluster relationships.

from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(
    n_clusters=3,
    linkage='ward'  # ward, complete, average, single
)
clusters = model.fit_predict(X)
                    

DBSCAN

Density-based clustering. Finds clusters of arbitrary shape and identifies outliers.

from sklearn.cluster import DBSCAN

model = DBSCAN(eps=0.5, min_samples=5)
clusters = model.fit_predict(X)

# -1 indicates outliers
outliers = (clusters == -1).sum()
                    

Gaussian Mixture Models

Probabilistic clustering. Each point has probability of belonging to each cluster.

from sklearn.mixture import GaussianMixture

model = GaussianMixture(n_components=3, random_state=42)
clusters = model.fit_predict(X)

# Get probabilities
probabilities = model.predict_proba(X)
                    

Clustering Evaluation

from sklearn.metrics import silhouette_score, davies_bouldin_score

# Silhouette Score (higher is better, range: -1 to 1)
silhouette = silhouette_score(X, clusters)

# Davies-Bouldin Index (lower is better)
db_index = davies_bouldin_score(X, clusters)

print(f"Silhouette Score: {silhouette:.4f}")
print(f"Davies-Bouldin Index: {db_index:.4f}")
                    

Dimensionality Reduction

Reduce number of features while preserving important information. Speeds up training and improves visualization.

Principal Component Analysis (PCA)

Most popular dimensionality reduction technique. Finds principal components that explain variance.

from sklearn.decomposition import PCA

# Reduce to 2D for visualization
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Check explained variance
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.4f}")

# Determine optimal components
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumsum >= 0.95) + 1  # 95% variance
                    

t-SNE

Great for visualization. Preserves local structure of data.

from sklearn.manifold import TSNE

# Reduce to 2D
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_reduced = tsne.fit_transform(X)

# Note: t-SNE is slow for large datasets
                    

UMAP

Faster than t-SNE, preserves both local and global structure.

# Install: pip install umap-learn
from umap import UMAP

umap = UMAP(n_components=2, random_state=42)
X_reduced = umap.fit_transform(X)
                    

Feature Selection vs Dimensionality Reduction

Aspect	Feature Selection	Dimensionality Reduction
Interpretability	High (original features)	Low (new components)
Speed	Fast	Slower
Information Loss	Some features removed	Compressed information
Use Case	When features are interpretable	When you need visualization

Model Selection & Evaluation

Choose the right model and evaluate it properly to avoid overfitting and get reliable performance estimates.

Train-Test Split

Divide data into training and testing sets. Typical split: 80-20 or 70-30.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # For classification, maintain class distribution
)
                    

Cross-Validation

More reliable than single train-test split. Divides data into k folds.

from sklearn.model_selection import cross_val_score, KFold

# K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.4f}")
print(f"Std: {scores.std():.4f}")
                    

Grid Search

Systematically search for best hyperparameters.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")

# Use best model
best_model = grid_search.best_estimator_
                    

Random Search

Faster than Grid Search for large parameter spaces.

from sklearn.model_selection import RandomizedSearchCV

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    n_iter=20,
    cv=5,
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)
                    

Classification Metrics

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)

# Basic metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Detailed report
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

# ROC-AUC (for binary classification)
roc_auc = roc_auc_score(y_test, y_proba[:, 1])
                    

When to Use Which Metric?

Metric	When to Use	Formula
Accuracy	Balanced classes	(TP+TN)/(TP+TN+FP+FN)
Precision	Minimize false positives	TP/(TP+FP)
Recall	Minimize false negatives	TP/(TP+FN)
F1-Score	Balance precision & recall	2(PrecisionRecall)/(Precision+Recall)
ROC-AUC	Imbalanced classes	Area under ROC curve

Ensemble Methods

Combine multiple models to get better predictions. "Wisdom of the crowd" principle.

Voting Classifier

Combines predictions from multiple classifiers using majority voting.

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Create individual models
lr = LogisticRegression(random_state=42)
svm = SVC(probability=True, random_state=42)
rf = RandomForestClassifier(random_state=42)

# Create voting classifier
voting_clf = VotingClassifier(
    estimators=[('lr', lr), ('svm', svm), ('rf', rf)],
    voting='soft'  # soft: average probabilities, hard: majority vote
)

voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
                    

Bagging

Bootstrap Aggregating. Trains multiple models on random subsets of data.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=10,
    random_state=42
)

bagging_clf.fit(X_train, y_train)
y_pred = bagging_clf.predict(X_test)
                    

Boosting - AdaBoost

Sequentially trains models, focusing on misclassified samples.

from sklearn.ensemble import AdaBoostClassifier

adaboost_clf = AdaBoostClassifier(
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)

adaboost_clf.fit(X_train, y_train)
y_pred = adaboost_clf.predict(X_test)
                    

Gradient Boosting

Builds trees sequentially to correct errors of previous trees.

from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42
)

gb_clf.fit(X_train, y_train)
y_pred = gb_clf.predict(X_test)
                    

Stacking

Uses predictions from multiple models as input to a meta-model.

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

# Base models
base_models = [
    ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
    ('svm', SVC(probability=True, random_state=42))
]

# Meta-model
meta_model = LogisticRegression(random_state=42)

# Stacking classifier
stacking_clf = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5
)

stacking_clf.fit(X_train, y_train)
y_pred = stacking_clf.predict(X_test)
                    

🎯 Ensemble Tips:

Combine diverse models for best results
Voting works best with different algorithm types
Boosting reduces bias, Bagging reduces variance
Stacking can achieve state-of-the-art results
Always validate ensemble on held-out test set

Comprehensive Evaluation Metrics

Understand how to properly evaluate your models with the right metrics.

Classification Metrics

Confusion Matrix

Shows True Positives, True Negatives, False Positives, False Negatives.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
                    

Precision, Recall, F1-Score

from sklearn.metrics import precision_recall_fscore_support

precision, recall, f1, support = precision_recall_fscore_support(
    y_test, y_pred, average='weighted'
)

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
                    

ROC-AUC Curve

Plots True Positive Rate vs False Positive Rate. AUC = Area Under Curve.

from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_proba[:, 1])
roc_auc = auc(fpr, tpr)

print(f"ROC-AUC Score: {roc_auc:.4f}")

Regression Metrics

Mean Squared Error (MSE)

Average of squared differences. Penalizes large errors more.

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5

print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
                    

Mean Absolute Error (MAE)

Average of absolute differences. More interpretable than MSE.

from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.4f}")
                    

R² Score

Proportion of variance explained. Range: 0 to 1 (higher is better).

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
                    

Clustering Metrics

Silhouette Score

Measures how similar points are to their own cluster. Range: -1 to 1.

from sklearn.metrics import silhouette_score

silhouette = silhouette_score(X, clusters)
print(f"Silhouette Score: {silhouette:.4f}")
                    

Davies-Bouldin Index

Average similarity between each cluster and its most similar cluster. Lower is better.

from sklearn.metrics import davies_bouldin_score

db_index = davies_bouldin_score(X, clusters)
print(f"Davies-Bouldin Index: {db_index:.4f}")
                    

Real-World Projects

Apply your knowledge to real datasets. These projects will solidify your understanding.

Project 1: Iris Flower Classification

Classify iris flowers into three species using their measurements.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))
                    

Project 2: Digit Recognition (MNIST)

Recognize handwritten digits from 0-9.

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load data
digits = load_digits()
X, y = digits.data, digits.target

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train SVM
model = SVC(kernel='rbf', gamma='scale')
model.fit(X_train_scaled, y_train)

# Evaluate
accuracy = model.score(X_test_scaled, y_test)
print(f"Accuracy: {accuracy:.4f}")
                    

Project 3: Customer Segmentation

Segment customers using clustering for targeted marketing.

from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate customer data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find optimal k
silhouette_scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    clusters = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, clusters)
    silhouette_scores.append(score)

# Train final model
optimal_k = silhouette_scores.index(max(silhouette_scores)) + 2
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
customer_segments = kmeans.fit_predict(X_scaled)

print(f"Optimal clusters: {optimal_k}")
print(f"Silhouette Score: {max(silhouette_scores):.4f}")
                    

Project 4: House Price Prediction

Predict house prices using regression.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load housing data
housing = fetch_openml(name='house_prices', as_frame=True)
X, y = housing.data, housing.target

# Handle missing values
X = X.fillna(X.mean())

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
rmse = mean_squared_error(y_test, y_pred) ** 0.5
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")
                    

12-Week Scikit-Learn Mastery Roadmap

Follow this structured path to become a Scikit-Learn expert in 12 weeks.

1

Week 1-2: Foundations

Learn NumPy, Pandas basics. Understand ML concepts: supervised vs unsupervised, overfitting, train-test split.

2

Week 3: Data Preprocessing

Master scaling, encoding, feature selection. Practice with real datasets.

3

Week 4-5: Classification Basics

Learn Logistic Regression, SVM, Decision Trees. Understand classification metrics.

4

Week 6: Regression

Master Linear, Ridge, Lasso regression. Learn regression metrics: MSE, RMSE, R².

5

Week 7: Ensemble Methods

Learn Random Forest, Gradient Boosting, Voting, Stacking. These are state-of-the-art!

6

Week 8: Model Selection & Evaluation

Master cross-validation, Grid Search, hyperparameter tuning. Avoid overfitting.

7

Week 9: Clustering & Dimensionality

Learn K-Means, DBSCAN, PCA, t-SNE. Understand unsupervised learning.

8

Week 10: Advanced Topics

Feature engineering, pipeline automation, handling imbalanced data, anomaly detection.

9

Week 11: Real Projects

Build 3-4 complete projects from scratch. Apply everything you learned.

10

Week 12: Kaggle & Portfolio

Participate in Kaggle competitions. Build portfolio projects. Share on GitHub.

Daily Study Schedule

📅 Recommended Daily Routine:

30 minutes: Watch tutorial or read documentation
60 minutes: Code along with examples
30 minutes: Practice on your own dataset
30 minutes: Review and take notes

Learning Shortcuts & Pro Tips

Start Simple

Begin with Logistic Regression and Linear Regression before complex models

Use Pipelines

Automate preprocessing with Pipeline class to avoid data leakage

Cross-Validate

Always use cross-validation for reliable performance estimates

Ensemble Everything

Combine models for better results. Voting and Stacking are powerful

Hyperparameter Tune

Use GridSearchCV to find optimal parameters systematically

Kaggle Practice

Participate in competitions to learn from others and build portfolio

Resources for Continued Learning

Official Documentation: scikit-learn.org - Most comprehensive resource
Kaggle: kaggle.com - Datasets and competitions
GitHub: Search for scikit-learn projects and examples
YouTube: Scikit-learn tutorials and ML courses
Books: "Hands-On Machine Learning" by Aurélien Géron
Courses: Andrew Ng's ML course, Fast.ai

Common Mistakes to Avoid

⚠️ Don't Make These Mistakes:

Fitting scaler on entire dataset (causes data leakage)
Not splitting data before evaluation
Using accuracy for imbalanced datasets
Tuning hyperparameters on test set
Not handling missing values properly
Ignoring feature scaling for distance-based algorithms
Not using cross-validation
Overfitting to training data