Welcome to Scikit-Learn Mastery
Master machine learning with Scikit-Learn - the most powerful and beginner-friendly ML library in Python. This comprehensive platform will transform you into a Scikit-Learn expert through structured learning, real-world projects, and hands-on code examples.
What You'll Learn
🔧 Data Preprocessing
Scaling, encoding, feature selection, and data transformation techniques
🎯 Classification
Logistic Regression, SVM, Random Forest, and more
📈 Regression
Linear, Ridge, Lasso, and ensemble regression models
🔀 Clustering
K-Means, Hierarchical, DBSCAN, and clustering evaluation
📊 Dimensionality Reduction
PCA, t-SNE, and feature selection methods
⚙️ Model Optimization
Cross-validation, Grid Search, and hyperparameter tuning
Why Scikit-Learn?
- Simple API: Consistent interface across all algorithms
- Comprehensive: 200+ algorithms for every ML task
- Production-Ready: Used by companies worldwide
- Well-Documented: Extensive documentation and examples
- Fast: Optimized C implementations for performance
- Integrates Well: Works seamlessly with NumPy, Pandas, Matplotlib
The Complete ML Pipeline
Understanding the machine learning pipeline is crucial. This is the exact order and workflow you should follow for every ML project:
Pipeline Workflow
1. Load Data
2. Explore Data
3. Clean Data
4. Feature Engineering
5. Split Data
6. Scale/Normalize
7. Train Model
8. Evaluate
9. Tune Hyperparameters
10. Deploy
Step-by-Step Explanation
1. Load Data
Start by loading your dataset using Pandas or Scikit-Learn's built-in datasets.
from sklearn.datasets import load_iris
import pandas as pd
# Load built-in dataset
iris = load_iris()
X = iris.data
y = iris.target
# Or load from CSV
df = pd.read_csv('data.csv')
2. Explore Data (EDA)
Understand your data before building models. Check shape, missing values, distributions, and correlations.
import pandas as pd
# Check data shape and info
print(df.shape)
print(df.info())
print(df.describe())
print(df.isnull().sum()) # Missing values
3. Clean Data
Handle missing values, remove duplicates, and fix inconsistencies.
# Handle missing values
df.fillna(df.mean(), inplace=True) # Fill with mean
df.dropna(inplace=True) # Or drop rows
# Remove duplicates
df.drop_duplicates(inplace=True)
# Remove outliers (example)
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['column'] >= Q1 - 1.5*IQR) & (df['column'] <= Q3 + 1.5*IQR)]
4. Feature Engineering
Create new features, encode categorical variables, and select relevant features.
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif
# Encode categorical variables
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])
# Create new features
df['feature_new'] = df['col1'] * df['col2']
# Select best features
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
5. Split Data
Divide data into training and testing sets to avoid overfitting.
from sklearn.model_selection import train_test_split
# 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training set: {X_train.shape}")
print(f"Testing set: {X_test.shape}")
6. Scale/Normalize Features
Standardize features to have similar scales. This is crucial for many algorithms.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# StandardScaler: mean=0, std=1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# MinMaxScaler: range [0, 1]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
7. Train Model
Choose an algorithm and train it on your training data.
from sklearn.ensemble import RandomForestClassifier
# Create and train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
8. Evaluate Model
Measure model performance using appropriate metrics.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
9. Tune Hyperparameters
Use Grid Search or Random Search to find optimal hyperparameters.
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10]
}
# Grid search
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy'
)
grid_search.fit(X_train_scaled, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
10. Deploy Model
Save your model and deploy it for predictions on new data.
import joblib
# Save model
joblib.dump(model, 'model.pkl')
# Load model
loaded_model = joblib.load('model.pkl')
# Make predictions on new data
new_predictions = loaded_model.predict(new_data_scaled)
💡 Pipeline Best Practices:
- Always fit scalers on training data only, then transform test data
- Use cross-validation to get reliable performance estimates
- Never leak test data into training (fit transformers on train data)
- Document your pipeline for reproducibility
- Use Scikit-Learn's Pipeline class to automate this workflow
Using Scikit-Learn Pipeline Class
Automate the entire pipeline with a single object:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100))
])
# Fit and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# Evaluate
accuracy = pipeline.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")
Data Preprocessing & Feature Engineering
Data preprocessing is the foundation of successful machine learning. Garbage in, garbage out!
Scaling & Normalization
Different algorithms require different scaling approaches:
StandardScaler (Z-score normalization)
Transforms data to have mean=0 and standard deviation=1. Best for algorithms like SVM, KNN, Linear Regression.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Formula: (x - mean) / std
MinMaxScaler (Min-Max normalization)
Scales features to a fixed range [0, 1]. Good for neural networks and when you need bounded values.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)
# Formula: (x - min) / (max - min)
RobustScaler
Uses median and interquartile range. Best when you have outliers.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)
Encoding Categorical Variables
LabelEncoder
Converts categorical labels to integers. Use for target variable.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_encoded = le.fit_transform(y) # ['cat', 'dog', 'cat'] → [0, 1, 0]
OneHotEncoder
Creates binary columns for each category. Use for features.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
X_encoded = encoder.fit_transform(X[['category']])
Feature Selection
Select only the most important features to improve model performance and reduce overfitting.
SelectKBest
Select the k best features based on statistical tests.
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
# Get selected feature indices
selected_features = selector.get_support(indices=True)
VarianceThreshold
Remove features with low variance (they don't vary much).
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
Recursive Feature Elimination (RFE)
Recursively removes features and builds a model to rank them.
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
rfe = RFE(model, n_features_to_select=10)
X_selected = rfe.fit_transform(X, y)
Handling Missing Values
from sklearn.impute import SimpleImputer
# Strategy: mean, median, most_frequent, constant
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
# For categorical data
imputer = SimpleImputer(strategy='most_frequent')
X_imputed = imputer.fit_transform(X_categorical)
🎯 Preprocessing Tips:
- Always fit transformers on training data only
- Apply the same transformation to test data
- Handle missing values before scaling
- Remove or handle outliers appropriately
- Use Pipeline to automate preprocessing
Classification Algorithms
Classification predicts discrete categories. Learn the most important algorithms:
Logistic Regression
Despite its name, it's a classification algorithm. Best for binary classification and interpretability.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Get probabilities
y_proba = model.predict_proba(X_test)
Support Vector Machine (SVM)
Powerful for both linear and non-linear classification. Works well with high-dimensional data.
from sklearn.svm import SVC
# Linear kernel
model = SVC(kernel='linear', C=1.0, random_state=42)
# RBF kernel (non-linear)
model = SVC(kernel='rbf', gamma='scale', C=1.0, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Decision Trees
Interpretable, handles non-linear relationships, but prone to overfitting.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(
max_depth=10,
min_samples_split=5,
random_state=42
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Random Forest
Ensemble of decision trees. More robust and better generalization than single trees.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42,
n_jobs=-1 # Use all processors
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Feature importance
importances = model.feature_importances_
Gradient Boosting
State-of-the-art ensemble method. Often wins Kaggle competitions.
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
random_state=42
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
K-Nearest Neighbors (KNN)
Simple but effective. Classifies based on nearest neighbors.
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Naive Bayes
Fast and effective for text classification and spam detection.
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
| Algorithm |
Speed |
Accuracy |
Interpretability |
Best For |
| Logistic Regression |
Very Fast |
Good |
Excellent |
Binary classification, baseline |
| SVM |
Slow |
Excellent |
Poor |
High-dimensional data |
| Decision Tree |
Fast |
Good |
Excellent |
Interpretability |
| Random Forest |
Fast |
Excellent |
Good |
General purpose |
| Gradient Boosting |
Slow |
Excellent |
Poor |
Competitions, best accuracy |
| KNN |
Slow |
Good |
Good |
Small datasets |
| Naive Bayes |
Very Fast |
Good |
Good |
Text classification |
Regression Algorithms
Regression predicts continuous values. Master these essential algorithms:
Linear Regression
The foundation of regression. Simple, fast, and interpretable.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Get coefficients
coefficients = model.coef_
intercept = model.intercept_
Ridge Regression (L2 Regularization)
Adds penalty to prevent overfitting. Good for multicollinearity.
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0) # Higher alpha = more regularization
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Lasso Regression (L1 Regularization)
Can shrink coefficients to zero, performing feature selection.
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
ElasticNet (L1 + L2)
Combines Ridge and Lasso. Best of both worlds.
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Random Forest Regression
Ensemble method for regression. Handles non-linear relationships.
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(
n_estimators=100,
max_depth=10,
random_state=42
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Gradient Boosting Regression
State-of-the-art for regression tasks.
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=5
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Support Vector Regression (SVR)
SVM for regression tasks.
from sklearn.svm import SVR
model = SVR(kernel='rbf', C=100, gamma='scale')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Regression Metrics
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R² Score: {r2:.4f}")
Clustering Algorithms
Unsupervised learning to group similar data points. No labels needed!
K-Means Clustering
Most popular clustering algorithm. Partitions data into k clusters.
from sklearn.cluster import KMeans
# Determine optimal k using elbow method
inertias = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
# Train final model
model = KMeans(n_clusters=3, random_state=42)
clusters = model.fit_predict(X)
Hierarchical Clustering
Creates a tree of clusters. Good for understanding cluster relationships.
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(
n_clusters=3,
linkage='ward' # ward, complete, average, single
)
clusters = model.fit_predict(X)
DBSCAN
Density-based clustering. Finds clusters of arbitrary shape and identifies outliers.
from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.5, min_samples=5)
clusters = model.fit_predict(X)
# -1 indicates outliers
outliers = (clusters == -1).sum()
Gaussian Mixture Models
Probabilistic clustering. Each point has probability of belonging to each cluster.
from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=3, random_state=42)
clusters = model.fit_predict(X)
# Get probabilities
probabilities = model.predict_proba(X)
Clustering Evaluation
from sklearn.metrics import silhouette_score, davies_bouldin_score
# Silhouette Score (higher is better, range: -1 to 1)
silhouette = silhouette_score(X, clusters)
# Davies-Bouldin Index (lower is better)
db_index = davies_bouldin_score(X, clusters)
print(f"Silhouette Score: {silhouette:.4f}")
print(f"Davies-Bouldin Index: {db_index:.4f}")
Dimensionality Reduction
Reduce number of features while preserving important information. Speeds up training and improves visualization.
Principal Component Analysis (PCA)
Most popular dimensionality reduction technique. Finds principal components that explain variance.
from sklearn.decomposition import PCA
# Reduce to 2D for visualization
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
# Check explained variance
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.4f}")
# Determine optimal components
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumsum >= 0.95) + 1 # 95% variance
t-SNE
Great for visualization. Preserves local structure of data.
from sklearn.manifold import TSNE
# Reduce to 2D
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_reduced = tsne.fit_transform(X)
# Note: t-SNE is slow for large datasets
UMAP
Faster than t-SNE, preserves both local and global structure.
# Install: pip install umap-learn
from umap import UMAP
umap = UMAP(n_components=2, random_state=42)
X_reduced = umap.fit_transform(X)
Feature Selection vs Dimensionality Reduction
| Aspect |
Feature Selection |
Dimensionality Reduction |
| Interpretability |
High (original features) |
Low (new components) |
| Speed |
Fast |
Slower |
| Information Loss |
Some features removed |
Compressed information |
| Use Case |
When features are interpretable |
When you need visualization |
Model Selection & Evaluation
Choose the right model and evaluate it properly to avoid overfitting and get reliable performance estimates.
Train-Test Split
Divide data into training and testing sets. Typical split: 80-20 or 70-30.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # For classification, maintain class distribution
)
Cross-Validation
More reliable than single train-test split. Divides data into k folds.
from sklearn.model_selection import cross_val_score, KFold
# K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.4f}")
print(f"Std: {scores.std():.4f}")
Grid Search
Systematically search for best hyperparameters.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
# Use best model
best_model = grid_search.best_estimator_
Random Search
Faster than Grid Search for large parameter spaces.
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
n_iter=20,
cv=5,
random_state=42,
n_jobs=-1
)
random_search.fit(X_train, y_train)
Classification Metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report, roc_auc_score, roc_curve
)
# Basic metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
# Detailed report
print(classification_report(y_test, y_pred))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
# ROC-AUC (for binary classification)
roc_auc = roc_auc_score(y_test, y_proba[:, 1])
When to Use Which Metric?
| Metric |
When to Use |
Formula |
| Accuracy |
Balanced classes |
(TP+TN)/(TP+TN+FP+FN) |
| Precision |
Minimize false positives |
TP/(TP+FP) |
| Recall |
Minimize false negatives |
TP/(TP+FN) |
| F1-Score |
Balance precision & recall |
2*(Precision*Recall)/(Precision+Recall) |
| ROC-AUC |
Imbalanced classes |
Area under ROC curve |
Ensemble Methods
Combine multiple models to get better predictions. "Wisdom of the crowd" principle.
Voting Classifier
Combines predictions from multiple classifiers using majority voting.
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
# Create individual models
lr = LogisticRegression(random_state=42)
svm = SVC(probability=True, random_state=42)
rf = RandomForestClassifier(random_state=42)
# Create voting classifier
voting_clf = VotingClassifier(
estimators=[('lr', lr), ('svm', svm), ('rf', rf)],
voting='soft' # soft: average probabilities, hard: majority vote
)
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
Bagging
Bootstrap Aggregating. Trains multiple models on random subsets of data.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bagging_clf = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=10,
random_state=42
)
bagging_clf.fit(X_train, y_train)
y_pred = bagging_clf.predict(X_test)
Boosting - AdaBoost
Sequentially trains models, focusing on misclassified samples.
from sklearn.ensemble import AdaBoostClassifier
adaboost_clf = AdaBoostClassifier(
n_estimators=50,
learning_rate=1.0,
random_state=42
)
adaboost_clf.fit(X_train, y_train)
y_pred = adaboost_clf.predict(X_test)
Gradient Boosting
Builds trees sequentially to correct errors of previous trees.
from sklearn.ensemble import GradientBoostingClassifier
gb_clf = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
random_state=42
)
gb_clf.fit(X_train, y_train)
y_pred = gb_clf.predict(X_test)
Stacking
Uses predictions from multiple models as input to a meta-model.
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
# Base models
base_models = [
('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
('svm', SVC(probability=True, random_state=42))
]
# Meta-model
meta_model = LogisticRegression(random_state=42)
# Stacking classifier
stacking_clf = StackingClassifier(
estimators=base_models,
final_estimator=meta_model,
cv=5
)
stacking_clf.fit(X_train, y_train)
y_pred = stacking_clf.predict(X_test)
🎯 Ensemble Tips:
- Combine diverse models for best results
- Voting works best with different algorithm types
- Boosting reduces bias, Bagging reduces variance
- Stacking can achieve state-of-the-art results
- Always validate ensemble on held-out test set
Comprehensive Evaluation Metrics
Understand how to properly evaluate your models with the right metrics.
Classification Metrics
Confusion Matrix
Shows True Positives, True Negatives, False Positives, False Negatives.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
Precision, Recall, F1-Score
from sklearn.metrics import precision_recall_fscore_support
precision, recall, f1, support = precision_recall_fscore_support(
y_test, y_pred, average='weighted'
)
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
ROC-AUC Curve
Plots True Positive Rate vs False Positive Rate. AUC = Area Under Curve.
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_proba[:, 1])
roc_auc = auc(fpr, tpr)
print(f"ROC-AUC Score: {roc_auc:.4f}")
Regression Metrics
Mean Squared Error (MSE)
Average of squared differences. Penalizes large errors more.
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
Mean Absolute Error (MAE)
Average of absolute differences. More interpretable than MSE.
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.4f}")
R² Score
Proportion of variance explained. Range: 0 to 1 (higher is better).
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
Clustering Metrics
Silhouette Score
Measures how similar points are to their own cluster. Range: -1 to 1.
from sklearn.metrics import silhouette_score
silhouette = silhouette_score(X, clusters)
print(f"Silhouette Score: {silhouette:.4f}")
Davies-Bouldin Index
Average similarity between each cluster and its most similar cluster. Lower is better.
from sklearn.metrics import davies_bouldin_score
db_index = davies_bouldin_score(X, clusters)
print(f"Davies-Bouldin Index: {db_index:.4f}")
Real-World Projects
Apply your knowledge to real datasets. These projects will solidify your understanding.
Project 1: Iris Flower Classification
Classify iris flowers into three species using their measurements.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))
Project 2: Digit Recognition (MNIST)
Recognize handwritten digits from 0-9.
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Load data
digits = load_digits()
X, y = digits.data, digits.target
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train SVM
model = SVC(kernel='rbf', gamma='scale')
model.fit(X_train_scaled, y_train)
# Evaluate
accuracy = model.score(X_test_scaled, y_test)
print(f"Accuracy: {accuracy:.4f}")
Project 3: Customer Segmentation
Segment customers using clustering for targeted marketing.
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Generate customer data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Find optimal k
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, clusters)
silhouette_scores.append(score)
# Train final model
optimal_k = silhouette_scores.index(max(silhouette_scores)) + 2
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
customer_segments = kmeans.fit_predict(X_scaled)
print(f"Optimal clusters: {optimal_k}")
print(f"Silhouette Score: {max(silhouette_scores):.4f}")
Project 4: House Price Prediction
Predict house prices using regression.
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Load housing data
housing = fetch_openml(name='house_prices', as_frame=True)
X, y = housing.data, housing.target
# Handle missing values
X = X.fillna(X.mean())
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
rmse = mean_squared_error(y_test, y_pred) ** 0.5
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")
12-Week Scikit-Learn Mastery Roadmap
Follow this structured path to become a Scikit-Learn expert in 12 weeks.
1
Week 1-2: Foundations
Learn NumPy, Pandas basics. Understand ML concepts: supervised vs unsupervised, overfitting, train-test split.
2
Week 3: Data Preprocessing
Master scaling, encoding, feature selection. Practice with real datasets.
3
Week 4-5: Classification Basics
Learn Logistic Regression, SVM, Decision Trees. Understand classification metrics.
4
Week 6: Regression
Master Linear, Ridge, Lasso regression. Learn regression metrics: MSE, RMSE, R².
5
Week 7: Ensemble Methods
Learn Random Forest, Gradient Boosting, Voting, Stacking. These are state-of-the-art!
6
Week 8: Model Selection & Evaluation
Master cross-validation, Grid Search, hyperparameter tuning. Avoid overfitting.
7
Week 9: Clustering & Dimensionality
Learn K-Means, DBSCAN, PCA, t-SNE. Understand unsupervised learning.
8
Week 10: Advanced Topics
Feature engineering, pipeline automation, handling imbalanced data, anomaly detection.
9
Week 11: Real Projects
Build 3-4 complete projects from scratch. Apply everything you learned.
10
Week 12: Kaggle & Portfolio
Participate in Kaggle competitions. Build portfolio projects. Share on GitHub.
Daily Study Schedule
📅 Recommended Daily Routine:
- 30 minutes: Watch tutorial or read documentation
- 60 minutes: Code along with examples
- 30 minutes: Practice on your own dataset
- 30 minutes: Review and take notes
Learning Shortcuts & Pro Tips
Start Simple
Begin with Logistic Regression and Linear Regression before complex models
Use Pipelines
Automate preprocessing with Pipeline class to avoid data leakage
Cross-Validate
Always use cross-validation for reliable performance estimates
Ensemble Everything
Combine models for better results. Voting and Stacking are powerful
Hyperparameter Tune
Use GridSearchCV to find optimal parameters systematically
Kaggle Practice
Participate in competitions to learn from others and build portfolio
Resources for Continued Learning
- Official Documentation: scikit-learn.org - Most comprehensive resource
- Kaggle: kaggle.com - Datasets and competitions
- GitHub: Search for scikit-learn projects and examples
- YouTube: Scikit-learn tutorials and ML courses
- Books: "Hands-On Machine Learning" by Aurélien Géron
- Courses: Andrew Ng's ML course, Fast.ai
Common Mistakes to Avoid
⚠️ Don't Make These Mistakes:
- Fitting scaler on entire dataset (causes data leakage)
- Not splitting data before evaluation
- Using accuracy for imbalanced datasets
- Tuning hyperparameters on test set
- Not handling missing values properly
- Ignoring feature scaling for distance-based algorithms
- Not using cross-validation
- Overfitting to training data