Series

Machine Learning Fundamentals

A comprehensive 5-part series covering the essential concepts and techniques in machine learning

ProgressPart 3 of 5

Part 2: Machine Learning Fundamentals: Part 2 - Supervised Learning Algorithms

Part 4: Machine Learning Fundamentals: Part 4 - Model Evaluation and Validation

Machine Learning

Unsupervised Learning

Clustering

Data Science

Part 3 of Machine Learning Fundamentals

Machine Learning Fundamentals: Part 3 - Unsupervised Learning and Clustering

December 15, 2023

9 min read

Jinu Nyachhyon

Machine Learning Fundamentals: Part 3 - Unsupervised Learning and Clustering

Welcome to Part 3 of our Machine Learning Fundamentals series! After covering supervised learning algorithms, we now turn to unsupervised learning - the art of finding hidden patterns in data without labeled examples.

What is Unsupervised Learning?

Unsupervised learning is like being a detective - you have clues (data) but no clear answers (labels). The algorithm must discover hidden structures, patterns, and relationships in the data on its own. This makes it both challenging and exciting, as you never know what insights you might uncover.

Types of Unsupervised Learning

1. Clustering

Groups similar data points together.

Goal: Find natural groupings in data
Examples: Customer segmentation, gene sequencing, image segmentation

2. Dimensionality Reduction

Reduces the number of features while preserving important information.

Goal: Simplify data while retaining key patterns
Examples: Data visualization, noise reduction, feature extraction

3. Association Rule Learning

Finds relationships between different variables.

Goal: Discover rules that describe relationships
Examples: Market basket analysis, recommendation systems

4. Anomaly Detection

Identifies unusual or outlier data points.

Goal: Find data points that don't fit normal patterns
Examples: Fraud detection, network security, quality control

Clustering Algorithms

K-Means Clustering

K-Means is the most popular clustering algorithm, partitioning data into k clusters.

How it Works

Choose the number of clusters (k)
Randomly initialize k cluster centers (centroids)
Assign each data point to the nearest centroid
Update centroids to the mean of assigned points
Repeat steps 3-4 until convergence

Mathematical Foundation

The algorithm minimizes the within-cluster sum of squares (WCSS):

$WCSS = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2$

Where:

$C_i$ is the i-th cluster
$\mu_i$ is the centroid of cluster i
$||x - \mu_i||^2$ is the squared Euclidean distance

Advantages

Simple and fast
Works well with spherical clusters
Scales well to large datasets
Guaranteed convergence

Disadvantages

Requires choosing k in advance
Sensitive to initialization
Assumes spherical clusters
Sensitive to outliers

Choosing the Right k

Elbow Method: Plot WCSS vs k, look for the "elbow"
Silhouette Analysis: Measure how similar points are to their own cluster vs other clusters
Gap Statistic: Compare WCSS to random data

Hierarchical Clustering

Hierarchical clustering creates a tree-like structure of clusters.

Types

Agglomerative (Bottom-up): Start with individual points, merge similar clusters
Divisive (Top-down): Start with all points, recursively split clusters

Linkage Criteria

Single Linkage: Minimum distance between clusters
Complete Linkage: Maximum distance between clusters
Average Linkage: Average distance between all pairs
Ward Linkage: Minimizes within-cluster variance

Advantages

No need to specify number of clusters beforehand
Creates a hierarchy of clusters
Deterministic results
Works with any distance measure

Disadvantages

Computationally expensive O(n³)
Sensitive to noise and outliers
Difficult to handle large datasets
Hard to undo previous steps

DBSCAN (Density-Based Spatial Clustering)

DBSCAN groups together points in high-density areas and marks points in low-density areas as outliers.

Key Parameters

eps (ε): Maximum distance between two points to be neighbors
min_samples: Minimum number of points to form a dense region

Point Types

Core Points: Have at least min_samples neighbors within eps distance
Border Points: Within eps distance of a core point but not core themselves
Noise Points: Neither core nor border points (outliers)

Advantages

Automatically determines number of clusters
Can find arbitrarily shaped clusters
Robust to outliers
Identifies noise points

Disadvantages

Sensitive to hyperparameters
Struggles with varying densities
High-dimensional data challenges
Memory intensive for large datasets

Dimensionality Reduction

Principal Component Analysis (PCA)

PCA finds the directions (principal components) that capture the most variance in the data.

How it Works

Standardize the data
Compute the covariance matrix
Calculate eigenvalues and eigenvectors
Sort by eigenvalue magnitude
Select top k eigenvectors as principal components
Transform data to new coordinate system

Mathematical Foundation

PCA finds the projection that maximizes variance:

$\max_w \text{Var}(X \cdot w) \text{ subject to } ||w||^2 = 1$

Advantages

Reduces dimensionality while preserving variance
Removes correlation between features
Helps with visualization
Reduces computational complexity

Disadvantages

Linear transformation only
Components may not be interpretable
Requires scaling of features
May lose important information

Choosing Number of Components

Explained Variance Ratio: Keep components that explain 80-95% of variance
Scree Plot: Look for the "elbow" in eigenvalue plot
Cross-Validation: Use downstream task performance

t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE is excellent for visualizing high-dimensional data in 2D or 3D.

How it Works

Calculate pairwise similarities in high-dimensional space
Initialize random low-dimensional representation
Calculate pairwise similarities in low-dimensional space
Minimize divergence between high and low-dimensional similarities

Advantages

Excellent for visualization
Preserves local structure well
Can reveal clusters and patterns
Non-linear dimensionality reduction

Disadvantages

Computationally expensive
Non-deterministic results
Hyperparameter sensitive
Not suitable for new data projection

Association Rule Learning

Market Basket Analysis

Finds relationships between items frequently bought together.

Key Metrics

Support: Frequency of itemset in dataset
Confidence: Likelihood of consequent given antecedent
Lift: How much more likely consequent is given antecedent

Example Rules

{Bread, Butter} → {Milk} (Support: 0.3, Confidence: 0.8, Lift: 1.2)
{Diapers} → {Beer} (Famous example from retail analytics)

Applications

Recommendation systems
Cross-selling strategies
Website navigation analysis
Protein sequences

Anomaly Detection

Statistical Methods

Z-Score Method

Identifies points that are many standard deviations away from the mean:

$z = \frac{x - \mu}{\sigma}$

Points with |z| > threshold (typically 2 or 3) are considered anomalies.

Interquartile Range (IQR) Method

Uses quartiles to identify outliers:

Q1: 25th percentile
Q3: 75th percentile
IQR = Q3 - Q1
Outliers: < Q1 - 1.5×IQR or > Q3 + 1.5×IQR

Machine Learning Methods

Isolation Forest

Isolates anomalies by randomly selecting features and split values. Anomalies are easier to isolate and require fewer splits.

One-Class SVM

Learns a boundary around normal data points. Points outside this boundary are considered anomalies.

Local Outlier Factor (LOF)

Compares local density of a point to the local densities of its neighbors. Points with significantly lower density are outliers.

Evaluation Metrics for Unsupervised Learning

Clustering Evaluation

Internal Metrics (No ground truth needed)

Silhouette Score: Measures how similar points are to their cluster vs other clusters
Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance
Davies-Bouldin Index: Average similarity between clusters

External Metrics (Ground truth available)

Adjusted Rand Index (ARI): Measures similarity to true clustering
Normalized Mutual Information (NMI): Information shared between clusterings
Homogeneity and Completeness: Cluster purity and coverage

Dimensionality Reduction Evaluation

Explained Variance Ratio: Proportion of variance preserved
Reconstruction Error: Difference between original and reconstructed data
Downstream Task Performance: How well reduced data works for specific tasks

Real-World Applications

Customer Segmentation

Objective: Group customers with similar behaviors
Techniques: K-Means, Hierarchical Clustering
Features: Purchase history, demographics, website behavior
Business Value: Targeted marketing, personalized recommendations

Fraud Detection

Objective: Identify suspicious transactions
Techniques: Isolation Forest, One-Class SVM, DBSCAN
Features: Transaction amount, frequency, location, time
Business Value: Reduce financial losses, improve security

Gene Expression Analysis

Objective: Identify gene groups with similar expression patterns
Techniques: Hierarchical Clustering, PCA
Features: Gene expression levels across conditions
Scientific Value: Understand biological processes, drug discovery

Recommendation Systems

Objective: Suggest relevant items to users
Techniques: Association Rules, Clustering, Matrix Factorization
Features: User ratings, item features, interaction history
Business Value: Increase sales, improve user experience

Best Practices and Tips

Data Preprocessing

Scale Features: Use StandardScaler or MinMaxScaler
Handle Missing Values: Imputation or removal
Remove Outliers: Consider impact on clustering
Feature Selection: Remove irrelevant or redundant features

Algorithm Selection

Start Simple: Begin with K-Means or PCA
Consider Data Shape: Spherical vs arbitrary cluster shapes
Evaluate Multiple Algorithms: Compare results
Domain Knowledge: Use business understanding to guide choices

Hyperparameter Tuning

Grid Search: Systematic parameter exploration
Cross-Validation: Use appropriate validation strategies
Multiple Metrics: Don't rely on single evaluation metric
Stability: Check if results are consistent across runs

Interpretation and Validation

Visualize Results: Use plots to understand patterns
Business Validation: Do results make business sense?
Statistical Significance: Test if patterns are meaningful
Actionability: Can insights be translated to actions?

What's Next?

In Part 4 of our series, we'll cover model evaluation and validation:

Cross-validation techniques
Bias-variance tradeoff
Overfitting and underfitting
Performance metrics for different problem types
Model selection strategies
Statistical significance testing

We'll also discuss how to properly evaluate and compare different machine learning models.

Key Takeaways

Unsupervised learning discovers hidden patterns without labeled data
Clustering groups similar data points for segmentation and analysis
Dimensionality reduction simplifies data while preserving important information
Evaluation is challenging without ground truth labels
Domain knowledge is crucial for interpreting results
Multiple techniques often needed to fully understand data
Preprocessing is critical for good results

Unsupervised learning opens up exciting possibilities for data exploration and discovery. While more challenging than supervised learning, it can reveal insights that weren't obvious from the start and help you understand your data in new ways.

Previous in Series

Part 2

Machine Learning Fundamentals: Part 2 - Supervised Learning Algorithms

Next in Series

Part 4

Machine Learning Fundamentals: Part 4 - Model Evaluation and Validation

Table of Contents