Back to all posts
Series
Machine Learning Fundamentals

A comprehensive 5-part series covering the essential concepts and techniques in machine learning

Machine Learning
Unsupervised Learning
Clustering
Data Science
Part 3 of Machine Learning Fundamentals

Machine Learning Fundamentals: Part 3 - Unsupervised Learning and Clustering

December 15, 2023
9 min read
Jinu Nyachhyon

Machine Learning Fundamentals: Part 3 - Unsupervised Learning and Clustering

Welcome to Part 3 of our Machine Learning Fundamentals series! After covering supervised learning algorithms, we now turn to unsupervised learning - the art of finding hidden patterns in data without labeled examples.

What is Unsupervised Learning?

Unsupervised learning is like being a detective - you have clues (data) but no clear answers (labels). The algorithm must discover hidden structures, patterns, and relationships in the data on its own. This makes it both challenging and exciting, as you never know what insights you might uncover.

Types of Unsupervised Learning

1. Clustering

Groups similar data points together.

  • Goal: Find natural groupings in data
  • Examples: Customer segmentation, gene sequencing, image segmentation

2. Dimensionality Reduction

Reduces the number of features while preserving important information.

  • Goal: Simplify data while retaining key patterns
  • Examples: Data visualization, noise reduction, feature extraction

3. Association Rule Learning

Finds relationships between different variables.

  • Goal: Discover rules that describe relationships
  • Examples: Market basket analysis, recommendation systems

4. Anomaly Detection

Identifies unusual or outlier data points.

  • Goal: Find data points that don't fit normal patterns
  • Examples: Fraud detection, network security, quality control

Clustering Algorithms

K-Means Clustering

K-Means is the most popular clustering algorithm, partitioning data into k clusters.

How it Works

  1. Choose the number of clusters (k)
  2. Randomly initialize k cluster centers (centroids)
  3. Assign each data point to the nearest centroid
  4. Update centroids to the mean of assigned points
  5. Repeat steps 3-4 until convergence

Mathematical Foundation

The algorithm minimizes the within-cluster sum of squares (WCSS):

WCSS=i=1kxCixμi2WCSS = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2

Where:

  • CiC_i is the i-th cluster
  • μi\mu_i is the centroid of cluster i
  • xμi2||x - \mu_i||^2 is the squared Euclidean distance

Advantages

  • Simple and fast
  • Works well with spherical clusters
  • Scales well to large datasets
  • Guaranteed convergence

Disadvantages

  • Requires choosing k in advance
  • Sensitive to initialization
  • Assumes spherical clusters
  • Sensitive to outliers

Choosing the Right k

  • Elbow Method: Plot WCSS vs k, look for the "elbow"
  • Silhouette Analysis: Measure how similar points are to their own cluster vs other clusters
  • Gap Statistic: Compare WCSS to random data

Hierarchical Clustering

Hierarchical clustering creates a tree-like structure of clusters.

Types

  1. Agglomerative (Bottom-up): Start with individual points, merge similar clusters
  2. Divisive (Top-down): Start with all points, recursively split clusters

Linkage Criteria

  • Single Linkage: Minimum distance between clusters
  • Complete Linkage: Maximum distance between clusters
  • Average Linkage: Average distance between all pairs
  • Ward Linkage: Minimizes within-cluster variance

Advantages

  • No need to specify number of clusters beforehand
  • Creates a hierarchy of clusters
  • Deterministic results
  • Works with any distance measure

Disadvantages

  • Computationally expensive O(n³)
  • Sensitive to noise and outliers
  • Difficult to handle large datasets
  • Hard to undo previous steps

DBSCAN (Density-Based Spatial Clustering)

DBSCAN groups together points in high-density areas and marks points in low-density areas as outliers.

Key Parameters

  • eps (ε): Maximum distance between two points to be neighbors
  • min_samples: Minimum number of points to form a dense region

Point Types

  • Core Points: Have at least min_samples neighbors within eps distance
  • Border Points: Within eps distance of a core point but not core themselves
  • Noise Points: Neither core nor border points (outliers)

Advantages

  • Automatically determines number of clusters
  • Can find arbitrarily shaped clusters
  • Robust to outliers
  • Identifies noise points

Disadvantages

  • Sensitive to hyperparameters
  • Struggles with varying densities
  • High-dimensional data challenges
  • Memory intensive for large datasets

Dimensionality Reduction

Principal Component Analysis (PCA)

PCA finds the directions (principal components) that capture the most variance in the data.

How it Works

  1. Standardize the data
  2. Compute the covariance matrix
  3. Calculate eigenvalues and eigenvectors
  4. Sort by eigenvalue magnitude
  5. Select top k eigenvectors as principal components
  6. Transform data to new coordinate system

Mathematical Foundation

PCA finds the projection that maximizes variance:

maxwVar(Xw) subject to w2=1\max_w \text{Var}(X \cdot w) \text{ subject to } ||w||^2 = 1

Advantages

  • Reduces dimensionality while preserving variance
  • Removes correlation between features
  • Helps with visualization
  • Reduces computational complexity

Disadvantages

  • Linear transformation only
  • Components may not be interpretable
  • Requires scaling of features
  • May lose important information

Choosing Number of Components

  • Explained Variance Ratio: Keep components that explain 80-95% of variance
  • Scree Plot: Look for the "elbow" in eigenvalue plot
  • Cross-Validation: Use downstream task performance

t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE is excellent for visualizing high-dimensional data in 2D or 3D.

How it Works

  1. Calculate pairwise similarities in high-dimensional space
  2. Initialize random low-dimensional representation
  3. Calculate pairwise similarities in low-dimensional space
  4. Minimize divergence between high and low-dimensional similarities

Advantages

  • Excellent for visualization
  • Preserves local structure well
  • Can reveal clusters and patterns
  • Non-linear dimensionality reduction

Disadvantages

  • Computationally expensive
  • Non-deterministic results
  • Hyperparameter sensitive
  • Not suitable for new data projection

Association Rule Learning

Market Basket Analysis

Finds relationships between items frequently bought together.

Key Metrics

  • Support: Frequency of itemset in dataset
  • Confidence: Likelihood of consequent given antecedent
  • Lift: How much more likely consequent is given antecedent

Example Rules

  • {Bread, Butter} → {Milk} (Support: 0.3, Confidence: 0.8, Lift: 1.2)
  • {Diapers} → {Beer} (Famous example from retail analytics)

Applications

  • Recommendation systems
  • Cross-selling strategies
  • Website navigation analysis
  • Protein sequences

Anomaly Detection

Statistical Methods

Z-Score Method

Identifies points that are many standard deviations away from the mean:

z=xμσz = \frac{x - \mu}{\sigma}

Points with |z| > threshold (typically 2 or 3) are considered anomalies.

Interquartile Range (IQR) Method

Uses quartiles to identify outliers:

  • Q1: 25th percentile
  • Q3: 75th percentile
  • IQR = Q3 - Q1
  • Outliers: < Q1 - 1.5×IQR or > Q3 + 1.5×IQR

Machine Learning Methods

Isolation Forest

Isolates anomalies by randomly selecting features and split values. Anomalies are easier to isolate and require fewer splits.

One-Class SVM

Learns a boundary around normal data points. Points outside this boundary are considered anomalies.

Local Outlier Factor (LOF)

Compares local density of a point to the local densities of its neighbors. Points with significantly lower density are outliers.

Evaluation Metrics for Unsupervised Learning

Clustering Evaluation

Internal Metrics (No ground truth needed)

  • Silhouette Score: Measures how similar points are to their cluster vs other clusters
  • Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance
  • Davies-Bouldin Index: Average similarity between clusters

External Metrics (Ground truth available)

  • Adjusted Rand Index (ARI): Measures similarity to true clustering
  • Normalized Mutual Information (NMI): Information shared between clusterings
  • Homogeneity and Completeness: Cluster purity and coverage

Dimensionality Reduction Evaluation

  • Explained Variance Ratio: Proportion of variance preserved
  • Reconstruction Error: Difference between original and reconstructed data
  • Downstream Task Performance: How well reduced data works for specific tasks

Real-World Applications

Customer Segmentation

  • Objective: Group customers with similar behaviors
  • Techniques: K-Means, Hierarchical Clustering
  • Features: Purchase history, demographics, website behavior
  • Business Value: Targeted marketing, personalized recommendations

Fraud Detection

  • Objective: Identify suspicious transactions
  • Techniques: Isolation Forest, One-Class SVM, DBSCAN
  • Features: Transaction amount, frequency, location, time
  • Business Value: Reduce financial losses, improve security

Gene Expression Analysis

  • Objective: Identify gene groups with similar expression patterns
  • Techniques: Hierarchical Clustering, PCA
  • Features: Gene expression levels across conditions
  • Scientific Value: Understand biological processes, drug discovery

Recommendation Systems

  • Objective: Suggest relevant items to users
  • Techniques: Association Rules, Clustering, Matrix Factorization
  • Features: User ratings, item features, interaction history
  • Business Value: Increase sales, improve user experience

Best Practices and Tips

Data Preprocessing

  1. Scale Features: Use StandardScaler or MinMaxScaler
  2. Handle Missing Values: Imputation or removal
  3. Remove Outliers: Consider impact on clustering
  4. Feature Selection: Remove irrelevant or redundant features

Algorithm Selection

  1. Start Simple: Begin with K-Means or PCA
  2. Consider Data Shape: Spherical vs arbitrary cluster shapes
  3. Evaluate Multiple Algorithms: Compare results
  4. Domain Knowledge: Use business understanding to guide choices

Hyperparameter Tuning

  1. Grid Search: Systematic parameter exploration
  2. Cross-Validation: Use appropriate validation strategies
  3. Multiple Metrics: Don't rely on single evaluation metric
  4. Stability: Check if results are consistent across runs

Interpretation and Validation

  1. Visualize Results: Use plots to understand patterns
  2. Business Validation: Do results make business sense?
  3. Statistical Significance: Test if patterns are meaningful
  4. Actionability: Can insights be translated to actions?

What's Next?

In Part 4 of our series, we'll cover model evaluation and validation:

  • Cross-validation techniques
  • Bias-variance tradeoff
  • Overfitting and underfitting
  • Performance metrics for different problem types
  • Model selection strategies
  • Statistical significance testing

We'll also discuss how to properly evaluate and compare different machine learning models.

Key Takeaways

  1. Unsupervised learning discovers hidden patterns without labeled data
  2. Clustering groups similar data points for segmentation and analysis
  3. Dimensionality reduction simplifies data while preserving important information
  4. Evaluation is challenging without ground truth labels
  5. Domain knowledge is crucial for interpreting results
  6. Multiple techniques often needed to fully understand data
  7. Preprocessing is critical for good results

Unsupervised learning opens up exciting possibilities for data exploration and discovery. While more challenging than supervised learning, it can reveal insights that weren't obvious from the start and help you understand your data in new ways.