Fundamentals of Machine Learning

PD Dr. Ullrich Köthe, WS 2017/18

This lecture belongs to the Master in Physics (specialisation Computational Physics, code "MVSpec") and the Master of Applied Informatics (code "IFML") programs, but is also open for students of Scientific Computing and anyone interested.

Solid basic knowledge in linear algebra, analysis (multi-dimensional differentiation and integration) and probability theory is required.

Summary:

Machine learning is one of the most promising approaches to address difficult decision and regression problems under uncertainty. The general idea is very simple: Instead of modeling a solution explicitly, a domain expert provides example data that demonstrate the desired behavior on representative problem instances. A suitable machine learning algorithm is then trained on these examples to reproduce the expert's solutions as well as possible and generalize it to new, unseen data. The last two decades have seen tremendous progress towards ever more powerful algorithms, and the course will cover the fundamental ideas from this field.

Dates:

Lecture	Wednesdays	14:15-15:45	~~Hörsaal Mathematikon (only 18. Oct)~~ Hörsaal COS, INF 360 (starting 25. Oct)
Lecture	Fridays	~~11:15-12:45~~ 12:30-14:00	~~Hörsaal Mathematikon~~ Hörsaal Chemie, INF 252 East
Tutorials	Wednesdays Fridays	12:30-14:00 11:00-12:15	Hörsaal Chemie, INF 252 West Hörsaal Mathematikon, INF 205

Please register for the lecture via MÜSLI.

Homework Assignments:

Please download homework assignments and upload your solutions via Moodle.

Textbooks:

General

Script for the lecture (edited by Manuel Haussmann)
Trevor Hastie, Robert Tibshirani, Jerome Friedman: "The Elements of Statistical Learning" (2nd edition), 745 pages, Springer, 2009 (recommended, online version)
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani: "An Introduction to Statistical Learning", 426 pages, Springer, 2013 (example-based, less mathematical version of "The Elements of Statistical Learning", online version)
Richard O. Duda, Peter E. Hart, David G. Stork: "Pattern Classification" (2nd edition), 680 pages, Wiley, 2000 (online version)
David Barber: "Bayesian Reasoning and Machine Learning", 720 pages, Cambridge University Press, 2012 (online version)
Christopher M. Bishop: "Pattern Recognition and Machine Learning", 738 pages, Springer, 2006
Kevin P. Murphy: "Machine Learning - A Probabilistic Perspective", 1105 pages, The MIT Press, 2012

Special Topics

Charles L. Lawson, Richard J. Hanson: "Solving Least Squares Problems", 350 pages, Society for Industrial and Applied Mathematics, 1987 (best introduction to ordinary and non-negative least-squares, QR decomposition etc.)
Sabine Van Huffel, Joos Vandewalle: "The Total Least Squares Problem: Computational Aspects and Analysis", 288 pages, Society for Industrial and Applied Mathematics, 1991 (total least squares)
George A. F. Seber, Christopher J. Wild: "Nonlinear Regression", 792 pages, Wiley, 2003
Ricardo A. Maronna, R. Douglas Martin, Victor J. Yohai: "Robust Statistics: Theory and Methods", 403 pages, Wiley, 2006 (good introduction to robust loss functions and robust regression)
Bernhard Schölkopf, Alexander Smola: "Learning with Kernels", 648 pages, The MIT Press, 2001 (everything about support vector machines)

Contents (29 lessons):

Lesson 1: Introduction

What is "machine learning" and why do we care?
Learning problems (classification, regression, forecasting,...)
Types of training data (supervised, unsupervised, weakly supervised)
Modeling via joint probabilities and their factorization into tractable form
Sources of uncertainty

Lesson 2: Classification Basics
cf. Duda/Hart/Stork: sections 2.2 to 2.4

Problem statement, confusion matrix
Upper bound on the error: pure guessing
Lower bound on the error: Bayesian decision rule
Discriminative vs. generative models
Example calculations of Bayesian error rates

Lessons 3 and 4: Nearest-Neighbor Classification
cf. Duda/Hart/Stork: section 4.5

Nearest neighbor decision rule
Empirical error analysis via cross validation
Asymptotic error analysis
Finite sample error analysis (with Mathematica software demo: see notebook PDF from the lecture)
Drawbacks of NN classification and how to address them

Lessons 5 and 6: Quadratic and Linear Discriminant Analysis
cf. Murphy: section 4.2; Duda/Hart/Stork: Sections 2.5 and 2.6 (QDA); Hastie/Tibshirani/Friedman: section 4.3; Bishop: section 4.1 (LDA)

Nearest mean classifier
Spherical vs. elliptic clusters
QDA training: fitting of one Gaussian likelihood per class using the maximum likelihood method
QDA prediction: nearest mean classification using a Mahalanobis distance
LDA training and prediction: simplification of QDA when all clusters have the same shape
Relation of LDA to least-squares regression

Lesson 7: Logistic Regression
cf. Hastie/Tibshirani/Friedman: section 4.4; Bishop: section 4.3;
Elkan: "Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training" (PDF)

Derivation of the logistic function as the posterior of LDA
Logistic regression as maximum likelihood optimization of this posterior
Relationship between the generative LDA model and the discriminative LR model
Learning of LR model using stochastic gradient descent

Lessons 8 and 9: Naive Bayes Classifier and Density Tree
cf. Barber: chapter 10 (Naive Bayes);
Ram and Gray: "Density estimation trees" (PDF)

Motivation of generative non-parametric methods: the matrix 'generative vs. discriminative' and 'parametric vs. non-parametric' and the lecture so far
The 1-D case: histograms
Histogram learning: least-squares optimization of bin responses, choice of bin size
The failure of multi-dimensional histograms for large d
The naive Bayes classifier
Density trees
Density forests

Lessons 10 and 11: Regression, Ordinary Least Squares, Weighted and Total Least Squares (three lectures)
cf. Hastie/Tibshirani/Friedman: section 3.2; Bishop: section 3.1; Murphy: section 7.3; Lawson/Hanson;
Paige and Saunders: "LSQR: An algorithm for sparse linear equations and sparse least squares" (PDF)
Van Huffel/Vandevalle; P. de Groen: "An Introduction to Total Least Squares" (PDF)

Definition of the regression problem
The zoo of regression methods
Ordinary Least Squares:
- derivation by maximum likelihood
- basic properties, data centralization
- pseudo-inverse
- solution via Cholesky factorization and singular value decomposition (SVD)
- matrix condition number and its consequences for least-squares solution via Cholesky and SVD
- the standard algorithm: QR decomposition
- an iterative algorithm for big sparse matrices: LSQR
- application: tomographic reconstruction
Weighted Least Squares, iteratively re-weighted least squares (IRLS)
Total Least Squares, the errors-in-variables model and its solution via SVD

Lessons 12 to 14: Regularized/Constrained Least Squares, Kernel Ridge Regression
cf. Bishop: section 3.2 (bias-variance trade-off), section 6.3 (kernel regression);
Hastie/Tibshirani/Friedman: section 3.4 (ridge regression and LASSO), section 3.8 (LARS algorithm);
Murphy: section 7.5 (ridge regression), sections 13.3 and 13.4 (LASSO), sections 14.4 and 14.7.4 (kernel-based regression);
Saunders, Gammerman and Vovk: "Ridge Regression Learning Algorithm in Dual Variables", 1998 (PDF)
Tropp and Gilbert: "Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit" (PDF)

the bias-variance trade-off: derivation and application to ordinary least squares
motivation for variable shrinkage
ridge regression: problem formulation and solution via QR decomposition and SVD
Augmented feature spaces and the kernel trick
- Mapping data to higher dimensions
- Example: fitting a circle by minimizing algebraic loss
- The dual problem of ridge regression
- Kernelization of the dual
- Mercer's condition, polynomial and Gaussian kernels
- Kernel ridge regression and (Nadaraya/Watson) kernel regression
sparse regression: L0 and L1 penalties, the LASSO problem
geometric interpretation of the L1 penalty
orthogonal matching pursuit
the LARS algorithm
application: topic detection in documents

Lesson 15: Non-linear Least Squares
cf. Seber/Wild: chapter 14 (Levenberg-Marquardt algorithm); Hastie/Tibshirani/Friedman: section 9.2 (regression trees)

Parametric non-linear least squares: the Levenberg-Marquardt algorithm
Nonparametric non-linear least squares: regression trees and forests

Lesson 16: Robust Regression
cf. Maronna/Martin/Yohai: chapters 2 and 4; Fischler and Bolles: "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography", 1981 (PDF)

Robust regression basics
Contaminated distributions and the accuracy/robustness trade-off
Robust error norms (absolute deviation, Huber loss, bi-weight loss) and their relation to the squared error
Robust regression with Huber loss, iteratively reweighted least squares
Least absolute deviation regression, reduction to linear programming
RANSAC algorithm

Lesson 17: Unsupervised Learning: Linear Dimension Reduction
cf. Hastie/Tibshirani/Friedman: section 14.5.1; Barber: sections 15.2 and 15.3; Bishop: section 12.1; Murphy: sections 12.2 and 12.3

Motivation and scope of unsupervised learning
Principal Component Analysis (PCA)
- Model for linear dimension reduction
- Three ways to derive PCA:
  (i) minimize the squared difference between the kernel matrices in the original and reduced feature space,
  (ii) find uncorrelated features that best explain the data variability,
  (iii) minimize the reconstruction error from the reduced to the original space
- The PCA algorithm
- Mapping new (out-of-training) instances to the reduced feature space
- Whitening
- Application: eigenfaces

Lesson 18: Unsupervised Learning: Non-Linear Dimension Reduction
cf. Hastie/Tibshirani/Friedman: section 14.5.4; Barber: section 15.7; Bishop: section 12.3; Murphy: section 14.4.4 (Kernel-PCA)
Roweis and Saul: "Nonlinear dimensionality reduction by locally linear embedding", 2000 (PDF); Hastie/Tibshirani/Friedman: section 14.9 (LLE)

Overview over variations of PCA (alternative objective functions, kernel trick, robustness)
Kernel-PCA (kPCA)

Linear dimension reduction in an augmented feature space
Expression of the PCA algorithm in terms of an arbitrary kernel matrix
Centering of the kernel matrix
How to map out-of-training instances

Local Linear Embedding (LLE)

Piecewise linear dimension reduction as an example for non-linear dimension reduction
Local reconstruction from nearest neighbors
Neighborhood-preserving embedding into a low dimensional space
How to map out-of-training instances

Application: concentric circles and swiss roll datasets

Lesson 19: Unsupervised Learning: Alternative Linear Dimension Reduction Methods
cf. Lee and Seung: "Algorithms for Non-negative Matrix Factorization", 2001 (PDF); Hastie/Tibshirani/Friedman: section 14.6; (NMF)
Hyvärinen and Oja: "Independent Component Analysis: A Tutorial", 1999 (PDF); Hastie/Tibshirani/Friedman: section 14.7; Murphy: section 12.6 (ICA)

Non-Negative Matrix Factorization (NMF)

Motivating example: topic representation by word histograms
ML estimation on the basis of Poisson distributions
Multiplicative update rules
Initialization methods: random, k-means, random Acol
Application: sparse parts-based face representation, comparison with eigenfaces
How to map out-of-training instances via non-negative LASSO

Indepdendent Component Analysis (ICA)

Motivating example: cocktail party problem
Derivation of the objective via entropy and KL divergence
Simplification by whitening of the input data
FastICA algorithm: Newton iterations towards an approximated objective

Lessons 20 and 21: Unsupervised Learning: Clustering
cf. Hastie/Tibshirani/Friedman: section 14.3; Duda/Hart/Stork: chapter 10; Barber: section 20.3; Murphy: chapter 11, and sections 14.4, 25.5; Bishop: chapter 9

Problem definition and overview, cluster representatives (mean, medoid)
Hierarchical clustering

Generic algorithm
Dendrograms
Cluster distances (single linkage, complete linkage, average linkage, representative linkage), relationship to minimum spanning tree
Pros and cons of the different choices

k-Means clustering

Minimization of within-cluster distance, maximization of between cluster distance
Special case of Euclidean distance
Alternating optimization of representatives and cluster membership
Initializations (random vs. k-means++)
Variant: k-medoids
Heuristics to determine the number of clusters (square root, elbow, jump, cross-validation)
Application: fitting lines

Expectation-Maximization (EM) algorithm

Soft cluster assignments
(Gaussian) mixture models
Maximum likelihood optimization of a mixture model
Alternating optimization of cluster membership and component models
High-level interpretation of the EM strategy

Lessons 22: Model Optimism and Model Selection
cf. Borra and Ciaccio: "Estimators of extra-sample error for non-parametric methods. A comparison based on extensive simulations", 2008 (PDF)
Hastie/Tibshirani/Friedman: sections 7.4 (covariance penalty), 3.4 (model degrees-of-freedom)
Cavanaugh: "Unifying the derivations for the Akaike and corrected Akaike information criteria", 1997 (PDF)
Burnham, Anderson, and Huyvaert: "AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons", 2011 (PDF)

Definitions of test error and optimism (out-sample, in-sample, expected)
Empirical estimation of the test error
- Cross-validation and its variants (repeated, reversed, corrected CV)
- Bootstrap sampling and the out-of-bag error
Covariance penalty
- Analytical derivation for linear models
- Generalization to Cp criterion for ridge regression and LASSO
- Empirical estimation via Rademacher sampling, permutation sampling and the bootstrap
The Akaike criterion
- KL divergence and the Minimum Description Length principle
- Derivation and interpretation of AIC and AICc
- Akaike weights and ensemble methods

Lesson 23: Risk and Loss
cf. Duda/Hart/Stork: chapter 2; Murphy: section 5.7

Definition of risk and loss
Bayes risk, minimum-risk decision rule
Effect of unbalanced risk on posterior and likelihood ratios
Receiver operating characteristic, ROC curve, area-under-curve (AUC)
Precision/recall curve, F1 measure

Lessons 24 and 25: Classification: Support Vector Machines (SVM)
cf. Schölkopf/Smola; Hastie/Tibshirani/Friedman: sections 12.2 and 12.3; Bishop: chapter 7; Murphy: section 14.5;
Barber: section 17.5 (basic treatment); Duda/Hart/Stork: section 5.11 (basic treatment);
Hsieh et al.: "A dual coordinate descent method for large-scale linear SVM", 2008 (PDF)
Platt: "Fast Training of Support Vector Machines using Sequential Minimal Optimization", 1998 (PDF)

Margin maximization on linearly separable data: intuitive motivation, structured risk minimization, Vapnik-Chervonenkis theory, VC dimension
Slack variables and the soft-margin formulation for non-separable data
The SVM optimization problem
SVM loss functions (hinge loss, squared hinge loss, logistic loss, ...)
The dual optimization problem
Karush-Kuhn-Tucker (KKT) conditions
Sparsity of the support vector set
Algorithms: dual coordinate descent for linear SVM (LIBLINEAR), sequential minimal optimization for kernel SVM (LIBSVM), convergence in terms of KKT conditions
Example: XOR problem

Lessons 26 and 27: Classification: Ensemble Methods
cf. Breiman: "Random Forests", 2001 (PDF); Hastie/Tibshirani/Friedman: chapter 15 (random forests);
Hastie/Tibshirani/Friedman: sections 10.1 to 10.5; Duda/Hart/Stork: section 9.5.2; Bishop: section 14.3; Murphy: section 16.4 (boosting)
Viola/Jones: "Rapid object detection using a boosted cascade of simple features", 2001 (PDF)

Constructing a strong classifier by combining weak classifiers: independent learning of uncorrelated classifiers, conditional learning of a classifier chain
Random Forests

Why does an ensemble of uncorrelated classifiers improve performance? How does correlation reduce ensemble performance?
Decision trees: CART (minimize Gini impurity), C4.5 (minimize entropy) and their training algorithm
Diverse tree training via bootstrapping and random feature selection
The out-of-bag error estimate
Example: XOR problem

Boosting

Classifiers that correct the errors of their predecessors: reweighting (AdaBoost), gradient boosting, cascaded classification
Boosting as an additive method minimizing exponential loss
The AdaBoost algorithm
Decision stumps
Cascaded classification in highly unbalanced datasets (Viola/Jones)
Example: face detection

Lessen 28: Preview of Lecture "Advanced Machine Learning"

Gaussian Process Classifiers
Neural Networks
Latent Variable Models (Hidden Markov Models, Probabilistic Latent Semantic Analysis)
Matrix Decomposition, Dictionary Learning
Graphical Models (Bayesian Networks, Markov Random Fields)
Structured Learning
Feature Selection, Feature Learning (Convolutional Neural Networks, Kernel Approximation)
Weakly Supervised Learning (Multiple Instance Learning, Semi-Supervised Learning, Active Learning)
Reinforcement Learning