Fundamentals of Machine Learning

PD Dr. Ullrich Köthe, WS 2017/18

This lecture belongs to the Master in Physics (specialisation Computational Physics, code "MVSpec") and the Master of Applied Informatics (code "IFML") programs, but is also open for students of Scientific Computing and anyone interested.

Solid basic knowledge in linear algebra, analysis (multi-dimensional differentiation and integration) and probability theory is required.

Summary:

Machine learning is one of the most promising approaches to address difficult decision and regression problems under uncertainty. The general idea is very simple: Instead of modeling a solution explicitly, a domain expert provides example data that demonstrate the desired behavior on representative problem instances. A suitable machine learning algorithm is then trained on these examples to reproduce the expert's solutions as well as possible and generalize it to new, unseen data. The last two decades have seen tremendous progress towards ever more powerful algorithms, and the course will cover the fundamental ideas from this field.

Dates:

Lecture Wednesdays 14:15-15:45 Hörsaal Mathematikon (only 18. Oct)
Hörsaal COS, INF 360 (starting 25. Oct)
Lecture Fridays 11:15-12:45
12:30-14:00
Hörsaal Mathematikon
Hörsaal Chemie, INF 252 East
Tutorials Wednesdays
Fridays
12:30-14:00
11:00-12:15
Hörsaal Chemie, INF 252 West
Hörsaal Mathematikon, INF 205

Please register for the lecture via MÜSLI.

Homework Assignments:

Please download homework assignments and upload your solutions via Moodle.

Textbooks:

General
  • Script for the lecture (edited by Manuel Haussmann)
  • Trevor Hastie, Robert Tibshirani, Jerome Friedman: "The Elements of Statistical Learning" (2nd edition), 745 pages, Springer, 2009 (recommended, online version)
  • Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani: "An Introduction to Statistical Learning", 426 pages, Springer, 2013 (example-based, less mathematical version of "The Elements of Statistical Learning", online version)
  • Richard O. Duda, Peter E. Hart, David G. Stork: "Pattern Classification" (2nd edition), 680 pages, Wiley, 2000 (online version)
  • David Barber: "Bayesian Reasoning and Machine Learning", 720 pages, Cambridge University Press, 2012 (online version)
  • Christopher M. Bishop: "Pattern Recognition and Machine Learning", 738 pages, Springer, 2006
  • Kevin P. Murphy: "Machine Learning - A Probabilistic Perspective", 1105 pages, The MIT Press, 2012
Special Topics
  • Charles L. Lawson, Richard J. Hanson: "Solving Least Squares Problems", 350 pages, Society for Industrial and Applied Mathematics, 1987 (best introduction to ordinary and non-negative least-squares, QR decomposition etc.)
  • Sabine Van Huffel, Joos Vandewalle: "The Total Least Squares Problem: Computational Aspects and Analysis", 288 pages, Society for Industrial and Applied Mathematics, 1991 (total least squares)
  • George A. F. Seber, Christopher J. Wild: "Nonlinear Regression", 792 pages, Wiley, 2003
  • Ricardo A. Maronna, R. Douglas Martin, Victor J. Yohai: "Robust Statistics: Theory and Methods", 403 pages, Wiley, 2006 (good introduction to robust loss functions and robust regression)
  • Bernhard Schölkopf, Alexander Smola: "Learning with Kernels", 648 pages, The MIT Press, 2001 (everything about support vector machines)

Contents (29 lessons):

Lesson 1: Introduction
  • What is "machine learning" and why do we care?
  • Learning problems (classification, regression, forecasting,...)
  • Types of training data (supervised, unsupervised, weakly supervised)
  • Modeling via joint probabilities and their factorization into tractable form
  • Sources of uncertainty
Lesson 2: Classification Basics
cf. Duda/Hart/Stork: sections 2.2 to 2.4
  • Problem statement, confusion matrix
  • Upper bound on the error: pure guessing
  • Lower bound on the error: Bayesian decision rule
  • Discriminative vs. generative models
  • Example calculations of Bayesian error rates
Lessons 3 and 4: Nearest-Neighbor Classification
cf. Duda/Hart/Stork: section 4.5
  • Nearest neighbor decision rule
  • Empirical error analysis via cross validation
  • Asymptotic error analysis
  • Finite sample error analysis (with Mathematica software demo: see notebook PDF from the lecture)
  • Drawbacks of NN classification and how to address them
Lessons 5 and 6: Quadratic and Linear Discriminant Analysis
cf. Murphy: section 4.2; Duda/Hart/Stork: Sections 2.5 and 2.6 (QDA); Hastie/Tibshirani/Friedman: section 4.3; Bishop: section 4.1 (LDA)
  • Nearest mean classifier
  • Spherical vs. elliptic clusters
  • QDA training: fitting of one Gaussian likelihood per class using the maximum likelihood method
  • QDA prediction: nearest mean classification using a Mahalanobis distance
  • LDA training and prediction: simplification of QDA when all clusters have the same shape
  • Relation of LDA to least-squares regression
Lesson 7: Logistic Regression
cf. Hastie/Tibshirani/Friedman: section 4.4; Bishop: section 4.3;
Elkan: "Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training" (PDF)
  • Derivation of the logistic function as the posterior of LDA
  • Logistic regression as maximum likelihood optimization of this posterior
  • Relationship between the generative LDA model and the discriminative LR model
  • Learning of LR model using stochastic gradient descent
Lessons 8 and 9: Naive Bayes Classifier and Density Tree
cf. Barber: chapter 10 (Naive Bayes);
Ram and Gray: "Density estimation trees" (PDF)
  • Motivation of generative non-parametric methods: the matrix 'generative vs. discriminative' and 'parametric vs. non-parametric' and the lecture so far
  • The 1-D case: histograms
  • Histogram learning: least-squares optimization of bin responses, choice of bin size
  • The failure of multi-dimensional histograms for large d
  • The naive Bayes classifier
  • Density trees
  • Density forests
Lessons 10 and 11: Regression, Ordinary Least Squares, Weighted and Total Least Squares (three lectures)
cf. Hastie/Tibshirani/Friedman: section 3.2; Bishop: section 3.1; Murphy: section 7.3; Lawson/Hanson;
Paige and Saunders: "LSQR: An algorithm for sparse linear equations and sparse least squares" (PDF)
Van Huffel/Vandevalle; P. de Groen: "An Introduction to Total Least Squares" (PDF)
  • Definition of the regression problem
  • The zoo of regression methods
  • Ordinary Least Squares:
    • derivation by maximum likelihood
    • basic properties, data centralization
    • pseudo-inverse
    • solution via Cholesky factorization and singular value decomposition (SVD)
    • matrix condition number and its consequences for least-squares solution via Cholesky and SVD
    • the standard algorithm: QR decomposition
    • an iterative algorithm for big sparse matrices: LSQR
    • application: tomographic reconstruction
  • Weighted Least Squares, iteratively re-weighted least squares (IRLS)
  • Total Least Squares, the errors-in-variables model and its solution via SVD
Lessons 12 to 14: Regularized/Constrained Least Squares, Kernel Ridge Regression
cf. Bishop: section 3.2 (bias-variance trade-off), section 6.3 (kernel regression);
Hastie/Tibshirani/Friedman: section 3.4 (ridge regression and LASSO), section 3.8 (LARS algorithm);
Murphy: section 7.5 (ridge regression), sections 13.3 and 13.4 (LASSO), sections 14.4 and 14.7.4 (kernel-based regression);
Saunders, Gammerman and Vovk: "Ridge Regression Learning Algorithm in Dual Variables", 1998 (PDF)
Tropp and Gilbert: "Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit" (PDF)
  • the bias-variance trade-off: derivation and application to ordinary least squares
  • motivation for variable shrinkage
  • ridge regression: problem formulation and solution via QR decomposition and SVD
  • Augmented feature spaces and the kernel trick
    • Mapping data to higher dimensions
    • Example: fitting a circle by minimizing algebraic loss
    • The dual problem of ridge regression
    • Kernelization of the dual
    • Mercer's condition, polynomial and Gaussian kernels
    • Kernel ridge regression and (Nadaraya/Watson) kernel regression
  • sparse regression: L0 and L1 penalties, the LASSO problem
  • geometric interpretation of the L1 penalty
  • orthogonal matching pursuit
  • the LARS algorithm
  • application: topic detection in documents
Lesson 15: Non-linear Least Squares
cf. Seber/Wild: chapter 14 (Levenberg-Marquardt algorithm); Hastie/Tibshirani/Friedman: section 9.2 (regression trees)
  • Parametric non-linear least squares: the Levenberg-Marquardt algorithm
  • Nonparametric non-linear least squares: regression trees and forests
Lesson 16: Robust Regression
cf. Maronna/Martin/Yohai: chapters 2 and 4; Fischler and Bolles: "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography", 1981 (PDF)
  • Robust regression basics
  • Contaminated distributions and the accuracy/robustness trade-off
  • Robust error norms (absolute deviation, Huber loss, bi-weight loss) and their relation to the squared error
  • Robust regression with Huber loss, iteratively reweighted least squares
  • Least absolute deviation regression, reduction to linear programming
  • RANSAC algorithm
Lesson 17: Unsupervised Learning: Linear Dimension Reduction
cf. Hastie/Tibshirani/Friedman: section 14.5.1; Barber: sections 15.2 and 15.3; Bishop: section 12.1; Murphy: sections 12.2 and 12.3
  • Motivation and scope of unsupervised learning
  • Principal Component Analysis (PCA)
    • Model for linear dimension reduction
    • Three ways to derive PCA:
      (i) minimize the squared difference between the kernel matrices in the original and reduced feature space,
      (ii) find uncorrelated features that best explain the data variability,
      (iii) minimize the reconstruction error from the reduced to the original space
    • The PCA algorithm
    • Mapping new (out-of-training) instances to the reduced feature space
    • Whitening
    • Application: eigenfaces
Lesson 18: Unsupervised Learning: Non-Linear Dimension Reduction
cf. Hastie/Tibshirani/Friedman: section 14.5.4; Barber: section 15.7; Bishop: section 12.3; Murphy: section 14.4.4 (Kernel-PCA)
Roweis and Saul: "Nonlinear dimensionality reduction by locally linear embedding", 2000 (PDF); Hastie/Tibshirani/Friedman: section 14.9 (LLE)
  • Overview over variations of PCA (alternative objective functions, kernel trick, robustness)
  • Kernel-PCA (kPCA)
    • Linear dimension reduction in an augmented feature space
    • Expression of the PCA algorithm in terms of an arbitrary kernel matrix
    • Centering of the kernel matrix
    • How to map out-of-training instances
  • Local Linear Embedding (LLE)
    • Piecewise linear dimension reduction as an example for non-linear dimension reduction
    • Local reconstruction from nearest neighbors
    • Neighborhood-preserving embedding into a low dimensional space
    • How to map out-of-training instances
  • Application: concentric circles and swiss roll datasets
Lesson 19: Unsupervised Learning: Alternative Linear Dimension Reduction Methods
cf. Lee and Seung: "Algorithms for Non-negative Matrix Factorization", 2001 (PDF); Hastie/Tibshirani/Friedman: section 14.6; (NMF)
Hyvärinen and Oja: "Independent Component Analysis: A Tutorial", 1999 (PDF); Hastie/Tibshirani/Friedman: section 14.7; Murphy: section 12.6 (ICA)
  • Non-Negative Matrix Factorization (NMF)
    • Motivating example: topic representation by word histograms
    • ML estimation on the basis of Poisson distributions
    • Multiplicative update rules
    • Initialization methods: random, k-means, random Acol
    • Application: sparse parts-based face representation, comparison with eigenfaces
    • How to map out-of-training instances via non-negative LASSO
  • Indepdendent Component Analysis (ICA)
    • Motivating example: cocktail party problem
    • Derivation of the objective via entropy and KL divergence
    • Simplification by whitening of the input data
    • FastICA algorithm: Newton iterations towards an approximated objective
Lessons 20 and 21: Unsupervised Learning: Clustering
cf. Hastie/Tibshirani/Friedman: section 14.3; Duda/Hart/Stork: chapter 10; Barber: section 20.3; Murphy: chapter 11, and sections 14.4, 25.5; Bishop: chapter 9
  • Problem definition and overview, cluster representatives (mean, medoid)
  • Hierarchical clustering
    • Generic algorithm
    • Dendrograms
    • Cluster distances (single linkage, complete linkage, average linkage, representative linkage), relationship to minimum spanning tree
    • Pros and cons of the different choices
  • k-Means clustering
    • Minimization of within-cluster distance, maximization of between cluster distance
    • Special case of Euclidean distance
    • Alternating optimization of representatives and cluster membership
    • Initializations (random vs. k-means++)
    • Variant: k-medoids
    • Heuristics to determine the number of clusters (square root, elbow, jump, cross-validation)
    • Application: fitting lines
  • Expectation-Maximization (EM) algorithm
    • Soft cluster assignments
    • (Gaussian) mixture models
    • Maximum likelihood optimization of a mixture model
    • Alternating optimization of cluster membership and component models
    • High-level interpretation of the EM strategy
Lessons 22: Model Optimism and Model Selection
cf. Borra and Ciaccio: "Estimators of extra-sample error for non-parametric methods. A comparison based on extensive simulations", 2008 (PDF)
Hastie/Tibshirani/Friedman: sections 7.4 (covariance penalty), 3.4 (model degrees-of-freedom)
Cavanaugh: "Unifying the derivations for the Akaike and corrected Akaike information criteria", 1997 (PDF)
Burnham, Anderson, and Huyvaert: "AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons", 2011 (PDF)
 
  • Definitions of test error and optimism (out-sample, in-sample, expected)
  • Empirical estimation of the test error
    • Cross-validation and its variants (repeated, reversed, corrected CV)
    • Bootstrap sampling and the out-of-bag error
  • Covariance penalty
    • Analytical derivation for linear models
    • Generalization to Cp criterion for ridge regression and LASSO
    • Empirical estimation via Rademacher sampling, permutation sampling and the bootstrap
  • The Akaike criterion
    • KL divergence and the Minimum Description Length principle
    • Derivation and interpretation of AIC and AICc
    • Akaike weights and ensemble methods
Lesson 23: Risk and Loss
cf. Duda/Hart/Stork: chapter 2; Murphy: section 5.7
  • Definition of risk and loss
  • Bayes risk, minimum-risk decision rule
  • Effect of unbalanced risk on posterior and likelihood ratios
  • Receiver operating characteristic, ROC curve, area-under-curve (AUC)
  • Precision/recall curve, F1 measure
Lessons 24 and 25: Classification: Support Vector Machines (SVM)
cf. Schölkopf/Smola; Hastie/Tibshirani/Friedman: sections 12.2 and 12.3; Bishop: chapter 7; Murphy: section 14.5;
Barber: section 17.5 (basic treatment); Duda/Hart/Stork: section 5.11 (basic treatment);
Hsieh et al.: "A dual coordinate descent method for large-scale linear SVM", 2008 (PDF)
Platt: "Fast Training of Support Vector Machines using Sequential Minimal Optimization", 1998 (PDF)
  • Margin maximization on linearly separable data: intuitive motivation, structured risk minimization, Vapnik-Chervonenkis theory, VC dimension
  • Slack variables and the soft-margin formulation for non-separable data
  • The SVM optimization problem
  • SVM loss functions (hinge loss, squared hinge loss, logistic loss, ...)
  • The dual optimization problem
  • Karush-Kuhn-Tucker (KKT) conditions
  • Sparsity of the support vector set
  • Algorithms: dual coordinate descent for linear SVM (LIBLINEAR), sequential minimal optimization for kernel SVM (LIBSVM), convergence in terms of KKT conditions
  • Example: XOR problem
Lessons 26 and 27: Classification: Ensemble Methods
cf. Breiman: "Random Forests", 2001 (PDF); Hastie/Tibshirani/Friedman: chapter 15 (random forests);
Hastie/Tibshirani/Friedman: sections 10.1 to 10.5; Duda/Hart/Stork: section 9.5.2; Bishop: section 14.3; Murphy: section 16.4 (boosting)
Viola/Jones: "Rapid object detection using a boosted cascade of simple features", 2001 (PDF)
 
  • Constructing a strong classifier by combining weak classifiers: independent learning of uncorrelated classifiers, conditional learning of a classifier chain
  • Random Forests
    • Why does an ensemble of uncorrelated classifiers improve performance? How does correlation reduce ensemble performance?
    • Decision trees: CART (minimize Gini impurity), C4.5 (minimize entropy) and their training algorithm
    • Diverse tree training via bootstrapping and random feature selection
    • The out-of-bag error estimate
    • Example: XOR problem
  • Boosting
    • Classifiers that correct the errors of their predecessors: reweighting (AdaBoost), gradient boosting, cascaded classification
    • Boosting as an additive method minimizing exponential loss
    • The AdaBoost algorithm
    • Decision stumps
    • Cascaded classification in highly unbalanced datasets (Viola/Jones)
    • Example: face detection
Lessen 28: Preview of Lecture "Advanced Machine Learning"
  • Gaussian Process Classifiers
  • Neural Networks
  • Latent Variable Models (Hidden Markov Models, Probabilistic Latent Semantic Analysis)
  • Matrix Decomposition, Dictionary Learning
  • Graphical Models (Bayesian Networks, Markov Random Fields)
  • Structured Learning
  • Feature Selection, Feature Learning (Convolutional Neural Networks, Kernel Approximation)
  • Weakly Supervised Learning (Multiple Instance Learning, Semi-Supervised Learning, Active Learning)
  • Reinforcement Learning