Advanced Machine Learning

Prof. Dr. Ullrich Köthe, SS 2021

Machine learning is one of the most promising approaches to address difficult decision and regression problems under uncertainty. The general idea is very simple: Instead of modeling a solution explicitly, a domain expert provides example data that demonstrate the desired behavior on representative problem instances. A suitable machine learning algorithm is then trained on these examples to reproduce the expert's solutions as well as possible and generalize it to new, unseen data. The last two decades have seen tremendous progress towards ever more powerful algorithms. This course will build upon the introductory material of last semester's Fundamentals of Machine Learning course and covers the most important advanced concepts and methods.

The lecture belongs to the Master of Applied Informatics program, but is also recommended for students towards a Master of Physics (specialisation Computational Physics), Master in Scientific Computing and anyone interested.

Solid knowledge in linear algebra, analysis (multi-dimensional differentiation and integration) and probability theory is required. Participants should be familiar with the fundamental concepts from last semester's "Fundamentals of Machine Learning" course or equivalent (for example, you can prepare yourself with Fred Hamprecht's Pattern Recognition Video Lecture).

Dates:

Lecture	Wednesdays	11:15-12:45	TBA (possibly online)
Lecture	Fridays	11:15-12:45	TBA (possibly online)
Tutorials	online
Please sign up for the lecture via Muesli.

Homework assignments and other course material will be published on Moodle (not yet active).

Christopher M. Bishop: "Pattern Recognition and Machine Learning", 738 pages, Springer, 2006
David Barber: "Bayesian Reasoning and Machine Learning", 720 pages, Cambridge University Press, 2012 (online version)
Kevin P. Murphy: "Machine Learning -- A Probabilistic Perspective", 1105 pages, The MIT Press, 2012
Trevor Hastie, Robert Tibshirani, Jerome Friedman: "The Elements of Statistical Learning" (2nd edition), 745 pages, Springer, 2009 (online version)
Richard O. Duda, Peter E. Hart, David G. Stork: "Pattern Classification" (2nd edition), 680 pages, Wiley, 2000 (online version)

Special Topics

Ian Goodfellow, Yoshua Bengio and Aaron Courville: "Deep Learning", 801 pages, The MIT Press, 2016 (online version)
Carl Edward Rasmussen and Christopher Williams: "Gaussian Processes for Machine Learning", 266 pages, The MIT Press, 2006 (online version)
Judea Pearl: "Causality" (2nd edition), 484 pages, Cambridge University Press, 2009
Daphne Koller, Nir Friedman: "Probabilistic Graphical Models", 1280 pages, The MIT Press, 2009 (advanced methods)

Approximate Content:

(subject to yearly change)

Lessons 1 and 2: Recapitulation
Textbooks: Duda/Hart/Stork: sections 2.5, 2.6 and chapter 5; Hastie/Tibshirani/Friedman: chapter 4; Bishop: chapter 4

Machine learning basics: classification vs. regression, supervised vs. unsupervised
Mathematical notation of the lecture
Types of classifiers: decision rules, discriminative models, generative models
A linear generative model: linear discriminant analysis (LDA)
A linear decision rule: perceptron (perceptron loss and gradient descent)
An improved linear decision rule: linear support vector machines (max-margin decisions, hinge loss, the SVM optimization problem)

Lessons 3 and 4: Logistic Regression
Bottou (2012): "Stochastic Gradient Descent Tricks" (PDF)
Minka (2003): "A comparison of numerical optimizers for logistic regression" (PDF)

A linear discriminative model: logistic regression (LR)
- Logistic loss and the regularized LR optimization problem
- Gradient descent and stochastic gradient descent algorithms for LR
- Iterated reweighted least-squares (IRLS)
- The dual optimization problem of LR and its solution by dual coordinate ascend
- Why is stochastic gradient descent fast on big data?
- Variants of stochastic gradient descent: momentum, mini-batches, learning rate control etc.

Lessons 5 to 11: Neural Networks
Textbooks: Goodfellow, Bengio, Courville: Part II; Bishop: chapter 5; Hastie/Tibshirani/Friedman: chapter 11; Murphy: section 16.5; Duda/Hart/Stork: chapter 6
Nielsen (2014): "Neural Networks and Deep Learning" (great online book)
"Theano Tutorials" (great introduction to neural networks using the Theano software framework)
He et al. (2015): "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" (PReLU activation function, PDF)
Clevert et al. (2016): "Fast and Accurate Deep Network Learning by Exponential Linear Units" (ELU activation function, PDF)
Ronneberger et al. (2015): "U-Net: Convolutional Networks for Biomedical Image Segmentation" (PDF)
Ruder (2016): "An overview of gradient descent optimization algorithms" (online version)
Wilson and Martinez (2003): "The general inefficiency of batch training for gradient descent learning" (PDF)
Ioffe and Szegedy (2015): "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (PDF)
Srivastava et al. (2014): "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (PDF)
Gao and Zhou (2014): "Dropout Rademacher Complexity of Deep Neural Networks" (PDF)

Motivation: non-linearly combine linear models in parallel and in series to overcome the limitations of single linear models
History: perceptron, backpropagation, unsupervised pre-training, dropout, piecewise linear activation functions
Definition of single neurons and activation functions
Multi-layer architecture, hidden neurons, bias neurons, weights, activation functions, notation
Theoretical results on performance: VC dimension, universal approximation theorem
Training by Backpropagation:
- Detailed derivation of the backpropagation algorithm
- Loss functions (quadratic, logistic, cross-entropy) and their gradients
Modern activation functions (ReLU, PReLU, ELU) and their gradients
Convolutional Neural Networks (CNNs), U-Nets
Training Tricks:
- Mini-batch stochastic gradient descent, momentum, AdaDelta, ADAM
- Hyper-parameter selection, termination criterion
- Weight initialization, weight decay
- Batch normalization
- Dropout and the resulting reduction in Rademacher complexity
- Data augmentation

Lessons 12 to 14: Gaussian Processes
Textbooks: Rasmussen and Williams; Barber: chapter 19; Bishop: chapter 6 (specifically, section 6.4); Murphy: chapter 15
Unser (1999): "Splines: A Perfect Fit for Signal and Image Processing" (PDF1, PDF2)
Snoek et al. (2012): "Practical Bayesian Optimization of Machine Learning Algorithms" (PDF)

Approaches to learning when the iid assumption is violated:
- Approximately recover the iid assumption in an augmented feature space (define new features by filters over the neighborhood of each instance)
- Graphical models (explicitly model more compledx independence assumptions => next chapter)
- Gaussian processes (learn the full joint probability as a multivariate Gaussian)
Definition of Gaussian processes (Gaussian distribution over a function space, prior assumption modeling via a kernel Hilbert space norm, finite multi-dimensional marginals)
Derivation of the fundamental interpolation equation and its variance via the conditional distribution of new data, given training data
Modeling uncertainty in the training data, Gaussian process regression, relation of GPs to kernel ridge regression
Example: linear interpolation
Kernel functions:
- General rules: Mercer's condition, constructing new kernels from existing ones
- Radial basis functions for scattered data (squared exponential, gamma-exponential, Matern, inverse quadrics, thin-plate splines, Wendland splines)
- Tensor product kernels for data on a grid (squared exponential, B-splines, cardinal functions), implementation of grid-based GP interpolation by filtering
Application to hyper-parameter optimization (Snoek et al. 2012)
Application to classification (mapping a latent function through a sigmoid, optimization via the Laplace approximation and Newton's method)

Lessons 15 and 16: Bayesian Network Introduction
Textbooks: Pearl: chapter 1; Barber: section 3.3; Bishop: sections 8.1 amd 8.2; Murphy: chapter 10; Koller and Friedman: chapter 3
Peters (2015): "Causality - Lecture Notes" (PDF)

Pitfalls of data dependencies: Alice's boys and Simpson's paradox (Berkeley admission example)
Decompositions of joint probabilities according to Bayes rule or in terms of Gibbs distributions
Graphical models, fundamental tasks in graphical models (marginalization, MAP solution, decision making)
Directed graphical models, Bayesian networks
Pearl's basic network construction algorithm, decomposition of a joint probability into parent-child factors
Fundamental configurations (causal chain, common cause, common effect) and their independence properties
Bergson's paradox ("explaining away" effect), burglary alarm example
Markov properties, d-separation, faithfulness
Efficient algorithm to check for d-separation in a given DAG

Lessons 17 and 18: Inference in Bayesian Networks
Textbooks: Barber: sections 5.1 and 5.2; Bishop: section 8.4; Murphy: chapter 20; Koller and Friedman: chapters 8 to 11
Kschischang, Brendan, Loeliger (2001): "Factor Graphs and the Sum-Product Algorithm" (PDF)
Aji and McEliece (2000): "The generalized distributive law" (PDF)

Basic inference algorithm to compute marginals: variable elimination
Principles and examples (burglary alarm, the naive Bayes classifier)
Exponential complexity of variable elimination, complexity reduction by means of the distributive law
Example: chain graphical model, reduction of inference to matrix exponentiation
Belief propagation: factor graphs, message passing, sum-product algorithm
Alice's boys reconsidered

Lessons 19 to 21: Temporal Bayesian Networks
Textbooks: Barber: sections 11.2 (EM algorithm) 23.2 and 23.3 (HMMs); Bishop: section 13.2; Murphy: chapter 17 (HMMs) and section 11.4 (EM algorithm)
Welch (2003): "Hidden Markov Models and the Baum–Welch Algorithm" (PDF)

Markov chains, probabilistic finite state machines
Page rank algorithm, the Google matrix
Hidden Markov models
Belief propagation in HMMs, the forward-backward algorithm
Algebraic requirements of belief propagation, sum-product semiring, min-sum semiring, min-sum algorithm
Finding the MAP configuration of an HMM using the min-sum algorithm, the Viterbi algorithm
Showing the equivalence between belief propagation and explicit recursion: forward filtering, predictor-corrector method
Supervised and unsupervised learning of an HMM
Baum-Welch algorithm, expectation maximization
Initialization of the Baum-Welch algorithm: random initialisation vs. Viterbi counting
The relationship between observations, marginals and MAP solutions in HMMs

Lessons 22 to 27: Causality Analysis with Bayesian Networks
Textbooks: Pearl: chapters 2 and 3; Barber: section 3.4; Murphy: sections 26.4 to 26.6; Koller and Friedman: chapters 14 to 16
Peters (2015): "Causality - Lecture Notes" (PDF)
Le et al. (2015): "A fast PC algorithm for high dimensional causal discovery with multi-core PCs" (PDF)
Eberhardt, Glymour, Scheines (2005): "On the Number of Experiments Sufficient and in the Worst Case Necessary to Identify All Causal Relations Among N Variables" (PDF)
Claassen, Mooij, Heskes (2013): "Learning Sparse Causal Models is not NP-hard" (PDF)
Gretton et al. (2007): "A kernel statistical test of independence" (PDF) and newer papers on this topic
Chickering (2002): "Optimal Structure Identification With Greedy Search" - score-based approximate BN construction (PDF)
Peters, Mooij, Janzing, Schölkopf (2014): "Causal Discovery with Continuous Additive Noise Models" - great tutorial on causality, RESIT algorithm (PDF)
Greenland and Robins (1988): "Identifiability, exchangeability, and epidemiological confounding" (PDF)
Rosenbaum and Rubin (1983): "The Central Role of the Propensity Score in Observational Studies for Causal Effects" (PDF)
Austin (2011): "An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies" (PDF)
Daume et al. (2007, 2010): "Frustratingly Easy Domain Adaptation" (PDF) and "Frustratingly easy semi-supervised domain adaptation" (PDF)
Bottou et al. (2013): "Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising" (PDF)
Barenboim and Pearl (2013): "A general algorithm for deciding transportability of experimental results" (PDF)

Introduction: fundamental understanding vs. statistical experiments vs. passive observation
Example: Cholera epidemics in London ~1850
Fundamental problem of causal inference: impossibility to play-back time and observe alternative outcomes
Role of Bayesian networks in causal analysis:

Basic tasks: prediction, counterfactual analysis, decision making
Reichenbach's common cause principle
Modeling interventions in a Bayesian network, Pearl's 'do' operator
Distinction between conditional and interventional distributions, definition of the total causal effect
Markov equivalence classes, skeleton and moral graph, Markov minimality
True causal graphs: reproduction of both observational and interventional distributions, the minimal true causal graph

How to identify causality from data?
- The idealized IC algorithm
- Optimization of the IC algorithm: the PC and the parallel stable PC algorithms
- Examples for BN construction
- The minimum number of experiments needed to completely identify a BN
Conditional independence tests: the G-test and its shortcomings, kernel-based independence tests
Structured equation models
Approximation algorithms for BN construction:
- Move-making (score-based) algorithms
- The RESIT algorithm: causality detection by testing independence between predictors and residuals
Parameter estimation of a BN
Avoiding omitted variable bias in causal analysis:
- Missing mediators in direct effect analysis, example: Berkeley admission
- Missing confounders in total effect analysis, example: kidney stone treatments
- Potential outcomes and exchangeability of treatment groups, the bias of naive conditional expectations
- Randomized experiments as gold standard
- Valid adjustment sets and backdoor adjustment
- Stratification
- Propensity scores, weight adjustment, propensity score matching
Transfer learning:
- How to combine unsatisfactory data in the target domain with good data from a related domain?
- "Frustatingly easy domain adaptation" (EasyAdapt, EasyAdapt++)
- Data augmentation (adaptation of the training set to the new domain)
- Importance sampling for counterfactual queries, example: computational advertising
- Causal theory of transportability (adjustment formula for selection bias and transfer bias)

top

Advanced Machine Learning

Dates:

Recommended Textbooks:

Approximate Content: