Introduction
Table of contents
for demo
Background
The term âMachine Learningâ was coined by MIT alumnus Arthur Samuel^{1} in 1959. It evolved from many fields including Statistical Learning, Pattern Recognition and so on. The goal of machine learning is to make computers âlearnâ from âdataâ^{2}. From an end userâs perspective, it is about understanding your data, make predictions and decisions. Intellectually, it is a collection of models, methods and algorithms that have evolved over more than a halfcentury now.
Historically both disciplines evolved from different perspectives, but with similar end goals. For example, Machine Learning focused on âpredictionâ and âdecisionsâ. It relied on âpatternsâ or âmodelâ learnt in the process to achieve it. Computation has played key role in its evolution. In contrast, Statistics, founded by statisticians such as Pearson and Fisher, focused on âmodel learningâ. To understand and explain âwhyâ behind a phenomenon. Probability has played key role in development of the field. As a concrete example, recall the ideal gas law $PV = nRT$ for Physics. Historically, machine learning only cared about ability to predict $P$ by knowing $V$ and $T$, did not matter how; on the other hand, Statistics did care about the precise form of the relationship between $P, V$ and $T$, in particular it being linear. Having said that, in current day and age, both disciplines are getting closer and closer, daybyday, and this class is such an amalgamation.
Artificial Intelligenceâs stated goal is to mimic human behavior in an intelligent manner, and to do what humans can do but really well, which includes artificial âcreativityâ and driving cars, playing games, responding to consumer questions, etc. Traditionally, the main tools to achieve these goals are ârulesâ and âdecision treesâ. In that sense, Artificial intelligence seeks to create muscle and mind of humans, and mind requires learning from data, i.e. Machine Learning. However, Machine Learning helps learn from data beyond mimicking humans. Having said that, again the boundaries between AI and ML are getting blurry daybyday.
Course Structure
The course contains four parts:

Part I. Supervised Learning. Learning from data to predict.

Part II. Unsupervised Learning. Understanding the structure within the data.

Part III. Probabilistic Modeling. Probabilistic view to model complex scenarios.

Part IV. Decision Making. Using data to make decisions.
Supervised Learning
The goal of supervised learning is to predict target using input / features, and a model is learned to do so. This can be sufficiently summarized as \(\text{\textit{target}} = f ( \text{\textit{features}} )\) For classification tasks, the target is categorical or takes discrete values (e.g. hot or cold). For regression tasks, the target takes any real value (e.g. temperature). The model type reflects our belief about the reality and different model leads to different algorithm. The philosophy of supervised learning is: future of the past equals future of the future.
Examples of classification include: identify handwritten digits, email spam filtering, detecting malicious network connection based on network log information or predicting whether a client will default on her/his credit based on the clientâs features. For example, suppose we have access to a clientâs features or attributes in terms of the (credit card) balance and income. Consider Figure 1. It plots available data with $X$ axis representing (credit card) balance and $Y$ axis representing income. The color of the point is blue if no default and brown if default. Pictorially, the classifier is trying to learn a boundary as shown in Figure 1 which separate no default from default.
Formally, the data are labeled observations of the following form: $(x_1,y_1),\ldots,(x_N,y_N)$. The goal is to learn a model that maps attribute (or feature) $x$ to label (or target) $y$ so that given attribute $x$, we can predict corresponding unknown (discrete) label $y$. That is, to learn a function $f$ such that $y = f(x)$ (and sometimes also whatâs the confidence).
Various approaches for learning $f$ can be categorized as

Linear: Logistic regression, Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), Perceptron

Nonlinear (parametric): Quadratic Discriminant Analysis (QDA), Polynomial, Neural Networks

Nonparametric: Kernels, Nearest Neighbors
How to find $f$? Among all possible choices of $f$, choose the one that fits the data the best. That is, solve optimization: empirical risk minimization (ERM):
\[\text{Minimize } \sum_{i=1}^N \mathrm{loss}\left(y_i,f(x_i)\right) \text{ over all possible } f.\]Stochastic Gradient Descent (SGD) is a method to solve this optimization problem. This is where Optimization meets Machine Learning.
6.390 (or equivalent undergraduate class) discusses the âHowâ or âmechanicsâ of such approaches. In this class, we expect that you know the âHowâ for much of supervised learning and decision making. That is, more than $60\%$ of this class. So, what will we do in 6.790 (since $> 60\%$ is already done!)?
To start with, we will learn âWhyâ behind the âHowâ. We will utilize Probability as our formal language. We will discuss estimators and theoretical guarantees, and generalization: does a good model fit on historical data lead to ability to predict future? Finally, we will have 40% of the course discusses unsupervised learning / probabilistic modeling to understand the structure within the data.
To understand âWhyâ, effectively we need to âlogically deduceâ what we do starting with appropriate goals and axioms. The axioms that are relevant are that of Probability. In particular, to reason about what we do in Machine Learning, we will utilize the language of probability. And probability is entirely based on the three key axioms. Formally, there is a probability space $\Omega$, events $\mathcal{F}$ in it, and a probability function $\mathbb{P}:\mathcal{F}\to [0,1]$.

Axiom 1. $\mathbb{P}(A)\geq 0$, for all $A\in\mathcal{F}$.

Axiom 2. $\mathbb{P}(\Omega)=1$.

Axiom 3. $\mathbb{P}(\cup_{i=1}^\infty E_i)=\sum_{i=1}^\infty \mathbb{P}(E_i)$, if $E_i\cap E_j=\varnothing$, for all $i\neq j$.
The above exercise is a simple example of logical deduction starting from the axioms of probability. In a sense, this is what we will do to explain âwhyâ.
Before proceeding further, it is important to wonder â âIs it possible to have a different set of probability axioms?â This is a question hotly debated in the first half of last century. At the end of the day, All roads lead to Rome: All sorts of reasonable hypothesis about beliefs / decision making lead to axioms of probability^{3}.
In the language of probability, both attributes $X$ and labels $Y$ are random variables. Especially, $Y$ is discretevalued random variable. The conditional distribution $\mathbb{P}(YX)$ is of interest. Suppose labels take value $1$ (e.g. default) or $1$ (e.g. no default), given attribute $X=x$. An ideal classier, also known as Bayes classifier, which in the context of binary classification, predicts
\[\hat{Y}(x)= \begin{cases}1 & \text { if } \mathbb{P}(Y=1 \mid X=x) \geq 1 / 2 \\ 0 & \text { otherwise }\end{cases}\]The performance metric of interest is misclassification probability, i.e. $\mathbb{P}(\hat{Y}(X) \neq Y)$.
Probabilistic view will help us understand how to choose the loss function and how well our model generalizes. In terms of generalization and overfitting, you should trust your data, but only so much. Consider the following example: We have observations $(x_i,y_i)$, $i=1,\ldots,n$. Here attributes ${x}_i$ are points distributed uniformly in the unit square. The label is generated according to the following rule: As sketched in the figure below, $y_i=0$ when the corresponding $\mathbf{x}_i$ lies in the shaded square and $y_i=1$ otherwise. The area of the shaded square is $1/2$.
Pretend we do not know the true label rule and would like to to find a model to approximate it based on the observations. The function fit, \(f({x})=\begin{cases} y_i, & \text{if }{x}={x}_i, \\ 0, & \text{otherwise}, \end{cases}\) which assigns every observed points to the correct label $y_i$ and assign all unseen points to $0$, is a perfect fit for the observation. However, since the possibility we encounter the same points in the set ${({x}_i,y_i), i=1,\ldots,n}$ in the future is zero, we will most certainly assign all future points to $0$ and this function is simply as bad as ârandomâ function! This is overfitting.
In order to prevent overfitting, empirically, we use crossvalidation â split data into three parts: train, (validate) and test, or/and $K$fold crossvalidation. To explain why this the right thing to do, we shall discuss the notion of generalization that utilizes the view that data is generated per an unknown underlying probability distribution. Methodically, we shall use regularization and again probabilistic formalism will help explain why (or why not) it works well. Probabilistic view, again will come to our rescue to explain the implicit regularization that is implemented by modern methods (e.g. âdropoutâ) of neural networks.
Some examples of regression include predict wage given age, year, and education level. Formally, the data are labeled observations of the following form: $(x_1,y_1),\ldots,(x_N,y_N)$. The goal is to learn a model that maps attribute (or feature) $x$ to label (or target) $y$ so that given attribute $x$, we can predict corresponding unknown (continuous) label $y$. That is, to learn a function $f$ such that $y = f(x)$ (and sometimes also what is the confidence interval).
In the language of probability, both attributes $X$ and labels $Y$ are random variables. Now, $Y$ is continuousvalued random variable. The conditional distribution $\mathbb{P}(YX)$ is of interest. Given attribute $X=x$, we estimate $\hat{Y}(x)$ to minimize estimation error. One the most common estimation error is $\mathbb{E}\left[(Y\hat{Y}(x))^2X=x\right]$, which is minimized by $\hat{Y}(x)=\mathbb{E}\left[YX=x\right]$. Finally, we should determine predictive distribution. $\mathbb{E}\left[YX=x\right]$ is unknown. The model fit for regression means to find the best fit for $f(x)\approx \mathbb{E}\left[YX=x\right]$ using observed data.
Unsupervised Learning
In unsupervised learning, there is no target. Only input / features are given. The goal is to learn the data distribution. In this course, we are going to cover topics such as dimensionality reduction, matrix estimation, clustering and mixture distribution, and feature extraction (topic model and deep generative model) from unstructured data such as text, audio or image, or for complexity reduction. Examples of unsupervised learning: Finding the principal component of DNA data (dimensionality reduction)Â , movie recommendation (matrix estimation), analyzing topics in documents (feature extraction: topic model), generating fake faces of celebrities (feature extraction: deep generative model).
Probabilistic Modeling
Two important topics in probabilistic modeling are:
 incorporating prior knowledge from Bayesian perspective, and
 sampling from distribution when probabilistic model is complex.
Most of the key tasks in machine learning are inference tasks. For example, in prediction we need to infer $\mathbb{P}(YX)$. In model learning, we need to infer $\mathbb{P}(\text{parameters}\text{data})$.
The Bayesâ rule states that
\[\underset{\text{posterior}}{\mathbb{P}(\text{parameters}\text{data})}\propto \underset{\text{likelihood}}{\mathbb{P}(\text{data}\text{parameters})}\times \underset{\text{prior}}{\mathbb{P}(\text{parameters})}\]The key question is how to select prior? This is the prior knowledge of the world. One of the classical priors is Gaussian distribution, which for example, leads to ridge regularization in regression.
A probability distribution can be complex. It may have succinct representation but no closedform formula, and hence difficult to evaluate. For example, we may know \(\mathbb{P}(X=x)\propto \exp(f(x))=\frac{1}{Z}\exp(f(x)),\) where \(Z=\int \exp(f(x))dx.\)
This integration can be very hard to evaluate for a general $f(x)$. The key algorithm to evaluate on such complex distributions is Markov Chain Monte Carlo (MCMC)^{4} It has specific forms such as Gibbs sampling and MetropolisHastings. MCMC works for generic form of distribution.
Decision Making
In data driven decision making (in presence of uncertainty), we need to learn the model of uncertainty, given observations. The goal is to make âoptimalâ decision with respect to a longterm objective. The decision vs information timescale are critically important. The following diagram summarizes the framework of decision making,
The two key timescales are state or environment dynamics, and information dynamics. Depending on the two timescales, there are methods / approaches including optimizing given model of uncertainty, Markov decision process, and reinforcement learning.
Â  State Dynamics  Information Dynamics 

Optimizing Given Model of Uncertainty  No change (or extremely slow)  Lots of historical information 
Markov Decision Process  High  Lots of historical information 
Reinforcement Learning  High  Minimal information, learn as you go 
One fundamental challenge in reinforcement learning is explore vs exploit: we need to make decisions based on incomplete information about the world; in turn the decision we make affects future data we get to see. so should we stick to a decision that seems best for the moment, or sacrifice nearterm gain to get more information for future longterm decision making.
Another central question in RL is the credit assignment problem. (TO be expanded.)
Reinforcement learning has found success in various applications such as robotics, automated game player, ads placement. And notably, in the recent success of largelanguagemodels (LLMs) such as ChatGPT.
And then, What Is Not Covered, But Of Interest
We may not be able to cover the following interesting topics in machine learning:

Active Learning, actively obtain data as each data point is expensive.

Transfer Learning, transfer data collected for one task to other learning task.

Semisupervised Learning, supervised setting with (additional) unsupervised data.

Causal inference, Hypothesis testing, âŚ
But hopefully, things youâll learn this in course will provide systematic foundations to approach these topics.

See https://g.co/kgs/Lj3v3k to read more about Arthur Samuel.Â ↩

What is learning? Some food for thought: https://goo.gl/5R1m4S.Â ↩

A good set of readings include , andÂ ↩

MCMC is one of the top 10 algorithms of all timeÂ . Other algorithms include quicksort and fast Fourier transform.Â ↩