Machine Learning Starter Notes

This post include:

Applications for Machine Learning: demos
So what is Machine Learning
Machine Learning Problems: the Optimization Basics

Applications for Machine Learning: demo

Self-Driving Car: Use machine learning to recognize surrounding obstacles A ride in Google’s self driving car
Object Detection: Computer vision Facebook
AI: machine learning can make AI more intelligent.
- Reinforcement Learning can help AI agents improve on themselves.
- Neural networks can help computer evaluate the situation on the GO game board.
Robotics: Machine learning can make robots more accurate. Video

So what is Machine Learning

Reading: Andrew’s Machine Learning Notes, class 1-2

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if performance in T (measured by P) improves with experience E
Supervised and Unsupervised learning
- Supervised learning: linear regression, logistic regression, neural network, etc.
- Unsupervised learning: dimension reduction, single value decomposition, etc.
Optimization problem: In mathematics and computer science, an optimization problem is the problem of finding the best solution from all feasible solutions.
Treat some machine learning problems as optimization problems.
- Linear regression, multi-variable linear regression, logistic regression, neural networks…

Machine Learning Problems: the Optimization Basics

How to solve an optimization problem?
Example: find the zero of a function.

Algorithm 1: Binary Search.

l = start; r = end;
while (l < r):
    mid = (l+r)//2
    if (f(l) * f(mid) < 0):
        r = mid
        continue
    else:
        l = mid
        continue
x = mid

Not guaranteed to find a solution! Dependent on the initial position!

Algorithm 2: Newton’s method. Tutorial
$x_{k+1} = x_k - \frac{y_k}{tan(\theta)} = x_k - \frac{y_k}{y'_k}$
Algorithm 3: Gradient descent.
- Works fine when $x$ and $y$ are vectors.

Machine Learning Problems: an Optimization Perspective

Given a cost function J, adjust some parameters $\theta$ according to training data (a.k.a: learn from experience) such that J decreases.
An gradient descent algorithm comes out. $\theta := \theta - \alpha \frac{\partial J}{\partial \theta}$
How to measure the performances? J is calculated on training data. There should be a training/validation/test data split.
How many steps should we perform?
How to tune the learning rate $\alpha$?
Learning plot: how does the cost function change according to the training steps?

Example problem 1: Linear Regression

Andrew’s notes on linear regression

Problem: Given some points (x,y) where x is an n-dimension array, adjust the parameter $\theta$ in the Linear Regression Model such that the cost function J decreases.
Description of the model: Use a line to fit the points: $y=\theta_0+\theta_1x_1 + \theta_2x_2 + … + \theta_nx_n$. When using this model, x is input, y is output, $\theta$ is parameter.
Solution: Use optimization. Designate a starting value, $\theta_0$. Then use gradient descent: $\mathbf{\theta} := \mathbf{\theta} - \alpha \frac{\partial J}{\partial \mathbf{\theta}}$
If you update each components of $\theta$ one by one -> stochastic gradient descent
If you update all components of $\theta$ at the same time -> batch gradient descent

Example problem 2: Neural Network

Problem: Optimize a model (by adjusting $\theta$), whose performance is measured by $J$, and whose output is calculated from $y=f(x)$, where both $J$ and $f$ are determined by the model.
Solution: Again, use optimization. Designate a starting value, $\theta_0$. Then use gradient descent.
There are two problems here:
(1) What is the functions $f$ and $J$?
(2) What is the derivative term $\frac{\partial J}{\partial \theta}$?
For a fully-connected neural network, the function is $f(x,W) = \sigma (Wx)$, where W contains the bias already, and $\sigma$ is the actication function. Usually two types of loss functions are used: L2 loss (if we treat the activations as pure values), and log loss (if we treat the activations as proabilities). The methods to take derivatives depends on the activation function.
There are many kinds of neural networks. For a convolutional neural network, the function is $f(x,\theta) = \sigma (\theta x)$, where $$ is a convolution operation (filtering in mathematics).