Massive Technical Interviews Tips: Machine Learning

Sunday, January 26, 2020

Machine Learning

https://developers.google.com/machine-learning/crash-course/ml-intro

ML systems learn how to combine input to produce useful predictions on never-before-seen data.

Models

A model defines the relationship between features and label. For example, a spam detection model might associate certain features strongly with "spam". Let's highlight two phases of a model's life:

Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label.
Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y'). For example, during inference, you can predict medianHouseValue for new unlabeled examples.

Regression vs. classification

A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following:

What is the value of a house in California?
What is the probability that a user will click on this ad?

A classification model predicts discrete values. For example, classification models make predictions that answer questions like the following:

Good features are concrete and quantifiable

Linear regression is a method for finding the straight line or hyperplane that best fits a set of points.

A Convenient Loss Function for Regression

L₂ Loss for a given example is also called squared error

= Square of the difference between prediction and label

= (observation - prediction)²

= (y - y')²

Squared loss: a popular loss function

The linear regression models we'll examine here use a loss function called squared loss (also known as L₂ loss). The squared loss for a single example is as follows:

  = the square of the difference between the label and the prediction
  = (observation - prediction(x))²
  = (y - y')²

Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

M S E = \frac{1}{N} \sum_{(x, y) \in D} (y - p r e d i c t i o n (x))^{2}

where:

is an example in which
- $x$ is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.
- $y$ is the example's label (for example, temperature).
$p r e d i c t i o n (x)$ is a function of the weights and bias in combination with the set of features $x$ .
$D$ is a data set containing many labeled examples, which are $(x, y)$ pairs.
$N$ is the number of examples in $D$ .

Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.

The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here in Figure 3, the gradient of the loss is equal to the derivative (slope) of the curve, and tells you which way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.

The partial derivative $f$ with respect to $x$ , denoted as follows:

\frac{\partial f}{\partial x}

is the derivative of

f

considered as a function of

x

alone. To find the following:

\frac{\partial f}{\partial x}

you must hold

y

constant (so

f

is now a function of one variable

x

), and take the regular derivative of

f

with respect to

x

Intuitively, a partial derivative tells you how much the function changes when you perturb one variable a bit. In the preceding example:

\frac{\partial f}{\partial x} (0, 1) = e^{2} \approx 7.4

So when you start at

(0, 1)

, hold

y

constant, and move

x

a little,

f

changes by about 7.4 times the amount that you changed

x

In machine learning, partial derivatives are mostly used in conjunction with the gradient of a function.

Instead of predicting exactly 0 or 1, logistic regression generates a probability—a value between 0 and 1, exclusive

https://zhuanlan.zhihu.com/p/36902908

真正的飞跃发生在大学时，微积分为我们求函数的极值提供了一个统一的思路：找函数的导数等于0的点，因为在极值点处，导数必定为0。这样，只要函数的可导的，我们就可以用这个万能的方法解决问题，幸运的是，在实际应用中我们遇到的函数基本上都是可导的。