文章作者：Tyan

博客：noahsnail.com | CSDN | 简书

## 5. Hyperparameter tuning

#### 5.1 Tuning process

Hyperparameters:

$\alpha$, $\beta$, $\beta_1,\beta_2, \epsilon$, layers, hidden units, learning rate decay, mini-batch size.

The learning rate is the most important hyperparameter to tune. $\beta$, mini-batch size and hidden units is second in importance to tune.

Try random values: Don’t use a grid. Corarse to fine.

#### 5.2 Using an appropriate scale to pick hyperparameters

Appropriate scale to hyperparameters:

$\alpha = [0.0001, 1]$, r = -4 * np.random.rand(), $\alpha = 10^r$.

If $\alpha = [10^a, 10^b]$, random pick from [a, b] uniformly, and set $\alpha = 10^r$.

Hyperparameters for exponentially weighted average

$\beta = [0.9, 0.999]$, don’t random pick from $[0.9, 0.999]$. Use $1-\beta = [0.001, 0.1]$, use similar method lik $\alpha$.

Why don’t use linear pick? Because when $\beta$ is close one, even if a little change, it will have a huge impact on algorithm.

#### 5.3 Hyperparameters tuning in practice: Pandas vs Caviar

Re-test hyperparamters occasionally

Babysitting one model(Pandas)

Training many models in parallel(Caviar)

## 6. Batch Normalization

#### 6.1 Normalizing activations in a network

In logistic regression, normalizing inputs to speed up learning.

- compute means$\mu = \frac {1} {m} \sum_{i=1}^n x^{(i)}$
- subtract off the means from training set $x = x - \mu$\
- compute the variances $\sigma ^2 = \frac {1} {m} \sum_{i=1}^n {x^{(i)}}^2$
- normalize training set $X = \frac {X} {\sigma ^2}$

Similarly, in order to speed up training neural network, we can normalize intermediate values in layers（`z`

in hidden layer）, it is called Batch Normalization or Batch Norm.

Implementing Batch Norm

- Given some intermediate value in neural network, $z^{(1)}, z^{(2)},…,z^{(m)}$
- compute means $\mu = \frac {1} {m} \sum_{i=1} z^{(i)}$
- compute the variances $\sigma ^2 = \frac {1} {m} \sum_{i=1} (z^{(i)} - \mu)^2$
- normalize $z$, $z^{(i)} = \frac {z^{(i)} - \mu} {\sqrt {(\sigma ^2 + \epsilon)}}$
- compute $\hat z$, $\hat z = \gamma z^{(i)} + \beta$.

Now we have normalized Z to have mean zero and standard unit variance. But maybe it makes sense for hidden units to have a different distribution. So we use $\hat z$ instead of $z$, $\gamma$ and $\beta$ are learnable parameters of your model.

#### 6.2 Fitting Batch Norm into a neural network

Add Batch Norm to a network

$X \rightarrow Z^{[1]} \rightarrow {\hat Z^{[1]}} \rightarrow {a^{[1]}} \rightarrow Z^{[2]} \rightarrow {\hat Z^{[2]}} \rightarrow {a^{[2]}}…$

Parameters:

$W^{[1]}, b^{[1]}$, $W^{[2]}, b^{[2]}…$

$\gamma^{[1]}, \beta^{[1]}$, $\gamma^{[2]}, \beta^{[2]}…$

If you use Batch Norm, you need to computing means and subtracting means, so $b^{[i]}$ is useless, so we can set $b^{[i]} = 0$ permanently.

#### 6.3 Why does Batch Norm work?

Covariate Shift: You have learned a function from $x \rightarrow y$, it works well. If the distribution of $x$ changes, you need to learn a new function to make it work well.

Hidden unit values change all the time, and so it’s suffering from the problem of covariate.

Batch Norm as regularization

- Each mini-batch is scaled by the mean/variance computed on just that mini-batch.
- This adds some noise to the values $z^{[l]}$ within that mini-batch. So similar to dropout, it adds some noise to each hidden layer’s activations.
- This has a slight regularization effect.

#### 6.4 Batch Norm at test time

In order to apply neural network at test time, come up with some seperate estimate of mu and sigma squared.

## 7. Multi-class classification

#### 7.1 Softmax regression

#### 7.2 Training a softmax classifier

Hard max.

Loss function.

Gradient descent with softmax.

## 8. Programming Frameworks

#### 8.1 Deep Learning frameworks

- Caffe/Caffe2
- TensorFlow
- Torch
- Theano
- mxnet
- PaddlePaddle
- Keras
- CNTK

Choosing deep learning frameworks

- Ease of programming (development and deployment)
- Running speed
- Truly open (open source with good governance)

#### 8.2 TensorFlow

…