机器学习(一)
基本概念
regression (v16).pdf (ntu.edu.tw)
MAE, MSE
Cross-entropy
Linear models have severe limitation. Model Bias
activation function
- Rectified Linear Unit (ReLU)
- Sigmoid Function
ML framework
function with unknowns
loss
optimization
Deep = Many hidden layers
Why don’t we go deeper? Loss for multiple hidden layers. overfitting
General Guide
General Guidance 李宏毅Hung-yi Lee
general guide
model bias: the model is too simple. solution: more features. Deep Learning (more neurons, layers).
怎么比较是model bias or optimization(没有找到最好的那个θ)——Gaining the insights from comparison • Start from shallower networks (or other models), which are easier to optimize. • If deeper networks do not obtain smaller loss on training data, then there is optimization issue.
overfitting——solution: data augmentation/constrained model(• Less parameters, sharing parameters • Less features • Early stopping • Regularization • Dropout)(也不能太大限制 会导致model bias)
cross validation
N-fold
mismatch Your training and testing data have different distributions. (Simply increasing the training data will not help.)
optimization
local minimum and saddle points
saddle point: gradient 0,但不是local min or local max
Tayler Series Approximation
Hessian
* PPT-p7有一个具体例子。怎么算H(二次微分)
H may tell us parameter update direction.
- PPT-p10有一个具体例子。
- 但考虑到计算量,实际不怎么用saddle point去decrease the loss
Batch and Momentum
batch
- larger batch size does not require longer time to compute gradient (unless batch size is too large),当batch size比较小,相当于parallel computing
- Smaller batch requires longer time for one epoch (longer time for seeing all data once)
Smaller batch size has better performance ——What’s wrong with large batch size? ——Optimization Fails
• Smaller batch size has better performance • “Noisy” update is better for training
Small batch is better on testing data?
Have both fish and bear’s paws?
- • Large Batch Optimization for Deep Learning: Training BERT in 76 minutes (https://arxiv.org/abs/1904.00962)
- • Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes (https://arxiv.org/abs/1711.04325)
- • Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well (https://arxiv.org/abs/2001.02312)
- • Large Batch Training of Convolutional Networks (https://arxiv.org/abs/1708.03888)
- • Accurate, large minibatch sgd: Training imagenet in 1 hour (https://arxiv.org/abs/1706.02677)
Momentum
Gradient Descent + Momentum
Movement = Negative of 𝜕𝐿∕𝜕𝑤 + Last Movement
Concluding Remarks 【• Critical points have zero gradients. • Critical points can be either saddle points or local minima. • Can be determined by the Hessian matrix. • It is possible to escape saddle points along the direction of eigenvectors of the Hessian matrix. • Local minima may be rare. • Smaller batch size and momentum help escape critical points. 】
learning rate
Tips for training: Adaptive Learning Rate
summary
error surface崎岖时 怎么办
- Different parameters needs different learning rate
- 方法一
* 但同一个参数,需要的learning rate也随时间改变
同一个方向,也需要learning rate调整
* 方法二 with Adaptive Learning Rate
* 可以自己调整σ多重要
* 效果
* 
* warm up。
* Residual Network [[1512.03385] Deep Residual Learning for Image Recognition (arxiv.org)](https://arxiv.org/abs/1512.03385)
* Transformer
* 起初时 不太精准 探索 [[1706.03762] Attention Is All You Need (arxiv.org)](https://arxiv.org/abs/1706.03762)
- Adam: RMSProp + Momentum
classification, changing loss function
更详细的讲解: https://youtu.be/fZAZUYEeIMg
ppt: Logistic Regression (ntu.edu.tw)
如果是小学123年纪,那1和2更接近,和3更远。但假设class 123没有什么关系的话,class 1=1 class 3=3就很奇怪。
- 解决方法,用 one hot vector
- 解决方法,用 one hot vector
softmax会让大的值和小的值差距更大
loss of classification.
cross-entropy比MSE好
(链接是数学解释)如果是cross entropy,开始点在蓝色,有斜率可以走,如果是MSE,就容易卡住。