CNN

CNN 李宏毅Hung-yi Lee

image classification.

(All the images to be classified have the same size.)
Do we really need “fully connected” in image processing? （从眼睛辨别鸟）
- 1. Identifying some critical patterns. A neuron does not have to see the whole image. Some patterns are much smaller than the whole image.
  - simplification 1: 选 receptive field 彼此之间可以重叠。stride 不会设太大希望是有重叠的。超出范围做 padding 每个 reception field 有一组参数 filters。
    Each receptive field has a set of neurons (e.g., 64 neurons).
- 1. the same patterns appear in different regions. Each receptive field needs a “beak”detector?
  - Two neurons with the same receptive field would not share parameters.
  - Each receptive field has a set of neurons (e.g., 64 neurons). Each receptive field has the neurons with the same set of parameters
- Benefit of Convolutional Layer
- The neurons with different receptive fields share the parameters. Each filter convolves over the input image. share weight. Convolution. P25
- 1. Subsampling the pixels will not change the object. subsampling. Pooling.
Application:
- Playing Go. next move. 下围棋不适合 pooling。Fully-connected network can be used But CNN performs much better.
  - why cnn? Some patterns are much smaller than the whole image. The same patterns appear in different regions. Subsampling the pixels will not change the object (not use pooling)
- speech https://dl.acm.org/doi/10.110
- nlp https://www.aclweb.org/anth
cnn 不能处理影像辨识和旋转。用 spatial transformer layer
这个运算的目的是为了将输出的值域压缩到（0，1），也就是所谓的归一化，因为每一级输出的值都将作为下一级的输入，只有将输入归一化了，才会避免某个输入无穷大，导致其他输入无效，变成“一家之言”，最终网络训练效果非常不好。

self attention

Transformer & BERT (ntu.edu.tw)

Vector Set as Input. One-hot Encoding. Word Embedding. To learn more: https://youtu.be/X7PH3NuYW0Q (in Mandarin)
- voice. frame.
- Graph is also a set of vectors (consider each node as a vector)
output.
- Each vector has a label.
- The whole sequence has a label.
- Model decides the number of labels itself. Seq to Seq
self attention 处理整个的。fully connected 处理自己的资讯某个位置的
具体推导不记得了可以看ppt，非常仔细。
Parameters to be learned ：Wq Wk Wv
Multi-head Self-attention, Different types of relevance 多个heads (α)分别乘，再bi=Wo x [bi1;bi2]
positional encoding(和上面的区别在 No position information in self-attention.) Each position has a unique positional vector 𝑒 𝑖 • hand-crafted • learned from data 尚待研究的问题哪一种做法更好
- application: Transformer, BERT
- Self-attention for Speech. Speech is a very long vector sequence. attention in a range. Truncated Self-attention
- Self-attention for Image.
- Self-Attention GAN
Self-attention v.s. CNN ——CNN: self-attention that can only attends in a receptive field Self-attention: CNN with learnable receptive field ➢ CNN is simplified self-attention. ➢ Self-attention is the complex version of CNN。—— On the Relationship between Self-Attention and Convolutional Layers—— CNN good for less data. self attention good for more data.
RNN. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
GAN. Self-attention for Graph. Consider edge: only attention to connected nodes.

Seq to Seq

Transformer (ntu.edu.tw)

application
- Text-to-Speech (TTS) Synthesis
- Seq2seq for Chatbot
- Seq2seq for Syntactic Parsing
- Seq2seq for Multi-label Classification
- Object Detection
encoder
- You can use RNN or CNN
- Add & Norm: Residual + Layer norm
decoder
- Autoregressive (Speech Recognition as example)
- masked self attention
- Adding “Stop Token” Model decides the number of labels itself. 的情况，需要自行决定什么时候停止，在distribution里面加上END的概率
Decoder – Non-autoregressive (NAT)
cross attention
Teacher Forcing: using the ground truth as input
copy mechanism (Machine Translation, Chat-bot, Summarization )
Guided Attention. In some tasks, input and output are monotonically aligned. For example, speech recognition, TTS, etc.
Beam Search. Not possible to check all the paths. Greedy decoding 一定是最好的吗？需要机器发挥一点创造力的 beam search 就没那么有用
sampling. Randomness is needed for decoder when generating sequence in some tasks. Accept that nothing is perfect. True beauty lies in the cracks of imperfection.偶尔给他错的东西可能学得更好
How to do the optimization? When you don’t know how to optimize, just use reinforcement learning (RL)!
Scheduled Sampling