Use CNN for Sequence Modeling

From RNN to CNN:
Usng CNN in sequence processing
Dongang Wang
20 Jun 2018

Contents
 Background in sequence processing
• Basic Seq2Seq model and Attention model
 Important tricks
• Dilated convolution, Position Encoding, Multiplicative Attention, etc.
 Example Networks
• ByteNet, ConvS2S, Transformer, etc.
 Application in captioning
• Convolutional Image Captioning
• Transformer in Dense Video Captioning

Main references
• Convolutional Sequence to Sequence Learning
  Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann Dauphin
  FAIR, published in arxiv 2017
• Attention Is All You Need
  Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Illia Polosukhin, et al.
  Google Research & Google Brain, published in NIPS 2017
• An Empirical Evaluation of Generic Convolutional and Recurrent
Networks for Sequence Modeling
  Shaojie Bai, J Zico Kolter, Vladlen Koltun
  CMU & Intel Labs, published in arxiv 2018 [Bai, 2018]
[Vaswani, 2017]
[Gehring, 2017]

Other references (1)
• Neural Machine Translation in Linear Time
  Nal Kalchbrenner, Lasse Espehold, Karen Simonyan, et al.
  Google DeepMind, published in arxiv 2016
• Convolutional Image Captioning
  Jyoti Aneja, Aditya Deshpande, Alexander Schwing
  UIUC, published in CVPR 2018
• End-to-End Dense Video Captioning with Masked Transformer
  Luowei Zhou, Yingbo Zhou, Jason Corso
  U of Michigan, published in CVPR 2018
[Aneja, 2018]
[Kalchbrenner, 2016]
[Zhou, 2018]

• Sequence to sequence learning with neural networks
  Ilya Sutskever, Oriol Vinyals, Quoc V. Le
  Google Research, published in NIPS 2014
• Neural machine translation by jointly learning to align and translate
  Dzmitry Bahdanau , KyungHyun Cho, Yoshua Bengio
  Jacobs U & U of Montreal, published in ICLR 2015 as oral
• Multi-scale context aggregation by dilated convolutions.
  Fisher Yu, Vladlen Koltun
  Princeton & Intel Lab, published in ICLR 2016 [Yu, 2016]
[Bahdanau, 2014]
[Sutskever, 2014]

• End-to-End Memory Networks
  Sainbayar Sukhbaatar, Arthur Szlam, JasonWeston, Rob Fergus
  NYU & FAIR, published in NIPS 2015
• Layer Normalization
  Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
  U of Toronto, published in NIPS 2016 in workshop
• Weight Normalization: A simple Reparameterization to Accelerate
Training of Deep Neural Network
  Tim Salimans, Diederik P. Kingma
  OpenAI, published in NIPS 2016
[Sukhbaatar, 2015]
[Salimans, 2016]
[Ba, 2016]

Basic Seq2Seq in NMT
 Model:
– Encoder: sentence encoded to a length-fixed vector
– Decoder: the encoded vector acts as the first input to decoder, and the
output of each time step will be input to next time step.
 Tricks:
– Deep LSTM using four layers.
– Reverse the order of the words of the input.
[Sutskever, 2014]

Basic Attention in NMT
 Model:
– Encoder: bidirectional LSTM
– Decoder: input the label and state from last
time step, and the combination of all encoder
features.
[Bahdanau, 2014]

Limitations
 Running Time (main concern)
• RNN cannot run in parallel because of the serial structure
 Long-term Dependency
• Gradient will vanish or explore along long sequences
 Structure almost untouched
• LSTM is proved to be the best structure at present (among ten thousand
RNNs), and the variants of LSTM cannot improve significantly.
• The techniques like batch normalization does not work in LSTM properly.
 Relationships are not proper in Seq2Seq
• For NMT, the path between corresponding input token and output token
should be short, but original Seq2Seq cannot model this relationship.

Tricks as Building blocks
 Modified ConvNets for sequences (no pooling)
– Stacked CNN with multiple kernels, without padding
– Dilated Convolutional Network
 Residual Connections
 Normalization (Batch, Weight, Layer)
– To accelerate optimization
 Position Encoding
– To remedy the loss of position information
 Multiplicative Attention
– Another kind of attention method

Building block: Stacked CNN
 For sequences:
– Multiple kernels (filters), the kernels should have the size of k by d, where k
is the interesting region, d is the dimension of word embedding.
– Stack several layers without padding, then the CNN could have a larger
receptive field. For example, 5 convs with k=3, then the output will
correspond to an input of 11 words (11->9->7->5->3->1).
 For variant lengths:
– Use same length with padding
– Use mask to control training
[Gehring, 2017]

Building block: Dilated Convolution
 This is a kind of causal convolution, in which the future information is not
taken into account.
 This method originally used in segmentation, where the resolution of the
input image is very essential. In sequence modeling, it is also essential to
retain the information from the word embedding.

Building block: Dilated Convolution
 For sequence: 1D dilated convolution
[Kalchbrenner, 2016]

Building block: Residual & Normalization
 For residual:
– It proved to be very powerful in ResNet.
– Since we may need deep network in modeling the sequence, it is also useful
to train the layers to learn modifications.
 For normalization:
– Intuition: gradients are not influenced by data, so that
optimization could be accelerated.
– Batch normalization: use mean/variance of batch data
– Weight normalization: use mean/variance of weights
– Layer normalization: use mean/variance of layer

Building block: Residual & Normalization
 Batch Normalization & Layer Normalization
 Batch Normalization & Weight Normalization

Building block: Position Encoding
 If we process all the words in the sentence together, we will lose the
information of the sequence order. In that case, we can modify the
original word embedding vector by adding a position vector.
– Train another embedding feature parallel to the word embedding, using the
position input as one-hot vector
– For j-th word out of J words, the embedded feature has the same dimension
as d. The k-th element in the d-dim is
– Using sine and cosine functions
4
( )( ) 1
2 2
kj
d J
l k j
Jd
= − − +
( 1)
sin( 10000 ), if is even
cos( 10000 ), if is odd
k d
kj k d
j k
l
j k−

= 


Building block: Multiplicative Attention
 Additive attention:
– Train a MLP, input is the encoded feature and the hidden state of last step
– Use the weights to get a weighted sum of the encoded feature to decode
 Multiplicative attention in decoding:
– g is the word of previous step
– h is hidden state of previous step
– z is the encoded feature
 Modified multiplicative attention (Scaled Dot-Product Attention):
– The dot product could be very large in some cases, which will make the
attention very bias. In that case, the dot-product could be divided by 𝑑𝑑𝑧𝑧

Network: ByteNet
 Blocks:
– Dilated Convolution
– Residual block with layer normalization
– Masked input
 Specialty:
– Dynamic unfolding: in neural machine
translation, the sentence length of
source and target has linear relation.
They modify the maximum length of
target sentence, with a=1.2 and b=0.
ˆt a s b= +

Network: ConvS2S
 Stacked CNN without pooling
 Position Encoding
 Multiplicative Attention

Network: Transformer
 Blocks:
– Position Encoding
– Scaled Dot-Product Attention
– Masked input
– Residual block with Layer Normalization
 Specialty:
– Multi-Head Attention: they perform 8 times
parallel attention layers, and concatenate the
output attention into one vector.

Application: Convolutional Image Captioning
 Block:
– Gated linear units
– Additive attention
– Residual block
with weight norm
– Fine-tune image
encoder
 Performance:
– not as good as
LSTM

Use CNN for Sequence Modeling

Recommended

More Related Content

What's hot (20)

Similar to Use CNN for Sequence Modeling (20)

Recently uploaded (20)

Use CNN for Sequence Modeling