SlideShare a Scribd company logo
From RNN to CNN:
Usng CNN in sequence processing
Dongang Wang
20 Jun 2018
Contents
 Background in sequence processing
• Basic Seq2Seq model and Attention model
 Important tricks
• Dilated convolution, Position Encoding, Multiplicative Attention, etc.
 Example Networks
• ByteNet, ConvS2S, Transformer, etc.
 Application in captioning
• Convolutional Image Captioning
• Transformer in Dense Video Captioning
Main references
• Convolutional Sequence to Sequence Learning
  Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann Dauphin
  FAIR, published in arxiv 2017
• Attention Is All You Need
  Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Illia Polosukhin, et al.
  Google Research & Google Brain, published in NIPS 2017
• An Empirical Evaluation of Generic Convolutional and Recurrent
Networks for Sequence Modeling
  Shaojie Bai, J Zico Kolter, Vladlen Koltun
  CMU & Intel Labs, published in arxiv 2018 [Bai, 2018]
[Vaswani, 2017]
[Gehring, 2017]
Other references (1)
• Neural Machine Translation in Linear Time
  Nal Kalchbrenner, Lasse Espehold, Karen Simonyan, et al.
  Google DeepMind, published in arxiv 2016
• Convolutional Image Captioning
  Jyoti Aneja, Aditya Deshpande, Alexander Schwing
  UIUC, published in CVPR 2018
• End-to-End Dense Video Captioning with Masked Transformer
  Luowei Zhou, Yingbo Zhou, Jason Corso
  U of Michigan, published in CVPR 2018
[Aneja, 2018]
[Kalchbrenner, 2016]
[Zhou, 2018]
Other references (2)
• Sequence to sequence learning with neural networks
  Ilya Sutskever, Oriol Vinyals, Quoc V. Le
  Google Research, published in NIPS 2014
• Neural machine translation by jointly learning to align and translate
  Dzmitry Bahdanau , KyungHyun Cho, Yoshua Bengio
  Jacobs U & U of Montreal, published in ICLR 2015 as oral
• Multi-scale context aggregation by dilated convolutions.
  Fisher Yu, Vladlen Koltun
  Princeton & Intel Lab, published in ICLR 2016 [Yu, 2016]
[Bahdanau, 2014]
[Sutskever, 2014]
Other references (3)
• End-to-End Memory Networks
  Sainbayar Sukhbaatar, Arthur Szlam, JasonWeston, Rob Fergus
  NYU & FAIR, published in NIPS 2015
• Layer Normalization
  Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
  U of Toronto, published in NIPS 2016 in workshop
• Weight Normalization: A simple Reparameterization to Accelerate
Training of Deep Neural Network
  Tim Salimans, Diederik P. Kingma
  OpenAI, published in NIPS 2016
[Sukhbaatar, 2015]
[Salimans, 2016]
[Ba, 2016]
Basic Seq2Seq in NMT
 Model:
– Encoder: sentence encoded to a length-fixed vector
– Decoder: the encoded vector acts as the first input to decoder, and the
output of each time step will be input to next time step.
 Tricks:
– Deep LSTM using four layers.
– Reverse the order of the words of the input.
[Sutskever, 2014]
Basic Attention in NMT
 Model:
– Encoder: bidirectional LSTM
– Decoder: input the label and state from last
time step, and the combination of all encoder
features.
[Bahdanau, 2014]
Limitations
 Running Time (main concern)
• RNN cannot run in parallel because of the serial structure
 Long-term Dependency
• Gradient will vanish or explore along long sequences
 Structure almost untouched
• LSTM is proved to be the best structure at present (among ten thousand
RNNs), and the variants of LSTM cannot improve significantly.
• The techniques like batch normalization does not work in LSTM properly.
 Relationships are not proper in Seq2Seq
• For NMT, the path between corresponding input token and output token
should be short, but original Seq2Seq cannot model this relationship.
Tricks as Building blocks
 Modified ConvNets for sequences (no pooling)
– Stacked CNN with multiple kernels, without padding
– Dilated Convolutional Network
 Residual Connections
 Normalization (Batch, Weight, Layer)
– To accelerate optimization
 Position Encoding
– To remedy the loss of position information
 Multiplicative Attention
– Another kind of attention method
Building block: Stacked CNN
 For sequences:
– Multiple kernels (filters), the kernels should have the size of k by d, where k
is the interesting region, d is the dimension of word embedding.
– Stack several layers without padding, then the CNN could have a larger
receptive field. For example, 5 convs with k=3, then the output will
correspond to an input of 11 words (11->9->7->5->3->1).
 For variant lengths:
– Use same length with padding
– Use mask to control training
[Gehring, 2017]
Building block: Dilated Convolution
 This is a kind of causal convolution, in which the future information is not
taken into account.
 This method originally used in segmentation, where the resolution of the
input image is very essential. In sequence modeling, it is also essential to
retain the information from the word embedding.
Building block: Dilated Convolution
 For sequence: 1D dilated convolution
[Kalchbrenner, 2016]
Building block: Residual & Normalization
 For residual:
– It proved to be very powerful in ResNet.
– Since we may need deep network in modeling the sequence, it is also useful
to train the layers to learn modifications.
 For normalization:
– Intuition: gradients are not influenced by data, so that
optimization could be accelerated.
– Batch normalization: use mean/variance of batch data
– Weight normalization: use mean/variance of weights
– Layer normalization: use mean/variance of layer
Building block: Residual & Normalization
 Batch Normalization & Layer Normalization
 Batch Normalization & Weight Normalization
Building block: Position Encoding
 If we process all the words in the sentence together, we will lose the
information of the sequence order. In that case, we can modify the
original word embedding vector by adding a position vector.
– Train another embedding feature parallel to the word embedding, using the
position input as one-hot vector
– For j-th word out of J words, the embedded feature has the same dimension
as d. The k-th element in the d-dim is
– Using sine and cosine functions
4
( )( ) 1
2 2
kj
d J
l k j
Jd
= − − +
( 1)
sin( 10000 ), if is even
cos( 10000 ), if is odd
k d
kj k d
j k
l
j k−

= 

Building block: Multiplicative Attention
 Additive attention:
– Train a MLP, input is the encoded feature and the hidden state of last step
– Use the weights to get a weighted sum of the encoded feature to decode
 Multiplicative attention in decoding:
– g is the word of previous step
– h is hidden state of previous step
– z is the encoded feature
 Modified multiplicative attention (Scaled Dot-Product Attention):
– The dot product could be very large in some cases, which will make the
attention very bias. In that case, the dot-product could be divided by 𝑑𝑑𝑧𝑧
Network: ByteNet
 Blocks:
– Dilated Convolution
– Residual block with layer normalization
– Masked input
 Specialty:
– Dynamic unfolding: in neural machine
translation, the sentence length of
source and target has linear relation.
They modify the maximum length of
target sentence, with a=1.2 and b=0.
ˆt a s b= +
Network: ConvS2S
 Stacked CNN without pooling
 Position Encoding
 Multiplicative Attention
Network: Transformer
 Blocks:
– Position Encoding
– Scaled Dot-Product Attention
– Masked input
– Residual block with Layer Normalization
 Specialty:
– Multi-Head Attention: they perform 8 times
parallel attention layers, and concatenate the
output attention into one vector.
Application: Convolutional Image Captioning
 Block:
– Gated linear units
– Additive attention
– Residual block
with weight norm
– Fine-tune image
encoder
 Performance:
– not as good as
LSTM
Ad

More Related Content

What's hot (20)

CORBA Basic and Deployment of CORBA
CORBA Basic and Deployment of CORBACORBA Basic and Deployment of CORBA
CORBA Basic and Deployment of CORBA
Priyanka Patil
 
Virtual Private Networks (VPN) ppt
Virtual Private Networks (VPN) pptVirtual Private Networks (VPN) ppt
Virtual Private Networks (VPN) ppt
OECLIB Odisha Electronics Control Library
 
Introduction to Genetic Algorithm
Introduction to Genetic Algorithm Introduction to Genetic Algorithm
Introduction to Genetic Algorithm
ramyaravindran12
 
Windows PowerShell
Windows PowerShellWindows PowerShell
Windows PowerShell
Sandun Perera
 
penetration test using Kali linux ppt
penetration test using Kali linux pptpenetration test using Kali linux ppt
penetration test using Kali linux ppt
AbhayNaik8
 
LDAP
LDAPLDAP
LDAP
Khemnath Chauhan
 
Virtual machines and their architecture
Virtual machines and their architectureVirtual machines and their architecture
Virtual machines and their architecture
Mrinmoy Dalal
 
Architecture of Linux
 Architecture of Linux Architecture of Linux
Architecture of Linux
SHUBHA CHATURVEDI
 
Network Virtualization
Network VirtualizationNetwork Virtualization
Network Virtualization
Kingston Smiler
 
Linux basics
Linux basicsLinux basics
Linux basics
Santosh Khadsare
 
Intrusion Detection System
Intrusion Detection SystemIntrusion Detection System
Intrusion Detection System
Mohit Belwal
 
Virtualization
Virtualization Virtualization
Virtualization
Ydel Capales
 
Top down and botttom up Parsing
Top down     and botttom up ParsingTop down     and botttom up Parsing
Top down and botttom up Parsing
Gerwin Ocsena
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
Acad
 
Introduction to linux
Introduction to linuxIntroduction to linux
Introduction to linux
Stephen Ahiante
 
Machine Learning Based Botnet Detection
Machine Learning Based Botnet DetectionMachine Learning Based Botnet Detection
Machine Learning Based Botnet Detection
butest
 
Mining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalMining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactional
ramya marichamy
 
Unix vs Linux | Difference Between Unix & Linux | Edureka
Unix vs Linux | Difference Between Unix & Linux | EdurekaUnix vs Linux | Difference Between Unix & Linux | Edureka
Unix vs Linux | Difference Between Unix & Linux | Edureka
Edureka!
 
Linux vs Windows | Edureka
Linux vs Windows | EdurekaLinux vs Windows | Edureka
Linux vs Windows | Edureka
Edureka!
 
Network Monitoring Tools
Network Monitoring ToolsNetwork Monitoring Tools
Network Monitoring Tools
Prince JabaKumar
 
CORBA Basic and Deployment of CORBA
CORBA Basic and Deployment of CORBACORBA Basic and Deployment of CORBA
CORBA Basic and Deployment of CORBA
Priyanka Patil
 
Introduction to Genetic Algorithm
Introduction to Genetic Algorithm Introduction to Genetic Algorithm
Introduction to Genetic Algorithm
ramyaravindran12
 
penetration test using Kali linux ppt
penetration test using Kali linux pptpenetration test using Kali linux ppt
penetration test using Kali linux ppt
AbhayNaik8
 
Virtual machines and their architecture
Virtual machines and their architectureVirtual machines and their architecture
Virtual machines and their architecture
Mrinmoy Dalal
 
Intrusion Detection System
Intrusion Detection SystemIntrusion Detection System
Intrusion Detection System
Mohit Belwal
 
Top down and botttom up Parsing
Top down     and botttom up ParsingTop down     and botttom up Parsing
Top down and botttom up Parsing
Gerwin Ocsena
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
Acad
 
Machine Learning Based Botnet Detection
Machine Learning Based Botnet DetectionMachine Learning Based Botnet Detection
Machine Learning Based Botnet Detection
butest
 
Mining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalMining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactional
ramya marichamy
 
Unix vs Linux | Difference Between Unix & Linux | Edureka
Unix vs Linux | Difference Between Unix & Linux | EdurekaUnix vs Linux | Difference Between Unix & Linux | Edureka
Unix vs Linux | Difference Between Unix & Linux | Edureka
Edureka!
 
Linux vs Windows | Edureka
Linux vs Windows | EdurekaLinux vs Windows | Edureka
Linux vs Windows | Edureka
Edureka!
 

Similar to Use CNN for Sequence Modeling (20)

ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Universitat Politècnica de Catalunya
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
RAHUL BHOJWANI
 
Talk from NVidia Developer Connect
Talk from NVidia Developer ConnectTalk from NVidia Developer Connect
Talk from NVidia Developer Connect
Anuj Gupta
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
Jayavardhan Reddy Peddamail
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
BriefHistoryTransformerstransformers.pdf
BriefHistoryTransformerstransformers.pdfBriefHistoryTransformerstransformers.pdf
BriefHistoryTransformerstransformers.pdf
creative sam
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
attention is all you need.pdf attention is all you need.pdfattention is all y...
attention is all you need.pdf attention is all you need.pdfattention is all y...attention is all you need.pdf attention is all you need.pdfattention is all y...
attention is all you need.pdf attention is all you need.pdfattention is all y...
Amit Ranjan
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
Preferred Networks
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
Shunta Saito
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Sangwoo Mo
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
ANISH BHANUSHALI
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Universitat Politècnica de Catalunya
 
Resnet.pdf
Resnet.pdfResnet.pdf
Resnet.pdf
YanhuaSi
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
KtonNguyn2
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Eric Haibin Lin
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Universitat Politècnica de Catalunya
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
RAHUL BHOJWANI
 
Talk from NVidia Developer Connect
Talk from NVidia Developer ConnectTalk from NVidia Developer Connect
Talk from NVidia Developer Connect
Anuj Gupta
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
Jayavardhan Reddy Peddamail
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
BriefHistoryTransformerstransformers.pdf
BriefHistoryTransformerstransformers.pdfBriefHistoryTransformerstransformers.pdf
BriefHistoryTransformerstransformers.pdf
creative sam
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
attention is all you need.pdf attention is all you need.pdfattention is all y...
attention is all you need.pdf attention is all you need.pdfattention is all y...attention is all you need.pdf attention is all you need.pdfattention is all y...
attention is all you need.pdf attention is all you need.pdfattention is all y...
Amit Ranjan
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
Shunta Saito
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Sangwoo Mo
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Universitat Politècnica de Catalunya
 
Resnet.pdf
Resnet.pdfResnet.pdf
Resnet.pdf
YanhuaSi
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
KtonNguyn2
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Eric Haibin Lin
 
Ad

Recently uploaded (20)

Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Cybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft CertificateCybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft Certificate
VICTOR MAESTRE RAMIREZ
 
Building a research repository that works by Clare Cady
Building a research repository that works by Clare CadyBuilding a research repository that works by Clare Cady
Building a research repository that works by Clare Cady
UXPA Boston
 
How Top Companies Benefit from Outsourcing
How Top Companies Benefit from OutsourcingHow Top Companies Benefit from Outsourcing
How Top Companies Benefit from Outsourcing
Nascenture
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
Master Data Management - Enterprise Application Integration
Master Data Management - Enterprise Application IntegrationMaster Data Management - Enterprise Application Integration
Master Data Management - Enterprise Application Integration
Sherif Rasmy
 
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdfGoogle DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
derrickjswork
 
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
SOFTTECHHUB
 
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptxIn-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
aptyai
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Build With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdfBuild With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdf
Google Developer Group - Harare
 
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
Understanding SEO in the Age of AI.pdf
Understanding SEO in the Age of AI.pdfUnderstanding SEO in the Age of AI.pdf
Understanding SEO in the Age of AI.pdf
Fulcrum Concepts, LLC
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
Toru Tamaki
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025
Damco Salesforce Services
 
DNF 2.0 Implementations Challenges in Nepal
DNF 2.0 Implementations Challenges in NepalDNF 2.0 Implementations Challenges in Nepal
DNF 2.0 Implementations Challenges in Nepal
ICT Frame Magazine Pvt. Ltd.
 
Sustainable_Development_Goals_INDIANWraa
Sustainable_Development_Goals_INDIANWraaSustainable_Development_Goals_INDIANWraa
Sustainable_Development_Goals_INDIANWraa
03ANMOLCHAURASIYA
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Cybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft CertificateCybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft Certificate
VICTOR MAESTRE RAMIREZ
 
Building a research repository that works by Clare Cady
Building a research repository that works by Clare CadyBuilding a research repository that works by Clare Cady
Building a research repository that works by Clare Cady
UXPA Boston
 
How Top Companies Benefit from Outsourcing
How Top Companies Benefit from OutsourcingHow Top Companies Benefit from Outsourcing
How Top Companies Benefit from Outsourcing
Nascenture
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
Master Data Management - Enterprise Application Integration
Master Data Management - Enterprise Application IntegrationMaster Data Management - Enterprise Application Integration
Master Data Management - Enterprise Application Integration
Sherif Rasmy
 
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdfGoogle DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
derrickjswork
 
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
SOFTTECHHUB
 
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptxIn-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
aptyai
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Understanding SEO in the Age of AI.pdf
Understanding SEO in the Age of AI.pdfUnderstanding SEO in the Age of AI.pdf
Understanding SEO in the Age of AI.pdf
Fulcrum Concepts, LLC
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
Toru Tamaki
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025
Damco Salesforce Services
 
Sustainable_Development_Goals_INDIANWraa
Sustainable_Development_Goals_INDIANWraaSustainable_Development_Goals_INDIANWraa
Sustainable_Development_Goals_INDIANWraa
03ANMOLCHAURASIYA
 
Ad

Use CNN for Sequence Modeling

  • 1. From RNN to CNN: Usng CNN in sequence processing Dongang Wang 20 Jun 2018
  • 2. Contents  Background in sequence processing • Basic Seq2Seq model and Attention model  Important tricks • Dilated convolution, Position Encoding, Multiplicative Attention, etc.  Example Networks • ByteNet, ConvS2S, Transformer, etc.  Application in captioning • Convolutional Image Captioning • Transformer in Dense Video Captioning
  • 3. Main references • Convolutional Sequence to Sequence Learning   Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann Dauphin   FAIR, published in arxiv 2017 • Attention Is All You Need   Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Illia Polosukhin, et al.   Google Research & Google Brain, published in NIPS 2017 • An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling   Shaojie Bai, J Zico Kolter, Vladlen Koltun   CMU & Intel Labs, published in arxiv 2018 [Bai, 2018] [Vaswani, 2017] [Gehring, 2017]
  • 4. Other references (1) • Neural Machine Translation in Linear Time   Nal Kalchbrenner, Lasse Espehold, Karen Simonyan, et al.   Google DeepMind, published in arxiv 2016 • Convolutional Image Captioning   Jyoti Aneja, Aditya Deshpande, Alexander Schwing   UIUC, published in CVPR 2018 • End-to-End Dense Video Captioning with Masked Transformer   Luowei Zhou, Yingbo Zhou, Jason Corso   U of Michigan, published in CVPR 2018 [Aneja, 2018] [Kalchbrenner, 2016] [Zhou, 2018]
  • 5. Other references (2) • Sequence to sequence learning with neural networks   Ilya Sutskever, Oriol Vinyals, Quoc V. Le   Google Research, published in NIPS 2014 • Neural machine translation by jointly learning to align and translate   Dzmitry Bahdanau , KyungHyun Cho, Yoshua Bengio   Jacobs U & U of Montreal, published in ICLR 2015 as oral • Multi-scale context aggregation by dilated convolutions.   Fisher Yu, Vladlen Koltun   Princeton & Intel Lab, published in ICLR 2016 [Yu, 2016] [Bahdanau, 2014] [Sutskever, 2014]
  • 6. Other references (3) • End-to-End Memory Networks   Sainbayar Sukhbaatar, Arthur Szlam, JasonWeston, Rob Fergus   NYU & FAIR, published in NIPS 2015 • Layer Normalization   Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton   U of Toronto, published in NIPS 2016 in workshop • Weight Normalization: A simple Reparameterization to Accelerate Training of Deep Neural Network   Tim Salimans, Diederik P. Kingma   OpenAI, published in NIPS 2016 [Sukhbaatar, 2015] [Salimans, 2016] [Ba, 2016]
  • 7. Basic Seq2Seq in NMT  Model: – Encoder: sentence encoded to a length-fixed vector – Decoder: the encoded vector acts as the first input to decoder, and the output of each time step will be input to next time step.  Tricks: – Deep LSTM using four layers. – Reverse the order of the words of the input. [Sutskever, 2014]
  • 8. Basic Attention in NMT  Model: – Encoder: bidirectional LSTM – Decoder: input the label and state from last time step, and the combination of all encoder features. [Bahdanau, 2014]
  • 9. Limitations  Running Time (main concern) • RNN cannot run in parallel because of the serial structure  Long-term Dependency • Gradient will vanish or explore along long sequences  Structure almost untouched • LSTM is proved to be the best structure at present (among ten thousand RNNs), and the variants of LSTM cannot improve significantly. • The techniques like batch normalization does not work in LSTM properly.  Relationships are not proper in Seq2Seq • For NMT, the path between corresponding input token and output token should be short, but original Seq2Seq cannot model this relationship.
  • 10. Tricks as Building blocks  Modified ConvNets for sequences (no pooling) – Stacked CNN with multiple kernels, without padding – Dilated Convolutional Network  Residual Connections  Normalization (Batch, Weight, Layer) – To accelerate optimization  Position Encoding – To remedy the loss of position information  Multiplicative Attention – Another kind of attention method
  • 11. Building block: Stacked CNN  For sequences: – Multiple kernels (filters), the kernels should have the size of k by d, where k is the interesting region, d is the dimension of word embedding. – Stack several layers without padding, then the CNN could have a larger receptive field. For example, 5 convs with k=3, then the output will correspond to an input of 11 words (11->9->7->5->3->1).  For variant lengths: – Use same length with padding – Use mask to control training [Gehring, 2017]
  • 12. Building block: Dilated Convolution  This is a kind of causal convolution, in which the future information is not taken into account.  This method originally used in segmentation, where the resolution of the input image is very essential. In sequence modeling, it is also essential to retain the information from the word embedding.
  • 13. Building block: Dilated Convolution  For sequence: 1D dilated convolution [Kalchbrenner, 2016]
  • 14. Building block: Residual & Normalization  For residual: – It proved to be very powerful in ResNet. – Since we may need deep network in modeling the sequence, it is also useful to train the layers to learn modifications.  For normalization: – Intuition: gradients are not influenced by data, so that optimization could be accelerated. – Batch normalization: use mean/variance of batch data – Weight normalization: use mean/variance of weights – Layer normalization: use mean/variance of layer
  • 15. Building block: Residual & Normalization  Batch Normalization & Layer Normalization  Batch Normalization & Weight Normalization
  • 16. Building block: Position Encoding  If we process all the words in the sentence together, we will lose the information of the sequence order. In that case, we can modify the original word embedding vector by adding a position vector. – Train another embedding feature parallel to the word embedding, using the position input as one-hot vector – For j-th word out of J words, the embedded feature has the same dimension as d. The k-th element in the d-dim is – Using sine and cosine functions 4 ( )( ) 1 2 2 kj d J l k j Jd = − − + ( 1) sin( 10000 ), if is even cos( 10000 ), if is odd k d kj k d j k l j k−  =  
  • 17. Building block: Multiplicative Attention  Additive attention: – Train a MLP, input is the encoded feature and the hidden state of last step – Use the weights to get a weighted sum of the encoded feature to decode  Multiplicative attention in decoding: – g is the word of previous step – h is hidden state of previous step – z is the encoded feature  Modified multiplicative attention (Scaled Dot-Product Attention): – The dot product could be very large in some cases, which will make the attention very bias. In that case, the dot-product could be divided by 𝑑𝑑𝑧𝑧
  • 18. Network: ByteNet  Blocks: – Dilated Convolution – Residual block with layer normalization – Masked input  Specialty: – Dynamic unfolding: in neural machine translation, the sentence length of source and target has linear relation. They modify the maximum length of target sentence, with a=1.2 and b=0. ˆt a s b= +
  • 19. Network: ConvS2S  Stacked CNN without pooling  Position Encoding  Multiplicative Attention
  • 20. Network: Transformer  Blocks: – Position Encoding – Scaled Dot-Product Attention – Masked input – Residual block with Layer Normalization  Specialty: – Multi-Head Attention: they perform 8 times parallel attention layers, and concatenate the output attention into one vector.
  • 21. Application: Convolutional Image Captioning  Block: – Gated linear units – Additive attention – Residual block with weight norm – Fine-tune image encoder  Performance: – not as good as LSTM
  翻译: