This document summarizes a research paper on scaling laws for neural language models. Some key findings of the paper include:
- Language model performance depends strongly on model scale and weakly on model shape. With enough compute and data, performance scales as a power law of parameters, compute, and data.
- Overfitting is universal, with penalties depending on the ratio of parameters to data.
- Large models have higher sample efficiency and can reach the same performance levels with less optimization steps and data points.
- The paper motivated subsequent work by OpenAI on applying scaling laws to other domains like computer vision and developing increasingly large language models like GPT-3.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
ベイズ最適化によるハイパーパラメータ探索についてざっくりと解説しました。
今回紹介する内容の元となった論文
Bergstra, James, et al. "Algorithms for hyper-parameter optimization." 25th annual conference on neural information processing systems (NIPS 2011). Vol. 24. Neural Information Processing Systems Foundation, 2011.
https://hal.inria.fr/hal-00642998/
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
表形式データのために提案されたDNNをベースとしたモデルとXGBoostを比較した論文を解説。
DNNとXGBoostの両方を用いたアンサンブル学習が良い性能が出たという実験結果などを紹介します。
Shwartz-Ziv, Ravid, and Amitai Armon. "Tabular Data: Deep Learning is Not All You Need." arXiv preprint arXiv:2106.03253 (2021).
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
ベイズ最適化によるハイパーパラメータ探索についてざっくりと解説しました。
今回紹介する内容の元となった論文
Bergstra, James, et al. "Algorithms for hyper-parameter optimization." 25th annual conference on neural information processing systems (NIPS 2011). Vol. 24. Neural Information Processing Systems Foundation, 2011.
https://hal.inria.fr/hal-00642998/
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
表形式データのために提案されたDNNをベースとしたモデルとXGBoostを比較した論文を解説。
DNNとXGBoostの両方を用いたアンサンブル学習が良い性能が出たという実験結果などを紹介します。
Shwartz-Ziv, Ravid, and Amitai Armon. "Tabular Data: Deep Learning is Not All You Need." arXiv preprint arXiv:2106.03253 (2021).
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Talk given by Akira Shibata at Developer's Summit 2016, one of the largest conference for software developer's in Japan. Akira, Data Scientist at DataRobot, Inc, talked about the evolution of machine learning techniques, most notably the recent developments in DataRobot and TensorFlow.
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...Deep Learning JP
The document proposes modifications to self-attention in Transformers to improve faithful signal propagation without shortcuts like skip connections or layer normalization. Specifically, it introduces a normalization-free network that uses dynamic isometry to ensure unitary transformations, a ReZero technique to implement skip connections without adding shortcuts, and modifications to attention and normalization techniques to address issues like rank collapse in Transformers. The methods are evaluated on tasks like CIFAR-10 classification and language modeling, demonstrating improved performance over standard Transformer architectures.