transformer

Cameron R. Wolfe, Ph.D. @cwolferesearch

Nearly all recently-proposed large language models (LLMs) are based upon the decoder-only transformer architecture. But, is this always the best architecture to use? It depends… 🧵 [1/8] pic.twitter.com/ZpBIsiC4FN

2023-04-22 03:13:40

拡大

Cameron R. Wolfe, Ph.D. @cwolferesearch

First of all, what is a decoder-only architecture? Well, the architecture is exactly what is sounds like, a transformer architecture with the encoder removed. See the tweet below for more details. [2/8] twitter.com/cwolferesearch…

2023-04-22 03:13:41

Cameron R. Wolfe @cwolferesearch

Large language models (LLMs) are fun to use, but understanding the fundamentals of how they work is also incredibly important. One major idea and building block of LLMs is their underlying architecture: the decoder-only transformer model. 🧵[1/6] pic.twitter.com/XBi5UzOYp4

2023-03-28 05:10:05

Cameron R. Wolfe, Ph.D. @cwolferesearch

Decoder-only architectures use masked self-attention in each of their layers, meaning that each token considers only preceding tokens during the computation of self-attention. [3/8] twitter.com/cwolferesearch…

2023-04-22 03:13:41

Cameron R. Wolfe @cwolferesearch

Each “block” of a large language model (LLM) is comprised of self-attention and a feed-forward transformation. However, the exact self-attention variant used by LLMs is masked, multi-headed self-attention. Let’s break down what this means…🧵[1/11] pic.twitter.com/rVcSIFydwL

2023-04-09 03:44:34

Cameron R. Wolfe, Ph.D. @cwolferesearch

Masked self-attention makes sense for LLMs because they are pretrained using next-token prediction. If self-attention could look at future tokens, the LLM could cheat by just copying future tokens during pretraining to accurately solve next-token prediction. [4/8] pic.twitter.com/nInSChbLCj

2023-04-22 03:13:41

拡大

Cameron R. Wolfe, Ph.D. @cwolferesearch

But, decoder-only architectures aren’t our only option! Namely, prefix language models adopt a hybrid architecture that applies both masked and bidirectional (or fully-visible) self-attention to different regions of the input sequence. [5/8] pic.twitter.com/yIou73tApI

2023-04-22 03:13:42

拡大

Cameron R. Wolfe, Ph.D. @cwolferesearch

Such an approach allows us to define a textual prefix over which we perform bidirectional self-attention. Then, we can generate output (with the prefix as context) using masked self-attention. Defining a prefix is useful for tasks like translation or conditional generation. [6/8] pic.twitter.com/lM9cEmlwlW

2023-04-22 03:13:42

拡大

Cameron R. Wolfe, Ph.D. @cwolferesearch

Encoder-only models also exist and are incredibly effective at solving classification and span prediction tasks (e.g., by using transfer learning as in BERT). But, encoder-only models do not work well for generative tasks, making then less applicable to language modeling. [7/8] pic.twitter.com/73ljAkBc0E

2023-04-22 03:13:42

拡大

Cameron R. Wolfe, Ph.D. @cwolferesearch

For more details on different variants of the transformer architecture, check out the following overviews of the T5 model, which studies all of these variants via a unified framework. - T5 (Part One): bit.ly/3oni24b - T5 (Part Two): bit.ly/3mFlqH1 [8/8]

2023-04-22 03:13:43

Cameron R. Wolfe, Ph.D. @cwolferesearch

ML @Netflix • Writer @ Deep (Learning) Focus • PhD @optimalab1 • I make AI understandable

cameronrwolfe.me

いま話題のタグ