Generative Pre-trained Transformers

Generative Pre-trained Transformers#

Twitter Handle LinkedIn Profile GitHub Profile

Natural language processing (NLP) has been transformed by the rise of decoder-based transformers, a cornerstone in the latest breakthroughs in AI language models. These architectures, which diverge from the traditional encoder-decoder framework of the pioneering Attention Is All You Need paper, are exclusively focused on the generative aspects of language processing. This singular focus has propelled them to the forefront of tasks like text completion, creative writing, language synthesis, and more critically, in developing advanced conversational AI.

At the core of these models is the masked self-attention mechanism, a unique feature that enables the model to generate language by considering only the previous context, a technique known as unidirectional or causal attention. This approach is radically different from the bidirectional context seen in full transformers, making it particularly suited for sequential data generation, where the future context is unknown.

To this end, we would discuss the architecture of decoder-based transformers, through the family of Generative Pre-trained Transformers (GPT). We would review mainly the GPT-2 paper, discuss some concepts, and then implement from scratch a simplified version of the GPT-2 model.

Table of Contents#

Citations#