A Complete GPT in 31 Lines of Python, Encoded in Vintage IBM Format
In 2024, Andrej Karpathy released microgpt—a 240-line implementation of GPT that captured the essence of transformer-based language models. It sparked a competition: how much could you compress a working neural network while keeping it functional?
attogpt represents an 87.1% reduction from the original, while maintaining full functionality: autograd, multi-head attention, training, and inference—all in 31 lines and 3,117 characters.
Despite its tiny size, attogpt is a fully functional GPT implementation that includes:
After training on a dataset of names for 1,000 steps, it generates plausible new names:
sample 1: kamon sample 2: ann sample 3: karai sample 4: jaire sample 5: vialan sample 6: karia sample 7: yeran sample 8: anna sample 9: areli sample 10: kaina
To showcase just how compact this code is, I encoded it using the same physical format as 1960s computing: IBM-style punch cards—80 columns × 12 rows of binary information, just like the cards that helped send humans to the moon.
The encoding uses Base64 with 12-bit binary patterns, mapping each character to a unique punch combination. While this approach is modern (not authentic 1960s Hollerith encoding, which only supported uppercase), it preserves the aesthetic and demonstrates that the entire neural network fits on 52 cards—less than a single drawer in a 1960s data center.
Each card shows the actual punch hole patterns. Black holes represent binary 1s.
The encoding process is fully reversible and uses modern techniques while preserving the aesthetic of 1960s punch cards:
While this encoding uses the physical 80Ă—12 format of IBM punch cards, it employs modern binary encoding rather than authentic 1960s Hollerith code. This was necessary because:
The goal is to demonstrate scale and historical perspective, not to recreate period-accurate equipment constraints.
Despite the compression, the code maintains identical functionality to the original microgpt, producing the same quality of results.
Beyond the technical novelty, there's something profound about encoding a modern neural network in a format from the 1960s. It's a tangible reminder of how far we've come—and how the fundamental principles of computing remain unchanged.
IBM punch cards were the primary medium for data storage and program input for decades. A single card could hold 80 bytes. A full drawer? About 2,000 cards, or roughly 156KB. attogpt would barely make a dent.
The engineers who wrote software on punch cards would likely be amazed that a complete neural network—capable of learning and generating text—could fit on a stack of cards you could hold in one hand.