attogpt: A GPT in 31 Lines, Encoded in Punch Cards

The Challenge: How Small Can a GPT Go?

In 2024, Andrej Karpathy released microgpt—a 240-line implementation of GPT that captured the essence of transformer-based language models. It sparked a competition: how much could you compress a working neural network while keeping it functional?

The Compression Journey

microgpt 240 lines Original by Andrej Karpathy

picogpt 64 lines Shared as a QR code

femtogpt 53 lines Encoded as a 3000-digit prime number

attogpt 31 lines Preserved on 52 IBM punch cards

attogpt represents an 87.1% reduction from the original, while maintaining full functionality: autograd, multi-head attention, training, and inference—all in 31 lines and 3,117 characters.

What It Actually Does

Despite its tiny size, attogpt is a fully functional GPT implementation that includes:

Autograd Engine: Automatic differentiation with backward pass
Transformer Architecture: Multi-head self-attention mechanism
Training Loop: Adam optimizer with bias correction and learning rate decay
Text Generation: Inference with temperature-controlled sampling

After training on a dataset of names for 1,000 steps, it generates plausible new names:

sample  1: kamon
sample  2: ann
sample  3: karai
sample  4: jaire
sample  5: vialan
sample  6: karia
sample  7: yeran
sample  8: anna
sample  9: areli
sample 10: kaina

The Punch Card Format

To showcase just how compact this code is, I encoded it using the same physical format as 1960s computing: IBM-style punch cards—80 columns × 12 rows of binary information, just like the cards that helped send humans to the moon.

The encoding uses Base64 with 12-bit binary patterns, mapping each character to a unique punch combination. While this approach is modern (not authentic 1960s Hollerith encoding, which only supported uppercase), it preserves the aesthetic and demonstrates that the entire neural network fits on 52 cards—less than a single drawer in a 1960s data center.

Explore the Punch Cards

Each card shows the actual punch hole patterns. Black holes represent binary 1s.

Loading punch card data...

Technical Details

Encoding Approach

The encoding process is fully reversible and uses modern techniques while preserving the aesthetic of 1960s punch cards:

Base64 Conversion: Python code → UTF-8 bytes → Base64 string
Binary Punch Mapping: Each Base64 character → unique 12-bit binary pattern
Card Layout: 80 characters per card, 12 possible punch positions per column
Decoding: Punch patterns → Base64 → UTF-8 → working Python code

Historical Note

While this encoding uses the physical 80×12 format of IBM punch cards, it employs modern binary encoding rather than authentic 1960s Hollerith code. This was necessary because:

Original IBM 026/029 keypunch machines only supported UPPERCASE characters
Python requires mixed-case (keywords, variables, etc.)
Modern encoding allows full reversibility with all characters

The goal is to demonstrate scale and historical perspective, not to recreate period-accurate equipment constraints.

Key Compression Techniques

Lambda Functions: Replace method definitions with inline lambdas
Dictionary State: Flatten nested parameter structure
Expression Folding: Combine multiple operations into single statements
Variable Reuse: Aggressive reuse of variable names

Despite the compression, the code maintains identical functionality to the original microgpt, producing the same quality of results.

Why Punch Cards?

Beyond the technical novelty, there's something profound about encoding a modern neural network in a format from the 1960s. It's a tangible reminder of how far we've come—and how the fundamental principles of computing remain unchanged.

IBM punch cards were the primary medium for data storage and program input for decades. A single card could hold 80 bytes. A full drawer? About 2,000 cards, or roughly 156KB. attogpt would barely make a dent.

The engineers who wrote software on punch cards would likely be amazed that a complete neural network—capable of learning and generating text—could fit on a stack of cards you could hold in one hand.

attogpt: A Neural Network on Punch Cards 📇