attogpt: A Neural Network on Punch Cards 📇

A Complete GPT in 31 Lines of Python, Encoded in Vintage IBM Format

Author: Cédric Caruzzo [GitHub Repository]
← Back to Projects & Writing Hub

The Challenge: How Small Can a GPT Go?

In 2024, Andrej Karpathy released microgpt—a 240-line implementation of GPT that captured the essence of transformer-based language models. It sparked a competition: how much could you compress a working neural network while keeping it functional?

The Compression Journey

microgpt 240 lines Original by Andrej Karpathy
picogpt 64 lines Shared as a QR code
femtogpt 53 lines Encoded as a 3000-digit prime number
attogpt 31 lines Preserved on 52 IBM punch cards

attogpt represents an 87.1% reduction from the original, while maintaining full functionality: autograd, multi-head attention, training, and inference—all in 31 lines and 3,117 characters.

By the Numbers

31
Lines of Code
52
Punch Cards
3.1KB
Total Size

What It Actually Does

Despite its tiny size, attogpt is a fully functional GPT implementation that includes:

After training on a dataset of names for 1,000 steps, it generates plausible new names:

sample  1: kamon
sample  2: ann
sample  3: karai
sample  4: jaire
sample  5: vialan
sample  6: karia
sample  7: yeran
sample  8: anna
sample  9: areli
sample 10: kaina

The Punch Card Format

To showcase just how compact this code is, I encoded it using the same physical format as 1960s computing: IBM-style punch cards—80 columns × 12 rows of binary information, just like the cards that helped send humans to the moon.

The encoding uses Base64 with 12-bit binary patterns, mapping each character to a unique punch combination. While this approach is modern (not authentic 1960s Hollerith encoding, which only supported uppercase), it preserves the aesthetic and demonstrates that the entire neural network fits on 52 cards—less than a single drawer in a 1960s data center.

Explore the Punch Cards

Each card shows the actual punch hole patterns. Black holes represent binary 1s.

Loading punch card data...

Technical Details

Encoding Approach

The encoding process is fully reversible and uses modern techniques while preserving the aesthetic of 1960s punch cards:

  1. Base64 Conversion: Python code → UTF-8 bytes → Base64 string
  2. Binary Punch Mapping: Each Base64 character → unique 12-bit binary pattern
  3. Card Layout: 80 characters per card, 12 possible punch positions per column
  4. Decoding: Punch patterns → Base64 → UTF-8 → working Python code

Historical Note

While this encoding uses the physical 80Ă—12 format of IBM punch cards, it employs modern binary encoding rather than authentic 1960s Hollerith code. This was necessary because:

The goal is to demonstrate scale and historical perspective, not to recreate period-accurate equipment constraints.

Key Compression Techniques

Despite the compression, the code maintains identical functionality to the original microgpt, producing the same quality of results.

Why Punch Cards?

Beyond the technical novelty, there's something profound about encoding a modern neural network in a format from the 1960s. It's a tangible reminder of how far we've come—and how the fundamental principles of computing remain unchanged.

IBM punch cards were the primary medium for data storage and program input for decades. A single card could hold 80 bytes. A full drawer? About 2,000 cards, or roughly 156KB. attogpt would barely make a dent.

The engineers who wrote software on punch cards would likely be amazed that a complete neural network—capable of learning and generating text—could fit on a stack of cards you could hold in one hand.

← Back to Projects & Writing Hub