microgpt

🧠 MicroGPT β€” The Most Atomic GPT Implementation

🚧 This project is a work in progress. Features, optimizations, and additional language ports may be added over time.

A multi-language port of Andrej Karpathy’s beautifully minimal GPT trainer. One file per language. Zero (or near-zero) external dependencies. The complete algorithm β€” everything else is just efficiency.


πŸ“– Overview

MicroGPT implements a complete GPT-2 style language model from scratch β€” including autograd, training, and inference β€” in a single file per language. It trains on a character-level dataset (names by default) and learns to generate novel, hallucinated names.

The project is a faithful port of the original Python implementation across eight languages:

Language File Dependencies Compile / Run
🐍 Python microgpt.py None (stdlib only) python microgpt.py
βš™οΈ C++ microgpt.cpp STL only g++ -std=c++17 -O2 -o microgpt microgpt.cpp && ./microgpt
πŸ¦€ Rust microgpt.rs std only rustc -O microgpt.rs -o microgpt && ./microgpt
🍎 Swift microgpt.swift Foundation only swiftc -O microgpt.swift -o microgpt && ./microgpt
🎯 Dart microgpt.dart dart:io + dart:math dart compile exe microgpt.dart -o microgpt && ./microgpt
🟣 Kotlin microgpt.kt kotlin.math + java.io kotlinc microgpt.kt -include-runtime -d microgpt.jar && java -jar microgpt.jar
🟑 JavaScript microgpt.js fs (Node.js built-in) node microgpt.js
πŸ”΅ Go microgpt.go stdlib only go run microgpt.go

Key principle: Every port preserves the exact same structure, variable naming conventions, and algorithmic flow as the original Python. Reading any version is reading the same algorithm.


πŸ—οΈ Architecture

The algorithm is structured into six cleanly separated sections, identical across all languages:

1. Data Loading & Tokenization

input.txt β†’ list of documents β†’ character-level tokenizer

2. Autograd Engine (Value)

A scalar-valued autograd system that tracks computation graphs and computes gradients via reverse-mode differentiation (backpropagation).

Supported operations: | Operation | Python | C++ | Rust | Swift | Dart | Kotlin | JavaScript | Go | |:β€”β€”β€”-|:β€”β€”-|:β€”-|:—–|:β€”β€”|:—–|:β€”β€”-|:———–|:β€”| | Addition | a + b | a + b | a.add(&b) | a + b | a + b | a + b | a.add(b) | Add(a,b) | | Multiplication | a * b | a * b | a.mul(&b) | a * b | a * b | a * b | a.mul(b) | Mul(a,b) | | Power | a ** n | vpow(a, n) | a.vpow(n) | a.vpow(n) | a.vpow(n) | a.vpow(n) | a.pow(n) | Pow(a,n) | | Log | a.log() | vlog(a) | a.vlog() | a.vlog() | a.vlog() | a.vlog() | a.log() | Log(a) | | Exp | a.exp() | vexp(a) | a.vexp() | a.vexp() | a.vexp() | a.vexp() | a.exp() | Exp(a) | | ReLU | a.relu() | vrelu(a) | a.relu() | a.relu() | a.relu() | a.relu() | a.relu() | Relu(a) | | Negation | -a | -a | a.neg() | -a | -a | -a | a.neg() | Neg(a) | | Subtraction | a - b | a - b | a.sub(&b) | a - b | a - b | a - b | a.sub(b) | Sub(a,b) | | Division | a / b | a / b | a.div(&b) | a / b | a / b | a / b | a.div(b) | Div(a,b) |

Backward pass uses topological sorting (DFS) to propagate gradients through the computation graph β€” the chain rule, applied recursively.

3. Parameter Initialization

Weights are initialized as 2D matrices of Value nodes, drawn from a Gaussian distribution (mean=0, std=0.08):

state_dict:
  wte        β†’ [vocab_size Γ— n_embd]      Token embeddings
  wpe        β†’ [block_size Γ— n_embd]      Position embeddings
  lm_head    β†’ [vocab_size Γ— n_embd]      Output projection

  Per layer:
    attn_wq  β†’ [n_embd Γ— n_embd]          Query projection
    attn_wk  β†’ [n_embd Γ— n_embd]          Key projection
    attn_wv  β†’ [n_embd Γ— n_embd]          Value projection
    attn_wo  β†’ [n_embd Γ— n_embd]          Output projection
    mlp_fc1  β†’ [4Β·n_embd Γ— n_embd]        MLP up-projection
    mlp_fc2  β†’ [n_embd Γ— 4Β·n_embd]        MLP down-projection

Default hyperparameters:

Parameter Value Description
n_embd 16 Embedding dimension
n_head 4 Number of attention heads
n_layer 1 Number of transformer layers
block_size 16 Maximum sequence length
head_dim 4 Dimension per head (n_embd / n_head)

4. Model Architecture

Follows GPT-2 with minor simplifications:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚         Token + Position         β”‚
                    β”‚           Embedding              β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                          β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
                          β”‚ RMSNorm β”‚
                          β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                               β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚     Multi-Head Self-Attention   β”‚
              β”‚  (Q, K, V projections + output) β”‚
              β”‚  with causal KV-cache           β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚ + residual
                          β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
                          β”‚ RMSNorm β”‚
                          β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                               β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚          MLP Block              β”‚
              β”‚  FC1 β†’ ReLU β†’ FC2               β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚ + residual
                          β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
                          β”‚ lm_head β”‚
                          β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                               β”‚
                          [ logits ]

Differences from standard GPT-2:

5. Training Loop

For each step (1..1000):
  1. Pick a document, tokenize it with BOS delimiters
  2. Forward each token through the model (autoregressive)
  3. Compute cross-entropy loss against next-token targets
  4. Backward pass (compute all gradients)
  5. Adam optimizer update with:
     - Linear learning rate decay
     - Bias-corrected first moment (β₁ = 0.85)
     - Bias-corrected second moment (Ξ²β‚‚ = 0.99)

Optimizer hyperparameters:

Parameter Value
Learning rate 0.01 (linearly decayed to 0)
β₁ 0.85
Ξ²β‚‚ 0.99
Ξ΅ 1e-8
Steps 1000

6. Inference

After training, the model generates 20 new samples:

For each sample:
  1. Start with BOS token
  2. Feed through model to get next-token probabilities
  3. Apply temperature scaling (0.5 β€” slightly conservative)
  4. Sample from the probability distribution
  5. Repeat until BOS is generated or block_size is reached

⚑ Language-Specific Design Decisions

C++ (microgpt.cpp)

Rust (microgpt.rs)

Swift (microgpt.swift)

Dart (microgpt.dart)

Kotlin (microgpt.kt)

JavaScript (microgpt.js)

Go (microgpt.go)


πŸ“‚ Project Structure

microgpt/
β”œβ”€β”€ input.txt           # Training dataset (32K+ names)
β”œβ”€β”€ microgpt.py         # Original Python implementation
β”œβ”€β”€ microgpt.cpp        # C++17 port
β”œβ”€β”€ microgpt.rs         # Rust port
β”œβ”€β”€ microgpt.swift      # Swift port
β”œβ”€β”€ microgpt.dart       # Dart port
β”œβ”€β”€ microgpt.kt         # Kotlin port
β”œβ”€β”€ microgpt.js         # JavaScript (Node.js) port
β”œβ”€β”€ microgpt.go         # Go port
β”œβ”€β”€ benchmark.sh        # All-languages benchmark script
└── README.md           # This file

πŸš€ Quick Start

Prerequisites

To auto-download input.txt:

curl -o input.txt https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt

Run any version

# Python (original)
python microgpt.py

# C++
g++ -std=c++17 -O2 -o microgpt_cpp microgpt.cpp && ./microgpt_cpp

# Rust
rustc -O microgpt.rs -o microgpt_rs && ./microgpt_rs

# Swift
swiftc -O microgpt.swift -o microgpt_swift && ./microgpt_swift

# Dart
dart compile exe microgpt.dart -o microgpt && ./microgpt

# Kotlin
kotlinc microgpt.kt -include-runtime -d microgpt.jar && java -jar microgpt.jar

# JavaScript (Node.js)
node microgpt.js

# Go
go run microgpt.go

Expected Output

num docs: 32033
vocab size: 27
num params: 7185
step    1 / 1000 | loss 3.2958
step    2 / 1000 | loss 3.2730
...
step 1000 / 1000 | loss 2.1042

--- inference (new, hallucinated names) ---
sample  1: Maliya
sample  2: Kori
sample  3: Aden
...

Note: Exact output values may vary slightly between languages due to differences in floating-point arithmetic and RNG implementations, but the overall behavior and convergence should be consistent.


⏱️ Benchmarking

A benchmark script is included that compiles and runs all implementations with maximum optimizations, then prints a ranked comparison table.

bash benchmark.sh

This compiles each language with its best optimization flags (-O3 for C++, -O for Rust/Swift, AOT for Dart, etc.), runs all 1000 training steps, and produces a results table like:

╔═══════════════════════════════════════════════════════════════════════╗
β•‘                        BENCHMARK RESULTS                              β•‘
╠═══════════════════════════════════════════════════════════════════════╣
β•‘  Rank β”‚ Language              β”‚ Compile (s) β”‚  Run (s)  β”‚ vs Fastest  β•‘
β•‘  ─────┼───────────────────────┼─────────────┼───────────┼─────────────║
β•‘    #1 β”‚ Rust (rustc -O)       β”‚          3s β”‚      6s   β”‚      1.00x  β•‘
β•‘    #2 β”‚ C++ (g++ -O3)         β”‚          1s β”‚      7s   β”‚      1.17x  β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Note: The above is a sample β€” actual timings depend on your hardware. Missing toolchains are gracefully skipped.


πŸ“Š Dependency Comparison

Language External Deps Total Imports
Python 0 os, math, random
C++ 0 <iostream>, <fstream>, <vector>, <cmath>, <random>, <algorithm>, <set>, <map>, <string>, <sstream>, <memory>, <functional>, <numeric>
Rust 0 std::cell, std::collections, std::fs, std::rc
Swift 0 Foundation
Dart 0 dart:io, dart:math
Kotlin 0 java.io.File, kotlin.math.*
JavaScript 0 fs (Node.js built-in)
Go 0 fmt, math, os, bufio, sort, strings

Every implementation achieves zero external dependencies β€” only standard library / language built-ins are used.


πŸ€” Why This Exists

This project demonstrates that:

  1. A GPT can be implemented in ~200-300 lines in any modern language
  2. Autograd doesn’t require a framework β€” it’s just the chain rule + a topological sort
  3. The algorithm is language-agnostic β€” the same structure maps naturally to C++, Rust, Swift, Dart, Kotlin, JavaScript, and Go
  4. Dependencies are optional β€” even RNG and Gaussian sampling can be done from scratch

It’s a powerful educational tool for understanding transformers at the most fundamental level.


πŸ™ Credits & Acknowledgments

Massive shoutout to Andrej Karpathy for the original Python implementation. His work on making deep learning accessible β€” from micrograd to nanoGPT to this beautifully minimal microgpt.py β€” has been an incredible gift to the ML community. This multi-language port is a tribute to the elegance of his original algorithm. πŸŽ‰


"This file is the complete algorithm. Everything else is just efficiency."
β€” Andrej Karpathy