π§ This project is a work in progress. Features, optimizations, and additional language ports may be added over time.
A multi-language port of Andrej Karpathyβs beautifully minimal GPT trainer. One file per language. Zero (or near-zero) external dependencies. The complete algorithm β everything else is just efficiency.
MicroGPT implements a complete GPT-2 style language model from scratch β including autograd, training, and inference β in a single file per language. It trains on a character-level dataset (names by default) and learns to generate novel, hallucinated names.
The project is a faithful port of the original Python implementation across eight languages:
| Language | File | Dependencies | Compile / Run |
|---|---|---|---|
| π Python | microgpt.py |
None (stdlib only) | python microgpt.py |
| βοΈ C++ | microgpt.cpp |
STL only | g++ -std=c++17 -O2 -o microgpt microgpt.cpp && ./microgpt |
| π¦ Rust | microgpt.rs |
std only |
rustc -O microgpt.rs -o microgpt && ./microgpt |
| π Swift | microgpt.swift |
Foundation only |
swiftc -O microgpt.swift -o microgpt && ./microgpt |
| π― Dart | microgpt.dart |
dart:io + dart:math |
dart compile exe microgpt.dart -o microgpt && ./microgpt |
| π£ Kotlin | microgpt.kt |
kotlin.math + java.io |
kotlinc microgpt.kt -include-runtime -d microgpt.jar && java -jar microgpt.jar |
| π‘ JavaScript | microgpt.js |
fs (Node.js built-in) |
node microgpt.js |
| π΅ Go | microgpt.go |
stdlib only | go run microgpt.go |
Key principle: Every port preserves the exact same structure, variable naming conventions, and algorithmic flow as the original Python. Reading any version is reading the same algorithm.
The algorithm is structured into six cleanly separated sections, identical across all languages:
input.txt β list of documents β character-level tokenizer
input.txt (a newline-separated list of names)0..n-1)42) for reproducibilityValue)A scalar-valued autograd system that tracks computation graphs and computes gradients via reverse-mode differentiation (backpropagation).
Supported operations:
| Operation | Python | C++ | Rust | Swift | Dart | Kotlin | JavaScript | Go |
|:βββ-|:ββ-|:β-|:ββ|:ββ|:ββ|:ββ-|:ββββ|:β|
| Addition | a + b | a + b | a.add(&b) | a + b | a + b | a + b | a.add(b) | Add(a,b) |
| Multiplication | a * b | a * b | a.mul(&b) | a * b | a * b | a * b | a.mul(b) | Mul(a,b) |
| Power | a ** n | vpow(a, n) | a.vpow(n) | a.vpow(n) | a.vpow(n) | a.vpow(n) | a.pow(n) | Pow(a,n) |
| Log | a.log() | vlog(a) | a.vlog() | a.vlog() | a.vlog() | a.vlog() | a.log() | Log(a) |
| Exp | a.exp() | vexp(a) | a.vexp() | a.vexp() | a.vexp() | a.vexp() | a.exp() | Exp(a) |
| ReLU | a.relu() | vrelu(a) | a.relu() | a.relu() | a.relu() | a.relu() | a.relu() | Relu(a) |
| Negation | -a | -a | a.neg() | -a | -a | -a | a.neg() | Neg(a) |
| Subtraction | a - b | a - b | a.sub(&b) | a - b | a - b | a - b | a.sub(b) | Sub(a,b) |
| Division | a / b | a / b | a.div(&b) | a / b | a / b | a / b | a.div(b) | Div(a,b) |
Backward pass uses topological sorting (DFS) to propagate gradients through the computation graph β the chain rule, applied recursively.
Weights are initialized as 2D matrices of Value nodes, drawn from a Gaussian distribution (mean=0, std=0.08):
state_dict:
wte β [vocab_size Γ n_embd] Token embeddings
wpe β [block_size Γ n_embd] Position embeddings
lm_head β [vocab_size Γ n_embd] Output projection
Per layer:
attn_wq β [n_embd Γ n_embd] Query projection
attn_wk β [n_embd Γ n_embd] Key projection
attn_wv β [n_embd Γ n_embd] Value projection
attn_wo β [n_embd Γ n_embd] Output projection
mlp_fc1 β [4Β·n_embd Γ n_embd] MLP up-projection
mlp_fc2 β [n_embd Γ 4Β·n_embd] MLP down-projection
Default hyperparameters:
| Parameter | Value | Description |
|---|---|---|
n_embd |
16 | Embedding dimension |
n_head |
4 | Number of attention heads |
n_layer |
1 | Number of transformer layers |
block_size |
16 | Maximum sequence length |
head_dim |
4 | Dimension per head (n_embd / n_head) |
Follows GPT-2 with minor simplifications:
ββββββββββββββββββββββββββββββββββββ
β Token + Position β
β Embedding β
ββββββββββββ¬ββββββββββββββββββββββββ
β
ββββββΌβββββ
β RMSNorm β
ββββββ¬βββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β Multi-Head Self-Attention β
β (Q, K, V projections + output) β
β with causal KV-cache β
ββββββββββββββββββ¬βββββββββββββββββ
β + residual
ββββββΌβββββ
β RMSNorm β
ββββββ¬βββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β MLP Block β
β FC1 β ReLU β FC2 β
ββββββββββββββββββ¬βββββββββββββββββ
β + residual
ββββββΌβββββ
β lm_head β
ββββββ¬βββββ
β
[ logits ]
Differences from standard GPT-2:
For each step (1..1000):
1. Pick a document, tokenize it with BOS delimiters
2. Forward each token through the model (autoregressive)
3. Compute cross-entropy loss against next-token targets
4. Backward pass (compute all gradients)
5. Adam optimizer update with:
- Linear learning rate decay
- Bias-corrected first moment (Ξ²β = 0.85)
- Bias-corrected second moment (Ξ²β = 0.99)
Optimizer hyperparameters:
| Parameter | Value |
|---|---|
| Learning rate | 0.01 (linearly decayed to 0) |
| Ξ²β | 0.85 |
| Ξ²β | 0.99 |
| Ξ΅ | 1e-8 |
| Steps | 1000 |
After training, the model generates 20 new samples:
For each sample:
1. Start with BOS token
2. Feed through model to get next-token probabilities
3. Apply temperature scaling (0.5 β slightly conservative)
4. Sample from the probability distribution
5. Repeat until BOS is generated or block_size is reached
microgpt.cpp)std::shared_ptr<Value> enables shared graph node referencesoperator+, operator*, etc., for natural mathematical syntaxstd::mt19937 (Mersenne Twister) + std::normal_distribution + std::discrete_distribution-std=c++17) for structured bindingsmicrogpt.rs)Rc<RefCell<ValueInner>> β Rustβs interior mutability pattern for shared mutable nodesstd-only, including a hand-rolled xoshiro256** PRNGmicrogpt.swift)class keyword+, *, -, / operators matching Pythonβs syntaxObjectIdentifier for the visited set in topological sortmicrogpt.dart)operator +, operator *, operator -, operator /vpow(), vlog(), vexp(), relu() for operations Dart canβt overloadstring[i])microgpt.kt)plus, times, minus, div, unaryMinus)data class LayerWeights for clean per-layer storageLong (64-bit signed) + Box-MullerLong literalsmicrogpt.js).add(), .mul(), .sub(), .div())BigInt for 64-bit precision + Box-MullerFloat64Array for Adam optimizer buffers (memory-efficient)microgpt.go)*Value) for shared graph referencesAdd, Mul, Pow, Log, Exp, Relu, etc.)uint64 + Box-Muller for GaussianRng.Shuffle with a swap callback (matches sort.Slice idiom)Data, Grad), unexported internals (children, localGrads)microgpt/
βββ input.txt # Training dataset (32K+ names)
βββ microgpt.py # Original Python implementation
βββ microgpt.cpp # C++17 port
βββ microgpt.rs # Rust port
βββ microgpt.swift # Swift port
βββ microgpt.dart # Dart port
βββ microgpt.kt # Kotlin port
βββ microgpt.js # JavaScript (Node.js) port
βββ microgpt.go # Go port
βββ benchmark.sh # All-languages benchmark script
βββ README.md # This file
input.txt file in the working directory (the Python version auto-downloads it; other versions expect it to exist)To auto-download input.txt:
curl -o input.txt https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt
# Python (original)
python microgpt.py
# C++
g++ -std=c++17 -O2 -o microgpt_cpp microgpt.cpp && ./microgpt_cpp
# Rust
rustc -O microgpt.rs -o microgpt_rs && ./microgpt_rs
# Swift
swiftc -O microgpt.swift -o microgpt_swift && ./microgpt_swift
# Dart
dart compile exe microgpt.dart -o microgpt && ./microgpt
# Kotlin
kotlinc microgpt.kt -include-runtime -d microgpt.jar && java -jar microgpt.jar
# JavaScript (Node.js)
node microgpt.js
# Go
go run microgpt.go
num docs: 32033
vocab size: 27
num params: 7185
step 1 / 1000 | loss 3.2958
step 2 / 1000 | loss 3.2730
...
step 1000 / 1000 | loss 2.1042
--- inference (new, hallucinated names) ---
sample 1: Maliya
sample 2: Kori
sample 3: Aden
...
Note: Exact output values may vary slightly between languages due to differences in floating-point arithmetic and RNG implementations, but the overall behavior and convergence should be consistent.
A benchmark script is included that compiles and runs all implementations with maximum optimizations, then prints a ranked comparison table.
bash benchmark.sh
This compiles each language with its best optimization flags (-O3 for C++, -O for Rust/Swift, AOT for Dart, etc.), runs all 1000 training steps, and produces a results table like:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BENCHMARK RESULTS β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β Rank β Language β Compile (s) β Run (s) β vs Fastest β
β ββββββΌββββββββββββββββββββββββΌββββββββββββββΌββββββββββββΌββββββββββββββ
β #1 β Rust (rustc -O) β 3s β 6s β 1.00x β
β #2 β C++ (g++ -O3) β 1s β 7s β 1.17x β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Note: The above is a sample β actual timings depend on your hardware. Missing toolchains are gracefully skipped.
| Language | External Deps | Total Imports |
|---|---|---|
| Python | 0 | os, math, random |
| C++ | 0 | <iostream>, <fstream>, <vector>, <cmath>, <random>, <algorithm>, <set>, <map>, <string>, <sstream>, <memory>, <functional>, <numeric> |
| Rust | 0 | std::cell, std::collections, std::fs, std::rc |
| Swift | 0 | Foundation |
| Dart | 0 | dart:io, dart:math |
| Kotlin | 0 | java.io.File, kotlin.math.* |
| JavaScript | 0 | fs (Node.js built-in) |
| Go | 0 | fmt, math, os, bufio, sort, strings |
Every implementation achieves zero external dependencies β only standard library / language built-ins are used.
This project demonstrates that:
Itβs a powerful educational tool for understanding transformers at the most fundamental level.
Massive shoutout to Andrej Karpathy for the original Python implementation. His work on making deep learning accessible β from micrograd to nanoGPT to this beautifully minimal
microgpt.pyβ has been an incredible gift to the ML community. This multi-language port is a tribute to the elegance of his original algorithm. π
"This file is the complete algorithm. Everything else is just efficiency."
β Andrej Karpathy