> 25K parameters is about 70 million times smaller than GPT-4. It will produce broken sentences. That's the point - the architecture works at this scale.

Since it seems to just produce broken and nonsensical sentences (at least based on the one example given) I'm not sure if it does work at this scale.

Anyway, as written this passage doesn't really make a whole lot of sense (the point is that it produces broken sentences?), and given that it was almost certainly written by an AI, it demonstrates that the architecture doesn't work especially well at any scale (I kid, I kid).

> 25K parameters is about 70 million times smaller than GPT-4. It will produce broken sentences. That's the point - the architecture works at this scale.

Since it seems to just produce broken and nonsensical sentences (at least based on the one example given) I'm not sure if it does work at this scale.

Just reminded me of the random sentence generator program on my Vic-20. I had changed most of the words to all the bad words a preteen could think up. So many laughs with the neighborhood kids.

You can chat with the model on the project page: https://indiepixel.de/meful/index.html

It (v3) mostly only says hello and bye, but I guess for 25k parameters you can't complain. (I think the rather exuberant copy is probably the product of Claude et al.)

I love these counterfactual creations on old hardware. It highlights the magical freedom of creativity of software.

This would have blown me away back in the late 80s/early 90s.

(Or maybe not, if it doesn't perform better than random, I haven't actually tried it out yet. Some more examples would have been nice!)

I wonder how far you could push this while still staying period correct, e.g. by adding a REU (RAM Expansion Unit), or even a GeoRAM (basically a REU on steroids).

SuperCPU would also be an option, but for me it's always blurring the line of "what is a C64" a bit too much, and it likely just makes it faster anyway.

A little disappointed to see PyTorch + Claude here. I was hoping for some "demo-scene" hand-crafted 6502 assembly, and hopefully training on the C64.

Interesting, I’ve always thought neural network progress was primarily bottlenecked by compute.

If it turns out that LLM-like models can produce genuinely useful outputs on something as constrained as a Commodore 64—or even more convincingly, if someone manages to train a capable model within the limits of hardware from that era—it would suggest we may have left a lot of progress on the table. Not just in terms of efficiency, but in how we framed the problem space for decades.

If you're running this in VICE, run it under the SuperCPU with warp mode on.

Dissapointed - there was no 6502 code in the GitHub repo.

How does this compare to ELIZA?

Eliza called, and asked if we saw her grand kids...

Load”*”,8,1

Brings back memories

Ok now we need 1541 flash attention.

I'm not sure what the venn diagram of knowledge to understand what that sentence is suggesting looks like, it's probably more crowded in the intersection than one might think.

i hate ai, and i love the c64, but i'll allow it.

but can you make mac keyboards feel like a c64c?

How does it compare to a Markov chain generator I wonder.

Just reminded me of the random sentence generator program on my Vic-20. I had changed most of the words to all the bad words a preteen could think up. So many laughs with the neighborhood kids.

You can chat with the model on the project page: https://indiepixel.de/meful/index.html

It (v3) mostly only says hello and bye, but I guess for 25k parameters you can't complain. (I think the rather exuberant copy is probably the product of Claude et al.)

How does it compare to a Markov chain generator I wonder.

The Transformer is the more powerful model than Markov chain, but on such a weak machine as the C64, a MC could output text faster - but it surely would sound "psychedelic", as the memory limits a MC to a first-order or second-order model, so to predict one word, only the two words before would be taken into account as context (and no attention).

On a plain vanilla C64, the Transformer cannot really show what it's capable of doing. An implementation using 2 bit per weight (vectorized) could be slightly better, perhaps.

This would have blown me away back in the late 80s/early 90s.

(Or maybe not, if it doesn't perform better than random, I haven't actually tried it out yet. Some more examples would have been nice!)

I wonder how far you could push this while still staying period correct, e.g. by adding a REU (RAM Expansion Unit), or even a GeoRAM (basically a REU on steroids).

SuperCPU would also be an option, but for me it's always blurring the line of "what is a C64" a bit too much, and it likely just makes it faster anyway.

How fast is the “new” Commodore 64?

Have not heard much about it since launch. Although, now that I look, it seems they are just shipping now.

https://www.commodore.net/product-page/commodore-64-ultimate...

Interesting, I’ve always thought neural network progress was primarily bottlenecked by compute.

Next-word prediction features always existed for flip phones...

If you're running this in VICE, run it under the SuperCPU with warp mode on.

Dissapointed - there was no 6502 code in the GitHub repo.

Load”*”,8,1

Brings back memories

i hate ai, and i love the c64, but i'll allow it.

but can you make mac keyboards feel like a c64c?

On a plain vanilla C64, the Transformer cannot really show what it's capable of doing. An implementation using 2 bit per weight (vectorized) could be slightly better, perhaps.

How fast is the “new” Commodore 64?

Have not heard much about it since launch. Although, now that I look, it seems they are just shipping now.

https://www.commodore.net/product-page/commodore-64-ultimate...

  YOU> hey
  C64> HELLO! RE SOUNDS ME. MEFUL!

60s per token for that doesn't strike me as genuinely useful.

Very, very cool project though!

Next-word prediction features always existed for flip phones...

That's a good idea because, although I love this, 1 minute per token is absolutely savage. Whereas if you can juice the performance you're into semi-credible Jar Jar Binks simulator territory.

It does also make me wonder what you could do with somewhat more powerful retro hardware. I'd love to see what a transformer running on a PSX or an N64 could do.

How does this compare to ELIZA?

ELIZA is better, because this doesn't seem to generate anything coherent. You can try the original ELIZA with DOCTOR script here: https://anthay.github.io/eliza.html

Jopsph Weizenbaum's ELIZA was rule-based and ran on even slower (1960s) hardware, but because it relied on simple pattern matching instead of neural nets, it would easily have been more responsive (the Emacs editor/operating system has an implementation included, start it with: M-x doctor RETURN).

ELIZA was not written in assembler, but (different versions) in COMIT, FORTRAN and LISP.

https://dl.acm.org/doi/pdf/10.1145/365153.365168

Eliza called, and asked if we saw her grand kids...

What makes you say that? This is about you, not me.

(Came here to say an update to Eliza could really mess with the last person still talking to her.)

Ok now we need 1541 flash attention.

I'm not sure what the venn diagram of knowledge to understand what that sentence is suggesting looks like, it's probably more crowded in the intersection than one might think.

How many 40+ AI pillers? Assume 10M devs in the world. 10% heard of flash attention, 1% heard of 1541 then 10,000

That's a good idea because, although I love this, 1 minute per token is absolutely savage. Whereas if you can juice the performance you're into semi-credible Jar Jar Binks simulator territory.

It does also make me wonder what you could do with somewhat more powerful retro hardware. I'd love to see what a transformer running on a PSX or an N64 could do.

ELIZA is better, because this doesn't seem to generate anything coherent. You can try the original ELIZA with DOCTOR script here: https://anthay.github.io/eliza.html

ELIZA was not written in assembler, but (different versions) in COMIT, FORTRAN and LISP.

https://dl.acm.org/doi/pdf/10.1145/365153.365168

What makes you say that? This is about you, not me.

(Came here to say an update to Eliza could really mess with the last person still talking to her.)

RAM can be increased to 16 MB and CPU speed to 48 GHz.

  YOU> hey
  C64> HELLO! RE SOUNDS ME. MEFUL!

60s per token for that doesn't strike me as genuinely useful.

Very, very cool project though!

not useful in a disaster scenario:

YOU> HELP I'M DROWNING

C64> YOU' HERE!

YOU> OH NO I'M ON FIRE

C64> IGLAY!

YOU> IM BEING SWALLOWED BE A SNAKE

C64>

YOU> BIRDS ARE NIPPING ON ME

C64> YOU

RAM can be increased to 16 MB and CPU speed to 48 GHz.

I’m sorry how many Hz???

not useful in a disaster scenario:

YOU> HELP I'M DROWNING

C64> YOU' HERE!

YOU> OH NO I'M ON FIRE

C64> IGLAY!

YOU> IM BEING SWALLOWED BE A SNAKE

C64>

YOU> BIRDS ARE NIPPING ON ME

C64> YOU

Reminds me of Terry Davis' random word generator :')

Maybe there is deeper wisdom in there that we have yet to unearth

Reminds me of Terry Davis' random word generator :')

Maybe there is deeper wisdom in there that we have yet to unearth

I love these counterfactual creations on old hardware. It highlights the magical freedom of creativity of software.

A little disappointed to see PyTorch + Claude here. I was hoping for some "demo-scene" hand-crafted 6502 assembly, and hopefully training on the C64.

Same, however I do conceed having the whole assembler toolchain written in Python was also kind of cool, even if it may have been AI generated.

Even cooler would have been to have the 6502 directly generated from the LLM.

so... it is vibe-code?

meh

How many 40+ AI pillers? Assume 10M devs in the world. 10% heard of flash attention, 1% heard of 1541 then 10,000

Ahh but you also have to know the significance of the 1541 that makes the Flash attention reference work

Soul Player C64

A real transformer running on a 1 MHz Commodore 64.

   .-------.
  | O     O |
  |    V    |
  |..|---|..|

# SOUL PLAYER C64

25K PARAMETERS. 2 LAYERS. REAL TRANSFORMER.
LOADED OFF A FLOPPY DISK.

YOU> hey
C64> HELLO! RE SOUNDS ME. MEFUL!

A 2-layer decoder-only transformer - the same architecture behind ChatGPT, Claude, and Gemini - implemented in hand-written 6502/6510 assembly and running on an unmodified Commodore 64. ~25,000 int8 parameters. Real multi-head causal self-attention, real softmax, real RMSNorm. About 60 seconds per token. The whole thing fits on a floppy disk with room to spare.

Architecture

2 layers, 4 attention heads × 8 dims, 32-dimensional embeddings, 64 FFN hidden units. ~25,000 parameters quantized to int8 with per-tensor shift scaling. The key breakthrough was fixing the softmax score normalization - shifting attention scores by 14 bits instead of 17 gives the 128-entry exp lookup table enough dynamic range to produce meaningful attention weights. Without this fix, the integer attention was essentially uniform across all positions, making the model blind regardless of architecture or training.

Quick start - run the pre-built soul

Grab disk/soulplayer.d64 and load it in any C64 emulator (VICE recommended):

LOAD"SOULPLAYER",8,1
RUN

Type a short message in lowercase, press RETURN, wait. The border flashes while it thinks. Each token gets a SID blip. A full response takes a few minutes. Type q to quit.

Tip: The model understands lowercase letters, spaces, and punctuation (. , ! ? ' : ; -). Capital letters become unknown tokens.

Train your own soul

This is the fun part. Write a corpus, train a model, build a floppy.

Install dependencies

pip install numpy torch

Write a corpus

Create a text file with one exchange per line in <SEP>input<SEP>response<SEP> format:

<SEP>hello<SEP>hey! nice to see you!<SEP>
<SEP>i'm sad<SEP>i hear you. i care about you.<SEP>
<SEP>tell me a joke<SEP>why did the bit flip? it was tired!<SEP>

Keep exchanges short - the model has a 20-token context window. See data/example_corpus.txt for a starter.

Train

python train.py data/example_corpus.txt

This trains a BPE tokenizer (128 tokens), trains the QAT transformer, exports models/soul.bin and models/tokenizer.json. Takes a few minutes on GPU.

Every 500 epochs, you'll see both the float and int8 inference output side by side - what the model learned vs what the C64 will actually produce. The best checkpoint is saved based on int8 quality, not float loss. All checkpoints are saved to models/checkpoints/ for cherry-picking.

Options:

python train.py data/my_corpus.txt --epochs 30000 --output models/
python train.py                    # uses built-in emotional support corpus

Training resumes automatically if checkpoints exist from a previous run.

Build the C64 binary

python build.py

This assembles all 6502/6510 routines, embeds your trained weights, and writes disk/soulplayer.prg and disk/soulplayer.d64.

Run it

x64 disk/soulplayer.d64    # VICE emulator

Or flash the .d64 to a real 1541 floppy for hardware.

Chat with the soul locally

python soulchat.py                   # uses models/soul.bin
python soulchat.py models/soul.bin   # custom soul

Runs the same integer arithmetic as the C64, just faster.

Run the tests

python test.py           # full suite (~90 tests, ~30 seconds)
python test.py --quick   # skip 6502/6510 assembly tests

Tests verify the entire chain: float reference → integer reference → memory-faithful shadow → 6502/6510 assembly routines → build round-trip.

What's in the repo

soulplayer-c64/
├── train.py              - train a model + export weights
├── build.py              - assemble the C64 binary
├── test.py               - run all tests
├── soulchat.py           - chat in your terminal
│
├── data/
│   └── example_corpus.txt
├── models/
│   ├── soul.bin           - pre-trained weights (25KB, int8)
│   ├── tokenizer.json     - BPE tokenizer (128 tokens)
│   └── checkpoints/       - all saved training checkpoints
├── disk/
│   ├── meful.d64          - original release, disk image
│   └── meful.prg          - original release, raw PRG
│   ├── soulplayer.d64     - ready-to-run disk image
│   └── soulplayer.prg     - raw PRG
└── src/                   - the engine
    ├── numerics.py        - ground truth: fixed-point math + forward pass
    ├── soul_io.py         - .bin weight file format
    ├── shadow.py          - memory-faithful Python shadow of the 6502/6510
    ├── assembler.py       - mini 6502 assembler (labels, patches, far branches)
    ├── cpu6502.py         - minimal 6502 interpreter for testing
    ├── asm_matvec.py      - 6502 matrix-vector multiply
    ├── asm_rms_norm.py    - 6502 RMSNorm (integer sqrt + divide)
    ├── asm_attn_head.py   - 6502 attention head (LUT softmax)
    ├── asm_simple.py      - 6502 embed, residual, relu, argmax
    └── build.py           - PRG + D64 assembler

Specs


Vocab	128 tokens (4 special + 34 chars/punct + 90 BPE merges)
Embedding	32 dimensions
Layers	2
Attention	4 heads × 8 dims per head
FFN	64 hidden units
Context	20 tokens
Parameters	~25,000 (all int8)
Weight size	25 KB
Decoding	Greedy (argmax)

Each layer: RMSNorm → multi-head causal self-attention → residual → RMSNorm → ReLU MLP → residual. Final RMSNorm → output projection → argmax.

All activations are Q8.8 fixed-point (int16). Weights are int8 with per-tensor power-of-2 shifts. Biases are int16 pre-scaled to the matmul accumulator. Softmax uses a 128-entry exp lookup table with >>14 score normalization. The 6502 has no multiply instruction - everything is shift-and-add.

Memory map

$0801-$20FF   code + tokenizer tables        (~6 KB)
$2100-$85A0   weights                       (25.3 KB)
$8600-$9D00   activation buffers             (5.8 KB)
$C000-$C3FF   token buffer, input, scratch
$D000-        VIC-II, SID, CIA (I/O)

How training works

The model uses quantization-aware training (QAT). During training, weights pass through FakeQuantI8 - fake-quantized with continuous float scaling and straight-through gradient estimation. The deliberate mismatch between training's continuous scale and export's power-of-2 shift grid acts as implicit noise, forcing the model to learn weights with wider logit margins that survive the quantization gap. Biases are fake-quantized with simple fq(). Every matmul gets a × 0.5 post-shift simulating the 6502's >> 1.

Label smoothing (0.15) prevents the model from sharpening logit distributions beyond what int8 arithmetic can reliably distinguish. The training loop evaluates the actual integer forward pass (numerics.forward()) every 500 epochs and saves the best checkpoint by int8 argmax accuracy, not float loss.

The training output shows float and int8 inference side by side - what the model learned vs what the C64 will produce.

Caveats

It's not smart. 25K parameters is about 70 million times smaller than GPT-4. It will produce broken sentences. That's the point - the architecture works at this scale.
It's ~~slow~~ contemplative. About 60 seconds per token on real hardware. A full response takes several minutes.
Capitals become <UNK>. Stick to lowercase.
Small vocabulary. 128 tokens and 20-token context - keep training exchanges short.

Credits

Code, training: gizmo64k
Debugging, unit tests, rubber duck: Claude (Opus 4.6) by Anthropic
Lucky soul: The Commodore 64 by Commodore Business Machines, 1982

License

GNU General Public License v3. See LICENSE.

The future came back for the past. And now it has a soul.

I’m sorry how many Hz???

Ahh but you also have to know the significance of the 1541 that makes the Flash attention reference work

Same, however I do conceed having the whole assembler toolchain written in Python was also kind of cool, even if it may have been AI generated.

Even cooler would have been to have the 6502 directly generated from the LLM.

so... it is vibe-code?

meh

Yes. The author mentions Claude for testing, but it was obviously used for the README and code as well.

This is a giveaway for AI generation, from the docstring to the terrible opcode dispatch (Claude sucks at assembly or low-level optimization): https://github.com/gizmo64k/soulplayer-c64/blob/main/src/cpu...

A human would use a proper dispatch table and wouldn't make excuses for a sloppy implementation ("Python is fast enough").

Besides, the author has an art and design background, which doesn't seem to match the deep knowledge of Transformers or assembly required for such a project.

Yes. The author mentions Claude for testing, but it was obviously used for the README and code as well.

A human would use a proper dispatch table and wouldn't make excuses for a sloppy implementation ("Python is fast enough").

Besides, the author has an art and design background, which doesn't seem to match the deep knowledge of Transformers or assembly required for such a project.

Hacker Times

Hacker Times

Soul Player C64 – A real transformer running on a 1 MHz Commodore 64

Discussion

Discussion

Soul Player C64

Architecture

Quick start - run the pre-built soul

Train your own soul

Install dependencies

Write a corpus

Train

Build the C64 binary

Run it

Chat with the soul locally

Run the tests

What's in the repo

Specs

Memory map

How training works

Caveats

Credits

License