CURRENT PROGRESS

For 8B parameter models:

Prompt evaluation: 380 tokens/s

Token generation: 27 tokens/s

Power consumption: 5W

Hard cost: 55$

Everything you see in the video is run locally (voice transcription, voice to text, LLM)

Exponential progress coming!

A coding agent evaluation for small language models

Small models are knowledgable and capable of coding, but still lack the confidence for agent tasks.

DATEJune 03, 2025

ReplaceME: Prune & Heal AI models for superior speed

A benchmark of replaceMe, a research paper on improving pruned model accuracy.

DATEMay 6, 2025

Luna v0.3.2

(Accidentally) achieving 210 tokens/s prompt evaluation speed.

DATEMay 5, 2025

DyT: Theoretically faster but practically slower

DyT was proposed as an alternative to RMS Normalization within Transformer Inference. Technically, it's 2x faster, but in practice 35% slower due to lack of SIMD support.

DATE2025-05-01

Luna v0.0.1

The first version of Luna, the AI box runs on an Orange Pi 5 Pro. Our findings on CPU core manipulation allowed for 25% increase in LLM inference speed while consuming 40% less power.

DATEMarch 7, 2025

Now, we're pushing performance per dollar

Want to contribute with custom inference firmware, hardware, or novel AI model? Let's talk!

High-performance computing setup for AI inference