CURRENT PROGRESS
For 8B parameter models:
Prompt evaluation: 380 tokens/s
Token generation: 27 tokens/s
Power consumption: 5W
Hard cost: 55$
Everything you see in the video is run locally (voice transcription, voice to text, LLM)
Exponential progress coming!
A coding agent evaluation for small language models
Small models are knowledgable and capable of coding, but still lack the confidence for agent tasks.
ReplaceME: Prune & Heal AI models for superior speed
A benchmark of replaceMe, a research paper on improving pruned model accuracy.
Luna v0.3.2
(Accidentally) achieving 210 tokens/s prompt evaluation speed.
DyT: Theoretically faster but practically slower
DyT was proposed as an alternative to RMS Normalization within Transformer Inference. Technically, it's 2x faster, but in practice 35% slower due to lack of SIMD support.
Luna v0.0.1
The first version of Luna, the AI box runs on an Orange Pi 5 Pro. Our findings on CPU core manipulation allowed for 25% increase in LLM inference speed while consuming 40% less power.
Now, we're pushing performance per dollar
Want to contribute with custom inference firmware, hardware, or novel AI model? Let's talk!
