Refiant Benchmark Results

Overview

Large Language Models (LLMs) are notoriously computation heavy, and consequently, power hungry. Apart from the data center discussion and limitations, these realities imply that using frontier AI models in private, bounded environments and on edge devices is out of the question for the current state of the art in the industry: on-device LLM usage is a sub-optimal experience from an end user perspective.

Compressed models have been around for a while, using techniques like pruning, quantization and distillation to preserve the essence of a model, while working within the constraints of edge devices.

Various companies, for example, Multiverse Computing in the EU, are exploring this direction, for both industrial and consumer use. The majority of research out of China, where GPU constraints are evident, is focused on getting more from less.

Refiant is building foundational AI architectures, including compression and context management systems, as well as, more broadly, physics based non-convex optimisation techniques and applications with to machine learning.

In particular, we have validated results on novel-transformer methods which reduce the complexity cost of attention mechanism calculations from quadratic to log-linear as well as compression techniques which preserve model fidelity to around 95%-99% across a number of standard AI testing benchmarks. These included MMLU, AIME, and MMMLU benchmarks.

‍

Benchmarks

In order to showcase the technology, the open source model from Open AI: GPT-OSS-120B which uses a mixture of experts architecture (MOE) was used as a test case. The weights are usually stored in around 60GB at MXFP4 and inference requires at least 80GB of memory. In terms of our results, this was reduced to ~12GB in stored weights and ~12GB in RAM at runtime, respectively.

A redundant expert analysis, which uses a unique MOE factorization and routing process was developed.
Expert activations increased from 1.5% per pass to 9.5% per pass.
Compression ratios of ~6.3x.
Compressed model ran smoothly on MacBook Pro M3 with 18GBs of RAM and with high latency: measured output was at 60 tokens per second.
Multiple models were run concurrently on a single device (Qwen 3-2B, quantized, GPT-OSS-120B).
The model was run and compressed on Apple Silicon, on the same device and no separate GPU time was required. The total compression run was approximately 4 hours.

‍

Remarks

This demonstration is part of a broader body of work on machine learning methods connecting physics, information theory and language. The core premise is that modern machine learning is highly redundant and compressions are both possible, and optimal. The above results were validated independently by an expert in the field, and formed part of validation for Refiant's seed round of investment.

Email: team@refiant.ai

‍

Overview

‍

Benchmarks

Remarks

Ready to maximise your AI potential?