fla-mlx is a small research library for running selected flash-linear-attention kernels on Apple Silicon through Appleās MLX framework.
Repository: github.com/HallerPatrick/fla-mlx
Why
Linear attention models such as GLA, Gated DeltaNet, and RetNet usually depend on Triton kernels, which means CUDA. On Apple Silicon this often leaves slow Python fallbacks or CPU execution. fla-mlx ports the inference kernels to native Metal, so these models can run locally on M-series Macs with practical performance.
Usage
pip install mlx
git clone https://github.com/HallerPatrick/fla-mlx
cd fla-mlx
pip install -e .
Generate text with a supported checkpoint:
python scripts/generate.py \
--model PatrickHaller/smollm_gla_small_512 \
--prompt "The key insight of linear attention is" \
--max_new_tokens 200
The project is still experimental, but it already includes Metal kernels for GLA, Gated DeltaNet, RetNet, and linear attention, plus tests and benchmarks for checking numerical parity and performance.