Flash Linear Attention for Apple Silicon

fla-mlx is a small research library for running selected flash-linear-attention kernels on Apple Silicon through Apple’s MLX framework.

Repository: github.com/HallerPatrick/fla-mlx

Why

Linear attention models such as GLA, Gated DeltaNet, and RetNet usually depend on Triton kernels, which means CUDA. On Apple Silicon this often leaves slow Python fallbacks or CPU execution. fla-mlx ports the inference kernels to native Metal, so these models can run locally on M-series Macs with practical performance.

Usage

pip install mlx
git clone https://github.com/HallerPatrick/fla-mlx
cd fla-mlx
pip install -e .

Generate text with a supported checkpoint:

python scripts/generate.py \
    --model PatrickHaller/smollm_gla_small_512 \
    --prompt "The key insight of linear attention is" \
    --max_new_tokens 200

The project is still experimental, but it already includes Metal kernels for GLA, Gated DeltaNet, RetNet, and linear attention, plus tests and benchmarks for checking numerical parity and performance.