Attention
SDP Attention
This is the default built-in attention in PyTorch.
sdp_attention: trueFor more details: PyTorch docs
Flash Attention
Axolotl supports Flash Attention 2, 3, and 4. The best available version is used automatically based on your installed packages and GPU.
flash_attention: trueFor more details: Flash Attention
Flash Attention 2
Requirements: Ampere, Ada, or Hopper GPUs (Turing or lower not supported)
pip install flash-attn --no-build-isolationIf you get undefined symbol while training, ensure you installed PyTorch prior to Axolotl.
Alternatively, try reinstall or downgrade a version.
Flash Attention 3
Requirements: Hopper only and CUDA 12.8 (recommended)
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/hopper
python setup.py installFlash Attention 4
Requirements: Hopper or Blackwell GPUs
pip install flash-attn-4Or from source:
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/flash_attn/cute
pip install -e .
# FA2's flash_attn package includes a cute/ stub that shadows FA4.
# Remove it so Python can find the real FA4 module:
rm -r $(python -c "import flash_attn; print(flash_attn.__path__[0])")/cuteHopper (SM90) users: The backward kernel is not yet included in the pip package. To use FA4 for training on Hopper, install from source using the instructions above.
FA4 only supports head dimensions up to 128 (d ≤ 128). The DeepSeek shape (192, 128) is
also supported but only on Blackwell. Axolotl automatically detects incompatible head dimensions
and falls back to FA2/3.
For more details: flash-attention/flash_attn/cute
AMD
Requirements: ROCm 6.0 and above.
Flex Attention
A flexible PyTorch API for attention used in combination with torch.compile.
flex_attention: true
# recommended
torch_compile: trueWe recommend using latest stable version of PyTorch for best performance.
For more details: PyTorch docs
SageAttention
Attention kernels with QK Int8 and PV FP16 accumulator.
sage_attention: trueRequirements: Ampere, Ada, or Hopper GPUs
pip install sageattention==2.2.0 --no-build-isolationOnly LoRA/QLoRA recommended at the moment. We found loss drop to 0 for full finetuning. See GitHub Issue.
For more details: Sage Attention
We do not support SageAttention 3 at the moment. If you are interested on adding this or improving SageAttention implementation, please make an Issue.
xFormers
xformers_attention: trueWe recommend using with Turing GPUs or below (such as on Colab).
For more details: xFormers
Shifted Sparse Attention
We plan to deprecate this! If you use this feature, we recommend switching to methods above.
Requirements: LLaMA model architecture
flash_attention: true
s2_attention: trueNo sample packing support!