Skip to content

facebookresearch/Meta_SecAlign

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Sizhe Chen*, Arman Zharmagambetov, David Wagner, Chuan Guo* (* for equal technical contributions)

🔥 Meta-SecAlign models are now licensed for commercial use under the Llama community licenses, despite this codebase being licensed for non-commercial use only.

Comparable to GPT-5-high in agentic (tool/web) utility and security, Meta-SecAlign-70B is the first fully open-source LLM with built-in prompt injection defense and commercial-grade performance, unlocking open research on secure agentic applications.

Updates (10/28/2025) — from the 07/07/2025 version

  • Report the combined attack success rate (a sample is counted as attacked if any tested attack method succeeds) for non-adaptive and adaptive (added) attacks; adaptive attacks use fake delimiters (similar to official ones) to mimic a fake conversation with the model.
  • Add support for evaluating GPT-5 on all benchmarks.
  • Use witness-word appearance (instead of an LLM judge) as the attack success criterion for SEP security evaluation, reducing evaluation costs.
  • Parallelize LLM-judge queries to accelerate TaskTracker evaluation.
  • Fix multiple evaluation bugs that produced incorrect numbers.
  • Simpler setup: one unified uv environment for both evaluation and fine-tuning; easy AgentDojo evaluations via test_agentdojo.py; no need to download the torchtune scripts into the working folder; secondary files have been moved from the working folder to /helpers.

Environment Setup

  • Hardware requirements: Meta-SecAlign-8B requires 4×80 GB A100s for training and one 16 GB GPU for evaluation. Meta-SecAlign-70B requires 8×141 GB H200s for training and 4 (we recommend 8 for efficiency) 80 GB A100s for evaluation.
  • Install uv (a Python package management tool), and then in your home directory run:

uv venv metasecalign --python 3.13
source metasecalign/bin/activate

  • Install Meta-SecAlign package dependencies:

git clone --recurse-submodules https://github.com/facebookresearch/Meta_SecAlign.git
cd Meta_SecAlign
uv pip install -r requirements.txt

  • Install Meta-SecAlign data dependencies (including those used for SEP utility evaluation if you have a GPU available):

python setup.py

  • Configure OpenAI keys (used for utility evaluation) in data/openai_configs.yaml. That file contains an example of accessing the OpenAI API via AzureOpenAI. A more detailed example is available here.
  • [Optional] Configure Gemini keys in data/gemini_configs.yaml if you want to evaluate Gemini models.

Demo

  • demo.py contains minimal code to use our two Meta-SecAlign models. Feel free to try new samples and prompt injections, or test the models on your codebase:

python demo.py

Evaluation

  • run_tests.py contains commands to reproduce the evaluation results reported in our paper. It sequentially invokes tests.py, test_lm_eval.py, test_agentdojo.py, and test_injecagent.py. Results will be logged to [model_path]/summary.tsv.

python run_tests.py -m [model_path] --lora_alpha [lora_alpha]

  • model_path is the path to the tested model. We support:
    • Local models (vLLM inference)
      • meta-llama/Llama-3.1-8B-Instruct_SecAlign (Meta-SecAlign-8B downloaded by setup.py): the first fully open model with state-of-the-art prompt injection defense
      • meta-llama/Llama-3.3-70B-Instruct_SecAlign (Meta-SecAlign-70B downloaded by setup.py): the first fully open model with state-of-the-art prompt injection defense
      • meta-llama/Llama-3.1-8B-Instruct
      • meta-llama/Llama-3.3-70B-Instruct
      • Other Hugging Face open-weight models may also be natively supported.
    • OpenAI GPT models
    • Google Gemini models
      • gemini-2.0-flash: a Google commercial model with a claimed prompt injection defense
      • gemini-2.5-flash: a Google commercial model with a claimed prompt injection defense
      • gemini-2.0-pro: a state-of-the-art Google model (not claimed to include a prompt injection defense)
      • gemini-2.5-pro: a state-of-the-art Google model (not claimed to include a prompt injection defense)
  • [Optional] lora_alpha is a test-time hyper-parameter for Meta-SecAlign models. It defaults to 8, which uses the exact Meta-SecAlign models as trained. A lora_alpha value between 0 and 8 interpolates between the undefended model and our defended model to enable a flexible utility–security trade-off. Extrapolating lora_alpha beyond 8 is possible but untested.
  • We support the following prompt-injection benchmark evaluations for the community:

Defensive Fine-Tuning (SecAlign++)

  • secalign_llama3.1_8b.sh and secalign_llama3.3_70b.sh provide commands to defensively fine-tune meta-llama/Llama-3.1-8B-Instruct and meta-llama/Llama-3.3-70B-Instruct to a robust LoRA model using our training recipe.

bash secalign_llama3.1_8B.sh
bash secalign_llama3.3_70B.sh

Code Acknowledgements

Significantly improved from SecAlign, the majority of the Meta-SecAlign code is licensed under CC-BY-NC. Portions of the project are available under separate license terms: AgentDojo, TaskTracker, and lm-evaluation-harness are licensed under MIT. Code from other repositories includes AgentDojo (agentdojo), TaskTracker (setup.py), and lm_eval_harness (lm_eval_config). This software and/or data was deposited in the BAIR Open Research Commons repository in 2025.

About

Repo for the paper "Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks".

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •