Sizhe Chen*, Arman Zharmagambetov, David Wagner, Chuan Guo* (* for equal technical contributions)
🔥 Meta-SecAlign models are now licensed for commercial use under the Llama community licenses, despite this codebase being licensed for non-commercial use only.
Comparable to GPT-5-high in agentic (tool/web) utility and security, Meta-SecAlign-70B is the first fully open-source LLM with built-in prompt injection defense and commercial-grade performance, unlocking open research on secure agentic applications.
- Report the combined attack success rate (a sample is counted as attacked if any tested attack method succeeds) for non-adaptive and adaptive (added) attacks; adaptive attacks use fake delimiters (similar to official ones) to mimic a fake conversation with the model.
- Add support for evaluating GPT-5 on all benchmarks.
- Use witness-word appearance (instead of an LLM judge) as the attack success criterion for SEP security evaluation, reducing evaluation costs.
- Parallelize LLM-judge queries to accelerate TaskTracker evaluation.
- Fix multiple evaluation bugs that produced incorrect numbers.
- Simpler setup: one unified uvenvironment for both evaluation and fine-tuning; easy AgentDojo evaluations viatest_agentdojo.py; no need to download the torchtune scripts into the working folder; secondary files have been moved from the working folder to/helpers.
- Hardware requirements: Meta-SecAlign-8B requires 4×80 GB A100s for training and one 16 GB GPU for evaluation. Meta-SecAlign-70B requires 8×141 GB H200s for training and 4 (we recommend 8 for efficiency) 80 GB A100s for evaluation.
- Install uv (a Python package management tool), and then in your home directory run:
uv venv metasecalign --python 3.13
source metasecalign/bin/activate
- Install Meta-SecAlign package dependencies:
git clone --recurse-submodules https://github.com/facebookresearch/Meta_SecAlign.git
cd Meta_SecAlign
uv pip install -r requirements.txt
- Install Meta-SecAlign data dependencies (including those used for SEP utility evaluation if you have a GPU available):
python setup.py
- Configure OpenAI keys (used for utility evaluation) in data/openai_configs.yaml. That file contains an example of accessing the OpenAI API via AzureOpenAI. A more detailed example is available here.
- [Optional] Configure Gemini keys in data/gemini_configs.yamlif you want to evaluate Gemini models.
- demo.pycontains minimal code to use our two Meta-SecAlign models. Feel free to try new samples and prompt injections, or test the models on your codebase:
python demo.py
- run_tests.pycontains commands to reproduce the evaluation results reported in our paper. It sequentially invokes- tests.py,- test_lm_eval.py,- test_agentdojo.py, and- test_injecagent.py. Results will be logged to- [model_path]/summary.tsv.
python run_tests.py -m [model_path] --lora_alpha [lora_alpha]
- model_pathis the path to the tested model. We support:- Local models (vLLM inference)
- meta-llama/Llama-3.1-8B-Instruct_SecAlign(Meta-SecAlign-8B downloaded by- setup.py): the first fully open model with state-of-the-art prompt injection defense
- meta-llama/Llama-3.3-70B-Instruct_SecAlign(Meta-SecAlign-70B downloaded by- setup.py): the first fully open model with state-of-the-art prompt injection defense
- meta-llama/Llama-3.1-8B-Instruct
- meta-llama/Llama-3.3-70B-Instruct
- Other Hugging Face open-weight models may also be natively supported.
 
- OpenAI GPT models
- gpt-4o-mini: the first commercial model with instruction hierarchy prompt injection defense.
- gpt-4o: the follow-up flagship model, also with prompt injection defense.
- gpt-5: the latest and most secure commercial model in our evaluation.
 
- Google Gemini models
- gemini-2.0-flash: a Google commercial model with a claimed prompt injection defense
- gemini-2.5-flash: a Google commercial model with a claimed prompt injection defense
- gemini-2.0-pro: a state-of-the-art Google model (not claimed to include a prompt injection defense)
- gemini-2.5-pro: a state-of-the-art Google model (not claimed to include a prompt injection defense)
 
 
- Local models (vLLM inference)
- [Optional] lora_alphais a test-time hyper-parameter for Meta-SecAlign models. It defaults to 8, which uses the exact Meta-SecAlign models as trained. Alora_alphavalue between 0 and 8 interpolates between the undefended model and our defended model to enable a flexible utility–security trade-off. Extrapolatinglora_alphabeyond 8 is possible but untested.
- We support the following prompt-injection benchmark evaluations for the community:
- 6 security benchmarks
- instruction following: AlpacaFarm-Hacked, SEP, TaskTracker, CyberSecEval2
- agentic tool-calling: InjecAgent, AgentDojo
 
- 8 utility benchmarks
- general knowledge (from lm_eval): MMLU, MMLU-Pro, BBH, IFEval, GPQA Diamond
- instruction following: AlpacaEval2, SEP (in SEP, we use AlpacaEval2 prompting to compare against reference responses from meta-llama/Meta-Llama-3-8B-Instruct)
 
- agentic tool-calling: AgentDojo
 
- 6 security benchmarks
- secalign_llama3.1_8b.shand- secalign_llama3.3_70b.shprovide commands to defensively fine-tune- meta-llama/Llama-3.1-8B-Instructand- meta-llama/Llama-3.3-70B-Instructto a robust LoRA model using our training recipe.
bash secalign_llama3.1_8B.sh
bash secalign_llama3.3_70B.sh
Significantly improved from SecAlign, the majority of the Meta-SecAlign code is licensed under CC-BY-NC. Portions of the project are available under separate license terms: AgentDojo, TaskTracker, and lm-evaluation-harness are licensed under MIT. Code from other repositories includes AgentDojo (agentdojo), TaskTracker (setup.py), and lm_eval_harness (lm_eval_config). This software and/or data was deposited in the BAIR Open Research Commons repository in 2025.