Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Sizhe Chen*, Arman Zharmagambetov, David Wagner, Chuan Guo* (* for equal technical contributions)

🔥 Meta-SecAlign models are now licensed for commercial use under the Llama community licenses, despite this codebase being licensed for non-commercial use only.

Comparable to GPT-5-high in agentic (tool/web) utility and security, Meta-SecAlign-70B is the first fully open-source LLM with built-in prompt injection defense and commercial-grade performance, unlocking open research on secure agentic applications.

Updates (10/28/2025) — from the 07/07/2025 version

Report the combined attack success rate (a sample is counted as attacked if any tested attack method succeeds) for non-adaptive and adaptive (added) attacks; adaptive attacks use fake delimiters (similar to official ones) to mimic a fake conversation with the model.
Add support for evaluating GPT-5 on all benchmarks.
Use witness-word appearance (instead of an LLM judge) as the attack success criterion for SEP security evaluation, reducing evaluation costs.
Parallelize LLM-judge queries to accelerate TaskTracker evaluation.
Fix multiple evaluation bugs that produced incorrect numbers.
Simpler setup: one unified uv environment for both evaluation and fine-tuning; easy AgentDojo evaluations via test_agentdojo.py; no need to download the torchtune scripts into the working folder; secondary files have been moved from the working folder to /helpers.

Environment Setup

Hardware requirements: Meta-SecAlign-8B requires 4×80 GB A100s for training and one 16 GB GPU for evaluation. Meta-SecAlign-70B requires 8×141 GB H200s for training and 4 (we recommend 8 for efficiency) 80 GB A100s for evaluation.
Install uv (a Python package management tool), and then in your home directory run:

uv venv metasecalign --python 3.13
source metasecalign/bin/activate

Install Meta-SecAlign package dependencies:

git clone --recurse-submodules https://github.com/facebookresearch/Meta_SecAlign.git
cd Meta_SecAlign
uv pip install -r requirements.txt

Install Meta-SecAlign data dependencies (including those used for SEP utility evaluation if you have a GPU available):

python setup.py

Configure OpenAI keys (used for utility evaluation) in data/openai_configs.yaml. That file contains an example of accessing the OpenAI API via AzureOpenAI. A more detailed example is available here.
[Optional] Configure Gemini keys in data/gemini_configs.yaml if you want to evaluate Gemini models.

Demo

demo.py contains minimal code to use our two Meta-SecAlign models. Feel free to try new samples and prompt injections, or test the models on your codebase:

python demo.py

Evaluation

run_tests.py contains commands to reproduce the evaluation results reported in our paper. It sequentially invokes tests.py, test_lm_eval.py, test_agentdojo.py, and test_injecagent.py. Results will be logged to [model_path]/summary.tsv.

python run_tests.py -m [model_path] --lora_alpha [lora_alpha]

model_path is the path to the tested model. We support:
- Local models (vLLM inference)
  - meta-llama/Llama-3.1-8B-Instruct_SecAlign (Meta-SecAlign-8B downloaded by setup.py): the first fully open model with state-of-the-art prompt injection defense
  - meta-llama/Llama-3.3-70B-Instruct_SecAlign (Meta-SecAlign-70B downloaded by setup.py): the first fully open model with state-of-the-art prompt injection defense
  - meta-llama/Llama-3.1-8B-Instruct
  - meta-llama/Llama-3.3-70B-Instruct
  - Other Hugging Face open-weight models may also be natively supported.
- OpenAI GPT models
  - gpt-4o-mini: the first commercial model with instruction hierarchy prompt injection defense.
  - gpt-4o: the follow-up flagship model, also with prompt injection defense.
  - gpt-5: the latest and most secure commercial model in our evaluation.
- Google Gemini models
  - gemini-2.0-flash: a Google commercial model with a claimed prompt injection defense
  - gemini-2.5-flash: a Google commercial model with a claimed prompt injection defense
  - gemini-2.0-pro: a state-of-the-art Google model (not claimed to include a prompt injection defense)
  - gemini-2.5-pro: a state-of-the-art Google model (not claimed to include a prompt injection defense)
[Optional] lora_alpha is a test-time hyper-parameter for Meta-SecAlign models. It defaults to 8, which uses the exact Meta-SecAlign models as trained. A lora_alpha value between 0 and 8 interpolates between the undefended model and our defended model to enable a flexible utility–security trade-off. Extrapolating lora_alpha beyond 8 is possible but untested.
We support the following prompt-injection benchmark evaluations for the community:
- 6 security benchmarks
  - instruction following: AlpacaFarm-Hacked, SEP, TaskTracker, CyberSecEval2
  - agentic tool-calling: InjecAgent, AgentDojo
- 8 utility benchmarks
  - general knowledge (from lm_eval): MMLU, MMLU-Pro, BBH, IFEval, GPQA Diamond
  - instruction following: AlpacaEval2, SEP (in SEP, we use AlpacaEval2 prompting to compare against reference responses from meta-llama/Meta-Llama-3-8B-Instruct)
- agentic tool-calling: AgentDojo

Defensive Fine-Tuning (SecAlign++)

secalign_llama3.1_8b.sh and secalign_llama3.3_70b.sh provide commands to defensively fine-tune meta-llama/Llama-3.1-8B-Instruct and meta-llama/Llama-3.3-70B-Instruct to a robust LoRA model using our training recipe.

bash secalign_llama3.1_8B.sh
bash secalign_llama3.3_70B.sh

Code Acknowledgements

Significantly improved from SecAlign, the majority of the Meta-SecAlign code is licensed under CC-BY-NC. Portions of the project are available under separate license terms: AgentDojo, TaskTracker, and lm-evaluation-harness are licensed under MIT. Code from other repositories includes AgentDojo (agentdojo), TaskTracker (setup.py), and lm_eval_harness (lm_eval_config). This software and/or data was deposited in the BAIR Open Research Commons repository in 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
agentdojo @ d3640b5		agentdojo @ d3640b5
data		data
helpers		helpers
lm_eval_config		lm_eval_config
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.py		config.py
demo.py		demo.py
generate_data.py		generate_data.py
requirements.txt		requirements.txt
run_tests.py		run_tests.py
secalign_llama3.1_8b.sh		secalign_llama3.1_8b.sh
secalign_llama3.3_70b.sh		secalign_llama3.3_70b.sh
setup.py		setup.py
test.py		test.py
test_agentdojo.py		test_agentdojo.py
test_injecagent.py		test_injecagent.py
test_lm_eval.py		test_lm_eval.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Updates (10/28/2025) — from the 07/07/2025 version

Environment Setup

Demo

Evaluation

Defensive Fine-Tuning (SecAlign++)

Code Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

facebookresearch/Meta_SecAlign

Folders and files

Latest commit

History

Repository files navigation

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Updates (10/28/2025) — from the 07/07/2025 version

Environment Setup

Demo

Evaluation

Defensive Fine-Tuning (SecAlign++)

Code Acknowledgements

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages