Smart-Prompt-Eval is a framework for stress-testing LLMs under different prompt manipulations, making it easy to benchmark robustness across tasks. Are Large Language Models (LLMs) robust yet? LLMs are powerful, but their robustness is still an open question. Smart-Prompt-Eval evaluates LLM robustness using multiple prompt manipulation techniques inspired by vulnerabilities identified in prior research. Each experiment highlights a different aspect of robustness.
- Language Errors: Testing model performance by manipulating queries to add various grammatical and spelling errors
- Multilingual Prompting: Testing model performance by translating queries into different languages
- Multiple Roles: Testing with multiple roles (user, assistant, and system)
- Evaluating Harmful Prompts: Testing model responses to potentially harmful prompts, using original and manipulated versions of the harmful prompts
-
Clone the repository and open the directory:
git clone https://github.com/Pro-GenAI/Smart-Prompt-Eval.git cd Smart-Prompt-Eval -
Install the package in development mode:
pip install -e . -
(Optional) Install development dependencies:
pip install -e ".[dev]" -
Set up .env file:
- Create a
.envfile in the root directory using .env.example as a template:
cp .env.example .env
- Edit the
.envfile to access an OpenAI-compatible API of your model.- Example: OpenAI API, Azure OpenAI, vLLM, Ollama, etc.
- Create a
python smart_prompt_eval/run_eval.pypython smart_prompt_eval/evals/harmful_eval.py
python smart_prompt_eval/run_eval.py harmful_eval linguistic_errors_evalThis project uses several tools to maintain code quality:
- Black: Code formatting
- isort: Import sorting
- flake8: Linting
- mypy: Type checking
Run all quality checks:
black smart_prompt_eval tests
isort smart_prompt_eval tests
flake8
mypyRun the test suite:
pytest -qIf you find this repository useful in your research, please consider citing:
@misc{vadlapati2025smartprompteval,
author = {Vadlapati, Praneeth},
title = {Smart-Prompt-Eval: Evaluating LLM Robustness with Manipulated Prompts},
year = {2025},
howpublished = {\url{https://github.com/Pro-GenAI/Smart-Prompt-Eval}},
note = {GitHub repository},
}- Linguistic Errors - Experiments on grammatical and spelling errors in prompts.
- Multilingual Prompting - Experiments on prompting LLMs in multiple languages.
- The Power of Roles - Experiments on the impact of different roles in prompting.
- GSM8K (Grade School Math 8K): This repository includes a copy of the GSM8K test file for convenience (see
smart_prompt_eval/datasets/gsm8k_test.jsonland localized variants). GSM8K is a collection of grade-school math word problems used to evaluate numerical reasoning. The original dataset and citation are available at the project's repository. Please cite the original dataset authors and consult the original repository for license and citation details.
For personal queries, please find the author's contact details here: https://prane-eth.github.io/
Image credits:
- User icon: https://www.flaticon.com/free-icon/user_9131478
- Robot icon: https://www.flaticon.com/free-icon/robot_18355220


