Smart-Prompt-Eval: Evaluating LLM Robustness using Manipulated Prompts

Introduction

Smart-Prompt-Eval is a framework for stress-testing LLMs under different prompt manipulations, making it easy to benchmark robustness across tasks. Are Large Language Models (LLMs) robust yet? LLMs are powerful, but their robustness is still an open question. Smart-Prompt-Eval evaluates LLM robustness using multiple prompt manipulation techniques inspired by vulnerabilities identified in prior research. Each experiment highlights a different aspect of robustness.

Sample Manipulations

Language Error Demo:

Harmful Prompt Demo:

Available Experiments

Language Errors: Testing model performance by manipulating queries to add various grammatical and spelling errors
Multilingual Prompting: Testing model performance by translating queries into different languages
Multiple Roles: Testing with multiple roles (user, assistant, and system)
Evaluating Harmful Prompts: Testing model responses to potentially harmful prompts, using original and manipulated versions of the harmful prompts

Setup

Clone the repository and open the directory:

git clone https://github.com/Pro-GenAI/Smart-Prompt-Eval.git
cd Smart-Prompt-Eval

Install the package in development mode:
```
pip install -e .
```
(Optional) Install development dependencies:
```
pip install -e ".[dev]"
```
Set up .env file:
- Create a .env file in the root directory using .env.example as a template:
```
cp .env.example .env
```
- Edit the .env file to access an OpenAI-compatible API of your model.
  - Example: OpenAI API, Azure OpenAI, vLLM, Ollama, etc.

Usage

Running All Evaluations

python smart_prompt_eval/run_eval.py

Running Individual Evaluations

python smart_prompt_eval/evals/harmful_eval.py
python smart_prompt_eval/run_eval.py harmful_eval linguistic_errors_eval

Development

Code Quality

This project uses several tools to maintain code quality:

Black: Code formatting
isort: Import sorting
flake8: Linting
mypy: Type checking

Run all quality checks:

black smart_prompt_eval tests
isort smart_prompt_eval tests
flake8
mypy

Testing

Run the test suite:

pytest -q

Citation

If you find this repository useful in your research, please consider citing:

@misc{vadlapati2025smartprompteval,
  author       = {Vadlapati, Praneeth},
  title        = {Smart-Prompt-Eval: Evaluating LLM Robustness with Manipulated Prompts},
  year         = {2025},
  howpublished = {\url{https://github.com/Pro-GenAI/Smart-Prompt-Eval}},
  note         = {GitHub repository},
}

Created based on my past papers and code

Linguistic Errors - Experiments on grammatical and spelling errors in prompts.
Multilingual Prompting - Experiments on prompting LLMs in multiple languages.
The Power of Roles - Experiments on the impact of different roles in prompting.

Datasets supported: (More can be added in future)

GSM8K (Grade School Math 8K): This repository includes a copy of the GSM8K test file for convenience (see smart_prompt_eval/datasets/gsm8k_test.jsonl and localized variants). GSM8K is a collection of grade-school math word problems used to evaluate numerical reasoning. The original dataset and citation are available at the project's repository. Please cite the original dataset authors and consult the original repository for license and citation details.

📧 Contact

For personal queries, please find the author's contact details here: https://prane-eth.github.io/

Image credits:

User icon: https://www.flaticon.com/free-icon/user_9131478
Robot icon: https://www.flaticon.com/free-icon/robot_18355220

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
smart_prompt_eval		smart_prompt_eval
tests		tests
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
response_cache.json		response_cache.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Smart-Prompt-Eval: Evaluating LLM Robustness using Manipulated Prompts

Introduction

Sample Manipulations

Language Error Demo:

Harmful Prompt Demo:

Available Experiments

Setup

Usage

Running All Evaluations

Running Individual Evaluations

Development

Code Quality

Testing

Citation

Created based on my past papers and code

Datasets supported: (More can be added in future)

📧 Contact

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

Pro-GenAI/Smart-Prompt-Eval

Folders and files

Latest commit

History

Repository files navigation

Smart-Prompt-Eval: Evaluating LLM Robustness using Manipulated Prompts

Introduction

Sample Manipulations

Language Error Demo:

Harmful Prompt Demo:

Available Experiments

Setup

Usage

Running All Evaluations

Running Individual Evaluations

Development

Code Quality

Testing

Citation

Created based on my past papers and code

Datasets supported: (More can be added in future)

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages