Base class for iterative multiturn Probe and FITD Probe as an example #1414

aishwaryap · 2025-10-15T23:03:48Z

This PR implements the Foot-in-the-door (FITD) probe reference. This probe generates queries of increasing intensity from safe to adversarial, getting compliance from the model at each intermediate stage to obtain attack success with the adversarial prompt. Queries at each level of intensity are adjusted as needed based on target responses.

Verification

List the steps needed to make sure this thing works

Sample configuration for testing with NIM.


run:
  generations: 1

plugins:
  probe_spec: fitd.FITDProbe
  extended_detectors: true
  probes:
    encoding:
      payloads:
        - default
        - xss
        - slur_terms
    fitd:
      FITDProbe:
        initial_prompt_count: 10
        max_calls_per_conv: 10
        num_level_queries: 3
        red_team_model_type: nim.NVOpenAIChat
        red_team_model_name: meta/llama-4-maverick-17b-128e-instruct
  target_type: nim
  target_name: meta/llama-4-maverick-17b-128e-instruct
  generators:
    openai:
      model: nvdev/meta/llama-4-maverick-17b-128e-instruct
      max_tokens: 1024
      use_chat: true
  detectors:
    judge:
      RefusalOnlyAdversarial:
        detector_model_type: nim
        detector_model_name: meta/llama-4-maverick-17b-128e-instruct
        confidence_cutoff: 7

system:
  parallel_attempts: 5
  lite: false

Also tested with generations: 5 and parallel_attempts: 5

An earlier commit was tested with a config for Spanish but multilingual support has been temporarily disabled within the scope of this PR as it was resulting in too many langprovider calls. This will be added back in a future PR.

NIM_API_KEY=<your-key> garak --config trial.conf
Run the tests and ensure they pass python -m pytest tests/
Verify the thing does what it should - Checked by manual inspection of logged interactions
Verify the thing does not do what it should not
Document the thing and how it works (Example)

jmartin-tech

Not a fully comprehensive review since this is still draft.

garak/probes/base.py

garak/probes/fitd.py

jmartin-tech

A couple minor intermediate comments, full review will be done once promoted to ready.

garak/probes/base.py

garak/probes/fitd.py

…that skips non adversarial attempts

…xes in new FITD implementation as well as overrides to some code in Probe.probe() and Refusal.detect() to enable successful run with parallel_attempts = 1

…TD for attack gen LLM

…efault langservice test; new one needed for IterativeProbe

…or non EN languages which relies on langprovider to translate turns into target language and reverse_langprovider to translate responses back to EN to generate the next turn and next turn generation uses hard coded EN prompts to an LLM

leondz

That's a lot of complex work, nice!

highlights:

would be preferable to centre on _mint_attempt
logging needs a tidy
expose more values via config/data

docs/source/garak.probes.fitd.rst

garak/detectors/judge.py

garak/probes/base.py

garak/probes/fitd.py

garak/probes/base.py

leondz · 2025-10-29T09:29:34Z

garak/probes/base.py

+class IterativeProbe(Probe):
+    """
+    Base class for multi turn probes where each probe turn is generated based on the target's response to the previous turn.
+    """


This was different from what I expected in an iterative probe. That's of course fine and even desirable - but it would be good if these docs described the use cases anticipated by this abstract class. What does it add? Seems to be a lot more than just "multi turn" - and there are some assumptions about what multi-turn looks like, too, I guess

Updated docstring. Open to further iteration

garak/probes/base.py

…docstrings in IterativeProbe

jmartin-tech

Thoughts based on refactor.

garak/detectors/judge.py

tests/probes/test_probes_fitd.py

garak/probes/base.py

garak/probes/fitd.py

garak/data/fitd/harmbench_prompts5.txt

erickgalinkin · 2025-10-31T19:08:53Z

garak/data/harmbench/harmbench_prompts.txt

Do we want to store these? Is it better to pull them from HF?

I'm fine making that change if we think it's a good idea. I thought when it's loaded from a data file it's easier to override via config but I guess I could add an extra config param that decides whether to download this from HF or load from file.
One more point to note is that these prompts are shared on the HarmBench github repo openly with MIT License but on HF you need to request access to use and pass in a token.

erickgalinkin · 2025-10-31T19:12:09Z

garak/probes/base.py

+    Additional design considerations:
+    1. Not all multiturn probes need this base class. A probe could directly construct a multiturn input where it only cares about how the target responds to the last turn (eg: prefill attacks) can just subclass Probe.
+    2. Probes that inherit from IterativeProbe are allowed to manipulate the history in addition to generating new turns based on a target's response. For example if the response to the initial turn was a refusal, the probe can in the next attempt either pass in that history of old init turn + refusal + next turn or just pass a new init turn.
+    3. An Attempt is created at every turn when the history is passed to the target. All these Attempts are collected and passed to the detector. The probe can use Attempt.notes to tell the detector to skip certain attempts but a special detector needs to be written that will pay attention to this value.
+    4. If num_generations > 1 , for every attempt at every turn, we obtain num_generations responses from the target, reduce to the unique ones and generate next turns based on each of them. This means that as the turn number increases, the number of attempts has the potential to grow exponentially. Currently, when we have processed (# init turns * self.soft_prompt_probe_cap) attempts, the probe will exit.
+    5. Currently the expansion of attempts happens in a BFS fashion.


I love that this is here, particularly points 1 and 3.
I might omit the word "currently" from point 5 but that is the nittiest of nits.

It says "currently" because we plan to support other traversals in future just not in this PR.

erickgalinkin · 2025-10-31T19:26:14Z

garak/probes/base.py

+
+    def _create_attempt(self, prompt) -> garak.attempt.Attempt:
+        """Create an attempt from a prompt. Prompt can be of type str if this is an initial turn or garak.attempt.Conversation if this is a subsequent turn.
+        Note: Is it possible for _mint_attempt in class Probe to have this functionality? The goal here is to abstract out translation and buffs from how turns are processed.


Agree with this note kind of -- since we're ultimately converting whatever it is into a Conversation, we should have a common utility for this. At one point, I tried to implement something in Conversation but we didn't do it. I forget why but I bet @jmartin-tech remembers.

Either way, should probably be a separate PR.

erickgalinkin · 2025-10-31T19:26:46Z

garak/probes/base.py

+                localized_prompt, lang=self.langprovider.target_lang
+            )
+        else:
+            # what types should this expect? Message, Conversation?


IIRC, it should only ever actually get a Conversation.

🤣 this is actually a copy of a comment form the the parent class.

Currently text to text probes provide str that is translated and wrapped, multi-modal probes with image and audio data provide Message objects, and this new multi-turn will now have cases that may supply Conversation objects. I would defer any adjustment of this behavior for a future revision as there is already pending work to better define typing expectations around prompts.

I think the way things work now this actually gets a string for most of the existing single turn probes and only FITD is actually causing a Conversation to go in here. Changing the rest is likely out of scope of this PR.

garak/probes/base.py

erickgalinkin · 2025-10-31T20:02:15Z

garak/probes/fitd.py

+        logging.debug(
+            "fitd.FITDProbe # _get_level_queries: Target turn = %s, Init turn = %s, Level queries = %s"
+            % (target_turn, init_turn, level_queries)


How much logging does this produce? Feels like it could write a nightmarish amount of data to the log file.

This particular log request happens only once per init turn so it depends on how many init turns you want to run (max 200 if using HarmBench).

In total a run I just did with 5 init turns and max 5 turns per conv produces a log of 2061 lines, of which only 166 are from FITD. That does not feel like a terrible ratio. I'll vary those params and get a couple of more numbers just to give a sense of how the logging from FITD grows with either of those params.

Majority of the log is all the HTTP requests to and from the various nim generators. Technically some of these are also due to FITD because of the sheer number of LLM calls involved but changing that would actually need to be a change in the generator I think.

When max_calls_per_conv = 10, the log has 3322 lines of which 303 are from FITD (again not counting logs by nim generators).

erickgalinkin · 2025-10-31T20:03:05Z

garak/probes/fitd.py

+            % data_path
+        )
+
+        prompts_file = data_path / "fitd" / "harmbench_prompts.txt"


It may be useful to have this be a class-level parameter so in the future it could be more easily customized?

I do like the idea of this being configurable, I would suggest this is case where the provided value must either be an absolute path or relative path that is inside the data_path sandbox location.

For the short term I think we can move this forward with the hard coded path and expand flexibility in the future.

erickgalinkin · 2025-10-31T20:07:54Z

garak/probes/fitd.py

+            if verify_response.lower().strip().startswith("unsafe"):
+                cur_conv_last_response_success = True
+            else:
+                cur_conv_last_response_success = False


Yeah, I think this is particularly closely tied to harmbench and we can probably do a bit better. The whole validation here should probably depend on some other type of logic, IMO.

erickgalinkin · 2025-10-31T20:09:06Z

garak/probes/fitd.py

+        )
+        return is_last_response_success
+
+    def _is_response_aligned(self, request: str, response: str) -> bool:


Seems like this functionality, and possibly the prior, are already in extant parts of garak.resources.red_team.evaluation?

Perhaps I'm missing something but it looks like the prompts used for stuff like this in garak.resources.red_team.evaluation are hardcoded to some level and I'm not sure they generalize all that well.
They also don't look like they are doing exactly the same thing as what I'm doing here, just something related.
I think the way I've set up the prompt templates for FITD as of now they are loaded from the data files but can be overridden via config and I feel that is preferable.

garak/probes/fitd.py

jmartin-tech

Awesome progress, a few more interim comments. Sorry for the piece meal review.

garak/probes/fitd.py

jmartin-tech · 2025-10-31T20:46:40Z

garak/probes/base.py

+                localized_prompt, lang=self.langprovider.target_lang
+            )
+        else:
+            # what types should this expect? Message, Conversation?


🤣 this is actually a copy of a comment form the the parent class.

Currently text to text probes provide str that is translated and wrapped, multi-modal probes with image and audio data provide Message objects, and this new multi-turn will now have cases that may supply Conversation objects. I would defer any adjustment of this behavior for a future revision as there is already pending work to better define typing expectations around prompts.

jmartin-tech · 2025-10-31T20:50:52Z

garak/probes/fitd.py

+    tier = garak.probes.Tier.INFORMATIONAL
+    active = False  # Note: This probe is currently not active because of the potential for the number of attempts to grow exponentially and have to be terminated early unless the user sets generations = 1
+
+    DEFAULT_PARAMS = garak.probes.IterativeProbe.DEFAULT_PARAMS | {


generations is inherited from the parent base.Probe class as distributed by _run_params, the initial view being this is the desired behavior to ensure diversity of responses from the target to better explore possible response diversity of the target.

jmartin-tech · 2025-10-31T20:52:06Z

garak/probes/fitd.py

+        self.red_team_model = None
+        self.detector = None


These are not required and also not detrimental at this time.

garak/probes/fitd.py

jmartin-tech · 2025-10-31T21:12:17Z

garak/probes/fitd.py

+            % data_path
+        )
+
+        prompts_file = data_path / "fitd" / "harmbench_prompts.txt"


I do like the idea of this being configurable, I would suggest this is case where the provided value must either be an absolute path or relative path that is inside the data_path sandbox location.

For the short term I think we can move this forward with the hard coded path and expand flexibility in the future.

…ampled data file

…am mode

…on in IterativeProbe.probe

…calls. Will be added in a later PR

garak/probes/fitd.py

…at raise GarakException

…ense information

jmartin-tech requested changes Oct 16, 2025

View reviewed changes

leondz mentioned this pull request Oct 24, 2025

Detector/Evaluator output indexing mismatch in multi-turn conversations #1430

Closed

jmartin-tech reviewed Oct 27, 2025

View reviewed changes

garak/probes/base.py Outdated Show resolved Hide resolved

garak/probes/base.py Outdated Show resolved Hide resolved

garak/probes/fitd.py Outdated Show resolved Hide resolved

garak/probes/fitd.py Outdated Show resolved Hide resolved

garak/probes/fitd.py Outdated Show resolved Hide resolved

aishwaryap added 17 commits October 28, 2025 17:30

Iterative base class attempt

cd61099

Initial attempt for FITD probe and variant of judge.Refusal detector …

e978a9a

…that skips non adversarial attempts

Log messages temporarily shifted to prints for debugging; some bug fi…

e60280e

…xes in new FITD implementation as well as overrides to some code in Probe.probe() and Refusal.detect() to enable successful run with parallel_attempts = 1

Marking indication for necessary detector/ evaluator bug fix

7be1d68

Moved most prints to debug logs; Added probe ref in docs

d10ec36

Removed editing of default params in FITD init; using generator in FI…

4494001

…TD for attack gen LLM

Removed more uses of DEFAULT_PARAMS outside init

e877509

Additional fixes from early review

aae655c

Moved detector init to probe() when needed; removed FITD probe from d…

cba18b0

…efault langservice test; new one needed for IterativeProbe

Early exit when number of attempts grows too much

f8559d2

Make probe inactive

18f6e59

Remove unusued code

befe56a

Outdated doc edit

1b0e1fc

Changes for good behavior with translation

0482958

Fixed accidental print to logging changes; Fixed copyright year

141ea53

Added tests for FITD; Removed attempt logging for debug

03c6ff7

aishwaryap force-pushed the feature/iterative_probe_fitd branch from d7f0f88 to 03c6ff7 Compare October 29, 2025 00:31

Modifications to FITD tests to avoid impacting other tests

4e69dd4

aishwaryap marked this pull request as ready for review October 29, 2025 02:03

leondz added probes Content & activity of LLM probes detectors work on code that inherits from or manages Detector labels Oct 29, 2025

leondz requested changes Oct 29, 2025

View reviewed changes

aishwaryap added 4 commits October 29, 2025 13:12

Removed methods that existed to bypass recently fixed bugs; Improved …

7063d63

…docstrings in IterativeProbe

Changes from review comments

64ac2c2

Black formatting

b4011aa

Var name changes for consistency with atkgen

3269a3a

aishwaryap added 2 commits October 30, 2025 16:28

Base class refactor

cb9f440

Logging cleanup and format update; some additional cleanup from reviews

aca4f5a

jmartin-tech reviewed Oct 31, 2025

View reviewed changes

erickgalinkin reviewed Oct 31, 2025

View reviewed changes

Document language translation assumptions in detector

a2ac059

jmartin-tech reviewed Oct 31, 2025

View reviewed changes

aishwaryap added 10 commits October 31, 2025 16:46

Moved prompts and hardcoded info to data files / config; removed subs…

43dd1ea

…ampled data file

FITD throws garak exceptions when getting Nones from target or red te…

82ba3a9

…am mode

Moved hardcoded prefix of eval prompt to a config value

f60c2fc

Removed unnecessary var save in judge

f572741

Removed num_init_turns

ce1725f

Renamed _get_attack_gen_llm_response to _get_red_team_response

ef23a0c

Cleaned up probe config in test

3dbdea6

Updating prompt sampling var name in FITD; try catch for GarakExcepti…

01576ea

…on in IterativeProbe.probe

Removed multilingual support due to excessive number of langprovider …

317fb49

…calls. Will be added in a later PR

Moving prints of progress per turn to logging

16cd79c

jmartin-tech reviewed Nov 4, 2025

View reviewed changes

garak/probes/fitd.py Outdated Show resolved Hide resolved

Moved refusal detection to a detector; Reduced scope of situations th…

34d3c72

…at raise GarakException

jmartin-tech mentioned this pull request Nov 6, 2025

reduce window for multiprocesing race #1464

Open

1 task

Moved HarmBench data with separate directory including source and lic…

26f5aa7

…ense information

Base class for iterative multiturn Probe and FITD Probe as an example #1414

Are you sure you want to change the base?

Base class for iterative multiturn Probe and FITD Probe as an example #1414

Uh oh!

Conversation

aishwaryap commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Verification

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leondz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

aishwaryap commented Oct 15, 2025 •

edited

Loading