Skip to content

Conversation

@aishwaryap
Copy link
Collaborator

@aishwaryap aishwaryap commented Oct 15, 2025

This PR implements the Foot-in-the-door (FITD) probe reference. This probe generates queries of increasing intensity from safe to adversarial, getting compliance from the model at each intermediate stage to obtain attack success with the adversarial prompt. Queries at each level of intensity are adjusted as needed based on target responses.

Verification

List the steps needed to make sure this thing works

  • Sample configuration for testing with NIM.

run:
  generations: 1

plugins:
  probe_spec: fitd.FITDProbe
  extended_detectors: true
  probes:
    encoding:
      payloads:
        - default
        - xss
        - slur_terms
    fitd:
      FITDProbe:
        initial_prompt_count: 10
        max_calls_per_conv: 10
        num_level_queries: 3
        red_team_model_type: nim.NVOpenAIChat
        red_team_model_name: meta/llama-4-maverick-17b-128e-instruct
  target_type: nim
  target_name: meta/llama-4-maverick-17b-128e-instruct
  generators:
    openai:
      model: nvdev/meta/llama-4-maverick-17b-128e-instruct
      max_tokens: 1024
      use_chat: true
  detectors:
    judge:
      RefusalOnlyAdversarial:
        detector_model_type: nim
        detector_model_name: meta/llama-4-maverick-17b-128e-instruct
        confidence_cutoff: 7

system:
  parallel_attempts: 5
  lite: false

Also tested with generations: 5 and parallel_attempts: 5

An earlier commit was tested with a config for Spanish but multilingual support has been temporarily disabled within the scope of this PR as it was resulting in too many langprovider calls. This will be added back in a future PR.

  • NIM_API_KEY=<your-key> garak --config trial.conf
  • Run the tests and ensure they pass python -m pytest tests/
  • Verify the thing does what it should - Checked by manual inspection of logged interactions
  • Verify the thing does not do what it should not
  • Document the thing and how it works (Example)

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fully comprehensive review since this is still draft.

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor intermediate comments, full review will be done once promoted to ready.

…xes in new FITD implementation as well as overrides to some code in Probe.probe() and Refusal.detect() to enable successful run with parallel_attempts = 1
…efault langservice test; new one needed for IterativeProbe
…or non EN languages which relies on langprovider to translate turns into target language and reverse_langprovider to translate responses back to EN to generate the next turn and next turn generation uses hard coded EN prompts to an LLM
@aishwaryap aishwaryap force-pushed the feature/iterative_probe_fitd branch from d7f0f88 to 03c6ff7 Compare October 29, 2025 00:31
@aishwaryap aishwaryap marked this pull request as ready for review October 29, 2025 02:03
@leondz leondz added probes Content & activity of LLM probes detectors work on code that inherits from or manages Detector labels Oct 29, 2025
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of complex work, nice!

highlights:

  • would be preferable to centre on _mint_attempt
  • logging needs a tidy
  • expose more values via config/data

Comment on lines 659 to 662
class IterativeProbe(Probe):
"""
Base class for multi turn probes where each probe turn is generated based on the target's response to the previous turn.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was different from what I expected in an iterative probe. That's of course fine and even desirable - but it would be good if these docs described the use cases anticipated by this abstract class. What does it add? Seems to be a lot more than just "multi turn" - and there are some assumptions about what multi-turn looks like, too, I guess

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated docstring. Open to further iteration

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts based on refactor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to store these? Is it better to pull them from HF?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine making that change if we think it's a good idea. I thought when it's loaded from a data file it's easier to override via config but I guess I could add an extra config param that decides whether to download this from HF or load from file.
One more point to note is that these prompts are shared on the HarmBench github repo openly with MIT License but on HF you need to request access to use and pass in a token.

Comment on lines +669 to +674
Additional design considerations:
1. Not all multiturn probes need this base class. A probe could directly construct a multiturn input where it only cares about how the target responds to the last turn (eg: prefill attacks) can just subclass Probe.
2. Probes that inherit from IterativeProbe are allowed to manipulate the history in addition to generating new turns based on a target's response. For example if the response to the initial turn was a refusal, the probe can in the next attempt either pass in that history of old init turn + refusal + next turn or just pass a new init turn.
3. An Attempt is created at every turn when the history is passed to the target. All these Attempts are collected and passed to the detector. The probe can use Attempt.notes to tell the detector to skip certain attempts but a special detector needs to be written that will pay attention to this value.
4. If num_generations > 1 , for every attempt at every turn, we obtain num_generations responses from the target, reduce to the unique ones and generate next turns based on each of them. This means that as the turn number increases, the number of attempts has the potential to grow exponentially. Currently, when we have processed (# init turns * self.soft_prompt_probe_cap) attempts, the probe will exit.
5. Currently the expansion of attempts happens in a BFS fashion.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that this is here, particularly points 1 and 3.
I might omit the word "currently" from point 5 but that is the nittiest of nits.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says "currently" because we plan to support other traversals in future just not in this PR.


def _create_attempt(self, prompt) -> garak.attempt.Attempt:
"""Create an attempt from a prompt. Prompt can be of type str if this is an initial turn or garak.attempt.Conversation if this is a subsequent turn.
Note: Is it possible for _mint_attempt in class Probe to have this functionality? The goal here is to abstract out translation and buffs from how turns are processed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this note kind of -- since we're ultimately converting whatever it is into a Conversation, we should have a common utility for this. At one point, I tried to implement something in Conversation but we didn't do it. I forget why but I bet @jmartin-tech remembers.

Either way, should probably be a separate PR.

localized_prompt, lang=self.langprovider.target_lang
)
else:
# what types should this expect? Message, Conversation?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, it should only ever actually get a Conversation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤣 this is actually a copy of a comment form the the parent class.

Currently text to text probes provide str that is translated and wrapped, multi-modal probes with image and audio data provide Message objects, and this new multi-turn will now have cases that may supply Conversation objects. I would defer any adjustment of this behavior for a future revision as there is already pending work to better define typing expectations around prompts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way things work now this actually gets a string for most of the existing single turn probes and only FITD is actually causing a Conversation to go in here. Changing the rest is likely out of scope of this PR.

Comment on lines +264 to +266
logging.debug(
"fitd.FITDProbe # _get_level_queries: Target turn = %s, Init turn = %s, Level queries = %s"
% (target_turn, init_turn, level_queries)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much logging does this produce? Feels like it could write a nightmarish amount of data to the log file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular log request happens only once per init turn so it depends on how many init turns you want to run (max 200 if using HarmBench).

In total a run I just did with 5 init turns and max 5 turns per conv produces a log of 2061 lines, of which only 166 are from FITD. That does not feel like a terrible ratio. I'll vary those params and get a couple of more numbers just to give a sense of how the logging from FITD grows with either of those params.

Majority of the log is all the HTTP requests to and from the various nim generators. Technically some of these are also due to FITD because of the sheer number of LLM calls involved but changing that would actually need to be a change in the generator I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When max_calls_per_conv = 10, the log has 3322 lines of which 303 are from FITD (again not counting logs by nim generators).

% data_path
)

prompts_file = data_path / "fitd" / "harmbench_prompts.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful to have this be a class-level parameter so in the future it could be more easily customized?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the idea of this being configurable, I would suggest this is case where the provided value must either be an absolute path or relative path that is inside the data_path sandbox location.

For the short term I think we can move this forward with the hard coded path and expand flexibility in the future.

Comment on lines 327 to 330
if verify_response.lower().strip().startswith("unsafe"):
cur_conv_last_response_success = True
else:
cur_conv_last_response_success = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is particularly closely tied to harmbench and we can probably do a bit better. The whole validation here should probably depend on some other type of logic, IMO.

)
return is_last_response_success

def _is_response_aligned(self, request: str, response: str) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this functionality, and possibly the prior, are already in extant parts of garak.resources.red_team.evaluation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm missing something but it looks like the prompts used for stuff like this in garak.resources.red_team.evaluation are hardcoded to some level and I'm not sure they generalize all that well.
They also don't look like they are doing exactly the same thing as what I'm doing here, just something related.
I think the way I've set up the prompt templates for FITD as of now they are loaded from the data files but can be overridden via config and I feel that is preferable.

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome progress, a few more interim comments. Sorry for the piece meal review.

localized_prompt, lang=self.langprovider.target_lang
)
else:
# what types should this expect? Message, Conversation?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤣 this is actually a copy of a comment form the the parent class.

Currently text to text probes provide str that is translated and wrapped, multi-modal probes with image and audio data provide Message objects, and this new multi-turn will now have cases that may supply Conversation objects. I would defer any adjustment of this behavior for a future revision as there is already pending work to better define typing expectations around prompts.

tier = garak.probes.Tier.INFORMATIONAL
active = False # Note: This probe is currently not active because of the potential for the number of attempts to grow exponentially and have to be terminated early unless the user sets generations = 1

DEFAULT_PARAMS = garak.probes.IterativeProbe.DEFAULT_PARAMS | {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generations is inherited from the parent base.Probe class as distributed by _run_params, the initial view being this is the desired behavior to ensure diversity of responses from the target to better explore possible response diversity of the target.

Comment on lines +185 to +186
self.red_team_model = None
self.detector = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not required and also not detrimental at this time.

% data_path
)

prompts_file = data_path / "fitd" / "harmbench_prompts.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the idea of this being configurable, I would suggest this is case where the provided value must either be an absolute path or relative path that is inside the data_path sandbox location.

For the short term I think we can move this forward with the hard coded path and expand flexibility in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

detectors work on code that inherits from or manages Detector probes Content & activity of LLM probes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants