Skip to content

[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage#1438

Open
riyosha wants to merge 1 commit intoAzure:mainfrom
riyosha:h-cot
Open

[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage#1438
riyosha wants to merge 1 commit intoAzure:mainfrom
riyosha:h-cot

Conversation

@riyosha
Copy link
Contributor

@riyosha riyosha commented Mar 5, 2026

Description

[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage

Related to issue #897

This PR introduces the Chain-of-Thought (CoT) Hijacking attack strategy, as described in Zhao et al. (2025). The changes include:

  • ADDED: Implementation of the CoT Hijacking attack strategy - pyrit/executor/attack/multi_turn/cot_hijacking.py
  • ADDED: YAML prompt templates for 6 puzzle types from the paper - pyrit/datasets/executors/cot_hijacking/puzzle_generation_{puzzle_type}.yaml
  • ADDED: Unit tests for CoT Hijacking attack - tests/unit/executor/attack/multi_turn/test_cot_hijacking.py

Related issues: #897


Tests and Documentation

  • Added unit tests in tests/unit/executor/attack/multi_turn/test_cot_hijacking.py
  • Tested the attack with local llama3:8b as the target model and attacker model as mistral:7b. (These LLMs lack advanced reasoning; suggestions for better, locally accessible LRM models are welcome!)

This is a draft PR and I want to get your thoughts on the implementation so far. I have planned these updates:

  • Currently I'm relying on the _fallback_score_response function to use pattern matching for generating a score. I want to replace this with either another auxiliary model as a scorer or use float scale scoring using Azure Content Safety API.
  • Currently the iterative feedback given to the attacker model ( _generate_attack_prompt_async) only includes the harm score and a static prompt to make the puzzle more complex. I'll update it to include the target's previous safe response as well.

Question:

  • I noticed a few other multi_attack strategies define async def _teardown_async even if unused. Should I also add it?

Copilot AI review requested due to automatic review settings March 5, 2026 01:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good! While reading it, I couldn't shake the feeling that this is very similar to RedTeamingAttack with the big difference that it cycles through the system prompt templates, of course. I haven't had time to compare with it in detail to see if that would be doable. My hunch is that it would introduce considerable complexity and is probably not worth it but I'd like to be sure...

Other things:

  • needs mentioning in api.rst
  • needs example notebook (both ipynb and py files) somewhere in doc/executor/attack, which in turn needs to be mentioned in TOC file. Example notebook doesn't need to be elaborate.
  • needs integration test, perhaps just one that runs the example notebook. This may be auto-created by test_executor_notebooks.py I think...

Returns:
Optional[AttackScoringConfig]: The scoring configuration.
"""
return AttackScoringConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised that we're unpacking the attack scoring config in the constructor into these two below, and then reassembling it here. Is that a pattern you've seen in another executor?

def __init__(
self,
*,
objective_target: PromptTarget = REQUIRED_VALUE, # type: ignore[assignment]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely, this assumes we're dealing with a target that has reasoning capabilities, right? @hannahwestra25 is currently working on expanding TargetCapabilities so that could come in handy here for validation purposes.

outcome=final_outcome,
)

# AttackLM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Returns:
Optional[Score]: The score for the response.
"""
if not self._objective_scorer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm not a huge fan of this. If people want that they can explicitly specify the current fallback scorer as their objective scorer. In fact, we have a fairly good refusal scorer. If you think this one is useful we can add it as a deterministic option into the score module.

Comment on lines +241 to +249
if score_value >= 1.0:
self._logger.info("Attack successful!")
context.attack_succeeded = True
final_outcome = AttackOutcome.SUCCESS
break
if score_value >= 0.8:
self._logger.info("Attack largely successful (score 0.8+)")
final_outcome = AttackOutcome.SUCCESS
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be configured via the scoring configuration. In other attacks, we expect a true/false scorer. So here it could be a threshold based on that takes a float scale scorer and then applies the threshold at 0.8 or 1.0 on top fo that to determine success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants