Skip to content

Support hybrid attention architectures in LayerWrapper#2367

Open
apsonawane wants to merge 1 commit intomainfrom
asonawane/qwen3.5
Open

Support hybrid attention architectures in LayerWrapper#2367
apsonawane wants to merge 1 commit intomainfrom
asonawane/qwen3.5

Conversation

@apsonawane
Copy link
Contributor

Summary

Allow LayerWrapper to handle models with hybrid layer types (e.g., Qwen3.5) where some decoder layers use linear attention instead of standard self-attention.

Problem

Qwen3.5 is a hybrid VL model with 24 decoder layers — 18 use GatedDeltaNet linear attention (linear_attn sub-module) and 6 use standard full attention (self_attn). When Olive's SelectiveMixedPrecision or GPTQ passes wrap each layer with LayerWrapper, the constructor calls:

self.attn, self.attn_name = get_submodules(
    layer, self.ATTENTION, self.model_type, return_name=True
)

This raises ValueErrorfor GatedDeltaNet layers since they don't have a self_attn attribute.

Fix
Pass fail_on_not_found=False to the attention sub-module lookup in LayerWrapper.__init__:

- self.attn, self.attn_name = get_submodules(
-     layer, self.ATTENTION, self.model_type, return_name=True
- )
+ # Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5)
+ # where some layers use linear attention instead of standard self-attention
+ self.attn, self.attn_name = get_submodules(
+     layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False
+ )

When a layer doesn't have a standard attention module, self.attn is set to None and the calibration passes gracefully skip attention-specific quantization for that layer while still processing the MLP.

Copilot AI review requested due to automatic review settings March 24, 2026 06:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables olive.common.hf.wrapper.LayerWrapper to tolerate transformer decoder layers that don’t expose a standard attention submodule (e.g., hybrid architectures mixing self-attention with alternative attention blocks), so Olive passes can continue operating on the rest of the layer (notably the MLP).

Changes:

  • Make attention submodule lookup non-fatal in LayerWrapper.__init__ by using fail_on_not_found=False.
  • Guard attention-related accessors (get_attention_inputs / get_attention_outputs) to return empty results when attention is missing.
  • Skip QKV-unpacking logic for layers without attention.

Comment on lines +137 to +141
# Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5)
# where some layers use linear attention instead of standard self-attention
self.attn, self.attn_name = get_submodules(
layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LayerWrapper.__init__ now uses fail_on_not_found=False for the attention lookup unconditionally. This can silently treat unsupported / mis-mapped model types as having no attention (e.g., any model_type not covered by ATTENTION), leading to partial/incorrect quantization instead of an early, actionable error. Consider keeping the strict lookup by default and only falling back when an alternate attention module is present (e.g., try standard mapping first, then try a known fallback like linear_attn; if neither exists, re-raise). Also consider setting attn_name to None when attn is None to avoid carrying a misleading module name.

Suggested change
# Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5)
# where some layers use linear attention instead of standard self-attention
self.attn, self.attn_name = get_submodules(
layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False
)
# First, try to find standard attention in a strict way so that unsupported/mis-mapped
# model types fail fast instead of being treated as having no attention.
try:
self.attn, self.attn_name = get_submodules(
layer,
self.ATTENTION,
self.model_type,
return_name=True,
)
except Exception as exc:
# For hybrid architectures (e.g., Qwen3.5) where some layers use linear attention
# instead of standard self-attention, fall back to a known alternate attention module.
linear_attn = getattr(layer, "linear_attn", None)
if linear_attn is not None:
self.attn = linear_attn
self.attn_name = "linear_attn"
else:
# If no known alternate attention exists, re-raise the original error so that
# unsupported model types are surfaced early.
raise exc
# If no attention module is present, ensure attn_name does not carry a misleading value.
if self.attn is None:
self.attn_name = None

Copilot uses AI. Check for mistakes.
Comment on lines 150 to 153
def get_attention_inputs(self, return_name: bool = True):
if self.attn is None:
return ([], []) if return_name else []
attention_inputs, names = get_submodules(
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_attention_inputs/get_attention_outputs return empty lists when attn is None. Some downstream code treats the returned iterable as “the set of attention linear layers” and may still mutate adjacent modules even when it’s empty (e.g., RotateBase.fuse_ln_linear always overwrites the layernorm weights/bias after iterating, so an empty list effectively removes the layernorm without fusing it anywhere). To avoid silent model corruption, consider exposing an explicit has_attention/attn is None signal that callers must check (and update known callers like rotate), or otherwise ensure callers can’t accidentally proceed with an empty attention list as a valid case.

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +141
# Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5)
# where some layers use linear attention instead of standard self-attention
self.attn, self.attn_name = get_submodules(
layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces a new behavior path where a layer can have attn is None and attention accessors return empty results. There’s existing test coverage for ModelWrapper/LayerWrapper (e.g., test/common/test_hf_wrapper.py), but it doesn’t cover the no-attention/hybrid case. Adding a unit test with a minimal dummy layer that lacks the standard attention attribute (and optionally has a linear_attn fallback) would help ensure downstream passes don’t regress and that the return shapes (([], []) vs []) stay consistent.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants