Support hybrid attention architectures in LayerWrapper by apsonawane · Pull Request #2367 · microsoft/Olive

apsonawane · 2026-03-24T06:24:11Z

Summary

Allow LayerWrapper to handle models with hybrid layer types (e.g., Qwen3.5) where some decoder layers use linear attention instead of standard self-attention.

Problem

Qwen3.5 is a hybrid VL model with 24 decoder layers — 18 use GatedDeltaNet linear attention (linear_attn sub-module) and 6 use standard full attention (self_attn). When Olive's SelectiveMixedPrecision or GPTQ passes wrap each layer with LayerWrapper, the constructor calls:

self.attn, self.attn_name = get_submodules(
    layer, self.ATTENTION, self.model_type, return_name=True
)

This raises ValueErrorfor GatedDeltaNet layers since they don't have a self_attn attribute.

Fix
Pass fail_on_not_found=False to the attention sub-module lookup in LayerWrapper.__init__:

- self.attn, self.attn_name = get_submodules(
-     layer, self.ATTENTION, self.model_type, return_name=True
- )
+ # Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5)
+ # where some layers use linear attention instead of standard self-attention
+ self.attn, self.attn_name = get_submodules(
+     layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False
+ )

When a layer doesn't have a standard attention module, self.attn is set to None and the calibration passes gracefully skip attention-specific quantization for that layer while still processing the MLP.

Copilot

Pull request overview

Enables olive.common.hf.wrapper.LayerWrapper to tolerate transformer decoder layers that don’t expose a standard attention submodule (e.g., hybrid architectures mixing self-attention with alternative attention blocks), so Olive passes can continue operating on the rest of the layer (notably the MLP).

Changes:

Make attention submodule lookup non-fatal in LayerWrapper.__init__ by using fail_on_not_found=False.
Guard attention-related accessors (get_attention_inputs / get_attention_outputs) to return empty results when attention is missing.
Skip QKV-unpacking logic for layers without attention.

Copilot · 2026-03-24T06:28:10Z

olive/common/hf/wrapper.py

+        # Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5)
+        # where some layers use linear attention instead of standard self-attention
+        self.attn, self.attn_name = get_submodules(
+            layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False
+        )


LayerWrapper.__init__ now uses fail_on_not_found=False for the attention lookup unconditionally. This can silently treat unsupported / mis-mapped model types as having no attention (e.g., any model_type not covered by ATTENTION), leading to partial/incorrect quantization instead of an early, actionable error. Consider keeping the strict lookup by default and only falling back when an alternate attention module is present (e.g., try standard mapping first, then try a known fallback like linear_attn; if neither exists, re-raise). Also consider setting attn_name to None when attn is None to avoid carrying a misleading module name.

Suggested change

# Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5)

# where some layers use linear attention instead of standard self-attention

self.attn, self.attn_name = get_submodules(

layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False

)

# First, try to find standard attention in a strict way so that unsupported/mis-mapped

# model types fail fast instead of being treated as having no attention.

try:

self.attn, self.attn_name = get_submodules(

layer,

self.ATTENTION,

self.model_type,

return_name=True,

)

except Exception as exc:

# For hybrid architectures (e.g., Qwen3.5) where some layers use linear attention

# instead of standard self-attention, fall back to a known alternate attention module.

linear_attn = getattr(layer, "linear_attn", None)

if linear_attn is not None:

self.attn = linear_attn

self.attn_name = "linear_attn"

else:

# If no known alternate attention exists, re-raise the original error so that

# unsupported model types are surfaced early.

raise exc

# If no attention module is present, ensure attn_name does not carry a misleading value.

if self.attn is None:

self.attn_name = None

Copilot · 2026-03-24T06:28:10Z

olive/common/hf/wrapper.py

    def get_attention_inputs(self, return_name: bool = True):
+        if self.attn is None:
+            return ([], []) if return_name else []
        attention_inputs, names = get_submodules(


get_attention_inputs/get_attention_outputs return empty lists when attn is None. Some downstream code treats the returned iterable as “the set of attention linear layers” and may still mutate adjacent modules even when it’s empty (e.g., RotateBase.fuse_ln_linear always overwrites the layernorm weights/bias after iterating, so an empty list effectively removes the layernorm without fusing it anywhere). To avoid silent model corruption, consider exposing an explicit has_attention/attn is None signal that callers must check (and update known callers like rotate), or otherwise ensure callers can’t accidentally proceed with an empty attention list as a valid case.

Copilot · 2026-03-24T06:28:11Z

olive/common/hf/wrapper.py

+        # Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5)
+        # where some layers use linear attention instead of standard self-attention
+        self.attn, self.attn_name = get_submodules(
+            layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False
+        )


This change introduces a new behavior path where a layer can have attn is None and attention accessors return empty results. There’s existing test coverage for ModelWrapper/LayerWrapper (e.g., test/common/test_hf_wrapper.py), but it doesn’t cover the no-attention/hybrid case. Adding a unit test with a minimal dummy layer that lacks the standard attention attribute (and optionally has a linear_attn fallback) would help ensure downstream passes don’t regress and that the return shapes (([], []) vs []) stay consistent.

Support hybrid attention architectures in LayerWrapper

18d6634

Copilot AI review requested due to automatic review settings March 24, 2026 06:24

Copilot started reviewing on behalf of apsonawane March 24, 2026 06:24 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support hybrid attention architectures in LayerWrapper#2367

Support hybrid attention architectures in LayerWrapper#2367
apsonawane wants to merge 1 commit intomainfrom
asonawane/qwen3.5

apsonawane commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        # Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5)
-        # where some layers use linear attention instead of standard self-attention
-        self.attn, self.attn_name = get_submodules(
-            layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False
-        )
+        # First, try to find standard attention in a strict way so that unsupported/mis-mapped
+        # model types fail fast instead of being treated as having no attention.
+        try:
+            self.attn, self.attn_name = get_submodules(
+                layer,
+                self.ATTENTION,
+                self.model_type,
+                return_name=True,
+            )
+        except Exception as exc:
+            # For hybrid architectures (e.g., Qwen3.5) where some layers use linear attention
+            # instead of standard self-attention, fall back to a known alternate attention module.
+            linear_attn = getattr(layer, "linear_attn", None)
+            if linear_attn is not None:
+                self.attn = linear_attn
+                self.attn_name = "linear_attn"
+            else:
+                # If no known alternate attention exists, re-raise the original error so that
+                # unsupported model types are surfaced early.
+                raise exc
+        # If no attention module is present, ensure attn_name does not carry a misleading value.
+        if self.attn is None:
+            self.attn_name = None

Conversation

apsonawane commented Mar 24, 2026

Summary

Problem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants