Support hybrid attention architectures in LayerWrapper#2367
Support hybrid attention architectures in LayerWrapper#2367apsonawane wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Enables olive.common.hf.wrapper.LayerWrapper to tolerate transformer decoder layers that don’t expose a standard attention submodule (e.g., hybrid architectures mixing self-attention with alternative attention blocks), so Olive passes can continue operating on the rest of the layer (notably the MLP).
Changes:
- Make attention submodule lookup non-fatal in
LayerWrapper.__init__by usingfail_on_not_found=False. - Guard attention-related accessors (
get_attention_inputs/get_attention_outputs) to return empty results when attention is missing. - Skip QKV-unpacking logic for layers without attention.
| # Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5) | ||
| # where some layers use linear attention instead of standard self-attention | ||
| self.attn, self.attn_name = get_submodules( | ||
| layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False | ||
| ) |
There was a problem hiding this comment.
LayerWrapper.__init__ now uses fail_on_not_found=False for the attention lookup unconditionally. This can silently treat unsupported / mis-mapped model types as having no attention (e.g., any model_type not covered by ATTENTION), leading to partial/incorrect quantization instead of an early, actionable error. Consider keeping the strict lookup by default and only falling back when an alternate attention module is present (e.g., try standard mapping first, then try a known fallback like linear_attn; if neither exists, re-raise). Also consider setting attn_name to None when attn is None to avoid carrying a misleading module name.
| # Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5) | |
| # where some layers use linear attention instead of standard self-attention | |
| self.attn, self.attn_name = get_submodules( | |
| layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False | |
| ) | |
| # First, try to find standard attention in a strict way so that unsupported/mis-mapped | |
| # model types fail fast instead of being treated as having no attention. | |
| try: | |
| self.attn, self.attn_name = get_submodules( | |
| layer, | |
| self.ATTENTION, | |
| self.model_type, | |
| return_name=True, | |
| ) | |
| except Exception as exc: | |
| # For hybrid architectures (e.g., Qwen3.5) where some layers use linear attention | |
| # instead of standard self-attention, fall back to a known alternate attention module. | |
| linear_attn = getattr(layer, "linear_attn", None) | |
| if linear_attn is not None: | |
| self.attn = linear_attn | |
| self.attn_name = "linear_attn" | |
| else: | |
| # If no known alternate attention exists, re-raise the original error so that | |
| # unsupported model types are surfaced early. | |
| raise exc | |
| # If no attention module is present, ensure attn_name does not carry a misleading value. | |
| if self.attn is None: | |
| self.attn_name = None |
| def get_attention_inputs(self, return_name: bool = True): | ||
| if self.attn is None: | ||
| return ([], []) if return_name else [] | ||
| attention_inputs, names = get_submodules( |
There was a problem hiding this comment.
get_attention_inputs/get_attention_outputs return empty lists when attn is None. Some downstream code treats the returned iterable as “the set of attention linear layers” and may still mutate adjacent modules even when it’s empty (e.g., RotateBase.fuse_ln_linear always overwrites the layernorm weights/bias after iterating, so an empty list effectively removes the layernorm without fusing it anywhere). To avoid silent model corruption, consider exposing an explicit has_attention/attn is None signal that callers must check (and update known callers like rotate), or otherwise ensure callers can’t accidentally proceed with an empty attention list as a valid case.
| # Use fail_on_not_found=False to support hybrid architectures (e.g., Qwen3.5) | ||
| # where some layers use linear attention instead of standard self-attention | ||
| self.attn, self.attn_name = get_submodules( | ||
| layer, self.ATTENTION, self.model_type, return_name=True, fail_on_not_found=False | ||
| ) |
There was a problem hiding this comment.
This change introduces a new behavior path where a layer can have attn is None and attention accessors return empty results. There’s existing test coverage for ModelWrapper/LayerWrapper (e.g., test/common/test_hf_wrapper.py), but it doesn’t cover the no-attention/hybrid case. Adding a unit test with a minimal dummy layer that lacks the standard attention attribute (and optionally has a linear_attn fallback) would help ensure downstream passes don’t regress and that the return shapes (([], []) vs []) stay consistent.
Summary
Allow
LayerWrapperto handle models with hybrid layer types (e.g., Qwen3.5) where some decoder layers use linear attention instead of standard self-attention.Problem
Qwen3.5 is a hybrid VL model with 24 decoder layers — 18 use GatedDeltaNet linear attention (
linear_attnsub-module) and 6 use standard full attention (self_attn). When Olive'sSelectiveMixedPrecisionorGPTQpasses wrap each layer withLayerWrapper, the constructor calls:This raises
ValueErrorfor GatedDeltaNet layers since they don't have aself_attnattribute.Fix
Pass
fail_on_not_found=Falseto the attention sub-module lookup inLayerWrapper.__init__:When a layer doesn't have a standard attention module,
self.attnis set to None and the calibration passes gracefully skip attention-specific quantization for that layer while still processing the MLP.