Skip to content

panic in crucible_downstairs::extent_inner_raw::RawInner::write after disk failure #1801

@wfchandler

Description

@wfchandler

During a recent customer incident, a U.2 drive was marked as degraded by ZFS with a large number of checksum errors.

# zpool status -v oxp_879a06eb-1775-4b7f-84b4-f59a8cc35e76 
  pool: oxp_879a06eb-1775-4b7f-84b4-f59a8cc35e76
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Oct 21 16:36:54 2025
        644G scanned at 155M/s, 92.4G issued at 1.17G/s, 644G total
        0B repaired, 14.35% done, 0 days 00:07:51 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        oxp_879a06eb-1775-4b7f-84b4-f59a8cc35e76  DEGRADED     0     0  844K
          c11t0014EE8401D0A400d0s0                DEGRADED     0     0 1.65M  too many errors

errors: Permanent errors have been detected in the following files:

        /pool/ext/879a06eb-1775-4b7f-84b4-f59a8cc35e76/crypt/zone/oxz_crucible_6ad0899b-24b4-4552-991f-aea5848a74c8/root/data/regions/468f9aed-8bb2-406b-8f13-a7e0f4dc07f0/00/000/07D

We found that the crucible-downstairs process had dumped core, with the following backtrace:

> $C
fffff5ffea9fdf70 libc.so.1`_lwp_kill+0xa()
fffff5ffea9fdfa0 libc.so.1`raise+0x22(6)
fffff5ffea9fdff0 libc.so.1`abort+0x58()
fffff5ffea9fe000 ~panic_abort::__rust_start_panic::abort::hc0bbcc6efcc1d5f3+8()
fffff5ffea9fe010 ~__rust_start_panic+8()
fffff5ffea9fe070 rust_panic+0xd()
fffff5ffea9fe130 std::panicking::rust_panic_with_hook::hca562214ce32c15f+0x22f()
fffff5ffea9fe170 std::panicking::begin_panic_handler::{{closure}}::he8e0be607dd83fc3+0x98()
fffff5ffea9fe180 ~std::sys::backtrace::__rust_end_short_backtrace::hbef0695d21ef05d6+8()
fffff5ffea9fe1b0 ~rust_begin_unwind+0x1b()
fffff5ffea9fe1e0 ~core::panicking::panic_fmt::h39e3a70bf5546bf3+0x1e()
fffff5ffea9fe260 ~core::result::unwrap_failed::h93284670db64e191+0x74()
fffff5ffea9fe4f0 <crucible_downstairs::extent_inner_raw::RawInner as crucible_downstairs::extent::ExtentInner>::write::h47a80151526d7d70+0x2966()
fffff5ffea9fe6c0 crucible_downstairs::extent::Extent::write::hfebe255e5a5cec51+0x28f()
fffff5ffea9fe8c0 crucible_downstairs::region::Region::region_write::h736c99de48a623b8+0x664()
fffff5ffea9fec90 crucible_downstairs::ActiveConnection::do_work::{{closure}}::h46d7c31f66a622e8+0x167a()
fffff5ffea9fedb0 crucible_downstairs::ActiveConnection::do_work_if_ready::{{closure}}::h12b619822ba7946b+0x166()
fffff5ffea9ffa10 crucible_downstairs::Downstairs::run::_$u7b$$u7b$closure$u7d$$u7d$::h657b5690ef2c7297 +0x2b34()
fffff5ffea9ffa80 tokio::runtime::task::core::Core<T,S>::poll::had3fbd8690aa096f+0x3b()
fffff5ffea9ffae0 tokio::runtime::task::harness::Harness<T,S>::poll::h607d3b8b6518056c+0x5d()
fffff5ffea9ffb30 tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h633e6ecefb0ff29a+0x12d()
fffff5ffea9ffbd0 tokio::runtime::scheduler::multi_thread::worker::Context::run::h2d2eefa5f7e62b96+0x5ab()
fffff5ffea9ffc30 tokio::runtime::context::scoped::Scoped<T>::set::h535233da306b2299+0x2a()
fffff5ffea9ffce0 tokio::runtime::context::runtime::enter_runtime::hecc6b8cd5a2aea62+0x19a()
fffff5ffea9ffd20 tokio::runtime::scheduler::multi_thread::worker::run::h317a7e14aba180ed+0x8d()
fffff5ffea9ffd80 tokio::runtime::task::core::Core<T,S>::poll::h6ef368a3478b448c+0x3f()
fffff5ffea9ffdd0 tokio::runtime::task::harness::Harness<T,S>::poll::ha00c054f03224c3d+0x59()
fffff5ffea9ffe90 tokio::runtime::blocking::pool::Inner::run::hec8a3eec3b7d6ffa+0xf4()
fffff5ffea9ffed0 std::sys::backtrace::__rust_begin_short_backtrace::h67a93445cce4c663+0x3e()
fffff5ffea9fff60 core::ops::function::FnOnce::call_once{{vtable.shim}}::hde39da4992774261+0xa3()
fffff5ffea9fffb0 std::sys::pal::unix::thread::Thread::new::thread_start::h3db7703b7c3024e8+0x2b()
fffff5ffea9fffe0 libc.so.1`_thrp_setup+0x77(fffff5ffeef36a40)
fffff5ffea9ffff0 libc.so.1`_lwp_start()

This panic appears to have been triggered by an unwrap, of which there is only one in RawInner's impl for ExtentInner::write. Note that this rack is running v14rc1.

Given that the disk was failing, it seems like Crucible is working as intended here, but I thought I'd open this to confirm.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions