-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
During a recent customer incident, a U.2 drive was marked as degraded by ZFS with a large number of checksum errors.
# zpool status -v oxp_879a06eb-1775-4b7f-84b4-f59a8cc35e76
pool: oxp_879a06eb-1775-4b7f-84b4-f59a8cc35e76
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub in progress since Tue Oct 21 16:36:54 2025
644G scanned at 155M/s, 92.4G issued at 1.17G/s, 644G total
0B repaired, 14.35% done, 0 days 00:07:51 to go
config:
NAME STATE READ WRITE CKSUM
oxp_879a06eb-1775-4b7f-84b4-f59a8cc35e76 DEGRADED 0 0 844K
c11t0014EE8401D0A400d0s0 DEGRADED 0 0 1.65M too many errors
errors: Permanent errors have been detected in the following files:
/pool/ext/879a06eb-1775-4b7f-84b4-f59a8cc35e76/crypt/zone/oxz_crucible_6ad0899b-24b4-4552-991f-aea5848a74c8/root/data/regions/468f9aed-8bb2-406b-8f13-a7e0f4dc07f0/00/000/07D
We found that the crucible-downstairs process had dumped core, with the following backtrace:
> $C
fffff5ffea9fdf70 libc.so.1`_lwp_kill+0xa()
fffff5ffea9fdfa0 libc.so.1`raise+0x22(6)
fffff5ffea9fdff0 libc.so.1`abort+0x58()
fffff5ffea9fe000 ~panic_abort::__rust_start_panic::abort::hc0bbcc6efcc1d5f3+8()
fffff5ffea9fe010 ~__rust_start_panic+8()
fffff5ffea9fe070 rust_panic+0xd()
fffff5ffea9fe130 std::panicking::rust_panic_with_hook::hca562214ce32c15f+0x22f()
fffff5ffea9fe170 std::panicking::begin_panic_handler::{{closure}}::he8e0be607dd83fc3+0x98()
fffff5ffea9fe180 ~std::sys::backtrace::__rust_end_short_backtrace::hbef0695d21ef05d6+8()
fffff5ffea9fe1b0 ~rust_begin_unwind+0x1b()
fffff5ffea9fe1e0 ~core::panicking::panic_fmt::h39e3a70bf5546bf3+0x1e()
fffff5ffea9fe260 ~core::result::unwrap_failed::h93284670db64e191+0x74()
fffff5ffea9fe4f0 <crucible_downstairs::extent_inner_raw::RawInner as crucible_downstairs::extent::ExtentInner>::write::h47a80151526d7d70+0x2966()
fffff5ffea9fe6c0 crucible_downstairs::extent::Extent::write::hfebe255e5a5cec51+0x28f()
fffff5ffea9fe8c0 crucible_downstairs::region::Region::region_write::h736c99de48a623b8+0x664()
fffff5ffea9fec90 crucible_downstairs::ActiveConnection::do_work::{{closure}}::h46d7c31f66a622e8+0x167a()
fffff5ffea9fedb0 crucible_downstairs::ActiveConnection::do_work_if_ready::{{closure}}::h12b619822ba7946b+0x166()
fffff5ffea9ffa10 crucible_downstairs::Downstairs::run::_$u7b$$u7b$closure$u7d$$u7d$::h657b5690ef2c7297 +0x2b34()
fffff5ffea9ffa80 tokio::runtime::task::core::Core<T,S>::poll::had3fbd8690aa096f+0x3b()
fffff5ffea9ffae0 tokio::runtime::task::harness::Harness<T,S>::poll::h607d3b8b6518056c+0x5d()
fffff5ffea9ffb30 tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h633e6ecefb0ff29a+0x12d()
fffff5ffea9ffbd0 tokio::runtime::scheduler::multi_thread::worker::Context::run::h2d2eefa5f7e62b96+0x5ab()
fffff5ffea9ffc30 tokio::runtime::context::scoped::Scoped<T>::set::h535233da306b2299+0x2a()
fffff5ffea9ffce0 tokio::runtime::context::runtime::enter_runtime::hecc6b8cd5a2aea62+0x19a()
fffff5ffea9ffd20 tokio::runtime::scheduler::multi_thread::worker::run::h317a7e14aba180ed+0x8d()
fffff5ffea9ffd80 tokio::runtime::task::core::Core<T,S>::poll::h6ef368a3478b448c+0x3f()
fffff5ffea9ffdd0 tokio::runtime::task::harness::Harness<T,S>::poll::ha00c054f03224c3d+0x59()
fffff5ffea9ffe90 tokio::runtime::blocking::pool::Inner::run::hec8a3eec3b7d6ffa+0xf4()
fffff5ffea9ffed0 std::sys::backtrace::__rust_begin_short_backtrace::h67a93445cce4c663+0x3e()
fffff5ffea9fff60 core::ops::function::FnOnce::call_once{{vtable.shim}}::hde39da4992774261+0xa3()
fffff5ffea9fffb0 std::sys::pal::unix::thread::Thread::new::thread_start::h3db7703b7c3024e8+0x2b()
fffff5ffea9fffe0 libc.so.1`_thrp_setup+0x77(fffff5ffeef36a40)
fffff5ffea9ffff0 libc.so.1`_lwp_start()
This panic appears to have been triggered by an unwrap, of which there is only one in RawInner's impl for ExtentInner::write. Note that this rack is running v14rc1.
Given that the disk was failing, it seems like Crucible is working as intended here, but I thought I'd open this to confirm.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels