-
Notifications
You must be signed in to change notification settings - Fork 28
Description
We no longer have a time based dismissal of a downstairs. We now will only dismiss a downstairs if enough IO has accrued that we have decided we don't want to hold any more in memory, and so replay is no longer an option.
This process runs into trouble if we are in the middle of a LiveRepair. Live Repair IOs require all three downstairs to go through the IO at the same time, if a single downstairs is present when the LR starts, but goes missing, we will wait forever for it to return so we can complete the repair IO.
This is tangentially related somewhat to #1194, where we should not even try to replay IOs to downstairs if a LiveRepair has happened.
It's easy enough to reproduce this hang.
Start IO to all downstairs.
Take downstairs 0 offline, wait for enough IO that it goes to faulted.
Bring downstairs 0 back, let LR start.
Once LR is started, stop downstairs 1 or 2.
IOs will now stop, as we are now going to wait for an IO from the offline downstairs.