Kernel hang with excessive resize-journal on replicated fs

public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed

* Kernel hang with excessive resize-journal on replicated fs
@ 2021-10-24 13:26 Chris Webb
  2021-10-24 16:54 ` Kent Overstreet
  0 siblings, 1 reply; 2+ messages in thread
From: Chris Webb @ 2021-10-24 13:26 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

The following ktest hangs in the kernel instead of producing an error from the
excessive journal resize:

  test_oversize_journal()
  {
      bcachefs format --errors=panic --replicas=2 /dev/sd[bc]
      mount -t bcachefs /dev/sdb:/dev/sdc /mnt

      # should fail, but actually hangs:
      ! bcachefs device resize-journal /dev/sdb $(blockdev --getsize64 /dev/sdb)

      umount /mnt
  }

(It would correctly return with an error on a single-device filesystem.)

The kernel is continuously looping on ret == -EAGAIN in
bch2_set_nr_journal_buckets. The disk reservation (unexpectedly) succeeds but
__bch2_set_nr_journal_buckets then keeps returning -EAGAIN because
bch2_bucket_alloc (correctly) fails.

Is the problem here that the disk reservation isn't confined to the specific
device we then want to allocate from? So although the space is available, it's
not available solely on the device where we want it?

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Kernel hang with excessive resize-journal on replicated fs
  2021-10-24 13:26 Kernel hang with excessive resize-journal on replicated fs Chris Webb
@ 2021-10-24 16:54 ` Kent Overstreet
  0 siblings, 0 replies; 2+ messages in thread
From: Kent Overstreet @ 2021-10-24 16:54 UTC (permalink / raw)
  To: Chris Webb; +Cc: linux-bcachefs

On Sun, Oct 24, 2021 at 02:26:39PM +0100, Chris Webb wrote:
> The following ktest hangs in the kernel instead of producing an error from the
> excessive journal resize:
> 
>   test_oversize_journal()
>   {
>       bcachefs format --errors=panic --replicas=2 /dev/sd[bc]
>       mount -t bcachefs /dev/sdb:/dev/sdc /mnt
> 
>       # should fail, but actually hangs:
>       ! bcachefs device resize-journal /dev/sdb $(blockdev --getsize64 /dev/sdb)
> 
>       umount /mnt
>   }
> 
> (It would correctly return with an error on a single-device filesystem.)

Cool find, and nice test :)

> The kernel is continuously looping on ret == -EAGAIN in
> bch2_set_nr_journal_buckets. The disk reservation (unexpectedly) succeeds but
> __bch2_set_nr_journal_buckets then keeps returning -EAGAIN because
> bch2_bucket_alloc (correctly) fails.
> 
> Is the problem here that the disk reservation isn't confined to the specific
> device we then want to allocate from? So although the space is available, it's
> not available solely on the device where we want it?

Yes. This isn't uniquely a problem with journal allocations - I think it's still
possible to hit a hang like this by doing replicated writes on a two-device
filesystem with devices with mismatched sizes.

I haven't figured out what the right solution is yet. One (easier) thing that
needs to happen is that disk space allocations need to be changed to
interruptible, or at least killable sleeps. The closure code doesn't have good
support for that, though.

Relevant idea I had awhile back - meant to apply it to standard kernel
wait_queue_head_t, should be just as easy to apply to closure waitlists which is
what we're using here: wait lists should be embedding a sequence number, which
is incrementing on wakeup. Then instead of adding ourself to the waitlist and
checking the wait condition, we read the sequence number and check the wait
condition; then we sleep until the sequence number is different from the one we
originally read.

This would be useful for standard kernel wait lists because right now,
prepare_to_wait() changes the task state, and that needs to happen before
checking the wait condition so that wakups will set it back to TASK_RUNNING
without racing. But changing the task state greatly restricts the code that can
be run inside the wait condition (no mutex_lock()!).

This would be useful for closure waitlists for a different reason - putting a
closure on a closure waitlist doesn't change the task state, but we can't take a
closure off of a waitlist without taking _everything_ off the waitlist (doing a
wake_up_all()) because closure waitlists are singly linked (and lockless) -
which is what we need to do when the operation is interrupted or the task is
killed.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-10-24 16:54 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-24 13:26 Kernel hang with excessive resize-journal on replicated fs Chris Webb
2021-10-24 16:54 ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox