[BUG] bcachefs fio lockup via generic/703

public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed

* [BUG] bcachefs fio lockup via generic/703
@ 2024-02-28 19:47 Brian Foster
  2024-02-28 20:03 ` Kent Overstreet
  0 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2024-02-28 19:47 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Hi Kent,

Firstly, I confirmed that today's master seems to avoid the splat I sent
previously (re: your comment about a reverse journal replay patch or
some such).

I still reproduce the stall issue on this system. After peeling away at
it, I was eventually able to reproduce without the drop writes
(dm-flakey) behavior from the test, and with fio using either the libaio
or sync I/O engine options. The sync I/O mode fortunately provides a
more useful stack trace:

# cat /proc/177747/stack 
[<0>] bch2_dio_write_flush+0x122/0x160 [bcachefs]
[<0>] bch2_direct_write+0xb53/0xce0 [bcachefs]
[<0>] bch2_write_iter+0x142/0xc70 [bcachefs]
[<0>] vfs_write+0x29b/0x470
[<0>] ksys_write+0x6f/0xf0
[<0>] do_syscall_64+0x86/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76

... which resolves down to the closure_sync() call in
bch2_dio_write_flush(). The problem seems to go away if I remove the
preceding journal flush from that function. This seems to rule out
io_uring/aio and instead suggest that we're getting stuck somehow
waiting on a journal flush.

Based on that I went back to the first commit before 746a33c96b7a0
("bcachefs: better journal pipelining"). With that, I can run hundreds
of iterations of generic/703 without a problem, so this appears to be a
regression associated with the journal pipeline improvements. I'm
currently re-running on the last known good commit with my test tweaks
backed out (i.e. so back to io_uring and drop writes) just to
corroborate that it's the same problem, but so far it's running as
expected...

Brian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] bcachefs fio lockup via generic/703
  2024-02-28 19:47 [BUG] bcachefs fio lockup via generic/703 Brian Foster
@ 2024-02-28 20:03 ` Kent Overstreet
  2024-02-28 20:13   ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Kent Overstreet @ 2024-02-28 20:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Wed, Feb 28, 2024 at 02:47:26PM -0500, Brian Foster wrote:
> Hi Kent,
> 
> Firstly, I confirmed that today's master seems to avoid the splat I sent
> previously (re: your comment about a reverse journal replay patch or
> some such).
> 
> I still reproduce the stall issue on this system. After peeling away at
> it, I was eventually able to reproduce without the drop writes
> (dm-flakey) behavior from the test, and with fio using either the libaio
> or sync I/O engine options. The sync I/O mode fortunately provides a
> more useful stack trace:
> 
> # cat /proc/177747/stack 
> [<0>] bch2_dio_write_flush+0x122/0x160 [bcachefs]
> [<0>] bch2_direct_write+0xb53/0xce0 [bcachefs]
> [<0>] bch2_write_iter+0x142/0xc70 [bcachefs]
> [<0>] vfs_write+0x29b/0x470
> [<0>] ksys_write+0x6f/0xf0
> [<0>] do_syscall_64+0x86/0x170
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> 
> ... which resolves down to the closure_sync() call in
> bch2_dio_write_flush(). The problem seems to go away if I remove the
> preceding journal flush from that function. This seems to rule out
> io_uring/aio and instead suggest that we're getting stuck somehow
> waiting on a journal flush.
> 
> Based on that I went back to the first commit before 746a33c96b7a0
> ("bcachefs: better journal pipelining"). With that, I can run hundreds
> of iterations of generic/703 without a problem, so this appears to be a
> regression associated with the journal pipeline improvements. I'm
> currently re-running on the last known good commit with my test tweaks
> backed out (i.e. so back to io_uring and drop writes) just to
> corroborate that it's the same problem, but so far it's running as
> expected...

So I suppose the journal must be getting stuck, and a journal write
isn't completing - what does sysfs internal/journal_debug say when it happens?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] bcachefs fio lockup via generic/703
  2024-02-28 20:03 ` Kent Overstreet
@ 2024-02-28 20:13   ` Brian Foster
  2024-02-28 23:43     ` Kent Overstreet
  2024-02-29  0:02     ` Kent Overstreet
  0 siblings, 2 replies; 8+ messages in thread
From: Brian Foster @ 2024-02-28 20:13 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Wed, Feb 28, 2024 at 03:03:06PM -0500, Kent Overstreet wrote:
> On Wed, Feb 28, 2024 at 02:47:26PM -0500, Brian Foster wrote:
> > Hi Kent,
> > 
> > Firstly, I confirmed that today's master seems to avoid the splat I sent
> > previously (re: your comment about a reverse journal replay patch or
> > some such).
> > 
> > I still reproduce the stall issue on this system. After peeling away at
> > it, I was eventually able to reproduce without the drop writes
> > (dm-flakey) behavior from the test, and with fio using either the libaio
> > or sync I/O engine options. The sync I/O mode fortunately provides a
> > more useful stack trace:
> > 
> > # cat /proc/177747/stack 
> > [<0>] bch2_dio_write_flush+0x122/0x160 [bcachefs]
> > [<0>] bch2_direct_write+0xb53/0xce0 [bcachefs]
> > [<0>] bch2_write_iter+0x142/0xc70 [bcachefs]
> > [<0>] vfs_write+0x29b/0x470
> > [<0>] ksys_write+0x6f/0xf0
> > [<0>] do_syscall_64+0x86/0x170
> > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > 
> > ... which resolves down to the closure_sync() call in
> > bch2_dio_write_flush(). The problem seems to go away if I remove the
> > preceding journal flush from that function. This seems to rule out
> > io_uring/aio and instead suggest that we're getting stuck somehow
> > waiting on a journal flush.
> > 
> > Based on that I went back to the first commit before 746a33c96b7a0
> > ("bcachefs: better journal pipelining"). With that, I can run hundreds
> > of iterations of generic/703 without a problem, so this appears to be a
> > regression associated with the journal pipeline improvements. I'm
> > currently re-running on the last known good commit with my test tweaks
> > backed out (i.e. so back to io_uring and drop writes) just to
> > corroborate that it's the same problem, but so far it's running as
> > expected...
> 
> So I suppose the journal must be getting stuck, and a journal write
> isn't completing - what does sysfs internal/journal_debug say when it happens?
> 

dirty journal entries:  0/32768
seq:                    11
seq_ondisk:             11
last_seq:               12
last_seq_ondisk:        11
flushed_seq_ondisk:     11
watermark:              stripe
each entry reserved:    346
nr flush writes:        10
nr noflush writes:      1
average write size:     10.9 KiB
nr direct reclaim:      0
nr background reclaim:  25
reclaim kicked:         0
reclaim runs in:        0 ms
current entry sectors:  512
current entry error:    ok
current entry:          closed
unwritten entries:
last buf closed
replay done:            1
space:
        discarded       512:244736
        clean ondisk    512:244736
        clean           512:244736
        total           512:245760
dev 0:
        nr              480
        bucket size     512
        available       478:266
        discard_idx     1
        dirty_ondisk    1 (seq 11)
        dirty_idx       1 (seq 11)
        cur_idx         1 (seq 11)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] bcachefs fio lockup via generic/703
  2024-02-28 20:13   ` Brian Foster
@ 2024-02-28 23:43     ` Kent Overstreet
  2024-02-29  0:02     ` Kent Overstreet
  1 sibling, 0 replies; 8+ messages in thread
From: Kent Overstreet @ 2024-02-28 23:43 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Wed, Feb 28, 2024 at 03:13:04PM -0500, Brian Foster wrote:
> On Wed, Feb 28, 2024 at 03:03:06PM -0500, Kent Overstreet wrote:
> > On Wed, Feb 28, 2024 at 02:47:26PM -0500, Brian Foster wrote:
> > > Hi Kent,
> > > 
> > > Firstly, I confirmed that today's master seems to avoid the splat I sent
> > > previously (re: your comment about a reverse journal replay patch or
> > > some such).
> > > 
> > > I still reproduce the stall issue on this system. After peeling away at
> > > it, I was eventually able to reproduce without the drop writes
> > > (dm-flakey) behavior from the test, and with fio using either the libaio
> > > or sync I/O engine options. The sync I/O mode fortunately provides a
> > > more useful stack trace:
> > > 
> > > # cat /proc/177747/stack 
> > > [<0>] bch2_dio_write_flush+0x122/0x160 [bcachefs]
> > > [<0>] bch2_direct_write+0xb53/0xce0 [bcachefs]
> > > [<0>] bch2_write_iter+0x142/0xc70 [bcachefs]
> > > [<0>] vfs_write+0x29b/0x470
> > > [<0>] ksys_write+0x6f/0xf0
> > > [<0>] do_syscall_64+0x86/0x170
> > > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > > 
> > > ... which resolves down to the closure_sync() call in
> > > bch2_dio_write_flush(). The problem seems to go away if I remove the
> > > preceding journal flush from that function. This seems to rule out
> > > io_uring/aio and instead suggest that we're getting stuck somehow
> > > waiting on a journal flush.
> > > 
> > > Based on that I went back to the first commit before 746a33c96b7a0
> > > ("bcachefs: better journal pipelining"). With that, I can run hundreds
> > > of iterations of generic/703 without a problem, so this appears to be a
> > > regression associated with the journal pipeline improvements. I'm
> > > currently re-running on the last known good commit with my test tweaks
> > > backed out (i.e. so back to io_uring and drop writes) just to
> > > corroborate that it's the same problem, but so far it's running as
> > > expected...
> > 
> > So I suppose the journal must be getting stuck, and a journal write
> > isn't completing - what does sysfs internal/journal_debug say when it happens?
> > 
> 
> dirty journal entries:  0/32768
> seq:                    11
> seq_ondisk:             11
> last_seq:               12
> last_seq_ondisk:        11
> flushed_seq_ondisk:     11
> watermark:              stripe
> each entry reserved:    346
> nr flush writes:        10
> nr noflush writes:      1
> average write size:     10.9 KiB
> nr direct reclaim:      0
> nr background reclaim:  25
> reclaim kicked:         0
> reclaim runs in:        0 ms
> current entry sectors:  512
> current entry error:    ok
> current entry:          closed
> unwritten entries:
> last buf closed
> replay done:            1
> space:
>         discarded       512:244736
>         clean ondisk    512:244736
>         clean           512:244736
>         total           512:245760
> dev 0:
>         nr              480
>         bucket size     512
>         available       478:266
>         discard_idx     1
>         dirty_ondisk    1 (seq 11)
>         dirty_idx       1 (seq 11)
>         cur_idx         1 (seq 11)
> 

Can you enable closure debugging (in kconfig?) That will show
outstanding closures and info on what they're waiting on in debugfs.

bch2_dio_write_flush is doing a journal flush and waiting on it to
complete, but the journal isn't doing anything, so... lost wakeup,
perhaps?

Another thing to check for would be if we have waiters on journal_bufs
that aren't in use - maybe add some simple assertions in
journal_debug_to_text() -> journal_bufs_to_text(), if a buf isn't active
and isn't in the range that's being printed just make sure buf->wait is
NULL.

and journal_buf_to_text() doesn't seem to print out whether the waitlist
has waiters, that's an oversight...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] bcachefs fio lockup via generic/703
  2024-02-28 20:13   ` Brian Foster
  2024-02-28 23:43     ` Kent Overstreet
@ 2024-02-29  0:02     ` Kent Overstreet
  2024-02-29  2:59       ` Kent Overstreet
  1 sibling, 1 reply; 8+ messages in thread
From: Kent Overstreet @ 2024-02-29  0:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Wed, Feb 28, 2024 at 03:13:04PM -0500, Brian Foster wrote:
> On Wed, Feb 28, 2024 at 03:03:06PM -0500, Kent Overstreet wrote:
> > On Wed, Feb 28, 2024 at 02:47:26PM -0500, Brian Foster wrote:
> > > Hi Kent,
> > > 
> > > Firstly, I confirmed that today's master seems to avoid the splat I sent
> > > previously (re: your comment about a reverse journal replay patch or
> > > some such).
> > > 
> > > I still reproduce the stall issue on this system. After peeling away at
> > > it, I was eventually able to reproduce without the drop writes
> > > (dm-flakey) behavior from the test, and with fio using either the libaio
> > > or sync I/O engine options. The sync I/O mode fortunately provides a
> > > more useful stack trace:
> > > 
> > > # cat /proc/177747/stack 
> > > [<0>] bch2_dio_write_flush+0x122/0x160 [bcachefs]
> > > [<0>] bch2_direct_write+0xb53/0xce0 [bcachefs]
> > > [<0>] bch2_write_iter+0x142/0xc70 [bcachefs]
> > > [<0>] vfs_write+0x29b/0x470
> > > [<0>] ksys_write+0x6f/0xf0
> > > [<0>] do_syscall_64+0x86/0x170
> > > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > > 
> > > ... which resolves down to the closure_sync() call in
> > > bch2_dio_write_flush(). The problem seems to go away if I remove the
> > > preceding journal flush from that function. This seems to rule out
> > > io_uring/aio and instead suggest that we're getting stuck somehow
> > > waiting on a journal flush.
> > > 
> > > Based on that I went back to the first commit before 746a33c96b7a0
> > > ("bcachefs: better journal pipelining"). With that, I can run hundreds
> > > of iterations of generic/703 without a problem, so this appears to be a
> > > regression associated with the journal pipeline improvements. I'm
> > > currently re-running on the last known good commit with my test tweaks
> > > backed out (i.e. so back to io_uring and drop writes) just to
> > > corroborate that it's the same problem, but so far it's running as
> > > expected...
> > 
> > So I suppose the journal must be getting stuck, and a journal write
> > isn't completing - what does sysfs internal/journal_debug say when it happens?
> > 
> 
> dirty journal entries:  0/32768
> seq:                    11
> seq_ondisk:             11
> last_seq:               12
> last_seq_ondisk:        11
> flushed_seq_ondisk:     11
> watermark:              stripe
> each entry reserved:    346
> nr flush writes:        10
> nr noflush writes:      1
> average write size:     10.9 KiB
> nr direct reclaim:      0
> nr background reclaim:  25
> reclaim kicked:         0
> reclaim runs in:        0 ms
> current entry sectors:  512
> current entry error:    ok
> current entry:          closed
> unwritten entries:
> last buf closed
> replay done:            1
> space:
>         discarded       512:244736
>         clean ondisk    512:244736
>         clean           512:244736
>         total           512:245760
> dev 0:
>         nr              480
>         bucket size     512
>         available       478:266
>         discard_idx     1
>         dirty_ondisk    1 (seq 11)
>         dirty_idx       1 (seq 11)
>         cur_idx         1 (seq 11)
> 

I might see it - journal_last_unwritten_seq() is now wrong, it's
assuming that j->seq_ondisk + 1 hasn't been submitted yet.

bch2_journal_flush_seq_async() then sets that journal buf to "must be a
flush write", but if it's already been submitted - whoops

But your journal_debug says seq_ondisk == flushed_seq_ondisk, so maybe
that's not actually it here?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] bcachefs fio lockup via generic/703
  2024-02-29  0:02     ` Kent Overstreet
@ 2024-02-29  2:59       ` Kent Overstreet
  2024-02-29 15:55         ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Kent Overstreet @ 2024-02-29  2:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Wed, Feb 28, 2024 at 07:02:39PM -0500, Kent Overstreet wrote:
> On Wed, Feb 28, 2024 at 03:13:04PM -0500, Brian Foster wrote:
> > On Wed, Feb 28, 2024 at 03:03:06PM -0500, Kent Overstreet wrote:
> > > On Wed, Feb 28, 2024 at 02:47:26PM -0500, Brian Foster wrote:
> > > > Hi Kent,
> > > > 
> > > > Firstly, I confirmed that today's master seems to avoid the splat I sent
> > > > previously (re: your comment about a reverse journal replay patch or
> > > > some such).
> > > > 
> > > > I still reproduce the stall issue on this system. After peeling away at
> > > > it, I was eventually able to reproduce without the drop writes
> > > > (dm-flakey) behavior from the test, and with fio using either the libaio
> > > > or sync I/O engine options. The sync I/O mode fortunately provides a
> > > > more useful stack trace:
> > > > 
> > > > # cat /proc/177747/stack 
> > > > [<0>] bch2_dio_write_flush+0x122/0x160 [bcachefs]
> > > > [<0>] bch2_direct_write+0xb53/0xce0 [bcachefs]
> > > > [<0>] bch2_write_iter+0x142/0xc70 [bcachefs]
> > > > [<0>] vfs_write+0x29b/0x470
> > > > [<0>] ksys_write+0x6f/0xf0
> > > > [<0>] do_syscall_64+0x86/0x170
> > > > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > > > 
> > > > ... which resolves down to the closure_sync() call in
> > > > bch2_dio_write_flush(). The problem seems to go away if I remove the
> > > > preceding journal flush from that function. This seems to rule out
> > > > io_uring/aio and instead suggest that we're getting stuck somehow
> > > > waiting on a journal flush.
> > > > 
> > > > Based on that I went back to the first commit before 746a33c96b7a0
> > > > ("bcachefs: better journal pipelining"). With that, I can run hundreds
> > > > of iterations of generic/703 without a problem, so this appears to be a
> > > > regression associated with the journal pipeline improvements. I'm
> > > > currently re-running on the last known good commit with my test tweaks
> > > > backed out (i.e. so back to io_uring and drop writes) just to
> > > > corroborate that it's the same problem, but so far it's running as
> > > > expected...
> > > 
> > > So I suppose the journal must be getting stuck, and a journal write
> > > isn't completing - what does sysfs internal/journal_debug say when it happens?
> > > 
> > 
> > dirty journal entries:  0/32768
> > seq:                    11
> > seq_ondisk:             11
> > last_seq:               12
> > last_seq_ondisk:        11
> > flushed_seq_ondisk:     11
> > watermark:              stripe
> > each entry reserved:    346
> > nr flush writes:        10
> > nr noflush writes:      1
> > average write size:     10.9 KiB
> > nr direct reclaim:      0
> > nr background reclaim:  25
> > reclaim kicked:         0
> > reclaim runs in:        0 ms
> > current entry sectors:  512
> > current entry error:    ok
> > current entry:          closed
> > unwritten entries:
> > last buf closed
> > replay done:            1
> > space:
> >         discarded       512:244736
> >         clean ondisk    512:244736
> >         clean           512:244736
> >         total           512:245760
> > dev 0:
> >         nr              480
> >         bucket size     512
> >         available       478:266
> >         discard_idx     1
> >         dirty_ondisk    1 (seq 11)
> >         dirty_idx       1 (seq 11)
> >         cur_idx         1 (seq 11)
> > 
> 
> I might see it - journal_last_unwritten_seq() is now wrong, it's
> assuming that j->seq_ondisk + 1 hasn't been submitted yet.
> 
> bch2_journal_flush_seq_async() then sets that journal buf to "must be a
> flush write", but if it's already been submitted - whoops

On further reading, journal_flusH_seq_async() looks correct, but
bch2_journal_noflush_seq() was wrong - the following patch is in
-testing:


From 3c8f22258ab39992ac8a4b077610a30c733f068d Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@linux.dev>
Date: Wed, 28 Feb 2024 21:56:57 -0500
Subject: [PATCH] bcachefs: Fix bch2_journal_noflush_seq()

Improved journal pipelining broke journal_noflush_seq(); it implicitly
assumed only the oldest outstanding journal buf could be in flight, but
that's no longer true.

Make this more straightforward by just setting buf->must_flush whenever
we know a journal buf is going to be flush.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index ed1724b52204..3dc1534d1d3c 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -715,7 +715,7 @@ int bch2_journal_flush_seq_async(struct journal *j, u64 seq,
 			return ret;
 
 		seq = res.seq;
-		buf = j->buf + (seq & JOURNAL_BUF_MASK);
+		buf = journal_seq_to_buf(j, seq);
 		buf->must_flush = true;
 
 		if (!buf->flush_time) {
@@ -733,8 +733,8 @@ int bch2_journal_flush_seq_async(struct journal *j, u64 seq,
 	}
 
 	/*
-	 * if write was kicked off without a flush, flush the next sequence
-	 * number instead
+	 * if write was kicked off without a flush, or if we promised it
+	 * wouldn't be a flush, flush the next sequence number instead
 	 */
 	buf = journal_seq_to_buf(j, seq);
 	if (buf->noflush) {
@@ -812,8 +812,8 @@ bool bch2_journal_noflush_seq(struct journal *j, u64 seq)
 	     unwritten_seq++) {
 		struct journal_buf *buf = journal_seq_to_buf(j, unwritten_seq);
 
-		/* journal write is already in flight, and was a flush write: */
-		if (unwritten_seq == journal_last_unwritten_seq(j) && !buf->noflush)
+		/* journal flush already in flight, or flush requseted */
+		if (buf->must_flush)
 			goto out;
 
 		buf->noflush = true;
diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c
index 4557b5c1f7e1..77e93ad875d6 100644
--- a/fs/bcachefs/journal_io.c
+++ b/fs/bcachefs/journal_io.c
@@ -1935,6 +1935,7 @@ static int bch2_journal_write_pick_flush(struct journal *j, struct journal_buf *
 
 		j->nr_noflush_writes++;
 	} else {
+		w->must_flush = true;
 		j->last_flush_write = jiffies;
 		j->nr_flush_writes++;
 		clear_bit(JOURNAL_NEED_FLUSH_WRITE, &j->flags);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BUG] bcachefs fio lockup via generic/703
  2024-02-29  2:59       ` Kent Overstreet
@ 2024-02-29 15:55         ` Brian Foster
  2024-02-29 16:24           ` Kent Overstreet
  0 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2024-02-29 15:55 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Wed, Feb 28, 2024 at 09:59:12PM -0500, Kent Overstreet wrote:
> On Wed, Feb 28, 2024 at 07:02:39PM -0500, Kent Overstreet wrote:
> > On Wed, Feb 28, 2024 at 03:13:04PM -0500, Brian Foster wrote:
> > > On Wed, Feb 28, 2024 at 03:03:06PM -0500, Kent Overstreet wrote:
> > > > On Wed, Feb 28, 2024 at 02:47:26PM -0500, Brian Foster wrote:
> > > > > Hi Kent,
> > > > > 
> > > > > Firstly, I confirmed that today's master seems to avoid the splat I sent
> > > > > previously (re: your comment about a reverse journal replay patch or
> > > > > some such).
> > > > > 
> > > > > I still reproduce the stall issue on this system. After peeling away at
> > > > > it, I was eventually able to reproduce without the drop writes
> > > > > (dm-flakey) behavior from the test, and with fio using either the libaio
> > > > > or sync I/O engine options. The sync I/O mode fortunately provides a
> > > > > more useful stack trace:
> > > > > 
> > > > > # cat /proc/177747/stack 
> > > > > [<0>] bch2_dio_write_flush+0x122/0x160 [bcachefs]
> > > > > [<0>] bch2_direct_write+0xb53/0xce0 [bcachefs]
> > > > > [<0>] bch2_write_iter+0x142/0xc70 [bcachefs]
> > > > > [<0>] vfs_write+0x29b/0x470
> > > > > [<0>] ksys_write+0x6f/0xf0
> > > > > [<0>] do_syscall_64+0x86/0x170
> > > > > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > > > > 
> > > > > ... which resolves down to the closure_sync() call in
> > > > > bch2_dio_write_flush(). The problem seems to go away if I remove the
> > > > > preceding journal flush from that function. This seems to rule out
> > > > > io_uring/aio and instead suggest that we're getting stuck somehow
> > > > > waiting on a journal flush.
> > > > > 
> > > > > Based on that I went back to the first commit before 746a33c96b7a0
> > > > > ("bcachefs: better journal pipelining"). With that, I can run hundreds
> > > > > of iterations of generic/703 without a problem, so this appears to be a
> > > > > regression associated with the journal pipeline improvements. I'm
> > > > > currently re-running on the last known good commit with my test tweaks
> > > > > backed out (i.e. so back to io_uring and drop writes) just to
> > > > > corroborate that it's the same problem, but so far it's running as
> > > > > expected...
> > > > 
> > > > So I suppose the journal must be getting stuck, and a journal write
> > > > isn't completing - what does sysfs internal/journal_debug say when it happens?
> > > > 
> > > 
...
> > 
> > I might see it - journal_last_unwritten_seq() is now wrong, it's
> > assuming that j->seq_ondisk + 1 hasn't been submitted yet.
> > 
> > bch2_journal_flush_seq_async() then sets that journal buf to "must be a
> > flush write", but if it's already been submitted - whoops
> 
> On further reading, journal_flusH_seq_async() looks correct, but
> bch2_journal_noflush_seq() was wrong - the following patch is in
> -testing:
> 
...

So I can still reproduce the issue on master, which includes commit
3c8f22258ab ("bcachefs: Fix bch2_journal_noflush_seq()"). Just for
reference, the related debug info is as follows:

# cat /proc/13918/stack 
[<0>] bch2_dio_write_flush+0x15b/0x190 [bcachefs]
[<0>] bch2_direct_write+0xb75/0xd30 [bcachefs]
[<0>] bch2_write_iter+0x4c/0xf10 [bcachefs]
[<0>] vfs_write+0x29b/0x470
[<0>] ksys_write+0x6f/0xf0
[<0>] do_syscall_64+0x86/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76

# cat journal_debug 
dirty journal entries:  0/32768
seq:                    72
seq_ondisk:             72
last_seq:               73
last_seq_ondisk:        72
flushed_seq_ondisk:     72
watermark:              stripe
each entry reserved:    361
nr flush writes:        65
nr noflush writes:      0
average write size:     2.14 KiB
nr direct reclaim:      0
nr background reclaim:  66
reclaim kicked:         0
reclaim runs in:        0 ms
blocked:                0
current entry sectors:  512
current entry error:    ok
current entry:          closed
unwritten entries:
last buf closed
replay done:            1
space:
        discarded       512:244736
        clean ondisk    512:244736
        clean           512:244736
        total           512:245760
dev 0:
        nr              480
        bucket size     512
        available       478:188
        discard_idx     1
        dirty_ondisk    1 (seq 72)
        dirty_idx       1 (seq 72)
        cur_idx         1 (seq 72)

# cat closures 
0000000044781eb9: __closure_sync+0x49/0x180 -> closure_sync_fn+0x0/0x30 p 0000000000000000 r 1 
 W bch2_journal_flush_seq_async.part.0+0xed/0x590 [bcachefs]

000000001b9a5601: bch2_fs_open+0x538/0x15e0 [bcachefs] -> 0x0 p 0000000000000000 r 1 R

Brian


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] bcachefs fio lockup via generic/703
  2024-02-29 15:55         ` Brian Foster
@ 2024-02-29 16:24           ` Kent Overstreet
  0 siblings, 0 replies; 8+ messages in thread
From: Kent Overstreet @ 2024-02-29 16:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Thu, Feb 29, 2024 at 10:55:46AM -0500, Brian Foster wrote:
> So I can still reproduce the issue on master, which includes commit
> 3c8f22258ab ("bcachefs: Fix bch2_journal_noflush_seq()"). Just for
> reference, the related debug info is as follows:
> 
> # cat /proc/13918/stack 
> [<0>] bch2_dio_write_flush+0x15b/0x190 [bcachefs]
> [<0>] bch2_direct_write+0xb75/0xd30 [bcachefs]
> [<0>] bch2_write_iter+0x4c/0xf10 [bcachefs]
> [<0>] vfs_write+0x29b/0x470
> [<0>] ksys_write+0x6f/0xf0
> [<0>] do_syscall_64+0x86/0x170
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> 
> # cat journal_debug 
> dirty journal entries:  0/32768
> seq:                    72
> seq_ondisk:             72
> last_seq:               73
> last_seq_ondisk:        72
> flushed_seq_ondisk:     72
> watermark:              stripe
> each entry reserved:    361
> nr flush writes:        65
> nr noflush writes:      0
> average write size:     2.14 KiB
> nr direct reclaim:      0
> nr background reclaim:  66
> reclaim kicked:         0
> reclaim runs in:        0 ms
> blocked:                0
> current entry sectors:  512
> current entry error:    ok
> current entry:          closed
> unwritten entries:
> last buf closed
> replay done:            1
> space:
>         discarded       512:244736
>         clean ondisk    512:244736
>         clean           512:244736
>         total           512:245760
> dev 0:
>         nr              480
>         bucket size     512
>         available       478:188
>         discard_idx     1
>         dirty_ondisk    1 (seq 72)
>         dirty_idx       1 (seq 72)
>         cur_idx         1 (seq 72)
> 
> # cat closures 
> 0000000044781eb9: __closure_sync+0x49/0x180 -> closure_sync_fn+0x0/0x30 p 0000000000000000 r 1 
>  W bch2_journal_flush_seq_async.part.0+0xed/0x590 [bcachefs]
> 
> 000000001b9a5601: bch2_fs_open+0x538/0x15e0 [bcachefs] -> 0x0 p 0000000000000000 r 1 R

So, definitely some sort of lost wakeup, or we accidentally waited on a
journal buf that wasn't actually outstanding

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-02-29 16:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-28 19:47 [BUG] bcachefs fio lockup via generic/703 Brian Foster
2024-02-28 20:03 ` Kent Overstreet
2024-02-28 20:13   ` Brian Foster
2024-02-28 23:43     ` Kent Overstreet
2024-02-29  0:02     ` Kent Overstreet
2024-02-29  2:59       ` Kent Overstreet
2024-02-29 15:55         ` Brian Foster
2024-02-29 16:24           ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox