writeback completion soft lockup BUG in folio_wake

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* writeback completion soft lockup BUG in folio_wake_bit()
@ 2022-03-15 19:07 Brian Foster
  2022-03-16 20:59 ` Matthew Wilcox
  0 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2022-03-15 19:07 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-xfs; +Cc: Linus Torvalds, Hugh Dickins

Hi,

I've been chasing down an issue recently that results in the following
soft lockup warning in XFS writeback (iomap_ioend) completion:

 watchdog: BUG: soft lockup - CPU#42 stuck for 208s! [kworker/42:0:52508]
 ...
 CPU: 42 PID: 52508 Comm: kworker/42:0 Tainted: G S           L    5.17.0-rc8 #5
 Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
 Workqueue: xfs-conv/dm-0 xfs_end_io [xfs]
 RIP: 0010:_raw_spin_unlock_irqrestore+0x1a/0x31
 Code: 74 01 c3 0f 1f 44 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 74 01 fb bf 01 00 00 00 <e8> f1 6e 6a ff 65 8b 05 7a e3 1a 63 85 c0 74 01 c3 0f 1f 44 00 00
 RSP: 0018:ff4a81e4a8a37ce8 EFLAGS: 00000206
 RAX: 0000000000000001 RBX: ffffffff9de08b28 RCX: ff4a81e4a6cfbd00
 RDX: ff4a81e4a6cfbd00 RSI: 0000000000000246 RDI: 0000000000000001
 RBP: 0000000000000246 R08: ff4a81e4a6cfbd00 R09: 000000000002f8c0
 R10: 00006dfe673f3ac5 R11: 00000000fa83b2da R12: 0000000000000fa8
 R13: ffcb6e504b14e640 R14: 0000000000000000 R15: 0000000000001000
 FS:  0000000000000000(0000) GS:ff1a26083fd40000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000888000 CR3: 0000000790f4a006 CR4: 0000000000771ee0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554 
 Call Trace:
  <TASK>
  folio_wake_bit+0x8a/0x110
  folio_end_writeback+0x37/0x80
  iomap_finish_ioend+0xc6/0x330
  iomap_finish_ioends+0x93/0xd0
  xfs_end_ioend+0x5e/0x150 [xfs]
  xfs_end_io+0xaf/0xe0 [xfs]
  process_one_work+0x1c5/0x390
  ? process_one_work+0x390/0x390
  worker_thread+0x30/0x350
  ? process_one_work+0x390/0x390
  kthread+0xe6/0x110
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

This is reproduced via repeated iterations of the LTP diotest3 test [1]
on a 96xCPU system. This usually first occurs in a transient manner,
dumping two or three warnings and then the system recovers and test
continues, but it eventually reproduces a full on livelock in this same
path. I think I've narrowed the root cause of the warning to commit
c2407cf7d22d0 ("mm: make wait_on_page_writeback() wait for multiple
pending writebacks") (last discussed here[2] afaics). The first thing to
note is that I think the dio aspect of the test program is not
important. The test in this case creates 100x tasks that each run a
"buffered write -> fsync -> dio read" loop on an isolated range of the
test file, where the buffered writes and fsyncs seem to be the primary
contributing factors.

What seems to happen is that the majority of the fsync calls end up
waiting on writeback of a particular page, the wakeup of the writeback
bit on that page wakes a task that immediately resets PG_writeback on
the page such that N other folio_wait_writeback() waiters see the bit
still set and immediately place themselves back onto the tail of the
wait queue.  Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop
in folio_wake_bit() (backing off the lock for a cycle or so in each
iteration) only to find the same bunch of tasks in the queue. This
process repeats for a long enough amount of time to trigger the soft
lockup warning. I've confirmed this spinning behavior with a tracepoint
in the bookmark loop that indicates we're stuck for many hundreds of
thousands of iterations (at least) of this loop when the soft lockup
warning triggers.

As to the full livelock variant, I think we end up with a
write_cache_pages() task that is woken waiting for page P, immediately
resets PG_writeback on P and moves on to P+1. P+1 is also already under
writeback, so write_cache_pages() blocks here waiting on that. Page P
has still not been submitted for I/O because XFS/iomap batches I/O
submissions across multiple ->writepage() callbacks. The waker task
spinloop occurs on page P as described above, but since the waker task
is the XFS workqueue task associated with ioend completion on the inode
(because these are delalloc writes that require unwritten extent
conversion on completion), it will never get to waking page P+1 and
we're stuck for good. IOW:

1. Page P is stuck in PG_writeback on a queue awaiting I/O submission.
2. The P submitting task is blocked waiting on PG_writeback of page P+1.
3. The waker task responsible for waking P+1 is spinning waking tasks
for a previous PG_writeback state on page P. This wait queue will never
drain, however, because it consists of a large number of tasks (>
WAITQUEUE_WALK_BREAK_CNT) inserted via folio_wait_writeback() that will
never break the loop for P (because of step 1).

I've run a few quick experiments to try and corroborate this analysis.
The problem goes away completely if I either back out the loop change in
folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something
like 128 (i.e. greater than the total possible number of waiter tasks in
this test). I've also played a few games with bookmark behavior mostly
out of curiosity, but usually end up introducing other problems like
missed wakeups, etc. I've not been able to reproduce this problem on
ext4, I suspect due to sufficiently different writeback/completion
batching behavior. I was thinking about whether skipping writeback of
PG_writeback pages after an explicit wait (removing the wait loop and
BUG_ON()) might avoid this problem and the one the loop is intended to
fix, but I'm not sure that would be safe. Thoughts or ideas?

Brian

[1] while [ true ]; do TMPDIR=<xfsdir> diotest3 -b 65536 -n 100 -i 100 -o 1024000; done
[2] https://lore.kernel.org/linux-mm/000000000000886dbd05b7ffa8db@google.com/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-15 19:07 writeback completion soft lockup BUG in folio_wake_bit() Brian Foster
@ 2022-03-16 20:59 ` Matthew Wilcox
  2022-03-16 23:35   ` Linus Torvalds
  2022-03-17 13:51   ` Brian Foster
  0 siblings, 2 replies; 27+ messages in thread
From: Matthew Wilcox @ 2022-03-16 20:59 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-mm, linux-fsdevel, linux-xfs, Linus Torvalds, Hugh Dickins

On Tue, Mar 15, 2022 at 03:07:10PM -0400, Brian Foster wrote:
> What seems to happen is that the majority of the fsync calls end up
> waiting on writeback of a particular page, the wakeup of the writeback
> bit on that page wakes a task that immediately resets PG_writeback on
> the page such that N other folio_wait_writeback() waiters see the bit
> still set and immediately place themselves back onto the tail of the
> wait queue.  Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop
> in folio_wake_bit() (backing off the lock for a cycle or so in each
> iteration) only to find the same bunch of tasks in the queue. This
> process repeats for a long enough amount of time to trigger the soft
> lockup warning. I've confirmed this spinning behavior with a tracepoint
> in the bookmark loop that indicates we're stuck for many hundreds of
> thousands of iterations (at least) of this loop when the soft lockup
> warning triggers.

[...]

> I've run a few quick experiments to try and corroborate this analysis.
> The problem goes away completely if I either back out the loop change in
> folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something
> like 128 (i.e. greater than the total possible number of waiter tasks in
> this test). I've also played a few games with bookmark behavior mostly
> out of curiosity, but usually end up introducing other problems like
> missed wakeups, etc.

As I recall, the bookmark hack was introduced in order to handle
lock_page() problems.  It wasn't really supposed to handle writeback,
but nobody thought it would cause any harm (and indeed, it didn't at the
time).  So how about we only use bookmarks for lock_page(), since
lock_page() usually doesn't have the multiple-waker semantics that
writeback has?

(this is more in the spirit of "minimal patch" -- I think initialising
the bookmark should be moved to folio_unlock()).

diff --git a/mm/filemap.c b/mm/filemap.c
index b2728eb52407..9ee3c5f1f489 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1146,26 +1146,28 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
 	return (flags & WQ_FLAG_EXCLUSIVE) != 0;
 }
 
-static void folio_wake_bit(struct folio *folio, int bit_nr)
+static void folio_wake_bit(struct folio *folio, int bit_nr,
+		wait_queue_entry_t *bookmark)
 {
 	wait_queue_head_t *q = folio_waitqueue(folio);
 	struct wait_page_key key;
 	unsigned long flags;
-	wait_queue_entry_t bookmark;
 
 	key.folio = folio;
 	key.bit_nr = bit_nr;
 	key.page_match = 0;
 
-	bookmark.flags = 0;
-	bookmark.private = NULL;
-	bookmark.func = NULL;
-	INIT_LIST_HEAD(&bookmark.entry);
+	if (bookmark) {
+		bookmark->flags = 0;
+		bookmark->private = NULL;
+		bookmark->func = NULL;
+		INIT_LIST_HEAD(&bookmark->entry);
+	}
 
 	spin_lock_irqsave(&q->lock, flags);
-	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
+	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
 
-	while (bookmark.flags & WQ_FLAG_BOOKMARK) {
+	while (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
 		/*
 		 * Take a breather from holding the lock,
 		 * allow pages that finish wake up asynchronously
@@ -1175,7 +1177,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr)
 		spin_unlock_irqrestore(&q->lock, flags);
 		cpu_relax();
 		spin_lock_irqsave(&q->lock, flags);
-		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
+		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
 	}
 
 	/*
@@ -1204,7 +1206,7 @@ static void folio_wake(struct folio *folio, int bit)
 {
 	if (!folio_test_waiters(folio))
 		return;
-	folio_wake_bit(folio, bit);
+	folio_wake_bit(folio, bit, NULL);
 }
 
 /*
@@ -1554,12 +1556,15 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
  */
 void folio_unlock(struct folio *folio)
 {
+	wait_queue_entry_t bookmark;
+
 	/* Bit 7 allows x86 to check the byte's sign bit */
 	BUILD_BUG_ON(PG_waiters != 7);
 	BUILD_BUG_ON(PG_locked > 7);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
 	if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0)))
-		folio_wake_bit(folio, PG_locked);
+		folio_wake_bit(folio, PG_locked, &bookmark);
 }
 EXPORT_SYMBOL(folio_unlock);
 
@@ -1578,7 +1583,7 @@ void folio_end_private_2(struct folio *folio)
 {
 	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
 	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
-	folio_wake_bit(folio, PG_private_2);
+	folio_wake_bit(folio, PG_private_2, NULL);
 	folio_put(folio);
 }
 EXPORT_SYMBOL(folio_end_private_2);

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-16 20:59 ` Matthew Wilcox
@ 2022-03-16 23:35   ` Linus Torvalds
  2022-03-17 15:04     ` Matthew Wilcox
  2022-03-17 15:31     ` Brian Foster
  2022-03-17 13:51   ` Brian Foster
  1 sibling, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2022-03-16 23:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins

On Wed, Mar 16, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> As I recall, the bookmark hack was introduced in order to handle
> lock_page() problems.  It wasn't really supposed to handle writeback,
> but nobody thought it would cause any harm (and indeed, it didn't at the
> time).  So how about we only use bookmarks for lock_page(), since
> lock_page() usually doesn't have the multiple-waker semantics that
> writeback has?

I was hoping that some of the page lock problems are gone and we could
maybe try to get rid of the bookmarks entirely.

But the page lock issues only ever showed up on some private
proprietary load and machine, so we never really got confirmation that
they are fixed. There were lots of strong signs to them being related
to the migration page locking, and it may be that the bookmark code is
only hurting these days.

See for example commit 9a1ea439b16b ("mm:
put_and_wait_on_page_locked() while page is migrated") which doesn't
actually change the *locking* side, but drops the page reference when
waiting for the locked page to be unlocked, which in turn removes a
"loop and try again when migration". And that may have been the real
_fix_ for the problem.

Because while the bookmark thing avoids the NMI lockup detector firing
due to excessive hold times, the bookmarking also _causes_ that "we
now will see the same page multiple times because we dropped the lock
and somebody re-added it at the end of the queue" issue. Which seems
to be the problem here.

Ugh. I wish we had some way to test "could we just remove the bookmark
code entirely again".

Of course, the PG_lock case also works fairly hard to not actually
remove and re-add the lock waiter to the queue, but having an actual
"wait for and get the lock" operation. The writeback bit isn't done
that way.

I do hate how we had to make folio_wait_writeback{_killable}() use
"while" rather than an "if". It *almost* works with just a "wait for
current writeback", but not quite. See commit c2407cf7d22d ("mm: make
wait_on_page_writeback() wait for multiple pending writebacks") for
why we have to loop. Ugly, ugly.

Because I do think that "while" in the writeback waiting is a problem.
Maybe _the_ problem.

                        Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-16 23:35   ` Linus Torvalds
@ 2022-03-17 15:04     ` Matthew Wilcox
  2022-03-17 19:26       ` Linus Torvalds
  2022-03-17 15:31     ` Brian Foster
  1 sibling, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2022-03-17 15:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins

On Wed, Mar 16, 2022 at 04:35:10PM -0700, Linus Torvalds wrote:
> On Wed, Mar 16, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > As I recall, the bookmark hack was introduced in order to handle
> > lock_page() problems.  It wasn't really supposed to handle writeback,
> > but nobody thought it would cause any harm (and indeed, it didn't at the
> > time).  So how about we only use bookmarks for lock_page(), since
> > lock_page() usually doesn't have the multiple-waker semantics that
> > writeback has?
> 
> I was hoping that some of the page lock problems are gone and we could
> maybe try to get rid of the bookmarks entirely.
> 
> But the page lock issues only ever showed up on some private
> proprietary load and machine, so we never really got confirmation that
> they are fixed. There were lots of strong signs to them being related
> to the migration page locking, and it may be that the bookmark code is
> only hurting these days.

Ah, I found Tim's mail describing the workload:
https://lore.kernel.org/all/a9e74f64-dee6-dc23-128e-8ef8c7383d77@linux.intel.com/

Relevant part:

: They have a parent process that spawns off 10 children per core and
: kicked them to run. The child processes all access a common library.
: We have 384 cores so 3840 child processes running.  When migration occur on
: a page in the common library, the first child that access the page will
: page fault and lock the page, with the other children also page faulting
: quickly and pile up in the page wait list, till the first child is done.

Seems like someone should be able to write a test app that does
something along those lines fairly easily.  The trick is finding a
giant system to run it on.  Although doing it on a smaller system with
more SW threads/CPU would probably be enough.

> See for example commit 9a1ea439b16b ("mm:
> put_and_wait_on_page_locked() while page is migrated") which doesn't
> actually change the *locking* side, but drops the page reference when
> waiting for the locked page to be unlocked, which in turn removes a
> "loop and try again when migration". And that may have been the real
> _fix_ for the problem.

Actually, I think it fixes a livelock.  If you have hundreds (let alone
thousands) of threads all trying to migrate the same page, they're all
going to fail with the original code because migration can't succeed with
an elevated refcount, and each thread that is waiting holds a refcount.
I had a similar problem trying to split a folio in ->readpage with
one of the xfstests that has 500 threads all trying to read() from the
same page.  That was solved by also using put_and_wait_on_page_locked()
in bd8a1f3655a7 ("mm/filemap: support readpage splitting a page").

> Because while the bookmark thing avoids the NMI lockup detector firing
> due to excessive hold times, the bookmarking also _causes_ that "we
> now will see the same page multiple times because we dropped the lock
> and somebody re-added it at the end of the queue" issue. Which seems
> to be the problem here.
> 
> Ugh. I wish we had some way to test "could we just remove the bookmark
> code entirely again".

I think we should be safe to remove it entirely now.  Of course, that
may show somewhere else that now suffers from a similar livelock ...
but if it does, we ought to fix that anyway, because the bookmark
code is stopping the livelock problem from being seen.

> Of course, the PG_lock case also works fairly hard to not actually
> remove and re-add the lock waiter to the queue, but having an actual
> "wait for and get the lock" operation. The writeback bit isn't done
> that way.
> 
> I do hate how we had to make folio_wait_writeback{_killable}() use
> "while" rather than an "if". It *almost* works with just a "wait for
> current writeback", but not quite. See commit c2407cf7d22d ("mm: make
> wait_on_page_writeback() wait for multiple pending writebacks") for
> why we have to loop. Ugly, ugly.
> 
> Because I do think that "while" in the writeback waiting is a problem.
> Maybe _the_ problem.

I do like the idea of wait-and-set the writeback flag in the VFS instead
of leaving it to the individual filesystems.  I'll put it on my list of
things to do.

Actually ... I don't think that calling folio_start_writeback() twice
is a problem per se.  It's inefficient (cycling the i_pages lock,
walking the xarray, atomic bitoperations on the flag), but nothing
goes wrong.  Except that NFS and AFS check the return value and
squawk.

So how about we do something like this:

 - Make folio_start_writeback() and set_page_writeback() return void,
   fixing up AFS and NFS.
 - Add a folio_wait_start_writeback() to use in the VFS
 - Remove the calls to set_page_writeback() in the filesystems

That keepwrite thing troubles me, and hints it might be more complicated
than that.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-17 15:04     ` Matthew Wilcox
@ 2022-03-17 19:26       ` Linus Torvalds
  2022-03-17 21:16         ` Matthew Wilcox
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2022-03-17 19:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins

On Thu, Mar 17, 2022 at 8:04 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> So how about we do something like this:
>
>  - Make folio_start_writeback() and set_page_writeback() return void,
>    fixing up AFS and NFS.
>  - Add a folio_wait_start_writeback() to use in the VFS
>  - Remove the calls to set_page_writeback() in the filesystems

That sounds lovely, but it does worry me a bit. Not just the odd
'keepwrite' thing, but also the whole ordering between the folio bit
and the tagging bits. Does the ordering possibly matter?

That whole "xyz_writeback_keepwrite()" thing seems odd. It's used in
only one place (the folio version isn't used at all):

  ext4_writepage():

     ext4_walk_page_buffers() fails:
                redirty_page_for_writepage(wbc, page);
                keep_towrite = true;
      ext4_bio_write_page().

which just looks odd. Why does it even try to continue to do the
writepage when the page buffer thing has failed?

In the regular write path (ie ext4_write_begin()), a
ext4_walk_page_buffers() failure is fatal or causes a retry). Why is
ext4_writepage() any different? Particularly since it wants to keep
the page dirty, then trying to do the writeback just seems wrong.

So this code is all a bit odd, I suspect there are decades of "people
continued to do what they historically did" changes, and it is all
worrisome.

Your cleanup sounds like the right thing, but I also think that
getting rid of that 'keepwrite' thing would also be the right thing.
And it all worries me.

            Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-17 19:26       ` Linus Torvalds
@ 2022-03-17 21:16         ` Matthew Wilcox
  2022-03-17 22:52           ` Dave Chinner
  2022-03-18 13:16           ` Jan Kara
  0 siblings, 2 replies; 27+ messages in thread
From: Matthew Wilcox @ 2022-03-17 21:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins,
	Namjae Jeon, Ashish Sangwan, Theodore Ts'o, Jan Kara,
	linux-ext4

On Thu, Mar 17, 2022 at 12:26:35PM -0700, Linus Torvalds wrote:
> On Thu, Mar 17, 2022 at 8:04 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > So how about we do something like this:
> >
> >  - Make folio_start_writeback() and set_page_writeback() return void,
> >    fixing up AFS and NFS.
> >  - Add a folio_wait_start_writeback() to use in the VFS
> >  - Remove the calls to set_page_writeback() in the filesystems
> 
> That sounds lovely, but it does worry me a bit. Not just the odd
> 'keepwrite' thing, but also the whole ordering between the folio bit
> and the tagging bits. Does the ordering possibly matter?

I wouldn't change the ordering of setting the xarray bits and the
writeback flag; they'd just be set a little earlier.  It'd all be done
while the page was still locked.  But you're right, there's lots of
subtle interactions here.

> That whole "xyz_writeback_keepwrite()" thing seems odd. It's used in
> only one place (the folio version isn't used at all):
> 
>   ext4_writepage():
> 
>      ext4_walk_page_buffers() fails:
>                 redirty_page_for_writepage(wbc, page);
>                 keep_towrite = true;
>       ext4_bio_write_page().
> 
> which just looks odd. Why does it even try to continue to do the
> writepage when the page buffer thing has failed?
> 
> In the regular write path (ie ext4_write_begin()), a
> ext4_walk_page_buffers() failure is fatal or causes a retry). Why is
> ext4_writepage() any different? Particularly since it wants to keep
> the page dirty, then trying to do the writeback just seems wrong.
> 
> So this code is all a bit odd, I suspect there are decades of "people
> continued to do what they historically did" changes, and it is all
> worrisome.

I found the commit: 1c8349a17137 ("ext4: fix data integrity sync in
ordered mode").  Fortunately, we have a documented test for this,
generic/127, so we'll know if we've broken it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-17 21:16         ` Matthew Wilcox
@ 2022-03-17 22:52           ` Dave Chinner
  2022-03-18 13:16           ` Jan Kara
  1 sibling, 0 replies; 27+ messages in thread
From: Dave Chinner @ 2022-03-17 22:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs,
	Hugh Dickins, Namjae Jeon, Ashish Sangwan, Theodore Ts'o,
	Jan Kara, linux-ext4

On Thu, Mar 17, 2022 at 09:16:20PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 17, 2022 at 12:26:35PM -0700, Linus Torvalds wrote:
> > On Thu, Mar 17, 2022 at 8:04 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > So how about we do something like this:
> > >
> > >  - Make folio_start_writeback() and set_page_writeback() return void,
> > >    fixing up AFS and NFS.
> > >  - Add a folio_wait_start_writeback() to use in the VFS
> > >  - Remove the calls to set_page_writeback() in the filesystems
> > 
> > That sounds lovely, but it does worry me a bit. Not just the odd
> > 'keepwrite' thing, but also the whole ordering between the folio bit
> > and the tagging bits. Does the ordering possibly matter?
> 
> I wouldn't change the ordering of setting the xarray bits and the
> writeback flag; they'd just be set a little earlier.  It'd all be done
> while the page was still locked.  But you're right, there's lots of
> subtle interactions here.
> 
> > That whole "xyz_writeback_keepwrite()" thing seems odd. It's used in
> > only one place (the folio version isn't used at all):
> > 
> >   ext4_writepage():
> > 
> >      ext4_walk_page_buffers() fails:
> >                 redirty_page_for_writepage(wbc, page);
> >                 keep_towrite = true;
> >       ext4_bio_write_page().
> > 
> > which just looks odd. Why does it even try to continue to do the
> > writepage when the page buffer thing has failed?
> > 
> > In the regular write path (ie ext4_write_begin()), a
> > ext4_walk_page_buffers() failure is fatal or causes a retry). Why is
> > ext4_writepage() any different? Particularly since it wants to keep
> > the page dirty, then trying to do the writeback just seems wrong.
> > 
> > So this code is all a bit odd, I suspect there are decades of "people
> > continued to do what they historically did" changes, and it is all
> > worrisome.
> 
> I found the commit: 1c8349a17137 ("ext4: fix data integrity sync in
> ordered mode").  Fortunately, we have a documented test for this,
> generic/127, so we'll know if we've broken it.

Looks like a footgun. ext4 needs the keepwrite stuff for block size <
page size, in the case where a page has both written and
delalloc/unwritten buffers on it. In that case ext4_writepage tries
to write just the written blocks and leave the dealloc/unwritten
buffers alone because it can't do allocation in ->writepage context.

I say footgun, because the nested ->writepage call that needs
keepwrite comes a from journal stop context in the high level
->writepages context that is doing allocation that will allow the
entire page to be written. i.e. it seems a bit silly to be
triggering partial page writeback that skips delalloc/unwritten
extents, but then needs special awareness of higher level IO that is
in progress that is currently doing the allocation that will allow
all the delalloc/unwritten extents on the page to also be written
back...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-17 21:16         ` Matthew Wilcox
  2022-03-17 22:52           ` Dave Chinner
@ 2022-03-18 13:16           ` Jan Kara
  2022-03-18 18:56             ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Jan Kara @ 2022-03-18 13:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs,
	Hugh Dickins, Namjae Jeon, Ashish Sangwan, Theodore Ts'o,
	Jan Kara, linux-ext4

On Thu 17-03-22 21:16:20, Matthew Wilcox wrote:
> On Thu, Mar 17, 2022 at 12:26:35PM -0700, Linus Torvalds wrote:
> > That whole "xyz_writeback_keepwrite()" thing seems odd. It's used in
> > only one place (the folio version isn't used at all):
> > 
> >   ext4_writepage():
> > 
> >      ext4_walk_page_buffers() fails:
> >                 redirty_page_for_writepage(wbc, page);
> >                 keep_towrite = true;
> >       ext4_bio_write_page().
> > 
> > which just looks odd. Why does it even try to continue to do the
> > writepage when the page buffer thing has failed?
> > 
> > In the regular write path (ie ext4_write_begin()), a
> > ext4_walk_page_buffers() failure is fatal or causes a retry). Why is
> > ext4_writepage() any different? Particularly since it wants to keep
> > the page dirty, then trying to do the writeback just seems wrong.
> > 
> > So this code is all a bit odd, I suspect there are decades of "people
> > continued to do what they historically did" changes, and it is all
> > worrisome.
> 
> I found the commit: 1c8349a17137 ("ext4: fix data integrity sync in
> ordered mode").  Fortunately, we have a documented test for this,
> generic/127, so we'll know if we've broken it.

I agree with Dave that 'keep_towrite' thing is kind of self-inflicted
damage on the ext4 side (we need to write out some blocks underlying the
page but cannot write all from the transaction commit code, so we need to
keep xarray tags intact so that data integrity sync cannot miss the page).
Also it is no longer needed in the current default ext4 setup. But if you
have blocksize < pagesize and mount the fs with 'dioreadlock,data=ordered'
mount options, the hack is still needed AFAIK and we don't have a
reasonable way around it.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-18 13:16           ` Jan Kara
@ 2022-03-18 18:56             ` Linus Torvalds
  2022-03-19 16:23               ` Theodore Ts'o
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2022-03-18 18:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs,
	Hugh Dickins, Namjae Jeon, Ashish Sangwan, Theodore Ts'o,
	Ext4 Developers List

On Fri, Mar 18, 2022 at 6:16 AM Jan Kara <jack@suse.cz> wrote:
>
> I agree with Dave that 'keep_towrite' thing is kind of self-inflicted
> damage on the ext4 side (we need to write out some blocks underlying the
> page but cannot write all from the transaction commit code, so we need to
> keep xarray tags intact so that data integrity sync cannot miss the page).
> Also it is no longer needed in the current default ext4 setup. But if you
> have blocksize < pagesize and mount the fs with 'dioreadlock,data=ordered'
> mount options, the hack is still needed AFAIK and we don't have a
> reasonable way around it.

I assume you meant 'dioread_lock'.

Which seems to be the default (even if 'data=ordered' is not).

Anyway - if it's not a problem for any current default setting, maybe
the solution is to warn about this case and turn it off?

IOW, we could simply warn about "data=ordered is no longer supported"
and turn it into data=journal.

Obviously *only* do this for the case of "blocksize < PAGE_SIZE".

If this ext4 thing is (a) obsolete and (b) causes VFS-level problems
that nobody else has, I really think we'd be much better off disabling
it than trying to work with it.

                 Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-18 18:56             ` Linus Torvalds
@ 2022-03-19 16:23               ` Theodore Ts'o
  2022-03-30 15:55                 ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2022-03-19 16:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel,
	linux-xfs, Hugh Dickins, Namjae Jeon, Ashish Sangwan,
	Ext4 Developers List

On Fri, Mar 18, 2022 at 11:56:04AM -0700, Linus Torvalds wrote:
> On Fri, Mar 18, 2022 at 6:16 AM Jan Kara <jack@suse.cz> wrote:
> >
> > I agree with Dave that 'keep_towrite' thing is kind of self-inflicted
> > damage on the ext4 side (we need to write out some blocks underlying the
> > page but cannot write all from the transaction commit code, so we need to
> > keep xarray tags intact so that data integrity sync cannot miss the page).
> > Also it is no longer needed in the current default ext4 setup. But if you
> > have blocksize < pagesize and mount the fs with 'dioreadlock,data=ordered'
> > mount options, the hack is still needed AFAIK and we don't have a
> > reasonable way around it.
> 
> I assume you meant 'dioread_lock'.
> 
> Which seems to be the default (even if 'data=ordered' is not).

That's not quite right.  data=ordered is the default, and that has
been the case since ext3 was introduced.

"dioread_lock" was the default in the days of ext3; "dioread_nolock"
was added to allow parallel Direct I/O reads (hence "nolock").  A
while back, we tried to make dioread_nolock the default since it tends
improve performance for workloads that have a mix of writes that
should be affected by fsyncs, and those that shouldn't.

Howevver, we had to revert that change when blocksize < pagesize to
work around a problem found on Power machines where "echo 3 >
drop_caches" on the host appears to cause file system corruptions on
the guest.  (Commit 626b035b816b "ext4: don't set dioread_nolock by
default for blocksize < pagesize").

> IOW, we could simply warn about "data=ordered is no longer supported"
> and turn it into data=journal.
> 
> Obviously *only* do this for the case of "blocksize < PAGE_SIZE".

Actiavelly, we've discussed a number of times doing the reverse ---
removing data=journal entirely, since it's (a) rarely used, and (b)
data=journal disables a bunch of optimizations, including
preallocation, and so its performance is pretty awful for most
workloads.  The main reason why haven't until now is that we believe
there is a small number of people who do find data=journal useful for
their workload, and at least _so_ far it's not been that hard to keep
it limping along --- although in most cases, data=journal doesn't get
supported for new features or performance optimizations, and it
definitely does note get as much testing.

So the thing that I've been waiting to do for a while is to replace
the whole data=ordered vs data=writeback and dioread_nolock and
dioread_lock is a complete reworking of the ext4 buffered writeback
path, where we write the data blocks *first*, and only then update the
ext4 metadata.

Historically, going as far back as ext2, we've always allocated data
blocks and updatted the metadata blocks, and only then updated the
buffer or page cache for the data blocks.  All of the complexities
around data=ordered, data=writeback, dioread_nolock, etc., is because
we haven't done the fundamental work of reversing the order in which
we do buffered writeback.   What we *should* be doing is:

*) Determining where the new allocated data blockblocks should be, and
   preventing those blocks from being used for any other purposes, but
   *not* updating the file system metadata to reflect that change.

*) Submit the data block write

*) On write completion, update the metadata blocks in a kernel thread.

Over time, we've been finding more and more reasons why I need to do
this work, so it's something I'm going to have to prioritize in the
next few months.

Cheerse,

						- Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-19 16:23               ` Theodore Ts'o
@ 2022-03-30 15:55                 ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2022-03-30 15:55 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Linus Torvalds, Jan Kara, Matthew Wilcox, Brian Foster, Linux-MM,
	linux-fsdevel, linux-xfs, Hugh Dickins, Namjae Jeon,
	Ashish Sangwan, Ext4 Developers List

On Sat, Mar 19, 2022 at 12:23:04PM -0400, Theodore Ts'o wrote:
> So the thing that I've been waiting to do for a while is to replace
> the whole data=ordered vs data=writeback and dioread_nolock and
> dioread_lock is a complete reworking of the ext4 buffered writeback
> path, where we write the data blocks *first*, and only then update the
> ext4 metadata.

> *) Determining where the new allocated data blockblocks should be, and
>    preventing those blocks from being used for any other purposes, but
>    *not* updating the file system metadata to reflect that change.
> 
> *) Submit the data block write
> 
> *) On write completion, update the metadata blocks in a kernel thread.

I think that would be easily done by switching to the iomap buffered
I/O code, which is very much built around that model.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-16 23:35   ` Linus Torvalds
  2022-03-17 15:04     ` Matthew Wilcox
@ 2022-03-17 15:31     ` Brian Foster
  1 sibling, 0 replies; 27+ messages in thread
From: Brian Foster @ 2022-03-17 15:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins

On Wed, Mar 16, 2022 at 04:35:10PM -0700, Linus Torvalds wrote:
> On Wed, Mar 16, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > As I recall, the bookmark hack was introduced in order to handle
> > lock_page() problems.  It wasn't really supposed to handle writeback,
> > but nobody thought it would cause any harm (and indeed, it didn't at the
> > time).  So how about we only use bookmarks for lock_page(), since
> > lock_page() usually doesn't have the multiple-waker semantics that
> > writeback has?
> 
> I was hoping that some of the page lock problems are gone and we could
> maybe try to get rid of the bookmarks entirely.
> 
> But the page lock issues only ever showed up on some private
> proprietary load and machine, so we never really got confirmation that
> they are fixed. There were lots of strong signs to them being related
> to the migration page locking, and it may be that the bookmark code is
> only hurting these days.
> 
> See for example commit 9a1ea439b16b ("mm:
> put_and_wait_on_page_locked() while page is migrated") which doesn't
> actually change the *locking* side, but drops the page reference when
> waiting for the locked page to be unlocked, which in turn removes a
> "loop and try again when migration". And that may have been the real
> _fix_ for the problem.
> 
> Because while the bookmark thing avoids the NMI lockup detector firing
> due to excessive hold times, the bookmarking also _causes_ that "we
> now will see the same page multiple times because we dropped the lock
> and somebody re-added it at the end of the queue" issue. Which seems
> to be the problem here.
> 
> Ugh. I wish we had some way to test "could we just remove the bookmark
> code entirely again".
> 
> Of course, the PG_lock case also works fairly hard to not actually
> remove and re-add the lock waiter to the queue, but having an actual
> "wait for and get the lock" operation. The writeback bit isn't done
> that way.
> 
> I do hate how we had to make folio_wait_writeback{_killable}() use
> "while" rather than an "if". It *almost* works with just a "wait for
> current writeback", but not quite. See commit c2407cf7d22d ("mm: make
> wait_on_page_writeback() wait for multiple pending writebacks") for
> why we have to loop. Ugly, ugly.
> 

Right.. In case you missed it in my too long of a description (sorry), I
pointed out that this problem seems to manifest most recently as of that
commit. In fact looking through the past discussion for that patch, it
wouldn't surprise me a ton of this problem is some pathological
manifestation of the perf issue that you described here [1].

Indeed most of the waiters in this case come from fsync() -> ... ->
__filemap_fdatawait_range(), and your test patch in that email performs
a similar sort of trick to skip out of the wake up side (I'm curious if
that was ever determined to help?) to things that I was playing with to
try and narrow this down.

> Because I do think that "while" in the writeback waiting is a problem.
> Maybe _the_ problem.
> 

FWIW, Matthew's patch does seem to address this problem. My current test
of that patch is past the point where I usually expect to see the soft
lockup warning, but I'm going to let it continue to run (and then run it
through some xfs regression if the approach is agreeable).

Getting back to the loop thing (and seeing Matthew's latest reply wrt to
wait_and_set())...

If we wanted to go back to non-looping in folio_wait_writeback() to
avoid the unserialized wait queue build up or whatever, would it make
any sense to lift the looping writeback check to write_cache_pages()? We
hold the page lock and have checked PageDirty() by that point, so ISTM
that would address the BUG_ON(PageWriteback()) race caused by the
delayed/unserialized wakeup without producing the excess wait queue
buildup caused by waiters in the __filemap_fdatawait_range() path.

Then presumably that "wait for writeback to clear" loop in
write_cache_pages() is eventually replaced by the "wait and set
writeback" thing when the rest of the fs code is fixed up appropriately.
Hm? Of course I haven't tested that so it could be completely bogus, but
I can if it makes any sort of sense as an incremental step..

Brian

[1] https://lore.kernel.org/linux-mm/CAHk-=wgD9GK5CeHopYmRHoYS9cNuCmDMsc=+MbM_KgJ0KB+=ng@mail.gmail.com/

>                         Linus
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-16 20:59 ` Matthew Wilcox
  2022-03-16 23:35   ` Linus Torvalds
@ 2022-03-17 13:51   ` Brian Foster
  2022-03-18 14:14     ` Brian Foster
  1 sibling, 1 reply; 27+ messages in thread
From: Brian Foster @ 2022-03-17 13:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, linux-fsdevel, linux-xfs, Linus Torvalds, Hugh Dickins

On Wed, Mar 16, 2022 at 08:59:39PM +0000, Matthew Wilcox wrote:
> On Tue, Mar 15, 2022 at 03:07:10PM -0400, Brian Foster wrote:
> > What seems to happen is that the majority of the fsync calls end up
> > waiting on writeback of a particular page, the wakeup of the writeback
> > bit on that page wakes a task that immediately resets PG_writeback on
> > the page such that N other folio_wait_writeback() waiters see the bit
> > still set and immediately place themselves back onto the tail of the
> > wait queue.  Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop
> > in folio_wake_bit() (backing off the lock for a cycle or so in each
> > iteration) only to find the same bunch of tasks in the queue. This
> > process repeats for a long enough amount of time to trigger the soft
> > lockup warning. I've confirmed this spinning behavior with a tracepoint
> > in the bookmark loop that indicates we're stuck for many hundreds of
> > thousands of iterations (at least) of this loop when the soft lockup
> > warning triggers.
> 
> [...]
> 
> > I've run a few quick experiments to try and corroborate this analysis.
> > The problem goes away completely if I either back out the loop change in
> > folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something
> > like 128 (i.e. greater than the total possible number of waiter tasks in
> > this test). I've also played a few games with bookmark behavior mostly
> > out of curiosity, but usually end up introducing other problems like
> > missed wakeups, etc.
> 
> As I recall, the bookmark hack was introduced in order to handle
> lock_page() problems.  It wasn't really supposed to handle writeback,
> but nobody thought it would cause any harm (and indeed, it didn't at the
> time).  So how about we only use bookmarks for lock_page(), since
> lock_page() usually doesn't have the multiple-waker semantics that
> writeback has?
> 

Oh, interesting. I wasn't aware of the tenuous status of the bookmark
code. This is indeed much nicer than anything I was playing with. I
suspect it will address the problem, but I'll throw it at my test env
for a while and follow up.. thanks!

Brian

> (this is more in the spirit of "minimal patch" -- I think initialising
> the bookmark should be moved to folio_unlock()).
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b2728eb52407..9ee3c5f1f489 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1146,26 +1146,28 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
>  	return (flags & WQ_FLAG_EXCLUSIVE) != 0;
>  }
>  
> -static void folio_wake_bit(struct folio *folio, int bit_nr)
> +static void folio_wake_bit(struct folio *folio, int bit_nr,
> +		wait_queue_entry_t *bookmark)
>  {
>  	wait_queue_head_t *q = folio_waitqueue(folio);
>  	struct wait_page_key key;
>  	unsigned long flags;
> -	wait_queue_entry_t bookmark;
>  
>  	key.folio = folio;
>  	key.bit_nr = bit_nr;
>  	key.page_match = 0;
>  
> -	bookmark.flags = 0;
> -	bookmark.private = NULL;
> -	bookmark.func = NULL;
> -	INIT_LIST_HEAD(&bookmark.entry);
> +	if (bookmark) {
> +		bookmark->flags = 0;
> +		bookmark->private = NULL;
> +		bookmark->func = NULL;
> +		INIT_LIST_HEAD(&bookmark->entry);
> +	}
>  
>  	spin_lock_irqsave(&q->lock, flags);
> -	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
> +	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
>  
> -	while (bookmark.flags & WQ_FLAG_BOOKMARK) {
> +	while (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
>  		/*
>  		 * Take a breather from holding the lock,
>  		 * allow pages that finish wake up asynchronously
> @@ -1175,7 +1177,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr)
>  		spin_unlock_irqrestore(&q->lock, flags);
>  		cpu_relax();
>  		spin_lock_irqsave(&q->lock, flags);
> -		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
> +		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
>  	}
>  
>  	/*
> @@ -1204,7 +1206,7 @@ static void folio_wake(struct folio *folio, int bit)
>  {
>  	if (!folio_test_waiters(folio))
>  		return;
> -	folio_wake_bit(folio, bit);
> +	folio_wake_bit(folio, bit, NULL);
>  }
>  
>  /*
> @@ -1554,12 +1556,15 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
>   */
>  void folio_unlock(struct folio *folio)
>  {
> +	wait_queue_entry_t bookmark;
> +
>  	/* Bit 7 allows x86 to check the byte's sign bit */
>  	BUILD_BUG_ON(PG_waiters != 7);
>  	BUILD_BUG_ON(PG_locked > 7);
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> +
>  	if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0)))
> -		folio_wake_bit(folio, PG_locked);
> +		folio_wake_bit(folio, PG_locked, &bookmark);
>  }
>  EXPORT_SYMBOL(folio_unlock);
>  
> @@ -1578,7 +1583,7 @@ void folio_end_private_2(struct folio *folio)
>  {
>  	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
>  	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
> -	folio_wake_bit(folio, PG_private_2);
> +	folio_wake_bit(folio, PG_private_2, NULL);
>  	folio_put(folio);
>  }
>  EXPORT_SYMBOL(folio_end_private_2);
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-17 13:51   ` Brian Foster
@ 2022-03-18 14:14     ` Brian Foster
  2022-03-18 14:45       ` Matthew Wilcox
  0 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2022-03-18 14:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, linux-fsdevel, linux-xfs, Linus Torvalds, Hugh Dickins

On Thu, Mar 17, 2022 at 09:51:44AM -0400, Brian Foster wrote:
> On Wed, Mar 16, 2022 at 08:59:39PM +0000, Matthew Wilcox wrote:
> > On Tue, Mar 15, 2022 at 03:07:10PM -0400, Brian Foster wrote:
> > > What seems to happen is that the majority of the fsync calls end up
> > > waiting on writeback of a particular page, the wakeup of the writeback
> > > bit on that page wakes a task that immediately resets PG_writeback on
> > > the page such that N other folio_wait_writeback() waiters see the bit
> > > still set and immediately place themselves back onto the tail of the
> > > wait queue.  Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop
> > > in folio_wake_bit() (backing off the lock for a cycle or so in each
> > > iteration) only to find the same bunch of tasks in the queue. This
> > > process repeats for a long enough amount of time to trigger the soft
> > > lockup warning. I've confirmed this spinning behavior with a tracepoint
> > > in the bookmark loop that indicates we're stuck for many hundreds of
> > > thousands of iterations (at least) of this loop when the soft lockup
> > > warning triggers.
> > 
> > [...]
> > 
> > > I've run a few quick experiments to try and corroborate this analysis.
> > > The problem goes away completely if I either back out the loop change in
> > > folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something
> > > like 128 (i.e. greater than the total possible number of waiter tasks in
> > > this test). I've also played a few games with bookmark behavior mostly
> > > out of curiosity, but usually end up introducing other problems like
> > > missed wakeups, etc.
> > 
> > As I recall, the bookmark hack was introduced in order to handle
> > lock_page() problems.  It wasn't really supposed to handle writeback,
> > but nobody thought it would cause any harm (and indeed, it didn't at the
> > time).  So how about we only use bookmarks for lock_page(), since
> > lock_page() usually doesn't have the multiple-waker semantics that
> > writeback has?
> > 
> 
> Oh, interesting. I wasn't aware of the tenuous status of the bookmark
> code. This is indeed much nicer than anything I was playing with. I
> suspect it will address the problem, but I'll throw it at my test env
> for a while and follow up.. thanks!
> 

So I'm not clear on where we're at with this patch vs. something that
moves (or drops) the wb wait loop vs. the wait and set thing (which
seems more invasive and longer term), but FWIW.. this patch survived
over 10k iterations of the reproducer test yesterday (the problem
typically reproduces in ~1k or so iterations on average) and an
overnight fstests run without regression.

Brian

> Brian
> 
> > (this is more in the spirit of "minimal patch" -- I think
> > initialising the bookmark should be moved to folio_unlock()).
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c index
> > b2728eb52407..9ee3c5f1f489 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1146,26 +1146,28 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
> >  	return (flags & WQ_FLAG_EXCLUSIVE) != 0;
> >  }
> >  
> > -static void folio_wake_bit(struct folio *folio, int bit_nr)
> > +static void folio_wake_bit(struct folio *folio, int bit_nr,
> > +		wait_queue_entry_t *bookmark)
> >  {
> >  	wait_queue_head_t *q = folio_waitqueue(folio);
> >  	struct wait_page_key key;
> >  	unsigned long flags;
> > -	wait_queue_entry_t bookmark;
> >  
> >  	key.folio = folio;
> >  	key.bit_nr = bit_nr;
> >  	key.page_match = 0;
> >  
> > -	bookmark.flags = 0;
> > -	bookmark.private = NULL;
> > -	bookmark.func = NULL;
> > -	INIT_LIST_HEAD(&bookmark.entry);
> > +	if (bookmark) {
> > +		bookmark->flags = 0;
> > +		bookmark->private = NULL;
> > +		bookmark->func = NULL;
> > +		INIT_LIST_HEAD(&bookmark->entry);
> > +	}
> >  
> >  	spin_lock_irqsave(&q->lock, flags);
> > -	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
> > +	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
> >  
> > -	while (bookmark.flags & WQ_FLAG_BOOKMARK) {
> > +	while (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
> >  		/*
> >  		 * Take a breather from holding the lock,
> >  		 * allow pages that finish wake up asynchronously
> > @@ -1175,7 +1177,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr)
> >  		spin_unlock_irqrestore(&q->lock, flags);
> >  		cpu_relax();
> >  		spin_lock_irqsave(&q->lock, flags);
> > -		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
> > +		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
> >  	}
> >  
> >  	/*
> > @@ -1204,7 +1206,7 @@ static void folio_wake(struct folio *folio, int bit)
> >  {
> >  	if (!folio_test_waiters(folio))
> >  		return;
> > -	folio_wake_bit(folio, bit);
> > +	folio_wake_bit(folio, bit, NULL);
> >  }
> >  
> >  /*
> > @@ -1554,12 +1556,15 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
> >   */
> >  void folio_unlock(struct folio *folio)
> >  {
> > +	wait_queue_entry_t bookmark;
> > +
> >  	/* Bit 7 allows x86 to check the byte's sign bit */
> >  	BUILD_BUG_ON(PG_waiters != 7);
> >  	BUILD_BUG_ON(PG_locked > 7);
> >  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> > +
> >  	if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0)))
> > -		folio_wake_bit(folio, PG_locked);
> > +		folio_wake_bit(folio, PG_locked, &bookmark);
> >  }
> >  EXPORT_SYMBOL(folio_unlock);
> >  
> > @@ -1578,7 +1583,7 @@ void folio_end_private_2(struct folio *folio)
> >  {
> >  	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
> >  	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
> > -	folio_wake_bit(folio, PG_private_2);
> > +	folio_wake_bit(folio, PG_private_2, NULL);
> >  	folio_put(folio);
> >  }
> >  EXPORT_SYMBOL(folio_end_private_2);
> > 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-18 14:14     ` Brian Foster
@ 2022-03-18 14:45       ` Matthew Wilcox
  2022-03-18 18:58         ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2022-03-18 14:45 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-mm, linux-fsdevel, linux-xfs, Linus Torvalds, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 654 bytes --]

On Fri, Mar 18, 2022 at 10:14:03AM -0400, Brian Foster wrote:
> So I'm not clear on where we're at with this patch vs. something that
> moves (or drops) the wb wait loop vs. the wait and set thing (which
> seems more invasive and longer term), but FWIW.. this patch survived
> over 10k iterations of the reproducer test yesterday (the problem
> typically reproduces in ~1k or so iterations on average) and an
> overnight fstests run without regression.

Excellent!  I'm going to propose these two patches for -rc1 (I don't
think we want to be playing with this after -rc8).  I agree the
wait-and-set approach is a little further out.

(patches attached)

[-- Attachment #2: 0001-filemap-Remove-use-of-wait-bookmarks.patch --]
[-- Type: text/plain, Size: 1818 bytes --]

From db81b2f1ce3dc6aa902beb8b3e45a5972cf85737 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Thu, 17 Mar 2022 12:12:49 -0400
Subject: [PATCH 1/2] filemap: Remove use of wait bookmarks

The original problem of the overly long list of waiters on a
locked page was solved properly by commit 9a1ea439b16b ("mm:
put_and_wait_on_page_locked() while page is migrated").  In
the meantime, using bookmarks for the writeback bit can cause
livelocks, so we need to stop using them.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b2728eb52407..bb8ec585dea8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1151,32 +1151,13 @@ static void folio_wake_bit(struct folio *folio, int bit_nr)
 	wait_queue_head_t *q = folio_waitqueue(folio);
 	struct wait_page_key key;
 	unsigned long flags;
-	wait_queue_entry_t bookmark;
 
 	key.folio = folio;
 	key.bit_nr = bit_nr;
 	key.page_match = 0;
 
-	bookmark.flags = 0;
-	bookmark.private = NULL;
-	bookmark.func = NULL;
-	INIT_LIST_HEAD(&bookmark.entry);
-
 	spin_lock_irqsave(&q->lock, flags);
-	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
-
-	while (bookmark.flags & WQ_FLAG_BOOKMARK) {
-		/*
-		 * Take a breather from holding the lock,
-		 * allow pages that finish wake up asynchronously
-		 * to acquire the lock and remove themselves
-		 * from wait queue
-		 */
-		spin_unlock_irqrestore(&q->lock, flags);
-		cpu_relax();
-		spin_lock_irqsave(&q->lock, flags);
-		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
-	}
+	__wake_up_locked_key(q, TASK_NORMAL, &key);
 
 	/*
 	 * It is possible for other pages to have collided on the waitqueue
-- 
2.34.1


[-- Attachment #3: 0002-sched-Remove-wait-bookmarks.patch --]
[-- Type: text/plain, Size: 5839 bytes --]

From 4eda97c0c8374f948a08c28265f019517d369a4c Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Fri, 18 Mar 2022 10:39:49 -0400
Subject: [PATCH 2/2] sched: Remove wait bookmarks

There are no users of wait bookmarks left, so simplify the wait
code by removing them.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/wait.h |  9 +++----
 kernel/sched/wait.c  | 58 +++++++-------------------------------------
 2 files changed, 12 insertions(+), 55 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 851e07da2583..69aa9d5767e7 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -19,10 +19,9 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
 /* wait_queue_entry::flags */
 #define WQ_FLAG_EXCLUSIVE	0x01
 #define WQ_FLAG_WOKEN		0x02
-#define WQ_FLAG_BOOKMARK	0x04
-#define WQ_FLAG_CUSTOM		0x08
-#define WQ_FLAG_DONE		0x10
-#define WQ_FLAG_PRIORITY	0x20
+#define WQ_FLAG_CUSTOM		0x04
+#define WQ_FLAG_DONE		0x08
+#define WQ_FLAG_PRIORITY	0x10
 
 /*
  * A single wait-queue entry structure:
@@ -211,8 +210,6 @@ __remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq
 
 void __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr, void *key);
 void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key);
-void __wake_up_locked_key_bookmark(struct wait_queue_head *wq_head,
-		unsigned int mode, void *key, wait_queue_entry_t *bookmark);
 void __wake_up_sync_key(struct wait_queue_head *wq_head, unsigned int mode, void *key);
 void __wake_up_locked_sync_key(struct wait_queue_head *wq_head, unsigned int mode, void *key);
 void __wake_up_locked(struct wait_queue_head *wq_head, unsigned int mode, int nr);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index eca38107b32f..e09f5c177662 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -58,13 +58,6 @@ void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry
 }
 EXPORT_SYMBOL(remove_wait_queue);
 
-/*
- * Scan threshold to break wait queue walk.
- * This allows a waker to take a break from holding the
- * wait queue lock during the wait queue walk.
- */
-#define WAITQUEUE_WALK_BREAK_CNT 64
-
 /*
  * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
  * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
@@ -79,21 +72,13 @@ EXPORT_SYMBOL(remove_wait_queue);
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
  */
 static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
-			int nr_exclusive, int wake_flags, void *key,
-			wait_queue_entry_t *bookmark)
+			int nr_exclusive, int wake_flags, void *key)
 {
 	wait_queue_entry_t *curr, *next;
-	int cnt = 0;
 
 	lockdep_assert_held(&wq_head->lock);
 
-	if (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
-		curr = list_next_entry(bookmark, entry);
-
-		list_del(&bookmark->entry);
-		bookmark->flags = 0;
-	} else
-		curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);
+	curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);
 
 	if (&curr->entry == &wq_head->head)
 		return nr_exclusive;
@@ -102,21 +87,11 @@ static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
 		unsigned flags = curr->flags;
 		int ret;
 
-		if (flags & WQ_FLAG_BOOKMARK)
-			continue;
-
 		ret = curr->func(curr, mode, wake_flags, key);
 		if (ret < 0)
 			break;
 		if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
 			break;
-
-		if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) &&
-				(&next->entry != &wq_head->head)) {
-			bookmark->flags = WQ_FLAG_BOOKMARK;
-			list_add_tail(&bookmark->entry, &next->entry);
-			break;
-		}
 	}
 
 	return nr_exclusive;
@@ -126,19 +101,11 @@ static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int
 			int nr_exclusive, int wake_flags, void *key)
 {
 	unsigned long flags;
-	wait_queue_entry_t bookmark;
 
-	bookmark.flags = 0;
-	bookmark.private = NULL;
-	bookmark.func = NULL;
-	INIT_LIST_HEAD(&bookmark.entry);
-
-	do {
-		spin_lock_irqsave(&wq_head->lock, flags);
-		nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive,
-						wake_flags, key, &bookmark);
-		spin_unlock_irqrestore(&wq_head->lock, flags);
-	} while (bookmark.flags & WQ_FLAG_BOOKMARK);
+	spin_lock_irqsave(&wq_head->lock, flags);
+	nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive,
+					wake_flags, key);
+	spin_unlock_irqrestore(&wq_head->lock, flags);
 }
 
 /**
@@ -163,23 +130,16 @@ EXPORT_SYMBOL(__wake_up);
  */
 void __wake_up_locked(struct wait_queue_head *wq_head, unsigned int mode, int nr)
 {
-	__wake_up_common(wq_head, mode, nr, 0, NULL, NULL);
+	__wake_up_common(wq_head, mode, nr, 0, NULL);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
 void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key)
 {
-	__wake_up_common(wq_head, mode, 1, 0, key, NULL);
+	__wake_up_common(wq_head, mode, 1, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
-void __wake_up_locked_key_bookmark(struct wait_queue_head *wq_head,
-		unsigned int mode, void *key, wait_queue_entry_t *bookmark)
-{
-	__wake_up_common(wq_head, mode, 1, 0, key, bookmark);
-}
-EXPORT_SYMBOL_GPL(__wake_up_locked_key_bookmark);
-
 /**
  * __wake_up_sync_key - wake up threads blocked on a waitqueue.
  * @wq_head: the waitqueue
@@ -225,7 +185,7 @@ EXPORT_SYMBOL_GPL(__wake_up_sync_key);
 void __wake_up_locked_sync_key(struct wait_queue_head *wq_head,
 			       unsigned int mode, void *key)
 {
-        __wake_up_common(wq_head, mode, 1, WF_SYNC, key, NULL);
+        __wake_up_common(wq_head, mode, 1, WF_SYNC, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_sync_key);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-18 14:45       ` Matthew Wilcox
@ 2022-03-18 18:58         ` Linus Torvalds
  2022-10-20  1:35           ` Dan Williams
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2022-03-18 18:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins

On Fri, Mar 18, 2022 at 7:45 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Excellent!  I'm going to propose these two patches for -rc1 (I don't
> think we want to be playing with this after -rc8)

Ack. I think your commit message may be a bit too optimistic (who
knows if other loads can trigger the over-long page locking wait-queue
latencies), but since I don't see any other ways to really check this
than just trying it, let's do it.

                 Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-03-18 18:58         ` Linus Torvalds
@ 2022-10-20  1:35           ` Dan Williams
  2022-10-23 22:38             ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Dan Williams @ 2022-10-20  1:35 UTC (permalink / raw)
  To: Linus Torvalds, Matthew Wilcox
  Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins

Linus Torvalds wrote:
> On Fri, Mar 18, 2022 at 7:45 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Excellent!  I'm going to propose these two patches for -rc1 (I don't
> > think we want to be playing with this after -rc8)
> 
> Ack. I think your commit message may be a bit too optimistic (who
> knows if other loads can trigger the over-long page locking wait-queue
> latencies), but since I don't see any other ways to really check this
> than just trying it, let's do it.
> 
>                  Linus

A report from a tester with this call trace:

 watchdog: BUG: soft lockup - CPU#127 stuck for 134s! [ksoftirqd/127:782]
 RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x40
 [..]
 Call Trace:
  <TASK>
  folio_wake_bit+0x8a/0x110
  folio_end_writeback+0x37/0x80
  ext4_finish_bio+0x19a/0x270
  ext4_end_bio+0x47/0x140
  blk_update_request+0x112/0x410

...lead me to this thread. This was after I had them force all softirqs
to run in ksoftirqd context, and run with rq_affinity == 2 to force
I/O completion work to throttle new submissions.

Willy, are these headed upstream:

https://lore.kernel.org/all/YjSbHp6B9a1G3tuQ@casper.infradead.org

...or I am missing an alternate solution posted elsewhere?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-10-20  1:35           ` Dan Williams
@ 2022-10-23 22:38             ` Linus Torvalds
  2022-10-24 19:39               ` Tim Chen
  2022-10-24 20:13               ` Dan Williams
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2022-10-23 22:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs,
	Hugh Dickins

On Wed, Oct 19, 2022 at 6:35 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> A report from a tester with this call trace:
>
>  watchdog: BUG: soft lockup - CPU#127 stuck for 134s! [ksoftirqd/127:782]
>  RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x40 [..]

Whee.

> ...lead me to this thread. This was after I had them force all softirqs
> to run in ksoftirqd context, and run with rq_affinity == 2 to force
> I/O completion work to throttle new submissions.
>
> Willy, are these headed upstream:
>
> https://lore.kernel.org/all/YjSbHp6B9a1G3tuQ@casper.infradead.org
>
> ...or I am missing an alternate solution posted elsewhere?

Can your reporter test that patch? I think it should still apply
pretty much as-is.. And if we actually had somebody who had a
test-case that was literally fixed by getting rid of the old bookmark
code, that would make applying that patch a no-brainer.

The problem is that the original load that caused us to do that thing
in the first place isn't repeatable because it was special production
code - so removing that bookmark code because we _think_ it now hurts
more than it helps is kind of a big hurdle.

But if we had some hard confirmation from somebody that "yes, the
bookmark code is now hurting", that would make it a lot more palatable
to just remove the code that we just _think_ that probably isn't
needed any more..

                  Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-10-23 22:38             ` Linus Torvalds
@ 2022-10-24 19:39               ` Tim Chen
  2022-10-24 19:43                 ` Linus Torvalds
  2022-10-24 20:13               ` Dan Williams
  1 sibling, 1 reply; 27+ messages in thread
From: Tim Chen @ 2022-10-24 19:39 UTC (permalink / raw)
  To: Linus Torvalds, Dan Williams
  Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs,
	Hugh Dickins

On Sun, 2022-10-23 at 15:38 -0700, Linus Torvalds wrote:
> On Wed, Oct 19, 2022 at 6:35 PM Dan Williams
> <dan.j.williams@intel.com> wrote:
> > 
> > A report from a tester with this call trace:
> > 
> >  watchdog: BUG: soft lockup - CPU#127 stuck for 134s!
> > [ksoftirqd/127:782]
> >  RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x40 [..]
> 
> Whee.
> 
> > ...lead me to this thread. This was after I had them force all
> > softirqs
> > to run in ksoftirqd context, and run with rq_affinity == 2 to force
> > I/O completion work to throttle new submissions.
> > 
> > Willy, are these headed upstream:
> > 
> > https://lore.kernel.org/all/YjSbHp6B9a1G3tuQ@casper.infradead.org
> > 
> > ...or I am missing an alternate solution posted elsewhere?
> 
> Can your reporter test that patch? I think it should still apply
> pretty much as-is.. And if we actually had somebody who had a
> test-case that was literally fixed by getting rid of the old bookmark
> code, that would make applying that patch a no-brainer.
> 
> The problem is that the original load that caused us to do that thing
> in the first place isn't repeatable because it was special production
> code - so removing that bookmark code because we _think_ it now hurts
> more than it helps is kind of a big hurdle.
> 
> But if we had some hard confirmation from somebody that "yes, the
> bookmark code is now hurting", that would make it a lot more
> palatable
> to just remove the code that we just _think_ that probably isn't
> needed any more..
> 
> 
I do think that the original locked page on migration problem was fixed
by commit 9a1ea439b16b. Unfortunately the customer did not respond to
us when we asked them to test their workload when that patch went 
into the mainline. 

I don't have objection to Matthew's fix to remove the bookmark code,
now that it is causing problems with this scenario that I didn't
anticipate in my original code.

Tim


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-10-24 19:39               ` Tim Chen
@ 2022-10-24 19:43                 ` Linus Torvalds
  2022-10-24 20:14                   ` Dan Williams
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2022-10-24 19:43 UTC (permalink / raw)
  To: Tim Chen
  Cc: Dan Williams, Matthew Wilcox, Brian Foster, Linux-MM,
	linux-fsdevel, linux-xfs, Hugh Dickins

On Mon, Oct 24, 2022 at 12:39 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> I do think that the original locked page on migration problem was fixed
> by commit 9a1ea439b16b. Unfortunately the customer did not respond to
> us when we asked them to test their workload when that patch went
> into the mainline.

Oh well.

> I don't have objection to Matthew's fix to remove the bookmark code,
> now that it is causing problems with this scenario that I didn't
> anticipate in my original code.

I'd really like to avoid *another* "we can't actually verify that this
helps" change in this area, so I'm hoping that the reporter that Dan
was talking to could test that patch.

Otherwise we're kind of going back-and-forth based on "this might fix
things", which just feels really fragile and reminds me of the bad old
days when we had the "one step forward, two steps back" dance with
some of the suspend/resume issues.

I realize that this code needs some extreme loads (and likely pretty
special hardware too) to actually become problematic, so testing is
_always_ going to be a bit of a problem, but still...

               Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-10-24 19:43                 ` Linus Torvalds
@ 2022-10-24 20:14                   ` Dan Williams
  0 siblings, 0 replies; 27+ messages in thread
From: Dan Williams @ 2022-10-24 20:14 UTC (permalink / raw)
  To: Linus Torvalds, Tim Chen
  Cc: Dan Williams, Matthew Wilcox, Brian Foster, Linux-MM,
	linux-fsdevel, linux-xfs, Hugh Dickins

Linus Torvalds wrote:
> On Mon, Oct 24, 2022 at 12:39 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> > I do think that the original locked page on migration problem was fixed
> > by commit 9a1ea439b16b. Unfortunately the customer did not respond to
> > us when we asked them to test their workload when that patch went
> > into the mainline.
> 
> Oh well.
> 
> > I don't have objection to Matthew's fix to remove the bookmark code,
> > now that it is causing problems with this scenario that I didn't
> > anticipate in my original code.
> 
> I'd really like to avoid *another* "we can't actually verify that this
> helps" change in this area, so I'm hoping that the reporter that Dan
> was talking to could test that patch.

Oh, sorry, I had typed up that reply and contacted Tim offline, but
forgot to send, now sent.

> Otherwise we're kind of going back-and-forth based on "this might fix
> things", which just feels really fragile and reminds me of the bad old
> days when we had the "one step forward, two steps back" dance with
> some of the suspend/resume issues.
> 
> I realize that this code needs some extreme loads (and likely pretty
> special hardware too) to actually become problematic, so testing is
> _always_ going to be a bit of a problem, but still...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-10-23 22:38             ` Linus Torvalds
  2022-10-24 19:39               ` Tim Chen
@ 2022-10-24 20:13               ` Dan Williams
  2022-10-24 20:28                 ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Dan Williams @ 2022-10-24 20:13 UTC (permalink / raw)
  To: Linus Torvalds, Dan Williams
  Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs,
	Hugh Dickins, jesus.a.arechiga.lopez, tim.c.chen

[ add Tim and Arechiga ]

Linus Torvalds wrote:
> On Wed, Oct 19, 2022 at 6:35 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > A report from a tester with this call trace:
> >
> >  watchdog: BUG: soft lockup - CPU#127 stuck for 134s! [ksoftirqd/127:782]
> >  RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x40 [..]
> 
> Whee.
> 
> > ...lead me to this thread. This was after I had them force all softirqs
> > to run in ksoftirqd context, and run with rq_affinity == 2 to force
> > I/O completion work to throttle new submissions.
> >
> > Willy, are these headed upstream:
> >
> > https://lore.kernel.org/all/YjSbHp6B9a1G3tuQ@casper.infradead.org
> >
> > ...or I am missing an alternate solution posted elsewhere?
> 
> Can your reporter test that patch? I think it should still apply
> pretty much as-is.. And if we actually had somebody who had a
> test-case that was literally fixed by getting rid of the old bookmark
> code, that would make applying that patch a no-brainer.
> 
> The problem is that the original load that caused us to do that thing
> in the first place isn't repeatable because it was special production
> code - so removing that bookmark code because we _think_ it now hurts
> more than it helps is kind of a big hurdle.
> 
> But if we had some hard confirmation from somebody that "yes, the
> bookmark code is now hurting", that would make it a lot more palatable
> to just remove the code that we just _think_ that probably isn't
> needed any more..

Arechiga reports that his test case that failed "fast" before now ran
for 28 hours without a soft lockup report with the proposed patches
applied. So, I would consider those:

Tested-by: Jesus Arechiga Lopez <jesus.a.arechiga.lopez@intel.com>


I notice that the original commit:

11a19c7b099f sched/wait: Introduce wakeup boomark in wake_up_page_bit

...was trying to fix waitqueue lock contention. The general approach of
setting a bookmark and taking a break "could" work, but it in this case
it would need to do something like return -EWOULDBLOCK and let ksoftirqd
fall into its cond_resched() retry path. However, that would require
plumbing the bookmark up several levels, not to mention the other
folio_wake_bit() callers that do not have a convenient place to do
cond_resched(). So I think has successfully found a way that waitqueue
lock contention can not be improved.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-10-24 20:13               ` Dan Williams
@ 2022-10-24 20:28                 ` Linus Torvalds
  2022-10-24 20:35                   ` Dan Williams
  2022-10-25 19:19                   ` Matthew Wilcox
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2022-10-24 20:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs,
	Hugh Dickins, jesus.a.arechiga.lopez, tim.c.chen

On Mon, Oct 24, 2022 at 1:13 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Arechiga reports that his test case that failed "fast" before now ran
> for 28 hours without a soft lockup report with the proposed patches
> applied. So, I would consider those:
>
> Tested-by: Jesus Arechiga Lopez <jesus.a.arechiga.lopez@intel.com>

Ok, great.

I really like that patch myself (and obviously liked it back when it
was originally proposed), but I think it was always held back by the
fact that we didn't really have any hard data for it.

It does sound like we now very much have hard data for "the page
waitlist complexity is now a bigger problem than the historical
problem it tried to solve".

So I'll happily apply it. The only question is whether it's a "let's
do this for 6.2", or if it's something that we'd want to back-port
anyway, and might as well apply sooner rather than later as a fix.

I think that in turn then depends on just how artificial the test case
was. If the test case was triggered by somebody seeing problems in
real life loads, that would make the urgency a lot higher. But if it
was purely a synthetic test case with no accompanying "this is what
made us look at this" problem, it might be a 6.2 thing.

Arechiga?

Also, considering that Willy authored the patch (even if it's really
just a "remove this whole code logic"), maybe he has opinions? Willy?

                 Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-10-24 20:28                 ` Linus Torvalds
@ 2022-10-24 20:35                   ` Dan Williams
  2022-10-25 15:58                     ` Arechiga Lopez, Jesus A
  2022-10-25 19:19                   ` Matthew Wilcox
  1 sibling, 1 reply; 27+ messages in thread
From: Dan Williams @ 2022-10-24 20:35 UTC (permalink / raw)
  To: Linus Torvalds, Dan Williams
  Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs,
	Hugh Dickins, jesus.a.arechiga.lopez, tim.c.chen

Linus Torvalds wrote:
> On Mon, Oct 24, 2022 at 1:13 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Arechiga reports that his test case that failed "fast" before now ran
> > for 28 hours without a soft lockup report with the proposed patches
> > applied. So, I would consider those:
> >
> > Tested-by: Jesus Arechiga Lopez <jesus.a.arechiga.lopez@intel.com>
> 
> Ok, great.
> 
> I really like that patch myself (and obviously liked it back when it
> was originally proposed), but I think it was always held back by the
> fact that we didn't really have any hard data for it.
> 
> It does sound like we now very much have hard data for "the page
> waitlist complexity is now a bigger problem than the historical
> problem it tried to solve".
> 
> So I'll happily apply it. The only question is whether it's a "let's
> do this for 6.2", or if it's something that we'd want to back-port
> anyway, and might as well apply sooner rather than later as a fix.
> 
> I think that in turn then depends on just how artificial the test case
> was. If the test case was triggered by somebody seeing problems in
> real life loads, that would make the urgency a lot higher. But if it
> was purely a synthetic test case with no accompanying "this is what
> made us look at this" problem, it might be a 6.2 thing.
> 
> Arechiga?

I will let Arechiga reply as well, but my sense is that this is more in
the latter camp of not urgent because the test case is trying to
generate platform stress (success!), not necessarily trying to get real
work done.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: writeback completion soft lockup BUG in folio_wake_bit()
  2022-10-24 20:35                   ` Dan Williams
@ 2022-10-25 15:58                     ` Arechiga Lopez, Jesus A
  0 siblings, 0 replies; 27+ messages in thread
From: Arechiga Lopez, Jesus A @ 2022-10-25 15:58 UTC (permalink / raw)
  To: Williams, Dan J, Torvalds, Linus
  Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs,
	Hugh Dickins, tim.c.chen@linux.intel.com, Gross, Mark

> -----Original Message-----
> From: Williams, Dan J <dan.j.williams@intel.com>
> Sent: Monday, October 24, 2022 3:36 PM
> To: Torvalds, Linus <torvalds@linux-foundation.org>; Williams, Dan J
> <dan.j.williams@intel.com>
> Cc: Matthew Wilcox <willy@infradead.org>; Brian Foster
> <bfoster@redhat.com>; Linux-MM <linux-mm@kvack.org>; linux-fsdevel
> <linux-fsdevel@vger.kernel.org>; linux-xfs <linux-xfs@vger.kernel.org>;
> Hugh Dickins <hughd@google.com>; Arechiga Lopez, Jesus A
> <jesus.a.arechiga.lopez@intel.com>; tim.c.chen@linux.intel.com
> Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()
> 
> Linus Torvalds wrote:
> > On Mon, Oct 24, 2022 at 1:13 PM Dan Williams <dan.j.williams@intel.com>
> wrote:
> > >
> > > Arechiga reports that his test case that failed "fast" before now
> > > ran for 28 hours without a soft lockup report with the proposed
> > > patches applied. So, I would consider those:
> > >
> > > Tested-by: Jesus Arechiga Lopez <jesus.a.arechiga.lopez@intel.com>
> >
> > Ok, great.
> >
> > I really like that patch myself (and obviously liked it back when it
> > was originally proposed), but I think it was always held back by the
> > fact that we didn't really have any hard data for it.
> >
> > It does sound like we now very much have hard data for "the page
> > waitlist complexity is now a bigger problem than the historical
> > problem it tried to solve".
> >
> > So I'll happily apply it. The only question is whether it's a "let's
> > do this for 6.2", or if it's something that we'd want to back-port
> > anyway, and might as well apply sooner rather than later as a fix.
> >
> > I think that in turn then depends on just how artificial the test case
> > was. If the test case was triggered by somebody seeing problems in
> > real life loads, that would make the urgency a lot higher. But if it
> > was purely a synthetic test case with no accompanying "this is what
> > made us look at this" problem, it might be a 6.2 thing.
> >
> > Arechiga?
> 
> I will let Arechiga reply as well, but my sense is that this is more in the latter
> camp of not urgent because the test case is trying to generate platform
> stress (success!), not necessarily trying to get real work done.

Yes, as Dan mentioned it is trying to generate platform stress, We've been seeing the soft lockup events on test targets (2 sockets with high core count CPU's, and a lot of RAM). 

The workload stresses every core/CPU thread in various ways and logs results to a shared log file (every core writing to the same log file).  We found that this shared log file was related to the soft lockups.

With this change applied to 5.19 it seems that the soft lockups are no longer happening with this workload + configuration.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-10-24 20:28                 ` Linus Torvalds
  2022-10-24 20:35                   ` Dan Williams
@ 2022-10-25 19:19                   ` Matthew Wilcox
  2022-10-25 19:20                     ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2022-10-25 19:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Williams, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs,
	Hugh Dickins, jesus.a.arechiga.lopez, tim.c.chen

On Mon, Oct 24, 2022 at 01:28:31PM -0700, Linus Torvalds wrote:
> It does sound like we now very much have hard data for "the page
> waitlist complexity is now a bigger problem than the historical
> problem it tried to solve".
> 
> So I'll happily apply it. The only question is whether it's a "let's
> do this for 6.2", or if it's something that we'd want to back-port
> anyway, and might as well apply sooner rather than later as a fix.
> 
> I think that in turn then depends on just how artificial the test case
> was. If the test case was triggered by somebody seeing problems in
> real life loads, that would make the urgency a lot higher. But if it
> was purely a synthetic test case with no accompanying "this is what
> made us look at this" problem, it might be a 6.2 thing.
> 
> Arechiga?
> 
> Also, considering that Willy authored the patch (even if it's really
> just a "remove this whole code logic"), maybe he has opinions? Willy?

I've been carrying a pair of patches in my tree to rip out the wait
bookmarks since March, waiting for me to write a userspace testcase to
reproduce the problem against v4.13 and then check it no longer does so
against v5.0.  Unfortunately, that hasn't happened.  I'm happy to add
Arechiga's Tested-by, and submit them to Andrew and have him bring them
into v6.2, since this doesn't seem urgent?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: writeback completion soft lockup BUG in folio_wake_bit()
  2022-10-25 19:19                   ` Matthew Wilcox
@ 2022-10-25 19:20                     ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2022-10-25 19:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dan Williams, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs,
	Hugh Dickins, jesus.a.arechiga.lopez, tim.c.chen

On Tue, Oct 25, 2022 at 12:19 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> I've been carrying a pair of patches in my tree to rip out the wait
> bookmarks since March, waiting for me to write a userspace testcase to
> reproduce the problem against v4.13 and then check it no longer does so
> against v5.0.  Unfortunately, that hasn't happened.  I'm happy to add
> Arechiga's Tested-by, and submit them to Andrew and have him bring them
> into v6.2, since this doesn't seem urgent?

Ack, sounds good to me.

                  Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-10-25 19:21 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-15 19:07 writeback completion soft lockup BUG in folio_wake_bit() Brian Foster
2022-03-16 20:59 ` Matthew Wilcox
2022-03-16 23:35   ` Linus Torvalds
2022-03-17 15:04     ` Matthew Wilcox
2022-03-17 19:26       ` Linus Torvalds
2022-03-17 21:16         ` Matthew Wilcox
2022-03-17 22:52           ` Dave Chinner
2022-03-18 13:16           ` Jan Kara
2022-03-18 18:56             ` Linus Torvalds
2022-03-19 16:23               ` Theodore Ts'o
2022-03-30 15:55                 ` Christoph Hellwig
2022-03-17 15:31     ` Brian Foster
2022-03-17 13:51   ` Brian Foster
2022-03-18 14:14     ` Brian Foster
2022-03-18 14:45       ` Matthew Wilcox
2022-03-18 18:58         ` Linus Torvalds
2022-10-20  1:35           ` Dan Williams
2022-10-23 22:38             ` Linus Torvalds
2022-10-24 19:39               ` Tim Chen
2022-10-24 19:43                 ` Linus Torvalds
2022-10-24 20:14                   ` Dan Williams
2022-10-24 20:13               ` Dan Williams
2022-10-24 20:28                 ` Linus Torvalds
2022-10-24 20:35                   ` Dan Williams
2022-10-25 15:58                     ` Arechiga Lopez, Jesus A
2022-10-25 19:19                   ` Matthew Wilcox
2022-10-25 19:20                     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).