* writeback completion soft lockup BUG in folio_wake_bit()
@ 2022-03-15 19:07 Brian Foster
2022-03-16 20:59 ` Matthew Wilcox
0 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2022-03-15 19:07 UTC (permalink / raw)
To: linux-mm, linux-fsdevel, linux-xfs; +Cc: Linus Torvalds, Hugh Dickins
Hi,
I've been chasing down an issue recently that results in the following
soft lockup warning in XFS writeback (iomap_ioend) completion:
watchdog: BUG: soft lockup - CPU#42 stuck for 208s! [kworker/42:0:52508]
...
CPU: 42 PID: 52508 Comm: kworker/42:0 Tainted: G S L 5.17.0-rc8 #5
Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
Workqueue: xfs-conv/dm-0 xfs_end_io [xfs]
RIP: 0010:_raw_spin_unlock_irqrestore+0x1a/0x31
Code: 74 01 c3 0f 1f 44 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 74 01 fb bf 01 00 00 00 <e8> f1 6e 6a ff 65 8b 05 7a e3 1a 63 85 c0 74 01 c3 0f 1f 44 00 00
RSP: 0018:ff4a81e4a8a37ce8 EFLAGS: 00000206
RAX: 0000000000000001 RBX: ffffffff9de08b28 RCX: ff4a81e4a6cfbd00
RDX: ff4a81e4a6cfbd00 RSI: 0000000000000246 RDI: 0000000000000001
RBP: 0000000000000246 R08: ff4a81e4a6cfbd00 R09: 000000000002f8c0
R10: 00006dfe673f3ac5 R11: 00000000fa83b2da R12: 0000000000000fa8
R13: ffcb6e504b14e640 R14: 0000000000000000 R15: 0000000000001000
FS: 0000000000000000(0000) GS:ff1a26083fd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000888000 CR3: 0000000790f4a006 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
folio_wake_bit+0x8a/0x110
folio_end_writeback+0x37/0x80
iomap_finish_ioend+0xc6/0x330
iomap_finish_ioends+0x93/0xd0
xfs_end_ioend+0x5e/0x150 [xfs]
xfs_end_io+0xaf/0xe0 [xfs]
process_one_work+0x1c5/0x390
? process_one_work+0x390/0x390
worker_thread+0x30/0x350
? process_one_work+0x390/0x390
kthread+0xe6/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
This is reproduced via repeated iterations of the LTP diotest3 test [1]
on a 96xCPU system. This usually first occurs in a transient manner,
dumping two or three warnings and then the system recovers and test
continues, but it eventually reproduces a full on livelock in this same
path. I think I've narrowed the root cause of the warning to commit
c2407cf7d22d0 ("mm: make wait_on_page_writeback() wait for multiple
pending writebacks") (last discussed here[2] afaics). The first thing to
note is that I think the dio aspect of the test program is not
important. The test in this case creates 100x tasks that each run a
"buffered write -> fsync -> dio read" loop on an isolated range of the
test file, where the buffered writes and fsyncs seem to be the primary
contributing factors.
What seems to happen is that the majority of the fsync calls end up
waiting on writeback of a particular page, the wakeup of the writeback
bit on that page wakes a task that immediately resets PG_writeback on
the page such that N other folio_wait_writeback() waiters see the bit
still set and immediately place themselves back onto the tail of the
wait queue. Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop
in folio_wake_bit() (backing off the lock for a cycle or so in each
iteration) only to find the same bunch of tasks in the queue. This
process repeats for a long enough amount of time to trigger the soft
lockup warning. I've confirmed this spinning behavior with a tracepoint
in the bookmark loop that indicates we're stuck for many hundreds of
thousands of iterations (at least) of this loop when the soft lockup
warning triggers.
As to the full livelock variant, I think we end up with a
write_cache_pages() task that is woken waiting for page P, immediately
resets PG_writeback on P and moves on to P+1. P+1 is also already under
writeback, so write_cache_pages() blocks here waiting on that. Page P
has still not been submitted for I/O because XFS/iomap batches I/O
submissions across multiple ->writepage() callbacks. The waker task
spinloop occurs on page P as described above, but since the waker task
is the XFS workqueue task associated with ioend completion on the inode
(because these are delalloc writes that require unwritten extent
conversion on completion), it will never get to waking page P+1 and
we're stuck for good. IOW:
1. Page P is stuck in PG_writeback on a queue awaiting I/O submission.
2. The P submitting task is blocked waiting on PG_writeback of page P+1.
3. The waker task responsible for waking P+1 is spinning waking tasks
for a previous PG_writeback state on page P. This wait queue will never
drain, however, because it consists of a large number of tasks (>
WAITQUEUE_WALK_BREAK_CNT) inserted via folio_wait_writeback() that will
never break the loop for P (because of step 1).
I've run a few quick experiments to try and corroborate this analysis.
The problem goes away completely if I either back out the loop change in
folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something
like 128 (i.e. greater than the total possible number of waiter tasks in
this test). I've also played a few games with bookmark behavior mostly
out of curiosity, but usually end up introducing other problems like
missed wakeups, etc. I've not been able to reproduce this problem on
ext4, I suspect due to sufficiently different writeback/completion
batching behavior. I was thinking about whether skipping writeback of
PG_writeback pages after an explicit wait (removing the wait loop and
BUG_ON()) might avoid this problem and the one the loop is intended to
fix, but I'm not sure that would be safe. Thoughts or ideas?
Brian
[1] while [ true ]; do TMPDIR=<xfsdir> diotest3 -b 65536 -n 100 -i 100 -o 1024000; done
[2] https://lore.kernel.org/linux-mm/000000000000886dbd05b7ffa8db@google.com/
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-15 19:07 writeback completion soft lockup BUG in folio_wake_bit() Brian Foster @ 2022-03-16 20:59 ` Matthew Wilcox 2022-03-16 23:35 ` Linus Torvalds 2022-03-17 13:51 ` Brian Foster 0 siblings, 2 replies; 27+ messages in thread From: Matthew Wilcox @ 2022-03-16 20:59 UTC (permalink / raw) To: Brian Foster Cc: linux-mm, linux-fsdevel, linux-xfs, Linus Torvalds, Hugh Dickins On Tue, Mar 15, 2022 at 03:07:10PM -0400, Brian Foster wrote: > What seems to happen is that the majority of the fsync calls end up > waiting on writeback of a particular page, the wakeup of the writeback > bit on that page wakes a task that immediately resets PG_writeback on > the page such that N other folio_wait_writeback() waiters see the bit > still set and immediately place themselves back onto the tail of the > wait queue. Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop > in folio_wake_bit() (backing off the lock for a cycle or so in each > iteration) only to find the same bunch of tasks in the queue. This > process repeats for a long enough amount of time to trigger the soft > lockup warning. I've confirmed this spinning behavior with a tracepoint > in the bookmark loop that indicates we're stuck for many hundreds of > thousands of iterations (at least) of this loop when the soft lockup > warning triggers. [...] > I've run a few quick experiments to try and corroborate this analysis. > The problem goes away completely if I either back out the loop change in > folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something > like 128 (i.e. greater than the total possible number of waiter tasks in > this test). I've also played a few games with bookmark behavior mostly > out of curiosity, but usually end up introducing other problems like > missed wakeups, etc. As I recall, the bookmark hack was introduced in order to handle lock_page() problems. It wasn't really supposed to handle writeback, but nobody thought it would cause any harm (and indeed, it didn't at the time). So how about we only use bookmarks for lock_page(), since lock_page() usually doesn't have the multiple-waker semantics that writeback has? (this is more in the spirit of "minimal patch" -- I think initialising the bookmark should be moved to folio_unlock()). diff --git a/mm/filemap.c b/mm/filemap.c index b2728eb52407..9ee3c5f1f489 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1146,26 +1146,28 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, return (flags & WQ_FLAG_EXCLUSIVE) != 0; } -static void folio_wake_bit(struct folio *folio, int bit_nr) +static void folio_wake_bit(struct folio *folio, int bit_nr, + wait_queue_entry_t *bookmark) { wait_queue_head_t *q = folio_waitqueue(folio); struct wait_page_key key; unsigned long flags; - wait_queue_entry_t bookmark; key.folio = folio; key.bit_nr = bit_nr; key.page_match = 0; - bookmark.flags = 0; - bookmark.private = NULL; - bookmark.func = NULL; - INIT_LIST_HEAD(&bookmark.entry); + if (bookmark) { + bookmark->flags = 0; + bookmark->private = NULL; + bookmark->func = NULL; + INIT_LIST_HEAD(&bookmark->entry); + } spin_lock_irqsave(&q->lock, flags); - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); + __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark); - while (bookmark.flags & WQ_FLAG_BOOKMARK) { + while (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) { /* * Take a breather from holding the lock, * allow pages that finish wake up asynchronously @@ -1175,7 +1177,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr) spin_unlock_irqrestore(&q->lock, flags); cpu_relax(); spin_lock_irqsave(&q->lock, flags); - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); + __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark); } /* @@ -1204,7 +1206,7 @@ static void folio_wake(struct folio *folio, int bit) { if (!folio_test_waiters(folio)) return; - folio_wake_bit(folio, bit); + folio_wake_bit(folio, bit, NULL); } /* @@ -1554,12 +1556,15 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem */ void folio_unlock(struct folio *folio) { + wait_queue_entry_t bookmark; + /* Bit 7 allows x86 to check the byte's sign bit */ BUILD_BUG_ON(PG_waiters != 7); BUILD_BUG_ON(PG_locked > 7); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0))) - folio_wake_bit(folio, PG_locked); + folio_wake_bit(folio, PG_locked, &bookmark); } EXPORT_SYMBOL(folio_unlock); @@ -1578,7 +1583,7 @@ void folio_end_private_2(struct folio *folio) { VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio); clear_bit_unlock(PG_private_2, folio_flags(folio, 0)); - folio_wake_bit(folio, PG_private_2); + folio_wake_bit(folio, PG_private_2, NULL); folio_put(folio); } EXPORT_SYMBOL(folio_end_private_2); ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-16 20:59 ` Matthew Wilcox @ 2022-03-16 23:35 ` Linus Torvalds 2022-03-17 15:04 ` Matthew Wilcox 2022-03-17 15:31 ` Brian Foster 2022-03-17 13:51 ` Brian Foster 1 sibling, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2022-03-16 23:35 UTC (permalink / raw) To: Matthew Wilcox Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins On Wed, Mar 16, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote: > > As I recall, the bookmark hack was introduced in order to handle > lock_page() problems. It wasn't really supposed to handle writeback, > but nobody thought it would cause any harm (and indeed, it didn't at the > time). So how about we only use bookmarks for lock_page(), since > lock_page() usually doesn't have the multiple-waker semantics that > writeback has? I was hoping that some of the page lock problems are gone and we could maybe try to get rid of the bookmarks entirely. But the page lock issues only ever showed up on some private proprietary load and machine, so we never really got confirmation that they are fixed. There were lots of strong signs to them being related to the migration page locking, and it may be that the bookmark code is only hurting these days. See for example commit 9a1ea439b16b ("mm: put_and_wait_on_page_locked() while page is migrated") which doesn't actually change the *locking* side, but drops the page reference when waiting for the locked page to be unlocked, which in turn removes a "loop and try again when migration". And that may have been the real _fix_ for the problem. Because while the bookmark thing avoids the NMI lockup detector firing due to excessive hold times, the bookmarking also _causes_ that "we now will see the same page multiple times because we dropped the lock and somebody re-added it at the end of the queue" issue. Which seems to be the problem here. Ugh. I wish we had some way to test "could we just remove the bookmark code entirely again". Of course, the PG_lock case also works fairly hard to not actually remove and re-add the lock waiter to the queue, but having an actual "wait for and get the lock" operation. The writeback bit isn't done that way. I do hate how we had to make folio_wait_writeback{_killable}() use "while" rather than an "if". It *almost* works with just a "wait for current writeback", but not quite. See commit c2407cf7d22d ("mm: make wait_on_page_writeback() wait for multiple pending writebacks") for why we have to loop. Ugly, ugly. Because I do think that "while" in the writeback waiting is a problem. Maybe _the_ problem. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-16 23:35 ` Linus Torvalds @ 2022-03-17 15:04 ` Matthew Wilcox 2022-03-17 19:26 ` Linus Torvalds 2022-03-17 15:31 ` Brian Foster 1 sibling, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2022-03-17 15:04 UTC (permalink / raw) To: Linus Torvalds Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins On Wed, Mar 16, 2022 at 04:35:10PM -0700, Linus Torvalds wrote: > On Wed, Mar 16, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > As I recall, the bookmark hack was introduced in order to handle > > lock_page() problems. It wasn't really supposed to handle writeback, > > but nobody thought it would cause any harm (and indeed, it didn't at the > > time). So how about we only use bookmarks for lock_page(), since > > lock_page() usually doesn't have the multiple-waker semantics that > > writeback has? > > I was hoping that some of the page lock problems are gone and we could > maybe try to get rid of the bookmarks entirely. > > But the page lock issues only ever showed up on some private > proprietary load and machine, so we never really got confirmation that > they are fixed. There were lots of strong signs to them being related > to the migration page locking, and it may be that the bookmark code is > only hurting these days. Ah, I found Tim's mail describing the workload: https://lore.kernel.org/all/a9e74f64-dee6-dc23-128e-8ef8c7383d77@linux.intel.com/ Relevant part: : They have a parent process that spawns off 10 children per core and : kicked them to run. The child processes all access a common library. : We have 384 cores so 3840 child processes running. When migration occur on : a page in the common library, the first child that access the page will : page fault and lock the page, with the other children also page faulting : quickly and pile up in the page wait list, till the first child is done. Seems like someone should be able to write a test app that does something along those lines fairly easily. The trick is finding a giant system to run it on. Although doing it on a smaller system with more SW threads/CPU would probably be enough. > See for example commit 9a1ea439b16b ("mm: > put_and_wait_on_page_locked() while page is migrated") which doesn't > actually change the *locking* side, but drops the page reference when > waiting for the locked page to be unlocked, which in turn removes a > "loop and try again when migration". And that may have been the real > _fix_ for the problem. Actually, I think it fixes a livelock. If you have hundreds (let alone thousands) of threads all trying to migrate the same page, they're all going to fail with the original code because migration can't succeed with an elevated refcount, and each thread that is waiting holds a refcount. I had a similar problem trying to split a folio in ->readpage with one of the xfstests that has 500 threads all trying to read() from the same page. That was solved by also using put_and_wait_on_page_locked() in bd8a1f3655a7 ("mm/filemap: support readpage splitting a page"). > Because while the bookmark thing avoids the NMI lockup detector firing > due to excessive hold times, the bookmarking also _causes_ that "we > now will see the same page multiple times because we dropped the lock > and somebody re-added it at the end of the queue" issue. Which seems > to be the problem here. > > Ugh. I wish we had some way to test "could we just remove the bookmark > code entirely again". I think we should be safe to remove it entirely now. Of course, that may show somewhere else that now suffers from a similar livelock ... but if it does, we ought to fix that anyway, because the bookmark code is stopping the livelock problem from being seen. > Of course, the PG_lock case also works fairly hard to not actually > remove and re-add the lock waiter to the queue, but having an actual > "wait for and get the lock" operation. The writeback bit isn't done > that way. > > I do hate how we had to make folio_wait_writeback{_killable}() use > "while" rather than an "if". It *almost* works with just a "wait for > current writeback", but not quite. See commit c2407cf7d22d ("mm: make > wait_on_page_writeback() wait for multiple pending writebacks") for > why we have to loop. Ugly, ugly. > > Because I do think that "while" in the writeback waiting is a problem. > Maybe _the_ problem. I do like the idea of wait-and-set the writeback flag in the VFS instead of leaving it to the individual filesystems. I'll put it on my list of things to do. Actually ... I don't think that calling folio_start_writeback() twice is a problem per se. It's inefficient (cycling the i_pages lock, walking the xarray, atomic bitoperations on the flag), but nothing goes wrong. Except that NFS and AFS check the return value and squawk. So how about we do something like this: - Make folio_start_writeback() and set_page_writeback() return void, fixing up AFS and NFS. - Add a folio_wait_start_writeback() to use in the VFS - Remove the calls to set_page_writeback() in the filesystems That keepwrite thing troubles me, and hints it might be more complicated than that. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-17 15:04 ` Matthew Wilcox @ 2022-03-17 19:26 ` Linus Torvalds 2022-03-17 21:16 ` Matthew Wilcox 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2022-03-17 19:26 UTC (permalink / raw) To: Matthew Wilcox Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins On Thu, Mar 17, 2022 at 8:04 AM Matthew Wilcox <willy@infradead.org> wrote: > > So how about we do something like this: > > - Make folio_start_writeback() and set_page_writeback() return void, > fixing up AFS and NFS. > - Add a folio_wait_start_writeback() to use in the VFS > - Remove the calls to set_page_writeback() in the filesystems That sounds lovely, but it does worry me a bit. Not just the odd 'keepwrite' thing, but also the whole ordering between the folio bit and the tagging bits. Does the ordering possibly matter? That whole "xyz_writeback_keepwrite()" thing seems odd. It's used in only one place (the folio version isn't used at all): ext4_writepage(): ext4_walk_page_buffers() fails: redirty_page_for_writepage(wbc, page); keep_towrite = true; ext4_bio_write_page(). which just looks odd. Why does it even try to continue to do the writepage when the page buffer thing has failed? In the regular write path (ie ext4_write_begin()), a ext4_walk_page_buffers() failure is fatal or causes a retry). Why is ext4_writepage() any different? Particularly since it wants to keep the page dirty, then trying to do the writeback just seems wrong. So this code is all a bit odd, I suspect there are decades of "people continued to do what they historically did" changes, and it is all worrisome. Your cleanup sounds like the right thing, but I also think that getting rid of that 'keepwrite' thing would also be the right thing. And it all worries me. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-17 19:26 ` Linus Torvalds @ 2022-03-17 21:16 ` Matthew Wilcox 2022-03-17 22:52 ` Dave Chinner 2022-03-18 13:16 ` Jan Kara 0 siblings, 2 replies; 27+ messages in thread From: Matthew Wilcox @ 2022-03-17 21:16 UTC (permalink / raw) To: Linus Torvalds Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, Namjae Jeon, Ashish Sangwan, Theodore Ts'o, Jan Kara, linux-ext4 On Thu, Mar 17, 2022 at 12:26:35PM -0700, Linus Torvalds wrote: > On Thu, Mar 17, 2022 at 8:04 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > So how about we do something like this: > > > > - Make folio_start_writeback() and set_page_writeback() return void, > > fixing up AFS and NFS. > > - Add a folio_wait_start_writeback() to use in the VFS > > - Remove the calls to set_page_writeback() in the filesystems > > That sounds lovely, but it does worry me a bit. Not just the odd > 'keepwrite' thing, but also the whole ordering between the folio bit > and the tagging bits. Does the ordering possibly matter? I wouldn't change the ordering of setting the xarray bits and the writeback flag; they'd just be set a little earlier. It'd all be done while the page was still locked. But you're right, there's lots of subtle interactions here. > That whole "xyz_writeback_keepwrite()" thing seems odd. It's used in > only one place (the folio version isn't used at all): > > ext4_writepage(): > > ext4_walk_page_buffers() fails: > redirty_page_for_writepage(wbc, page); > keep_towrite = true; > ext4_bio_write_page(). > > which just looks odd. Why does it even try to continue to do the > writepage when the page buffer thing has failed? > > In the regular write path (ie ext4_write_begin()), a > ext4_walk_page_buffers() failure is fatal or causes a retry). Why is > ext4_writepage() any different? Particularly since it wants to keep > the page dirty, then trying to do the writeback just seems wrong. > > So this code is all a bit odd, I suspect there are decades of "people > continued to do what they historically did" changes, and it is all > worrisome. I found the commit: 1c8349a17137 ("ext4: fix data integrity sync in ordered mode"). Fortunately, we have a documented test for this, generic/127, so we'll know if we've broken it. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-17 21:16 ` Matthew Wilcox @ 2022-03-17 22:52 ` Dave Chinner 2022-03-18 13:16 ` Jan Kara 1 sibling, 0 replies; 27+ messages in thread From: Dave Chinner @ 2022-03-17 22:52 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, Namjae Jeon, Ashish Sangwan, Theodore Ts'o, Jan Kara, linux-ext4 On Thu, Mar 17, 2022 at 09:16:20PM +0000, Matthew Wilcox wrote: > On Thu, Mar 17, 2022 at 12:26:35PM -0700, Linus Torvalds wrote: > > On Thu, Mar 17, 2022 at 8:04 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > So how about we do something like this: > > > > > > - Make folio_start_writeback() and set_page_writeback() return void, > > > fixing up AFS and NFS. > > > - Add a folio_wait_start_writeback() to use in the VFS > > > - Remove the calls to set_page_writeback() in the filesystems > > > > That sounds lovely, but it does worry me a bit. Not just the odd > > 'keepwrite' thing, but also the whole ordering between the folio bit > > and the tagging bits. Does the ordering possibly matter? > > I wouldn't change the ordering of setting the xarray bits and the > writeback flag; they'd just be set a little earlier. It'd all be done > while the page was still locked. But you're right, there's lots of > subtle interactions here. > > > That whole "xyz_writeback_keepwrite()" thing seems odd. It's used in > > only one place (the folio version isn't used at all): > > > > ext4_writepage(): > > > > ext4_walk_page_buffers() fails: > > redirty_page_for_writepage(wbc, page); > > keep_towrite = true; > > ext4_bio_write_page(). > > > > which just looks odd. Why does it even try to continue to do the > > writepage when the page buffer thing has failed? > > > > In the regular write path (ie ext4_write_begin()), a > > ext4_walk_page_buffers() failure is fatal or causes a retry). Why is > > ext4_writepage() any different? Particularly since it wants to keep > > the page dirty, then trying to do the writeback just seems wrong. > > > > So this code is all a bit odd, I suspect there are decades of "people > > continued to do what they historically did" changes, and it is all > > worrisome. > > I found the commit: 1c8349a17137 ("ext4: fix data integrity sync in > ordered mode"). Fortunately, we have a documented test for this, > generic/127, so we'll know if we've broken it. Looks like a footgun. ext4 needs the keepwrite stuff for block size < page size, in the case where a page has both written and delalloc/unwritten buffers on it. In that case ext4_writepage tries to write just the written blocks and leave the dealloc/unwritten buffers alone because it can't do allocation in ->writepage context. I say footgun, because the nested ->writepage call that needs keepwrite comes a from journal stop context in the high level ->writepages context that is doing allocation that will allow the entire page to be written. i.e. it seems a bit silly to be triggering partial page writeback that skips delalloc/unwritten extents, but then needs special awareness of higher level IO that is in progress that is currently doing the allocation that will allow all the delalloc/unwritten extents on the page to also be written back... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-17 21:16 ` Matthew Wilcox 2022-03-17 22:52 ` Dave Chinner @ 2022-03-18 13:16 ` Jan Kara 2022-03-18 18:56 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Jan Kara @ 2022-03-18 13:16 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, Namjae Jeon, Ashish Sangwan, Theodore Ts'o, Jan Kara, linux-ext4 On Thu 17-03-22 21:16:20, Matthew Wilcox wrote: > On Thu, Mar 17, 2022 at 12:26:35PM -0700, Linus Torvalds wrote: > > That whole "xyz_writeback_keepwrite()" thing seems odd. It's used in > > only one place (the folio version isn't used at all): > > > > ext4_writepage(): > > > > ext4_walk_page_buffers() fails: > > redirty_page_for_writepage(wbc, page); > > keep_towrite = true; > > ext4_bio_write_page(). > > > > which just looks odd. Why does it even try to continue to do the > > writepage when the page buffer thing has failed? > > > > In the regular write path (ie ext4_write_begin()), a > > ext4_walk_page_buffers() failure is fatal or causes a retry). Why is > > ext4_writepage() any different? Particularly since it wants to keep > > the page dirty, then trying to do the writeback just seems wrong. > > > > So this code is all a bit odd, I suspect there are decades of "people > > continued to do what they historically did" changes, and it is all > > worrisome. > > I found the commit: 1c8349a17137 ("ext4: fix data integrity sync in > ordered mode"). Fortunately, we have a documented test for this, > generic/127, so we'll know if we've broken it. I agree with Dave that 'keep_towrite' thing is kind of self-inflicted damage on the ext4 side (we need to write out some blocks underlying the page but cannot write all from the transaction commit code, so we need to keep xarray tags intact so that data integrity sync cannot miss the page). Also it is no longer needed in the current default ext4 setup. But if you have blocksize < pagesize and mount the fs with 'dioreadlock,data=ordered' mount options, the hack is still needed AFAIK and we don't have a reasonable way around it. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-18 13:16 ` Jan Kara @ 2022-03-18 18:56 ` Linus Torvalds 2022-03-19 16:23 ` Theodore Ts'o 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2022-03-18 18:56 UTC (permalink / raw) To: Jan Kara Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, Namjae Jeon, Ashish Sangwan, Theodore Ts'o, Ext4 Developers List On Fri, Mar 18, 2022 at 6:16 AM Jan Kara <jack@suse.cz> wrote: > > I agree with Dave that 'keep_towrite' thing is kind of self-inflicted > damage on the ext4 side (we need to write out some blocks underlying the > page but cannot write all from the transaction commit code, so we need to > keep xarray tags intact so that data integrity sync cannot miss the page). > Also it is no longer needed in the current default ext4 setup. But if you > have blocksize < pagesize and mount the fs with 'dioreadlock,data=ordered' > mount options, the hack is still needed AFAIK and we don't have a > reasonable way around it. I assume you meant 'dioread_lock'. Which seems to be the default (even if 'data=ordered' is not). Anyway - if it's not a problem for any current default setting, maybe the solution is to warn about this case and turn it off? IOW, we could simply warn about "data=ordered is no longer supported" and turn it into data=journal. Obviously *only* do this for the case of "blocksize < PAGE_SIZE". If this ext4 thing is (a) obsolete and (b) causes VFS-level problems that nobody else has, I really think we'd be much better off disabling it than trying to work with it. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-18 18:56 ` Linus Torvalds @ 2022-03-19 16:23 ` Theodore Ts'o 2022-03-30 15:55 ` Christoph Hellwig 0 siblings, 1 reply; 27+ messages in thread From: Theodore Ts'o @ 2022-03-19 16:23 UTC (permalink / raw) To: Linus Torvalds Cc: Jan Kara, Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, Namjae Jeon, Ashish Sangwan, Ext4 Developers List On Fri, Mar 18, 2022 at 11:56:04AM -0700, Linus Torvalds wrote: > On Fri, Mar 18, 2022 at 6:16 AM Jan Kara <jack@suse.cz> wrote: > > > > I agree with Dave that 'keep_towrite' thing is kind of self-inflicted > > damage on the ext4 side (we need to write out some blocks underlying the > > page but cannot write all from the transaction commit code, so we need to > > keep xarray tags intact so that data integrity sync cannot miss the page). > > Also it is no longer needed in the current default ext4 setup. But if you > > have blocksize < pagesize and mount the fs with 'dioreadlock,data=ordered' > > mount options, the hack is still needed AFAIK and we don't have a > > reasonable way around it. > > I assume you meant 'dioread_lock'. > > Which seems to be the default (even if 'data=ordered' is not). That's not quite right. data=ordered is the default, and that has been the case since ext3 was introduced. "dioread_lock" was the default in the days of ext3; "dioread_nolock" was added to allow parallel Direct I/O reads (hence "nolock"). A while back, we tried to make dioread_nolock the default since it tends improve performance for workloads that have a mix of writes that should be affected by fsyncs, and those that shouldn't. Howevver, we had to revert that change when blocksize < pagesize to work around a problem found on Power machines where "echo 3 > drop_caches" on the host appears to cause file system corruptions on the guest. (Commit 626b035b816b "ext4: don't set dioread_nolock by default for blocksize < pagesize"). > IOW, we could simply warn about "data=ordered is no longer supported" > and turn it into data=journal. > > Obviously *only* do this for the case of "blocksize < PAGE_SIZE". Actiavelly, we've discussed a number of times doing the reverse --- removing data=journal entirely, since it's (a) rarely used, and (b) data=journal disables a bunch of optimizations, including preallocation, and so its performance is pretty awful for most workloads. The main reason why haven't until now is that we believe there is a small number of people who do find data=journal useful for their workload, and at least _so_ far it's not been that hard to keep it limping along --- although in most cases, data=journal doesn't get supported for new features or performance optimizations, and it definitely does note get as much testing. So the thing that I've been waiting to do for a while is to replace the whole data=ordered vs data=writeback and dioread_nolock and dioread_lock is a complete reworking of the ext4 buffered writeback path, where we write the data blocks *first*, and only then update the ext4 metadata. Historically, going as far back as ext2, we've always allocated data blocks and updatted the metadata blocks, and only then updated the buffer or page cache for the data blocks. All of the complexities around data=ordered, data=writeback, dioread_nolock, etc., is because we haven't done the fundamental work of reversing the order in which we do buffered writeback. What we *should* be doing is: *) Determining where the new allocated data blockblocks should be, and preventing those blocks from being used for any other purposes, but *not* updating the file system metadata to reflect that change. *) Submit the data block write *) On write completion, update the metadata blocks in a kernel thread. Over time, we've been finding more and more reasons why I need to do this work, so it's something I'm going to have to prioritize in the next few months. Cheerse, - Ted ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-19 16:23 ` Theodore Ts'o @ 2022-03-30 15:55 ` Christoph Hellwig 0 siblings, 0 replies; 27+ messages in thread From: Christoph Hellwig @ 2022-03-30 15:55 UTC (permalink / raw) To: Theodore Ts'o Cc: Linus Torvalds, Jan Kara, Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, Namjae Jeon, Ashish Sangwan, Ext4 Developers List On Sat, Mar 19, 2022 at 12:23:04PM -0400, Theodore Ts'o wrote: > So the thing that I've been waiting to do for a while is to replace > the whole data=ordered vs data=writeback and dioread_nolock and > dioread_lock is a complete reworking of the ext4 buffered writeback > path, where we write the data blocks *first*, and only then update the > ext4 metadata. > *) Determining where the new allocated data blockblocks should be, and > preventing those blocks from being used for any other purposes, but > *not* updating the file system metadata to reflect that change. > > *) Submit the data block write > > *) On write completion, update the metadata blocks in a kernel thread. I think that would be easily done by switching to the iomap buffered I/O code, which is very much built around that model. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-16 23:35 ` Linus Torvalds 2022-03-17 15:04 ` Matthew Wilcox @ 2022-03-17 15:31 ` Brian Foster 1 sibling, 0 replies; 27+ messages in thread From: Brian Foster @ 2022-03-17 15:31 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins On Wed, Mar 16, 2022 at 04:35:10PM -0700, Linus Torvalds wrote: > On Wed, Mar 16, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > As I recall, the bookmark hack was introduced in order to handle > > lock_page() problems. It wasn't really supposed to handle writeback, > > but nobody thought it would cause any harm (and indeed, it didn't at the > > time). So how about we only use bookmarks for lock_page(), since > > lock_page() usually doesn't have the multiple-waker semantics that > > writeback has? > > I was hoping that some of the page lock problems are gone and we could > maybe try to get rid of the bookmarks entirely. > > But the page lock issues only ever showed up on some private > proprietary load and machine, so we never really got confirmation that > they are fixed. There were lots of strong signs to them being related > to the migration page locking, and it may be that the bookmark code is > only hurting these days. > > See for example commit 9a1ea439b16b ("mm: > put_and_wait_on_page_locked() while page is migrated") which doesn't > actually change the *locking* side, but drops the page reference when > waiting for the locked page to be unlocked, which in turn removes a > "loop and try again when migration". And that may have been the real > _fix_ for the problem. > > Because while the bookmark thing avoids the NMI lockup detector firing > due to excessive hold times, the bookmarking also _causes_ that "we > now will see the same page multiple times because we dropped the lock > and somebody re-added it at the end of the queue" issue. Which seems > to be the problem here. > > Ugh. I wish we had some way to test "could we just remove the bookmark > code entirely again". > > Of course, the PG_lock case also works fairly hard to not actually > remove and re-add the lock waiter to the queue, but having an actual > "wait for and get the lock" operation. The writeback bit isn't done > that way. > > I do hate how we had to make folio_wait_writeback{_killable}() use > "while" rather than an "if". It *almost* works with just a "wait for > current writeback", but not quite. See commit c2407cf7d22d ("mm: make > wait_on_page_writeback() wait for multiple pending writebacks") for > why we have to loop. Ugly, ugly. > Right.. In case you missed it in my too long of a description (sorry), I pointed out that this problem seems to manifest most recently as of that commit. In fact looking through the past discussion for that patch, it wouldn't surprise me a ton of this problem is some pathological manifestation of the perf issue that you described here [1]. Indeed most of the waiters in this case come from fsync() -> ... -> __filemap_fdatawait_range(), and your test patch in that email performs a similar sort of trick to skip out of the wake up side (I'm curious if that was ever determined to help?) to things that I was playing with to try and narrow this down. > Because I do think that "while" in the writeback waiting is a problem. > Maybe _the_ problem. > FWIW, Matthew's patch does seem to address this problem. My current test of that patch is past the point where I usually expect to see the soft lockup warning, but I'm going to let it continue to run (and then run it through some xfs regression if the approach is agreeable). Getting back to the loop thing (and seeing Matthew's latest reply wrt to wait_and_set())... If we wanted to go back to non-looping in folio_wait_writeback() to avoid the unserialized wait queue build up or whatever, would it make any sense to lift the looping writeback check to write_cache_pages()? We hold the page lock and have checked PageDirty() by that point, so ISTM that would address the BUG_ON(PageWriteback()) race caused by the delayed/unserialized wakeup without producing the excess wait queue buildup caused by waiters in the __filemap_fdatawait_range() path. Then presumably that "wait for writeback to clear" loop in write_cache_pages() is eventually replaced by the "wait and set writeback" thing when the rest of the fs code is fixed up appropriately. Hm? Of course I haven't tested that so it could be completely bogus, but I can if it makes any sort of sense as an incremental step.. Brian [1] https://lore.kernel.org/linux-mm/CAHk-=wgD9GK5CeHopYmRHoYS9cNuCmDMsc=+MbM_KgJ0KB+=ng@mail.gmail.com/ > Linus > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-16 20:59 ` Matthew Wilcox 2022-03-16 23:35 ` Linus Torvalds @ 2022-03-17 13:51 ` Brian Foster 2022-03-18 14:14 ` Brian Foster 1 sibling, 1 reply; 27+ messages in thread From: Brian Foster @ 2022-03-17 13:51 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, linux-fsdevel, linux-xfs, Linus Torvalds, Hugh Dickins On Wed, Mar 16, 2022 at 08:59:39PM +0000, Matthew Wilcox wrote: > On Tue, Mar 15, 2022 at 03:07:10PM -0400, Brian Foster wrote: > > What seems to happen is that the majority of the fsync calls end up > > waiting on writeback of a particular page, the wakeup of the writeback > > bit on that page wakes a task that immediately resets PG_writeback on > > the page such that N other folio_wait_writeback() waiters see the bit > > still set and immediately place themselves back onto the tail of the > > wait queue. Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop > > in folio_wake_bit() (backing off the lock for a cycle or so in each > > iteration) only to find the same bunch of tasks in the queue. This > > process repeats for a long enough amount of time to trigger the soft > > lockup warning. I've confirmed this spinning behavior with a tracepoint > > in the bookmark loop that indicates we're stuck for many hundreds of > > thousands of iterations (at least) of this loop when the soft lockup > > warning triggers. > > [...] > > > I've run a few quick experiments to try and corroborate this analysis. > > The problem goes away completely if I either back out the loop change in > > folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something > > like 128 (i.e. greater than the total possible number of waiter tasks in > > this test). I've also played a few games with bookmark behavior mostly > > out of curiosity, but usually end up introducing other problems like > > missed wakeups, etc. > > As I recall, the bookmark hack was introduced in order to handle > lock_page() problems. It wasn't really supposed to handle writeback, > but nobody thought it would cause any harm (and indeed, it didn't at the > time). So how about we only use bookmarks for lock_page(), since > lock_page() usually doesn't have the multiple-waker semantics that > writeback has? > Oh, interesting. I wasn't aware of the tenuous status of the bookmark code. This is indeed much nicer than anything I was playing with. I suspect it will address the problem, but I'll throw it at my test env for a while and follow up.. thanks! Brian > (this is more in the spirit of "minimal patch" -- I think initialising > the bookmark should be moved to folio_unlock()). > > diff --git a/mm/filemap.c b/mm/filemap.c > index b2728eb52407..9ee3c5f1f489 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1146,26 +1146,28 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, > return (flags & WQ_FLAG_EXCLUSIVE) != 0; > } > > -static void folio_wake_bit(struct folio *folio, int bit_nr) > +static void folio_wake_bit(struct folio *folio, int bit_nr, > + wait_queue_entry_t *bookmark) > { > wait_queue_head_t *q = folio_waitqueue(folio); > struct wait_page_key key; > unsigned long flags; > - wait_queue_entry_t bookmark; > > key.folio = folio; > key.bit_nr = bit_nr; > key.page_match = 0; > > - bookmark.flags = 0; > - bookmark.private = NULL; > - bookmark.func = NULL; > - INIT_LIST_HEAD(&bookmark.entry); > + if (bookmark) { > + bookmark->flags = 0; > + bookmark->private = NULL; > + bookmark->func = NULL; > + INIT_LIST_HEAD(&bookmark->entry); > + } > > spin_lock_irqsave(&q->lock, flags); > - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); > + __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark); > > - while (bookmark.flags & WQ_FLAG_BOOKMARK) { > + while (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) { > /* > * Take a breather from holding the lock, > * allow pages that finish wake up asynchronously > @@ -1175,7 +1177,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr) > spin_unlock_irqrestore(&q->lock, flags); > cpu_relax(); > spin_lock_irqsave(&q->lock, flags); > - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); > + __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark); > } > > /* > @@ -1204,7 +1206,7 @@ static void folio_wake(struct folio *folio, int bit) > { > if (!folio_test_waiters(folio)) > return; > - folio_wake_bit(folio, bit); > + folio_wake_bit(folio, bit, NULL); > } > > /* > @@ -1554,12 +1556,15 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem > */ > void folio_unlock(struct folio *folio) > { > + wait_queue_entry_t bookmark; > + > /* Bit 7 allows x86 to check the byte's sign bit */ > BUILD_BUG_ON(PG_waiters != 7); > BUILD_BUG_ON(PG_locked > 7); > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > + > if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0))) > - folio_wake_bit(folio, PG_locked); > + folio_wake_bit(folio, PG_locked, &bookmark); > } > EXPORT_SYMBOL(folio_unlock); > > @@ -1578,7 +1583,7 @@ void folio_end_private_2(struct folio *folio) > { > VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio); > clear_bit_unlock(PG_private_2, folio_flags(folio, 0)); > - folio_wake_bit(folio, PG_private_2); > + folio_wake_bit(folio, PG_private_2, NULL); > folio_put(folio); > } > EXPORT_SYMBOL(folio_end_private_2); > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-17 13:51 ` Brian Foster @ 2022-03-18 14:14 ` Brian Foster 2022-03-18 14:45 ` Matthew Wilcox 0 siblings, 1 reply; 27+ messages in thread From: Brian Foster @ 2022-03-18 14:14 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, linux-fsdevel, linux-xfs, Linus Torvalds, Hugh Dickins On Thu, Mar 17, 2022 at 09:51:44AM -0400, Brian Foster wrote: > On Wed, Mar 16, 2022 at 08:59:39PM +0000, Matthew Wilcox wrote: > > On Tue, Mar 15, 2022 at 03:07:10PM -0400, Brian Foster wrote: > > > What seems to happen is that the majority of the fsync calls end up > > > waiting on writeback of a particular page, the wakeup of the writeback > > > bit on that page wakes a task that immediately resets PG_writeback on > > > the page such that N other folio_wait_writeback() waiters see the bit > > > still set and immediately place themselves back onto the tail of the > > > wait queue. Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop > > > in folio_wake_bit() (backing off the lock for a cycle or so in each > > > iteration) only to find the same bunch of tasks in the queue. This > > > process repeats for a long enough amount of time to trigger the soft > > > lockup warning. I've confirmed this spinning behavior with a tracepoint > > > in the bookmark loop that indicates we're stuck for many hundreds of > > > thousands of iterations (at least) of this loop when the soft lockup > > > warning triggers. > > > > [...] > > > > > I've run a few quick experiments to try and corroborate this analysis. > > > The problem goes away completely if I either back out the loop change in > > > folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something > > > like 128 (i.e. greater than the total possible number of waiter tasks in > > > this test). I've also played a few games with bookmark behavior mostly > > > out of curiosity, but usually end up introducing other problems like > > > missed wakeups, etc. > > > > As I recall, the bookmark hack was introduced in order to handle > > lock_page() problems. It wasn't really supposed to handle writeback, > > but nobody thought it would cause any harm (and indeed, it didn't at the > > time). So how about we only use bookmarks for lock_page(), since > > lock_page() usually doesn't have the multiple-waker semantics that > > writeback has? > > > > Oh, interesting. I wasn't aware of the tenuous status of the bookmark > code. This is indeed much nicer than anything I was playing with. I > suspect it will address the problem, but I'll throw it at my test env > for a while and follow up.. thanks! > So I'm not clear on where we're at with this patch vs. something that moves (or drops) the wb wait loop vs. the wait and set thing (which seems more invasive and longer term), but FWIW.. this patch survived over 10k iterations of the reproducer test yesterday (the problem typically reproduces in ~1k or so iterations on average) and an overnight fstests run without regression. Brian > Brian > > > (this is more in the spirit of "minimal patch" -- I think > > initialising the bookmark should be moved to folio_unlock()). > > > > diff --git a/mm/filemap.c b/mm/filemap.c index > > b2728eb52407..9ee3c5f1f489 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -1146,26 +1146,28 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, > > return (flags & WQ_FLAG_EXCLUSIVE) != 0; > > } > > > > -static void folio_wake_bit(struct folio *folio, int bit_nr) > > +static void folio_wake_bit(struct folio *folio, int bit_nr, > > + wait_queue_entry_t *bookmark) > > { > > wait_queue_head_t *q = folio_waitqueue(folio); > > struct wait_page_key key; > > unsigned long flags; > > - wait_queue_entry_t bookmark; > > > > key.folio = folio; > > key.bit_nr = bit_nr; > > key.page_match = 0; > > > > - bookmark.flags = 0; > > - bookmark.private = NULL; > > - bookmark.func = NULL; > > - INIT_LIST_HEAD(&bookmark.entry); > > + if (bookmark) { > > + bookmark->flags = 0; > > + bookmark->private = NULL; > > + bookmark->func = NULL; > > + INIT_LIST_HEAD(&bookmark->entry); > > + } > > > > spin_lock_irqsave(&q->lock, flags); > > - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); > > + __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark); > > > > - while (bookmark.flags & WQ_FLAG_BOOKMARK) { > > + while (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) { > > /* > > * Take a breather from holding the lock, > > * allow pages that finish wake up asynchronously > > @@ -1175,7 +1177,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr) > > spin_unlock_irqrestore(&q->lock, flags); > > cpu_relax(); > > spin_lock_irqsave(&q->lock, flags); > > - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); > > + __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark); > > } > > > > /* > > @@ -1204,7 +1206,7 @@ static void folio_wake(struct folio *folio, int bit) > > { > > if (!folio_test_waiters(folio)) > > return; > > - folio_wake_bit(folio, bit); > > + folio_wake_bit(folio, bit, NULL); > > } > > > > /* > > @@ -1554,12 +1556,15 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem > > */ > > void folio_unlock(struct folio *folio) > > { > > + wait_queue_entry_t bookmark; > > + > > /* Bit 7 allows x86 to check the byte's sign bit */ > > BUILD_BUG_ON(PG_waiters != 7); > > BUILD_BUG_ON(PG_locked > 7); > > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > > + > > if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0))) > > - folio_wake_bit(folio, PG_locked); > > + folio_wake_bit(folio, PG_locked, &bookmark); > > } > > EXPORT_SYMBOL(folio_unlock); > > > > @@ -1578,7 +1583,7 @@ void folio_end_private_2(struct folio *folio) > > { > > VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio); > > clear_bit_unlock(PG_private_2, folio_flags(folio, 0)); > > - folio_wake_bit(folio, PG_private_2); > > + folio_wake_bit(folio, PG_private_2, NULL); > > folio_put(folio); > > } > > EXPORT_SYMBOL(folio_end_private_2); > > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-18 14:14 ` Brian Foster @ 2022-03-18 14:45 ` Matthew Wilcox 2022-03-18 18:58 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2022-03-18 14:45 UTC (permalink / raw) To: Brian Foster Cc: linux-mm, linux-fsdevel, linux-xfs, Linus Torvalds, Hugh Dickins [-- Attachment #1: Type: text/plain, Size: 654 bytes --] On Fri, Mar 18, 2022 at 10:14:03AM -0400, Brian Foster wrote: > So I'm not clear on where we're at with this patch vs. something that > moves (or drops) the wb wait loop vs. the wait and set thing (which > seems more invasive and longer term), but FWIW.. this patch survived > over 10k iterations of the reproducer test yesterday (the problem > typically reproduces in ~1k or so iterations on average) and an > overnight fstests run without regression. Excellent! I'm going to propose these two patches for -rc1 (I don't think we want to be playing with this after -rc8). I agree the wait-and-set approach is a little further out. (patches attached) [-- Attachment #2: 0001-filemap-Remove-use-of-wait-bookmarks.patch --] [-- Type: text/plain, Size: 1818 bytes --] From db81b2f1ce3dc6aa902beb8b3e45a5972cf85737 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" <willy@infradead.org> Date: Thu, 17 Mar 2022 12:12:49 -0400 Subject: [PATCH 1/2] filemap: Remove use of wait bookmarks The original problem of the overly long list of waiters on a locked page was solved properly by commit 9a1ea439b16b ("mm: put_and_wait_on_page_locked() while page is migrated"). In the meantime, using bookmarks for the writeback bit can cause livelocks, so we need to stop using them. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- mm/filemap.c | 21 +-------------------- 1 file changed, 1 insertion(+), 20 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index b2728eb52407..bb8ec585dea8 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1151,32 +1151,13 @@ static void folio_wake_bit(struct folio *folio, int bit_nr) wait_queue_head_t *q = folio_waitqueue(folio); struct wait_page_key key; unsigned long flags; - wait_queue_entry_t bookmark; key.folio = folio; key.bit_nr = bit_nr; key.page_match = 0; - bookmark.flags = 0; - bookmark.private = NULL; - bookmark.func = NULL; - INIT_LIST_HEAD(&bookmark.entry); - spin_lock_irqsave(&q->lock, flags); - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); - - while (bookmark.flags & WQ_FLAG_BOOKMARK) { - /* - * Take a breather from holding the lock, - * allow pages that finish wake up asynchronously - * to acquire the lock and remove themselves - * from wait queue - */ - spin_unlock_irqrestore(&q->lock, flags); - cpu_relax(); - spin_lock_irqsave(&q->lock, flags); - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); - } + __wake_up_locked_key(q, TASK_NORMAL, &key); /* * It is possible for other pages to have collided on the waitqueue -- 2.34.1 [-- Attachment #3: 0002-sched-Remove-wait-bookmarks.patch --] [-- Type: text/plain, Size: 5839 bytes --] From 4eda97c0c8374f948a08c28265f019517d369a4c Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" <willy@infradead.org> Date: Fri, 18 Mar 2022 10:39:49 -0400 Subject: [PATCH 2/2] sched: Remove wait bookmarks There are no users of wait bookmarks left, so simplify the wait code by removing them. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- include/linux/wait.h | 9 +++---- kernel/sched/wait.c | 58 +++++++------------------------------------- 2 files changed, 12 insertions(+), 55 deletions(-) diff --git a/include/linux/wait.h b/include/linux/wait.h index 851e07da2583..69aa9d5767e7 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -19,10 +19,9 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int /* wait_queue_entry::flags */ #define WQ_FLAG_EXCLUSIVE 0x01 #define WQ_FLAG_WOKEN 0x02 -#define WQ_FLAG_BOOKMARK 0x04 -#define WQ_FLAG_CUSTOM 0x08 -#define WQ_FLAG_DONE 0x10 -#define WQ_FLAG_PRIORITY 0x20 +#define WQ_FLAG_CUSTOM 0x04 +#define WQ_FLAG_DONE 0x08 +#define WQ_FLAG_PRIORITY 0x10 /* * A single wait-queue entry structure: @@ -211,8 +210,6 @@ __remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq void __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr, void *key); void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key); -void __wake_up_locked_key_bookmark(struct wait_queue_head *wq_head, - unsigned int mode, void *key, wait_queue_entry_t *bookmark); void __wake_up_sync_key(struct wait_queue_head *wq_head, unsigned int mode, void *key); void __wake_up_locked_sync_key(struct wait_queue_head *wq_head, unsigned int mode, void *key); void __wake_up_locked(struct wait_queue_head *wq_head, unsigned int mode, int nr); diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index eca38107b32f..e09f5c177662 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -58,13 +58,6 @@ void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry } EXPORT_SYMBOL(remove_wait_queue); -/* - * Scan threshold to break wait queue walk. - * This allows a waker to take a break from holding the - * wait queue lock during the wait queue walk. - */ -#define WAITQUEUE_WALK_BREAK_CNT 64 - /* * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve @@ -79,21 +72,13 @@ EXPORT_SYMBOL(remove_wait_queue); * zero in this (rare) case, and we handle it by continuing to scan the queue. */ static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode, - int nr_exclusive, int wake_flags, void *key, - wait_queue_entry_t *bookmark) + int nr_exclusive, int wake_flags, void *key) { wait_queue_entry_t *curr, *next; - int cnt = 0; lockdep_assert_held(&wq_head->lock); - if (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) { - curr = list_next_entry(bookmark, entry); - - list_del(&bookmark->entry); - bookmark->flags = 0; - } else - curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry); + curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry); if (&curr->entry == &wq_head->head) return nr_exclusive; @@ -102,21 +87,11 @@ static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode, unsigned flags = curr->flags; int ret; - if (flags & WQ_FLAG_BOOKMARK) - continue; - ret = curr->func(curr, mode, wake_flags, key); if (ret < 0) break; if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) break; - - if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && - (&next->entry != &wq_head->head)) { - bookmark->flags = WQ_FLAG_BOOKMARK; - list_add_tail(&bookmark->entry, &next->entry); - break; - } } return nr_exclusive; @@ -126,19 +101,11 @@ static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int int nr_exclusive, int wake_flags, void *key) { unsigned long flags; - wait_queue_entry_t bookmark; - bookmark.flags = 0; - bookmark.private = NULL; - bookmark.func = NULL; - INIT_LIST_HEAD(&bookmark.entry); - - do { - spin_lock_irqsave(&wq_head->lock, flags); - nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive, - wake_flags, key, &bookmark); - spin_unlock_irqrestore(&wq_head->lock, flags); - } while (bookmark.flags & WQ_FLAG_BOOKMARK); + spin_lock_irqsave(&wq_head->lock, flags); + nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive, + wake_flags, key); + spin_unlock_irqrestore(&wq_head->lock, flags); } /** @@ -163,23 +130,16 @@ EXPORT_SYMBOL(__wake_up); */ void __wake_up_locked(struct wait_queue_head *wq_head, unsigned int mode, int nr) { - __wake_up_common(wq_head, mode, nr, 0, NULL, NULL); + __wake_up_common(wq_head, mode, nr, 0, NULL); } EXPORT_SYMBOL_GPL(__wake_up_locked); void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key) { - __wake_up_common(wq_head, mode, 1, 0, key, NULL); + __wake_up_common(wq_head, mode, 1, 0, key); } EXPORT_SYMBOL_GPL(__wake_up_locked_key); -void __wake_up_locked_key_bookmark(struct wait_queue_head *wq_head, - unsigned int mode, void *key, wait_queue_entry_t *bookmark) -{ - __wake_up_common(wq_head, mode, 1, 0, key, bookmark); -} -EXPORT_SYMBOL_GPL(__wake_up_locked_key_bookmark); - /** * __wake_up_sync_key - wake up threads blocked on a waitqueue. * @wq_head: the waitqueue @@ -225,7 +185,7 @@ EXPORT_SYMBOL_GPL(__wake_up_sync_key); void __wake_up_locked_sync_key(struct wait_queue_head *wq_head, unsigned int mode, void *key) { - __wake_up_common(wq_head, mode, 1, WF_SYNC, key, NULL); + __wake_up_common(wq_head, mode, 1, WF_SYNC, key); } EXPORT_SYMBOL_GPL(__wake_up_locked_sync_key); -- 2.34.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-18 14:45 ` Matthew Wilcox @ 2022-03-18 18:58 ` Linus Torvalds 2022-10-20 1:35 ` Dan Williams 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2022-03-18 18:58 UTC (permalink / raw) To: Matthew Wilcox Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins On Fri, Mar 18, 2022 at 7:45 AM Matthew Wilcox <willy@infradead.org> wrote: > > Excellent! I'm going to propose these two patches for -rc1 (I don't > think we want to be playing with this after -rc8) Ack. I think your commit message may be a bit too optimistic (who knows if other loads can trigger the over-long page locking wait-queue latencies), but since I don't see any other ways to really check this than just trying it, let's do it. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-03-18 18:58 ` Linus Torvalds @ 2022-10-20 1:35 ` Dan Williams 2022-10-23 22:38 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Dan Williams @ 2022-10-20 1:35 UTC (permalink / raw) To: Linus Torvalds, Matthew Wilcox Cc: Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins Linus Torvalds wrote: > On Fri, Mar 18, 2022 at 7:45 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > Excellent! I'm going to propose these two patches for -rc1 (I don't > > think we want to be playing with this after -rc8) > > Ack. I think your commit message may be a bit too optimistic (who > knows if other loads can trigger the over-long page locking wait-queue > latencies), but since I don't see any other ways to really check this > than just trying it, let's do it. > > Linus A report from a tester with this call trace: watchdog: BUG: soft lockup - CPU#127 stuck for 134s! [ksoftirqd/127:782] RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x40 [..] Call Trace: <TASK> folio_wake_bit+0x8a/0x110 folio_end_writeback+0x37/0x80 ext4_finish_bio+0x19a/0x270 ext4_end_bio+0x47/0x140 blk_update_request+0x112/0x410 ...lead me to this thread. This was after I had them force all softirqs to run in ksoftirqd context, and run with rq_affinity == 2 to force I/O completion work to throttle new submissions. Willy, are these headed upstream: https://lore.kernel.org/all/YjSbHp6B9a1G3tuQ@casper.infradead.org ...or I am missing an alternate solution posted elsewhere? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-10-20 1:35 ` Dan Williams @ 2022-10-23 22:38 ` Linus Torvalds 2022-10-24 19:39 ` Tim Chen 2022-10-24 20:13 ` Dan Williams 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2022-10-23 22:38 UTC (permalink / raw) To: Dan Williams Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins On Wed, Oct 19, 2022 at 6:35 PM Dan Williams <dan.j.williams@intel.com> wrote: > > A report from a tester with this call trace: > > watchdog: BUG: soft lockup - CPU#127 stuck for 134s! [ksoftirqd/127:782] > RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x40 [..] Whee. > ...lead me to this thread. This was after I had them force all softirqs > to run in ksoftirqd context, and run with rq_affinity == 2 to force > I/O completion work to throttle new submissions. > > Willy, are these headed upstream: > > https://lore.kernel.org/all/YjSbHp6B9a1G3tuQ@casper.infradead.org > > ...or I am missing an alternate solution posted elsewhere? Can your reporter test that patch? I think it should still apply pretty much as-is.. And if we actually had somebody who had a test-case that was literally fixed by getting rid of the old bookmark code, that would make applying that patch a no-brainer. The problem is that the original load that caused us to do that thing in the first place isn't repeatable because it was special production code - so removing that bookmark code because we _think_ it now hurts more than it helps is kind of a big hurdle. But if we had some hard confirmation from somebody that "yes, the bookmark code is now hurting", that would make it a lot more palatable to just remove the code that we just _think_ that probably isn't needed any more.. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-10-23 22:38 ` Linus Torvalds @ 2022-10-24 19:39 ` Tim Chen 2022-10-24 19:43 ` Linus Torvalds 2022-10-24 20:13 ` Dan Williams 1 sibling, 1 reply; 27+ messages in thread From: Tim Chen @ 2022-10-24 19:39 UTC (permalink / raw) To: Linus Torvalds, Dan Williams Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins On Sun, 2022-10-23 at 15:38 -0700, Linus Torvalds wrote: > On Wed, Oct 19, 2022 at 6:35 PM Dan Williams > <dan.j.williams@intel.com> wrote: > > > > A report from a tester with this call trace: > > > > watchdog: BUG: soft lockup - CPU#127 stuck for 134s! > > [ksoftirqd/127:782] > > RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x40 [..] > > Whee. > > > ...lead me to this thread. This was after I had them force all > > softirqs > > to run in ksoftirqd context, and run with rq_affinity == 2 to force > > I/O completion work to throttle new submissions. > > > > Willy, are these headed upstream: > > > > https://lore.kernel.org/all/YjSbHp6B9a1G3tuQ@casper.infradead.org > > > > ...or I am missing an alternate solution posted elsewhere? > > Can your reporter test that patch? I think it should still apply > pretty much as-is.. And if we actually had somebody who had a > test-case that was literally fixed by getting rid of the old bookmark > code, that would make applying that patch a no-brainer. > > The problem is that the original load that caused us to do that thing > in the first place isn't repeatable because it was special production > code - so removing that bookmark code because we _think_ it now hurts > more than it helps is kind of a big hurdle. > > But if we had some hard confirmation from somebody that "yes, the > bookmark code is now hurting", that would make it a lot more > palatable > to just remove the code that we just _think_ that probably isn't > needed any more.. > > I do think that the original locked page on migration problem was fixed by commit 9a1ea439b16b. Unfortunately the customer did not respond to us when we asked them to test their workload when that patch went into the mainline. I don't have objection to Matthew's fix to remove the bookmark code, now that it is causing problems with this scenario that I didn't anticipate in my original code. Tim ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-10-24 19:39 ` Tim Chen @ 2022-10-24 19:43 ` Linus Torvalds 2022-10-24 20:14 ` Dan Williams 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2022-10-24 19:43 UTC (permalink / raw) To: Tim Chen Cc: Dan Williams, Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins On Mon, Oct 24, 2022 at 12:39 PM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > I do think that the original locked page on migration problem was fixed > by commit 9a1ea439b16b. Unfortunately the customer did not respond to > us when we asked them to test their workload when that patch went > into the mainline. Oh well. > I don't have objection to Matthew's fix to remove the bookmark code, > now that it is causing problems with this scenario that I didn't > anticipate in my original code. I'd really like to avoid *another* "we can't actually verify that this helps" change in this area, so I'm hoping that the reporter that Dan was talking to could test that patch. Otherwise we're kind of going back-and-forth based on "this might fix things", which just feels really fragile and reminds me of the bad old days when we had the "one step forward, two steps back" dance with some of the suspend/resume issues. I realize that this code needs some extreme loads (and likely pretty special hardware too) to actually become problematic, so testing is _always_ going to be a bit of a problem, but still... Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-10-24 19:43 ` Linus Torvalds @ 2022-10-24 20:14 ` Dan Williams 0 siblings, 0 replies; 27+ messages in thread From: Dan Williams @ 2022-10-24 20:14 UTC (permalink / raw) To: Linus Torvalds, Tim Chen Cc: Dan Williams, Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins Linus Torvalds wrote: > On Mon, Oct 24, 2022 at 12:39 PM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > > I do think that the original locked page on migration problem was fixed > > by commit 9a1ea439b16b. Unfortunately the customer did not respond to > > us when we asked them to test their workload when that patch went > > into the mainline. > > Oh well. > > > I don't have objection to Matthew's fix to remove the bookmark code, > > now that it is causing problems with this scenario that I didn't > > anticipate in my original code. > > I'd really like to avoid *another* "we can't actually verify that this > helps" change in this area, so I'm hoping that the reporter that Dan > was talking to could test that patch. Oh, sorry, I had typed up that reply and contacted Tim offline, but forgot to send, now sent. > Otherwise we're kind of going back-and-forth based on "this might fix > things", which just feels really fragile and reminds me of the bad old > days when we had the "one step forward, two steps back" dance with > some of the suspend/resume issues. > > I realize that this code needs some extreme loads (and likely pretty > special hardware too) to actually become problematic, so testing is > _always_ going to be a bit of a problem, but still... ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-10-23 22:38 ` Linus Torvalds 2022-10-24 19:39 ` Tim Chen @ 2022-10-24 20:13 ` Dan Williams 2022-10-24 20:28 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Dan Williams @ 2022-10-24 20:13 UTC (permalink / raw) To: Linus Torvalds, Dan Williams Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, jesus.a.arechiga.lopez, tim.c.chen [ add Tim and Arechiga ] Linus Torvalds wrote: > On Wed, Oct 19, 2022 at 6:35 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > A report from a tester with this call trace: > > > > watchdog: BUG: soft lockup - CPU#127 stuck for 134s! [ksoftirqd/127:782] > > RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x40 [..] > > Whee. > > > ...lead me to this thread. This was after I had them force all softirqs > > to run in ksoftirqd context, and run with rq_affinity == 2 to force > > I/O completion work to throttle new submissions. > > > > Willy, are these headed upstream: > > > > https://lore.kernel.org/all/YjSbHp6B9a1G3tuQ@casper.infradead.org > > > > ...or I am missing an alternate solution posted elsewhere? > > Can your reporter test that patch? I think it should still apply > pretty much as-is.. And if we actually had somebody who had a > test-case that was literally fixed by getting rid of the old bookmark > code, that would make applying that patch a no-brainer. > > The problem is that the original load that caused us to do that thing > in the first place isn't repeatable because it was special production > code - so removing that bookmark code because we _think_ it now hurts > more than it helps is kind of a big hurdle. > > But if we had some hard confirmation from somebody that "yes, the > bookmark code is now hurting", that would make it a lot more palatable > to just remove the code that we just _think_ that probably isn't > needed any more.. Arechiga reports that his test case that failed "fast" before now ran for 28 hours without a soft lockup report with the proposed patches applied. So, I would consider those: Tested-by: Jesus Arechiga Lopez <jesus.a.arechiga.lopez@intel.com> I notice that the original commit: 11a19c7b099f sched/wait: Introduce wakeup boomark in wake_up_page_bit ...was trying to fix waitqueue lock contention. The general approach of setting a bookmark and taking a break "could" work, but it in this case it would need to do something like return -EWOULDBLOCK and let ksoftirqd fall into its cond_resched() retry path. However, that would require plumbing the bookmark up several levels, not to mention the other folio_wake_bit() callers that do not have a convenient place to do cond_resched(). So I think has successfully found a way that waitqueue lock contention can not be improved. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-10-24 20:13 ` Dan Williams @ 2022-10-24 20:28 ` Linus Torvalds 2022-10-24 20:35 ` Dan Williams 2022-10-25 19:19 ` Matthew Wilcox 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2022-10-24 20:28 UTC (permalink / raw) To: Dan Williams Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, jesus.a.arechiga.lopez, tim.c.chen On Mon, Oct 24, 2022 at 1:13 PM Dan Williams <dan.j.williams@intel.com> wrote: > > Arechiga reports that his test case that failed "fast" before now ran > for 28 hours without a soft lockup report with the proposed patches > applied. So, I would consider those: > > Tested-by: Jesus Arechiga Lopez <jesus.a.arechiga.lopez@intel.com> Ok, great. I really like that patch myself (and obviously liked it back when it was originally proposed), but I think it was always held back by the fact that we didn't really have any hard data for it. It does sound like we now very much have hard data for "the page waitlist complexity is now a bigger problem than the historical problem it tried to solve". So I'll happily apply it. The only question is whether it's a "let's do this for 6.2", or if it's something that we'd want to back-port anyway, and might as well apply sooner rather than later as a fix. I think that in turn then depends on just how artificial the test case was. If the test case was triggered by somebody seeing problems in real life loads, that would make the urgency a lot higher. But if it was purely a synthetic test case with no accompanying "this is what made us look at this" problem, it might be a 6.2 thing. Arechiga? Also, considering that Willy authored the patch (even if it's really just a "remove this whole code logic"), maybe he has opinions? Willy? Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-10-24 20:28 ` Linus Torvalds @ 2022-10-24 20:35 ` Dan Williams 2022-10-25 15:58 ` Arechiga Lopez, Jesus A 2022-10-25 19:19 ` Matthew Wilcox 1 sibling, 1 reply; 27+ messages in thread From: Dan Williams @ 2022-10-24 20:35 UTC (permalink / raw) To: Linus Torvalds, Dan Williams Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, jesus.a.arechiga.lopez, tim.c.chen Linus Torvalds wrote: > On Mon, Oct 24, 2022 at 1:13 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > Arechiga reports that his test case that failed "fast" before now ran > > for 28 hours without a soft lockup report with the proposed patches > > applied. So, I would consider those: > > > > Tested-by: Jesus Arechiga Lopez <jesus.a.arechiga.lopez@intel.com> > > Ok, great. > > I really like that patch myself (and obviously liked it back when it > was originally proposed), but I think it was always held back by the > fact that we didn't really have any hard data for it. > > It does sound like we now very much have hard data for "the page > waitlist complexity is now a bigger problem than the historical > problem it tried to solve". > > So I'll happily apply it. The only question is whether it's a "let's > do this for 6.2", or if it's something that we'd want to back-port > anyway, and might as well apply sooner rather than later as a fix. > > I think that in turn then depends on just how artificial the test case > was. If the test case was triggered by somebody seeing problems in > real life loads, that would make the urgency a lot higher. But if it > was purely a synthetic test case with no accompanying "this is what > made us look at this" problem, it might be a 6.2 thing. > > Arechiga? I will let Arechiga reply as well, but my sense is that this is more in the latter camp of not urgent because the test case is trying to generate platform stress (success!), not necessarily trying to get real work done. ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: writeback completion soft lockup BUG in folio_wake_bit() 2022-10-24 20:35 ` Dan Williams @ 2022-10-25 15:58 ` Arechiga Lopez, Jesus A 0 siblings, 0 replies; 27+ messages in thread From: Arechiga Lopez, Jesus A @ 2022-10-25 15:58 UTC (permalink / raw) To: Williams, Dan J, Torvalds, Linus Cc: Matthew Wilcox, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, tim.c.chen@linux.intel.com, Gross, Mark > -----Original Message----- > From: Williams, Dan J <dan.j.williams@intel.com> > Sent: Monday, October 24, 2022 3:36 PM > To: Torvalds, Linus <torvalds@linux-foundation.org>; Williams, Dan J > <dan.j.williams@intel.com> > Cc: Matthew Wilcox <willy@infradead.org>; Brian Foster > <bfoster@redhat.com>; Linux-MM <linux-mm@kvack.org>; linux-fsdevel > <linux-fsdevel@vger.kernel.org>; linux-xfs <linux-xfs@vger.kernel.org>; > Hugh Dickins <hughd@google.com>; Arechiga Lopez, Jesus A > <jesus.a.arechiga.lopez@intel.com>; tim.c.chen@linux.intel.com > Subject: Re: writeback completion soft lockup BUG in folio_wake_bit() > > Linus Torvalds wrote: > > On Mon, Oct 24, 2022 at 1:13 PM Dan Williams <dan.j.williams@intel.com> > wrote: > > > > > > Arechiga reports that his test case that failed "fast" before now > > > ran for 28 hours without a soft lockup report with the proposed > > > patches applied. So, I would consider those: > > > > > > Tested-by: Jesus Arechiga Lopez <jesus.a.arechiga.lopez@intel.com> > > > > Ok, great. > > > > I really like that patch myself (and obviously liked it back when it > > was originally proposed), but I think it was always held back by the > > fact that we didn't really have any hard data for it. > > > > It does sound like we now very much have hard data for "the page > > waitlist complexity is now a bigger problem than the historical > > problem it tried to solve". > > > > So I'll happily apply it. The only question is whether it's a "let's > > do this for 6.2", or if it's something that we'd want to back-port > > anyway, and might as well apply sooner rather than later as a fix. > > > > I think that in turn then depends on just how artificial the test case > > was. If the test case was triggered by somebody seeing problems in > > real life loads, that would make the urgency a lot higher. But if it > > was purely a synthetic test case with no accompanying "this is what > > made us look at this" problem, it might be a 6.2 thing. > > > > Arechiga? > > I will let Arechiga reply as well, but my sense is that this is more in the latter > camp of not urgent because the test case is trying to generate platform > stress (success!), not necessarily trying to get real work done. Yes, as Dan mentioned it is trying to generate platform stress, We've been seeing the soft lockup events on test targets (2 sockets with high core count CPU's, and a lot of RAM). The workload stresses every core/CPU thread in various ways and logs results to a shared log file (every core writing to the same log file). We found that this shared log file was related to the soft lockups. With this change applied to 5.19 it seems that the soft lockups are no longer happening with this workload + configuration. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-10-24 20:28 ` Linus Torvalds 2022-10-24 20:35 ` Dan Williams @ 2022-10-25 19:19 ` Matthew Wilcox 2022-10-25 19:20 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2022-10-25 19:19 UTC (permalink / raw) To: Linus Torvalds Cc: Dan Williams, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, jesus.a.arechiga.lopez, tim.c.chen On Mon, Oct 24, 2022 at 01:28:31PM -0700, Linus Torvalds wrote: > It does sound like we now very much have hard data for "the page > waitlist complexity is now a bigger problem than the historical > problem it tried to solve". > > So I'll happily apply it. The only question is whether it's a "let's > do this for 6.2", or if it's something that we'd want to back-port > anyway, and might as well apply sooner rather than later as a fix. > > I think that in turn then depends on just how artificial the test case > was. If the test case was triggered by somebody seeing problems in > real life loads, that would make the urgency a lot higher. But if it > was purely a synthetic test case with no accompanying "this is what > made us look at this" problem, it might be a 6.2 thing. > > Arechiga? > > Also, considering that Willy authored the patch (even if it's really > just a "remove this whole code logic"), maybe he has opinions? Willy? I've been carrying a pair of patches in my tree to rip out the wait bookmarks since March, waiting for me to write a userspace testcase to reproduce the problem against v4.13 and then check it no longer does so against v5.0. Unfortunately, that hasn't happened. I'm happy to add Arechiga's Tested-by, and submit them to Andrew and have him bring them into v6.2, since this doesn't seem urgent? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: writeback completion soft lockup BUG in folio_wake_bit() 2022-10-25 19:19 ` Matthew Wilcox @ 2022-10-25 19:20 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2022-10-25 19:20 UTC (permalink / raw) To: Matthew Wilcox Cc: Dan Williams, Brian Foster, Linux-MM, linux-fsdevel, linux-xfs, Hugh Dickins, jesus.a.arechiga.lopez, tim.c.chen On Tue, Oct 25, 2022 at 12:19 PM Matthew Wilcox <willy@infradead.org> wrote: > > I've been carrying a pair of patches in my tree to rip out the wait > bookmarks since March, waiting for me to write a userspace testcase to > reproduce the problem against v4.13 and then check it no longer does so > against v5.0. Unfortunately, that hasn't happened. I'm happy to add > Arechiga's Tested-by, and submit them to Andrew and have him bring them > into v6.2, since this doesn't seem urgent? Ack, sounds good to me. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2022-10-25 19:21 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-03-15 19:07 writeback completion soft lockup BUG in folio_wake_bit() Brian Foster 2022-03-16 20:59 ` Matthew Wilcox 2022-03-16 23:35 ` Linus Torvalds 2022-03-17 15:04 ` Matthew Wilcox 2022-03-17 19:26 ` Linus Torvalds 2022-03-17 21:16 ` Matthew Wilcox 2022-03-17 22:52 ` Dave Chinner 2022-03-18 13:16 ` Jan Kara 2022-03-18 18:56 ` Linus Torvalds 2022-03-19 16:23 ` Theodore Ts'o 2022-03-30 15:55 ` Christoph Hellwig 2022-03-17 15:31 ` Brian Foster 2022-03-17 13:51 ` Brian Foster 2022-03-18 14:14 ` Brian Foster 2022-03-18 14:45 ` Matthew Wilcox 2022-03-18 18:58 ` Linus Torvalds 2022-10-20 1:35 ` Dan Williams 2022-10-23 22:38 ` Linus Torvalds 2022-10-24 19:39 ` Tim Chen 2022-10-24 19:43 ` Linus Torvalds 2022-10-24 20:14 ` Dan Williams 2022-10-24 20:13 ` Dan Williams 2022-10-24 20:28 ` Linus Torvalds 2022-10-24 20:35 ` Dan Williams 2022-10-25 15:58 ` Arechiga Lopez, Jesus A 2022-10-25 19:19 ` Matthew Wilcox 2022-10-25 19:20 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).