linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-xfs@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>
Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()
Date: Wed, 16 Mar 2022 20:59:39 +0000	[thread overview]
Message-ID: <YjJPu/3tYnuKK888@casper.infradead.org> (raw)
In-Reply-To: <YjDj3lvlNJK/IPiU@bfoster>

On Tue, Mar 15, 2022 at 03:07:10PM -0400, Brian Foster wrote:
> What seems to happen is that the majority of the fsync calls end up
> waiting on writeback of a particular page, the wakeup of the writeback
> bit on that page wakes a task that immediately resets PG_writeback on
> the page such that N other folio_wait_writeback() waiters see the bit
> still set and immediately place themselves back onto the tail of the
> wait queue.  Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop
> in folio_wake_bit() (backing off the lock for a cycle or so in each
> iteration) only to find the same bunch of tasks in the queue. This
> process repeats for a long enough amount of time to trigger the soft
> lockup warning. I've confirmed this spinning behavior with a tracepoint
> in the bookmark loop that indicates we're stuck for many hundreds of
> thousands of iterations (at least) of this loop when the soft lockup
> warning triggers.

[...]

> I've run a few quick experiments to try and corroborate this analysis.
> The problem goes away completely if I either back out the loop change in
> folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something
> like 128 (i.e. greater than the total possible number of waiter tasks in
> this test). I've also played a few games with bookmark behavior mostly
> out of curiosity, but usually end up introducing other problems like
> missed wakeups, etc.

As I recall, the bookmark hack was introduced in order to handle
lock_page() problems.  It wasn't really supposed to handle writeback,
but nobody thought it would cause any harm (and indeed, it didn't at the
time).  So how about we only use bookmarks for lock_page(), since
lock_page() usually doesn't have the multiple-waker semantics that
writeback has?

(this is more in the spirit of "minimal patch" -- I think initialising
the bookmark should be moved to folio_unlock()).

diff --git a/mm/filemap.c b/mm/filemap.c
index b2728eb52407..9ee3c5f1f489 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1146,26 +1146,28 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
 	return (flags & WQ_FLAG_EXCLUSIVE) != 0;
 }
 
-static void folio_wake_bit(struct folio *folio, int bit_nr)
+static void folio_wake_bit(struct folio *folio, int bit_nr,
+		wait_queue_entry_t *bookmark)
 {
 	wait_queue_head_t *q = folio_waitqueue(folio);
 	struct wait_page_key key;
 	unsigned long flags;
-	wait_queue_entry_t bookmark;
 
 	key.folio = folio;
 	key.bit_nr = bit_nr;
 	key.page_match = 0;
 
-	bookmark.flags = 0;
-	bookmark.private = NULL;
-	bookmark.func = NULL;
-	INIT_LIST_HEAD(&bookmark.entry);
+	if (bookmark) {
+		bookmark->flags = 0;
+		bookmark->private = NULL;
+		bookmark->func = NULL;
+		INIT_LIST_HEAD(&bookmark->entry);
+	}
 
 	spin_lock_irqsave(&q->lock, flags);
-	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
+	__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
 
-	while (bookmark.flags & WQ_FLAG_BOOKMARK) {
+	while (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
 		/*
 		 * Take a breather from holding the lock,
 		 * allow pages that finish wake up asynchronously
@@ -1175,7 +1177,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr)
 		spin_unlock_irqrestore(&q->lock, flags);
 		cpu_relax();
 		spin_lock_irqsave(&q->lock, flags);
-		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
+		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark);
 	}
 
 	/*
@@ -1204,7 +1206,7 @@ static void folio_wake(struct folio *folio, int bit)
 {
 	if (!folio_test_waiters(folio))
 		return;
-	folio_wake_bit(folio, bit);
+	folio_wake_bit(folio, bit, NULL);
 }
 
 /*
@@ -1554,12 +1556,15 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
  */
 void folio_unlock(struct folio *folio)
 {
+	wait_queue_entry_t bookmark;
+
 	/* Bit 7 allows x86 to check the byte's sign bit */
 	BUILD_BUG_ON(PG_waiters != 7);
 	BUILD_BUG_ON(PG_locked > 7);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
 	if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0)))
-		folio_wake_bit(folio, PG_locked);
+		folio_wake_bit(folio, PG_locked, &bookmark);
 }
 EXPORT_SYMBOL(folio_unlock);
 
@@ -1578,7 +1583,7 @@ void folio_end_private_2(struct folio *folio)
 {
 	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
 	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
-	folio_wake_bit(folio, PG_private_2);
+	folio_wake_bit(folio, PG_private_2, NULL);
 	folio_put(folio);
 }
 EXPORT_SYMBOL(folio_end_private_2);

  reply	other threads:[~2022-03-16 20:59 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-15 19:07 writeback completion soft lockup BUG in folio_wake_bit() Brian Foster
2022-03-16 20:59 ` Matthew Wilcox [this message]
2022-03-16 23:35   ` Linus Torvalds
2022-03-17 15:04     ` Matthew Wilcox
2022-03-17 19:26       ` Linus Torvalds
2022-03-17 21:16         ` Matthew Wilcox
2022-03-17 22:52           ` Dave Chinner
2022-03-18 13:16           ` Jan Kara
2022-03-18 18:56             ` Linus Torvalds
2022-03-19 16:23               ` Theodore Ts'o
2022-03-30 15:55                 ` Christoph Hellwig
2022-03-17 15:31     ` Brian Foster
2022-03-17 13:51   ` Brian Foster
2022-03-18 14:14     ` Brian Foster
2022-03-18 14:45       ` Matthew Wilcox
2022-03-18 18:58         ` Linus Torvalds
2022-10-20  1:35           ` Dan Williams
2022-10-23 22:38             ` Linus Torvalds
2022-10-24 19:39               ` Tim Chen
2022-10-24 19:43                 ` Linus Torvalds
2022-10-24 20:14                   ` Dan Williams
2022-10-24 20:13               ` Dan Williams
2022-10-24 20:28                 ` Linus Torvalds
2022-10-24 20:35                   ` Dan Williams
2022-10-25 15:58                     ` Arechiga Lopez, Jesus A
2022-10-25 19:19                   ` Matthew Wilcox
2022-10-25 19:20                     ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YjJPu/3tYnuKK888@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=bfoster@redhat.com \
    --cc=hughd@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).