[PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads
@ 2025-04-10  1:49 Luis Chamberlain
  2025-04-10  1:49 ` [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration Luis Chamberlain
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-10  1:49 UTC (permalink / raw)
  To: brauner, jack, tytso, adilger.kernel, linux-ext4, riel
  Cc: dave, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, mcgrof

We have an eye-sore of a spin lock held during page migration which
was added for a ext4 jbd corruption fix for which we have no clear
public corruption data. We want to remove the spin lock on mm/migrate
so to help buffer-head filesystems embrace large folios, since we
can cond_resched() on large folios on folio_mc_copy(). We've managed
to reproduce a corruption by just removing the spinlock and stressing
ext4 with generic/750, a corruption happens after 3 hours many times.
However, while developing an alternative fix based on feedback [0], we've
come the conclusion ext4 on vanilla Linux is still affected. We still have
a lingering jbd2 corruption issue.

The underlying race is in jbd2’s use of buffer_migrate_folio_norefs() without
holding doing proper synchronization, making it unsafe during folio migration.
ext4 uses jbd2 as its journaling backend. The corruption surfaces in ext4's
metadata operations, like ext4_ext_insert_extent(), when journal metadata fails
to be marked dirty due to the migration race. This leads to ENOSPC, journal
aborts, read-only fallback, and long-term filesystem corruption seen in replay
logs and "error count since last fsck".

This simply skips folio migration on jbd2 metadata buffers to avoid races during
journal writes that can lead to filesystem corruption, but also paves the way
to enable jbd2 to eventually overcome this limitation and enable folio
migration, while also implementing some of the suggested enhancements on
__find_get_block_slow(). The suggested trylock idea is implemented, thereby
potentially reducing i_private_lock contention and leveraging folio_trylock()
when allowed.

The first patch is intended to go through Linus' tree, if agreeable, and then
the rest can be evaluated for fs-next. Although I did not intend to upstream
the debugfs interface, at this point I'm convinced the statistics are extremely
useful while enhacing this path, and should also prove useful in enhacing and
eventually enabling folio migration on jbd2 metadata buffers.

If you want this in tree form, see 20250409-bh-meta-migrate-optimal [1].

[0] https://lore.kernel.org/all/20250330064732.3781046-3-mcgrof@kernel.org/T/#mf2fb79c9ab0d20fab65c65142b7f53680e68d8fa
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20250409-bh-meta-migrate-optimal

Changes on v2:

 - replace heuristic with buffer_meta() check as we're convinced the issue
   with corruption stil exist and jbd2 metadata buffers still needs work
   to enable folio migration
 - implements community feedback and performance suggestions on code
   paving the way to eventually enable jbd2 metadata buffers to leverage
   folio migration
 - adds debugfs interface

Davidlohr Bueso (6):
  fs/buffer: try to use folio lock for pagecache lookups
  fs/buffer: introduce __find_get_block_nonatomic()
  fs/ocfs2: use sleeping version of __find_get_block()
  fs/jbd2: use sleeping version of __find_get_block()
  fs/ext4: use sleeping version of __find_get_block()
  mm/migrate: enable noref migration for jbd2

Luis Chamberlain (2):
  migrate: fix skipping metadata buffer heads on migration
  mm: add migration buffer-head debugfs interface

 fs/buffer.c                 |  76 ++++++++++----
 fs/ext4/ialloc.c            |   3 +-
 fs/ext4/inode.c             |   2 +
 fs/ext4/mballoc.c           |   3 +-
 fs/jbd2/revoke.c            |  15 +--
 fs/ocfs2/journal.c          |   2 +-
 include/linux/buffer_head.h |   9 ++
 mm/migrate.c                | 192 ++++++++++++++++++++++++++++++++++--
 8 files changed, 266 insertions(+), 36 deletions(-)

-- 
2.47.2

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-10  1:49 [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads Luis Chamberlain
@ 2025-04-10  1:49 ` Luis Chamberlain
  2025-04-10  3:18   ` Matthew Wilcox
                     ` (2 more replies)
  2025-04-10  1:49 ` [PATCH v2 2/8] fs/buffer: try to use folio lock for pagecache lookups Luis Chamberlain
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-10  1:49 UTC (permalink / raw)
  To: brauner, jack, tytso, adilger.kernel, linux-ext4, riel
  Cc: dave, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, mcgrof, syzbot+f3c6fda1297c748a7076

Filesystems which use buffer-heads where it cannot guarantees that there
are no other references to the folio, for example with a folio
lock, must use buffer_migrate_folio_norefs() for the address space
mapping migrate_folio() callback. There are only 3 filesystems which use
this callback:

  1) the block device cache
  2) ext4 for its ext4_journalled_aops, ie, jbd2
  3) nilfs2

jbd2's use of this however callback however is very race prone, consider
folio migration while reviewing jbd2_journal_write_metadata_buffer()
and the fact that jbd2:

  - does not hold the folio lock
  - does not have have page writeback bit set
  - does not lock the buffer

And so, it can race with folio_set_bh() on folio migration. The commit
ebdf4de5642fb6 ("mm: migrate: fix reference  check race between
__find_get_block() and migration") added a spin lock to prevent races
with page migration which ext4 users were reporting through the SUSE
bugzilla (bnc#1137609 [0]). Although we don't have exact traces of the
original filesystem corruption we can can reproduce fs corruption on
ext4 by just removing the spinlock and stress testing the filesystem
with generic/750, we eventually end up after 3 hours of testing with
kdevops using libvirt on the ext4 profiles ext4-4k and ext4-2k. It
turns out that the spin lock doesn't in the end protect against
corruption, it *helps* reduce the possibility, but ext4 filesystem
corruption can still happen even with the spin lock held. A test was
done using vanilla Linux and adding a udelay(2000) right before we
spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
we can reproduce the same exact filesystem corruption issues as observed
without the spinlock with generic/750 [1].

We now have a slew of traces collected for the ext4 corruptions possible
without the current spin lock held [2] [3] [4] but the general pattern
is as follows, as summarized by ChatGPT from all traces:

do_writepages() # write back -->
   ext4_map_block() # performs logical to physical block mapping -->
     ext4_ext_insert_extent() # updates extent tree -->
       jbd2_journal_dirty_metadata()  # marks metadata as dirty for
                                      # journaling. This can lead
                                      # to any of the following hints
                                      # as to what happened from
                                      # ext4 / jbd2

         - Directory and extent metadata corruption splats or

         - Filure to handle out-of-space conditions gracefully, with
           cascading metadata errors and eventual filesystem shutdown
           to prevent further damage.

         - Failure to journal new extent metadata during extent tree
           growth, triggered under memory pressure or heavy writeback.
           Commonly results in ENOSPC, journal abort, and read-only
           fallback. **

         - Journal metadata failure during extent tree growth causes
           read-only fallback. Seen repeatedly on small-block (2k)
           filesystems under stress (e.g. fsstress). Triggers errors in
           bitmap and inode updates, and persists in journal replay logs.
           "Error count since last fsck" shows long-term corruption
           footprint.

** Reproduced on vanilla Linux with udelay(2000) **

Call trace (ENOSPC journal failure):
  do_writepages()
    → ext4_do_writepages()
      → ext4_map_blocks()
        → ext4_ext_map_blocks()
          → ext4_ext_insert_extent()
            → __ext4_handle_dirty_metadata()
              → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)

And so jbd2 still needs more work to avoid races with folio migration.
So replace the current spin lock solution by just skipping jbd buffers
on folio migration. We identify jbd buffers as its the only user of
set_buffer_meta() on __ext4_handle_dirty_metadata(). By checking for
buffer_meta() and bailing on migration we fix the existing racy ext4
corruption while also removing the spin lock to be held while sleeping
complaints originally reported by 0-day [5], and paves the way for
buffer-heads for more users of large folios other than the block
device cache.

Link: https://bugzilla.suse.com/show_bug.cgi?id=1137609 # [0]
Link: https://web.git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20250408-ext4-jbd-force-corruption # [1]
Link: https://lkml.kernel.org/r/Z-ZwToVfJbdTVRtG@bombadil.infradead.org # [2]
Link: https://lore.kernel.org/all/Z-rzyrS0Jr7t984Y@bombadil.infradead.org/ # [3]
Link: https://lore.kernel.org/all/Z_AA9SHZdRGq6tb4@bombadil.infradead.org/ # [4]
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/oe-lkp/202503101536.27099c77-lkp@intel.com # [5]
Fixes: ebdf4de5642fb6 ("mm: migrate: fix reference  check race between __find_get_block() and migration")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/migrate.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index f3ee6d8d5e2e..32fa72ba10b4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -841,6 +841,9 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 	if (folio_ref_count(src) != expected_count)
 		return -EAGAIN;
 
+	if (buffer_meta(head))
+		return -EAGAIN;
+
 	if (!buffer_migrate_lock_buffers(head, mode))
 		return -EAGAIN;
 
@@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 			}
 			bh = bh->b_this_page;
 		} while (bh != head);
+		spin_unlock(&mapping->i_private_lock);
 		if (busy) {
 			if (invalidated) {
 				rc = -EAGAIN;
 				goto unlock_buffers;
 			}
-			spin_unlock(&mapping->i_private_lock);
 			invalidate_bh_lrus();
 			invalidated = true;
 			goto recheck_buffers;
@@ -880,10 +883,7 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 		folio_set_bh(bh, dst, bh_offset(bh));
 		bh = bh->b_this_page;
 	} while (bh != head);
-
 unlock_buffers:
-	if (check_refs)
-		spin_unlock(&mapping->i_private_lock);
 	bh = head;
 	do {
 		unlock_buffer(bh);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 2/8] fs/buffer: try to use folio lock for pagecache lookups
  2025-04-10  1:49 [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads Luis Chamberlain
  2025-04-10  1:49 ` [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration Luis Chamberlain
@ 2025-04-10  1:49 ` Luis Chamberlain
  2025-04-10 14:38   ` Jan Kara
  2025-04-10  1:49 ` [PATCH v2 3/8] fs/buffer: introduce __find_get_block_nonatomic() Luis Chamberlain
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-10  1:49 UTC (permalink / raw)
  To: brauner, jack, tytso, adilger.kernel, linux-ext4, riel
  Cc: dave, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, mcgrof

From: Davidlohr Bueso <dave@stgolabs.net>

Callers of __find_get_block() may or may not allow for blocking
semantics, and is currently assumed that it will not. Layout
two paths based on this. Ultimately the i_private_lock scheme will
be used as a fallback in non-blocking contexts. Otherwise
always take the folio lock instead. The suggested trylock idea
is implemented, thereby potentially reducing i_private_lock
contention in addition to later enabling future migration support
around with large folios and noref migration.

No change in semantics. All lookup users are non-blocking.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 45 +++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c7abb4a029dc..5a1a37a6840a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -176,18 +176,8 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 }
 EXPORT_SYMBOL(end_buffer_write_sync);
 
-/*
- * Various filesystems appear to want __find_get_block to be non-blocking.
- * But it's the page lock which protects the buffers.  To get around this,
- * we get exclusion from try_to_free_buffers with the blockdev mapping's
- * i_private_lock.
- *
- * Hack idea: for the blockdev mapping, i_private_lock contention
- * may be quite high.  This code could TryLock the page, and if that
- * succeeds, there is no need to take i_private_lock.
- */
 static struct buffer_head *
-__find_get_block_slow(struct block_device *bdev, sector_t block)
+__find_get_block_slow(struct block_device *bdev, sector_t block, bool atomic)
 {
 	struct address_space *bd_mapping = bdev->bd_mapping;
 	const int blkbits = bd_mapping->host->i_blkbits;
@@ -197,6 +187,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
 	struct buffer_head *head;
 	struct folio *folio;
 	int all_mapped = 1;
+	bool folio_locked = true;
 	static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1);
 
 	index = ((loff_t)block << blkbits) / PAGE_SIZE;
@@ -204,7 +195,19 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
 	if (IS_ERR(folio))
 		goto out;
 
-	spin_lock(&bd_mapping->i_private_lock);
+	/*
+	 * Folio lock protects the buffers. Callers that cannot block
+	 * will fallback to serializing vs try_to_free_buffers() via
+	 * the i_private_lock.
+	 */
+	if (!folio_trylock(folio)) {
+		if (atomic) {
+			spin_lock(&bd_mapping->i_private_lock);
+			folio_locked = false;
+		} else
+			folio_lock(folio);
+	}
+
 	head = folio_buffers(folio);
 	if (!head)
 		goto out_unlock;
@@ -236,7 +239,10 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
 		       1 << blkbits);
 	}
 out_unlock:
-	spin_unlock(&bd_mapping->i_private_lock);
+	if (folio_locked)
+		folio_unlock(folio);
+	else
+		spin_unlock(&bd_mapping->i_private_lock);
 	folio_put(folio);
 out:
 	return ret;
@@ -1388,14 +1394,15 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
  * it in the LRU and mark it as accessed.  If it is not present then return
  * NULL
  */
-struct buffer_head *
-__find_get_block(struct block_device *bdev, sector_t block, unsigned size)
+static struct buffer_head *
+find_get_block_common(struct block_device *bdev, sector_t block,
+			unsigned size, bool atomic)
 {
 	struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
 
 	if (bh == NULL) {
 		/* __find_get_block_slow will mark the page accessed */
-		bh = __find_get_block_slow(bdev, block);
+		bh = __find_get_block_slow(bdev, block, atomic);
 		if (bh)
 			bh_lru_install(bh);
 	} else
@@ -1403,6 +1410,12 @@ __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
 
 	return bh;
 }
+
+struct buffer_head *
+__find_get_block(struct block_device *bdev, sector_t block, unsigned size)
+{
+	return find_get_block_common(bdev, block, size, true);
+}
 EXPORT_SYMBOL(__find_get_block);
 
 /**
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 3/8] fs/buffer: introduce __find_get_block_nonatomic()
  2025-04-10  1:49 [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads Luis Chamberlain
  2025-04-10  1:49 ` [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration Luis Chamberlain
  2025-04-10  1:49 ` [PATCH v2 2/8] fs/buffer: try to use folio lock for pagecache lookups Luis Chamberlain
@ 2025-04-10  1:49 ` Luis Chamberlain
  2025-04-10  1:49 ` [PATCH v2 4/8] fs/ocfs2: use sleeping version of __find_get_block() Luis Chamberlain
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-10  1:49 UTC (permalink / raw)
  To: brauner, jack, tytso, adilger.kernel, linux-ext4, riel
  Cc: dave, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, mcgrof

From: Davidlohr Bueso <dave@stgolabs.net>

Callers that do not require atomic context for pagecache
lookups can use this flavor to avoid the unnencesary
contention on the blockdev mapping i_private_lock as well
as waiting on noref migration.

Convert local caller write_boundary_block() which already
takes the buffer lock as well as bdev_getblk() depending
on the respective gpf flags. Either way there are no
currently changes in semantics.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c                 | 19 +++++++++++++++++--
 include/linux/buffer_head.h |  2 ++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 5a1a37a6840a..07ec57ec100e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -662,7 +662,8 @@ EXPORT_SYMBOL(generic_buffers_fsync);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize)
 {
-	struct buffer_head *bh = __find_get_block(bdev, bblock + 1, blocksize);
+	struct buffer_head *bh = __find_get_block_nonatomic(bdev, bblock + 1,
+							    blocksize);
 	if (bh) {
 		if (buffer_dirty(bh))
 			write_dirty_buffer(bh, 0);
@@ -1418,6 +1419,15 @@ __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
 }
 EXPORT_SYMBOL(__find_get_block);
 
+/* same as __find_get_block() but allows sleeping contexts */
+struct buffer_head *
+__find_get_block_nonatomic(struct block_device *bdev, sector_t block,
+			   unsigned size)
+{
+	return find_get_block_common(bdev, block, size, false);
+}
+EXPORT_SYMBOL(__find_get_block_nonatomic);
+
 /**
  * bdev_getblk - Get a buffer_head in a block device's buffer cache.
  * @bdev: The block device.
@@ -1435,7 +1445,12 @@ EXPORT_SYMBOL(__find_get_block);
 struct buffer_head *bdev_getblk(struct block_device *bdev, sector_t block,
 		unsigned size, gfp_t gfp)
 {
-	struct buffer_head *bh = __find_get_block(bdev, block, size);
+	struct buffer_head *bh;
+
+	if (gfpflags_allow_blocking(gfp))
+		bh = __find_get_block_nonatomic(bdev, block, size);
+	else
+		bh = __find_get_block(bdev, block, size);
 
 	might_alloc(gfp);
 	if (bh)
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index f0a4ad7839b6..2b5458517def 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -222,6 +222,8 @@ void __wait_on_buffer(struct buffer_head *);
 wait_queue_head_t *bh_waitq_head(struct buffer_head *bh);
 struct buffer_head *__find_get_block(struct block_device *bdev, sector_t block,
 			unsigned size);
+struct buffer_head *__find_get_block_nonatomic(struct block_device *bdev,
+		       sector_t block, unsigned size);
 struct buffer_head *bdev_getblk(struct block_device *bdev, sector_t block,
 		unsigned size, gfp_t gfp);
 void __brelse(struct buffer_head *);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 4/8] fs/ocfs2: use sleeping version of __find_get_block()
  2025-04-10  1:49 [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads Luis Chamberlain
                   ` (2 preceding siblings ...)
  2025-04-10  1:49 ` [PATCH v2 3/8] fs/buffer: introduce __find_get_block_nonatomic() Luis Chamberlain
@ 2025-04-10  1:49 ` Luis Chamberlain
  2025-04-10  1:49 ` [PATCH v2 5/8] fs/jbd2: " Luis Chamberlain
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-10  1:49 UTC (permalink / raw)
  To: brauner, jack, tytso, adilger.kernel, linux-ext4, riel
  Cc: dave, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, mcgrof

From: Davidlohr Bueso <dave@stgolabs.net>

This is a path that allows for blocking as it does IO. Convert
to the new nonatomic flavor to benefit from potential performance
benefits and adapt in the future vs migration such that semantics
are kept.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/ocfs2/journal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index f1b4b3e611cb..c7a9729dc9d0 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -1249,7 +1249,7 @@ static int ocfs2_force_read_journal(struct inode *inode)
 		}
 
 		for (i = 0; i < p_blocks; i++, p_blkno++) {
-			bh = __find_get_block(osb->sb->s_bdev, p_blkno,
+			bh = __find_get_block_nonatomic(osb->sb->s_bdev, p_blkno,
 					osb->sb->s_blocksize);
 			/* block not cached. */
 			if (!bh)
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 5/8] fs/jbd2: use sleeping version of __find_get_block()
  2025-04-10  1:49 [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads Luis Chamberlain
                   ` (3 preceding siblings ...)
  2025-04-10  1:49 ` [PATCH v2 4/8] fs/ocfs2: use sleeping version of __find_get_block() Luis Chamberlain
@ 2025-04-10  1:49 ` Luis Chamberlain
  2025-04-10  1:49 ` [PATCH v2 6/8] fs/ext4: " Luis Chamberlain
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-10  1:49 UTC (permalink / raw)
  To: brauner, jack, tytso, adilger.kernel, linux-ext4, riel
  Cc: dave, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, mcgrof

From: Davidlohr Bueso <dave@stgolabs.net>

Convert to the new nonatomic flavor to benefit from potential
performance benefits and adapt in the future vs migration such
that semantics are kept.

- jbd2_journal_revoke(): can sleep (has might_sleep() in the beginning)

- jbd2_journal_cancel_revoke(): only used from do_get_write_access() and
    do_get_create_access() which do sleep. So can sleep.

- jbd2_clear_buffer_revoked_flags() - only called from journal commit code
    which sleeps. So can sleep.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/jbd2/revoke.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index 0cf0fddbee81..1467f6790747 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -345,7 +345,8 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr,
 	bh = bh_in;
 
 	if (!bh) {
-		bh = __find_get_block(bdev, blocknr, journal->j_blocksize);
+		bh = __find_get_block_nonatomic(bdev, blocknr,
+						journal->j_blocksize);
 		if (bh)
 			BUFFER_TRACE(bh, "found on hash");
 	}
@@ -355,7 +356,8 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr,
 
 		/* If there is a different buffer_head lying around in
 		 * memory anywhere... */
-		bh2 = __find_get_block(bdev, blocknr, journal->j_blocksize);
+		bh2 = __find_get_block_nonatomic(bdev, blocknr,
+						 journal->j_blocksize);
 		if (bh2) {
 			/* ... and it has RevokeValid status... */
 			if (bh2 != bh && buffer_revokevalid(bh2))
@@ -464,7 +466,8 @@ void jbd2_journal_cancel_revoke(handle_t *handle, struct journal_head *jh)
 	 * state machine will get very upset later on. */
 	if (need_cancel) {
 		struct buffer_head *bh2;
-		bh2 = __find_get_block(bh->b_bdev, bh->b_blocknr, bh->b_size);
+		bh2 = __find_get_block_nonatomic(bh->b_bdev, bh->b_blocknr,
+						 bh->b_size);
 		if (bh2) {
 			if (bh2 != bh)
 				clear_buffer_revoked(bh2);
@@ -492,9 +495,9 @@ void jbd2_clear_buffer_revoked_flags(journal_t *journal)
 			struct jbd2_revoke_record_s *record;
 			struct buffer_head *bh;
 			record = (struct jbd2_revoke_record_s *)list_entry;
-			bh = __find_get_block(journal->j_fs_dev,
-					      record->blocknr,
-					      journal->j_blocksize);
+			bh = __find_get_block_nonatomic(journal->j_fs_dev,
+							record->blocknr,
+							journal->j_blocksize);
 			if (bh) {
 				clear_buffer_revoked(bh);
 				__brelse(bh);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 6/8] fs/ext4: use sleeping version of __find_get_block()
  2025-04-10  1:49 [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads Luis Chamberlain
                   ` (4 preceding siblings ...)
  2025-04-10  1:49 ` [PATCH v2 5/8] fs/jbd2: " Luis Chamberlain
@ 2025-04-10  1:49 ` Luis Chamberlain
  2025-04-10 13:36   ` Jan Kara
  2025-04-10  1:49 ` [PATCH v2 7/8] mm/migrate: enable noref migration for jbd2 Luis Chamberlain
  2025-04-10  1:49 ` [PATCH v2 8/8] mm: add migration buffer-head debugfs interface Luis Chamberlain
  7 siblings, 1 reply; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-10  1:49 UTC (permalink / raw)
  To: brauner, jack, tytso, adilger.kernel, linux-ext4, riel
  Cc: dave, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, mcgrof

From: Davidlohr Bueso <dave@stgolabs.net>

Trivially introduce the wrapper and enable ext4_free_blocks() to use
it, which has a cond_resched to begin with. Convert to the new nonatomic
flavor to benefit from potential performance benefits and adapt in the
future vs migration such that semantics are kept.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/ext4/inode.c             | 2 ++
 fs/ext4/mballoc.c           | 3 ++-
 include/linux/buffer_head.h | 6 ++++++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1dc09ed5d403..b7acb5d3adcb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -860,6 +860,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
 		return sb_find_get_block(inode->i_sb, map.m_pblk);
 
 	/*
+	 * Potential TODO: use sb_find_get_block_nonatomic() instead.
+	 *
 	 * Since bh could introduce extra ref count such as referred by
 	 * journal_head etc. Try to avoid using __GFP_MOVABLE here
 	 * as it may fail the migration when journal_head remains.
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 0d523e9fb3d5..6f4265b21e19 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -6644,7 +6644,8 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 		for (i = 0; i < count; i++) {
 			cond_resched();
 			if (is_metadata)
-				bh = sb_find_get_block(inode->i_sb, block + i);
+				bh = sb_find_get_block_nonatomic(inode->i_sb,
+								 block + i);
 			ext4_forget(handle, is_metadata, inode, bh, block + i);
 		}
 	}
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 2b5458517def..8db10ca288fc 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -399,6 +399,12 @@ sb_find_get_block(struct super_block *sb, sector_t block)
 	return __find_get_block(sb->s_bdev, block, sb->s_blocksize);
 }
 
+static inline struct buffer_head *
+sb_find_get_block_nonatomic(struct super_block *sb, sector_t block)
+{
+	return __find_get_block_nonatomic(sb->s_bdev, block, sb->s_blocksize);
+}
+
 static inline void
 map_bh(struct buffer_head *bh, struct super_block *sb, sector_t block)
 {
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 7/8] mm/migrate: enable noref migration for jbd2
  2025-04-10  1:49 [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads Luis Chamberlain
                   ` (5 preceding siblings ...)
  2025-04-10  1:49 ` [PATCH v2 6/8] fs/ext4: " Luis Chamberlain
@ 2025-04-10  1:49 ` Luis Chamberlain
  2025-04-10 13:40   ` Jan Kara
  2025-04-10  1:49 ` [PATCH v2 8/8] mm: add migration buffer-head debugfs interface Luis Chamberlain
  7 siblings, 1 reply; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-10  1:49 UTC (permalink / raw)
  To: brauner, jack, tytso, adilger.kernel, linux-ext4, riel
  Cc: dave, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, mcgrof

From: Davidlohr Bueso <dave@stgolabs.net>

Add semantics to enable future optimizations for buffer head noref jbd2
migration. This adds a new BH_Migrate flag which ensures we can bail
on the lookup path. This should enable jbd2 to get semantics of when
a buffer head is under folio migration, and should yield to it and to
eventually remove the buffer_meta() check skipping current jbd2 folio
migration.

Suggested-by: Jan Kara <jack@suse.cz>
Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c                 | 12 +++++++++++-
 fs/ext4/ialloc.c            |  3 ++-
 include/linux/buffer_head.h |  1 +
 mm/migrate.c                |  4 ++++
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 07ec57ec100e..753fc45667da 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -211,6 +211,15 @@ __find_get_block_slow(struct block_device *bdev, sector_t block, bool atomic)
 	head = folio_buffers(folio);
 	if (!head)
 		goto out_unlock;
+	/*
+	 * Upon a noref migration, the folio lock serializes here;
+	 * otherwise bail.
+	 */
+	if (test_bit_acquire(BH_Migrate, &head->b_state)) {
+		WARN_ON(folio_locked);
+		goto out_unlock;
+	}
+
 	bh = head;
 	do {
 		if (!buffer_mapped(bh))
@@ -1393,7 +1402,8 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
 /*
  * Perform a pagecache lookup for the matching buffer.  If it's there, refresh
  * it in the LRU and mark it as accessed.  If it is not present then return
- * NULL
+ * NULL. Atomic context callers may also return NULL if the buffer is being
+ * migrated; similarly the page is not marked accessed either.
  */
 static struct buffer_head *
 find_get_block_common(struct block_device *bdev, sector_t block,
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 38bc8d74f4cc..e7ecc7c8a729 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -691,7 +691,8 @@ static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino)
 	if (!bh || !buffer_uptodate(bh))
 		/*
 		 * If the block is not in the buffer cache, then it
-		 * must have been written out.
+		 * must have been written out, or, most unlikely, is
+		 * being migrated - false failure should be OK here.
 		 */
 		goto out;
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 8db10ca288fc..5559b906b1de 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -34,6 +34,7 @@ enum bh_state_bits {
 	BH_Meta,	/* Buffer contains metadata */
 	BH_Prio,	/* Buffer should be submitted with REQ_PRIO */
 	BH_Defer_Completion, /* Defer AIO completion to workqueue */
+	BH_Migrate,     /* Buffer is being migrated (norefs) */
 
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
diff --git a/mm/migrate.c b/mm/migrate.c
index 32fa72ba10b4..8fed2655f2e8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -851,6 +851,8 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 		bool busy;
 		bool invalidated = false;
 
+		VM_WARN_ON_ONCE(test_and_set_bit_lock(BH_Migrate,
+						      &head->b_state));
 recheck_buffers:
 		busy = false;
 		spin_lock(&mapping->i_private_lock);
@@ -884,6 +886,8 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 		bh = bh->b_this_page;
 	} while (bh != head);
 unlock_buffers:
+	if (check_refs)
+		clear_bit_unlock(BH_Migrate, &head->b_state);
 	bh = head;
 	do {
 		unlock_buffer(bh);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 8/8] mm: add migration buffer-head debugfs interface
  2025-04-10  1:49 [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads Luis Chamberlain
                   ` (6 preceding siblings ...)
  2025-04-10  1:49 ` [PATCH v2 7/8] mm/migrate: enable noref migration for jbd2 Luis Chamberlain
@ 2025-04-10  1:49 ` Luis Chamberlain
  7 siblings, 0 replies; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-10  1:49 UTC (permalink / raw)
  To: brauner, jack, tytso, adilger.kernel, linux-ext4, riel
  Cc: dave, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, mcgrof

If you are working on enhancing folio migration it is easy to not
be certain on improvements. This debugfs interface enables you to
evaluate gains on improvements on buffer-head folio migration.

This can easily tell you *why* folio migration might fail, for example,
here is the output of a generic/750 run for 18 hours:

root@e3-ext4-2k ~ # cat /sys/kernel/debug/mm/migrate/bh/stats

[buffer_migrate_folio]
                    calls       50160811
                  success       50047572
                    fails       113239

[buffer_migrate_folio_norefs]
                    calls       23577082468
                  success       2939858
                    fails       23574142610
                 jbd-meta       23425956714
          no-head-success       102
            no-head-fails       0
                  invalid       147919982
                    valid       2939881
            valid-success       2939756
              valid-fails       125

Success ratios:
buffer_migrate_folio: 99% success (50047572/50160811)
buffer_migrate_folio_norefs: 0% success (2939858/23577082468)

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/migrate.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 178 insertions(+), 6 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8fed2655f2e8..c478e8218cb0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -44,6 +44,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/memory-tiers.h>
 #include <linux/pagewalk.h>
+#include <linux/debugfs.h>
 
 #include <asm/tlbflush.h>
 
@@ -791,6 +792,126 @@ int migrate_folio(struct address_space *mapping, struct folio *dst,
 EXPORT_SYMBOL(migrate_folio);
 
 #ifdef CONFIG_BUFFER_HEAD
+
+static const char * const bh_routine_names[] = {
+	"buffer_migrate_folio",
+	"buffer_migrate_folio_norefs",
+};
+
+#define BH_STATS(X)							       \
+	X(bh_migrate_folio, 0, "calls")					       \
+	X(bh_migrate_folio_success, 0, "success")			       \
+	X(bh_migrate_folio_fails, 0, "fails")				       \
+	X(bh_migrate_folio_norefs, 1, "calls")				       \
+	X(bh_migrate_folio_norefs_success, 1, "success")		       \
+	X(bh_migrate_folio_norefs_fails, 1, "fails")			       \
+	X(bh_migrate_folio_norefs_meta, 1, "jbd-meta")			       \
+	X(bh_migrate_folio_norefs_nohead_success, 1, "no-head-success")	       \
+	X(bh_migrate_folio_norefs_nohead_fails, 1, "no-head-fails")	       \
+	X(bh_migrate_folio_norefs_invalid, 1, "invalid")		       \
+	X(bh_migrate_folio_norefs_valid, 1, "valid")			       \
+	X(bh_migrate_folio_norefs_valid_success, 1, "valid-success")	       \
+	X(bh_migrate_folio_norefs_valid_fails, 1, "valid-fails")
+
+
+#define DECLARE_STAT(name, routine_idx, meaning) static atomic_long_t name;
+BH_STATS(DECLARE_STAT)
+
+#define BH_STAT_PTR(name, routine_idx, meaning) &name,
+static atomic_long_t * const bh_stat_array[] = {
+	BH_STATS(BH_STAT_PTR)
+};
+
+#define BH_STAT_ROUTINE_IDX(name, routine_idx, meaning) routine_idx,
+static const int bh_stat_routine_index[] = {
+	BH_STATS(BH_STAT_ROUTINE_IDX)
+};
+
+#define BH_STAT_MEANING(name, routine_idx, meaning) meaning,
+static const char * const bh_stat_meanings[] = {
+	BH_STATS(BH_STAT_MEANING)
+};
+
+#define NUM_BH_STATS ARRAY_SIZE(bh_stat_array)
+
+static ssize_t read_file_bh_migrate_stats(struct file *file,
+					  char __user *user_buf,
+					  size_t count, loff_t *ppos)
+{
+	char *buf;
+	unsigned int i, len = 0, size = NUM_BH_STATS * 128;
+	int ret, last_routine = -1;
+	unsigned long total, success, rate;
+
+	BUILD_BUG_ON(ARRAY_SIZE(bh_stat_array) != ARRAY_SIZE(bh_stat_meanings));
+
+	buf = kzalloc(size, GFP_KERNEL);
+	if (buf == NULL)
+		return -ENOMEM;
+
+	for (i = 0; i < NUM_BH_STATS; i++) {
+		int routine_idx = bh_stat_routine_index[i];
+
+		if (routine_idx != last_routine) {
+			len += scnprintf(buf + len, size - len, "\n[%s]\n",
+					 bh_routine_names[routine_idx]);
+			last_routine = routine_idx;
+		}
+
+		len += scnprintf(buf + len, size - len, "%25s\t%lu\n",
+				 bh_stat_meanings[i],
+				 atomic_long_read(bh_stat_array[i]));
+
+	}
+
+	len += scnprintf(buf + len, size - len, "\nSuccess ratios:\n");
+
+	total = atomic_long_read(&bh_migrate_folio);
+	success = atomic_long_read(&bh_migrate_folio_success);
+	rate = total ? (success * 100) / total : 0;
+	len += scnprintf(buf + len, size - len,
+		"%s: %lu%% success (%lu/%lu)\n",
+		"buffer_migrate_folio", rate, success, total);
+
+	total = atomic_long_read(&bh_migrate_folio_norefs);
+	success = atomic_long_read(&bh_migrate_folio_norefs_success);
+	rate = total ? (success * 100) / total : 0;
+	len += scnprintf(buf + len, size - len,
+		"%s: %lu%% success (%lu/%lu)\n",
+		"buffer_migrate_folio_norefs", rate, success, total);
+
+	ret = simple_read_from_buffer(user_buf, count, ppos, buf, len);
+	kfree(buf);
+	return ret;
+}
+
+static const struct file_operations fops_bh_migrate_stats = {
+	.read = read_file_bh_migrate_stats,
+	.open = simple_open,
+	.owner = THIS_MODULE,
+	.llseek = default_llseek,
+};
+
+static void mm_migrate_bh_init(struct dentry *migrate_debug_root)
+{
+	struct dentry *parent_dirs[ARRAY_SIZE(bh_routine_names)] = { NULL };
+	struct dentry *root = debugfs_create_dir("bh", migrate_debug_root);
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(bh_routine_names); i++)
+		parent_dirs[i] = debugfs_create_dir(bh_routine_names[i], root);
+
+	for (i = 0; i < NUM_BH_STATS; i++) {
+		int routine = bh_stat_routine_index[i];
+		debugfs_create_ulong(bh_stat_meanings[i], 0400,
+		                     parent_dirs[routine],
+		                     (unsigned long *)
+				     &bh_stat_array[i]->counter);
+	}
+
+	debugfs_create_file("stats", 0400, root, root, &fops_bh_migrate_stats);
+}
+
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
 							enum migrate_mode mode)
@@ -833,16 +954,26 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 	int expected_count;
 
 	head = folio_buffers(src);
-	if (!head)
-		return migrate_folio(mapping, dst, src, mode);
+	if (!head) {
+		rc = migrate_folio(mapping, dst, src, mode);
+		if (check_refs) {
+			if (rc == 0)
+				atomic_long_inc(&bh_migrate_folio_norefs_nohead_success);
+			else
+				atomic_long_inc(&bh_migrate_folio_norefs_nohead_fails);
+		}
+		return rc;
+	}
 
 	/* Check whether page does not have extra refs before we do more work */
 	expected_count = folio_expected_refs(mapping, src);
 	if (folio_ref_count(src) != expected_count)
 		return -EAGAIN;
 
-	if (buffer_meta(head))
+	if (buffer_meta(head)) {
+		atomic_long_inc(&bh_migrate_folio_norefs_meta);
 		return -EAGAIN;
+	}
 
 	if (!buffer_migrate_lock_buffers(head, mode))
 		return -EAGAIN;
@@ -868,17 +999,23 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 		if (busy) {
 			if (invalidated) {
 				rc = -EAGAIN;
+				atomic_long_inc(&bh_migrate_folio_norefs_invalid);
 				goto unlock_buffers;
 			}
 			invalidate_bh_lrus();
 			invalidated = true;
 			goto recheck_buffers;
 		}
+		atomic_long_inc(&bh_migrate_folio_norefs_valid);
 	}
 
 	rc = filemap_migrate_folio(mapping, dst, src, mode);
-	if (rc != MIGRATEPAGE_SUCCESS)
+	if (rc != MIGRATEPAGE_SUCCESS) {
+		if (check_refs)
+			atomic_long_inc(&bh_migrate_folio_norefs_valid_fails);
 		goto unlock_buffers;
+	} else if (check_refs)
+		atomic_long_inc(&bh_migrate_folio_norefs_valid_success);
 
 	bh = head;
 	do {
@@ -915,7 +1052,16 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 int buffer_migrate_folio(struct address_space *mapping,
 		struct folio *dst, struct folio *src, enum migrate_mode mode)
 {
-	return __buffer_migrate_folio(mapping, dst, src, mode, false);
+	int ret;
+	atomic_long_inc(&bh_migrate_folio);
+
+	ret = __buffer_migrate_folio(mapping, dst, src, mode, false);
+	if (ret == 0)
+		atomic_long_inc(&bh_migrate_folio_success);
+	else
+		atomic_long_inc(&bh_migrate_folio_fails);
+
+	return ret;
 }
 EXPORT_SYMBOL(buffer_migrate_folio);
 
@@ -936,9 +1082,21 @@ EXPORT_SYMBOL(buffer_migrate_folio);
 int buffer_migrate_folio_norefs(struct address_space *mapping,
 		struct folio *dst, struct folio *src, enum migrate_mode mode)
 {
-	return __buffer_migrate_folio(mapping, dst, src, mode, true);
+	int ret;
+
+	atomic_long_inc(&bh_migrate_folio_norefs);
+
+	ret = __buffer_migrate_folio(mapping, dst, src, mode, true);
+	if (ret == 0)
+		atomic_long_inc(&bh_migrate_folio_norefs_success);
+	else
+		atomic_long_inc(&bh_migrate_folio_norefs_fails);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(buffer_migrate_folio_norefs);
+#else
+static inline void mm_migrate_bh_init(struct dentry *migrate_debug_root) { }
 #endif /* CONFIG_BUFFER_HEAD */
 
 int filemap_migrate_folio(struct address_space *mapping,
@@ -2737,3 +2895,17 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
+
+static __init int mm_migrate_debugfs_init(void)
+{
+	struct dentry *mm_debug_root;
+	struct dentry *migrate_debug_root;
+
+	mm_debug_root = debugfs_create_dir("mm", NULL);
+	migrate_debug_root = debugfs_create_dir("migrate", mm_debug_root);
+
+	mm_migrate_bh_init(migrate_debug_root);
+
+	return 0;
+}
+fs_initcall(mm_migrate_debugfs_init);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-10  1:49 ` [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration Luis Chamberlain
@ 2025-04-10  3:18   ` Matthew Wilcox
  2025-04-10 12:05   ` Jan Kara
  2025-04-15  1:36   ` Davidlohr Bueso
  2 siblings, 0 replies; 38+ messages in thread
From: Matthew Wilcox @ 2025-04-10  3:18 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: brauner, jack, tytso, adilger.kernel, linux-ext4, riel, dave,
	hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Wed, Apr 09, 2025 at 06:49:38PM -0700, Luis Chamberlain wrote:
> +++ b/mm/migrate.c
> @@ -841,6 +841,9 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>  	if (folio_ref_count(src) != expected_count)
>  		return -EAGAIN;
>  
> +	if (buffer_meta(head))
> +		return -EAGAIN;

This isn't enough on filesystems with bs<PS.  You're only testing the
meta bit on the first buffer_head on the folio and not on the rest of
them.  If this is the right approach to take, then we want:

+++ b/mm/migrate.c
@@ -799,6 +799,8 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
        struct buffer_head *failed_bh;

        do {
+               if (buffer_meta(bh))
+                       goto unlock;
                if (!trylock_buffer(bh)) {
                        if (mode == MIGRATE_ASYNC)
                                goto unlock;


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-10  1:49 ` [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration Luis Chamberlain
  2025-04-10  3:18   ` Matthew Wilcox
@ 2025-04-10 12:05   ` Jan Kara
  2025-04-14 21:09     ` Luis Chamberlain
  2025-04-15 16:18     ` Luis Chamberlain
  2025-04-15  1:36   ` Davidlohr Bueso
  2 siblings, 2 replies; 38+ messages in thread
From: Jan Kara @ 2025-04-10 12:05 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: brauner, jack, tytso, adilger.kernel, linux-ext4, riel, dave,
	willy, hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Wed 09-04-25 18:49:38, Luis Chamberlain wrote:
> Filesystems which use buffer-heads where it cannot guarantees that there
> are no other references to the folio, for example with a folio
> lock, must use buffer_migrate_folio_norefs() for the address space
> mapping migrate_folio() callback. There are only 3 filesystems which use
> this callback:
> 
>   1) the block device cache

Well, but through this also all simple filesystems that use buffer_heads
for metadata handling...

>   2) ext4 for its ext4_journalled_aops, ie, jbd2
>   3) nilfs2
> 
> jbd2's use of this however callback however is very race prone, consider
> folio migration while reviewing jbd2_journal_write_metadata_buffer()
> and the fact that jbd2:
> 
>   - does not hold the folio lock
>   - does not have have page writeback bit set
>   - does not lock the buffer
>
> And so, it can race with folio_set_bh() on folio migration. The commit
> ebdf4de5642fb6 ("mm: migrate: fix reference  check race between
> __find_get_block() and migration") added a spin lock to prevent races
> with page migration which ext4 users were reporting through the SUSE
> bugzilla (bnc#1137609 [0]). Although we don't have exact traces of the
> original filesystem corruption we can can reproduce fs corruption on
> ext4 by just removing the spinlock and stress testing the filesystem
> with generic/750, we eventually end up after 3 hours of testing with
> kdevops using libvirt on the ext4 profiles ext4-4k and ext4-2k.

Correct, jbd2 holds bh reference (its private jh structure attached to
bh->b_private holds it) and that is expected to protect jbd2 from anybody
else mucking with the bh.

> It turns out that the spin lock doesn't in the end protect against
> corruption, it *helps* reduce the possibility, but ext4 filesystem
> corruption can still happen even with the spin lock held. A test was
> done using vanilla Linux and adding a udelay(2000) right before we
> spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
> we can reproduce the same exact filesystem corruption issues as observed
> without the spinlock with generic/750 [1].

This is unexpected.

> ** Reproduced on vanilla Linux with udelay(2000) **
> 
> Call trace (ENOSPC journal failure):
>   do_writepages()
>     → ext4_do_writepages()
>       → ext4_map_blocks()
>         → ext4_ext_map_blocks()
>           → ext4_ext_insert_extent()
>             → __ext4_handle_dirty_metadata()
>               → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)

Curious. Did you try running e2fsck after the filesystem complained like
this? This complains about journal handle not having enough credits for
needed metadata update. Either we've lost some update to the journal_head
structure (b_modified got accidentally cleared) or some update to extent
tree.

> And so jbd2 still needs more work to avoid races with folio migration.
> So replace the current spin lock solution by just skipping jbd buffers
> on folio migration. We identify jbd buffers as its the only user of
> set_buffer_meta() on __ext4_handle_dirty_metadata(). By checking for
> buffer_meta() and bailing on migration we fix the existing racy ext4
> corruption while also removing the spin lock to be held while sleeping
> complaints originally reported by 0-day [5], and paves the way for
> buffer-heads for more users of large folios other than the block
> device cache.

I think we need to understand why private_lock protection does not protect
bh users holding reference like jbd2 from folio migration before papering
over this problem with the hack. Because there are chances other simple
filesystems suffer from the same problem...

> diff --git a/mm/migrate.c b/mm/migrate.c
> index f3ee6d8d5e2e..32fa72ba10b4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -841,6 +841,9 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>  	if (folio_ref_count(src) != expected_count)
>  		return -EAGAIN;
>  
> +	if (buffer_meta(head))
> +		return -EAGAIN;
> +
>  	if (!buffer_migrate_lock_buffers(head, mode))
>  		return -EAGAIN;
>  
> @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>  			}
>  			bh = bh->b_this_page;
>  		} while (bh != head);
> +		spin_unlock(&mapping->i_private_lock);

No, you've just broken all simple filesystems (like ext2) with this patch.
You can reduce the spinlock critical section only after providing
alternative way to protect them from migration. So this should probably
happen at the end of the series.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/8] fs/ext4: use sleeping version of __find_get_block()
  2025-04-10  1:49 ` [PATCH v2 6/8] fs/ext4: " Luis Chamberlain
@ 2025-04-10 13:36   ` Jan Kara
  2025-04-10 17:32     ` Davidlohr Bueso
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2025-04-10 13:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: brauner, jack, tytso, adilger.kernel, linux-ext4, riel, dave,
	willy, hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez

On Wed 09-04-25 18:49:43, Luis Chamberlain wrote:
> From: Davidlohr Bueso <dave@stgolabs.net>
> 
> Trivially introduce the wrapper and enable ext4_free_blocks() to use
> it, which has a cond_resched to begin with. Convert to the new nonatomic
> flavor to benefit from potential performance benefits and adapt in the
> future vs migration such that semantics are kept.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  fs/ext4/inode.c             | 2 ++
>  fs/ext4/mballoc.c           | 3 ++-
>  include/linux/buffer_head.h | 6 ++++++
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1dc09ed5d403..b7acb5d3adcb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -860,6 +860,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
>  		return sb_find_get_block(inode->i_sb, map.m_pblk);
>  
>  	/*
> +	 * Potential TODO: use sb_find_get_block_nonatomic() instead.
> +	 *

Yes, please. Since we are behind nowait check, we are fine with blocking...

								Honza

>  	 * Since bh could introduce extra ref count such as referred by
>  	 * journal_head etc. Try to avoid using __GFP_MOVABLE here
>  	 * as it may fail the migration when journal_head remains.
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 0d523e9fb3d5..6f4265b21e19 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -6644,7 +6644,8 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>  		for (i = 0; i < count; i++) {
>  			cond_resched();
>  			if (is_metadata)
> -				bh = sb_find_get_block(inode->i_sb, block + i);
> +				bh = sb_find_get_block_nonatomic(inode->i_sb,
> +								 block + i);
>  			ext4_forget(handle, is_metadata, inode, bh, block + i);
>  		}
>  	}
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 2b5458517def..8db10ca288fc 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -399,6 +399,12 @@ sb_find_get_block(struct super_block *sb, sector_t block)
>  	return __find_get_block(sb->s_bdev, block, sb->s_blocksize);
>  }
>  
> +static inline struct buffer_head *
> +sb_find_get_block_nonatomic(struct super_block *sb, sector_t block)
> +{
> +	return __find_get_block_nonatomic(sb->s_bdev, block, sb->s_blocksize);
> +}
> +

This hunk probably belongs to some introductory patch implementing
nonatomic helpers.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 7/8] mm/migrate: enable noref migration for jbd2
  2025-04-10  1:49 ` [PATCH v2 7/8] mm/migrate: enable noref migration for jbd2 Luis Chamberlain
@ 2025-04-10 13:40   ` Jan Kara
  2025-04-10 17:30     ` Davidlohr Bueso
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2025-04-10 13:40 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: brauner, jack, tytso, adilger.kernel, linux-ext4, riel, dave,
	willy, hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez

On Wed 09-04-25 18:49:44, Luis Chamberlain wrote:
> From: Davidlohr Bueso <dave@stgolabs.net>
> 
> Add semantics to enable future optimizations for buffer head noref jbd2
> migration. This adds a new BH_Migrate flag which ensures we can bail
> on the lookup path. This should enable jbd2 to get semantics of when
> a buffer head is under folio migration, and should yield to it and to
> eventually remove the buffer_meta() check skipping current jbd2 folio
> migration.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

..

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 32fa72ba10b4..8fed2655f2e8 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -851,6 +851,8 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>  		bool busy;
>  		bool invalidated = false;
>  
> +		VM_WARN_ON_ONCE(test_and_set_bit_lock(BH_Migrate,
> +						      &head->b_state));

Careful here. This breaks the logic with !CONFIG_DEBUG_VM.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/8] fs/buffer: try to use folio lock for pagecache lookups
  2025-04-10  1:49 ` [PATCH v2 2/8] fs/buffer: try to use folio lock for pagecache lookups Luis Chamberlain
@ 2025-04-10 14:38   ` Jan Kara
  2025-04-10 17:38     ` Davidlohr Bueso
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2025-04-10 14:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: brauner, jack, tytso, adilger.kernel, linux-ext4, riel, dave,
	willy, hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez

On Wed 09-04-25 18:49:39, Luis Chamberlain wrote:
> From: Davidlohr Bueso <dave@stgolabs.net>
> 
> Callers of __find_get_block() may or may not allow for blocking
> semantics, and is currently assumed that it will not. Layout
> two paths based on this. Ultimately the i_private_lock scheme will
> be used as a fallback in non-blocking contexts. Otherwise
> always take the folio lock instead. The suggested trylock idea
> is implemented, thereby potentially reducing i_private_lock
> contention in addition to later enabling future migration support
> around with large folios and noref migration.
> 
> No change in semantics. All lookup users are non-blocking.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

...

> @@ -204,7 +195,19 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
>  	if (IS_ERR(folio))
>  		goto out;
>  
> -	spin_lock(&bd_mapping->i_private_lock);
> +	/*
> +	 * Folio lock protects the buffers. Callers that cannot block
> +	 * will fallback to serializing vs try_to_free_buffers() via
> +	 * the i_private_lock.
> +	 */
> +	if (!folio_trylock(folio)) {
> +		if (atomic) {
> +			spin_lock(&bd_mapping->i_private_lock);
> +			folio_locked = false;
> +		} else
> +			folio_lock(folio);
> +	}

Ewww, this is going to be pain. You will mostly use the folio_trylock() for
protecting the lookup, except when some insane workload / fuzzer manages to
trigger the other path which will lead to completely unreproducible bugs...
I'd rather do:

	if (atomic) {
		spin_lock(&bd_mapping->i_private_lock);
		folio_locked = false;
	} else {
		folio_lock(folio);
	}

I'd actually love to do something like:

	if (atomic) {
		if (!folio_trylock(folio))
			bail...
	} else {
		folio_lock(folio);
	}

but that may be just too radical this point and would need some serious
testing how frequent the trylock failures are. No point in blocking this
series with it. So just go with the deterministic use of i_private_lock for
atomic users for now.

								Honza

> +
>  	head = folio_buffers(folio);
>  	if (!head)
>  		goto out_unlock;
> @@ -236,7 +239,10 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
>  		       1 << blkbits);
>  	}
>  out_unlock:
> -	spin_unlock(&bd_mapping->i_private_lock);
> +	if (folio_locked)
> +		folio_unlock(folio);
> +	else
> +		spin_unlock(&bd_mapping->i_private_lock);
>  	folio_put(folio);
>  out:
>  	return ret;
> @@ -1388,14 +1394,15 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
>   * it in the LRU and mark it as accessed.  If it is not present then return
>   * NULL
>   */
> -struct buffer_head *
> -__find_get_block(struct block_device *bdev, sector_t block, unsigned size)
> +static struct buffer_head *
> +find_get_block_common(struct block_device *bdev, sector_t block,
> +			unsigned size, bool atomic)
>  {
>  	struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
>  
>  	if (bh == NULL) {
>  		/* __find_get_block_slow will mark the page accessed */
> -		bh = __find_get_block_slow(bdev, block);
> +		bh = __find_get_block_slow(bdev, block, atomic);
>  		if (bh)
>  			bh_lru_install(bh);
>  	} else
> @@ -1403,6 +1410,12 @@ __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
>  
>  	return bh;
>  }
> +
> +struct buffer_head *
> +__find_get_block(struct block_device *bdev, sector_t block, unsigned size)
> +{
> +	return find_get_block_common(bdev, block, size, true);
> +}
>  EXPORT_SYMBOL(__find_get_block);
>  
>  /**
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 7/8] mm/migrate: enable noref migration for jbd2
  2025-04-10 13:40   ` Jan Kara
@ 2025-04-10 17:30     ` Davidlohr Bueso
  2025-04-14 12:12       ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Davidlohr Bueso @ 2025-04-10 17:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Luis Chamberlain, brauner, tytso, adilger.kernel, linux-ext4,
	riel, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez

On Thu, 10 Apr 2025, Jan Kara wrote:

>> @@ -851,6 +851,8 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>>		bool busy;
>>		bool invalidated = false;
>>
>> +		VM_WARN_ON_ONCE(test_and_set_bit_lock(BH_Migrate,
>> +						      &head->b_state));
>
>Careful here. This breaks the logic with !CONFIG_DEBUG_VM.

Ok, I guess just a WARN_ON_ONCE() here then.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/8] fs/ext4: use sleeping version of __find_get_block()
  2025-04-10 13:36   ` Jan Kara
@ 2025-04-10 17:32     ` Davidlohr Bueso
  0 siblings, 0 replies; 38+ messages in thread
From: Davidlohr Bueso @ 2025-04-10 17:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Luis Chamberlain, brauner, tytso, adilger.kernel, linux-ext4,
	riel, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez

On Thu, 10 Apr 2025, Jan Kara wrote:

>> @@ -860,6 +860,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
>>		return sb_find_get_block(inode->i_sb, map.m_pblk);
>>
>>	/*
>> +	 * Potential TODO: use sb_find_get_block_nonatomic() instead.
>> +	 *
>
>Yes, please. Since we are behind nowait check, we are fine with blocking...

Yes, I will send a follow up that replaces that getblk_unmovable() with
sb_find_get_block_nonatomic().

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/8] fs/buffer: try to use folio lock for pagecache lookups
  2025-04-10 14:38   ` Jan Kara
@ 2025-04-10 17:38     ` Davidlohr Bueso
  0 siblings, 0 replies; 38+ messages in thread
From: Davidlohr Bueso @ 2025-04-10 17:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Luis Chamberlain, brauner, tytso, adilger.kernel, linux-ext4,
	riel, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez

On Thu, 10 Apr 2025, Jan Kara wrote:

>I'd rather do:
>
>	if (atomic) {
>		spin_lock(&bd_mapping->i_private_lock);
>		folio_locked = false;
>	} else {
>		folio_lock(folio);
>	}
>

Fine with me. I just think the trylock for the atomic scenario would have
given greater chances of successful migration, but at a lack of determinism,
of course.

>I'd actually love to do something like:
>
>	if (atomic) {
>		if (!folio_trylock(folio))
>			bail...
>	} else {
>		folio_lock(folio);
>	}
>
>but that may be just too radical this point and would need some serious
>testing how frequent the trylock failures are. No point in blocking this
>series with it. So just go with the deterministic use of i_private_lock for
>atomic users for now.

This acually crossed my mind, but I also considered the scheme a little
too much for this series.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 7/8] mm/migrate: enable noref migration for jbd2
  2025-04-10 17:30     ` Davidlohr Bueso
@ 2025-04-14 12:12       ` Jan Kara
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Kara @ 2025-04-14 12:12 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Jan Kara, Luis Chamberlain, brauner, tytso, adilger.kernel,
	linux-ext4, riel, willy, hannes, oliver.sang, david, axboe, hare,
	david, djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez

On Thu 10-04-25 10:30:28, Davidlohr Bueso wrote:
> On Thu, 10 Apr 2025, Jan Kara wrote:
> 
> > > @@ -851,6 +851,8 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > 		bool busy;
> > > 		bool invalidated = false;
> > > 
> > > +		VM_WARN_ON_ONCE(test_and_set_bit_lock(BH_Migrate,
> > > +						      &head->b_state));
> > 
> > Careful here. This breaks the logic with !CONFIG_DEBUG_VM.
> 
> Ok, I guess just a WARN_ON_ONCE() here then.

Or just:

	bool migrating = test_and_set_bit_lock(BH_Migrate, &head->b_state);

	VM_WARN_ON_ONCE(migrating);

Frankly, I find statements with side-effects in WARN_ON / BUG_ON statements
rather confusing practice...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-10 12:05   ` Jan Kara
@ 2025-04-14 21:09     ` Luis Chamberlain
  2025-04-14 22:19       ` Luis Chamberlain
  2025-04-15 11:23       ` Jan Kara
  2025-04-15 16:18     ` Luis Chamberlain
  1 sibling, 2 replies; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-14 21:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: brauner, tytso, adilger.kernel, linux-ext4, riel, dave, willy,
	hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> >  			}
> >  			bh = bh->b_this_page;
> >  		} while (bh != head);
> > +		spin_unlock(&mapping->i_private_lock);
> 
> No, you've just broken all simple filesystems (like ext2) with this patch.
> You can reduce the spinlock critical section only after providing
> alternative way to protect them from migration. So this should probably
> happen at the end of the series.

So you're OK with this spin lock move with the other series in place?

And so we punt the hard-to-reproduce corruption issue as future work
to do? Becuase the other alternative for now is to just disable
migration for jbd2:

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1dc09ed5d403..ef1c3ef68877 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.bmap			= ext4_bmap,
 	.invalidate_folio	= ext4_journalled_invalidate_folio,
 	.release_folio		= ext4_release_folio,
-	.migrate_folio		= buffer_migrate_folio_norefs,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_folio	= generic_error_remove_folio,
 	.swap_activate		= ext4_iomap_swap_activate,

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-14 21:09     ` Luis Chamberlain
@ 2025-04-14 22:19       ` Luis Chamberlain
  2025-04-15  9:05         ` Christian Brauner
  2025-04-15 11:17         ` Jan Kara
  2025-04-15 11:23       ` Jan Kara
  1 sibling, 2 replies; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-14 22:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: brauner, tytso, adilger.kernel, linux-ext4, riel, dave, willy,
	hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > >  			}
> > >  			bh = bh->b_this_page;
> > >  		} while (bh != head);
> > > +		spin_unlock(&mapping->i_private_lock);
> > 
> > No, you've just broken all simple filesystems (like ext2) with this patch.
> > You can reduce the spinlock critical section only after providing
> > alternative way to protect them from migration. So this should probably
> > happen at the end of the series.
> 
> So you're OK with this spin lock move with the other series in place?
> 
> And so we punt the hard-to-reproduce corruption issue as future work
> to do? Becuase the other alternative for now is to just disable
> migration for jbd2:
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1dc09ed5d403..ef1c3ef68877 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
>  	.bmap			= ext4_bmap,
>  	.invalidate_folio	= ext4_journalled_invalidate_folio,
>  	.release_folio		= ext4_release_folio,
> -	.migrate_folio		= buffer_migrate_folio_norefs,
>  	.is_partially_uptodate  = block_is_partially_uptodate,
>  	.error_remove_folio	= generic_error_remove_folio,
>  	.swap_activate		= ext4_iomap_swap_activate,

BTW I ask because.. are your expectations that the next v3 series also
be a target for Linus tree as part of a fix for this spinlock
replacement?

  Luis

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-10  1:49 ` [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration Luis Chamberlain
  2025-04-10  3:18   ` Matthew Wilcox
  2025-04-10 12:05   ` Jan Kara
@ 2025-04-15  1:36   ` Davidlohr Bueso
  2025-04-15 11:25     ` Jan Kara
  2 siblings, 1 reply; 38+ messages in thread
From: Davidlohr Bueso @ 2025-04-15  1:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: brauner, jack, tytso, adilger.kernel, linux-ext4, riel, willy,
	hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Wed, 09 Apr 2025, Luis Chamberlain wrote:

>corruption can still happen even with the spin lock held. A test was
>done using vanilla Linux and adding a udelay(2000) right before we
>spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
>we can reproduce the same exact filesystem corruption issues as observed
>without the spinlock with generic/750 [1].

fyi I was actually able to trigger this on a vanilla 6.15-rc1 kernel,
not even having to add the artificial delay.

[336534.157119] ------------[ cut here ]------------
[336534.158911] WARNING: CPU: 3 PID: 87221 at fs/jbd2/transaction.c:1552 jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
[336534.160771] Modules linked in: loop sunrpc 9p kvm_intel nls_iso8859_1 nls_cp437 vfat fat crc32c_generic kvm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd 9pnet_virtio cryptd virtio_balloon virtio_console evdev joydev button nvme_fabrics nvme_core dm_mod drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 md_mod virtio_net net_failover virtio_blk failover psmouse serio_raw virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
[336534.173218] CPU: 3 UID: 0 PID: 87221 Comm: kworker/u36:8 Not tainted 6.15.0-rc1 #2 PREEMPT(full)
[336534.175146] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2025.02-5 03/28/2025
[336534.176947] Workqueue: writeback wb_workfn (flush-7:5)
[336534.178183] RIP: 0010:jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
[336534.179626] Code: 30 0f 84 5b fe ff ff 0f 0b 41 bc 8b ff ff ff e9 69 fe ff ff 48 8b 04 24 4c 8b 48 70 4d 39 cf 0f 84 53 ff ff ff e9 32 c3 00 00 <0f> 0b 41 bc e4 ff ff ff e9 41 ff ff ff 0f 0b 90 0f 1f 40 00 90 90
[336534.183983] RSP: 0018:ffff9f168d38f548 EFLAGS: 00010246
[336534.185194] RAX: 0000000000000001 RBX: ffff8c0ae8244e10 RCX: 00000000000000fd
[336534.186810] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[336534.188399] RBP: ffff8c09db0d8618 R08: ffff8c09db0d8618 R09: 0000000000000000
[336534.189977] R10: ffff8c0b2671a83c R11: 0000000000006989 R12: 0000000000000000
[336534.191243] R13: ffff8c09cc3b33f0 R14: ffff8c0ae8244e18 R15: ffff8c0ad5e0ef00
[336534.192469] FS:  0000000000000000(0000) GS:ffff8c0b95b8d000(0000) knlGS:0000000000000000
[336534.193840] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[336534.194831] CR2: 00007f0ebab4f000 CR3: 000000011e616005 CR4: 0000000000772ef0
[336534.196044] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[336534.197274] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[336534.198473] PKRU: 55555554
[336534.198927] Call Trace:
[336534.199350]  <TASK>
[336534.199701]  __ext4_handle_dirty_metadata+0x5c/0x190 [ext4]
[336534.200626]  ext4_ext_insert_extent+0x575/0x1440 [ext4]
[336534.201465]  ? ext4_cache_extents+0x5a/0xd0 [ext4]
[336534.202243]  ? ext4_find_extent+0x37c/0x3a0 [ext4]
[336534.203024]  ext4_ext_map_blocks+0x50e/0x18d0 [ext4]
[336534.203803]  ? mpage_map_and_submit_buffers+0x23f/0x270 [ext4]
[336534.204723]  ext4_map_blocks+0x11a/0x4d0 [ext4]
[336534.205442]  ? ext4_alloc_io_end_vec+0x1f/0x70 [ext4]
[336534.206239]  ? kmem_cache_alloc_noprof+0x310/0x3d0
[336534.206982]  ext4_do_writepages+0x762/0xd40 [ext4]
[336534.207706]  ? __pfx_block_write_full_folio+0x10/0x10
[336534.208451]  ? ext4_writepages+0xc6/0x1a0 [ext4]
[336534.209161]  ext4_writepages+0xc6/0x1a0 [ext4]
[336534.209834]  do_writepages+0xdd/0x250
[336534.210378]  ? filemap_get_read_batch+0x170/0x310
[336534.211069]  __writeback_single_inode+0x41/0x330
[336534.211738]  writeback_sb_inodes+0x21b/0x4d0
[336534.212375]  __writeback_inodes_wb+0x4c/0xe0
[336534.212998]  wb_writeback+0x19c/0x320
[336534.213546]  wb_workfn+0x30e/0x440
[336534.214039]  process_one_work+0x188/0x340
[336534.214650]  worker_thread+0x246/0x390
[336534.215196]  ? _raw_spin_lock_irqsave+0x23/0x50
[336534.215879]  ? __pfx_worker_thread+0x10/0x10
[336534.216522]  kthread+0x104/0x250
[336534.217004]  ? __pfx_kthread+0x10/0x10
[336534.217554]  ? _raw_spin_unlock+0x15/0x30
[336534.218140]  ? finish_task_switch.isra.0+0x94/0x290
[336534.218979]  ? __pfx_kthread+0x10/0x10
[336534.220347]  ret_from_fork+0x2d/0x50
[336534.221086]  ? __pfx_kthread+0x10/0x10
[336534.221703]  ret_from_fork_asm+0x1a/0x30
[336534.222415]  </TASK>
[336534.222775] ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-14 22:19       ` Luis Chamberlain
@ 2025-04-15  9:05         ` Christian Brauner
  2025-04-15 15:47           ` Luis Chamberlain
  2025-04-15 11:17         ` Jan Kara
  1 sibling, 1 reply; 38+ messages in thread
From: Christian Brauner @ 2025-04-15  9:05 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jan Kara, tytso, adilger.kernel, linux-ext4, riel, dave, willy,
	hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Mon, Apr 14, 2025 at 03:19:33PM -0700, Luis Chamberlain wrote:
> On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> > On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > >  			}
> > > >  			bh = bh->b_this_page;
> > > >  		} while (bh != head);
> > > > +		spin_unlock(&mapping->i_private_lock);
> > > 
> > > No, you've just broken all simple filesystems (like ext2) with this patch.
> > > You can reduce the spinlock critical section only after providing
> > > alternative way to protect them from migration. So this should probably
> > > happen at the end of the series.
> > 
> > So you're OK with this spin lock move with the other series in place?
> > 
> > And so we punt the hard-to-reproduce corruption issue as future work
> > to do? Becuase the other alternative for now is to just disable
> > migration for jbd2:
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 1dc09ed5d403..ef1c3ef68877 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
> >  	.bmap			= ext4_bmap,
> >  	.invalidate_folio	= ext4_journalled_invalidate_folio,
> >  	.release_folio		= ext4_release_folio,
> > -	.migrate_folio		= buffer_migrate_folio_norefs,
> >  	.is_partially_uptodate  = block_is_partially_uptodate,
> >  	.error_remove_folio	= generic_error_remove_folio,
> >  	.swap_activate		= ext4_iomap_swap_activate,
> 
> BTW I ask because.. are your expectations that the next v3 series also
> be a target for Linus tree as part of a fix for this spinlock
> replacement?

Since this is fixing potential filesystem corruption I will upstream
whatever we need to do to fix this. Ideally we have a minimal fix to
upstream now and a comprehensive fix and cleanup for v6.16.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-14 22:19       ` Luis Chamberlain
  2025-04-15  9:05         ` Christian Brauner
@ 2025-04-15 11:17         ` Jan Kara
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Kara @ 2025-04-15 11:17 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jan Kara, brauner, tytso, adilger.kernel, linux-ext4, riel, dave,
	willy, hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Mon 14-04-25 15:19:33, Luis Chamberlain wrote:
> On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> > On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > >  			}
> > > >  			bh = bh->b_this_page;
> > > >  		} while (bh != head);
> > > > +		spin_unlock(&mapping->i_private_lock);
> > > 
> > > No, you've just broken all simple filesystems (like ext2) with this patch.
> > > You can reduce the spinlock critical section only after providing
> > > alternative way to protect them from migration. So this should probably
> > > happen at the end of the series.
> > 
> > So you're OK with this spin lock move with the other series in place?
> > 
> > And so we punt the hard-to-reproduce corruption issue as future work
> > to do? Becuase the other alternative for now is to just disable
> > migration for jbd2:
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 1dc09ed5d403..ef1c3ef68877 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
> >  	.bmap			= ext4_bmap,
> >  	.invalidate_folio	= ext4_journalled_invalidate_folio,
> >  	.release_folio		= ext4_release_folio,
> > -	.migrate_folio		= buffer_migrate_folio_norefs,
> >  	.is_partially_uptodate  = block_is_partially_uptodate,
> >  	.error_remove_folio	= generic_error_remove_folio,
> >  	.swap_activate		= ext4_iomap_swap_activate,
> 
> BTW I ask because.. are your expectations that the next v3 series also
> be a target for Linus tree as part of a fix for this spinlock
> replacement?

I have no problem with this (the use of data=journal mode is rare) but I
suspect this won't fix the corruption you're seeing because that happens in
the block device mapping. So to fix that you'd have to disable page
migration for block device mappings (i.e., do the above with def_blk_aops)
and *that* will make a lot of people unhappy.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-14 21:09     ` Luis Chamberlain
  2025-04-14 22:19       ` Luis Chamberlain
@ 2025-04-15 11:23       ` Jan Kara
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Kara @ 2025-04-15 11:23 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jan Kara, brauner, tytso, adilger.kernel, linux-ext4, riel, dave,
	willy, hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Mon 14-04-25 14:09:44, Luis Chamberlain wrote:
> On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > >  			}
> > >  			bh = bh->b_this_page;
> > >  		} while (bh != head);
> > > +		spin_unlock(&mapping->i_private_lock);
> > 
> > No, you've just broken all simple filesystems (like ext2) with this patch.
> > You can reduce the spinlock critical section only after providing
> > alternative way to protect them from migration. So this should probably
> > happen at the end of the series.
> 
> So you're OK with this spin lock move with the other series in place?

Yes.

> And so we punt the hard-to-reproduce corruption issue as future work
> to do?

Yes, I'm not happy about that but I prefer that over putting band aids
into that code. Even more so because we don't even understand whether they
fix the problem or just make this reproducer stop working...

> Becuase the other alternative for now is to just disable
> migration for jbd2:
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1dc09ed5d403..ef1c3ef68877 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
>  	.bmap			= ext4_bmap,
>  	.invalidate_folio	= ext4_journalled_invalidate_folio,
>  	.release_folio		= ext4_release_folio,
> -	.migrate_folio		= buffer_migrate_folio_norefs,
>  	.is_partially_uptodate  = block_is_partially_uptodate,
>  	.error_remove_folio	= generic_error_remove_folio,
>  	.swap_activate		= ext4_iomap_swap_activate,

This will not disable migration for JBD2 buffers. This will just disable
migration for files with data journalling. See my other reply for details.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-15  1:36   ` Davidlohr Bueso
@ 2025-04-15 11:25     ` Jan Kara
  2025-04-15 18:14       ` Davidlohr Bueso
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2025-04-15 11:25 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Luis Chamberlain, brauner, jack, tytso, adilger.kernel,
	linux-ext4, riel, willy, hannes, oliver.sang, david, axboe, hare,
	david, djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Mon 14-04-25 18:36:41, Davidlohr Bueso wrote:
> On Wed, 09 Apr 2025, Luis Chamberlain wrote:
> 
> > corruption can still happen even with the spin lock held. A test was
> > done using vanilla Linux and adding a udelay(2000) right before we
> > spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
> > we can reproduce the same exact filesystem corruption issues as observed
> > without the spinlock with generic/750 [1].
> 
> fyi I was actually able to trigger this on a vanilla 6.15-rc1 kernel,
> not even having to add the artificial delay.

OK, so this is using generic/750, isn't it? How long did you have to run it
to trigger this? Because I've never seen this tripping...

								Honza

> 
> [336534.157119] ------------[ cut here ]------------
> [336534.158911] WARNING: CPU: 3 PID: 87221 at fs/jbd2/transaction.c:1552 jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
> [336534.160771] Modules linked in: loop sunrpc 9p kvm_intel nls_iso8859_1 nls_cp437 vfat fat crc32c_generic kvm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd 9pnet_virtio cryptd virtio_balloon virtio_console evdev joydev button nvme_fabrics nvme_core dm_mod drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 md_mod virtio_net net_failover virtio_blk failover psmouse serio_raw virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
> [336534.173218] CPU: 3 UID: 0 PID: 87221 Comm: kworker/u36:8 Not tainted 6.15.0-rc1 #2 PREEMPT(full)
> [336534.175146] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2025.02-5 03/28/2025
> [336534.176947] Workqueue: writeback wb_workfn (flush-7:5)
> [336534.178183] RIP: 0010:jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
> [336534.179626] Code: 30 0f 84 5b fe ff ff 0f 0b 41 bc 8b ff ff ff e9 69 fe ff ff 48 8b 04 24 4c 8b 48 70 4d 39 cf 0f 84 53 ff ff ff e9 32 c3 00 00 <0f> 0b 41 bc e4 ff ff ff e9 41 ff ff ff 0f 0b 90 0f 1f 40 00 90 90
> [336534.183983] RSP: 0018:ffff9f168d38f548 EFLAGS: 00010246
> [336534.185194] RAX: 0000000000000001 RBX: ffff8c0ae8244e10 RCX: 00000000000000fd
> [336534.186810] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [336534.188399] RBP: ffff8c09db0d8618 R08: ffff8c09db0d8618 R09: 0000000000000000
> [336534.189977] R10: ffff8c0b2671a83c R11: 0000000000006989 R12: 0000000000000000
> [336534.191243] R13: ffff8c09cc3b33f0 R14: ffff8c0ae8244e18 R15: ffff8c0ad5e0ef00
> [336534.192469] FS:  0000000000000000(0000) GS:ffff8c0b95b8d000(0000) knlGS:0000000000000000
> [336534.193840] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [336534.194831] CR2: 00007f0ebab4f000 CR3: 000000011e616005 CR4: 0000000000772ef0
> [336534.196044] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [336534.197274] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [336534.198473] PKRU: 55555554
> [336534.198927] Call Trace:
> [336534.199350]  <TASK>
> [336534.199701]  __ext4_handle_dirty_metadata+0x5c/0x190 [ext4]
> [336534.200626]  ext4_ext_insert_extent+0x575/0x1440 [ext4]
> [336534.201465]  ? ext4_cache_extents+0x5a/0xd0 [ext4]
> [336534.202243]  ? ext4_find_extent+0x37c/0x3a0 [ext4]
> [336534.203024]  ext4_ext_map_blocks+0x50e/0x18d0 [ext4]
> [336534.203803]  ? mpage_map_and_submit_buffers+0x23f/0x270 [ext4]
> [336534.204723]  ext4_map_blocks+0x11a/0x4d0 [ext4]
> [336534.205442]  ? ext4_alloc_io_end_vec+0x1f/0x70 [ext4]
> [336534.206239]  ? kmem_cache_alloc_noprof+0x310/0x3d0
> [336534.206982]  ext4_do_writepages+0x762/0xd40 [ext4]
> [336534.207706]  ? __pfx_block_write_full_folio+0x10/0x10
> [336534.208451]  ? ext4_writepages+0xc6/0x1a0 [ext4]
> [336534.209161]  ext4_writepages+0xc6/0x1a0 [ext4]
> [336534.209834]  do_writepages+0xdd/0x250
> [336534.210378]  ? filemap_get_read_batch+0x170/0x310
> [336534.211069]  __writeback_single_inode+0x41/0x330
> [336534.211738]  writeback_sb_inodes+0x21b/0x4d0
> [336534.212375]  __writeback_inodes_wb+0x4c/0xe0
> [336534.212998]  wb_writeback+0x19c/0x320
> [336534.213546]  wb_workfn+0x30e/0x440
> [336534.214039]  process_one_work+0x188/0x340
> [336534.214650]  worker_thread+0x246/0x390
> [336534.215196]  ? _raw_spin_lock_irqsave+0x23/0x50
> [336534.215879]  ? __pfx_worker_thread+0x10/0x10
> [336534.216522]  kthread+0x104/0x250
> [336534.217004]  ? __pfx_kthread+0x10/0x10
> [336534.217554]  ? _raw_spin_unlock+0x15/0x30
> [336534.218140]  ? finish_task_switch.isra.0+0x94/0x290
> [336534.218979]  ? __pfx_kthread+0x10/0x10
> [336534.220347]  ret_from_fork+0x2d/0x50
> [336534.221086]  ? __pfx_kthread+0x10/0x10
> [336534.221703]  ret_from_fork_asm+0x1a/0x30
> [336534.222415]  </TASK>
> [336534.222775] ---[ end trace 0000000000000000 ]---
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-15  9:05         ` Christian Brauner
@ 2025-04-15 15:47           ` Luis Chamberlain
  2025-04-15 16:23             ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-15 15:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, tytso, adilger.kernel, linux-ext4, riel, dave, willy,
	hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Tue, Apr 15, 2025 at 11:05:38AM +0200, Christian Brauner wrote:
> On Mon, Apr 14, 2025 at 03:19:33PM -0700, Luis Chamberlain wrote:
> > On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> > > On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > > >  			}
> > > > >  			bh = bh->b_this_page;
> > > > >  		} while (bh != head);
> > > > > +		spin_unlock(&mapping->i_private_lock);
> > > > 
> > > > No, you've just broken all simple filesystems (like ext2) with this patch.
> > > > You can reduce the spinlock critical section only after providing
> > > > alternative way to protect them from migration. So this should probably
> > > > happen at the end of the series.
> > > 
> > > So you're OK with this spin lock move with the other series in place?
> > > 
> > > And so we punt the hard-to-reproduce corruption issue as future work
> > > to do? Becuase the other alternative for now is to just disable
> > > migration for jbd2:
> > > 
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 1dc09ed5d403..ef1c3ef68877 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
> > >  	.bmap			= ext4_bmap,
> > >  	.invalidate_folio	= ext4_journalled_invalidate_folio,
> > >  	.release_folio		= ext4_release_folio,
> > > -	.migrate_folio		= buffer_migrate_folio_norefs,
> > >  	.is_partially_uptodate  = block_is_partially_uptodate,
> > >  	.error_remove_folio	= generic_error_remove_folio,
> > >  	.swap_activate		= ext4_iomap_swap_activate,
> > 
> > BTW I ask because.. are your expectations that the next v3 series also
> > be a target for Linus tree as part of a fix for this spinlock
> > replacement?
> 
> Since this is fixing potential filesystem corruption I will upstream
> whatever we need to do to fix this. Ideally we have a minimal fix to
> upstream now and a comprehensive fix and cleanup for v6.16.

Despite our efforts we don't yet have an agreement on how to fix the
ext4 corruption, becuase Jan noted the buffer_meta() check in this patch
is too broad and would affect other filesystems (I have yet to
understand how, but will review).

And so while we have agreement we can remove the spin lock to fix the
sleeping while atomic incurred by large folios for buffer heads by this
patch series, the removal of the spin lock would happen at the end of
this series.

And so the ext4 corruption is an existing issue as-is today, its
separate from the spin lock removal goal to fix the sleeping while
atomic..

However this series might be quite big for an rc2 or rc3 fix for that spin
lock removal issue. It should bring in substantial performance benefits
though, so it might be worthy to consider. We can re-run tests with the
adjustment to remove the spin lock until the last patch in this series.

The alternative is to revert the spin lock addition commit for Linus'
tree, ie commit ebdf4de5642fb6 ("mm: migrate: fix reference check race
between __find_get_block() and migration") and note that it in fact does
not fix the ext4 corruption as we've noted, and in fact causes an issue
with sleeping while atomic with support for large folios for buffer
heads. If we do that then we  punt this series for the next development
window, and it would just not have the spin lock removal on the last
patch.

Jan Kara, Christian, thoughts?

  Luis

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-10 12:05   ` Jan Kara
  2025-04-14 21:09     ` Luis Chamberlain
@ 2025-04-15 16:18     ` Luis Chamberlain
  2025-04-15 16:28       ` Jan Kara
  1 sibling, 1 reply; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-15 16:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: brauner, tytso, adilger.kernel, linux-ext4, riel, dave, willy,
	hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> On Wed 09-04-25 18:49:38, Luis Chamberlain wrote:
> > ** Reproduced on vanilla Linux with udelay(2000) **
> > 
> > Call trace (ENOSPC journal failure):
> >   do_writepages()
> >     → ext4_do_writepages()
> >       → ext4_map_blocks()
> >         → ext4_ext_map_blocks()
> >           → ext4_ext_insert_extent()
> >             → __ext4_handle_dirty_metadata()
> >               → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)
> 
> Curious. Did you try running e2fsck after the filesystem complained like
> this? This complains about journal handle not having enough credits for
> needed metadata update. Either we've lost some update to the journal_head
> structure (b_modified got accidentally cleared) or some update to extent
> tree.

Just tried it after pkill fsstress and stopping the test:

root@e1-ext4-2k /var/lib/xfstests # umount /dev/loop5
root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5
fsck from util-linux 2.41
e2fsck 1.47.2 (1-Jan-2025)
/dev/loop5 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 26 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 129 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 592 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 1095 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 1416 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 1653 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 1929 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 1965 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 2538 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 2765 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 2831 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 2838 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 3595 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 4659 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 5268 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 6400 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 6830 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 8490 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 8555 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 8658 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 8754 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 8996 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9168 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9430 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9468 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9543 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9632 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9746 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10043 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10280 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10623 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10651 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10691 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10708 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 11389 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 11564 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 11578 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 11842 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 11900 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 12721 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 12831 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 13531 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyyyInode 16580 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 16740 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 17123 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 17192 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 17412 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 17519 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyInode 18730 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 18974 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 19118 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyInode 19806 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 19942 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 20303 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 20323 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 21047 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 21919 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 22180 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 22856 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 23462 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 23587 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 23775 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 23916 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 24046 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 24161 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 25576 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 25666 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 25992 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 26404 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 26795 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 26862 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 26868 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 27627 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 27959 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 28014 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 29120 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 29308 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyyyInode 30673 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 31127 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 31332 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 31547 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyInode 32727 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 32888 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 33062 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyyInode 34421 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 34508 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyyyInode 35996 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 36258 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyInode 36867 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 37166 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 37171 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 37548 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 37732 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 38028 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 38099 extent tree (at level 1) could be shorter.  Optimize<y>? yes
....

So I tried:

root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5 -y 2>&1 > log
e2fsck 1.47.2 (1-Jan-2025)
root@e1-ext4-2k /var/lib/xfstests # wc -l log
16411 log

root@e1-ext4-2k /var/lib/xfstests # tail log

Free blocks count wrong for group #609 (62, counted=63).
Fix? yes

Free blocks count wrong (12289, counted=12293).
Fix? yes

/dev/loop5: ***** FILE SYSTEM WAS MODIFIED *****
/dev/loop5: 1310719/1310720 files (10.7% non-contiguous),
10473467/10485760 blocks

  Luis

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-15 15:47           ` Luis Chamberlain
@ 2025-04-15 16:23             ` Jan Kara
  2025-04-15 21:06               ` Luis Chamberlain
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2025-04-15 16:23 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christian Brauner, Jan Kara, tytso, adilger.kernel, linux-ext4,
	riel, dave, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Tue 15-04-25 08:47:51, Luis Chamberlain wrote:
> On Tue, Apr 15, 2025 at 11:05:38AM +0200, Christian Brauner wrote:
> > On Mon, Apr 14, 2025 at 03:19:33PM -0700, Luis Chamberlain wrote:
> > > On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> > > > On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > > > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > > > >  			}
> > > > > >  			bh = bh->b_this_page;
> > > > > >  		} while (bh != head);
> > > > > > +		spin_unlock(&mapping->i_private_lock);
> > > > > 
> > > > > No, you've just broken all simple filesystems (like ext2) with this patch.
> > > > > You can reduce the spinlock critical section only after providing
> > > > > alternative way to protect them from migration. So this should probably
> > > > > happen at the end of the series.
> > > > 
> > > > So you're OK with this spin lock move with the other series in place?
> > > > 
> > > > And so we punt the hard-to-reproduce corruption issue as future work
> > > > to do? Becuase the other alternative for now is to just disable
> > > > migration for jbd2:
> > > > 
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index 1dc09ed5d403..ef1c3ef68877 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
> > > >  	.bmap			= ext4_bmap,
> > > >  	.invalidate_folio	= ext4_journalled_invalidate_folio,
> > > >  	.release_folio		= ext4_release_folio,
> > > > -	.migrate_folio		= buffer_migrate_folio_norefs,
> > > >  	.is_partially_uptodate  = block_is_partially_uptodate,
> > > >  	.error_remove_folio	= generic_error_remove_folio,
> > > >  	.swap_activate		= ext4_iomap_swap_activate,
> > > 
> > > BTW I ask because.. are your expectations that the next v3 series also
> > > be a target for Linus tree as part of a fix for this spinlock
> > > replacement?
> > 
> > Since this is fixing potential filesystem corruption I will upstream
> > whatever we need to do to fix this. Ideally we have a minimal fix to
> > upstream now and a comprehensive fix and cleanup for v6.16.
> 
> Despite our efforts we don't yet have an agreement on how to fix the
> ext4 corruption, becuase Jan noted the buffer_meta() check in this patch
> is too broad and would affect other filesystems (I have yet to
> understand how, but will review).
> 
> And so while we have agreement we can remove the spin lock to fix the
> sleeping while atomic incurred by large folios for buffer heads by this
> patch series, the removal of the spin lock would happen at the end of
> this series.
> 
> And so the ext4 corruption is an existing issue as-is today, its
> separate from the spin lock removal goal to fix the sleeping while
> atomic..

I agree. Ext4 corruption problems are separate from sleeping in atomic
issues.

> However this series might be quite big for an rc2 or rc3 fix for that spin
> lock removal issue. It should bring in substantial performance benefits
> though, so it might be worthy to consider. We can re-run tests with the
> adjustment to remove the spin lock until the last patch in this series.
> 
> The alternative is to revert the spin lock addition commit for Linus'
> tree, ie commit ebdf4de5642fb6 ("mm: migrate: fix reference check race
> between __find_get_block() and migration") and note that it in fact does
> not fix the ext4 corruption as we've noted, and in fact causes an issue
> with sleeping while atomic with support for large folios for buffer
> heads. If we do that then we  punt this series for the next development
> window, and it would just not have the spin lock removal on the last
> patch.

Well, the commit ebdf4de5642fb6 is 6 years old. At that time there were no
large folios (in fact there were no folios at all ;)) in the page cache and
it does work quite well (I didn't see a corruption report from real users
since then). So I don't like removing that commit because it makes a
"reproducible with a heavy stress test" problem become a "reproduced by
real world workloads" problem.

If you look for a fast way to fixup sleep in atomic issues, then I'd
suggest just disabling large pages for block device page cache. That is the
new functionality that actually triggered all these investigations and
sleep-in-atomic reports. And once this patch set gets merged, we can
reenable large folios in the block device page cache again.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-15 16:18     ` Luis Chamberlain
@ 2025-04-15 16:28       ` Jan Kara
  2025-04-16 16:58         ` Luis Chamberlain
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2025-04-15 16:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jan Kara, brauner, tytso, adilger.kernel, linux-ext4, riel, dave,
	willy, hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Tue 15-04-25 09:18:10, Luis Chamberlain wrote:
> On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > On Wed 09-04-25 18:49:38, Luis Chamberlain wrote:
> > > ** Reproduced on vanilla Linux with udelay(2000) **
> > > 
> > > Call trace (ENOSPC journal failure):
> > >   do_writepages()
> > >     → ext4_do_writepages()
> > >       → ext4_map_blocks()
> > >         → ext4_ext_map_blocks()
> > >           → ext4_ext_insert_extent()
> > >             → __ext4_handle_dirty_metadata()
> > >               → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)
> > 
> > Curious. Did you try running e2fsck after the filesystem complained like
> > this? This complains about journal handle not having enough credits for
> > needed metadata update. Either we've lost some update to the journal_head
> > structure (b_modified got accidentally cleared) or some update to extent
> > tree.
> 
> Just tried it after pkill fsstress and stopping the test:
> 
> root@e1-ext4-2k /var/lib/xfstests # umount /dev/loop5
> root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5
> fsck from util-linux 2.41
> e2fsck 1.47.2 (1-Jan-2025)
> /dev/loop5 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Inode 26 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 129 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 592 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 1095 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 1416 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 1653 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 1929 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 1965 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 2538 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 2765 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 2831 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 2838 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 3595 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 4659 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 5268 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 6400 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 6830 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 8490 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 8555 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 8658 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 8754 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 8996 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9168 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9430 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9468 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9543 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9632 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9746 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10043 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10280 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10623 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10651 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10691 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10708 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 11389 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 11564 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 11578 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 11842 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 11900 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 12721 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 12831 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 13531 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyyyInode 16580 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 16740 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 17123 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 17192 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 17412 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 17519 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyInode 18730 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 18974 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 19118 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyInode 19806 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 19942 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 20303 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 20323 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 21047 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 21919 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 22180 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 22856 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 23462 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 23587 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 23775 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 23916 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 24046 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 24161 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 25576 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 25666 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 25992 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 26404 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 26795 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 26862 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 26868 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 27627 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 27959 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 28014 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 29120 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 29308 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyyyInode 30673 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 31127 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 31332 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 31547 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyInode 32727 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 32888 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 33062 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyyInode 34421 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 34508 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyyyInode 35996 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 36258 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyInode 36867 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 37166 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 37171 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 37548 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 37732 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 38028 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 38099 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> ....

These are harmless. They are not errors, just opportunities for
optimization of the extent tree e2fsck can make.

> So I tried:
> 
> root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5 -y 2>&1 > log
> e2fsck 1.47.2 (1-Jan-2025)
> root@e1-ext4-2k /var/lib/xfstests # wc -l log
> 16411 log

Can you share the log please?

> root@e1-ext4-2k /var/lib/xfstests # tail log
> 
> Free blocks count wrong for group #609 (62, counted=63).
> Fix? yes
> 
> Free blocks count wrong (12289, counted=12293).
> Fix? yes

These could potentially indicate some interesting issues but it depends on
what the above e2fsck messages are...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-15 11:25     ` Jan Kara
@ 2025-04-15 18:14       ` Davidlohr Bueso
  0 siblings, 0 replies; 38+ messages in thread
From: Davidlohr Bueso @ 2025-04-15 18:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Luis Chamberlain, brauner, tytso, adilger.kernel, linux-ext4,
	riel, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Tue, 15 Apr 2025, Jan Kara wrote:

>On Mon 14-04-25 18:36:41, Davidlohr Bueso wrote:
>> On Wed, 09 Apr 2025, Luis Chamberlain wrote:
>>
>> > corruption can still happen even with the spin lock held. A test was
>> > done using vanilla Linux and adding a udelay(2000) right before we
>> > spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
>> > we can reproduce the same exact filesystem corruption issues as observed
>> > without the spinlock with generic/750 [1].
>>
>> fyi I was actually able to trigger this on a vanilla 6.15-rc1 kernel,
>> not even having to add the artificial delay.
>
>OK, so this is using generic/750, isn't it? How long did you have to run it
>to trigger this? Because I've never seen this tripping...

Correct, generic/750. It triggered after 90+ hrs running.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-15 16:23             ` Jan Kara
@ 2025-04-15 21:06               ` Luis Chamberlain
  2025-04-16  2:02                 ` Davidlohr Bueso
  0 siblings, 1 reply; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-15 21:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, tytso, adilger.kernel, linux-ext4, riel, dave,
	willy, hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Tue, Apr 15, 2025 at 06:23:54PM +0200, Jan Kara wrote:
> On Tue 15-04-25 08:47:51, Luis Chamberlain wrote:
> > On Tue, Apr 15, 2025 at 11:05:38AM +0200, Christian Brauner wrote:
> > > On Mon, Apr 14, 2025 at 03:19:33PM -0700, Luis Chamberlain wrote:
> > > > On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> > > > > On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > > > > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > > > > >  			}
> > > > > > >  			bh = bh->b_this_page;
> > > > > > >  		} while (bh != head);
> > > > > > > +		spin_unlock(&mapping->i_private_lock);
> > > > > > 
> > > > > > No, you've just broken all simple filesystems (like ext2) with this patch.
> > > > > > You can reduce the spinlock critical section only after providing
> > > > > > alternative way to protect them from migration. So this should probably
> > > > > > happen at the end of the series.
> > > > > 
> > > > > So you're OK with this spin lock move with the other series in place?
> > > > > 
> > > > > And so we punt the hard-to-reproduce corruption issue as future work
> > > > > to do? Becuase the other alternative for now is to just disable
> > > > > migration for jbd2:
> > > > > 
> > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > index 1dc09ed5d403..ef1c3ef68877 100644
> > > > > --- a/fs/ext4/inode.c
> > > > > +++ b/fs/ext4/inode.c
> > > > > @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
> > > > >  	.bmap			= ext4_bmap,
> > > > >  	.invalidate_folio	= ext4_journalled_invalidate_folio,
> > > > >  	.release_folio		= ext4_release_folio,
> > > > > -	.migrate_folio		= buffer_migrate_folio_norefs,
> > > > >  	.is_partially_uptodate  = block_is_partially_uptodate,
> > > > >  	.error_remove_folio	= generic_error_remove_folio,
> > > > >  	.swap_activate		= ext4_iomap_swap_activate,
> > > > 
> > > > BTW I ask because.. are your expectations that the next v3 series also
> > > > be a target for Linus tree as part of a fix for this spinlock
> > > > replacement?
> > > 
> > > Since this is fixing potential filesystem corruption I will upstream
> > > whatever we need to do to fix this. Ideally we have a minimal fix to
> > > upstream now and a comprehensive fix and cleanup for v6.16.
> > 
> > Despite our efforts we don't yet have an agreement on how to fix the
> > ext4 corruption, becuase Jan noted the buffer_meta() check in this patch
> > is too broad and would affect other filesystems (I have yet to
> > understand how, but will review).
> > 
> > And so while we have agreement we can remove the spin lock to fix the
> > sleeping while atomic incurred by large folios for buffer heads by this
> > patch series, the removal of the spin lock would happen at the end of
> > this series.
> > 
> > And so the ext4 corruption is an existing issue as-is today, its
> > separate from the spin lock removal goal to fix the sleeping while
> > atomic..
> 
> I agree. Ext4 corruption problems are separate from sleeping in atomic
> issues.

Glad that's clear.

> > However this series might be quite big for an rc2 or rc3 fix for that spin
> > lock removal issue. It should bring in substantial performance benefits
> > though, so it might be worthy to consider. We can re-run tests with the
> > adjustment to remove the spin lock until the last patch in this series.
> > 
> > The alternative is to revert the spin lock addition commit for Linus'
> > tree, ie commit ebdf4de5642fb6 ("mm: migrate: fix reference check race
> > between __find_get_block() and migration") and note that it in fact does
> > not fix the ext4 corruption as we've noted, and in fact causes an issue
> > with sleeping while atomic with support for large folios for buffer
> > heads. If we do that then we  punt this series for the next development
> > window, and it would just not have the spin lock removal on the last
> > patch.
> 
> Well, the commit ebdf4de5642fb6 is 6 years old. At that time there were no
> large folios (in fact there were no folios at all ;)) in the page cache and
> it does work quite well (I didn't see a corruption report from real users
> since then).

It is still a work around.

> So I don't like removing that commit because it makes a
> "reproducible with a heavy stress test" problem become a "reproduced by
> real world workloads" problem.

So how about just patch 2 and 8 in this series, with the spin lock
removal happening on the last patch for Linus tree?

  Luis

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-15 21:06               ` Luis Chamberlain
@ 2025-04-16  2:02                 ` Davidlohr Bueso
  0 siblings, 0 replies; 38+ messages in thread
From: Davidlohr Bueso @ 2025-04-16  2:02 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jan Kara, Christian Brauner, tytso, adilger.kernel, linux-ext4,
	riel, willy, hannes, oliver.sang, david, axboe, hare, david,
	djwong, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Tue, 15 Apr 2025, Luis Chamberlain wrote:

>On Tue, Apr 15, 2025 at 06:23:54PM +0200, Jan Kara wrote:
>> So I don't like removing that commit because it makes a
>> "reproducible with a heavy stress test" problem become a "reproduced by
>> real world workloads" problem.
>
>So how about just patch 2 and 8 in this series, with the spin lock
>removal happening on the last patch for Linus tree?

fyi I sent out a new series (trimmed some recipients), addressing the concerns
laid out in this approach.

https://lore.kernel.org/all/20250415231635.83960-1-dave@stgolabs.net/

Similar to adding artificial delays to a vanilla kernel, the only behavior
these modifications cause is a quicker triggering of the aforementioned
(yet independent) ext4 warning splat/corruption.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-15 16:28       ` Jan Kara
@ 2025-04-16 16:58         ` Luis Chamberlain
  2025-04-23 17:09           ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-16 16:58 UTC (permalink / raw)
  To: Jan Kara, dave
  Cc: brauner, tytso, adilger.kernel, linux-ext4, riel, willy, hannes,
	oliver.sang, david, axboe, hare, david, djwong, ritesh.list,
	linux-fsdevel, linux-block, linux-mm, gost.dev, p.raghav,
	da.gomez, syzbot+f3c6fda1297c748a7076

On Tue, Apr 15, 2025 at 06:28:55PM +0200, Jan Kara wrote:
> > So I tried:
> > 
> > root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5 -y 2>&1 > log
> > e2fsck 1.47.2 (1-Jan-2025)
> > root@e1-ext4-2k /var/lib/xfstests # wc -l log
> > 16411 log
> 
> Can you share the log please?

Sure, here you go:

https://github.com/linux-kdevops/20250416-ext4-jbd2-bh-migrate-corruption

The last trace-0004.txt is a fresh one with Davidlohr's patches
applied. It has trace-0004-fsck.txt.

  Luis

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-16 16:58         ` Luis Chamberlain
@ 2025-04-23 17:09           ` Jan Kara
  2025-04-23 20:30             ` Luis Chamberlain
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2025-04-23 17:09 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jan Kara, dave, brauner, tytso, adilger.kernel, linux-ext4, riel,
	willy, hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

[-- Attachment #1: Type: text/plain, Size: 1484 bytes --]

On Wed 16-04-25 09:58:30, Luis Chamberlain wrote:
> On Tue, Apr 15, 2025 at 06:28:55PM +0200, Jan Kara wrote:
> > > So I tried:
> > > 
> > > root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5 -y 2>&1 > log
> > > e2fsck 1.47.2 (1-Jan-2025)
> > > root@e1-ext4-2k /var/lib/xfstests # wc -l log
> > > 16411 log
> > 
> > Can you share the log please?
> 
> Sure, here you go:
> 
> https://github.com/linux-kdevops/20250416-ext4-jbd2-bh-migrate-corruption
> 
> The last trace-0004.txt is a fresh one with Davidlohr's patches
> applied. It has trace-0004-fsck.txt.

Thanks for the data! I was staring at them for some time and at this point
I'm leaning towards a conclusion that this is actually not a case of
metadata corruption but rather a bug in ext4 transaction credit computation
that is completely independent of page migration.

Based on the e2fsck log you've provided the only damage in the filesystem
is from the aborted transaction handle in the middle of extent tree growth.
So nothing points to a lost metadata write or anything like that. And the
credit reservation for page writeback is indeed somewhat racy - we reserve
number of transaction credits based on current tree depth. However by the
time we get to ext4_ext_map_blocks() another process could have modified
the extent tree so we may need to modify more blocks than we originally
expected and reserved credits for.

Can you give attached patch a try please?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-ext4-Fix-calculation-of-credits-for-extent-tree-modi.patch --]
[-- Type: text/x-patch, Size: 1981 bytes --]

From 4c53fb9f4b9b3eb4a579f69b7adcb6524d55629c Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 23 Apr 2025 18:10:54 +0200
Subject: [PATCH] ext4: Fix calculation of credits for extent tree modification

Luis and David are reporting that after running generic/750 test for 90+
hours on 2k ext4 filesystem, they are able to trigger a warning in
jbd2_journal_dirty_metadata() complaining that there are not enough
credits in the running transaction started in ext4_do_writepages().

Indeed the code in ext4_do_writepages() is racy and the extent tree can
change between the time we compute credits necessary for extent tree
computation and the time we actually modify the extent tree. Thus it may
happen that the number of credits actually needed is higher. Modify
ext4_ext_index_trans_blocks() to count with the worst case of maximum
tree depth.

Link: https://lore.kernel.org/all/20250415013641.f2ppw6wov4kn4wq2@offworld
Reported-by: Davidlohr Bueso <dave@stgolabs.net>
Reported-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c616a16a9f36..43286632e650 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2396,18 +2396,19 @@ int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int nrblocks,
 int ext4_ext_index_trans_blocks(struct inode *inode, int extents)
 {
 	int index;
-	int depth;

 	/* If we are converting the inline data, only one is needed here. */
 	if (ext4_has_inline_data(inode))
 		return 1;

-	depth = ext_depth(inode);
-
+	/*
+	 * Extent tree can change between the time we estimate credits and
+	 * the time we actually modify the tree. Assume the worst case.
+	 */
 	if (extents <= 1)
-		index = depth * 2;
+		index = EXT4_MAX_EXTENT_DEPTH * 2;
 	else
-		index = depth * 3;
+		index = EXT4_MAX_EXTENT_DEPTH * 3;

 	return index;
 }
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-23 17:09           ` Jan Kara
@ 2025-04-23 20:30             ` Luis Chamberlain
  2025-04-25 22:51               ` Luis Chamberlain
  0 siblings, 1 reply; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-23 20:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: dave, brauner, tytso, adilger.kernel, linux-ext4, riel, willy,
	hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Wed, Apr 23, 2025 at 07:09:28PM +0200, Jan Kara wrote:
> On Wed 16-04-25 09:58:30, Luis Chamberlain wrote:
> > On Tue, Apr 15, 2025 at 06:28:55PM +0200, Jan Kara wrote:
> > > > So I tried:
> > > > 
> > > > root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5 -y 2>&1 > log
> > > > e2fsck 1.47.2 (1-Jan-2025)
> > > > root@e1-ext4-2k /var/lib/xfstests # wc -l log
> > > > 16411 log
> > > 
> > > Can you share the log please?
> > 
> > Sure, here you go:
> > 
> > https://github.com/linux-kdevops/20250416-ext4-jbd2-bh-migrate-corruption
> > 
> > The last trace-0004.txt is a fresh one with Davidlohr's patches
> > applied. It has trace-0004-fsck.txt.
> 
> Thanks for the data! I was staring at them for some time and at this point
> I'm leaning towards a conclusion that this is actually not a case of
> metadata corruption but rather a bug in ext4 transaction credit computation
> that is completely independent of page migration.
> 
> Based on the e2fsck log you've provided the only damage in the filesystem
> is from the aborted transaction handle in the middle of extent tree growth.
> So nothing points to a lost metadata write or anything like that. And the
> credit reservation for page writeback is indeed somewhat racy - we reserve
> number of transaction credits based on current tree depth. However by the
> time we get to ext4_ext_map_blocks() another process could have modified
> the extent tree so we may need to modify more blocks than we originally
> expected and reserved credits for.
> 
> Can you give attached patch a try please?
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

> From 4c53fb9f4b9b3eb4a579f69b7adcb6524d55629c Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 23 Apr 2025 18:10:54 +0200
> Subject: [PATCH] ext4: Fix calculation of credits for extent tree modification
> 
> Luis and David are reporting that after running generic/750 test for 90+
> hours on 2k ext4 filesystem, they are able to trigger a warning in
> jbd2_journal_dirty_metadata() complaining that there are not enough
> credits in the running transaction started in ext4_do_writepages().
> 
> Indeed the code in ext4_do_writepages() is racy and the extent tree can
> change between the time we compute credits necessary for extent tree
> computation and the time we actually modify the extent tree. Thus it may
> happen that the number of credits actually needed is higher. Modify
> ext4_ext_index_trans_blocks() to count with the worst case of maximum
> tree depth.
> 
> Link: https://lore.kernel.org/all/20250415013641.f2ppw6wov4kn4wq2@offworld
> Reported-by: Davidlohr Bueso <dave@stgolabs.net>
> Reported-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Jan Kara <jack@suse.cz>

I kicked off tests! Let's see after ~ 90 hours!

  Luis

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-23 20:30             ` Luis Chamberlain
@ 2025-04-25 22:51               ` Luis Chamberlain
  2025-04-28 23:08                 ` Luis Chamberlain
  0 siblings, 1 reply; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-25 22:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: dave, brauner, tytso, adilger.kernel, linux-ext4, riel, willy,
	hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Wed, Apr 23, 2025 at 01:30:29PM -0700, Luis Chamberlain wrote:
> On Wed, Apr 23, 2025 at 07:09:28PM +0200, Jan Kara wrote:
> > On Wed 16-04-25 09:58:30, Luis Chamberlain wrote:
> > > On Tue, Apr 15, 2025 at 06:28:55PM +0200, Jan Kara wrote:
> > > > > So I tried:
> > > > > 
> > > > > root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5 -y 2>&1 > log
> > > > > e2fsck 1.47.2 (1-Jan-2025)
> > > > > root@e1-ext4-2k /var/lib/xfstests # wc -l log
> > > > > 16411 log
> > > > 
> > > > Can you share the log please?
> > > 
> > > Sure, here you go:
> > > 
> > > https://github.com/linux-kdevops/20250416-ext4-jbd2-bh-migrate-corruption
> > > 
> > > The last trace-0004.txt is a fresh one with Davidlohr's patches
> > > applied. It has trace-0004-fsck.txt.
> > 
> > Thanks for the data! I was staring at them for some time and at this point
> > I'm leaning towards a conclusion that this is actually not a case of
> > metadata corruption but rather a bug in ext4 transaction credit computation
> > that is completely independent of page migration.
> > 
> > Based on the e2fsck log you've provided the only damage in the filesystem
> > is from the aborted transaction handle in the middle of extent tree growth.
> > So nothing points to a lost metadata write or anything like that. And the
> > credit reservation for page writeback is indeed somewhat racy - we reserve
> > number of transaction credits based on current tree depth. However by the
> > time we get to ext4_ext_map_blocks() another process could have modified
> > the extent tree so we may need to modify more blocks than we originally
> > expected and reserved credits for.
> > 
> > Can you give attached patch a try please?
> > 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> 
> > From 4c53fb9f4b9b3eb4a579f69b7adcb6524d55629c Mon Sep 17 00:00:00 2001
> > From: Jan Kara <jack@suse.cz>
> > Date: Wed, 23 Apr 2025 18:10:54 +0200
> > Subject: [PATCH] ext4: Fix calculation of credits for extent tree modification
> > 
> > Luis and David are reporting that after running generic/750 test for 90+
> > hours on 2k ext4 filesystem, they are able to trigger a warning in
> > jbd2_journal_dirty_metadata() complaining that there are not enough
> > credits in the running transaction started in ext4_do_writepages().
> > 
> > Indeed the code in ext4_do_writepages() is racy and the extent tree can
> > change between the time we compute credits necessary for extent tree
> > computation and the time we actually modify the extent tree. Thus it may
> > happen that the number of credits actually needed is higher. Modify
> > ext4_ext_index_trans_blocks() to count with the worst case of maximum
> > tree depth.
> > 
> > Link: https://lore.kernel.org/all/20250415013641.f2ppw6wov4kn4wq2@offworld
> > Reported-by: Davidlohr Bueso <dave@stgolabs.net>
> > Reported-by: Luis Chamberlain <mcgrof@kernel.org>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> I kicked off tests! Let's see after ~ 90 hours!

Tested-by: kdevops@lists.linux.dev

I have run the test over 3 separate guests and each one has tested this
over 48 hours each. There is no ext4 fs corruption reported, all is
good, so I do believe thix fixes the issue. One of the guests was on
Linus't tree which didn't yet have Davidlorh's fixes for folio migration.
And so I believe this patch should have a stable tag fix so stable gets it.

  Luis

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-25 22:51               ` Luis Chamberlain
@ 2025-04-28 23:08                 ` Luis Chamberlain
  2025-04-29  9:32                   ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Luis Chamberlain @ 2025-04-28 23:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: dave, brauner, tytso, adilger.kernel, linux-ext4, riel, willy,
	hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Fri, Apr 25, 2025 at 03:51:55PM -0700, Luis Chamberlain wrote:
> On Wed, Apr 23, 2025 at 01:30:29PM -0700, Luis Chamberlain wrote:
> > On Wed, Apr 23, 2025 at 07:09:28PM +0200, Jan Kara wrote:
> > > On Wed 16-04-25 09:58:30, Luis Chamberlain wrote:
> > > > On Tue, Apr 15, 2025 at 06:28:55PM +0200, Jan Kara wrote:
> > > > > > So I tried:
> > > > > > 
> > > > > > root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5 -y 2>&1 > log
> > > > > > e2fsck 1.47.2 (1-Jan-2025)
> > > > > > root@e1-ext4-2k /var/lib/xfstests # wc -l log
> > > > > > 16411 log
> > > > > 
> > > > > Can you share the log please?
> > > > 
> > > > Sure, here you go:
> > > > 
> > > > https://github.com/linux-kdevops/20250416-ext4-jbd2-bh-migrate-corruption
> > > > 
> > > > The last trace-0004.txt is a fresh one with Davidlohr's patches
> > > > applied. It has trace-0004-fsck.txt.
> > > 
> > > Thanks for the data! I was staring at them for some time and at this point
> > > I'm leaning towards a conclusion that this is actually not a case of
> > > metadata corruption but rather a bug in ext4 transaction credit computation
> > > that is completely independent of page migration.
> > > 
> > > Based on the e2fsck log you've provided the only damage in the filesystem
> > > is from the aborted transaction handle in the middle of extent tree growth.
> > > So nothing points to a lost metadata write or anything like that. And the
> > > credit reservation for page writeback is indeed somewhat racy - we reserve
> > > number of transaction credits based on current tree depth. However by the
> > > time we get to ext4_ext_map_blocks() another process could have modified
> > > the extent tree so we may need to modify more blocks than we originally
> > > expected and reserved credits for.
> > > 
> > > Can you give attached patch a try please?
> > > 
> > > 								Honza
> > > -- 
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
> > 
> > > From 4c53fb9f4b9b3eb4a579f69b7adcb6524d55629c Mon Sep 17 00:00:00 2001
> > > From: Jan Kara <jack@suse.cz>
> > > Date: Wed, 23 Apr 2025 18:10:54 +0200
> > > Subject: [PATCH] ext4: Fix calculation of credits for extent tree modification
> > > 
> > > Luis and David are reporting that after running generic/750 test for 90+
> > > hours on 2k ext4 filesystem, they are able to trigger a warning in
> > > jbd2_journal_dirty_metadata() complaining that there are not enough
> > > credits in the running transaction started in ext4_do_writepages().
> > > 
> > > Indeed the code in ext4_do_writepages() is racy and the extent tree can
> > > change between the time we compute credits necessary for extent tree
> > > computation and the time we actually modify the extent tree. Thus it may
> > > happen that the number of credits actually needed is higher. Modify
> > > ext4_ext_index_trans_blocks() to count with the worst case of maximum
> > > tree depth.
> > > 
> > > Link: https://lore.kernel.org/all/20250415013641.f2ppw6wov4kn4wq2@offworld
> > > Reported-by: Davidlohr Bueso <dave@stgolabs.net>
> > > Reported-by: Luis Chamberlain <mcgrof@kernel.org>
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > 
> > I kicked off tests! Let's see after ~ 90 hours!
> 
> Tested-by: kdevops@lists.linux.dev
> 
> I have run the test over 3 separate guests and each one has tested this
> over 48 hours each. There is no ext4 fs corruption reported, all is
> good, so I do believe thix fixes the issue. One of the guests was on
> Linus't tree which didn't yet have Davidlorh's fixes for folio migration.
> And so I believe this patch should have a stable tag fix so stable gets it.

Jan, my testing has passed 120 hours now on multiple guests. This is
certainly a fixed bug with your patch.

  Luis

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
  2025-04-28 23:08                 ` Luis Chamberlain
@ 2025-04-29  9:32                   ` Jan Kara
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Kara @ 2025-04-29  9:32 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jan Kara, dave, brauner, tytso, adilger.kernel, linux-ext4, riel,
	willy, hannes, oliver.sang, david, axboe, hare, david, djwong,
	ritesh.list, linux-fsdevel, linux-block, linux-mm, gost.dev,
	p.raghav, da.gomez, syzbot+f3c6fda1297c748a7076

On Mon 28-04-25 16:08:52, Luis Chamberlain wrote:
> On Fri, Apr 25, 2025 at 03:51:55PM -0700, Luis Chamberlain wrote:
> > On Wed, Apr 23, 2025 at 01:30:29PM -0700, Luis Chamberlain wrote:
> > > On Wed, Apr 23, 2025 at 07:09:28PM +0200, Jan Kara wrote:
> > > > On Wed 16-04-25 09:58:30, Luis Chamberlain wrote:
> > > > > On Tue, Apr 15, 2025 at 06:28:55PM +0200, Jan Kara wrote:
> > > > > > > So I tried:
> > > > > > > 
> > > > > > > root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5 -y 2>&1 > log
> > > > > > > e2fsck 1.47.2 (1-Jan-2025)
> > > > > > > root@e1-ext4-2k /var/lib/xfstests # wc -l log
> > > > > > > 16411 log
> > > > > > 
> > > > > > Can you share the log please?
> > > > > 
> > > > > Sure, here you go:
> > > > > 
> > > > > https://github.com/linux-kdevops/20250416-ext4-jbd2-bh-migrate-corruption
> > > > > 
> > > > > The last trace-0004.txt is a fresh one with Davidlohr's patches
> > > > > applied. It has trace-0004-fsck.txt.
> > > > 
> > > > Thanks for the data! I was staring at them for some time and at this point
> > > > I'm leaning towards a conclusion that this is actually not a case of
> > > > metadata corruption but rather a bug in ext4 transaction credit computation
> > > > that is completely independent of page migration.
> > > > 
> > > > Based on the e2fsck log you've provided the only damage in the filesystem
> > > > is from the aborted transaction handle in the middle of extent tree growth.
> > > > So nothing points to a lost metadata write or anything like that. And the
> > > > credit reservation for page writeback is indeed somewhat racy - we reserve
> > > > number of transaction credits based on current tree depth. However by the
> > > > time we get to ext4_ext_map_blocks() another process could have modified
> > > > the extent tree so we may need to modify more blocks than we originally
> > > > expected and reserved credits for.
> > > > 
> > > > Can you give attached patch a try please?
> > > > 
> > > > 								Honza
> > > > -- 
> > > > Jan Kara <jack@suse.com>
> > > > SUSE Labs, CR
> > > 
> > > > From 4c53fb9f4b9b3eb4a579f69b7adcb6524d55629c Mon Sep 17 00:00:00 2001
> > > > From: Jan Kara <jack@suse.cz>
> > > > Date: Wed, 23 Apr 2025 18:10:54 +0200
> > > > Subject: [PATCH] ext4: Fix calculation of credits for extent tree modification
> > > > 
> > > > Luis and David are reporting that after running generic/750 test for 90+
> > > > hours on 2k ext4 filesystem, they are able to trigger a warning in
> > > > jbd2_journal_dirty_metadata() complaining that there are not enough
> > > > credits in the running transaction started in ext4_do_writepages().
> > > > 
> > > > Indeed the code in ext4_do_writepages() is racy and the extent tree can
> > > > change between the time we compute credits necessary for extent tree
> > > > computation and the time we actually modify the extent tree. Thus it may
> > > > happen that the number of credits actually needed is higher. Modify
> > > > ext4_ext_index_trans_blocks() to count with the worst case of maximum
> > > > tree depth.
> > > > 
> > > > Link: https://lore.kernel.org/all/20250415013641.f2ppw6wov4kn4wq2@offworld
> > > > Reported-by: Davidlohr Bueso <dave@stgolabs.net>
> > > > Reported-by: Luis Chamberlain <mcgrof@kernel.org>
> > > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > 
> > > I kicked off tests! Let's see after ~ 90 hours!
> > 
> > Tested-by: kdevops@lists.linux.dev
> > 
> > I have run the test over 3 separate guests and each one has tested this
> > over 48 hours each. There is no ext4 fs corruption reported, all is
> > good, so I do believe thix fixes the issue. One of the guests was on
> > Linus't tree which didn't yet have Davidlorh's fixes for folio migration.
> > And so I believe this patch should have a stable tag fix so stable gets it.
> 
> Jan, my testing has passed 120 hours now on multiple guests. This is
> certainly a fixed bug with your patch.

Thanks for testing! I'll do an official submission of the fix.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-04-29  9:32 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-10  1:49 [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads Luis Chamberlain
2025-04-10  1:49 ` [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration Luis Chamberlain
2025-04-10  3:18   ` Matthew Wilcox
2025-04-10 12:05   ` Jan Kara
2025-04-14 21:09     ` Luis Chamberlain
2025-04-14 22:19       ` Luis Chamberlain
2025-04-15  9:05         ` Christian Brauner
2025-04-15 15:47           ` Luis Chamberlain
2025-04-15 16:23             ` Jan Kara
2025-04-15 21:06               ` Luis Chamberlain
2025-04-16  2:02                 ` Davidlohr Bueso
2025-04-15 11:17         ` Jan Kara
2025-04-15 11:23       ` Jan Kara
2025-04-15 16:18     ` Luis Chamberlain
2025-04-15 16:28       ` Jan Kara
2025-04-16 16:58         ` Luis Chamberlain
2025-04-23 17:09           ` Jan Kara
2025-04-23 20:30             ` Luis Chamberlain
2025-04-25 22:51               ` Luis Chamberlain
2025-04-28 23:08                 ` Luis Chamberlain
2025-04-29  9:32                   ` Jan Kara
2025-04-15  1:36   ` Davidlohr Bueso
2025-04-15 11:25     ` Jan Kara
2025-04-15 18:14       ` Davidlohr Bueso
2025-04-10  1:49 ` [PATCH v2 2/8] fs/buffer: try to use folio lock for pagecache lookups Luis Chamberlain
2025-04-10 14:38   ` Jan Kara
2025-04-10 17:38     ` Davidlohr Bueso
2025-04-10  1:49 ` [PATCH v2 3/8] fs/buffer: introduce __find_get_block_nonatomic() Luis Chamberlain
2025-04-10  1:49 ` [PATCH v2 4/8] fs/ocfs2: use sleeping version of __find_get_block() Luis Chamberlain
2025-04-10  1:49 ` [PATCH v2 5/8] fs/jbd2: " Luis Chamberlain
2025-04-10  1:49 ` [PATCH v2 6/8] fs/ext4: " Luis Chamberlain
2025-04-10 13:36   ` Jan Kara
2025-04-10 17:32     ` Davidlohr Bueso
2025-04-10  1:49 ` [PATCH v2 7/8] mm/migrate: enable noref migration for jbd2 Luis Chamberlain
2025-04-10 13:40   ` Jan Kara
2025-04-10 17:30     ` Davidlohr Bueso
2025-04-14 12:12       ` Jan Kara
2025-04-10  1:49 ` [PATCH v2 8/8] mm: add migration buffer-head debugfs interface Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).