[PATCH 0/5] ext4: fix insufficient credits when writing back large folios

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] ext4: fix insufficient credits when writing back large folios
@ 2025-05-30  6:28 Zhang Yi
  2025-05-30  6:28 ` [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages Zhang Yi
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Zhang Yi @ 2025-05-30  6:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

This series addresses the issue that Jan pointed out regarding large
folios support for ext4[1]. The problem is that the credits calculation
may insufficient in ext4_meta_trans_blocks() when allocating blocks
during write back a sufficiently large and discontinuous folio, it
doesn't involve the credits for updating bitmap and group descriptor
block. However, if we fix this issue, it may lead to significant
overestimation on the some filesystems with a lot of block groups.

The solution involves first reserving credit for one page when writing
back a sufficiently large and discontinuous folio, and then attempting
to extend the current transaction's credits. If the credits reach the
upper limit, the handler stops and initiates a new transaction. Again,
fix the wrong credits calculation in ext4_meta_trans_blocks(). Finally,
this solution only works in dioread_nolock mode, so disable large folios
if dioread_nolock is disabled. Please see the following patches for
details.

[1] https://lore.kernel.org/linux-ext4/ht54j6bvjmiqt62xmcveqlo7bmrunqs4ji7wikfteftdjijzek@7tz5gpejaoen/

Thanks,
Yi.

Zhang Yi (5):
  ext4: restart handle if credits are insufficient during writepages
  ext4: correct the reserved credits for extent conversion
  ext4/jbd2: reintroduce jbd2_journal_blocks_per_page()
  ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()
  ext4: disable large folios if dioread_nolock is not enabled

 fs/ext4/ext4_jbd2.h  | 14 ++++++++++
 fs/ext4/inode.c      | 64 +++++++++++++++++++++++++++++++++++---------
 fs/jbd2/journal.c    |  6 +++++
 include/linux/jbd2.h |  1 +
 4 files changed, 72 insertions(+), 13 deletions(-)

-- 
2.46.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages
  2025-05-30  6:28 [PATCH 0/5] ext4: fix insufficient credits when writing back large folios Zhang Yi
@ 2025-05-30  6:28 ` Zhang Yi
  2025-06-05 14:04   ` Jan Kara
  2025-05-30  6:28 ` [PATCH 2/5] ext4: correct the reserved credits for extent conversion Zhang Yi
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Zhang Yi @ 2025-05-30  6:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

After large folios are supported on ext4, writing back a sufficiently
large and discontinuous folio may consume a significant number of
journal credits, placing considerable strain on the journal. For
example, in a 20GB filesystem with 1K block size and 1MB journal size,
writing back a 2MB folio could require thousands of credits in the
worst-case scenario (when each block is discontinuous and distributed
across different block groups), potentially exceeding the journal size.

Fix this by making the write-back process first reserves credits for one
page and attempts to extend the transaction if the credits are
insufficient. In particular, if the credits for a transaction reach
their upper limit, stop the handle and initiate a new transaction.

Note that since we do not support partial folio writeouts, some blocks
within this folio may have been allocated. These allocated extents are
submitted through the current transaction, but the folio itself is not
submitted. To prevent stale data and potential deadlocks in ordered
mode, only the dioread_nolock mode supports this solution, as it always
allocate unwritten extents.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 57 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 51 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index be9a4cba35fd..5ef34c0c5633 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1680,6 +1680,7 @@ struct mpage_da_data {
 	unsigned int do_map:1;
 	unsigned int scanned_until_end:1;
 	unsigned int journalled_more_data:1;
+	unsigned int continue_map:1;
 };
 
 static void mpage_release_unused_pages(struct mpage_da_data *mpd,
@@ -2367,6 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
  *
  * @handle - handle for journal operations
  * @mpd - extent to map
+ * @needed_blocks - journal credits needed for one writepages iteration
+ * @check_blocks - journal credits needed for map one extent
  * @give_up_on_write - we set this to true iff there is a fatal error and there
  *                     is no hope of writing the data. The caller should discard
  *                     dirty pages to avoid infinite loops.
@@ -2383,6 +2386,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
  */
 static int mpage_map_and_submit_extent(handle_t *handle,
 				       struct mpage_da_data *mpd,
+				       int needed_blocks, int check_blocks,
 				       bool *give_up_on_write)
 {
 	struct inode *inode = mpd->inode;
@@ -2393,6 +2397,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 	ext4_io_end_t *io_end = mpd->io_submit.io_end;
 	struct ext4_io_end_vec *io_end_vec;
 
+	mpd->continue_map = 0;
+
 	io_end_vec = ext4_alloc_io_end_vec(io_end);
 	if (IS_ERR(io_end_vec))
 		return PTR_ERR(io_end_vec);
@@ -2439,6 +2445,34 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 		err = mpage_map_and_submit_buffers(mpd);
 		if (err < 0)
 			goto update_disksize;
+		if (!map->m_len)
+			goto update_disksize;
+
+		/*
+		 * For mapping a folio that is sufficiently large and
+		 * discontinuous, the current handle credits may be
+		 * insufficient, try to extend the handle.
+		 */
+		err = __ext4_journal_ensure_credits(handle, check_blocks,
+				needed_blocks, 0);
+		if (err < 0)
+			goto update_disksize;
+		/*
+		 * The credits for the current handle and transaction have
+		 * reached their upper limit, stop the handle and initiate a
+		 * new transaction. Note that some blocks in this folio may
+		 * have been allocated, and these allocated extents are
+		 * submitted through the current transaction, but the folio
+		 * itself is not submitted. To prevent stale data and
+		 * potential deadlock in ordered mode, only the
+		 * dioread_nolock mode supports this.
+		 */
+		if (err > 0) {
+			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
+			mpd->continue_map = 1;
+			err = 0;
+			goto update_disksize;
+		}
 	} while (map->m_len);
 
 update_disksize:
@@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 		if (!err)
 			err = err2;
 	}
+	if (!err && mpd->continue_map)
+		ext4_get_io_end(io_end);
+
 	return err;
 }
 
@@ -2703,7 +2740,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 	handle_t *handle = NULL;
 	struct inode *inode = mpd->inode;
 	struct address_space *mapping = inode->i_mapping;
-	int needed_blocks, rsv_blocks = 0, ret = 0;
+	int needed_blocks, check_blocks, rsv_blocks = 0, ret = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
 	struct blk_plug plug;
 	bool give_up_on_write = false;
@@ -2825,10 +2862,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 
 	while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
 		/* For each extent of pages we use new io_end */
-		mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
 		if (!mpd->io_submit.io_end) {
-			ret = -ENOMEM;
-			break;
+			mpd->io_submit.io_end =
+					ext4_init_io_end(inode, GFP_KERNEL);
+			if (!mpd->io_submit.io_end) {
+				ret = -ENOMEM;
+				break;
+			}
 		}
 
 		WARN_ON_ONCE(!mpd->can_map);
@@ -2841,10 +2881,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 		 */
 		BUG_ON(ext4_should_journal_data(inode));
 		needed_blocks = ext4_da_writepages_trans_blocks(inode);
+		check_blocks = ext4_chunk_trans_blocks(inode,
+				MAX_WRITEPAGES_EXTENT_LEN);
 
 		/* start a new transaction */
 		handle = ext4_journal_start_with_reserve(inode,
-				EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
+				EXT4_HT_WRITE_PAGE, needed_blocks,
+				mpd->continue_map ? 0 : rsv_blocks);
 		if (IS_ERR(handle)) {
 			ret = PTR_ERR(handle);
 			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
@@ -2861,6 +2904,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 		ret = mpage_prepare_extent_to_map(mpd);
 		if (!ret && mpd->map.m_len)
 			ret = mpage_map_and_submit_extent(handle, mpd,
+					needed_blocks, check_blocks,
 					&give_up_on_write);
 		/*
 		 * Caution: If the handle is synchronous,
@@ -2894,7 +2938,8 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 			ext4_journal_stop(handle);
 		} else
 			ext4_put_io_end(mpd->io_submit.io_end);
-		mpd->io_submit.io_end = NULL;
+		if (ret || !mpd->continue_map)
+			mpd->io_submit.io_end = NULL;
 
 		if (ret == -ENOSPC && sbi->s_journal) {
 			/*
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/5] ext4: correct the reserved credits for extent conversion
  2025-05-30  6:28 [PATCH 0/5] ext4: fix insufficient credits when writing back large folios Zhang Yi
  2025-05-30  6:28 ` [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages Zhang Yi
@ 2025-05-30  6:28 ` Zhang Yi
  2025-06-05 11:32   ` Jan Kara
  2025-05-30  6:28 ` [PATCH 3/5] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page() Zhang Yi
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Zhang Yi @ 2025-05-30  6:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Now, we reserve journal credits for converting extents in only one page
to written state when the I/O operation is complete. This is
insufficient when large folio is enabled.

Fix this by reserving credits for converting up to one extent per block in
the largest 2MB folio, this calculation should only involve extents index
and leaf blocks, so it should not estimate too many credits.

Fixes: 7ac67301e82f ("ext4: enable large folio for regular file")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5ef34c0c5633..d35c07c1dcac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2808,12 +2808,12 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 	mpd->journalled_more_data = 0;
 
 	if (ext4_should_dioread_nolock(inode)) {
+		int bpf = ext4_journal_blocks_per_folio(inode);
 		/*
 		 * We may need to convert up to one extent per block in
-		 * the page and we may dirty the inode.
+		 * the folio and we may dirty the inode.
 		 */
-		rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
-						PAGE_SIZE >> inode->i_blkbits);
+		rsv_blocks = 1 + ext4_ext_index_trans_blocks(inode, bpf);
 	}
 
 	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/5] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page()
  2025-05-30  6:28 [PATCH 0/5] ext4: fix insufficient credits when writing back large folios Zhang Yi
  2025-05-30  6:28 ` [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages Zhang Yi
  2025-05-30  6:28 ` [PATCH 2/5] ext4: correct the reserved credits for extent conversion Zhang Yi
@ 2025-05-30  6:28 ` Zhang Yi
  2025-05-30  6:28 ` [PATCH 4/5] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks() Zhang Yi
  2025-05-30  6:28 ` [PATCH 5/5] ext4: disable large folios if dioread_nolock is not enabled Zhang Yi
  4 siblings, 0 replies; 12+ messages in thread
From: Zhang Yi @ 2025-05-30  6:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

This partially reverts commit d6bf294773a4 ("ext4/jbd2: convert
jbd2_journal_blocks_per_page() to support large folio"). This
jbd2_journal_blocks_per_folio() will lead to a significant
overestimation of journal credits. Since we still reserve credits for
one page and attempt to extend and restart handles during large folio
writebacks, so we should convert this helper back to
ext4_journal_blocks_per_page().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4_jbd2.h  | 7 +++++++
 fs/ext4/inode.c      | 6 +++---
 fs/jbd2/journal.c    | 6 ++++++
 include/linux/jbd2.h | 1 +
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 63d17c5201b5..c0ee756cb34c 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -326,6 +326,13 @@ static inline int ext4_journal_blocks_per_folio(struct inode *inode)
 	return 0;
 }
 
+static inline int ext4_journal_blocks_per_page(struct inode *inode)
+{
+	if (EXT4_JOURNAL(inode) != NULL)
+		return jbd2_journal_blocks_per_page(inode);
+	return 0;
+}
+
 static inline int ext4_journal_force_commit(journal_t *journal)
 {
 	if (journal)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d35c07c1dcac..1818a2a7ba8f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2516,7 +2516,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
  */
 static int ext4_da_writepages_trans_blocks(struct inode *inode)
 {
-	int bpp = ext4_journal_blocks_per_folio(inode);
+	int bpp = ext4_journal_blocks_per_page(inode);
 
 	return ext4_meta_trans_blocks(inode,
 				MAX_WRITEPAGES_EXTENT_LEN + bpp - 1, bpp);
@@ -2594,7 +2594,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 	ext4_lblk_t lblk;
 	struct buffer_head *head;
 	handle_t *handle = NULL;
-	int bpp = ext4_journal_blocks_per_folio(mpd->inode);
+	int bpp = ext4_journal_blocks_per_page(mpd->inode);
 
 	if (mpd->wbc->sync_mode == WB_SYNC_ALL || mpd->wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
@@ -6221,7 +6221,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
  */
 int ext4_writepage_trans_blocks(struct inode *inode)
 {
-	int bpp = ext4_journal_blocks_per_folio(inode);
+	int bpp = ext4_journal_blocks_per_page(inode);
 	int ret;
 
 	ret = ext4_meta_trans_blocks(inode, bpp, bpp);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 6d5e76848733..2d8e9053c3cf 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -84,6 +84,7 @@ EXPORT_SYMBOL(jbd2_journal_start_commit);
 EXPORT_SYMBOL(jbd2_journal_force_commit_nested);
 EXPORT_SYMBOL(jbd2_journal_wipe);
 EXPORT_SYMBOL(jbd2_journal_blocks_per_folio);
+EXPORT_SYMBOL(jbd2_journal_blocks_per_page);
 EXPORT_SYMBOL(jbd2_journal_invalidate_folio);
 EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
 EXPORT_SYMBOL(jbd2_journal_force_commit);
@@ -2661,6 +2662,11 @@ int jbd2_journal_blocks_per_folio(struct inode *inode)
 		     inode->i_sb->s_blocksize_bits);
 }
 
+int jbd2_journal_blocks_per_page(struct inode *inode)
+{
+	return 1 << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
+}
+
 /*
  * helper functions to deal with 32 or 64bit block numbers.
  */
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 43b9297fe8a7..f35369c104ba 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1724,6 +1724,7 @@ static inline int tid_geq(tid_t x, tid_t y)
 }
 
 extern int jbd2_journal_blocks_per_folio(struct inode *inode);
+extern int jbd2_journal_blocks_per_page(struct inode *inode);
 extern size_t journal_tag_bytes(journal_t *journal);
 
 static inline int jbd2_journal_has_csum_v2or3(journal_t *journal)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/5] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()
  2025-05-30  6:28 [PATCH 0/5] ext4: fix insufficient credits when writing back large folios Zhang Yi
                   ` (2 preceding siblings ...)
  2025-05-30  6:28 ` [PATCH 3/5] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page() Zhang Yi
@ 2025-05-30  6:28 ` Zhang Yi
  2025-06-05 13:44   ` Jan Kara
  2025-05-30  6:28 ` [PATCH 5/5] ext4: disable large folios if dioread_nolock is not enabled Zhang Yi
  4 siblings, 1 reply; 12+ messages in thread
From: Zhang Yi @ 2025-05-30  6:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

The calculation of journal credits in ext4_meta_trans_blocks() should
include pextents, as each extent separately may be allocated from a
different group and thus need to update different bitmap and group
descriptor block.

Fixes: 0e32d8617012 ("ext4: correct the journal credits calculations of allocating blocks")
Reported-by:: Jan Kara <jack@suse.cz>
Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1818a2a7ba8f..e7de2fafc941 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6184,7 +6184,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
 	int ret;
 
 	/*
-	 * How many index and lead blocks need to touch to map @lblocks
+	 * How many index and leaf blocks need to touch to map @lblocks
 	 * logical blocks to @pextents physical extents?
 	 */
 	idxblocks = ext4_index_trans_blocks(inode, lblocks, pextents);
@@ -6193,7 +6193,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
 	 * Now let's see how many group bitmaps and group descriptors need
 	 * to account
 	 */
-	groups = idxblocks;
+	groups = idxblocks + pextents;
 	gdpblocks = groups;
 	if (groups > ngroups)
 		groups = ngroups;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/5] ext4: disable large folios if dioread_nolock is not enabled
  2025-05-30  6:28 [PATCH 0/5] ext4: fix insufficient credits when writing back large folios Zhang Yi
                   ` (3 preceding siblings ...)
  2025-05-30  6:28 ` [PATCH 4/5] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks() Zhang Yi
@ 2025-05-30  6:28 ` Zhang Yi
  4 siblings, 0 replies; 12+ messages in thread
From: Zhang Yi @ 2025-05-30  6:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

The write-back process cannot restart a journal transaction when
submitting a sufficiently large and discontinuous folio if
dioread_nolock is disabled. To address this, disable large folios when
building an inode if dioread_nolock is disabled, and also ensure that
dioread_nolock cannot be disabled on an active inode that has large
folio enabled.

Fixes: 7ac67301e82f ("ext4: enable large folio for regular file")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4_jbd2.h | 7 +++++++
 fs/ext4/inode.c     | 2 ++
 2 files changed, 9 insertions(+)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index c0ee756cb34c..59292da272ef 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -422,6 +422,13 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
  */
 static inline int ext4_should_dioread_nolock(struct inode *inode)
 {
+	/*
+	 * Cannot disable dioread_nolock on an active inode that has
+	 * large folio enabled.
+	 */
+	if (mapping_large_folio_support(inode->i_mapping))
+		return 1;
+
 	if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
 		return 0;
 	if (!S_ISREG(inode->i_mode))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e7de2fafc941..421c7bbc3ca9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5164,6 +5164,8 @@ bool ext4_should_enable_large_folio(struct inode *inode)
 		return false;
 	if (ext4_has_feature_encrypt(sb))
 		return false;
+	if (!ext4_should_dioread_nolock(inode))
+		return false;
 
 	return true;
 }
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/5] ext4: correct the reserved credits for extent conversion
  2025-05-30  6:28 ` [PATCH 2/5] ext4: correct the reserved credits for extent conversion Zhang Yi
@ 2025-06-05 11:32   ` Jan Kara
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Kara @ 2025-06-05 11:32 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Fri 30-05-25 14:28:55, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Now, we reserve journal credits for converting extents in only one page
> to written state when the I/O operation is complete. This is
> insufficient when large folio is enabled.
> 
> Fix this by reserving credits for converting up to one extent per block in
> the largest 2MB folio, this calculation should only involve extents index
> and leaf blocks, so it should not estimate too many credits.
> 
> Fixes: 7ac67301e82f ("ext4: enable large folio for regular file")
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5ef34c0c5633..d35c07c1dcac 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2808,12 +2808,12 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>  	mpd->journalled_more_data = 0;
>  
>  	if (ext4_should_dioread_nolock(inode)) {
> +		int bpf = ext4_journal_blocks_per_folio(inode);
>  		/*
>  		 * We may need to convert up to one extent per block in
> -		 * the page and we may dirty the inode.
> +		 * the folio and we may dirty the inode.
>  		 */
> -		rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
> -						PAGE_SIZE >> inode->i_blkbits);
> +		rsv_blocks = 1 + ext4_ext_index_trans_blocks(inode, bpf);
>  	}
>  
>  	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/5] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()
  2025-05-30  6:28 ` [PATCH 4/5] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks() Zhang Yi
@ 2025-06-05 13:44   ` Jan Kara
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Kara @ 2025-06-05 13:44 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Fri 30-05-25 14:28:57, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The calculation of journal credits in ext4_meta_trans_blocks() should
> include pextents, as each extent separately may be allocated from a
> different group and thus need to update different bitmap and group
> descriptor block.
> 
> Fixes: 0e32d8617012 ("ext4: correct the journal credits calculations of allocating blocks")
> Reported-by:: Jan Kara <jack@suse.cz>
> Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1818a2a7ba8f..e7de2fafc941 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -6184,7 +6184,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
>  	int ret;
>  
>  	/*
> -	 * How many index and lead blocks need to touch to map @lblocks
> +	 * How many index and leaf blocks need to touch to map @lblocks
>  	 * logical blocks to @pextents physical extents?
>  	 */
>  	idxblocks = ext4_index_trans_blocks(inode, lblocks, pextents);
> @@ -6193,7 +6193,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
>  	 * Now let's see how many group bitmaps and group descriptors need
>  	 * to account
>  	 */
> -	groups = idxblocks;
> +	groups = idxblocks + pextents;
>  	gdpblocks = groups;
>  	if (groups > ngroups)
>  		groups = ngroups;
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages
  2025-05-30  6:28 ` [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages Zhang Yi
@ 2025-06-05 14:04   ` Jan Kara
  2025-06-06  6:54     ` Zhang Yi
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2025-06-05 14:04 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Fri 30-05-25 14:28:54, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> After large folios are supported on ext4, writing back a sufficiently
> large and discontinuous folio may consume a significant number of
> journal credits, placing considerable strain on the journal. For
> example, in a 20GB filesystem with 1K block size and 1MB journal size,
> writing back a 2MB folio could require thousands of credits in the
> worst-case scenario (when each block is discontinuous and distributed
> across different block groups), potentially exceeding the journal size.
> 
> Fix this by making the write-back process first reserves credits for one
> page and attempts to extend the transaction if the credits are
> insufficient. In particular, if the credits for a transaction reach
> their upper limit, stop the handle and initiate a new transaction.
> 
> Note that since we do not support partial folio writeouts, some blocks
> within this folio may have been allocated. These allocated extents are
> submitted through the current transaction, but the folio itself is not
> submitted. To prevent stale data and potential deadlocks in ordered
> mode, only the dioread_nolock mode supports this solution, as it always
> allocate unwritten extents.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Couple of simplification suggestions below and one bigger issue we need to
deal with.

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index be9a4cba35fd..5ef34c0c5633 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1680,6 +1680,7 @@ struct mpage_da_data {
>  	unsigned int do_map:1;
>  	unsigned int scanned_until_end:1;
>  	unsigned int journalled_more_data:1;
> +	unsigned int continue_map:1;
>  };
>  
>  static void mpage_release_unused_pages(struct mpage_da_data *mpd,
> @@ -2367,6 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>   *
>   * @handle - handle for journal operations
>   * @mpd - extent to map
> + * @needed_blocks - journal credits needed for one writepages iteration
> + * @check_blocks - journal credits needed for map one extent
>   * @give_up_on_write - we set this to true iff there is a fatal error and there
>   *                     is no hope of writing the data. The caller should discard
>   *                     dirty pages to avoid infinite loops.
> @@ -2383,6 +2386,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>   */
>  static int mpage_map_and_submit_extent(handle_t *handle,
>  				       struct mpage_da_data *mpd,
> +				       int needed_blocks, int check_blocks,
>  				       bool *give_up_on_write)
>  {
>  	struct inode *inode = mpd->inode;
> @@ -2393,6 +2397,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  	ext4_io_end_t *io_end = mpd->io_submit.io_end;
>  	struct ext4_io_end_vec *io_end_vec;
>  
> +	mpd->continue_map = 0;
> +
>  	io_end_vec = ext4_alloc_io_end_vec(io_end);
>  	if (IS_ERR(io_end_vec))
>  		return PTR_ERR(io_end_vec);
> @@ -2439,6 +2445,34 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  		err = mpage_map_and_submit_buffers(mpd);
>  		if (err < 0)
>  			goto update_disksize;
> +		if (!map->m_len)
> +			goto update_disksize;
> +
> +		/*
> +		 * For mapping a folio that is sufficiently large and
> +		 * discontinuous, the current handle credits may be
> +		 * insufficient, try to extend the handle.
> +		 */
> +		err = __ext4_journal_ensure_credits(handle, check_blocks,
> +				needed_blocks, 0);
> +		if (err < 0)
> +			goto update_disksize;

IMO it would be more logical to have __ext4_journal_ensure_credits() in
mpage_map_one_extent() where the handle is actually used. Also there it
would be pretty logical to do:

		/* Make sure transaction has enough credits for this extent */
		needed_credits = ext4_chunk_trans_blocks(inode, 1);
		err = ext4_journal_ensure_credits(handle, needed_credits, 0);

No need to extend the transaction by more than we need for this current
extent and also no need to propagate needed credits here.

If __ext4_journal_ensure_credits() cannot extend the transaction, we can
just return -EAGAIN (or something like that) and make sure the retry logic
on ENOSPC or similar transient errors in mpage_map_and_submit_extent()
applies properly.

> +		/*
> +		 * The credits for the current handle and transaction have
> +		 * reached their upper limit, stop the handle and initiate a
> +		 * new transaction. Note that some blocks in this folio may
> +		 * have been allocated, and these allocated extents are
> +		 * submitted through the current transaction, but the folio
> +		 * itself is not submitted. To prevent stale data and
> +		 * potential deadlock in ordered mode, only the
> +		 * dioread_nolock mode supports this.
> +		 */
> +		if (err > 0) {
> +			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
> +			mpd->continue_map = 1;
> +			err = 0;
> +			goto update_disksize;
> +		}
>  	} while (map->m_len);
>  
>  update_disksize:
> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  		if (!err)
>  			err = err2;
>  	}
> +	if (!err && mpd->continue_map)
> +		ext4_get_io_end(io_end);
> +

IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in
ext4_do_writepages() if we see we need to continue doing mapping for the
current io_end.

That way it would be also more obvious that you've just reintroduced
deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page
writeback"). This is actually a fundamental thing because for
ext4_journal_stop() to complete, we may need IO on the folio to finish
which means we need io_end to be processed. Even if we avoided the awkward
case with sync handle described in 646caa9c8e196, to be able to start a new
handle we may need to complete a previous transaction commit to be able to
make space in the journal.

Thinking some more about this holding ioend for a folio with partially
submitted IO is also deadlock prone because mpage_prepare_extent_to_map()
can call folio_wait_writeback() which will effectively wait for the last
reference to ioend to be dropped so that underlying extents can be
converted and folio_writeback bit cleared.

So what I think we need to do is that if we submit part of the folio and
cannot submit it all, we just redirty the folio and bail out of the mapping
loop (similarly as in ENOSPC case). Then once IO completes
mpage_prepare_extent_to_map() is able to start working on the folio again.
Since we cleared dirty bits in the buffers we should not be repeating the
work we already did...

								Honza

>  	return err;
>  }
>  
> @@ -2703,7 +2740,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>  	handle_t *handle = NULL;
>  	struct inode *inode = mpd->inode;
>  	struct address_space *mapping = inode->i_mapping;
> -	int needed_blocks, rsv_blocks = 0, ret = 0;
> +	int needed_blocks, check_blocks, rsv_blocks = 0, ret = 0;
>  	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
>  	struct blk_plug plug;
>  	bool give_up_on_write = false;
> @@ -2825,10 +2862,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>  
>  	while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
>  		/* For each extent of pages we use new io_end */
> -		mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
>  		if (!mpd->io_submit.io_end) {
> -			ret = -ENOMEM;
> -			break;
> +			mpd->io_submit.io_end =
> +					ext4_init_io_end(inode, GFP_KERNEL);
> +			if (!mpd->io_submit.io_end) {
> +				ret = -ENOMEM;
> +				break;
> +			}
>  		}
>  
>  		WARN_ON_ONCE(!mpd->can_map);
> @@ -2841,10 +2881,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>  		 */
>  		BUG_ON(ext4_should_journal_data(inode));
>  		needed_blocks = ext4_da_writepages_trans_blocks(inode);
> +		check_blocks = ext4_chunk_trans_blocks(inode,
> +				MAX_WRITEPAGES_EXTENT_LEN);
>  
>  		/* start a new transaction */
>  		handle = ext4_journal_start_with_reserve(inode,
> -				EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
> +				EXT4_HT_WRITE_PAGE, needed_blocks,
> +				mpd->continue_map ? 0 : rsv_blocks);
>  		if (IS_ERR(handle)) {
>  			ret = PTR_ERR(handle);
>  			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
> @@ -2861,6 +2904,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>  		ret = mpage_prepare_extent_to_map(mpd);
>  		if (!ret && mpd->map.m_len)
>  			ret = mpage_map_and_submit_extent(handle, mpd,
> +					needed_blocks, check_blocks,
>  					&give_up_on_write);
>  		/*
>  		 * Caution: If the handle is synchronous,
> @@ -2894,7 +2938,8 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>  			ext4_journal_stop(handle);
>  		} else
>  			ext4_put_io_end(mpd->io_submit.io_end);
> -		mpd->io_submit.io_end = NULL;
> +		if (ret || !mpd->continue_map)
> +			mpd->io_submit.io_end = NULL;
>  
>  		if (ret == -ENOSPC && sbi->s_journal) {
>  			/*
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages
  2025-06-05 14:04   ` Jan Kara
@ 2025-06-06  6:54     ` Zhang Yi
  2025-06-06 13:16       ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Zhang Yi @ 2025-06-06  6:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On 2025/6/5 22:04, Jan Kara wrote:
> On Fri 30-05-25 14:28:54, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> After large folios are supported on ext4, writing back a sufficiently
>> large and discontinuous folio may consume a significant number of
>> journal credits, placing considerable strain on the journal. For
>> example, in a 20GB filesystem with 1K block size and 1MB journal size,
>> writing back a 2MB folio could require thousands of credits in the
>> worst-case scenario (when each block is discontinuous and distributed
>> across different block groups), potentially exceeding the journal size.
>>
>> Fix this by making the write-back process first reserves credits for one
>> page and attempts to extend the transaction if the credits are
>> insufficient. In particular, if the credits for a transaction reach
>> their upper limit, stop the handle and initiate a new transaction.
>>
>> Note that since we do not support partial folio writeouts, some blocks
>> within this folio may have been allocated. These allocated extents are
>> submitted through the current transaction, but the folio itself is not
>> submitted. To prevent stale data and potential deadlocks in ordered
>> mode, only the dioread_nolock mode supports this solution, as it always
>> allocate unwritten extents.
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Couple of simplification suggestions below and one bigger issue we need to
> deal with.
> 
Thanks a lot for your comments and suggestions.


>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index be9a4cba35fd..5ef34c0c5633 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1680,6 +1680,7 @@ struct mpage_da_data {
>>  	unsigned int do_map:1;
>>  	unsigned int scanned_until_end:1;
>>  	unsigned int journalled_more_data:1;
>> +	unsigned int continue_map:1;
>>  };
>>  
>>  static void mpage_release_unused_pages(struct mpage_da_data *mpd,
>> @@ -2367,6 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>   *
>>   * @handle - handle for journal operations
>>   * @mpd - extent to map
>> + * @needed_blocks - journal credits needed for one writepages iteration
>> + * @check_blocks - journal credits needed for map one extent
>>   * @give_up_on_write - we set this to true iff there is a fatal error and there
>>   *                     is no hope of writing the data. The caller should discard
>>   *                     dirty pages to avoid infinite loops.
>> @@ -2383,6 +2386,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>   */
>>  static int mpage_map_and_submit_extent(handle_t *handle,
>>  				       struct mpage_da_data *mpd,
>> +				       int needed_blocks, int check_blocks,
>>  				       bool *give_up_on_write)
>>  {
>>  	struct inode *inode = mpd->inode;
>> @@ -2393,6 +2397,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  	ext4_io_end_t *io_end = mpd->io_submit.io_end;
>>  	struct ext4_io_end_vec *io_end_vec;
>>  
>> +	mpd->continue_map = 0;
>> +
>>  	io_end_vec = ext4_alloc_io_end_vec(io_end);
>>  	if (IS_ERR(io_end_vec))
>>  		return PTR_ERR(io_end_vec);
>> @@ -2439,6 +2445,34 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  		err = mpage_map_and_submit_buffers(mpd);
>>  		if (err < 0)
>>  			goto update_disksize;
>> +		if (!map->m_len)
>> +			goto update_disksize;
>> +
>> +		/*
>> +		 * For mapping a folio that is sufficiently large and
>> +		 * discontinuous, the current handle credits may be
>> +		 * insufficient, try to extend the handle.
>> +		 */
>> +		err = __ext4_journal_ensure_credits(handle, check_blocks,
>> +				needed_blocks, 0);
>> +		if (err < 0)
>> +			goto update_disksize;
> 
> IMO it would be more logical to have __ext4_journal_ensure_credits() in
> mpage_map_one_extent() where the handle is actually used. Also there it
> would be pretty logical to do:
> 
> 		/* Make sure transaction has enough credits for this extent */
> 		needed_credits = ext4_chunk_trans_blocks(inode, 1);
> 		err = ext4_journal_ensure_credits(handle, needed_credits, 0);
> 
> No need to extend the transaction by more than we need for this current
> extent and also no need to propagate needed credits here.
> 
> If __ext4_journal_ensure_credits() cannot extend the transaction, we can
> just return -EAGAIN (or something like that) and make sure the retry logic
> on ENOSPC or similar transient errors in mpage_map_and_submit_extent()
> applies properly.

Yeah, that would be simpler.

> 
>> +		/*
>> +		 * The credits for the current handle and transaction have
>> +		 * reached their upper limit, stop the handle and initiate a
>> +		 * new transaction. Note that some blocks in this folio may
>> +		 * have been allocated, and these allocated extents are
>> +		 * submitted through the current transaction, but the folio
>> +		 * itself is not submitted. To prevent stale data and
>> +		 * potential deadlock in ordered mode, only the
>> +		 * dioread_nolock mode supports this.
>> +		 */
>> +		if (err > 0) {
>> +			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
>> +			mpd->continue_map = 1;
>> +			err = 0;
>> +			goto update_disksize;
>> +		}
>>  	} while (map->m_len);
>>  
>>  update_disksize:
>> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  		if (!err)
>>  			err = err2;
>>  	}
>> +	if (!err && mpd->continue_map)
>> +		ext4_get_io_end(io_end);
>> +
> 
> IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in
> ext4_do_writepages() if we see we need to continue doing mapping for the
> current io_end.
> 
> That way it would be also more obvious that you've just reintroduced
> deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page
> writeback"). This is actually a fundamental thing because for
> ext4_journal_stop() to complete, we may need IO on the folio to finish
> which means we need io_end to be processed. Even if we avoided the awkward
> case with sync handle described in 646caa9c8e196, to be able to start a new
> handle we may need to complete a previous transaction commit to be able to
> make space in the journal.

Yeah, you are right, I missed the full folios that were attached to the
same io_end in the previous rounds. If we continue to use this solution,
I think we should split the io_end and submit the previous one which
includes those full folios before the previous transaction is
committed.

> 
> Thinking some more about this holding ioend for a folio with partially
> submitted IO is also deadlock prone because mpage_prepare_extent_to_map()
> can call folio_wait_writeback() which will effectively wait for the last
> reference to ioend to be dropped so that underlying extents can be
> converted and folio_writeback bit cleared.

I don't understand this one. The mpage_prepare_extent_to_map() should
call folio_wait_writeback() for the current processing partial folio,
not the previous full folios that were attached to the io_end. This is
because mpd->first_page should be moved forward in mpage_folio_done()
once we complete the previous full folio. Besides, in my current
solution, the current partial folio will not be submitted, the
folio_writeback flag will not be set, so how does this deadlock happen?

> 
> So what I think we need to do is that if we submit part of the folio and
> cannot submit it all, we just redirty the folio and bail out of the mapping
> loop (similarly as in ENOSPC case).

After looking at the ENOSPC case again, I found that the handling of
ENOSPC before we enabling large folio is also wrong, it may case stale
data on 1K block size. Suppose we are processing four bhs on a dirty
page. We map the first bh, and the corresponding io_vec is added to the
io_end, with the unwritten flag set. However, when we attempt to map
the second bh, we bail out of the loop due to ENOSPC. At this point,
ext4_do_writepages()->ext4_put_io_end() will convert the extent of the
first bh to written. However, since the folio has not been committed
(mpage_submit_folio() submit a full folio), it will trigger stale data
issue. Is that right? I suppose we also have to write partial folio out
in this case.

> Then once IO completes
> mpage_prepare_extent_to_map() is able to start working on the folio again.
> Since we cleared dirty bits in the buffers we should not be repeating the
> work we already did...
> 

Hmm, it looks like this solution should work. We should introduce a
partial folio version of mpage_submit_folio(), call it and redirty
the folio once we need to bail out of the loop since insufficient
space or journal credits. But ext4_bio_write_folio() will handle the
the logic of fscrypt case, I'm not familiar with fscrypt, so I'm not
sure it could handle the partial page properly. I'll give it a try.

Thanks,
Yi.

> 
>>  	return err;
>>  }
>>  
>> @@ -2703,7 +2740,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  	handle_t *handle = NULL;
>>  	struct inode *inode = mpd->inode;
>>  	struct address_space *mapping = inode->i_mapping;
>> -	int needed_blocks, rsv_blocks = 0, ret = 0;
>> +	int needed_blocks, check_blocks, rsv_blocks = 0, ret = 0;
>>  	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
>>  	struct blk_plug plug;
>>  	bool give_up_on_write = false;
>> @@ -2825,10 +2862,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  
>>  	while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
>>  		/* For each extent of pages we use new io_end */
>> -		mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
>>  		if (!mpd->io_submit.io_end) {
>> -			ret = -ENOMEM;
>> -			break;
>> +			mpd->io_submit.io_end =
>> +					ext4_init_io_end(inode, GFP_KERNEL);
>> +			if (!mpd->io_submit.io_end) {
>> +				ret = -ENOMEM;
>> +				break;
>> +			}
>>  		}
>>  
>>  		WARN_ON_ONCE(!mpd->can_map);
>> @@ -2841,10 +2881,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  		 */
>>  		BUG_ON(ext4_should_journal_data(inode));
>>  		needed_blocks = ext4_da_writepages_trans_blocks(inode);
>> +		check_blocks = ext4_chunk_trans_blocks(inode,
>> +				MAX_WRITEPAGES_EXTENT_LEN);
>>  
>>  		/* start a new transaction */
>>  		handle = ext4_journal_start_with_reserve(inode,
>> -				EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
>> +				EXT4_HT_WRITE_PAGE, needed_blocks,
>> +				mpd->continue_map ? 0 : rsv_blocks);
>>  		if (IS_ERR(handle)) {
>>  			ret = PTR_ERR(handle);
>>  			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
>> @@ -2861,6 +2904,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  		ret = mpage_prepare_extent_to_map(mpd);
>>  		if (!ret && mpd->map.m_len)
>>  			ret = mpage_map_and_submit_extent(handle, mpd,
>> +					needed_blocks, check_blocks,
>>  					&give_up_on_write);
>>  		/*
>>  		 * Caution: If the handle is synchronous,
>> @@ -2894,7 +2938,8 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  			ext4_journal_stop(handle);
>>  		} else
>>  			ext4_put_io_end(mpd->io_submit.io_end);
>> -		mpd->io_submit.io_end = NULL;
>> +		if (ret || !mpd->continue_map)
>> +			mpd->io_submit.io_end = NULL;
>>  
>>  		if (ret == -ENOSPC && sbi->s_journal) {
>>  			/*
>> -- 
>> 2.46.1
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages
  2025-06-06  6:54     ` Zhang Yi
@ 2025-06-06 13:16       ` Jan Kara
  2025-06-07  3:54         ` Zhang Yi
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2025-06-06 13:16 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Fri 06-06-25 14:54:21, Zhang Yi wrote:
> On 2025/6/5 22:04, Jan Kara wrote:
> >> +		/*
> >> +		 * The credits for the current handle and transaction have
> >> +		 * reached their upper limit, stop the handle and initiate a
> >> +		 * new transaction. Note that some blocks in this folio may
> >> +		 * have been allocated, and these allocated extents are
> >> +		 * submitted through the current transaction, but the folio
> >> +		 * itself is not submitted. To prevent stale data and
> >> +		 * potential deadlock in ordered mode, only the
> >> +		 * dioread_nolock mode supports this.
> >> +		 */
> >> +		if (err > 0) {
> >> +			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
> >> +			mpd->continue_map = 1;
> >> +			err = 0;
> >> +			goto update_disksize;
> >> +		}
> >>  	} while (map->m_len);
> >>  
> >>  update_disksize:
> >> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
> >>  		if (!err)
> >>  			err = err2;
> >>  	}
> >> +	if (!err && mpd->continue_map)
> >> +		ext4_get_io_end(io_end);
> >> +
> > 
> > IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in
> > ext4_do_writepages() if we see we need to continue doing mapping for the
> > current io_end.
> > 
> > That way it would be also more obvious that you've just reintroduced
> > deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page
> > writeback"). This is actually a fundamental thing because for
> > ext4_journal_stop() to complete, we may need IO on the folio to finish
> > which means we need io_end to be processed. Even if we avoided the awkward
> > case with sync handle described in 646caa9c8e196, to be able to start a new
> > handle we may need to complete a previous transaction commit to be able to
> > make space in the journal.
> 
> Yeah, you are right, I missed the full folios that were attached to the
> same io_end in the previous rounds. If we continue to use this solution,
> I think we should split the io_end and submit the previous one which
> includes those full folios before the previous transaction is
> committed.

Yes, fully mapped folios definitely need to be submitted. But I think that
should be handled by ext4_io_submit() call in ext4_do_writepages() loop?

> > Thinking some more about this holding ioend for a folio with partially
> > submitted IO is also deadlock prone because mpage_prepare_extent_to_map()
> > can call folio_wait_writeback() which will effectively wait for the last
> > reference to ioend to be dropped so that underlying extents can be
> > converted and folio_writeback bit cleared.
> 
> I don't understand this one. The mpage_prepare_extent_to_map() should
> call folio_wait_writeback() for the current processing partial folio,
> not the previous full folios that were attached to the io_end. This is
> because mpd->first_page should be moved forward in mpage_folio_done()
> once we complete the previous full folio. Besides, in my current
> solution, the current partial folio will not be submitted, the
> folio_writeback flag will not be set, so how does this deadlock happen?

Sorry, this was me being confused. I went through the path again and indeed
if we cannot map all buffers underlying the folio, we don't clear buffer
(and folio) dirty bits and don't set folio writeback bit so there's no
deadlock there.

> > So what I think we need to do is that if we submit part of the folio and
> > cannot submit it all, we just redirty the folio and bail out of the mapping
> > loop (similarly as in ENOSPC case).
> 
> After looking at the ENOSPC case again, I found that the handling of
> ENOSPC before we enabling large folio is also wrong, it may case stale
> data on 1K block size. Suppose we are processing four bhs on a dirty
> page. We map the first bh, and the corresponding io_vec is added to the
> io_end, with the unwritten flag set. However, when we attempt to map
> the second bh, we bail out of the loop due to ENOSPC. At this point,
> ext4_do_writepages()->ext4_put_io_end() will convert the extent of the
> first bh to written. However, since the folio has not been committed
> (mpage_submit_folio() submit a full folio), it will trigger stale data
> issue. Is that right? I suppose we also have to write partial folio out
> in this case.

Yes, this case will be problematic actually both with dioread_nolock but
also without it (as in this case we create written extents from the start
and we tell JBD2 to only wait for data IO to complete but not to submit
it). We really need to make sure partially mapped folio is submitted for IO
as well in this case.

> > Then once IO completes
> > mpage_prepare_extent_to_map() is able to start working on the folio again.
> > Since we cleared dirty bits in the buffers we should not be repeating the
> > work we already did...
> > 
> 
> Hmm, it looks like this solution should work. We should introduce a
> partial folio version of mpage_submit_folio(), call it and redirty
> the folio once we need to bail out of the loop since insufficient
> space or journal credits. But ext4_bio_write_folio() will handle the
> the logic of fscrypt case, I'm not familiar with fscrypt, so I'm not
> sure it could handle the partial page properly. I'll give it a try.

As far as I can tell it should work fine. The logic in
ext4_bio_write_folio() is already prepared for handling partial folio
writeouts, redirtying of the page etc. (because it needs to handle writeout
from transaction commit where we can writeout only parts of folios with
underlying blocks allocated). We just need to teach mpage_submit_folio() to
substract only written-out number of pages from nr_to_write.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages
  2025-06-06 13:16       ` Jan Kara
@ 2025-06-07  3:54         ` Zhang Yi
  0 siblings, 0 replies; 12+ messages in thread
From: Zhang Yi @ 2025-06-07  3:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On 2025/6/6 21:16, Jan Kara wrote:
> On Fri 06-06-25 14:54:21, Zhang Yi wrote:
>> On 2025/6/5 22:04, Jan Kara wrote:
>>>> +		/*
>>>> +		 * The credits for the current handle and transaction have
>>>> +		 * reached their upper limit, stop the handle and initiate a
>>>> +		 * new transaction. Note that some blocks in this folio may
>>>> +		 * have been allocated, and these allocated extents are
>>>> +		 * submitted through the current transaction, but the folio
>>>> +		 * itself is not submitted. To prevent stale data and
>>>> +		 * potential deadlock in ordered mode, only the
>>>> +		 * dioread_nolock mode supports this.
>>>> +		 */
>>>> +		if (err > 0) {
>>>> +			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
>>>> +			mpd->continue_map = 1;
>>>> +			err = 0;
>>>> +			goto update_disksize;
>>>> +		}
>>>>  	} while (map->m_len);
>>>>  
>>>>  update_disksize:
>>>> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>>>  		if (!err)
>>>>  			err = err2;
>>>>  	}
>>>> +	if (!err && mpd->continue_map)
>>>> +		ext4_get_io_end(io_end);
>>>> +
>>>
>>> IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in
>>> ext4_do_writepages() if we see we need to continue doing mapping for the
>>> current io_end.
>>>
>>> That way it would be also more obvious that you've just reintroduced
>>> deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page
>>> writeback"). This is actually a fundamental thing because for
>>> ext4_journal_stop() to complete, we may need IO on the folio to finish
>>> which means we need io_end to be processed. Even if we avoided the awkward
>>> case with sync handle described in 646caa9c8e196, to be able to start a new
>>> handle we may need to complete a previous transaction commit to be able to
>>> make space in the journal.
>>
>> Yeah, you are right, I missed the full folios that were attached to the
>> same io_end in the previous rounds. If we continue to use this solution,
>> I think we should split the io_end and submit the previous one which
>> includes those full folios before the previous transaction is
>> committed.
> 
> Yes, fully mapped folios definitely need to be submitted. But I think that
> should be handled by ext4_io_submit() call in ext4_do_writepages() loop?

Sorry, my previous description may not have been clear enough. The
deadlock issue in this solution should be:

1. In the latest round of ext4_do_writepages(),
   mpage_prepare_extent_to_map() may add some contiguous fully mapped
   folios and an unmapped folio(A) to the current processing io_end.
2. mpage_map_and_submit_extent() mapped some bhs in folio A and bail out
   due to the insufficient journal credits, it acquires one more
   refcount of io_end to prevent ext4_put_io_end() convert the extent
   written status of the newly allcoated extents, since we don't submit
   this partial folio now.
3. ext4_io_submit() in ext4_do_writepages() submits the fully mapped
   folios, but the endio process cannot be invoked properly since the
   above extra refcount, so the writeback state of the folio cannot be
   cleared.
4. Finally, it stops the current handle and waits for the transaction to
   be committed. However, the commit process also waits for those fully
   mapped folios to complete, which can lead to a deadlock.

Therefore, if the io_end contains both fully mapped folios and a partial
folio, we need to split the io_end, the first one contains those mapped
folios, and the second one only contain the partial folio. We only hold
the refcount of the second io_end, the first one can be finished properly,
this can break the deadlock.

However, the solution of submitting the partial folio sounds more simple
to me.

[..]

>>> Then once IO completes
>>> mpage_prepare_extent_to_map() is able to start working on the folio again.
>>> Since we cleared dirty bits in the buffers we should not be repeating the
>>> work we already did...
>>>
>>
>> Hmm, it looks like this solution should work. We should introduce a
>> partial folio version of mpage_submit_folio(), call it and redirty
>> the folio once we need to bail out of the loop since insufficient
>> space or journal credits. But ext4_bio_write_folio() will handle the
>> the logic of fscrypt case, I'm not familiar with fscrypt, so I'm not
>> sure it could handle the partial page properly. I'll give it a try.
> 
> As far as I can tell it should work fine. The logic in
> ext4_bio_write_folio() is already prepared for handling partial folio
> writeouts, redirtying of the page etc. (because it needs to handle writeout
> from transaction commit where we can writeout only parts of folios with
> underlying blocks allocated). We just need to teach mpage_submit_folio() to
> substract only written-out number of pages from nr_to_write.
> 

Yes, indeed. I will try this solution.

Thanks,
Yi.




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-06-07  3:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30  6:28 [PATCH 0/5] ext4: fix insufficient credits when writing back large folios Zhang Yi
2025-05-30  6:28 ` [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages Zhang Yi
2025-06-05 14:04   ` Jan Kara
2025-06-06  6:54     ` Zhang Yi
2025-06-06 13:16       ` Jan Kara
2025-06-07  3:54         ` Zhang Yi
2025-05-30  6:28 ` [PATCH 2/5] ext4: correct the reserved credits for extent conversion Zhang Yi
2025-06-05 11:32   ` Jan Kara
2025-05-30  6:28 ` [PATCH 3/5] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page() Zhang Yi
2025-05-30  6:28 ` [PATCH 4/5] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks() Zhang Yi
2025-06-05 13:44   ` Jan Kara
2025-05-30  6:28 ` [PATCH 5/5] ext4: disable large folios if dioread_nolock is not enabled Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).