[PATCH v2 0/6] ext4: fix insufficient credits when writing back large folios

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] ext4: fix insufficient credits when writing back large folios
@ 2025-06-11 11:16 Zhang Yi
  2025-06-11 11:16 ` [PATCH v2 1/6] ext4: move the calculation of wbc->nr_to_write to mpage_folio_done() Zhang Yi
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Zhang Yi @ 2025-06-11 11:16 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Changes since v1:
 - Make the write-back process supports writing a partial folio if it
   exits the mapping loop prematurely due to insufficient sapce or
   journal credits, it also fix the potential stale data and
   inconsistency issues.
 - Fix the same issue regarding the allocation of blocks in
   ext4_write_begin() and ext4_page_mkwrite() when delalloc is not
   enabled.

This series addresses the issue that Jan pointed out regarding large
folios support for ext4[1]. The problem is that the credits calculation
may insufficient in ext4_meta_trans_blocks() when allocating blocks
during write back a sufficiently large and discontinuous folio, it
doesn't involve the credits for updating bitmap and group descriptor
block. However, if we fix this issue, it may lead to significant
overestimation on the some filesystems with a lot of block groups.

The solution involves first ensure that the current journal transaction
has enough credits when we mapping an extent during allocating blocks.
Then if the credits reach the upper limit, exit the current mapping
loop, submit the partial folio and restart a new transaction. Finally,
fix the wrong credits calculation in ext4_meta_trans_blocks(). Please
see the following patches for details.

[1] https://lore.kernel.org/linux-ext4/ht54j6bvjmiqt62xmcveqlo7bmrunqs4ji7wikfteftdjijzek@7tz5gpejaoen/

Thanks,
Yi.

Zhang Yi (6):
  ext4: move the calculation of wbc->nr_to_write to mpage_folio_done()
  ext4: fix stale data if it bail out of the extents mapping loop
  ext4: restart handle if credits are insufficient during allocating
    blocks
  ext4: correct the reserved credits for extent conversion
  ext4/jbd2: reintroduce jbd2_journal_blocks_per_page()
  ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()

 fs/ext4/ext4_jbd2.h  |   7 +++
 fs/ext4/inode.c      | 118 ++++++++++++++++++++++++++++++++++++-------
 fs/jbd2/journal.c    |   6 +++
 include/linux/jbd2.h |   1 +
 4 files changed, 113 insertions(+), 19 deletions(-)

-- 
2.46.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/6] ext4: move the calculation of wbc->nr_to_write to mpage_folio_done()
  2025-06-11 11:16 [PATCH v2 0/6] ext4: fix insufficient credits when writing back large folios Zhang Yi
@ 2025-06-11 11:16 ` Zhang Yi
  2025-06-19 15:08   ` Jan Kara
  2025-06-20  1:51   ` Baokun Li
  2025-06-11 11:16 ` [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop Zhang Yi
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Zhang Yi @ 2025-06-11 11:16 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

mpage_folio_done() should be a more appropriate place than
mpage_submit_folio() for updating the wbc->nr_to_write after we have
submitted a fully mapped folio. Preparing to make mpage_submit_folio()
allows to submit partially mapped folio that is still under processing.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index be9a4cba35fd..3a086fee7989 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2024,7 +2024,10 @@ int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
 
 static void mpage_folio_done(struct mpage_da_data *mpd, struct folio *folio)
 {
-	mpd->first_page += folio_nr_pages(folio);
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	mpd->first_page += nr_pages;
+	mpd->wbc->nr_to_write -= nr_pages;
 	folio_unlock(folio);
 }
 
@@ -2055,8 +2058,6 @@ static int mpage_submit_folio(struct mpage_da_data *mpd, struct folio *folio)
 	    !ext4_verity_in_progress(mpd->inode))
 		len = size & (len - 1);
 	err = ext4_bio_write_folio(&mpd->io_submit, folio, len);
-	if (!err)
-		mpd->wbc->nr_to_write -= folio_nr_pages(folio);
 
 	return err;
 }
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop
  2025-06-11 11:16 [PATCH v2 0/6] ext4: fix insufficient credits when writing back large folios Zhang Yi
  2025-06-11 11:16 ` [PATCH v2 1/6] ext4: move the calculation of wbc->nr_to_write to mpage_folio_done() Zhang Yi
@ 2025-06-11 11:16 ` Zhang Yi
  2025-06-19 16:21   ` Jan Kara
  2025-06-11 11:16 ` [PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks Zhang Yi
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Zhang Yi @ 2025-06-11 11:16 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

During the process of writing back folios, if
mpage_map_and_submit_extent() exits the extent mapping loop due to an
ENOSPC or ENOMEM error, it may result in stale data or filesystem
inconsistency in environments where the block size is smaller than the
folio size.

When mapping a discontinuous folio in mpage_map_and_submit_extent(),
some buffers may have already be mapped. If we exit the mapping loop
prematurely, the folio data within the mapped range will not be written
back, and the file's disk size will not be updated. Once the transaction
that includes this range of extents is committed, this can lead to stale
data or filesystem inconsistency.

Fix this by submitting the current processing partial mapped folio and
update the disk size to the end of the mapped range.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 48 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3a086fee7989..d0db6e3bf158 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	return 0;
 }
 
+/*
+ * This is used to submit mapped buffers in a single folio that is not fully
+ * mapped for various reasons, such as insufficient space or journal credits.
+ */
+static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
+{
+	struct inode *inode = mpd->inode;
+	struct folio *folio;
+	int ret;
+
+	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
+	if (IS_ERR(folio))
+		return PTR_ERR(folio);
+
+	ret = mpage_submit_folio(mpd, folio);
+	if (ret)
+		goto out;
+	/*
+	 * Update first_page to prevent this folio from being released in
+	 * mpage_release_unused_pages(), it should not equal to the folio
+	 * index.
+	 *
+	 * The first_page will be reset to the aligned folio index when this
+	 * folio is written again in the next round. Additionally, do not
+	 * update wbc->nr_to_write here, as it will be updated once the
+	 * entire folio has finished processing.
+	 */
+	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
+	WARN_ON_ONCE((folio->index == mpd->first_page) ||
+		     !folio_contains(folio, pos >> PAGE_SHIFT));
+out:
+	folio_unlock(folio);
+	folio_put(folio);
+	return ret;
+}
+
 /*
  * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
  *				 mpd->len and submit pages underlying it for IO
@@ -2412,8 +2448,16 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 			 */
 			if ((err == -ENOMEM) ||
 			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
-				if (progress)
+				/*
+				 * We may have already allocated extents for
+				 * some bhs inside the folio, issue the
+				 * corresponding data to prevent stale data.
+				 */
+				if (progress) {
+					if (mpage_submit_buffers(mpd, disksize))
+						goto invalidate_dirty_pages;
 					goto update_disksize;
+				}
 				return err;
 			}
 			ext4_msg(sb, KERN_CRIT,
@@ -2432,6 +2476,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 			*give_up_on_write = true;
 			return err;
 		}
+		disksize = ((loff_t)(map->m_lblk + map->m_len)) <<
+				inode->i_blkbits;
 		progress = 1;
 		/*
 		 * Update buffer state, submit mapped pages, and get us new
@@ -2442,12 +2488,12 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 			goto update_disksize;
 	} while (map->m_len);
 
+	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
 update_disksize:
 	/*
 	 * Update on-disk size after IO is submitted.  Races with
 	 * truncate are avoided by checking i_size under i_data_sem.
 	 */
-	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
 	if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
 		int err2;
 		loff_t i_size;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks
  2025-06-11 11:16 [PATCH v2 0/6] ext4: fix insufficient credits when writing back large folios Zhang Yi
  2025-06-11 11:16 ` [PATCH v2 1/6] ext4: move the calculation of wbc->nr_to_write to mpage_folio_done() Zhang Yi
  2025-06-11 11:16 ` [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop Zhang Yi
@ 2025-06-11 11:16 ` Zhang Yi
  2025-06-19 16:33   ` Jan Kara
  2025-06-11 11:16 ` [PATCH v2 4/6] ext4: correct the reserved credits for extent conversion Zhang Yi
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Zhang Yi @ 2025-06-11 11:16 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

After large folios are supported on ext4, writing back a sufficiently
large and discontinuous folio may consume a significant number of
journal credits, placing considerable strain on the journal. For
example, in a 20GB filesystem with 1K block size and 1MB journal size,
writing back a 2MB folio could require thousands of credits in the
worst-case scenario (when each block is discontinuous and distributed
across different block groups), potentially exceeding the journal size.
This issue can also occur in ext4_write_begin() and ext4_page_mkwrite()
when delalloc is not enabled.

Fix this by ensuring that there are sufficient journal credits before
allocating an extent in mpage_map_one_extent() and _ext4_get_block(). If
there are not enough credits, return -EAGAIN, exit the current mapping
loop, restart a new handle and a new transaction, and allocating blocks
on this folio again in the next iteration.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 45 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 39 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d0db6e3bf158..b51de58518b2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -877,20 +877,44 @@ static void ext4_update_bh_state(struct buffer_head *bh, unsigned long flags)
 	} while (unlikely(!try_cmpxchg(&bh->b_state, &old_state, new_state)));
 }
 
+/*
+ * Make sure that the current journal transaction has enough credits to map
+ * one extent. Return -EAGAIN if it cannot extend the current running
+ * transaction.
+ */
+static inline int ext4_journal_ensure_extent_credits(handle_t *handle,
+						     struct inode *inode)
+{
+	int needed_credits;
+	int ret;
+
+	needed_credits = ext4_chunk_trans_blocks(inode, 1);
+	ret = __ext4_journal_ensure_credits(handle, needed_credits,
+					    needed_credits, 0);
+	return ret <= 0 ? ret : -EAGAIN;
+}
+
 static int _ext4_get_block(struct inode *inode, sector_t iblock,
 			   struct buffer_head *bh, int flags)
 {
 	struct ext4_map_blocks map;
+	handle_t *handle = ext4_journal_current_handle();
 	int ret = 0;
 
 	if (ext4_has_inline_data(inode))
 		return -ERANGE;
 
+	/* Make sure transaction has enough credits for this extent */
+	if (flags & EXT4_GET_BLOCKS_CREATE) {
+		ret = ext4_journal_ensure_extent_credits(handle, inode);
+		if (ret)
+			return ret;
+	}
+
 	map.m_lblk = iblock;
 	map.m_len = bh->b_size >> inode->i_blkbits;
 
-	ret = ext4_map_blocks(ext4_journal_current_handle(), inode, &map,
-			      flags);
+	ret = ext4_map_blocks(handle, inode, &map, flags);
 	if (ret > 0) {
 		map_bh(bh, inode->i_sb, map.m_pblk);
 		ext4_update_bh_state(bh, map.m_flags);
@@ -1374,8 +1398,9 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
 				ext4_orphan_del(NULL, inode);
 		}
 
-		if (ret == -ENOSPC &&
-		    ext4_should_retry_alloc(inode->i_sb, &retries))
+		if (ret == -EAGAIN ||
+		    (ret == -ENOSPC &&
+		     ext4_should_retry_alloc(inode->i_sb, &retries)))
 			goto retry_journal;
 		folio_put(folio);
 		return ret;
@@ -2324,6 +2349,11 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	int get_blocks_flags;
 	int err, dioread_nolock;
 
+	/* Make sure transaction has enough credits for this extent */
+	err = ext4_journal_ensure_extent_credits(handle, inode);
+	if (err < 0)
+		return err;
+
 	trace_ext4_da_write_pages_extent(inode, map);
 	/*
 	 * Call ext4_map_blocks() to allocate any delayed allocation blocks, or
@@ -2446,7 +2476,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 			 * In the case of ENOSPC, if ext4_count_free_blocks()
 			 * is non-zero, a commit should free up blocks.
 			 */
-			if ((err == -ENOMEM) ||
+			if ((err == -ENOMEM) || (err == -EAGAIN) ||
 			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
 				/*
 				 * We may have already allocated extents for
@@ -2953,6 +2983,8 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 			ret = 0;
 			continue;
 		}
+		if (ret == -EAGAIN)
+			ret = 0;
 		/* Fatal error - ENOMEM, EIO... */
 		if (ret)
 			break;
@@ -6722,7 +6754,8 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 		}
 	}
 	ext4_journal_stop(handle);
-	if (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+	if (err == -EAGAIN ||
+	    (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)))
 		goto retry_alloc;
 out_ret:
 	ret = vmf_fs_error(err);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/6] ext4: correct the reserved credits for extent conversion
  2025-06-11 11:16 [PATCH v2 0/6] ext4: fix insufficient credits when writing back large folios Zhang Yi
                   ` (2 preceding siblings ...)
  2025-06-11 11:16 ` [PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks Zhang Yi
@ 2025-06-11 11:16 ` Zhang Yi
  2025-06-23  8:12   ` Baokun Li
  2025-06-11 11:16 ` [PATCH v2 5/6] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page() Zhang Yi
  2025-06-11 11:16 ` [PATCH v2 6/6] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks() Zhang Yi
  5 siblings, 1 reply; 23+ messages in thread
From: Zhang Yi @ 2025-06-11 11:16 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Now, we reserve journal credits for converting extents in only one page
to written state when the I/O operation is complete. This is
insufficient when large folio is enabled.

Fix this by reserving credits for converting up to one extent per block in
the largest 2MB folio, this calculation should only involve extents index
and leaf blocks, so it should not estimate too many credits.

Fixes: 7ac67301e82f ("ext4: enable large folio for regular file")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b51de58518b2..67e37dd546eb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2848,12 +2848,12 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 	mpd->journalled_more_data = 0;
 
 	if (ext4_should_dioread_nolock(inode)) {
+		int bpf = ext4_journal_blocks_per_folio(inode);
 		/*
 		 * We may need to convert up to one extent per block in
-		 * the page and we may dirty the inode.
+		 * the folio and we may dirty the inode.
 		 */
-		rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
-						PAGE_SIZE >> inode->i_blkbits);
+		rsv_blocks = 1 + ext4_ext_index_trans_blocks(inode, bpf);
 	}
 
 	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 5/6] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page()
  2025-06-11 11:16 [PATCH v2 0/6] ext4: fix insufficient credits when writing back large folios Zhang Yi
                   ` (3 preceding siblings ...)
  2025-06-11 11:16 ` [PATCH v2 4/6] ext4: correct the reserved credits for extent conversion Zhang Yi
@ 2025-06-11 11:16 ` Zhang Yi
  2025-06-19 16:44   ` Jan Kara
  2025-06-11 11:16 ` [PATCH v2 6/6] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks() Zhang Yi
  5 siblings, 1 reply; 23+ messages in thread
From: Zhang Yi @ 2025-06-11 11:16 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

This partially reverts commit d6bf294773a4 ("ext4/jbd2: convert
jbd2_journal_blocks_per_page() to support large folio"). This
jbd2_journal_blocks_per_folio() will lead to a significant
overestimation of journal credits. Since we still reserve credits for
one page and attempt to extend and restart handles during large folio
writebacks, so we should convert this helper back to
ext4_journal_blocks_per_page().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4_jbd2.h  | 7 +++++++
 fs/ext4/inode.c      | 6 +++---
 fs/jbd2/journal.c    | 6 ++++++
 include/linux/jbd2.h | 1 +
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 63d17c5201b5..c0ee756cb34c 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -326,6 +326,13 @@ static inline int ext4_journal_blocks_per_folio(struct inode *inode)
 	return 0;
 }
 
+static inline int ext4_journal_blocks_per_page(struct inode *inode)
+{
+	if (EXT4_JOURNAL(inode) != NULL)
+		return jbd2_journal_blocks_per_page(inode);
+	return 0;
+}
+
 static inline int ext4_journal_force_commit(journal_t *journal)
 {
 	if (journal)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 67e37dd546eb..9835145b1b27 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2556,7 +2556,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
  */
 static int ext4_da_writepages_trans_blocks(struct inode *inode)
 {
-	int bpp = ext4_journal_blocks_per_folio(inode);
+	int bpp = ext4_journal_blocks_per_page(inode);
 
 	return ext4_meta_trans_blocks(inode,
 				MAX_WRITEPAGES_EXTENT_LEN + bpp - 1, bpp);
@@ -2634,7 +2634,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 	ext4_lblk_t lblk;
 	struct buffer_head *head;
 	handle_t *handle = NULL;
-	int bpp = ext4_journal_blocks_per_folio(mpd->inode);
+	int bpp = ext4_journal_blocks_per_page(mpd->inode);
 
 	if (mpd->wbc->sync_mode == WB_SYNC_ALL || mpd->wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
@@ -6255,7 +6255,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
  */
 int ext4_writepage_trans_blocks(struct inode *inode)
 {
-	int bpp = ext4_journal_blocks_per_folio(inode);
+	int bpp = ext4_journal_blocks_per_page(inode);
 	int ret;
 
 	ret = ext4_meta_trans_blocks(inode, bpp, bpp);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index d480b94117cd..7fccb425907f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -84,6 +84,7 @@ EXPORT_SYMBOL(jbd2_journal_start_commit);
 EXPORT_SYMBOL(jbd2_journal_force_commit_nested);
 EXPORT_SYMBOL(jbd2_journal_wipe);
 EXPORT_SYMBOL(jbd2_journal_blocks_per_folio);
+EXPORT_SYMBOL(jbd2_journal_blocks_per_page);
 EXPORT_SYMBOL(jbd2_journal_invalidate_folio);
 EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
 EXPORT_SYMBOL(jbd2_journal_force_commit);
@@ -2661,6 +2662,11 @@ int jbd2_journal_blocks_per_folio(struct inode *inode)
 		     inode->i_sb->s_blocksize_bits);
 }
 
+int jbd2_journal_blocks_per_page(struct inode *inode)
+{
+	return 1 << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
+}
+
 /*
  * helper functions to deal with 32 or 64bit block numbers.
  */
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 43b9297fe8a7..f35369c104ba 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1724,6 +1724,7 @@ static inline int tid_geq(tid_t x, tid_t y)
 }
 
 extern int jbd2_journal_blocks_per_folio(struct inode *inode);
+extern int jbd2_journal_blocks_per_page(struct inode *inode);
 extern size_t journal_tag_bytes(journal_t *journal);
 
 static inline int jbd2_journal_has_csum_v2or3(journal_t *journal)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 6/6] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()
  2025-06-11 11:16 [PATCH v2 0/6] ext4: fix insufficient credits when writing back large folios Zhang Yi
                   ` (4 preceding siblings ...)
  2025-06-11 11:16 ` [PATCH v2 5/6] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page() Zhang Yi
@ 2025-06-11 11:16 ` Zhang Yi
  2025-06-23  8:12   ` Baokun Li
  5 siblings, 1 reply; 23+ messages in thread
From: Zhang Yi @ 2025-06-11 11:16 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

The calculation of journal credits in ext4_meta_trans_blocks() should
include pextents, as each extent separately may be allocated from a
different group and thus need to update different bitmap and group
descriptor block.

Fixes: 0e32d8617012 ("ext4: correct the journal credits calculations of allocating blocks")
Reported-by: Jan Kara <jack@suse.cz>
Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9835145b1b27..9b6ebf823740 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6218,7 +6218,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
 	int ret;
 
 	/*
-	 * How many index and lead blocks need to touch to map @lblocks
+	 * How many index and leaf blocks need to touch to map @lblocks
 	 * logical blocks to @pextents physical extents?
 	 */
 	idxblocks = ext4_index_trans_blocks(inode, lblocks, pextents);
@@ -6227,7 +6227,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
 	 * Now let's see how many group bitmaps and group descriptors need
 	 * to account
 	 */
-	groups = idxblocks;
+	groups = idxblocks + pextents;
 	gdpblocks = groups;
 	if (groups > ngroups)
 		groups = ngroups;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/6] ext4: move the calculation of wbc->nr_to_write to mpage_folio_done()
  2025-06-11 11:16 ` [PATCH v2 1/6] ext4: move the calculation of wbc->nr_to_write to mpage_folio_done() Zhang Yi
@ 2025-06-19 15:08   ` Jan Kara
  2025-06-20  1:51   ` Baokun Li
  1 sibling, 0 replies; 23+ messages in thread
From: Jan Kara @ 2025-06-19 15:08 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Wed 11-06-25 19:16:20, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> mpage_folio_done() should be a more appropriate place than
> mpage_submit_folio() for updating the wbc->nr_to_write after we have
> submitted a fully mapped folio. Preparing to make mpage_submit_folio()
> allows to submit partially mapped folio that is still under processing.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Indeed. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index be9a4cba35fd..3a086fee7989 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2024,7 +2024,10 @@ int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
>  
>  static void mpage_folio_done(struct mpage_da_data *mpd, struct folio *folio)
>  {
> -	mpd->first_page += folio_nr_pages(folio);
> +	unsigned long nr_pages = folio_nr_pages(folio);
> +
> +	mpd->first_page += nr_pages;
> +	mpd->wbc->nr_to_write -= nr_pages;
>  	folio_unlock(folio);
>  }
>  
> @@ -2055,8 +2058,6 @@ static int mpage_submit_folio(struct mpage_da_data *mpd, struct folio *folio)
>  	    !ext4_verity_in_progress(mpd->inode))
>  		len = size & (len - 1);
>  	err = ext4_bio_write_folio(&mpd->io_submit, folio, len);
> -	if (!err)
> -		mpd->wbc->nr_to_write -= folio_nr_pages(folio);
>  
>  	return err;
>  }
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop
  2025-06-11 11:16 ` [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop Zhang Yi
@ 2025-06-19 16:21   ` Jan Kara
  2025-06-20  4:57     ` Zhang Yi
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2025-06-19 16:21 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Wed 11-06-25 19:16:21, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> During the process of writing back folios, if
> mpage_map_and_submit_extent() exits the extent mapping loop due to an
> ENOSPC or ENOMEM error, it may result in stale data or filesystem
> inconsistency in environments where the block size is smaller than the
> folio size.
> 
> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
> some buffers may have already be mapped. If we exit the mapping loop
> prematurely, the folio data within the mapped range will not be written
> back, and the file's disk size will not be updated. Once the transaction
> that includes this range of extents is committed, this can lead to stale
> data or filesystem inconsistency.
> 
> Fix this by submitting the current processing partial mapped folio and
> update the disk size to the end of the mapped range.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 48 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3a086fee7989..d0db6e3bf158 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>  	return 0;
>  }
>  
> +/*
> + * This is used to submit mapped buffers in a single folio that is not fully
> + * mapped for various reasons, such as insufficient space or journal credits.
> + */
> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
> +{
> +	struct inode *inode = mpd->inode;
> +	struct folio *folio;
> +	int ret;
> +
> +	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
> +	if (IS_ERR(folio))
> +		return PTR_ERR(folio);
> +
> +	ret = mpage_submit_folio(mpd, folio);
> +	if (ret)
> +		goto out;
> +	/*
> +	 * Update first_page to prevent this folio from being released in
> +	 * mpage_release_unused_pages(), it should not equal to the folio
> +	 * index.
> +	 *
> +	 * The first_page will be reset to the aligned folio index when this
> +	 * folio is written again in the next round. Additionally, do not
> +	 * update wbc->nr_to_write here, as it will be updated once the
> +	 * entire folio has finished processing.
> +	 */
> +	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;

Well, but there can be many folios between mpd->first_page and pos. And
this way you avoid cleaning them up (unlocking them and dropping elevated
refcount) before we restart next loop. How is this going to work?

Also I don't see in this patch where mpd->first_page would get set back to
retry writing this folio. What am I missing?

> +	WARN_ON_ONCE((folio->index == mpd->first_page) ||
> +		     !folio_contains(folio, pos >> PAGE_SHIFT));
> +out:
> +	folio_unlock(folio);
> +	folio_put(folio);
> +	return ret;
> +}
> +
>  /*
>   * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
>   *				 mpd->len and submit pages underlying it for IO
> @@ -2412,8 +2448,16 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			 */
>  			if ((err == -ENOMEM) ||
>  			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
> -				if (progress)
> +				/*
> +				 * We may have already allocated extents for
> +				 * some bhs inside the folio, issue the
> +				 * corresponding data to prevent stale data.
> +				 */
> +				if (progress) {
> +					if (mpage_submit_buffers(mpd, disksize))
> +						goto invalidate_dirty_pages;
>  					goto update_disksize;
> +				}
>  				return err;
>  			}
>  			ext4_msg(sb, KERN_CRIT,
> @@ -2432,6 +2476,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			*give_up_on_write = true;
>  			return err;
>  		}
> +		disksize = ((loff_t)(map->m_lblk + map->m_len)) <<
> +				inode->i_blkbits;

I don't think setting disksize like this is correct in case
mpage_map_and_submit_buffers() below fails (when extent covers many folios
and we don't succeed in writing them all). In that case we may need to keep
disksize somewhere in the middle of the extent.

Overall I don't think we need to modify disksize handling here. It is fine
to leave (part of) the extent dangling beyond disksize until we retry the
writeback in these rare cases.

>  		progress = 1;
>  		/*
>  		 * Update buffer state, submit mapped pages, and get us new
> @@ -2442,12 +2488,12 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			goto update_disksize;
>  	} while (map->m_len);
>  
> +	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>  update_disksize:
>  	/*
>  	 * Update on-disk size after IO is submitted.  Races with
>  	 * truncate are avoided by checking i_size under i_data_sem.
>  	 */
> -	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>  	if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
>  		int err2;
>  		loff_t i_size;

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks
  2025-06-11 11:16 ` [PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks Zhang Yi
@ 2025-06-19 16:33   ` Jan Kara
  2025-06-20  5:00     ` Zhang Yi
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2025-06-19 16:33 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Wed 11-06-25 19:16:22, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> After large folios are supported on ext4, writing back a sufficiently
> large and discontinuous folio may consume a significant number of
> journal credits, placing considerable strain on the journal. For
> example, in a 20GB filesystem with 1K block size and 1MB journal size,
> writing back a 2MB folio could require thousands of credits in the
> worst-case scenario (when each block is discontinuous and distributed
> across different block groups), potentially exceeding the journal size.
> This issue can also occur in ext4_write_begin() and ext4_page_mkwrite()
> when delalloc is not enabled.
> 
> Fix this by ensuring that there are sufficient journal credits before
> allocating an extent in mpage_map_one_extent() and _ext4_get_block(). If
> there are not enough credits, return -EAGAIN, exit the current mapping
> loop, restart a new handle and a new transaction, and allocating blocks
> on this folio again in the next iteration.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

...

>  static int _ext4_get_block(struct inode *inode, sector_t iblock,
>  			   struct buffer_head *bh, int flags)
>  {
>  	struct ext4_map_blocks map;
> +	handle_t *handle = ext4_journal_current_handle();
>  	int ret = 0;
>  
>  	if (ext4_has_inline_data(inode))
>  		return -ERANGE;
>  
> +	/* Make sure transaction has enough credits for this extent */
> +	if (flags & EXT4_GET_BLOCKS_CREATE) {
> +		ret = ext4_journal_ensure_extent_credits(handle, inode);
> +		if (ret)
> +			return ret;
> +	}
> +
>  	map.m_lblk = iblock;
>  	map.m_len = bh->b_size >> inode->i_blkbits;
>  
> -	ret = ext4_map_blocks(ext4_journal_current_handle(), inode, &map,
> -			      flags);
> +	ret = ext4_map_blocks(handle, inode, &map, flags);

Good spotting with ext4_page_mkwrite() and ext4_write_begin() also needing
this treatment! But rather then hiding the transaction extension in
_ext4_get_block() I'd do this in ext4_block_write_begin() where it is much
more obvious (and also it is much more obvious who needs to be prepared for
handling EAGAIN error). Otherwise the patch looks good!

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 5/6] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page()
  2025-06-11 11:16 ` [PATCH v2 5/6] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page() Zhang Yi
@ 2025-06-19 16:44   ` Jan Kara
  2025-06-21  7:46     ` Zhang Yi
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2025-06-19 16:44 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Wed 11-06-25 19:16:24, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> This partially reverts commit d6bf294773a4 ("ext4/jbd2: convert
> jbd2_journal_blocks_per_page() to support large folio"). This
> jbd2_journal_blocks_per_folio() will lead to a significant
> overestimation of journal credits. Since we still reserve credits for
> one page and attempt to extend and restart handles during large folio
> writebacks, so we should convert this helper back to
> ext4_journal_blocks_per_page().
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Here I'm not decided. Does it make any particular sense to reserve credits
for one *page* worth of blocks when pages don't have any particular meaning
in our writeback code anymore? We could reserve credits just for one
physical extent and that should be enough. For blocksize == pagesize (most
common configs) this would be actually equivalent. If blocksize < pagesize,
this could force us to do some more writeback retries and thus get somewhat
higher writeback CPU overhead but do we really care for these configs?  It
is well possible I've overlooked something and someone will spot a
performance regression in practical setup with this in which case we'd have
to come up with something more clever but I think it's worth it to start
simple and complicate later.

								Honza

> ---
>  fs/ext4/ext4_jbd2.h  | 7 +++++++
>  fs/ext4/inode.c      | 6 +++---
>  fs/jbd2/journal.c    | 6 ++++++
>  include/linux/jbd2.h | 1 +
>  4 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 63d17c5201b5..c0ee756cb34c 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -326,6 +326,13 @@ static inline int ext4_journal_blocks_per_folio(struct inode *inode)
>  	return 0;
>  }
>  
> +static inline int ext4_journal_blocks_per_page(struct inode *inode)
> +{
> +	if (EXT4_JOURNAL(inode) != NULL)
> +		return jbd2_journal_blocks_per_page(inode);
> +	return 0;
> +}
> +
>  static inline int ext4_journal_force_commit(journal_t *journal)
>  {
>  	if (journal)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 67e37dd546eb..9835145b1b27 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2556,7 +2556,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>   */
>  static int ext4_da_writepages_trans_blocks(struct inode *inode)
>  {
> -	int bpp = ext4_journal_blocks_per_folio(inode);
> +	int bpp = ext4_journal_blocks_per_page(inode);
>  
>  	return ext4_meta_trans_blocks(inode,
>  				MAX_WRITEPAGES_EXTENT_LEN + bpp - 1, bpp);
> @@ -2634,7 +2634,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
>  	ext4_lblk_t lblk;
>  	struct buffer_head *head;
>  	handle_t *handle = NULL;
> -	int bpp = ext4_journal_blocks_per_folio(mpd->inode);
> +	int bpp = ext4_journal_blocks_per_page(mpd->inode);
>  
>  	if (mpd->wbc->sync_mode == WB_SYNC_ALL || mpd->wbc->tagged_writepages)
>  		tag = PAGECACHE_TAG_TOWRITE;
> @@ -6255,7 +6255,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
>   */
>  int ext4_writepage_trans_blocks(struct inode *inode)
>  {
> -	int bpp = ext4_journal_blocks_per_folio(inode);
> +	int bpp = ext4_journal_blocks_per_page(inode);
>  	int ret;
>  
>  	ret = ext4_meta_trans_blocks(inode, bpp, bpp);
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index d480b94117cd..7fccb425907f 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -84,6 +84,7 @@ EXPORT_SYMBOL(jbd2_journal_start_commit);
>  EXPORT_SYMBOL(jbd2_journal_force_commit_nested);
>  EXPORT_SYMBOL(jbd2_journal_wipe);
>  EXPORT_SYMBOL(jbd2_journal_blocks_per_folio);
> +EXPORT_SYMBOL(jbd2_journal_blocks_per_page);
>  EXPORT_SYMBOL(jbd2_journal_invalidate_folio);
>  EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
>  EXPORT_SYMBOL(jbd2_journal_force_commit);
> @@ -2661,6 +2662,11 @@ int jbd2_journal_blocks_per_folio(struct inode *inode)
>  		     inode->i_sb->s_blocksize_bits);
>  }
>  
> +int jbd2_journal_blocks_per_page(struct inode *inode)
> +{
> +	return 1 << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
> +}
> +
>  /*
>   * helper functions to deal with 32 or 64bit block numbers.
>   */
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 43b9297fe8a7..f35369c104ba 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1724,6 +1724,7 @@ static inline int tid_geq(tid_t x, tid_t y)
>  }
>  
>  extern int jbd2_journal_blocks_per_folio(struct inode *inode);
> +extern int jbd2_journal_blocks_per_page(struct inode *inode);
>  extern size_t journal_tag_bytes(journal_t *journal);
>  
>  static inline int jbd2_journal_has_csum_v2or3(journal_t *journal)
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/6] ext4: move the calculation of wbc->nr_to_write to mpage_folio_done()
  2025-06-11 11:16 ` [PATCH v2 1/6] ext4: move the calculation of wbc->nr_to_write to mpage_folio_done() Zhang Yi
  2025-06-19 15:08   ` Jan Kara
@ 2025-06-20  1:51   ` Baokun Li
  1 sibling, 0 replies; 23+ messages in thread
From: Baokun Li @ 2025-06-20  1:51 UTC (permalink / raw)
  To: Zhang Yi, linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yukuai3, yangerkun

On 2025/6/11 19:16, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> mpage_folio_done() should be a more appropriate place than
> mpage_submit_folio() for updating the wbc->nr_to_write after we have
> submitted a fully mapped folio. Preparing to make mpage_submit_folio()
> allows to submit partially mapped folio that is still under processing.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Makes sense. Feel free to add:

Reviewed-by: Baokun Li <libaokun1@huawei.com>
> ---
>   fs/ext4/inode.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index be9a4cba35fd..3a086fee7989 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2024,7 +2024,10 @@ int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
>   
>   static void mpage_folio_done(struct mpage_da_data *mpd, struct folio *folio)
>   {
> -	mpd->first_page += folio_nr_pages(folio);
> +	unsigned long nr_pages = folio_nr_pages(folio);
> +
> +	mpd->first_page += nr_pages;
> +	mpd->wbc->nr_to_write -= nr_pages;
>   	folio_unlock(folio);
>   }
>   
> @@ -2055,8 +2058,6 @@ static int mpage_submit_folio(struct mpage_da_data *mpd, struct folio *folio)
>   	    !ext4_verity_in_progress(mpd->inode))
>   		len = size & (len - 1);
>   	err = ext4_bio_write_folio(&mpd->io_submit, folio, len);
> -	if (!err)
> -		mpd->wbc->nr_to_write -= folio_nr_pages(folio);
>   
>   	return err;
>   }



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop
  2025-06-19 16:21   ` Jan Kara
@ 2025-06-20  4:57     ` Zhang Yi
  2025-06-20 15:21       ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Zhang Yi @ 2025-06-20  4:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On 2025/6/20 0:21, Jan Kara wrote:
> On Wed 11-06-25 19:16:21, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> During the process of writing back folios, if
>> mpage_map_and_submit_extent() exits the extent mapping loop due to an
>> ENOSPC or ENOMEM error, it may result in stale data or filesystem
>> inconsistency in environments where the block size is smaller than the
>> folio size.
>>
>> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
>> some buffers may have already be mapped. If we exit the mapping loop
>> prematurely, the folio data within the mapped range will not be written
>> back, and the file's disk size will not be updated. Once the transaction
>> that includes this range of extents is committed, this can lead to stale
>> data or filesystem inconsistency.
>>
>> Fix this by submitting the current processing partial mapped folio and
>> update the disk size to the end of the mapped range.
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 48 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 3a086fee7989..d0db6e3bf158 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>  	return 0;
>>  }
>>  
>> +/*
>> + * This is used to submit mapped buffers in a single folio that is not fully
>> + * mapped for various reasons, such as insufficient space or journal credits.
>> + */
>> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
>> +{
>> +	struct inode *inode = mpd->inode;
>> +	struct folio *folio;
>> +	int ret;
>> +
>> +	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
>> +	if (IS_ERR(folio))
>> +		return PTR_ERR(folio);
>> +
>> +	ret = mpage_submit_folio(mpd, folio);
>> +	if (ret)
>> +		goto out;
>> +	/*
>> +	 * Update first_page to prevent this folio from being released in
>> +	 * mpage_release_unused_pages(), it should not equal to the folio
>> +	 * index.
>> +	 *
>> +	 * The first_page will be reset to the aligned folio index when this
>> +	 * folio is written again in the next round. Additionally, do not
>> +	 * update wbc->nr_to_write here, as it will be updated once the
>> +	 * entire folio has finished processing.
>> +	 */
>> +	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
> 
> Well, but there can be many folios between mpd->first_page and pos. And
> this way you avoid cleaning them up (unlocking them and dropping elevated
> refcount) before we restart next loop. How is this going to work?
> 

Hmm, I don't think there can be many folios between mpd->first_page and
pos. All of the fully mapped folios should be unlocked by
mpage_folio_done(), and there is no elevated since it always call
folio_batch_release() once we finish processing the folios.
mpage_release_unused_pages() is used to clean up unsubmitted folios.

For example, suppose we have a 4kB block size filesystem and we found 4
order-2 folios need to be mapped in the mpage_prepare_extent_to_map().

       first_page             next_page
       |                      |
      [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
       LLLL  LLLL  LLLL  LLLL

In the first round in the mpage_map_and_submit_extent(), we mapped the
first two folios along with the first two pages of the third folio, the
mpage_map_and_submit_buffers() should then submit and unlock the first
two folios, while also updating mpd->first_page to the beginning of the
third folio.

                  first_page  next_page
                  |          |
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  LLLL  LLLL               W: mapped  U: unlocked

In the second round in the mpage_map_and_submit_extent(), we failed to
map the blocks and call mpage_submit_buffers() to submit and unlock
this partially mapped folio. Additionally, we increased mpd->first_page.

                     first_page next_page
                     |        /
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  UUUU  LLLL               W: mapped  U: unlocked

Then, we break out to ext4_do_writepages(), which calls
mpage_release_unused_pages() to unlock the last folio.

                     first_page next_page
                     |        /
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  UUUU  UUUU               W: mapped  U: unlocked

In the next round in the ext4_do_writepages(), mpage_prepare_extent_to_map()
restarts processing the third folio and resets the mpd->first_page to
the beginning of it.

                   first_page  next_page
                   |          /
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  LLLL  LLLL               W: mapped  U: unlocked


> Also I don't see in this patch where mpd->first_page would get set back to
> retry writing this folio. What am I missing?
> 

We already have this setting, please refer to the following section in
mpage_prepare_extent_to_map().

static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
{
	pgoff_t index = mpd->first_page;
	...
	mpd->map.m_len = 0;
	mpd->next_page = index;
	...
	while (index <= end) {
...
			if (mpd->map.m_len == 0)
				mpd->first_page = folio->index;
...
}

>> +	WARN_ON_ONCE((folio->index == mpd->first_page) ||
>> +		     !folio_contains(folio, pos >> PAGE_SHIFT));
>> +out:
>> +	folio_unlock(folio);
>> +	folio_put(folio);
>> +	return ret;
>> +}
>> +
>>  /*
>>   * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
>>   *				 mpd->len and submit pages underlying it for IO
>> @@ -2412,8 +2448,16 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  			 */
>>  			if ((err == -ENOMEM) ||
>>  			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
>> -				if (progress)
>> +				/*
>> +				 * We may have already allocated extents for
>> +				 * some bhs inside the folio, issue the
>> +				 * corresponding data to prevent stale data.
>> +				 */
>> +				if (progress) {
>> +					if (mpage_submit_buffers(mpd, disksize))
>> +						goto invalidate_dirty_pages;
>>  					goto update_disksize;
>> +				}
>>  				return err;
>>  			}
>>  			ext4_msg(sb, KERN_CRIT,
>> @@ -2432,6 +2476,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  			*give_up_on_write = true;
>>  			return err;
>>  		}
>> +		disksize = ((loff_t)(map->m_lblk + map->m_len)) <<
>> +				inode->i_blkbits;
> 
> I don't think setting disksize like this is correct in case
> mpage_map_and_submit_buffers() below fails (when extent covers many folios
> and we don't succeed in writing them all). In that case we may need to keep
> disksize somewhere in the middle of the extent.
> 
> Overall I don't think we need to modify disksize handling here. It is fine
> to leave (part of) the extent dangling beyond disksize until we retry the
> writeback in these rare cases.

OK, this is indeed a rare case. Let's keep it as it is.

Thanks,
Yi.

> 
>>  		progress = 1;
>>  		/*
>>  		 * Update buffer state, submit mapped pages, and get us new
>> @@ -2442,12 +2488,12 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  			goto update_disksize;
>>  	} while (map->m_len);
>>  
>> +	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>>  update_disksize:
>>  	/*
>>  	 * Update on-disk size after IO is submitted.  Races with
>>  	 * truncate are avoided by checking i_size under i_data_sem.
>>  	 */
>> -	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>>  	if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
>>  		int err2;
>>  		loff_t i_size;
> 
> 								Honza


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks
  2025-06-19 16:33   ` Jan Kara
@ 2025-06-20  5:00     ` Zhang Yi
  2025-06-20 14:18       ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Zhang Yi @ 2025-06-20  5:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On 2025/6/20 0:33, Jan Kara wrote:
> On Wed 11-06-25 19:16:22, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> After large folios are supported on ext4, writing back a sufficiently
>> large and discontinuous folio may consume a significant number of
>> journal credits, placing considerable strain on the journal. For
>> example, in a 20GB filesystem with 1K block size and 1MB journal size,
>> writing back a 2MB folio could require thousands of credits in the
>> worst-case scenario (when each block is discontinuous and distributed
>> across different block groups), potentially exceeding the journal size.
>> This issue can also occur in ext4_write_begin() and ext4_page_mkwrite()
>> when delalloc is not enabled.
>>
>> Fix this by ensuring that there are sufficient journal credits before
>> allocating an extent in mpage_map_one_extent() and _ext4_get_block(). If
>> there are not enough credits, return -EAGAIN, exit the current mapping
>> loop, restart a new handle and a new transaction, and allocating blocks
>> on this folio again in the next iteration.
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> ...
> 
>>  static int _ext4_get_block(struct inode *inode, sector_t iblock,
>>  			   struct buffer_head *bh, int flags)
>>  {
>>  	struct ext4_map_blocks map;
>> +	handle_t *handle = ext4_journal_current_handle();
>>  	int ret = 0;
>>  
>>  	if (ext4_has_inline_data(inode))
>>  		return -ERANGE;
>>  
>> +	/* Make sure transaction has enough credits for this extent */
>> +	if (flags & EXT4_GET_BLOCKS_CREATE) {
>> +		ret = ext4_journal_ensure_extent_credits(handle, inode);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>>  	map.m_lblk = iblock;
>>  	map.m_len = bh->b_size >> inode->i_blkbits;
>>  
>> -	ret = ext4_map_blocks(ext4_journal_current_handle(), inode, &map,
>> -			      flags);
>> +	ret = ext4_map_blocks(handle, inode, &map, flags);
> 
> Good spotting with ext4_page_mkwrite() and ext4_write_begin() also needing
> this treatment! But rather then hiding the transaction extension in
> _ext4_get_block() I'd do this in ext4_block_write_begin() where it is much
> more obvious (and also it is much more obvious who needs to be prepared for
> handling EAGAIN error). Otherwise the patch looks good!
> 

Yes, I completely agree with you. However, unfortunately, do this in
ext4_block_write_begin() only works for ext4_write_begin().
ext4_page_mkwrite() does not call ext4_block_write_begin() to allocate
blocks, it call the vfs helper __block_write_begin_int() instead.

vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
{
	...
	if (!ext4_should_journal_data(inode)) {
		err = block_page_mkwrite(vma, vmf, get_block);
	...
}


So...

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks
  2025-06-20  5:00     ` Zhang Yi
@ 2025-06-20 14:18       ` Jan Kara
  2025-06-21  6:30         ` Zhang Yi
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2025-06-20 14:18 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Fri 20-06-25 13:00:32, Zhang Yi wrote:
> On 2025/6/20 0:33, Jan Kara wrote:
> > On Wed 11-06-25 19:16:22, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> After large folios are supported on ext4, writing back a sufficiently
> >> large and discontinuous folio may consume a significant number of
> >> journal credits, placing considerable strain on the journal. For
> >> example, in a 20GB filesystem with 1K block size and 1MB journal size,
> >> writing back a 2MB folio could require thousands of credits in the
> >> worst-case scenario (when each block is discontinuous and distributed
> >> across different block groups), potentially exceeding the journal size.
> >> This issue can also occur in ext4_write_begin() and ext4_page_mkwrite()
> >> when delalloc is not enabled.
> >>
> >> Fix this by ensuring that there are sufficient journal credits before
> >> allocating an extent in mpage_map_one_extent() and _ext4_get_block(). If
> >> there are not enough credits, return -EAGAIN, exit the current mapping
> >> loop, restart a new handle and a new transaction, and allocating blocks
> >> on this folio again in the next iteration.
> >>
> >> Suggested-by: Jan Kara <jack@suse.cz>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > 
> > ...
> > 
> >>  static int _ext4_get_block(struct inode *inode, sector_t iblock,
> >>  			   struct buffer_head *bh, int flags)
> >>  {
> >>  	struct ext4_map_blocks map;
> >> +	handle_t *handle = ext4_journal_current_handle();
> >>  	int ret = 0;
> >>  
> >>  	if (ext4_has_inline_data(inode))
> >>  		return -ERANGE;
> >>  
> >> +	/* Make sure transaction has enough credits for this extent */
> >> +	if (flags & EXT4_GET_BLOCKS_CREATE) {
> >> +		ret = ext4_journal_ensure_extent_credits(handle, inode);
> >> +		if (ret)
> >> +			return ret;
> >> +	}
> >> +
> >>  	map.m_lblk = iblock;
> >>  	map.m_len = bh->b_size >> inode->i_blkbits;
> >>  
> >> -	ret = ext4_map_blocks(ext4_journal_current_handle(), inode, &map,
> >> -			      flags);
> >> +	ret = ext4_map_blocks(handle, inode, &map, flags);
> > 
> > Good spotting with ext4_page_mkwrite() and ext4_write_begin() also needing
> > this treatment! But rather then hiding the transaction extension in
> > _ext4_get_block() I'd do this in ext4_block_write_begin() where it is much
> > more obvious (and also it is much more obvious who needs to be prepared for
> > handling EAGAIN error). Otherwise the patch looks good!
> > 
> 
> Yes, I completely agree with you. However, unfortunately, do this in
> ext4_block_write_begin() only works for ext4_write_begin().
> ext4_page_mkwrite() does not call ext4_block_write_begin() to allocate
> blocks, it call the vfs helper __block_write_begin_int() instead.
> 
> vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
> {
> 	...
> 	if (!ext4_should_journal_data(inode)) {
> 		err = block_page_mkwrite(vma, vmf, get_block);
> 	...
> }
> 
> 
> So...

Right, I forgot about the nodelalloc case. But since we do most of things
by hand for data=journal mode, perhaps we could lift some code from
data=journal mode and reuse it for nodelalloc as well like:

        folio_lock(folio);
        size = i_size_read(inode);
        /* Page got truncated from under us? */
        if (folio->mapping != mapping || folio_pos(folio) > size) {
                ret = VM_FAULT_NOPAGE;
                goto out_error;
        }

        len = folio_size(folio);
        if (folio_pos(folio) + len > size)
                len = size - folio_pos(folio);
                
        err = ext4_block_write_begin(handle, folio, 0, len,
                                     get_block);
	if (err)
		goto out_error;
	if (!ext4_should_journal_data(inode))
		block_commit_write(folio, 0, len);
		folio_mark_dirty(folio);
	} else {
	        if (ext4_journal_folio_buffers(handle, folio, len)) {
	        	ret = VM_FAULT_SIGBUS;
		        goto out_error;
		}
	}
	ext4_journal_stop(handle);
	folio_wait_stable(folio);

We get an additional bonus for not waiting for page writeback with
transaction handle held (which is a potential deadlock vector). What do you
think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop
  2025-06-20  4:57     ` Zhang Yi
@ 2025-06-20 15:21       ` Jan Kara
  2025-06-21  4:42         ` Zhang Yi
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2025-06-20 15:21 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Fri 20-06-25 12:57:18, Zhang Yi wrote:
> On 2025/6/20 0:21, Jan Kara wrote:
> > On Wed 11-06-25 19:16:21, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> During the process of writing back folios, if
> >> mpage_map_and_submit_extent() exits the extent mapping loop due to an
> >> ENOSPC or ENOMEM error, it may result in stale data or filesystem
> >> inconsistency in environments where the block size is smaller than the
> >> folio size.
> >>
> >> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
> >> some buffers may have already be mapped. If we exit the mapping loop
> >> prematurely, the folio data within the mapped range will not be written
> >> back, and the file's disk size will not be updated. Once the transaction
> >> that includes this range of extents is committed, this can lead to stale
> >> data or filesystem inconsistency.
> >>
> >> Fix this by submitting the current processing partial mapped folio and
> >> update the disk size to the end of the mapped range.
> >>
> >> Suggested-by: Jan Kara <jack@suse.cz>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
> >>  1 file changed, 48 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 3a086fee7989..d0db6e3bf158 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
> >>  	return 0;
> >>  }
> >>  
> >> +/*
> >> + * This is used to submit mapped buffers in a single folio that is not fully
> >> + * mapped for various reasons, such as insufficient space or journal credits.
> >> + */
> >> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
> >> +{
> >> +	struct inode *inode = mpd->inode;
> >> +	struct folio *folio;
> >> +	int ret;
> >> +
> >> +	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
> >> +	if (IS_ERR(folio))
> >> +		return PTR_ERR(folio);
> >> +
> >> +	ret = mpage_submit_folio(mpd, folio);
> >> +	if (ret)
> >> +		goto out;
> >> +	/*
> >> +	 * Update first_page to prevent this folio from being released in
> >> +	 * mpage_release_unused_pages(), it should not equal to the folio
> >> +	 * index.
> >> +	 *
> >> +	 * The first_page will be reset to the aligned folio index when this
> >> +	 * folio is written again in the next round. Additionally, do not
> >> +	 * update wbc->nr_to_write here, as it will be updated once the
> >> +	 * entire folio has finished processing.
> >> +	 */
> >> +	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
> > 
> > Well, but there can be many folios between mpd->first_page and pos. And
> > this way you avoid cleaning them up (unlocking them and dropping elevated
> > refcount) before we restart next loop. How is this going to work?
> > 
> 
> Hmm, I don't think there can be many folios between mpd->first_page and
> pos. All of the fully mapped folios should be unlocked by
> mpage_folio_done(), and there is no elevated since it always call
> folio_batch_release() once we finish processing the folios.

Indeed. I forgot that mpage_map_one_extent() with shorten mpd->map->m_len
to the number of currently mapped blocks.

> mpage_release_unused_pages() is used to clean up unsubmitted folios.
> 
> For example, suppose we have a 4kB block size filesystem and we found 4
> order-2 folios need to be mapped in the mpage_prepare_extent_to_map().
> 
>        first_page             next_page
>        |                      |
>       [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
>        LLLL  LLLL  LLLL  LLLL
> 
> In the first round in the mpage_map_and_submit_extent(), we mapped the
> first two folios along with the first two pages of the third folio, the
> mpage_map_and_submit_buffers() should then submit and unlock the first
> two folios, while also updating mpd->first_page to the beginning of the
> third folio.
> 
>                   first_page  next_page
>                   |          |
>       [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
>        UUUU  UUUU  LLLL  LLLL               W: mapped  U: unlocked
> 
> In the second round in the mpage_map_and_submit_extent(), we failed to
> map the blocks and call mpage_submit_buffers() to submit and unlock
> this partially mapped folio. Additionally, we increased mpd->first_page.
> 
>                      first_page next_page
>                      |        /
>       [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
>        UUUU  UUUU  UUUU  LLLL               W: mapped  U: unlocked

Good. But what if we have a filesystem with 1k blocksize and order 0
folios? I mean situation like:

        first_page             next_page
        |                      |
       [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
        L     L     L     L

Now we map first two folios.

                   first_page  next_page
                   |           |
       [MMMM][MMMM][HHHH][HHHH]              H: hole  L: locked
                    L     L

Now mpage_map_one_extent() maps half of the folio and fails to extend the
transaction further:

                   first_page  next_page
                   |           |
       [MMMM][MMMM][MMHH][HHHH]              H: hole  L: locked
                    L     L

and mpage_submit_folio() now shifts mpd->first page like:

                          first_page
                          |    next_page
                          |    |
       [MMMM][MMMM][MMHH][HHHH]              H: hole  L: locked
                    L     L

and it never gets reset back?

I suspect you thought that the failure to extend transaction in the middle
of order 0 folio should not happen because you reserve credits for full
page worth of writeback? But those credits could be exhaused by the time we
get to mapping third folio because mpage_map_one_extent() only ensures
there are credits for mapping one extent.

And I think reserving credits for just one extent is fine even from the
beginning (as I wrote in my comment to patch 4). We just need to handle
this partial case - which should be possible by just leaving
mpd->first_page untouched and leave unlocking to
mpage_release_unused_pages(). But I can be missing some effects, the
writeback code is really complex...

BTW long-term the code may be easier to follow if we replace
mpd->first_page and mpd->next_page with logical block based or byte based
indexing. Now when we have large order folios, page is not that important
concept for writeback anymore.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop
  2025-06-20 15:21       ` Jan Kara
@ 2025-06-21  4:42         ` Zhang Yi
  2025-06-23  7:38           ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Zhang Yi @ 2025-06-21  4:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On 2025/6/20 23:21, Jan Kara wrote:
> On Fri 20-06-25 12:57:18, Zhang Yi wrote:
>> On 2025/6/20 0:21, Jan Kara wrote:
>>> On Wed 11-06-25 19:16:21, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> During the process of writing back folios, if
>>>> mpage_map_and_submit_extent() exits the extent mapping loop due to an
>>>> ENOSPC or ENOMEM error, it may result in stale data or filesystem
>>>> inconsistency in environments where the block size is smaller than the
>>>> folio size.
>>>>
>>>> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
>>>> some buffers may have already be mapped. If we exit the mapping loop
>>>> prematurely, the folio data within the mapped range will not be written
>>>> back, and the file's disk size will not be updated. Once the transaction
>>>> that includes this range of extents is committed, this can lead to stale
>>>> data or filesystem inconsistency.
>>>>
>>>> Fix this by submitting the current processing partial mapped folio and
>>>> update the disk size to the end of the mapped range.
>>>>
>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>>  fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
>>>>  1 file changed, 48 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 3a086fee7989..d0db6e3bf158 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>>>  	return 0;
>>>>  }
>>>>  
>>>> +/*
>>>> + * This is used to submit mapped buffers in a single folio that is not fully
>>>> + * mapped for various reasons, such as insufficient space or journal credits.
>>>> + */
>>>> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
>>>> +{
>>>> +	struct inode *inode = mpd->inode;
>>>> +	struct folio *folio;
>>>> +	int ret;
>>>> +
>>>> +	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
>>>> +	if (IS_ERR(folio))
>>>> +		return PTR_ERR(folio);
>>>> +
>>>> +	ret = mpage_submit_folio(mpd, folio);
>>>> +	if (ret)
>>>> +		goto out;
>>>> +	/*
>>>> +	 * Update first_page to prevent this folio from being released in
>>>> +	 * mpage_release_unused_pages(), it should not equal to the folio
>>>> +	 * index.
>>>> +	 *
>>>> +	 * The first_page will be reset to the aligned folio index when this
>>>> +	 * folio is written again in the next round. Additionally, do not
>>>> +	 * update wbc->nr_to_write here, as it will be updated once the
>>>> +	 * entire folio has finished processing.
>>>> +	 */
>>>> +	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
>>>
>>> Well, but there can be many folios between mpd->first_page and pos. And
>>> this way you avoid cleaning them up (unlocking them and dropping elevated
>>> refcount) before we restart next loop. How is this going to work?
>>>
>>
>> Hmm, I don't think there can be many folios between mpd->first_page and
>> pos. All of the fully mapped folios should be unlocked by
>> mpage_folio_done(), and there is no elevated since it always call
>> folio_batch_release() once we finish processing the folios.
> 
> Indeed. I forgot that mpage_map_one_extent() with shorten mpd->map->m_len
> to the number of currently mapped blocks.
> 
>> mpage_release_unused_pages() is used to clean up unsubmitted folios.
>>
>> For example, suppose we have a 4kB block size filesystem and we found 4
>> order-2 folios need to be mapped in the mpage_prepare_extent_to_map().
>>
>>        first_page             next_page
>>        |                      |
>>       [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
>>        LLLL  LLLL  LLLL  LLLL
>>
>> In the first round in the mpage_map_and_submit_extent(), we mapped the
>> first two folios along with the first two pages of the third folio, the
>> mpage_map_and_submit_buffers() should then submit and unlock the first
>> two folios, while also updating mpd->first_page to the beginning of the
>> third folio.
>>
>>                   first_page  next_page
>>                   |          |
>>       [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
>>        UUUU  UUUU  LLLL  LLLL               W: mapped  U: unlocked
>>
>> In the second round in the mpage_map_and_submit_extent(), we failed to
>> map the blocks and call mpage_submit_buffers() to submit and unlock
>> this partially mapped folio. Additionally, we increased mpd->first_page.
>>
>>                      first_page next_page
>>                      |        /
>>       [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
>>        UUUU  UUUU  UUUU  LLLL               W: mapped  U: unlocked
> 
> Good. But what if we have a filesystem with 1k blocksize and order 0
> folios? I mean situation like:
> 
>         first_page             next_page
>         |                      |
>        [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
>         L     L     L     L
> 
> Now we map first two folios.
> 
>                    first_page  next_page
>                    |           |
>        [MMMM][MMMM][HHHH][HHHH]              H: hole  L: locked
>                     L     L
> 
> Now mpage_map_one_extent() maps half of the folio and fails to extend the
> transaction further:
> 
>                    first_page  next_page
>                    |           |
>        [MMMM][MMMM][MMHH][HHHH]              H: hole  L: locked
>                     L     L
> 
> and mpage_submit_folio() now shifts mpd->first page like:
> 
>                           first_page
>                           |    next_page
>                           |    |
>        [MMMM][MMMM][MMHH][HHHH]              H: hole  L: locked
>                     L     L
> 
> and it never gets reset back?
> 
> I suspect you thought that the failure to extend transaction in the middle
> of order 0 folio should not happen because you reserve credits for full
> page worth of writeback? But those credits could be exhaused by the time we
> get to mapping third folio because mpage_map_one_extent() only ensures
> there are credits for mapping one extent.

Ooops, you are right. Sorry, it was my mistake.

> 
> And I think reserving credits for just one extent is fine even from the
> beginning (as I wrote in my comment to patch 4). We just need to handle
> this partial case 

Yeah.

> which should be possible by just leaving
> mpd->first_page untouched and leave unlocking to
> mpage_release_unused_pages(). 

I was going to use this solution, but it breaks the semantics of the
mpage_release_unused_pages() and trigger BUG_ON(folio_test_writeback(folio))
in this function. I don't want to drop this BUG_ON since I think it's
somewhat useful.

> But I can be missing some effects, the writeback code is really complex...

Indeed, I was confused by this code for a long time. Thank you a lot for
patiently correcting my mistakes in my patch.

> BTW long-term the code may be easier to follow if we replace
> mpd->first_page and mpd->next_page with logical block based or byte based
> indexing. Now when we have large order folios, page is not that important
> concept for writeback anymore.
> 

I suppose we should do this conversion now.

Thanks,
Yi.







^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks
  2025-06-20 14:18       ` Jan Kara
@ 2025-06-21  6:30         ` Zhang Yi
  0 siblings, 0 replies; 23+ messages in thread
From: Zhang Yi @ 2025-06-21  6:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On 2025/6/20 22:18, Jan Kara wrote:
> On Fri 20-06-25 13:00:32, Zhang Yi wrote:
>> On 2025/6/20 0:33, Jan Kara wrote:
>>> On Wed 11-06-25 19:16:22, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> After large folios are supported on ext4, writing back a sufficiently
>>>> large and discontinuous folio may consume a significant number of
>>>> journal credits, placing considerable strain on the journal. For
>>>> example, in a 20GB filesystem with 1K block size and 1MB journal size,
>>>> writing back a 2MB folio could require thousands of credits in the
>>>> worst-case scenario (when each block is discontinuous and distributed
>>>> across different block groups), potentially exceeding the journal size.
>>>> This issue can also occur in ext4_write_begin() and ext4_page_mkwrite()
>>>> when delalloc is not enabled.
>>>>
>>>> Fix this by ensuring that there are sufficient journal credits before
>>>> allocating an extent in mpage_map_one_extent() and _ext4_get_block(). If
>>>> there are not enough credits, return -EAGAIN, exit the current mapping
>>>> loop, restart a new handle and a new transaction, and allocating blocks
>>>> on this folio again in the next iteration.
>>>>
>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> ...
>>>
>>>>  static int _ext4_get_block(struct inode *inode, sector_t iblock,
>>>>  			   struct buffer_head *bh, int flags)
>>>>  {
>>>>  	struct ext4_map_blocks map;
>>>> +	handle_t *handle = ext4_journal_current_handle();
>>>>  	int ret = 0;
>>>>  
>>>>  	if (ext4_has_inline_data(inode))
>>>>  		return -ERANGE;
>>>>  
>>>> +	/* Make sure transaction has enough credits for this extent */
>>>> +	if (flags & EXT4_GET_BLOCKS_CREATE) {
>>>> +		ret = ext4_journal_ensure_extent_credits(handle, inode);
>>>> +		if (ret)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>>  	map.m_lblk = iblock;
>>>>  	map.m_len = bh->b_size >> inode->i_blkbits;
>>>>  
>>>> -	ret = ext4_map_blocks(ext4_journal_current_handle(), inode, &map,
>>>> -			      flags);
>>>> +	ret = ext4_map_blocks(handle, inode, &map, flags);
>>>
>>> Good spotting with ext4_page_mkwrite() and ext4_write_begin() also needing
>>> this treatment! But rather then hiding the transaction extension in
>>> _ext4_get_block() I'd do this in ext4_block_write_begin() where it is much
>>> more obvious (and also it is much more obvious who needs to be prepared for
>>> handling EAGAIN error). Otherwise the patch looks good!
>>>
>>
>> Yes, I completely agree with you. However, unfortunately, do this in
>> ext4_block_write_begin() only works for ext4_write_begin().
>> ext4_page_mkwrite() does not call ext4_block_write_begin() to allocate
>> blocks, it call the vfs helper __block_write_begin_int() instead.
>>
>> vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>> {
>> 	...
>> 	if (!ext4_should_journal_data(inode)) {
>> 		err = block_page_mkwrite(vma, vmf, get_block);
>> 	...
>> }
>>
>>
>> So...
> 
> Right, I forgot about the nodelalloc case. But since we do most of things
> by hand for data=journal mode, perhaps we could lift some code from
> data=journal mode and reuse it for nodelalloc as well like:
> 
>         folio_lock(folio);
>         size = i_size_read(inode);
>         /* Page got truncated from under us? */
>         if (folio->mapping != mapping || folio_pos(folio) > size) {
>                 ret = VM_FAULT_NOPAGE;
>                 goto out_error;
>         }
> 
>         len = folio_size(folio);
>         if (folio_pos(folio) + len > size)
>                 len = size - folio_pos(folio);
>                 
>         err = ext4_block_write_begin(handle, folio, 0, len,
>                                      get_block);
> 	if (err)
> 		goto out_error;
> 	if (!ext4_should_journal_data(inode))
> 		block_commit_write(folio, 0, len);
> 		folio_mark_dirty(folio);
> 	} else {
> 	        if (ext4_journal_folio_buffers(handle, folio, len)) {
> 	        	ret = VM_FAULT_SIGBUS;
> 		        goto out_error;
> 		}
> 	}
> 	ext4_journal_stop(handle);
> 	folio_wait_stable(folio);
> 
> We get an additional bonus for not waiting for page writeback with
> transaction handle held (which is a potential deadlock vector). What do you
> think?
> 

Yeah, this solution looks nice to me, it should works! Thank you for
the suggestion.

Best regards,
Yi.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 5/6] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page()
  2025-06-19 16:44   ` Jan Kara
@ 2025-06-21  7:46     ` Zhang Yi
  2025-06-23  7:26       ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Zhang Yi @ 2025-06-21  7:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On 2025/6/20 0:44, Jan Kara wrote:
> On Wed 11-06-25 19:16:24, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> This partially reverts commit d6bf294773a4 ("ext4/jbd2: convert
>> jbd2_journal_blocks_per_page() to support large folio"). This
>> jbd2_journal_blocks_per_folio() will lead to a significant
>> overestimation of journal credits. Since we still reserve credits for
>> one page and attempt to extend and restart handles during large folio
>> writebacks, so we should convert this helper back to
>> ext4_journal_blocks_per_page().
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Here I'm not decided. Does it make any particular sense to reserve credits
> for one *page* worth of blocks when pages don't have any particular meaning
> in our writeback code anymore? We could reserve credits just for one
> physical extent and that should be enough.

Indeed, reserving credits for a single page is no longer suitable in the
currently folio based context. It do seems more appropriate to allocate
credits for a single extent.

> For blocksize == pagesize (most
> common configs) this would be actually equivalent. If blocksize < pagesize,
> this could force us to do some more writeback retries and thus get somewhat
> higher writeback CPU overhead but do we really care for these configs?  It
> is well possible I've overlooked something and someone will spot a
> performance regression in practical setup with this in which case we'd have
> to come up with something more clever but I think it's worth it to start
> simple and complicate later.

This can indeed be a problem if the file system is already fragmented
enough. However, thanks to the credits extension logic in
__ext4_journal_ensure_credits(), I suppose that on most file system images,
it will not trigger excessive retry operations. Besides, although there
might be some lock cost in jbd2_journal_extend(), I suppose it won't be a
big deal.

Perhaps we could reserve more credits through a complex formula at the
outset, which would lower the cost of expanding the credits. But I don't
think this will help much in reducing the number of retries, it may only
be helpful in extreme cases (the running transaction stats to commit, we
cannot extend it).

So, I think we can implement it by reserving for an extent for the time
being. Do you agree?

Thanks,
Yi.

> 
> 
>> ---
>>  fs/ext4/ext4_jbd2.h  | 7 +++++++
>>  fs/ext4/inode.c      | 6 +++---
>>  fs/jbd2/journal.c    | 6 ++++++
>>  include/linux/jbd2.h | 1 +
>>  4 files changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>> index 63d17c5201b5..c0ee756cb34c 100644
>> --- a/fs/ext4/ext4_jbd2.h
>> +++ b/fs/ext4/ext4_jbd2.h
>> @@ -326,6 +326,13 @@ static inline int ext4_journal_blocks_per_folio(struct inode *inode)
>>  	return 0;
>>  }
>>  
>> +static inline int ext4_journal_blocks_per_page(struct inode *inode)
>> +{
>> +	if (EXT4_JOURNAL(inode) != NULL)
>> +		return jbd2_journal_blocks_per_page(inode);
>> +	return 0;
>> +}
>> +
>>  static inline int ext4_journal_force_commit(journal_t *journal)
>>  {
>>  	if (journal)
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 67e37dd546eb..9835145b1b27 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2556,7 +2556,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>   */
>>  static int ext4_da_writepages_trans_blocks(struct inode *inode)
>>  {
>> -	int bpp = ext4_journal_blocks_per_folio(inode);
>> +	int bpp = ext4_journal_blocks_per_page(inode);
>>  
>>  	return ext4_meta_trans_blocks(inode,
>>  				MAX_WRITEPAGES_EXTENT_LEN + bpp - 1, bpp);
>> @@ -2634,7 +2634,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
>>  	ext4_lblk_t lblk;
>>  	struct buffer_head *head;
>>  	handle_t *handle = NULL;
>> -	int bpp = ext4_journal_blocks_per_folio(mpd->inode);
>> +	int bpp = ext4_journal_blocks_per_page(mpd->inode);
>>  
>>  	if (mpd->wbc->sync_mode == WB_SYNC_ALL || mpd->wbc->tagged_writepages)
>>  		tag = PAGECACHE_TAG_TOWRITE;
>> @@ -6255,7 +6255,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
>>   */
>>  int ext4_writepage_trans_blocks(struct inode *inode)
>>  {
>> -	int bpp = ext4_journal_blocks_per_folio(inode);
>> +	int bpp = ext4_journal_blocks_per_page(inode);
>>  	int ret;
>>  
>>  	ret = ext4_meta_trans_blocks(inode, bpp, bpp);
>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>> index d480b94117cd..7fccb425907f 100644
>> --- a/fs/jbd2/journal.c
>> +++ b/fs/jbd2/journal.c
>> @@ -84,6 +84,7 @@ EXPORT_SYMBOL(jbd2_journal_start_commit);
>>  EXPORT_SYMBOL(jbd2_journal_force_commit_nested);
>>  EXPORT_SYMBOL(jbd2_journal_wipe);
>>  EXPORT_SYMBOL(jbd2_journal_blocks_per_folio);
>> +EXPORT_SYMBOL(jbd2_journal_blocks_per_page);
>>  EXPORT_SYMBOL(jbd2_journal_invalidate_folio);
>>  EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
>>  EXPORT_SYMBOL(jbd2_journal_force_commit);
>> @@ -2661,6 +2662,11 @@ int jbd2_journal_blocks_per_folio(struct inode *inode)
>>  		     inode->i_sb->s_blocksize_bits);
>>  }
>>  
>> +int jbd2_journal_blocks_per_page(struct inode *inode)
>> +{
>> +	return 1 << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
>> +}
>> +
>>  /*
>>   * helper functions to deal with 32 or 64bit block numbers.
>>   */
>> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
>> index 43b9297fe8a7..f35369c104ba 100644
>> --- a/include/linux/jbd2.h
>> +++ b/include/linux/jbd2.h
>> @@ -1724,6 +1724,7 @@ static inline int tid_geq(tid_t x, tid_t y)
>>  }
>>  
>>  extern int jbd2_journal_blocks_per_folio(struct inode *inode);
>> +extern int jbd2_journal_blocks_per_page(struct inode *inode);
>>  extern size_t journal_tag_bytes(journal_t *journal);
>>  
>>  static inline int jbd2_journal_has_csum_v2or3(journal_t *journal)
>> -- 
>> 2.46.1
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 5/6] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page()
  2025-06-21  7:46     ` Zhang Yi
@ 2025-06-23  7:26       ` Jan Kara
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2025-06-23  7:26 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Sat 21-06-25 15:46:40, Zhang Yi wrote:
> On 2025/6/20 0:44, Jan Kara wrote:
> > On Wed 11-06-25 19:16:24, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> This partially reverts commit d6bf294773a4 ("ext4/jbd2: convert
> >> jbd2_journal_blocks_per_page() to support large folio"). This
> >> jbd2_journal_blocks_per_folio() will lead to a significant
> >> overestimation of journal credits. Since we still reserve credits for
> >> one page and attempt to extend and restart handles during large folio
> >> writebacks, so we should convert this helper back to
> >> ext4_journal_blocks_per_page().
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > 
> > Here I'm not decided. Does it make any particular sense to reserve credits
> > for one *page* worth of blocks when pages don't have any particular meaning
> > in our writeback code anymore? We could reserve credits just for one
> > physical extent and that should be enough.
> 
> Indeed, reserving credits for a single page is no longer suitable in the
> currently folio based context. It do seems more appropriate to allocate
> credits for a single extent.
> 
> > For blocksize == pagesize (most
> > common configs) this would be actually equivalent. If blocksize < pagesize,
> > this could force us to do some more writeback retries and thus get somewhat
> > higher writeback CPU overhead but do we really care for these configs?  It
> > is well possible I've overlooked something and someone will spot a
> > performance regression in practical setup with this in which case we'd have
> > to come up with something more clever but I think it's worth it to start
> > simple and complicate later.
> 
> This can indeed be a problem if the file system is already fragmented
> enough. However, thanks to the credits extension logic in
> __ext4_journal_ensure_credits(), I suppose that on most file system images,
> it will not trigger excessive retry operations. Besides, although there
> might be some lock cost in jbd2_journal_extend(), I suppose it won't be a
> big deal.
> 
> Perhaps we could reserve more credits through a complex formula at the
> outset, which would lower the cost of expanding the credits. But I don't
> think this will help much in reducing the number of retries, it may only
> be helpful in extreme cases (the running transaction stats to commit, we
> cannot extend it).
> 
> So, I think we can implement it by reserving for an extent for the time
> being. Do you agree?

Yes, I agree. After all this is easy to change if we find some real issues
with it.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop
  2025-06-21  4:42         ` Zhang Yi
@ 2025-06-23  7:38           ` Jan Kara
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2025-06-23  7:38 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, ojaswin, yi.zhang, libaokun1, yukuai3, yangerkun

On Sat 21-06-25 12:42:11, Zhang Yi wrote:
> On 2025/6/20 23:21, Jan Kara wrote:
> > I suspect you thought that the failure to extend transaction in the middle
> > of order 0 folio should not happen because you reserve credits for full
> > page worth of writeback? But those credits could be exhaused by the time we
> > get to mapping third folio because mpage_map_one_extent() only ensures
> > there are credits for mapping one extent.
> 
> Ooops, you are right. Sorry, it was my mistake.
> 
> > 
> > And I think reserving credits for just one extent is fine even from the
> > beginning (as I wrote in my comment to patch 4). We just need to handle
> > this partial case 
> 
> Yeah.
> 
> > which should be possible by just leaving
> > mpd->first_page untouched and leave unlocking to
> > mpage_release_unused_pages(). 
> 
> I was going to use this solution, but it breaks the semantics of the
> mpage_release_unused_pages() and trigger BUG_ON(folio_test_writeback(folio))
> in this function. I don't want to drop this BUG_ON since I think it's
> somewhat useful.

Oh, right. Well, we could modify the BUG_ON to:

	/*
	 * first_page folio can be under writeback if we need to restart
         * transaction to map more
	 */
	BUG_ON((invalidate || folio->inode > mpd->first_page) && folio_test_writeback(folio));

> > But I can be missing some effects, the writeback code is really complex...
> 
> Indeed, I was confused by this code for a long time. Thank you a lot for
> patiently correcting my mistakes in my patch.

Thank you for taking time to properly fix these issues :)

> > BTW long-term the code may be easier to follow if we replace
> > mpd->first_page and mpd->next_page with logical block based or byte based
> > indexing. Now when we have large order folios, page is not that important
> > concept for writeback anymore.
> 
> I suppose we should do this conversion now.

Yes or this... I guess it would be more obvious what's going on this way.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/6] ext4: correct the reserved credits for extent conversion
  2025-06-11 11:16 ` [PATCH v2 4/6] ext4: correct the reserved credits for extent conversion Zhang Yi
@ 2025-06-23  8:12   ` Baokun Li
  0 siblings, 0 replies; 23+ messages in thread
From: Baokun Li @ 2025-06-23  8:12 UTC (permalink / raw)
  To: Zhang Yi, linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yukuai3, yangerkun, Baokun Li

On 2025/6/11 19:16, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Now, we reserve journal credits for converting extents in only one page
> to written state when the I/O operation is complete. This is
> insufficient when large folio is enabled.
>
> Fix this by reserving credits for converting up to one extent per block in
> the largest 2MB folio, this calculation should only involve extents index
> and leaf blocks, so it should not estimate too many credits.
>
> Fixes: 7ac67301e82f ("ext4: enable large folio for regular file")
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Looks good to me. Feel free to add:

Reviewed-by: Baokun Li <libaokun1@huawei.com>

> ---
>   fs/ext4/inode.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b51de58518b2..67e37dd546eb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2848,12 +2848,12 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>   	mpd->journalled_more_data = 0;
>   
>   	if (ext4_should_dioread_nolock(inode)) {
> +		int bpf = ext4_journal_blocks_per_folio(inode);
>   		/*
>   		 * We may need to convert up to one extent per block in
> -		 * the page and we may dirty the inode.
> +		 * the folio and we may dirty the inode.
>   		 */
> -		rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
> -						PAGE_SIZE >> inode->i_blkbits);
> +		rsv_blocks = 1 + ext4_ext_index_trans_blocks(inode, bpf);
>   	}
>   
>   	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/6] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()
  2025-06-11 11:16 ` [PATCH v2 6/6] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks() Zhang Yi
@ 2025-06-23  8:12   ` Baokun Li
  0 siblings, 0 replies; 23+ messages in thread
From: Baokun Li @ 2025-06-23  8:12 UTC (permalink / raw)
  To: Zhang Yi, linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
	yi.zhang, yukuai3, yangerkun, Baokun Li

On 2025/6/11 19:16, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> The calculation of journal credits in ext4_meta_trans_blocks() should
> include pextents, as each extent separately may be allocated from a
> different group and thus need to update different bitmap and group
> descriptor block.
>
> Fixes: 0e32d8617012 ("ext4: correct the journal credits calculations of allocating blocks")
> Reported-by: Jan Kara <jack@suse.cz>
> Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Looks good to me. Feel free to add:

Reviewed-by: Baokun Li <libaokun1@huawei.com>

> ---
>   fs/ext4/inode.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 9835145b1b27..9b6ebf823740 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -6218,7 +6218,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
>   	int ret;
>   
>   	/*
> -	 * How many index and lead blocks need to touch to map @lblocks
> +	 * How many index and leaf blocks need to touch to map @lblocks
>   	 * logical blocks to @pextents physical extents?
>   	 */
>   	idxblocks = ext4_index_trans_blocks(inode, lblocks, pextents);
> @@ -6227,7 +6227,7 @@ int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
>   	 * Now let's see how many group bitmaps and group descriptors need
>   	 * to account
>   	 */
> -	groups = idxblocks;
> +	groups = idxblocks + pextents;
>   	gdpblocks = groups;
>   	if (groups > ngroups)
>   		groups = ngroups;



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-06-23 11:46 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11 11:16 [PATCH v2 0/6] ext4: fix insufficient credits when writing back large folios Zhang Yi
2025-06-11 11:16 ` [PATCH v2 1/6] ext4: move the calculation of wbc->nr_to_write to mpage_folio_done() Zhang Yi
2025-06-19 15:08   ` Jan Kara
2025-06-20  1:51   ` Baokun Li
2025-06-11 11:16 ` [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop Zhang Yi
2025-06-19 16:21   ` Jan Kara
2025-06-20  4:57     ` Zhang Yi
2025-06-20 15:21       ` Jan Kara
2025-06-21  4:42         ` Zhang Yi
2025-06-23  7:38           ` Jan Kara
2025-06-11 11:16 ` [PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks Zhang Yi
2025-06-19 16:33   ` Jan Kara
2025-06-20  5:00     ` Zhang Yi
2025-06-20 14:18       ` Jan Kara
2025-06-21  6:30         ` Zhang Yi
2025-06-11 11:16 ` [PATCH v2 4/6] ext4: correct the reserved credits for extent conversion Zhang Yi
2025-06-23  8:12   ` Baokun Li
2025-06-11 11:16 ` [PATCH v2 5/6] ext4/jbd2: reintroduce jbd2_journal_blocks_per_page() Zhang Yi
2025-06-19 16:44   ` Jan Kara
2025-06-21  7:46     ` Zhang Yi
2025-06-23  7:26       ` Jan Kara
2025-06-11 11:16 ` [PATCH v2 6/6] ext4: fix insufficient credits calculation in ext4_meta_trans_blocks() Zhang Yi
2025-06-23  8:12   ` Baokun Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).