[PATCH v3 00/12] ext4: optimize online defragment

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/12] ext4: optimize online defragment
@ 2025-10-10 10:33 Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 01/12] ext4: correct the checking of quota files before moving extents Zhang Yi
                   ` (11 more replies)
  0 siblings, 12 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Changes since v2:
 - Rebase patches to the 6.18-5472d60c129f.
 - Patch 02, add a TODO comment, we should optimize the increasement of
   the extents sequence counter ext4_es_insert_extent() in the future as
   Jan suggested.
 - Patch 09, add a WARN_ON_ONCE if ext4_swap_extents() return
   successfully but the swapped length is shorter than required. Also,
   copy data if some extents have been swapped to prevent data loss.
   Finally, fix the comment as Jan suggested.
 - Patch 10, fix the increasement of moved_len in ext4_move_extents()
   as Jan pointed out.
 - Patch 11, fix potential overflow issues on the left shift as Jan
   pointed out.
 - Add review tag in patch 01-08,11-12 from Jan.
Changes since v1:
 - Fix the syzbot issues reported in v1 by adjusting the order of
   parameter checks in mext_check_validity() in patches 07 and 08.

v2: https://lore.kernel.org/linux-ext4/20250925092610.1936929-1-yi.zhang@huaweicloud.com/
v1: https://lore.kernel.org/linux-ext4/20250923012724.2378858-1-yi.zhang@huaweicloud.com/

Original Description:

Currently, the online defragmentation of the ext4 is primarily
implemented through the move extent operation in the kernel. This
extent-moving operates at the granularity of PAGE_SIZE, iteratively
performing extent swapping and data movement operations, which is quite
inefficient. Especially since ext4 now supports large folios, iterations
at the PAGE_SIZE granularity are no longer practical and fail to
leverage the advantages of large folios. Additionally, the current
implementation is tightly coupled with buffer_head, making it unable to
support after the conversion of buffered I/O processes to the iomap
infrastructure.

This patch set (based on 6.17-rc7) optimizes the extent-moving process,
deprecates the old move_extent_per_page() interface, and introduces a
new mext_move_extent() interface. The new interface iterates over and
copies data based on the extents of the original file instead of the
PAGE_SIZE, and supporting large folios. The data processing logic in the
iteration remains largely consistent with previous versions, with no
additional optimizations or changes made. 

Additionally, the primary objective of this set of patches is to prepare
for converting the buffered I/O process for regular files to the iomap
infrastructure. These patches decouple the buffer_head from the main
extent-moving process, restricting its use to only the helpers
mext_folio_mkwrite() and mext_folio_mkuptodate(), which handle updating
and marking pages in the swapped page cache as dirty. The overall coding
style of the extent-moving process aligns with the iomap infrastructure,
laying the foundation for supporting online defragmentation once the
iomap infrastructure is adopted.

Patch overview:

Patch 1:    Fix a minor issue related to validity checking.
Patch 2-4:  Introduce a sequence counter for the mapping extent status
            tree, this also prepares for the iomap infrastructure.
Patch 5-7:  Refactor the mext_check_arguments() helper function and the
            validity checking to improve code readability.
Patch 8-12: Drop move_extent_per_page() and switch to using the new
            mext_move_extent(). Additionally, add support for large
            folios.

With this patch set, the efficiency of online defragmentation for the
ext4 file system can also be improved under general circumstances. Below
is a set of typical test obtained using the fio e4defrag ioengine on the
environment with Intel Xeon Gold 6240 CPU, 400G memory and a NVMe SSD
device.

  [defrag]
  directory=/mnt
  filesize=400G
  buffered=1
  fadvise_hint=0
  ioengine=e4defrag
  bs=4k         # 4k,32k,128k
  donorname=test.def
  filename=test
  inplace=0
  rw=write
  overwrite=0   # 0 for unwritten extent and 1 for written extent
  numjobs=1
  iodepth=1
  runtime=30s

  [w/o]
   U 4k:    IOPS=225k,  BW=877MiB/s      # U: unwritten extent-moving
   U 32k:   IOPS=33.2k, BW=1037MiB/s
   U 128k:  IOPS=8510,  BW=1064MiB/s
   M 4k:    IOPS=19.8k, BW=77.2MiB/s     # M: written extent-moving
   M 32k:   IOPS=2502,  BW=78.2MiB/s
   M 128k:  IOPS=635,   BW=79.5MiB/s

  [w]
   U 4k:    IOPS=246k,  BW=963MiB/s
   U 32k:   IOPS=209k,  BW=6529MiB/s
   U 128k:  IOPS=146k,  BW=17.8GiB/s
   M 4k:    IOPS=19.5k, BW=76.2MiB/s
   M 32k:   IOPS=4091,  BW=128MiB/s
   M 128k:  IOPS=2814,  BW=352MiB/s 

Best Regards,
Yi.

Zhang Yi (12):
  ext4: correct the checking of quota files before moving extents
  ext4: introduce seq counter for the extent status entry
  ext4: make ext4_es_lookup_extent() pass out the extent seq counter
  ext4: pass out extent seq counter when mapping blocks
  ext4: use EXT4_B_TO_LBLK() in mext_check_arguments()
  ext4: add mext_check_validity() to do basic check
  ext4: refactor mext_check_arguments()
  ext4: rename mext_page_mkuptodate() to mext_folio_mkuptodate()
  ext4: introduce mext_move_extent()
  ext4: switch to using the new extent movement method
  ext4: add large folios support for moving extents
  ext4: add two trace points for moving extents

 fs/ext4/ext4.h              |   3 +
 fs/ext4/extents.c           |   2 +-
 fs/ext4/extents_status.c    |  31 +-
 fs/ext4/extents_status.h    |   2 +-
 fs/ext4/inode.c             |  28 +-
 fs/ext4/ioctl.c             |  10 -
 fs/ext4/move_extent.c       | 780 +++++++++++++++++-------------------
 fs/ext4/super.c             |   1 +
 include/trace/events/ext4.h |  97 ++++-
 9 files changed, 499 insertions(+), 455 deletions(-)

-- 
2.46.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 01/12] ext4: correct the checking of quota files before moving extents
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 02/12] ext4: introduce seq counter for the extent status entry Zhang Yi
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

The move extent operation should return -EOPNOTSUPP if any of the inodes
is a quota inode, rather than requiring both to be quota inodes.

Fixes: 02749a4c2082 ("ext4: add ext4_is_quota_file()")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/move_extent.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 4b091c21908f..0f4b7c89edd3 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -485,7 +485,7 @@ mext_check_arguments(struct inode *orig_inode,
 		return -ETXTBSY;
 	}
 
-	if (ext4_is_quota_file(orig_inode) && ext4_is_quota_file(donor_inode)) {
+	if (ext4_is_quota_file(orig_inode) || ext4_is_quota_file(donor_inode)) {
 		ext4_debug("ext4 move extent: The argument files should not be quota files [ino:orig %lu, donor %lu]\n",
 			orig_inode->i_ino, donor_inode->i_ino);
 		return -EOPNOTSUPP;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 02/12] ext4: introduce seq counter for the extent status entry
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 01/12] ext4: correct the checking of quota files before moving extents Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 03/12] ext4: make ext4_es_lookup_extent() pass out the extent seq counter Zhang Yi
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

In the iomap_write_iter(), the iomap buffered write frame does not hold
any locks between querying the inode extent mapping info and performing
page cache writes. As a result, the extent mapping can be changed due to
concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
write-back process faces a similar problem: concurrent changes can
invalidate the extent mapping before the I/O is submitted.

Therefore, both of these processes must recheck the mapping info after
acquiring the folio lock. To address this, similar to XFS, we propose
introducing an extent sequence number to serve as a validity cookie for
the extent. After commit 24b7a2331fcd ("ext4: clairfy the rules for
modifying extents"), we can ensure the extent information should always
be processed through the extent status tree, and the extent status tree
is always uptodate under i_rwsem or invalidate_lock or folio lock, so
it's safe to introduce this sequence number. The sequence number will be
increased whenever the extent status tree changes, preparing for the
buffered write iomap conversion.

Besides, this mechanism is also applicable for the moving extents case.
In move_extent_per_page(), it also needs to reacquire data_sem and check
the mapping info again under the folio lock.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h              |  2 ++
 fs/ext4/extents_status.c    | 25 +++++++++++++++++++++----
 fs/ext4/super.c             |  1 +
 include/trace/events/ext4.h | 23 +++++++++++++++--------
 4 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 57087da6c7be..eff97b3a1093 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1138,6 +1138,8 @@ struct ext4_inode_info {
 	ext4_lblk_t i_es_shrink_lblk;	/* Offset where we start searching for
 					   extents to shrink. Protected by
 					   i_es_lock  */
+	u64 i_es_seq;			/* Change counter for extents.
+					   Protected by i_es_lock */
 
 	/* ialloc */
 	ext4_group_t	i_last_alloc_group;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 31dc0496f8d0..c3daa57ecd35 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -235,6 +235,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
 	return es->es_lblk + es->es_len - 1;
 }
 
+static inline void ext4_es_inc_seq(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	WRITE_ONCE(ei->i_es_seq, ei->i_es_seq + 1);
+}
+
 /*
  * search through the tree for an delayed extent with a given offset.  If
  * it can't be found, try to find next extent.
@@ -906,7 +913,6 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	newes.es_lblk = lblk;
 	newes.es_len = len;
 	ext4_es_store_pblock_status(&newes, pblk, status);
-	trace_ext4_es_insert_extent(inode, &newes);
 
 	ext4_es_insert_extent_check(inode, &newes);
 
@@ -955,6 +961,11 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 		}
 		pending = err3;
 	}
+	/*
+	 * TODO: For cache on-disk extents, there is no need to increment
+	 * the sequence counter, this requires future optimization.
+	 */
+	ext4_es_inc_seq(inode);
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	/*
@@ -981,6 +992,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	if (err1 || err2 || err3 < 0)
 		goto retry;
 
+	trace_ext4_es_insert_extent(inode, &newes);
 	ext4_es_print_tree(inode);
 	return;
 }
@@ -1550,7 +1562,6 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
 		return;
 
-	trace_ext4_es_remove_extent(inode, lblk, len);
 	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
 		 lblk, len, inode->i_ino);
 
@@ -1570,16 +1581,21 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	 */
 	write_lock(&EXT4_I(inode)->i_es_lock);
 	err = __es_remove_extent(inode, lblk, end, &reserved, es);
+	if (err)
+		goto error;
 	/* Free preallocated extent if it didn't get used. */
 	if (es) {
 		if (!es->es_len)
 			__es_free_extent(es);
 		es = NULL;
 	}
+	ext4_es_inc_seq(inode);
+error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	if (err)
 		goto retry;
 
+	trace_ext4_es_remove_extent(inode, lblk, len);
 	ext4_es_print_tree(inode);
 	ext4_da_release_space(inode, reserved);
 }
@@ -2140,8 +2156,6 @@ void ext4_es_insert_delayed_extent(struct inode *inode, ext4_lblk_t lblk,
 	newes.es_lblk = lblk;
 	newes.es_len = len;
 	ext4_es_store_pblock_status(&newes, ~0, EXTENT_STATUS_DELAYED);
-	trace_ext4_es_insert_delayed_extent(inode, &newes, lclu_allocated,
-					    end_allocated);
 
 	ext4_es_insert_extent_check(inode, &newes);
 
@@ -2196,11 +2210,14 @@ void ext4_es_insert_delayed_extent(struct inode *inode, ext4_lblk_t lblk,
 			pr2 = NULL;
 		}
 	}
+	ext4_es_inc_seq(inode);
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	if (err1 || err2 || err3 < 0)
 		goto retry;
 
+	trace_ext4_es_insert_delayed_extent(inode, &newes, lclu_allocated,
+					    end_allocated);
 	ext4_es_print_tree(inode);
 	ext4_print_pending_tree(inode);
 	return;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 33e7c08c9529..760c9d7588be 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1406,6 +1406,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_es_all_nr = 0;
 	ei->i_es_shk_nr = 0;
 	ei->i_es_shrink_lblk = 0;
+	ei->i_es_seq = 0;
 	ei->i_reserved_data_blocks = 0;
 	spin_lock_init(&(ei->i_block_reservation_lock));
 	ext4_init_pending_tree(&ei->i_pending_tree);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index a374e7ea7e57..6a0754d38acf 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2210,7 +2210,8 @@ DECLARE_EVENT_CLASS(ext4__es_extent,
 		__field(	ext4_lblk_t,	lblk		)
 		__field(	ext4_lblk_t,	len		)
 		__field(	ext4_fsblk_t,	pblk		)
-		__field(	char, status	)
+		__field(	char,		status		)
+		__field(	u64,		seq		)
 	),
 
 	TP_fast_assign(
@@ -2220,13 +2221,15 @@ DECLARE_EVENT_CLASS(ext4__es_extent,
 		__entry->len	= es->es_len;
 		__entry->pblk	= ext4_es_show_pblock(es);
 		__entry->status	= ext4_es_status(es);
+		__entry->seq	= EXT4_I(inode)->i_es_seq;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s seq %llu",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->lblk, __entry->len,
-		  __entry->pblk, show_extent_status(__entry->status))
+		  __entry->pblk, show_extent_status(__entry->status),
+		  __entry->seq)
 );
 
 DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent,
@@ -2251,6 +2254,7 @@ TRACE_EVENT(ext4_es_remove_extent,
 		__field(	ino_t,	ino			)
 		__field(	loff_t,	lblk			)
 		__field(	loff_t,	len			)
+		__field(	u64,	seq			)
 	),
 
 	TP_fast_assign(
@@ -2258,12 +2262,13 @@ TRACE_EVENT(ext4_es_remove_extent,
 		__entry->ino	= inode->i_ino;
 		__entry->lblk	= lblk;
 		__entry->len	= len;
+		__entry->seq	= EXT4_I(inode)->i_es_seq;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
+	TP_printk("dev %d,%d ino %lu es [%lld/%lld) seq %llu",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->lblk, __entry->len)
+		  __entry->lblk, __entry->len, __entry->seq)
 );
 
 TRACE_EVENT(ext4_es_find_extent_range_enter,
@@ -2523,6 +2528,7 @@ TRACE_EVENT(ext4_es_insert_delayed_extent,
 		__field(	char,		status		)
 		__field(	bool,		lclu_allocated	)
 		__field(	bool,		end_allocated	)
+		__field(	u64,		seq		)
 	),
 
 	TP_fast_assign(
@@ -2534,15 +2540,16 @@ TRACE_EVENT(ext4_es_insert_delayed_extent,
 		__entry->status		= ext4_es_status(es);
 		__entry->lclu_allocated	= lclu_allocated;
 		__entry->end_allocated	= end_allocated;
+		__entry->seq		= EXT4_I(inode)->i_es_seq;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s "
-		  "allocated %d %d",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s allocated %d %d seq %llu",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->lblk, __entry->len,
 		  __entry->pblk, show_extent_status(__entry->status),
-		  __entry->lclu_allocated, __entry->end_allocated)
+		  __entry->lclu_allocated, __entry->end_allocated,
+		  __entry->seq)
 );
 
 /* fsmap traces */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 03/12] ext4: make ext4_es_lookup_extent() pass out the extent seq counter
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 01/12] ext4: correct the checking of quota files before moving extents Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 02/12] ext4: introduce seq counter for the extent status entry Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 04/12] ext4: pass out extent seq counter when mapping blocks Zhang Yi
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When querying extents in the extent status tree, we should hold the
data_sem if we want to obtain the sequence number as a valid cookie
simultaneously. However, currently, ext4_map_blocks() calls
ext4_es_lookup_extent() without holding data_sem. Therefore, we should
acquire i_es_lock instead, which also ensures that the sequence cookie
and the extent remain consistent. Consequently, make
ext4_es_lookup_extent() to pass out the sequence number when necessary.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c        | 2 +-
 fs/ext4/extents_status.c | 6 ++++--
 fs/ext4/extents_status.h | 2 +-
 fs/ext4/inode.c          | 8 ++++----
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ca5499e9412b..c7d219e6c6d8 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2213,7 +2213,7 @@ static int ext4_fill_es_cache_info(struct inode *inode,
 	while (block <= end) {
 		next = 0;
 		flags = 0;
-		if (!ext4_es_lookup_extent(inode, block, &next, &es))
+		if (!ext4_es_lookup_extent(inode, block, &next, &es, NULL))
 			break;
 		if (ext4_es_is_unwritten(&es))
 			flags |= FIEMAP_EXTENT_UNWRITTEN;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index c3daa57ecd35..e04fbf10fe4f 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1039,8 +1039,8 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
  * Return: 1 on found, 0 on not
  */
 int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
-			  ext4_lblk_t *next_lblk,
-			  struct extent_status *es)
+			  ext4_lblk_t *next_lblk, struct extent_status *es,
+			  u64 *pseq)
 {
 	struct ext4_es_tree *tree;
 	struct ext4_es_stats *stats;
@@ -1099,6 +1099,8 @@ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
 			} else
 				*next_lblk = 0;
 		}
+		if (pseq)
+			*pseq = EXT4_I(inode)->i_es_seq;
 	} else {
 		percpu_counter_inc(&stats->es_stats_cache_misses);
 	}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 8f9c008d11e8..f3396cf32b44 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -148,7 +148,7 @@ extern void ext4_es_find_extent_range(struct inode *inode,
 				      struct extent_status *es);
 extern int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t *next_lblk,
-				 struct extent_status *es);
+				 struct extent_status *es, u64 *pseq);
 extern bool ext4_es_scan_range(struct inode *inode,
 			       int (*matching_fn)(struct extent_status *es),
 			       ext4_lblk_t lblk, ext4_lblk_t end);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f9e4ac87211e..10792772b450 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -649,7 +649,7 @@ static int ext4_map_create_blocks(handle_t *handle, struct inode *inode,
 	 * extent status tree.
 	 */
 	if (flags & EXT4_GET_BLOCKS_PRE_IO &&
-	    ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
+	    ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
 		if (ext4_es_is_written(&es))
 			return retval;
 	}
@@ -723,7 +723,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		ext4_check_map_extents_env(inode);
 
 	/* Lookup extent status tree firstly */
-	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
+	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
 		if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
 			map->m_pblk = ext4_es_pblock(&es) +
 					map->m_lblk - es.es_lblk;
@@ -1908,7 +1908,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
 	ext4_check_map_extents_env(inode);
 
 	/* Lookup extent status tree firstly */
-	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
+	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
 		map->m_len = min_t(unsigned int, map->m_len,
 				   es.es_len - (map->m_lblk - es.es_lblk));
 
@@ -1961,7 +1961,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
 	 * is held in write mode, before inserting a new da entry in
 	 * the extent status tree.
 	 */
-	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
+	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
 		map->m_len = min_t(unsigned int, map->m_len,
 				   es.es_len - (map->m_lblk - es.es_lblk));
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 04/12] ext4: pass out extent seq counter when mapping blocks
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
                   ` (2 preceding siblings ...)
  2025-10-10 10:33 ` [PATCH v3 03/12] ext4: make ext4_es_lookup_extent() pass out the extent seq counter Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 05/12] ext4: use EXT4_B_TO_LBLK() in mext_check_arguments() Zhang Yi
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When creating or querying mapping blocks using the ext4_map_blocks() and
ext4_map_{query|create}_blocks() helpers, also pass out the extent
sequence number of the block mapping info through the ext4_map_blocks
structure. This sequence number can later serve as a valid cookie within
iomap infrastructure and the move extents procedure.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  |  1 +
 fs/ext4/inode.c | 24 ++++++++++++++++--------
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index eff97b3a1093..9f127aedbaee 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -260,6 +260,7 @@ struct ext4_map_blocks {
 	ext4_lblk_t m_lblk;
 	unsigned int m_len;
 	unsigned int m_flags;
+	u64 m_seq;
 };
 
 /*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10792772b450..ad8deae1c7c3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -550,10 +550,13 @@ static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
 		retval = ext4_ext_map_blocks(handle, inode, map, flags);
 	else
 		retval = ext4_ind_map_blocks(handle, inode, map, flags);
-
-	if (retval <= 0)
+	if (retval < 0)
 		return retval;
 
+	/* A hole? */
+	if (retval == 0)
+		goto out;
+
 	if (unlikely(retval != map->m_len)) {
 		ext4_warning(inode->i_sb,
 			     "ES len assertion failed for inode "
@@ -573,11 +576,13 @@ static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
 				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
 		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
 				      map->m_pblk, status, false);
-		return retval;
+	} else {
+		retval = ext4_map_query_blocks_next_in_leaf(handle, inode, map,
+							    orig_mlen);
 	}
-
-	return ext4_map_query_blocks_next_in_leaf(handle, inode, map,
-						  orig_mlen);
+out:
+	map->m_seq = READ_ONCE(EXT4_I(inode)->i_es_seq);
+	return retval;
 }
 
 static int ext4_map_create_blocks(handle_t *handle, struct inode *inode,
@@ -649,7 +654,7 @@ static int ext4_map_create_blocks(handle_t *handle, struct inode *inode,
 	 * extent status tree.
 	 */
 	if (flags & EXT4_GET_BLOCKS_PRE_IO &&
-	    ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
+	    ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
 		if (ext4_es_is_written(&es))
 			return retval;
 	}
@@ -658,6 +663,7 @@ static int ext4_map_create_blocks(handle_t *handle, struct inode *inode,
 			EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
 	ext4_es_insert_extent(inode, map->m_lblk, map->m_len, map->m_pblk,
 			      status, flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE);
+	map->m_seq = READ_ONCE(EXT4_I(inode)->i_es_seq);
 
 	return retval;
 }
@@ -723,7 +729,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		ext4_check_map_extents_env(inode);
 
 	/* Lookup extent status tree firstly */
-	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
+	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
 		if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
 			map->m_pblk = ext4_es_pblock(&es) +
 					map->m_lblk - es.es_lblk;
@@ -1979,6 +1985,8 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
 
 	map->m_flags |= EXT4_MAP_DELAYED;
 	retval = ext4_insert_delayed_blocks(inode, map->m_lblk, map->m_len);
+	if (!retval)
+		map->m_seq = READ_ONCE(EXT4_I(inode)->i_es_seq);
 	up_write(&EXT4_I(inode)->i_data_sem);
 
 	return retval;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 05/12] ext4: use EXT4_B_TO_LBLK() in mext_check_arguments()
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
                   ` (3 preceding siblings ...)
  2025-10-10 10:33 ` [PATCH v3 04/12] ext4: pass out extent seq counter when mapping blocks Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 06/12] ext4: add mext_check_validity() to do basic check Zhang Yi
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Switch to using EXT4_B_TO_LBLK() to calculate the EOF position of the
origin and donor inodes, instead of using open-coded calculations.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/move_extent.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 0f4b7c89edd3..6175906c7119 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -461,12 +461,6 @@ mext_check_arguments(struct inode *orig_inode,
 		     __u64 donor_start, __u64 *len)
 {
 	__u64 orig_eof, donor_eof;
-	unsigned int blkbits = orig_inode->i_blkbits;
-	unsigned int blocksize = 1 << blkbits;
-
-	orig_eof = (i_size_read(orig_inode) + blocksize - 1) >> blkbits;
-	donor_eof = (i_size_read(donor_inode) + blocksize - 1) >> blkbits;
-
 
 	if (donor_inode->i_mode & (S_ISUID|S_ISGID)) {
 		ext4_debug("ext4 move extent: suid or sgid is set"
@@ -526,6 +520,9 @@ mext_check_arguments(struct inode *orig_inode,
 			orig_inode->i_ino, donor_inode->i_ino);
 		return -EINVAL;
 	}
+
+	orig_eof = EXT4_B_TO_LBLK(orig_inode, i_size_read(orig_inode));
+	donor_eof = EXT4_B_TO_LBLK(donor_inode, i_size_read(donor_inode));
 	if (orig_eof <= orig_start)
 		*len = 0;
 	else if (orig_eof < orig_start + *len - 1)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 06/12] ext4: add mext_check_validity() to do basic check
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
                   ` (4 preceding siblings ...)
  2025-10-10 10:33 ` [PATCH v3 05/12] ext4: use EXT4_B_TO_LBLK() in mext_check_arguments() Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 07/12] ext4: refactor mext_check_arguments() Zhang Yi
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Currently, the basic validation checks during the move extent operation
are scattered across __ext4_ioctl() and ext4_move_extents(), which makes
the code somewhat disorganized. Introduce a new helper,
mext_check_validity(), to handle these checks. This change involves only
code relocation without any logical modifications.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ioctl.c       |  10 -----
 fs/ext4/move_extent.c | 102 +++++++++++++++++++++++++++---------------
 2 files changed, 65 insertions(+), 47 deletions(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index a93a7baae990..366a9b615bf0 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -1641,16 +1641,6 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		if (!(fd_file(donor)->f_mode & FMODE_WRITE))
 			return -EBADF;
 
-		if (ext4_has_feature_bigalloc(sb)) {
-			ext4_msg(sb, KERN_ERR,
-				 "Online defrag not supported with bigalloc");
-			return -EOPNOTSUPP;
-		} else if (IS_DAX(inode)) {
-			ext4_msg(sb, KERN_ERR,
-				 "Online defrag not supported with DAX");
-			return -EOPNOTSUPP;
-		}
-
 		err = mnt_want_write_file(filp);
 		if (err)
 			return err;
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 6175906c7119..cdd175d5c1f3 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -442,6 +442,68 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
 	goto unlock_folios;
 }
 
+/*
+ * Check the validity of the basic filesystem environment and the
+ * inodes' support status.
+ */
+static int mext_check_validity(struct inode *orig_inode,
+			       struct inode *donor_inode)
+{
+	struct super_block *sb = orig_inode->i_sb;
+
+	/* origin and donor should be different inodes */
+	if (orig_inode == donor_inode) {
+		ext4_debug("ext4 move extent: The argument files should not be same inode [ino:orig %lu, donor %lu]\n",
+			   orig_inode->i_ino, donor_inode->i_ino);
+		return -EINVAL;
+	}
+
+	/* origin and donor should belone to the same filesystem */
+	if (orig_inode->i_sb != donor_inode->i_sb) {
+		ext4_debug("ext4 move extent: The argument files should be in same FS [ino:orig %lu, donor %lu]\n",
+			   orig_inode->i_ino, donor_inode->i_ino);
+		return -EINVAL;
+	}
+
+	/* Regular file check */
+	if (!S_ISREG(orig_inode->i_mode) || !S_ISREG(donor_inode->i_mode)) {
+		ext4_debug("ext4 move extent: The argument files should be regular file [ino:orig %lu, donor %lu]\n",
+			   orig_inode->i_ino, donor_inode->i_ino);
+		return -EINVAL;
+	}
+
+	if (ext4_has_feature_bigalloc(sb)) {
+		ext4_msg(sb, KERN_ERR,
+			 "Online defrag not supported with bigalloc");
+		return -EOPNOTSUPP;
+	}
+
+	if (IS_DAX(orig_inode)) {
+		ext4_msg(sb, KERN_ERR,
+			 "Online defrag not supported with DAX");
+		return -EOPNOTSUPP;
+	}
+
+	/*
+	 * TODO: it's not obvious how to swap blocks for inodes with full
+	 * journaling enabled.
+	 */
+	if (ext4_should_journal_data(orig_inode) ||
+	    ext4_should_journal_data(donor_inode)) {
+		ext4_msg(sb, KERN_ERR,
+			 "Online defrag not supported with data journaling");
+		return -EOPNOTSUPP;
+	}
+
+	if (IS_ENCRYPTED(orig_inode) || IS_ENCRYPTED(donor_inode)) {
+		ext4_msg(sb, KERN_ERR,
+			 "Online defrag not supported for encrypted files");
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
 /**
  * mext_check_arguments - Check whether move extent can be done
  *
@@ -567,43 +629,9 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
 	ext4_lblk_t d_start = donor_blk;
 	int ret;
 
-	if (orig_inode->i_sb != donor_inode->i_sb) {
-		ext4_debug("ext4 move extent: The argument files "
-			"should be in same FS [ino:orig %lu, donor %lu]\n",
-			orig_inode->i_ino, donor_inode->i_ino);
-		return -EINVAL;
-	}
-
-	/* orig and donor should be different inodes */
-	if (orig_inode == donor_inode) {
-		ext4_debug("ext4 move extent: The argument files should not "
-			"be same inode [ino:orig %lu, donor %lu]\n",
-			orig_inode->i_ino, donor_inode->i_ino);
-		return -EINVAL;
-	}
-
-	/* Regular file check */
-	if (!S_ISREG(orig_inode->i_mode) || !S_ISREG(donor_inode->i_mode)) {
-		ext4_debug("ext4 move extent: The argument files should be "
-			"regular file [ino:orig %lu, donor %lu]\n",
-			orig_inode->i_ino, donor_inode->i_ino);
-		return -EINVAL;
-	}
-
-	/* TODO: it's not obvious how to swap blocks for inodes with full
-	   journaling enabled */
-	if (ext4_should_journal_data(orig_inode) ||
-	    ext4_should_journal_data(donor_inode)) {
-		ext4_msg(orig_inode->i_sb, KERN_ERR,
-			 "Online defrag not supported with data journaling");
-		return -EOPNOTSUPP;
-	}
-
-	if (IS_ENCRYPTED(orig_inode) || IS_ENCRYPTED(donor_inode)) {
-		ext4_msg(orig_inode->i_sb, KERN_ERR,
-			 "Online defrag not supported for encrypted files");
-		return -EOPNOTSUPP;
-	}
+	ret = mext_check_validity(orig_inode, donor_inode);
+	if (ret)
+		return ret;
 
 	/* Protect orig and donor inodes against a truncate */
 	lock_two_nondirectories(orig_inode, donor_inode);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 07/12] ext4: refactor mext_check_arguments()
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
                   ` (5 preceding siblings ...)
  2025-10-10 10:33 ` [PATCH v3 06/12] ext4: add mext_check_validity() to do basic check Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 08/12] ext4: rename mext_page_mkuptodate() to mext_folio_mkuptodate() Zhang Yi
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When moving extents, mext_check_validity() performs some basic file
system and file checks. However, some essential checks need to be
performed after acquiring the i_rwsem are still scattered in
mext_check_arguments(). Move those checks into mext_check_validity() and
make it executes entirely under the i_rwsem to make the checks clearer.
Furthermore, rename mext_check_arguments() to mext_check_adjust_range(),
as it only performs checks and length adjustments on the move extent
range. Finally, also change the print message for the non-existent file
check to be consistent with other unsupported checks.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/move_extent.c | 97 +++++++++++++++++++------------------------
 1 file changed, 43 insertions(+), 54 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index cdd175d5c1f3..0191a3c746db 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -501,60 +501,36 @@ static int mext_check_validity(struct inode *orig_inode,
 		return -EOPNOTSUPP;
 	}
 
-	return 0;
-}
-
-/**
- * mext_check_arguments - Check whether move extent can be done
- *
- * @orig_inode:		original inode
- * @donor_inode:	donor inode
- * @orig_start:		logical start offset in block for orig
- * @donor_start:	logical start offset in block for donor
- * @len:		the number of blocks to be moved
- *
- * Check the arguments of ext4_move_extents() whether the files can be
- * exchanged with each other.
- * Return 0 on success, or a negative error value on failure.
- */
-static int
-mext_check_arguments(struct inode *orig_inode,
-		     struct inode *donor_inode, __u64 orig_start,
-		     __u64 donor_start, __u64 *len)
-{
-	__u64 orig_eof, donor_eof;
+	/* Ext4 move extent supports only extent based file */
+	if (!(ext4_test_inode_flag(orig_inode, EXT4_INODE_EXTENTS)) ||
+	    !(ext4_test_inode_flag(donor_inode, EXT4_INODE_EXTENTS))) {
+		ext4_msg(sb, KERN_ERR,
+			 "Online defrag not supported for non-extent files");
+		return -EOPNOTSUPP;
+	}
 
 	if (donor_inode->i_mode & (S_ISUID|S_ISGID)) {
-		ext4_debug("ext4 move extent: suid or sgid is set"
-			   " to donor file [ino:orig %lu, donor %lu]\n",
+		ext4_debug("ext4 move extent: suid or sgid is set to donor file [ino:orig %lu, donor %lu]\n",
 			   orig_inode->i_ino, donor_inode->i_ino);
 		return -EINVAL;
 	}
 
-	if (IS_IMMUTABLE(donor_inode) || IS_APPEND(donor_inode))
+	if (IS_IMMUTABLE(donor_inode) || IS_APPEND(donor_inode)) {
+		ext4_debug("ext4 move extent: donor should not be immutable or append file [ino:orig %lu, donor %lu]\n",
+			   orig_inode->i_ino, donor_inode->i_ino);
 		return -EPERM;
+	}
 
 	/* Ext4 move extent does not support swap files */
 	if (IS_SWAPFILE(orig_inode) || IS_SWAPFILE(donor_inode)) {
 		ext4_debug("ext4 move extent: The argument files should not be swap files [ino:orig %lu, donor %lu]\n",
-			orig_inode->i_ino, donor_inode->i_ino);
+			   orig_inode->i_ino, donor_inode->i_ino);
 		return -ETXTBSY;
 	}
 
 	if (ext4_is_quota_file(orig_inode) || ext4_is_quota_file(donor_inode)) {
 		ext4_debug("ext4 move extent: The argument files should not be quota files [ino:orig %lu, donor %lu]\n",
-			orig_inode->i_ino, donor_inode->i_ino);
-		return -EOPNOTSUPP;
-	}
-
-	/* Ext4 move extent supports only extent based file */
-	if (!(ext4_test_inode_flag(orig_inode, EXT4_INODE_EXTENTS))) {
-		ext4_debug("ext4 move extent: orig file is not extents "
-			"based file [ino:orig %lu]\n", orig_inode->i_ino);
-		return -EOPNOTSUPP;
-	} else if (!(ext4_test_inode_flag(donor_inode, EXT4_INODE_EXTENTS))) {
-		ext4_debug("ext4 move extent: donor file is not extents "
-			"based file [ino:donor %lu]\n", donor_inode->i_ino);
+			   orig_inode->i_ino, donor_inode->i_ino);
 		return -EOPNOTSUPP;
 	}
 
@@ -563,12 +539,25 @@ mext_check_arguments(struct inode *orig_inode,
 		return -EINVAL;
 	}
 
+	return 0;
+}
+
+/*
+ * Check the moving range of ext4_move_extents() whether the files can be
+ * exchanged with each other, and adjust the length to fit within the file
+ * size. Return 0 on success, or a negative error value on failure.
+ */
+static int mext_check_adjust_range(struct inode *orig_inode,
+				   struct inode *donor_inode, __u64 orig_start,
+				   __u64 donor_start, __u64 *len)
+{
+	__u64 orig_eof, donor_eof;
+
 	/* Start offset should be same */
 	if ((orig_start & ~(PAGE_MASK >> orig_inode->i_blkbits)) !=
 	    (donor_start & ~(PAGE_MASK >> orig_inode->i_blkbits))) {
-		ext4_debug("ext4 move extent: orig and donor's start "
-			"offsets are not aligned [ino:orig %lu, donor %lu]\n",
-			orig_inode->i_ino, donor_inode->i_ino);
+		ext4_debug("ext4 move extent: orig and donor's start offsets are not aligned [ino:orig %lu, donor %lu]\n",
+			   orig_inode->i_ino, donor_inode->i_ino);
 		return -EINVAL;
 	}
 
@@ -577,9 +566,9 @@ mext_check_arguments(struct inode *orig_inode,
 	    (*len > EXT_MAX_BLOCKS) ||
 	    (donor_start + *len >= EXT_MAX_BLOCKS) ||
 	    (orig_start + *len >= EXT_MAX_BLOCKS))  {
-		ext4_debug("ext4 move extent: Can't handle over [%u] blocks "
-			"[ino:orig %lu, donor %lu]\n", EXT_MAX_BLOCKS,
-			orig_inode->i_ino, donor_inode->i_ino);
+		ext4_debug("ext4 move extent: Can't handle over [%u] blocks [ino:orig %lu, donor %lu]\n",
+			   EXT_MAX_BLOCKS,
+			   orig_inode->i_ino, donor_inode->i_ino);
 		return -EINVAL;
 	}
 
@@ -594,9 +583,8 @@ mext_check_arguments(struct inode *orig_inode,
 	else if (donor_eof < donor_start + *len - 1)
 		*len = donor_eof - donor_start;
 	if (!*len) {
-		ext4_debug("ext4 move extent: len should not be 0 "
-			"[ino:orig %lu, donor %lu]\n", orig_inode->i_ino,
-			donor_inode->i_ino);
+		ext4_debug("ext4 move extent: len should not be 0 [ino:orig %lu, donor %lu]\n",
+			   orig_inode->i_ino, donor_inode->i_ino);
 		return -EINVAL;
 	}
 
@@ -629,22 +617,22 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
 	ext4_lblk_t d_start = donor_blk;
 	int ret;
 
-	ret = mext_check_validity(orig_inode, donor_inode);
-	if (ret)
-		return ret;
-
 	/* Protect orig and donor inodes against a truncate */
 	lock_two_nondirectories(orig_inode, donor_inode);
 
+	ret = mext_check_validity(orig_inode, donor_inode);
+	if (ret)
+		goto unlock;
+
 	/* Wait for all existing dio workers */
 	inode_dio_wait(orig_inode);
 	inode_dio_wait(donor_inode);
 
 	/* Protect extent tree against block allocations via delalloc */
 	ext4_double_down_write_data_sem(orig_inode, donor_inode);
-	/* Check the filesystem environment whether move_extent can be done */
-	ret = mext_check_arguments(orig_inode, donor_inode, orig_blk,
-				    donor_blk, &len);
+	/* Check and adjust the specified move_extent range. */
+	ret = mext_check_adjust_range(orig_inode, donor_inode, orig_blk,
+				      donor_blk, &len);
 	if (ret)
 		goto out;
 	o_end = o_start + len;
@@ -725,6 +713,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
 
 	ext4_free_ext_path(path);
 	ext4_double_up_write_data_sem(orig_inode, donor_inode);
+unlock:
 	unlock_two_nondirectories(orig_inode, donor_inode);
 
 	return ret;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 08/12] ext4: rename mext_page_mkuptodate() to mext_folio_mkuptodate()
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
                   ` (6 preceding siblings ...)
  2025-10-10 10:33 ` [PATCH v3 07/12] ext4: refactor mext_check_arguments() Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 09/12] ext4: introduce mext_move_extent() Zhang Yi
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

mext_page_mkuptodate() no longer works on a single page, so rename it to
mext_folio_mkuptodate().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/move_extent.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 0191a3c746db..2df6072b26c0 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -165,7 +165,7 @@ mext_folio_double_lock(struct inode *inode1, struct inode *inode2,
 }
 
 /* Force folio buffers uptodate w/o dropping folio's lock */
-static int mext_page_mkuptodate(struct folio *folio, size_t from, size_t to)
+static int mext_folio_mkuptodate(struct folio *folio, size_t from, size_t to)
 {
 	struct inode *inode = folio->mapping->host;
 	sector_t block;
@@ -358,7 +358,7 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
 data_copy:
 	from = offset_in_folio(folio[0],
 			       orig_blk_offset << orig_inode->i_blkbits);
-	*err = mext_page_mkuptodate(folio[0], from, from + replaced_size);
+	*err = mext_folio_mkuptodate(folio[0], from, from + replaced_size);
 	if (*err)
 		goto unlock_folios;
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 09/12] ext4: introduce mext_move_extent()
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
                   ` (7 preceding siblings ...)
  2025-10-10 10:33 ` [PATCH v3 08/12] ext4: rename mext_page_mkuptodate() to mext_folio_mkuptodate() Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  2025-10-10 13:38   ` Jan Kara
  2025-10-10 10:33 ` [PATCH v3 10/12] ext4: switch to using the new extent movement method Zhang Yi
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When moving extents, the current move_extent_per_page() process can only
move extents of length PAGE_SIZE at a time, which is highly inefficient,
especially when the fragmentation of the file is not particularly
severe, this will result in a large number of unnecessary extent split
and merge operations. Moreover, since the ext4 file system now supports
large folios, using PAGE_SIZE as the processing unit is no longer
practical.

Therefore, introduce a new move extents method, mext_move_extent(). It
moves one extent of the origin inode at a time, but not exceeding the
size of a folio. The parameters for the move are passed through the new
mext_data data structure, which includes the origin inode, donor inode,
the mapping extent of the origin inode to be moved, and the starting
offset of the donor inode.

The move process is similar to move_extent_per_page() and can be
categorized into three types: MEXT_SKIP_EXTENT, MEXT_MOVE_EXTENT, and
MEXT_COPY_DATA. MEXT_SKIP_EXTENT indicates that the corresponding area
of the donor file is a hole, meaning no actual space is allocated, so
the move is skipped. MEXT_MOVE_EXTENT indicates that the corresponding
areas of both the origin and donor files are unwritten, so no data needs
to be copied; only the extents are swapped. MEXT_COPY_DATA indicates
that the corresponding areas of both the origin and donor files contain
data, so data must be copied. The data copying is performed in three
steps: first, the data from the original location is read into the page
cache; then, the extents are swapped, and the page cache is rebuilt to
reflect the index of the physical blocks; finally, the dirty page cache
is marked and written back to ensure that the data is written to disk
before the metadata is persisted.

One important point to note is that the folio lock and i_data_sem are
held only during the moving process. Therefore, before moving an extent,
it is necessary to check whether the sequence cookie of the area to be
moved has changed while holding the folio lock. If a change is detected,
it indicates that concurrent write-back operations may have occurred
during this period, and the type of the extent to be moved can no longer
be considered reliable. For example, it may have changed from unwritten
to written. In such cases, return -ESTALE, and the calling function
should reacquire the move extent of the original file and retry the
movement.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/move_extent.c | 218 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 218 insertions(+)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 2df6072b26c0..aa243be36200 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -13,6 +13,13 @@
 #include "ext4.h"
 #include "ext4_extents.h"
 
+struct mext_data {
+	struct inode *orig_inode;	/* Origin file inode */
+	struct inode *donor_inode;	/* Donor file inode */
+	struct ext4_map_blocks orig_map;/* Origin file's move mapping */
+	ext4_lblk_t donor_lblk;		/* Start block of the donor file */
+};
+
 /**
  * get_ext_path() - Find an extent path for designated logical block number.
  * @inode:	inode to be searched
@@ -164,6 +171,14 @@ mext_folio_double_lock(struct inode *inode1, struct inode *inode2,
 	return 0;
 }
 
+static void mext_folio_double_unlock(struct folio *folio[2])
+{
+	folio_unlock(folio[0]);
+	folio_put(folio[0]);
+	folio_unlock(folio[1]);
+	folio_put(folio[1]);
+}
+
 /* Force folio buffers uptodate w/o dropping folio's lock */
 static int mext_folio_mkuptodate(struct folio *folio, size_t from, size_t to)
 {
@@ -238,6 +253,209 @@ static int mext_folio_mkuptodate(struct folio *folio, size_t from, size_t to)
 	return 0;
 }
 
+enum mext_move_type {MEXT_SKIP_EXTENT, MEXT_MOVE_EXTENT, MEXT_COPY_DATA};
+
+/*
+ * Start to move extent between the origin inode and the donor inode,
+ * hold one folio for each inode and check the candidate moving extent
+ * mapping status again.
+ */
+static int mext_move_begin(struct mext_data *mext, struct folio *folio[2],
+			   enum mext_move_type *move_type)
+{
+	struct inode *orig_inode = mext->orig_inode;
+	struct inode *donor_inode = mext->donor_inode;
+	unsigned int blkbits = orig_inode->i_blkbits;
+	struct ext4_map_blocks donor_map = {0};
+	loff_t orig_pos, donor_pos;
+	size_t move_len;
+	int ret;
+
+	orig_pos = ((loff_t)mext->orig_map.m_lblk) << blkbits;
+	donor_pos = ((loff_t)mext->donor_lblk) << blkbits;
+	ret = mext_folio_double_lock(orig_inode, donor_inode,
+			orig_pos >> PAGE_SHIFT, donor_pos >> PAGE_SHIFT, folio);
+	if (ret)
+		return ret;
+
+	/*
+	 * Check the origin inode's mapping information again under the
+	 * folio lock, as we do not hold the i_data_sem at all times, and
+	 * it may change during the concurrent write-back operation.
+	 */
+	if (mext->orig_map.m_seq != READ_ONCE(EXT4_I(orig_inode)->i_es_seq)) {
+		ret = -ESTALE;
+		goto error;
+	}
+
+	/* Adjust the moving length according to the length of shorter folio. */
+	move_len = umin(folio_pos(folio[0]) + folio_size(folio[0]) - orig_pos,
+			folio_pos(folio[1]) + folio_size(folio[1]) - donor_pos);
+	move_len >>= blkbits;
+	if (move_len < mext->orig_map.m_len)
+		mext->orig_map.m_len = move_len;
+
+	donor_map.m_lblk = mext->donor_lblk;
+	donor_map.m_len = mext->orig_map.m_len;
+	donor_map.m_flags = 0;
+	ret = ext4_map_blocks(NULL, donor_inode, &donor_map, 0);
+	if (ret < 0)
+		goto error;
+
+	/* Adjust the moving length according to the donor mapping length. */
+	mext->orig_map.m_len = donor_map.m_len;
+
+	/* Skip moving if the donor range is a hole or a delalloc extent. */
+	if (!(donor_map.m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)))
+		*move_type = MEXT_SKIP_EXTENT;
+	/* If both mapping ranges are unwritten, no need to copy data. */
+	else if ((mext->orig_map.m_flags & EXT4_MAP_UNWRITTEN) &&
+		 (donor_map.m_flags & EXT4_MAP_UNWRITTEN))
+		*move_type = MEXT_MOVE_EXTENT;
+	else
+		*move_type = MEXT_COPY_DATA;
+
+	return 0;
+error:
+	mext_folio_double_unlock(folio);
+	return ret;
+}
+
+/*
+ * Re-create the new moved mapping buffers of the original inode and commit
+ * the entire written range.
+ */
+static int mext_folio_mkwrite(struct inode *inode, struct folio *folio,
+			      size_t from, size_t to)
+{
+	unsigned int blocksize = i_blocksize(inode);
+	struct buffer_head *bh, *head;
+	size_t block_start, block_end;
+	sector_t block;
+	int ret;
+
+	head = folio_buffers(folio);
+	if (!head)
+		head = create_empty_buffers(folio, blocksize, 0);
+
+	block = folio_pos(folio) >> inode->i_blkbits;
+	block_end = 0;
+	bh = head;
+	do {
+		block_start = block_end;
+		block_end = block_start + blocksize;
+		if (block_end <= from || block_start >= to)
+			continue;
+
+		ret = ext4_get_block(inode, block, bh, 0);
+		if (ret)
+			return ret;
+	} while (block++, (bh = bh->b_this_page) != head);
+
+	block_commit_write(folio, from, to);
+	return 0;
+}
+
+/*
+ * Save the data in original inode extent blocks and replace one folio size
+ * aligned original inode extent with one or one partial donor inode extent,
+ * and then write out the saved data in new original inode blocks. Pass out
+ * the replaced block count through m_len. Return 0 on success, and an error
+ * code otherwise.
+ */
+static __used int mext_move_extent(struct mext_data *mext, u64 *m_len)
+{
+	struct inode *orig_inode = mext->orig_inode;
+	struct inode *donor_inode = mext->donor_inode;
+	struct ext4_map_blocks *orig_map = &mext->orig_map;
+	unsigned int blkbits = orig_inode->i_blkbits;
+	struct folio *folio[2] = {NULL, NULL};
+	loff_t from, length;
+	enum mext_move_type move_type = 0;
+	handle_t *handle;
+	u64 r_len = 0;
+	unsigned int credits;
+	int ret, ret2;
+
+	*m_len = 0;
+	credits = ext4_chunk_trans_extent(orig_inode, 0) * 2;
+	handle = ext4_journal_start(orig_inode, EXT4_HT_MOVE_EXTENTS, credits);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	ret = mext_move_begin(mext, folio, &move_type);
+	if (ret)
+		goto stop_handle;
+
+	if (move_type == MEXT_SKIP_EXTENT)
+		goto unlock;
+
+	/*
+	 * Copy the data. First, read the original inode data into the page
+	 * cache. Then, release the existing mapping relationships and swap
+	 * the extent. Finally, re-establish the new mapping relationships
+	 * and dirty the page cache.
+	 */
+	if (move_type == MEXT_COPY_DATA) {
+		from = offset_in_folio(folio[0],
+				((loff_t)orig_map->m_lblk) << blkbits);
+		length = ((loff_t)orig_map->m_len) << blkbits;
+
+		ret = mext_folio_mkuptodate(folio[0], from, from + length);
+		if (ret)
+			goto unlock;
+	}
+
+	if (!filemap_release_folio(folio[0], 0) ||
+	    !filemap_release_folio(folio[1], 0)) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	/* Move extent */
+	ext4_double_down_write_data_sem(orig_inode, donor_inode);
+	*m_len = ext4_swap_extents(handle, orig_inode, donor_inode,
+				   orig_map->m_lblk, mext->donor_lblk,
+				   orig_map->m_len, 1, &ret);
+	ext4_double_up_write_data_sem(orig_inode, donor_inode);
+
+	/* A short-length swap cannot occur after a successful swap extent. */
+	if (WARN_ON_ONCE(!ret && (*m_len != orig_map->m_len)))
+		ret = -EIO;
+
+	if (!(*m_len) || (move_type == MEXT_MOVE_EXTENT))
+		goto unlock;
+
+	/* Copy data */
+	length = (*m_len) << blkbits;
+	ret = mext_folio_mkwrite(orig_inode, folio[0], from, from + length);
+	if (ret)
+		goto repair_branches;
+	/*
+	 * Even in case of data=writeback it is reasonable to pin
+	 * inode to transaction, to prevent unexpected data loss.
+	 */
+	ret = ext4_jbd2_inode_add_write(handle, orig_inode,
+			((loff_t)orig_map->m_lblk) << blkbits, length);
+unlock:
+	mext_folio_double_unlock(folio);
+stop_handle:
+	ext4_journal_stop(handle);
+	return ret;
+
+repair_branches:
+	r_len = ext4_swap_extents(handle, donor_inode, orig_inode,
+				  mext->donor_lblk, orig_map->m_lblk,
+				  *m_len, 0, &ret2);
+	if (ret2 || r_len != *m_len) {
+		ext4_error_inode_block(orig_inode, (sector_t)(orig_map->m_lblk),
+				       EIO, "Unable to copy data block, data will be lost!");
+		ret = -EIO;
+	}
+	*m_len = 0;
+	goto unlock;
+}
+
 /**
  * move_extent_per_page - Move extent data per page
  *
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 10/12] ext4: switch to using the new extent movement method
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
                   ` (8 preceding siblings ...)
  2025-10-10 10:33 ` [PATCH v3 09/12] ext4: introduce mext_move_extent() Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  2025-10-10 13:41   ` Jan Kara
  2025-10-10 10:33 ` [PATCH v3 11/12] ext4: add large folios support for moving extents Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 12/12] ext4: add two trace points " Zhang Yi
  11 siblings, 1 reply; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Now that we have mext_move_extent(), we can switch to this new interface
and deprecate move_extent_per_page(). First, after acquiring the
i_rwsem, we can directly use ext4_map_blocks() to obtain a contiguous
extent from the original inode as the extent to be moved. It can and
it's safe to get mapping information from the extent status tree without
needing to access the ondisk extent tree, because ext4_move_extent()
will check the sequence cookie under the folio lock. Then, after
populating the mext_data structure, we call ext4_move_extent() to move
the extent. Finally, the length of the extent will be adjusted in
mext.orig_map.m_len and the actual length moved is returned through
m_len.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/move_extent.c | 395 ++++++------------------------------------
 1 file changed, 51 insertions(+), 344 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index aa243be36200..85f5ae53a2d6 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -20,29 +20,6 @@ struct mext_data {
 	ext4_lblk_t donor_lblk;		/* Start block of the donor file */
 };
 
-/**
- * get_ext_path() - Find an extent path for designated logical block number.
- * @inode:	inode to be searched
- * @lblock:	logical block number to find an extent path
- * @path:	pointer to an extent path
- *
- * ext4_find_extent wrapper. Return an extent path pointer on success,
- * or an error pointer on failure.
- */
-static inline struct ext4_ext_path *
-get_ext_path(struct inode *inode, ext4_lblk_t lblock,
-	     struct ext4_ext_path *path)
-{
-	path = ext4_find_extent(inode, lblock, path, EXT4_EX_NOCACHE);
-	if (IS_ERR(path))
-		return path;
-	if (path[ext_depth(inode)].p_ext == NULL) {
-		ext4_free_ext_path(path);
-		return ERR_PTR(-ENODATA);
-	}
-	return path;
-}
-
 /**
  * ext4_double_down_write_data_sem() - write lock two inodes's i_data_sem
  * @first: inode to be locked
@@ -59,7 +36,6 @@ ext4_double_down_write_data_sem(struct inode *first, struct inode *second)
 	} else {
 		down_write(&EXT4_I(second)->i_data_sem);
 		down_write_nested(&EXT4_I(first)->i_data_sem, I_DATA_SEM_OTHER);
-
 	}
 }
 
@@ -78,42 +54,6 @@ ext4_double_up_write_data_sem(struct inode *orig_inode,
 	up_write(&EXT4_I(donor_inode)->i_data_sem);
 }
 
-/**
- * mext_check_coverage - Check that all extents in range has the same type
- *
- * @inode:		inode in question
- * @from:		block offset of inode
- * @count:		block count to be checked
- * @unwritten:		extents expected to be unwritten
- * @err:		pointer to save error value
- *
- * Return 1 if all extents in range has expected type, and zero otherwise.
- */
-static int
-mext_check_coverage(struct inode *inode, ext4_lblk_t from, ext4_lblk_t count,
-		    int unwritten, int *err)
-{
-	struct ext4_ext_path *path = NULL;
-	struct ext4_extent *ext;
-	int ret = 0;
-	ext4_lblk_t last = from + count;
-	while (from < last) {
-		path = get_ext_path(inode, from, path);
-		if (IS_ERR(path)) {
-			*err = PTR_ERR(path);
-			return ret;
-		}
-		ext = path[ext_depth(inode)].p_ext;
-		if (unwritten != ext4_ext_is_unwritten(ext))
-			goto out;
-		from += ext4_ext_get_actual_len(ext);
-	}
-	ret = 1;
-out:
-	ext4_free_ext_path(path);
-	return ret;
-}
-
 /**
  * mext_folio_double_lock - Grab and lock folio on both @inode1 and @inode2
  *
@@ -363,7 +303,7 @@ static int mext_folio_mkwrite(struct inode *inode, struct folio *folio,
  * the replaced block count through m_len. Return 0 on success, and an error
  * code otherwise.
  */
-static __used int mext_move_extent(struct mext_data *mext, u64 *m_len)
+static int mext_move_extent(struct mext_data *mext, u64 *m_len)
 {
 	struct inode *orig_inode = mext->orig_inode;
 	struct inode *donor_inode = mext->donor_inode;
@@ -456,210 +396,6 @@ static __used int mext_move_extent(struct mext_data *mext, u64 *m_len)
 	goto unlock;
 }
 
-/**
- * move_extent_per_page - Move extent data per page
- *
- * @o_filp:			file structure of original file
- * @donor_inode:		donor inode
- * @orig_page_offset:		page index on original file
- * @donor_page_offset:		page index on donor file
- * @data_offset_in_page:	block index where data swapping starts
- * @block_len_in_page:		the number of blocks to be swapped
- * @unwritten:			orig extent is unwritten or not
- * @err:			pointer to save return value
- *
- * Save the data in original inode blocks and replace original inode extents
- * with donor inode extents by calling ext4_swap_extents().
- * Finally, write out the saved data in new original inode blocks. Return
- * replaced block count.
- */
-static int
-move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
-		     pgoff_t orig_page_offset, pgoff_t donor_page_offset,
-		     int data_offset_in_page,
-		     int block_len_in_page, int unwritten, int *err)
-{
-	struct inode *orig_inode = file_inode(o_filp);
-	struct folio *folio[2] = {NULL, NULL};
-	handle_t *handle;
-	ext4_lblk_t orig_blk_offset, donor_blk_offset;
-	unsigned long blocksize = orig_inode->i_sb->s_blocksize;
-	unsigned int tmp_data_size, data_size, replaced_size;
-	int i, err2, jblocks, retries = 0;
-	int replaced_count = 0;
-	int from;
-	int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
-	struct super_block *sb = orig_inode->i_sb;
-	struct buffer_head *bh = NULL;
-
-	/*
-	 * It needs twice the amount of ordinary journal buffers because
-	 * inode and donor_inode may change each different metadata blocks.
-	 */
-again:
-	*err = 0;
-	jblocks = ext4_meta_trans_blocks(orig_inode, block_len_in_page,
-					 block_len_in_page) * 2;
-	handle = ext4_journal_start(orig_inode, EXT4_HT_MOVE_EXTENTS, jblocks);
-	if (IS_ERR(handle)) {
-		*err = PTR_ERR(handle);
-		return 0;
-	}
-
-	orig_blk_offset = orig_page_offset * blocks_per_page +
-		data_offset_in_page;
-
-	donor_blk_offset = donor_page_offset * blocks_per_page +
-		data_offset_in_page;
-
-	/* Calculate data_size */
-	if ((orig_blk_offset + block_len_in_page - 1) ==
-	    ((orig_inode->i_size - 1) >> orig_inode->i_blkbits)) {
-		/* Replace the last block */
-		tmp_data_size = orig_inode->i_size & (blocksize - 1);
-		/*
-		 * If data_size equal zero, it shows data_size is multiples of
-		 * blocksize. So we set appropriate value.
-		 */
-		if (tmp_data_size == 0)
-			tmp_data_size = blocksize;
-
-		data_size = tmp_data_size +
-			((block_len_in_page - 1) << orig_inode->i_blkbits);
-	} else
-		data_size = block_len_in_page << orig_inode->i_blkbits;
-
-	replaced_size = data_size;
-
-	*err = mext_folio_double_lock(orig_inode, donor_inode, orig_page_offset,
-				     donor_page_offset, folio);
-	if (unlikely(*err < 0))
-		goto stop_journal;
-	/*
-	 * If orig extent was unwritten it can become initialized
-	 * at any time after i_data_sem was dropped, in order to
-	 * serialize with delalloc we have recheck extent while we
-	 * hold page's lock, if it is still the case data copy is not
-	 * necessary, just swap data blocks between orig and donor.
-	 */
-	if (unwritten) {
-		ext4_double_down_write_data_sem(orig_inode, donor_inode);
-		/* If any of extents in range became initialized we have to
-		 * fallback to data copying */
-		unwritten = mext_check_coverage(orig_inode, orig_blk_offset,
-						block_len_in_page, 1, err);
-		if (*err)
-			goto drop_data_sem;
-
-		unwritten &= mext_check_coverage(donor_inode, donor_blk_offset,
-						 block_len_in_page, 1, err);
-		if (*err)
-			goto drop_data_sem;
-
-		if (!unwritten) {
-			ext4_double_up_write_data_sem(orig_inode, donor_inode);
-			goto data_copy;
-		}
-		if (!filemap_release_folio(folio[0], 0) ||
-		    !filemap_release_folio(folio[1], 0)) {
-			*err = -EBUSY;
-			goto drop_data_sem;
-		}
-		replaced_count = ext4_swap_extents(handle, orig_inode,
-						   donor_inode, orig_blk_offset,
-						   donor_blk_offset,
-						   block_len_in_page, 1, err);
-	drop_data_sem:
-		ext4_double_up_write_data_sem(orig_inode, donor_inode);
-		goto unlock_folios;
-	}
-data_copy:
-	from = offset_in_folio(folio[0],
-			       orig_blk_offset << orig_inode->i_blkbits);
-	*err = mext_folio_mkuptodate(folio[0], from, from + replaced_size);
-	if (*err)
-		goto unlock_folios;
-
-	/* At this point all buffers in range are uptodate, old mapping layout
-	 * is no longer required, try to drop it now. */
-	if (!filemap_release_folio(folio[0], 0) ||
-	    !filemap_release_folio(folio[1], 0)) {
-		*err = -EBUSY;
-		goto unlock_folios;
-	}
-	ext4_double_down_write_data_sem(orig_inode, donor_inode);
-	replaced_count = ext4_swap_extents(handle, orig_inode, donor_inode,
-					       orig_blk_offset, donor_blk_offset,
-					   block_len_in_page, 1, err);
-	ext4_double_up_write_data_sem(orig_inode, donor_inode);
-	if (*err) {
-		if (replaced_count) {
-			block_len_in_page = replaced_count;
-			replaced_size =
-				block_len_in_page << orig_inode->i_blkbits;
-		} else
-			goto unlock_folios;
-	}
-	/* Perform all necessary steps similar write_begin()/write_end()
-	 * but keeping in mind that i_size will not change */
-	bh = folio_buffers(folio[0]);
-	if (!bh)
-		bh = create_empty_buffers(folio[0],
-				1 << orig_inode->i_blkbits, 0);
-	for (i = 0; i < from >> orig_inode->i_blkbits; i++)
-		bh = bh->b_this_page;
-	for (i = 0; i < block_len_in_page; i++) {
-		*err = ext4_get_block(orig_inode, orig_blk_offset + i, bh, 0);
-		if (*err < 0)
-			goto repair_branches;
-		bh = bh->b_this_page;
-	}
-
-	block_commit_write(folio[0], from, from + replaced_size);
-
-	/* Even in case of data=writeback it is reasonable to pin
-	 * inode to transaction, to prevent unexpected data loss */
-	*err = ext4_jbd2_inode_add_write(handle, orig_inode,
-			(loff_t)orig_page_offset << PAGE_SHIFT, replaced_size);
-
-unlock_folios:
-	folio_unlock(folio[0]);
-	folio_put(folio[0]);
-	folio_unlock(folio[1]);
-	folio_put(folio[1]);
-stop_journal:
-	ext4_journal_stop(handle);
-	if (*err == -ENOSPC &&
-	    ext4_should_retry_alloc(sb, &retries))
-		goto again;
-	/* Buffer was busy because probably is pinned to journal transaction,
-	 * force transaction commit may help to free it. */
-	if (*err == -EBUSY && retries++ < 4 && EXT4_SB(sb)->s_journal &&
-	    jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal))
-		goto again;
-	return replaced_count;
-
-repair_branches:
-	/*
-	 * This should never ever happen!
-	 * Extents are swapped already, but we are not able to copy data.
-	 * Try to swap extents to it's original places
-	 */
-	ext4_double_down_write_data_sem(orig_inode, donor_inode);
-	replaced_count = ext4_swap_extents(handle, donor_inode, orig_inode,
-					       orig_blk_offset, donor_blk_offset,
-					   block_len_in_page, 0, &err2);
-	ext4_double_up_write_data_sem(orig_inode, donor_inode);
-	if (replaced_count != block_len_in_page) {
-		ext4_error_inode_block(orig_inode, (sector_t)(orig_blk_offset),
-				       EIO, "Unable to copy data block,"
-				       " data will be lost.");
-		*err = -EIO;
-	}
-	replaced_count = 0;
-	goto unlock_folios;
-}
-
 /*
  * Check the validity of the basic filesystem environment and the
  * inodes' support status.
@@ -821,106 +557,81 @@ static int mext_check_adjust_range(struct inode *orig_inode,
  *
  * This function returns 0 and moved block length is set in moved_len
  * if succeed, otherwise returns error value.
- *
  */
-int
-ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
-		  __u64 donor_blk, __u64 len, __u64 *moved_len)
+int ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
+		      __u64 donor_blk, __u64 len, __u64 *moved_len)
 {
 	struct inode *orig_inode = file_inode(o_filp);
 	struct inode *donor_inode = file_inode(d_filp);
-	struct ext4_ext_path *path = NULL;
-	int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
-	ext4_lblk_t o_end, o_start = orig_blk;
-	ext4_lblk_t d_start = donor_blk;
+	struct mext_data mext;
+	struct super_block *sb = orig_inode->i_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	int retries = 0;
+	u64 m_len;
 	int ret;
 
+	*moved_len = 0;
+
 	/* Protect orig and donor inodes against a truncate */
 	lock_two_nondirectories(orig_inode, donor_inode);
 
 	ret = mext_check_validity(orig_inode, donor_inode);
 	if (ret)
-		goto unlock;
+		goto out;
 
 	/* Wait for all existing dio workers */
 	inode_dio_wait(orig_inode);
 	inode_dio_wait(donor_inode);
 
-	/* Protect extent tree against block allocations via delalloc */
-	ext4_double_down_write_data_sem(orig_inode, donor_inode);
 	/* Check and adjust the specified move_extent range. */
 	ret = mext_check_adjust_range(orig_inode, donor_inode, orig_blk,
 				      donor_blk, &len);
 	if (ret)
 		goto out;
-	o_end = o_start + len;
 
-	*moved_len = 0;
-	while (o_start < o_end) {
-		struct ext4_extent *ex;
-		ext4_lblk_t cur_blk, next_blk;
-		pgoff_t orig_page_index, donor_page_index;
-		int offset_in_page;
-		int unwritten, cur_len;
-
-		path = get_ext_path(orig_inode, o_start, path);
-		if (IS_ERR(path)) {
-			ret = PTR_ERR(path);
+	mext.orig_inode = orig_inode;
+	mext.donor_inode = donor_inode;
+	while (len) {
+		mext.orig_map.m_lblk = orig_blk;
+		mext.orig_map.m_len = len;
+		mext.orig_map.m_flags = 0;
+		mext.donor_lblk = donor_blk;
+
+		ret = ext4_map_blocks(NULL, orig_inode, &mext.orig_map, 0);
+		if (ret < 0)
 			goto out;
-		}
-		ex = path[path->p_depth].p_ext;
-		cur_blk = le32_to_cpu(ex->ee_block);
-		cur_len = ext4_ext_get_actual_len(ex);
-		/* Check hole before the start pos */
-		if (cur_blk + cur_len - 1 < o_start) {
-			next_blk = ext4_ext_next_allocated_block(path);
-			if (next_blk == EXT_MAX_BLOCKS) {
-				ret = -ENODATA;
-				goto out;
+
+		/* Skip moving if it is a hole or a delalloc extent. */
+		if (mext.orig_map.m_flags &
+		    (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
+			ret = mext_move_extent(&mext, &m_len);
+			*moved_len += m_len;
+			if (!ret)
+				goto next;
+
+			/* Move failed or partially failed. */
+			if (m_len) {
+				orig_blk += m_len;
+				donor_blk += m_len;
+				len -= m_len;
 			}
-			d_start += next_blk - o_start;
-			o_start = next_blk;
-			continue;
-		/* Check hole after the start pos */
-		} else if (cur_blk > o_start) {
-			/* Skip hole */
-			d_start += cur_blk - o_start;
-			o_start = cur_blk;
-			/* Extent inside requested range ?*/
-			if (cur_blk >= o_end)
-				goto out;
-		} else { /* in_range(o_start, o_blk, o_len) */
-			cur_len += cur_blk - o_start;
+			if (ret == -ESTALE)
+				continue;
+			if (ret == -ENOSPC &&
+			    ext4_should_retry_alloc(sb, &retries))
+				continue;
+			if (ret == -EBUSY &&
+			    sbi->s_journal && retries++ < 4 &&
+			    jbd2_journal_force_commit_nested(sbi->s_journal))
+				continue;
+
+			goto out;
 		}
-		unwritten = ext4_ext_is_unwritten(ex);
-		if (o_end - o_start < cur_len)
-			cur_len = o_end - o_start;
-
-		orig_page_index = o_start >> (PAGE_SHIFT -
-					       orig_inode->i_blkbits);
-		donor_page_index = d_start >> (PAGE_SHIFT -
-					       donor_inode->i_blkbits);
-		offset_in_page = o_start % blocks_per_page;
-		if (cur_len > blocks_per_page - offset_in_page)
-			cur_len = blocks_per_page - offset_in_page;
-		/*
-		 * Up semaphore to avoid following problems:
-		 * a. transaction deadlock among ext4_journal_start,
-		 *    ->write_begin via pagefault, and jbd2_journal_commit
-		 * b. racing with ->read_folio, ->write_begin, and
-		 *    ext4_get_block in move_extent_per_page
-		 */
-		ext4_double_up_write_data_sem(orig_inode, donor_inode);
-		/* Swap original branches with new branches */
-		*moved_len += move_extent_per_page(o_filp, donor_inode,
-				     orig_page_index, donor_page_index,
-				     offset_in_page, cur_len,
-				     unwritten, &ret);
-		ext4_double_down_write_data_sem(orig_inode, donor_inode);
-		if (ret < 0)
-			break;
-		o_start += cur_len;
-		d_start += cur_len;
+next:
+		orig_blk += mext.orig_map.m_len;
+		donor_blk += mext.orig_map.m_len;
+		len -= mext.orig_map.m_len;
+		retries = 0;
 	}
 
 out:
@@ -929,10 +640,6 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
 		ext4_discard_preallocations(donor_inode);
 	}
 
-	ext4_free_ext_path(path);
-	ext4_double_up_write_data_sem(orig_inode, donor_inode);
-unlock:
 	unlock_two_nondirectories(orig_inode, donor_inode);
-
 	return ret;
 }
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 11/12] ext4: add large folios support for moving extents
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
                   ` (9 preceding siblings ...)
  2025-10-10 10:33 ` [PATCH v3 10/12] ext4: switch to using the new extent movement method Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  2025-10-10 10:33 ` [PATCH v3 12/12] ext4: add two trace points " Zhang Yi
  11 siblings, 0 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Pass the moving extent length into mext_folio_double_lock() so that it
can acquire a higher-order folio if the length exceeds PAGE_SIZE. This
can speed up extent moving when the extent is larger than one page.
Additionally, remove the unnecessary comments from
mext_folio_double_lock().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/move_extent.c | 27 ++++++++++-----------------
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 85f5ae53a2d6..691edcf5d464 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -54,23 +54,14 @@ ext4_double_up_write_data_sem(struct inode *orig_inode,
 	up_write(&EXT4_I(donor_inode)->i_data_sem);
 }
 
-/**
- * mext_folio_double_lock - Grab and lock folio on both @inode1 and @inode2
- *
- * @inode1:	the inode structure
- * @inode2:	the inode structure
- * @index1:	folio index
- * @index2:	folio index
- * @folio:	result folio vector
- *
- * Grab two locked folio for inode's by inode order
- */
-static int
-mext_folio_double_lock(struct inode *inode1, struct inode *inode2,
-		      pgoff_t index1, pgoff_t index2, struct folio *folio[2])
+/* Grab and lock folio on both @inode1 and @inode2 by inode order. */
+static int mext_folio_double_lock(struct inode *inode1, struct inode *inode2,
+				  pgoff_t index1, pgoff_t index2, size_t len,
+				  struct folio *folio[2])
 {
 	struct address_space *mapping[2];
 	unsigned int flags;
+	fgf_t fgp_flags = FGP_WRITEBEGIN;
 
 	BUG_ON(!inode1 || !inode2);
 	if (inode1 < inode2) {
@@ -83,14 +74,15 @@ mext_folio_double_lock(struct inode *inode1, struct inode *inode2,
 	}
 
 	flags = memalloc_nofs_save();
-	folio[0] = __filemap_get_folio(mapping[0], index1, FGP_WRITEBEGIN,
+	fgp_flags |= fgf_set_order(len);
+	folio[0] = __filemap_get_folio(mapping[0], index1, fgp_flags,
 			mapping_gfp_mask(mapping[0]));
 	if (IS_ERR(folio[0])) {
 		memalloc_nofs_restore(flags);
 		return PTR_ERR(folio[0]);
 	}
 
-	folio[1] = __filemap_get_folio(mapping[1], index2, FGP_WRITEBEGIN,
+	folio[1] = __filemap_get_folio(mapping[1], index2, fgp_flags,
 			mapping_gfp_mask(mapping[1]));
 	memalloc_nofs_restore(flags);
 	if (IS_ERR(folio[1])) {
@@ -214,7 +206,8 @@ static int mext_move_begin(struct mext_data *mext, struct folio *folio[2],
 	orig_pos = ((loff_t)mext->orig_map.m_lblk) << blkbits;
 	donor_pos = ((loff_t)mext->donor_lblk) << blkbits;
 	ret = mext_folio_double_lock(orig_inode, donor_inode,
-			orig_pos >> PAGE_SHIFT, donor_pos >> PAGE_SHIFT, folio);
+			orig_pos >> PAGE_SHIFT, donor_pos >> PAGE_SHIFT,
+			((size_t)mext->orig_map.m_len) << blkbits, folio);
 	if (ret)
 		return ret;
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 12/12] ext4: add two trace points for moving extents
  2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
                   ` (10 preceding siblings ...)
  2025-10-10 10:33 ` [PATCH v3 11/12] ext4: add large folios support for moving extents Zhang Yi
@ 2025-10-10 10:33 ` Zhang Yi
  11 siblings, 0 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-10 10:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

To facilitate tracking the length, type, and outcome of the move extent,
add a trace point at both the entry and exit of mext_move_extent().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/move_extent.c       | 14 ++++++-
 include/trace/events/ext4.h | 74 +++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 691edcf5d464..87f21a30a82a 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -13,6 +13,8 @@
 #include "ext4.h"
 #include "ext4_extents.h"
 
+#include <trace/events/ext4.h>
+
 struct mext_data {
 	struct inode *orig_inode;	/* Origin file inode */
 	struct inode *donor_inode;	/* Donor file inode */
@@ -311,10 +313,14 @@ static int mext_move_extent(struct mext_data *mext, u64 *m_len)
 	int ret, ret2;
 
 	*m_len = 0;
+	trace_ext4_move_extent_enter(orig_inode, orig_map, donor_inode,
+				     mext->donor_lblk);
 	credits = ext4_chunk_trans_extent(orig_inode, 0) * 2;
 	handle = ext4_journal_start(orig_inode, EXT4_HT_MOVE_EXTENTS, credits);
-	if (IS_ERR(handle))
-		return PTR_ERR(handle);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out;
+	}
 
 	ret = mext_move_begin(mext, folio, &move_type);
 	if (ret)
@@ -374,6 +380,10 @@ static int mext_move_extent(struct mext_data *mext, u64 *m_len)
 	mext_folio_double_unlock(folio);
 stop_handle:
 	ext4_journal_stop(handle);
+out:
+	trace_ext4_move_extent_exit(orig_inode, orig_map->m_lblk, donor_inode,
+				    mext->donor_lblk, orig_map->m_len, *m_len,
+				    move_type, ret);
 	return ret;
 
 repair_branches:
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 6a0754d38acf..a05bdd48e16e 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -3016,6 +3016,80 @@ TRACE_EVENT(ext4_update_sb,
 		  __entry->fsblk, __entry->flags)
 );
 
+TRACE_EVENT(ext4_move_extent_enter,
+	TP_PROTO(struct inode *orig_inode, struct ext4_map_blocks *orig_map,
+		 struct inode *donor_inode, ext4_lblk_t donor_lblk),
+
+	TP_ARGS(orig_inode, orig_map, donor_inode, donor_lblk),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(ino_t, orig_ino)
+		__field(ext4_lblk_t, orig_lblk)
+		__field(unsigned int, orig_flags)
+		__field(ino_t, donor_ino)
+		__field(ext4_lblk_t, donor_lblk)
+		__field(unsigned int, len)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= orig_inode->i_sb->s_dev;
+		__entry->orig_ino	= orig_inode->i_ino;
+		__entry->orig_lblk	= orig_map->m_lblk;
+		__entry->orig_flags	= orig_map->m_flags;
+		__entry->donor_ino	= donor_inode->i_ino;
+		__entry->donor_lblk	= donor_lblk;
+		__entry->len		= orig_map->m_len;
+	),
+
+	TP_printk("dev %d,%d origin ino %lu lblk %u flags %s donor ino %lu lblk %u len %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long) __entry->orig_ino,  __entry->orig_lblk,
+		  show_mflags(__entry->orig_flags),
+		  (unsigned long) __entry->donor_ino,  __entry->donor_lblk,
+		  __entry->len)
+);
+
+TRACE_EVENT(ext4_move_extent_exit,
+	TP_PROTO(struct inode *orig_inode, ext4_lblk_t orig_lblk,
+		 struct inode *donor_inode, ext4_lblk_t donor_lblk,
+		 unsigned int m_len, u64 move_len, int move_type, int ret),
+
+	TP_ARGS(orig_inode, orig_lblk, donor_inode, donor_lblk, m_len,
+		move_len, move_type, ret),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(ino_t, orig_ino)
+		__field(ext4_lblk_t, orig_lblk)
+		__field(ino_t, donor_ino)
+		__field(ext4_lblk_t, donor_lblk)
+		__field(unsigned int, m_len)
+		__field(u64, move_len)
+		__field(int, move_type)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= orig_inode->i_sb->s_dev;
+		__entry->orig_ino	= orig_inode->i_ino;
+		__entry->orig_lblk	= orig_lblk;
+		__entry->donor_ino	= donor_inode->i_ino;
+		__entry->donor_lblk	= donor_lblk;
+		__entry->m_len		= m_len;
+		__entry->move_len	= move_len;
+		__entry->move_type	= move_type;
+		__entry->ret		= ret;
+	),
+
+	TP_printk("dev %d,%d origin ino %lu lblk %u donor ino %lu lblk %u m_len %u, move_len %llu type %d ret %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long) __entry->orig_ino,  __entry->orig_lblk,
+		  (unsigned long) __entry->donor_ino,  __entry->donor_lblk,
+		  __entry->m_len, __entry->move_len, __entry->move_type,
+		  __entry->ret)
+);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 09/12] ext4: introduce mext_move_extent()
  2025-10-10 10:33 ` [PATCH v3 09/12] ext4: introduce mext_move_extent() Zhang Yi
@ 2025-10-10 13:38   ` Jan Kara
  2025-10-11  1:20     ` Zhang Yi
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2025-10-10 13:38 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, libaokun1, yukuai3, yangerkun

On Fri 10-10-25 18:33:23, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When moving extents, the current move_extent_per_page() process can only
> move extents of length PAGE_SIZE at a time, which is highly inefficient,
> especially when the fragmentation of the file is not particularly
> severe, this will result in a large number of unnecessary extent split
> and merge operations. Moreover, since the ext4 file system now supports
> large folios, using PAGE_SIZE as the processing unit is no longer
> practical.
> 
> Therefore, introduce a new move extents method, mext_move_extent(). It
> moves one extent of the origin inode at a time, but not exceeding the
> size of a folio. The parameters for the move are passed through the new
> mext_data data structure, which includes the origin inode, donor inode,
> the mapping extent of the origin inode to be moved, and the starting
> offset of the donor inode.
> 
> The move process is similar to move_extent_per_page() and can be
> categorized into three types: MEXT_SKIP_EXTENT, MEXT_MOVE_EXTENT, and
> MEXT_COPY_DATA. MEXT_SKIP_EXTENT indicates that the corresponding area
> of the donor file is a hole, meaning no actual space is allocated, so
> the move is skipped. MEXT_MOVE_EXTENT indicates that the corresponding
> areas of both the origin and donor files are unwritten, so no data needs
> to be copied; only the extents are swapped. MEXT_COPY_DATA indicates
> that the corresponding areas of both the origin and donor files contain
> data, so data must be copied. The data copying is performed in three
> steps: first, the data from the original location is read into the page
> cache; then, the extents are swapped, and the page cache is rebuilt to
> reflect the index of the physical blocks; finally, the dirty page cache
> is marked and written back to ensure that the data is written to disk
> before the metadata is persisted.
> 
> One important point to note is that the folio lock and i_data_sem are
> held only during the moving process. Therefore, before moving an extent,
> it is necessary to check whether the sequence cookie of the area to be
> moved has changed while holding the folio lock. If a change is detected,
> it indicates that concurrent write-back operations may have occurred
> during this period, and the type of the extent to be moved can no longer
> be considered reliable. For example, it may have changed from unwritten
> to written. In such cases, return -ESTALE, and the calling function
> should reacquire the move extent of the original file and retry the
> movement.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

...

> +static __used int mext_move_extent(struct mext_data *mext, u64 *m_len)
> +{
> +	struct inode *orig_inode = mext->orig_inode;
> +	struct inode *donor_inode = mext->donor_inode;
> +	struct ext4_map_blocks *orig_map = &mext->orig_map;
> +	unsigned int blkbits = orig_inode->i_blkbits;
> +	struct folio *folio[2] = {NULL, NULL};
> +	loff_t from, length;
> +	enum mext_move_type move_type = 0;
> +	handle_t *handle;
> +	u64 r_len = 0;
> +	unsigned int credits;
> +	int ret, ret2;
> +
> +	*m_len = 0;
> +	credits = ext4_chunk_trans_extent(orig_inode, 0) * 2;
> +	handle = ext4_journal_start(orig_inode, EXT4_HT_MOVE_EXTENTS, credits);
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	ret = mext_move_begin(mext, folio, &move_type);
> +	if (ret)
> +		goto stop_handle;
> +
> +	if (move_type == MEXT_SKIP_EXTENT)
> +		goto unlock;
> +
> +	/*
> +	 * Copy the data. First, read the original inode data into the page
> +	 * cache. Then, release the existing mapping relationships and swap
> +	 * the extent. Finally, re-establish the new mapping relationships
> +	 * and dirty the page cache.
> +	 */
> +	if (move_type == MEXT_COPY_DATA) {
> +		from = offset_in_folio(folio[0],
> +				((loff_t)orig_map->m_lblk) << blkbits);
> +		length = ((loff_t)orig_map->m_len) << blkbits;
> +
> +		ret = mext_folio_mkuptodate(folio[0], from, from + length);
> +		if (ret)
> +			goto unlock;
> +	}
> +
> +	if (!filemap_release_folio(folio[0], 0) ||
> +	    !filemap_release_folio(folio[1], 0)) {
> +		ret = -EBUSY;
> +		goto unlock;
> +	}
> +
> +	/* Move extent */
> +	ext4_double_down_write_data_sem(orig_inode, donor_inode);
> +	*m_len = ext4_swap_extents(handle, orig_inode, donor_inode,
> +				   orig_map->m_lblk, mext->donor_lblk,
> +				   orig_map->m_len, 1, &ret);
> +	ext4_double_up_write_data_sem(orig_inode, donor_inode);
> +
> +	/* A short-length swap cannot occur after a successful swap extent. */
> +	if (WARN_ON_ONCE(!ret && (*m_len != orig_map->m_len)))
> +		ret = -EIO;
> +
> +	if (!(*m_len) || (move_type == MEXT_MOVE_EXTENT))
> +		goto unlock;
> +
> +	/* Copy data */
> +	length = (*m_len) << blkbits;
> +	ret = mext_folio_mkwrite(orig_inode, folio[0], from, from + length);
> +	if (ret)
> +		goto repair_branches;

I think you need to be careful here and below to not overwrite 'ret' if it
is != 0. So something like:

	ret2 = mext_folio_mkwrite(..)
	if (ret2) {
		if (!ret)
			ret = ret2;
		goto repair_branches;
	}

and something similar below. Otherwise the patch looks good to me.

								Honza

> +	/*
> +	 * Even in case of data=writeback it is reasonable to pin
> +	 * inode to transaction, to prevent unexpected data loss.
> +	 */
> +	ret = ext4_jbd2_inode_add_write(handle, orig_inode,
> +			((loff_t)orig_map->m_lblk) << blkbits, length);
> +unlock:
> +	mext_folio_double_unlock(folio);
> +stop_handle:
> +	ext4_journal_stop(handle);
> +	return ret;
> +
> +repair_branches:
> +	r_len = ext4_swap_extents(handle, donor_inode, orig_inode,
> +				  mext->donor_lblk, orig_map->m_lblk,
> +				  *m_len, 0, &ret2);
> +	if (ret2 || r_len != *m_len) {
> +		ext4_error_inode_block(orig_inode, (sector_t)(orig_map->m_lblk),
> +				       EIO, "Unable to copy data block, data will be lost!");
> +		ret = -EIO;
> +	}
> +	*m_len = 0;
> +	goto unlock;
> +}
> +
>  /**
>   * move_extent_per_page - Move extent data per page
>   *
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 10/12] ext4: switch to using the new extent movement method
  2025-10-10 10:33 ` [PATCH v3 10/12] ext4: switch to using the new extent movement method Zhang Yi
@ 2025-10-10 13:41   ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2025-10-10 13:41 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, libaokun1, yukuai3, yangerkun

On Fri 10-10-25 18:33:24, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Now that we have mext_move_extent(), we can switch to this new interface
> and deprecate move_extent_per_page(). First, after acquiring the
> i_rwsem, we can directly use ext4_map_blocks() to obtain a contiguous
> extent from the original inode as the extent to be moved. It can and
> it's safe to get mapping information from the extent status tree without
> needing to access the ondisk extent tree, because ext4_move_extent()
> will check the sequence cookie under the folio lock. Then, after
> populating the mext_data structure, we call ext4_move_extent() to move
> the extent. Finally, the length of the extent will be adjusted in
> mext.orig_map.m_len and the actual length moved is returned through
> m_len.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/move_extent.c | 395 ++++++------------------------------------
>  1 file changed, 51 insertions(+), 344 deletions(-)
> 
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index aa243be36200..85f5ae53a2d6 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -20,29 +20,6 @@ struct mext_data {
>  	ext4_lblk_t donor_lblk;		/* Start block of the donor file */
>  };
>  
> -/**
> - * get_ext_path() - Find an extent path for designated logical block number.
> - * @inode:	inode to be searched
> - * @lblock:	logical block number to find an extent path
> - * @path:	pointer to an extent path
> - *
> - * ext4_find_extent wrapper. Return an extent path pointer on success,
> - * or an error pointer on failure.
> - */
> -static inline struct ext4_ext_path *
> -get_ext_path(struct inode *inode, ext4_lblk_t lblock,
> -	     struct ext4_ext_path *path)
> -{
> -	path = ext4_find_extent(inode, lblock, path, EXT4_EX_NOCACHE);
> -	if (IS_ERR(path))
> -		return path;
> -	if (path[ext_depth(inode)].p_ext == NULL) {
> -		ext4_free_ext_path(path);
> -		return ERR_PTR(-ENODATA);
> -	}
> -	return path;
> -}
> -
>  /**
>   * ext4_double_down_write_data_sem() - write lock two inodes's i_data_sem
>   * @first: inode to be locked
> @@ -59,7 +36,6 @@ ext4_double_down_write_data_sem(struct inode *first, struct inode *second)
>  	} else {
>  		down_write(&EXT4_I(second)->i_data_sem);
>  		down_write_nested(&EXT4_I(first)->i_data_sem, I_DATA_SEM_OTHER);
> -
>  	}
>  }
>  
> @@ -78,42 +54,6 @@ ext4_double_up_write_data_sem(struct inode *orig_inode,
>  	up_write(&EXT4_I(donor_inode)->i_data_sem);
>  }
>  
> -/**
> - * mext_check_coverage - Check that all extents in range has the same type
> - *
> - * @inode:		inode in question
> - * @from:		block offset of inode
> - * @count:		block count to be checked
> - * @unwritten:		extents expected to be unwritten
> - * @err:		pointer to save error value
> - *
> - * Return 1 if all extents in range has expected type, and zero otherwise.
> - */
> -static int
> -mext_check_coverage(struct inode *inode, ext4_lblk_t from, ext4_lblk_t count,
> -		    int unwritten, int *err)
> -{
> -	struct ext4_ext_path *path = NULL;
> -	struct ext4_extent *ext;
> -	int ret = 0;
> -	ext4_lblk_t last = from + count;
> -	while (from < last) {
> -		path = get_ext_path(inode, from, path);
> -		if (IS_ERR(path)) {
> -			*err = PTR_ERR(path);
> -			return ret;
> -		}
> -		ext = path[ext_depth(inode)].p_ext;
> -		if (unwritten != ext4_ext_is_unwritten(ext))
> -			goto out;
> -		from += ext4_ext_get_actual_len(ext);
> -	}
> -	ret = 1;
> -out:
> -	ext4_free_ext_path(path);
> -	return ret;
> -}
> -
>  /**
>   * mext_folio_double_lock - Grab and lock folio on both @inode1 and @inode2
>   *
> @@ -363,7 +303,7 @@ static int mext_folio_mkwrite(struct inode *inode, struct folio *folio,
>   * the replaced block count through m_len. Return 0 on success, and an error
>   * code otherwise.
>   */
> -static __used int mext_move_extent(struct mext_data *mext, u64 *m_len)
> +static int mext_move_extent(struct mext_data *mext, u64 *m_len)
>  {
>  	struct inode *orig_inode = mext->orig_inode;
>  	struct inode *donor_inode = mext->donor_inode;
> @@ -456,210 +396,6 @@ static __used int mext_move_extent(struct mext_data *mext, u64 *m_len)
>  	goto unlock;
>  }
>  
> -/**
> - * move_extent_per_page - Move extent data per page
> - *
> - * @o_filp:			file structure of original file
> - * @donor_inode:		donor inode
> - * @orig_page_offset:		page index on original file
> - * @donor_page_offset:		page index on donor file
> - * @data_offset_in_page:	block index where data swapping starts
> - * @block_len_in_page:		the number of blocks to be swapped
> - * @unwritten:			orig extent is unwritten or not
> - * @err:			pointer to save return value
> - *
> - * Save the data in original inode blocks and replace original inode extents
> - * with donor inode extents by calling ext4_swap_extents().
> - * Finally, write out the saved data in new original inode blocks. Return
> - * replaced block count.
> - */
> -static int
> -move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
> -		     pgoff_t orig_page_offset, pgoff_t donor_page_offset,
> -		     int data_offset_in_page,
> -		     int block_len_in_page, int unwritten, int *err)
> -{
> -	struct inode *orig_inode = file_inode(o_filp);
> -	struct folio *folio[2] = {NULL, NULL};
> -	handle_t *handle;
> -	ext4_lblk_t orig_blk_offset, donor_blk_offset;
> -	unsigned long blocksize = orig_inode->i_sb->s_blocksize;
> -	unsigned int tmp_data_size, data_size, replaced_size;
> -	int i, err2, jblocks, retries = 0;
> -	int replaced_count = 0;
> -	int from;
> -	int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
> -	struct super_block *sb = orig_inode->i_sb;
> -	struct buffer_head *bh = NULL;
> -
> -	/*
> -	 * It needs twice the amount of ordinary journal buffers because
> -	 * inode and donor_inode may change each different metadata blocks.
> -	 */
> -again:
> -	*err = 0;
> -	jblocks = ext4_meta_trans_blocks(orig_inode, block_len_in_page,
> -					 block_len_in_page) * 2;
> -	handle = ext4_journal_start(orig_inode, EXT4_HT_MOVE_EXTENTS, jblocks);
> -	if (IS_ERR(handle)) {
> -		*err = PTR_ERR(handle);
> -		return 0;
> -	}
> -
> -	orig_blk_offset = orig_page_offset * blocks_per_page +
> -		data_offset_in_page;
> -
> -	donor_blk_offset = donor_page_offset * blocks_per_page +
> -		data_offset_in_page;
> -
> -	/* Calculate data_size */
> -	if ((orig_blk_offset + block_len_in_page - 1) ==
> -	    ((orig_inode->i_size - 1) >> orig_inode->i_blkbits)) {
> -		/* Replace the last block */
> -		tmp_data_size = orig_inode->i_size & (blocksize - 1);
> -		/*
> -		 * If data_size equal zero, it shows data_size is multiples of
> -		 * blocksize. So we set appropriate value.
> -		 */
> -		if (tmp_data_size == 0)
> -			tmp_data_size = blocksize;
> -
> -		data_size = tmp_data_size +
> -			((block_len_in_page - 1) << orig_inode->i_blkbits);
> -	} else
> -		data_size = block_len_in_page << orig_inode->i_blkbits;
> -
> -	replaced_size = data_size;
> -
> -	*err = mext_folio_double_lock(orig_inode, donor_inode, orig_page_offset,
> -				     donor_page_offset, folio);
> -	if (unlikely(*err < 0))
> -		goto stop_journal;
> -	/*
> -	 * If orig extent was unwritten it can become initialized
> -	 * at any time after i_data_sem was dropped, in order to
> -	 * serialize with delalloc we have recheck extent while we
> -	 * hold page's lock, if it is still the case data copy is not
> -	 * necessary, just swap data blocks between orig and donor.
> -	 */
> -	if (unwritten) {
> -		ext4_double_down_write_data_sem(orig_inode, donor_inode);
> -		/* If any of extents in range became initialized we have to
> -		 * fallback to data copying */
> -		unwritten = mext_check_coverage(orig_inode, orig_blk_offset,
> -						block_len_in_page, 1, err);
> -		if (*err)
> -			goto drop_data_sem;
> -
> -		unwritten &= mext_check_coverage(donor_inode, donor_blk_offset,
> -						 block_len_in_page, 1, err);
> -		if (*err)
> -			goto drop_data_sem;
> -
> -		if (!unwritten) {
> -			ext4_double_up_write_data_sem(orig_inode, donor_inode);
> -			goto data_copy;
> -		}
> -		if (!filemap_release_folio(folio[0], 0) ||
> -		    !filemap_release_folio(folio[1], 0)) {
> -			*err = -EBUSY;
> -			goto drop_data_sem;
> -		}
> -		replaced_count = ext4_swap_extents(handle, orig_inode,
> -						   donor_inode, orig_blk_offset,
> -						   donor_blk_offset,
> -						   block_len_in_page, 1, err);
> -	drop_data_sem:
> -		ext4_double_up_write_data_sem(orig_inode, donor_inode);
> -		goto unlock_folios;
> -	}
> -data_copy:
> -	from = offset_in_folio(folio[0],
> -			       orig_blk_offset << orig_inode->i_blkbits);
> -	*err = mext_folio_mkuptodate(folio[0], from, from + replaced_size);
> -	if (*err)
> -		goto unlock_folios;
> -
> -	/* At this point all buffers in range are uptodate, old mapping layout
> -	 * is no longer required, try to drop it now. */
> -	if (!filemap_release_folio(folio[0], 0) ||
> -	    !filemap_release_folio(folio[1], 0)) {
> -		*err = -EBUSY;
> -		goto unlock_folios;
> -	}
> -	ext4_double_down_write_data_sem(orig_inode, donor_inode);
> -	replaced_count = ext4_swap_extents(handle, orig_inode, donor_inode,
> -					       orig_blk_offset, donor_blk_offset,
> -					   block_len_in_page, 1, err);
> -	ext4_double_up_write_data_sem(orig_inode, donor_inode);
> -	if (*err) {
> -		if (replaced_count) {
> -			block_len_in_page = replaced_count;
> -			replaced_size =
> -				block_len_in_page << orig_inode->i_blkbits;
> -		} else
> -			goto unlock_folios;
> -	}
> -	/* Perform all necessary steps similar write_begin()/write_end()
> -	 * but keeping in mind that i_size will not change */
> -	bh = folio_buffers(folio[0]);
> -	if (!bh)
> -		bh = create_empty_buffers(folio[0],
> -				1 << orig_inode->i_blkbits, 0);
> -	for (i = 0; i < from >> orig_inode->i_blkbits; i++)
> -		bh = bh->b_this_page;
> -	for (i = 0; i < block_len_in_page; i++) {
> -		*err = ext4_get_block(orig_inode, orig_blk_offset + i, bh, 0);
> -		if (*err < 0)
> -			goto repair_branches;
> -		bh = bh->b_this_page;
> -	}
> -
> -	block_commit_write(folio[0], from, from + replaced_size);
> -
> -	/* Even in case of data=writeback it is reasonable to pin
> -	 * inode to transaction, to prevent unexpected data loss */
> -	*err = ext4_jbd2_inode_add_write(handle, orig_inode,
> -			(loff_t)orig_page_offset << PAGE_SHIFT, replaced_size);
> -
> -unlock_folios:
> -	folio_unlock(folio[0]);
> -	folio_put(folio[0]);
> -	folio_unlock(folio[1]);
> -	folio_put(folio[1]);
> -stop_journal:
> -	ext4_journal_stop(handle);
> -	if (*err == -ENOSPC &&
> -	    ext4_should_retry_alloc(sb, &retries))
> -		goto again;
> -	/* Buffer was busy because probably is pinned to journal transaction,
> -	 * force transaction commit may help to free it. */
> -	if (*err == -EBUSY && retries++ < 4 && EXT4_SB(sb)->s_journal &&
> -	    jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal))
> -		goto again;
> -	return replaced_count;
> -
> -repair_branches:
> -	/*
> -	 * This should never ever happen!
> -	 * Extents are swapped already, but we are not able to copy data.
> -	 * Try to swap extents to it's original places
> -	 */
> -	ext4_double_down_write_data_sem(orig_inode, donor_inode);
> -	replaced_count = ext4_swap_extents(handle, donor_inode, orig_inode,
> -					       orig_blk_offset, donor_blk_offset,
> -					   block_len_in_page, 0, &err2);
> -	ext4_double_up_write_data_sem(orig_inode, donor_inode);
> -	if (replaced_count != block_len_in_page) {
> -		ext4_error_inode_block(orig_inode, (sector_t)(orig_blk_offset),
> -				       EIO, "Unable to copy data block,"
> -				       " data will be lost.");
> -		*err = -EIO;
> -	}
> -	replaced_count = 0;
> -	goto unlock_folios;
> -}
> -
>  /*
>   * Check the validity of the basic filesystem environment and the
>   * inodes' support status.
> @@ -821,106 +557,81 @@ static int mext_check_adjust_range(struct inode *orig_inode,
>   *
>   * This function returns 0 and moved block length is set in moved_len
>   * if succeed, otherwise returns error value.
> - *
>   */
> -int
> -ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
> -		  __u64 donor_blk, __u64 len, __u64 *moved_len)
> +int ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
> +		      __u64 donor_blk, __u64 len, __u64 *moved_len)
>  {
>  	struct inode *orig_inode = file_inode(o_filp);
>  	struct inode *donor_inode = file_inode(d_filp);
> -	struct ext4_ext_path *path = NULL;
> -	int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
> -	ext4_lblk_t o_end, o_start = orig_blk;
> -	ext4_lblk_t d_start = donor_blk;
> +	struct mext_data mext;
> +	struct super_block *sb = orig_inode->i_sb;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	int retries = 0;
> +	u64 m_len;
>  	int ret;
>  
> +	*moved_len = 0;
> +
>  	/* Protect orig and donor inodes against a truncate */
>  	lock_two_nondirectories(orig_inode, donor_inode);
>  
>  	ret = mext_check_validity(orig_inode, donor_inode);
>  	if (ret)
> -		goto unlock;
> +		goto out;
>  
>  	/* Wait for all existing dio workers */
>  	inode_dio_wait(orig_inode);
>  	inode_dio_wait(donor_inode);
>  
> -	/* Protect extent tree against block allocations via delalloc */
> -	ext4_double_down_write_data_sem(orig_inode, donor_inode);
>  	/* Check and adjust the specified move_extent range. */
>  	ret = mext_check_adjust_range(orig_inode, donor_inode, orig_blk,
>  				      donor_blk, &len);
>  	if (ret)
>  		goto out;
> -	o_end = o_start + len;
>  
> -	*moved_len = 0;
> -	while (o_start < o_end) {
> -		struct ext4_extent *ex;
> -		ext4_lblk_t cur_blk, next_blk;
> -		pgoff_t orig_page_index, donor_page_index;
> -		int offset_in_page;
> -		int unwritten, cur_len;
> -
> -		path = get_ext_path(orig_inode, o_start, path);
> -		if (IS_ERR(path)) {
> -			ret = PTR_ERR(path);
> +	mext.orig_inode = orig_inode;
> +	mext.donor_inode = donor_inode;
> +	while (len) {
> +		mext.orig_map.m_lblk = orig_blk;
> +		mext.orig_map.m_len = len;
> +		mext.orig_map.m_flags = 0;
> +		mext.donor_lblk = donor_blk;
> +
> +		ret = ext4_map_blocks(NULL, orig_inode, &mext.orig_map, 0);
> +		if (ret < 0)
>  			goto out;
> -		}
> -		ex = path[path->p_depth].p_ext;
> -		cur_blk = le32_to_cpu(ex->ee_block);
> -		cur_len = ext4_ext_get_actual_len(ex);
> -		/* Check hole before the start pos */
> -		if (cur_blk + cur_len - 1 < o_start) {
> -			next_blk = ext4_ext_next_allocated_block(path);
> -			if (next_blk == EXT_MAX_BLOCKS) {
> -				ret = -ENODATA;
> -				goto out;
> +
> +		/* Skip moving if it is a hole or a delalloc extent. */
> +		if (mext.orig_map.m_flags &
> +		    (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
> +			ret = mext_move_extent(&mext, &m_len);
> +			*moved_len += m_len;
> +			if (!ret)
> +				goto next;
> +
> +			/* Move failed or partially failed. */
> +			if (m_len) {
> +				orig_blk += m_len;
> +				donor_blk += m_len;
> +				len -= m_len;
>  			}
> -			d_start += next_blk - o_start;
> -			o_start = next_blk;
> -			continue;
> -		/* Check hole after the start pos */
> -		} else if (cur_blk > o_start) {
> -			/* Skip hole */
> -			d_start += cur_blk - o_start;
> -			o_start = cur_blk;
> -			/* Extent inside requested range ?*/
> -			if (cur_blk >= o_end)
> -				goto out;
> -		} else { /* in_range(o_start, o_blk, o_len) */
> -			cur_len += cur_blk - o_start;
> +			if (ret == -ESTALE)
> +				continue;
> +			if (ret == -ENOSPC &&
> +			    ext4_should_retry_alloc(sb, &retries))
> +				continue;
> +			if (ret == -EBUSY &&
> +			    sbi->s_journal && retries++ < 4 &&
> +			    jbd2_journal_force_commit_nested(sbi->s_journal))
> +				continue;
> +
> +			goto out;
>  		}
> -		unwritten = ext4_ext_is_unwritten(ex);
> -		if (o_end - o_start < cur_len)
> -			cur_len = o_end - o_start;
> -
> -		orig_page_index = o_start >> (PAGE_SHIFT -
> -					       orig_inode->i_blkbits);
> -		donor_page_index = d_start >> (PAGE_SHIFT -
> -					       donor_inode->i_blkbits);
> -		offset_in_page = o_start % blocks_per_page;
> -		if (cur_len > blocks_per_page - offset_in_page)
> -			cur_len = blocks_per_page - offset_in_page;
> -		/*
> -		 * Up semaphore to avoid following problems:
> -		 * a. transaction deadlock among ext4_journal_start,
> -		 *    ->write_begin via pagefault, and jbd2_journal_commit
> -		 * b. racing with ->read_folio, ->write_begin, and
> -		 *    ext4_get_block in move_extent_per_page
> -		 */
> -		ext4_double_up_write_data_sem(orig_inode, donor_inode);
> -		/* Swap original branches with new branches */
> -		*moved_len += move_extent_per_page(o_filp, donor_inode,
> -				     orig_page_index, donor_page_index,
> -				     offset_in_page, cur_len,
> -				     unwritten, &ret);
> -		ext4_double_down_write_data_sem(orig_inode, donor_inode);
> -		if (ret < 0)
> -			break;
> -		o_start += cur_len;
> -		d_start += cur_len;
> +next:
> +		orig_blk += mext.orig_map.m_len;
> +		donor_blk += mext.orig_map.m_len;
> +		len -= mext.orig_map.m_len;
> +		retries = 0;
>  	}
>  
>  out:
> @@ -929,10 +640,6 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
>  		ext4_discard_preallocations(donor_inode);
>  	}
>  
> -	ext4_free_ext_path(path);
> -	ext4_double_up_write_data_sem(orig_inode, donor_inode);
> -unlock:
>  	unlock_two_nondirectories(orig_inode, donor_inode);
> -
>  	return ret;
>  }
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 09/12] ext4: introduce mext_move_extent()
  2025-10-10 13:38   ` Jan Kara
@ 2025-10-11  1:20     ` Zhang Yi
  0 siblings, 0 replies; 16+ messages in thread
From: Zhang Yi @ 2025-10-11  1:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	yi.zhang, libaokun1, yukuai3, yangerkun

On 10/10/2025 9:38 PM, Jan Kara wrote:
> On Fri 10-10-25 18:33:23, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When moving extents, the current move_extent_per_page() process can only
>> move extents of length PAGE_SIZE at a time, which is highly inefficient,
>> especially when the fragmentation of the file is not particularly
>> severe, this will result in a large number of unnecessary extent split
>> and merge operations. Moreover, since the ext4 file system now supports
>> large folios, using PAGE_SIZE as the processing unit is no longer
>> practical.
>>
>> Therefore, introduce a new move extents method, mext_move_extent(). It
>> moves one extent of the origin inode at a time, but not exceeding the
>> size of a folio. The parameters for the move are passed through the new
>> mext_data data structure, which includes the origin inode, donor inode,
>> the mapping extent of the origin inode to be moved, and the starting
>> offset of the donor inode.
>>
>> The move process is similar to move_extent_per_page() and can be
>> categorized into three types: MEXT_SKIP_EXTENT, MEXT_MOVE_EXTENT, and
>> MEXT_COPY_DATA. MEXT_SKIP_EXTENT indicates that the corresponding area
>> of the donor file is a hole, meaning no actual space is allocated, so
>> the move is skipped. MEXT_MOVE_EXTENT indicates that the corresponding
>> areas of both the origin and donor files are unwritten, so no data needs
>> to be copied; only the extents are swapped. MEXT_COPY_DATA indicates
>> that the corresponding areas of both the origin and donor files contain
>> data, so data must be copied. The data copying is performed in three
>> steps: first, the data from the original location is read into the page
>> cache; then, the extents are swapped, and the page cache is rebuilt to
>> reflect the index of the physical blocks; finally, the dirty page cache
>> is marked and written back to ensure that the data is written to disk
>> before the metadata is persisted.
>>
>> One important point to note is that the folio lock and i_data_sem are
>> held only during the moving process. Therefore, before moving an extent,
>> it is necessary to check whether the sequence cookie of the area to be
>> moved has changed while holding the folio lock. If a change is detected,
>> it indicates that concurrent write-back operations may have occurred
>> during this period, and the type of the extent to be moved can no longer
>> be considered reliable. For example, it may have changed from unwritten
>> to written. In such cases, return -ESTALE, and the calling function
>> should reacquire the move extent of the original file and retry the
>> movement.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> ...
> 
>> +static __used int mext_move_extent(struct mext_data *mext, u64 *m_len)
>> +{
>> +	struct inode *orig_inode = mext->orig_inode;
>> +	struct inode *donor_inode = mext->donor_inode;
>> +	struct ext4_map_blocks *orig_map = &mext->orig_map;
>> +	unsigned int blkbits = orig_inode->i_blkbits;
>> +	struct folio *folio[2] = {NULL, NULL};
>> +	loff_t from, length;
>> +	enum mext_move_type move_type = 0;
>> +	handle_t *handle;
>> +	u64 r_len = 0;
>> +	unsigned int credits;
>> +	int ret, ret2;
>> +
>> +	*m_len = 0;
>> +	credits = ext4_chunk_trans_extent(orig_inode, 0) * 2;
>> +	handle = ext4_journal_start(orig_inode, EXT4_HT_MOVE_EXTENTS, credits);
>> +	if (IS_ERR(handle))
>> +		return PTR_ERR(handle);
>> +
>> +	ret = mext_move_begin(mext, folio, &move_type);
>> +	if (ret)
>> +		goto stop_handle;
>> +
>> +	if (move_type == MEXT_SKIP_EXTENT)
>> +		goto unlock;
>> +
>> +	/*
>> +	 * Copy the data. First, read the original inode data into the page
>> +	 * cache. Then, release the existing mapping relationships and swap
>> +	 * the extent. Finally, re-establish the new mapping relationships
>> +	 * and dirty the page cache.
>> +	 */
>> +	if (move_type == MEXT_COPY_DATA) {
>> +		from = offset_in_folio(folio[0],
>> +				((loff_t)orig_map->m_lblk) << blkbits);
>> +		length = ((loff_t)orig_map->m_len) << blkbits;
>> +
>> +		ret = mext_folio_mkuptodate(folio[0], from, from + length);
>> +		if (ret)
>> +			goto unlock;
>> +	}
>> +
>> +	if (!filemap_release_folio(folio[0], 0) ||
>> +	    !filemap_release_folio(folio[1], 0)) {
>> +		ret = -EBUSY;
>> +		goto unlock;
>> +	}
>> +
>> +	/* Move extent */
>> +	ext4_double_down_write_data_sem(orig_inode, donor_inode);
>> +	*m_len = ext4_swap_extents(handle, orig_inode, donor_inode,
>> +				   orig_map->m_lblk, mext->donor_lblk,
>> +				   orig_map->m_len, 1, &ret);
>> +	ext4_double_up_write_data_sem(orig_inode, donor_inode);
>> +
>> +	/* A short-length swap cannot occur after a successful swap extent. */
>> +	if (WARN_ON_ONCE(!ret && (*m_len != orig_map->m_len)))
>> +		ret = -EIO;
>> +
>> +	if (!(*m_len) || (move_type == MEXT_MOVE_EXTENT))
>> +		goto unlock;
>> +
>> +	/* Copy data */
>> +	length = (*m_len) << blkbits;
>> +	ret = mext_folio_mkwrite(orig_inode, folio[0], from, from + length);
>> +	if (ret)
>> +		goto repair_branches;
> 
> I think you need to be careful here and below to not overwrite 'ret' if it
> is != 0. So something like:
> 
> 	ret2 = mext_folio_mkwrite(..)
> 	if (ret2) {
> 		if (!ret)
> 			ret = ret2;
> 		goto repair_branches;
> 	}
> 
> and something similar below. Otherwise the patch looks good to me.
> 
> 								Honza

OK, although overwrite 'ret' seems fine, it's better to keep it.

Thanks,
Yi.



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-11  1:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-10 10:33 [PATCH v3 00/12] ext4: optimize online defragment Zhang Yi
2025-10-10 10:33 ` [PATCH v3 01/12] ext4: correct the checking of quota files before moving extents Zhang Yi
2025-10-10 10:33 ` [PATCH v3 02/12] ext4: introduce seq counter for the extent status entry Zhang Yi
2025-10-10 10:33 ` [PATCH v3 03/12] ext4: make ext4_es_lookup_extent() pass out the extent seq counter Zhang Yi
2025-10-10 10:33 ` [PATCH v3 04/12] ext4: pass out extent seq counter when mapping blocks Zhang Yi
2025-10-10 10:33 ` [PATCH v3 05/12] ext4: use EXT4_B_TO_LBLK() in mext_check_arguments() Zhang Yi
2025-10-10 10:33 ` [PATCH v3 06/12] ext4: add mext_check_validity() to do basic check Zhang Yi
2025-10-10 10:33 ` [PATCH v3 07/12] ext4: refactor mext_check_arguments() Zhang Yi
2025-10-10 10:33 ` [PATCH v3 08/12] ext4: rename mext_page_mkuptodate() to mext_folio_mkuptodate() Zhang Yi
2025-10-10 10:33 ` [PATCH v3 09/12] ext4: introduce mext_move_extent() Zhang Yi
2025-10-10 13:38   ` Jan Kara
2025-10-11  1:20     ` Zhang Yi
2025-10-10 10:33 ` [PATCH v3 10/12] ext4: switch to using the new extent movement method Zhang Yi
2025-10-10 13:41   ` Jan Kara
2025-10-10 10:33 ` [PATCH v3 11/12] ext4: add large folios support for moving extents Zhang Yi
2025-10-10 10:33 ` [PATCH v3 12/12] ext4: add two trace points " Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).