* [PATCH 0/4] ext4: replace ext4_es_insert_extent() when caching on-disk extents
@ 2025-10-31 6:29 Zhang Yi
2025-10-31 6:29 ` [PATCH 1/4] ext4: make ext4_es_cache_extent() support overwrite existing extents Zhang Yi
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Zhang Yi @ 2025-10-31 6:29 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
yi.zhang, yi.zhang, libaokun1, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
This series addresses the optimization that Jan pointed out [1]
regarding the introduction of a sequence number to
ext4_es_insert_extent(). The proposal is to replace all instances where
the cache of on-disk extents is updated by using ext4_es_cache_extent()
instead of ext4_es_insert_extent(). This change can prevent excessive
cache invalidations caused by unnecessarily increasing the extent
sequence number when reading from the on-disk extent tree. This seires
has no dependency on the patch set[2] that introduced the extent
sequence number, so it can be merged independently.
[1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
[2] https://lore.kernel.org/linux-ext4/20251013015128.499308-1-yi.zhang@huaweicloud.com/
Thanks,
Yi.
Zhang Yi (4):
ext4: make ext4_es_cache_extent() support overwrite existing extents
ext4: check for conflicts when caching extents
ext4: adjust the debug info in ext4_es_cache_extent()
ext4: replace ext4_es_insert_extent() when caching on-disk extents
fs/ext4/extents.c | 8 ++---
fs/ext4/extents_status.c | 75 +++++++++++++++++++++++++++++++++++-----
fs/ext4/extents_status.h | 2 +-
fs/ext4/inode.c | 18 +++++-----
4 files changed, 81 insertions(+), 22 deletions(-)
--
2.46.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/4] ext4: make ext4_es_cache_extent() support overwrite existing extents
2025-10-31 6:29 [PATCH 0/4] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
@ 2025-10-31 6:29 ` Zhang Yi
2025-11-06 9:15 ` Jan Kara
2025-10-31 6:29 ` [PATCH 2/4] ext4: check for conflicts when caching extents Zhang Yi
` (2 subsequent siblings)
3 siblings, 1 reply; 8+ messages in thread
From: Zhang Yi @ 2025-10-31 6:29 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
yi.zhang, yi.zhang, libaokun1, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Currently, ext4_es_cache_extent() is used to load extents into the
extent status tree when reading on-disk extent blocks. Since it may be
called while moving or modifying the extent tree, so it does not
overwrite existing extents in the extent status tree and is only used
for the initial loading.
There are many other places in ext4 where on-disk extents are inserted
into the extent status tree, such as in ext4_map_query_blocks().
Currently, they call ext4_es_insert_extent() to perform the insertion,
but they don't modify the extents, so ext4_es_cache_extent() would be a
more appropriate choice. However, when ext4_map_query_blocks() inserts
an extent, it may overwrite a short existing extent of the same type.
Therefore, to prepare for the replacements, we need to extend
ext4_es_cache_extent() to allow it to overwrite existing extents with
the same type.
In addition, since cached extents can be more lenient than the extents
they modify and do not involve modifying reserved blocks, it is not
necessary to ensure that the insertion operation succeeds as strictly as
in the ext4_es_insert_extent() function.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/extents.c | 4 ++--
fs/ext4/extents_status.c | 28 +++++++++++++++++++++-------
fs/ext4/extents_status.h | 2 +-
3 files changed, 24 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ca5499e9412b..c42ceb5aae37 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -537,12 +537,12 @@ static void ext4_cache_extents(struct inode *inode,
if (prev && (prev != lblk))
ext4_es_cache_extent(inode, prev, lblk - prev, ~0,
- EXTENT_STATUS_HOLE);
+ EXTENT_STATUS_HOLE, false);
if (ext4_ext_is_unwritten(ex))
status = EXTENT_STATUS_UNWRITTEN;
ext4_es_cache_extent(inode, lblk, len,
- ext4_ext_pblock(ex), status);
+ ext4_ext_pblock(ex), status, false);
prev = lblk + len;
}
}
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 31dc0496f8d0..f9546ecf7340 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -986,13 +986,19 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
}
/*
- * ext4_es_cache_extent() inserts information into the extent status
- * tree if and only if there isn't information about the range in
- * question already.
+ * ext4_es_cache_extent() inserts extent information into the extent status
+ * tree. If 'overwrite' is not set, it inserts extent only if there isn't
+ * information about the specified range. Otherwise, it overwrites the
+ * current information.
+ *
+ * Note that this interface is only used for caching on-disk extent
+ * information and cannot be used to convert existing extents in the extent
+ * status tree. To convert existing extents, use ext4_es_insert_extent()
+ * instead.
*/
void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len, ext4_fsblk_t pblk,
- unsigned int status)
+ unsigned int status, bool overwrite)
{
struct extent_status *es;
struct extent_status newes;
@@ -1012,10 +1018,18 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
BUG_ON(end < lblk);
write_lock(&EXT4_I(inode)->i_es_lock);
-
es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk);
- if (!es || es->es_lblk > end)
- __es_insert_extent(inode, &newes, NULL);
+ if (es && es->es_lblk <= end) {
+ if (!overwrite)
+ goto unlock;
+
+ /* Only extents of the same type can be overwritten. */
+ WARN_ON_ONCE(ext4_es_type(es) != status);
+ if (__es_remove_extent(inode, lblk, end, NULL, NULL))
+ goto unlock;
+ }
+ __es_insert_extent(inode, &newes, NULL);
+unlock:
write_unlock(&EXT4_I(inode)->i_es_lock);
}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 8f9c008d11e8..415f7c223a46 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -139,7 +139,7 @@ extern void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
bool delalloc_reserve_used);
extern void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len, ext4_fsblk_t pblk,
- unsigned int status);
+ unsigned int status, bool overwrite);
extern void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
extern void ext4_es_find_extent_range(struct inode *inode,
--
2.46.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/4] ext4: check for conflicts when caching extents
2025-10-31 6:29 [PATCH 0/4] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
2025-10-31 6:29 ` [PATCH 1/4] ext4: make ext4_es_cache_extent() support overwrite existing extents Zhang Yi
@ 2025-10-31 6:29 ` Zhang Yi
2025-10-31 6:29 ` [PATCH 3/4] ext4: adjust the debug info in ext4_es_cache_extent() Zhang Yi
2025-10-31 6:29 ` [PATCH 4/4] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
3 siblings, 0 replies; 8+ messages in thread
From: Zhang Yi @ 2025-10-31 6:29 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
yi.zhang, yi.zhang, libaokun1, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Since ext4_es_cache_extent() can only be used to load on-disk extents
and does not permit modifying extents, it is not possible to overwrite
an extent of a different type. To prevent misuse of the interface, the
current implementation checks only the first existing extent but does
not verify all extents within the range to be inserted, as doing so
would be time-consuming in highly fragmented scenarios. Furthermore,
adding such checks to __es_remove_extent() would complicate its logic.
Therefore, a full check can be performed in debug mode to ensure that
the function does not overwrite any valuable extents.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/extents_status.c | 50 +++++++++++++++++++++++++++++++++++++---
1 file changed, 47 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index f9546ecf7340..55103c331b6b 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -985,6 +985,48 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
return;
}
+#ifdef CONFIG_EXT4_DEBUG
+/*
+ * If we find an extent that already exists during caching extents, its
+ * status must match the one to be cached. Otherwise, the extent status
+ * tree may have been corrupted.
+ */
+static void ext4_es_cache_extent_check(struct inode *inode,
+ struct extent_status *es, struct extent_status *newes)
+{
+ unsigned int status = ext4_es_type(newes);
+ struct rb_node *node;
+
+ if (ext4_es_type(es) != status)
+ goto conflict;
+
+ while ((node = rb_next(&es->rb_node)) != NULL) {
+ es = rb_entry(node, struct extent_status, rb_node);
+
+ if (es->es_lblk >= newes->es_lblk + newes->es_len)
+ break;
+ if (ext4_es_type(es) != status)
+ goto conflict;
+ }
+ return;
+
+conflict:
+ ext4_warning_inode(inode,
+ "ES cache extent failed: add [%d,%d,%llu,0x%x] conflict with existing [%d,%d,%llu,0x%x]\n",
+ newes->es_lblk, newes->es_len, ext4_es_pblock(newes),
+ ext4_es_status(newes), es->es_lblk, es->es_len,
+ ext4_es_pblock(es), ext4_es_status(es));
+
+ WARN_ON_ONCE(1);
+}
+#else
+static void ext4_es_cache_extent_check(struct inode __maybe_unused *inode,
+ struct extent_status *es, struct extent_status *newes)
+{
+ WARN_ON_ONCE(ext4_es_type(es) != ext4_es_type(newes));
+}
+#endif
+
/*
* ext4_es_cache_extent() inserts extent information into the extent status
* tree. If 'overwrite' is not set, it inserts extent only if there isn't
@@ -1022,9 +1064,11 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
if (es && es->es_lblk <= end) {
if (!overwrite)
goto unlock;
-
- /* Only extents of the same type can be overwritten. */
- WARN_ON_ONCE(ext4_es_type(es) != status);
+ /*
+ * Check whether the overwrites are safe. Only extents
+ * of the same type can be overwritten.
+ */
+ ext4_es_cache_extent_check(inode, es, &newes);
if (__es_remove_extent(inode, lblk, end, NULL, NULL))
goto unlock;
}
--
2.46.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 3/4] ext4: adjust the debug info in ext4_es_cache_extent()
2025-10-31 6:29 [PATCH 0/4] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
2025-10-31 6:29 ` [PATCH 1/4] ext4: make ext4_es_cache_extent() support overwrite existing extents Zhang Yi
2025-10-31 6:29 ` [PATCH 2/4] ext4: check for conflicts when caching extents Zhang Yi
@ 2025-10-31 6:29 ` Zhang Yi
2025-10-31 6:29 ` [PATCH 4/4] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
3 siblings, 0 replies; 8+ messages in thread
From: Zhang Yi @ 2025-10-31 6:29 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
yi.zhang, yi.zhang, libaokun1, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Print a trace point after successfully inserting an extent in the
ext4_es_cache_extent() function. Additionally, similar to other extent
cache operation functions, call ext4_print_pending_tree() to display the
extent debug information of the inode when in ES_DEBUG mode.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/extents_status.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 55103c331b6b..ae25a3888de4 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1052,7 +1052,6 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
newes.es_lblk = lblk;
newes.es_len = len;
ext4_es_store_pblock_status(&newes, pblk, status);
- trace_ext4_es_cache_extent(inode, &newes);
if (!len)
return;
@@ -1073,6 +1072,8 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
goto unlock;
}
__es_insert_extent(inode, &newes, NULL);
+ trace_ext4_es_cache_extent(inode, &newes);
+ ext4_es_print_tree(inode);
unlock:
write_unlock(&EXT4_I(inode)->i_es_lock);
}
--
2.46.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 4/4] ext4: replace ext4_es_insert_extent() when caching on-disk extents
2025-10-31 6:29 [PATCH 0/4] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
` (2 preceding siblings ...)
2025-10-31 6:29 ` [PATCH 3/4] ext4: adjust the debug info in ext4_es_cache_extent() Zhang Yi
@ 2025-10-31 6:29 ` Zhang Yi
3 siblings, 0 replies; 8+ messages in thread
From: Zhang Yi @ 2025-10-31 6:29 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
yi.zhang, yi.zhang, libaokun1, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
In ext4, the remaining places for inserting extents into the extent
status tree within ext4_ext_determine_insert_hole() and
ext4_map_query_blocks() directly cache on-disk extents. We can use
ext4_es_cache_extent() instead of ext4_es_insert_extent() in these
cases. This will help reduce unnecessary increases in extent sequence
numbers and cache invalidations after supporting IOMAP in the future.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/extents.c | 4 ++--
fs/ext4/inode.c | 18 +++++++++---------
2 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c42ceb5aae37..7dc80141350d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4160,8 +4160,8 @@ static ext4_lblk_t ext4_ext_determine_insert_hole(struct inode *inode,
insert_hole:
/* Put just found gap into cache to speed up subsequent requests */
ext_debug(inode, " -> %u:%u\n", hole_start, len);
- ext4_es_insert_extent(inode, hole_start, len, ~0,
- EXTENT_STATUS_HOLE, false);
+ ext4_es_cache_extent(inode, hole_start, len, ~0,
+ EXTENT_STATUS_HOLE, true);
/* Update hole_len to reflect hole size after lblk */
if (hole_start != lblk)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e99306a8f47c..a3c37de552e9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -504,8 +504,8 @@ static int ext4_map_query_blocks_next_in_leaf(handle_t *handle,
retval = ext4_ext_map_blocks(handle, inode, &map2, 0);
if (retval <= 0) {
- ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
- map->m_pblk, status, false);
+ ext4_es_cache_extent(inode, map->m_lblk, map->m_len,
+ map->m_pblk, status, true);
return map->m_len;
}
@@ -526,13 +526,13 @@ static int ext4_map_query_blocks_next_in_leaf(handle_t *handle,
*/
if (map->m_pblk + map->m_len == map2.m_pblk &&
status == status2) {
- ext4_es_insert_extent(inode, map->m_lblk,
- map->m_len + map2.m_len, map->m_pblk,
- status, false);
+ ext4_es_cache_extent(inode, map->m_lblk,
+ map->m_len + map2.m_len, map->m_pblk,
+ status, true);
map->m_len += map2.m_len;
} else {
- ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
- map->m_pblk, status, false);
+ ext4_es_cache_extent(inode, map->m_lblk, map->m_len,
+ map->m_pblk, status, true);
}
return map->m_len;
@@ -571,8 +571,8 @@ static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
map->m_len == orig_mlen) {
status = map->m_flags & EXT4_MAP_UNWRITTEN ?
EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
- ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
- map->m_pblk, status, false);
+ ext4_es_cache_extent(inode, map->m_lblk, map->m_len,
+ map->m_pblk, status, true);
return retval;
}
--
2.46.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/4] ext4: make ext4_es_cache_extent() support overwrite existing extents
2025-10-31 6:29 ` [PATCH 1/4] ext4: make ext4_es_cache_extent() support overwrite existing extents Zhang Yi
@ 2025-11-06 9:15 ` Jan Kara
2025-11-06 13:02 ` Zhang Yi
0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2025-11-06 9:15 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
jack, yi.zhang, libaokun1, yangerkun
On Fri 31-10-25 14:29:02, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Currently, ext4_es_cache_extent() is used to load extents into the
> extent status tree when reading on-disk extent blocks. Since it may be
> called while moving or modifying the extent tree, so it does not
> overwrite existing extents in the extent status tree and is only used
> for the initial loading.
>
> There are many other places in ext4 where on-disk extents are inserted
> into the extent status tree, such as in ext4_map_query_blocks().
> Currently, they call ext4_es_insert_extent() to perform the insertion,
> but they don't modify the extents, so ext4_es_cache_extent() would be a
> more appropriate choice. However, when ext4_map_query_blocks() inserts
> an extent, it may overwrite a short existing extent of the same type.
> Therefore, to prepare for the replacements, we need to extend
> ext4_es_cache_extent() to allow it to overwrite existing extents with
> the same type.
>
> In addition, since cached extents can be more lenient than the extents
> they modify and do not involve modifying reserved blocks, it is not
> necessary to ensure that the insertion operation succeeds as strictly as
> in the ext4_es_insert_extent() function.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Thanks for writing this series! I think we can actually simplify things
event further. Extent status tree operations can be divided into three
groups:
1) Lookups in es tree - protected only by i_es_lock.
2) Caching of on-disk state into es tree - protected by i_es_lock and
i_data_sem (at least in read mode).
3) Modification of existing state - protected by i_es_lock and i_data_sem
in write mode.
Now because 2) has exclusion vs 3) due to i_data_sem, the observation is
that 2) should never see a real conflict - i.e., all intersecting entries
in es tree have the same status, otherwise this is a bug. So I think that
ext4_es_cache_extent() should always walk the whole inserted range, verify
the statuses match and merge all these entries into a single one. This
isn't going to be slower than what we have today because your
__es_remove_extent(), __es_insert_extent() pair is effectively doing the
same thing, just without checking the statuses. That way we always get the
checking and also ext4_es_cache_extent() doesn't have to have the
overwriting and non-overwriting variant. What do you think?
Honza
> ---
> fs/ext4/extents.c | 4 ++--
> fs/ext4/extents_status.c | 28 +++++++++++++++++++++-------
> fs/ext4/extents_status.h | 2 +-
> 3 files changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index ca5499e9412b..c42ceb5aae37 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -537,12 +537,12 @@ static void ext4_cache_extents(struct inode *inode,
>
> if (prev && (prev != lblk))
> ext4_es_cache_extent(inode, prev, lblk - prev, ~0,
> - EXTENT_STATUS_HOLE);
> + EXTENT_STATUS_HOLE, false);
>
> if (ext4_ext_is_unwritten(ex))
> status = EXTENT_STATUS_UNWRITTEN;
> ext4_es_cache_extent(inode, lblk, len,
> - ext4_ext_pblock(ex), status);
> + ext4_ext_pblock(ex), status, false);
> prev = lblk + len;
> }
> }
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 31dc0496f8d0..f9546ecf7340 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -986,13 +986,19 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> }
>
> /*
> - * ext4_es_cache_extent() inserts information into the extent status
> - * tree if and only if there isn't information about the range in
> - * question already.
> + * ext4_es_cache_extent() inserts extent information into the extent status
> + * tree. If 'overwrite' is not set, it inserts extent only if there isn't
> + * information about the specified range. Otherwise, it overwrites the
> + * current information.
> + *
> + * Note that this interface is only used for caching on-disk extent
> + * information and cannot be used to convert existing extents in the extent
> + * status tree. To convert existing extents, use ext4_es_insert_extent()
> + * instead.
> */
> void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
> ext4_lblk_t len, ext4_fsblk_t pblk,
> - unsigned int status)
> + unsigned int status, bool overwrite)
> {
> struct extent_status *es;
> struct extent_status newes;
> @@ -1012,10 +1018,18 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
> BUG_ON(end < lblk);
>
> write_lock(&EXT4_I(inode)->i_es_lock);
> -
> es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk);
> - if (!es || es->es_lblk > end)
> - __es_insert_extent(inode, &newes, NULL);
> + if (es && es->es_lblk <= end) {
> + if (!overwrite)
> + goto unlock;
> +
> + /* Only extents of the same type can be overwritten. */
> + WARN_ON_ONCE(ext4_es_type(es) != status);
> + if (__es_remove_extent(inode, lblk, end, NULL, NULL))
> + goto unlock;
> + }
> + __es_insert_extent(inode, &newes, NULL);
> +unlock:
> write_unlock(&EXT4_I(inode)->i_es_lock);
> }
>
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index 8f9c008d11e8..415f7c223a46 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -139,7 +139,7 @@ extern void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> bool delalloc_reserve_used);
> extern void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
> ext4_lblk_t len, ext4_fsblk_t pblk,
> - unsigned int status);
> + unsigned int status, bool overwrite);
> extern void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
> ext4_lblk_t len);
> extern void ext4_es_find_extent_range(struct inode *inode,
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/4] ext4: make ext4_es_cache_extent() support overwrite existing extents
2025-11-06 9:15 ` Jan Kara
@ 2025-11-06 13:02 ` Zhang Yi
2025-11-11 10:33 ` Jan Kara
0 siblings, 1 reply; 8+ messages in thread
From: Zhang Yi @ 2025-11-06 13:02 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
yi.zhang, libaokun1, yangerkun
Hi! Thank you for the review and suggestions!
On 11/6/2025 5:15 PM, Jan Kara wrote:
> On Fri 31-10-25 14:29:02, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Currently, ext4_es_cache_extent() is used to load extents into the
>> extent status tree when reading on-disk extent blocks. Since it may be
>> called while moving or modifying the extent tree, so it does not
>> overwrite existing extents in the extent status tree and is only used
>> for the initial loading.
>>
>> There are many other places in ext4 where on-disk extents are inserted
>> into the extent status tree, such as in ext4_map_query_blocks().
>> Currently, they call ext4_es_insert_extent() to perform the insertion,
>> but they don't modify the extents, so ext4_es_cache_extent() would be a
>> more appropriate choice. However, when ext4_map_query_blocks() inserts
>> an extent, it may overwrite a short existing extent of the same type.
>> Therefore, to prepare for the replacements, we need to extend
>> ext4_es_cache_extent() to allow it to overwrite existing extents with
>> the same type.
>>
>> In addition, since cached extents can be more lenient than the extents
>> they modify and do not involve modifying reserved blocks, it is not
>> necessary to ensure that the insertion operation succeeds as strictly as
>> in the ext4_es_insert_extent() function.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>
> Thanks for writing this series! I think we can actually simplify things
> event further. Extent status tree operations can be divided into three
> groups:
> 1) Lookups in es tree - protected only by i_es_lock.
> 2) Caching of on-disk state into es tree - protected by i_es_lock and
> i_data_sem (at least in read mode).
> 3) Modification of existing state - protected by i_es_lock and i_data_sem
> in write mode.
Yeah.
>
> Now because 2) has exclusion vs 3) due to i_data_sem, the observation is
> that 2) should never see a real conflict - i.e., all intersecting entries
> in es tree have the same status, otherwise this is a bug.
While I was debugging, I observed two exceptions here.
A. The first exceptions is about the delay extent. Since there is no actual
extent present in the extent tree on the disk, if a delayed extent
already exists in the extent status tree and someone calls
ext4_find_extent()->ext4_cache_extents() to cache an extent at the same
location, then a status mismatch will occur (attempting to replace
the delayed extent with a hole). This is not a bug.
B. I also observed that ext4_find_extent()->ext4_cache_extents() is called
during splitting and conversion between unwritten and written states (in
most scenarios, EXT4_EX_NOCACHE is not added). However, because the
process is in an intermediate state of handling extents, there can be
cases where the status do not match. I did not analyze this scenario in
detail, but since ext4_es_insert_extent() is called at the end of the
processing to ensure the final state is correct, I don't think this is a
practical issue either.
Therefore, I believe that retaining non-overwriting is necessary for this
scenario involving ext4_cache_extents() because it will be called during
case 3). Except for ext4_cache_extents(), other scenarios theoretically
should not be involved.
> So I think that
> ext4_es_cache_extent() should always walk the whole inserted range, verify
> the statuses match and merge all these entries into a single one. This
> isn't going to be slower than what we have today because your
> __es_remove_extent(), __es_insert_extent() pair is effectively doing the
> same thing, just without checking the statuses.
Yes, I agree that we can delegate the verification work in
ext4_es_cache_extent() to __es_remove_extent(). During the process of
overwriting extents, the first step is to remove the existing extents. If
the extent status does not match, an alarm will be triggered.
> That way we always get the
> checking and also ext4_es_cache_extent() doesn't have to have the
> overwriting and non-overwriting variant. What do you think?
>
> Honza
For case A, we can add an exception during verification and skip the
warnings. For case B, We need to ensure that ext4_cache_extents() is not
allowed to be called during the intermediate processing of the extent
tree. This seems feasible in theory, but I guess it is somewhat fragile.
So, keep the non-overwriting mode?
Best Regards,
Yi.
>
>> ---
>> fs/ext4/extents.c | 4 ++--
>> fs/ext4/extents_status.c | 28 +++++++++++++++++++++-------
>> fs/ext4/extents_status.h | 2 +-
>> 3 files changed, 24 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index ca5499e9412b..c42ceb5aae37 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -537,12 +537,12 @@ static void ext4_cache_extents(struct inode *inode,
>>
>> if (prev && (prev != lblk))
>> ext4_es_cache_extent(inode, prev, lblk - prev, ~0,
>> - EXTENT_STATUS_HOLE);
>> + EXTENT_STATUS_HOLE, false);
>>
>> if (ext4_ext_is_unwritten(ex))
>> status = EXTENT_STATUS_UNWRITTEN;
>> ext4_es_cache_extent(inode, lblk, len,
>> - ext4_ext_pblock(ex), status);
>> + ext4_ext_pblock(ex), status, false);
>> prev = lblk + len;
>> }
>> }
>> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
>> index 31dc0496f8d0..f9546ecf7340 100644
>> --- a/fs/ext4/extents_status.c
>> +++ b/fs/ext4/extents_status.c
>> @@ -986,13 +986,19 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>> }
>>
>> /*
>> - * ext4_es_cache_extent() inserts information into the extent status
>> - * tree if and only if there isn't information about the range in
>> - * question already.
>> + * ext4_es_cache_extent() inserts extent information into the extent status
>> + * tree. If 'overwrite' is not set, it inserts extent only if there isn't
>> + * information about the specified range. Otherwise, it overwrites the
>> + * current information.
>> + *
>> + * Note that this interface is only used for caching on-disk extent
>> + * information and cannot be used to convert existing extents in the extent
>> + * status tree. To convert existing extents, use ext4_es_insert_extent()
>> + * instead.
>> */
>> void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
>> ext4_lblk_t len, ext4_fsblk_t pblk,
>> - unsigned int status)
>> + unsigned int status, bool overwrite)
>> {
>> struct extent_status *es;
>> struct extent_status newes;
>> @@ -1012,10 +1018,18 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
>> BUG_ON(end < lblk);
>>
>> write_lock(&EXT4_I(inode)->i_es_lock);
>> -
>> es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk);
>> - if (!es || es->es_lblk > end)
>> - __es_insert_extent(inode, &newes, NULL);
>> + if (es && es->es_lblk <= end) {
>> + if (!overwrite)
>> + goto unlock;
>> +
>> + /* Only extents of the same type can be overwritten. */
>> + WARN_ON_ONCE(ext4_es_type(es) != status);
>> + if (__es_remove_extent(inode, lblk, end, NULL, NULL))
>> + goto unlock;
>> + }
>> + __es_insert_extent(inode, &newes, NULL);
>> +unlock:
>> write_unlock(&EXT4_I(inode)->i_es_lock);
>> }
>>
>> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
>> index 8f9c008d11e8..415f7c223a46 100644
>> --- a/fs/ext4/extents_status.h
>> +++ b/fs/ext4/extents_status.h
>> @@ -139,7 +139,7 @@ extern void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>> bool delalloc_reserve_used);
>> extern void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
>> ext4_lblk_t len, ext4_fsblk_t pblk,
>> - unsigned int status);
>> + unsigned int status, bool overwrite);
>> extern void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>> ext4_lblk_t len);
>> extern void ext4_es_find_extent_range(struct inode *inode,
>> --
>> 2.46.1
>>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/4] ext4: make ext4_es_cache_extent() support overwrite existing extents
2025-11-06 13:02 ` Zhang Yi
@ 2025-11-11 10:33 ` Jan Kara
0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2025-11-11 10:33 UTC (permalink / raw)
To: Zhang Yi
Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, yi.zhang, libaokun1, yangerkun
Hi!
On Thu 06-11-25 21:02:35, Zhang Yi wrote:
> On 11/6/2025 5:15 PM, Jan Kara wrote:
> > On Fri 31-10-25 14:29:02, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Currently, ext4_es_cache_extent() is used to load extents into the
> >> extent status tree when reading on-disk extent blocks. Since it may be
> >> called while moving or modifying the extent tree, so it does not
> >> overwrite existing extents in the extent status tree and is only used
> >> for the initial loading.
> >>
> >> There are many other places in ext4 where on-disk extents are inserted
> >> into the extent status tree, such as in ext4_map_query_blocks().
> >> Currently, they call ext4_es_insert_extent() to perform the insertion,
> >> but they don't modify the extents, so ext4_es_cache_extent() would be a
> >> more appropriate choice. However, when ext4_map_query_blocks() inserts
> >> an extent, it may overwrite a short existing extent of the same type.
> >> Therefore, to prepare for the replacements, we need to extend
> >> ext4_es_cache_extent() to allow it to overwrite existing extents with
> >> the same type.
> >>
> >> In addition, since cached extents can be more lenient than the extents
> >> they modify and do not involve modifying reserved blocks, it is not
> >> necessary to ensure that the insertion operation succeeds as strictly as
> >> in the ext4_es_insert_extent() function.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >
> > Thanks for writing this series! I think we can actually simplify things
> > event further. Extent status tree operations can be divided into three
> > groups:
> > 1) Lookups in es tree - protected only by i_es_lock.
> > 2) Caching of on-disk state into es tree - protected by i_es_lock and
> > i_data_sem (at least in read mode).
> > 3) Modification of existing state - protected by i_es_lock and i_data_sem
> > in write mode.
>
> Yeah.
>
> >
> > Now because 2) has exclusion vs 3) due to i_data_sem, the observation is
> > that 2) should never see a real conflict - i.e., all intersecting entries
> > in es tree have the same status, otherwise this is a bug.
>
> While I was debugging, I observed two exceptions here.
>
> A. The first exceptions is about the delay extent. Since there is no actual
> extent present in the extent tree on the disk, if a delayed extent
> already exists in the extent status tree and someone calls
> ext4_find_extent()->ext4_cache_extents() to cache an extent at the same
> location, then a status mismatch will occur (attempting to replace
> the delayed extent with a hole). This is not a bug.
> B. I also observed that ext4_find_extent()->ext4_cache_extents() is called
> during splitting and conversion between unwritten and written states (in
> most scenarios, EXT4_EX_NOCACHE is not added). However, because the
> process is in an intermediate state of handling extents, there can be
> cases where the status do not match. I did not analyze this scenario in
> detail, but since ext4_es_insert_extent() is called at the end of the
> processing to ensure the final state is correct, I don't think this is a
> practical issue either.
Thanks for bringing this up. I didn't think about these two cases. As for
case A that is easy to deal with as you write below. A hole insertion can
be deemed compatible with existing delalloc extent.
Case B is more difficult and I think I need to better understand the
details there to decide what to do. Only extent splitting (as it happens
e.g. with EXT4_GET_BLOCKS_PRE_IO) should keep extents in the extent tree and
extent status tree compatible. So it has to be something like
EXT4_GET_BLOCKS_CONVERT case. There indeed after we call
ext4_ext_mark_initialized() we have initialized extent on disk but in
extent status tree it is still as unwritten. But I just didn't find a place
in the extent conversion path that would modify extent state on disk and
then call ext4_find_extent(). Can you perhaps share a stacktrace where the
extent incompatibility was hit from ext4_cache_extents()? Thanks!
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-11-11 10:33 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-31 6:29 [PATCH 0/4] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
2025-10-31 6:29 ` [PATCH 1/4] ext4: make ext4_es_cache_extent() support overwrite existing extents Zhang Yi
2025-11-06 9:15 ` Jan Kara
2025-11-06 13:02 ` Zhang Yi
2025-11-11 10:33 ` Jan Kara
2025-10-31 6:29 ` [PATCH 2/4] ext4: check for conflicts when caching extents Zhang Yi
2025-10-31 6:29 ` [PATCH 3/4] ext4: adjust the debug info in ext4_es_cache_extent() Zhang Yi
2025-10-31 6:29 ` [PATCH 4/4] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).