From: "Darrick J. Wong" <djwong@kernel.org>
To: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
Jan Kara <jack@suse.cz>, John Garry <john.g.garry@oracle.com>,
Ojaswin Mujoo <ojaswin@linux.ibm.com>,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v3 4/7] ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS
Date: Wed, 14 May 2025 09:16:03 -0700 [thread overview]
Message-ID: <20250514161603.GG25655@frogsfrogsfrogs> (raw)
In-Reply-To: <29f33ff8f54526d332c4cc590ac254543587aaa4.1746734745.git.ritesh.list@gmail.com>
On Fri, May 09, 2025 at 02:20:34AM +0530, Ritesh Harjani (IBM) wrote:
> There can be a case where there are contiguous extents on the adjacent
> leaf nodes of on-disk extent trees. So when someone tries to write to
> this contiguous range, ext4_map_blocks() call will split by returning
> 1 extent at a time if this is not already cached in extent_status tree
> cache (where if these extents when cached can get merged since they are
> contiguous).
>
> This is fine for a normal write however in case of atomic writes, it
> can't afford to break the write into two. Now this is also something
> that will only happen in the slow write case where we call
> ext4_map_blocks() for each of these extents spread across different leaf
> nodes. However, there is no guarantee that these extent status cache
> cannot be reclaimed before the last call to ext4_map_blocks() in
> ext4_map_blocks_atomic_write_slow().
Can you have two physically and logically contiguous mappings within a
single leaf node? Or is the key idea here that the extent status tree
will merge adjacent mappings from the same leaf block, just not between
leaf blocks?
> Hence this patch adds support of EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS.
> This flag checks if the requested range can be fully found in extent
> status cache and return. If not, it looks up in on-disk extent
> tree via ext4_map_query_blocks(). If the found extent is the last entry
> in the leaf node, then it goes and queries the next lblk to see if there
> is an adjacent contiguous extent in the adjacent leaf node of the
> on-disk extent tree.
>
> Even though there can be a case where there are multiple adjacent extent
> entries spread across multiple leaf nodes. But we only read an adjacent
> leaf block i.e. in total of 2 extent entries spread across 2 leaf nodes.
> The reason for this is that we are mostly only going to support atomic
> writes with upto 64KB or maybe max upto 1MB of atomic write support.
>
> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
> fs/ext4/ext4.h | 18 ++++++++-
> fs/ext4/extents.c | 12 ++++++
> fs/ext4/inode.c | 97 +++++++++++++++++++++++++++++++++++++++++------
> 3 files changed, 115 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index e2b36a3c1b0f..b4bbe2837423 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -256,9 +256,19 @@ struct ext4_allocation_request {
> #define EXT4_MAP_UNWRITTEN BIT(BH_Unwritten)
> #define EXT4_MAP_BOUNDARY BIT(BH_Boundary)
> #define EXT4_MAP_DELAYED BIT(BH_Delay)
> +/*
> + * This is for use in ext4_map_query_blocks() for a special case where we can
> + * have a physically and logically contiguous blocks explit across two leaf
s/explit/split/ ?
--D
> + * nodes instead of a single extent. This is required in case of atomic writes
> + * to know whether the returned extent is last in leaf. If yes, then lookup for
> + * next in leaf block in ext4_map_query_blocks_next_in_leaf().
> + * - This is never going to be added to any buffer head state.
> + * - We use the next available bit after BH_BITMAP_UPTODATE.
> + */
> +#define EXT4_MAP_QUERY_LAST_IN_LEAF BIT(BH_BITMAP_UPTODATE + 1)
> #define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
> EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
> - EXT4_MAP_DELAYED)
> + EXT4_MAP_DELAYED | EXT4_MAP_QUERY_LAST_IN_LEAF)
>
> struct ext4_map_blocks {
> ext4_fsblk_t m_pblk;
> @@ -725,6 +735,12 @@ enum {
> #define EXT4_GET_BLOCKS_IO_SUBMIT 0x0400
> /* Caller is in the atomic contex, find extent if it has been cached */
> #define EXT4_GET_BLOCKS_CACHED_NOWAIT 0x0800
> +/*
> + * Atomic write caller needs this to query in the slow path of mixed mapping
> + * case, when a contiguous extent can be split across two adjacent leaf nodes.
> + * Look EXT4_MAP_QUERY_LAST_IN_LEAF.
> + */
> +#define EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF 0x1000
>
> /*
> * The bit position of these flags must not overlap with any of the
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index c616a16a9f36..fa850f188d46 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4433,6 +4433,18 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> allocated = map->m_len;
> ext4_ext_show_leaf(inode, path);
> out:
> + /*
> + * We never use EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF with CREATE flag.
> + * So we know that the depth used here is correct, since there was no
> + * block allocation done if EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF is set.
> + * If tomorrow we start using this QUERY flag with CREATE, then we will
> + * need to re-calculate the depth as it might have changed due to block
> + * allocation.
> + */
> + if (flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF)
> + if (!err && ex && (ex == EXT_LAST_EXTENT(path[depth].p_hdr)))
> + map->m_flags |= EXT4_MAP_QUERY_LAST_IN_LEAF;
> +
> ext4_free_ext_path(path);
>
> trace_ext4_ext_map_blocks_exit(inode, flags, map,
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 2f99b087a5d8..8b86b1a29bdc 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -459,14 +459,71 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
> }
> #endif /* ES_AGGRESSIVE_TEST */
>
> +static int ext4_map_query_blocks_next_in_leaf(handle_t *handle,
> + struct inode *inode, struct ext4_map_blocks *map,
> + unsigned int orig_mlen)
> +{
> + struct ext4_map_blocks map2;
> + unsigned int status, status2;
> + int retval;
> +
> + status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> + EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> +
> + WARN_ON_ONCE(!(map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF));
> + WARN_ON_ONCE(orig_mlen <= map->m_len);
> +
> + /* Prepare map2 for lookup in next leaf block */
> + map2.m_lblk = map->m_lblk + map->m_len;
> + map2.m_len = orig_mlen - map->m_len;
> + map2.m_flags = 0;
> + retval = ext4_ext_map_blocks(handle, inode, &map2, 0);
> +
> + if (retval <= 0) {
> + ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> + map->m_pblk, status, false);
> + return map->m_len;
> + }
> +
> + if (unlikely(retval != map2.m_len)) {
> + ext4_warning(inode->i_sb,
> + "ES len assertion failed for inode "
> + "%lu: retval %d != map->m_len %d",
> + inode->i_ino, retval, map2.m_len);
> + WARN_ON(1);
> + }
> +
> + status2 = map2.m_flags & EXT4_MAP_UNWRITTEN ?
> + EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> +
> + /*
> + * If map2 is contiguous with map, then let's insert it as a single
> + * extent in es cache and return the combined length of both the maps.
> + */
> + if (map->m_pblk + map->m_len == map2.m_pblk &&
> + status == status2) {
> + ext4_es_insert_extent(inode, map->m_lblk,
> + map->m_len + map2.m_len, map->m_pblk,
> + status, false);
> + map->m_len += map2.m_len;
> + } else {
> + ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> + map->m_pblk, status, false);
> + }
> +
> + return map->m_len;
> +}
> +
> static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
> - struct ext4_map_blocks *map)
> + struct ext4_map_blocks *map, int flags)
> {
> unsigned int status;
> int retval;
> + unsigned int orig_mlen = map->m_len;
> + unsigned int query_flags = flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF;
>
> if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> - retval = ext4_ext_map_blocks(handle, inode, map, 0);
> + retval = ext4_ext_map_blocks(handle, inode, map, query_flags);
> else
> retval = ext4_ind_map_blocks(handle, inode, map, 0);
>
> @@ -481,11 +538,23 @@ static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
> WARN_ON(1);
> }
>
> - status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> - EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> - ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> - map->m_pblk, status, false);
> - return retval;
> + /*
> + * No need to query next in leaf:
> + * - if returned extent is not last in leaf or
> + * - if the last in leaf is the full requested range
> + */
> + if (!(map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF) ||
> + ((map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF) &&
> + (map->m_len == orig_mlen))) {
> + status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> + EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> + ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> + map->m_pblk, status, false);
> + return retval;
> + }
> +
> + return ext4_map_query_blocks_next_in_leaf(handle, inode, map,
> + orig_mlen);
> }
>
> static int ext4_map_create_blocks(handle_t *handle, struct inode *inode,
> @@ -599,6 +668,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> struct extent_status es;
> int retval;
> int ret = 0;
> + unsigned int orig_mlen = map->m_len;
> #ifdef ES_AGGRESSIVE_TEST
> struct ext4_map_blocks orig_map;
>
> @@ -650,7 +720,12 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> ext4_map_blocks_es_recheck(handle, inode, map,
> &orig_map, flags);
> #endif
> - goto found;
> + if (!(flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF) ||
> + orig_mlen == map->m_len)
> + goto found;
> +
> + if (flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF)
> + map->m_len = orig_mlen;
> }
> /*
> * In the query cache no-wait mode, nothing we can do more if we
> @@ -664,7 +739,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> * file system block.
> */
> down_read(&EXT4_I(inode)->i_data_sem);
> - retval = ext4_map_query_blocks(handle, inode, map);
> + retval = ext4_map_query_blocks(handle, inode, map, flags);
> up_read((&EXT4_I(inode)->i_data_sem));
>
> found:
> @@ -1802,7 +1877,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
> if (ext4_has_inline_data(inode))
> retval = 0;
> else
> - retval = ext4_map_query_blocks(NULL, inode, map);
> + retval = ext4_map_query_blocks(NULL, inode, map, 0);
> up_read(&EXT4_I(inode)->i_data_sem);
> if (retval)
> return retval < 0 ? retval : 0;
> @@ -1825,7 +1900,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
> goto found;
> }
> } else if (!ext4_has_inline_data(inode)) {
> - retval = ext4_map_query_blocks(NULL, inode, map);
> + retval = ext4_map_query_blocks(NULL, inode, map, 0);
> if (retval) {
> up_write(&EXT4_I(inode)->i_data_sem);
> return retval < 0 ? retval : 0;
> --
> 2.49.0
>
>
next prev parent reply other threads:[~2025-05-14 16:16 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
2025-05-08 20:50 ` [PATCH v3 1/7] ext4: Document an edge case for overwrites Ritesh Harjani (IBM)
2025-05-09 5:19 ` Ojaswin Mujoo
2025-05-14 16:23 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 2/7] ext4: Check if inode uses extents in ext4_inode_can_atomic_write() Ritesh Harjani (IBM)
2025-05-09 5:20 ` Ojaswin Mujoo
2025-05-14 16:24 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 3/7] ext4: Make ext4_meta_trans_blocks() non-static for later use Ritesh Harjani (IBM)
2025-05-09 5:21 ` Ojaswin Mujoo
2025-05-14 16:24 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 4/7] ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS Ritesh Harjani (IBM)
2025-05-14 16:16 ` Darrick J. Wong [this message]
2025-05-14 18:47 ` Ritesh Harjani
2025-05-08 20:50 ` [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
2025-05-14 16:19 ` Darrick J. Wong
2025-05-14 19:04 ` Ritesh Harjani
2025-05-08 20:50 ` [PATCH v3 6/7] ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc Ritesh Harjani (IBM)
2025-05-14 16:21 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 7/7] ext4: Add atomic block write documentation Ritesh Harjani (IBM)
2025-05-09 7:34 ` Ojaswin Mujoo
2025-05-14 16:38 ` Darrick J. Wong
2025-05-15 2:15 ` Ritesh Harjani
2025-05-15 2:18 ` Ritesh Harjani
2025-05-09 17:42 ` [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani
2025-05-14 16:40 ` Darrick J. Wong
2025-05-14 18:55 ` Ritesh Harjani
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250514161603.GG25655@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=jack@suse.cz \
--cc=john.g.garry@oracle.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox