From: "Darrick J. Wong" <djwong@kernel.org>
To: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
Jan Kara <jack@suse.cz>, John Garry <john.g.garry@oracle.com>,
Ojaswin Mujoo <ojaswin@linux.ibm.com>,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc
Date: Wed, 14 May 2025 09:19:13 -0700 [thread overview]
Message-ID: <20250514161913.GH25655@frogsfrogsfrogs> (raw)
In-Reply-To: <dc97b425ba7f88f18ef175b014d25d9d6e074278.1746734746.git.ritesh.list@gmail.com>
On Fri, May 09, 2025 at 02:20:35AM +0530, Ritesh Harjani (IBM) wrote:
> EXT4 supports bigalloc feature which allows the FS to work in size of
> clusters (group of blocks) rather than individual blocks. This patch
> adds atomic write support for bigalloc so that systems with bs = ps can
> also create FS using -
> mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
>
> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
> adjust ext4's atomic write unit max value to cluster size. This can then support
> atomic write of size anywhere between [blocksize, clustersize]. This
> patch adds the required changes to enable multi-fsblock atomic write
> support using bigalloc in the next patch.
>
> In this patch for block allocation:
> we first query the underlying region of the requested range by calling
> ext4_map_blocks() call. Here are the various cases which we then handle
> depending upon the underlying mapping type:
> 1. If the underlying region for the entire requested range is a mapped extent,
> then we don't call ext4_map_blocks() to allocate anything. We don't need to
> even start the jbd2 txn in this case.
> 2. For an append write case, we create a mapped extent.
> 3. If the underlying region is entirely a hole, then we create an unwritten
> extent for the requested range.
> 4. If the underlying region is a large unwritten extent, then we split the
> extent into 2 unwritten extent of required size.
> 5. If the underlying region has any type of mixed mapping, then we call
> ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
> within the requested range. This then provide a single mapped extent type
> mapping for the requested range.
>
> Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
> flag only when the underlying extent mapping of the requested range is
> not entirely a hole, an unwritten extent, or a fully mapped extent. That
> is, if the underlying region contains a mix of hole(s), unwritten
> extent(s), and mapped extent(s), we use this loop to ensure that all the
> short mappings are zeroed out. This guarantees that the entire requested
> range becomes a single, uniformly mapped extent. It is ok to do so
> because we know this is being done on a bigalloc enabled filesystem
> where the block bitmap represents the entire cluster unit.
>
> Note having a single contiguous underlying region of type mapped,
> unwrittn or hole is not a problem. But the reason to avoid writing on
> top of mixed mapping region is because, atomic writes requires all or
> nothing should get written for the userspace pwritev2 request. So if at
> any point in time during the write if a crash or a sudden poweroff
> occurs, the region undergoing atomic write should read either complete
> old data or complete new data. But it should never have a mix of both
> old and new data.
> So, we first convert any mixed mapping region to a single contiguous
> mapped extent before any data gets written to it. This is because
> normally FS will only convert unwritten extents to written at the end of
> the write in ->end_io() call. And if we allow the writes over a mixed
> mapping and if a sudden power off happens in between, we will end up
> reading mix of new data (over mapped extents) and old data (over
> unwritten extents), because unwritten to written conversion never went
> through.
> So to avoid this and to avoid writes getting torned due to mixed
> mapping, we first allocate a single contiguous block mapping and then
> do the write.
>
> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Looks fine (I don't like the pre-zeroing but options are limited on
ext4) except for one thing...
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 8b86b1a29bdc..2642e1ef128f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3412,6 +3412,136 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> }
> }
>
> +static int ext4_map_blocks_atomic_write_slow(handle_t *handle,
> + struct inode *inode, struct ext4_map_blocks *map)
> +{
> + ext4_lblk_t m_lblk = map->m_lblk;
> + unsigned int m_len = map->m_len;
> + unsigned int mapped_len = 0, m_flags = 0;
> + ext4_fsblk_t next_pblk;
> + bool check_next_pblk = false;
> + int ret = 0;
> +
> + WARN_ON_ONCE(!ext4_has_feature_bigalloc(inode->i_sb));
> +
> + /*
> + * This is a slow path in case of mixed mapping. We use
> + * EXT4_GET_BLOCKS_CREATE_ZERO flag here to make sure we get a single
> + * contiguous mapped mapping. This will ensure any unwritten or hole
> + * regions within the requested range is zeroed out and we return
> + * a single contiguous mapped extent.
> + */
> + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> +
> + do {
> + ret = ext4_map_blocks(handle, inode, map, m_flags);
> + if (ret < 0 && ret != -ENOSPC)
> + goto out_err;
> + /*
> + * This should never happen, but let's return an error code to
> + * avoid an infinite loop in here.
> + */
> + if (ret == 0) {
> + ret = -EFSCORRUPTED;
> + ext4_warning_inode(inode,
> + "ext4_map_blocks() couldn't allocate blocks m_flags: 0x%x, ret:%d",
> + m_flags, ret);
> + goto out_err;
> + }
> + /*
> + * With bigalloc we should never get ENOSPC nor discontiguous
> + * physical extents.
> + */
> + if ((check_next_pblk && next_pblk != map->m_pblk) ||
> + ret == -ENOSPC) {
> + ext4_warning_inode(inode,
> + "Non-contiguous allocation detected: expected %llu, got %llu, "
> + "or ext4_map_blocks() returned out of space ret: %d",
> + next_pblk, map->m_pblk, ret);
> + ret = -ENOSPC;
> + goto out_err;
If you get physically discontiguous mappings within a cluster, the
extent tree is corrupt.
--D
> + }
> + next_pblk = map->m_pblk + map->m_len;
> + check_next_pblk = true;
> +
> + mapped_len += map->m_len;
> + map->m_lblk += map->m_len;
> + map->m_len = m_len - mapped_len;
> + } while (mapped_len < m_len);
> +
> + /*
> + * We might have done some work in above loop, so we need to query the
> + * start of the physical extent, based on the origin m_lblk and m_len.
> + * Let's also ensure we were able to allocate the required range for
> + * mixed mapping case.
> + */
> + map->m_lblk = m_lblk;
> + map->m_len = m_len;
> + map->m_flags = 0;
> +
> + ret = ext4_map_blocks(handle, inode, map,
> + EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF);
> + if (ret != m_len) {
> + ext4_warning_inode(inode,
> + "allocation failed for atomic write request m_lblk:%u, m_len:%u, ret:%d\n",
> + m_lblk, m_len, ret);
> + ret = -EINVAL;
> + }
> + return ret;
> +
> +out_err:
> + /* reset map before returning an error */
> + map->m_lblk = m_lblk;
> + map->m_len = m_len;
> + map->m_flags = 0;
> + return ret;
> +}
> +
> +/*
> + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested
> + * range in @map [lblk, lblk + len) is one single contiguous extent with no
> + * mixed mappings.
> + *
> + * We first use m_flags passed to us by our caller (ext4_iomap_alloc()).
> + * We only call EXT4_GET_BLOCKS_ZERO in the slow path, when the underlying
> + * physical extent for the requested range does not have a single contiguous
> + * mapping type i.e. (Hole, Mapped, or Unwritten) throughout.
> + * In that case we will loop over the requested range to allocate and zero out
> + * the unwritten / holes in between, to get a single mapped extent from
> + * [m_lblk, m_lblk + m_len). Note that this is only possible because we know
> + * this can be called only with bigalloc enabled filesystem where the underlying
> + * cluster is already allocated. This avoids allocating discontiguous extents
> + * in the slow path due to multiple calls to ext4_map_blocks().
> + * The slow path is mostly non-performance critical path, so it should be ok to
> + * loop using ext4_map_blocks() with appropriate flags to allocate & zero the
> + * underlying short holes/unwritten extents within the requested range.
> + */
> +static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
> + struct ext4_map_blocks *map, int m_flags)
> +{
> + ext4_lblk_t m_lblk = map->m_lblk;
> + unsigned int m_len = map->m_len;
> + int ret = 0;
> +
> + WARN_ON_ONCE(m_len > 1 && !ext4_has_feature_bigalloc(inode->i_sb));
> +
> + ret = ext4_map_blocks(handle, inode, map, m_flags);
> + if (ret < 0 || ret == m_len)
> + goto out;
> + /*
> + * This is a mixed mapping case where we were not able to allocate
> + * a single contiguous extent. In that case let's reset requested
> + * mapping and call the slow path.
> + */
> + map->m_lblk = m_lblk;
> + map->m_len = m_len;
> + map->m_flags = 0;
> +
> + return ext4_map_blocks_atomic_write_slow(handle, inode, map);
> +out:
> + return ret;
> +}
> +
> static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> unsigned int flags)
> {
> @@ -3425,7 +3555,30 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> */
> if (map->m_len > DIO_MAX_BLOCKS)
> map->m_len = DIO_MAX_BLOCKS;
> - dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
> +
> + /*
> + * journal credits estimation for atomic writes. We call
> + * ext4_map_blocks(), to find if there could be a mixed mapping. If yes,
> + * then let's assume the no. of pextents required can be m_len i.e.
> + * every alternate block can be unwritten and hole.
> + */
> + if (flags & IOMAP_ATOMIC) {
> + unsigned int orig_mlen = map->m_len;
> +
> + ret = ext4_map_blocks(NULL, inode, map, 0);
> + if (ret < 0)
> + return ret;
> + if (map->m_len < orig_mlen) {
> + map->m_len = orig_mlen;
> + dio_credits = ext4_meta_trans_blocks(inode, orig_mlen,
> + map->m_len);
> + } else {
> + dio_credits = ext4_chunk_trans_blocks(inode,
> + map->m_len);
> + }
> + } else {
> + dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
> + }
>
> retry:
> /*
> @@ -3456,7 +3609,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
>
> - ret = ext4_map_blocks(handle, inode, map, m_flags);
> + if (flags & IOMAP_ATOMIC)
> + ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags);
> + else
> + ret = ext4_map_blocks(handle, inode, map, m_flags);
>
> /*
> * We cannot fill holes in indirect tree based inodes as that could
> @@ -3480,6 +3636,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> int ret;
> struct ext4_map_blocks map;
> u8 blkbits = inode->i_blkbits;
> + unsigned int orig_mlen;
>
> if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> return -EINVAL;
> @@ -3493,6 +3650,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> map.m_lblk = offset >> blkbits;
> map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> + orig_mlen = map.m_len;
>
> if (flags & IOMAP_WRITE) {
> /*
> @@ -3503,8 +3661,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> */
> if (offset + length <= i_size_read(inode)) {
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> - if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
> - goto out;
> + /*
> + * For atomic writes the entire requested length should
> + * be mapped.
> + */
> + if (map.m_flags & EXT4_MAP_MAPPED) {
> + if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
> + (flags & IOMAP_ATOMIC && ret >= orig_mlen))
> + goto out;
> + }
> + map.m_len = orig_mlen;
> }
> ret = ext4_iomap_alloc(inode, &map, flags);
> } else {
> @@ -3525,6 +3691,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> */
> map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
>
> + /*
> + * Before returning to iomap, let's ensure the allocated mapping
> + * covers the entire requested length for atomic writes.
> + */
> + if (flags & IOMAP_ATOMIC) {
> + if (map.m_len < (length >> blkbits)) {
> + WARN_ON(1);
> + return -EINVAL;
> + }
> + }
> ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>
> return 0;
> --
> 2.49.0
>
>
next prev parent reply other threads:[~2025-05-14 16:19 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
2025-05-08 20:50 ` [PATCH v3 1/7] ext4: Document an edge case for overwrites Ritesh Harjani (IBM)
2025-05-09 5:19 ` Ojaswin Mujoo
2025-05-14 16:23 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 2/7] ext4: Check if inode uses extents in ext4_inode_can_atomic_write() Ritesh Harjani (IBM)
2025-05-09 5:20 ` Ojaswin Mujoo
2025-05-14 16:24 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 3/7] ext4: Make ext4_meta_trans_blocks() non-static for later use Ritesh Harjani (IBM)
2025-05-09 5:21 ` Ojaswin Mujoo
2025-05-14 16:24 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 4/7] ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS Ritesh Harjani (IBM)
2025-05-14 16:16 ` Darrick J. Wong
2025-05-14 18:47 ` Ritesh Harjani
2025-05-08 20:50 ` [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
2025-05-14 16:19 ` Darrick J. Wong [this message]
2025-05-14 19:04 ` Ritesh Harjani
2025-05-08 20:50 ` [PATCH v3 6/7] ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc Ritesh Harjani (IBM)
2025-05-14 16:21 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 7/7] ext4: Add atomic block write documentation Ritesh Harjani (IBM)
2025-05-09 7:34 ` Ojaswin Mujoo
2025-05-14 16:38 ` Darrick J. Wong
2025-05-15 2:15 ` Ritesh Harjani
2025-05-15 2:18 ` Ritesh Harjani
2025-05-09 17:42 ` [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani
2025-05-14 16:40 ` Darrick J. Wong
2025-05-14 18:55 ` Ritesh Harjani
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250514161913.GH25655@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=jack@suse.cz \
--cc=john.g.garry@oracle.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox