Re: [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
	Jan Kara <jack@suse.cz>, John Garry <john.g.garry@oracle.com>,
	Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc
Date: Thu, 15 May 2025 00:34:43 +0530	[thread overview]
Message-ID: <87ldqzypx0.fsf@gmail.com> (raw)
In-Reply-To: <20250514161913.GH25655@frogsfrogsfrogs>

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Fri, May 09, 2025 at 02:20:35AM +0530, Ritesh Harjani (IBM) wrote:
>> EXT4 supports bigalloc feature which allows the FS to work in size of
>> clusters (group of blocks) rather than individual blocks. This patch
>> adds atomic write support for bigalloc so that systems with bs = ps can
>> also create FS using -
>>     mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
>> 
>> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
>> adjust ext4's atomic write unit max value to cluster size. This can then support
>> atomic write of size anywhere between [blocksize, clustersize]. This
>> patch adds the required changes to enable multi-fsblock atomic write
>> support using bigalloc in the next patch.
>> 
>> In this patch for block allocation:
>> we first query the underlying region of the requested range by calling
>> ext4_map_blocks() call. Here are the various cases which we then handle
>> depending upon the underlying mapping type:
>> 1. If the underlying region for the entire requested range is a mapped extent,
>>    then we don't call ext4_map_blocks() to allocate anything. We don't need to
>>    even start the jbd2 txn in this case.
>> 2. For an append write case, we create a mapped extent.
>> 3. If the underlying region is entirely a hole, then we create an unwritten
>>    extent for the requested range.
>> 4. If the underlying region is a large unwritten extent, then we split the
>>    extent into 2 unwritten extent of required size.
>> 5. If the underlying region has any type of mixed mapping, then we call
>>    ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
>>    within the requested range. This then provide a single mapped extent type
>>    mapping for the requested range.
>> 
>> Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
>> flag only when the underlying extent mapping of the requested range is
>> not entirely a hole, an unwritten extent, or a fully mapped extent. That
>> is, if the underlying region contains a mix of hole(s), unwritten
>> extent(s), and mapped extent(s), we use this loop to ensure that all the
>> short mappings are zeroed out. This guarantees that the entire requested
>> range becomes a single, uniformly mapped extent. It is ok to do so
>> because we know this is being done on a bigalloc enabled filesystem
>> where the block bitmap represents the entire cluster unit.
>> 
>> Note having a single contiguous underlying region of type mapped,
>> unwrittn or hole is not a problem. But the reason to avoid writing on
>> top of mixed mapping region is because, atomic writes requires all or
>> nothing should get written for the userspace pwritev2 request. So if at
>> any point in time during the write if a crash or a sudden poweroff
>> occurs, the region undergoing atomic write should read either complete
>> old data or complete new data. But it should never have a mix of both
>> old and new data.
>> So, we first convert any mixed mapping region to a single contiguous
>> mapped extent before any data gets written to it. This is because
>> normally FS will only convert unwritten extents to written at the end of
>> the write in ->end_io() call. And if we allow the writes over a mixed
>> mapping and if a sudden power off happens in between, we will end up
>> reading mix of new data (over mapped extents) and old data (over
>> unwritten extents), because unwritten to written conversion never went
>> through.
>> So to avoid this and to avoid writes getting torned due to mixed
>> mapping, we first allocate a single contiguous block mapping and then
>> do the write.
>> 
>> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>
> Looks fine (I don't like the pre-zeroing but options are limited on
> ext4) except for one thing...
>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 8b86b1a29bdc..2642e1ef128f 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3412,6 +3412,136 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>>  	}
>>  }
>>  
>> +static int ext4_map_blocks_atomic_write_slow(handle_t *handle,
>> +			struct inode *inode, struct ext4_map_blocks *map)
>> +{
>> +	ext4_lblk_t m_lblk = map->m_lblk;
>> +	unsigned int m_len = map->m_len;
>> +	unsigned int mapped_len = 0, m_flags = 0;
>> +	ext4_fsblk_t next_pblk;
>> +	bool check_next_pblk = false;
>> +	int ret = 0;
>> +
>> +	WARN_ON_ONCE(!ext4_has_feature_bigalloc(inode->i_sb));
>> +
>> +	/*
>> +	 * This is a slow path in case of mixed mapping. We use
>> +	 * EXT4_GET_BLOCKS_CREATE_ZERO flag here to make sure we get a single
>> +	 * contiguous mapped mapping. This will ensure any unwritten or hole
>> +	 * regions within the requested range is zeroed out and we return
>> +	 * a single contiguous mapped extent.
>> +	 */
>> +	m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
>> +
>> +	do {
>> +		ret = ext4_map_blocks(handle, inode, map, m_flags);
>> +		if (ret < 0 && ret != -ENOSPC)
>> +			goto out_err;
>> +		/*
>> +		 * This should never happen, but let's return an error code to
>> +		 * avoid an infinite loop in here.
>> +		 */
>> +		if (ret == 0) {
>> +			ret = -EFSCORRUPTED;
>> +			ext4_warning_inode(inode,
>> +				"ext4_map_blocks() couldn't allocate blocks m_flags: 0x%x, ret:%d",
>> +				m_flags, ret);
>> +			goto out_err;
>> +		}
>> +		/*
>> +		 * With bigalloc we should never get ENOSPC nor discontiguous
>> +		 * physical extents.
>> +		 */
>> +		if ((check_next_pblk && next_pblk != map->m_pblk) ||
>> +				ret == -ENOSPC) {
>> +			ext4_warning_inode(inode,
>> +				"Non-contiguous allocation detected: expected %llu, got %llu, "
>> +				"or ext4_map_blocks() returned out of space ret: %d",
>> +				next_pblk, map->m_pblk, ret);
>> +			ret = -ENOSPC;
>> +			goto out_err;
>
> If you get physically discontiguous mappings within a cluster, the
> extent tree is corrupt.
>

yes, I guess I was just being hesitant to do that. But you are right,
we should return -EFSCORRUPTED here then. 

I will change the error code along with the other forcecommit change in v4. 

> --D
>

Thanks for the review!

-ritesh


>> +		next_pblk = map->m_pblk + map->m_len;
>> +		check_next_pblk = true;
>> +
>> +		mapped_len += map->m_len;
>> +		map->m_lblk += map->m_len;
>> +		map->m_len = m_len - mapped_len;
>> +	} while (mapped_len < m_len);
>> +
>> +	/*
>> +	 * We might have done some work in above loop, so we need to query the
>> +	 * start of the physical extent, based on the origin m_lblk and m_len.
>> +	 * Let's also ensure we were able to allocate the required range for
>> +	 * mixed mapping case.
>> +	 */
>> +	map->m_lblk = m_lblk;
>> +	map->m_len = m_len;
>> +	map->m_flags = 0;
>> +
>> +	ret = ext4_map_blocks(handle, inode, map,
>> +			      EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF);
>> +	if (ret != m_len) {
>> +		ext4_warning_inode(inode,
>> +			"allocation failed for atomic write request m_lblk:%u, m_len:%u, ret:%d\n",
>> +			m_lblk, m_len, ret);
>> +		ret = -EINVAL;
>> +	}
>> +	return ret;
>> +
>> +out_err:
>> +	/* reset map before returning an error */
>> +	map->m_lblk = m_lblk;
>> +	map->m_len = m_len;
>> +	map->m_flags = 0;
>> +	return ret;
>> +}
>> +
>> +/*
>> + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested
>> + * range in @map [lblk, lblk + len) is one single contiguous extent with no
>> + * mixed mappings.
>> + *
>> + * We first use m_flags passed to us by our caller (ext4_iomap_alloc()).
>> + * We only call EXT4_GET_BLOCKS_ZERO in the slow path, when the underlying
>> + * physical extent for the requested range does not have a single contiguous
>> + * mapping type i.e. (Hole, Mapped, or Unwritten) throughout.
>> + * In that case we will loop over the requested range to allocate and zero out
>> + * the unwritten / holes in between, to get a single mapped extent from
>> + * [m_lblk, m_lblk +  m_len). Note that this is only possible because we know
>> + * this can be called only with bigalloc enabled filesystem where the underlying
>> + * cluster is already allocated. This avoids allocating discontiguous extents
>> + * in the slow path due to multiple calls to ext4_map_blocks().
>> + * The slow path is mostly non-performance critical path, so it should be ok to
>> + * loop using ext4_map_blocks() with appropriate flags to allocate & zero the
>> + * underlying short holes/unwritten extents within the requested range.
>> + */
>> +static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
>> +				struct ext4_map_blocks *map, int m_flags)
>> +{
>> +	ext4_lblk_t m_lblk = map->m_lblk;
>> +	unsigned int m_len = map->m_len;
>> +	int ret = 0;
>> +
>> +	WARN_ON_ONCE(m_len > 1 && !ext4_has_feature_bigalloc(inode->i_sb));
>> +
>> +	ret = ext4_map_blocks(handle, inode, map, m_flags);
>> +	if (ret < 0 || ret == m_len)
>> +		goto out;
>> +	/*
>> +	 * This is a mixed mapping case where we were not able to allocate
>> +	 * a single contiguous extent. In that case let's reset requested
>> +	 * mapping and call the slow path.
>> +	 */
>> +	map->m_lblk = m_lblk;
>> +	map->m_len = m_len;
>> +	map->m_flags = 0;
>> +
>> +	return ext4_map_blocks_atomic_write_slow(handle, inode, map);
>> +out:
>> +	return ret;
>> +}
>> +
>>  static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>>  			    unsigned int flags)
>>  {
>> @@ -3425,7 +3555,30 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>>  	 */
>>  	if (map->m_len > DIO_MAX_BLOCKS)
>>  		map->m_len = DIO_MAX_BLOCKS;
>> -	dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
>> +
>> +	/*
>> +	 * journal credits estimation for atomic writes. We call
>> +	 * ext4_map_blocks(), to find if there could be a mixed mapping. If yes,
>> +	 * then let's assume the no. of pextents required can be m_len i.e.
>> +	 * every alternate block can be unwritten and hole.
>> +	 */
>> +	if (flags & IOMAP_ATOMIC) {
>> +		unsigned int orig_mlen = map->m_len;
>> +
>> +		ret = ext4_map_blocks(NULL, inode, map, 0);
>> +		if (ret < 0)
>> +			return ret;
>> +		if (map->m_len < orig_mlen) {
>> +			map->m_len = orig_mlen;
>> +			dio_credits = ext4_meta_trans_blocks(inode, orig_mlen,
>> +							     map->m_len);
>> +		} else {
>> +			dio_credits = ext4_chunk_trans_blocks(inode,
>> +							      map->m_len);
>> +		}
>> +	} else {
>> +		dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
>> +	}
>>  
>>  retry:
>>  	/*
>> @@ -3456,7 +3609,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>>  	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>>  		m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
>>  
>> -	ret = ext4_map_blocks(handle, inode, map, m_flags);
>> +	if (flags & IOMAP_ATOMIC)
>> +		ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags);
>> +	else
>> +		ret = ext4_map_blocks(handle, inode, map, m_flags);
>>  
>>  	/*
>>  	 * We cannot fill holes in indirect tree based inodes as that could
>> @@ -3480,6 +3636,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>>  	int ret;
>>  	struct ext4_map_blocks map;
>>  	u8 blkbits = inode->i_blkbits;
>> +	unsigned int orig_mlen;
>>  
>>  	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>>  		return -EINVAL;
>> @@ -3493,6 +3650,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>>  	map.m_lblk = offset >> blkbits;
>>  	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>>  			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>> +	orig_mlen = map.m_len;
>>  
>>  	if (flags & IOMAP_WRITE) {
>>  		/*
>> @@ -3503,8 +3661,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>>  		 */
>>  		if (offset + length <= i_size_read(inode)) {
>>  			ret = ext4_map_blocks(NULL, inode, &map, 0);
>> -			if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
>> -				goto out;
>> +			/*
>> +			 * For atomic writes the entire requested length should
>> +			 * be mapped.
>> +			 */
>> +			if (map.m_flags & EXT4_MAP_MAPPED) {
>> +				if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
>> +				   (flags & IOMAP_ATOMIC && ret >= orig_mlen))
>> +					goto out;
>> +			}
>> +			map.m_len = orig_mlen;
>>  		}
>>  		ret = ext4_iomap_alloc(inode, &map, flags);
>>  	} else {
>> @@ -3525,6 +3691,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>>  	 */
>>  	map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
>>  
>> +	/*
>> +	 * Before returning to iomap, let's ensure the allocated mapping
>> +	 * covers the entire requested length for atomic writes.
>> +	 */
>> +	if (flags & IOMAP_ATOMIC) {
>> +		if (map.m_len < (length >> blkbits)) {
>> +			WARN_ON(1);
>> +			return -EINVAL;
>> +		}
>> +	}
>>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>>  
>>  	return 0;
>> -- 
>> 2.49.0
>> 
>>

next prev parent reply	other threads:[~2025-05-14 19:08 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
2025-05-08 20:50 ` [PATCH v3 1/7] ext4: Document an edge case for overwrites Ritesh Harjani (IBM)
2025-05-09  5:19   ` Ojaswin Mujoo
2025-05-14 16:23   ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 2/7] ext4: Check if inode uses extents in ext4_inode_can_atomic_write() Ritesh Harjani (IBM)
2025-05-09  5:20   ` Ojaswin Mujoo
2025-05-14 16:24   ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 3/7] ext4: Make ext4_meta_trans_blocks() non-static for later use Ritesh Harjani (IBM)
2025-05-09  5:21   ` Ojaswin Mujoo
2025-05-14 16:24   ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 4/7] ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS Ritesh Harjani (IBM)
2025-05-14 16:16   ` Darrick J. Wong
2025-05-14 18:47     ` Ritesh Harjani
2025-05-08 20:50 ` [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
2025-05-14 16:19   ` Darrick J. Wong
2025-05-14 19:04     ` Ritesh Harjani [this message]
2025-05-08 20:50 ` [PATCH v3 6/7] ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc Ritesh Harjani (IBM)
2025-05-14 16:21   ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 7/7] ext4: Add atomic block write documentation Ritesh Harjani (IBM)
2025-05-09  7:34   ` Ojaswin Mujoo
2025-05-14 16:38     ` Darrick J. Wong
2025-05-15  2:15       ` Ritesh Harjani
2025-05-15  2:18     ` Ritesh Harjani
2025-05-09 17:42 ` [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani
2025-05-14 16:40   ` Darrick J. Wong
2025-05-14 18:55     ` Ritesh Harjani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ldqzypx0.fsf@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=djwong@kernel.org \
    --cc=jack@suse.cz \
    --cc=john.g.garry@oracle.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.