public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Wang Yugui <wangyugui@e16-tech.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate
Date: Tue, 15 Mar 2022 21:16:27 +0800	[thread overview]
Message-ID: <20220315211627.BD66.409509F4@e16-tech.com> (raw)
In-Reply-To: <dde255ba71e1bfe38daf88c1b7ef51df15d5c150.1647346287.git.fdmanana@suse.com>

Hi,

> TODO: test case when the file with/without snapshot
> 
> From: Filipe Manana <fdmanana@suse.com>
> 
> During a plain fallocate, we always start by reserving an amount of data
> space that matches the length of the range passed to fallocate. When we
> already have extents allocated in that range, we may end up trying to
> reserve a lot more data space then we need, which can result in two
> undesired behaviours:

Need we check whether the allocated extents are exclusive  or shared?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/03/15

> 
> 1) We fail with -ENOSPC. For example the passed range has a length
>    of 1G, but there's only one hole with a size of 1M in that range;
> 
> 2) We temporarily reserve excessive data space that could be used by
>    other operations happening concurrently;
> 
> 3) By reserving much more data space then we need, we can end up
>    doing expensive things like triggering dellaloc for other inodes,
>    waiting for the ordered extents to complete, trigger transaction
>    commits, allocate new block groups, etc.
> 
> Example:
> 
>   $ cat test.sh
>   #!/bin/bash
> 
>   DEV=/dev/sdj
>   MNT=/mnt/sdj
> 
>   mkfs.btrfs -f -b 1g $DEV
>   mount $DEV $MNT
> 
>   # Create a file with a size of 600M and two holes, one at [200M, 201M[
>   # and another at [401M, 402M[
>   xfs_io -f -c "pwrite -S 0xab 0 200M" \
>             -c "pwrite -S 0xcd 201M 200M" \
>             -c "pwrite -S 0xef 402M 198M" \
>             $MNT/foobar
> 
>   # Now call fallocate against the whole file range, see if it fails
>   # with -ENOSPC or not - it shouldn't since we only need to allocate
>   # 2M of data space.
>   xfs_io -c "falloc 0 600M" $MNT/foobar
> 
>   umount $MNT
> 
>   $ ./test.sh
>   (...)
>   wrote 209715200/209715200 bytes at offset 0
>   200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
>   wrote 209715200/209715200 bytes at offset 210763776
>   200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
>   wrote 207618048/207618048 bytes at offset 421527552
>   198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
>   fallocate: No space left on device
>   $
> 
> So fix this by not allocating an amount of data space that matches the
> length of the range passed to fallocate. Instead allocate an amount of
> data space that corresponds to the sum of the sizes of each hole found
> in the range. This reservation now happens after we have locked the file
> range, which is safe since we know at this point there's no delalloc
> in the range because we've taken the inode's VFS lock in exclusive mode,
> we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
> delalloc and waited for all ordered extents in the range to complete.
> 
> This type of failure actually seems to happen in pratice with systemd,
> and we had at least one report about this in a very long thread which
> is referenced by the Link tag below.
> 
> Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/file.c | 69 ++++++++++++++++++++++++++-----------------------
>  1 file changed, 37 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 380054c94e4b..b7c0db1000cd 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3417,6 +3417,9 @@ static long btrfs_fallocate(struct file *file, int mode,
>  	u64 alloc_hint = 0;
>  	u64 locked_end;
>  	u64 actual_end = 0;
> +	u64 data_space_needed = 0;
> +	u64 data_space_reserved = 0;
> +	u64 qgroup_reserved = 0;
>  	struct extent_map *em;
>  	int blocksize = btrfs_inode_sectorsize(BTRFS_I(inode));
>  	int ret;
> @@ -3437,18 +3440,6 @@ static long btrfs_fallocate(struct file *file, int mode,
>  	if (mode & FALLOC_FL_PUNCH_HOLE)
>  		return btrfs_punch_hole(file, offset, len);
>  
> -	/*
> -	 * Only trigger disk allocation, don't trigger qgroup reserve
> -	 *
> -	 * For qgroup space, it will be checked later.
> -	 */
> -	if (!(mode & FALLOC_FL_ZERO_RANGE)) {
> -		ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
> -						      alloc_end - alloc_start);
> -		if (ret < 0)
> -			return ret;
> -	}
> -
>  	btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP);
>  
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
> @@ -3548,48 +3539,66 @@ static long btrfs_fallocate(struct file *file, int mode,
>  		if (em->block_start == EXTENT_MAP_HOLE ||
>  		    (cur_offset >= inode->i_size &&
>  		     !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> +			const u64 range_len = last_byte - cur_offset;
> +
>  			ret = add_falloc_range(&reserve_list, cur_offset,
> -					       last_byte - cur_offset);
> +					       range_len);
>  			if (ret < 0) {
>  				free_extent_map(em);
>  				break;
>  			}
>  			ret = btrfs_qgroup_reserve_data(BTRFS_I(inode),
>  					&data_reserved, cur_offset,
> -					last_byte - cur_offset);
> +					range_len);
>  			if (ret < 0) {
> -				cur_offset = last_byte;
>  				free_extent_map(em);
>  				break;
>  			}
> -		} else {
> -			/*
> -			 * Do not need to reserve unwritten extent for this
> -			 * range, free reserved data space first, otherwise
> -			 * it'll result in false ENOSPC error.
> -			 */
> -			btrfs_free_reserved_data_space(BTRFS_I(inode),
> -				data_reserved, cur_offset,
> -				last_byte - cur_offset);
> +			qgroup_reserved += range_len;
> +			data_space_needed += range_len;
>  		}
>  		free_extent_map(em);
>  		cur_offset = last_byte;
>  	}
>  
> +	if (!ret && data_space_needed > 0) {
> +		/*
> +		 * We are safe to reserve space here as we can't have delalloc
> +		 * in the range, see above.
> +		 */
> +		ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
> +						      data_space_needed);
> +		if (!ret)
> +			data_space_reserved = data_space_needed;
> +	}
> +
>  	/*
>  	 * If ret is still 0, means we're OK to fallocate.
>  	 * Or just cleanup the list and exit.
>  	 */
>  	list_for_each_entry_safe(range, tmp, &reserve_list, list) {
> -		if (!ret)
> +		if (!ret) {
>  			ret = btrfs_prealloc_file_range(inode, mode,
>  					range->start,
>  					range->len, i_blocksize(inode),
>  					offset + len, &alloc_hint);
> -		else
> +			/*
> +			 * btrfs_prealloc_file_range() releases space even
> +			 * if it returns an error.
> +			 */
> +			data_space_reserved -= range->len;
> +			qgroup_reserved -= range->len;
> +		} else if (data_space_reserved > 0) {
>  			btrfs_free_reserved_data_space(BTRFS_I(inode),
> -					data_reserved, range->start,
> -					range->len);
> +					       data_reserved, range->start,
> +					       range->len);
> +			data_space_reserved -= range->len;
> +			qgroup_reserved -= range->len;
> +		} else if (qgroup_reserved > 0) {
> +			btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved,
> +					       range->start, range->len);
> +			qgroup_reserved -= range->len;
> +		}
>  		list_del(&range->list);
>  		kfree(range);
>  	}
> @@ -3606,10 +3615,6 @@ static long btrfs_fallocate(struct file *file, int mode,
>  			     &cached_state);
>  out:
>  	btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
> -	/* Let go of our reservation. */
> -	if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE))
> -		btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved,
> -				cur_offset, alloc_end - cur_offset);
>  	extent_changeset_free(data_reserved);
>  	return ret;
>  }
> -- 
> 2.33.0



  reply	other threads:[~2022-03-15 13:16 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana
2022-03-15 10:47 ` [PATCH 1/6] btrfs: only reserve the needed data space amount during fallocate fdmanana
2022-03-15 10:47 ` [PATCH 2/6] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana
2022-03-15 10:47 ` [PATCH 3/6] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana
2022-03-15 10:47 ` [PATCH 4/6] btrfs: remove ordered extent check and wait during fallocate fdmanana
2022-03-15 10:47 ` [PATCH 5/6] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana
2022-03-15 10:47 ` [PATCH 6/6] btrfs: add and use helper to assert an inode range is clean fdmanana
2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana
2022-03-15 12:18   ` [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana
2022-03-15 13:16     ` Wang Yugui [this message]
2022-03-15 13:45       ` Filipe Manana
2022-03-15 12:18   ` [PATCH v2 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana
2022-03-15 12:18   ` [PATCH v2 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana
2022-03-15 12:18   ` [PATCH v2 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana
2022-03-15 12:18   ` [PATCH v2 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana
2022-03-15 12:18   ` [PATCH v2 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana
2022-03-15 12:18   ` [PATCH v2 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana
2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana
2022-03-15 15:22   ` [PATCH v3 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana
2022-03-15 15:22   ` [PATCH v3 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana
2022-03-15 15:22   ` [PATCH v3 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana
2022-03-15 15:22   ` [PATCH v3 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana
2022-03-15 15:22   ` [PATCH v3 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana
2022-03-15 15:22   ` [PATCH v3 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana
2022-03-15 15:22   ` [PATCH v3 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana
2022-03-16 16:29   ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220315211627.BD66.409509F4@e16-tech.com \
    --to=wangyugui@e16-tech.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox