From: Wang Yugui <wangyugui@e16-tech.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate
Date: Tue, 15 Mar 2022 21:16:27 +0800 [thread overview]
Message-ID: <20220315211627.BD66.409509F4@e16-tech.com> (raw)
In-Reply-To: <dde255ba71e1bfe38daf88c1b7ef51df15d5c150.1647346287.git.fdmanana@suse.com>
Hi,
> TODO: test case when the file with/without snapshot
>
> From: Filipe Manana <fdmanana@suse.com>
>
> During a plain fallocate, we always start by reserving an amount of data
> space that matches the length of the range passed to fallocate. When we
> already have extents allocated in that range, we may end up trying to
> reserve a lot more data space then we need, which can result in two
> undesired behaviours:
Need we check whether the allocated extents are exclusive or shared?
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/03/15
>
> 1) We fail with -ENOSPC. For example the passed range has a length
> of 1G, but there's only one hole with a size of 1M in that range;
>
> 2) We temporarily reserve excessive data space that could be used by
> other operations happening concurrently;
>
> 3) By reserving much more data space then we need, we can end up
> doing expensive things like triggering dellaloc for other inodes,
> waiting for the ordered extents to complete, trigger transaction
> commits, allocate new block groups, etc.
>
> Example:
>
> $ cat test.sh
> #!/bin/bash
>
> DEV=/dev/sdj
> MNT=/mnt/sdj
>
> mkfs.btrfs -f -b 1g $DEV
> mount $DEV $MNT
>
> # Create a file with a size of 600M and two holes, one at [200M, 201M[
> # and another at [401M, 402M[
> xfs_io -f -c "pwrite -S 0xab 0 200M" \
> -c "pwrite -S 0xcd 201M 200M" \
> -c "pwrite -S 0xef 402M 198M" \
> $MNT/foobar
>
> # Now call fallocate against the whole file range, see if it fails
> # with -ENOSPC or not - it shouldn't since we only need to allocate
> # 2M of data space.
> xfs_io -c "falloc 0 600M" $MNT/foobar
>
> umount $MNT
>
> $ ./test.sh
> (...)
> wrote 209715200/209715200 bytes at offset 0
> 200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
> wrote 209715200/209715200 bytes at offset 210763776
> 200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
> wrote 207618048/207618048 bytes at offset 421527552
> 198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
> fallocate: No space left on device
> $
>
> So fix this by not allocating an amount of data space that matches the
> length of the range passed to fallocate. Instead allocate an amount of
> data space that corresponds to the sum of the sizes of each hole found
> in the range. This reservation now happens after we have locked the file
> range, which is safe since we know at this point there's no delalloc
> in the range because we've taken the inode's VFS lock in exclusive mode,
> we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
> delalloc and waited for all ordered extents in the range to complete.
>
> This type of failure actually seems to happen in pratice with systemd,
> and we had at least one report about this in a very long thread which
> is referenced by the Link tag below.
>
> Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/file.c | 69 ++++++++++++++++++++++++++-----------------------
> 1 file changed, 37 insertions(+), 32 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 380054c94e4b..b7c0db1000cd 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3417,6 +3417,9 @@ static long btrfs_fallocate(struct file *file, int mode,
> u64 alloc_hint = 0;
> u64 locked_end;
> u64 actual_end = 0;
> + u64 data_space_needed = 0;
> + u64 data_space_reserved = 0;
> + u64 qgroup_reserved = 0;
> struct extent_map *em;
> int blocksize = btrfs_inode_sectorsize(BTRFS_I(inode));
> int ret;
> @@ -3437,18 +3440,6 @@ static long btrfs_fallocate(struct file *file, int mode,
> if (mode & FALLOC_FL_PUNCH_HOLE)
> return btrfs_punch_hole(file, offset, len);
>
> - /*
> - * Only trigger disk allocation, don't trigger qgroup reserve
> - *
> - * For qgroup space, it will be checked later.
> - */
> - if (!(mode & FALLOC_FL_ZERO_RANGE)) {
> - ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
> - alloc_end - alloc_start);
> - if (ret < 0)
> - return ret;
> - }
> -
> btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP);
>
> if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
> @@ -3548,48 +3539,66 @@ static long btrfs_fallocate(struct file *file, int mode,
> if (em->block_start == EXTENT_MAP_HOLE ||
> (cur_offset >= inode->i_size &&
> !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> + const u64 range_len = last_byte - cur_offset;
> +
> ret = add_falloc_range(&reserve_list, cur_offset,
> - last_byte - cur_offset);
> + range_len);
> if (ret < 0) {
> free_extent_map(em);
> break;
> }
> ret = btrfs_qgroup_reserve_data(BTRFS_I(inode),
> &data_reserved, cur_offset,
> - last_byte - cur_offset);
> + range_len);
> if (ret < 0) {
> - cur_offset = last_byte;
> free_extent_map(em);
> break;
> }
> - } else {
> - /*
> - * Do not need to reserve unwritten extent for this
> - * range, free reserved data space first, otherwise
> - * it'll result in false ENOSPC error.
> - */
> - btrfs_free_reserved_data_space(BTRFS_I(inode),
> - data_reserved, cur_offset,
> - last_byte - cur_offset);
> + qgroup_reserved += range_len;
> + data_space_needed += range_len;
> }
> free_extent_map(em);
> cur_offset = last_byte;
> }
>
> + if (!ret && data_space_needed > 0) {
> + /*
> + * We are safe to reserve space here as we can't have delalloc
> + * in the range, see above.
> + */
> + ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
> + data_space_needed);
> + if (!ret)
> + data_space_reserved = data_space_needed;
> + }
> +
> /*
> * If ret is still 0, means we're OK to fallocate.
> * Or just cleanup the list and exit.
> */
> list_for_each_entry_safe(range, tmp, &reserve_list, list) {
> - if (!ret)
> + if (!ret) {
> ret = btrfs_prealloc_file_range(inode, mode,
> range->start,
> range->len, i_blocksize(inode),
> offset + len, &alloc_hint);
> - else
> + /*
> + * btrfs_prealloc_file_range() releases space even
> + * if it returns an error.
> + */
> + data_space_reserved -= range->len;
> + qgroup_reserved -= range->len;
> + } else if (data_space_reserved > 0) {
> btrfs_free_reserved_data_space(BTRFS_I(inode),
> - data_reserved, range->start,
> - range->len);
> + data_reserved, range->start,
> + range->len);
> + data_space_reserved -= range->len;
> + qgroup_reserved -= range->len;
> + } else if (qgroup_reserved > 0) {
> + btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved,
> + range->start, range->len);
> + qgroup_reserved -= range->len;
> + }
> list_del(&range->list);
> kfree(range);
> }
> @@ -3606,10 +3615,6 @@ static long btrfs_fallocate(struct file *file, int mode,
> &cached_state);
> out:
> btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
> - /* Let go of our reservation. */
> - if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE))
> - btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved,
> - cur_offset, alloc_end - cur_offset);
> extent_changeset_free(data_reserved);
> return ret;
> }
> --
> 2.33.0
next prev parent reply other threads:[~2022-03-15 13:16 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana
2022-03-15 10:47 ` [PATCH 1/6] btrfs: only reserve the needed data space amount during fallocate fdmanana
2022-03-15 10:47 ` [PATCH 2/6] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana
2022-03-15 10:47 ` [PATCH 3/6] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana
2022-03-15 10:47 ` [PATCH 4/6] btrfs: remove ordered extent check and wait during fallocate fdmanana
2022-03-15 10:47 ` [PATCH 5/6] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana
2022-03-15 10:47 ` [PATCH 6/6] btrfs: add and use helper to assert an inode range is clean fdmanana
2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana
2022-03-15 12:18 ` [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana
2022-03-15 13:16 ` Wang Yugui [this message]
2022-03-15 13:45 ` Filipe Manana
2022-03-15 12:18 ` [PATCH v2 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana
2022-03-15 12:18 ` [PATCH v2 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana
2022-03-15 12:18 ` [PATCH v2 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana
2022-03-15 12:18 ` [PATCH v2 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana
2022-03-15 12:18 ` [PATCH v2 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana
2022-03-15 12:18 ` [PATCH v2 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana
2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana
2022-03-15 15:22 ` [PATCH v3 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana
2022-03-15 15:22 ` [PATCH v3 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana
2022-03-15 15:22 ` [PATCH v3 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana
2022-03-15 15:22 ` [PATCH v3 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana
2022-03-15 15:22 ` [PATCH v3 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana
2022-03-15 15:22 ` [PATCH v3 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana
2022-03-15 15:22 ` [PATCH v3 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana
2022-03-16 16:29 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220315211627.BD66.409509F4@e16-tech.com \
--to=wangyugui@e16-tech.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox