Re: [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient
Date: Thu, 1 Sep 2022 10:03:03 -0400	[thread overview]
Message-ID: <YxC7l6cL2GtilEf3@localhost.localdomain> (raw)
In-Reply-To: <246cd5358b28e3e11a96fe2abd0a4a34840cdb85.1662022922.git.fdmanana@suse.com>

On Thu, Sep 01, 2022 at 02:18:22PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The current implementation of hole and data seeking for llseek does not
> scale well in regards to the number of extents and the distance between
> the start offset and the next hole or extent. This is due to a very high
> algorithmic complexity. Often we also get reports of btrfs' hole and data
> seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
> tag at the bottom).
> 
> In order to better understand it, lets consider the case where the start
> offset is 0, we are seeking for a hole and the file size is 16G. Between
> file offset 0 and the first hole in the file there are 100K extents - this
> is common for large files, specially if we have compression enabled, since
> the maximum extent size is limited to 128K. The steps take by the main
> loop of the current algorithm are the following:
> 
> 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
>    calls btrfs_get_extent(). This will first lookup for an extent map in
>    the inode's extent map tree (a red black tree). If the extent map is
>    not loaded in memory, then it will do a lookup for the corresponding
>    file extent item in the subvolume's b+tree, create an extent map based
>    on the contents of the file extent item and then add the extent map to
>    the extent map tree of the inode;
> 
> 2) The second iteration calls btrfs_get_extent_fiemap() again, this time
>    with a start offset matching the end offset of the previous extent.
>    Again, btrfs_get_extent() will first search the extent map tree, and
>    if it doesn't find an extent map there, it will again search in the
>    b+tree of the subvolume for a matching file extent item, build an
>    extent map based on the file extent item, and add the extent map to
>    to the extent map tree of the inode;
> 
> 3) This repeats over and over until we find the first hole (when seeking
>    for holes) or until we find the first extent (when seeking for data).
> 
>    If there no extent maps loaded in memory for each iteration, then on
>    each iteration we do 1 extent map tree search, 1 b+tree search, plus
>    1 more extent map tree traversal to insert an extent map - plus we
>    allocate memory for the extent map.
> 
>    On each iteration we are growing the size of the extent map tree,
>    making each future search slower, and also visiting the same b+tree
>    leaves over and over again - taking into account with the default leaf
>    size of 16K we can fit more than 200 file extent items in a leaf - so
>    we can visit the same b+tree leaf 200+ times, on each visit walking
>    down a path from the root to the leaf.
> 
> So it's easy to see that what we have now doesn't scale well. Also, it
> loads an extent map for every file extent item into memory, which is not
> efficient - we should add extents maps only when doing IO (writing or
> reading file data).
> 
> This change implements a new algorithm which scales much better, and
> works like this:
> 
> 1) We iterate over the subvolume's b+tree, visiting each leaf that has
>    file extent items once and only once;
> 
> 2) For any file extent items found, that don't represent holes or prealloc
>    extents, it will not search the extent map tree - there's no need at
>    all for that - an extent map is just an in-memory representation of a
>    file extent item;
> 
> 3) When a hole is found, or a prealloc extent, it will check if there's
>    delalloc for its range. For this it will search for EXTENT_DELALLOC
>    bits in the inode's io tree and check the extent map tree - this is
>    for accounting for unflushed delalloc and for flushed delalloc (the
>    period between running delalloc and ordered extent completion),
>    respectively. This is similar to what the current implementation does
>    when it finds a hole or prealloc extent, but without creating extent
>    maps and adding them to the extent map tree in case they are not
>    loaded in memory;
> 
> 4) It never allocates extent maps, or adds extent maps to the inode's
>    extent map tree. This not only saves memory and time (from the tree
>    insertions and allocations), but also eliminates the possibility of
>    -ENOMEM due to allocating too many extent maps.
> 
> Part of this new code will also be used later for fiemap (which also
> suffers similar scalability problems).
> 
> The following test example can be used to quickly measure the efficiency
> before and after this patch:
> 
>     $ cat test-seek-hole.sh
>     #!/bin/bash
> 
>     DEV=/dev/sdi
>     MNT=/mnt/sdi
> 
>     mkfs.btrfs -f $DEV
> 
>     mount -o compress=lzo $DEV $MNT
> 
>     # 16G file -> 131073 compressed extents.
>     xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
> 
>     # Leave a 1M hole at file offset 15G.
>     xfs_io -c "fpunch 15G 1M" $MNT/foobar
> 
>     # Unmount and mount again, so that we can test when there's no
>     # metadata cached in memory.
>     umount $MNT
>     mount -o compress=lzo $DEV $MNT
> 
>     # Test seeking for hole from offset 0 (hole is at offset 15G).
> 
>     start=$(date +%s%N)
>     xfs_io -c "seek -h 0" $MNT/foobar
>     end=$(date +%s%N)
>     dur=$(( (end - start) / 1000000 ))
>     echo "Took $dur milliseconds to seek first hole (metadata not cached)"
>     echo
> 
>     start=$(date +%s%N)
>     xfs_io -c "seek -h 0" $MNT/foobar
>     end=$(date +%s%N)
>     dur=$(( (end - start) / 1000000 ))
>     echo "Took $dur milliseconds to seek first hole (metadata cached)"
>     echo
> 
>     umount $MNT
> 
> Before this change:
> 
>     $ ./test-seek-hole.sh
>     (...)
>     Whence	Result
>     HOLE	16106127360
>     Took 176 milliseconds to seek first hole (metadata not cached)
> 
>     Whence	Result
>     HOLE	16106127360
>     Took 17 milliseconds to seek first hole (metadata cached)
> 
> After this change:
> 
>     $ ./test-seek-hole.sh
>     (...)
>     Whence	Result
>     HOLE	16106127360
>     Took 43 milliseconds to seek first hole (metadata not cached)
> 
>     Whence	Result
>     HOLE	16106127360
>     Took 13 milliseconds to seek first hole (metadata cached)
> 
> That's about 4X faster when no metadata is cached and about 30% faster
> when all metadata is cached.
> 
> In practice the differences may often be significantly higher, either due
> to a higher number of extents in a file or because the subvolume's b+tree
> is much bigger than in this example, where we only have one file.
> 
> Link: https://lwn.net/Articles/718805/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/file.c | 437 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 406 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 96f444ad0951..b292a8ada3a4 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3601,22 +3601,281 @@ static long btrfs_fallocate(struct file *file, int mode,
>  	return ret;
>  }
>  
> +/*
> + * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> + * has unflushed and/or flushing delalloc. There might be other adjacent
> + * subranges after the one it found, so have_delalloc_in_range() keeps looping
> + * while it gets adjacent subranges, and merging them together.
> + */
> +static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> +				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> +{
> +	const u64 len = end + 1 - start;
> +	struct extent_map_tree *em_tree = &inode->extent_tree;
> +	struct extent_map *em;
> +	u64 em_end;
> +	u64 delalloc_len;
> +
> +	/*
> +	 * Search the io tree first for EXTENT_DELALLOC. If we find any, it
> +	 * means we have delalloc (dirty pages) for which writeback has not
> +	 * started yet.
> +	 */
> +	*delalloc_start_ret = start;
> +	delalloc_len = count_range_bits(&inode->io_tree, delalloc_start_ret, end,
> +					len, EXTENT_DELALLOC, 1);
> +	/*
> +	 * If delalloc was found then *delalloc_start_ret has a sector size
> +	 * aligned value (rounded down).
> +	 */
> +	if (delalloc_len > 0)
> +		*delalloc_end_ret = *delalloc_start_ret + delalloc_len - 1;
> +
> +	/*
> +	 * Now also check if there's any extent map in the range that does not
> +	 * map to a hole or prealloc extent. We do this because:
> +	 *
> +	 * 1) When delalloc is flushed, the file range is locked, we clear the
> +	 *    EXTENT_DELALLOC bit from the io tree and create an extent map for
> +	 *    an allocated extent. So we might just have been called after
> +	 *    delalloc is flushed and before the ordered extent completes and
> +	 *    inserts the new file extent item in the subvolume's btree;
> +	 *
> +	 * 2) We may have an extent map created by flushing delalloc for a
> +	 *    subrange that starts before the subrange we found marked with
> +	 *    EXTENT_DELALLOC in the io tree.
> +	 */
> +	read_lock(&em_tree->lock);
> +	em = lookup_extent_mapping(em_tree, start, len);
> +	read_unlock(&em_tree->lock);
> +
> +	/* extent_map_end() returns a non-inclusive end offset. */
> +	em_end = em ? extent_map_end(em) : 0;
> +
> +	/*
> +	 * If we have a hole/prealloc extent map, check the next one if this one
> +	 * ends before our range's end.
> +	 */
> +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
> +		struct extent_map *next_em;
> +
> +		read_lock(&em_tree->lock);
> +		next_em = lookup_extent_mapping(em_tree, em_end, len - em_end);
> +		read_unlock(&em_tree->lock);
> +
> +		free_extent_map(em);
> +		em_end = next_em ? extent_map_end(next_em) : 0;
> +		em = next_em;
> +	}
> +
> +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> +		free_extent_map(em);
> +		em = NULL;
> +	}
> +
> +	/*
> +	 * No extent map or one for a hole or prealloc extent. Use the delalloc
> +	 * range we found in the io tree if we have one.
> +	 */
> +	if (!em)
> +		return (delalloc_len > 0);
> +

You can move this after the lookup, and then remove the if (em && parts above.
Then all you need to do is in the second if statement return (delalloc_len > 0);
Thanks,

Josef

next prev parent reply	other threads:[~2022-09-01 14:03 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
2022-09-01 13:18 ` [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible fdmanana
2022-09-01 13:58   ` Josef Bacik
2022-09-01 21:49   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient fdmanana
2022-09-01 14:03   ` Josef Bacik [this message]
2022-09-01 15:00     ` Filipe Manana
2022-09-02 13:26       ` Josef Bacik
2022-09-01 22:18   ` Qu Wenruo
2022-09-02  8:36     ` Filipe Manana
2022-09-11 22:12   ` Qu Wenruo
2022-09-12  8:38     ` Filipe Manana
2022-09-01 13:18 ` [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap fdmanana
2022-09-01 14:03   ` Josef Bacik
2022-09-01 22:19   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 04/10] btrfs: remove zero length check when entering fiemap fdmanana
2022-09-01 14:04   ` Josef Bacik
2022-09-01 22:24   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 05/10] btrfs: properly flush delalloc " fdmanana
2022-09-01 14:06   ` Josef Bacik
2022-09-01 22:38   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 06/10] btrfs: allow fiemap to be interruptible fdmanana
2022-09-01 14:07   ` Josef Bacik
2022-09-01 22:42   ` Qu Wenruo
2022-09-02  8:38     ` Filipe Manana
2022-09-01 13:18 ` [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name fdmanana
2022-09-01 14:08   ` Josef Bacik
2022-09-01 22:45   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap fdmanana
2022-09-01 14:23   ` Josef Bacik
2022-09-01 22:50   ` Qu Wenruo
2022-09-02  8:46     ` Filipe Manana
2022-09-01 13:18 ` [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks " fdmanana
2022-09-01 14:26   ` Josef Bacik
2022-09-01 23:01   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness fdmanana
2022-09-01 14:35   ` Josef Bacik
2022-09-01 15:04     ` Filipe Manana
2022-09-02 13:25       ` Josef Bacik
2022-09-01 23:27   ` Qu Wenruo
2022-09-02  8:59     ` Filipe Manana
2022-09-02  9:34       ` Qu Wenruo
2022-09-02  9:41         ` Filipe Manana
2022-09-02  9:50           ` Qu Wenruo
2022-09-02  0:53 ` [PATCH 00/10] btrfs: make lseek and fiemap much more efficient Wang Yugui
2022-09-02  8:24   ` Filipe Manana
2022-09-02 11:41     ` Wang Yugui
2022-09-02 11:45     ` Filipe Manana
2022-09-05 14:39       ` Filipe Manana
2022-09-06 16:20 ` David Sterba
2022-09-06 17:13   ` Filipe Manana
2022-09-07  9:12 ` Christoph Hellwig
2022-09-07  9:47   ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YxC7l6cL2GtilEf3@localhost.localdomain \
    --to=josef@toxicpanda.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.