Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/8] btrfs: use an xarray to track open inodes in a root
Date: Thu, 9 May 2024 09:48:38 +0930	[thread overview]
Message-ID: <31c23133-72fd-4844-8b64-583d36d5e61c@gmx.com> (raw)
In-Reply-To: <b0f3124d15d38e7ab8283821a123fcdd36900e29.1715169723.git.fdmanana@suse.com>



在 2024/5/8 21:47, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Currently we use a red black tree (rb-tree) to track the currently open
> inodes of a root (in struct btrfs_root::inode_tree). This however is not
> very efficient when the number of inodes is large since rb-trees are
> binary trees. For example for 100K open inodes, the tree has a depth of
> 17. Besides that, inserting into the tree requires navigating through it
> and pulling useless cache lines in the process since the red black tree
> nodes are embedded within the btrfs inode - on the other hand, by being
> embedded, it requires no extra memory allocations.
>
> We can improve this by using an xarray instead, which is efficient when
> indices are densely clustered (such as inode numbers), is more cache
> friendly and behaves like a resizable array, with a much better search
> and insertion complexity than a red black tree. This only has one small
> disadvantage which is that insertion will sometimes require allocating
> memory for the xarray - which may fail (not that often since it uses a
> kmem_cache) - but on the other hand we can reduce the btrfs inode
> structure size by 24 bytes (from 1064 down to 1040 bytes) after removing
> the embedded red black tree node, which after the next patches will allow
> to reduce the size of the structure to 1024 bytes, meaning we will be able
> to store 4 inodes per 4K page instead of 3 inodes.
>
> This change does a straighforward change to use an xarray, and results
> in a transaction abort if we can't allocate memory for the xarray when
> creating an inode - but the next patch changes things so that we don't
> need to abort.
>
> Running the following fs_mark test showed some improvements:
>
>      $ cat test.sh
>      #!/bin/bash
>
>      DEV=/dev/nullb0
>      MNT=/mnt/nullb0
>      MOUNT_OPTIONS="-o ssd"
>      FILES=100000
>      THREADS=$(nproc --all)
>
>      echo "performance" | \
>          tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
>
>      mkfs.btrfs -f $DEV
>      mount $MOUNT_OPTIONS $DEV $MNT
>
>      OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
>      for ((i = 1; i <= $THREADS; i++)); do
>          OPTS="$OPTS -d $MNT/d$i"
>      done
>
>      fs_mark $OPTS
>
>      umount $MNT
>
> Before this patch:
>
>      FSUse%        Count         Size    Files/sec     App Overhead
>          10      1200000            0      92081.6         12505547
>          16      2400000            0     138222.6         13067072
>          23      3600000            0     148833.1         13290336
>          43      4800000            0      97864.7         13931248
>          53      6000000            0      85597.3         14384313
>
> After this patch:
>
>      FSUse%        Count         Size    Files/sec     App Overhead
>          10      1200000            0      93225.1         12571078
>          16      2400000            0     146720.3         12805007
>          23      3600000            0     160626.4         13073835
>          46      4800000            0     116286.2         13802927
>          53      6000000            0      90087.9         14754892
>
> The test was run with a release kernel config (Debian's default config).
>
> Also capturing the insertion times into the rb tree and into the xarray,
> that is measuring the duration of the old function inode_tree_add() and
> the duration of the new btrfs_add_inode_to_root() function, gave the
> following results (in nanoseconds):
>
> Before this patch, inode_tree_add() execution times:
>
>     Count: 5000000
>     Range:  0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
>     Percentiles:  90th: 1015.000; 95th: 1139.000; 99th: 1397.000
>        0.000 -    7.816:    40 |
>        7.816 -   37.858:   209 |
>       37.858 -  170.278:  6059 |
>      170.278 -  753.961: 2754890 #####################################################
>      753.961 - 3326.728: 2232312 ###########################################
>     3326.728 - 14667.018:  4366 |
>     14667.018 - 64652.943:   852 |
>     64652.943 - 284981.761:   550 |
>     284981.761 - 1256150.914:   221 |
>     1256150.914 - 5536887.000:     7 |
>
> After this patch, btrfs_add_inode_to_root() execution times:
>
>     Count: 5000000
>     Range:  0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
>     Percentiles:  90th: 342.000; 95th: 432.000; 99th: 572.000
>        0.000 -    7.264:   104 |
>        7.264 -   33.145:   352 |
>       33.145 -  140.081: 109606 #
>      140.081 -  581.930: 4840090 #####################################################
>      581.930 - 2407.590: 43532 |
>     2407.590 - 9950.979:  2245 |
>     9950.979 - 41119.278:   514 |
>     41119.278 - 169902.616:   155 |
>     169902.616 - 702018.539:    47 |
>     702018.539 - 2900652.000:     9 |
>
> Average, percentiles, standard deviation, etc, are all much better.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Not familiar with XArray, but reviewing this with the XArray doc indeed
helps me get more used to it.

Thanks,
Qu
> ---
>   fs/btrfs/btrfs_inode.h |   3 -
>   fs/btrfs/ctree.h       |   7 ++-
>   fs/btrfs/disk-io.c     |   6 +-
>   fs/btrfs/inode.c       | 128 ++++++++++++++++-------------------------
>   4 files changed, 58 insertions(+), 86 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 91c994b569f3..e577b9745884 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -155,9 +155,6 @@ struct btrfs_inode {
>   	 */
>   	struct list_head delalloc_inodes;
>
> -	/* node for the red-black tree that links inodes in subvolume root */
> -	struct rb_node rb_node;
> -
>   	unsigned long runtime_flags;
>
>   	/* full 64 bit generation number, struct vfs_inode doesn't have a big
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index c03c58246033..aa2568f86dc9 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -222,8 +222,11 @@ struct btrfs_root {
>   	struct list_head root_list;
>
>   	spinlock_t inode_lock;
> -	/* red-black tree that keeps track of in-memory inodes */
> -	struct rb_root inode_tree;
> +	/*
> +	 * Xarray that keeps track of in-memory inodes, protected by the lock
> +	 * @inode_lock.
> +	 */
> +	struct xarray inodes;
>
>   	/*
>   	 * Xarray that keeps track of delayed nodes of every inode, protected
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a91a8056758a..ed40fe1db53e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -662,7 +662,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>   	root->free_objectid = 0;
>   	root->nr_delalloc_inodes = 0;
>   	root->nr_ordered_extents = 0;
> -	root->inode_tree = RB_ROOT;
> +	xa_init(&root->inodes);
>   	xa_init(&root->delayed_nodes);
>
>   	btrfs_init_root_block_rsv(root);
> @@ -1854,7 +1854,8 @@ void btrfs_put_root(struct btrfs_root *root)
>   		return;
>
>   	if (refcount_dec_and_test(&root->refs)) {
> -		WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
> +		if (WARN_ON(!xa_empty(&root->inodes)))
> +			xa_destroy(&root->inodes);
>   		WARN_ON(test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state));
>   		if (root->anon_dev)
>   			free_anon_bdev(root->anon_dev);
> @@ -1939,7 +1940,6 @@ static int btrfs_init_btree_inode(struct super_block *sb)
>   	inode->i_mapping->a_ops = &btree_aops;
>   	mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
>
> -	RB_CLEAR_NODE(&BTRFS_I(inode)->rb_node);
>   	extent_io_tree_init(fs_info, &BTRFS_I(inode)->io_tree,
>   			    IO_TREE_BTREE_INODE_IO);
>   	extent_map_tree_init(&BTRFS_I(inode)->extent_tree);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d0274324c75a..450fe1582f1d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5493,58 +5493,51 @@ static int fixup_tree_root_location(struct btrfs_fs_info *fs_info,
>   	return err;
>   }
>
> -static void inode_tree_add(struct btrfs_inode *inode)
> +static int btrfs_add_inode_to_root(struct btrfs_inode *inode)
>   {
>   	struct btrfs_root *root = inode->root;
> -	struct btrfs_inode *entry;
> -	struct rb_node **p;
> -	struct rb_node *parent;
> -	struct rb_node *new = &inode->rb_node;
> -	u64 ino = btrfs_ino(inode);
> +	struct btrfs_inode *existing;
> +	const u64 ino = btrfs_ino(inode);
> +	int ret;
>
>   	if (inode_unhashed(&inode->vfs_inode))
> -		return;
> -	parent = NULL;
> +		return 0;
> +
> +	ret = xa_reserve(&root->inodes, ino, GFP_NOFS);
> +	if (ret)
> +		return ret;
> +
>   	spin_lock(&root->inode_lock);
> -	p = &root->inode_tree.rb_node;
> -	while (*p) {
> -		parent = *p;
> -		entry = rb_entry(parent, struct btrfs_inode, rb_node);
> +	existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
> +	spin_unlock(&root->inode_lock);
>
> -		if (ino < btrfs_ino(entry))
> -			p = &parent->rb_left;
> -		else if (ino > btrfs_ino(entry))
> -			p = &parent->rb_right;
> -		else {
> -			WARN_ON(!(entry->vfs_inode.i_state &
> -				  (I_WILL_FREE | I_FREEING)));
> -			rb_replace_node(parent, new, &root->inode_tree);
> -			RB_CLEAR_NODE(parent);
> -			spin_unlock(&root->inode_lock);
> -			return;
> -		}
> +	if (xa_is_err(existing)) {
> +		ret = xa_err(existing);
> +		ASSERT(ret != -EINVAL);
> +		ASSERT(ret != -ENOMEM);
> +		return ret;
> +	} else if (existing) {
> +		WARN_ON(!(existing->vfs_inode.i_state & (I_WILL_FREE | I_FREEING)));
>   	}
> -	rb_link_node(new, parent, p);
> -	rb_insert_color(new, &root->inode_tree);
> -	spin_unlock(&root->inode_lock);
> +
> +	return 0;
>   }
>
> -static void inode_tree_del(struct btrfs_inode *inode)
> +static void btrfs_del_inode_from_root(struct btrfs_inode *inode)
>   {
>   	struct btrfs_root *root = inode->root;
> -	int empty = 0;
> +	struct btrfs_inode *entry;
> +	bool empty = false;
>
>   	spin_lock(&root->inode_lock);
> -	if (!RB_EMPTY_NODE(&inode->rb_node)) {
> -		rb_erase(&inode->rb_node, &root->inode_tree);
> -		RB_CLEAR_NODE(&inode->rb_node);
> -		empty = RB_EMPTY_ROOT(&root->inode_tree);
> -	}
> +	entry = xa_erase(&root->inodes, btrfs_ino(inode));
> +	if (entry == inode)
> +		empty = xa_empty(&root->inodes);
>   	spin_unlock(&root->inode_lock);
>
>   	if (empty && btrfs_root_refs(&root->root_item) == 0) {
>   		spin_lock(&root->inode_lock);
> -		empty = RB_EMPTY_ROOT(&root->inode_tree);
> +		empty = xa_empty(&root->inodes);
>   		spin_unlock(&root->inode_lock);
>   		if (empty)
>   			btrfs_add_dead_root(root);
> @@ -5613,8 +5606,13 @@ struct inode *btrfs_iget_path(struct super_block *s, u64 ino,
>
>   		ret = btrfs_read_locked_inode(inode, path);
>   		if (!ret) {
> -			inode_tree_add(BTRFS_I(inode));
> -			unlock_new_inode(inode);
> +			ret = btrfs_add_inode_to_root(BTRFS_I(inode));
> +			if (ret) {
> +				iget_failed(inode);
> +				inode = ERR_PTR(ret);
> +			} else {
> +				unlock_new_inode(inode);
> +			}
>   		} else {
>   			iget_failed(inode);
>   			/*
> @@ -6426,7 +6424,11 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
>   		}
>   	}
>
> -	inode_tree_add(BTRFS_I(inode));
> +	ret = btrfs_add_inode_to_root(BTRFS_I(inode));
> +	if (ret) {
> +		btrfs_abort_transaction(trans, ret);
> +		goto discard;
> +	}
>
>   	trace_btrfs_inode_new(inode);
>   	btrfs_set_inode_last_trans(trans, BTRFS_I(inode));
> @@ -8466,7 +8468,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>   	ei->ordered_tree_last = NULL;
>   	INIT_LIST_HEAD(&ei->delalloc_inodes);
>   	INIT_LIST_HEAD(&ei->delayed_iput);
> -	RB_CLEAR_NODE(&ei->rb_node);
>   	init_rwsem(&ei->i_mmap_lock);
>
>   	return inode;
> @@ -8538,7 +8539,7 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
>   		}
>   	}
>   	btrfs_qgroup_check_reserved_leak(inode);
> -	inode_tree_del(inode);
> +	btrfs_del_inode_from_root(inode);
>   	btrfs_drop_extent_map_range(inode, 0, (u64)-1, false);
>   	btrfs_inode_clear_file_extent_range(inode, 0, (u64)-1);
>   	btrfs_put_root(inode->root);
> @@ -10857,52 +10858,23 @@ void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 en
>    */
>   struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
>   {
> -	struct rb_node *node;
> -	struct rb_node *prev;
>   	struct btrfs_inode *inode;
> +	unsigned long from = min_ino;
>
>   	spin_lock(&root->inode_lock);
> -again:
> -	node = root->inode_tree.rb_node;
> -	prev = NULL;
> -	while (node) {
> -		prev = node;
> -		inode = rb_entry(node, struct btrfs_inode, rb_node);
> -		if (min_ino < btrfs_ino(inode))
> -			node = node->rb_left;
> -		else if (min_ino > btrfs_ino(inode))
> -			node = node->rb_right;
> -		else
> +	while (true) {
> +		inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT);
> +		if (!inode)
> +			break;
> +		if (igrab(&inode->vfs_inode))
>   			break;
> -	}
> -
> -	if (!node) {
> -		while (prev) {
> -			inode = rb_entry(prev, struct btrfs_inode, rb_node);
> -			if (min_ino <= btrfs_ino(inode)) {
> -				node = prev;
> -				break;
> -			}
> -			prev = rb_next(prev);
> -		}
> -	}
> -
> -	while (node) {
> -		inode = rb_entry(prev, struct btrfs_inode, rb_node);
> -		if (igrab(&inode->vfs_inode)) {
> -			spin_unlock(&root->inode_lock);
> -			return inode;
> -		}
> -
> -		min_ino = btrfs_ino(inode) + 1;
> -		if (cond_resched_lock(&root->inode_lock))
> -			goto again;
>
> -		node = rb_next(node);
> +		from = btrfs_ino(inode) + 1;
> +		cond_resched_lock(&root->inode_lock);
>   	}
>   	spin_unlock(&root->inode_lock);
>
> -	return NULL;
> +	return inode;
>   }
>
>   static const struct inode_operations btrfs_dir_inode_operations = {

  reply	other threads:[~2024-05-09  0:18 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
2024-05-08 12:17 ` [PATCH 1/8] btrfs: use an xarray to track open inodes in a root fdmanana
2024-05-09  0:18   ` Qu Wenruo [this message]
2024-05-08 12:17 ` [PATCH 2/8] btrfs: preallocate inodes xarray entry to avoid transaction abort fdmanana
2024-05-09  0:21   ` Qu Wenruo
2024-05-08 12:17 ` [PATCH 3/8] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path() fdmanana
2024-05-09  0:23   ` Qu Wenruo
2024-05-08 12:17 ` [PATCH 4/8] btrfs: remove inode_lock from struct btrfs_root and use xarray locks fdmanana
2024-05-09  0:25   ` Qu Wenruo
2024-05-09  8:38     ` Filipe Manana
2024-05-09  8:42       ` Qu Wenruo
2024-05-08 12:17 ` [PATCH 5/8] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode fdmanana
2024-05-09  0:30   ` Qu Wenruo
2024-05-09  8:39     ` Filipe Manana
2024-05-08 12:17 ` [PATCH 6/8] btrfs: don't allocate file extent tree for non regular files fdmanana
2024-05-09  0:39   ` Qu Wenruo
2024-05-09  8:41     ` Filipe Manana
2024-05-13 18:39       ` David Sterba
2024-05-08 12:17 ` [PATCH 7/8] btrfs: remove location key from struct btrfs_inode fdmanana
2024-05-08 12:17 ` [PATCH 8/8] btrfs: remove objectid from struct btrfs_inode on 64 bits platforms fdmanana
2024-05-09 17:56 ` [PATCH 0/8] btrfs: inode management and memory consumption improvements David Sterba
2024-05-10 11:04   ` Filipe Manana
2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
2024-05-10 17:32   ` [PATCH v2 01/10] btrfs: use an xarray to track open inodes in a root fdmanana
2024-05-14 15:49     ` David Sterba
2024-05-10 17:32   ` [PATCH v2 02/10] btrfs: preallocate inodes xarray entry to avoid transaction abort fdmanana
2024-05-10 17:32   ` [PATCH v2 03/10] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path() fdmanana
2024-05-10 17:32   ` [PATCH v2 04/10] btrfs: remove inode_lock from struct btrfs_root and use xarray locks fdmanana
2024-05-10 17:32   ` [PATCH v2 05/10] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode fdmanana
2024-05-10 17:32   ` [PATCH v2 06/10] btrfs: don't allocate file extent tree for non regular files fdmanana
2024-05-10 17:32   ` [PATCH v2 07/10] btrfs: remove location key from struct btrfs_inode fdmanana
2024-05-10 17:32   ` [PATCH v2 08/10] btrfs: remove objectid from struct btrfs_inode on 64 bits platforms fdmanana
2024-05-10 17:32   ` [PATCH v2 09/10] btrfs: rename rb_root member of extent_map_tree from map to root fdmanana
2024-05-10 17:32   ` [PATCH v2 10/10] btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree fdmanana
2024-05-14 15:58     ` David Sterba
2024-05-14 16:08   ` [PATCH v2 00/10] btrfs: inode management and memory consumption improvements David Sterba
2024-05-15 18:28   ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=31c23133-72fd-4844-8b64-583d36d5e61c@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox