[PATCH 0/8] btrfs: inode management and memory consumption improvements

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH 0/8] btrfs: inode management and memory consumption improvements
@ 2024-05-08 12:17 fdmanana
  2024-05-08 12:17 ` [PATCH 1/8] btrfs: use an xarray to track open inodes in a root fdmanana
                   ` (9 more replies)
  0 siblings, 10 replies; 37+ messages in thread
From: fdmanana @ 2024-05-08 12:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Some inode related improvements, to use an xarray to track open inodes per
root instead of a red black tree, reduce lock contention and use less memory
per btrfs inode, so now we can fit 4 inodes per 4K page instead of 3.
More details in the the change logs.

Filipe Manana (8):
  btrfs: use an xarray to track open inodes in a root
  btrfs: preallocate inodes xarray entry to avoid transaction abort
  btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path()
  btrfs: remove inode_lock from struct btrfs_root and use xarray locks
  btrfs: unify index_cnt and csum_bytes from struct btrfs_inode
  btrfs: don't allocate file extent tree for non regular files
  btrfs: remove location key from struct btrfs_inode
  btrfs: remove objectid from struct btrfs_inode on 64 bits platforms

 fs/btrfs/btrfs_inode.h       | 130 +++++++++++-----
 fs/btrfs/ctree.h             |   8 +-
 fs/btrfs/delayed-inode.c     |  27 ++--
 fs/btrfs/disk-io.c           |  12 +-
 fs/btrfs/export.c            |   2 +-
 fs/btrfs/file-item.c         |  13 +-
 fs/btrfs/inode.c             | 286 +++++++++++++++++------------------
 fs/btrfs/ioctl.c             |   8 +-
 fs/btrfs/relocation.c        |  12 +-
 fs/btrfs/tests/btrfs-tests.c |   5 +-
 fs/btrfs/tree-log.c          |   9 +-
 11 files changed, 285 insertions(+), 227 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/8] btrfs: use an xarray to track open inodes in a root
  2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
@ 2024-05-08 12:17 ` fdmanana
  2024-05-09  0:18   ` Qu Wenruo
  2024-05-08 12:17 ` [PATCH 2/8] btrfs: preallocate inodes xarray entry to avoid transaction abort fdmanana
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: fdmanana @ 2024-05-08 12:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently we use a red black tree (rb-tree) to track the currently open
inodes of a root (in struct btrfs_root::inode_tree). This however is not
very efficient when the number of inodes is large since rb-trees are
binary trees. For example for 100K open inodes, the tree has a depth of
17. Besides that, inserting into the tree requires navigating through it
and pulling useless cache lines in the process since the red black tree
nodes are embedded within the btrfs inode - on the other hand, by being
embedded, it requires no extra memory allocations.

We can improve this by using an xarray instead, which is efficient when
indices are densely clustered (such as inode numbers), is more cache
friendly and behaves like a resizable array, with a much better search
and insertion complexity than a red black tree. This only has one small
disadvantage which is that insertion will sometimes require allocating
memory for the xarray - which may fail (not that often since it uses a
kmem_cache) - but on the other hand we can reduce the btrfs inode
structure size by 24 bytes (from 1064 down to 1040 bytes) after removing
the embedded red black tree node, which after the next patches will allow
to reduce the size of the structure to 1024 bytes, meaning we will be able
to store 4 inodes per 4K page instead of 3 inodes.

This change does a straighforward change to use an xarray, and results
in a transaction abort if we can't allocate memory for the xarray when
creating an inode - but the next patch changes things so that we don't
need to abort.

Running the following fs_mark test showed some improvements:

    $ cat test.sh
    #!/bin/bash

    DEV=/dev/nullb0
    MNT=/mnt/nullb0
    MOUNT_OPTIONS="-o ssd"
    FILES=100000
    THREADS=$(nproc --all)

    echo "performance" | \
        tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

    mkfs.btrfs -f $DEV
    mount $MOUNT_OPTIONS $DEV $MNT

    OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
    for ((i = 1; i <= $THREADS; i++)); do
        OPTS="$OPTS -d $MNT/d$i"
    done

    fs_mark $OPTS

    umount $MNT

Before this patch:

    FSUse%        Count         Size    Files/sec     App Overhead
        10      1200000            0      92081.6         12505547
        16      2400000            0     138222.6         13067072
        23      3600000            0     148833.1         13290336
        43      4800000            0      97864.7         13931248
        53      6000000            0      85597.3         14384313

After this patch:

    FSUse%        Count         Size    Files/sec     App Overhead
        10      1200000            0      93225.1         12571078
        16      2400000            0     146720.3         12805007
        23      3600000            0     160626.4         13073835
        46      4800000            0     116286.2         13802927
        53      6000000            0      90087.9         14754892

The test was run with a release kernel config (Debian's default config).

Also capturing the insertion times into the rb tree and into the xarray,
that is measuring the duration of the old function inode_tree_add() and
the duration of the new btrfs_add_inode_to_root() function, gave the
following results (in nanoseconds):

Before this patch, inode_tree_add() execution times:

   Count: 5000000
   Range:  0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
   Percentiles:  90th: 1015.000; 95th: 1139.000; 99th: 1397.000
      0.000 -    7.816:    40 |
      7.816 -   37.858:   209 |
     37.858 -  170.278:  6059 |
    170.278 -  753.961: 2754890 #####################################################
    753.961 - 3326.728: 2232312 ###########################################
   3326.728 - 14667.018:  4366 |
   14667.018 - 64652.943:   852 |
   64652.943 - 284981.761:   550 |
   284981.761 - 1256150.914:   221 |
   1256150.914 - 5536887.000:     7 |

After this patch, btrfs_add_inode_to_root() execution times:

   Count: 5000000
   Range:  0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
   Percentiles:  90th: 342.000; 95th: 432.000; 99th: 572.000
      0.000 -    7.264:   104 |
      7.264 -   33.145:   352 |
     33.145 -  140.081: 109606 #
    140.081 -  581.930: 4840090 #####################################################
    581.930 - 2407.590: 43532 |
   2407.590 - 9950.979:  2245 |
   9950.979 - 41119.278:   514 |
   41119.278 - 169902.616:   155 |
   169902.616 - 702018.539:    47 |
   702018.539 - 2900652.000:     9 |

Average, percentiles, standard deviation, etc, are all much better.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h |   3 -
 fs/btrfs/ctree.h       |   7 ++-
 fs/btrfs/disk-io.c     |   6 +-
 fs/btrfs/inode.c       | 128 ++++++++++++++++-------------------------
 4 files changed, 58 insertions(+), 86 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 91c994b569f3..e577b9745884 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -155,9 +155,6 @@ struct btrfs_inode {
 	 */
 	struct list_head delalloc_inodes;
 
-	/* node for the red-black tree that links inodes in subvolume root */
-	struct rb_node rb_node;
-
 	unsigned long runtime_flags;
 
 	/* full 64 bit generation number, struct vfs_inode doesn't have a big
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c03c58246033..aa2568f86dc9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -222,8 +222,11 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t inode_lock;
-	/* red-black tree that keeps track of in-memory inodes */
-	struct rb_root inode_tree;
+	/*
+	 * Xarray that keeps track of in-memory inodes, protected by the lock
+	 * @inode_lock.
+	 */
+	struct xarray inodes;
 
 	/*
 	 * Xarray that keeps track of delayed nodes of every inode, protected
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a91a8056758a..ed40fe1db53e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -662,7 +662,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	root->free_objectid = 0;
 	root->nr_delalloc_inodes = 0;
 	root->nr_ordered_extents = 0;
-	root->inode_tree = RB_ROOT;
+	xa_init(&root->inodes);
 	xa_init(&root->delayed_nodes);
 
 	btrfs_init_root_block_rsv(root);
@@ -1854,7 +1854,8 @@ void btrfs_put_root(struct btrfs_root *root)
 		return;
 
 	if (refcount_dec_and_test(&root->refs)) {
-		WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
+		if (WARN_ON(!xa_empty(&root->inodes)))
+			xa_destroy(&root->inodes);
 		WARN_ON(test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state));
 		if (root->anon_dev)
 			free_anon_bdev(root->anon_dev);
@@ -1939,7 +1940,6 @@ static int btrfs_init_btree_inode(struct super_block *sb)
 	inode->i_mapping->a_ops = &btree_aops;
 	mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
 
-	RB_CLEAR_NODE(&BTRFS_I(inode)->rb_node);
 	extent_io_tree_init(fs_info, &BTRFS_I(inode)->io_tree,
 			    IO_TREE_BTREE_INODE_IO);
 	extent_map_tree_init(&BTRFS_I(inode)->extent_tree);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d0274324c75a..450fe1582f1d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5493,58 +5493,51 @@ static int fixup_tree_root_location(struct btrfs_fs_info *fs_info,
 	return err;
 }
 
-static void inode_tree_add(struct btrfs_inode *inode)
+static int btrfs_add_inode_to_root(struct btrfs_inode *inode)
 {
 	struct btrfs_root *root = inode->root;
-	struct btrfs_inode *entry;
-	struct rb_node **p;
-	struct rb_node *parent;
-	struct rb_node *new = &inode->rb_node;
-	u64 ino = btrfs_ino(inode);
+	struct btrfs_inode *existing;
+	const u64 ino = btrfs_ino(inode);
+	int ret;
 
 	if (inode_unhashed(&inode->vfs_inode))
-		return;
-	parent = NULL;
+		return 0;
+
+	ret = xa_reserve(&root->inodes, ino, GFP_NOFS);
+	if (ret)
+		return ret;
+
 	spin_lock(&root->inode_lock);
-	p = &root->inode_tree.rb_node;
-	while (*p) {
-		parent = *p;
-		entry = rb_entry(parent, struct btrfs_inode, rb_node);
+	existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
+	spin_unlock(&root->inode_lock);
 
-		if (ino < btrfs_ino(entry))
-			p = &parent->rb_left;
-		else if (ino > btrfs_ino(entry))
-			p = &parent->rb_right;
-		else {
-			WARN_ON(!(entry->vfs_inode.i_state &
-				  (I_WILL_FREE | I_FREEING)));
-			rb_replace_node(parent, new, &root->inode_tree);
-			RB_CLEAR_NODE(parent);
-			spin_unlock(&root->inode_lock);
-			return;
-		}
+	if (xa_is_err(existing)) {
+		ret = xa_err(existing);
+		ASSERT(ret != -EINVAL);
+		ASSERT(ret != -ENOMEM);
+		return ret;
+	} else if (existing) {
+		WARN_ON(!(existing->vfs_inode.i_state & (I_WILL_FREE | I_FREEING)));
 	}
-	rb_link_node(new, parent, p);
-	rb_insert_color(new, &root->inode_tree);
-	spin_unlock(&root->inode_lock);
+
+	return 0;
 }
 
-static void inode_tree_del(struct btrfs_inode *inode)
+static void btrfs_del_inode_from_root(struct btrfs_inode *inode)
 {
 	struct btrfs_root *root = inode->root;
-	int empty = 0;
+	struct btrfs_inode *entry;
+	bool empty = false;
 
 	spin_lock(&root->inode_lock);
-	if (!RB_EMPTY_NODE(&inode->rb_node)) {
-		rb_erase(&inode->rb_node, &root->inode_tree);
-		RB_CLEAR_NODE(&inode->rb_node);
-		empty = RB_EMPTY_ROOT(&root->inode_tree);
-	}
+	entry = xa_erase(&root->inodes, btrfs_ino(inode));
+	if (entry == inode)
+		empty = xa_empty(&root->inodes);
 	spin_unlock(&root->inode_lock);
 
 	if (empty && btrfs_root_refs(&root->root_item) == 0) {
 		spin_lock(&root->inode_lock);
-		empty = RB_EMPTY_ROOT(&root->inode_tree);
+		empty = xa_empty(&root->inodes);
 		spin_unlock(&root->inode_lock);
 		if (empty)
 			btrfs_add_dead_root(root);
@@ -5613,8 +5606,13 @@ struct inode *btrfs_iget_path(struct super_block *s, u64 ino,
 
 		ret = btrfs_read_locked_inode(inode, path);
 		if (!ret) {
-			inode_tree_add(BTRFS_I(inode));
-			unlock_new_inode(inode);
+			ret = btrfs_add_inode_to_root(BTRFS_I(inode));
+			if (ret) {
+				iget_failed(inode);
+				inode = ERR_PTR(ret);
+			} else {
+				unlock_new_inode(inode);
+			}
 		} else {
 			iget_failed(inode);
 			/*
@@ -6426,7 +6424,11 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	inode_tree_add(BTRFS_I(inode));
+	ret = btrfs_add_inode_to_root(BTRFS_I(inode));
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto discard;
+	}
 
 	trace_btrfs_inode_new(inode);
 	btrfs_set_inode_last_trans(trans, BTRFS_I(inode));
@@ -8466,7 +8468,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->ordered_tree_last = NULL;
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->delayed_iput);
-	RB_CLEAR_NODE(&ei->rb_node);
 	init_rwsem(&ei->i_mmap_lock);
 
 	return inode;
@@ -8538,7 +8539,7 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
 		}
 	}
 	btrfs_qgroup_check_reserved_leak(inode);
-	inode_tree_del(inode);
+	btrfs_del_inode_from_root(inode);
 	btrfs_drop_extent_map_range(inode, 0, (u64)-1, false);
 	btrfs_inode_clear_file_extent_range(inode, 0, (u64)-1);
 	btrfs_put_root(inode->root);
@@ -10857,52 +10858,23 @@ void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 en
  */
 struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
 {
-	struct rb_node *node;
-	struct rb_node *prev;
 	struct btrfs_inode *inode;
+	unsigned long from = min_ino;
 
 	spin_lock(&root->inode_lock);
-again:
-	node = root->inode_tree.rb_node;
-	prev = NULL;
-	while (node) {
-		prev = node;
-		inode = rb_entry(node, struct btrfs_inode, rb_node);
-		if (min_ino < btrfs_ino(inode))
-			node = node->rb_left;
-		else if (min_ino > btrfs_ino(inode))
-			node = node->rb_right;
-		else
+	while (true) {
+		inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT);
+		if (!inode)
+			break;
+		if (igrab(&inode->vfs_inode))
 			break;
-	}
-
-	if (!node) {
-		while (prev) {
-			inode = rb_entry(prev, struct btrfs_inode, rb_node);
-			if (min_ino <= btrfs_ino(inode)) {
-				node = prev;
-				break;
-			}
-			prev = rb_next(prev);
-		}
-	}
-
-	while (node) {
-		inode = rb_entry(prev, struct btrfs_inode, rb_node);
-		if (igrab(&inode->vfs_inode)) {
-			spin_unlock(&root->inode_lock);
-			return inode;
-		}
-
-		min_ino = btrfs_ino(inode) + 1;
-		if (cond_resched_lock(&root->inode_lock))
-			goto again;
 
-		node = rb_next(node);
+		from = btrfs_ino(inode) + 1;
+		cond_resched_lock(&root->inode_lock);
 	}
 	spin_unlock(&root->inode_lock);
 
-	return NULL;
+	return inode;
 }
 
 static const struct inode_operations btrfs_dir_inode_operations = {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/8] btrfs: preallocate inodes xarray entry to avoid transaction abort
  2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
  2024-05-08 12:17 ` [PATCH 1/8] btrfs: use an xarray to track open inodes in a root fdmanana
@ 2024-05-08 12:17 ` fdmanana
  2024-05-09  0:21   ` Qu Wenruo
  2024-05-08 12:17 ` [PATCH 3/8] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path() fdmanana
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: fdmanana @ 2024-05-08 12:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When creating a new inode, at btrfs_create_new_inode(), one of the very
last steps is to add the inode to the root's inodes xarray. This often
requires allocating memory which may fail (even though xarrays have a
dedicated kmem_cache which make it less likely to fail), and at that point
we are forced to abort the current transaction (as some, but not all, of
the inode metadata was added to its subvolume btree).

To avoid a transaction abort, preallocate memory for the xarray early at
btrfs_create_new_inode(), so that if we fail we don't need to abort the
transaction and the insertion into the xarray is guaranteed to succeed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 450fe1582f1d..85dbc19c2f6f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5493,7 +5493,7 @@ static int fixup_tree_root_location(struct btrfs_fs_info *fs_info,
 	return err;
 }
 
-static int btrfs_add_inode_to_root(struct btrfs_inode *inode)
+static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_inode *existing;
@@ -5503,9 +5503,11 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode)
 	if (inode_unhashed(&inode->vfs_inode))
 		return 0;
 
-	ret = xa_reserve(&root->inodes, ino, GFP_NOFS);
-	if (ret)
-		return ret;
+	if (prealloc) {
+		ret = xa_reserve(&root->inodes, ino, GFP_NOFS);
+		if (ret)
+			return ret;
+	}
 
 	spin_lock(&root->inode_lock);
 	existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
@@ -5606,7 +5608,7 @@ struct inode *btrfs_iget_path(struct super_block *s, u64 ino,
 
 		ret = btrfs_read_locked_inode(inode, path);
 		if (!ret) {
-			ret = btrfs_add_inode_to_root(BTRFS_I(inode));
+			ret = btrfs_add_inode_to_root(BTRFS_I(inode), true);
 			if (ret) {
 				iget_failed(inode);
 				inode = ERR_PTR(ret);
@@ -6237,6 +6239,7 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_item_batch batch;
 	unsigned long ptr;
 	int ret;
+	bool xa_reserved = false;
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -6251,6 +6254,11 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 		goto out;
 	inode->i_ino = objectid;
 
+	ret = xa_reserve(&root->inodes, objectid, GFP_NOFS);
+	if (ret)
+		goto out;
+	xa_reserved = true;
+
 	if (args->orphan) {
 		/*
 		 * O_TMPFILE, set link count to 0, so that after this point, we
@@ -6424,8 +6432,9 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	ret = btrfs_add_inode_to_root(BTRFS_I(inode));
-	if (ret) {
+	ret = btrfs_add_inode_to_root(BTRFS_I(inode), false);
+	if (WARN_ON(ret)) {
+		/* Shouldn't happen, we used xa_reserve() before. */
 		btrfs_abort_transaction(trans, ret);
 		goto discard;
 	}
@@ -6456,6 +6465,9 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 	ihold(inode);
 	discard_new_inode(inode);
 out:
+	if (xa_reserved)
+		xa_release(&root->inodes, objectid);
+
 	btrfs_free_path(path);
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/8] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path()
  2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
  2024-05-08 12:17 ` [PATCH 1/8] btrfs: use an xarray to track open inodes in a root fdmanana
  2024-05-08 12:17 ` [PATCH 2/8] btrfs: preallocate inodes xarray entry to avoid transaction abort fdmanana
@ 2024-05-08 12:17 ` fdmanana
  2024-05-09  0:23   ` Qu Wenruo
  2024-05-08 12:17 ` [PATCH 4/8] btrfs: remove inode_lock from struct btrfs_root and use xarray locks fdmanana
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: fdmanana @ 2024-05-08 12:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Make btrfs_iget_path() simpler and easier to read by avoiding nesting of
if-then-else statements and having an error label to do all the error
handling instead of repeating it a couple times.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 44 +++++++++++++++++++++-----------------------
 1 file changed, 21 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 85dbc19c2f6f..8ea9fd4c2b66 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5598,37 +5598,35 @@ struct inode *btrfs_iget_path(struct super_block *s, u64 ino,
 			      struct btrfs_root *root, struct btrfs_path *path)
 {
 	struct inode *inode;
+	int ret;
 
 	inode = btrfs_iget_locked(s, ino, root);
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
 
-	if (inode->i_state & I_NEW) {
-		int ret;
+	if (!(inode->i_state & I_NEW))
+		return inode;
 
-		ret = btrfs_read_locked_inode(inode, path);
-		if (!ret) {
-			ret = btrfs_add_inode_to_root(BTRFS_I(inode), true);
-			if (ret) {
-				iget_failed(inode);
-				inode = ERR_PTR(ret);
-			} else {
-				unlock_new_inode(inode);
-			}
-		} else {
-			iget_failed(inode);
-			/*
-			 * ret > 0 can come from btrfs_search_slot called by
-			 * btrfs_read_locked_inode, this means the inode item
-			 * was not found.
-			 */
-			if (ret > 0)
-				ret = -ENOENT;
-			inode = ERR_PTR(ret);
-		}
-	}
+	ret = btrfs_read_locked_inode(inode, path);
+	/*
+	 * ret > 0 can come from btrfs_search_slot called by
+	 * btrfs_read_locked_inode(), this means the inode item was not found.
+	 */
+	if (ret > 0)
+		ret = -ENOENT;
+	if (ret < 0)
+		goto error;
+
+	ret = btrfs_add_inode_to_root(BTRFS_I(inode), true);
+	if (ret < 0)
+		goto error;
+
+	unlock_new_inode(inode);
 
 	return inode;
+error:
+	iget_failed(inode);
+	return ERR_PTR(ret);
 }
 
 struct inode *btrfs_iget(struct super_block *s, u64 ino, struct btrfs_root *root)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/8] btrfs: remove inode_lock from struct btrfs_root and use xarray locks
  2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
                   ` (2 preceding siblings ...)
  2024-05-08 12:17 ` [PATCH 3/8] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path() fdmanana
@ 2024-05-08 12:17 ` fdmanana
  2024-05-09  0:25   ` Qu Wenruo
  2024-05-08 12:17 ` [PATCH 5/8] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode fdmanana
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: fdmanana @ 2024-05-08 12:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently we use the spinlock inode_lock from struct btrfs_root to
serialize access to two different data structures:

1) The delayed inodes xarray (struct btrfs_root::delayed_nodes);
2) The inodes xarray (struct btrfs_root::inodes).

Instead of using our own lock, we can use the spinlock that is part of the
xarray implementation, by using the xa_lock() and xa_unlock() APIs and
using the xarray APIs with the double underscore prefix that don't take
the xarray locks and assume the caller is using xa_lock() and xa_unlock().

So remove the spinlock inode_lock from struct btrfs_root and use the
corresponding xarray locks. This brings 2 benefits:

1) We reduce the size of struct btrfs_root, from 1336 bytes down to
   1328 bytes on a 64 bits release kernel config;

2) We reduce lock contention by not using anymore  the same lock for
   changing two different and unrelated xarrays.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h         |  1 -
 fs/btrfs/delayed-inode.c | 24 +++++++++++-------------
 fs/btrfs/disk-io.c       |  1 -
 fs/btrfs/inode.c         | 18 ++++++++----------
 4 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aa2568f86dc9..1004cb934b4a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -221,7 +221,6 @@ struct btrfs_root {
 
 	struct list_head root_list;
 
-	spinlock_t inode_lock;
 	/*
 	 * Xarray that keeps track of in-memory inodes, protected by the lock
 	 * @inode_lock.
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 95a0497fa866..1373f474c9b6 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -77,14 +77,14 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
 		return node;
 	}
 
-	spin_lock(&root->inode_lock);
+	xa_lock(&root->delayed_nodes);
 	node = xa_load(&root->delayed_nodes, ino);
 
 	if (node) {
 		if (btrfs_inode->delayed_node) {
 			refcount_inc(&node->refs);	/* can be accessed */
 			BUG_ON(btrfs_inode->delayed_node != node);
-			spin_unlock(&root->inode_lock);
+			xa_unlock(&root->delayed_nodes);
 			return node;
 		}
 
@@ -111,10 +111,10 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
 			node = NULL;
 		}
 
-		spin_unlock(&root->inode_lock);
+		xa_unlock(&root->delayed_nodes);
 		return node;
 	}
-	spin_unlock(&root->inode_lock);
+	xa_unlock(&root->delayed_nodes);
 
 	return NULL;
 }
@@ -148,21 +148,21 @@ static struct btrfs_delayed_node *btrfs_get_or_create_delayed_node(
 		kmem_cache_free(delayed_node_cache, node);
 		return ERR_PTR(-ENOMEM);
 	}
-	spin_lock(&root->inode_lock);
+	xa_lock(&root->delayed_nodes);
 	ptr = xa_load(&root->delayed_nodes, ino);
 	if (ptr) {
 		/* Somebody inserted it, go back and read it. */
-		spin_unlock(&root->inode_lock);
+		xa_unlock(&root->delayed_nodes);
 		kmem_cache_free(delayed_node_cache, node);
 		node = NULL;
 		goto again;
 	}
-	ptr = xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC);
+	ptr = __xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC);
 	ASSERT(xa_err(ptr) != -EINVAL);
 	ASSERT(xa_err(ptr) != -ENOMEM);
 	ASSERT(ptr == NULL);
 	btrfs_inode->delayed_node = node;
-	spin_unlock(&root->inode_lock);
+	xa_unlock(&root->delayed_nodes);
 
 	return node;
 }
@@ -275,14 +275,12 @@ static void __btrfs_release_delayed_node(
 	if (refcount_dec_and_test(&delayed_node->refs)) {
 		struct btrfs_root *root = delayed_node->root;
 
-		spin_lock(&root->inode_lock);
 		/*
 		 * Once our refcount goes to zero, nobody is allowed to bump it
 		 * back up.  We can delete it now.
 		 */
 		ASSERT(refcount_read(&delayed_node->refs) == 0);
 		xa_erase(&root->delayed_nodes, delayed_node->inode_id);
-		spin_unlock(&root->inode_lock);
 		kmem_cache_free(delayed_node_cache, delayed_node);
 	}
 }
@@ -2057,9 +2055,9 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
 		struct btrfs_delayed_node *node;
 		int count;
 
-		spin_lock(&root->inode_lock);
+		xa_lock(&root->delayed_nodes);
 		if (xa_empty(&root->delayed_nodes)) {
-			spin_unlock(&root->inode_lock);
+			xa_unlock(&root->delayed_nodes);
 			return;
 		}
 
@@ -2076,7 +2074,7 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
 			if (count >= ARRAY_SIZE(delayed_nodes))
 				break;
 		}
-		spin_unlock(&root->inode_lock);
+		xa_unlock(&root->delayed_nodes);
 		index++;
 
 		for (int i = 0; i < count; i++) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ed40fe1db53e..d20e400a9ce3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -674,7 +674,6 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	INIT_LIST_HEAD(&root->ordered_extents);
 	INIT_LIST_HEAD(&root->ordered_root);
 	INIT_LIST_HEAD(&root->reloc_dirty_list);
-	spin_lock_init(&root->inode_lock);
 	spin_lock_init(&root->delalloc_lock);
 	spin_lock_init(&root->ordered_extent_lock);
 	spin_lock_init(&root->accounting_lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8ea9fd4c2b66..4fd41d6b377f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5509,9 +5509,7 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc)
 			return ret;
 	}
 
-	spin_lock(&root->inode_lock);
 	existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
-	spin_unlock(&root->inode_lock);
 
 	if (xa_is_err(existing)) {
 		ret = xa_err(existing);
@@ -5531,16 +5529,16 @@ static void btrfs_del_inode_from_root(struct btrfs_inode *inode)
 	struct btrfs_inode *entry;
 	bool empty = false;
 
-	spin_lock(&root->inode_lock);
-	entry = xa_erase(&root->inodes, btrfs_ino(inode));
+	xa_lock(&root->inodes);
+	entry = __xa_erase(&root->inodes, btrfs_ino(inode));
 	if (entry == inode)
 		empty = xa_empty(&root->inodes);
-	spin_unlock(&root->inode_lock);
+	xa_unlock(&root->inodes);
 
 	if (empty && btrfs_root_refs(&root->root_item) == 0) {
-		spin_lock(&root->inode_lock);
+		xa_lock(&root->inodes);
 		empty = xa_empty(&root->inodes);
-		spin_unlock(&root->inode_lock);
+		xa_unlock(&root->inodes);
 		if (empty)
 			btrfs_add_dead_root(root);
 	}
@@ -10871,7 +10869,7 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
 	struct btrfs_inode *inode;
 	unsigned long from = min_ino;
 
-	spin_lock(&root->inode_lock);
+	xa_lock(&root->inodes);
 	while (true) {
 		inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT);
 		if (!inode)
@@ -10880,9 +10878,9 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
 			break;
 
 		from = btrfs_ino(inode) + 1;
-		cond_resched_lock(&root->inode_lock);
+		cond_resched_lock(&root->inodes.xa_lock);
 	}
-	spin_unlock(&root->inode_lock);
+	xa_unlock(&root->inodes);
 
 	return inode;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/8] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode
  2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
                   ` (3 preceding siblings ...)
  2024-05-08 12:17 ` [PATCH 4/8] btrfs: remove inode_lock from struct btrfs_root and use xarray locks fdmanana
@ 2024-05-08 12:17 ` fdmanana
  2024-05-09  0:30   ` Qu Wenruo
  2024-05-08 12:17 ` [PATCH 6/8] btrfs: don't allocate file extent tree for non regular files fdmanana
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: fdmanana @ 2024-05-08 12:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The index_cnt field of struct btrfs_inode is used only for two purposes:

1) To store the index for the next entry added to a directory;

2) For the data relocation inode to track the logical start address of the
   block group currently being relocated.

For the relocation case we use index_cnt because it's not used for
anything else in the relocation use case - we could have used other fields
that are not used by relocation such as defrag_bytes, last_unlink_trans
or last_reflink_trans for example (amongs others).

Since the csum_bytes field is not used for directories, do the following
changes:

1) Put index_cnt and csum_bytes in a union, and index_cnt is only
   initialized when the inode is a directory. The csum_bytes is only
   accessed in IO paths for regular files, so we're fine here;

2) Use the defrag_bytes field for relocation, since the data relocation
   inode is never used for defrag purposes. And to make the naming better,
   alias it to reloc_block_group_start by using a union.

This reduces the size of struct btrfs_inode by 8 bytes in a release
kernel, from 1040 bytes down to 1032 bytes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h   | 46 +++++++++++++++++++++++++---------------
 fs/btrfs/delayed-inode.c |  3 ++-
 fs/btrfs/inode.c         | 21 ++++++++++++------
 fs/btrfs/relocation.c    | 12 +++++------
 fs/btrfs/tree-log.c      |  3 ++-
 5 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index e577b9745884..19bb3d057414 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -215,11 +215,20 @@ struct btrfs_inode {
 		u64 last_dir_index_offset;
 	};
 
-	/*
-	 * Total number of bytes pending defrag, used by stat to check whether
-	 * it needs COW. Protected by 'lock'.
-	 */
-	u64 defrag_bytes;
+	union {
+		/*
+		 * Total number of bytes pending defrag, used by stat to check whether
+		 * it needs COW. Protected by 'lock'.
+		 * Used by inodes other than the data relocation inode.
+		 */
+		u64 defrag_bytes;
+
+		/*
+		 * Logical address of the block group being relocated.
+		 * Used only by the data relocation inode.
+		 */
+		u64 reloc_block_group_start;
+	};
 
 	/*
 	 * The size of the file stored in the metadata on disk.  data=ordered
@@ -228,12 +237,21 @@ struct btrfs_inode {
 	 */
 	u64 disk_i_size;
 
-	/*
-	 * If this is a directory then index_cnt is the counter for the index
-	 * number for new files that are created. For an empty directory, this
-	 * must be initialized to BTRFS_DIR_START_INDEX.
-	 */
-	u64 index_cnt;
+	union {
+		/*
+		 * If this is a directory then index_cnt is the counter for the
+		 * index number for new files that are created. For an empty
+		 * directory, this must be initialized to BTRFS_DIR_START_INDEX.
+		 */
+		u64 index_cnt;
+
+		/*
+		 * If this is not a directory, this is the number of bytes
+		 * outstanding that are going to need csums. This is used in
+		 * ENOSPC accounting. Protected by 'lock'.
+		 */
+		u64 csum_bytes;
+	};
 
 	/* Cache the directory index number to speed the dir/file remove */
 	u64 dir_index;
@@ -256,12 +274,6 @@ struct btrfs_inode {
 	 */
 	u64 last_reflink_trans;
 
-	/*
-	 * Number of bytes outstanding that are going to need csums.  This is
-	 * used in ENOSPC accounting. Protected by 'lock'.
-	 */
-	u64 csum_bytes;
-
 	/* Backwards incompatible flags, lower half of inode_item::flags  */
 	u32 flags;
 	/* Read-only compatibility flags, upper half of inode_item::flags */
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 1373f474c9b6..e298a44eaf69 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1914,7 +1914,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
 	BTRFS_I(inode)->i_otime_nsec = btrfs_stack_timespec_nsec(&inode_item->otime);
 
 	inode->i_generation = BTRFS_I(inode)->generation;
-	BTRFS_I(inode)->index_cnt = (u64)-1;
+	if (S_ISDIR(inode->i_mode))
+		BTRFS_I(inode)->index_cnt = (u64)-1;
 
 	mutex_unlock(&delayed_node->mutex);
 	btrfs_release_delayed_node(delayed_node);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4fd41d6b377f..9b98aa65cc63 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3856,7 +3856,9 @@ static int btrfs_read_locked_inode(struct inode *inode,
 	inode->i_rdev = 0;
 	rdev = btrfs_inode_rdev(leaf, inode_item);
 
-	BTRFS_I(inode)->index_cnt = (u64)-1;
+	if (S_ISDIR(inode->i_mode))
+		BTRFS_I(inode)->index_cnt = (u64)-1;
+
 	btrfs_inode_split_flags(btrfs_inode_flags(leaf, inode_item),
 				&BTRFS_I(inode)->flags, &BTRFS_I(inode)->ro_flags);
 
@@ -6268,8 +6270,10 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 		if (ret)
 			goto out;
 	}
-	/* index_cnt is ignored for everything but a dir. */
-	BTRFS_I(inode)->index_cnt = BTRFS_DIR_START_INDEX;
+
+	if (S_ISDIR(inode->i_mode))
+		BTRFS_I(inode)->index_cnt = BTRFS_DIR_START_INDEX;
+
 	BTRFS_I(inode)->generation = trans->transid;
 	inode->i_generation = BTRFS_I(inode)->generation;
 
@@ -8435,8 +8439,12 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->disk_i_size = 0;
 	ei->flags = 0;
 	ei->ro_flags = 0;
+	/*
+	 * ->index_cnt will be propertly initialized later when creating a new
+	 * inode (btrfs_create_new_inode()) or when reading an existing inode
+	 * from disk (btrfs_read_locked_inode()).
+	 */
 	ei->csum_bytes = 0;
-	ei->index_cnt = (u64)-1;
 	ei->dir_index = 0;
 	ei->last_unlink_trans = 0;
 	ei->last_reflink_trans = 0;
@@ -8511,9 +8519,10 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
 	if (!S_ISDIR(vfs_inode->i_mode)) {
 		WARN_ON(inode->delalloc_bytes);
 		WARN_ON(inode->new_delalloc_bytes);
+		WARN_ON(inode->csum_bytes);
 	}
-	WARN_ON(inode->csum_bytes);
-	WARN_ON(inode->defrag_bytes);
+	if (!root || !btrfs_is_data_reloc_root(root))
+		WARN_ON(inode->defrag_bytes);
 
 	/*
 	 * This can happen where we create an inode, but somebody else also
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 8b24bb5a0aa1..9f35524b6664 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -962,7 +962,7 @@ static int get_new_location(struct inode *reloc_inode, u64 *new_bytenr,
 	if (!path)
 		return -ENOMEM;
 
-	bytenr -= BTRFS_I(reloc_inode)->index_cnt;
+	bytenr -= BTRFS_I(reloc_inode)->reloc_block_group_start;
 	ret = btrfs_lookup_file_extent(NULL, root, path,
 			btrfs_ino(BTRFS_I(reloc_inode)), bytenr, 0);
 	if (ret < 0)
@@ -2797,7 +2797,7 @@ static noinline_for_stack int prealloc_file_extent_cluster(
 	u64 alloc_hint = 0;
 	u64 start;
 	u64 end;
-	u64 offset = inode->index_cnt;
+	u64 offset = inode->reloc_block_group_start;
 	u64 num_bytes;
 	int nr;
 	int ret = 0;
@@ -2951,7 +2951,7 @@ static int relocate_one_folio(struct inode *inode, struct file_ra_state *ra,
 			      int *cluster_nr, unsigned long index)
 {
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
-	u64 offset = BTRFS_I(inode)->index_cnt;
+	u64 offset = BTRFS_I(inode)->reloc_block_group_start;
 	const unsigned long last_index = (cluster->end - offset) >> PAGE_SHIFT;
 	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
 	struct folio *folio;
@@ -3086,7 +3086,7 @@ static int relocate_one_folio(struct inode *inode, struct file_ra_state *ra,
 static int relocate_file_extent_cluster(struct inode *inode,
 					const struct file_extent_cluster *cluster)
 {
-	u64 offset = BTRFS_I(inode)->index_cnt;
+	u64 offset = BTRFS_I(inode)->reloc_block_group_start;
 	unsigned long index;
 	unsigned long last_index;
 	struct file_ra_state *ra;
@@ -3915,7 +3915,7 @@ static noinline_for_stack struct inode *create_reloc_inode(
 		inode = NULL;
 		goto out;
 	}
-	BTRFS_I(inode)->index_cnt = group->start;
+	BTRFS_I(inode)->reloc_block_group_start = group->start;
 
 	ret = btrfs_orphan_add(trans, BTRFS_I(inode));
 out:
@@ -4395,7 +4395,7 @@ int btrfs_reloc_clone_csums(struct btrfs_ordered_extent *ordered)
 {
 	struct btrfs_inode *inode = BTRFS_I(ordered->inode);
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	u64 disk_bytenr = ordered->file_offset + inode->index_cnt;
+	u64 disk_bytenr = ordered->file_offset + inode->reloc_block_group_start;
 	struct btrfs_root *csum_root = btrfs_csum_root(fs_info, disk_bytenr);
 	LIST_HEAD(list);
 	int ret;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5146387b416b..0aee43466c52 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1625,7 +1625,8 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans,
 		if (ret)
 			goto out;
 	}
-	BTRFS_I(inode)->index_cnt = (u64)-1;
+	if (S_ISDIR(inode->i_mode))
+		BTRFS_I(inode)->index_cnt = (u64)-1;
 
 	if (inode->i_nlink == 0) {
 		if (S_ISDIR(inode->i_mode)) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/8] btrfs: don't allocate file extent tree for non regular files
  2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
                   ` (4 preceding siblings ...)
  2024-05-08 12:17 ` [PATCH 5/8] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode fdmanana
@ 2024-05-08 12:17 ` fdmanana
  2024-05-09  0:39   ` Qu Wenruo
  2024-05-08 12:17 ` [PATCH 7/8] btrfs: remove location key from struct btrfs_inode fdmanana
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: fdmanana @ 2024-05-08 12:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When not using the NO_HOLES feature we always allocate an io tree for an
inode's file_extent_tree. This is wasteful because that io tree is only
used for regular files, so we allocate more memory than needed for inodes
that represent directories or symlinks for example, or for inodes that
correspond to free space inodes.

So improve on this by allocating the io tree only for inodes of regular
files that are not free space inodes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file-item.c | 13 ++++++-----
 fs/btrfs/inode.c     | 53 +++++++++++++++++++++++++++++---------------
 2 files changed, 42 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index bce95f871750..f3ed78e21fa4 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -45,13 +45,12 @@
  */
 void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_size)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 start, end, i_size;
 	int ret;
 
 	spin_lock(&inode->lock);
 	i_size = new_i_size ?: i_size_read(&inode->vfs_inode);
-	if (btrfs_fs_incompat(fs_info, NO_HOLES)) {
+	if (!inode->file_extent_tree) {
 		inode->disk_i_size = i_size;
 		goto out_unlock;
 	}
@@ -84,13 +83,14 @@ void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_siz
 int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start,
 				      u64 len)
 {
+	if (!inode->file_extent_tree)
+		return 0;
+
 	if (len == 0)
 		return 0;
 
 	ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize));
 
-	if (btrfs_fs_incompat(inode->root->fs_info, NO_HOLES))
-		return 0;
 	return set_extent_bit(inode->file_extent_tree, start, start + len - 1,
 			      EXTENT_DIRTY, NULL);
 }
@@ -112,14 +112,15 @@ int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start,
 int btrfs_inode_clear_file_extent_range(struct btrfs_inode *inode, u64 start,
 					u64 len)
 {
+	if (!inode->file_extent_tree)
+		return 0;
+
 	if (len == 0)
 		return 0;
 
 	ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize) ||
 	       len == (u64)-1);
 
-	if (btrfs_fs_incompat(inode->root->fs_info, NO_HOLES))
-		return 0;
 	return clear_extent_bit(inode->file_extent_tree, start,
 				start + len - 1, EXTENT_DIRTY, NULL);
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9b98aa65cc63..175fd007f0ef 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3781,6 +3781,30 @@ static noinline int acls_after_inode_item(struct extent_buffer *leaf,
 	return 1;
 }
 
+static int btrfs_init_file_extent_tree(struct btrfs_inode *inode)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+
+	if (WARN_ON_ONCE(inode->file_extent_tree))
+		return 0;
+	if (btrfs_fs_incompat(fs_info, NO_HOLES))
+		return 0;
+	if (!S_ISREG(inode->vfs_inode.i_mode))
+		return 0;
+	if (btrfs_is_free_space_inode(inode))
+		return 0;
+
+	inode->file_extent_tree = kmalloc(sizeof(struct extent_io_tree), GFP_KERNEL);
+	if (!inode->file_extent_tree)
+		return -ENOMEM;
+
+	extent_io_tree_init(fs_info, inode->file_extent_tree, IO_TREE_INODE_FILE_EXTENT);
+	/* Lockdep class is set only for the file extent tree. */
+	lockdep_set_class(&inode->file_extent_tree->lock, &file_extent_tree_class);
+
+	return 0;
+}
+
 /*
  * read an inode from the btree into the in-memory inode
  */
@@ -3800,6 +3824,10 @@ static int btrfs_read_locked_inode(struct inode *inode,
 	bool filled = false;
 	int first_xattr_slot;
 
+	ret = btrfs_init_file_extent_tree(BTRFS_I(inode));
+	if (ret)
+		return ret;
+
 	ret = btrfs_fill_inode(inode, &rdev);
 	if (!ret)
 		filled = true;
@@ -6247,6 +6275,10 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 		BTRFS_I(inode)->root = btrfs_grab_root(BTRFS_I(dir)->root);
 	root = BTRFS_I(inode)->root;
 
+	ret = btrfs_init_file_extent_tree(BTRFS_I(inode));
+	if (ret)
+		goto out;
+
 	ret = btrfs_get_free_objectid(root, &objectid);
 	if (ret)
 		goto out;
@@ -8413,20 +8445,10 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
 	struct btrfs_inode *ei;
 	struct inode *inode;
-	struct extent_io_tree *file_extent_tree = NULL;
-
-	/* Self tests may pass a NULL fs_info. */
-	if (fs_info && !btrfs_fs_incompat(fs_info, NO_HOLES)) {
-		file_extent_tree = kmalloc(sizeof(struct extent_io_tree), GFP_KERNEL);
-		if (!file_extent_tree)
-			return NULL;
-	}
 
 	ei = alloc_inode_sb(sb, btrfs_inode_cachep, GFP_KERNEL);
-	if (!ei) {
-		kfree(file_extent_tree);
+	if (!ei)
 		return NULL;
-	}
 
 	ei->root = NULL;
 	ei->generation = 0;
@@ -8471,13 +8493,8 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	extent_io_tree_init(fs_info, &ei->io_tree, IO_TREE_INODE_IO);
 	ei->io_tree.inode = ei;
 
-	ei->file_extent_tree = file_extent_tree;
-	if (file_extent_tree) {
-		extent_io_tree_init(fs_info, ei->file_extent_tree,
-				    IO_TREE_INODE_FILE_EXTENT);
-		/* Lockdep class is set only for the file extent tree. */
-		lockdep_set_class(&ei->file_extent_tree->lock, &file_extent_tree_class);
-	}
+	ei->file_extent_tree = NULL;
+
 	mutex_init(&ei->log_mutex);
 	spin_lock_init(&ei->ordered_tree_lock);
 	ei->ordered_tree = RB_ROOT;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 7/8] btrfs: remove location key from struct btrfs_inode
  2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
                   ` (5 preceding siblings ...)
  2024-05-08 12:17 ` [PATCH 6/8] btrfs: don't allocate file extent tree for non regular files fdmanana
@ 2024-05-08 12:17 ` fdmanana
  2024-05-08 12:17 ` [PATCH 8/8] btrfs: remove objectid from struct btrfs_inode on 64 bits platforms fdmanana
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: fdmanana @ 2024-05-08 12:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently struct btrfs_inode has a key member, named "location", that is
either:

1) The key of the inode's item. In this case the objectid is the number
   of the inode;

2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for
   the case where we have a root that is a snapshot of a subvolume that
   points to other subvolumes. In this case the objectid is the ID of
   a subvolume inside the snapshotted parent subvolume.

The key is only used to lookup the inode item for the first case, while
for the second it's never used since it corresponds to directory stubs
created with new_simple_dir() and which are marked as dummy, so there's
no actual inode item to ever update. In the second case we only check
the key type at btrfs_ino() for 32 bits platforms and its objectid is
only needed for unlink.

Instead of using a key we can do fine with just the objectid, since we
can generate the key whenever we need it having only the objectid, as
in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset
is always 0.

So use only an objectid instead of a full key. This reduces the size of
struct btrfs_inode from 1032 bytes down to 1024 bytes on a release kernel,
meaning we can now have 4 inodes per 4K page instead of 3, using less
pages for inodes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h       | 47 +++++++++++++++++++++++++++++++-----
 fs/btrfs/disk-io.c           |  4 +--
 fs/btrfs/export.c            |  2 +-
 fs/btrfs/inode.c             | 25 +++++++++----------
 fs/btrfs/ioctl.c             |  8 +++---
 fs/btrfs/tests/btrfs-tests.c |  4 +--
 fs/btrfs/tree-log.c          |  6 +++--
 7 files changed, 63 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 19bb3d057414..fa2f91396ae0 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -89,6 +89,29 @@ enum {
 	BTRFS_INODE_FREE_SPACE_INODE,
 	/* Set when there are no capabilities in XATTs for the inode. */
 	BTRFS_INODE_NO_CAP_XATTR,
+	/*
+	 * Indicate this is a directory that points to a subvolume for which
+	 * there is no root reference item. That's a case like the following:
+	 *
+	 *   $ btrfs subvolume create /mnt/parent
+	 *   $ btrfs subvolume create /mnt/parent/child
+	 *   $ btrfs subvolume snapshot /mnt/parent /mnt/snap
+	 *
+	 * If subvolume "parent" is root 256, subvolume "child" is root 257 and
+	 * snapshot "snap" is root 258, then there's no root reference item (key
+	 * BTRFS_ROOT_REF_KEY in the root tree) for the subvolume "child"
+	 * associated to root 258 (the snapshot) - there's only for the root
+	 * of the "parent" subvolume (root 256). In the chunk root we have a
+	 * (256 BTRFS_ROOT_REF_KEY 257) key but we don't have a
+	 * (258 BTRFS_ROOT_REF_KEY 257) key - the sames goes for backrefs, we
+	 * have a (257 BTRFS_ROOT_BACKREF_KEY 256) but we don't have a
+	 * (257 BTRFS_ROOT_BACKREF_KEY 258) key.
+	 *
+	 * So when opening the "child" dentry from the snapshot's directory,
+	 * we don't find a root ref item and we create a stub inode. This is
+	 * done at new_simple_dir(), called from btrfs_lookup_dentry().
+	 */
+	BTRFS_INODE_ROOT_STUB,
 };
 
 /* in memory btrfs inode */
@@ -96,10 +119,15 @@ struct btrfs_inode {
 	/* which subvolume this inode belongs to */
 	struct btrfs_root *root;
 
-	/* key used to find this inode on disk.  This is used by the code
-	 * to read in roots of subvolumes
+	/*
+	 * This is either:
+	 *
+	 * 1) The objectid of the corresponding BTRFS_INODE_ITEM_KEY;
+	 *
+	 * 2) In case this a root stub inode (BTRFS_INODE_ROOT_STUB flag set),
+	 *    the ID of that root.
 	 */
-	struct btrfs_key location;
+	u64 objectid;
 
 	/* Cached value of inode property 'compression'. */
 	u8 prop_compress;
@@ -330,10 +358,9 @@ static inline unsigned long btrfs_inode_hash(u64 objectid,
  */
 static inline u64 btrfs_ino(const struct btrfs_inode *inode)
 {
-	u64 ino = inode->location.objectid;
+	u64 ino = inode->objectid;
 
-	/* type == BTRFS_ROOT_ITEM_KEY: subvol dir */
-	if (inode->location.type == BTRFS_ROOT_ITEM_KEY)
+	if (test_bit(BTRFS_INODE_ROOT_STUB, &inode->runtime_flags))
 		ino = inode->vfs_inode.i_ino;
 	return ino;
 }
@@ -347,6 +374,14 @@ static inline u64 btrfs_ino(const struct btrfs_inode *inode)
 
 #endif
 
+static inline void btrfs_get_inode_key(const struct btrfs_inode *inode,
+				       struct btrfs_key *key)
+{
+	key->objectid = inode->objectid;
+	key->type = BTRFS_INODE_ITEM_KEY;
+	key->offset = 0;
+}
+
 static inline void btrfs_i_size_write(struct btrfs_inode *inode, u64 size)
 {
 	i_size_write(&inode->vfs_inode, size);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d20e400a9ce3..e3edaf510108 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1944,9 +1944,7 @@ static int btrfs_init_btree_inode(struct super_block *sb)
 	extent_map_tree_init(&BTRFS_I(inode)->extent_tree);
 
 	BTRFS_I(inode)->root = btrfs_grab_root(fs_info->tree_root);
-	BTRFS_I(inode)->location.objectid = BTRFS_BTREE_INODE_OBJECTID;
-	BTRFS_I(inode)->location.type = 0;
-	BTRFS_I(inode)->location.offset = 0;
+	BTRFS_I(inode)->objectid = BTRFS_BTREE_INODE_OBJECTID;
 	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 	__insert_inode_hash(inode, hash);
 	fs_info->btree_inode = inode;
diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 9e81f89e76d8..5526e25ebb3f 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -40,7 +40,7 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
 	if (parent) {
 		u64 parent_root_id;
 
-		fid->parent_objectid = BTRFS_I(parent)->location.objectid;
+		fid->parent_objectid = BTRFS_I(parent)->objectid;
 		fid->parent_gen = parent->i_generation;
 		parent_root_id = btrfs_root_id(BTRFS_I(parent)->root);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 175fd007f0ef..44dc82ff96db 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3838,7 +3838,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
 			return -ENOMEM;
 	}
 
-	memcpy(&location, &BTRFS_I(inode)->location, sizeof(location));
+	btrfs_get_inode_key(BTRFS_I(inode), &location);
 
 	ret = btrfs_lookup_inode(NULL, root, path, &location, 0);
 	if (ret) {
@@ -4068,13 +4068,15 @@ static noinline int btrfs_update_inode_item(struct btrfs_trans_handle *trans,
 	struct btrfs_inode_item *inode_item;
 	struct btrfs_path *path;
 	struct extent_buffer *leaf;
+	struct btrfs_key key;
 	int ret;
 
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
 
-	ret = btrfs_lookup_inode(trans, inode->root, path, &inode->location, 1);
+	btrfs_get_inode_key(inode, &key);
+	ret = btrfs_lookup_inode(trans, inode->root, path, &key, 1);
 	if (ret) {
 		if (ret > 0)
 			ret = -ENOENT;
@@ -4338,7 +4340,7 @@ static int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
 	if (btrfs_ino(inode) == BTRFS_FIRST_FREE_OBJECTID) {
 		objectid = btrfs_root_id(inode->root);
 	} else if (btrfs_ino(inode) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID) {
-		objectid = inode->location.objectid;
+		objectid = inode->objectid;
 	} else {
 		WARN_ON(1);
 		fscrypt_free_filename(&fname);
@@ -5580,9 +5582,7 @@ static int btrfs_init_locked_inode(struct inode *inode, void *p)
 	struct btrfs_iget_args *args = p;
 
 	inode->i_ino = args->ino;
-	BTRFS_I(inode)->location.objectid = args->ino;
-	BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
-	BTRFS_I(inode)->location.offset = 0;
+	BTRFS_I(inode)->objectid = args->ino;
 	BTRFS_I(inode)->root = btrfs_grab_root(args->root);
 
 	if (args->root && args->root == args->root->fs_info->tree_root &&
@@ -5596,7 +5596,7 @@ static int btrfs_find_actor(struct inode *inode, void *opaque)
 {
 	struct btrfs_iget_args *args = opaque;
 
-	return args->ino == BTRFS_I(inode)->location.objectid &&
+	return args->ino == BTRFS_I(inode)->objectid &&
 		args->root == BTRFS_I(inode)->root;
 }
 
@@ -5673,7 +5673,8 @@ static struct inode *new_simple_dir(struct inode *dir,
 		return ERR_PTR(-ENOMEM);
 
 	BTRFS_I(inode)->root = btrfs_grab_root(root);
-	memcpy(&BTRFS_I(inode)->location, key, sizeof(*key));
+	BTRFS_I(inode)->objectid = key->objectid;
+	set_bit(BTRFS_INODE_ROOT_STUB, &BTRFS_I(inode)->runtime_flags);
 	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 
 	inode->i_ino = BTRFS_EMPTY_SUBVOL_DIR_OBJECTID;
@@ -6149,7 +6150,7 @@ static int btrfs_insert_inode_locked(struct inode *inode)
 {
 	struct btrfs_iget_args args;
 
-	args.ino = BTRFS_I(inode)->location.objectid;
+	args.ino = BTRFS_I(inode)->objectid;
 	args.root = BTRFS_I(inode)->root;
 
 	return insert_inode_locked4(inode,
@@ -6256,7 +6257,6 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(dir);
 	struct btrfs_root *root;
 	struct btrfs_inode_item *inode_item;
-	struct btrfs_key *location;
 	struct btrfs_path *path;
 	u64 objectid;
 	struct btrfs_inode_ref *ref;
@@ -6332,10 +6332,7 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 				BTRFS_INODE_NODATASUM;
 	}
 
-	location = &BTRFS_I(inode)->location;
-	location->objectid = objectid;
-	location->offset = 0;
-	location->type = BTRFS_INODE_ITEM_KEY;
+	BTRFS_I(inode)->objectid = objectid;
 
 	ret = btrfs_insert_inode_locked(inode);
 	if (ret < 0) {
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index efd5d6e9589e..2de8f06523b9 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1918,7 +1918,7 @@ static int btrfs_search_path_in_tree_user(struct mnt_idmap *idmap,
 {
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct super_block *sb = inode->i_sb;
-	struct btrfs_key upper_limit = BTRFS_I(inode)->location;
+	u64 upper_limit = BTRFS_I(inode)->objectid;
 	u64 treeid = btrfs_root_id(BTRFS_I(inode)->root);
 	u64 dirid = args->dirid;
 	unsigned long item_off;
@@ -1944,7 +1944,7 @@ static int btrfs_search_path_in_tree_user(struct mnt_idmap *idmap,
 	 * If the bottom subvolume does not exist directly under upper_limit,
 	 * construct the path in from the bottom up.
 	 */
-	if (dirid != upper_limit.objectid) {
+	if (dirid != upper_limit) {
 		ptr = &args->path[BTRFS_INO_LOOKUP_USER_PATH_MAX - 1];
 
 		root = btrfs_get_fs_root(fs_info, treeid, true);
@@ -2019,7 +2019,7 @@ static int btrfs_search_path_in_tree_user(struct mnt_idmap *idmap,
 				goto out_put;
 			}
 
-			if (key.offset == upper_limit.objectid)
+			if (key.offset == upper_limit)
 				break;
 			if (key.objectid == BTRFS_FIRST_FREE_OBJECTID) {
 				ret = -EACCES;
@@ -2140,7 +2140,7 @@ static int btrfs_ioctl_ino_lookup_user(struct file *file, void __user *argp)
 	inode = file_inode(file);
 
 	if (args->dirid == BTRFS_FIRST_FREE_OBJECTID &&
-	    BTRFS_I(inode)->location.objectid != BTRFS_FIRST_FREE_OBJECTID) {
+	    BTRFS_I(inode)->objectid != BTRFS_FIRST_FREE_OBJECTID) {
 		/*
 		 * The subvolume does not exist under fd with which this is
 		 * called
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index dce0387ef155..b28a79935d8e 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -62,9 +62,7 @@ struct inode *btrfs_new_test_inode(void)
 
 	inode->i_mode = S_IFREG;
 	inode->i_ino = BTRFS_FIRST_FREE_OBJECTID;
-	BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
-	BTRFS_I(inode)->location.objectid = BTRFS_FIRST_FREE_OBJECTID;
-	BTRFS_I(inode)->location.offset = 0;
+	BTRFS_I(inode)->objectid = BTRFS_FIRST_FREE_OBJECTID;
 	inode_init_owner(&nop_mnt_idmap, inode, NULL, S_IFREG);
 
 	return inode;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 0aee43466c52..2e762b89d4a2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4235,8 +4235,10 @@ static int log_inode_item(struct btrfs_trans_handle *trans,
 			  struct btrfs_inode *inode, bool inode_item_dropped)
 {
 	struct btrfs_inode_item *inode_item;
+	struct btrfs_key key;
 	int ret;
 
+	btrfs_get_inode_key(inode, &key);
 	/*
 	 * If we are doing a fast fsync and the inode was logged before in the
 	 * current transaction, then we know the inode was previously logged and
@@ -4248,7 +4250,7 @@ static int log_inode_item(struct btrfs_trans_handle *trans,
 	 * already exists can also result in unnecessarily splitting a leaf.
 	 */
 	if (!inode_item_dropped && inode->logged_trans == trans->transid) {
-		ret = btrfs_search_slot(trans, log, &inode->location, path, 0, 1);
+		ret = btrfs_search_slot(trans, log, &key, path, 0, 1);
 		ASSERT(ret <= 0);
 		if (ret > 0)
 			ret = -ENOENT;
@@ -4262,7 +4264,7 @@ static int log_inode_item(struct btrfs_trans_handle *trans,
 		 * the inode, we set BTRFS_INODE_NEEDS_FULL_SYNC on its runtime
 		 * flags and set ->logged_trans to 0.
 		 */
-		ret = btrfs_insert_empty_item(trans, log, path, &inode->location,
+		ret = btrfs_insert_empty_item(trans, log, path, &key,
 					      sizeof(*inode_item));
 		ASSERT(ret != -EEXIST);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 8/8] btrfs: remove objectid from struct btrfs_inode on 64 bits platforms
  2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
                   ` (6 preceding siblings ...)
  2024-05-08 12:17 ` [PATCH 7/8] btrfs: remove location key from struct btrfs_inode fdmanana
@ 2024-05-08 12:17 ` fdmanana
  2024-05-09 17:56 ` [PATCH 0/8] btrfs: inode management and memory consumption improvements David Sterba
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
  9 siblings, 0 replies; 37+ messages in thread
From: fdmanana @ 2024-05-08 12:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

On 64 bits platforms we don't really need to have a dedicated member (the
objectid field) for the inode's number since we store in the vfs inode's
i_ino member, which is an unsigned long and this type is 64 bits wide on
64 bits platforms. We only need that field in case we are on a 32 bits
platform because the unsigned long type is 32 bits wide on such platforms
See commit 33345d01522f ("Btrfs: Always use 64bit inode number") regarding
this 64/32 bits detail.

The objectid field of struct btrfs_inode is also used to store the ID of
a root for directories that are stubs for unreferenced roots. In such
cases the inode is a directory and has the BTRFS_INODE_ROOT_STUB runtime
flag set.

So in order to reduce the size of btrfs_inode structure on 64 bits
platforms we can remove the objectid member and use the vfs inode's i_ino
member instead whenever we need to get the inode number. In case the inode
is a root stub (BTRFS_INODE_ROOT_STUB set) we can use the member
last_reflink_trans to store the ID of the unreferenced root, since such
inode is a directory and reflinks can't be done against directories.

So remove the objectid fields for 64 bits platforms and alias the
last_reflink_trans field with a name of ref_root_id in a union.
On a release kernel config, this reduces the size of struct btrfs_inode
from 1024 bytes down to 1016 bytes, giving an 8 bytes slack space for
future expansion without decreasing the number of inodes we can have
per page (4 with a 4K page, 64 with a 64K page).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h       | 50 ++++++++++++++++++++++++------------
 fs/btrfs/disk-io.c           |  3 +--
 fs/btrfs/export.c            |  2 +-
 fs/btrfs/inode.c             | 17 +++++-------
 fs/btrfs/ioctl.c             |  4 +--
 fs/btrfs/tests/btrfs-tests.c |  3 +--
 6 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index fa2f91396ae0..4d9299789a03 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -119,15 +119,14 @@ struct btrfs_inode {
 	/* which subvolume this inode belongs to */
 	struct btrfs_root *root;
 
+#if BITS_PER_LONG == 32
 	/*
-	 * This is either:
-	 *
-	 * 1) The objectid of the corresponding BTRFS_INODE_ITEM_KEY;
-	 *
-	 * 2) In case this a root stub inode (BTRFS_INODE_ROOT_STUB flag set),
-	 *    the ID of that root.
+	 * The objectid of the corresponding BTRFS_INODE_ITEM_KEY.
+	 * On 64 bits platforms we can get it from vfs_inode.i_ino, which is an
+	 * unsigned long and therefore 64 bits on such platforms.
 	 */
 	u64 objectid;
+#endif
 
 	/* Cached value of inode property 'compression'. */
 	u8 prop_compress;
@@ -291,16 +290,25 @@ struct btrfs_inode {
 	 */
 	u64 last_unlink_trans;
 
-	/*
-	 * The id/generation of the last transaction where this inode was
-	 * either the source or the destination of a clone/dedupe operation.
-	 * Used when logging an inode to know if there are shared extents that
-	 * need special care when logging checksum items, to avoid duplicate
-	 * checksum items in a log (which can lead to a corruption where we end
-	 * up with missing checksum ranges after log replay).
-	 * Protected by the vfs inode lock.
-	 */
-	u64 last_reflink_trans;
+	union {
+		/*
+		 * The id/generation of the last transaction where this inode
+		 * was either the source or the destination of a clone/dedupe
+		 * operation. Used when logging an inode to know if there are
+		 * shared extents that need special care when logging checksum
+		 * items, to avoid duplicate checksum items in a log (which can
+		 * lead to a corruption where we end up with missing checksum
+		 * ranges after log replay). Protected by the vfs inode lock.
+		 * Used for regular files only.
+		 */
+		u64 last_reflink_trans;
+
+		/*
+		 * In case this a root stub inode (BTRFS_INODE_ROOT_STUB flag set),
+		 * the ID of that root.
+		 */
+		u64 ref_root_id;
+	};
 
 	/* Backwards incompatible flags, lower half of inode_item::flags  */
 	u32 flags;
@@ -377,11 +385,19 @@ static inline u64 btrfs_ino(const struct btrfs_inode *inode)
 static inline void btrfs_get_inode_key(const struct btrfs_inode *inode,
 				       struct btrfs_key *key)
 {
-	key->objectid = inode->objectid;
+	key->objectid = btrfs_ino(inode);
 	key->type = BTRFS_INODE_ITEM_KEY;
 	key->offset = 0;
 }
 
+static inline void btrfs_set_inode_number(struct btrfs_inode *inode, u64 ino)
+{
+#if BITS_PER_LONG == 32
+	inode->objectid = ino;
+#endif
+	inode->vfs_inode.i_ino = ino;
+}
+
 static inline void btrfs_i_size_write(struct btrfs_inode *inode, u64 size)
 {
 	i_size_write(&inode->vfs_inode, size);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e3edaf510108..e6bf895b3547 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1928,7 +1928,7 @@ static int btrfs_init_btree_inode(struct super_block *sb)
 	if (!inode)
 		return -ENOMEM;
 
-	inode->i_ino = BTRFS_BTREE_INODE_OBJECTID;
+	btrfs_set_inode_number(BTRFS_I(inode), BTRFS_BTREE_INODE_OBJECTID);
 	set_nlink(inode, 1);
 	/*
 	 * we set the i_size on the btree inode to the max possible int.
@@ -1944,7 +1944,6 @@ static int btrfs_init_btree_inode(struct super_block *sb)
 	extent_map_tree_init(&BTRFS_I(inode)->extent_tree);
 
 	BTRFS_I(inode)->root = btrfs_grab_root(fs_info->tree_root);
-	BTRFS_I(inode)->objectid = BTRFS_BTREE_INODE_OBJECTID;
 	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 	__insert_inode_hash(inode, hash);
 	fs_info->btree_inode = inode;
diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 5526e25ebb3f..5da56e21ff73 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -40,7 +40,7 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
 	if (parent) {
 		u64 parent_root_id;
 
-		fid->parent_objectid = BTRFS_I(parent)->objectid;
+		fid->parent_objectid = btrfs_ino(BTRFS_I(parent));
 		fid->parent_gen = parent->i_generation;
 		parent_root_id = btrfs_root_id(BTRFS_I(parent)->root);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 44dc82ff96db..5a1014122088 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4340,7 +4340,7 @@ static int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
 	if (btrfs_ino(inode) == BTRFS_FIRST_FREE_OBJECTID) {
 		objectid = btrfs_root_id(inode->root);
 	} else if (btrfs_ino(inode) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID) {
-		objectid = inode->objectid;
+		objectid = inode->ref_root_id;
 	} else {
 		WARN_ON(1);
 		fscrypt_free_filename(&fname);
@@ -5581,8 +5581,7 @@ static int btrfs_init_locked_inode(struct inode *inode, void *p)
 {
 	struct btrfs_iget_args *args = p;
 
-	inode->i_ino = args->ino;
-	BTRFS_I(inode)->objectid = args->ino;
+	btrfs_set_inode_number(BTRFS_I(inode), args->ino);
 	BTRFS_I(inode)->root = btrfs_grab_root(args->root);
 
 	if (args->root && args->root == args->root->fs_info->tree_root &&
@@ -5596,7 +5595,7 @@ static int btrfs_find_actor(struct inode *inode, void *opaque)
 {
 	struct btrfs_iget_args *args = opaque;
 
-	return args->ino == BTRFS_I(inode)->objectid &&
+	return args->ino == btrfs_ino(BTRFS_I(inode)) &&
 		args->root == BTRFS_I(inode)->root;
 }
 
@@ -5673,11 +5672,11 @@ static struct inode *new_simple_dir(struct inode *dir,
 		return ERR_PTR(-ENOMEM);
 
 	BTRFS_I(inode)->root = btrfs_grab_root(root);
-	BTRFS_I(inode)->objectid = key->objectid;
+	BTRFS_I(inode)->ref_root_id = key->objectid;
 	set_bit(BTRFS_INODE_ROOT_STUB, &BTRFS_I(inode)->runtime_flags);
 	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 
-	inode->i_ino = BTRFS_EMPTY_SUBVOL_DIR_OBJECTID;
+	btrfs_set_inode_number(BTRFS_I(inode), BTRFS_EMPTY_SUBVOL_DIR_OBJECTID);
 	/*
 	 * We only need lookup, the rest is read-only and there's no inode
 	 * associated with the dentry
@@ -6150,7 +6149,7 @@ static int btrfs_insert_inode_locked(struct inode *inode)
 {
 	struct btrfs_iget_args args;
 
-	args.ino = BTRFS_I(inode)->objectid;
+	args.ino = btrfs_ino(BTRFS_I(inode));
 	args.root = BTRFS_I(inode)->root;
 
 	return insert_inode_locked4(inode,
@@ -6282,7 +6281,7 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 	ret = btrfs_get_free_objectid(root, &objectid);
 	if (ret)
 		goto out;
-	inode->i_ino = objectid;
+	btrfs_set_inode_number(BTRFS_I(inode), objectid);
 
 	ret = xa_reserve(&root->inodes, objectid, GFP_NOFS);
 	if (ret)
@@ -6332,8 +6331,6 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 				BTRFS_INODE_NODATASUM;
 	}
 
-	BTRFS_I(inode)->objectid = objectid;
-
 	ret = btrfs_insert_inode_locked(inode);
 	if (ret < 0) {
 		if (!args->orphan)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2de8f06523b9..1d343cd2435b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1918,7 +1918,7 @@ static int btrfs_search_path_in_tree_user(struct mnt_idmap *idmap,
 {
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct super_block *sb = inode->i_sb;
-	u64 upper_limit = BTRFS_I(inode)->objectid;
+	u64 upper_limit = btrfs_ino(BTRFS_I(inode));
 	u64 treeid = btrfs_root_id(BTRFS_I(inode)->root);
 	u64 dirid = args->dirid;
 	unsigned long item_off;
@@ -2140,7 +2140,7 @@ static int btrfs_ioctl_ino_lookup_user(struct file *file, void __user *argp)
 	inode = file_inode(file);
 
 	if (args->dirid == BTRFS_FIRST_FREE_OBJECTID &&
-	    BTRFS_I(inode)->objectid != BTRFS_FIRST_FREE_OBJECTID) {
+	    btrfs_ino(BTRFS_I(inode)) != BTRFS_FIRST_FREE_OBJECTID) {
 		/*
 		 * The subvolume does not exist under fd with which this is
 		 * called
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index b28a79935d8e..ce50847e1e01 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -61,8 +61,7 @@ struct inode *btrfs_new_test_inode(void)
 		return NULL;
 
 	inode->i_mode = S_IFREG;
-	inode->i_ino = BTRFS_FIRST_FREE_OBJECTID;
-	BTRFS_I(inode)->objectid = BTRFS_FIRST_FREE_OBJECTID;
+	btrfs_set_inode_number(BTRFS_I(inode), BTRFS_FIRST_FREE_OBJECTID);
 	inode_init_owner(&nop_mnt_idmap, inode, NULL, S_IFREG);
 
 	return inode;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/8] btrfs: use an xarray to track open inodes in a root
  2024-05-08 12:17 ` [PATCH 1/8] btrfs: use an xarray to track open inodes in a root fdmanana
@ 2024-05-09  0:18   ` Qu Wenruo
  0 siblings, 0 replies; 37+ messages in thread
From: Qu Wenruo @ 2024-05-09  0:18 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/5/8 21:47, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Currently we use a red black tree (rb-tree) to track the currently open
> inodes of a root (in struct btrfs_root::inode_tree). This however is not
> very efficient when the number of inodes is large since rb-trees are
> binary trees. For example for 100K open inodes, the tree has a depth of
> 17. Besides that, inserting into the tree requires navigating through it
> and pulling useless cache lines in the process since the red black tree
> nodes are embedded within the btrfs inode - on the other hand, by being
> embedded, it requires no extra memory allocations.
>
> We can improve this by using an xarray instead, which is efficient when
> indices are densely clustered (such as inode numbers), is more cache
> friendly and behaves like a resizable array, with a much better search
> and insertion complexity than a red black tree. This only has one small
> disadvantage which is that insertion will sometimes require allocating
> memory for the xarray - which may fail (not that often since it uses a
> kmem_cache) - but on the other hand we can reduce the btrfs inode
> structure size by 24 bytes (from 1064 down to 1040 bytes) after removing
> the embedded red black tree node, which after the next patches will allow
> to reduce the size of the structure to 1024 bytes, meaning we will be able
> to store 4 inodes per 4K page instead of 3 inodes.
>
> This change does a straighforward change to use an xarray, and results
> in a transaction abort if we can't allocate memory for the xarray when
> creating an inode - but the next patch changes things so that we don't
> need to abort.
>
> Running the following fs_mark test showed some improvements:
>
>      $ cat test.sh
>      #!/bin/bash
>
>      DEV=/dev/nullb0
>      MNT=/mnt/nullb0
>      MOUNT_OPTIONS="-o ssd"
>      FILES=100000
>      THREADS=$(nproc --all)
>
>      echo "performance" | \
>          tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
>
>      mkfs.btrfs -f $DEV
>      mount $MOUNT_OPTIONS $DEV $MNT
>
>      OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
>      for ((i = 1; i <= $THREADS; i++)); do
>          OPTS="$OPTS -d $MNT/d$i"
>      done
>
>      fs_mark $OPTS
>
>      umount $MNT
>
> Before this patch:
>
>      FSUse%        Count         Size    Files/sec     App Overhead
>          10      1200000            0      92081.6         12505547
>          16      2400000            0     138222.6         13067072
>          23      3600000            0     148833.1         13290336
>          43      4800000            0      97864.7         13931248
>          53      6000000            0      85597.3         14384313
>
> After this patch:
>
>      FSUse%        Count         Size    Files/sec     App Overhead
>          10      1200000            0      93225.1         12571078
>          16      2400000            0     146720.3         12805007
>          23      3600000            0     160626.4         13073835
>          46      4800000            0     116286.2         13802927
>          53      6000000            0      90087.9         14754892
>
> The test was run with a release kernel config (Debian's default config).
>
> Also capturing the insertion times into the rb tree and into the xarray,
> that is measuring the duration of the old function inode_tree_add() and
> the duration of the new btrfs_add_inode_to_root() function, gave the
> following results (in nanoseconds):
>
> Before this patch, inode_tree_add() execution times:
>
>     Count: 5000000
>     Range:  0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
>     Percentiles:  90th: 1015.000; 95th: 1139.000; 99th: 1397.000
>        0.000 -    7.816:    40 |
>        7.816 -   37.858:   209 |
>       37.858 -  170.278:  6059 |
>      170.278 -  753.961: 2754890 #####################################################
>      753.961 - 3326.728: 2232312 ###########################################
>     3326.728 - 14667.018:  4366 |
>     14667.018 - 64652.943:   852 |
>     64652.943 - 284981.761:   550 |
>     284981.761 - 1256150.914:   221 |
>     1256150.914 - 5536887.000:     7 |
>
> After this patch, btrfs_add_inode_to_root() execution times:
>
>     Count: 5000000
>     Range:  0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
>     Percentiles:  90th: 342.000; 95th: 432.000; 99th: 572.000
>        0.000 -    7.264:   104 |
>        7.264 -   33.145:   352 |
>       33.145 -  140.081: 109606 #
>      140.081 -  581.930: 4840090 #####################################################
>      581.930 - 2407.590: 43532 |
>     2407.590 - 9950.979:  2245 |
>     9950.979 - 41119.278:   514 |
>     41119.278 - 169902.616:   155 |
>     169902.616 - 702018.539:    47 |
>     702018.539 - 2900652.000:     9 |
>
> Average, percentiles, standard deviation, etc, are all much better.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Not familiar with XArray, but reviewing this with the XArray doc indeed
helps me get more used to it.

Thanks,
Qu
> ---
>   fs/btrfs/btrfs_inode.h |   3 -
>   fs/btrfs/ctree.h       |   7 ++-
>   fs/btrfs/disk-io.c     |   6 +-
>   fs/btrfs/inode.c       | 128 ++++++++++++++++-------------------------
>   4 files changed, 58 insertions(+), 86 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 91c994b569f3..e577b9745884 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -155,9 +155,6 @@ struct btrfs_inode {
>   	 */
>   	struct list_head delalloc_inodes;
>
> -	/* node for the red-black tree that links inodes in subvolume root */
> -	struct rb_node rb_node;
> -
>   	unsigned long runtime_flags;
>
>   	/* full 64 bit generation number, struct vfs_inode doesn't have a big
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index c03c58246033..aa2568f86dc9 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -222,8 +222,11 @@ struct btrfs_root {
>   	struct list_head root_list;
>
>   	spinlock_t inode_lock;
> -	/* red-black tree that keeps track of in-memory inodes */
> -	struct rb_root inode_tree;
> +	/*
> +	 * Xarray that keeps track of in-memory inodes, protected by the lock
> +	 * @inode_lock.
> +	 */
> +	struct xarray inodes;
>
>   	/*
>   	 * Xarray that keeps track of delayed nodes of every inode, protected
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a91a8056758a..ed40fe1db53e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -662,7 +662,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>   	root->free_objectid = 0;
>   	root->nr_delalloc_inodes = 0;
>   	root->nr_ordered_extents = 0;
> -	root->inode_tree = RB_ROOT;
> +	xa_init(&root->inodes);
>   	xa_init(&root->delayed_nodes);
>
>   	btrfs_init_root_block_rsv(root);
> @@ -1854,7 +1854,8 @@ void btrfs_put_root(struct btrfs_root *root)
>   		return;
>
>   	if (refcount_dec_and_test(&root->refs)) {
> -		WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
> +		if (WARN_ON(!xa_empty(&root->inodes)))
> +			xa_destroy(&root->inodes);
>   		WARN_ON(test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state));
>   		if (root->anon_dev)
>   			free_anon_bdev(root->anon_dev);
> @@ -1939,7 +1940,6 @@ static int btrfs_init_btree_inode(struct super_block *sb)
>   	inode->i_mapping->a_ops = &btree_aops;
>   	mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
>
> -	RB_CLEAR_NODE(&BTRFS_I(inode)->rb_node);
>   	extent_io_tree_init(fs_info, &BTRFS_I(inode)->io_tree,
>   			    IO_TREE_BTREE_INODE_IO);
>   	extent_map_tree_init(&BTRFS_I(inode)->extent_tree);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d0274324c75a..450fe1582f1d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5493,58 +5493,51 @@ static int fixup_tree_root_location(struct btrfs_fs_info *fs_info,
>   	return err;
>   }
>
> -static void inode_tree_add(struct btrfs_inode *inode)
> +static int btrfs_add_inode_to_root(struct btrfs_inode *inode)
>   {
>   	struct btrfs_root *root = inode->root;
> -	struct btrfs_inode *entry;
> -	struct rb_node **p;
> -	struct rb_node *parent;
> -	struct rb_node *new = &inode->rb_node;
> -	u64 ino = btrfs_ino(inode);
> +	struct btrfs_inode *existing;
> +	const u64 ino = btrfs_ino(inode);
> +	int ret;
>
>   	if (inode_unhashed(&inode->vfs_inode))
> -		return;
> -	parent = NULL;
> +		return 0;
> +
> +	ret = xa_reserve(&root->inodes, ino, GFP_NOFS);
> +	if (ret)
> +		return ret;
> +
>   	spin_lock(&root->inode_lock);
> -	p = &root->inode_tree.rb_node;
> -	while (*p) {
> -		parent = *p;
> -		entry = rb_entry(parent, struct btrfs_inode, rb_node);
> +	existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
> +	spin_unlock(&root->inode_lock);
>
> -		if (ino < btrfs_ino(entry))
> -			p = &parent->rb_left;
> -		else if (ino > btrfs_ino(entry))
> -			p = &parent->rb_right;
> -		else {
> -			WARN_ON(!(entry->vfs_inode.i_state &
> -				  (I_WILL_FREE | I_FREEING)));
> -			rb_replace_node(parent, new, &root->inode_tree);
> -			RB_CLEAR_NODE(parent);
> -			spin_unlock(&root->inode_lock);
> -			return;
> -		}
> +	if (xa_is_err(existing)) {
> +		ret = xa_err(existing);
> +		ASSERT(ret != -EINVAL);
> +		ASSERT(ret != -ENOMEM);
> +		return ret;
> +	} else if (existing) {
> +		WARN_ON(!(existing->vfs_inode.i_state & (I_WILL_FREE | I_FREEING)));
>   	}
> -	rb_link_node(new, parent, p);
> -	rb_insert_color(new, &root->inode_tree);
> -	spin_unlock(&root->inode_lock);
> +
> +	return 0;
>   }
>
> -static void inode_tree_del(struct btrfs_inode *inode)
> +static void btrfs_del_inode_from_root(struct btrfs_inode *inode)
>   {
>   	struct btrfs_root *root = inode->root;
> -	int empty = 0;
> +	struct btrfs_inode *entry;
> +	bool empty = false;
>
>   	spin_lock(&root->inode_lock);
> -	if (!RB_EMPTY_NODE(&inode->rb_node)) {
> -		rb_erase(&inode->rb_node, &root->inode_tree);
> -		RB_CLEAR_NODE(&inode->rb_node);
> -		empty = RB_EMPTY_ROOT(&root->inode_tree);
> -	}
> +	entry = xa_erase(&root->inodes, btrfs_ino(inode));
> +	if (entry == inode)
> +		empty = xa_empty(&root->inodes);
>   	spin_unlock(&root->inode_lock);
>
>   	if (empty && btrfs_root_refs(&root->root_item) == 0) {
>   		spin_lock(&root->inode_lock);
> -		empty = RB_EMPTY_ROOT(&root->inode_tree);
> +		empty = xa_empty(&root->inodes);
>   		spin_unlock(&root->inode_lock);
>   		if (empty)
>   			btrfs_add_dead_root(root);
> @@ -5613,8 +5606,13 @@ struct inode *btrfs_iget_path(struct super_block *s, u64 ino,
>
>   		ret = btrfs_read_locked_inode(inode, path);
>   		if (!ret) {
> -			inode_tree_add(BTRFS_I(inode));
> -			unlock_new_inode(inode);
> +			ret = btrfs_add_inode_to_root(BTRFS_I(inode));
> +			if (ret) {
> +				iget_failed(inode);
> +				inode = ERR_PTR(ret);
> +			} else {
> +				unlock_new_inode(inode);
> +			}
>   		} else {
>   			iget_failed(inode);
>   			/*
> @@ -6426,7 +6424,11 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
>   		}
>   	}
>
> -	inode_tree_add(BTRFS_I(inode));
> +	ret = btrfs_add_inode_to_root(BTRFS_I(inode));
> +	if (ret) {
> +		btrfs_abort_transaction(trans, ret);
> +		goto discard;
> +	}
>
>   	trace_btrfs_inode_new(inode);
>   	btrfs_set_inode_last_trans(trans, BTRFS_I(inode));
> @@ -8466,7 +8468,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>   	ei->ordered_tree_last = NULL;
>   	INIT_LIST_HEAD(&ei->delalloc_inodes);
>   	INIT_LIST_HEAD(&ei->delayed_iput);
> -	RB_CLEAR_NODE(&ei->rb_node);
>   	init_rwsem(&ei->i_mmap_lock);
>
>   	return inode;
> @@ -8538,7 +8539,7 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
>   		}
>   	}
>   	btrfs_qgroup_check_reserved_leak(inode);
> -	inode_tree_del(inode);
> +	btrfs_del_inode_from_root(inode);
>   	btrfs_drop_extent_map_range(inode, 0, (u64)-1, false);
>   	btrfs_inode_clear_file_extent_range(inode, 0, (u64)-1);
>   	btrfs_put_root(inode->root);
> @@ -10857,52 +10858,23 @@ void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 en
>    */
>   struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
>   {
> -	struct rb_node *node;
> -	struct rb_node *prev;
>   	struct btrfs_inode *inode;
> +	unsigned long from = min_ino;
>
>   	spin_lock(&root->inode_lock);
> -again:
> -	node = root->inode_tree.rb_node;
> -	prev = NULL;
> -	while (node) {
> -		prev = node;
> -		inode = rb_entry(node, struct btrfs_inode, rb_node);
> -		if (min_ino < btrfs_ino(inode))
> -			node = node->rb_left;
> -		else if (min_ino > btrfs_ino(inode))
> -			node = node->rb_right;
> -		else
> +	while (true) {
> +		inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT);
> +		if (!inode)
> +			break;
> +		if (igrab(&inode->vfs_inode))
>   			break;
> -	}
> -
> -	if (!node) {
> -		while (prev) {
> -			inode = rb_entry(prev, struct btrfs_inode, rb_node);
> -			if (min_ino <= btrfs_ino(inode)) {
> -				node = prev;
> -				break;
> -			}
> -			prev = rb_next(prev);
> -		}
> -	}
> -
> -	while (node) {
> -		inode = rb_entry(prev, struct btrfs_inode, rb_node);
> -		if (igrab(&inode->vfs_inode)) {
> -			spin_unlock(&root->inode_lock);
> -			return inode;
> -		}
> -
> -		min_ino = btrfs_ino(inode) + 1;
> -		if (cond_resched_lock(&root->inode_lock))
> -			goto again;
>
> -		node = rb_next(node);
> +		from = btrfs_ino(inode) + 1;
> +		cond_resched_lock(&root->inode_lock);
>   	}
>   	spin_unlock(&root->inode_lock);
>
> -	return NULL;
> +	return inode;
>   }
>
>   static const struct inode_operations btrfs_dir_inode_operations = {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/8] btrfs: preallocate inodes xarray entry to avoid transaction abort
  2024-05-08 12:17 ` [PATCH 2/8] btrfs: preallocate inodes xarray entry to avoid transaction abort fdmanana
@ 2024-05-09  0:21   ` Qu Wenruo
  0 siblings, 0 replies; 37+ messages in thread
From: Qu Wenruo @ 2024-05-09  0:21 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/5/8 21:47, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> When creating a new inode, at btrfs_create_new_inode(), one of the very
> last steps is to add the inode to the root's inodes xarray. This often
> requires allocating memory which may fail (even though xarrays have a
> dedicated kmem_cache which make it less likely to fail), and at that point
> we are forced to abort the current transaction (as some, but not all, of
> the inode metadata was added to its subvolume btree).
>
> To avoid a transaction abort, preallocate memory for the xarray early at
> btrfs_create_new_inode(), so that if we fail we don't need to abort the
> transaction and the insertion into the xarray is guaranteed to succeed.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/inode.c | 26 +++++++++++++++++++-------
>   1 file changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 450fe1582f1d..85dbc19c2f6f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5493,7 +5493,7 @@ static int fixup_tree_root_location(struct btrfs_fs_info *fs_info,
>   	return err;
>   }
>
> -static int btrfs_add_inode_to_root(struct btrfs_inode *inode)
> +static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc)
>   {
>   	struct btrfs_root *root = inode->root;
>   	struct btrfs_inode *existing;
> @@ -5503,9 +5503,11 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode)
>   	if (inode_unhashed(&inode->vfs_inode))
>   		return 0;
>
> -	ret = xa_reserve(&root->inodes, ino, GFP_NOFS);
> -	if (ret)
> -		return ret;
> +	if (prealloc) {
> +		ret = xa_reserve(&root->inodes, ino, GFP_NOFS);
> +		if (ret)
> +			return ret;
> +	}
>
>   	spin_lock(&root->inode_lock);
>   	existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
> @@ -5606,7 +5608,7 @@ struct inode *btrfs_iget_path(struct super_block *s, u64 ino,
>
>   		ret = btrfs_read_locked_inode(inode, path);
>   		if (!ret) {
> -			ret = btrfs_add_inode_to_root(BTRFS_I(inode));
> +			ret = btrfs_add_inode_to_root(BTRFS_I(inode), true);
>   			if (ret) {
>   				iget_failed(inode);
>   				inode = ERR_PTR(ret);
> @@ -6237,6 +6239,7 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
>   	struct btrfs_item_batch batch;
>   	unsigned long ptr;
>   	int ret;
> +	bool xa_reserved = false;
>
>   	path = btrfs_alloc_path();
>   	if (!path)
> @@ -6251,6 +6254,11 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
>   		goto out;
>   	inode->i_ino = objectid;
>
> +	ret = xa_reserve(&root->inodes, objectid, GFP_NOFS);
> +	if (ret)
> +		goto out;
> +	xa_reserved = true;
> +
>   	if (args->orphan) {
>   		/*
>   		 * O_TMPFILE, set link count to 0, so that after this point, we
> @@ -6424,8 +6432,9 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
>   		}
>   	}
>
> -	ret = btrfs_add_inode_to_root(BTRFS_I(inode));
> -	if (ret) {
> +	ret = btrfs_add_inode_to_root(BTRFS_I(inode), false);
> +	if (WARN_ON(ret)) {
> +		/* Shouldn't happen, we used xa_reserve() before. */
>   		btrfs_abort_transaction(trans, ret);
>   		goto discard;
>   	}
> @@ -6456,6 +6465,9 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
>   	ihold(inode);
>   	discard_new_inode(inode);
>   out:
> +	if (xa_reserved)
> +		xa_release(&root->inodes, objectid);
> +
>   	btrfs_free_path(path);
>   	return ret;
>   }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/8] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path()
  2024-05-08 12:17 ` [PATCH 3/8] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path() fdmanana
@ 2024-05-09  0:23   ` Qu Wenruo
  0 siblings, 0 replies; 37+ messages in thread
From: Qu Wenruo @ 2024-05-09  0:23 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/5/8 21:47, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Make btrfs_iget_path() simpler and easier to read by avoiding nesting of
> if-then-else statements and having an error label to do all the error
> handling instead of repeating it a couple times.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/inode.c | 44 +++++++++++++++++++++-----------------------
>   1 file changed, 21 insertions(+), 23 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 85dbc19c2f6f..8ea9fd4c2b66 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5598,37 +5598,35 @@ struct inode *btrfs_iget_path(struct super_block *s, u64 ino,
>   			      struct btrfs_root *root, struct btrfs_path *path)
>   {
>   	struct inode *inode;
> +	int ret;
>
>   	inode = btrfs_iget_locked(s, ino, root);
>   	if (!inode)
>   		return ERR_PTR(-ENOMEM);
>
> -	if (inode->i_state & I_NEW) {
> -		int ret;
> +	if (!(inode->i_state & I_NEW))
> +		return inode;
>
> -		ret = btrfs_read_locked_inode(inode, path);
> -		if (!ret) {
> -			ret = btrfs_add_inode_to_root(BTRFS_I(inode), true);
> -			if (ret) {
> -				iget_failed(inode);
> -				inode = ERR_PTR(ret);
> -			} else {
> -				unlock_new_inode(inode);
> -			}
> -		} else {
> -			iget_failed(inode);
> -			/*
> -			 * ret > 0 can come from btrfs_search_slot called by
> -			 * btrfs_read_locked_inode, this means the inode item
> -			 * was not found.
> -			 */
> -			if (ret > 0)
> -				ret = -ENOENT;
> -			inode = ERR_PTR(ret);
> -		}
> -	}
> +	ret = btrfs_read_locked_inode(inode, path);
> +	/*
> +	 * ret > 0 can come from btrfs_search_slot called by
> +	 * btrfs_read_locked_inode(), this means the inode item was not found.
> +	 */
> +	if (ret > 0)
> +		ret = -ENOENT;
> +	if (ret < 0)
> +		goto error;
> +
> +	ret = btrfs_add_inode_to_root(BTRFS_I(inode), true);
> +	if (ret < 0)
> +		goto error;
> +
> +	unlock_new_inode(inode);
>
>   	return inode;
> +error:
> +	iget_failed(inode);
> +	return ERR_PTR(ret);
>   }
>
>   struct inode *btrfs_iget(struct super_block *s, u64 ino, struct btrfs_root *root)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] btrfs: remove inode_lock from struct btrfs_root and use xarray locks
  2024-05-08 12:17 ` [PATCH 4/8] btrfs: remove inode_lock from struct btrfs_root and use xarray locks fdmanana
@ 2024-05-09  0:25   ` Qu Wenruo
  2024-05-09  8:38     ` Filipe Manana
  0 siblings, 1 reply; 37+ messages in thread
From: Qu Wenruo @ 2024-05-09  0:25 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/5/8 21:47, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Currently we use the spinlock inode_lock from struct btrfs_root to
> serialize access to two different data structures:
>
> 1) The delayed inodes xarray (struct btrfs_root::delayed_nodes);
> 2) The inodes xarray (struct btrfs_root::inodes).
>
> Instead of using our own lock, we can use the spinlock that is part of the
> xarray implementation, by using the xa_lock() and xa_unlock() APIs and
> using the xarray APIs with the double underscore prefix that don't take
> the xarray locks and assume the caller is using xa_lock() and xa_unlock().
>
> So remove the spinlock inode_lock from struct btrfs_root and use the
> corresponding xarray locks. This brings 2 benefits:
>
> 1) We reduce the size of struct btrfs_root, from 1336 bytes down to
>     1328 bytes on a 64 bits release kernel config;
>
> 2) We reduce lock contention by not using anymore  the same lock for
>     changing two different and unrelated xarrays.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/ctree.h         |  1 -
>   fs/btrfs/delayed-inode.c | 24 +++++++++++-------------
>   fs/btrfs/disk-io.c       |  1 -
>   fs/btrfs/inode.c         | 18 ++++++++----------
>   4 files changed, 19 insertions(+), 25 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index aa2568f86dc9..1004cb934b4a 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -221,7 +221,6 @@ struct btrfs_root {
>
>   	struct list_head root_list;
>
> -	spinlock_t inode_lock;
>   	/*
>   	 * Xarray that keeps track of in-memory inodes, protected by the lock
>   	 * @inode_lock.
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 95a0497fa866..1373f474c9b6 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -77,14 +77,14 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
>   		return node;
>   	}
>
> -	spin_lock(&root->inode_lock);
> +	xa_lock(&root->delayed_nodes);
>   	node = xa_load(&root->delayed_nodes, ino);

Do we need xa_lock() here?

The doc shows xa_load() use RCU read lock already.
Only xa_store()/xa_find() would take xa_lock internally, thus they need
to be converted.

Or did I miss something else?

Thanks,
Qu
>
>   	if (node) {
>   		if (btrfs_inode->delayed_node) {
>   			refcount_inc(&node->refs);	/* can be accessed */
>   			BUG_ON(btrfs_inode->delayed_node != node);
> -			spin_unlock(&root->inode_lock);
> +			xa_unlock(&root->delayed_nodes);
>   			return node;
>   		}
>
> @@ -111,10 +111,10 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
>   			node = NULL;
>   		}
>
> -		spin_unlock(&root->inode_lock);
> +		xa_unlock(&root->delayed_nodes);
>   		return node;
>   	}
> -	spin_unlock(&root->inode_lock);
> +	xa_unlock(&root->delayed_nodes);
>
>   	return NULL;
>   }
> @@ -148,21 +148,21 @@ static struct btrfs_delayed_node *btrfs_get_or_create_delayed_node(
>   		kmem_cache_free(delayed_node_cache, node);
>   		return ERR_PTR(-ENOMEM);
>   	}
> -	spin_lock(&root->inode_lock);
> +	xa_lock(&root->delayed_nodes);
>   	ptr = xa_load(&root->delayed_nodes, ino);
>   	if (ptr) {
>   		/* Somebody inserted it, go back and read it. */
> -		spin_unlock(&root->inode_lock);
> +		xa_unlock(&root->delayed_nodes);
>   		kmem_cache_free(delayed_node_cache, node);
>   		node = NULL;
>   		goto again;
>   	}
> -	ptr = xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC);
> +	ptr = __xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC);
>   	ASSERT(xa_err(ptr) != -EINVAL);
>   	ASSERT(xa_err(ptr) != -ENOMEM);
>   	ASSERT(ptr == NULL);
>   	btrfs_inode->delayed_node = node;
> -	spin_unlock(&root->inode_lock);
> +	xa_unlock(&root->delayed_nodes);
>
>   	return node;
>   }
> @@ -275,14 +275,12 @@ static void __btrfs_release_delayed_node(
>   	if (refcount_dec_and_test(&delayed_node->refs)) {
>   		struct btrfs_root *root = delayed_node->root;
>
> -		spin_lock(&root->inode_lock);
>   		/*
>   		 * Once our refcount goes to zero, nobody is allowed to bump it
>   		 * back up.  We can delete it now.
>   		 */
>   		ASSERT(refcount_read(&delayed_node->refs) == 0);
>   		xa_erase(&root->delayed_nodes, delayed_node->inode_id);
> -		spin_unlock(&root->inode_lock);
>   		kmem_cache_free(delayed_node_cache, delayed_node);
>   	}
>   }
> @@ -2057,9 +2055,9 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
>   		struct btrfs_delayed_node *node;
>   		int count;
>
> -		spin_lock(&root->inode_lock);
> +		xa_lock(&root->delayed_nodes);
>   		if (xa_empty(&root->delayed_nodes)) {
> -			spin_unlock(&root->inode_lock);
> +			xa_unlock(&root->delayed_nodes);
>   			return;
>   		}
>
> @@ -2076,7 +2074,7 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
>   			if (count >= ARRAY_SIZE(delayed_nodes))
>   				break;
>   		}
> -		spin_unlock(&root->inode_lock);
> +		xa_unlock(&root->delayed_nodes);
>   		index++;
>
>   		for (int i = 0; i < count; i++) {
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index ed40fe1db53e..d20e400a9ce3 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -674,7 +674,6 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>   	INIT_LIST_HEAD(&root->ordered_extents);
>   	INIT_LIST_HEAD(&root->ordered_root);
>   	INIT_LIST_HEAD(&root->reloc_dirty_list);
> -	spin_lock_init(&root->inode_lock);
>   	spin_lock_init(&root->delalloc_lock);
>   	spin_lock_init(&root->ordered_extent_lock);
>   	spin_lock_init(&root->accounting_lock);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8ea9fd4c2b66..4fd41d6b377f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5509,9 +5509,7 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc)
>   			return ret;
>   	}
>
> -	spin_lock(&root->inode_lock);
>   	existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
> -	spin_unlock(&root->inode_lock);
>
>   	if (xa_is_err(existing)) {
>   		ret = xa_err(existing);
> @@ -5531,16 +5529,16 @@ static void btrfs_del_inode_from_root(struct btrfs_inode *inode)
>   	struct btrfs_inode *entry;
>   	bool empty = false;
>
> -	spin_lock(&root->inode_lock);
> -	entry = xa_erase(&root->inodes, btrfs_ino(inode));
> +	xa_lock(&root->inodes);
> +	entry = __xa_erase(&root->inodes, btrfs_ino(inode));
>   	if (entry == inode)
>   		empty = xa_empty(&root->inodes);
> -	spin_unlock(&root->inode_lock);
> +	xa_unlock(&root->inodes);
>
>   	if (empty && btrfs_root_refs(&root->root_item) == 0) {
> -		spin_lock(&root->inode_lock);
> +		xa_lock(&root->inodes);
>   		empty = xa_empty(&root->inodes);
> -		spin_unlock(&root->inode_lock);
> +		xa_unlock(&root->inodes);
>   		if (empty)
>   			btrfs_add_dead_root(root);
>   	}
> @@ -10871,7 +10869,7 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
>   	struct btrfs_inode *inode;
>   	unsigned long from = min_ino;
>
> -	spin_lock(&root->inode_lock);
> +	xa_lock(&root->inodes);
>   	while (true) {
>   		inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT);
>   		if (!inode)
> @@ -10880,9 +10878,9 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
>   			break;
>
>   		from = btrfs_ino(inode) + 1;
> -		cond_resched_lock(&root->inode_lock);
> +		cond_resched_lock(&root->inodes.xa_lock);
>   	}
> -	spin_unlock(&root->inode_lock);
> +	xa_unlock(&root->inodes);
>
>   	return inode;
>   }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/8] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode
  2024-05-08 12:17 ` [PATCH 5/8] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode fdmanana
@ 2024-05-09  0:30   ` Qu Wenruo
  2024-05-09  8:39     ` Filipe Manana
  0 siblings, 1 reply; 37+ messages in thread
From: Qu Wenruo @ 2024-05-09  0:30 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/5/8 21:47, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> The index_cnt field of struct btrfs_inode is used only for two purposes:
>
> 1) To store the index for the next entry added to a directory;
>
> 2) For the data relocation inode to track the logical start address of the
>     block group currently being relocated.
>
> For the relocation case we use index_cnt because it's not used for
> anything else in the relocation use case - we could have used other fields
> that are not used by relocation such as defrag_bytes, last_unlink_trans
> or last_reflink_trans for example (amongs others).
>
> Since the csum_bytes field is not used for directories, do the following
> changes:
>
> 1) Put index_cnt and csum_bytes in a union, and index_cnt is only
>     initialized when the inode is a directory. The csum_bytes is only
>     accessed in IO paths for regular files, so we're fine here;
>
> 2) Use the defrag_bytes field for relocation, since the data relocation
>     inode is never used for defrag purposes. And to make the naming better,
>     alias it to reloc_block_group_start by using a union.
>
> This reduces the size of struct btrfs_inode by 8 bytes in a release
> kernel, from 1040 bytes down to 1032 bytes.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/btrfs_inode.h   | 46 +++++++++++++++++++++++++---------------
>   fs/btrfs/delayed-inode.c |  3 ++-
>   fs/btrfs/inode.c         | 21 ++++++++++++------
>   fs/btrfs/relocation.c    | 12 +++++------
>   fs/btrfs/tree-log.c      |  3 ++-
>   5 files changed, 54 insertions(+), 31 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index e577b9745884..19bb3d057414 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -215,11 +215,20 @@ struct btrfs_inode {
>   		u64 last_dir_index_offset;
>   	};
>
> -	/*
> -	 * Total number of bytes pending defrag, used by stat to check whether
> -	 * it needs COW. Protected by 'lock'.
> -	 */
> -	u64 defrag_bytes;
> +	union {
> +		/*
> +		 * Total number of bytes pending defrag, used by stat to check whether
> +		 * it needs COW. Protected by 'lock'.
> +		 * Used by inodes other than the data relocation inode.
> +		 */
> +		u64 defrag_bytes;
> +
> +		/*
> +		 * Logical address of the block group being relocated.
> +		 * Used only by the data relocation inode.
> +		 */
> +		u64 reloc_block_group_start;
> +	};
>
>   	/*
>   	 * The size of the file stored in the metadata on disk.  data=ordered
> @@ -228,12 +237,21 @@ struct btrfs_inode {
>   	 */
>   	u64 disk_i_size;
>
> -	/*
> -	 * If this is a directory then index_cnt is the counter for the index
> -	 * number for new files that are created. For an empty directory, this
> -	 * must be initialized to BTRFS_DIR_START_INDEX.
> -	 */
> -	u64 index_cnt;
> +	union {
> +		/*
> +		 * If this is a directory then index_cnt is the counter for the
> +		 * index number for new files that are created. For an empty
> +		 * directory, this must be initialized to BTRFS_DIR_START_INDEX.
> +		 */
> +		u64 index_cnt;
> +
> +		/*
> +		 * If this is not a directory, this is the number of bytes
> +		 * outstanding that are going to need csums. This is used in
> +		 * ENOSPC accounting. Protected by 'lock'.
> +		 */
> +		u64 csum_bytes;
> +	};
>
>   	/* Cache the directory index number to speed the dir/file remove */
>   	u64 dir_index;
> @@ -256,12 +274,6 @@ struct btrfs_inode {
>   	 */
>   	u64 last_reflink_trans;
>
> -	/*
> -	 * Number of bytes outstanding that are going to need csums.  This is
> -	 * used in ENOSPC accounting. Protected by 'lock'.
> -	 */
> -	u64 csum_bytes;
> -
>   	/* Backwards incompatible flags, lower half of inode_item::flags  */
>   	u32 flags;
>   	/* Read-only compatibility flags, upper half of inode_item::flags */
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 1373f474c9b6..e298a44eaf69 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -1914,7 +1914,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
>   	BTRFS_I(inode)->i_otime_nsec = btrfs_stack_timespec_nsec(&inode_item->otime);
>
>   	inode->i_generation = BTRFS_I(inode)->generation;
> -	BTRFS_I(inode)->index_cnt = (u64)-1;
> +	if (S_ISDIR(inode->i_mode))
> +		BTRFS_I(inode)->index_cnt = (u64)-1;
>
>   	mutex_unlock(&delayed_node->mutex);
>   	btrfs_release_delayed_node(delayed_node);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 4fd41d6b377f..9b98aa65cc63 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3856,7 +3856,9 @@ static int btrfs_read_locked_inode(struct inode *inode,
>   	inode->i_rdev = 0;
>   	rdev = btrfs_inode_rdev(leaf, inode_item);
>
> -	BTRFS_I(inode)->index_cnt = (u64)-1;
> +	if (S_ISDIR(inode->i_mode))
> +		BTRFS_I(inode)->index_cnt = (u64)-1;
> +
>   	btrfs_inode_split_flags(btrfs_inode_flags(leaf, inode_item),
>   				&BTRFS_I(inode)->flags, &BTRFS_I(inode)->ro_flags);
>
> @@ -6268,8 +6270,10 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
>   		if (ret)
>   			goto out;
>   	}
> -	/* index_cnt is ignored for everything but a dir. */
> -	BTRFS_I(inode)->index_cnt = BTRFS_DIR_START_INDEX;
> +
> +	if (S_ISDIR(inode->i_mode))
> +		BTRFS_I(inode)->index_cnt = BTRFS_DIR_START_INDEX;
> +
>   	BTRFS_I(inode)->generation = trans->transid;
>   	inode->i_generation = BTRFS_I(inode)->generation;
>
> @@ -8435,8 +8439,12 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>   	ei->disk_i_size = 0;
>   	ei->flags = 0;
>   	ei->ro_flags = 0;
> +	/*
> +	 * ->index_cnt will be propertly initialized later when creating a new
> +	 * inode (btrfs_create_new_inode()) or when reading an existing inode
> +	 * from disk (btrfs_read_locked_inode()).
> +	 */

Would above comment be a little confusing?
As the comment is for csum_bytes, without checking the definition it's
not clear that csum_bytes and index_cnt is shared.

Maybe just removing it would be good enough?

Other wise looks good to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

>   	ei->csum_bytes = 0;
> -	ei->index_cnt = (u64)-1;
>   	ei->dir_index = 0;
>   	ei->last_unlink_trans = 0;
>   	ei->last_reflink_trans = 0;
> @@ -8511,9 +8519,10 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
>   	if (!S_ISDIR(vfs_inode->i_mode)) {
>   		WARN_ON(inode->delalloc_bytes);
>   		WARN_ON(inode->new_delalloc_bytes);
> +		WARN_ON(inode->csum_bytes);
>   	}
> -	WARN_ON(inode->csum_bytes);
> -	WARN_ON(inode->defrag_bytes);
> +	if (!root || !btrfs_is_data_reloc_root(root))
> +		WARN_ON(inode->defrag_bytes);
>
>   	/*
>   	 * This can happen where we create an inode, but somebody else also
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 8b24bb5a0aa1..9f35524b6664 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -962,7 +962,7 @@ static int get_new_location(struct inode *reloc_inode, u64 *new_bytenr,
>   	if (!path)
>   		return -ENOMEM;
>
> -	bytenr -= BTRFS_I(reloc_inode)->index_cnt;
> +	bytenr -= BTRFS_I(reloc_inode)->reloc_block_group_start;
>   	ret = btrfs_lookup_file_extent(NULL, root, path,
>   			btrfs_ino(BTRFS_I(reloc_inode)), bytenr, 0);
>   	if (ret < 0)
> @@ -2797,7 +2797,7 @@ static noinline_for_stack int prealloc_file_extent_cluster(
>   	u64 alloc_hint = 0;
>   	u64 start;
>   	u64 end;
> -	u64 offset = inode->index_cnt;
> +	u64 offset = inode->reloc_block_group_start;
>   	u64 num_bytes;
>   	int nr;
>   	int ret = 0;
> @@ -2951,7 +2951,7 @@ static int relocate_one_folio(struct inode *inode, struct file_ra_state *ra,
>   			      int *cluster_nr, unsigned long index)
>   {
>   	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> -	u64 offset = BTRFS_I(inode)->index_cnt;
> +	u64 offset = BTRFS_I(inode)->reloc_block_group_start;
>   	const unsigned long last_index = (cluster->end - offset) >> PAGE_SHIFT;
>   	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
>   	struct folio *folio;
> @@ -3086,7 +3086,7 @@ static int relocate_one_folio(struct inode *inode, struct file_ra_state *ra,
>   static int relocate_file_extent_cluster(struct inode *inode,
>   					const struct file_extent_cluster *cluster)
>   {
> -	u64 offset = BTRFS_I(inode)->index_cnt;
> +	u64 offset = BTRFS_I(inode)->reloc_block_group_start;
>   	unsigned long index;
>   	unsigned long last_index;
>   	struct file_ra_state *ra;
> @@ -3915,7 +3915,7 @@ static noinline_for_stack struct inode *create_reloc_inode(
>   		inode = NULL;
>   		goto out;
>   	}
> -	BTRFS_I(inode)->index_cnt = group->start;
> +	BTRFS_I(inode)->reloc_block_group_start = group->start;
>
>   	ret = btrfs_orphan_add(trans, BTRFS_I(inode));
>   out:
> @@ -4395,7 +4395,7 @@ int btrfs_reloc_clone_csums(struct btrfs_ordered_extent *ordered)
>   {
>   	struct btrfs_inode *inode = BTRFS_I(ordered->inode);
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	u64 disk_bytenr = ordered->file_offset + inode->index_cnt;
> +	u64 disk_bytenr = ordered->file_offset + inode->reloc_block_group_start;
>   	struct btrfs_root *csum_root = btrfs_csum_root(fs_info, disk_bytenr);
>   	LIST_HEAD(list);
>   	int ret;
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 5146387b416b..0aee43466c52 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -1625,7 +1625,8 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans,
>   		if (ret)
>   			goto out;
>   	}
> -	BTRFS_I(inode)->index_cnt = (u64)-1;
> +	if (S_ISDIR(inode->i_mode))
> +		BTRFS_I(inode)->index_cnt = (u64)-1;
>
>   	if (inode->i_nlink == 0) {
>   		if (S_ISDIR(inode->i_mode)) {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/8] btrfs: don't allocate file extent tree for non regular files
  2024-05-08 12:17 ` [PATCH 6/8] btrfs: don't allocate file extent tree for non regular files fdmanana
@ 2024-05-09  0:39   ` Qu Wenruo
  2024-05-09  8:41     ` Filipe Manana
  0 siblings, 1 reply; 37+ messages in thread
From: Qu Wenruo @ 2024-05-09  0:39 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/5/8 21:47, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> When not using the NO_HOLES feature we always allocate an io tree for an
> inode's file_extent_tree.

I'm wondering can we discard non-NO_HOLES support directly so that we
can get rid of the file_extent_tree completely?

Initially I'm wondering why NO_HOLES makes a difference, especially as
NO_HOLES can be set halfway, thus we can still have explicit hole extents.

But it turns out that the whole file_extent_tree() is only to handle
non-NO_HOLES case so that we always have a hole for the whole file range
until i_size.

Considering NO_HOLES is considered safe at 4.0 kernel, can we start
deprecating non-NO_HOLES?

> This is wasteful because that io tree is only
> used for regular files, so we allocate more memory than needed for inodes
> that represent directories or symlinks for example, or for inodes that
> correspond to free space inodes.
>
> So improve on this by allocating the io tree only for inodes of regular
> files that are not free space inodes.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Otherwise looks good to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> ---
>   fs/btrfs/file-item.c | 13 ++++++-----
>   fs/btrfs/inode.c     | 53 +++++++++++++++++++++++++++++---------------
>   2 files changed, 42 insertions(+), 24 deletions(-)
>
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index bce95f871750..f3ed78e21fa4 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -45,13 +45,12 @@
>    */
>   void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_size)
>   {
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>   	u64 start, end, i_size;
>   	int ret;
>
>   	spin_lock(&inode->lock);
>   	i_size = new_i_size ?: i_size_read(&inode->vfs_inode);
> -	if (btrfs_fs_incompat(fs_info, NO_HOLES)) {
> +	if (!inode->file_extent_tree) {
>   		inode->disk_i_size = i_size;
>   		goto out_unlock;
>   	}
> @@ -84,13 +83,14 @@ void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_siz
>   int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start,
>   				      u64 len)
>   {
> +	if (!inode->file_extent_tree)
> +		return 0;
> +
>   	if (len == 0)
>   		return 0;
>
>   	ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize));
>
> -	if (btrfs_fs_incompat(inode->root->fs_info, NO_HOLES))
> -		return 0;
>   	return set_extent_bit(inode->file_extent_tree, start, start + len - 1,
>   			      EXTENT_DIRTY, NULL);
>   }
> @@ -112,14 +112,15 @@ int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start,
>   int btrfs_inode_clear_file_extent_range(struct btrfs_inode *inode, u64 start,
>   					u64 len)
>   {
> +	if (!inode->file_extent_tree)
> +		return 0;
> +
>   	if (len == 0)
>   		return 0;
>
>   	ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize) ||
>   	       len == (u64)-1);
>
> -	if (btrfs_fs_incompat(inode->root->fs_info, NO_HOLES))
> -		return 0;
>   	return clear_extent_bit(inode->file_extent_tree, start,
>   				start + len - 1, EXTENT_DIRTY, NULL);
>   }
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9b98aa65cc63..175fd007f0ef 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3781,6 +3781,30 @@ static noinline int acls_after_inode_item(struct extent_buffer *leaf,
>   	return 1;
>   }
>
> +static int btrfs_init_file_extent_tree(struct btrfs_inode *inode)
> +{
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +
> +	if (WARN_ON_ONCE(inode->file_extent_tree))
> +		return 0;
> +	if (btrfs_fs_incompat(fs_info, NO_HOLES))
> +		return 0;
> +	if (!S_ISREG(inode->vfs_inode.i_mode))
> +		return 0;
> +	if (btrfs_is_free_space_inode(inode))
> +		return 0;
> +
> +	inode->file_extent_tree = kmalloc(sizeof(struct extent_io_tree), GFP_KERNEL);
> +	if (!inode->file_extent_tree)
> +		return -ENOMEM;
> +
> +	extent_io_tree_init(fs_info, inode->file_extent_tree, IO_TREE_INODE_FILE_EXTENT);
> +	/* Lockdep class is set only for the file extent tree. */
> +	lockdep_set_class(&inode->file_extent_tree->lock, &file_extent_tree_class);
> +
> +	return 0;
> +}
> +
>   /*
>    * read an inode from the btree into the in-memory inode
>    */
> @@ -3800,6 +3824,10 @@ static int btrfs_read_locked_inode(struct inode *inode,
>   	bool filled = false;
>   	int first_xattr_slot;
>
> +	ret = btrfs_init_file_extent_tree(BTRFS_I(inode));
> +	if (ret)
> +		return ret;
> +
>   	ret = btrfs_fill_inode(inode, &rdev);
>   	if (!ret)
>   		filled = true;
> @@ -6247,6 +6275,10 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
>   		BTRFS_I(inode)->root = btrfs_grab_root(BTRFS_I(dir)->root);
>   	root = BTRFS_I(inode)->root;
>
> +	ret = btrfs_init_file_extent_tree(BTRFS_I(inode));
> +	if (ret)
> +		goto out;
> +
>   	ret = btrfs_get_free_objectid(root, &objectid);
>   	if (ret)
>   		goto out;
> @@ -8413,20 +8445,10 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>   	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
>   	struct btrfs_inode *ei;
>   	struct inode *inode;
> -	struct extent_io_tree *file_extent_tree = NULL;
> -
> -	/* Self tests may pass a NULL fs_info. */
> -	if (fs_info && !btrfs_fs_incompat(fs_info, NO_HOLES)) {
> -		file_extent_tree = kmalloc(sizeof(struct extent_io_tree), GFP_KERNEL);
> -		if (!file_extent_tree)
> -			return NULL;
> -	}
>
>   	ei = alloc_inode_sb(sb, btrfs_inode_cachep, GFP_KERNEL);
> -	if (!ei) {
> -		kfree(file_extent_tree);
> +	if (!ei)
>   		return NULL;
> -	}
>
>   	ei->root = NULL;
>   	ei->generation = 0;
> @@ -8471,13 +8493,8 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>   	extent_io_tree_init(fs_info, &ei->io_tree, IO_TREE_INODE_IO);
>   	ei->io_tree.inode = ei;
>
> -	ei->file_extent_tree = file_extent_tree;
> -	if (file_extent_tree) {
> -		extent_io_tree_init(fs_info, ei->file_extent_tree,
> -				    IO_TREE_INODE_FILE_EXTENT);
> -		/* Lockdep class is set only for the file extent tree. */
> -		lockdep_set_class(&ei->file_extent_tree->lock, &file_extent_tree_class);
> -	}
> +	ei->file_extent_tree = NULL;
> +
>   	mutex_init(&ei->log_mutex);
>   	spin_lock_init(&ei->ordered_tree_lock);
>   	ei->ordered_tree = RB_ROOT;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] btrfs: remove inode_lock from struct btrfs_root and use xarray locks
  2024-05-09  0:25   ` Qu Wenruo
@ 2024-05-09  8:38     ` Filipe Manana
  2024-05-09  8:42       ` Qu Wenruo
  0 siblings, 1 reply; 37+ messages in thread
From: Filipe Manana @ 2024-05-09  8:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 9, 2024 at 1:25 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/5/8 21:47, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Currently we use the spinlock inode_lock from struct btrfs_root to
> > serialize access to two different data structures:
> >
> > 1) The delayed inodes xarray (struct btrfs_root::delayed_nodes);
> > 2) The inodes xarray (struct btrfs_root::inodes).
> >
> > Instead of using our own lock, we can use the spinlock that is part of the
> > xarray implementation, by using the xa_lock() and xa_unlock() APIs and
> > using the xarray APIs with the double underscore prefix that don't take
> > the xarray locks and assume the caller is using xa_lock() and xa_unlock().
> >
> > So remove the spinlock inode_lock from struct btrfs_root and use the
> > corresponding xarray locks. This brings 2 benefits:
> >
> > 1) We reduce the size of struct btrfs_root, from 1336 bytes down to
> >     1328 bytes on a 64 bits release kernel config;
> >
> > 2) We reduce lock contention by not using anymore  the same lock for
> >     changing two different and unrelated xarrays.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/ctree.h         |  1 -
> >   fs/btrfs/delayed-inode.c | 24 +++++++++++-------------
> >   fs/btrfs/disk-io.c       |  1 -
> >   fs/btrfs/inode.c         | 18 ++++++++----------
> >   4 files changed, 19 insertions(+), 25 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index aa2568f86dc9..1004cb934b4a 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -221,7 +221,6 @@ struct btrfs_root {
> >
> >       struct list_head root_list;
> >
> > -     spinlock_t inode_lock;
> >       /*
> >        * Xarray that keeps track of in-memory inodes, protected by the lock
> >        * @inode_lock.
> > diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> > index 95a0497fa866..1373f474c9b6 100644
> > --- a/fs/btrfs/delayed-inode.c
> > +++ b/fs/btrfs/delayed-inode.c
> > @@ -77,14 +77,14 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
> >               return node;
> >       }
> >
> > -     spin_lock(&root->inode_lock);
> > +     xa_lock(&root->delayed_nodes);
> >       node = xa_load(&root->delayed_nodes, ino);
>
> Do we need xa_lock() here?
>
> The doc shows xa_load() use RCU read lock already.
> Only xa_store()/xa_find() would take xa_lock internally, thus they need
> to be converted.
>
> Or did I miss something else?

The RCU is only for protection against concurrent xarray operations
that modify the xarray.
After xa_load() returns, the "node" might have been removed from the
xarray and freed by another task.

That's why this change is a straightforward switch from one lock to another.

It may be that our code is structured in a way that we can get away
with the lock.
But that needs to be properly analysed and given that it's a
non-trivial behavioural change, should have its own separate patch and
change log with the analysis.

Thanks.


>
> Thanks,
> Qu
> >
> >       if (node) {
> >               if (btrfs_inode->delayed_node) {
> >                       refcount_inc(&node->refs);      /* can be accessed */
> >                       BUG_ON(btrfs_inode->delayed_node != node);
> > -                     spin_unlock(&root->inode_lock);
> > +                     xa_unlock(&root->delayed_nodes);
> >                       return node;
> >               }
> >
> > @@ -111,10 +111,10 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
> >                       node = NULL;
> >               }
> >
> > -             spin_unlock(&root->inode_lock);
> > +             xa_unlock(&root->delayed_nodes);
> >               return node;
> >       }
> > -     spin_unlock(&root->inode_lock);
> > +     xa_unlock(&root->delayed_nodes);
> >
> >       return NULL;
> >   }
> > @@ -148,21 +148,21 @@ static struct btrfs_delayed_node *btrfs_get_or_create_delayed_node(
> >               kmem_cache_free(delayed_node_cache, node);
> >               return ERR_PTR(-ENOMEM);
> >       }
> > -     spin_lock(&root->inode_lock);
> > +     xa_lock(&root->delayed_nodes);
> >       ptr = xa_load(&root->delayed_nodes, ino);
> >       if (ptr) {
> >               /* Somebody inserted it, go back and read it. */
> > -             spin_unlock(&root->inode_lock);
> > +             xa_unlock(&root->delayed_nodes);
> >               kmem_cache_free(delayed_node_cache, node);
> >               node = NULL;
> >               goto again;
> >       }
> > -     ptr = xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC);
> > +     ptr = __xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC);
> >       ASSERT(xa_err(ptr) != -EINVAL);
> >       ASSERT(xa_err(ptr) != -ENOMEM);
> >       ASSERT(ptr == NULL);
> >       btrfs_inode->delayed_node = node;
> > -     spin_unlock(&root->inode_lock);
> > +     xa_unlock(&root->delayed_nodes);
> >
> >       return node;
> >   }
> > @@ -275,14 +275,12 @@ static void __btrfs_release_delayed_node(
> >       if (refcount_dec_and_test(&delayed_node->refs)) {
> >               struct btrfs_root *root = delayed_node->root;
> >
> > -             spin_lock(&root->inode_lock);
> >               /*
> >                * Once our refcount goes to zero, nobody is allowed to bump it
> >                * back up.  We can delete it now.
> >                */
> >               ASSERT(refcount_read(&delayed_node->refs) == 0);
> >               xa_erase(&root->delayed_nodes, delayed_node->inode_id);
> > -             spin_unlock(&root->inode_lock);
> >               kmem_cache_free(delayed_node_cache, delayed_node);
> >       }
> >   }
> > @@ -2057,9 +2055,9 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
> >               struct btrfs_delayed_node *node;
> >               int count;
> >
> > -             spin_lock(&root->inode_lock);
> > +             xa_lock(&root->delayed_nodes);
> >               if (xa_empty(&root->delayed_nodes)) {
> > -                     spin_unlock(&root->inode_lock);
> > +                     xa_unlock(&root->delayed_nodes);
> >                       return;
> >               }
> >
> > @@ -2076,7 +2074,7 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
> >                       if (count >= ARRAY_SIZE(delayed_nodes))
> >                               break;
> >               }
> > -             spin_unlock(&root->inode_lock);
> > +             xa_unlock(&root->delayed_nodes);
> >               index++;
> >
> >               for (int i = 0; i < count; i++) {
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index ed40fe1db53e..d20e400a9ce3 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -674,7 +674,6 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
> >       INIT_LIST_HEAD(&root->ordered_extents);
> >       INIT_LIST_HEAD(&root->ordered_root);
> >       INIT_LIST_HEAD(&root->reloc_dirty_list);
> > -     spin_lock_init(&root->inode_lock);
> >       spin_lock_init(&root->delalloc_lock);
> >       spin_lock_init(&root->ordered_extent_lock);
> >       spin_lock_init(&root->accounting_lock);
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 8ea9fd4c2b66..4fd41d6b377f 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -5509,9 +5509,7 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc)
> >                       return ret;
> >       }
> >
> > -     spin_lock(&root->inode_lock);
> >       existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
> > -     spin_unlock(&root->inode_lock);
> >
> >       if (xa_is_err(existing)) {
> >               ret = xa_err(existing);
> > @@ -5531,16 +5529,16 @@ static void btrfs_del_inode_from_root(struct btrfs_inode *inode)
> >       struct btrfs_inode *entry;
> >       bool empty = false;
> >
> > -     spin_lock(&root->inode_lock);
> > -     entry = xa_erase(&root->inodes, btrfs_ino(inode));
> > +     xa_lock(&root->inodes);
> > +     entry = __xa_erase(&root->inodes, btrfs_ino(inode));
> >       if (entry == inode)
> >               empty = xa_empty(&root->inodes);
> > -     spin_unlock(&root->inode_lock);
> > +     xa_unlock(&root->inodes);
> >
> >       if (empty && btrfs_root_refs(&root->root_item) == 0) {
> > -             spin_lock(&root->inode_lock);
> > +             xa_lock(&root->inodes);
> >               empty = xa_empty(&root->inodes);
> > -             spin_unlock(&root->inode_lock);
> > +             xa_unlock(&root->inodes);
> >               if (empty)
> >                       btrfs_add_dead_root(root);
> >       }
> > @@ -10871,7 +10869,7 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
> >       struct btrfs_inode *inode;
> >       unsigned long from = min_ino;
> >
> > -     spin_lock(&root->inode_lock);
> > +     xa_lock(&root->inodes);
> >       while (true) {
> >               inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT);
> >               if (!inode)
> > @@ -10880,9 +10878,9 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
> >                       break;
> >
> >               from = btrfs_ino(inode) + 1;
> > -             cond_resched_lock(&root->inode_lock);
> > +             cond_resched_lock(&root->inodes.xa_lock);
> >       }
> > -     spin_unlock(&root->inode_lock);
> > +     xa_unlock(&root->inodes);
> >
> >       return inode;
> >   }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/8] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode
  2024-05-09  0:30   ` Qu Wenruo
@ 2024-05-09  8:39     ` Filipe Manana
  0 siblings, 0 replies; 37+ messages in thread
From: Filipe Manana @ 2024-05-09  8:39 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 9, 2024 at 1:30 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/5/8 21:47, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > The index_cnt field of struct btrfs_inode is used only for two purposes:
> >
> > 1) To store the index for the next entry added to a directory;
> >
> > 2) For the data relocation inode to track the logical start address of the
> >     block group currently being relocated.
> >
> > For the relocation case we use index_cnt because it's not used for
> > anything else in the relocation use case - we could have used other fields
> > that are not used by relocation such as defrag_bytes, last_unlink_trans
> > or last_reflink_trans for example (amongs others).
> >
> > Since the csum_bytes field is not used for directories, do the following
> > changes:
> >
> > 1) Put index_cnt and csum_bytes in a union, and index_cnt is only
> >     initialized when the inode is a directory. The csum_bytes is only
> >     accessed in IO paths for regular files, so we're fine here;
> >
> > 2) Use the defrag_bytes field for relocation, since the data relocation
> >     inode is never used for defrag purposes. And to make the naming better,
> >     alias it to reloc_block_group_start by using a union.
> >
> > This reduces the size of struct btrfs_inode by 8 bytes in a release
> > kernel, from 1040 bytes down to 1032 bytes.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/btrfs_inode.h   | 46 +++++++++++++++++++++++++---------------
> >   fs/btrfs/delayed-inode.c |  3 ++-
> >   fs/btrfs/inode.c         | 21 ++++++++++++------
> >   fs/btrfs/relocation.c    | 12 +++++------
> >   fs/btrfs/tree-log.c      |  3 ++-
> >   5 files changed, 54 insertions(+), 31 deletions(-)
> >
> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > index e577b9745884..19bb3d057414 100644
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -215,11 +215,20 @@ struct btrfs_inode {
> >               u64 last_dir_index_offset;
> >       };
> >
> > -     /*
> > -      * Total number of bytes pending defrag, used by stat to check whether
> > -      * it needs COW. Protected by 'lock'.
> > -      */
> > -     u64 defrag_bytes;
> > +     union {
> > +             /*
> > +              * Total number of bytes pending defrag, used by stat to check whether
> > +              * it needs COW. Protected by 'lock'.
> > +              * Used by inodes other than the data relocation inode.
> > +              */
> > +             u64 defrag_bytes;
> > +
> > +             /*
> > +              * Logical address of the block group being relocated.
> > +              * Used only by the data relocation inode.
> > +              */
> > +             u64 reloc_block_group_start;
> > +     };
> >
> >       /*
> >        * The size of the file stored in the metadata on disk.  data=ordered
> > @@ -228,12 +237,21 @@ struct btrfs_inode {
> >        */
> >       u64 disk_i_size;
> >
> > -     /*
> > -      * If this is a directory then index_cnt is the counter for the index
> > -      * number for new files that are created. For an empty directory, this
> > -      * must be initialized to BTRFS_DIR_START_INDEX.
> > -      */
> > -     u64 index_cnt;
> > +     union {
> > +             /*
> > +              * If this is a directory then index_cnt is the counter for the
> > +              * index number for new files that are created. For an empty
> > +              * directory, this must be initialized to BTRFS_DIR_START_INDEX.
> > +              */
> > +             u64 index_cnt;
> > +
> > +             /*
> > +              * If this is not a directory, this is the number of bytes
> > +              * outstanding that are going to need csums. This is used in
> > +              * ENOSPC accounting. Protected by 'lock'.
> > +              */
> > +             u64 csum_bytes;
> > +     };
> >
> >       /* Cache the directory index number to speed the dir/file remove */
> >       u64 dir_index;
> > @@ -256,12 +274,6 @@ struct btrfs_inode {
> >        */
> >       u64 last_reflink_trans;
> >
> > -     /*
> > -      * Number of bytes outstanding that are going to need csums.  This is
> > -      * used in ENOSPC accounting. Protected by 'lock'.
> > -      */
> > -     u64 csum_bytes;
> > -
> >       /* Backwards incompatible flags, lower half of inode_item::flags  */
> >       u32 flags;
> >       /* Read-only compatibility flags, upper half of inode_item::flags */
> > diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> > index 1373f474c9b6..e298a44eaf69 100644
> > --- a/fs/btrfs/delayed-inode.c
> > +++ b/fs/btrfs/delayed-inode.c
> > @@ -1914,7 +1914,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
> >       BTRFS_I(inode)->i_otime_nsec = btrfs_stack_timespec_nsec(&inode_item->otime);
> >
> >       inode->i_generation = BTRFS_I(inode)->generation;
> > -     BTRFS_I(inode)->index_cnt = (u64)-1;
> > +     if (S_ISDIR(inode->i_mode))
> > +             BTRFS_I(inode)->index_cnt = (u64)-1;
> >
> >       mutex_unlock(&delayed_node->mutex);
> >       btrfs_release_delayed_node(delayed_node);
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 4fd41d6b377f..9b98aa65cc63 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -3856,7 +3856,9 @@ static int btrfs_read_locked_inode(struct inode *inode,
> >       inode->i_rdev = 0;
> >       rdev = btrfs_inode_rdev(leaf, inode_item);
> >
> > -     BTRFS_I(inode)->index_cnt = (u64)-1;
> > +     if (S_ISDIR(inode->i_mode))
> > +             BTRFS_I(inode)->index_cnt = (u64)-1;
> > +
> >       btrfs_inode_split_flags(btrfs_inode_flags(leaf, inode_item),
> >                               &BTRFS_I(inode)->flags, &BTRFS_I(inode)->ro_flags);
> >
> > @@ -6268,8 +6270,10 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
> >               if (ret)
> >                       goto out;
> >       }
> > -     /* index_cnt is ignored for everything but a dir. */
> > -     BTRFS_I(inode)->index_cnt = BTRFS_DIR_START_INDEX;
> > +
> > +     if (S_ISDIR(inode->i_mode))
> > +             BTRFS_I(inode)->index_cnt = BTRFS_DIR_START_INDEX;
> > +
> >       BTRFS_I(inode)->generation = trans->transid;
> >       inode->i_generation = BTRFS_I(inode)->generation;
> >
> > @@ -8435,8 +8439,12 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
> >       ei->disk_i_size = 0;
> >       ei->flags = 0;
> >       ei->ro_flags = 0;
> > +     /*
> > +      * ->index_cnt will be propertly initialized later when creating a new
> > +      * inode (btrfs_create_new_inode()) or when reading an existing inode
> > +      * from disk (btrfs_read_locked_inode()).
> > +      */
>
> Would above comment be a little confusing?
> As the comment is for csum_bytes, without checking the definition it's
> not clear that csum_bytes and index_cnt is shared.
>
> Maybe just removing it would be good enough?

I found the comment useful, and no better place to put it.
If everyone finds it useless, I'll remove it.

Thanks.

>
> Other wise looks good to me.
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Thanks,
> Qu
>
> >       ei->csum_bytes = 0;
> > -     ei->index_cnt = (u64)-1;
> >       ei->dir_index = 0;
> >       ei->last_unlink_trans = 0;
> >       ei->last_reflink_trans = 0;
> > @@ -8511,9 +8519,10 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
> >       if (!S_ISDIR(vfs_inode->i_mode)) {
> >               WARN_ON(inode->delalloc_bytes);
> >               WARN_ON(inode->new_delalloc_bytes);
> > +             WARN_ON(inode->csum_bytes);
> >       }
> > -     WARN_ON(inode->csum_bytes);
> > -     WARN_ON(inode->defrag_bytes);
> > +     if (!root || !btrfs_is_data_reloc_root(root))
> > +             WARN_ON(inode->defrag_bytes);
> >
> >       /*
> >        * This can happen where we create an inode, but somebody else also
> > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> > index 8b24bb5a0aa1..9f35524b6664 100644
> > --- a/fs/btrfs/relocation.c
> > +++ b/fs/btrfs/relocation.c
> > @@ -962,7 +962,7 @@ static int get_new_location(struct inode *reloc_inode, u64 *new_bytenr,
> >       if (!path)
> >               return -ENOMEM;
> >
> > -     bytenr -= BTRFS_I(reloc_inode)->index_cnt;
> > +     bytenr -= BTRFS_I(reloc_inode)->reloc_block_group_start;
> >       ret = btrfs_lookup_file_extent(NULL, root, path,
> >                       btrfs_ino(BTRFS_I(reloc_inode)), bytenr, 0);
> >       if (ret < 0)
> > @@ -2797,7 +2797,7 @@ static noinline_for_stack int prealloc_file_extent_cluster(
> >       u64 alloc_hint = 0;
> >       u64 start;
> >       u64 end;
> > -     u64 offset = inode->index_cnt;
> > +     u64 offset = inode->reloc_block_group_start;
> >       u64 num_bytes;
> >       int nr;
> >       int ret = 0;
> > @@ -2951,7 +2951,7 @@ static int relocate_one_folio(struct inode *inode, struct file_ra_state *ra,
> >                             int *cluster_nr, unsigned long index)
> >   {
> >       struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> > -     u64 offset = BTRFS_I(inode)->index_cnt;
> > +     u64 offset = BTRFS_I(inode)->reloc_block_group_start;
> >       const unsigned long last_index = (cluster->end - offset) >> PAGE_SHIFT;
> >       gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
> >       struct folio *folio;
> > @@ -3086,7 +3086,7 @@ static int relocate_one_folio(struct inode *inode, struct file_ra_state *ra,
> >   static int relocate_file_extent_cluster(struct inode *inode,
> >                                       const struct file_extent_cluster *cluster)
> >   {
> > -     u64 offset = BTRFS_I(inode)->index_cnt;
> > +     u64 offset = BTRFS_I(inode)->reloc_block_group_start;
> >       unsigned long index;
> >       unsigned long last_index;
> >       struct file_ra_state *ra;
> > @@ -3915,7 +3915,7 @@ static noinline_for_stack struct inode *create_reloc_inode(
> >               inode = NULL;
> >               goto out;
> >       }
> > -     BTRFS_I(inode)->index_cnt = group->start;
> > +     BTRFS_I(inode)->reloc_block_group_start = group->start;
> >
> >       ret = btrfs_orphan_add(trans, BTRFS_I(inode));
> >   out:
> > @@ -4395,7 +4395,7 @@ int btrfs_reloc_clone_csums(struct btrfs_ordered_extent *ordered)
> >   {
> >       struct btrfs_inode *inode = BTRFS_I(ordered->inode);
> >       struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > -     u64 disk_bytenr = ordered->file_offset + inode->index_cnt;
> > +     u64 disk_bytenr = ordered->file_offset + inode->reloc_block_group_start;
> >       struct btrfs_root *csum_root = btrfs_csum_root(fs_info, disk_bytenr);
> >       LIST_HEAD(list);
> >       int ret;
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index 5146387b416b..0aee43466c52 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -1625,7 +1625,8 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans,
> >               if (ret)
> >                       goto out;
> >       }
> > -     BTRFS_I(inode)->index_cnt = (u64)-1;
> > +     if (S_ISDIR(inode->i_mode))
> > +             BTRFS_I(inode)->index_cnt = (u64)-1;
> >
> >       if (inode->i_nlink == 0) {
> >               if (S_ISDIR(inode->i_mode)) {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/8] btrfs: don't allocate file extent tree for non regular files
  2024-05-09  0:39   ` Qu Wenruo
@ 2024-05-09  8:41     ` Filipe Manana
  2024-05-13 18:39       ` David Sterba
  0 siblings, 1 reply; 37+ messages in thread
From: Filipe Manana @ 2024-05-09  8:41 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 9, 2024 at 1:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/5/8 21:47, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When not using the NO_HOLES feature we always allocate an io tree for an
> > inode's file_extent_tree.
>
> I'm wondering can we discard non-NO_HOLES support directly so that we
> can get rid of the file_extent_tree completely?
>
> Initially I'm wondering why NO_HOLES makes a difference, especially as
> NO_HOLES can be set halfway, thus we can still have explicit hole extents.
>
> But it turns out that the whole file_extent_tree() is only to handle
> non-NO_HOLES case so that we always have a hole for the whole file range
> until i_size.
>
> Considering NO_HOLES is considered safe at 4.0 kernel, can we start
> deprecating non-NO_HOLES?

NO_HOLES is a mkfs default for only 2 or 3 years IIRC. It's not that long.

And how do you know everyone is already using NO_HOLES?
Last week I saw a user filesystem that doesn't use it.

And probably there are many more.

>
> > This is wasteful because that io tree is only
> > used for regular files, so we allocate more memory than needed for inodes
> > that represent directories or symlinks for example, or for inodes that
> > correspond to free space inodes.
> >
> > So improve on this by allocating the io tree only for inodes of regular
> > files that are not free space inodes.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> Otherwise looks good to me.
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Thanks,
> Qu
>
> > ---
> >   fs/btrfs/file-item.c | 13 ++++++-----
> >   fs/btrfs/inode.c     | 53 +++++++++++++++++++++++++++++---------------
> >   2 files changed, 42 insertions(+), 24 deletions(-)
> >
> > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> > index bce95f871750..f3ed78e21fa4 100644
> > --- a/fs/btrfs/file-item.c
> > +++ b/fs/btrfs/file-item.c
> > @@ -45,13 +45,12 @@
> >    */
> >   void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_size)
> >   {
> > -     struct btrfs_fs_info *fs_info = inode->root->fs_info;
> >       u64 start, end, i_size;
> >       int ret;
> >
> >       spin_lock(&inode->lock);
> >       i_size = new_i_size ?: i_size_read(&inode->vfs_inode);
> > -     if (btrfs_fs_incompat(fs_info, NO_HOLES)) {
> > +     if (!inode->file_extent_tree) {
> >               inode->disk_i_size = i_size;
> >               goto out_unlock;
> >       }
> > @@ -84,13 +83,14 @@ void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_siz
> >   int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start,
> >                                     u64 len)
> >   {
> > +     if (!inode->file_extent_tree)
> > +             return 0;
> > +
> >       if (len == 0)
> >               return 0;
> >
> >       ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize));
> >
> > -     if (btrfs_fs_incompat(inode->root->fs_info, NO_HOLES))
> > -             return 0;
> >       return set_extent_bit(inode->file_extent_tree, start, start + len - 1,
> >                             EXTENT_DIRTY, NULL);
> >   }
> > @@ -112,14 +112,15 @@ int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start,
> >   int btrfs_inode_clear_file_extent_range(struct btrfs_inode *inode, u64 start,
> >                                       u64 len)
> >   {
> > +     if (!inode->file_extent_tree)
> > +             return 0;
> > +
> >       if (len == 0)
> >               return 0;
> >
> >       ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize) ||
> >              len == (u64)-1);
> >
> > -     if (btrfs_fs_incompat(inode->root->fs_info, NO_HOLES))
> > -             return 0;
> >       return clear_extent_bit(inode->file_extent_tree, start,
> >                               start + len - 1, EXTENT_DIRTY, NULL);
> >   }
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 9b98aa65cc63..175fd007f0ef 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -3781,6 +3781,30 @@ static noinline int acls_after_inode_item(struct extent_buffer *leaf,
> >       return 1;
> >   }
> >
> > +static int btrfs_init_file_extent_tree(struct btrfs_inode *inode)
> > +{
> > +     struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > +
> > +     if (WARN_ON_ONCE(inode->file_extent_tree))
> > +             return 0;
> > +     if (btrfs_fs_incompat(fs_info, NO_HOLES))
> > +             return 0;
> > +     if (!S_ISREG(inode->vfs_inode.i_mode))
> > +             return 0;
> > +     if (btrfs_is_free_space_inode(inode))
> > +             return 0;
> > +
> > +     inode->file_extent_tree = kmalloc(sizeof(struct extent_io_tree), GFP_KERNEL);
> > +     if (!inode->file_extent_tree)
> > +             return -ENOMEM;
> > +
> > +     extent_io_tree_init(fs_info, inode->file_extent_tree, IO_TREE_INODE_FILE_EXTENT);
> > +     /* Lockdep class is set only for the file extent tree. */
> > +     lockdep_set_class(&inode->file_extent_tree->lock, &file_extent_tree_class);
> > +
> > +     return 0;
> > +}
> > +
> >   /*
> >    * read an inode from the btree into the in-memory inode
> >    */
> > @@ -3800,6 +3824,10 @@ static int btrfs_read_locked_inode(struct inode *inode,
> >       bool filled = false;
> >       int first_xattr_slot;
> >
> > +     ret = btrfs_init_file_extent_tree(BTRFS_I(inode));
> > +     if (ret)
> > +             return ret;
> > +
> >       ret = btrfs_fill_inode(inode, &rdev);
> >       if (!ret)
> >               filled = true;
> > @@ -6247,6 +6275,10 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
> >               BTRFS_I(inode)->root = btrfs_grab_root(BTRFS_I(dir)->root);
> >       root = BTRFS_I(inode)->root;
> >
> > +     ret = btrfs_init_file_extent_tree(BTRFS_I(inode));
> > +     if (ret)
> > +             goto out;
> > +
> >       ret = btrfs_get_free_objectid(root, &objectid);
> >       if (ret)
> >               goto out;
> > @@ -8413,20 +8445,10 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
> >       struct btrfs_fs_info *fs_info = btrfs_sb(sb);
> >       struct btrfs_inode *ei;
> >       struct inode *inode;
> > -     struct extent_io_tree *file_extent_tree = NULL;
> > -
> > -     /* Self tests may pass a NULL fs_info. */
> > -     if (fs_info && !btrfs_fs_incompat(fs_info, NO_HOLES)) {
> > -             file_extent_tree = kmalloc(sizeof(struct extent_io_tree), GFP_KERNEL);
> > -             if (!file_extent_tree)
> > -                     return NULL;
> > -     }
> >
> >       ei = alloc_inode_sb(sb, btrfs_inode_cachep, GFP_KERNEL);
> > -     if (!ei) {
> > -             kfree(file_extent_tree);
> > +     if (!ei)
> >               return NULL;
> > -     }
> >
> >       ei->root = NULL;
> >       ei->generation = 0;
> > @@ -8471,13 +8493,8 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
> >       extent_io_tree_init(fs_info, &ei->io_tree, IO_TREE_INODE_IO);
> >       ei->io_tree.inode = ei;
> >
> > -     ei->file_extent_tree = file_extent_tree;
> > -     if (file_extent_tree) {
> > -             extent_io_tree_init(fs_info, ei->file_extent_tree,
> > -                                 IO_TREE_INODE_FILE_EXTENT);
> > -             /* Lockdep class is set only for the file extent tree. */
> > -             lockdep_set_class(&ei->file_extent_tree->lock, &file_extent_tree_class);
> > -     }
> > +     ei->file_extent_tree = NULL;
> > +
> >       mutex_init(&ei->log_mutex);
> >       spin_lock_init(&ei->ordered_tree_lock);
> >       ei->ordered_tree = RB_ROOT;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] btrfs: remove inode_lock from struct btrfs_root and use xarray locks
  2024-05-09  8:38     ` Filipe Manana
@ 2024-05-09  8:42       ` Qu Wenruo
  0 siblings, 0 replies; 37+ messages in thread
From: Qu Wenruo @ 2024-05-09  8:42 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2024/5/9 18:08, Filipe Manana 写道:
> On Thu, May 9, 2024 at 1:25 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> 在 2024/5/8 21:47, fdmanana@kernel.org 写道:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> Currently we use the spinlock inode_lock from struct btrfs_root to
>>> serialize access to two different data structures:
>>>
>>> 1) The delayed inodes xarray (struct btrfs_root::delayed_nodes);
>>> 2) The inodes xarray (struct btrfs_root::inodes).
>>>
>>> Instead of using our own lock, we can use the spinlock that is part of the
>>> xarray implementation, by using the xa_lock() and xa_unlock() APIs and
>>> using the xarray APIs with the double underscore prefix that don't take
>>> the xarray locks and assume the caller is using xa_lock() and xa_unlock().
>>>
>>> So remove the spinlock inode_lock from struct btrfs_root and use the
>>> corresponding xarray locks. This brings 2 benefits:
>>>
>>> 1) We reduce the size of struct btrfs_root, from 1336 bytes down to
>>>      1328 bytes on a 64 bits release kernel config;
>>>
>>> 2) We reduce lock contention by not using anymore  the same lock for
>>>      changing two different and unrelated xarrays.
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>    fs/btrfs/ctree.h         |  1 -
>>>    fs/btrfs/delayed-inode.c | 24 +++++++++++-------------
>>>    fs/btrfs/disk-io.c       |  1 -
>>>    fs/btrfs/inode.c         | 18 ++++++++----------
>>>    4 files changed, 19 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index aa2568f86dc9..1004cb934b4a 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -221,7 +221,6 @@ struct btrfs_root {
>>>
>>>        struct list_head root_list;
>>>
>>> -     spinlock_t inode_lock;
>>>        /*
>>>         * Xarray that keeps track of in-memory inodes, protected by the lock
>>>         * @inode_lock.
>>> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
>>> index 95a0497fa866..1373f474c9b6 100644
>>> --- a/fs/btrfs/delayed-inode.c
>>> +++ b/fs/btrfs/delayed-inode.c
>>> @@ -77,14 +77,14 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
>>>                return node;
>>>        }
>>>
>>> -     spin_lock(&root->inode_lock);
>>> +     xa_lock(&root->delayed_nodes);
>>>        node = xa_load(&root->delayed_nodes, ino);
>>
>> Do we need xa_lock() here?
>>
>> The doc shows xa_load() use RCU read lock already.
>> Only xa_store()/xa_find() would take xa_lock internally, thus they need
>> to be converted.
>>
>> Or did I miss something else?
> 
> The RCU is only for protection against concurrent xarray operations
> that modify the xarray.
> After xa_load() returns, the "node" might have been removed from the
> xarray and freed by another task.
> 
> That's why this change is a straightforward switch from one lock to another.
> 
> It may be that our code is structured in a way that we can get away
> with the lock.
> But that needs to be properly analysed and given that it's a
> non-trivial behavioural change, should have its own separate patch and
> change log with the analysis.

OK, got it.

Then it looks fine to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> 
> Thanks.
> 
> 
>>
>> Thanks,
>> Qu
>>>
>>>        if (node) {
>>>                if (btrfs_inode->delayed_node) {
>>>                        refcount_inc(&node->refs);      /* can be accessed */
>>>                        BUG_ON(btrfs_inode->delayed_node != node);
>>> -                     spin_unlock(&root->inode_lock);
>>> +                     xa_unlock(&root->delayed_nodes);
>>>                        return node;
>>>                }
>>>
>>> @@ -111,10 +111,10 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
>>>                        node = NULL;
>>>                }
>>>
>>> -             spin_unlock(&root->inode_lock);
>>> +             xa_unlock(&root->delayed_nodes);
>>>                return node;
>>>        }
>>> -     spin_unlock(&root->inode_lock);
>>> +     xa_unlock(&root->delayed_nodes);
>>>
>>>        return NULL;
>>>    }
>>> @@ -148,21 +148,21 @@ static struct btrfs_delayed_node *btrfs_get_or_create_delayed_node(
>>>                kmem_cache_free(delayed_node_cache, node);
>>>                return ERR_PTR(-ENOMEM);
>>>        }
>>> -     spin_lock(&root->inode_lock);
>>> +     xa_lock(&root->delayed_nodes);
>>>        ptr = xa_load(&root->delayed_nodes, ino);
>>>        if (ptr) {
>>>                /* Somebody inserted it, go back and read it. */
>>> -             spin_unlock(&root->inode_lock);
>>> +             xa_unlock(&root->delayed_nodes);
>>>                kmem_cache_free(delayed_node_cache, node);
>>>                node = NULL;
>>>                goto again;
>>>        }
>>> -     ptr = xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC);
>>> +     ptr = __xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC);
>>>        ASSERT(xa_err(ptr) != -EINVAL);
>>>        ASSERT(xa_err(ptr) != -ENOMEM);
>>>        ASSERT(ptr == NULL);
>>>        btrfs_inode->delayed_node = node;
>>> -     spin_unlock(&root->inode_lock);
>>> +     xa_unlock(&root->delayed_nodes);
>>>
>>>        return node;
>>>    }
>>> @@ -275,14 +275,12 @@ static void __btrfs_release_delayed_node(
>>>        if (refcount_dec_and_test(&delayed_node->refs)) {
>>>                struct btrfs_root *root = delayed_node->root;
>>>
>>> -             spin_lock(&root->inode_lock);
>>>                /*
>>>                 * Once our refcount goes to zero, nobody is allowed to bump it
>>>                 * back up.  We can delete it now.
>>>                 */
>>>                ASSERT(refcount_read(&delayed_node->refs) == 0);
>>>                xa_erase(&root->delayed_nodes, delayed_node->inode_id);
>>> -             spin_unlock(&root->inode_lock);
>>>                kmem_cache_free(delayed_node_cache, delayed_node);
>>>        }
>>>    }
>>> @@ -2057,9 +2055,9 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
>>>                struct btrfs_delayed_node *node;
>>>                int count;
>>>
>>> -             spin_lock(&root->inode_lock);
>>> +             xa_lock(&root->delayed_nodes);
>>>                if (xa_empty(&root->delayed_nodes)) {
>>> -                     spin_unlock(&root->inode_lock);
>>> +                     xa_unlock(&root->delayed_nodes);
>>>                        return;
>>>                }
>>>
>>> @@ -2076,7 +2074,7 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
>>>                        if (count >= ARRAY_SIZE(delayed_nodes))
>>>                                break;
>>>                }
>>> -             spin_unlock(&root->inode_lock);
>>> +             xa_unlock(&root->delayed_nodes);
>>>                index++;
>>>
>>>                for (int i = 0; i < count; i++) {
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index ed40fe1db53e..d20e400a9ce3 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -674,7 +674,6 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>>>        INIT_LIST_HEAD(&root->ordered_extents);
>>>        INIT_LIST_HEAD(&root->ordered_root);
>>>        INIT_LIST_HEAD(&root->reloc_dirty_list);
>>> -     spin_lock_init(&root->inode_lock);
>>>        spin_lock_init(&root->delalloc_lock);
>>>        spin_lock_init(&root->ordered_extent_lock);
>>>        spin_lock_init(&root->accounting_lock);
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 8ea9fd4c2b66..4fd41d6b377f 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -5509,9 +5509,7 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc)
>>>                        return ret;
>>>        }
>>>
>>> -     spin_lock(&root->inode_lock);
>>>        existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
>>> -     spin_unlock(&root->inode_lock);
>>>
>>>        if (xa_is_err(existing)) {
>>>                ret = xa_err(existing);
>>> @@ -5531,16 +5529,16 @@ static void btrfs_del_inode_from_root(struct btrfs_inode *inode)
>>>        struct btrfs_inode *entry;
>>>        bool empty = false;
>>>
>>> -     spin_lock(&root->inode_lock);
>>> -     entry = xa_erase(&root->inodes, btrfs_ino(inode));
>>> +     xa_lock(&root->inodes);
>>> +     entry = __xa_erase(&root->inodes, btrfs_ino(inode));
>>>        if (entry == inode)
>>>                empty = xa_empty(&root->inodes);
>>> -     spin_unlock(&root->inode_lock);
>>> +     xa_unlock(&root->inodes);
>>>
>>>        if (empty && btrfs_root_refs(&root->root_item) == 0) {
>>> -             spin_lock(&root->inode_lock);
>>> +             xa_lock(&root->inodes);
>>>                empty = xa_empty(&root->inodes);
>>> -             spin_unlock(&root->inode_lock);
>>> +             xa_unlock(&root->inodes);
>>>                if (empty)
>>>                        btrfs_add_dead_root(root);
>>>        }
>>> @@ -10871,7 +10869,7 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
>>>        struct btrfs_inode *inode;
>>>        unsigned long from = min_ino;
>>>
>>> -     spin_lock(&root->inode_lock);
>>> +     xa_lock(&root->inodes);
>>>        while (true) {
>>>                inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT);
>>>                if (!inode)
>>> @@ -10880,9 +10878,9 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
>>>                        break;
>>>
>>>                from = btrfs_ino(inode) + 1;
>>> -             cond_resched_lock(&root->inode_lock);
>>> +             cond_resched_lock(&root->inodes.xa_lock);
>>>        }
>>> -     spin_unlock(&root->inode_lock);
>>> +     xa_unlock(&root->inodes);
>>>
>>>        return inode;
>>>    }
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/8] btrfs: inode management and memory consumption improvements
  2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
                   ` (7 preceding siblings ...)
  2024-05-08 12:17 ` [PATCH 8/8] btrfs: remove objectid from struct btrfs_inode on 64 bits platforms fdmanana
@ 2024-05-09 17:56 ` David Sterba
  2024-05-10 11:04   ` Filipe Manana
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
  9 siblings, 1 reply; 37+ messages in thread
From: David Sterba @ 2024-05-09 17:56 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, May 08, 2024 at 01:17:23PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Some inode related improvements, to use an xarray to track open inodes per
> root instead of a red black tree, reduce lock contention and use less memory
> per btrfs inode, so now we can fit 4 inodes per 4K page instead of 3.
> More details in the the change logs.

Outstanding! You managed to reduce the size by 48 bytes, on my config
from 1080 to 1032. Which unfortunately means it's still 3 inodes in a
page. The config is maximal regarding the conditional features that
affect size of struct inode. All of them could be enabled on distro
kernels (checked on openSUSE):

Ifdefs in include/linux/fs.h struct inode:

#ifdef CONFIG_FS_POSIX_ACL
#ifdef CONFIG_SECURITY
#ifdef CONFIG_CGROUP_WRITEBACK
#ifdef CONFIG_FSNOTIFY
#ifdef CONFIG_FS_ENCRYPTION
#ifdef CONFIG_FS_VERITY

There's also #ifdef __NEED_I_SIZE_ORDERED but that's for 32bit only.

This is the pahole diff summary before and after the patchset on
for-next with my reference release config:

 -       /* size: 1080, cachelines: 17, members: 39 */
 -       /* sum members: 1075, holes: 2, sum holes: 5 */
 -       /* forced alignments: 2 */
 -       /* last cacheline: 56 bytes */
 +       /* size: 1032, cachelines: 17, members: 36 */
 +       /* sum members: 1026, holes: 2, sum holes: 6 */
 +       /* forced alignments: 1 */
 +       /* last cacheline: 8 bytes */

The sum is still over 1024 so we'll need to find more tricks to reduce
the space.  There are 2 holes, one is 4 bytes (after i_otime_nsec) so
there's still some potential.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/8] btrfs: inode management and memory consumption improvements
  2024-05-09 17:56 ` [PATCH 0/8] btrfs: inode management and memory consumption improvements David Sterba
@ 2024-05-10 11:04   ` Filipe Manana
  0 siblings, 0 replies; 37+ messages in thread
From: Filipe Manana @ 2024-05-10 11:04 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs

On Thu, May 9, 2024 at 6:56 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, May 08, 2024 at 01:17:23PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Some inode related improvements, to use an xarray to track open inodes per
> > root instead of a red black tree, reduce lock contention and use less memory
> > per btrfs inode, so now we can fit 4 inodes per 4K page instead of 3.
> > More details in the the change logs.
>
> Outstanding! You managed to reduce the size by 48 bytes, on my config
> from 1080 to 1032. Which unfortunately means it's still 3 inodes in a
> page. The config is maximal regarding the conditional features that
> affect size of struct inode. All of them could be enabled on distro
> kernels (checked on openSUSE):
>
> Ifdefs in include/linux/fs.h struct inode:
>
> #ifdef CONFIG_FS_POSIX_ACL
> #ifdef CONFIG_SECURITY
> #ifdef CONFIG_CGROUP_WRITEBACK
> #ifdef CONFIG_FSNOTIFY
> #ifdef CONFIG_FS_ENCRYPTION
> #ifdef CONFIG_FS_VERITY
>
> There's also #ifdef __NEED_I_SIZE_ORDERED but that's for 32bit only.
>
> This is the pahole diff summary before and after the patchset on
> for-next with my reference release config:
>
>  -       /* size: 1080, cachelines: 17, members: 39 */
>  -       /* sum members: 1075, holes: 2, sum holes: 5 */
>  -       /* forced alignments: 2 */
>  -       /* last cacheline: 56 bytes */
>  +       /* size: 1032, cachelines: 17, members: 36 */
>  +       /* sum members: 1026, holes: 2, sum holes: 6 */
>  +       /* forced alignments: 1 */
>  +       /* last cacheline: 8 bytes */
>
> The sum is still over 1024 so we'll need to find more tricks to reduce
> the space.  There are 2 holes, one is 4 bytes (after i_otime_nsec) so
> there's still some potential.

I'm seeing a reduction down to 1016 bytes because I don't have
CONFIG_FS_ENCRYPTION and CONFIG_FS_VERITY set.
It's a very old kernel config I keep reusing for several years, so
that explains it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 00/10] btrfs: inode management and memory consumption improvements
  2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
                   ` (8 preceding siblings ...)
  2024-05-09 17:56 ` [PATCH 0/8] btrfs: inode management and memory consumption improvements David Sterba
@ 2024-05-10 17:32 ` fdmanana
  2024-05-10 17:32   ` [PATCH v2 01/10] btrfs: use an xarray to track open inodes in a root fdmanana
                     ` (11 more replies)
  9 siblings, 12 replies; 37+ messages in thread
From: fdmanana @ 2024-05-10 17:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Some inode related improvements, to use an xarray to track open inodes per
root instead of a red black tree, reduce lock contention and use less memory
per btrfs inode, so now we can fit 4 inodes per 4K page instead of 3.
More details in the change logs.

V2: Added two more patches (9/10 and 10/10) to make sure btrfs_inode size
    is reduced to 1024 bytes when CONFIG_FS_ENCRYPTION and CONFIG_FS_VERITY
    are set. I wasn't using these configs before, and with them the final
    size for btrfs_inode was 1032 bytes and not 1016 bytes as I initially
    had - now the final size is 1024 bytes with those configs enabled.

Filipe Manana (10):
  btrfs: use an xarray to track open inodes in a root
  btrfs: preallocate inodes xarray entry to avoid transaction abort
  btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path()
  btrfs: remove inode_lock from struct btrfs_root and use xarray locks
  btrfs: unify index_cnt and csum_bytes from struct btrfs_inode
  btrfs: don't allocate file extent tree for non regular files
  btrfs: remove location key from struct btrfs_inode
  btrfs: remove objectid from struct btrfs_inode on 64 bits platforms
  btrfs: rename rb_root member of extent_map_tree from map to root
  btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree

 fs/btrfs/btrfs_inode.h            | 130 ++++++++++----
 fs/btrfs/ctree.h                  |   8 +-
 fs/btrfs/delayed-inode.c          |  29 ++-
 fs/btrfs/disk-io.c                |  12 +-
 fs/btrfs/export.c                 |   2 +-
 fs/btrfs/extent_map.c             |  50 +++---
 fs/btrfs/extent_map.h             |   2 +-
 fs/btrfs/file-item.c              |  13 +-
 fs/btrfs/inode.c                  | 286 +++++++++++++++---------------
 fs/btrfs/ioctl.c                  |   8 +-
 fs/btrfs/relocation.c             |  12 +-
 fs/btrfs/tests/btrfs-tests.c      |   5 +-
 fs/btrfs/tests/extent-map-tests.c |   6 +-
 fs/btrfs/tree-log.c               |   9 +-
 14 files changed, 317 insertions(+), 255 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 01/10] btrfs: use an xarray to track open inodes in a root
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
@ 2024-05-10 17:32   ` fdmanana
  2024-05-14 15:49     ` David Sterba
  2024-05-10 17:32   ` [PATCH v2 02/10] btrfs: preallocate inodes xarray entry to avoid transaction abort fdmanana
                     ` (10 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: fdmanana @ 2024-05-10 17:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently we use a red black tree (rb-tree) to track the currently open
inodes of a root (in struct btrfs_root::inode_tree). This however is not
very efficient when the number of inodes is large since rb-trees are
binary trees. For example for 100K open inodes, the tree has a depth of
17. Besides that, inserting into the tree requires navigating through it
and pulling useless cache lines in the process since the red black tree
nodes are embedded within the btrfs inode - on the other hand, by being
embedded, it requires no extra memory allocations.

We can improve this by using an xarray instead, which is efficient when
indices are densely clustered (such as inode numbers), is more cache
friendly and behaves like a resizable array, with a much better search
and insertion complexity than a red black tree. This only has one small
disadvantage which is that insertion will sometimes require allocating
memory for the xarray - which may fail (not that often since it uses a
kmem_cache) - but on the other hand we can reduce the btrfs inode
structure size by 24 bytes (from 1080 down to 1056 bytes) after removing
the embedded red black tree node, which after the next patches will allow
to reduce the size of the structure to 1024 bytes, meaning we will be able
to store 4 inodes per 4K page instead of 3 inodes.

This change does a straighforward change to use an xarray, and results
in a transaction abort if we can't allocate memory for the xarray when
creating an inode - but the next patch changes things so that we don't
need to abort.

Running the following fs_mark test showed some improvements:

    $ cat test.sh
    #!/bin/bash

    DEV=/dev/nullb0
    MNT=/mnt/nullb0
    MOUNT_OPTIONS="-o ssd"
    FILES=100000
    THREADS=$(nproc --all)

    echo "performance" | \
        tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

    mkfs.btrfs -f $DEV
    mount $MOUNT_OPTIONS $DEV $MNT

    OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
    for ((i = 1; i <= $THREADS; i++)); do
        OPTS="$OPTS -d $MNT/d$i"
    done

    fs_mark $OPTS

    umount $MNT

Before this patch:

    FSUse%        Count         Size    Files/sec     App Overhead
        10      1200000            0      92081.6         12505547
        16      2400000            0     138222.6         13067072
        23      3600000            0     148833.1         13290336
        43      4800000            0      97864.7         13931248
        53      6000000            0      85597.3         14384313

After this patch:

    FSUse%        Count         Size    Files/sec     App Overhead
        10      1200000            0      93225.1         12571078
        16      2400000            0     146720.3         12805007
        23      3600000            0     160626.4         13073835
        46      4800000            0     116286.2         13802927
        53      6000000            0      90087.9         14754892

The test was run with a release kernel config (Debian's default config).

Also capturing the insertion times into the rb tree and into the xarray,
that is measuring the duration of the old function inode_tree_add() and
the duration of the new btrfs_add_inode_to_root() function, gave the
following results (in nanoseconds):

Before this patch, inode_tree_add() execution times:

   Count: 5000000
   Range:  0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
   Percentiles:  90th: 1015.000; 95th: 1139.000; 99th: 1397.000
      0.000 -    7.816:    40 |
      7.816 -   37.858:   209 |
     37.858 -  170.278:  6059 |
    170.278 -  753.961: 2754890 #####################################################
    753.961 - 3326.728: 2232312 ###########################################
   3326.728 - 14667.018:  4366 |
   14667.018 - 64652.943:   852 |
   64652.943 - 284981.761:   550 |
   284981.761 - 1256150.914:   221 |
   1256150.914 - 5536887.000:     7 |

After this patch, btrfs_add_inode_to_root() execution times:

   Count: 5000000
   Range:  0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
   Percentiles:  90th: 342.000; 95th: 432.000; 99th: 572.000
      0.000 -    7.264:   104 |
      7.264 -   33.145:   352 |
     33.145 -  140.081: 109606 #
    140.081 -  581.930: 4840090 #####################################################
    581.930 - 2407.590: 43532 |
   2407.590 - 9950.979:  2245 |
   9950.979 - 41119.278:   514 |
   41119.278 - 169902.616:   155 |
   169902.616 - 702018.539:    47 |
   702018.539 - 2900652.000:     9 |

Average, percentiles, standard deviation, etc, are all much better.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h |   3 -
 fs/btrfs/ctree.h       |   7 ++-
 fs/btrfs/disk-io.c     |   6 +-
 fs/btrfs/inode.c       | 128 ++++++++++++++++-------------------------
 4 files changed, 58 insertions(+), 86 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 91c994b569f3..e577b9745884 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -155,9 +155,6 @@ struct btrfs_inode {
 	 */
 	struct list_head delalloc_inodes;
 
-	/* node for the red-black tree that links inodes in subvolume root */
-	struct rb_node rb_node;
-
 	unsigned long runtime_flags;
 
 	/* full 64 bit generation number, struct vfs_inode doesn't have a big
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c03c58246033..aa2568f86dc9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -222,8 +222,11 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t inode_lock;
-	/* red-black tree that keeps track of in-memory inodes */
-	struct rb_root inode_tree;
+	/*
+	 * Xarray that keeps track of in-memory inodes, protected by the lock
+	 * @inode_lock.
+	 */
+	struct xarray inodes;
 
 	/*
 	 * Xarray that keeps track of delayed nodes of every inode, protected
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a91a8056758a..ed40fe1db53e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -662,7 +662,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	root->free_objectid = 0;
 	root->nr_delalloc_inodes = 0;
 	root->nr_ordered_extents = 0;
-	root->inode_tree = RB_ROOT;
+	xa_init(&root->inodes);
 	xa_init(&root->delayed_nodes);
 
 	btrfs_init_root_block_rsv(root);
@@ -1854,7 +1854,8 @@ void btrfs_put_root(struct btrfs_root *root)
 		return;
 
 	if (refcount_dec_and_test(&root->refs)) {
-		WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
+		if (WARN_ON(!xa_empty(&root->inodes)))
+			xa_destroy(&root->inodes);
 		WARN_ON(test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state));
 		if (root->anon_dev)
 			free_anon_bdev(root->anon_dev);
@@ -1939,7 +1940,6 @@ static int btrfs_init_btree_inode(struct super_block *sb)
 	inode->i_mapping->a_ops = &btree_aops;
 	mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
 
-	RB_CLEAR_NODE(&BTRFS_I(inode)->rb_node);
 	extent_io_tree_init(fs_info, &BTRFS_I(inode)->io_tree,
 			    IO_TREE_BTREE_INODE_IO);
 	extent_map_tree_init(&BTRFS_I(inode)->extent_tree);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d0274324c75a..450fe1582f1d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5493,58 +5493,51 @@ static int fixup_tree_root_location(struct btrfs_fs_info *fs_info,
 	return err;
 }
 
-static void inode_tree_add(struct btrfs_inode *inode)
+static int btrfs_add_inode_to_root(struct btrfs_inode *inode)
 {
 	struct btrfs_root *root = inode->root;
-	struct btrfs_inode *entry;
-	struct rb_node **p;
-	struct rb_node *parent;
-	struct rb_node *new = &inode->rb_node;
-	u64 ino = btrfs_ino(inode);
+	struct btrfs_inode *existing;
+	const u64 ino = btrfs_ino(inode);
+	int ret;
 
 	if (inode_unhashed(&inode->vfs_inode))
-		return;
-	parent = NULL;
+		return 0;
+
+	ret = xa_reserve(&root->inodes, ino, GFP_NOFS);
+	if (ret)
+		return ret;
+
 	spin_lock(&root->inode_lock);
-	p = &root->inode_tree.rb_node;
-	while (*p) {
-		parent = *p;
-		entry = rb_entry(parent, struct btrfs_inode, rb_node);
+	existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
+	spin_unlock(&root->inode_lock);
 
-		if (ino < btrfs_ino(entry))
-			p = &parent->rb_left;
-		else if (ino > btrfs_ino(entry))
-			p = &parent->rb_right;
-		else {
-			WARN_ON(!(entry->vfs_inode.i_state &
-				  (I_WILL_FREE | I_FREEING)));
-			rb_replace_node(parent, new, &root->inode_tree);
-			RB_CLEAR_NODE(parent);
-			spin_unlock(&root->inode_lock);
-			return;
-		}
+	if (xa_is_err(existing)) {
+		ret = xa_err(existing);
+		ASSERT(ret != -EINVAL);
+		ASSERT(ret != -ENOMEM);
+		return ret;
+	} else if (existing) {
+		WARN_ON(!(existing->vfs_inode.i_state & (I_WILL_FREE | I_FREEING)));
 	}
-	rb_link_node(new, parent, p);
-	rb_insert_color(new, &root->inode_tree);
-	spin_unlock(&root->inode_lock);
+
+	return 0;
 }
 
-static void inode_tree_del(struct btrfs_inode *inode)
+static void btrfs_del_inode_from_root(struct btrfs_inode *inode)
 {
 	struct btrfs_root *root = inode->root;
-	int empty = 0;
+	struct btrfs_inode *entry;
+	bool empty = false;
 
 	spin_lock(&root->inode_lock);
-	if (!RB_EMPTY_NODE(&inode->rb_node)) {
-		rb_erase(&inode->rb_node, &root->inode_tree);
-		RB_CLEAR_NODE(&inode->rb_node);
-		empty = RB_EMPTY_ROOT(&root->inode_tree);
-	}
+	entry = xa_erase(&root->inodes, btrfs_ino(inode));
+	if (entry == inode)
+		empty = xa_empty(&root->inodes);
 	spin_unlock(&root->inode_lock);
 
 	if (empty && btrfs_root_refs(&root->root_item) == 0) {
 		spin_lock(&root->inode_lock);
-		empty = RB_EMPTY_ROOT(&root->inode_tree);
+		empty = xa_empty(&root->inodes);
 		spin_unlock(&root->inode_lock);
 		if (empty)
 			btrfs_add_dead_root(root);
@@ -5613,8 +5606,13 @@ struct inode *btrfs_iget_path(struct super_block *s, u64 ino,
 
 		ret = btrfs_read_locked_inode(inode, path);
 		if (!ret) {
-			inode_tree_add(BTRFS_I(inode));
-			unlock_new_inode(inode);
+			ret = btrfs_add_inode_to_root(BTRFS_I(inode));
+			if (ret) {
+				iget_failed(inode);
+				inode = ERR_PTR(ret);
+			} else {
+				unlock_new_inode(inode);
+			}
 		} else {
 			iget_failed(inode);
 			/*
@@ -6426,7 +6424,11 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	inode_tree_add(BTRFS_I(inode));
+	ret = btrfs_add_inode_to_root(BTRFS_I(inode));
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto discard;
+	}
 
 	trace_btrfs_inode_new(inode);
 	btrfs_set_inode_last_trans(trans, BTRFS_I(inode));
@@ -8466,7 +8468,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->ordered_tree_last = NULL;
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->delayed_iput);
-	RB_CLEAR_NODE(&ei->rb_node);
 	init_rwsem(&ei->i_mmap_lock);
 
 	return inode;
@@ -8538,7 +8539,7 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
 		}
 	}
 	btrfs_qgroup_check_reserved_leak(inode);
-	inode_tree_del(inode);
+	btrfs_del_inode_from_root(inode);
 	btrfs_drop_extent_map_range(inode, 0, (u64)-1, false);
 	btrfs_inode_clear_file_extent_range(inode, 0, (u64)-1);
 	btrfs_put_root(inode->root);
@@ -10857,52 +10858,23 @@ void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 en
  */
 struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
 {
-	struct rb_node *node;
-	struct rb_node *prev;
 	struct btrfs_inode *inode;
+	unsigned long from = min_ino;
 
 	spin_lock(&root->inode_lock);
-again:
-	node = root->inode_tree.rb_node;
-	prev = NULL;
-	while (node) {
-		prev = node;
-		inode = rb_entry(node, struct btrfs_inode, rb_node);
-		if (min_ino < btrfs_ino(inode))
-			node = node->rb_left;
-		else if (min_ino > btrfs_ino(inode))
-			node = node->rb_right;
-		else
+	while (true) {
+		inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT);
+		if (!inode)
+			break;
+		if (igrab(&inode->vfs_inode))
 			break;
-	}
-
-	if (!node) {
-		while (prev) {
-			inode = rb_entry(prev, struct btrfs_inode, rb_node);
-			if (min_ino <= btrfs_ino(inode)) {
-				node = prev;
-				break;
-			}
-			prev = rb_next(prev);
-		}
-	}
-
-	while (node) {
-		inode = rb_entry(prev, struct btrfs_inode, rb_node);
-		if (igrab(&inode->vfs_inode)) {
-			spin_unlock(&root->inode_lock);
-			return inode;
-		}
-
-		min_ino = btrfs_ino(inode) + 1;
-		if (cond_resched_lock(&root->inode_lock))
-			goto again;
 
-		node = rb_next(node);
+		from = btrfs_ino(inode) + 1;
+		cond_resched_lock(&root->inode_lock);
 	}
 	spin_unlock(&root->inode_lock);
 
-	return NULL;
+	return inode;
 }
 
 static const struct inode_operations btrfs_dir_inode_operations = {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 02/10] btrfs: preallocate inodes xarray entry to avoid transaction abort
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
  2024-05-10 17:32   ` [PATCH v2 01/10] btrfs: use an xarray to track open inodes in a root fdmanana
@ 2024-05-10 17:32   ` fdmanana
  2024-05-10 17:32   ` [PATCH v2 03/10] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path() fdmanana
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: fdmanana @ 2024-05-10 17:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When creating a new inode, at btrfs_create_new_inode(), one of the very
last steps is to add the inode to the root's inodes xarray. This often
requires allocating memory which may fail (even though xarrays have a
dedicated kmem_cache which make it less likely to fail), and at that point
we are forced to abort the current transaction (as some, but not all, of
the inode metadata was added to its subvolume btree).

To avoid a transaction abort, preallocate memory for the xarray early at
btrfs_create_new_inode(), so that if we fail we don't need to abort the
transaction and the insertion into the xarray is guaranteed to succeed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 450fe1582f1d..85dbc19c2f6f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5493,7 +5493,7 @@ static int fixup_tree_root_location(struct btrfs_fs_info *fs_info,
 	return err;
 }
 
-static int btrfs_add_inode_to_root(struct btrfs_inode *inode)
+static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_inode *existing;
@@ -5503,9 +5503,11 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode)
 	if (inode_unhashed(&inode->vfs_inode))
 		return 0;
 
-	ret = xa_reserve(&root->inodes, ino, GFP_NOFS);
-	if (ret)
-		return ret;
+	if (prealloc) {
+		ret = xa_reserve(&root->inodes, ino, GFP_NOFS);
+		if (ret)
+			return ret;
+	}
 
 	spin_lock(&root->inode_lock);
 	existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
@@ -5606,7 +5608,7 @@ struct inode *btrfs_iget_path(struct super_block *s, u64 ino,
 
 		ret = btrfs_read_locked_inode(inode, path);
 		if (!ret) {
-			ret = btrfs_add_inode_to_root(BTRFS_I(inode));
+			ret = btrfs_add_inode_to_root(BTRFS_I(inode), true);
 			if (ret) {
 				iget_failed(inode);
 				inode = ERR_PTR(ret);
@@ -6237,6 +6239,7 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_item_batch batch;
 	unsigned long ptr;
 	int ret;
+	bool xa_reserved = false;
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -6251,6 +6254,11 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 		goto out;
 	inode->i_ino = objectid;
 
+	ret = xa_reserve(&root->inodes, objectid, GFP_NOFS);
+	if (ret)
+		goto out;
+	xa_reserved = true;
+
 	if (args->orphan) {
 		/*
 		 * O_TMPFILE, set link count to 0, so that after this point, we
@@ -6424,8 +6432,9 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	ret = btrfs_add_inode_to_root(BTRFS_I(inode));
-	if (ret) {
+	ret = btrfs_add_inode_to_root(BTRFS_I(inode), false);
+	if (WARN_ON(ret)) {
+		/* Shouldn't happen, we used xa_reserve() before. */
 		btrfs_abort_transaction(trans, ret);
 		goto discard;
 	}
@@ -6456,6 +6465,9 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 	ihold(inode);
 	discard_new_inode(inode);
 out:
+	if (xa_reserved)
+		xa_release(&root->inodes, objectid);
+
 	btrfs_free_path(path);
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 03/10] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path()
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
  2024-05-10 17:32   ` [PATCH v2 01/10] btrfs: use an xarray to track open inodes in a root fdmanana
  2024-05-10 17:32   ` [PATCH v2 02/10] btrfs: preallocate inodes xarray entry to avoid transaction abort fdmanana
@ 2024-05-10 17:32   ` fdmanana
  2024-05-10 17:32   ` [PATCH v2 04/10] btrfs: remove inode_lock from struct btrfs_root and use xarray locks fdmanana
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: fdmanana @ 2024-05-10 17:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Make btrfs_iget_path() simpler and easier to read by avoiding nesting of
if-then-else statements and having an error label to do all the error
handling instead of repeating it a couple times.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 44 +++++++++++++++++++++-----------------------
 1 file changed, 21 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 85dbc19c2f6f..8ea9fd4c2b66 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5598,37 +5598,35 @@ struct inode *btrfs_iget_path(struct super_block *s, u64 ino,
 			      struct btrfs_root *root, struct btrfs_path *path)
 {
 	struct inode *inode;
+	int ret;
 
 	inode = btrfs_iget_locked(s, ino, root);
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
 
-	if (inode->i_state & I_NEW) {
-		int ret;
+	if (!(inode->i_state & I_NEW))
+		return inode;
 
-		ret = btrfs_read_locked_inode(inode, path);
-		if (!ret) {
-			ret = btrfs_add_inode_to_root(BTRFS_I(inode), true);
-			if (ret) {
-				iget_failed(inode);
-				inode = ERR_PTR(ret);
-			} else {
-				unlock_new_inode(inode);
-			}
-		} else {
-			iget_failed(inode);
-			/*
-			 * ret > 0 can come from btrfs_search_slot called by
-			 * btrfs_read_locked_inode, this means the inode item
-			 * was not found.
-			 */
-			if (ret > 0)
-				ret = -ENOENT;
-			inode = ERR_PTR(ret);
-		}
-	}
+	ret = btrfs_read_locked_inode(inode, path);
+	/*
+	 * ret > 0 can come from btrfs_search_slot called by
+	 * btrfs_read_locked_inode(), this means the inode item was not found.
+	 */
+	if (ret > 0)
+		ret = -ENOENT;
+	if (ret < 0)
+		goto error;
+
+	ret = btrfs_add_inode_to_root(BTRFS_I(inode), true);
+	if (ret < 0)
+		goto error;
+
+	unlock_new_inode(inode);
 
 	return inode;
+error:
+	iget_failed(inode);
+	return ERR_PTR(ret);
 }
 
 struct inode *btrfs_iget(struct super_block *s, u64 ino, struct btrfs_root *root)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 04/10] btrfs: remove inode_lock from struct btrfs_root and use xarray locks
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
                     ` (2 preceding siblings ...)
  2024-05-10 17:32   ` [PATCH v2 03/10] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path() fdmanana
@ 2024-05-10 17:32   ` fdmanana
  2024-05-10 17:32   ` [PATCH v2 05/10] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode fdmanana
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: fdmanana @ 2024-05-10 17:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently we use the spinlock inode_lock from struct btrfs_root to
serialize access to two different data structures:

1) The delayed inodes xarray (struct btrfs_root::delayed_nodes);
2) The inodes xarray (struct btrfs_root::inodes).

Instead of using our own lock, we can use the spinlock that is part of the
xarray implementation, by using the xa_lock() and xa_unlock() APIs and
using the xarray APIs with the double underscore prefix that don't take
the xarray locks and assume the caller is using xa_lock() and xa_unlock().

So remove the spinlock inode_lock from struct btrfs_root and use the
corresponding xarray locks. This brings 2 benefits:

1) We reduce the size of struct btrfs_root, from 1336 bytes down to
   1328 bytes on a 64 bits release kernel config;

2) We reduce lock contention by not using anymore  the same lock for
   changing two different and unrelated xarrays.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h         |  1 -
 fs/btrfs/delayed-inode.c | 26 ++++++++++++--------------
 fs/btrfs/disk-io.c       |  1 -
 fs/btrfs/inode.c         | 18 ++++++++----------
 4 files changed, 20 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aa2568f86dc9..1004cb934b4a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -221,7 +221,6 @@ struct btrfs_root {
 
 	struct list_head root_list;
 
-	spinlock_t inode_lock;
 	/*
 	 * Xarray that keeps track of in-memory inodes, protected by the lock
 	 * @inode_lock.
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 95a0497fa866..40e617c7e8a1 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -77,14 +77,14 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
 		return node;
 	}
 
-	spin_lock(&root->inode_lock);
+	xa_lock(&root->delayed_nodes);
 	node = xa_load(&root->delayed_nodes, ino);
 
 	if (node) {
 		if (btrfs_inode->delayed_node) {
 			refcount_inc(&node->refs);	/* can be accessed */
 			BUG_ON(btrfs_inode->delayed_node != node);
-			spin_unlock(&root->inode_lock);
+			xa_unlock(&root->delayed_nodes);
 			return node;
 		}
 
@@ -111,10 +111,10 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
 			node = NULL;
 		}
 
-		spin_unlock(&root->inode_lock);
+		xa_unlock(&root->delayed_nodes);
 		return node;
 	}
-	spin_unlock(&root->inode_lock);
+	xa_unlock(&root->delayed_nodes);
 
 	return NULL;
 }
@@ -148,21 +148,21 @@ static struct btrfs_delayed_node *btrfs_get_or_create_delayed_node(
 		kmem_cache_free(delayed_node_cache, node);
 		return ERR_PTR(-ENOMEM);
 	}
-	spin_lock(&root->inode_lock);
+	xa_lock(&root->delayed_nodes);
 	ptr = xa_load(&root->delayed_nodes, ino);
 	if (ptr) {
 		/* Somebody inserted it, go back and read it. */
-		spin_unlock(&root->inode_lock);
+		xa_unlock(&root->delayed_nodes);
 		kmem_cache_free(delayed_node_cache, node);
 		node = NULL;
 		goto again;
 	}
-	ptr = xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC);
+	ptr = __xa_store(&root->delayed_nodes, ino, node, GFP_ATOMIC);
 	ASSERT(xa_err(ptr) != -EINVAL);
 	ASSERT(xa_err(ptr) != -ENOMEM);
 	ASSERT(ptr == NULL);
 	btrfs_inode->delayed_node = node;
-	spin_unlock(&root->inode_lock);
+	xa_unlock(&root->delayed_nodes);
 
 	return node;
 }
@@ -275,14 +275,12 @@ static void __btrfs_release_delayed_node(
 	if (refcount_dec_and_test(&delayed_node->refs)) {
 		struct btrfs_root *root = delayed_node->root;
 
-		spin_lock(&root->inode_lock);
+		xa_erase(&root->delayed_nodes, delayed_node->inode_id);
 		/*
 		 * Once our refcount goes to zero, nobody is allowed to bump it
 		 * back up.  We can delete it now.
 		 */
 		ASSERT(refcount_read(&delayed_node->refs) == 0);
-		xa_erase(&root->delayed_nodes, delayed_node->inode_id);
-		spin_unlock(&root->inode_lock);
 		kmem_cache_free(delayed_node_cache, delayed_node);
 	}
 }
@@ -2057,9 +2055,9 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
 		struct btrfs_delayed_node *node;
 		int count;
 
-		spin_lock(&root->inode_lock);
+		xa_lock(&root->delayed_nodes);
 		if (xa_empty(&root->delayed_nodes)) {
-			spin_unlock(&root->inode_lock);
+			xa_unlock(&root->delayed_nodes);
 			return;
 		}
 
@@ -2076,7 +2074,7 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
 			if (count >= ARRAY_SIZE(delayed_nodes))
 				break;
 		}
-		spin_unlock(&root->inode_lock);
+		xa_unlock(&root->delayed_nodes);
 		index++;
 
 		for (int i = 0; i < count; i++) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ed40fe1db53e..d20e400a9ce3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -674,7 +674,6 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	INIT_LIST_HEAD(&root->ordered_extents);
 	INIT_LIST_HEAD(&root->ordered_root);
 	INIT_LIST_HEAD(&root->reloc_dirty_list);
-	spin_lock_init(&root->inode_lock);
 	spin_lock_init(&root->delalloc_lock);
 	spin_lock_init(&root->ordered_extent_lock);
 	spin_lock_init(&root->accounting_lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8ea9fd4c2b66..4fd41d6b377f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5509,9 +5509,7 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc)
 			return ret;
 	}
 
-	spin_lock(&root->inode_lock);
 	existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);
-	spin_unlock(&root->inode_lock);
 
 	if (xa_is_err(existing)) {
 		ret = xa_err(existing);
@@ -5531,16 +5529,16 @@ static void btrfs_del_inode_from_root(struct btrfs_inode *inode)
 	struct btrfs_inode *entry;
 	bool empty = false;
 
-	spin_lock(&root->inode_lock);
-	entry = xa_erase(&root->inodes, btrfs_ino(inode));
+	xa_lock(&root->inodes);
+	entry = __xa_erase(&root->inodes, btrfs_ino(inode));
 	if (entry == inode)
 		empty = xa_empty(&root->inodes);
-	spin_unlock(&root->inode_lock);
+	xa_unlock(&root->inodes);
 
 	if (empty && btrfs_root_refs(&root->root_item) == 0) {
-		spin_lock(&root->inode_lock);
+		xa_lock(&root->inodes);
 		empty = xa_empty(&root->inodes);
-		spin_unlock(&root->inode_lock);
+		xa_unlock(&root->inodes);
 		if (empty)
 			btrfs_add_dead_root(root);
 	}
@@ -10871,7 +10869,7 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
 	struct btrfs_inode *inode;
 	unsigned long from = min_ino;
 
-	spin_lock(&root->inode_lock);
+	xa_lock(&root->inodes);
 	while (true) {
 		inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT);
 		if (!inode)
@@ -10880,9 +10878,9 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
 			break;
 
 		from = btrfs_ino(inode) + 1;
-		cond_resched_lock(&root->inode_lock);
+		cond_resched_lock(&root->inodes.xa_lock);
 	}
-	spin_unlock(&root->inode_lock);
+	xa_unlock(&root->inodes);
 
 	return inode;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 05/10] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
                     ` (3 preceding siblings ...)
  2024-05-10 17:32   ` [PATCH v2 04/10] btrfs: remove inode_lock from struct btrfs_root and use xarray locks fdmanana
@ 2024-05-10 17:32   ` fdmanana
  2024-05-10 17:32   ` [PATCH v2 06/10] btrfs: don't allocate file extent tree for non regular files fdmanana
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: fdmanana @ 2024-05-10 17:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The index_cnt field of struct btrfs_inode is used only for two purposes:

1) To store the index for the next entry added to a directory;

2) For the data relocation inode to track the logical start address of the
   block group currently being relocated.

For the relocation case we use index_cnt because it's not used for
anything else in the relocation use case - we could have used other fields
that are not used by relocation such as defrag_bytes, last_unlink_trans
or last_reflink_trans for example (amongs others).

Since the csum_bytes field is not used for directories, do the following
changes:

1) Put index_cnt and csum_bytes in a union, and index_cnt is only
   initialized when the inode is a directory. The csum_bytes is only
   accessed in IO paths for regular files, so we're fine here;

2) Use the defrag_bytes field for relocation, since the data relocation
   inode is never used for defrag purposes. And to make the naming better,
   alias it to reloc_block_group_start by using a union.

This reduces the size of struct btrfs_inode by 8 bytes in a release
kernel, from 1056 bytes down to 1048 bytes.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h   | 46 +++++++++++++++++++++++++---------------
 fs/btrfs/delayed-inode.c |  3 ++-
 fs/btrfs/inode.c         | 21 ++++++++++++------
 fs/btrfs/relocation.c    | 12 +++++------
 fs/btrfs/tree-log.c      |  3 ++-
 5 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index e577b9745884..19bb3d057414 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -215,11 +215,20 @@ struct btrfs_inode {
 		u64 last_dir_index_offset;
 	};
 
-	/*
-	 * Total number of bytes pending defrag, used by stat to check whether
-	 * it needs COW. Protected by 'lock'.
-	 */
-	u64 defrag_bytes;
+	union {
+		/*
+		 * Total number of bytes pending defrag, used by stat to check whether
+		 * it needs COW. Protected by 'lock'.
+		 * Used by inodes other than the data relocation inode.
+		 */
+		u64 defrag_bytes;
+
+		/*
+		 * Logical address of the block group being relocated.
+		 * Used only by the data relocation inode.
+		 */
+		u64 reloc_block_group_start;
+	};
 
 	/*
 	 * The size of the file stored in the metadata on disk.  data=ordered
@@ -228,12 +237,21 @@ struct btrfs_inode {
 	 */
 	u64 disk_i_size;
 
-	/*
-	 * If this is a directory then index_cnt is the counter for the index
-	 * number for new files that are created. For an empty directory, this
-	 * must be initialized to BTRFS_DIR_START_INDEX.
-	 */
-	u64 index_cnt;
+	union {
+		/*
+		 * If this is a directory then index_cnt is the counter for the
+		 * index number for new files that are created. For an empty
+		 * directory, this must be initialized to BTRFS_DIR_START_INDEX.
+		 */
+		u64 index_cnt;
+
+		/*
+		 * If this is not a directory, this is the number of bytes
+		 * outstanding that are going to need csums. This is used in
+		 * ENOSPC accounting. Protected by 'lock'.
+		 */
+		u64 csum_bytes;
+	};
 
 	/* Cache the directory index number to speed the dir/file remove */
 	u64 dir_index;
@@ -256,12 +274,6 @@ struct btrfs_inode {
 	 */
 	u64 last_reflink_trans;
 
-	/*
-	 * Number of bytes outstanding that are going to need csums.  This is
-	 * used in ENOSPC accounting. Protected by 'lock'.
-	 */
-	u64 csum_bytes;
-
 	/* Backwards incompatible flags, lower half of inode_item::flags  */
 	u32 flags;
 	/* Read-only compatibility flags, upper half of inode_item::flags */
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 40e617c7e8a1..483c141dc488 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1914,7 +1914,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
 	BTRFS_I(inode)->i_otime_nsec = btrfs_stack_timespec_nsec(&inode_item->otime);
 
 	inode->i_generation = BTRFS_I(inode)->generation;
-	BTRFS_I(inode)->index_cnt = (u64)-1;
+	if (S_ISDIR(inode->i_mode))
+		BTRFS_I(inode)->index_cnt = (u64)-1;
 
 	mutex_unlock(&delayed_node->mutex);
 	btrfs_release_delayed_node(delayed_node);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4fd41d6b377f..9b98aa65cc63 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3856,7 +3856,9 @@ static int btrfs_read_locked_inode(struct inode *inode,
 	inode->i_rdev = 0;
 	rdev = btrfs_inode_rdev(leaf, inode_item);
 
-	BTRFS_I(inode)->index_cnt = (u64)-1;
+	if (S_ISDIR(inode->i_mode))
+		BTRFS_I(inode)->index_cnt = (u64)-1;
+
 	btrfs_inode_split_flags(btrfs_inode_flags(leaf, inode_item),
 				&BTRFS_I(inode)->flags, &BTRFS_I(inode)->ro_flags);
 
@@ -6268,8 +6270,10 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 		if (ret)
 			goto out;
 	}
-	/* index_cnt is ignored for everything but a dir. */
-	BTRFS_I(inode)->index_cnt = BTRFS_DIR_START_INDEX;
+
+	if (S_ISDIR(inode->i_mode))
+		BTRFS_I(inode)->index_cnt = BTRFS_DIR_START_INDEX;
+
 	BTRFS_I(inode)->generation = trans->transid;
 	inode->i_generation = BTRFS_I(inode)->generation;
 
@@ -8435,8 +8439,12 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->disk_i_size = 0;
 	ei->flags = 0;
 	ei->ro_flags = 0;
+	/*
+	 * ->index_cnt will be propertly initialized later when creating a new
+	 * inode (btrfs_create_new_inode()) or when reading an existing inode
+	 * from disk (btrfs_read_locked_inode()).
+	 */
 	ei->csum_bytes = 0;
-	ei->index_cnt = (u64)-1;
 	ei->dir_index = 0;
 	ei->last_unlink_trans = 0;
 	ei->last_reflink_trans = 0;
@@ -8511,9 +8519,10 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
 	if (!S_ISDIR(vfs_inode->i_mode)) {
 		WARN_ON(inode->delalloc_bytes);
 		WARN_ON(inode->new_delalloc_bytes);
+		WARN_ON(inode->csum_bytes);
 	}
-	WARN_ON(inode->csum_bytes);
-	WARN_ON(inode->defrag_bytes);
+	if (!root || !btrfs_is_data_reloc_root(root))
+		WARN_ON(inode->defrag_bytes);
 
 	/*
 	 * This can happen where we create an inode, but somebody else also
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 8b24bb5a0aa1..9f35524b6664 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -962,7 +962,7 @@ static int get_new_location(struct inode *reloc_inode, u64 *new_bytenr,
 	if (!path)
 		return -ENOMEM;
 
-	bytenr -= BTRFS_I(reloc_inode)->index_cnt;
+	bytenr -= BTRFS_I(reloc_inode)->reloc_block_group_start;
 	ret = btrfs_lookup_file_extent(NULL, root, path,
 			btrfs_ino(BTRFS_I(reloc_inode)), bytenr, 0);
 	if (ret < 0)
@@ -2797,7 +2797,7 @@ static noinline_for_stack int prealloc_file_extent_cluster(
 	u64 alloc_hint = 0;
 	u64 start;
 	u64 end;
-	u64 offset = inode->index_cnt;
+	u64 offset = inode->reloc_block_group_start;
 	u64 num_bytes;
 	int nr;
 	int ret = 0;
@@ -2951,7 +2951,7 @@ static int relocate_one_folio(struct inode *inode, struct file_ra_state *ra,
 			      int *cluster_nr, unsigned long index)
 {
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
-	u64 offset = BTRFS_I(inode)->index_cnt;
+	u64 offset = BTRFS_I(inode)->reloc_block_group_start;
 	const unsigned long last_index = (cluster->end - offset) >> PAGE_SHIFT;
 	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
 	struct folio *folio;
@@ -3086,7 +3086,7 @@ static int relocate_one_folio(struct inode *inode, struct file_ra_state *ra,
 static int relocate_file_extent_cluster(struct inode *inode,
 					const struct file_extent_cluster *cluster)
 {
-	u64 offset = BTRFS_I(inode)->index_cnt;
+	u64 offset = BTRFS_I(inode)->reloc_block_group_start;
 	unsigned long index;
 	unsigned long last_index;
 	struct file_ra_state *ra;
@@ -3915,7 +3915,7 @@ static noinline_for_stack struct inode *create_reloc_inode(
 		inode = NULL;
 		goto out;
 	}
-	BTRFS_I(inode)->index_cnt = group->start;
+	BTRFS_I(inode)->reloc_block_group_start = group->start;
 
 	ret = btrfs_orphan_add(trans, BTRFS_I(inode));
 out:
@@ -4395,7 +4395,7 @@ int btrfs_reloc_clone_csums(struct btrfs_ordered_extent *ordered)
 {
 	struct btrfs_inode *inode = BTRFS_I(ordered->inode);
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	u64 disk_bytenr = ordered->file_offset + inode->index_cnt;
+	u64 disk_bytenr = ordered->file_offset + inode->reloc_block_group_start;
 	struct btrfs_root *csum_root = btrfs_csum_root(fs_info, disk_bytenr);
 	LIST_HEAD(list);
 	int ret;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5146387b416b..0aee43466c52 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1625,7 +1625,8 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans,
 		if (ret)
 			goto out;
 	}
-	BTRFS_I(inode)->index_cnt = (u64)-1;
+	if (S_ISDIR(inode->i_mode))
+		BTRFS_I(inode)->index_cnt = (u64)-1;
 
 	if (inode->i_nlink == 0) {
 		if (S_ISDIR(inode->i_mode)) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 06/10] btrfs: don't allocate file extent tree for non regular files
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
                     ` (4 preceding siblings ...)
  2024-05-10 17:32   ` [PATCH v2 05/10] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode fdmanana
@ 2024-05-10 17:32   ` fdmanana
  2024-05-10 17:32   ` [PATCH v2 07/10] btrfs: remove location key from struct btrfs_inode fdmanana
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: fdmanana @ 2024-05-10 17:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When not using the NO_HOLES feature we always allocate an io tree for an
inode's file_extent_tree. This is wasteful because that io tree is only
used for regular files, so we allocate more memory than needed for inodes
that represent directories or symlinks for example, or for inodes that
correspond to free space inodes.

So improve on this by allocating the io tree only for inodes of regular
files that are not free space inodes.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file-item.c | 13 ++++++-----
 fs/btrfs/inode.c     | 53 +++++++++++++++++++++++++++++---------------
 2 files changed, 42 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index bce95f871750..f3ed78e21fa4 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -45,13 +45,12 @@
  */
 void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_size)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 start, end, i_size;
 	int ret;
 
 	spin_lock(&inode->lock);
 	i_size = new_i_size ?: i_size_read(&inode->vfs_inode);
-	if (btrfs_fs_incompat(fs_info, NO_HOLES)) {
+	if (!inode->file_extent_tree) {
 		inode->disk_i_size = i_size;
 		goto out_unlock;
 	}
@@ -84,13 +83,14 @@ void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_siz
 int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start,
 				      u64 len)
 {
+	if (!inode->file_extent_tree)
+		return 0;
+
 	if (len == 0)
 		return 0;
 
 	ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize));
 
-	if (btrfs_fs_incompat(inode->root->fs_info, NO_HOLES))
-		return 0;
 	return set_extent_bit(inode->file_extent_tree, start, start + len - 1,
 			      EXTENT_DIRTY, NULL);
 }
@@ -112,14 +112,15 @@ int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start,
 int btrfs_inode_clear_file_extent_range(struct btrfs_inode *inode, u64 start,
 					u64 len)
 {
+	if (!inode->file_extent_tree)
+		return 0;
+
 	if (len == 0)
 		return 0;
 
 	ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize) ||
 	       len == (u64)-1);
 
-	if (btrfs_fs_incompat(inode->root->fs_info, NO_HOLES))
-		return 0;
 	return clear_extent_bit(inode->file_extent_tree, start,
 				start + len - 1, EXTENT_DIRTY, NULL);
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9b98aa65cc63..175fd007f0ef 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3781,6 +3781,30 @@ static noinline int acls_after_inode_item(struct extent_buffer *leaf,
 	return 1;
 }
 
+static int btrfs_init_file_extent_tree(struct btrfs_inode *inode)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+
+	if (WARN_ON_ONCE(inode->file_extent_tree))
+		return 0;
+	if (btrfs_fs_incompat(fs_info, NO_HOLES))
+		return 0;
+	if (!S_ISREG(inode->vfs_inode.i_mode))
+		return 0;
+	if (btrfs_is_free_space_inode(inode))
+		return 0;
+
+	inode->file_extent_tree = kmalloc(sizeof(struct extent_io_tree), GFP_KERNEL);
+	if (!inode->file_extent_tree)
+		return -ENOMEM;
+
+	extent_io_tree_init(fs_info, inode->file_extent_tree, IO_TREE_INODE_FILE_EXTENT);
+	/* Lockdep class is set only for the file extent tree. */
+	lockdep_set_class(&inode->file_extent_tree->lock, &file_extent_tree_class);
+
+	return 0;
+}
+
 /*
  * read an inode from the btree into the in-memory inode
  */
@@ -3800,6 +3824,10 @@ static int btrfs_read_locked_inode(struct inode *inode,
 	bool filled = false;
 	int first_xattr_slot;
 
+	ret = btrfs_init_file_extent_tree(BTRFS_I(inode));
+	if (ret)
+		return ret;
+
 	ret = btrfs_fill_inode(inode, &rdev);
 	if (!ret)
 		filled = true;
@@ -6247,6 +6275,10 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 		BTRFS_I(inode)->root = btrfs_grab_root(BTRFS_I(dir)->root);
 	root = BTRFS_I(inode)->root;
 
+	ret = btrfs_init_file_extent_tree(BTRFS_I(inode));
+	if (ret)
+		goto out;
+
 	ret = btrfs_get_free_objectid(root, &objectid);
 	if (ret)
 		goto out;
@@ -8413,20 +8445,10 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
 	struct btrfs_inode *ei;
 	struct inode *inode;
-	struct extent_io_tree *file_extent_tree = NULL;
-
-	/* Self tests may pass a NULL fs_info. */
-	if (fs_info && !btrfs_fs_incompat(fs_info, NO_HOLES)) {
-		file_extent_tree = kmalloc(sizeof(struct extent_io_tree), GFP_KERNEL);
-		if (!file_extent_tree)
-			return NULL;
-	}
 
 	ei = alloc_inode_sb(sb, btrfs_inode_cachep, GFP_KERNEL);
-	if (!ei) {
-		kfree(file_extent_tree);
+	if (!ei)
 		return NULL;
-	}
 
 	ei->root = NULL;
 	ei->generation = 0;
@@ -8471,13 +8493,8 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	extent_io_tree_init(fs_info, &ei->io_tree, IO_TREE_INODE_IO);
 	ei->io_tree.inode = ei;
 
-	ei->file_extent_tree = file_extent_tree;
-	if (file_extent_tree) {
-		extent_io_tree_init(fs_info, ei->file_extent_tree,
-				    IO_TREE_INODE_FILE_EXTENT);
-		/* Lockdep class is set only for the file extent tree. */
-		lockdep_set_class(&ei->file_extent_tree->lock, &file_extent_tree_class);
-	}
+	ei->file_extent_tree = NULL;
+
 	mutex_init(&ei->log_mutex);
 	spin_lock_init(&ei->ordered_tree_lock);
 	ei->ordered_tree = RB_ROOT;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 07/10] btrfs: remove location key from struct btrfs_inode
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
                     ` (5 preceding siblings ...)
  2024-05-10 17:32   ` [PATCH v2 06/10] btrfs: don't allocate file extent tree for non regular files fdmanana
@ 2024-05-10 17:32   ` fdmanana
  2024-05-10 17:32   ` [PATCH v2 08/10] btrfs: remove objectid from struct btrfs_inode on 64 bits platforms fdmanana
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: fdmanana @ 2024-05-10 17:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently struct btrfs_inode has a key member, named "location", that is
either:

1) The key of the inode's item. In this case the objectid is the number
   of the inode;

2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for
   the case where we have a root that is a snapshot of a subvolume that
   points to other subvolumes. In this case the objectid is the ID of
   a subvolume inside the snapshotted parent subvolume.

The key is only used to lookup the inode item for the first case, while
for the second it's never used since it corresponds to directory stubs
created with new_simple_dir() and which are marked as dummy, so there's
no actual inode item to ever update. In the second case we only check
the key type at btrfs_ino() for 32 bits platforms and its objectid is
only needed for unlink.

Instead of using a key we can do fine with just the objectid, since we
can generate the key whenever we need it having only the objectid, as
in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset
is always 0.

So use only an objectid instead of a full key. This reduces the size of
struct btrfs_inode from 1048 bytes down to 1040 bytes on a release kernel.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h       | 47 +++++++++++++++++++++++++++++++-----
 fs/btrfs/disk-io.c           |  4 +--
 fs/btrfs/export.c            |  2 +-
 fs/btrfs/inode.c             | 25 +++++++++----------
 fs/btrfs/ioctl.c             |  8 +++---
 fs/btrfs/tests/btrfs-tests.c |  4 +--
 fs/btrfs/tree-log.c          |  6 +++--
 7 files changed, 63 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 19bb3d057414..fa2f91396ae0 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -89,6 +89,29 @@ enum {
 	BTRFS_INODE_FREE_SPACE_INODE,
 	/* Set when there are no capabilities in XATTs for the inode. */
 	BTRFS_INODE_NO_CAP_XATTR,
+	/*
+	 * Indicate this is a directory that points to a subvolume for which
+	 * there is no root reference item. That's a case like the following:
+	 *
+	 *   $ btrfs subvolume create /mnt/parent
+	 *   $ btrfs subvolume create /mnt/parent/child
+	 *   $ btrfs subvolume snapshot /mnt/parent /mnt/snap
+	 *
+	 * If subvolume "parent" is root 256, subvolume "child" is root 257 and
+	 * snapshot "snap" is root 258, then there's no root reference item (key
+	 * BTRFS_ROOT_REF_KEY in the root tree) for the subvolume "child"
+	 * associated to root 258 (the snapshot) - there's only for the root
+	 * of the "parent" subvolume (root 256). In the chunk root we have a
+	 * (256 BTRFS_ROOT_REF_KEY 257) key but we don't have a
+	 * (258 BTRFS_ROOT_REF_KEY 257) key - the sames goes for backrefs, we
+	 * have a (257 BTRFS_ROOT_BACKREF_KEY 256) but we don't have a
+	 * (257 BTRFS_ROOT_BACKREF_KEY 258) key.
+	 *
+	 * So when opening the "child" dentry from the snapshot's directory,
+	 * we don't find a root ref item and we create a stub inode. This is
+	 * done at new_simple_dir(), called from btrfs_lookup_dentry().
+	 */
+	BTRFS_INODE_ROOT_STUB,
 };
 
 /* in memory btrfs inode */
@@ -96,10 +119,15 @@ struct btrfs_inode {
 	/* which subvolume this inode belongs to */
 	struct btrfs_root *root;
 
-	/* key used to find this inode on disk.  This is used by the code
-	 * to read in roots of subvolumes
+	/*
+	 * This is either:
+	 *
+	 * 1) The objectid of the corresponding BTRFS_INODE_ITEM_KEY;
+	 *
+	 * 2) In case this a root stub inode (BTRFS_INODE_ROOT_STUB flag set),
+	 *    the ID of that root.
 	 */
-	struct btrfs_key location;
+	u64 objectid;
 
 	/* Cached value of inode property 'compression'. */
 	u8 prop_compress;
@@ -330,10 +358,9 @@ static inline unsigned long btrfs_inode_hash(u64 objectid,
  */
 static inline u64 btrfs_ino(const struct btrfs_inode *inode)
 {
-	u64 ino = inode->location.objectid;
+	u64 ino = inode->objectid;
 
-	/* type == BTRFS_ROOT_ITEM_KEY: subvol dir */
-	if (inode->location.type == BTRFS_ROOT_ITEM_KEY)
+	if (test_bit(BTRFS_INODE_ROOT_STUB, &inode->runtime_flags))
 		ino = inode->vfs_inode.i_ino;
 	return ino;
 }
@@ -347,6 +374,14 @@ static inline u64 btrfs_ino(const struct btrfs_inode *inode)
 
 #endif
 
+static inline void btrfs_get_inode_key(const struct btrfs_inode *inode,
+				       struct btrfs_key *key)
+{
+	key->objectid = inode->objectid;
+	key->type = BTRFS_INODE_ITEM_KEY;
+	key->offset = 0;
+}
+
 static inline void btrfs_i_size_write(struct btrfs_inode *inode, u64 size)
 {
 	i_size_write(&inode->vfs_inode, size);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d20e400a9ce3..e3edaf510108 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1944,9 +1944,7 @@ static int btrfs_init_btree_inode(struct super_block *sb)
 	extent_map_tree_init(&BTRFS_I(inode)->extent_tree);
 
 	BTRFS_I(inode)->root = btrfs_grab_root(fs_info->tree_root);
-	BTRFS_I(inode)->location.objectid = BTRFS_BTREE_INODE_OBJECTID;
-	BTRFS_I(inode)->location.type = 0;
-	BTRFS_I(inode)->location.offset = 0;
+	BTRFS_I(inode)->objectid = BTRFS_BTREE_INODE_OBJECTID;
 	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 	__insert_inode_hash(inode, hash);
 	fs_info->btree_inode = inode;
diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 9e81f89e76d8..5526e25ebb3f 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -40,7 +40,7 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
 	if (parent) {
 		u64 parent_root_id;
 
-		fid->parent_objectid = BTRFS_I(parent)->location.objectid;
+		fid->parent_objectid = BTRFS_I(parent)->objectid;
 		fid->parent_gen = parent->i_generation;
 		parent_root_id = btrfs_root_id(BTRFS_I(parent)->root);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 175fd007f0ef..44dc82ff96db 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3838,7 +3838,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
 			return -ENOMEM;
 	}
 
-	memcpy(&location, &BTRFS_I(inode)->location, sizeof(location));
+	btrfs_get_inode_key(BTRFS_I(inode), &location);
 
 	ret = btrfs_lookup_inode(NULL, root, path, &location, 0);
 	if (ret) {
@@ -4068,13 +4068,15 @@ static noinline int btrfs_update_inode_item(struct btrfs_trans_handle *trans,
 	struct btrfs_inode_item *inode_item;
 	struct btrfs_path *path;
 	struct extent_buffer *leaf;
+	struct btrfs_key key;
 	int ret;
 
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
 
-	ret = btrfs_lookup_inode(trans, inode->root, path, &inode->location, 1);
+	btrfs_get_inode_key(inode, &key);
+	ret = btrfs_lookup_inode(trans, inode->root, path, &key, 1);
 	if (ret) {
 		if (ret > 0)
 			ret = -ENOENT;
@@ -4338,7 +4340,7 @@ static int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
 	if (btrfs_ino(inode) == BTRFS_FIRST_FREE_OBJECTID) {
 		objectid = btrfs_root_id(inode->root);
 	} else if (btrfs_ino(inode) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID) {
-		objectid = inode->location.objectid;
+		objectid = inode->objectid;
 	} else {
 		WARN_ON(1);
 		fscrypt_free_filename(&fname);
@@ -5580,9 +5582,7 @@ static int btrfs_init_locked_inode(struct inode *inode, void *p)
 	struct btrfs_iget_args *args = p;
 
 	inode->i_ino = args->ino;
-	BTRFS_I(inode)->location.objectid = args->ino;
-	BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
-	BTRFS_I(inode)->location.offset = 0;
+	BTRFS_I(inode)->objectid = args->ino;
 	BTRFS_I(inode)->root = btrfs_grab_root(args->root);
 
 	if (args->root && args->root == args->root->fs_info->tree_root &&
@@ -5596,7 +5596,7 @@ static int btrfs_find_actor(struct inode *inode, void *opaque)
 {
 	struct btrfs_iget_args *args = opaque;
 
-	return args->ino == BTRFS_I(inode)->location.objectid &&
+	return args->ino == BTRFS_I(inode)->objectid &&
 		args->root == BTRFS_I(inode)->root;
 }
 
@@ -5673,7 +5673,8 @@ static struct inode *new_simple_dir(struct inode *dir,
 		return ERR_PTR(-ENOMEM);
 
 	BTRFS_I(inode)->root = btrfs_grab_root(root);
-	memcpy(&BTRFS_I(inode)->location, key, sizeof(*key));
+	BTRFS_I(inode)->objectid = key->objectid;
+	set_bit(BTRFS_INODE_ROOT_STUB, &BTRFS_I(inode)->runtime_flags);
 	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 
 	inode->i_ino = BTRFS_EMPTY_SUBVOL_DIR_OBJECTID;
@@ -6149,7 +6150,7 @@ static int btrfs_insert_inode_locked(struct inode *inode)
 {
 	struct btrfs_iget_args args;
 
-	args.ino = BTRFS_I(inode)->location.objectid;
+	args.ino = BTRFS_I(inode)->objectid;
 	args.root = BTRFS_I(inode)->root;
 
 	return insert_inode_locked4(inode,
@@ -6256,7 +6257,6 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(dir);
 	struct btrfs_root *root;
 	struct btrfs_inode_item *inode_item;
-	struct btrfs_key *location;
 	struct btrfs_path *path;
 	u64 objectid;
 	struct btrfs_inode_ref *ref;
@@ -6332,10 +6332,7 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 				BTRFS_INODE_NODATASUM;
 	}
 
-	location = &BTRFS_I(inode)->location;
-	location->objectid = objectid;
-	location->offset = 0;
-	location->type = BTRFS_INODE_ITEM_KEY;
+	BTRFS_I(inode)->objectid = objectid;
 
 	ret = btrfs_insert_inode_locked(inode);
 	if (ret < 0) {
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 28df28e50ad9..79a5ccb27b92 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1918,7 +1918,7 @@ static int btrfs_search_path_in_tree_user(struct mnt_idmap *idmap,
 {
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct super_block *sb = inode->i_sb;
-	struct btrfs_key upper_limit = BTRFS_I(inode)->location;
+	u64 upper_limit = BTRFS_I(inode)->objectid;
 	u64 treeid = btrfs_root_id(BTRFS_I(inode)->root);
 	u64 dirid = args->dirid;
 	unsigned long item_off;
@@ -1944,7 +1944,7 @@ static int btrfs_search_path_in_tree_user(struct mnt_idmap *idmap,
 	 * If the bottom subvolume does not exist directly under upper_limit,
 	 * construct the path in from the bottom up.
 	 */
-	if (dirid != upper_limit.objectid) {
+	if (dirid != upper_limit) {
 		ptr = &args->path[BTRFS_INO_LOOKUP_USER_PATH_MAX - 1];
 
 		root = btrfs_get_fs_root(fs_info, treeid, true);
@@ -2019,7 +2019,7 @@ static int btrfs_search_path_in_tree_user(struct mnt_idmap *idmap,
 				goto out_put;
 			}
 
-			if (key.offset == upper_limit.objectid)
+			if (key.offset == upper_limit)
 				break;
 			if (key.objectid == BTRFS_FIRST_FREE_OBJECTID) {
 				ret = -EACCES;
@@ -2140,7 +2140,7 @@ static int btrfs_ioctl_ino_lookup_user(struct file *file, void __user *argp)
 	inode = file_inode(file);
 
 	if (args->dirid == BTRFS_FIRST_FREE_OBJECTID &&
-	    BTRFS_I(inode)->location.objectid != BTRFS_FIRST_FREE_OBJECTID) {
+	    BTRFS_I(inode)->objectid != BTRFS_FIRST_FREE_OBJECTID) {
 		/*
 		 * The subvolume does not exist under fd with which this is
 		 * called
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index dce0387ef155..b28a79935d8e 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -62,9 +62,7 @@ struct inode *btrfs_new_test_inode(void)
 
 	inode->i_mode = S_IFREG;
 	inode->i_ino = BTRFS_FIRST_FREE_OBJECTID;
-	BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
-	BTRFS_I(inode)->location.objectid = BTRFS_FIRST_FREE_OBJECTID;
-	BTRFS_I(inode)->location.offset = 0;
+	BTRFS_I(inode)->objectid = BTRFS_FIRST_FREE_OBJECTID;
 	inode_init_owner(&nop_mnt_idmap, inode, NULL, S_IFREG);
 
 	return inode;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 0aee43466c52..2e762b89d4a2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4235,8 +4235,10 @@ static int log_inode_item(struct btrfs_trans_handle *trans,
 			  struct btrfs_inode *inode, bool inode_item_dropped)
 {
 	struct btrfs_inode_item *inode_item;
+	struct btrfs_key key;
 	int ret;
 
+	btrfs_get_inode_key(inode, &key);
 	/*
 	 * If we are doing a fast fsync and the inode was logged before in the
 	 * current transaction, then we know the inode was previously logged and
@@ -4248,7 +4250,7 @@ static int log_inode_item(struct btrfs_trans_handle *trans,
 	 * already exists can also result in unnecessarily splitting a leaf.
 	 */
 	if (!inode_item_dropped && inode->logged_trans == trans->transid) {
-		ret = btrfs_search_slot(trans, log, &inode->location, path, 0, 1);
+		ret = btrfs_search_slot(trans, log, &key, path, 0, 1);
 		ASSERT(ret <= 0);
 		if (ret > 0)
 			ret = -ENOENT;
@@ -4262,7 +4264,7 @@ static int log_inode_item(struct btrfs_trans_handle *trans,
 		 * the inode, we set BTRFS_INODE_NEEDS_FULL_SYNC on its runtime
 		 * flags and set ->logged_trans to 0.
 		 */
-		ret = btrfs_insert_empty_item(trans, log, path, &inode->location,
+		ret = btrfs_insert_empty_item(trans, log, path, &key,
 					      sizeof(*inode_item));
 		ASSERT(ret != -EEXIST);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 08/10] btrfs: remove objectid from struct btrfs_inode on 64 bits platforms
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
                     ` (6 preceding siblings ...)
  2024-05-10 17:32   ` [PATCH v2 07/10] btrfs: remove location key from struct btrfs_inode fdmanana
@ 2024-05-10 17:32   ` fdmanana
  2024-05-10 17:32   ` [PATCH v2 09/10] btrfs: rename rb_root member of extent_map_tree from map to root fdmanana
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: fdmanana @ 2024-05-10 17:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

On 64 bits platforms we don't really need to have a dedicated member (the
objectid field) for the inode's number since we store in the vfs inode's
i_ino member, which is an unsigned long and this type is 64 bits wide on
64 bits platforms. We only need that field in case we are on a 32 bits
platform because the unsigned long type is 32 bits wide on such platforms
See commit 33345d01522f ("Btrfs: Always use 64bit inode number") regarding
this 64/32 bits detail.

The objectid field of struct btrfs_inode is also used to store the ID of
a root for directories that are stubs for unreferenced roots. In such
cases the inode is a directory and has the BTRFS_INODE_ROOT_STUB runtime
flag set.

So in order to reduce the size of btrfs_inode structure on 64 bits
platforms we can remove the objectid member and use the vfs inode's i_ino
member instead whenever we need to get the inode number. In case the inode
is a root stub (BTRFS_INODE_ROOT_STUB set) we can use the member
last_reflink_trans to store the ID of the unreferenced root, since such
inode is a directory and reflinks can't be done against directories.

So remove the objectid fields for 64 bits platforms and alias the
last_reflink_trans field with a name of ref_root_id in a union.
On a release kernel config, this reduces the size of struct btrfs_inode
from 1040 bytes down to 1032 bytes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h       | 50 ++++++++++++++++++++++++------------
 fs/btrfs/disk-io.c           |  3 +--
 fs/btrfs/export.c            |  2 +-
 fs/btrfs/inode.c             | 17 +++++-------
 fs/btrfs/ioctl.c             |  4 +--
 fs/btrfs/tests/btrfs-tests.c |  3 +--
 6 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index fa2f91396ae0..4d9299789a03 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -119,15 +119,14 @@ struct btrfs_inode {
 	/* which subvolume this inode belongs to */
 	struct btrfs_root *root;
 
+#if BITS_PER_LONG == 32
 	/*
-	 * This is either:
-	 *
-	 * 1) The objectid of the corresponding BTRFS_INODE_ITEM_KEY;
-	 *
-	 * 2) In case this a root stub inode (BTRFS_INODE_ROOT_STUB flag set),
-	 *    the ID of that root.
+	 * The objectid of the corresponding BTRFS_INODE_ITEM_KEY.
+	 * On 64 bits platforms we can get it from vfs_inode.i_ino, which is an
+	 * unsigned long and therefore 64 bits on such platforms.
 	 */
 	u64 objectid;
+#endif
 
 	/* Cached value of inode property 'compression'. */
 	u8 prop_compress;
@@ -291,16 +290,25 @@ struct btrfs_inode {
 	 */
 	u64 last_unlink_trans;
 
-	/*
-	 * The id/generation of the last transaction where this inode was
-	 * either the source or the destination of a clone/dedupe operation.
-	 * Used when logging an inode to know if there are shared extents that
-	 * need special care when logging checksum items, to avoid duplicate
-	 * checksum items in a log (which can lead to a corruption where we end
-	 * up with missing checksum ranges after log replay).
-	 * Protected by the vfs inode lock.
-	 */
-	u64 last_reflink_trans;
+	union {
+		/*
+		 * The id/generation of the last transaction where this inode
+		 * was either the source or the destination of a clone/dedupe
+		 * operation. Used when logging an inode to know if there are
+		 * shared extents that need special care when logging checksum
+		 * items, to avoid duplicate checksum items in a log (which can
+		 * lead to a corruption where we end up with missing checksum
+		 * ranges after log replay). Protected by the vfs inode lock.
+		 * Used for regular files only.
+		 */
+		u64 last_reflink_trans;
+
+		/*
+		 * In case this a root stub inode (BTRFS_INODE_ROOT_STUB flag set),
+		 * the ID of that root.
+		 */
+		u64 ref_root_id;
+	};
 
 	/* Backwards incompatible flags, lower half of inode_item::flags  */
 	u32 flags;
@@ -377,11 +385,19 @@ static inline u64 btrfs_ino(const struct btrfs_inode *inode)
 static inline void btrfs_get_inode_key(const struct btrfs_inode *inode,
 				       struct btrfs_key *key)
 {
-	key->objectid = inode->objectid;
+	key->objectid = btrfs_ino(inode);
 	key->type = BTRFS_INODE_ITEM_KEY;
 	key->offset = 0;
 }
 
+static inline void btrfs_set_inode_number(struct btrfs_inode *inode, u64 ino)
+{
+#if BITS_PER_LONG == 32
+	inode->objectid = ino;
+#endif
+	inode->vfs_inode.i_ino = ino;
+}
+
 static inline void btrfs_i_size_write(struct btrfs_inode *inode, u64 size)
 {
 	i_size_write(&inode->vfs_inode, size);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e3edaf510108..e6bf895b3547 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1928,7 +1928,7 @@ static int btrfs_init_btree_inode(struct super_block *sb)
 	if (!inode)
 		return -ENOMEM;
 
-	inode->i_ino = BTRFS_BTREE_INODE_OBJECTID;
+	btrfs_set_inode_number(BTRFS_I(inode), BTRFS_BTREE_INODE_OBJECTID);
 	set_nlink(inode, 1);
 	/*
 	 * we set the i_size on the btree inode to the max possible int.
@@ -1944,7 +1944,6 @@ static int btrfs_init_btree_inode(struct super_block *sb)
 	extent_map_tree_init(&BTRFS_I(inode)->extent_tree);
 
 	BTRFS_I(inode)->root = btrfs_grab_root(fs_info->tree_root);
-	BTRFS_I(inode)->objectid = BTRFS_BTREE_INODE_OBJECTID;
 	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 	__insert_inode_hash(inode, hash);
 	fs_info->btree_inode = inode;
diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 5526e25ebb3f..5da56e21ff73 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -40,7 +40,7 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
 	if (parent) {
 		u64 parent_root_id;
 
-		fid->parent_objectid = BTRFS_I(parent)->objectid;
+		fid->parent_objectid = btrfs_ino(BTRFS_I(parent));
 		fid->parent_gen = parent->i_generation;
 		parent_root_id = btrfs_root_id(BTRFS_I(parent)->root);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 44dc82ff96db..5a1014122088 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4340,7 +4340,7 @@ static int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
 	if (btrfs_ino(inode) == BTRFS_FIRST_FREE_OBJECTID) {
 		objectid = btrfs_root_id(inode->root);
 	} else if (btrfs_ino(inode) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID) {
-		objectid = inode->objectid;
+		objectid = inode->ref_root_id;
 	} else {
 		WARN_ON(1);
 		fscrypt_free_filename(&fname);
@@ -5581,8 +5581,7 @@ static int btrfs_init_locked_inode(struct inode *inode, void *p)
 {
 	struct btrfs_iget_args *args = p;
 
-	inode->i_ino = args->ino;
-	BTRFS_I(inode)->objectid = args->ino;
+	btrfs_set_inode_number(BTRFS_I(inode), args->ino);
 	BTRFS_I(inode)->root = btrfs_grab_root(args->root);
 
 	if (args->root && args->root == args->root->fs_info->tree_root &&
@@ -5596,7 +5595,7 @@ static int btrfs_find_actor(struct inode *inode, void *opaque)
 {
 	struct btrfs_iget_args *args = opaque;
 
-	return args->ino == BTRFS_I(inode)->objectid &&
+	return args->ino == btrfs_ino(BTRFS_I(inode)) &&
 		args->root == BTRFS_I(inode)->root;
 }
 
@@ -5673,11 +5672,11 @@ static struct inode *new_simple_dir(struct inode *dir,
 		return ERR_PTR(-ENOMEM);
 
 	BTRFS_I(inode)->root = btrfs_grab_root(root);
-	BTRFS_I(inode)->objectid = key->objectid;
+	BTRFS_I(inode)->ref_root_id = key->objectid;
 	set_bit(BTRFS_INODE_ROOT_STUB, &BTRFS_I(inode)->runtime_flags);
 	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 
-	inode->i_ino = BTRFS_EMPTY_SUBVOL_DIR_OBJECTID;
+	btrfs_set_inode_number(BTRFS_I(inode), BTRFS_EMPTY_SUBVOL_DIR_OBJECTID);
 	/*
 	 * We only need lookup, the rest is read-only and there's no inode
 	 * associated with the dentry
@@ -6150,7 +6149,7 @@ static int btrfs_insert_inode_locked(struct inode *inode)
 {
 	struct btrfs_iget_args args;
 
-	args.ino = BTRFS_I(inode)->objectid;
+	args.ino = btrfs_ino(BTRFS_I(inode));
 	args.root = BTRFS_I(inode)->root;
 
 	return insert_inode_locked4(inode,
@@ -6282,7 +6281,7 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 	ret = btrfs_get_free_objectid(root, &objectid);
 	if (ret)
 		goto out;
-	inode->i_ino = objectid;
+	btrfs_set_inode_number(BTRFS_I(inode), objectid);
 
 	ret = xa_reserve(&root->inodes, objectid, GFP_NOFS);
 	if (ret)
@@ -6332,8 +6331,6 @@ int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
 				BTRFS_INODE_NODATASUM;
 	}
 
-	BTRFS_I(inode)->objectid = objectid;
-
 	ret = btrfs_insert_inode_locked(inode);
 	if (ret < 0) {
 		if (!args->orphan)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 79a5ccb27b92..968e256003af 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1918,7 +1918,7 @@ static int btrfs_search_path_in_tree_user(struct mnt_idmap *idmap,
 {
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct super_block *sb = inode->i_sb;
-	u64 upper_limit = BTRFS_I(inode)->objectid;
+	u64 upper_limit = btrfs_ino(BTRFS_I(inode));
 	u64 treeid = btrfs_root_id(BTRFS_I(inode)->root);
 	u64 dirid = args->dirid;
 	unsigned long item_off;
@@ -2140,7 +2140,7 @@ static int btrfs_ioctl_ino_lookup_user(struct file *file, void __user *argp)
 	inode = file_inode(file);
 
 	if (args->dirid == BTRFS_FIRST_FREE_OBJECTID &&
-	    BTRFS_I(inode)->objectid != BTRFS_FIRST_FREE_OBJECTID) {
+	    btrfs_ino(BTRFS_I(inode)) != BTRFS_FIRST_FREE_OBJECTID) {
 		/*
 		 * The subvolume does not exist under fd with which this is
 		 * called
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index b28a79935d8e..ce50847e1e01 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -61,8 +61,7 @@ struct inode *btrfs_new_test_inode(void)
 		return NULL;
 
 	inode->i_mode = S_IFREG;
-	inode->i_ino = BTRFS_FIRST_FREE_OBJECTID;
-	BTRFS_I(inode)->objectid = BTRFS_FIRST_FREE_OBJECTID;
+	btrfs_set_inode_number(BTRFS_I(inode), BTRFS_FIRST_FREE_OBJECTID);
 	inode_init_owner(&nop_mnt_idmap, inode, NULL, S_IFREG);
 
 	return inode;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 09/10] btrfs: rename rb_root member of extent_map_tree from map to root
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
                     ` (7 preceding siblings ...)
  2024-05-10 17:32   ` [PATCH v2 08/10] btrfs: remove objectid from struct btrfs_inode on 64 bits platforms fdmanana
@ 2024-05-10 17:32   ` fdmanana
  2024-05-10 17:32   ` [PATCH v2 10/10] btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree fdmanana
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: fdmanana @ 2024-05-10 17:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently we name the rb_root member of struct extent_map_tree as 'map',
which is odd and confusing. Since it's a root node, rename it to 'root'.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c             | 22 +++++++++++-----------
 fs/btrfs/extent_map.h             |  2 +-
 fs/btrfs/tests/extent-map-tests.c |  6 +++---
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 744e8952abb0..4bc41b0dd701 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -33,7 +33,7 @@ void __cold extent_map_exit(void)
  */
 void extent_map_tree_init(struct extent_map_tree *tree)
 {
-	tree->map = RB_ROOT_CACHED;
+	tree->root = RB_ROOT_CACHED;
 	INIT_LIST_HEAD(&tree->modified_extents);
 	rwlock_init(&tree->lock);
 }
@@ -265,7 +265,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 			em->generation = max(em->generation, merge->generation);
 			em->flags |= EXTENT_FLAG_MERGED;
 
-			rb_erase_cached(&merge->rb_node, &tree->map);
+			rb_erase_cached(&merge->rb_node, &tree->root);
 			RB_CLEAR_NODE(&merge->rb_node);
 			free_extent_map(merge);
 			dec_evictable_extent_maps(inode);
@@ -278,7 +278,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 	if (rb && can_merge_extent_map(merge) && mergeable_maps(em, merge)) {
 		em->len += merge->len;
 		em->block_len += merge->block_len;
-		rb_erase_cached(&merge->rb_node, &tree->map);
+		rb_erase_cached(&merge->rb_node, &tree->root);
 		RB_CLEAR_NODE(&merge->rb_node);
 		em->generation = max(em->generation, merge->generation);
 		em->flags |= EXTENT_FLAG_MERGED;
@@ -389,7 +389,7 @@ static int add_extent_mapping(struct btrfs_inode *inode,
 
 	lockdep_assert_held_write(&tree->lock);
 
-	ret = tree_insert(&tree->map, em);
+	ret = tree_insert(&tree->root, em);
 	if (ret)
 		return ret;
 
@@ -410,7 +410,7 @@ __lookup_extent_mapping(struct extent_map_tree *tree,
 	struct rb_node *prev_or_next = NULL;
 	u64 end = range_end(start, len);
 
-	rb_node = __tree_search(&tree->map.rb_root, start, &prev_or_next);
+	rb_node = __tree_search(&tree->root.rb_root, start, &prev_or_next);
 	if (!rb_node) {
 		if (prev_or_next)
 			rb_node = prev_or_next;
@@ -479,7 +479,7 @@ void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 	lockdep_assert_held_write(&tree->lock);
 
 	WARN_ON(em->flags & EXTENT_FLAG_PINNED);
-	rb_erase_cached(&em->rb_node, &tree->map);
+	rb_erase_cached(&em->rb_node, &tree->root);
 	if (!(em->flags & EXTENT_FLAG_LOGGING))
 		list_del_init(&em->list);
 	RB_CLEAR_NODE(&em->rb_node);
@@ -500,7 +500,7 @@ static void replace_extent_mapping(struct btrfs_inode *inode,
 	ASSERT(extent_map_in_tree(cur));
 	if (!(cur->flags & EXTENT_FLAG_LOGGING))
 		list_del_init(&cur->list);
-	rb_replace_node_cached(&cur->rb_node, &new->rb_node, &tree->map);
+	rb_replace_node_cached(&cur->rb_node, &new->rb_node, &tree->root);
 	RB_CLEAR_NODE(&cur->rb_node);
 
 	setup_extent_mapping(inode, new, modified);
@@ -659,11 +659,11 @@ static void drop_all_extent_maps_fast(struct btrfs_inode *inode)
 	struct extent_map_tree *tree = &inode->extent_tree;
 
 	write_lock(&tree->lock);
-	while (!RB_EMPTY_ROOT(&tree->map.rb_root)) {
+	while (!RB_EMPTY_ROOT(&tree->root.rb_root)) {
 		struct extent_map *em;
 		struct rb_node *node;
 
-		node = rb_first_cached(&tree->map);
+		node = rb_first_cached(&tree->root);
 		em = rb_entry(node, struct extent_map, rb_node);
 		em->flags &= ~(EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING);
 		remove_extent_mapping(inode, em);
@@ -1058,7 +1058,7 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_t
 		return 0;
 
 	write_lock(&tree->lock);
-	node = rb_first_cached(&tree->map);
+	node = rb_first_cached(&tree->root);
 	while (node) {
 		struct extent_map *em;
 
@@ -1094,7 +1094,7 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_t
 		 * lock and took it again.
 		 */
 		if (cond_resched_rwlock_write(&tree->lock))
-			node = rb_first_cached(&tree->map);
+			node = rb_first_cached(&tree->root);
 	}
 	write_unlock(&tree->lock);
 	up_read(&inode->i_mmap_lock);
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 6d587111f73a..9c0793888d13 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -115,7 +115,7 @@ struct extent_map {
 };
 
 struct extent_map_tree {
-	struct rb_root_cached map;
+	struct rb_root_cached root;
 	struct list_head modified_extents;
 	rwlock_t lock;
 };
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index ba36794ba2d5..075e6930acda 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -19,8 +19,8 @@ static int free_extent_map_tree(struct btrfs_inode *inode)
 	int ret = 0;
 
 	write_lock(&em_tree->lock);
-	while (!RB_EMPTY_ROOT(&em_tree->map.rb_root)) {
-		node = rb_first_cached(&em_tree->map);
+	while (!RB_EMPTY_ROOT(&em_tree->root.rb_root)) {
+		node = rb_first_cached(&em_tree->root);
 		em = rb_entry(node, struct extent_map, rb_node);
 		remove_extent_mapping(inode, em);
 
@@ -551,7 +551,7 @@ static int validate_range(struct extent_map_tree *em_tree, int index)
 	struct rb_node *n;
 	int i;
 
-	for (i = 0, n = rb_first_cached(&em_tree->map);
+	for (i = 0, n = rb_first_cached(&em_tree->root);
 	     valid_ranges[index][i].len && n;
 	     i++, n = rb_next(n)) {
 		struct extent_map *entry = rb_entry(n, struct extent_map, rb_node);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 10/10] btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
                     ` (8 preceding siblings ...)
  2024-05-10 17:32   ` [PATCH v2 09/10] btrfs: rename rb_root member of extent_map_tree from map to root fdmanana
@ 2024-05-10 17:32   ` fdmanana
  2024-05-14 15:58     ` David Sterba
  2024-05-14 16:08   ` [PATCH v2 00/10] btrfs: inode management and memory consumption improvements David Sterba
  2024-05-15 18:28   ` David Sterba
  11 siblings, 1 reply; 37+ messages in thread
From: fdmanana @ 2024-05-10 17:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We are currently using a cached rb_root (struct rb_root_cached) for the
rb root of struct extent_map_tree. This does't offer much of an advantage
here because:

1) It's only advantage over the regular rb_root is that it caches a
   pointer to the left most node (first node), so a call to
   rb_first_cached() doesn't have to chase pointers until it reaches
   the left most node;

2) We only have two scenarios that access left most node with
   rb_first_cached():

      When dropping all extent maps from an inode, during inode eviction;

      When iterating over extent maps during the extent map shrinker;

3) In both cases we keep removing extent maps, which causes deletion of
   the left most node so rb_erase_cached() has to call rb_next() to find
   out what's the next left most node and assign it to
   struct rb_root_cached::rb_leftmost;

4) We can do that ourselves in those two uses cases and stop using a
   rb_root_cached rb tree and use instead a regular rb_root rb tree.

   This reduces the size of struct extent_map_tree by 8 bytes and, since
   this structure is embedded in struct btrfs_inode, it also reduces the
   size of that structure by 8 bytes.

   So on a 64 bits platform the size of btrfs_inode is reduced from 1032
   bytes down to 1024 bytes.

   This means we will be able to have 4 inodes per 4K page instead of 3.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c             | 48 +++++++++++++++++--------------
 fs/btrfs/extent_map.h             |  2 +-
 fs/btrfs/tests/extent-map-tests.c |  6 ++--
 3 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 4bc41b0dd701..35e163152dbc 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -33,7 +33,7 @@ void __cold extent_map_exit(void)
  */
 void extent_map_tree_init(struct extent_map_tree *tree)
 {
-	tree->root = RB_ROOT_CACHED;
+	tree->root = RB_ROOT;
 	INIT_LIST_HEAD(&tree->modified_extents);
 	rwlock_init(&tree->lock);
 }
@@ -85,27 +85,24 @@ static void dec_evictable_extent_maps(struct btrfs_inode *inode)
 		percpu_counter_dec(&fs_info->evictable_extent_maps);
 }
 
-static int tree_insert(struct rb_root_cached *root, struct extent_map *em)
+static int tree_insert(struct rb_root *root, struct extent_map *em)
 {
-	struct rb_node **p = &root->rb_root.rb_node;
+	struct rb_node **p = &root->rb_node;
 	struct rb_node *parent = NULL;
 	struct extent_map *entry = NULL;
 	struct rb_node *orig_parent = NULL;
 	u64 end = range_end(em->start, em->len);
-	bool leftmost = true;
 
 	while (*p) {
 		parent = *p;
 		entry = rb_entry(parent, struct extent_map, rb_node);
 
-		if (em->start < entry->start) {
+		if (em->start < entry->start)
 			p = &(*p)->rb_left;
-		} else if (em->start >= extent_map_end(entry)) {
+		else if (em->start >= extent_map_end(entry))
 			p = &(*p)->rb_right;
-			leftmost = false;
-		} else {
+		else
 			return -EEXIST;
-		}
 	}
 
 	orig_parent = parent;
@@ -128,7 +125,7 @@ static int tree_insert(struct rb_root_cached *root, struct extent_map *em)
 			return -EEXIST;
 
 	rb_link_node(&em->rb_node, orig_parent, p);
-	rb_insert_color_cached(&em->rb_node, root, leftmost);
+	rb_insert_color(&em->rb_node, root);
 	return 0;
 }
 
@@ -265,7 +262,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 			em->generation = max(em->generation, merge->generation);
 			em->flags |= EXTENT_FLAG_MERGED;
 
-			rb_erase_cached(&merge->rb_node, &tree->root);
+			rb_erase(&merge->rb_node, &tree->root);
 			RB_CLEAR_NODE(&merge->rb_node);
 			free_extent_map(merge);
 			dec_evictable_extent_maps(inode);
@@ -278,7 +275,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 	if (rb && can_merge_extent_map(merge) && mergeable_maps(em, merge)) {
 		em->len += merge->len;
 		em->block_len += merge->block_len;
-		rb_erase_cached(&merge->rb_node, &tree->root);
+		rb_erase(&merge->rb_node, &tree->root);
 		RB_CLEAR_NODE(&merge->rb_node);
 		em->generation = max(em->generation, merge->generation);
 		em->flags |= EXTENT_FLAG_MERGED;
@@ -410,7 +407,7 @@ __lookup_extent_mapping(struct extent_map_tree *tree,
 	struct rb_node *prev_or_next = NULL;
 	u64 end = range_end(start, len);
 
-	rb_node = __tree_search(&tree->root.rb_root, start, &prev_or_next);
+	rb_node = __tree_search(&tree->root, start, &prev_or_next);
 	if (!rb_node) {
 		if (prev_or_next)
 			rb_node = prev_or_next;
@@ -479,7 +476,7 @@ void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 	lockdep_assert_held_write(&tree->lock);
 
 	WARN_ON(em->flags & EXTENT_FLAG_PINNED);
-	rb_erase_cached(&em->rb_node, &tree->root);
+	rb_erase(&em->rb_node, &tree->root);
 	if (!(em->flags & EXTENT_FLAG_LOGGING))
 		list_del_init(&em->list);
 	RB_CLEAR_NODE(&em->rb_node);
@@ -500,7 +497,7 @@ static void replace_extent_mapping(struct btrfs_inode *inode,
 	ASSERT(extent_map_in_tree(cur));
 	if (!(cur->flags & EXTENT_FLAG_LOGGING))
 		list_del_init(&cur->list);
-	rb_replace_node_cached(&cur->rb_node, &new->rb_node, &tree->root);
+	rb_replace_node(&cur->rb_node, &new->rb_node, &tree->root);
 	RB_CLEAR_NODE(&cur->rb_node);
 
 	setup_extent_mapping(inode, new, modified);
@@ -657,18 +654,23 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 static void drop_all_extent_maps_fast(struct btrfs_inode *inode)
 {
 	struct extent_map_tree *tree = &inode->extent_tree;
+	struct rb_node *node;
 
 	write_lock(&tree->lock);
-	while (!RB_EMPTY_ROOT(&tree->root.rb_root)) {
+	node = rb_first(&tree->root);
+	while (node) {
 		struct extent_map *em;
-		struct rb_node *node;
+		struct rb_node *next = rb_next(node);
 
-		node = rb_first_cached(&tree->root);
 		em = rb_entry(node, struct extent_map, rb_node);
 		em->flags &= ~(EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING);
 		remove_extent_mapping(inode, em);
 		free_extent_map(em);
-		cond_resched_rwlock_write(&tree->lock);
+
+		if (cond_resched_rwlock_write(&tree->lock))
+			node = rb_first(&tree->root);
+		else
+			node = next;
 	}
 	write_unlock(&tree->lock);
 }
@@ -1058,12 +1060,12 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_t
 		return 0;
 
 	write_lock(&tree->lock);
-	node = rb_first_cached(&tree->root);
+	node = rb_first(&tree->root);
 	while (node) {
+		struct rb_node *next = rb_next(node);
 		struct extent_map *em;
 
 		em = rb_entry(node, struct extent_map, rb_node);
-		node = rb_next(node);
 		(*scanned)++;
 
 		if (em->flags & EXTENT_FLAG_PINNED)
@@ -1094,7 +1096,9 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_t
 		 * lock and took it again.
 		 */
 		if (cond_resched_rwlock_write(&tree->lock))
-			node = rb_first_cached(&tree->root);
+			node = rb_first(&tree->root);
+		else
+			node = next;
 	}
 	write_unlock(&tree->lock);
 	up_read(&inode->i_mmap_lock);
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 9c0793888d13..9144721b88a5 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -115,7 +115,7 @@ struct extent_map {
 };
 
 struct extent_map_tree {
-	struct rb_root_cached root;
+	struct rb_root root;
 	struct list_head modified_extents;
 	rwlock_t lock;
 };
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 075e6930acda..c511a1297956 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -19,8 +19,8 @@ static int free_extent_map_tree(struct btrfs_inode *inode)
 	int ret = 0;
 
 	write_lock(&em_tree->lock);
-	while (!RB_EMPTY_ROOT(&em_tree->root.rb_root)) {
-		node = rb_first_cached(&em_tree->root);
+	while (!RB_EMPTY_ROOT(&em_tree->root)) {
+		node = rb_first(&em_tree->root);
 		em = rb_entry(node, struct extent_map, rb_node);
 		remove_extent_mapping(inode, em);
 
@@ -551,7 +551,7 @@ static int validate_range(struct extent_map_tree *em_tree, int index)
 	struct rb_node *n;
 	int i;
 
-	for (i = 0, n = rb_first_cached(&em_tree->root);
+	for (i = 0, n = rb_first(&em_tree->root);
 	     valid_ranges[index][i].len && n;
 	     i++, n = rb_next(n)) {
 		struct extent_map *entry = rb_entry(n, struct extent_map, rb_node);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/8] btrfs: don't allocate file extent tree for non regular files
  2024-05-09  8:41     ` Filipe Manana
@ 2024-05-13 18:39       ` David Sterba
  0 siblings, 0 replies; 37+ messages in thread
From: David Sterba @ 2024-05-13 18:39 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs

On Thu, May 09, 2024 at 09:41:50AM +0100, Filipe Manana wrote:
> On Thu, May 9, 2024 at 1:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > 在 2024/5/8 21:47, fdmanana@kernel.org 写道:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > When not using the NO_HOLES feature we always allocate an io tree for an
> > > inode's file_extent_tree.
> >
> > I'm wondering can we discard non-NO_HOLES support directly so that we
> > can get rid of the file_extent_tree completely?
> >
> > Initially I'm wondering why NO_HOLES makes a difference, especially as
> > NO_HOLES can be set halfway, thus we can still have explicit hole extents.
> >
> > But it turns out that the whole file_extent_tree() is only to handle
> > non-NO_HOLES case so that we always have a hole for the whole file range
> > until i_size.
> >
> > Considering NO_HOLES is considered safe at 4.0 kernel, can we start
> > deprecating non-NO_HOLES?
> 
> NO_HOLES is a mkfs default for only 2 or 3 years IIRC. It's not that long.
> 
> And how do you know everyone is already using NO_HOLES?
> Last week I saw a user filesystem that doesn't use it.
> 
> And probably there are many more.

I think so. Soft deprecation can start but we still have to keep the
code around and given that many filesystems do not change features since
mkfs the time ("mkfs and forget") there are many created before the 5.15
defaults change. So, estimmated hard deprecation time is when the kernel
5.15 will be out of LTS support upstream or distros we care about. Rough
estimate is 10 years from now.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 01/10] btrfs: use an xarray to track open inodes in a root
  2024-05-10 17:32   ` [PATCH v2 01/10] btrfs: use an xarray to track open inodes in a root fdmanana
@ 2024-05-14 15:49     ` David Sterba
  0 siblings, 0 replies; 37+ messages in thread
From: David Sterba @ 2024-05-14 15:49 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, May 10, 2024 at 06:32:49PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>

The xarray pattern in this patch seems to solve problem with
preallocation and search, I was stuck on that with the extent buffer
conversion from radix -> xarray, so I'm only clarifying things loud for
myself:

> +	ret = xa_reserve(&root->inodes, ino, GFP_NOFS);
> +	if (ret)
> +		return ret;

Unlocked reservation with NOFS

> +
>  	spin_lock(&root->inode_lock);
> -	p = &root->inode_tree.rb_node;
> -	while (*p) {
> -		parent = *p;
> -		entry = rb_entry(parent, struct btrfs_inode, rb_node);
> +	existing = xa_store(&root->inodes, ino, inode, GFP_ATOMIC);

Locked insertion with preallocation, thus no ATOMIC needed

>  	spin_lock(&root->inode_lock);

...

> +		inode = xa_find(&root->inodes, &from, ULONG_MAX, XA_PRESENT);

Locked search that does not need to handle the preallocated but not
inserted element.

What I did with the delayed inodes tree was xa_reserve + xa_load, this
needs the special handling of potentially finding NULL by xa_load. This
worked for the delayed notes but not for the extent buffers. We'd have
to do special cases for that, loop and wait, but xa_find(..., XA_PRESENT)
seems to be doing exactly that.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 10/10] btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree
  2024-05-10 17:32   ` [PATCH v2 10/10] btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree fdmanana
@ 2024-05-14 15:58     ` David Sterba
  0 siblings, 0 replies; 37+ messages in thread
From: David Sterba @ 2024-05-14 15:58 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, May 10, 2024 at 06:32:58PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We are currently using a cached rb_root (struct rb_root_cached) for the
> rb root of struct extent_map_tree. This does't offer much of an advantage
> here because:

Good catch, this one escaped me, we need the RB tree tracking so I did not
consider it, the cached pointer was hidden.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/10] btrfs: inode management and memory consumption improvements
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
                     ` (9 preceding siblings ...)
  2024-05-10 17:32   ` [PATCH v2 10/10] btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree fdmanana
@ 2024-05-14 16:08   ` David Sterba
  2024-05-15 18:28   ` David Sterba
  11 siblings, 0 replies; 37+ messages in thread
From: David Sterba @ 2024-05-14 16:08 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, May 10, 2024 at 06:32:48PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Some inode related improvements, to use an xarray to track open inodes per
> root instead of a red black tree, reduce lock contention and use less memory
> per btrfs inode, so now we can fit 4 inodes per 4K page instead of 3.

That's great, thank you very much.

The slack space per page is slightly less than 1/4 (25%) in SLUB as it
has more pages per slab per object. Depending on that it can go down to
12.5% (for 8 pages per slab). But still this is a noticeable
improvement.

The 4 byte hole after otime members is still there, we might find use
for it in the future.

I'm thinking about adding a _static_assert(sizeof(btrfs_inode) <= 1024)
on a release config (and x86_64). Given the amount of time and efforts
it took to shrink the size I want to make it visible when a deliberate
change to either struct inode or any embedded structure increases the
size again.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/10] btrfs: inode management and memory consumption improvements
  2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
                     ` (10 preceding siblings ...)
  2024-05-14 16:08   ` [PATCH v2 00/10] btrfs: inode management and memory consumption improvements David Sterba
@ 2024-05-15 18:28   ` David Sterba
  11 siblings, 0 replies; 37+ messages in thread
From: David Sterba @ 2024-05-15 18:28 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, May 10, 2024 at 06:32:48PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Some inode related improvements, to use an xarray to track open inodes per
> root instead of a red black tree, reduce lock contention and use less memory
> per btrfs inode, so now we can fit 4 inodes per 4K page instead of 3.
> More details in the change logs.
> 
> V2: Added two more patches (9/10 and 10/10) to make sure btrfs_inode size
>     is reduced to 1024 bytes when CONFIG_FS_ENCRYPTION and CONFIG_FS_VERITY
>     are set. I wasn't using these configs before, and with them the final
>     size for btrfs_inode was 1032 bytes and not 1016 bytes as I initially
>     had - now the final size is 1024 bytes with those configs enabled.
> 
> Filipe Manana (10):
>   btrfs: use an xarray to track open inodes in a root
>   btrfs: preallocate inodes xarray entry to avoid transaction abort
>   btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path()
>   btrfs: remove inode_lock from struct btrfs_root and use xarray locks
>   btrfs: unify index_cnt and csum_bytes from struct btrfs_inode
>   btrfs: don't allocate file extent tree for non regular files
>   btrfs: remove location key from struct btrfs_inode
>   btrfs: remove objectid from struct btrfs_inode on 64 bits platforms
>   btrfs: rename rb_root member of extent_map_tree from map to root
>   btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-05-15 18:28 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-08 12:17 [PATCH 0/8] btrfs: inode management and memory consumption improvements fdmanana
2024-05-08 12:17 ` [PATCH 1/8] btrfs: use an xarray to track open inodes in a root fdmanana
2024-05-09  0:18   ` Qu Wenruo
2024-05-08 12:17 ` [PATCH 2/8] btrfs: preallocate inodes xarray entry to avoid transaction abort fdmanana
2024-05-09  0:21   ` Qu Wenruo
2024-05-08 12:17 ` [PATCH 3/8] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path() fdmanana
2024-05-09  0:23   ` Qu Wenruo
2024-05-08 12:17 ` [PATCH 4/8] btrfs: remove inode_lock from struct btrfs_root and use xarray locks fdmanana
2024-05-09  0:25   ` Qu Wenruo
2024-05-09  8:38     ` Filipe Manana
2024-05-09  8:42       ` Qu Wenruo
2024-05-08 12:17 ` [PATCH 5/8] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode fdmanana
2024-05-09  0:30   ` Qu Wenruo
2024-05-09  8:39     ` Filipe Manana
2024-05-08 12:17 ` [PATCH 6/8] btrfs: don't allocate file extent tree for non regular files fdmanana
2024-05-09  0:39   ` Qu Wenruo
2024-05-09  8:41     ` Filipe Manana
2024-05-13 18:39       ` David Sterba
2024-05-08 12:17 ` [PATCH 7/8] btrfs: remove location key from struct btrfs_inode fdmanana
2024-05-08 12:17 ` [PATCH 8/8] btrfs: remove objectid from struct btrfs_inode on 64 bits platforms fdmanana
2024-05-09 17:56 ` [PATCH 0/8] btrfs: inode management and memory consumption improvements David Sterba
2024-05-10 11:04   ` Filipe Manana
2024-05-10 17:32 ` [PATCH v2 00/10] " fdmanana
2024-05-10 17:32   ` [PATCH v2 01/10] btrfs: use an xarray to track open inodes in a root fdmanana
2024-05-14 15:49     ` David Sterba
2024-05-10 17:32   ` [PATCH v2 02/10] btrfs: preallocate inodes xarray entry to avoid transaction abort fdmanana
2024-05-10 17:32   ` [PATCH v2 03/10] btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path() fdmanana
2024-05-10 17:32   ` [PATCH v2 04/10] btrfs: remove inode_lock from struct btrfs_root and use xarray locks fdmanana
2024-05-10 17:32   ` [PATCH v2 05/10] btrfs: unify index_cnt and csum_bytes from struct btrfs_inode fdmanana
2024-05-10 17:32   ` [PATCH v2 06/10] btrfs: don't allocate file extent tree for non regular files fdmanana
2024-05-10 17:32   ` [PATCH v2 07/10] btrfs: remove location key from struct btrfs_inode fdmanana
2024-05-10 17:32   ` [PATCH v2 08/10] btrfs: remove objectid from struct btrfs_inode on 64 bits platforms fdmanana
2024-05-10 17:32   ` [PATCH v2 09/10] btrfs: rename rb_root member of extent_map_tree from map to root fdmanana
2024-05-10 17:32   ` [PATCH v2 10/10] btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree fdmanana
2024-05-14 15:58     ` David Sterba
2024-05-14 16:08   ` [PATCH v2 00/10] btrfs: inode management and memory consumption improvements David Sterba
2024-05-15 18:28   ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox