* [PATCH v4 0/2] rcu-based inode lookup for iget* @ 2024-06-11 17:38 Mateusz Guzik 2024-06-11 17:38 ` [PATCH v4 1/2] vfs: add rcu-based find_inode variants for iget ops Mateusz Guzik ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Mateusz Guzik @ 2024-06-11 17:38 UTC (permalink / raw) To: brauner Cc: viro, jack, linux-kernel, linux-fsdevel, linux-btrfs, josef, hch, Mateusz Guzik I revamped the commit message for patch 1, explicitly spelling out a bunch of things and adding bpftrace output. Please read it. There was some massaging of lines in the include/linux/fs.h header files. If you don't like it I would appreciate if you adjusted it however you see fit on your own. This adjusts the state to what was suggested by Christian. Specific results: ext4 (needed mkfs.ext4 -N 24000000): before: 3.77s user 890.90s system 1939% cpu 46.118 total after: 3.24s user 397.73s system 1858% cpu 21.581 total (-53%) btrfs (s/iget5_locked/iget5_locked_rcu in fs/btrfs/inode.c): before: 3.54s user 892.30s system 1966% cpu 45.549 total after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%) btrfs bottlenecks itself on its own locks here. Benchmark info in the commit message to the first patch. fs rundown is as follows: - ext4 patched implicitly - xfs does not use the inode hash - bcachefs is out of the picture as Kent decided to implement his own inode hashing based on rhashtable, for now private to his fs. - btrfs handled in the patchset I have not looked at others. v4: - only provide iget5_locked_rcu - add a btrfs ack v3: - export new routines with _GPL - don't use the extern keyword - add ilookup5_rcu to follow iget5_locked scheme v2: - add argument lists to new routines - assert the inode hash lock is not held as applicable - real btrfs patch included Mateusz Guzik (2): vfs: add rcu-based find_inode variants for iget ops btrfs: use iget5_locked_rcu fs/btrfs/inode.c | 2 +- fs/inode.c | 119 ++++++++++++++++++++++++++++++++++++++------- include/linux/fs.h | 10 +++- 3 files changed, 112 insertions(+), 19 deletions(-) -- 2.43.0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v4 1/2] vfs: add rcu-based find_inode variants for iget ops 2024-06-11 17:38 [PATCH v4 0/2] rcu-based inode lookup for iget* Mateusz Guzik @ 2024-06-11 17:38 ` Mateusz Guzik 2024-06-13 14:40 ` Jan Kara 2024-06-11 17:38 ` [PATCH v4 2/2] btrfs: use iget5_locked_rcu Mateusz Guzik 2024-06-12 12:08 ` [PATCH v4 0/2] rcu-based inode lookup for iget* Christian Brauner 2 siblings, 1 reply; 6+ messages in thread From: Mateusz Guzik @ 2024-06-11 17:38 UTC (permalink / raw) To: brauner Cc: viro, jack, linux-kernel, linux-fsdevel, linux-btrfs, josef, hch, Mateusz Guzik This avoids one inode hash lock acquire in the common case on inode creation, in effect significantly reducing contention. On the stock kernel said lock is typically taken twice: 1. once to check if the inode happens to already be present 2. once to add it to the hash The back-to-back lock/unlock pattern is known to degrade performance significantly, which is further exacerbated if the hash is heavily populated (long chains to walk, extending hold time). Arguably hash sizing and hashing algo need to be revisited, but that's beyond the scope of this patch. With the acquire from step 1 eliminated with RCU lookup throughput increases significantly at the scale of 20 cores (benchmark results at the bottom). So happens the hash already supports RCU-based operation, but lookups on inode insertions didn't take advantage of it. This of course has its limits as the global lock is still a bottleneck. There was a patchset posted which introduced fine-grained locking[1] but it appears staled. Apart from that doubt was expressed whether a handrolled hash implementation is appropriate to begin with, suggesting replacement with rhashtables. Nobody committed to carrying [1] across the finish line or implementing anything better, thus the bandaid below. iget_locked consumers (notably ext4) get away without any changes because inode comparison method is built-in. iget5_locked consumers pass a custom callback. Since removal of locking adds more problems (inode can be changing) it's not safe to assume all filesystems happen to cope. Thus iget5_locked_rcu gets added, requiring manual conversion of interested filesystems. In order to reduce code duplication find_inode and find_inode_fast grow an argument indicating whether inode hash lock is held, which is passed down in case sleeping is necessary. They always rcu_read_lock, which is redundant but harmless. Doing it conditionally reduces readability for no real gain that I can see. RCU-alike restrictions were already put on callbacks due to the hash spinlock being held. Benchmarking: There is a real cache-busting workload scanning millions of files in parallel (it's a backup appliance), where the initial lookup is guaranteed to fail resulting in the two lock acquires on stock kernel (and one with the patch at hand). Implemented below is a synthetic benchmark providing the same behavior. [I shall note the workload is not running on Linux, instead it was causing trouble elsewhere. Benchmark below was used while addressing said problems and was found to adequately represent the real workload.] Total real time fluctuates by 1-2s. With 20 threads each walking a dedicated 1000 dirs * 1000 files directory tree to stat(2) on a 32 core + 24GB RAM vm: ext4 (needed mkfs.ext4 -N 24000000): before: 3.77s user 890.90s system 1939% cpu 46.118 total after: 3.24s user 397.73s system 1858% cpu 21.581 total (-53%) That's 20 million files to visit, while the machine can only cache about 15 million at a time (obtained from ext4_inode_cache object count in /proc/slabinfo). Since each terminal inode is only visited once per run this amounts to 0% hit ratio for the dentry cache and the hash table (there are however hits for the intermediate directories). On repeated runs the kernel caches the last ~15 mln, meaning there is ~5 mln of uncached inodes which are going to be visited first, evicting the previously cached state as it happens. Lack of hits can be trivially verified with bpftrace, like so: bpftrace -e 'kretprobe:find_inode_fast { @[kstack(), retval != 0] = count(); }'\ -c "/bin/sh walktrees /testfs 20" Best ran more than once. Expected results after "warmup": [snip] @[ __ext4_iget+275 ext4_lookup+224 __lookup_slow+130 walk_component+219 link_path_walk.part.0.constprop.0+614 path_lookupat+62 filename_lookup+204 vfs_statx+128 vfs_fstatat+131 __do_sys_newfstatat+38 do_syscall_64+87 entry_SYSCALL_64_after_hwframe+118 , 1]: 20000 @[ __ext4_iget+275 ext4_lookup+224 __lookup_slow+130 walk_component+219 path_lookupat+106 filename_lookup+204 vfs_statx+128 vfs_fstatat+131 __do_sys_newfstatat+38 do_syscall_64+87 entry_SYSCALL_64_after_hwframe+118 , 1]: 20000000 That is 20 million calls for the initial lookup and 20 million after allocating a new inode, all of them failing to return a value != 0 (i.e., they are returning NULL -- no match found). Of course aborting the benchmark in the middle and starting it again (or messing with the state in other ways) is going to alter these results. Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@fromorbit.com/ Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- fs/inode.c | 97 ++++++++++++++++++++++++++++++++++++++-------- include/linux/fs.h | 7 +++- 2 files changed, 86 insertions(+), 18 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 3a41f83a4ba5..8c57cea7bbbb 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -886,36 +886,45 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc) return freed; } -static void __wait_on_freeing_inode(struct inode *inode); +static void __wait_on_freeing_inode(struct inode *inode, bool locked); /* * Called with the inode lock held. */ static struct inode *find_inode(struct super_block *sb, struct hlist_head *head, int (*test)(struct inode *, void *), - void *data) + void *data, bool locked) { struct inode *inode = NULL; + if (locked) + lockdep_assert_held(&inode_hash_lock); + else + lockdep_assert_not_held(&inode_hash_lock); + + rcu_read_lock(); repeat: - hlist_for_each_entry(inode, head, i_hash) { + hlist_for_each_entry_rcu(inode, head, i_hash) { if (inode->i_sb != sb) continue; if (!test(inode, data)) continue; spin_lock(&inode->i_lock); if (inode->i_state & (I_FREEING|I_WILL_FREE)) { - __wait_on_freeing_inode(inode); + __wait_on_freeing_inode(inode, locked); goto repeat; } if (unlikely(inode->i_state & I_CREATING)) { spin_unlock(&inode->i_lock); + rcu_read_unlock(); return ERR_PTR(-ESTALE); } __iget(inode); spin_unlock(&inode->i_lock); + rcu_read_unlock(); return inode; } + rcu_read_unlock(); return NULL; } @@ -924,29 +933,39 @@ static struct inode *find_inode(struct super_block *sb, * iget_locked for details. */ static struct inode *find_inode_fast(struct super_block *sb, - struct hlist_head *head, unsigned long ino) + struct hlist_head *head, unsigned long ino, + bool locked) { struct inode *inode = NULL; + if (locked) + lockdep_assert_held(&inode_hash_lock); + else + lockdep_assert_not_held(&inode_hash_lock); + + rcu_read_lock(); repeat: - hlist_for_each_entry(inode, head, i_hash) { + hlist_for_each_entry_rcu(inode, head, i_hash) { if (inode->i_ino != ino) continue; if (inode->i_sb != sb) continue; spin_lock(&inode->i_lock); if (inode->i_state & (I_FREEING|I_WILL_FREE)) { - __wait_on_freeing_inode(inode); + __wait_on_freeing_inode(inode, locked); goto repeat; } if (unlikely(inode->i_state & I_CREATING)) { spin_unlock(&inode->i_lock); + rcu_read_unlock(); return ERR_PTR(-ESTALE); } __iget(inode); spin_unlock(&inode->i_lock); + rcu_read_unlock(); return inode; } + rcu_read_unlock(); return NULL; } @@ -1161,7 +1180,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval, again: spin_lock(&inode_hash_lock); - old = find_inode(inode->i_sb, head, test, data); + old = find_inode(inode->i_sb, head, test, data, true); if (unlikely(old)) { /* * Uhhuh, somebody else created the same inode under us. @@ -1245,6 +1264,48 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval, } EXPORT_SYMBOL(iget5_locked); +/** + * iget5_locked_rcu - obtain an inode from a mounted file system + * @sb: super block of file system + * @hashval: hash value (usually inode number) to get + * @test: callback used for comparisons between inodes + * @set: callback used to initialize a new struct inode + * @data: opaque data pointer to pass to @test and @set + * + * This is equivalent to iget5_locked, except the @test callback must + * tolerate the inode not being stable, including being mid-teardown. + */ +struct inode *iget5_locked_rcu(struct super_block *sb, unsigned long hashval, + int (*test)(struct inode *, void *), + int (*set)(struct inode *, void *), void *data) +{ + struct hlist_head *head = inode_hashtable + hash(sb, hashval); + struct inode *inode, *new; + +again: + inode = find_inode(sb, head, test, data, false); + if (inode) { + if (IS_ERR(inode)) + return NULL; + wait_on_inode(inode); + if (unlikely(inode_unhashed(inode))) { + iput(inode); + goto again; + } + return inode; + } + + new = alloc_inode(sb); + if (new) { + new->i_state = 0; + inode = inode_insert5(new, hashval, test, set, data); + if (unlikely(inode != new)) + destroy_inode(new); + } + return inode; +} +EXPORT_SYMBOL_GPL(iget5_locked_rcu); + /** * iget_locked - obtain an inode from a mounted file system * @sb: super block of file system @@ -1263,9 +1324,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino) struct hlist_head *head = inode_hashtable + hash(sb, ino); struct inode *inode; again: - spin_lock(&inode_hash_lock); - inode = find_inode_fast(sb, head, ino); - spin_unlock(&inode_hash_lock); + inode = find_inode_fast(sb, head, ino, false); if (inode) { if (IS_ERR(inode)) return NULL; @@ -1283,7 +1342,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino) spin_lock(&inode_hash_lock); /* We released the lock, so.. */ - old = find_inode_fast(sb, head, ino); + old = find_inode_fast(sb, head, ino, true); if (!old) { inode->i_ino = ino; spin_lock(&inode->i_lock); @@ -1419,7 +1478,7 @@ struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval, struct inode *inode; spin_lock(&inode_hash_lock); - inode = find_inode(sb, head, test, data); + inode = find_inode(sb, head, test, data, true); spin_unlock(&inode_hash_lock); return IS_ERR(inode) ? NULL : inode; @@ -1474,7 +1533,7 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino) struct inode *inode; again: spin_lock(&inode_hash_lock); - inode = find_inode_fast(sb, head, ino); + inode = find_inode_fast(sb, head, ino, true); spin_unlock(&inode_hash_lock); if (inode) { @@ -2235,17 +2294,21 @@ EXPORT_SYMBOL(inode_needs_sync); * wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list * will DTRT. */ -static void __wait_on_freeing_inode(struct inode *inode) +static void __wait_on_freeing_inode(struct inode *inode, bool locked) { wait_queue_head_t *wq; DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW); wq = bit_waitqueue(&inode->i_state, __I_NEW); prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); spin_unlock(&inode->i_lock); - spin_unlock(&inode_hash_lock); + rcu_read_unlock(); + if (locked) + spin_unlock(&inode_hash_lock); schedule(); finish_wait(wq, &wait.wq_entry); - spin_lock(&inode_hash_lock); + if (locked) + spin_lock(&inode_hash_lock); + rcu_read_lock(); } static __initdata unsigned long ihash_entries; diff --git a/include/linux/fs.h b/include/linux/fs.h index bfc1e6407bf6..3eb88cb3c1e6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3045,7 +3045,12 @@ extern struct inode *inode_insert5(struct inode *inode, unsigned long hashval, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *data); -extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *); +struct inode *iget5_locked(struct super_block *, unsigned long, + int (*test)(struct inode *, void *), + int (*set)(struct inode *, void *), void *); +struct inode *iget5_locked_rcu(struct super_block *, unsigned long, + int (*test)(struct inode *, void *), + int (*set)(struct inode *, void *), void *); extern struct inode * iget_locked(struct super_block *, unsigned long); extern struct inode *find_inode_nowait(struct super_block *, unsigned long, -- 2.43.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v4 1/2] vfs: add rcu-based find_inode variants for iget ops 2024-06-11 17:38 ` [PATCH v4 1/2] vfs: add rcu-based find_inode variants for iget ops Mateusz Guzik @ 2024-06-13 14:40 ` Jan Kara 0 siblings, 0 replies; 6+ messages in thread From: Jan Kara @ 2024-06-13 14:40 UTC (permalink / raw) To: Mateusz Guzik Cc: brauner, viro, jack, linux-kernel, linux-fsdevel, linux-btrfs, josef, hch On Tue 11-06-24 19:38:22, Mateusz Guzik wrote: > This avoids one inode hash lock acquire in the common case on inode > creation, in effect significantly reducing contention. > > On the stock kernel said lock is typically taken twice: > 1. once to check if the inode happens to already be present > 2. once to add it to the hash > > The back-to-back lock/unlock pattern is known to degrade performance > significantly, which is further exacerbated if the hash is heavily > populated (long chains to walk, extending hold time). Arguably hash > sizing and hashing algo need to be revisited, but that's beyond the > scope of this patch. > > With the acquire from step 1 eliminated with RCU lookup throughput > increases significantly at the scale of 20 cores (benchmark results at > the bottom). > > So happens the hash already supports RCU-based operation, but lookups on > inode insertions didn't take advantage of it. > > This of course has its limits as the global lock is still a bottleneck. > There was a patchset posted which introduced fine-grained locking[1] but > it appears staled. Apart from that doubt was expressed whether a > handrolled hash implementation is appropriate to begin with, suggesting > replacement with rhashtables. Nobody committed to carrying [1] across > the finish line or implementing anything better, thus the bandaid below. > > iget_locked consumers (notably ext4) get away without any changes > because inode comparison method is built-in. > > iget5_locked consumers pass a custom callback. Since removal of locking > adds more problems (inode can be changing) it's not safe to assume all > filesystems happen to cope. Thus iget5_locked_rcu gets added, requiring > manual conversion of interested filesystems. > > In order to reduce code duplication find_inode and find_inode_fast grow > an argument indicating whether inode hash lock is held, which is passed > down in case sleeping is necessary. They always rcu_read_lock, which is > redundant but harmless. Doing it conditionally reduces readability for > no real gain that I can see. RCU-alike restrictions were already put on > callbacks due to the hash spinlock being held. > > Benchmarking: > There is a real cache-busting workload scanning millions of files in > parallel (it's a backup appliance), where the initial lookup is > guaranteed to fail resulting in the two lock acquires on stock kernel > (and one with the patch at hand). > > Implemented below is a synthetic benchmark providing the same behavior. > [I shall note the workload is not running on Linux, instead it was > causing trouble elsewhere. Benchmark below was used while addressing > said problems and was found to adequately represent the real workload.] > > Total real time fluctuates by 1-2s. > > With 20 threads each walking a dedicated 1000 dirs * 1000 files > directory tree to stat(2) on a 32 core + 24GB RAM vm: > > ext4 (needed mkfs.ext4 -N 24000000): > before: 3.77s user 890.90s system 1939% cpu 46.118 total > after: 3.24s user 397.73s system 1858% cpu 21.581 total (-53%) > > That's 20 million files to visit, while the machine can only cache about > 15 million at a time (obtained from ext4_inode_cache object count in > /proc/slabinfo). Since each terminal inode is only visited once per run > this amounts to 0% hit ratio for the dentry cache and the hash table > (there are however hits for the intermediate directories). > > On repeated runs the kernel caches the last ~15 mln, meaning there is ~5 > mln of uncached inodes which are going to be visited first, evicting the > previously cached state as it happens. > > Lack of hits can be trivially verified with bpftrace, like so: > bpftrace -e 'kretprobe:find_inode_fast { @[kstack(), retval != 0] = count(); }'\ > -c "/bin/sh walktrees /testfs 20" > > Best ran more than once. > > Expected results after "warmup": > [snip] > @[ > __ext4_iget+275 > ext4_lookup+224 > __lookup_slow+130 > walk_component+219 > link_path_walk.part.0.constprop.0+614 > path_lookupat+62 > filename_lookup+204 > vfs_statx+128 > vfs_fstatat+131 > __do_sys_newfstatat+38 > do_syscall_64+87 > entry_SYSCALL_64_after_hwframe+118 > , 1]: 20000 > @[ > __ext4_iget+275 > ext4_lookup+224 > __lookup_slow+130 > walk_component+219 > path_lookupat+106 > filename_lookup+204 > vfs_statx+128 > vfs_fstatat+131 > __do_sys_newfstatat+38 > do_syscall_64+87 > entry_SYSCALL_64_after_hwframe+118 > , 1]: 20000000 > > That is 20 million calls for the initial lookup and 20 million after > allocating a new inode, all of them failing to return a value != 0 > (i.e., they are returning NULL -- no match found). > > Of course aborting the benchmark in the middle and starting it again (or > messing with the state in other ways) is going to alter these results. > > Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz > > [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@fromorbit.com/ > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Looks good to me. Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > fs/inode.c | 97 ++++++++++++++++++++++++++++++++++++++-------- > include/linux/fs.h | 7 +++- > 2 files changed, 86 insertions(+), 18 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index 3a41f83a4ba5..8c57cea7bbbb 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -886,36 +886,45 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc) > return freed; > } > > -static void __wait_on_freeing_inode(struct inode *inode); > +static void __wait_on_freeing_inode(struct inode *inode, bool locked); > /* > * Called with the inode lock held. > */ > static struct inode *find_inode(struct super_block *sb, > struct hlist_head *head, > int (*test)(struct inode *, void *), > - void *data) > + void *data, bool locked) > { > struct inode *inode = NULL; > > + if (locked) > + lockdep_assert_held(&inode_hash_lock); > + else > + lockdep_assert_not_held(&inode_hash_lock); > + > + rcu_read_lock(); > repeat: > - hlist_for_each_entry(inode, head, i_hash) { > + hlist_for_each_entry_rcu(inode, head, i_hash) { > if (inode->i_sb != sb) > continue; > if (!test(inode, data)) > continue; > spin_lock(&inode->i_lock); > if (inode->i_state & (I_FREEING|I_WILL_FREE)) { > - __wait_on_freeing_inode(inode); > + __wait_on_freeing_inode(inode, locked); > goto repeat; > } > if (unlikely(inode->i_state & I_CREATING)) { > spin_unlock(&inode->i_lock); > + rcu_read_unlock(); > return ERR_PTR(-ESTALE); > } > __iget(inode); > spin_unlock(&inode->i_lock); > + rcu_read_unlock(); > return inode; > } > + rcu_read_unlock(); > return NULL; > } > > @@ -924,29 +933,39 @@ static struct inode *find_inode(struct super_block *sb, > * iget_locked for details. > */ > static struct inode *find_inode_fast(struct super_block *sb, > - struct hlist_head *head, unsigned long ino) > + struct hlist_head *head, unsigned long ino, > + bool locked) > { > struct inode *inode = NULL; > > + if (locked) > + lockdep_assert_held(&inode_hash_lock); > + else > + lockdep_assert_not_held(&inode_hash_lock); > + > + rcu_read_lock(); > repeat: > - hlist_for_each_entry(inode, head, i_hash) { > + hlist_for_each_entry_rcu(inode, head, i_hash) { > if (inode->i_ino != ino) > continue; > if (inode->i_sb != sb) > continue; > spin_lock(&inode->i_lock); > if (inode->i_state & (I_FREEING|I_WILL_FREE)) { > - __wait_on_freeing_inode(inode); > + __wait_on_freeing_inode(inode, locked); > goto repeat; > } > if (unlikely(inode->i_state & I_CREATING)) { > spin_unlock(&inode->i_lock); > + rcu_read_unlock(); > return ERR_PTR(-ESTALE); > } > __iget(inode); > spin_unlock(&inode->i_lock); > + rcu_read_unlock(); > return inode; > } > + rcu_read_unlock(); > return NULL; > } > > @@ -1161,7 +1180,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval, > > again: > spin_lock(&inode_hash_lock); > - old = find_inode(inode->i_sb, head, test, data); > + old = find_inode(inode->i_sb, head, test, data, true); > if (unlikely(old)) { > /* > * Uhhuh, somebody else created the same inode under us. > @@ -1245,6 +1264,48 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval, > } > EXPORT_SYMBOL(iget5_locked); > > +/** > + * iget5_locked_rcu - obtain an inode from a mounted file system > + * @sb: super block of file system > + * @hashval: hash value (usually inode number) to get > + * @test: callback used for comparisons between inodes > + * @set: callback used to initialize a new struct inode > + * @data: opaque data pointer to pass to @test and @set > + * > + * This is equivalent to iget5_locked, except the @test callback must > + * tolerate the inode not being stable, including being mid-teardown. > + */ > +struct inode *iget5_locked_rcu(struct super_block *sb, unsigned long hashval, > + int (*test)(struct inode *, void *), > + int (*set)(struct inode *, void *), void *data) > +{ > + struct hlist_head *head = inode_hashtable + hash(sb, hashval); > + struct inode *inode, *new; > + > +again: > + inode = find_inode(sb, head, test, data, false); > + if (inode) { > + if (IS_ERR(inode)) > + return NULL; > + wait_on_inode(inode); > + if (unlikely(inode_unhashed(inode))) { > + iput(inode); > + goto again; > + } > + return inode; > + } > + > + new = alloc_inode(sb); > + if (new) { > + new->i_state = 0; > + inode = inode_insert5(new, hashval, test, set, data); > + if (unlikely(inode != new)) > + destroy_inode(new); > + } > + return inode; > +} > +EXPORT_SYMBOL_GPL(iget5_locked_rcu); > + > /** > * iget_locked - obtain an inode from a mounted file system > * @sb: super block of file system > @@ -1263,9 +1324,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino) > struct hlist_head *head = inode_hashtable + hash(sb, ino); > struct inode *inode; > again: > - spin_lock(&inode_hash_lock); > - inode = find_inode_fast(sb, head, ino); > - spin_unlock(&inode_hash_lock); > + inode = find_inode_fast(sb, head, ino, false); > if (inode) { > if (IS_ERR(inode)) > return NULL; > @@ -1283,7 +1342,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino) > > spin_lock(&inode_hash_lock); > /* We released the lock, so.. */ > - old = find_inode_fast(sb, head, ino); > + old = find_inode_fast(sb, head, ino, true); > if (!old) { > inode->i_ino = ino; > spin_lock(&inode->i_lock); > @@ -1419,7 +1478,7 @@ struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval, > struct inode *inode; > > spin_lock(&inode_hash_lock); > - inode = find_inode(sb, head, test, data); > + inode = find_inode(sb, head, test, data, true); > spin_unlock(&inode_hash_lock); > > return IS_ERR(inode) ? NULL : inode; > @@ -1474,7 +1533,7 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino) > struct inode *inode; > again: > spin_lock(&inode_hash_lock); > - inode = find_inode_fast(sb, head, ino); > + inode = find_inode_fast(sb, head, ino, true); > spin_unlock(&inode_hash_lock); > > if (inode) { > @@ -2235,17 +2294,21 @@ EXPORT_SYMBOL(inode_needs_sync); > * wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list > * will DTRT. > */ > -static void __wait_on_freeing_inode(struct inode *inode) > +static void __wait_on_freeing_inode(struct inode *inode, bool locked) > { > wait_queue_head_t *wq; > DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW); > wq = bit_waitqueue(&inode->i_state, __I_NEW); > prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); > spin_unlock(&inode->i_lock); > - spin_unlock(&inode_hash_lock); > + rcu_read_unlock(); > + if (locked) > + spin_unlock(&inode_hash_lock); > schedule(); > finish_wait(wq, &wait.wq_entry); > - spin_lock(&inode_hash_lock); > + if (locked) > + spin_lock(&inode_hash_lock); > + rcu_read_lock(); > } > > static __initdata unsigned long ihash_entries; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index bfc1e6407bf6..3eb88cb3c1e6 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -3045,7 +3045,12 @@ extern struct inode *inode_insert5(struct inode *inode, unsigned long hashval, > int (*test)(struct inode *, void *), > int (*set)(struct inode *, void *), > void *data); > -extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *); > +struct inode *iget5_locked(struct super_block *, unsigned long, > + int (*test)(struct inode *, void *), > + int (*set)(struct inode *, void *), void *); > +struct inode *iget5_locked_rcu(struct super_block *, unsigned long, > + int (*test)(struct inode *, void *), > + int (*set)(struct inode *, void *), void *); > extern struct inode * iget_locked(struct super_block *, unsigned long); > extern struct inode *find_inode_nowait(struct super_block *, > unsigned long, > -- > 2.43.0 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v4 2/2] btrfs: use iget5_locked_rcu 2024-06-11 17:38 [PATCH v4 0/2] rcu-based inode lookup for iget* Mateusz Guzik 2024-06-11 17:38 ` [PATCH v4 1/2] vfs: add rcu-based find_inode variants for iget ops Mateusz Guzik @ 2024-06-11 17:38 ` Mateusz Guzik 2024-06-12 18:33 ` David Sterba 2024-06-12 12:08 ` [PATCH v4 0/2] rcu-based inode lookup for iget* Christian Brauner 2 siblings, 1 reply; 6+ messages in thread From: Mateusz Guzik @ 2024-06-11 17:38 UTC (permalink / raw) To: brauner Cc: viro, jack, linux-kernel, linux-fsdevel, linux-btrfs, josef, hch, Mateusz Guzik With 20 threads each walking a dedicated 1000 dirs * 1000 files directory tree to stat(2) on a 32 core + 24GB ram vm: before: 3.54s user 892.30s system 1966% cpu 45.549 total after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%) Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- fs/btrfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4883cb512379..457d2c18d071 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5588,7 +5588,7 @@ static struct inode *btrfs_iget_locked(struct super_block *s, u64 ino, args.ino = ino; args.root = root; - inode = iget5_locked(s, hashval, btrfs_find_actor, + inode = iget5_locked_rcu(s, hashval, btrfs_find_actor, btrfs_init_locked_inode, (void *)&args); return inode; -- 2.43.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v4 2/2] btrfs: use iget5_locked_rcu 2024-06-11 17:38 ` [PATCH v4 2/2] btrfs: use iget5_locked_rcu Mateusz Guzik @ 2024-06-12 18:33 ` David Sterba 0 siblings, 0 replies; 6+ messages in thread From: David Sterba @ 2024-06-12 18:33 UTC (permalink / raw) To: Mateusz Guzik Cc: brauner, viro, jack, linux-kernel, linux-fsdevel, linux-btrfs, josef, hch On Tue, Jun 11, 2024 at 07:38:23PM +0200, Mateusz Guzik wrote: > With 20 threads each walking a dedicated 1000 dirs * 1000 files > directory tree to stat(2) on a 32 core + 24GB ram vm: > > before: 3.54s user 892.30s system 1966% cpu 45.549 total > after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%) > > Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz > > Reviewed-by: Josef Bacik <josef@toxicpanda.com> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Acked-by: David Sterba <dsterba@suse.com> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v4 0/2] rcu-based inode lookup for iget* 2024-06-11 17:38 [PATCH v4 0/2] rcu-based inode lookup for iget* Mateusz Guzik 2024-06-11 17:38 ` [PATCH v4 1/2] vfs: add rcu-based find_inode variants for iget ops Mateusz Guzik 2024-06-11 17:38 ` [PATCH v4 2/2] btrfs: use iget5_locked_rcu Mateusz Guzik @ 2024-06-12 12:08 ` Christian Brauner 2 siblings, 0 replies; 6+ messages in thread From: Christian Brauner @ 2024-06-12 12:08 UTC (permalink / raw) To: Mateusz Guzik Cc: Christian Brauner, viro, jack, linux-kernel, linux-fsdevel, linux-btrfs, josef, hch On Tue, 11 Jun 2024 19:38:21 +0200, Mateusz Guzik wrote: > I revamped the commit message for patch 1, explicitly spelling out a > bunch of things and adding bpftrace output. Please read it. > > There was some massaging of lines in the include/linux/fs.h header > files. If you don't like it I would appreciate if you adjusted it > however you see fit on your own. > > [...] Applied to the vfs.inode.rcu branch of the vfs/vfs.git tree. Patches in the vfs.inode.rcu branch should appear in linux-next soon. Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it. It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated. Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch. tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs.inode.rcu [1/2] vfs: add rcu-based find_inode variants for iget ops https://git.kernel.org/vfs/vfs/c/d50a5495bae7 [2/2] btrfs: use iget5_locked_rcu https://git.kernel.org/vfs/vfs/c/4921028a9c89 ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-06-13 14:40 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-11 17:38 [PATCH v4 0/2] rcu-based inode lookup for iget* Mateusz Guzik 2024-06-11 17:38 ` [PATCH v4 1/2] vfs: add rcu-based find_inode variants for iget ops Mateusz Guzik 2024-06-13 14:40 ` Jan Kara 2024-06-11 17:38 ` [PATCH v4 2/2] btrfs: use iget5_locked_rcu Mateusz Guzik 2024-06-12 18:33 ` David Sterba 2024-06-12 12:08 ` [PATCH v4 0/2] rcu-based inode lookup for iget* Christian Brauner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).