* [GIT PULL] bcachefs changes for 6.12-rc1
@ 2024-09-21 19:27 Kent Overstreet
2024-09-23 17:18 ` Linus Torvalds
2024-09-23 19:06 ` pr-tracker-bot
0 siblings, 2 replies; 28+ messages in thread
From: Kent Overstreet @ 2024-09-21 19:27 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-bcachefs, linux-fsdevel, linux-kernel
Relatively quiet cycle here - which is good, because 6.11 was huge.
I think we'll be able to take off EXPERIMENTAL in inside of a year, see
below for more.
Cheers,
Kent
The following changes since commit 16005147cca41a0f67b5def2a4656286f8c0db4a:
bcachefs: Don't delete open files in online fsck (2024-09-09 09:41:47 -0400)
are available in the Git repository at:
git://evilpiepirate.org/bcachefs.git tags/bcachefs-2024-09-21
for you to fetch changes up to 025c55a4c7f11ea38521c6e797f3192ad8768c93:
bcachefs: return err ptr instead of null in read sb clean (2024-09-21 11:39:49 -0400)
----------------------------------------------------------------
bcachefs changes for 6.12-rc1
rcu_pending, btree key cache rework: this solves lock contenting in the
key cache, eliminating the biggest source of the srcu lock hold time
warnings, and drastically improving performance on some metadata heavy
workloads - on multithreaded creates we're now 3-4x faster than xfs.
We're now using an rhashtable instead of the system inode hash table;
this is another significant performance improvement on multithreaded
metadata workloads, eliminating more lock contention.
for_each_btree_key_in_subvolume_upto(): new helper for iterating over
keys within a specific subvolume, eliminating a lot of open coded
"subvolume_get_snapshot()" and also fixing another source of srcu lock
time warnings, by running each loop iteration in its own transaction (as
the existing for_each_btree_key() does).
More work on btree_trans locking asserts; we now assert that we don't
hold btree node locks when trans->locked is false, which is important
because we don't use lockdep for tracking individual btree node locks.
Some cleanups and improvements in the bset.c btree node lookup code,
from Alan.
Rework of btree node pinning, which we use in backpointers fsck. The old
hacky implementation, where the shrinker just skipped over nodes in the
pinned range, was causing OOMs; instead we now use another shrinker with
a much higher seeks number for pinned nodes.
Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue
where rebalance would sometimes fall back to allocating from the full
filesystem, which is not what we want when it's trying to move data to a
specific target.
Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache
allocations.
Idmap mounts are now supported - Hongbo.
Rename whiteouts are now supported - Hongbo.
Erasure coding can now handle devices being marked as failed, or
forcibly removed. We still need the evacuate path for erasure coding,
but it's getting very close to ready for people to start using.
Status, and when will we be taking off experimental:
----------------------------------------------------
Going by critical, user facing bugs getting found and fixed, we're
nearly there. There are a couple key items that need to be finished
before we can take off the experimental label:
- The end-user experience is still pretty painful when the root
filesystem needs a fsck; we need some form of limited self healing so
that necessary repair gets run automatically. Errors (by type) are
recorded in the superblock, so what we need to do next is convert
remaining inconsistent() errors to fsck() errors (so that all runtime
inconsistencies are logged in the superblock), and we need to go
through the list of fsck errors and classify them by which fsck passes
are needed to repair them.
- We need comprehensive torture testing for all our repair paths, to
shake out remaining bugs there. Thomas has been working on the tooling
for this, so this is coming soonish.
Slightly less critical items:
- We need to improve the end-user experience for degraded mounts: right
now, a degraded root filesystem means dropping to an initramfs shell
or somehow inputting mount options manually (we don't want to allow
degraded mounts without some form of user input, except on unattended
servers) - we need the mount helper to prompt the user to allow
mounting degraded, and make sure this works with systemd.
- Scalabiity: we have users running 100TB+ filesystems, and that's
effectively the limit right now due to fsck times. We have some
reworks in the pipeline to address this, we're aiming to make petabyte
sized filesystems practical.
----------------------------------------------------------------
Alan Huang (8):
bcachefs: Remove unused parameter of bkey_mantissa
bcachefs: Remove unused parameter of bkey_mantissa_bits_dropped
bcachefs: Remove dead code in __build_ro_aux_tree
bcachefs: Convert open-coded extra computation to helper
bcachefs: Minimize the search range used to calculate the mantissa
bcachefs: Remove the prev array stuff
bcachefs: Remove unused parameter
bcachefs: Refactor bch2_bset_fix_lookup_table
Alyssa Ross (1):
bcachefs: Fix negative timespecs
Chen Yufan (1):
bcachefs: Convert to use jiffies macros
Diogo Jahchan Koike (1):
bcachefs: return err ptr instead of null in read sb clean
Feiko Nanninga (1):
bcachefs: Fix sysfs rebalance duration waited formatting
Hongbo Li (2):
bcachefs: support idmap mounts
bcachefs: Fix compilation error for bch2_sb_member_alloc
Julian Sun (4):
bcachefs: remove the unused macro definition
bcachefs: fix macro definition allocate_dropping_locks_errcode
bcachefs: fix macro definition allocate_dropping_locks
bcachefs: remove the unused parameter in macro bkey_crc_next
Kent Overstreet (68):
inode: make __iget() a static inline
bcachefs: switch to rhashtable for vfs inodes hash
bcachefs: Fix deadlock in __wait_on_freeing_inode()
lib/generic-radix-tree.c: genradix_ptr_inlined()
lib/generic-radix-tree.c: add preallocation
bcachefs: rcu_pending
bcachefs: rcu_pending now works in userspace
bcachefs: Rip out freelists from btree key cache
bcachefs: key cache can now allocate from pending
bcachefs: data_allowed is now an opts.h option
bcachefs: bch2_opt_set_sb() can now set (some) device options
bcachefs: Opt_durability can now be set via bch2_opt_set_sb()
bcachefs: Add check for btree_path ref overflow
bcachefs: Btree path tracepoints
bcachefs: kill bch2_btree_iter_peek_and_restart()
bcachefs: bchfs_read(): call trans_begin() on every loop iter
bcachefs: bch2_fiemap(): call trans_begin() on every loop iter
bcachefs: for_each_btree_key_in_subvolume_upto()
bcachefs: bch2_readdir() -> for_each_btree_key_in_subvolume_upto
bcachefs: bch2_xattr_list() -> for_each_btree_key_in_subvolume_upto
bcachefs: bch2_seek_data() -> for_each_btree_key_in_subvolume_upto
bcachefs: bch2_seek_hole() -> for_each_btree_key_in_subvolume_upto
bcachefs: range_has_data() -> for_each_btree_key_in_subvolume_upto
bcachefs: bch2_folio_set() -> for_each_btree_key_in_subvolume_upto
bcachefs: quota_reserve_range() -> for_each_btree_key_in_subvolume_upto
bcachefs: Move rebalance_status out of sysfs/internal
bcachefs: promote_whole_extents is now a normal option
bcachefs: trivial open_bucket_add_buckets() cleanup
bcachefs: bch2_sb_nr_devices()
bcachefs: Drop memalloc_nofs_save() in bch2_btree_node_mem_alloc()
bcachefs: bch2_time_stats_reset()
bcachefs: Assert that we don't lock nodes when !trans->locked
bcachefs: darray: convert to alloc_hooks()
bcachefs: Switch gc bucket array to a genradix
bcachefs: Add pinned to btree cache not freed counters
bcachefs: do_encrypt() now handles allocation failures
bcachefs: convert __bch2_encrypt_bio() to darray
bcachefs: kill redundant is_vmalloc_addr()
bcachefs: fix prototype to bch2_alloc_sectors_start_trans()
bcachefs: BCH_WRITE_ALLOC_NOWAIT no longer applies to open bucket allocation
bcachefs: rebalance writes use BCH_WRITE_ONLY_SPECIFIED_DEVS
bcachefs: Use __GFP_ACCOUNT for reclaimable memory
bcachefs: Use mm_account_reclaimed_pages() when freeing btree nodes
bcachefs: Options for recovery_passes, recovery_passes_exclude
bcachefs: Move tabstop setup to bch2_dev_usage_to_text()
bcachefs: bch2_dev_remove_alloc() -> alloc_background.c
bcachefs: bch2_sb_member_alloc()
bcachefs: improve "no device to read from" message
bcachefs: bch2_opts_to_text()
bcachefs: Progress indicator for extents_to_backpointers
bcachefs: bch2_dev_rcu_noerror()
bcachefs: Failed devices no longer require mounting in degraded mode
bcachefs: Don't count "skipped access bit" as touched in btree cache scan
bcachefs: btree cache counters should be size_t
bcachefs: split up btree cache counters for live, freeable
bcachefs: Rework btree node pinning
bcachefs: EIO errcode cleanup
bcachefs: stripe_to_mem()
bcachefs: bch_stripe.disk_label
bcachefs: ec_stripe_head.nr_created
bcachefs: improve bch2_new_stripe_to_text()
bcachefs: improve error message on too few devices for ec
bcachefs: improve error messages in bch2_ec_read_extent()
bcachefs: bch2_trigger_ptr() calculates sectors even when no device
bcachefs: bch2_dev_remove_stripes()
bcachefs: bch_fs.rw_devs_change_count
bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices
bcachefs: Don't drop devices with stripe pointers
Matthew Wilcox (Oracle) (1):
bcachefs: Do not check folio_has_private()
Nathan Chancellor (1):
bcachefs: Fix format specifier in bch2_btree_key_cache_to_text()
Reed Riley (1):
bcachefs: Replace div_u64 with div64_u64 where second param is u64
Sasha Finkelstein (1):
bcachefs: Hook up RENAME_WHITEOUT in rename.
Thorsten Blum (3):
bcachefs: Annotate struct bucket_array with __counted_by()
bcachefs: Annotate struct bch_xattr with __counted_by()
bcachefs: Annotate bch_replicas_entry_{v0,v1} with __counted_by()
Xiaxi Shen (1):
bcachefs: Fix a spelling error in docs
Yang Li (1):
bcachefs: Remove duplicated include in backpointers.c
Youling Tang (4):
bcachefs: allocate inode by using alloc_inode_sb()
bcachefs: Mark bch_inode_info as SLAB_ACCOUNT
bcachefs: drop unused posix acl handlers
bcachefs: Simplify bch2_xattr_emit() implementation
Documentation/filesystems/bcachefs/CodingStyle.rst | 2 +-
fs/bcachefs/Kconfig | 7 +
fs/bcachefs/Makefile | 1 +
fs/bcachefs/acl.c | 2 +-
fs/bcachefs/alloc_background.c | 45 +-
fs/bcachefs/alloc_background.h | 3 +-
fs/bcachefs/alloc_foreground.c | 59 +-
fs/bcachefs/alloc_foreground.h | 5 +-
fs/bcachefs/backpointers.c | 106 +++-
fs/bcachefs/backpointers.h | 23 +-
fs/bcachefs/bcachefs.h | 14 +-
fs/bcachefs/bcachefs_format.h | 2 +
fs/bcachefs/bset.c | 182 +++---
fs/bcachefs/bset.h | 4 +-
fs/bcachefs/btree_cache.c | 273 ++++++---
fs/bcachefs/btree_cache.h | 3 +
fs/bcachefs/btree_gc.c | 21 +-
fs/bcachefs/btree_io.c | 8 +-
fs/bcachefs/btree_io.h | 4 +-
fs/bcachefs/btree_iter.c | 63 +-
fs/bcachefs/btree_iter.h | 52 +-
fs/bcachefs/btree_key_cache.c | 405 +++----------
fs/bcachefs/btree_key_cache_types.h | 18 +-
fs/bcachefs/btree_locking.h | 13 +-
fs/bcachefs/btree_trans_commit.c | 2 +-
fs/bcachefs/btree_types.h | 60 +-
fs/bcachefs/btree_update.c | 12 +-
fs/bcachefs/btree_update_interior.c | 37 +-
fs/bcachefs/btree_update_interior.h | 2 +
fs/bcachefs/buckets.c | 35 +-
fs/bcachefs/buckets.h | 15 +-
fs/bcachefs/buckets_types.h | 8 -
fs/bcachefs/checksum.c | 101 ++--
fs/bcachefs/clock.h | 9 -
fs/bcachefs/darray.c | 4 +-
fs/bcachefs/darray.h | 26 +-
fs/bcachefs/data_update.c | 2 +-
fs/bcachefs/dirent.c | 66 +--
fs/bcachefs/ec.c | 303 +++++++---
fs/bcachefs/ec.h | 11 +-
fs/bcachefs/ec_format.h | 9 +-
fs/bcachefs/ec_types.h | 1 +
fs/bcachefs/errcode.h | 14 +-
fs/bcachefs/extents.c | 33 +-
fs/bcachefs/extents.h | 24 +-
fs/bcachefs/fs-common.c | 5 +-
fs/bcachefs/fs-io-buffered.c | 41 +-
fs/bcachefs/fs-io-direct.c | 2 +-
fs/bcachefs/fs-io-pagecache.c | 90 ++-
fs/bcachefs/fs-io-pagecache.h | 4 +-
fs/bcachefs/fs-io.c | 178 ++----
fs/bcachefs/fs-ioctl.c | 4 +-
fs/bcachefs/fs.c | 427 +++++++++-----
fs/bcachefs/fs.h | 18 +-
fs/bcachefs/inode.c | 2 +-
fs/bcachefs/io_read.c | 18 +-
fs/bcachefs/io_write.c | 7 +-
fs/bcachefs/journal_io.c | 6 +-
fs/bcachefs/journal_reclaim.c | 7 +-
fs/bcachefs/opts.c | 85 ++-
fs/bcachefs/opts.h | 61 +-
fs/bcachefs/rcu_pending.c | 650 +++++++++++++++++++++
fs/bcachefs/rcu_pending.h | 27 +
fs/bcachefs/rebalance.c | 3 +
fs/bcachefs/recovery.c | 22 +-
fs/bcachefs/recovery_passes.c | 10 +-
fs/bcachefs/replicas.c | 10 +-
fs/bcachefs/replicas_format.h | 9 +-
fs/bcachefs/sb-clean.c | 2 +-
fs/bcachefs/sb-members.c | 57 ++
fs/bcachefs/sb-members.h | 22 +-
fs/bcachefs/str_hash.h | 2 +-
fs/bcachefs/subvolume.h | 45 ++
fs/bcachefs/subvolume_types.h | 3 +-
fs/bcachefs/super-io.c | 12 +-
fs/bcachefs/super.c | 85 +--
fs/bcachefs/sysfs.c | 55 +-
fs/bcachefs/thread_with_file.c | 2 +-
fs/bcachefs/time_stats.c | 14 +
fs/bcachefs/time_stats.h | 3 +-
fs/bcachefs/trace.h | 465 ++++++++++++++-
fs/bcachefs/util.c | 16 +-
fs/bcachefs/util.h | 2 +-
fs/bcachefs/xattr.c | 81 +--
fs/bcachefs/xattr_format.h | 2 +-
fs/inode.c | 8 -
include/linux/fs.h | 9 +-
include/linux/generic-radix-tree.h | 105 +++-
lib/generic-radix-tree.c | 80 +--
89 files changed, 3155 insertions(+), 1690 deletions(-)
create mode 100644 fs/bcachefs/rcu_pending.c
create mode 100644 fs/bcachefs/rcu_pending.h
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-21 19:27 [GIT PULL] bcachefs changes for 6.12-rc1 Kent Overstreet @ 2024-09-23 17:18 ` Linus Torvalds 2024-09-23 19:07 ` Linus Torvalds 2024-09-23 19:56 ` Kent Overstreet 2024-09-23 19:06 ` pr-tracker-bot 1 sibling, 2 replies; 28+ messages in thread From: Linus Torvalds @ 2024-09-23 17:18 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, linux-fsdevel, linux-kernel On Sat, 21 Sept 2024 at 12:28, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > We're now using an rhashtable instead of the system inode hash table; > this is another significant performance improvement on multithreaded > metadata workloads, eliminating more lock contention. So I've pulled this, but I reacted to this issue - what is the load where the inode hash table is actually causing issues? Because honestly, the reason we're using that "one single lock" for the inode hash table is that nobody has ever bothered. In particular, it *should* be reasonably straightforward (lots of small details, but not fundamentally hard) to just make the inode hash table be a "bl" hash - with the locking done per-bucket, not using one global lock. But nobody has ever seemed to care before. Because the inode hash just isn't all that heavily used, since normal loads don't look up inodes directly (they look up dentries, and you find the inodes off those - and the *dentry* hash scales well thanks to both RCU and doing the bucket lock thing). So the fact that bcachefs cares makes me go "Why?" Normal filesystems *never* call ilookup() and friends. Why does bcachefs do it so much that you decided that you need to use a whole other hashtable? Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-23 17:18 ` Linus Torvalds @ 2024-09-23 19:07 ` Linus Torvalds 2024-09-23 19:58 ` Kent Overstreet 2024-09-23 19:56 ` Kent Overstreet 1 sibling, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2024-09-23 19:07 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, linux-fsdevel, linux-kernel On Mon, 23 Sept 2024 at 10:18, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > We're now using an rhashtable instead of the system inode hash table; > > this is another significant performance improvement on multithreaded > > metadata workloads, eliminating more lock contention. > > So I've pulled this, but I reacted to this issue - what is the load > where the inode hash table is actually causing issues? Oh, and I also only now noticed that a third of the commits are fairly recent and from after the merge window even started. Maybe there's a good reason for that, but it sure wasn't explained in the pull request. Grr. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-23 19:07 ` Linus Torvalds @ 2024-09-23 19:58 ` Kent Overstreet 0 siblings, 0 replies; 28+ messages in thread From: Kent Overstreet @ 2024-09-23 19:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Andrew Morton On Mon, Sep 23, 2024 at 12:07:24PM GMT, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 10:18, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > We're now using an rhashtable instead of the system inode hash table; > > > this is another significant performance improvement on multithreaded > > > metadata workloads, eliminating more lock contention. > > > > So I've pulled this, but I reacted to this issue - what is the load > > where the inode hash table is actually causing issues? > > Oh, and I also only now noticed that a third of the commits are fairly > recent and from after the merge window even started. > > Maybe there's a good reason for that, but it sure wasn't explained in > the pull request. Grr. I rebased to drop a patch that conflicts with Andrew's tree (the PF_MEMALLOC_NORECLAIM stuff, which I would still like to chat with him about since I missed him at Plumbers...) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-23 17:18 ` Linus Torvalds 2024-09-23 19:07 ` Linus Torvalds @ 2024-09-23 19:56 ` Kent Overstreet 2024-09-24 0:26 ` Dave Chinner 1 sibling, 1 reply; 28+ messages in thread From: Kent Overstreet @ 2024-09-23 19:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote: > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > We're now using an rhashtable instead of the system inode hash table; > > this is another significant performance improvement on multithreaded > > metadata workloads, eliminating more lock contention. > > So I've pulled this, but I reacted to this issue - what is the load > where the inode hash table is actually causing issues? > > Because honestly, the reason we're using that "one single lock" for > the inode hash table is that nobody has ever bothered. > > In particular, it *should* be reasonably straightforward (lots of > small details, but not fundamentally hard) to just make the inode hash > table be a "bl" hash - with the locking done per-bucket, not using one > global lock. Dave was working on that - he posted patches and it seemed like they were mostly done, not sure what happened with those. > But nobody has ever seemed to care before. Because the inode hash just > isn't all that heavily used, since normal loads don't look up inodes > directly (they look up dentries, and you find the inodes off those - > and the *dentry* hash scales well thanks to both RCU and doing the > bucket lock thing). > > So the fact that bcachefs cares makes me go "Why?" I've seen benchmarks (not just on bcachefs, ext4 as well) where it was actually _really_ heavy. They were synthetics, but I'd expect to see the same effect from workloads where we're streaming a lot of inodes through the cache - think rsync. Another data point is that the btree key cache work I just did was also for exactly this type of situation - it addressed lock contention when streaming items into and out of the key cache, which are mostly inodes (it's also used for the alloc btree) - and I had reports from multiple users it was a drastic improvement, much more than expected, and eliminated the main source of our "srcu lock held too long" warnings - and if lock contention is bad enough to trip a 10 second warning, you know it's bad, heh. I try to really jump on these sorts of lock contention issues because I've seen many instances where they caused performance cliffs - in practice they're often O(n^2) issues because as your workload scales up it's usually not just number of instances where you take the lock that increases, usually something else is scaling too - even just increased cacheline bouncing from the lock contention itself will be problematic. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-23 19:56 ` Kent Overstreet @ 2024-09-24 0:26 ` Dave Chinner 2024-09-24 1:55 ` Kent Overstreet 2024-09-24 2:26 ` Linus Torvalds 0 siblings, 2 replies; 28+ messages in thread From: Dave Chinner @ 2024-09-24 0:26 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Mon, Sep 23, 2024 at 03:56:50PM -0400, Kent Overstreet wrote: > On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote: > > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > We're now using an rhashtable instead of the system inode hash table; > > > this is another significant performance improvement on multithreaded > > > metadata workloads, eliminating more lock contention. > > > > So I've pulled this, but I reacted to this issue - what is the load > > where the inode hash table is actually causing issues? > > > > Because honestly, the reason we're using that "one single lock" for > > the inode hash table is that nobody has ever bothered. > > > > In particular, it *should* be reasonably straightforward (lots of > > small details, but not fundamentally hard) to just make the inode hash > > table be a "bl" hash - with the locking done per-bucket, not using one > > global lock. > > Dave was working on that - he posted patches and it seemed like they > were mostly done, not sure what happened with those. I lost interest in that patchset a long while ago. The last posting I did was largely as a community service to get the completed, lockdep and RTPREEMPT compatible version of the hashbl infrastructure it needed out there for other people to be able to easily push this forward if they needed it. That was here: https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/ and I posted it because Kent was asking about it because his attempts to convert the inode hash to use rhashtables wasn't working. I've suggested multiple times since then when people have asked me about that patch set that they are free to pick it up and finish it off themselves. Unfortunately, that usually results in silence, a comment of "I don't have time for that", and/or they go off and hack around the issue in some other way.... > > But nobody has ever seemed to care before. Because the inode hash just > > isn't all that heavily used, since normal loads don't look up inodes > > directly (they look up dentries, and you find the inodes off those - > > and the *dentry* hash scales well thanks to both RCU and doing the > > bucket lock thing). Not entirely true. Yes, the dentry cache protects the inode hash from contention when the workload repeatedly hits cached dentries. However, the problematic workload is cold cache operations where the dentry cache repeatedly misses. This places all the operational concurrency directly on the inode hash as new inodes are inserted into the hash. Add memory reclaim and that adds contention as it removes inodes from the hash on eviction. This happens in workloads that cycle lots of inode through memory. Concurrent cold cache lookups, create and unlink hammer the global filesystem inode hash lock at more than 4 active tasks, especially when there is also concurrent memory reclaim running evicting inodes from the inode hash. This inode cache miss contention issue was sort-of hacked around recently by commit 7180f8d91fcb ("vfs: add rcu-based find_inode variants for iget ops"). This allowed filesystems filesystems to use -partially- RCU-based lookups rather fully locked lookups. In the case of a hit, it will be an RCU operation, but as you state, cache hits on the inode hash are the rare case, not the common case. I say "sort-of hacked around" because the RCU lookup only avoids the inode hash lock on the initial lookup that misses. Then insert is still done under the inode_hash_lock and so inode_hash_lock contention still definitely exists in this path. All the above commit did was move the catastrophic cacheline contention breakdown point to be higher than the test workload that was being run could hit. In review of that commit, I suggested that the inode hash_bl conversion patches be finished off instead, but the response was "I don't have time for that". And so here we are again.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 0:26 ` Dave Chinner @ 2024-09-24 1:55 ` Kent Overstreet 2024-09-24 2:26 ` Linus Torvalds 1 sibling, 0 replies; 28+ messages in thread From: Kent Overstreet @ 2024-09-24 1:55 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Tue, Sep 24, 2024 at 10:26:56AM GMT, Dave Chinner wrote: > On Mon, Sep 23, 2024 at 03:56:50PM -0400, Kent Overstreet wrote: > > On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote: > > > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > We're now using an rhashtable instead of the system inode hash table; > > > > this is another significant performance improvement on multithreaded > > > > metadata workloads, eliminating more lock contention. > > > > > > So I've pulled this, but I reacted to this issue - what is the load > > > where the inode hash table is actually causing issues? > > > > > > Because honestly, the reason we're using that "one single lock" for > > > the inode hash table is that nobody has ever bothered. > > > > > > In particular, it *should* be reasonably straightforward (lots of > > > small details, but not fundamentally hard) to just make the inode hash > > > table be a "bl" hash - with the locking done per-bucket, not using one > > > global lock. > > > > Dave was working on that - he posted patches and it seemed like they > > were mostly done, not sure what happened with those. > > I lost interest in that patchset a long while ago. The last posting > I did was largely as a community service to get the completed, > lockdep and RTPREEMPT compatible version of the hashbl > infrastructure it needed out there for other people to be able to > easily push this forward if they needed it. That was here: > > https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/ > > and I posted it because Kent was asking about it because his > attempts to convert the inode hash to use rhashtables wasn't > working. > > I've suggested multiple times since then when people have asked me > about that patch set that they are free to pick it up and finish it > off themselves. Unfortunately, that usually results in silence, a > comment of "I don't have time for that", and/or they go off and hack > around the issue in some other way.... Looking over that thread I'm not clear on what was missing - you mentioned RTPREEMPT/lockdep but you had a patch for that, which looked servicable enough? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 0:26 ` Dave Chinner 2024-09-24 1:55 ` Kent Overstreet @ 2024-09-24 2:26 ` Linus Torvalds 2024-09-24 2:48 ` Linus Torvalds ` (2 more replies) 1 sibling, 3 replies; 28+ messages in thread From: Linus Torvalds @ 2024-09-24 2:26 UTC (permalink / raw) To: Dave Chinner Cc: Kent Overstreet, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote: > > However, the problematic workload is cold cache operations where > the dentry cache repeatedly misses. This places all the operational > concurrency directly on the inode hash as new inodes are inserted > into the hash. Add memory reclaim and that adds contention as it > removes inodes from the hash on eviction. Yeah, and then we spend all the time just adding the inodes to the hashes, and probably fairly seldom use them. Oh well. And I had missed the issue with PREEMPT_RT and the fact that right now the inode hash lock is outside the inode lock, which is problematic. So it's all a bit nasty. But I also assume most of the bad issues end up mainly showing up on just fairly synthetic benchmarks with ramdisks, because even with a good SSD I suspect the IO for the cold cache would still dominate? Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 2:26 ` Linus Torvalds @ 2024-09-24 2:48 ` Linus Torvalds 2024-09-24 3:55 ` Dave Chinner 2024-09-24 2:55 ` Kent Overstreet 2024-09-24 3:04 ` Dave Chinner 2 siblings, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2024-09-24 2:48 UTC (permalink / raw) To: Dave Chinner Cc: Kent Overstreet, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Mon, 23 Sept 2024 at 19:26, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And I had missed the issue with PREEMPT_RT and the fact that right now > the inode hash lock is outside the inode lock, which is problematic. .. although I guess that could be solved by just making the spinlock be external to the hash list, rather than part of the list like list_bh. You wouldn't need to do one lock per list head, you could just do some grouping of the hash. That said, the vfs inode hash code really is pretty disgusting and has been accumulating these warts for a long time. So maybe a "filesystems do their own thing" together with some helper infrastructure (like using rhashtable) is actually the right thing to do. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 2:48 ` Linus Torvalds @ 2024-09-24 3:55 ` Dave Chinner 2024-09-24 16:57 ` Linus Torvalds 0 siblings, 1 reply; 28+ messages in thread From: Dave Chinner @ 2024-09-24 3:55 UTC (permalink / raw) To: Linus Torvalds Cc: Kent Overstreet, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Mon, Sep 23, 2024 at 07:48:03PM -0700, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 19:26, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > And I had missed the issue with PREEMPT_RT and the fact that right now > > the inode hash lock is outside the inode lock, which is problematic. > > .. although I guess that could be solved by just making the spinlock > be external to the hash list, rather than part of the list like > list_bh. That's effectively what the patch did - it added a spinlock per hash list. > You wouldn't need to do one lock per list head, you could just do some > grouping of the hash. Yes, that's a more complex thing to do, though - it is effectively using hashed locks for the hash table. > That said, the vfs inode hash code really is pretty disgusting and has > been accumulating these warts for a long time. So maybe a "filesystems > do their own thing" together with some helper infrastructure (like > using rhashtable) is actually the right thing to do. Perhaps. XFS has done it's own thing since 2007, but that was because the XFS inode lifecycle has always existed outside the VFS inode life cycle. i.e. the inode has to outlive evict() and ->destroy_inode because it can still be dirty and tracked by the journal when the VFS evicts it from the cache. We also have internal inode lookup stuff that we don't want to go anywhere near the VFS (e.g. inode writeback clustering from journal item cleaning callbacks). But this has come at a significant complexity cost, and some of the biggest problems we've had to solve have been a result of this architecture. e.g. memory reclaim getting blocked on dirty XFS inodes that the VFS is unaware of via the fs specific callout in the superblock shrinker caused all sorts of long tail memory allocation latency issues for years. We've fixed that, but it took a long time to work out how to avoid blocking in memory reclaim without driving the system immediately into OOM kill frenzies.... There are also issues around RCU freeing of inodes and how the filesysetm level cache handles that. e.g. what happens when the filesystem doesn't immediately free the inode after memory reclaim evicts it, but then the fs re-instantiates it moments later when a new lookup for that inode comes in? The rest of the VFS assumes that the inode won't reappear until a RCU grace period expires, but having the re-instantiation code call synchronise_rcu() (or something similar) is not an option because this sort of situation can occur thousands of times a second under memory pressure.... We know this because there is an outstanding issue in XFS around this situation - none of the general solutions to the problem we've tried have been viable. I suspect that any other filesystem that implements it's own inode cache and does VFS inode re-instantiation on a inode cache hit will have similar issues, and so there will be wider VFS architecture impacts as a result. IOWs, it's not clear to me that this is a caching model we really want to persue in general because of a) the code duplication and b) the issues such an inode life-cycle model has interacting with the VFS life-cycle expectations... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 3:55 ` Dave Chinner @ 2024-09-24 16:57 ` Linus Torvalds 2024-09-24 17:27 ` Kent Overstreet 2024-09-25 0:17 ` Dave Chinner 0 siblings, 2 replies; 28+ messages in thread From: Linus Torvalds @ 2024-09-24 16:57 UTC (permalink / raw) To: Dave Chinner Cc: Kent Overstreet, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote: > > That's effectively what the patch did - it added a spinlock per hash > list. Yeah, no, I like your patches, they all seem to be doing sane things and improve the code. The part I don't like is how the basic model that your patches improve seems quite a bit broken. For example, that whole superblock list lock contention just makes me go "Christ, that's stupid". Not because your patches to fix it with Waiman's fancy list would be wrong or bad, but because the whole pain is self-inflicted garbage. And it's all historically very very understandable. It wasn't broken when it was written. That singly linked list is 100% sane in the historical context of "we really don't want anything fancy for this". The whole list of inodes for a superblock is basically unimportant, and it's main use (only use?) is for the final "check that we don't have busy inodes at umount time, remove any pending stuff". So using a stupid linked list was absolutely the right thing to do, but now that just the *locking* for that list is a pain, it turns out that we probably shouldn't use a list at all. Not because the list was wrong, but because a flat list is such a pain for locking, since there's no hierarchy to spread the locking out over. (We used to have that kind of "flat lock" for the dcache too, but "dcache_lock" went away many moons ago, and good riddance - but the only reason it could go away is that the dcache has a hierarchy, so that you can always lock just the local dentry (or the parent dentry) if you are just careful). > [ filesystems doing their own optimized thing ] > > IOWs, it's not clear to me that this is a caching model we really > want to persue in general because of a) the code duplication and b) > the issues such an inode life-cycle model has interacting with the > VFS life-cycle expectations... No, I'm sure you are right, and I'm just frustrated with this code that probably _should_ look a lot more like the dcache code, but doesn't. I get the very strong feeling that we should have a per-superblock hash table that we could also traverse the entries of. That way the superblock inode list would get some structure (the hash buckets) that would allow the locking to be distributed (and we'd only need one lock and just share it between the "hash inode" and "add it to the superblock list"). But that would require something much more involved than "improve the current code". Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 16:57 ` Linus Torvalds @ 2024-09-24 17:27 ` Kent Overstreet 2024-09-25 0:17 ` Dave Chinner 1 sibling, 0 replies; 28+ messages in thread From: Kent Overstreet @ 2024-09-24 17:27 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Tue, Sep 24, 2024 at 09:57:13AM GMT, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote: > > > > That's effectively what the patch did - it added a spinlock per hash > > list. > > Yeah, no, I like your patches, they all seem to be doing sane things > and improve the code. > > The part I don't like is how the basic model that your patches improve > seems quite a bit broken. > > For example, that whole superblock list lock contention just makes me > go "Christ, that's stupid". Not because your patches to fix it with > Waiman's fancy list would be wrong or bad, but because the whole pain > is self-inflicted garbage. > > And it's all historically very very understandable. It wasn't broken > when it was written. > > That singly linked list is 100% sane in the historical context of "we > really don't want anything fancy for this". The whole list of inodes > for a superblock is basically unimportant, and it's main use (only > use?) is for the final "check that we don't have busy inodes at umount > time, remove any pending stuff". > > So using a stupid linked list was absolutely the right thing to do, > but now that just the *locking* for that list is a pain, it turns out > that we probably shouldn't use a list at all. Not because the list was > wrong, but because a flat list is such a pain for locking, since > there's no hierarchy to spread the locking out over. Well, lists are just a pain. They should really be radix trees, we're not doing actual LRU anyways. > > [ filesystems doing their own optimized thing ] > > > > IOWs, it's not clear to me that this is a caching model we really > > want to persue in general because of a) the code duplication and b) > > the issues such an inode life-cycle model has interacting with the > > VFS life-cycle expectations... > > No, I'm sure you are right, and I'm just frustrated with this code > that probably _should_ look a lot more like the dcache code, but > doesn't. > > I get the very strong feeling that we should have a per-superblock > hash table that we could also traverse the entries of. That way the > superblock inode list would get some structure (the hash buckets) that > would allow the locking to be distributed (and we'd only need one lock > and just share it between the "hash inode" and "add it to the > superblock list"). I was considering that - that's what the bcachefs btree key cache does, the shrinker just iterates over the rhashtable. But referenced inodes aren't kept on the shrinker lists; I hate the overhead of adding/removing, but Dave was emphatic on there being workloads where that was really necessary and I'm somewhat convinced. The other big issue is memcg; I think for that to work we'd need reclaim to coalesce shrinker calls and shrink from multiple memcgs at the same time. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 16:57 ` Linus Torvalds 2024-09-24 17:27 ` Kent Overstreet @ 2024-09-25 0:17 ` Dave Chinner 2024-09-25 1:45 ` Linus Torvalds 2024-09-25 2:48 ` Kent Overstreet 1 sibling, 2 replies; 28+ messages in thread From: Dave Chinner @ 2024-09-25 0:17 UTC (permalink / raw) To: Linus Torvalds Cc: Kent Overstreet, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Tue, Sep 24, 2024 at 09:57:13AM -0700, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote: > > > > That's effectively what the patch did - it added a spinlock per hash > > list. > > Yeah, no, I like your patches, they all seem to be doing sane things > and improve the code. > > The part I don't like is how the basic model that your patches improve > seems quite a bit broken. > > For example, that whole superblock list lock contention just makes me > go "Christ, that's stupid". Not because your patches to fix it with > Waiman's fancy list would be wrong or bad, but because the whole pain > is self-inflicted garbage. I'm not about to disagree with that assessment. > And it's all historically very very understandable. It wasn't broken > when it was written. > > That singly linked list is 100% sane in the historical context of "we > really don't want anything fancy for this". The whole list of inodes > for a superblock is basically unimportant, and it's main use (only > use?) is for the final "check that we don't have busy inodes at umount > time, remove any pending stuff". > > So using a stupid linked list was absolutely the right thing to do, > but now that just the *locking* for that list is a pain, it turns out > that we probably shouldn't use a list at all. Not because the list was > wrong, but because a flat list is such a pain for locking, since > there's no hierarchy to spread the locking out over. Right, that's effectively what the dlist infrastructure has taught us - we need some kind of heirarchy to spread the locking over. But what that heirachy is for a "iterate every object" list looks like isn't really clear cut... Another option I'd considered was to offload the iteration to filesystems that have internal tracking mechanisms. e.g. use a superblock method (say ->for_all_cached_inodes()) that gets passed a callback function for the operation to perform on a stabilised inode. This would work for XFS - we already have such an internal callback based cache-walking infrastructure - xfs_iwalk() - that is used extensively for internal admin, gc and scrub/repair functions. If we could use this for VFS icache traversals, then we wouldn't need to maintain the sb->s_inodes list at all in XFS. But I didn't go down that route because I didn't think we wanted to encourage each major filesysetm to have their own unique internal inode caching implementations with there own special subtle differences. The XFS icache code is pretty complex and really requires a lot of XFS expertise to understand - that makes changing global inode caching behaviour or life cycle semantics much more difficult than it already is. That said, if we do decide that a model where filesystems will provide their own inode caches is acceptible, then as a first step we could convert the generic s_inodes list iteration to the callback model fairly easily.... > (We used to have that kind of "flat lock" for the dcache too, but > "dcache_lock" went away many moons ago, and good riddance - but the > only reason it could go away is that the dcache has a hierarchy, so > that you can always lock just the local dentry (or the parent dentry) > if you are just careful). > > > [ filesystems doing their own optimized thing ] > > > > IOWs, it's not clear to me that this is a caching model we really > > want to persue in general because of a) the code duplication and b) > > the issues such an inode life-cycle model has interacting with the > > VFS life-cycle expectations... > > No, I'm sure you are right, and I'm just frustrated with this code > that probably _should_ look a lot more like the dcache code, but > doesn't. > > I get the very strong feeling that we should have a per-superblock > hash table that we could also traverse the entries of. That way the > superblock inode list would get some structure (the hash buckets) that > would allow the locking to be distributed (and we'd only need one lock > and just share it between the "hash inode" and "add it to the > superblock list"). The only problem with this is the size of the per-sb hash tables needed for scalability - we can't allocate system sized hash tables for every superblock just in case a superblock might be asked to cache 100 million inodes. That's why Kent used rhashtables for the bcachefs implementation - they resize according to how many objects are being indexed, and hence scale both up and down. That is also why XFS uses multiple radix trees per-sb in it's icache implementation - they scale up efficiently, yet have a small memory footprint when only a few inodes are in cache in a little used filesystem. > But that would require something much more involved than "improve the > current code". Yup. FWIW, I think all this "how do we cache inodes better" discussion is somehwat glossing over a more important question we need to think about first: do we even need a fully fledged inode cache anymore? Every inode brought into cache is pinned by a dentry. The dentry cache has an LRU and cache aging, and so by the time a dentry is aged out of the dentry cache and the inode is finally dropped, it has not been in use for some time. It has aged out of the current working set. Given that we've already aged the inode out of the working set. why do we then need to dump the inode into another LRU and age that out again before we reclaim the inode? What does this double caching actually gaining us? I've done experiments on XFS marking all inodes with I_DONT_CACHE, which means it gets removed from the inode cache the moment the reference count goes to zero (i.e. when the dentry is dropped from cache). This leaves the inode life cycle mirroring the dentry life-cycle. i.e. the inode is evicted when the dentry is aged out as per normal. On SSD based systems, I really don't see any performance degradation for my typical workstation workloads like git tree operations and kernel compiles. I don't see any noticable impact on streaming inode/data workloads, either. IOWs, the dentry cache appears to be handling the workings set maintenance duties pretty well for most common workloads. Hence I question the need for LRU based inode cache aging being needed at all. So: should we be looking towards gutting the inode cache and so the in-memory VFS inode lifecycle tracks actively referenced inodes? If so, then the management of the VFS inodes becomes a lot simpler as the LRU lock, maintenance and shrinker-based reclaim goes away entirely. Lots of code gets simpler if we trim down the VFS inode life cycle to remove the caching of unreferenced inodes... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-25 0:17 ` Dave Chinner @ 2024-09-25 1:45 ` Linus Torvalds 2024-09-25 11:41 ` Christian Brauner 2024-09-25 2:48 ` Kent Overstreet 1 sibling, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2024-09-25 1:45 UTC (permalink / raw) To: Dave Chinner Cc: Kent Overstreet, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Tue, 24 Sept 2024 at 17:17, Dave Chinner <david@fromorbit.com> wrote: > > FWIW, I think all this "how do we cache inodes better" discussion is > somehwat glossing over a more important question we need to think > about first: do we even need a fully fledged inode cache anymore? I'd be more than happy to try. I have to admit that it's largely my mental picture (ie "inode caches don't really matter"), and why I was surprised that the inode locks even show up on benchmarks. If we just always drop inodes when the refcount goes to zero and never have any cached inodes outside of other references, I think most of the reasons for the superblock inode list just disappear. Most of the heavy lifting has long since been moved to the dentry shrinking. I'd worry that we have a lot of tuning that depends on the whole inode LRU thing (that iirc takes the mapping size etc into account). But maybe it would be worth trying to make I_DONTCACHE the default, with the aim of removing the independent inode lifetimes entirely... Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-25 1:45 ` Linus Torvalds @ 2024-09-25 11:41 ` Christian Brauner 0 siblings, 0 replies; 28+ messages in thread From: Christian Brauner @ 2024-09-25 11:41 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Kent Overstreet, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Tue, Sep 24, 2024 at 06:45:55PM GMT, Linus Torvalds wrote: > On Tue, 24 Sept 2024 at 17:17, Dave Chinner <david@fromorbit.com> wrote: > > > > FWIW, I think all this "how do we cache inodes better" discussion is > > somehwat glossing over a more important question we need to think > > about first: do we even need a fully fledged inode cache anymore? > > I'd be more than happy to try. I have to admit that it's largely my > mental picture (ie "inode caches don't really matter"), and why I was > surprised that the inode locks even show up on benchmarks. > > If we just always drop inodes when the refcount goes to zero and never > have any cached inodes outside of other references, I think most of > the reasons for the superblock inode list just disappear. Most of the > heavy lifting has long since been moved to the dentry shrinking. > > I'd worry that we have a lot of tuning that depends on the whole inode > LRU thing (that iirc takes the mapping size etc into account). > > But maybe it would be worth trying to make I_DONTCACHE the default, > with the aim of removing the independent inode lifetimes entirely... I would be rather surprised if we didn't see regressions for some workloads but who knows. If we wanted to test this then one easy way would be to add a FS_INODE_NOCACHE fs_flag and have inode_init_always() raise I_DONTCACHE based on that. That way we can experiment with this on a per-fs basis. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-25 0:17 ` Dave Chinner 2024-09-25 1:45 ` Linus Torvalds @ 2024-09-25 2:48 ` Kent Overstreet 2024-09-27 0:48 ` Herbert Xu 1 sibling, 1 reply; 28+ messages in thread From: Kent Overstreet @ 2024-09-25 2:48 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner, Thomas Graf, Herbert Xu, netdev +cc rhashtable folks for bug report On Wed, Sep 25, 2024 at 10:17:46AM GMT, Dave Chinner wrote: > On Tue, Sep 24, 2024 at 09:57:13AM -0700, Linus Torvalds wrote: > > On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote: > > > > > > That's effectively what the patch did - it added a spinlock per hash > > > list. > > > > Yeah, no, I like your patches, they all seem to be doing sane things > > and improve the code. > > > > The part I don't like is how the basic model that your patches improve > > seems quite a bit broken. > > > > For example, that whole superblock list lock contention just makes me > > go "Christ, that's stupid". Not because your patches to fix it with > > Waiman's fancy list would be wrong or bad, but because the whole pain > > is self-inflicted garbage. > > I'm not about to disagree with that assessment. > > > And it's all historically very very understandable. It wasn't broken > > when it was written. > > > > That singly linked list is 100% sane in the historical context of "we > > really don't want anything fancy for this". The whole list of inodes > > for a superblock is basically unimportant, and it's main use (only > > use?) is for the final "check that we don't have busy inodes at umount > > time, remove any pending stuff". > > > > So using a stupid linked list was absolutely the right thing to do, > > but now that just the *locking* for that list is a pain, it turns out > > that we probably shouldn't use a list at all. Not because the list was > > wrong, but because a flat list is such a pain for locking, since > > there's no hierarchy to spread the locking out over. > > Right, that's effectively what the dlist infrastructure has taught > us - we need some kind of heirarchy to spread the locking over. But > what that heirachy is for a "iterate every object" list looks like > isn't really clear cut... If you think of dlist as just "faster lru list", I think that's a completely viable approach. So one thing I've been mulling over is a data structure for efficiently tracking async op structs - "some async op went out to lunch" is something that we still have crap debug tooling for, and those bugs take forever to get fixed. There seems to be an outstanding bug in compaction or migration that I've seen multiple reports for, and I doubt it's bcachefs (we're quite standard there) - and it's some async op that should've released a folio lock but isn't. And all we need for that is, roughly, an idr with a percpu buffer for allocating/freeing slots in front. Which would be perfect for shrinker lists, too: walking the list is a lockless radix tree walk, and adding/removing entries mostly just hits the percpu buffers. So now we have another use case for that data structure, so I'll try to get that done in the near future and see how it does. > > (We used to have that kind of "flat lock" for the dcache too, but > > "dcache_lock" went away many moons ago, and good riddance - but the > > only reason it could go away is that the dcache has a hierarchy, so > > that you can always lock just the local dentry (or the parent dentry) > > if you are just careful). > > > > > > [ filesystems doing their own optimized thing ] > > > > > > IOWs, it's not clear to me that this is a caching model we really > > > want to persue in general because of a) the code duplication and b) > > > the issues such an inode life-cycle model has interacting with the > > > VFS life-cycle expectations... > > > > No, I'm sure you are right, and I'm just frustrated with this code > > that probably _should_ look a lot more like the dcache code, but > > doesn't. > > > > I get the very strong feeling that we should have a per-superblock > > hash table that we could also traverse the entries of. That way the > > superblock inode list would get some structure (the hash buckets) that > > would allow the locking to be distributed (and we'd only need one lock > > and just share it between the "hash inode" and "add it to the > > superblock list"). > > The only problem with this is the size of the per-sb hash tables > needed for scalability - we can't allocate system sized hash tables > for every superblock just in case a superblock might be asked to > cache 100 million inodes. That's why Kent used rhashtables for the > bcachefs implementation - they resize according to how many objects > are being indexed, and hence scale both up and down. I've been noticing rhashtable resize is surprisingly heavy (the default parameters don't ever shrink the table, which is why it's not seen as much). And, when I was torture testing that code I tripped over what appeared to be an infinite loop in rht_bucket() when a rehsah is in progress, which I worked around in a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan() (because it makes more sense to just not run while rehashing and avoid the function call overhead of rht_bucket()). If rhashtable folks are interested in looking at that I have a test that reproduces it. > So: should we be looking towards gutting the inode cache and so the > in-memory VFS inode lifecycle tracks actively referenced inodes? If > so, then the management of the VFS inodes becomes a lot simpler as > the LRU lock, maintenance and shrinker-based reclaim goes away > entirely. Lots of code gets simpler if we trim down the VFS inode > life cycle to remove the caching of unreferenced inodes... Sounds completely legit to me - except for the weird filesystems. I just started cscoping and afs at least seems to do iget() without ever attaching it to a dentry, I didn't look further. So... If fs/inode.c is grotty, perhaps it makes sense to start a fs/inode_rhashtable.c with all the new semantics and convert filesystems one by one, like we've been doing with the dio code and iomap. I finally have my head sufficiently around all the weird inode lifetime rules that I could do it, although it'll be a bit before I'm not swamped. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-25 2:48 ` Kent Overstreet @ 2024-09-27 0:48 ` Herbert Xu 2024-09-28 0:11 ` Kent Overstreet 0 siblings, 1 reply; 28+ messages in thread From: Herbert Xu @ 2024-09-27 0:48 UTC (permalink / raw) To: Kent Overstreet Cc: Dave Chinner, Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner, Thomas Graf, netdev On Tue, Sep 24, 2024 at 10:48:07PM -0400, Kent Overstreet wrote: > > I've been noticing rhashtable resize is surprisingly heavy (the default > parameters don't ever shrink the table, which is why it's not seen as > much). Most rhashtable users enable automatic shrinking. > And, when I was torture testing that code I tripped over what appeared > to be an infinite loop in rht_bucket() when a rehsah is in progress, > which I worked around in > > a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan() You must not walk the rhashtable internal data structures by hand. If you need to iterate through the whole table, use the provided walking mechanism. However, as rhashtable is dynamic, walking it during a resize may cause you to see some elements twice. If you want to avoid that, set up your own linked list of all elements for a completely stable walk. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-27 0:48 ` Herbert Xu @ 2024-09-28 0:11 ` Kent Overstreet 2024-09-28 0:47 ` Herbert Xu 0 siblings, 1 reply; 28+ messages in thread From: Kent Overstreet @ 2024-09-28 0:11 UTC (permalink / raw) To: Herbert Xu Cc: Dave Chinner, Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner, Thomas Graf, netdev On Fri, Sep 27, 2024 at 08:48:14AM GMT, Herbert Xu wrote: > On Tue, Sep 24, 2024 at 10:48:07PM -0400, Kent Overstreet wrote: > > > > I've been noticing rhashtable resize is surprisingly heavy (the default > > parameters don't ever shrink the table, which is why it's not seen as > > much). > > Most rhashtable users enable automatic shrinking. > > > And, when I was torture testing that code I tripped over what appeared > > to be an infinite loop in rht_bucket() when a rehsah is in progress, > > which I worked around in > > > > a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan() > > You must not walk the rhashtable internal data structures by hand. > If you need to iterate through the whole table, use the provided > walking mechanism. I was just doing rht_for_each_entry_rcu() (open coded because you don't provide a safe version). > However, as rhashtable is dynamic, walking it during a resize may > cause you to see some elements twice. If you want to avoid that, > set up your own linked list of all elements for a completely stable > walk. Yeah that's fine, but like I said, I was seeing an infinite loop inside rht_bucket(). That would be an rhashtable bug... ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-28 0:11 ` Kent Overstreet @ 2024-09-28 0:47 ` Herbert Xu 0 siblings, 0 replies; 28+ messages in thread From: Herbert Xu @ 2024-09-28 0:47 UTC (permalink / raw) To: Kent Overstreet Cc: Dave Chinner, Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner, Thomas Graf, netdev On Fri, Sep 27, 2024 at 08:11:06PM -0400, Kent Overstreet wrote: > > > > And, when I was torture testing that code I tripped over what appeared > > > to be an infinite loop in rht_bucket() when a rehsah is in progress, > > > which I worked around in > > > > > > a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan() > > > > You must not walk the rhashtable internal data structures by hand. > > If you need to iterate through the whole table, use the provided > > walking mechanism. > > I was just doing rht_for_each_entry_rcu() (open coded because you don't > provide a safe version). I'm talking about the commit a592cd above. You were doing this: for (i = 0; i < tbl->size; i++) rht_for_each_entry_rcu(ck, pos, tbl, i, hash) { This is absolutely not allowed. The rhashtable must only be accessed through the published API and not iterated over directly. This is guaranteed to break during a resize/rehash. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 2:26 ` Linus Torvalds 2024-09-24 2:48 ` Linus Torvalds @ 2024-09-24 2:55 ` Kent Overstreet 2024-09-24 3:34 ` Dave Chinner 2024-09-24 3:04 ` Dave Chinner 2 siblings, 1 reply; 28+ messages in thread From: Kent Overstreet @ 2024-09-24 2:55 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Mon, Sep 23, 2024 at 07:26:31PM GMT, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote: > > > > However, the problematic workload is cold cache operations where > > the dentry cache repeatedly misses. This places all the operational > > concurrency directly on the inode hash as new inodes are inserted > > into the hash. Add memory reclaim and that adds contention as it > > removes inodes from the hash on eviction. > > Yeah, and then we spend all the time just adding the inodes to the > hashes, and probably fairly seldom use them. Oh well. > > And I had missed the issue with PREEMPT_RT and the fact that right now > the inode hash lock is outside the inode lock, which is problematic. > > So it's all a bit nasty. > > But I also assume most of the bad issues end up mainly showing up on > just fairly synthetic benchmarks with ramdisks, because even with a > good SSD I suspect the IO for the cold cache would still dominate? Not for bcachefs, because filling into the vfs inode cache doesn't require a disk read - they're cached in the inodes btree and much smaller there. We use a varint encoding so they're typically 50-100 bytes, last I checked. Compare to ~1k, plus or minus, in the vfs inode cache. Thomas Bertshinger has been working on applications at LANL where avoiding pulling into the vfs inode cache seems to make a significant difference (file indexing in his case) - it turns out there's an xattr syscall that's missing, which I believe he'll be submitting a patch for. But stat/statx always pulls into the vfs inode cache, and that's likely worth fixing. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 2:55 ` Kent Overstreet @ 2024-09-24 3:34 ` Dave Chinner 2024-09-24 3:47 ` Kent Overstreet 0 siblings, 1 reply; 28+ messages in thread From: Dave Chinner @ 2024-09-24 3:34 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote: > On Mon, Sep 23, 2024 at 07:26:31PM GMT, Linus Torvalds wrote: > > On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote: > > > > > > However, the problematic workload is cold cache operations where > > > the dentry cache repeatedly misses. This places all the operational > > > concurrency directly on the inode hash as new inodes are inserted > > > into the hash. Add memory reclaim and that adds contention as it > > > removes inodes from the hash on eviction. > > > > Yeah, and then we spend all the time just adding the inodes to the > > hashes, and probably fairly seldom use them. Oh well. > > > > And I had missed the issue with PREEMPT_RT and the fact that right now > > the inode hash lock is outside the inode lock, which is problematic. > > > > So it's all a bit nasty. > > > > But I also assume most of the bad issues end up mainly showing up on > > just fairly synthetic benchmarks with ramdisks, because even with a > > good SSD I suspect the IO for the cold cache would still dominate? > > Not for bcachefs, because filling into the vfs inode cache doesn't > require a disk read - they're cached in the inodes btree and much > smaller there. We use a varint encoding so they're typically 50-100 > bytes, last I checked. > > Compare to ~1k, plus or minus, in the vfs inode cache. > > Thomas Bertshinger has been working on applications at LANL where > avoiding pulling into the vfs inode cache seems to make a significant > difference (file indexing in his case) - it turns out there's an xattr > syscall that's missing, which I believe he'll be submitting a patch for. > > But stat/statx always pulls into the vfs inode cache, and that's likely > worth fixing. No, let's not even consider going there. Unlike most people, old time XFS developers have direct experience with the problems that "uncached" inode access for stat purposes. XFS has had the bulkstat API for a long, long time (i.e. since 1998 on Irix). When it was first implemented on Irix, it was VFS cache coherent. But in the early 2000s, that caused problems with HSMs needing to scan billions inodes indexing petabytes of stored data with certain SLA guarantees (i.e. needing to scan at least a million inodes a second). The CPU overhead of cache instantiation and teardown was too great to meet those performance targets on 500MHz MIPS CPUs. So we converted bulkstat to run directly out of the XFS buffer cache (i.e. uncached from the perspective of the VFS). This reduced the CPU over per-inode substantially, allowing bulkstat rates to increase by a factor of 10. However, it introduced all sorts of coherency problems between cached inode state vs what was stored in the buffer cache. It was basically O_DIRECT for stat() and, as you'd expect from that description, the coherency problems were horrible. Detecting iallocated-but-not-yet-updated and unlinked-but-not-yet-freed inodes were particularly consistent sources of issues. The only way to fix these coherency problems was to check the inode cache for a resident inode first, which basically defeated the entire purpose of bypassing the VFS cache in the first place. So we went back to using coherent lookups once we stopped caring about the old SGI HSM SLA requirements back in 2010: commit 7dce11dbac54fce777eea0f5fb25b2694ccd7900 Author: Christoph Hellwig <hch@lst.de> Date: Wed Jun 23 18:11:11 2010 +1000 xfs: always use iget in bulkstat The non-coherent bulkstat versionsthat look directly at the inode buffers causes various problems with performance optimizations that make increased use of just logging inodes. This patch makes bulkstat always use iget, which should be fast enough for normal use with the radix-tree based inode cache introduced a while ago. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Essentially, we now always being inodes into cache as fully instantiated VFS inodes, and mark them with I_DONT_CACHE so they get torn down immediately after we've used them if they weren't already cached. (i.e. all accesses are done in a single task context with the inode hot in the CPU caches.) Without the patchset I pointed to, bulkstat maxxes out the VFS inode cache cycling on the sb->s_inodes_list_lock at about 700,000 inodes/sec being cycled through the cache. With that patchset, it maxxed out the bandwidth of the NVMe SSDs I was using at ~7GB/s read IO and cycling about 6 million inodes/sec through the VFS cache. IOWs, we can make the VFS inode cache scale way beyond what most users actually need right now. The lesson in all this? Don't hack around VFS scalability issues if it can be avoided. If we fix the problems properly in the first place, then most fs developers (let alone users!) won't even realise there was a scalability problem in the VFS to begin with. Users will, however, notice every inconsistency and/or corruption caused by fs developers trying to work around VFS cache limitations... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 3:34 ` Dave Chinner @ 2024-09-24 3:47 ` Kent Overstreet 2024-09-25 1:00 ` Dave Chinner 0 siblings, 1 reply; 28+ messages in thread From: Kent Overstreet @ 2024-09-24 3:47 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Tue, Sep 24, 2024 at 01:34:14PM GMT, Dave Chinner wrote: > On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote: > > But stat/statx always pulls into the vfs inode cache, and that's likely > > worth fixing. > > No, let's not even consider going there. > > Unlike most people, old time XFS developers have direct experience > with the problems that "uncached" inode access for stat purposes. > > XFS has had the bulkstat API for a long, long time (i.e. since 1998 > on Irix). When it was first implemented on Irix, it was VFS cache > coherent. But in the early 2000s, that caused problems with HSMs > needing to scan billions inodes indexing petabytes of stored data > with certain SLA guarantees (i.e. needing to scan at least a million > inodes a second). The CPU overhead of cache instantiation and > teardown was too great to meet those performance targets on 500MHz > MIPS CPUs. > > So we converted bulkstat to run directly out of the XFS buffer cache > (i.e. uncached from the perspective of the VFS). This reduced the > CPU over per-inode substantially, allowing bulkstat rates to > increase by a factor of 10. However, it introduced all sorts of > coherency problems between cached inode state vs what was stored in > the buffer cache. It was basically O_DIRECT for stat() and, as you'd > expect from that description, the coherency problems were horrible. > Detecting iallocated-but-not-yet-updated and > unlinked-but-not-yet-freed inodes were particularly consistent > sources of issues. > > The only way to fix these coherency problems was to check the inode > cache for a resident inode first, which basically defeated the > entire purpose of bypassing the VFS cache in the first place. Eh? Of course it'd have to be coherent, but just checking if an inode is present in the VFS cache is what, 1-2 cache misses? Depending on hash table fill factor... That's going to show up, but I have a hard time seeing that "defeating the entire purpose" of bypassing the VFS cache, as you say. > Don't hack around VFS scalability issues if it can be avoided. Well, maybe if your dlock list patches make it in - I still see crazy lock contention there... ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 3:47 ` Kent Overstreet @ 2024-09-25 1:00 ` Dave Chinner 2024-09-25 2:13 ` Kent Overstreet 0 siblings, 1 reply; 28+ messages in thread From: Dave Chinner @ 2024-09-25 1:00 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Mon, Sep 23, 2024 at 11:47:54PM -0400, Kent Overstreet wrote: > On Tue, Sep 24, 2024 at 01:34:14PM GMT, Dave Chinner wrote: > > On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote: > > > But stat/statx always pulls into the vfs inode cache, and that's likely > > > worth fixing. > > > > No, let's not even consider going there. > > > > Unlike most people, old time XFS developers have direct experience > > with the problems that "uncached" inode access for stat purposes. > > > > XFS has had the bulkstat API for a long, long time (i.e. since 1998 > > on Irix). When it was first implemented on Irix, it was VFS cache > > coherent. But in the early 2000s, that caused problems with HSMs > > needing to scan billions inodes indexing petabytes of stored data > > with certain SLA guarantees (i.e. needing to scan at least a million > > inodes a second). The CPU overhead of cache instantiation and > > teardown was too great to meet those performance targets on 500MHz > > MIPS CPUs. > > > > So we converted bulkstat to run directly out of the XFS buffer cache > > (i.e. uncached from the perspective of the VFS). This reduced the > > CPU over per-inode substantially, allowing bulkstat rates to > > increase by a factor of 10. However, it introduced all sorts of > > coherency problems between cached inode state vs what was stored in > > the buffer cache. It was basically O_DIRECT for stat() and, as you'd > > expect from that description, the coherency problems were horrible. > > Detecting iallocated-but-not-yet-updated and > > unlinked-but-not-yet-freed inodes were particularly consistent > > sources of issues. > > > > The only way to fix these coherency problems was to check the inode > > cache for a resident inode first, which basically defeated the > > entire purpose of bypassing the VFS cache in the first place. > > Eh? Of course it'd have to be coherent, but just checking if an inode is > present in the VFS cache is what, 1-2 cache misses? Depending on hash > table fill factor... Sure, when there is no contention and you have CPU to spare. But the moment the lookup hits contention problems (i.e. we are exceeding the cache lookup scalability capability), we are straight back to running a VFS cache speed instead of uncached speed. IOWs, needing to perform the cache lookup defeated the purpose of using uncached lookups to avoid the cache scalabilty problems. Keep in mind that not having a referenced inode opens up the code to things like pre-emption races. i.e. a cache miss doesn't prevent the current task from being preempted before it reads the inode information into the user buffer. The VFS inode could bei instantiated and modified before the uncached access runs again and pulls stale information from the underlying buffer and returns that to userspace. Those were the sorts of problems we continually had with using low level inode information for stat operations vs using the up-to-date VFS inode state.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-25 1:00 ` Dave Chinner @ 2024-09-25 2:13 ` Kent Overstreet 2024-09-25 4:43 ` Dave Chinner 0 siblings, 1 reply; 28+ messages in thread From: Kent Overstreet @ 2024-09-25 2:13 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Wed, Sep 25, 2024 at 11:00:10AM GMT, Dave Chinner wrote: > > Eh? Of course it'd have to be coherent, but just checking if an inode is > > present in the VFS cache is what, 1-2 cache misses? Depending on hash > > table fill factor... > > Sure, when there is no contention and you have CPU to spare. But the > moment the lookup hits contention problems (i.e. we are exceeding > the cache lookup scalability capability), we are straight back to > running a VFS cache speed instead of uncached speed. The cache lookups are just reads; they don't introduce scalability issues, unless they're contending with other cores writing to those cachelines - checking if an item is present in a hash table is trivial to do locklessly. But pulling an inode into and then evicting it from the inode cache entails a lot more work - just initializing a struct inode is nontrivial, and then there's the (multiple) shared data structures you have to manipulate. > Keep in mind that not having a referenced inode opens up the code to > things like pre-emption races. i.e. a cache miss doesn't prevent > the current task from being preempted before it reads the inode > information into the user buffer. The VFS inode could bei > instantiated and modified before the uncached access runs again and > pulls stale information from the underlying buffer and returns that > to userspace. Yeah, if you're reading from a buffer cache that doesn't have a lock that does get dicy - but for bcachefs where we're reading from a btree node that does have a lock it's quite manageable. And incidentally this sort of "we have a cache on top of the btree, but sometimes we have to do direct access" is already something that comes up a lot in bcachefs, primarily for the alloc btree. _Tons_ of fun, but doesn't actually come up here for us since we don't use the vfs inode cache as a writeback cache. Now, for some completely different sillyness, there's actually _three_ levels of caching for inodes in bcachefs: btree node cache, btree key cache, then the vfs cache. In the first two inodes are packed down to ~100 bytes so it's not that bad, but it does make you go "...what?". It would be nice in theory to collapse - but the upside is that we don't have the interactions between the vfs inode cache and journalling that xfs has. But if vfs inodes no longer have their own lifetime like you've been talking about, that might open up interesting possibilities. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-25 2:13 ` Kent Overstreet @ 2024-09-25 4:43 ` Dave Chinner 2024-09-25 5:11 ` Kent Overstreet 0 siblings, 1 reply; 28+ messages in thread From: Dave Chinner @ 2024-09-25 4:43 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Tue, Sep 24, 2024 at 10:13:01PM -0400, Kent Overstreet wrote: > On Wed, Sep 25, 2024 at 11:00:10AM GMT, Dave Chinner wrote: > > > Eh? Of course it'd have to be coherent, but just checking if an inode is > > > present in the VFS cache is what, 1-2 cache misses? Depending on hash > > > table fill factor... > > > > Sure, when there is no contention and you have CPU to spare. But the > > moment the lookup hits contention problems (i.e. we are exceeding > > the cache lookup scalability capability), we are straight back to > > running a VFS cache speed instead of uncached speed. > > The cache lookups are just reads; they don't introduce scalability > issues, unless they're contending with other cores writing to those > cachelines - checking if an item is present in a hash table is trivial > to do locklessly. Which was not something the VFS inode cache did until a couple of months ago. Just because something is possible/easy today, it doesn't mean it was possible or viable 15-20 years ago. > But pulling an inode into and then evicting it from the inode cache > entails a lot more work - just initializing a struct inode is > nontrivial, and then there's the (multiple) shared data structures you > have to manipulate. Yes, but to avoid this we'd need to come up with a mechanism that is generally safe for most filesystems, not just bcachefs. I mean, if you can come up with a stat() mechanism that is safe enough for us read straight out the XFS buffer cache for inode cache misses, then we'll switch over to using it ASAP. That's your challenge - if you want bcachefs to be able to do this, then you have to make sure the infrastructure required works for other filesystems just as safely, too. > And incidentally this sort of "we have a cache on top of the btree, but > sometimes we have to do direct access" is already something that comes > up a lot in bcachefs, primarily for the alloc btree. _Tons_ of fun, but > doesn't actually come up here for us since we don't use the vfs inode > cache as a writeback cache. And there-in lies the problem for the general case. Most filesystems do use writeback caching of inode metadata via the VFS inode state, XFS included, and that's where all the dragons are hiding. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-25 4:43 ` Dave Chinner @ 2024-09-25 5:11 ` Kent Overstreet 0 siblings, 0 replies; 28+ messages in thread From: Kent Overstreet @ 2024-09-25 5:11 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Wed, Sep 25, 2024 at 02:43:15PM GMT, Dave Chinner wrote: > On Tue, Sep 24, 2024 at 10:13:01PM -0400, Kent Overstreet wrote: > > On Wed, Sep 25, 2024 at 11:00:10AM GMT, Dave Chinner wrote: > > > > Eh? Of course it'd have to be coherent, but just checking if an inode is > > > > present in the VFS cache is what, 1-2 cache misses? Depending on hash > > > > table fill factor... > > > > > > Sure, when there is no contention and you have CPU to spare. But the > > > moment the lookup hits contention problems (i.e. we are exceeding > > > the cache lookup scalability capability), we are straight back to > > > running a VFS cache speed instead of uncached speed. > > > > The cache lookups are just reads; they don't introduce scalability > > issues, unless they're contending with other cores writing to those > > cachelines - checking if an item is present in a hash table is trivial > > to do locklessly. > > Which was not something the VFS inode cache did until a couple of > months ago. Just because something is possible/easy today, it > doesn't mean it was possible or viable 15-20 years ago. We've been doing lockless hash table lookups for a lot longer than that. > > But pulling an inode into and then evicting it from the inode cache > > entails a lot more work - just initializing a struct inode is > > nontrivial, and then there's the (multiple) shared data structures you > > have to manipulate. > > Yes, but to avoid this we'd need to come up with a mechanism that is > generally safe for most filesystems, not just bcachefs. > > I mean, if you can come up with a stat() mechanism that is safe > enough for us read straight out the XFS buffer cache for inode cache > misses, then we'll switch over to using it ASAP. So add a per buffer lock to your buffer cache, and check for vfs cache residency under the lock. Whether bulkstat is worth adding that lock, for a filesystem that doesn't need it? That I can't say, it's a tradeoff. > That's your challenge - if you want bcachefs to be able to do this, > then you have to make sure the infrastructure required works for > other filesystems just as safely, too. So... you think bcachefs should be limited to the capabilities of what other filesystems can do? Interesting position... ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-24 2:26 ` Linus Torvalds 2024-09-24 2:48 ` Linus Torvalds 2024-09-24 2:55 ` Kent Overstreet @ 2024-09-24 3:04 ` Dave Chinner 2 siblings, 0 replies; 28+ messages in thread From: Dave Chinner @ 2024-09-24 3:04 UTC (permalink / raw) To: Linus Torvalds Cc: Kent Overstreet, linux-bcachefs, linux-fsdevel, linux-kernel, Dave Chinner On Mon, Sep 23, 2024 at 07:26:31PM -0700, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote: > > > > However, the problematic workload is cold cache operations where > > the dentry cache repeatedly misses. This places all the operational > > concurrency directly on the inode hash as new inodes are inserted > > into the hash. Add memory reclaim and that adds contention as it > > removes inodes from the hash on eviction. > > Yeah, and then we spend all the time just adding the inodes to the > hashes, and probably fairly seldom use them. Oh well. > > And I had missed the issue with PREEMPT_RT and the fact that right now > the inode hash lock is outside the inode lock, which is problematic. *nod* > So it's all a bit nasty. > > But I also assume most of the bad issues end up mainly showing up on > just fairly synthetic benchmarks with ramdisks, because even with a > good SSD I suspect the IO for the cold cache would still dominate? No, all the issues show up on consumer level NVMe SSDs - they have more than enough IO concurrency to cause these CPU concurrency related problems. Keep in mind that when it comes to doing huge amounts of IO, ramdisks are fundamentally flawed and don't scale. That is, the IO is synchonous memcpy() based and so consumes CPU time and both read and write memory bandwidth, and concurrency is limited to the number of CPUs in the system.. With NVMe SSDs, all the data movement is asynchronous and offloaded to hardware with DMA engines that move the data. Those DMA engines can often handle hundreds of concurrent IOs at once. DMA sourced data is also only written to RAM once, and there are no dependent data reads to slow down the DMA write to RAM like there is with a data copy streamed through a CPU. IOWs, once the IO rates and concurrency go up, it is generally much faster to use the CPU to program the DMA engines to move the data than it is to move the data with the CPU itself. The testing I did (and so the numbers in those benchmarks) was done on 2018-era PCIe 3.0 enterprise NVMe SSDs that could do approximately 400k 4kB random read IOPS. The latest consumer PCIe 5.0 NVMe SSDs are *way faster* than these drives when subject to highly concurrent IO requests... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] bcachefs changes for 6.12-rc1 2024-09-21 19:27 [GIT PULL] bcachefs changes for 6.12-rc1 Kent Overstreet 2024-09-23 17:18 ` Linus Torvalds @ 2024-09-23 19:06 ` pr-tracker-bot 1 sibling, 0 replies; 28+ messages in thread From: pr-tracker-bot @ 2024-09-23 19:06 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, linux-bcachefs, linux-fsdevel, linux-kernel The pull request you sent on Sat, 21 Sep 2024 15:27:58 -0400: > git://evilpiepirate.org/bcachefs.git tags/bcachefs-2024-09-21 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/b3f391fddf3cfaadda59ec8da8fd17f4520bbf42 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2024-09-28 0:47 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-09-21 19:27 [GIT PULL] bcachefs changes for 6.12-rc1 Kent Overstreet 2024-09-23 17:18 ` Linus Torvalds 2024-09-23 19:07 ` Linus Torvalds 2024-09-23 19:58 ` Kent Overstreet 2024-09-23 19:56 ` Kent Overstreet 2024-09-24 0:26 ` Dave Chinner 2024-09-24 1:55 ` Kent Overstreet 2024-09-24 2:26 ` Linus Torvalds 2024-09-24 2:48 ` Linus Torvalds 2024-09-24 3:55 ` Dave Chinner 2024-09-24 16:57 ` Linus Torvalds 2024-09-24 17:27 ` Kent Overstreet 2024-09-25 0:17 ` Dave Chinner 2024-09-25 1:45 ` Linus Torvalds 2024-09-25 11:41 ` Christian Brauner 2024-09-25 2:48 ` Kent Overstreet 2024-09-27 0:48 ` Herbert Xu 2024-09-28 0:11 ` Kent Overstreet 2024-09-28 0:47 ` Herbert Xu 2024-09-24 2:55 ` Kent Overstreet 2024-09-24 3:34 ` Dave Chinner 2024-09-24 3:47 ` Kent Overstreet 2024-09-25 1:00 ` Dave Chinner 2024-09-25 2:13 ` Kent Overstreet 2024-09-25 4:43 ` Dave Chinner 2024-09-25 5:11 ` Kent Overstreet 2024-09-24 3:04 ` Dave Chinner 2024-09-23 19:06 ` pr-tracker-bot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox