From: Dave Chinner <david@fromorbit.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
linux-bcachefs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, Dave Chinner <dchinner@redhat.com>
Subject: Re: [GIT PULL] bcachefs changes for 6.12-rc1
Date: Wed, 25 Sep 2024 11:00:10 +1000 [thread overview]
Message-ID: <ZvNgmoKgWF0TBXP8@dread.disaster.area> (raw)
In-Reply-To: <dia6l34faugmuwmgpyvpeeppqjwmv2qhhvu57nrerc34qknwlo@ltwkoy7pstrm>
On Mon, Sep 23, 2024 at 11:47:54PM -0400, Kent Overstreet wrote:
> On Tue, Sep 24, 2024 at 01:34:14PM GMT, Dave Chinner wrote:
> > On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote:
> > > But stat/statx always pulls into the vfs inode cache, and that's likely
> > > worth fixing.
> >
> > No, let's not even consider going there.
> >
> > Unlike most people, old time XFS developers have direct experience
> > with the problems that "uncached" inode access for stat purposes.
> >
> > XFS has had the bulkstat API for a long, long time (i.e. since 1998
> > on Irix). When it was first implemented on Irix, it was VFS cache
> > coherent. But in the early 2000s, that caused problems with HSMs
> > needing to scan billions inodes indexing petabytes of stored data
> > with certain SLA guarantees (i.e. needing to scan at least a million
> > inodes a second). The CPU overhead of cache instantiation and
> > teardown was too great to meet those performance targets on 500MHz
> > MIPS CPUs.
> >
> > So we converted bulkstat to run directly out of the XFS buffer cache
> > (i.e. uncached from the perspective of the VFS). This reduced the
> > CPU over per-inode substantially, allowing bulkstat rates to
> > increase by a factor of 10. However, it introduced all sorts of
> > coherency problems between cached inode state vs what was stored in
> > the buffer cache. It was basically O_DIRECT for stat() and, as you'd
> > expect from that description, the coherency problems were horrible.
> > Detecting iallocated-but-not-yet-updated and
> > unlinked-but-not-yet-freed inodes were particularly consistent
> > sources of issues.
> >
> > The only way to fix these coherency problems was to check the inode
> > cache for a resident inode first, which basically defeated the
> > entire purpose of bypassing the VFS cache in the first place.
>
> Eh? Of course it'd have to be coherent, but just checking if an inode is
> present in the VFS cache is what, 1-2 cache misses? Depending on hash
> table fill factor...
Sure, when there is no contention and you have CPU to spare. But the
moment the lookup hits contention problems (i.e. we are exceeding
the cache lookup scalability capability), we are straight back to
running a VFS cache speed instead of uncached speed.
IOWs, needing to perform the cache lookup defeated the purpose of
using uncached lookups to avoid the cache scalabilty problems.
Keep in mind that not having a referenced inode opens up the code to
things like pre-emption races. i.e. a cache miss doesn't prevent
the current task from being preempted before it reads the inode
information into the user buffer. The VFS inode could bei
instantiated and modified before the uncached access runs again and
pulls stale information from the underlying buffer and returns that
to userspace.
Those were the sorts of problems we continually had with using low
level inode information for stat operations vs using the up-to-date
VFS inode state....
-Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2024-09-25 1:00 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-21 19:27 [GIT PULL] bcachefs changes for 6.12-rc1 Kent Overstreet
2024-09-23 17:18 ` Linus Torvalds
2024-09-23 19:07 ` Linus Torvalds
2024-09-23 19:58 ` Kent Overstreet
2024-09-23 19:56 ` Kent Overstreet
2024-09-24 0:26 ` Dave Chinner
2024-09-24 1:55 ` Kent Overstreet
2024-09-24 2:26 ` Linus Torvalds
2024-09-24 2:48 ` Linus Torvalds
2024-09-24 3:55 ` Dave Chinner
2024-09-24 16:57 ` Linus Torvalds
2024-09-24 17:27 ` Kent Overstreet
2024-09-25 0:17 ` Dave Chinner
2024-09-25 1:45 ` Linus Torvalds
2024-09-25 11:41 ` Christian Brauner
2024-09-25 2:48 ` Kent Overstreet
2024-09-27 0:48 ` Herbert Xu
2024-09-28 0:11 ` Kent Overstreet
2024-09-28 0:47 ` Herbert Xu
2024-09-24 2:55 ` Kent Overstreet
2024-09-24 3:34 ` Dave Chinner
2024-09-24 3:47 ` Kent Overstreet
2024-09-25 1:00 ` Dave Chinner [this message]
2024-09-25 2:13 ` Kent Overstreet
2024-09-25 4:43 ` Dave Chinner
2024-09-25 5:11 ` Kent Overstreet
2024-09-24 3:04 ` Dave Chinner
2024-09-23 19:06 ` pr-tracker-bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZvNgmoKgWF0TBXP8@dread.disaster.area \
--to=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=kent.overstreet@linux.dev \
--cc=linux-bcachefs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox