public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>,
	linux-bcachefs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Dave Chinner <dchinner@redhat.com>
Subject: Re: [GIT PULL] bcachefs changes for 6.12-rc1
Date: Tue, 24 Sep 2024 13:55:35 +1000	[thread overview]
Message-ID: <ZvI4N55fzO7kg0W/@dread.disaster.area> (raw)
In-Reply-To: <CAHk-=wh+atcBWa34mDdG1bFGRc28eJas3tP+9QrYXX6C7BX0JQ@mail.gmail.com>

On Mon, Sep 23, 2024 at 07:48:03PM -0700, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 19:26, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And I had missed the issue with PREEMPT_RT and the fact that right now
> > the inode hash lock is outside the inode lock, which is problematic.
> 
> .. although I guess that could be solved by just making the spinlock
> be external to the hash list, rather than part of the list like
> list_bh.

That's effectively what the patch did - it added a spinlock per hash
list.

> You wouldn't need to do one lock per list head, you could just do some
> grouping of the hash.

Yes, that's a more complex thing to do, though - it is effectively
using hashed locks for the hash table.

> That said, the vfs inode hash code really is pretty disgusting and has
> been accumulating these warts for a long time. So maybe a "filesystems
> do their own thing" together with some helper infrastructure (like
> using rhashtable) is actually the right thing to do.

Perhaps.

XFS has done it's own thing since 2007, but that was because the XFS
inode lifecycle has always existed outside the VFS inode life cycle.
i.e. the inode has to outlive evict() and ->destroy_inode because it
can still be dirty and tracked by the journal when the VFS evicts it
from the cache.

We also have internal inode lookup stuff that we don't want to go
anywhere near the VFS (e.g. inode writeback clustering from journal
item cleaning callbacks).

But this has come at a significant complexity cost, and some of the
biggest problems we've had to solve have been a result of this
architecture. e.g. memory reclaim getting blocked on dirty XFS
inodes that the VFS is unaware of via the fs specific callout in the
superblock shrinker caused all sorts of long tail memory allocation
latency issues for years. We've fixed that, but it took a long time
to work out how to avoid blocking in memory reclaim without driving
the system immediately into OOM kill frenzies....

There are also issues around RCU freeing of inodes and how the
filesysetm level cache handles that. e.g. what happens when the
filesystem doesn't immediately free the inode after memory reclaim
evicts it, but then the fs re-instantiates it moments later when a
new lookup for that inode comes in? The rest of the VFS assumes that
the inode won't reappear until a RCU grace period expires, but
having the re-instantiation code call synchronise_rcu() (or
something similar) is not an option because this sort of situation
can occur thousands of times a second under memory pressure....

We know this because there is an outstanding issue in XFS around
this situation - none of the general solutions to the problem
we've tried have been viable. I suspect that any other filesystem
that implements it's own inode cache and does VFS inode
re-instantiation on a inode cache hit will have similar issues,
and so there will be wider VFS architecture impacts as a result.

IOWs, it's not clear to me that this is a caching model we really
want to persue in general because of a) the code duplication and b)
the issues such an inode life-cycle model has interacting with the
VFS life-cycle expectations...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2024-09-24  3:55 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-21 19:27 [GIT PULL] bcachefs changes for 6.12-rc1 Kent Overstreet
2024-09-23 17:18 ` Linus Torvalds
2024-09-23 19:07   ` Linus Torvalds
2024-09-23 19:58     ` Kent Overstreet
2024-09-23 19:56   ` Kent Overstreet
2024-09-24  0:26     ` Dave Chinner
2024-09-24  1:55       ` Kent Overstreet
2024-09-24  2:26       ` Linus Torvalds
2024-09-24  2:48         ` Linus Torvalds
2024-09-24  3:55           ` Dave Chinner [this message]
2024-09-24 16:57             ` Linus Torvalds
2024-09-24 17:27               ` Kent Overstreet
2024-09-25  0:17               ` Dave Chinner
2024-09-25  1:45                 ` Linus Torvalds
2024-09-25 11:41                   ` Christian Brauner
2024-09-25  2:48                 ` Kent Overstreet
2024-09-27  0:48                   ` Herbert Xu
2024-09-28  0:11                     ` Kent Overstreet
2024-09-28  0:47                       ` Herbert Xu
2024-09-24  2:55         ` Kent Overstreet
2024-09-24  3:34           ` Dave Chinner
2024-09-24  3:47             ` Kent Overstreet
2024-09-25  1:00               ` Dave Chinner
2024-09-25  2:13                 ` Kent Overstreet
2024-09-25  4:43                   ` Dave Chinner
2024-09-25  5:11                     ` Kent Overstreet
2024-09-24  3:04         ` Dave Chinner
2024-09-23 19:06 ` pr-tracker-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZvI4N55fzO7kg0W/@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox