Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: NeilBrown <neilb@suse.de>
Cc: Jeff Layton <jlayton@kernel.org>,
	lsf-pc@lists.linuxfoundation.org,
	Al Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
Date: Tue, 21 Jan 2025 12:20:22 +1100	[thread overview]
Message-ID: <Z472VjIvT78DskGv@dread.disaster.area> (raw)
In-Reply-To: <173732553757.22054.12851849131700067664@noble.neil.brown.name>

On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote:
> On Mon, 20 Jan 2025, Dave Chinner wrote:
> > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > > 
> > > My question to fs-devel is: is anyone willing to convert their fs (or
> > > advice me on converting?) to use the new interface and do some testing
> > > and be open to exploring any bugs that appear?
> > 
> > tl;dr: You're asking for people to put in a *lot* of time to convert
> > complex filesystems to concurrent directory modifications without
> > clear indication that it will improve performance. Hence I wouldn't
> > expect widespread enthusiasm to suddenly implement it...
> 
> Thanks Dave!
> Your point as detailed below seems to be that, for xfs at least, it may
> be better to reduce hold times for exclusive locks rather than allow
> concurrent locks.  That seems entirely credible for a local fs but
> doesn't apply for NFS as we cannot get a success status before the
> operation is complete.

How is that different from a local filesystem? A local filesystem
can't return from open(O_CREAT) with a struct file referencing a
newly allocated inode until the VFS inode is fully instantiated (or
failed), either...

i.e. this sounds like you want concurrent share-locked dirent ops so
that synchronously processed operations can be issued concurrently.

Could the NFS client implement asynchronous directory ops, keeping
track of the operations in flight without needing to hold the parent
i_rwsem across each individual operation? This basically what I've
been describing for XFS to minimise parent dir lock hold times.

What would VFS support for that look like? Is that of similar
complexity to implementing shared locking support so that concurrent
blocking directory operations can be issued? Is async processing a
better model to move the directory ops towards so we can tie
userspace directly into it via io_uring?

> So it seems likely that different filesystems
> will want different approaches.  No surprise.
> 
> There is some evidence that ext4 can be converted to concurrent
> modification without a lot of work, and with measurable benefits.  I
> guess I should focus there for local filesystems.
> 
> But I don't want to assume what is best for each file system which is
> why I asked for input from developers of the various filesystems.
> 
> But even for xfs, I think that to provide a successful return from mkdir
> would require waiting for some IO to complete, and that other operations

I don't see where IO enters the picture, to be honest. File creation
does not typically require foreground IO on XFS at all (ignoring
dirsync mode). How did you think we scale XFS to near a million file
creates a second? :) 

In the case of mkdir, it does not take a direct reference to the
inode being created so it potentially doesn't even need to wait for
the completion of the operation. i.e. to use the new subdir it has
to be open()d; that means going through the ->lookup path and which
will block on I_NEW until the background creation is completed...

That said, open(O_CREAT) would need to call wait_on_inode()
somewhere to wait for I_NEW to clear so operations on the inode can
proceed immediately via the persistent struct file reference it
creates.  With the right support, that waiting can be done without
holding the parent directory locked, as any new lookup on that
dirent/inode pair will block until I_NEW is cleared...

Hence my question above about what does VFS support for
async dirops actually looks like, and whether something like this:

> might benefit from starting before that IO completes.
> So maybe an xfs implementation of mkdir_shared would be:
>  - take internal exclusive lock on directory
>  - run fast foreground part of mkdir
>  - drop the lock
>  - wait for background stuff, which could affect error return, to
>   complete
>  - return appropriate error, or success

as natively supported functionality might be a better solution to
the problem....

From the XFs perspective, we already have internal exclusive locking
for dirent mods, but that is needed for serialising the physical
directory mods against other (shared access) readers (e.g. lookup
and readdir).  We would not want to be sharing such an internal lock
across the unbound fast path concurrency of lookup, create, unlink,
readdir -and- multiple background processing threads.

Logging a modification intent against an inode wouldn't require a
new internal inode lock; AFAICT all it requires is exclusivity
against lookup so that lookup can find new/unlinked dirents that
have been logged but not yet added to or removed from the physical
directory blocks.

We can do that by inserting the VFS inode into cache in the
foreground operation and leaving I_NEW set on create.  We can do a
similar thing with unlink (I_WILL_FREE?) to make the VFS inode
invisible to new lookups before the unlink has actually been
processed. In this way, we can push the actual processing into
background work queues without actually changing how the namespace
behaves from a user perspective...

We -might- be able to do all these operations under a shared VFS
lock, but it is not necessary to have a shared VFS lock to enable
such a async processing model. If the performance gains for the NFS
client come from allowing concurrent processing of individual
synchronous operations, then we really need to consider if there are
other methods of hiding synchronous operation latency that might
be more applicable to more filesystems and high performance user
interfaces...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2025-01-21  1:20 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-17 17:26 [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer Jeff Layton
2025-01-18  1:06 ` NeilBrown
2025-01-19 21:51   ` Dave Chinner
2025-01-19 22:25     ` NeilBrown
2025-01-20 11:55       ` Jeff Layton
2025-01-21  1:20       ` Dave Chinner [this message]
2025-01-21 13:04         ` Jeff Layton
2025-01-22  0:22           ` Dave Chinner
2025-01-22  1:04             ` NeilBrown
2025-02-03 17:19           ` Steve French

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z472VjIvT78DskGv@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linuxfoundation.org \
    --cc=neilb@suse.de \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox