[LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
@ 2025-01-17 17:26 Jeff Layton
  2025-01-18  1:06 ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2025-01-17 17:26 UTC (permalink / raw)
  To: lsf-pc; +Cc: NeilBrown, Al Viro, linux-fsdevel

We've hit a number of cases in testing recently where the parent's
i_rwsem ends up being the bottleneck in heavy parallel create
workloads. Currently we have to take the parent's inode->i_rwsem
exclusively when altering a directory, which means that any directory-
morphing operations in the same directory are serialized.

This is particularly onerous in the ->create codepath, since a
filesystem may have to do a number of blocking operations to create a
new file (allocate memory, start a transaction, etc.)

Neil recently posted this RFC series, which allows parallel directory
modifying operations:

    https://lore.kernel.org/linux-fsdevel/20241220030830.272429-1-neilb@suse.de/

Al pointed out a number of problems in it, but the basic approach seems
sound. I'd like to have a discussion at LSF/MM about this.

Are there any problems with the basic approach? Are there other
approaches that might be better? Are there incremental steps we could
do pave the way for this to be a reality?

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
  2025-01-17 17:26 [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer Jeff Layton
@ 2025-01-18  1:06 ` NeilBrown
  2025-01-19 21:51   ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2025-01-18  1:06 UTC (permalink / raw)
  To: Jeff Layton; +Cc: lsf-pc, Al Viro, linux-fsdevel

On Sat, 18 Jan 2025, Jeff Layton wrote:
> We've hit a number of cases in testing recently where the parent's
> i_rwsem ends up being the bottleneck in heavy parallel create
> workloads. Currently we have to take the parent's inode->i_rwsem
> exclusively when altering a directory, which means that any directory-
> morphing operations in the same directory are serialized.
> 
> This is particularly onerous in the ->create codepath, since a
> filesystem may have to do a number of blocking operations to create a
> new file (allocate memory, start a transaction, etc.)
> 
> Neil recently posted this RFC series, which allows parallel directory
> modifying operations:
> 
>     https://lore.kernel.org/linux-fsdevel/20241220030830.272429-1-neilb@suse.de/
> 
> Al pointed out a number of problems in it, but the basic approach seems
> sound. I'd like to have a discussion at LSF/MM about this.
> 
> Are there any problems with the basic approach? Are there other
> approaches that might be better? Are there incremental steps we could
> do pave the way for this to be a reality?

Thanks for raising this!
There was at least one problem with the approach but I have a plan to
address that.  I won't go into detail here.  I hope to get a new
patch set out sometime in the coming week.

My question to fs-devel is: is anyone willing to convert their fs (or
advice me on converting?) to use the new interface and do some testing
and be open to exploring any bugs that appear?

I'd like to try ext4 using the patches that lustre maintains for
parallel directory ops in ext4 but making them suitable for upstream
doesn't look to be straight forward.

  https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob;f=ldiskfs/kernel_patches/patches/linux-6.5/ext4-pdirop.patch;h=208d9dc44f4860fbf27072ed1969744131e30108;hb=HEAD

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
  2025-01-18  1:06 ` NeilBrown
@ 2025-01-19 21:51   ` Dave Chinner
  2025-01-19 22:25     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2025-01-19 21:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jeff Layton, lsf-pc, Al Viro, linux-fsdevel

On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> On Sat, 18 Jan 2025, Jeff Layton wrote:
> > We've hit a number of cases in testing recently where the parent's
> > i_rwsem ends up being the bottleneck in heavy parallel create
> > workloads. Currently we have to take the parent's inode->i_rwsem
> > exclusively when altering a directory, which means that any directory-
> > morphing operations in the same directory are serialized.
> > 
> > This is particularly onerous in the ->create codepath, since a
> > filesystem may have to do a number of blocking operations to create a
> > new file (allocate memory, start a transaction, etc.)
> > 
> > Neil recently posted this RFC series, which allows parallel directory
> > modifying operations:
> > 
> >     https://lore.kernel.org/linux-fsdevel/20241220030830.272429-1-neilb@suse.de/
> > 
> > Al pointed out a number of problems in it, but the basic approach seems
> > sound. I'd like to have a discussion at LSF/MM about this.
> > 
> > Are there any problems with the basic approach? Are there other
> > approaches that might be better? Are there incremental steps we could
> > do pave the way for this to be a reality?
> 
> Thanks for raising this!
> There was at least one problem with the approach but I have a plan to
> address that.  I won't go into detail here.  I hope to get a new
> patch set out sometime in the coming week.
> 
> My question to fs-devel is: is anyone willing to convert their fs (or
> advice me on converting?) to use the new interface and do some testing
> and be open to exploring any bugs that appear?

tl;dr: You're asking for people to put in a *lot* of time to convert
complex filesystems to concurrent directory modifications without
clear indication that it will improve performance. Hence I wouldn't
expect widespread enthusiasm to suddenly implement it...

In more detail....

It's not exactly simple to take a directory tree structure that is
exclusively locked for modification and make it safe for concurrent
updates. It -might- be possible to make the directory updates in XFS
more concurrent, but it still has an internal name hash btree index
that would have to be completely re-written to support concurrent
updates.

That's also ignoring all the other bits of the filesystem that will
single thread outside the directory. e.g. during create we have to
allocate an inode, and locality algorithms will cluster new inodes
in the same directory close together. That means they are covered by
the same exclusive lock (e.g. the AGI and AGF header blocks in XFS).
Unlink has the same problem.

IOWs, it's not just directory ops and structures that need locking
changes; the way filesystems do inode and block allocation and
freeing also needs to change to support increased concurrency in
directory operations.

Hence I suspect that concurrent directory mods for filesystems like
XFS will need a new processing model - possibly a low overhead
intent-based modification model using in-memory whiteouts and async
background batching of intents. We kinda already do this with unlink
- we do the directory removal in the foreground, and defer the rest
of the unlink (i.e. inode freeing) to async background worker
threads.

e.g. doing background batching of namespace ops means things like
"rm *" in a directory doesn't need to transactionally modify the
directory as it runs. We could track all the inodes we are unlinking
via the intents and then simply truncate away the entire directory
when it becomes empty and rmdir() is called. We still have to clean
up and mark all the inodes free, but that can be done in the
background.

As such, I suspect that moving XFS to a more async processing model
for directory namespace ops to minimise lock hold times will be
simpler (and potentially faster!) than rewriting large chunks of the
XFS directory and inode management operations to allow for
i_rwsem/ILOCK/AGI/AGF locking concurrency...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
  2025-01-19 21:51   ` Dave Chinner
@ 2025-01-19 22:25     ` NeilBrown
  2025-01-20 11:55       ` Jeff Layton
  2025-01-21  1:20       ` Dave Chinner
  0 siblings, 2 replies; 10+ messages in thread
From: NeilBrown @ 2025-01-19 22:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jeff Layton, lsf-pc, Al Viro, linux-fsdevel

On Mon, 20 Jan 2025, Dave Chinner wrote:
> On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > 
> > My question to fs-devel is: is anyone willing to convert their fs (or
> > advice me on converting?) to use the new interface and do some testing
> > and be open to exploring any bugs that appear?
> 
> tl;dr: You're asking for people to put in a *lot* of time to convert
> complex filesystems to concurrent directory modifications without
> clear indication that it will improve performance. Hence I wouldn't
> expect widespread enthusiasm to suddenly implement it...

Thanks Dave!
Your point as detailed below seems to be that, for xfs at least, it may
be better to reduce hold times for exclusive locks rather than allow
concurrent locks.  That seems entirely credible for a local fs but
doesn't apply for NFS as we cannot get a success status before the
operation is complete.  So it seems likely that different filesystems
will want different approaches.  No surprise.

There is some evidence that ext4 can be converted to concurrent
modification without a lot of work, and with measurable benefits.  I
guess I should focus there for local filesystems.

But I don't want to assume what is best for each file system which is
why I asked for input from developers of the various filesystems.

But even for xfs, I think that to provide a successful return from mkdir
would require waiting for some IO to complete, and that other operations
might benefit from starting before that IO completes.
So maybe an xfs implementation of mkdir_shared would be:
 - take internal exclusive lock on directory
 - run fast foreground part of mkdir
 - drop the lock
 - wait for background stuff, which could affect error return, to
  complete
 - return appropriate error, or success

So xfs could clearly use exclusive locking where that is the best
choice, but not have exclusive locking imposed for the entire operation.
That is my core goal : don't impose a particular locking style - allow
the filesystem to manage locking within an umbrella that ensures the
guarantees that the vfs needs (like no creation inside a directory
during rmdir).

Thanks,
NeilBrown
 

> 
> In more detail....
> 
> It's not exactly simple to take a directory tree structure that is
> exclusively locked for modification and make it safe for concurrent
> updates. It -might- be possible to make the directory updates in XFS
> more concurrent, but it still has an internal name hash btree index
> that would have to be completely re-written to support concurrent
> updates.
> 
> That's also ignoring all the other bits of the filesystem that will
> single thread outside the directory. e.g. during create we have to
> allocate an inode, and locality algorithms will cluster new inodes
> in the same directory close together. That means they are covered by
> the same exclusive lock (e.g. the AGI and AGF header blocks in XFS).
> Unlink has the same problem.
> 
> IOWs, it's not just directory ops and structures that need locking
> changes; the way filesystems do inode and block allocation and
> freeing also needs to change to support increased concurrency in
> directory operations.
> 
> Hence I suspect that concurrent directory mods for filesystems like
> XFS will need a new processing model - possibly a low overhead
> intent-based modification model using in-memory whiteouts and async
> background batching of intents. We kinda already do this with unlink
> - we do the directory removal in the foreground, and defer the rest
> of the unlink (i.e. inode freeing) to async background worker
> threads.
> 
> e.g. doing background batching of namespace ops means things like
> "rm *" in a directory doesn't need to transactionally modify the
> directory as it runs. We could track all the inodes we are unlinking
> via the intents and then simply truncate away the entire directory
> when it becomes empty and rmdir() is called. We still have to clean
> up and mark all the inodes free, but that can be done in the
> background.
> 
> As such, I suspect that moving XFS to a more async processing model
> for directory namespace ops to minimise lock hold times will be
> simpler (and potentially faster!) than rewriting large chunks of the
> XFS directory and inode management operations to allow for
> i_rwsem/ILOCK/AGI/AGF locking concurrency...
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
  2025-01-19 22:25     ` NeilBrown
@ 2025-01-20 11:55       ` Jeff Layton
  2025-01-21  1:20       ` Dave Chinner
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff Layton @ 2025-01-20 11:55 UTC (permalink / raw)
  To: NeilBrown, Dave Chinner; +Cc: lsf-pc, Al Viro, linux-fsdevel

On Mon, 2025-01-20 at 09:25 +1100, NeilBrown wrote:
> On Mon, 20 Jan 2025, Dave Chinner wrote:
> > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > > 
> > > My question to fs-devel is: is anyone willing to convert their fs (or
> > > advice me on converting?) to use the new interface and do some testing
> > > and be open to exploring any bugs that appear?
> > 
> > tl;dr: You're asking for people to put in a *lot* of time to convert
> > complex filesystems to concurrent directory modifications without
> > clear indication that it will improve performance. Hence I wouldn't
> > expect widespread enthusiasm to suddenly implement it...
> 
> Thanks Dave!
> Your point as detailed below seems to be that, for xfs at least, it may
> be better to reduce hold times for exclusive locks rather than allow
> concurrent locks.  That seems entirely credible for a local fs but
> doesn't apply for NFS as we cannot get a success status before the
> operation is complete.  So it seems likely that different filesystems
> will want different approaches.  No surprise.
> 
> There is some evidence that ext4 can be converted to concurrent
> modification without a lot of work, and with measurable benefits.  I
> guess I should focus there for local filesystems.
> 
> But I don't want to assume what is best for each file system which is
> why I asked for input from developers of the various filesystems.
> 
> But even for xfs, I think that to provide a successful return from mkdir
> would require waiting for some IO to complete, and that other operations
> might benefit from starting before that IO completes.
> So maybe an xfs implementation of mkdir_shared would be:
>  - take internal exclusive lock on directory
>  - run fast foreground part of mkdir
>  - drop the lock
>  - wait for background stuff, which could affect error return, to
>   complete
>  - return appropriate error, or success
> 
> So xfs could clearly use exclusive locking where that is the best
> choice, but not have exclusive locking imposed for the entire operation.
> That is my core goal : don't impose a particular locking style - allow
> the filesystem to manage locking within an umbrella that ensures the
> guarantees that the vfs needs (like no creation inside a directory
> during rmdir).
> 
> 

I too don't think this approach is incompatible with XFS necessarily,
but we may very well find that once we remove the bottleneck of the
exclusive i_rwsem around directory modifications, that the bottleneck
just moves down to within the filesystem driver. We have to start
somewhere though!

Also, I should mention that it looks like Al won't be able to attend
LSF/MM this year. We may not want to schedule a time slot for this
after all, as I think his involvement will be key here.

> 
> > 
> > In more detail....
> > 
> > It's not exactly simple to take a directory tree structure that is
> > exclusively locked for modification and make it safe for concurrent
> > updates. It -might- be possible to make the directory updates in XFS
> > more concurrent, but it still has an internal name hash btree index
> > that would have to be completely re-written to support concurrent
> > updates.
> > 
> > That's also ignoring all the other bits of the filesystem that will
> > single thread outside the directory. e.g. during create we have to
> > allocate an inode, and locality algorithms will cluster new inodes
> > in the same directory close together. That means they are covered by
> > the same exclusive lock (e.g. the AGI and AGF header blocks in XFS).
> > Unlink has the same problem.
> > 
> > IOWs, it's not just directory ops and structures that need locking
> > changes; the way filesystems do inode and block allocation and
> > freeing also needs to change to support increased concurrency in
> > directory operations.
> > 
> > Hence I suspect that concurrent directory mods for filesystems like
> > XFS will need a new processing model - possibly a low overhead
> > intent-based modification model using in-memory whiteouts and async
> > background batching of intents. We kinda already do this with unlink
> > - we do the directory removal in the foreground, and defer the rest
> > of the unlink (i.e. inode freeing) to async background worker
> > threads.
> > 
> > e.g. doing background batching of namespace ops means things like
> > "rm *" in a directory doesn't need to transactionally modify the
> > directory as it runs. We could track all the inodes we are unlinking
> > via the intents and then simply truncate away the entire directory
> > when it becomes empty and rmdir() is called. We still have to clean
> > up and mark all the inodes free, but that can be done in the
> > background.
> > 
> > As such, I suspect that moving XFS to a more async processing model
> > for directory namespace ops to minimise lock hold times will be
> > simpler (and potentially faster!) than rewriting large chunks of the
> > XFS directory and inode management operations to allow for
> > i_rwsem/ILOCK/AGI/AGF locking concurrency...
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
  2025-01-19 22:25     ` NeilBrown
  2025-01-20 11:55       ` Jeff Layton
@ 2025-01-21  1:20       ` Dave Chinner
  2025-01-21 13:04         ` Jeff Layton
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2025-01-21  1:20 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jeff Layton, lsf-pc, Al Viro, linux-fsdevel

On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote:
> On Mon, 20 Jan 2025, Dave Chinner wrote:
> > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > > 
> > > My question to fs-devel is: is anyone willing to convert their fs (or
> > > advice me on converting?) to use the new interface and do some testing
> > > and be open to exploring any bugs that appear?
> > 
> > tl;dr: You're asking for people to put in a *lot* of time to convert
> > complex filesystems to concurrent directory modifications without
> > clear indication that it will improve performance. Hence I wouldn't
> > expect widespread enthusiasm to suddenly implement it...
> 
> Thanks Dave!
> Your point as detailed below seems to be that, for xfs at least, it may
> be better to reduce hold times for exclusive locks rather than allow
> concurrent locks.  That seems entirely credible for a local fs but
> doesn't apply for NFS as we cannot get a success status before the
> operation is complete.

How is that different from a local filesystem? A local filesystem
can't return from open(O_CREAT) with a struct file referencing a
newly allocated inode until the VFS inode is fully instantiated (or
failed), either...

i.e. this sounds like you want concurrent share-locked dirent ops so
that synchronously processed operations can be issued concurrently.

Could the NFS client implement asynchronous directory ops, keeping
track of the operations in flight without needing to hold the parent
i_rwsem across each individual operation? This basically what I've
been describing for XFS to minimise parent dir lock hold times.

What would VFS support for that look like? Is that of similar
complexity to implementing shared locking support so that concurrent
blocking directory operations can be issued? Is async processing a
better model to move the directory ops towards so we can tie
userspace directly into it via io_uring?

> So it seems likely that different filesystems
> will want different approaches.  No surprise.
> 
> There is some evidence that ext4 can be converted to concurrent
> modification without a lot of work, and with measurable benefits.  I
> guess I should focus there for local filesystems.
> 
> But I don't want to assume what is best for each file system which is
> why I asked for input from developers of the various filesystems.
> 
> But even for xfs, I think that to provide a successful return from mkdir
> would require waiting for some IO to complete, and that other operations

I don't see where IO enters the picture, to be honest. File creation
does not typically require foreground IO on XFS at all (ignoring
dirsync mode). How did you think we scale XFS to near a million file
creates a second? :) 

In the case of mkdir, it does not take a direct reference to the
inode being created so it potentially doesn't even need to wait for
the completion of the operation. i.e. to use the new subdir it has
to be open()d; that means going through the ->lookup path and which
will block on I_NEW until the background creation is completed...

That said, open(O_CREAT) would need to call wait_on_inode()
somewhere to wait for I_NEW to clear so operations on the inode can
proceed immediately via the persistent struct file reference it
creates.  With the right support, that waiting can be done without
holding the parent directory locked, as any new lookup on that
dirent/inode pair will block until I_NEW is cleared...

Hence my question above about what does VFS support for
async dirops actually looks like, and whether something like this:

> might benefit from starting before that IO completes.
> So maybe an xfs implementation of mkdir_shared would be:
>  - take internal exclusive lock on directory
>  - run fast foreground part of mkdir
>  - drop the lock
>  - wait for background stuff, which could affect error return, to
>   complete
>  - return appropriate error, or success

as natively supported functionality might be a better solution to
the problem....

From the XFs perspective, we already have internal exclusive locking
for dirent mods, but that is needed for serialising the physical
directory mods against other (shared access) readers (e.g. lookup
and readdir).  We would not want to be sharing such an internal lock
across the unbound fast path concurrency of lookup, create, unlink,
readdir -and- multiple background processing threads.

Logging a modification intent against an inode wouldn't require a
new internal inode lock; AFAICT all it requires is exclusivity
against lookup so that lookup can find new/unlinked dirents that
have been logged but not yet added to or removed from the physical
directory blocks.

We can do that by inserting the VFS inode into cache in the
foreground operation and leaving I_NEW set on create.  We can do a
similar thing with unlink (I_WILL_FREE?) to make the VFS inode
invisible to new lookups before the unlink has actually been
processed. In this way, we can push the actual processing into
background work queues without actually changing how the namespace
behaves from a user perspective...

We -might- be able to do all these operations under a shared VFS
lock, but it is not necessary to have a shared VFS lock to enable
such a async processing model. If the performance gains for the NFS
client come from allowing concurrent processing of individual
synchronous operations, then we really need to consider if there are
other methods of hiding synchronous operation latency that might
be more applicable to more filesystems and high performance user
interfaces...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
  2025-01-21  1:20       ` Dave Chinner
@ 2025-01-21 13:04         ` Jeff Layton
  2025-01-22  0:22           ` Dave Chinner
  2025-02-03 17:19           ` Steve French
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff Layton @ 2025-01-21 13:04 UTC (permalink / raw)
  To: Dave Chinner, NeilBrown; +Cc: lsf-pc, Al Viro, linux-fsdevel

On Tue, 2025-01-21 at 12:20 +1100, Dave Chinner wrote:
> On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote:
> > On Mon, 20 Jan 2025, Dave Chinner wrote:
> > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > > > 
> > > > My question to fs-devel is: is anyone willing to convert their fs (or
> > > > advice me on converting?) to use the new interface and do some testing
> > > > and be open to exploring any bugs that appear?
> > > 
> > > tl;dr: You're asking for people to put in a *lot* of time to convert
> > > complex filesystems to concurrent directory modifications without
> > > clear indication that it will improve performance. Hence I wouldn't
> > > expect widespread enthusiasm to suddenly implement it...
> > 
> > Thanks Dave!
> > Your point as detailed below seems to be that, for xfs at least, it may
> > be better to reduce hold times for exclusive locks rather than allow
> > concurrent locks.  That seems entirely credible for a local fs but
> > doesn't apply for NFS as we cannot get a success status before the
> > operation is complete.
> 
> How is that different from a local filesystem? A local filesystem
> can't return from open(O_CREAT) with a struct file referencing a
> newly allocated inode until the VFS inode is fully instantiated (or
> failed), either...
> 
> i.e. this sounds like you want concurrent share-locked dirent ops so
> that synchronously processed operations can be issued concurrently.
> 
> Could the NFS client implement asynchronous directory ops, keeping
> track of the operations in flight without needing to hold the parent
> i_rwsem across each individual operation? This basically what I've
> been describing for XFS to minimise parent dir lock hold times.
> 

Yes, basically. The protocol and NFS client have no requirement to
serialize directory operations. We'd be happy to spray as many at the
server in parallel as we can get away with. We currently don't do that
today, largely because the VFS prohibits it.

The NFS server, or exported filesystem may have requirements that
serialize these operations though.

> What would VFS support for that look like? Is that of similar
> complexity to implementing shared locking support so that concurrent
> blocking directory operations can be issued? Is async processing a
> better model to move the directory ops towards so we can tie
> userspace directly into it via io_uring?
> 

Given that the VFS requires an exclusive lock today for directory
morphing ops, moving to a model where we can take a shared lock on the
directory instead seems like a nice, incremental approach to dealing
with this problem.

That said, I get your objection. Not being able to upgrade a rwsem
makes that shared lock kind of nasty for filesystems that actually do
rely on it for some parts of their work today.

The usual method of dealing with that would be to create a new XFS-only
per-inode lock that would take over that serialization. The nice thing
there is that you could (over time) reduce its scope.

> > So it seems likely that different filesystems
> > will want different approaches.  No surprise.
> > 
> > There is some evidence that ext4 can be converted to concurrent
> > modification without a lot of work, and with measurable benefits.  I
> > guess I should focus there for local filesystems.
> > 
> > But I don't want to assume what is best for each file system which is
> > why I asked for input from developers of the various filesystems.
> > 
> > But even for xfs, I think that to provide a successful return from mkdir
> > would require waiting for some IO to complete, and that other operations
> 
> I don't see where IO enters the picture, to be honest. File creation
> does not typically require foreground IO on XFS at all (ignoring
> dirsync mode). How did you think we scale XFS to near a million file
> creates a second? :) 
> 
> In the case of mkdir, it does not take a direct reference to the
> inode being created so it potentially doesn't even need to wait for
> the completion of the operation. i.e. to use the new subdir it has
> to be open()d; that means going through the ->lookup path and which
> will block on I_NEW until the background creation is completed...
> 
> That said, open(O_CREAT) would need to call wait_on_inode()
> somewhere to wait for I_NEW to clear so operations on the inode can
> proceed immediately via the persistent struct file reference it
> creates.  With the right support, that waiting can be done without
> holding the parent directory locked, as any new lookup on that
> dirent/inode pair will block until I_NEW is cleared...
> 
> Hence my question above about what does VFS support for
> async dirops actually looks like, and whether something like this:
> 
> > might benefit from starting before that IO completes.
> > So maybe an xfs implementation of mkdir_shared would be:
> >  - take internal exclusive lock on directory
> >  - run fast foreground part of mkdir
> >  - drop the lock
> >  - wait for background stuff, which could affect error return, to
> >   complete
> >  - return appropriate error, or success
> 
> as natively supported functionality might be a better solution to
> the problem....
> 
> From the XFs perspective, we already have internal exclusive locking
> for dirent mods, but that is needed for serialising the physical
> directory mods against other (shared access) readers (e.g. lookup
> and readdir).  We would not want to be sharing such an internal lock
> across the unbound fast path concurrency of lookup, create, unlink,
> readdir -and- multiple background processing threads.
> 
> Logging a modification intent against an inode wouldn't require a
> new internal inode lock; AFAICT all it requires is exclusivity
> against lookup so that lookup can find new/unlinked dirents that
> have been logged but not yet added to or removed from the physical
> directory blocks.
> 
> We can do that by inserting the VFS inode into cache in the
> foreground operation and leaving I_NEW set on create.  We can do a
> similar thing with unlink (I_WILL_FREE?) to make the VFS inode
> invisible to new lookups before the unlink has actually been
> processed. In this way, we can push the actual processing into
> background work queues without actually changing how the namespace
> behaves from a user perspective...
> 
> We -might- be able to do all these operations under a shared VFS
> lock, but it is not necessary to have a shared VFS lock to enable
> such a async processing model. If the performance gains for the NFS
> client come from allowing concurrent processing of individual
> synchronous operations, then we really need to consider if there are
> other methods of hiding synchronous operation latency that might
> be more applicable to more filesystems and high performance user
> interfaces...
> 

So maybe we'd have something like:

struct mkdir_context {
	struct mnt_idmap	*idmap;	// current args to mkdir op
	struct inode		*dir;
	struct dentry		*dentry;
	umode_t			mode;
	int			status		// return status
	struct completion	complete;	// when done -- maybe this would be completion callback function?
};

...and an inode operation like:

	int (*mkdir_begin)(struct mkdir_context *ctx);

Then the fs could just pass a pointer to that around, and when the
operation is done, complete the variable, which would make the original
task that called mkdir() finalize the results of the op?

My worry here would be that dealing with the workqueue and context
switching would be a performance killer in the simple cases that do it
all in a single thread today.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
  2025-01-21 13:04         ` Jeff Layton
@ 2025-01-22  0:22           ` Dave Chinner
  2025-01-22  1:04             ` NeilBrown
  2025-02-03 17:19           ` Steve French
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2025-01-22  0:22 UTC (permalink / raw)
  To: Jeff Layton; +Cc: NeilBrown, lsf-pc, Al Viro, linux-fsdevel

On Tue, Jan 21, 2025 at 08:04:46AM -0500, Jeff Layton wrote:
> On Tue, 2025-01-21 at 12:20 +1100, Dave Chinner wrote:
> > On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote:
> > > On Mon, 20 Jan 2025, Dave Chinner wrote:
> > > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > > > > 
> > > > > My question to fs-devel is: is anyone willing to convert their fs (or
> > > > > advice me on converting?) to use the new interface and do some testing
> > > > > and be open to exploring any bugs that appear?
> > > > 
> > > > tl;dr: You're asking for people to put in a *lot* of time to convert
> > > > complex filesystems to concurrent directory modifications without
> > > > clear indication that it will improve performance. Hence I wouldn't
> > > > expect widespread enthusiasm to suddenly implement it...
> > > 
> > > Thanks Dave!
> > > Your point as detailed below seems to be that, for xfs at least, it may
> > > be better to reduce hold times for exclusive locks rather than allow
> > > concurrent locks.  That seems entirely credible for a local fs but
> > > doesn't apply for NFS as we cannot get a success status before the
> > > operation is complete.
> > 
> > How is that different from a local filesystem? A local filesystem
> > can't return from open(O_CREAT) with a struct file referencing a
> > newly allocated inode until the VFS inode is fully instantiated (or
> > failed), either...
> > 
> > i.e. this sounds like you want concurrent share-locked dirent ops so
> > that synchronously processed operations can be issued concurrently.
> > 
> > Could the NFS client implement asynchronous directory ops, keeping
> > track of the operations in flight without needing to hold the parent
> > i_rwsem across each individual operation? This basically what I've
> > been describing for XFS to minimise parent dir lock hold times.
> > 
> 
> Yes, basically. The protocol and NFS client have no requirement to
> serialize directory operations. We'd be happy to spray as many at the
> server in parallel as we can get away with. We currently don't do that
> today, largely because the VFS prohibits it.
>
> The NFS server, or exported filesystem may have requirements that
> serialize these operations though.

Sure, 
> 
> > What would VFS support for that look like? Is that of similar
> > complexity to implementing shared locking support so that concurrent
> > blocking directory operations can be issued? Is async processing a
> > better model to move the directory ops towards so we can tie
> > userspace directly into it via io_uring?
> 
> Given that the VFS requires an exclusive lock today for directory
> morphing ops, moving to a model where we can take a shared lock on the
> directory instead seems like a nice, incremental approach to dealing
> with this problem.

I understand this, but it's not really "incremental" in that it
entrenches the "concurrency is only possible for synchronous
processing models" that shared locking across the entire operation
entails.

i.e. we can't hand the shared lock to another thread to release on
completion (e.g. an async processing pipeline) because lockdep will
get upset about that and {down,up}_read_non_owner() is very much
discouraged and {down,up}_write_non_owner() doesn't even exist.

> That said, I get your objection. Not being able to upgrade a rwsem
> makes that shared lock kind of nasty for filesystems that actually do
> rely on it for some parts of their work today.

It's more than that - the use of a rwsem for exclusion basically
forces us into "task that takes the lock must release the lock"
model, and so if the VFS expects the entire operation to be done
under a down_read() context then we must complete the entire
operation before handing back the completion to the original task.

That's why I'm pushing back against a "whole task" centric shared
locking model as it is ultimately unfriendly to efficient async
dispatch and completion of background tasks.

> The usual method of dealing with that would be to create a new XFS-only
> per-inode lock that would take over that serialization. The nice thing
> there is that you could (over time) reduce its scope.

As I've already said: we already have an internal per-inode rwsem in
XFS for protecting physical directory operations against races.
Suggesting this is the solution is missing the point I was trying to
make: that it doesn't actually solve anything and the better
solution for filesystems like XFS is to decouple the front end VFS
serialisation requirements from the back end filesystem
implementation.

That's kinda my point: this isn't an "XFS-only" problem - it's
something that almost every filesystem we have is going to have
problems with. I very much doubt btrfs will be able to do concurrent
directory mods due to it's namespace btree exclusion model, and I
suspect that bcachefs is going to have the same issues, too.

Hence I think shared locking is fundamentally the wrong model here -
the problem that we need to address is the excessive latency of
synchronous back end FS ops, not the lack of concurrency in
processing directory modifications. Yes, a lack of back end
filesytsem concurrency contributes to the excessive latency of
synchronous directory modification, but that doesn't mean that the
best solution to the problem is to change the concurrency model.

i.e. If we have to make the same mods to every filesystem to do
background async processing to take any sort of advantage of shared
locking, and those background processing mods don't actually require
a shared locking model to realise the performance benefits, then why
add the complexity of a "shared lock for modification" model in the
first place?

[snip stuff about how to decouple VFS/fs serialisation]

> So maybe we'd have something like:
> 
> struct mkdir_context {
> 	struct mnt_idmap	*idmap;	// current args to mkdir op
> 	struct inode		*dir;
> 	struct dentry		*dentry;
> 	umode_t			mode;
> 	int			status		// return status
> 	struct completion	complete;	// when done -- maybe this would be completion callback function?
> };
> 
> ...and an inode operation like:
> 
> 	int (*mkdir_begin)(struct mkdir_context *ctx);
> 
> Then the fs could just pass a pointer to that around, and when the
> operation is done, complete the variable, which would make the original
> task that called mkdir() finalize the results of the op?

That's still serial/synchronous task processing of the entire
operation and does nothing to hide the latency of the operation from
the user.

As I've already explained, this is not the model I've been talking
about.

Yes, we'd need a dynamically allocated control structure that
defines the inode instantiation task that needs completing, but
there's no need for completions within it as the caller can call
wait_on_inode() to wait for the async inode instantiation to
complete.

If the instantiation fails, then we mark the inode bad, stash the
errno somewhere in the inode, and the waiter then grabs the errno
and tears down the inode to clean up the mess.

> My worry here would be that dealing with the workqueue and context
> switching would be a performance killer in the simple cases that do it
> all in a single thread today.

Catch-22.

The premise behind shared locking is that mkdir operations block
(i.e. context switch multiple times) and so have high latency and
long lock hold times. Therefore we need shared locking to be able to
issue lots of them concurrently to get bulk directory performance.

The argument now being made is that adding context switches to mkdir
operations that -don't block- (i.e. the opposite behaviour that
shared locking is trying to adress) will cause performance problems.

i.e. The "simple cases that do it all in a single thread today" are
blocking and taking multiple context switches on every operation.
e.g. nfs client submits to the server, has to wait for reply, so
there's two context switches per operation.

This misses the point of doing async background processing: it
removes the blocking from the foreground task until it is absolutely
necessary, hence helping the foreground process *not block* whilst
holding the parent directory lock and so reduce lock hold times and
increase the number of ops that can be done under that exclusive
lock.

Using shared locking doesn't change the context switch overhead.
Using async processing doesn't make the context switch overhead
worse.  What changes is where those context switches occur and how
much foreground task and lock latency they result in. And in some
cases, async processing drastically reduces context switch overhead
because it allows for subsystems to batch process pending
operations.

Anyone who has been following io_uring development should know all
these things about async processing already. There's a reason that
that infrastructure exists: async processing is more efficient and
faster than the concurrent synchronous processing model being
proposed here....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
  2025-01-22  0:22           ` Dave Chinner
@ 2025-01-22  1:04             ` NeilBrown
  0 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2025-01-22  1:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jeff Layton, lsf-pc, Al Viro, linux-fsdevel

On Wed, 22 Jan 2025, Dave Chinner wrote:
> 
> Anyone who has been following io_uring development should know all
> these things about async processing already. There's a reason that
> that infrastructure exists: async processing is more efficient and
> faster than the concurrent synchronous processing model being
> proposed here....

I understand that asynchronous is best.  I think we are a long way from
achieving that.  I think shared locking is still a good step in that
direction.

Shared locking allows the exclusion to be pushed down into the
filesystem to whatever extend the filesystem needs.  That will be needed
for an async approach too.

We already have a hint of async in the dcache in that ->lookup() can
complete without a result if an intent flag is set.  The actually lookup
might then happen any time before the intended operation completes.  For
NFS exclusive open, that lookup is combined with the create/open.  For
unlink (which doesn't have an intent flag yet) it could be combined with
the nfs REMOVE operation (if that seemed like a good idea).  Other
filesystems could do other things.  But this is just a hint of aysnc as
yet.

I imagine that in the longer term we could drop the i_rwsem completely
for directories.  The VFS would set up a locked dentry much like it does
before ->lookup and then calls into the filesystem.  The filesystem
might do the op synchronously or might take note of what is needed and
schedule the relevant changes or whatever.  When the op finished it does
clear_and_wake_up_bit() (or similar) after stashing the result ...
somewhere.

For synchronous operations like syscalls, an on-stack result struct would
be passed which contains an error status and optionally a new dentry (if
e.g. mkdir found it needed to splice in an existing dentry).

For async operations io_uring would allocate the result struct and would
store in it a callback function to be called after the
clear_and_wake_up_bit(). 

Rather than using i_rwsem to block additions to a directory while it is
being removed, we would lock the dentry (so no more locked children can
be added) and wait for any locked children to be unlocked.

There are doubtless details that I have missed but it is clear that to
allow async dirops we need to remove the need for i_rwsem, and I think
transitioning from exclusive to shared is a useful step in that
direction.

I'm almost tempted to add the result struct to the new _shared
inode_operations that I want to add, but that would likely be premature.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
  2025-01-21 13:04         ` Jeff Layton
  2025-01-22  0:22           ` Dave Chinner
@ 2025-02-03 17:19           ` Steve French
  1 sibling, 0 replies; 10+ messages in thread
From: Steve French @ 2025-02-03 17:19 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Dave Chinner, NeilBrown, lsf-pc, Al Viro, linux-fsdevel

On Tue, Jan 21, 2025 at 7:07 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Tue, 2025-01-21 at 12:20 +1100, Dave Chinner wrote:
> > On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote:
> > > On Mon, 20 Jan 2025, Dave Chinner wrote:
> > > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > > > >
> > > > > My question to fs-devel is: is anyone willing to convert their fs (or
> > > > > advice me on converting?) to use the new interface and do some testing
> > > > > and be open to exploring any bugs that appear?
> > > >
> > > > tl;dr: You're asking for people to put in a *lot* of time to convert
> > > > complex filesystems to concurrent directory modifications without
> > > > clear indication that it will improve performance. Hence I wouldn't
> > > > expect widespread enthusiasm to suddenly implement it...
> > >
> > > Thanks Dave!
> > > Your point as detailed below seems to be that, for xfs at least, it may
> > > be better to reduce hold times for exclusive locks rather than allow
> > > concurrent locks.  That seems entirely credible for a local fs but
> > > doesn't apply for NFS as we cannot get a success status before the
> > > operation is complete.
> >
> > How is that different from a local filesystem? A local filesystem
> > can't return from open(O_CREAT) with a struct file referencing a
> > newly allocated inode until the VFS inode is fully instantiated (or
> > failed), either...
> >
> > i.e. this sounds like you want concurrent share-locked dirent ops so
> > that synchronously processed operations can be issued concurrently.
> >
> > Could the NFS client implement asynchronous directory ops, keeping
> > track of the operations in flight without needing to hold the parent
> > i_rwsem across each individual operation? This basically what I've
> > been describing for XFS to minimise parent dir lock hold times.
> >
>
> Yes, basically. The protocol and NFS client have no requirement to
> serialize directory operations. We'd be happy to spray as many at the
> server in parallel as we can get away with. We currently don't do that
> today, largely because the VFS prohibits it.
>
> The NFS server, or exported filesystem may have requirements that
> serialize these operations though.
>
> > What would VFS support for that look like? Is that of similar
> > complexity to implementing shared locking support so that concurrent
> > blocking directory operations can be issued? Is async processing a
> > better model to move the directory ops towards so we can tie
> > userspace directly into it via io_uring?
> >
>
> Given that the VFS requires an exclusive lock today for directory
> morphing ops, moving to a model where we can take a shared lock on the
> directory instead seems like a nice, incremental approach to dealing
> with this problem.
>
> That said, I get your objection. Not being able to upgrade a rwsem
> makes that shared lock kind of nasty for filesystems that actually do
> rely on it for some parts of their work today.
>
> The usual method of dealing with that would be to create a new XFS-only
> per-inode lock that would take over that serialization. The nice thing
> there is that you could (over time) reduce its scope.
>
> > > So it seems likely that different filesystems
> > > will want different approaches.  No surprise.
> > >
> > > There is some evidence that ext4 can be converted to concurrent
> > > modification without a lot of work, and with measurable benefits.  I
> > > guess I should focus there for local filesystems.
> > >
> > > But I don't want to assume what is best for each file system which is
> > > why I asked for input from developers of the various filesystems.

This is an interesting question for SMB3.1.1 as well (cifs.ko etc.), especially
in workloads where the client already has a directory lease on the directory
being updated (and with multichannel there could be quite a bit of i/o in
flight), but even without a directory lease there is no restriction on sending
multiple simultaneous directory related requests to the server


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-02-03 17:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-17 17:26 [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer Jeff Layton
2025-01-18  1:06 ` NeilBrown
2025-01-19 21:51   ` Dave Chinner
2025-01-19 22:25     ` NeilBrown
2025-01-20 11:55       ` Jeff Layton
2025-01-21  1:20       ` Dave Chinner
2025-01-21 13:04         ` Jeff Layton
2025-01-22  0:22           ` Dave Chinner
2025-01-22  1:04             ` NeilBrown
2025-02-03 17:19           ` Steve French

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox