* [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer
@ 2025-01-17 17:26 Jeff Layton
2025-01-18 1:06 ` NeilBrown
0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2025-01-17 17:26 UTC (permalink / raw)
To: lsf-pc; +Cc: NeilBrown, Al Viro, linux-fsdevel
We've hit a number of cases in testing recently where the parent's
i_rwsem ends up being the bottleneck in heavy parallel create
workloads. Currently we have to take the parent's inode->i_rwsem
exclusively when altering a directory, which means that any directory-
morphing operations in the same directory are serialized.
This is particularly onerous in the ->create codepath, since a
filesystem may have to do a number of blocking operations to create a
new file (allocate memory, start a transaction, etc.)
Neil recently posted this RFC series, which allows parallel directory
modifying operations:
https://lore.kernel.org/linux-fsdevel/20241220030830.272429-1-neilb@suse.de/
Al pointed out a number of problems in it, but the basic approach seems
sound. I'd like to have a discussion at LSF/MM about this.
Are there any problems with the basic approach? Are there other
approaches that might be better? Are there incremental steps we could
do pave the way for this to be a reality?
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer 2025-01-17 17:26 [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer Jeff Layton @ 2025-01-18 1:06 ` NeilBrown 2025-01-19 21:51 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: NeilBrown @ 2025-01-18 1:06 UTC (permalink / raw) To: Jeff Layton; +Cc: lsf-pc, Al Viro, linux-fsdevel On Sat, 18 Jan 2025, Jeff Layton wrote: > We've hit a number of cases in testing recently where the parent's > i_rwsem ends up being the bottleneck in heavy parallel create > workloads. Currently we have to take the parent's inode->i_rwsem > exclusively when altering a directory, which means that any directory- > morphing operations in the same directory are serialized. > > This is particularly onerous in the ->create codepath, since a > filesystem may have to do a number of blocking operations to create a > new file (allocate memory, start a transaction, etc.) > > Neil recently posted this RFC series, which allows parallel directory > modifying operations: > > https://lore.kernel.org/linux-fsdevel/20241220030830.272429-1-neilb@suse.de/ > > Al pointed out a number of problems in it, but the basic approach seems > sound. I'd like to have a discussion at LSF/MM about this. > > Are there any problems with the basic approach? Are there other > approaches that might be better? Are there incremental steps we could > do pave the way for this to be a reality? Thanks for raising this! There was at least one problem with the approach but I have a plan to address that. I won't go into detail here. I hope to get a new patch set out sometime in the coming week. My question to fs-devel is: is anyone willing to convert their fs (or advice me on converting?) to use the new interface and do some testing and be open to exploring any bugs that appear? I'd like to try ext4 using the patches that lustre maintains for parallel directory ops in ext4 but making them suitable for upstream doesn't look to be straight forward. https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob;f=ldiskfs/kernel_patches/patches/linux-6.5/ext4-pdirop.patch;h=208d9dc44f4860fbf27072ed1969744131e30108;hb=HEAD NeilBrown ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer 2025-01-18 1:06 ` NeilBrown @ 2025-01-19 21:51 ` Dave Chinner 2025-01-19 22:25 ` NeilBrown 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2025-01-19 21:51 UTC (permalink / raw) To: NeilBrown; +Cc: Jeff Layton, lsf-pc, Al Viro, linux-fsdevel On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote: > On Sat, 18 Jan 2025, Jeff Layton wrote: > > We've hit a number of cases in testing recently where the parent's > > i_rwsem ends up being the bottleneck in heavy parallel create > > workloads. Currently we have to take the parent's inode->i_rwsem > > exclusively when altering a directory, which means that any directory- > > morphing operations in the same directory are serialized. > > > > This is particularly onerous in the ->create codepath, since a > > filesystem may have to do a number of blocking operations to create a > > new file (allocate memory, start a transaction, etc.) > > > > Neil recently posted this RFC series, which allows parallel directory > > modifying operations: > > > > https://lore.kernel.org/linux-fsdevel/20241220030830.272429-1-neilb@suse.de/ > > > > Al pointed out a number of problems in it, but the basic approach seems > > sound. I'd like to have a discussion at LSF/MM about this. > > > > Are there any problems with the basic approach? Are there other > > approaches that might be better? Are there incremental steps we could > > do pave the way for this to be a reality? > > Thanks for raising this! > There was at least one problem with the approach but I have a plan to > address that. I won't go into detail here. I hope to get a new > patch set out sometime in the coming week. > > My question to fs-devel is: is anyone willing to convert their fs (or > advice me on converting?) to use the new interface and do some testing > and be open to exploring any bugs that appear? tl;dr: You're asking for people to put in a *lot* of time to convert complex filesystems to concurrent directory modifications without clear indication that it will improve performance. Hence I wouldn't expect widespread enthusiasm to suddenly implement it... In more detail.... It's not exactly simple to take a directory tree structure that is exclusively locked for modification and make it safe for concurrent updates. It -might- be possible to make the directory updates in XFS more concurrent, but it still has an internal name hash btree index that would have to be completely re-written to support concurrent updates. That's also ignoring all the other bits of the filesystem that will single thread outside the directory. e.g. during create we have to allocate an inode, and locality algorithms will cluster new inodes in the same directory close together. That means they are covered by the same exclusive lock (e.g. the AGI and AGF header blocks in XFS). Unlink has the same problem. IOWs, it's not just directory ops and structures that need locking changes; the way filesystems do inode and block allocation and freeing also needs to change to support increased concurrency in directory operations. Hence I suspect that concurrent directory mods for filesystems like XFS will need a new processing model - possibly a low overhead intent-based modification model using in-memory whiteouts and async background batching of intents. We kinda already do this with unlink - we do the directory removal in the foreground, and defer the rest of the unlink (i.e. inode freeing) to async background worker threads. e.g. doing background batching of namespace ops means things like "rm *" in a directory doesn't need to transactionally modify the directory as it runs. We could track all the inodes we are unlinking via the intents and then simply truncate away the entire directory when it becomes empty and rmdir() is called. We still have to clean up and mark all the inodes free, but that can be done in the background. As such, I suspect that moving XFS to a more async processing model for directory namespace ops to minimise lock hold times will be simpler (and potentially faster!) than rewriting large chunks of the XFS directory and inode management operations to allow for i_rwsem/ILOCK/AGI/AGF locking concurrency... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer 2025-01-19 21:51 ` Dave Chinner @ 2025-01-19 22:25 ` NeilBrown 2025-01-20 11:55 ` Jeff Layton 2025-01-21 1:20 ` Dave Chinner 0 siblings, 2 replies; 10+ messages in thread From: NeilBrown @ 2025-01-19 22:25 UTC (permalink / raw) To: Dave Chinner; +Cc: Jeff Layton, lsf-pc, Al Viro, linux-fsdevel On Mon, 20 Jan 2025, Dave Chinner wrote: > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote: > > > > My question to fs-devel is: is anyone willing to convert their fs (or > > advice me on converting?) to use the new interface and do some testing > > and be open to exploring any bugs that appear? > > tl;dr: You're asking for people to put in a *lot* of time to convert > complex filesystems to concurrent directory modifications without > clear indication that it will improve performance. Hence I wouldn't > expect widespread enthusiasm to suddenly implement it... Thanks Dave! Your point as detailed below seems to be that, for xfs at least, it may be better to reduce hold times for exclusive locks rather than allow concurrent locks. That seems entirely credible for a local fs but doesn't apply for NFS as we cannot get a success status before the operation is complete. So it seems likely that different filesystems will want different approaches. No surprise. There is some evidence that ext4 can be converted to concurrent modification without a lot of work, and with measurable benefits. I guess I should focus there for local filesystems. But I don't want to assume what is best for each file system which is why I asked for input from developers of the various filesystems. But even for xfs, I think that to provide a successful return from mkdir would require waiting for some IO to complete, and that other operations might benefit from starting before that IO completes. So maybe an xfs implementation of mkdir_shared would be: - take internal exclusive lock on directory - run fast foreground part of mkdir - drop the lock - wait for background stuff, which could affect error return, to complete - return appropriate error, or success So xfs could clearly use exclusive locking where that is the best choice, but not have exclusive locking imposed for the entire operation. That is my core goal : don't impose a particular locking style - allow the filesystem to manage locking within an umbrella that ensures the guarantees that the vfs needs (like no creation inside a directory during rmdir). Thanks, NeilBrown > > In more detail.... > > It's not exactly simple to take a directory tree structure that is > exclusively locked for modification and make it safe for concurrent > updates. It -might- be possible to make the directory updates in XFS > more concurrent, but it still has an internal name hash btree index > that would have to be completely re-written to support concurrent > updates. > > That's also ignoring all the other bits of the filesystem that will > single thread outside the directory. e.g. during create we have to > allocate an inode, and locality algorithms will cluster new inodes > in the same directory close together. That means they are covered by > the same exclusive lock (e.g. the AGI and AGF header blocks in XFS). > Unlink has the same problem. > > IOWs, it's not just directory ops and structures that need locking > changes; the way filesystems do inode and block allocation and > freeing also needs to change to support increased concurrency in > directory operations. > > Hence I suspect that concurrent directory mods for filesystems like > XFS will need a new processing model - possibly a low overhead > intent-based modification model using in-memory whiteouts and async > background batching of intents. We kinda already do this with unlink > - we do the directory removal in the foreground, and defer the rest > of the unlink (i.e. inode freeing) to async background worker > threads. > > e.g. doing background batching of namespace ops means things like > "rm *" in a directory doesn't need to transactionally modify the > directory as it runs. We could track all the inodes we are unlinking > via the intents and then simply truncate away the entire directory > when it becomes empty and rmdir() is called. We still have to clean > up and mark all the inodes free, but that can be done in the > background. > > As such, I suspect that moving XFS to a more async processing model > for directory namespace ops to minimise lock hold times will be > simpler (and potentially faster!) than rewriting large chunks of the > XFS directory and inode management operations to allow for > i_rwsem/ILOCK/AGI/AGF locking concurrency... > > -Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer 2025-01-19 22:25 ` NeilBrown @ 2025-01-20 11:55 ` Jeff Layton 2025-01-21 1:20 ` Dave Chinner 1 sibling, 0 replies; 10+ messages in thread From: Jeff Layton @ 2025-01-20 11:55 UTC (permalink / raw) To: NeilBrown, Dave Chinner; +Cc: lsf-pc, Al Viro, linux-fsdevel On Mon, 2025-01-20 at 09:25 +1100, NeilBrown wrote: > On Mon, 20 Jan 2025, Dave Chinner wrote: > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote: > > > > > > My question to fs-devel is: is anyone willing to convert their fs (or > > > advice me on converting?) to use the new interface and do some testing > > > and be open to exploring any bugs that appear? > > > > tl;dr: You're asking for people to put in a *lot* of time to convert > > complex filesystems to concurrent directory modifications without > > clear indication that it will improve performance. Hence I wouldn't > > expect widespread enthusiasm to suddenly implement it... > > Thanks Dave! > Your point as detailed below seems to be that, for xfs at least, it may > be better to reduce hold times for exclusive locks rather than allow > concurrent locks. That seems entirely credible for a local fs but > doesn't apply for NFS as we cannot get a success status before the > operation is complete. So it seems likely that different filesystems > will want different approaches. No surprise. > > There is some evidence that ext4 can be converted to concurrent > modification without a lot of work, and with measurable benefits. I > guess I should focus there for local filesystems. > > But I don't want to assume what is best for each file system which is > why I asked for input from developers of the various filesystems. > > But even for xfs, I think that to provide a successful return from mkdir > would require waiting for some IO to complete, and that other operations > might benefit from starting before that IO completes. > So maybe an xfs implementation of mkdir_shared would be: > - take internal exclusive lock on directory > - run fast foreground part of mkdir > - drop the lock > - wait for background stuff, which could affect error return, to > complete > - return appropriate error, or success > > So xfs could clearly use exclusive locking where that is the best > choice, but not have exclusive locking imposed for the entire operation. > That is my core goal : don't impose a particular locking style - allow > the filesystem to manage locking within an umbrella that ensures the > guarantees that the vfs needs (like no creation inside a directory > during rmdir). > > I too don't think this approach is incompatible with XFS necessarily, but we may very well find that once we remove the bottleneck of the exclusive i_rwsem around directory modifications, that the bottleneck just moves down to within the filesystem driver. We have to start somewhere though! Also, I should mention that it looks like Al won't be able to attend LSF/MM this year. We may not want to schedule a time slot for this after all, as I think his involvement will be key here. > > > > > In more detail.... > > > > It's not exactly simple to take a directory tree structure that is > > exclusively locked for modification and make it safe for concurrent > > updates. It -might- be possible to make the directory updates in XFS > > more concurrent, but it still has an internal name hash btree index > > that would have to be completely re-written to support concurrent > > updates. > > > > That's also ignoring all the other bits of the filesystem that will > > single thread outside the directory. e.g. during create we have to > > allocate an inode, and locality algorithms will cluster new inodes > > in the same directory close together. That means they are covered by > > the same exclusive lock (e.g. the AGI and AGF header blocks in XFS). > > Unlink has the same problem. > > > > IOWs, it's not just directory ops and structures that need locking > > changes; the way filesystems do inode and block allocation and > > freeing also needs to change to support increased concurrency in > > directory operations. > > > > Hence I suspect that concurrent directory mods for filesystems like > > XFS will need a new processing model - possibly a low overhead > > intent-based modification model using in-memory whiteouts and async > > background batching of intents. We kinda already do this with unlink > > - we do the directory removal in the foreground, and defer the rest > > of the unlink (i.e. inode freeing) to async background worker > > threads. > > > > e.g. doing background batching of namespace ops means things like > > "rm *" in a directory doesn't need to transactionally modify the > > directory as it runs. We could track all the inodes we are unlinking > > via the intents and then simply truncate away the entire directory > > when it becomes empty and rmdir() is called. We still have to clean > > up and mark all the inodes free, but that can be done in the > > background. > > > > As such, I suspect that moving XFS to a more async processing model > > for directory namespace ops to minimise lock hold times will be > > simpler (and potentially faster!) than rewriting large chunks of the > > XFS directory and inode management operations to allow for > > i_rwsem/ILOCK/AGI/AGF locking concurrency... > > > > -Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > > -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer 2025-01-19 22:25 ` NeilBrown 2025-01-20 11:55 ` Jeff Layton @ 2025-01-21 1:20 ` Dave Chinner 2025-01-21 13:04 ` Jeff Layton 1 sibling, 1 reply; 10+ messages in thread From: Dave Chinner @ 2025-01-21 1:20 UTC (permalink / raw) To: NeilBrown; +Cc: Jeff Layton, lsf-pc, Al Viro, linux-fsdevel On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote: > On Mon, 20 Jan 2025, Dave Chinner wrote: > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote: > > > > > > My question to fs-devel is: is anyone willing to convert their fs (or > > > advice me on converting?) to use the new interface and do some testing > > > and be open to exploring any bugs that appear? > > > > tl;dr: You're asking for people to put in a *lot* of time to convert > > complex filesystems to concurrent directory modifications without > > clear indication that it will improve performance. Hence I wouldn't > > expect widespread enthusiasm to suddenly implement it... > > Thanks Dave! > Your point as detailed below seems to be that, for xfs at least, it may > be better to reduce hold times for exclusive locks rather than allow > concurrent locks. That seems entirely credible for a local fs but > doesn't apply for NFS as we cannot get a success status before the > operation is complete. How is that different from a local filesystem? A local filesystem can't return from open(O_CREAT) with a struct file referencing a newly allocated inode until the VFS inode is fully instantiated (or failed), either... i.e. this sounds like you want concurrent share-locked dirent ops so that synchronously processed operations can be issued concurrently. Could the NFS client implement asynchronous directory ops, keeping track of the operations in flight without needing to hold the parent i_rwsem across each individual operation? This basically what I've been describing for XFS to minimise parent dir lock hold times. What would VFS support for that look like? Is that of similar complexity to implementing shared locking support so that concurrent blocking directory operations can be issued? Is async processing a better model to move the directory ops towards so we can tie userspace directly into it via io_uring? > So it seems likely that different filesystems > will want different approaches. No surprise. > > There is some evidence that ext4 can be converted to concurrent > modification without a lot of work, and with measurable benefits. I > guess I should focus there for local filesystems. > > But I don't want to assume what is best for each file system which is > why I asked for input from developers of the various filesystems. > > But even for xfs, I think that to provide a successful return from mkdir > would require waiting for some IO to complete, and that other operations I don't see where IO enters the picture, to be honest. File creation does not typically require foreground IO on XFS at all (ignoring dirsync mode). How did you think we scale XFS to near a million file creates a second? :) In the case of mkdir, it does not take a direct reference to the inode being created so it potentially doesn't even need to wait for the completion of the operation. i.e. to use the new subdir it has to be open()d; that means going through the ->lookup path and which will block on I_NEW until the background creation is completed... That said, open(O_CREAT) would need to call wait_on_inode() somewhere to wait for I_NEW to clear so operations on the inode can proceed immediately via the persistent struct file reference it creates. With the right support, that waiting can be done without holding the parent directory locked, as any new lookup on that dirent/inode pair will block until I_NEW is cleared... Hence my question above about what does VFS support for async dirops actually looks like, and whether something like this: > might benefit from starting before that IO completes. > So maybe an xfs implementation of mkdir_shared would be: > - take internal exclusive lock on directory > - run fast foreground part of mkdir > - drop the lock > - wait for background stuff, which could affect error return, to > complete > - return appropriate error, or success as natively supported functionality might be a better solution to the problem.... From the XFs perspective, we already have internal exclusive locking for dirent mods, but that is needed for serialising the physical directory mods against other (shared access) readers (e.g. lookup and readdir). We would not want to be sharing such an internal lock across the unbound fast path concurrency of lookup, create, unlink, readdir -and- multiple background processing threads. Logging a modification intent against an inode wouldn't require a new internal inode lock; AFAICT all it requires is exclusivity against lookup so that lookup can find new/unlinked dirents that have been logged but not yet added to or removed from the physical directory blocks. We can do that by inserting the VFS inode into cache in the foreground operation and leaving I_NEW set on create. We can do a similar thing with unlink (I_WILL_FREE?) to make the VFS inode invisible to new lookups before the unlink has actually been processed. In this way, we can push the actual processing into background work queues without actually changing how the namespace behaves from a user perspective... We -might- be able to do all these operations under a shared VFS lock, but it is not necessary to have a shared VFS lock to enable such a async processing model. If the performance gains for the NFS client come from allowing concurrent processing of individual synchronous operations, then we really need to consider if there are other methods of hiding synchronous operation latency that might be more applicable to more filesystems and high performance user interfaces... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer 2025-01-21 1:20 ` Dave Chinner @ 2025-01-21 13:04 ` Jeff Layton 2025-01-22 0:22 ` Dave Chinner 2025-02-03 17:19 ` Steve French 0 siblings, 2 replies; 10+ messages in thread From: Jeff Layton @ 2025-01-21 13:04 UTC (permalink / raw) To: Dave Chinner, NeilBrown; +Cc: lsf-pc, Al Viro, linux-fsdevel On Tue, 2025-01-21 at 12:20 +1100, Dave Chinner wrote: > On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote: > > On Mon, 20 Jan 2025, Dave Chinner wrote: > > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote: > > > > > > > > My question to fs-devel is: is anyone willing to convert their fs (or > > > > advice me on converting?) to use the new interface and do some testing > > > > and be open to exploring any bugs that appear? > > > > > > tl;dr: You're asking for people to put in a *lot* of time to convert > > > complex filesystems to concurrent directory modifications without > > > clear indication that it will improve performance. Hence I wouldn't > > > expect widespread enthusiasm to suddenly implement it... > > > > Thanks Dave! > > Your point as detailed below seems to be that, for xfs at least, it may > > be better to reduce hold times for exclusive locks rather than allow > > concurrent locks. That seems entirely credible for a local fs but > > doesn't apply for NFS as we cannot get a success status before the > > operation is complete. > > How is that different from a local filesystem? A local filesystem > can't return from open(O_CREAT) with a struct file referencing a > newly allocated inode until the VFS inode is fully instantiated (or > failed), either... > > i.e. this sounds like you want concurrent share-locked dirent ops so > that synchronously processed operations can be issued concurrently. > > Could the NFS client implement asynchronous directory ops, keeping > track of the operations in flight without needing to hold the parent > i_rwsem across each individual operation? This basically what I've > been describing for XFS to minimise parent dir lock hold times. > Yes, basically. The protocol and NFS client have no requirement to serialize directory operations. We'd be happy to spray as many at the server in parallel as we can get away with. We currently don't do that today, largely because the VFS prohibits it. The NFS server, or exported filesystem may have requirements that serialize these operations though. > What would VFS support for that look like? Is that of similar > complexity to implementing shared locking support so that concurrent > blocking directory operations can be issued? Is async processing a > better model to move the directory ops towards so we can tie > userspace directly into it via io_uring? > Given that the VFS requires an exclusive lock today for directory morphing ops, moving to a model where we can take a shared lock on the directory instead seems like a nice, incremental approach to dealing with this problem. That said, I get your objection. Not being able to upgrade a rwsem makes that shared lock kind of nasty for filesystems that actually do rely on it for some parts of their work today. The usual method of dealing with that would be to create a new XFS-only per-inode lock that would take over that serialization. The nice thing there is that you could (over time) reduce its scope. > > So it seems likely that different filesystems > > will want different approaches. No surprise. > > > > There is some evidence that ext4 can be converted to concurrent > > modification without a lot of work, and with measurable benefits. I > > guess I should focus there for local filesystems. > > > > But I don't want to assume what is best for each file system which is > > why I asked for input from developers of the various filesystems. > > > > But even for xfs, I think that to provide a successful return from mkdir > > would require waiting for some IO to complete, and that other operations > > I don't see where IO enters the picture, to be honest. File creation > does not typically require foreground IO on XFS at all (ignoring > dirsync mode). How did you think we scale XFS to near a million file > creates a second? :) > > In the case of mkdir, it does not take a direct reference to the > inode being created so it potentially doesn't even need to wait for > the completion of the operation. i.e. to use the new subdir it has > to be open()d; that means going through the ->lookup path and which > will block on I_NEW until the background creation is completed... > > That said, open(O_CREAT) would need to call wait_on_inode() > somewhere to wait for I_NEW to clear so operations on the inode can > proceed immediately via the persistent struct file reference it > creates. With the right support, that waiting can be done without > holding the parent directory locked, as any new lookup on that > dirent/inode pair will block until I_NEW is cleared... > > Hence my question above about what does VFS support for > async dirops actually looks like, and whether something like this: > > > might benefit from starting before that IO completes. > > So maybe an xfs implementation of mkdir_shared would be: > > - take internal exclusive lock on directory > > - run fast foreground part of mkdir > > - drop the lock > > - wait for background stuff, which could affect error return, to > > complete > > - return appropriate error, or success > > as natively supported functionality might be a better solution to > the problem.... > > From the XFs perspective, we already have internal exclusive locking > for dirent mods, but that is needed for serialising the physical > directory mods against other (shared access) readers (e.g. lookup > and readdir). We would not want to be sharing such an internal lock > across the unbound fast path concurrency of lookup, create, unlink, > readdir -and- multiple background processing threads. > > Logging a modification intent against an inode wouldn't require a > new internal inode lock; AFAICT all it requires is exclusivity > against lookup so that lookup can find new/unlinked dirents that > have been logged but not yet added to or removed from the physical > directory blocks. > > We can do that by inserting the VFS inode into cache in the > foreground operation and leaving I_NEW set on create. We can do a > similar thing with unlink (I_WILL_FREE?) to make the VFS inode > invisible to new lookups before the unlink has actually been > processed. In this way, we can push the actual processing into > background work queues without actually changing how the namespace > behaves from a user perspective... > > We -might- be able to do all these operations under a shared VFS > lock, but it is not necessary to have a shared VFS lock to enable > such a async processing model. If the performance gains for the NFS > client come from allowing concurrent processing of individual > synchronous operations, then we really need to consider if there are > other methods of hiding synchronous operation latency that might > be more applicable to more filesystems and high performance user > interfaces... > So maybe we'd have something like: struct mkdir_context { struct mnt_idmap *idmap; // current args to mkdir op struct inode *dir; struct dentry *dentry; umode_t mode; int status // return status struct completion complete; // when done -- maybe this would be completion callback function? }; ...and an inode operation like: int (*mkdir_begin)(struct mkdir_context *ctx); Then the fs could just pass a pointer to that around, and when the operation is done, complete the variable, which would make the original task that called mkdir() finalize the results of the op? My worry here would be that dealing with the workqueue and context switching would be a performance killer in the simple cases that do it all in a single thread today. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer 2025-01-21 13:04 ` Jeff Layton @ 2025-01-22 0:22 ` Dave Chinner 2025-01-22 1:04 ` NeilBrown 2025-02-03 17:19 ` Steve French 1 sibling, 1 reply; 10+ messages in thread From: Dave Chinner @ 2025-01-22 0:22 UTC (permalink / raw) To: Jeff Layton; +Cc: NeilBrown, lsf-pc, Al Viro, linux-fsdevel On Tue, Jan 21, 2025 at 08:04:46AM -0500, Jeff Layton wrote: > On Tue, 2025-01-21 at 12:20 +1100, Dave Chinner wrote: > > On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote: > > > On Mon, 20 Jan 2025, Dave Chinner wrote: > > > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote: > > > > > > > > > > My question to fs-devel is: is anyone willing to convert their fs (or > > > > > advice me on converting?) to use the new interface and do some testing > > > > > and be open to exploring any bugs that appear? > > > > > > > > tl;dr: You're asking for people to put in a *lot* of time to convert > > > > complex filesystems to concurrent directory modifications without > > > > clear indication that it will improve performance. Hence I wouldn't > > > > expect widespread enthusiasm to suddenly implement it... > > > > > > Thanks Dave! > > > Your point as detailed below seems to be that, for xfs at least, it may > > > be better to reduce hold times for exclusive locks rather than allow > > > concurrent locks. That seems entirely credible for a local fs but > > > doesn't apply for NFS as we cannot get a success status before the > > > operation is complete. > > > > How is that different from a local filesystem? A local filesystem > > can't return from open(O_CREAT) with a struct file referencing a > > newly allocated inode until the VFS inode is fully instantiated (or > > failed), either... > > > > i.e. this sounds like you want concurrent share-locked dirent ops so > > that synchronously processed operations can be issued concurrently. > > > > Could the NFS client implement asynchronous directory ops, keeping > > track of the operations in flight without needing to hold the parent > > i_rwsem across each individual operation? This basically what I've > > been describing for XFS to minimise parent dir lock hold times. > > > > Yes, basically. The protocol and NFS client have no requirement to > serialize directory operations. We'd be happy to spray as many at the > server in parallel as we can get away with. We currently don't do that > today, largely because the VFS prohibits it. > > The NFS server, or exported filesystem may have requirements that > serialize these operations though. Sure, > > > What would VFS support for that look like? Is that of similar > > complexity to implementing shared locking support so that concurrent > > blocking directory operations can be issued? Is async processing a > > better model to move the directory ops towards so we can tie > > userspace directly into it via io_uring? > > Given that the VFS requires an exclusive lock today for directory > morphing ops, moving to a model where we can take a shared lock on the > directory instead seems like a nice, incremental approach to dealing > with this problem. I understand this, but it's not really "incremental" in that it entrenches the "concurrency is only possible for synchronous processing models" that shared locking across the entire operation entails. i.e. we can't hand the shared lock to another thread to release on completion (e.g. an async processing pipeline) because lockdep will get upset about that and {down,up}_read_non_owner() is very much discouraged and {down,up}_write_non_owner() doesn't even exist. > That said, I get your objection. Not being able to upgrade a rwsem > makes that shared lock kind of nasty for filesystems that actually do > rely on it for some parts of their work today. It's more than that - the use of a rwsem for exclusion basically forces us into "task that takes the lock must release the lock" model, and so if the VFS expects the entire operation to be done under a down_read() context then we must complete the entire operation before handing back the completion to the original task. That's why I'm pushing back against a "whole task" centric shared locking model as it is ultimately unfriendly to efficient async dispatch and completion of background tasks. > The usual method of dealing with that would be to create a new XFS-only > per-inode lock that would take over that serialization. The nice thing > there is that you could (over time) reduce its scope. As I've already said: we already have an internal per-inode rwsem in XFS for protecting physical directory operations against races. Suggesting this is the solution is missing the point I was trying to make: that it doesn't actually solve anything and the better solution for filesystems like XFS is to decouple the front end VFS serialisation requirements from the back end filesystem implementation. That's kinda my point: this isn't an "XFS-only" problem - it's something that almost every filesystem we have is going to have problems with. I very much doubt btrfs will be able to do concurrent directory mods due to it's namespace btree exclusion model, and I suspect that bcachefs is going to have the same issues, too. Hence I think shared locking is fundamentally the wrong model here - the problem that we need to address is the excessive latency of synchronous back end FS ops, not the lack of concurrency in processing directory modifications. Yes, a lack of back end filesytsem concurrency contributes to the excessive latency of synchronous directory modification, but that doesn't mean that the best solution to the problem is to change the concurrency model. i.e. If we have to make the same mods to every filesystem to do background async processing to take any sort of advantage of shared locking, and those background processing mods don't actually require a shared locking model to realise the performance benefits, then why add the complexity of a "shared lock for modification" model in the first place? [snip stuff about how to decouple VFS/fs serialisation] > So maybe we'd have something like: > > struct mkdir_context { > struct mnt_idmap *idmap; // current args to mkdir op > struct inode *dir; > struct dentry *dentry; > umode_t mode; > int status // return status > struct completion complete; // when done -- maybe this would be completion callback function? > }; > > ...and an inode operation like: > > int (*mkdir_begin)(struct mkdir_context *ctx); > > Then the fs could just pass a pointer to that around, and when the > operation is done, complete the variable, which would make the original > task that called mkdir() finalize the results of the op? That's still serial/synchronous task processing of the entire operation and does nothing to hide the latency of the operation from the user. As I've already explained, this is not the model I've been talking about. Yes, we'd need a dynamically allocated control structure that defines the inode instantiation task that needs completing, but there's no need for completions within it as the caller can call wait_on_inode() to wait for the async inode instantiation to complete. If the instantiation fails, then we mark the inode bad, stash the errno somewhere in the inode, and the waiter then grabs the errno and tears down the inode to clean up the mess. > My worry here would be that dealing with the workqueue and context > switching would be a performance killer in the simple cases that do it > all in a single thread today. Catch-22. The premise behind shared locking is that mkdir operations block (i.e. context switch multiple times) and so have high latency and long lock hold times. Therefore we need shared locking to be able to issue lots of them concurrently to get bulk directory performance. The argument now being made is that adding context switches to mkdir operations that -don't block- (i.e. the opposite behaviour that shared locking is trying to adress) will cause performance problems. i.e. The "simple cases that do it all in a single thread today" are blocking and taking multiple context switches on every operation. e.g. nfs client submits to the server, has to wait for reply, so there's two context switches per operation. This misses the point of doing async background processing: it removes the blocking from the foreground task until it is absolutely necessary, hence helping the foreground process *not block* whilst holding the parent directory lock and so reduce lock hold times and increase the number of ops that can be done under that exclusive lock. Using shared locking doesn't change the context switch overhead. Using async processing doesn't make the context switch overhead worse. What changes is where those context switches occur and how much foreground task and lock latency they result in. And in some cases, async processing drastically reduces context switch overhead because it allows for subsystems to batch process pending operations. Anyone who has been following io_uring development should know all these things about async processing already. There's a reason that that infrastructure exists: async processing is more efficient and faster than the concurrent synchronous processing model being proposed here.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer 2025-01-22 0:22 ` Dave Chinner @ 2025-01-22 1:04 ` NeilBrown 0 siblings, 0 replies; 10+ messages in thread From: NeilBrown @ 2025-01-22 1:04 UTC (permalink / raw) To: Dave Chinner; +Cc: Jeff Layton, lsf-pc, Al Viro, linux-fsdevel On Wed, 22 Jan 2025, Dave Chinner wrote: > > Anyone who has been following io_uring development should know all > these things about async processing already. There's a reason that > that infrastructure exists: async processing is more efficient and > faster than the concurrent synchronous processing model being > proposed here.... I understand that asynchronous is best. I think we are a long way from achieving that. I think shared locking is still a good step in that direction. Shared locking allows the exclusion to be pushed down into the filesystem to whatever extend the filesystem needs. That will be needed for an async approach too. We already have a hint of async in the dcache in that ->lookup() can complete without a result if an intent flag is set. The actually lookup might then happen any time before the intended operation completes. For NFS exclusive open, that lookup is combined with the create/open. For unlink (which doesn't have an intent flag yet) it could be combined with the nfs REMOVE operation (if that seemed like a good idea). Other filesystems could do other things. But this is just a hint of aysnc as yet. I imagine that in the longer term we could drop the i_rwsem completely for directories. The VFS would set up a locked dentry much like it does before ->lookup and then calls into the filesystem. The filesystem might do the op synchronously or might take note of what is needed and schedule the relevant changes or whatever. When the op finished it does clear_and_wake_up_bit() (or similar) after stashing the result ... somewhere. For synchronous operations like syscalls, an on-stack result struct would be passed which contains an error status and optionally a new dentry (if e.g. mkdir found it needed to splice in an existing dentry). For async operations io_uring would allocate the result struct and would store in it a callback function to be called after the clear_and_wake_up_bit(). Rather than using i_rwsem to block additions to a directory while it is being removed, we would lock the dentry (so no more locked children can be added) and wait for any locked children to be unlocked. There are doubtless details that I have missed but it is clear that to allow async dirops we need to remove the need for i_rwsem, and I think transitioning from exclusive to shared is a useful step in that direction. I'm almost tempted to add the result struct to the new _shared inode_operations that I want to add, but that would likely be premature. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer 2025-01-21 13:04 ` Jeff Layton 2025-01-22 0:22 ` Dave Chinner @ 2025-02-03 17:19 ` Steve French 1 sibling, 0 replies; 10+ messages in thread From: Steve French @ 2025-02-03 17:19 UTC (permalink / raw) To: Jeff Layton; +Cc: Dave Chinner, NeilBrown, lsf-pc, Al Viro, linux-fsdevel On Tue, Jan 21, 2025 at 7:07 AM Jeff Layton <jlayton@kernel.org> wrote: > > On Tue, 2025-01-21 at 12:20 +1100, Dave Chinner wrote: > > On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote: > > > On Mon, 20 Jan 2025, Dave Chinner wrote: > > > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote: > > > > > > > > > > My question to fs-devel is: is anyone willing to convert their fs (or > > > > > advice me on converting?) to use the new interface and do some testing > > > > > and be open to exploring any bugs that appear? > > > > > > > > tl;dr: You're asking for people to put in a *lot* of time to convert > > > > complex filesystems to concurrent directory modifications without > > > > clear indication that it will improve performance. Hence I wouldn't > > > > expect widespread enthusiasm to suddenly implement it... > > > > > > Thanks Dave! > > > Your point as detailed below seems to be that, for xfs at least, it may > > > be better to reduce hold times for exclusive locks rather than allow > > > concurrent locks. That seems entirely credible for a local fs but > > > doesn't apply for NFS as we cannot get a success status before the > > > operation is complete. > > > > How is that different from a local filesystem? A local filesystem > > can't return from open(O_CREAT) with a struct file referencing a > > newly allocated inode until the VFS inode is fully instantiated (or > > failed), either... > > > > i.e. this sounds like you want concurrent share-locked dirent ops so > > that synchronously processed operations can be issued concurrently. > > > > Could the NFS client implement asynchronous directory ops, keeping > > track of the operations in flight without needing to hold the parent > > i_rwsem across each individual operation? This basically what I've > > been describing for XFS to minimise parent dir lock hold times. > > > > Yes, basically. The protocol and NFS client have no requirement to > serialize directory operations. We'd be happy to spray as many at the > server in parallel as we can get away with. We currently don't do that > today, largely because the VFS prohibits it. > > The NFS server, or exported filesystem may have requirements that > serialize these operations though. > > > What would VFS support for that look like? Is that of similar > > complexity to implementing shared locking support so that concurrent > > blocking directory operations can be issued? Is async processing a > > better model to move the directory ops towards so we can tie > > userspace directly into it via io_uring? > > > > Given that the VFS requires an exclusive lock today for directory > morphing ops, moving to a model where we can take a shared lock on the > directory instead seems like a nice, incremental approach to dealing > with this problem. > > That said, I get your objection. Not being able to upgrade a rwsem > makes that shared lock kind of nasty for filesystems that actually do > rely on it for some parts of their work today. > > The usual method of dealing with that would be to create a new XFS-only > per-inode lock that would take over that serialization. The nice thing > there is that you could (over time) reduce its scope. > > > > So it seems likely that different filesystems > > > will want different approaches. No surprise. > > > > > > There is some evidence that ext4 can be converted to concurrent > > > modification without a lot of work, and with measurable benefits. I > > > guess I should focus there for local filesystems. > > > > > > But I don't want to assume what is best for each file system which is > > > why I asked for input from developers of the various filesystems. This is an interesting question for SMB3.1.1 as well (cifs.ko etc.), especially in workloads where the client already has a directory lease on the directory being updated (and with multichannel there could be quite a bit of i/o in flight), but even without a directory lease there is no restriction on sending multiple simultaneous directory related requests to the server -- Thanks, Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-02-03 17:20 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-17 17:26 [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer Jeff Layton 2025-01-18 1:06 ` NeilBrown 2025-01-19 21:51 ` Dave Chinner 2025-01-19 22:25 ` NeilBrown 2025-01-20 11:55 ` Jeff Layton 2025-01-21 1:20 ` Dave Chinner 2025-01-21 13:04 ` Jeff Layton 2025-01-22 0:22 ` Dave Chinner 2025-01-22 1:04 ` NeilBrown 2025-02-03 17:19 ` Steve French
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox