From: NeilBrown <neil@brown.name>
To: Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org
Subject: [PATCH 0/8 preview] demonstrate proposed new locking strategy for directories
Date: Mon, 9 Jun 2025 17:34:05 +1000 [thread overview]
Message-ID: <20250609075950.159417-1-neil@brown.name> (raw)
This patches are still under development. In particular some proper
documentation is needed. They are sufficient to demonstrate my design.
They add an alternate mechanism for providing the locking that the VFS
needs for directory operations. This includes:
- only one operation per name at a time
- no operations in a directory being removed
- no concurrent cross-directory renames which might result in an
ancestor loop
I had originally hoped to push the locking of i_rw_sem down into the
filesystems and have the new locking on top of that. This turned out to
be impractical. This series leave the i_rw_sem locking where it is,
introduces new locking that happens while the directory is locked, and
gives the filesystem the option of disabling (most of) the i_rw_sem
locking. Once all filesystems are converted the i_rw_sem locking can be
removed.
Shared lock on i_rw_sem is still used for readdir and simple lookup, to
exclude it while rmdir is happening.
The problem with pushing i_rw_sem down is that I still want to use it to
exclude readdir while rmdir is happening. Some readdir implementations
use the result to prime the dcache which means creating d_in_lookup()
dentries in the directory. If we can do this while holding i_rw_sem,
then it is not safe to take i_rw_sem while holding a d_in_lookup()
dentry. So i_rw_sem CANNOT be taken after a lookup has been performed -
it must be before, or never.
Another issue is that after taking i_rw_sem in rmdir() I need to wait
for any dentries that are still locked. Waiting for the dentry lock
while holding i_rw_sem means we cannot take i_rw_sem after getting a
dentry lock.
So we take i_rw_sem for filesystems that still require it (initially
all) but still do the other locking which will be uncontended. This
exercises the code to help ensure it is ready when we remove the
i_rw_sem requirement for any given filesystem.
The central feature is a per-dentry lock implemented with a couple of
d_flags and wait_var_event/wake_up_var. A single thread can take 1,
sometimes 2, occasionally 3 locks on different dentries.
A second lock is needed for rename - we lock the two dentries in
address-order after confirming there is no hierarchical relationship.
It is also needed for silly-rename as part of unlink. In this case the
plan is for the second dentry to always be a d_in_lookup dentry so the
lock is guaranteed to be uncontented. I'm not sure I got that finished
yet.
The three-dentry case is a rename which results in a silly-rename of the
target.
For rmdir we introduce S_DYING so that marking a directory a S_DEAD is
two-stage. We mark is S_DYING which will prevent more dentry locks
being taken, then we wait for the locks that were already taken, then
set S_DEAD.
For rename ... maybe just read the patch. I tried to explain it
thoroughly.
The goal is to perform create/remove/rename without any mutex/semaphore
held by the VFS. This will allow concurrent operations in a directory
and prepare the way for async operation so that e.g. io_uring could be
given a list of many names in a directory to unlink and it could unlink
them in parallel. We probably need to make changes to the locking on
the inode being removed before this can be fully achieved - I haven't
explored that in detail yet.
Thanks,
NeilBrown
next reply other threads:[~2025-06-09 8:00 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-09 7:34 NeilBrown [this message]
2025-06-09 7:34 ` [PATCH 1/8] VFS: use global wait-queue table for d_alloc_parallel() NeilBrown
2025-06-09 7:34 ` [PATCH 2/8] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() NeilBrown
2025-06-09 7:34 ` [PATCH 3/8] fs/proc: take rcu_read_lock() in proc_sys_compare() NeilBrown
2025-06-09 7:34 ` [PATCH 4/8] VFS: Add ability to exclusively lock a dentry and use for open/create NeilBrown
2025-06-09 7:34 ` [PATCH 5/8] Introduce S_DYING which warns that S_DEAD might follow NeilBrown
2025-06-10 20:57 ` Al Viro
2025-06-11 1:00 ` NeilBrown
2025-06-11 1:13 ` Al Viro
2025-06-11 2:49 ` NeilBrown
2025-06-09 7:34 ` [PATCH 6/8] VFS: provide alternative to s_vfs_rename_mutex NeilBrown
2025-06-09 7:34 ` [PATCH 7/8] VFS: use new dentry locking for create/remove/rename NeilBrown
2025-06-10 20:36 ` Al Viro
2025-06-11 0:34 ` NeilBrown
2025-06-09 7:34 ` [PATCH 8/8] VFS: allow a filesystem to opt out of directory locking NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250609075950.159417-1-neil@brown.name \
--to=neil@brown.name \
--cc=brauner@kernel.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).