* [PATCH 01/11] VFS: introduce vfs_mkdir_return()
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
@ 2024-12-20 2:54 ` NeilBrown
2024-12-23 5:04 ` Al Viro
2024-12-20 2:54 ` [PATCH 02/11] VFS: add _shared versions of the various directory modifying inode_operations NeilBrown
` (11 subsequent siblings)
12 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2024-12-20 2:54 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds
Cc: linux-fsdevel, linux-kernel
vfs_mkdir() does not guarantee to make the child dentry positive on
success. It may leave it negative and then the caller needs to perform a
lookup to find the target dentry.
This patch introduced vfs_mkdir_return() which performs the lookup if
needed so that this code is centralised.
This prepares for a new inode operation which will perform mkdir and
returns the correct dentry.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/cachefiles/namei.c | 7 +----
fs/namei.c | 63 ++++++++++++++++++++++++++++++++++++++++
fs/nfsd/vfs.c | 21 ++------------
fs/overlayfs/dir.c | 33 +--------------------
fs/overlayfs/overlayfs.h | 10 +++----
fs/overlayfs/super.c | 2 +-
fs/smb/server/vfs.c | 23 +++------------
include/linux/fs.h | 2 ++
8 files changed, 80 insertions(+), 81 deletions(-)
diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 7cf59713f0f7..3c866c3b9534 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -95,7 +95,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
/* search the current directory for the element name */
inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
-retry:
ret = cachefiles_inject_read_error();
if (ret == 0)
subdir = lookup_one_len(dirname, dir, strlen(dirname));
@@ -130,7 +129,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
goto mkdir_error;
ret = cachefiles_inject_write_error();
if (ret == 0)
- ret = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), subdir, 0700);
+ ret = vfs_mkdir_return(&nop_mnt_idmap, d_inode(dir), &subdir, 0700);
if (ret < 0) {
trace_cachefiles_vfs_error(NULL, d_inode(dir), ret,
cachefiles_trace_mkdir_error);
@@ -138,10 +137,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
}
trace_cachefiles_mkdir(dir, subdir);
- if (unlikely(d_unhashed(subdir))) {
- cachefiles_put_directory(subdir);
- goto retry;
- }
ASSERT(d_backing_inode(subdir));
_debug("mkdir -> %pd{ino=%lu}",
diff --git a/fs/namei.c b/fs/namei.c
index 9d30c7aa9aa6..cdd1fc9d56a0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4315,6 +4315,69 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
}
EXPORT_SYMBOL(vfs_mkdir);
+/**
+ * vfs_mkdir_return - create directory returning correct dentry
+ * @idmap: idmap of the mount the inode was found from
+ * @dir: inode of the parent directory
+ * @dentryp: pointer to dentry of the child directory
+ * @mode: mode of the child directory
+ *
+ * Create a directory.
+ *
+ * If the inode has been found through an idmapped mount the idmap of
+ * the vfsmount must be passed through @idmap. This function will then take
+ * care to map the inode according to @idmap before checking permissions.
+ * On non-idmapped mounts or if permission checking is to be performed on the
+ * raw inode simply pass @nop_mnt_idmap.
+ *
+ * The filesystem may not use the dentry that was passed in. In that case
+ * the passed-in dentry is put and a new one is placed in *@dentryp;
+ * So on successful return *@dentryp will always be positive.
+ */
+int vfs_mkdir_return(struct mnt_idmap *idmap, struct inode *dir,
+ struct dentry **dentryp, umode_t mode)
+{
+ struct dentry *dentry = *dentryp;
+ int error;
+ unsigned max_links = dir->i_sb->s_max_links;
+
+ error = may_create(idmap, dir, dentry);
+ if (error)
+ return error;
+
+ if (!dir->i_op->mkdir)
+ return -EPERM;
+
+ mode = vfs_prepare_mode(idmap, dir, mode, S_IRWXUGO | S_ISVTX, 0);
+ error = security_inode_mkdir(dir, dentry, mode);
+ if (error)
+ return error;
+
+ if (max_links && dir->i_nlink >= max_links)
+ return -EMLINK;
+
+ error = dir->i_op->mkdir(idmap, dir, dentry, mode);
+ if (!error) {
+ fsnotify_mkdir(dir, dentry);
+ if (unlikely(d_unhashed(dentry))) {
+ struct dentry *d;
+ d = lookup_dcache((const struct qstr *)&dentry->d_name,
+ dentry->d_parent, 0);
+ if (IS_ERR(d)) {
+ error = PTR_ERR(d);
+ } else if (unlikely(d_is_negative(d))) {
+ dput(d);
+ error = -ENOENT;
+ } else {
+ dput(dentry);
+ *dentryp = d;
+ }
+ }
+ }
+ return error;
+}
+EXPORT_SYMBOL(vfs_mkdir_return);
+
int do_mkdirat(int dfd, struct filename *name, umode_t mode)
{
struct dentry *dentry;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 29cb7b812d71..740332413138 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1488,26 +1488,11 @@ nfsd_create_locked(struct svc_rqst *rqstp, struct svc_fh *fhp,
nfsd_check_ignore_resizing(iap);
break;
case S_IFDIR:
- host_err = vfs_mkdir(&nop_mnt_idmap, dirp, dchild, iap->ia_mode);
- if (!host_err && unlikely(d_unhashed(dchild))) {
- struct dentry *d;
- d = lookup_one_len(dchild->d_name.name,
- dchild->d_parent,
- dchild->d_name.len);
- if (IS_ERR(d)) {
- host_err = PTR_ERR(d);
- break;
- }
- if (unlikely(d_is_negative(d))) {
- dput(d);
- err = nfserr_serverfault;
- goto out;
- }
+ host_err = vfs_mkdir_return(&nop_mnt_idmap, dirp, &dchild, iap->ia_mode);
+ if (!host_err && unlikely(dchild != resfhp->fh_dentry)) {
dput(resfhp->fh_dentry);
- resfhp->fh_dentry = dget(d);
+ resfhp->fh_dentry = dget(dchild);
err = fh_update(resfhp);
- dput(dchild);
- dchild = d;
if (err)
goto out;
}
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 08e683917d12..92a277ccc419 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -138,37 +138,6 @@ int ovl_cleanup_and_whiteout(struct ovl_fs *ofs, struct inode *dir,
goto out;
}
-int ovl_mkdir_real(struct ovl_fs *ofs, struct inode *dir,
- struct dentry **newdentry, umode_t mode)
-{
- int err;
- struct dentry *d, *dentry = *newdentry;
-
- err = ovl_do_mkdir(ofs, dir, dentry, mode);
- if (err)
- return err;
-
- if (likely(!d_unhashed(dentry)))
- return 0;
-
- /*
- * vfs_mkdir() may succeed and leave the dentry passed
- * to it unhashed and negative. If that happens, try to
- * lookup a new hashed and positive dentry.
- */
- d = ovl_lookup_upper(ofs, dentry->d_name.name, dentry->d_parent,
- dentry->d_name.len);
- if (IS_ERR(d)) {
- pr_warn("failed lookup after mkdir (%pd2, err=%i).\n",
- dentry, err);
- return PTR_ERR(d);
- }
- dput(dentry);
- *newdentry = d;
-
- return 0;
-}
-
struct dentry *ovl_create_real(struct ovl_fs *ofs, struct inode *dir,
struct dentry *newdentry, struct ovl_cattr *attr)
{
@@ -191,7 +160,7 @@ struct dentry *ovl_create_real(struct ovl_fs *ofs, struct inode *dir,
case S_IFDIR:
/* mkdir is special... */
- err = ovl_mkdir_real(ofs, dir, &newdentry, attr->mode);
+ err = ovl_do_mkdir(ofs, dir, &newdentry, attr->mode);
break;
case S_IFCHR:
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index b361f35762be..9ff34f84fc50 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -242,11 +242,11 @@ static inline int ovl_do_create(struct ovl_fs *ofs,
}
static inline int ovl_do_mkdir(struct ovl_fs *ofs,
- struct inode *dir, struct dentry *dentry,
+ struct inode *dir, struct dentry **dentry,
umode_t mode)
{
- int err = vfs_mkdir(ovl_upper_mnt_idmap(ofs), dir, dentry, mode);
- pr_debug("mkdir(%pd2, 0%o) = %i\n", dentry, mode, err);
+ int err = vfs_mkdir_return(ovl_upper_mnt_idmap(ofs), dir, dentry, mode);
+ pr_debug("mkdir(%pd2, 0%o) = %i\n", *dentry, mode, err);
return err;
}
@@ -838,8 +838,8 @@ struct ovl_cattr {
#define OVL_CATTR(m) (&(struct ovl_cattr) { .mode = (m) })
-int ovl_mkdir_real(struct ovl_fs *ofs, struct inode *dir,
- struct dentry **newdentry, umode_t mode);
+int ovl_do_mkdir(struct ovl_fs *ofs, struct inode *dir,
+ struct dentry **newdentry, umode_t mode);
struct dentry *ovl_create_real(struct ovl_fs *ofs,
struct inode *dir, struct dentry *newdentry,
struct ovl_cattr *attr);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index fe511192f83c..24c44f7c1c8b 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -309,7 +309,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
goto retry;
}
- err = ovl_mkdir_real(ofs, dir, &work, attr.ia_mode);
+ err = ovl_do_mkdir(ofs, dir, &work, attr.ia_mode);
if (err)
goto out_dput;
diff --git a/fs/smb/server/vfs.c b/fs/smb/server/vfs.c
index 88d167a5f8b7..dfb0eee5f5f3 100644
--- a/fs/smb/server/vfs.c
+++ b/fs/smb/server/vfs.c
@@ -211,7 +211,7 @@ int ksmbd_vfs_mkdir(struct ksmbd_work *work, const char *name, umode_t mode)
{
struct mnt_idmap *idmap;
struct path path;
- struct dentry *dentry;
+ struct dentry *dentry, *d;
int err;
dentry = ksmbd_vfs_kern_path_create(work, name,
@@ -227,25 +227,10 @@ int ksmbd_vfs_mkdir(struct ksmbd_work *work, const char *name, umode_t mode)
idmap = mnt_idmap(path.mnt);
mode |= S_IFDIR;
- err = vfs_mkdir(idmap, d_inode(path.dentry), dentry, mode);
- if (!err && d_unhashed(dentry)) {
- struct dentry *d;
-
- d = lookup_one(idmap, dentry->d_name.name, dentry->d_parent,
- dentry->d_name.len);
- if (IS_ERR(d)) {
- err = PTR_ERR(d);
- goto out_err;
- }
- if (unlikely(d_is_negative(d))) {
- dput(d);
- err = -ENOENT;
- goto out_err;
- }
-
+ d = dentry;
+ err = vfs_mkdir_return(idmap, d_inode(path.dentry), &dentry, mode);
+ if (!err && dentry != d)
ksmbd_vfs_inherit_owner(work, d_inode(path.dentry), d_inode(d));
- dput(d);
- }
out_err:
done_path_create(&path, dentry);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7e29433c5ecc..406887d0394e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1923,6 +1923,8 @@ int vfs_create(struct mnt_idmap *, struct inode *,
struct dentry *, umode_t, bool);
int vfs_mkdir(struct mnt_idmap *, struct inode *,
struct dentry *, umode_t);
+int vfs_mkdir_return(struct mnt_idmap *, struct inode *,
+ struct dentry **, umode_t);
int vfs_mknod(struct mnt_idmap *, struct inode *, struct dentry *,
umode_t, dev_t);
int vfs_symlink(struct mnt_idmap *, struct inode *,
--
2.47.0
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH 01/11] VFS: introduce vfs_mkdir_return()
2024-12-20 2:54 ` [PATCH 01/11] VFS: introduce vfs_mkdir_return() NeilBrown
@ 2024-12-23 5:04 ` Al Viro
2024-12-23 7:26 ` NeilBrown
0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2024-12-23 5:04 UTC (permalink / raw)
To: NeilBrown
Cc: Christian Brauner, Jan Kara, Linus Torvalds, linux-fsdevel,
linux-kernel
On Fri, Dec 20, 2024 at 01:54:19PM +1100, NeilBrown wrote:
> + error = dir->i_op->mkdir(idmap, dir, dentry, mode);
> + if (!error) {
> + fsnotify_mkdir(dir, dentry);
> + if (unlikely(d_unhashed(dentry))) {
> + struct dentry *d;
> + d = lookup_dcache((const struct qstr *)&dentry->d_name,
> + dentry->d_parent, 0);
> + if (IS_ERR(d)) {
> + error = PTR_ERR(d);
> + } else if (unlikely(d_is_negative(d))) {
... which will instantly oops if there's no cached dentry with
such name and parent. lookup_dcache() is pure dcache lookup;
it does *NOT* call ->lookup() on miss - just returns NULL.
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 01/11] VFS: introduce vfs_mkdir_return()
2024-12-23 5:04 ` Al Viro
@ 2024-12-23 7:26 ` NeilBrown
0 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2024-12-23 7:26 UTC (permalink / raw)
To: Al Viro
Cc: Christian Brauner, Jan Kara, Linus Torvalds, linux-fsdevel,
linux-kernel
On Mon, 23 Dec 2024, Al Viro wrote:
> On Fri, Dec 20, 2024 at 01:54:19PM +1100, NeilBrown wrote:
> > + error = dir->i_op->mkdir(idmap, dir, dentry, mode);
> > + if (!error) {
> > + fsnotify_mkdir(dir, dentry);
> > + if (unlikely(d_unhashed(dentry))) {
> > + struct dentry *d;
> > + d = lookup_dcache((const struct qstr *)&dentry->d_name,
> > + dentry->d_parent, 0);
> > + if (IS_ERR(d)) {
> > + error = PTR_ERR(d);
> > + } else if (unlikely(d_is_negative(d))) {
>
>
> ... which will instantly oops if there's no cached dentry with
> such name and parent. lookup_dcache() is pure dcache lookup;
> it does *NOT* call ->lookup() on miss - just returns NULL.
>
I originally had lookup_one_len() but realised that I don't really want
any of lookup_one_common() so optimised - badly.
Maybe I should go back to lookup_one_len().
Or maybe I can just insert
if (!d)
d = __lookup_slow(....->d_name, dentry->d_parent, 0);
I'll look more closely and see which seems best.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 02/11] VFS: add _shared versions of the various directory modifying inode_operations
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
2024-12-20 2:54 ` [PATCH 01/11] VFS: introduce vfs_mkdir_return() NeilBrown
@ 2024-12-20 2:54 ` NeilBrown
2024-12-23 5:08 ` Al Viro
2024-12-20 2:54 ` [PATCH 03/11] VFS: use global wait-queue table for d_alloc_parallel() NeilBrown
` (10 subsequent siblings)
12 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2024-12-20 2:54 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds
Cc: linux-fsdevel, linux-kernel
These "_shared" versions of various inode operations are not guaranteed
an exclusive lock on the directory but are guaranteed an exclusive lock
on the dentry within the directory.
i_rwsem *may* be held exclusively or *may* be held shared, in which case
an exclusive lock will be held on the dentry - provided by a later
patch.
This will allow a graceful transition from exclusive to shared locking
for directory updates.
mkdir_shared is a bit different as it optionally returns a new dentry
for cases when the filesystem is not able to use the original dentry.
This allows vfs_mkdir_return() to avoid the need for an extra lookup.
Signed-off-by: NeilBrown <neilb@suse.de>
---
Documentation/filesystems/locking.rst | 28 ++++++-
Documentation/filesystems/porting.rst | 10 +++
Documentation/filesystems/vfs.rst | 24 ++++++
fs/namei.c | 108 +++++++++++++++++++-------
include/linux/fs.h | 16 ++++
5 files changed, 158 insertions(+), 28 deletions(-)
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index f5e3676db954..7cacff59356f 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -57,15 +57,24 @@ inode_operations
prototypes::
int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool);
+ int (*create_shared) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool);
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
int (*link) (struct dentry *,struct inode *,struct dentry *);
+ int (*link_shared) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
+ int (*unlink_shared) (struct inode *,struct dentry *);
int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
+ int (*symlink_shared) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
+ struct dentry * (*mkdir_shared) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
+ int (*rmdir_shared) (struct inode *,struct dentry *);
int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
+ int (*mknod_shared) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
+ int (*rename_shared) (struct mnt_idmap *, struct inode *, struct dentry *,
+ struct inode *, struct dentry *, unsigned int);
int (*readlink) (struct dentry *, char __user *,int);
const char *(*get_link) (struct dentry *, struct inode *, struct delayed_call *);
void (*truncate) (struct inode *);
@@ -79,6 +88,9 @@ prototypes::
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode);
+ int (*atomic_open_shared)(struct inode *, struct dentry *,
+ struct file *, unsigned open_flag,
+ umode_t create_mode);
int (*tmpfile) (struct mnt_idmap *, struct inode *,
struct file *, umode_t);
int (*fileattr_set)(struct mnt_idmap *idmap,
@@ -90,18 +102,29 @@ prototypes::
locking rules:
all may block
+A "mixed" lock means that either that i_rwsem on the directory is held
+exclusively, or it is held as a shared lock, and an exclusive lock is held
+on the dentry in that directory.
============== ==================================================
ops i_rwsem(inode)
============== ==================================================
lookup: shared
create: exclusive
+create_shared: mixed
link: exclusive (both)
+link_shared: exclusive on source, mixed on target
mknod: exclusive
+mknod_shared: mixed
symlink: exclusive
+symlink_shared: mixed
mkdir: exclusive
+mkdir_shared: mixed
unlink: exclusive (both)
+unlink_shared: exclusive on object, mixed on directory/name
rmdir: exclusive (both)(see below)
+rmdir_shared: exclusive on object, mixed on directory/name (see below)
rename: exclusive (both parents, some children) (see below)
+rename_shared: mixed (both parents) exclusive (some children) (see below)
readlink: no
get_link: no
setattr: exclusive
@@ -113,6 +136,7 @@ listxattr: no
fiemap: no
update_time: no
atomic_open: shared (exclusive if O_CREAT is set in open flags)
+atomic_open_shared: mixed (if O_CREAT is not set, then may not have exclusive lock on name)
tmpfile: no
fileattr_get: no or exclusive
fileattr_set: exclusive
@@ -120,8 +144,8 @@ get_offset_ctx no
============== ==================================================
- Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
- exclusive on victim.
+ Additionally, ->rmdir(), ->unlink() and ->rename(), as well as _shared
+ versions, have ->i_rwsem exclusive on victim.
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
->unlink() and ->rename() have ->i_rwsem exclusive on all non-directories
involved.
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 9ab2a3d6f2b4..c7f3825f280c 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1141,3 +1141,13 @@ pointer are gone.
set_blocksize() takes opened struct file instead of struct block_device now
and it *must* be opened exclusive.
+
+---
+
+**recommended**
+
+create_shared, link_shared, unlink_shared, rmdir_shared, mknod_shared,
+rename_shared, atomic_open_shared can be provided instead of the
+corresponding inode_operations with the "_shared" suffix. Multiple
+_shared operations can be performed in a given directory concurrently,
+but never on the same name.
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 0b18af3f954e..c4860597975a 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -491,15 +491,24 @@ As of kernel 2.6.22, the following members are defined:
struct inode_operations {
int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool);
+ int (*create_shared) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool);
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
int (*link) (struct dentry *,struct inode *,struct dentry *);
+ int (*link_shared) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
+ int (*unlink_shared) (struct inode *,struct dentry *);
int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
+ int (*symlink_shared) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
+ struct dentry * (*mkdir_shared) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
+ int (*rmdir_shared) (struct inode *,struct dentry *);
int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
+ int (*mknod_shared) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
+ int (*rename_shared) (struct mnt_idmap *, struct inode *, struct dentry *,
+ struct inode *, struct dentry *, unsigned int);
int (*readlink) (struct dentry *, char __user *,int);
const char *(*get_link) (struct dentry *, struct inode *,
struct delayed_call *);
@@ -511,6 +520,8 @@ As of kernel 2.6.22, the following members are defined:
void (*update_time)(struct inode *, struct timespec *, int);
int (*atomic_open)(struct inode *, struct dentry *, struct file *,
unsigned open_flag, umode_t create_mode);
+ int (*atomic_open_shared)(struct inode *, struct dentry *, struct file *,
+ unsigned open_flag, umode_t create_mode);
int (*tmpfile) (struct mnt_idmap *, struct inode *, struct file *, umode_t);
struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
@@ -524,6 +535,7 @@ Again, all methods are called without any locks being held, unless
otherwise noted.
``create``
+``create_shared``
called by the open(2) and creat(2) system calls. Only required
if you want to support regular files. The dentry you get should
not have an inode (i.e. it should be a negative dentry). Here
@@ -546,29 +558,39 @@ otherwise noted.
directory inode semaphore held
``link``
+``link_shared``
called by the link(2) system call. Only required if you want to
support hard links. You will probably need to call
d_instantiate() just as you would in the create() method
``unlink``
+``unlink_shared``
called by the unlink(2) system call. Only required if you want
to support deleting inodes
``symlink``
+``symlink_shared``
called by the symlink(2) system call. Only required if you want
to support symlinks. You will probably need to call
d_instantiate() just as you would in the create() method
``mkdir``
+``mkdir_shared``
called by the mkdir(2) system call. Only required if you want
to support creating subdirectories. You will probably need to
call d_instantiate() just as you would in the create() method
+ mkdir_shared can return an alternate dentry, much like lookup.
+ In this case the original dentry will still be negative and will
+ be unhashed.
+
``rmdir``
+``rmdir_shared``
called by the rmdir(2) system call. Only required if you want
to support deleting subdirectories
``mknod``
+``mknod_shared``
called by the mknod(2) system call to create a device (char,
block) inode or a named pipe (FIFO) or socket. Only required if
you want to support creating these types of inodes. You will
@@ -576,6 +598,7 @@ otherwise noted.
create() method
``rename``
+``rename_shared``
called by the rename(2) system call to rename the object to have
the parent and name given by the second inode and dentry.
@@ -647,6 +670,7 @@ otherwise noted.
itself and call mark_inode_dirty_sync.
``atomic_open``
+``atomic_open_shared``
called on the last component of an open. Using this optional
method the filesystem can look up, possibly create and open the
file in one atomic operation. If it wants to leave actual
diff --git a/fs/namei.c b/fs/namei.c
index cdd1fc9d56a0..65082378dc60 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3338,14 +3338,17 @@ int vfs_create(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;
- if (!dir->i_op->create)
+ if (!dir->i_op->create && !dir->i_op->create_shared)
return -EACCES; /* shouldn't it be ENOSYS? */
mode = vfs_prepare_mode(idmap, dir, mode, S_IALLUGO, S_IFREG);
error = security_inode_create(dir, dentry, mode);
if (error)
return error;
- error = dir->i_op->create(idmap, dir, dentry, mode, want_excl);
+ if (dir->i_op->create_shared)
+ error = dir->i_op->create_shared(idmap, dir, dentry, mode, want_excl);
+ else
+ error = dir->i_op->create(idmap, dir, dentry, mode, want_excl);
if (!error)
fsnotify_create(dir, dentry);
return error;
@@ -3506,8 +3509,12 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
file->f_path.dentry = DENTRY_NOT_SET;
file->f_path.mnt = nd->path.mnt;
- error = dir->i_op->atomic_open(dir, dentry, file,
- open_to_namei_flags(open_flag), mode);
+ if (dir->i_op->atomic_open_shared)
+ error = dir->i_op->atomic_open_shared(dir, dentry, file,
+ open_to_namei_flags(open_flag), mode);
+ else
+ error = dir->i_op->atomic_open(dir, dentry, file,
+ open_to_namei_flags(open_flag), mode);
d_lookup_done(dentry);
if (!error) {
if (file->f_mode & FMODE_OPENED) {
@@ -3616,7 +3623,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
}
if (create_error)
open_flag &= ~O_CREAT;
- if (dir_inode->i_op->atomic_open) {
+ if (dir_inode->i_op->atomic_open || dir_inode->i_op->atomic_open_shared) {
dentry = atomic_open(nd, dentry, file, open_flag, mode);
if (unlikely(create_error) && dentry == ERR_PTR(-ENOENT))
dentry = ERR_PTR(create_error);
@@ -3641,13 +3648,17 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
if (!dentry->d_inode && (open_flag & O_CREAT)) {
file->f_mode |= FMODE_CREATED;
audit_inode_child(dir_inode, dentry, AUDIT_TYPE_CHILD_CREATE);
- if (!dir_inode->i_op->create) {
- error = -EACCES;
- goto out_dput;
- }
- error = dir_inode->i_op->create(idmap, dir_inode, dentry,
- mode, open_flag & O_EXCL);
+ if (dir_inode->i_op->create_shared)
+ error = dir_inode->i_op->create_shared(idmap, dir_inode,
+ dentry, mode,
+ open_flag & O_EXCL);
+ else if (dir_inode->i_op->create)
+ error = dir_inode->i_op->create(idmap, dir_inode,
+ dentry, mode,
+ open_flag & O_EXCL);
+ else
+ error = -EACCES;
if (error)
goto out_dput;
}
@@ -4174,7 +4185,7 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
!capable(CAP_MKNOD))
return -EPERM;
- if (!dir->i_op->mknod)
+ if (!dir->i_op->mknod && !dir->i_op->mknod_shared)
return -EPERM;
mode = vfs_prepare_mode(idmap, dir, mode, mode, mode);
@@ -4186,7 +4197,10 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;
- error = dir->i_op->mknod(idmap, dir, dentry, mode, dev);
+ if (dir->i_op->mknod_shared)
+ error = dir->i_op->mknod_shared(idmap, dir, dentry, mode, dev);
+ else
+ error = dir->i_op->mknod(idmap, dir, dentry, mode, dev);
if (!error)
fsnotify_create(dir, dentry);
return error;
@@ -4297,7 +4311,7 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;
- if (!dir->i_op->mkdir)
+ if (!dir->i_op->mkdir && !dir->i_op->mkdir_shared)
return -EPERM;
mode = vfs_prepare_mode(idmap, dir, mode, S_IRWXUGO | S_ISVTX, 0);
@@ -4308,7 +4322,16 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
if (max_links && dir->i_nlink >= max_links)
return -EMLINK;
- error = dir->i_op->mkdir(idmap, dir, dentry, mode);
+ if (dir->i_op->mkdir_shared) {
+ struct dentry *de;
+ de = dir->i_op->mkdir_shared(idmap, dir, dentry, mode);
+ if (IS_ERR(de))
+ error = PTR_ERR(de);
+ else if (de)
+ dput(de);
+ } else {
+ error = dir->i_op->mkdir(idmap, dir, dentry, mode);
+ }
if (!error)
fsnotify_mkdir(dir, dentry);
return error;
@@ -4356,6 +4379,20 @@ int vfs_mkdir_return(struct mnt_idmap *idmap, struct inode *dir,
if (max_links && dir->i_nlink >= max_links)
return -EMLINK;
+ if (dir->i_op->mkdir_shared) {
+ struct dentry *de;
+
+ de = dir->i_op->mkdir_shared(idmap, dir, dentry, mode);
+ if (IS_ERR(de))
+ return PTR_ERR(de);
+ if (de) {
+ dput(dentry);
+ *dentryp = de;
+ }
+ fsnotify_mkdir(dir, dentry);
+ return 0;
+ }
+
error = dir->i_op->mkdir(idmap, dir, dentry, mode);
if (!error) {
fsnotify_mkdir(dir, dentry);
@@ -4439,7 +4476,7 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;
- if (!dir->i_op->rmdir)
+ if (!dir->i_op->rmdir && !dir->i_op->rmdir_shared)
return -EPERM;
dget(dentry);
@@ -4454,7 +4491,10 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
if (error)
goto out;
- error = dir->i_op->rmdir(dir, dentry);
+ if (dir->i_op->rmdir_shared)
+ error = dir->i_op->rmdir_shared(dir, dentry);
+ else
+ error = dir->i_op->rmdir(dir, dentry);
if (error)
goto out;
@@ -4569,7 +4609,7 @@ int vfs_unlink(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;
- if (!dir->i_op->unlink)
+ if (!dir->i_op->unlink && !dir->i_op->unlink_shared)
return -EPERM;
inode_lock(target);
@@ -4583,7 +4623,10 @@ int vfs_unlink(struct mnt_idmap *idmap, struct inode *dir,
error = try_break_deleg(target, delegated_inode);
if (error)
goto out;
- error = dir->i_op->unlink(dir, dentry);
+ if (dir->i_op->unlink_shared)
+ error = dir->i_op->unlink_shared(dir, dentry);
+ else
+ error = dir->i_op->unlink(dir, dentry);
if (!error) {
dont_mount(dentry);
detach_mounts(dentry);
@@ -4722,14 +4765,17 @@ int vfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;
- if (!dir->i_op->symlink)
+ if (!dir->i_op->symlink && !dir->i_op->symlink_shared)
return -EPERM;
error = security_inode_symlink(dir, dentry, oldname);
if (error)
return error;
- error = dir->i_op->symlink(idmap, dir, dentry, oldname);
+ if (dir->i_op->symlink_shared)
+ error = dir->i_op->symlink_shared(idmap, dir, dentry, oldname);
+ else
+ error = dir->i_op->symlink(idmap, dir, dentry, oldname);
if (!error)
fsnotify_create(dir, dentry);
return error;
@@ -4835,7 +4881,7 @@ int vfs_link(struct dentry *old_dentry, struct mnt_idmap *idmap,
*/
if (HAS_UNMAPPED_ID(idmap, inode))
return -EPERM;
- if (!dir->i_op->link)
+ if (!dir->i_op->link && !dir->i_op->link_shared)
return -EPERM;
if (S_ISDIR(inode->i_mode))
return -EPERM;
@@ -4852,7 +4898,11 @@ int vfs_link(struct dentry *old_dentry, struct mnt_idmap *idmap,
error = -EMLINK;
else {
error = try_break_deleg(inode, delegated_inode);
- if (!error)
+ if (error)
+ ;
+ else if (dir->i_op->link_shared)
+ error = dir->i_op->link_shared(old_dentry, dir, new_dentry);
+ else
error = dir->i_op->link(old_dentry, dir, new_dentry);
}
@@ -5044,7 +5094,7 @@ int vfs_rename(struct renamedata *rd)
if (error)
return error;
- if (!old_dir->i_op->rename)
+ if (!old_dir->i_op->rename && !old_dir->i_op->rename_shared)
return -EPERM;
/*
@@ -5127,8 +5177,14 @@ int vfs_rename(struct renamedata *rd)
if (error)
goto out;
}
- error = old_dir->i_op->rename(rd->new_mnt_idmap, old_dir, old_dentry,
- new_dir, new_dentry, flags);
+ if (old_dir->i_op->rename_shared)
+ error = old_dir->i_op->rename_shared(rd->new_mnt_idmap,
+ old_dir, old_dentry,
+ new_dir, new_dentry, flags);
+ else
+ error = old_dir->i_op->rename(rd->new_mnt_idmap,
+ old_dir, old_dentry,
+ new_dir, new_dentry, flags);
if (error)
goto out;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 406887d0394e..68eba181175b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2147,17 +2147,30 @@ struct inode_operations {
int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,
umode_t, bool);
+ int (*create_shared) (struct mnt_idmap *, struct inode *,struct dentry *,
+ umode_t, bool);
int (*link) (struct dentry *,struct inode *,struct dentry *);
+ int (*link_shared) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
+ int (*unlink_shared) (struct inode *,struct dentry *);
int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,
const char *);
+ int (*symlink_shared) (struct mnt_idmap *, struct inode *,struct dentry *,
+ const char *);
int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,
umode_t);
+ struct dentry * (*mkdir_shared) (struct mnt_idmap *, struct inode *,struct dentry *,
+ umode_t);
int (*rmdir) (struct inode *,struct dentry *);
+ int (*rmdir_shared) (struct inode *,struct dentry *);
int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,
umode_t,dev_t);
+ int (*mknod_shared) (struct mnt_idmap *, struct inode *,struct dentry *,
+ umode_t,dev_t);
int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
+ int (*rename_shared) (struct mnt_idmap *, struct inode *, struct dentry *,
+ struct inode *, struct dentry *, unsigned int);
int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *);
int (*getattr) (struct mnt_idmap *, const struct path *,
struct kstat *, u32, unsigned int);
@@ -2168,6 +2181,9 @@ struct inode_operations {
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode);
+ int (*atomic_open_shared)(struct inode *, struct dentry *,
+ struct file *, unsigned open_flag,
+ umode_t create_mode);
int (*tmpfile) (struct mnt_idmap *, struct inode *,
struct file *, umode_t);
struct posix_acl *(*get_acl)(struct mnt_idmap *, struct dentry *,
--
2.47.0
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH 02/11] VFS: add _shared versions of the various directory modifying inode_operations
2024-12-20 2:54 ` [PATCH 02/11] VFS: add _shared versions of the various directory modifying inode_operations NeilBrown
@ 2024-12-23 5:08 ` Al Viro
0 siblings, 0 replies; 26+ messages in thread
From: Al Viro @ 2024-12-23 5:08 UTC (permalink / raw)
To: NeilBrown
Cc: Christian Brauner, Jan Kara, Linus Torvalds, linux-fsdevel,
linux-kernel
On Fri, Dec 20, 2024 at 01:54:20PM +1100, NeilBrown wrote:
> i_rwsem *may* be held exclusively or *may* be held shared, in which case
> an exclusive lock will be held on the dentry - provided by a later
> patch.
>
> This will allow a graceful transition from exclusive to shared locking
> for directory updates.
> +A "mixed" lock means that either that i_rwsem on the directory is held
> +exclusively, or it is held as a shared lock, and an exclusive lock is held
> +on the dentry in that directory.
... it also means that ->d_parent and ->d_name are completely unstable.
Yes, really - have rmdir victim moved around on server, then look it up
in a new place and you'll get __d_unalias() move the existing dentry
to new location right under you.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 03/11] VFS: use global wait-queue table for d_alloc_parallel()
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
2024-12-20 2:54 ` [PATCH 01/11] VFS: introduce vfs_mkdir_return() NeilBrown
2024-12-20 2:54 ` [PATCH 02/11] VFS: add _shared versions of the various directory modifying inode_operations NeilBrown
@ 2024-12-20 2:54 ` NeilBrown
2024-12-20 2:54 ` [PATCH 04/11] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() NeilBrown
` (9 subsequent siblings)
12 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2024-12-20 2:54 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds
Cc: linux-fsdevel, linux-kernel
d_alloc_parallel() currently requires a wait_queue_head to be passed in.
This must have a life time which extends until the lookup is completed.
Future proposed patches will use d_alloc_parallel() for names being
created/unlinked etc. Some filesystems combine lookup with create
making a longer code path that the wq needs to live for. If it is still
to be allocated on-stack this can be cumbersome.
This patch replaces the on-stack wqs with a global array of wqs which
are used as needed. A wq is NOT allocated when a dentry is first
created but only when a second thread attempts to use the same name and
so is forced to wait. At this moment a wq is chosen using the
least-significant bits on the task's pid and that wq is assigned to
->d_wait. The ->d_lock is then dropped and the task waits.
When the dentry is finally moved out of "in_lookup" a wake up is only
sent if ->d_wait is not NULL. This avoids an (uncontended) spin
lock/unlock which saves a couple of atomic operations in a common case.
The wake up passes the dentry that the wake up is for as the "key" and
the waiter will only wake processes waiting on the same key. This means
that when these global waitqueues are shared (which is inevitable
though unlikely to be frequent), a task will not be woken prematurely.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/afs/dir_silly.c | 4 +--
fs/dcache.c | 69 +++++++++++++++++++++++++++++++++--------
fs/fuse/readdir.c | 3 +-
fs/namei.c | 6 ++--
fs/nfs/dir.c | 6 ++--
fs/nfs/unlink.c | 3 +-
fs/proc/base.c | 3 +-
fs/proc/proc_sysctl.c | 3 +-
fs/smb/client/readdir.c | 3 +-
include/linux/dcache.h | 3 +-
include/linux/nfs_xdr.h | 1 -
11 files changed, 67 insertions(+), 37 deletions(-)
diff --git a/fs/afs/dir_silly.c b/fs/afs/dir_silly.c
index a1e581946b93..aa4363a1c6fa 100644
--- a/fs/afs/dir_silly.c
+++ b/fs/afs/dir_silly.c
@@ -239,13 +239,11 @@ int afs_silly_iput(struct dentry *dentry, struct inode *inode)
struct dentry *alias;
int ret;
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
-
_enter("%p{%pd},%llx", dentry, dentry, vnode->fid.vnode);
down_read(&dvnode->rmdir_lock);
- alias = d_alloc_parallel(dentry->d_parent, &dentry->d_name, &wq);
+ alias = d_alloc_parallel(dentry->d_parent, &dentry->d_name);
if (IS_ERR(alias)) {
up_read(&dvnode->rmdir_lock);
return 0;
diff --git a/fs/dcache.c b/fs/dcache.c
index b4d5e9e1e43d..ebe849474bd8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2078,8 +2078,7 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
return found;
}
if (d_in_lookup(dentry)) {
- found = d_alloc_parallel(dentry->d_parent, name,
- dentry->d_wait);
+ found = d_alloc_parallel(dentry->d_parent, name);
if (IS_ERR(found) || !d_in_lookup(found)) {
iput(inode);
return found;
@@ -2089,7 +2088,7 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
if (!found) {
iput(inode);
return ERR_PTR(-ENOMEM);
- }
+ }
}
res = d_splice_alias(inode, found);
if (res) {
@@ -2459,30 +2458,70 @@ static inline unsigned start_dir_add(struct inode *dir)
}
static inline void end_dir_add(struct inode *dir, unsigned int n,
- wait_queue_head_t *d_wait)
+ wait_queue_head_t *d_wait, struct dentry *de)
{
smp_store_release(&dir->i_dir_seq, n + 2);
preempt_enable_nested();
- wake_up_all(d_wait);
+ if (d_wait)
+ __wake_up(d_wait, TASK_NORMAL, 0, de);
+}
+
+#define PAR_LOOKUP_WQS 256
+static wait_queue_head_t par_wait_table[PAR_LOOKUP_WQS] __cacheline_aligned;
+
+static int __init par_wait_init(void)
+{
+ int i;
+
+ for (i = 0; i < PAR_LOOKUP_WQS; i++)
+ init_waitqueue_head(&par_wait_table[i]);
+ return 0;
+}
+fs_initcall(par_wait_init);
+
+struct par_wait_key {
+ struct dentry *de;
+ struct wait_queue_entry wqe;
+};
+
+static int d_wait_wake_fn(struct wait_queue_entry *wq_entry,
+ unsigned mode, int sync, void *key)
+{
+ struct par_wait_key *pwk = container_of(wq_entry,
+ struct par_wait_key, wqe);
+ if (pwk->de == key)
+ return default_wake_function(wq_entry, mode, sync, key);
+ return 0;
}
static void d_wait_lookup(struct dentry *dentry)
{
if (d_in_lookup(dentry)) {
- DECLARE_WAITQUEUE(wait, current);
- add_wait_queue(dentry->d_wait, &wait);
+ struct par_wait_key wk = {
+ .de = dentry,
+ .wqe = {
+ .private = current,
+ .func = d_wait_wake_fn,
+ },
+ };
+ struct wait_queue_head *wq;
+ if (!dentry->d_wait)
+ dentry->d_wait = &par_wait_table[current->pid %
+ PAR_LOOKUP_WQS];
+ wq = dentry->d_wait;
+ add_wait_queue(wq, &wk.wqe);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
spin_unlock(&dentry->d_lock);
schedule();
spin_lock(&dentry->d_lock);
} while (d_in_lookup(dentry));
+ remove_wait_queue(wq, &wk.wqe);
}
}
struct dentry *d_alloc_parallel(struct dentry *parent,
- const struct qstr *name,
- wait_queue_head_t *wq)
+ const struct qstr *name)
{
unsigned int hash = name->hash;
struct hlist_bl_head *b = in_lookup_hash(parent, hash);
@@ -2579,7 +2618,7 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
rcu_read_unlock();
/* we can't take ->d_lock here; it's OK, though. */
new->d_flags |= DCACHE_PAR_LOOKUP;
- new->d_wait = wq;
+ new->d_wait = NULL;
hlist_bl_add_head(&new->d_u.d_in_lookup_hash, b);
hlist_bl_unlock(b);
return new;
@@ -2616,8 +2655,12 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
void __d_lookup_unhash_wake(struct dentry *dentry)
{
+ wait_queue_head_t *d_wait;
+
spin_lock(&dentry->d_lock);
- wake_up_all(__d_lookup_unhash(dentry));
+ d_wait = __d_lookup_unhash(dentry);
+ if (d_wait)
+ __wake_up(d_wait, TASK_NORMAL, 0, dentry);
spin_unlock(&dentry->d_lock);
}
EXPORT_SYMBOL(__d_lookup_unhash_wake);
@@ -2645,7 +2688,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
}
__d_rehash(dentry);
if (dir)
- end_dir_add(dir, n, d_wait);
+ end_dir_add(dir, n, d_wait, dentry);
spin_unlock(&dentry->d_lock);
if (inode)
spin_unlock(&inode->i_lock);
@@ -2863,7 +2906,7 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
write_seqcount_end(&dentry->d_seq);
if (dir)
- end_dir_add(dir, n, d_wait);
+ end_dir_add(dir, n, d_wait, target);
if (dentry->d_parent != old_parent)
spin_unlock(&dentry->d_parent->d_lock);
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index 17ce9636a2b1..c6b646a3f1bd 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -160,7 +160,6 @@ static int fuse_direntplus_link(struct file *file,
struct inode *dir = d_inode(parent);
struct fuse_conn *fc;
struct inode *inode;
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
if (!o->nodeid) {
/*
@@ -195,7 +194,7 @@ static int fuse_direntplus_link(struct file *file,
dentry = d_lookup(parent, &name);
if (!dentry) {
retry:
- dentry = d_alloc_parallel(parent, &name, &wq);
+ dentry = d_alloc_parallel(parent, &name);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
}
diff --git a/fs/namei.c b/fs/namei.c
index 65082378dc60..174e6693304e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1767,13 +1767,12 @@ static struct dentry *__lookup_slow(const struct qstr *name,
{
struct dentry *dentry, *old;
struct inode *inode = dir->d_inode;
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
/* Don't go there if it's already dead */
if (unlikely(IS_DEADDIR(inode)))
return ERR_PTR(-ENOENT);
again:
- dentry = d_alloc_parallel(dir, name, &wq);
+ dentry = d_alloc_parallel(dir, name);
if (IS_ERR(dentry))
return dentry;
if (unlikely(!d_in_lookup(dentry))) {
@@ -3566,7 +3565,6 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
struct dentry *dentry;
int error, create_error = 0;
umode_t mode = op->mode;
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
if (unlikely(IS_DEADDIR(dir_inode)))
return ERR_PTR(-ENOENT);
@@ -3575,7 +3573,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
dentry = d_lookup(dir, &nd->last);
for (;;) {
if (!dentry) {
- dentry = d_alloc_parallel(dir, &nd->last, &wq);
+ dentry = d_alloc_parallel(dir, &nd->last);
if (IS_ERR(dentry))
return dentry;
}
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 492cffd9d3d8..531bf586501f 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -725,7 +725,6 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
unsigned long dir_verifier)
{
struct qstr filename = QSTR_INIT(entry->name, entry->len);
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
struct dentry *dentry;
struct dentry *alias;
struct inode *inode;
@@ -754,7 +753,7 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
dentry = d_lookup(parent, &filename);
again:
if (!dentry) {
- dentry = d_alloc_parallel(parent, &filename, &wq);
+ dentry = d_alloc_parallel(parent, &filename);
if (IS_ERR(dentry))
return;
}
@@ -2069,7 +2068,6 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
struct file *file, unsigned open_flags,
umode_t mode)
{
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
struct nfs_open_context *ctx;
struct dentry *res;
struct iattr attr = { .ia_valid = ATTR_OPEN };
@@ -2125,7 +2123,7 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
d_drop(dentry);
switched = true;
dentry = d_alloc_parallel(dentry->d_parent,
- &dentry->d_name, &wq);
+ &dentry->d_name);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
if (unlikely(!d_in_lookup(dentry)))
diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
index bf77399696a7..d44162d3a8f1 100644
--- a/fs/nfs/unlink.c
+++ b/fs/nfs/unlink.c
@@ -124,7 +124,7 @@ static int nfs_call_unlink(struct dentry *dentry, struct inode *inode, struct nf
struct dentry *alias;
down_read_non_owner(&NFS_I(dir)->rmdir_sem);
- alias = d_alloc_parallel(dentry->d_parent, &data->args.name, &data->wq);
+ alias = d_alloc_parallel(dentry->d_parent, &data->args.name);
if (IS_ERR(alias)) {
up_read_non_owner(&NFS_I(dir)->rmdir_sem);
return 0;
@@ -185,7 +185,6 @@ nfs_async_unlink(struct dentry *dentry, const struct qstr *name)
data->cred = get_current_cred();
data->res.dir_attr = &data->dir_attr;
- init_waitqueue_head(&data->wq);
status = -EBUSY;
spin_lock(&dentry->d_lock);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0edf14a9840e..0c2d5583aef3 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2125,8 +2125,7 @@ bool proc_fill_cache(struct file *file, struct dir_context *ctx,
child = d_hash_and_lookup(dir, &qname);
if (!child) {
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
- child = d_alloc_parallel(dir, &qname, &wq);
+ child = d_alloc_parallel(dir, &qname);
if (IS_ERR(child))
goto end_instantiate;
if (d_in_lookup(child)) {
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 27a283d85a6e..cd7999439aa9 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -693,8 +693,7 @@ static bool proc_sys_fill_cache(struct file *file,
child = d_lookup(dir, &qname);
if (!child) {
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
- child = d_alloc_parallel(dir, &qname, &wq);
+ child = d_alloc_parallel(dir, &qname);
if (IS_ERR(child))
return false;
if (d_in_lookup(child)) {
diff --git a/fs/smb/client/readdir.c b/fs/smb/client/readdir.c
index 273358d20a46..2c4c9b5ce3a2 100644
--- a/fs/smb/client/readdir.c
+++ b/fs/smb/client/readdir.c
@@ -73,7 +73,6 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
struct cifs_sb_info *cifs_sb = CIFS_SB(sb);
bool posix = cifs_sb_master_tcon(cifs_sb)->posix_extensions;
bool reparse_need_reval = false;
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
int rc;
cifs_dbg(FYI, "%s: for %s\n", __func__, name->name);
@@ -105,7 +104,7 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
(fattr->cf_flags & CIFS_FATTR_NEED_REVAL))
return;
- dentry = d_alloc_parallel(parent, name, &wq);
+ dentry = d_alloc_parallel(parent, name);
}
if (IS_ERR(dentry))
return;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index bff956f7b2b9..b64c0260e4be 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -235,8 +235,7 @@ extern void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op
/* allocate/de-allocate */
extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
extern struct dentry * d_alloc_anon(struct super_block *);
-extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *,
- wait_queue_head_t *);
+extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *);
extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
extern bool d_same_name(const struct dentry *dentry, const struct dentry *parent,
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 559273a0f16d..7b807b6c1e81 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1735,7 +1735,6 @@ struct nfs_unlinkdata {
struct nfs_removeargs args;
struct nfs_removeres res;
struct dentry *dentry;
- wait_queue_head_t wq;
const struct cred *cred;
struct nfs_fattr dir_attr;
long timeout;
--
2.47.0
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH 04/11] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
` (2 preceding siblings ...)
2024-12-20 2:54 ` [PATCH 03/11] VFS: use global wait-queue table for d_alloc_parallel() NeilBrown
@ 2024-12-20 2:54 ` NeilBrown
2024-12-20 2:54 ` [PATCH 05/11] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry NeilBrown
` (8 subsequent siblings)
12 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2024-12-20 2:54 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds
Cc: linux-fsdevel, linux-kernel
lookup_one_qstr_excl() is used for lookups prior to directory
modifications, whether create, unlink, rename, or whatever.
To prepare for allowing modification to happen in parallel, change
lookup_one_qstr_excl() to use d_alloc_parallel().
If any for the "intent" LOOKUP flags are passed, the caller must ensure
d_lookup_done() is called at an appropriate time. If none are passed
then we can be sure ->lookup() will do a real lookup and d_lookup_done()
is called internally.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/namei.c | 21 ++++++++++++++-------
fs/smb/server/vfs.c | 1 +
include/linux/namei.h | 3 +++
3 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 174e6693304e..395bfbc8fc92 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1664,11 +1664,9 @@ static struct dentry *lookup_dcache(const struct qstr *name,
}
/*
- * Parent directory has inode locked exclusive. This is one
- * and only case when ->lookup() gets called on non in-lookup
- * dentries - as the matter of fact, this only gets called
- * when directory is guaranteed to have no in-lookup children
- * at all.
+ * Parent directory has inode locked exclusive.
+ * If @flags contains any LOOKUP_INTENT_FLAGS then d_lookup_done()
+ * must be called after the intended operation is performed - or aborted.
*/
struct dentry *lookup_one_qstr_excl(const struct qstr *name,
struct dentry *base,
@@ -1685,15 +1683,22 @@ struct dentry *lookup_one_qstr_excl(const struct qstr *name,
if (unlikely(IS_DEADDIR(dir)))
return ERR_PTR(-ENOENT);
- dentry = d_alloc(base, name);
- if (unlikely(!dentry))
+ dentry = d_alloc_parallel(base, name);
+ if (unlikely(IS_ERR_OR_NULL(dentry)))
return ERR_PTR(-ENOMEM);
+ if (!d_in_lookup(dentry))
+ /* Raced with another thread which did the lookup */
+ return dentry;
old = dir->i_op->lookup(dir, dentry, flags);
if (unlikely(old)) {
+ d_lookup_done(dentry);
dput(dentry);
dentry = old;
}
+ if ((flags & LOOKUP_INTENT_FLAGS) == 0)
+ /* ->lookup must have given final answer */
+ d_lookup_done(dentry);
return dentry;
}
EXPORT_SYMBOL(lookup_one_qstr_excl);
@@ -4112,6 +4117,7 @@ static struct dentry *filename_create(int dfd, struct filename *name,
}
return dentry;
fail:
+ d_lookup_done(dentry);
dput(dentry);
dentry = ERR_PTR(error);
unlock:
@@ -5340,6 +5346,7 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
rd.flags = flags;
error = vfs_rename(&rd);
exit5:
+ d_lookup_done(new_dentry);
dput(new_dentry);
exit4:
dput(old_dentry);
diff --git a/fs/smb/server/vfs.c b/fs/smb/server/vfs.c
index dfb0eee5f5f3..83131f08bfb4 100644
--- a/fs/smb/server/vfs.c
+++ b/fs/smb/server/vfs.c
@@ -772,6 +772,7 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
ksmbd_debug(VFS, "vfs_rename failed err %d\n", err);
out4:
+ d_lookup_done(new_dentry);
dput(new_dentry);
out3:
dput(old_parent);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 8ec8fed3bce8..15118992f745 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -34,6 +34,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
#define LOOKUP_EXCL 0x0400 /* ... in exclusive creation */
#define LOOKUP_RENAME_TARGET 0x0800 /* ... in destination of rename() */
+#define LOOKUP_INTENT_FLAGS (LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL | \
+ LOOKUP_RENAME_TARGET)
+
/* internal use only */
#define LOOKUP_PARENT 0x0010
--
2.47.0
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH 05/11] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
` (3 preceding siblings ...)
2024-12-20 2:54 ` [PATCH 04/11] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() NeilBrown
@ 2024-12-20 2:54 ` NeilBrown
2024-12-20 2:54 ` [PATCH 06/11] VFS: introduce done_lookup_and_lock() NeilBrown
` (7 subsequent siblings)
12 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2024-12-20 2:54 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds
Cc: linux-fsdevel, linux-kernel
No callers of kern_path_locked() or user_path_locked_at() want a
negative dentry. So change them to return -ENOENT instead. This
simplifies callers.
This results in a subtle change to bcachefs in that an ioctl will now
return -ENOENT in preference to -EXDEV. I believe this restores the
behaviour to what it was prior to
Commit bbe6a7c899e7 ("bch2_ioctl_subvolume_destroy(): fix locking")
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/base/devtmpfs.c | 65 +++++++++++++++++++----------------------
fs/bcachefs/fs-ioctl.c | 4 ---
fs/namei.c | 4 +++
kernel/audit_watch.c | 12 ++++----
4 files changed, 40 insertions(+), 45 deletions(-)
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index b848764ef018..c9e34842139f 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -245,15 +245,12 @@ static int dev_rmdir(const char *name)
dentry = kern_path_locked(name, &parent);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
- if (d_really_is_positive(dentry)) {
- if (d_inode(dentry)->i_private == &thread)
- err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
- dentry);
- else
- err = -EPERM;
- } else {
- err = -ENOENT;
- }
+ if (d_inode(dentry)->i_private == &thread)
+ err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
+ dentry);
+ else
+ err = -EPERM;
+
dput(dentry);
inode_unlock(d_inode(parent.dentry));
path_put(&parent);
@@ -310,6 +307,8 @@ static int handle_remove(const char *nodename, struct device *dev)
{
struct path parent;
struct dentry *dentry;
+ struct kstat stat;
+ struct path p;
int deleted = 0;
int err;
@@ -317,32 +316,28 @@ static int handle_remove(const char *nodename, struct device *dev)
if (IS_ERR(dentry))
return PTR_ERR(dentry);
- if (d_really_is_positive(dentry)) {
- struct kstat stat;
- struct path p = {.mnt = parent.mnt, .dentry = dentry};
- err = vfs_getattr(&p, &stat, STATX_TYPE | STATX_MODE,
- AT_STATX_SYNC_AS_STAT);
- if (!err && dev_mynode(dev, d_inode(dentry), &stat)) {
- struct iattr newattrs;
- /*
- * before unlinking this node, reset permissions
- * of possible references like hardlinks
- */
- newattrs.ia_uid = GLOBAL_ROOT_UID;
- newattrs.ia_gid = GLOBAL_ROOT_GID;
- newattrs.ia_mode = stat.mode & ~0777;
- newattrs.ia_valid =
- ATTR_UID|ATTR_GID|ATTR_MODE;
- inode_lock(d_inode(dentry));
- notify_change(&nop_mnt_idmap, dentry, &newattrs, NULL);
- inode_unlock(d_inode(dentry));
- err = vfs_unlink(&nop_mnt_idmap, d_inode(parent.dentry),
- dentry, NULL);
- if (!err || err == -ENOENT)
- deleted = 1;
- }
- } else {
- err = -ENOENT;
+ p.mnt = parent.mnt;
+ p.dentry = dentry;
+ err = vfs_getattr(&p, &stat, STATX_TYPE | STATX_MODE,
+ AT_STATX_SYNC_AS_STAT);
+ if (!err && dev_mynode(dev, d_inode(dentry), &stat)) {
+ struct iattr newattrs;
+ /*
+ * before unlinking this node, reset permissions
+ * of possible references like hardlinks
+ */
+ newattrs.ia_uid = GLOBAL_ROOT_UID;
+ newattrs.ia_gid = GLOBAL_ROOT_GID;
+ newattrs.ia_mode = stat.mode & ~0777;
+ newattrs.ia_valid =
+ ATTR_UID|ATTR_GID|ATTR_MODE;
+ inode_lock(d_inode(dentry));
+ notify_change(&nop_mnt_idmap, dentry, &newattrs, NULL);
+ inode_unlock(d_inode(dentry));
+ err = vfs_unlink(&nop_mnt_idmap, d_inode(parent.dentry),
+ dentry, NULL);
+ if (!err || err == -ENOENT)
+ deleted = 1;
}
dput(dentry);
inode_unlock(d_inode(parent.dentry));
diff --git a/fs/bcachefs/fs-ioctl.c b/fs/bcachefs/fs-ioctl.c
index 405cf08bda34..c5464219b23f 100644
--- a/fs/bcachefs/fs-ioctl.c
+++ b/fs/bcachefs/fs-ioctl.c
@@ -516,10 +516,6 @@ static long bch2_ioctl_subvolume_destroy(struct bch_fs *c, struct file *filp,
ret = -EXDEV;
goto err;
}
- if (!d_is_positive(victim)) {
- ret = -ENOENT;
- goto err;
- }
ret = __bch2_unlink(dir, victim, true);
if (!ret) {
fsnotify_rmdir(dir, victim);
diff --git a/fs/namei.c b/fs/namei.c
index 395bfbc8fc92..8780406cb4d7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2743,6 +2743,10 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
}
inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
d = lookup_one_qstr_excl(&last, path->dentry, 0);
+ if (!IS_ERR(d) && d_is_negative(d)) {
+ dput(d);
+ d = ERR_PTR(-ENOENT);
+ }
if (IS_ERR(d)) {
inode_unlock(path->dentry->d_inode);
path_put(path);
diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
index 7f358740e958..e3130675ee6b 100644
--- a/kernel/audit_watch.c
+++ b/kernel/audit_watch.c
@@ -350,11 +350,10 @@ static int audit_get_nd(struct audit_watch *watch, struct path *parent)
struct dentry *d = kern_path_locked(watch->path, parent);
if (IS_ERR(d))
return PTR_ERR(d);
- if (d_is_positive(d)) {
- /* update watch filter fields */
- watch->dev = d->d_sb->s_dev;
- watch->ino = d_backing_inode(d)->i_ino;
- }
+ /* update watch filter fields */
+ watch->dev = d->d_sb->s_dev;
+ watch->ino = d_backing_inode(d)->i_ino;
+
inode_unlock(d_backing_inode(parent->dentry));
dput(d);
return 0;
@@ -419,7 +418,7 @@ int audit_add_watch(struct audit_krule *krule, struct list_head **list)
/* caller expects mutex locked */
mutex_lock(&audit_filter_mutex);
- if (ret) {
+ if (ret && ret != -ENOENT) {
audit_put_watch(watch);
return ret;
}
@@ -438,6 +437,7 @@ int audit_add_watch(struct audit_krule *krule, struct list_head **list)
h = audit_hash_ino((u32)watch->ino);
*list = &audit_inode_hash[h];
+ ret = 0;
error:
path_put(&parent_path);
audit_put_watch(watch);
--
2.47.0
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH 06/11] VFS: introduce done_lookup_and_lock()
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
` (4 preceding siblings ...)
2024-12-20 2:54 ` [PATCH 05/11] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry NeilBrown
@ 2024-12-20 2:54 ` NeilBrown
2024-12-20 2:54 ` [PATCH 07/11] VFS: introduce lookup_and_lock() NeilBrown
` (6 subsequent siblings)
12 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2024-12-20 2:54 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds
Cc: linux-fsdevel, linux-kernel
Callers of kern_path_locked() and user_path_locked_at() should now call
done_lookup_and_lock() to unlock the directory and dput() the
dentry.
This will allow the locking rules to be changed in a central place.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/base/devtmpfs.c | 7 +++----
fs/bcachefs/fs-ioctl.c | 3 +--
fs/namei.c | 10 ++++++++--
include/linux/namei.h | 1 +
kernel/audit_fsnotify.c | 3 +--
kernel/audit_watch.c | 3 +--
6 files changed, 15 insertions(+), 12 deletions(-)
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index c9e34842139f..bb6d26338b6c 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -251,8 +251,8 @@ static int dev_rmdir(const char *name)
else
err = -EPERM;
- dput(dentry);
- inode_unlock(d_inode(parent.dentry));
+ done_lookup_and_lock(parent.dentry, dentry);
+
path_put(&parent);
return err;
}
@@ -339,8 +339,7 @@ static int handle_remove(const char *nodename, struct device *dev)
if (!err || err == -ENOENT)
deleted = 1;
}
- dput(dentry);
- inode_unlock(d_inode(parent.dentry));
+ done_lookup_and_lock(parent.dentry, dentry);
path_put(&parent);
if (deleted && strchr(nodename, '/'))
diff --git a/fs/bcachefs/fs-ioctl.c b/fs/bcachefs/fs-ioctl.c
index c5464219b23f..d51c86e24bef 100644
--- a/fs/bcachefs/fs-ioctl.c
+++ b/fs/bcachefs/fs-ioctl.c
@@ -522,8 +522,7 @@ static long bch2_ioctl_subvolume_destroy(struct bch_fs *c, struct file *filp,
d_delete(victim);
}
err:
- inode_unlock(dir);
- dput(victim);
+ done_lookup_and_lock(path.dentry, victim);
path_put(&path);
return ret;
}
diff --git a/fs/namei.c b/fs/namei.c
index 8780406cb4d7..29f86df4b9dc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2773,6 +2773,13 @@ struct dentry *user_path_locked_at(int dfd, const char __user *name, struct path
}
EXPORT_SYMBOL(user_path_locked_at);
+void done_lookup_and_lock(struct dentry *parent, struct dentry *child)
+{
+ dput(child);
+ inode_unlock(d_inode(parent));
+}
+EXPORT_SYMBOL(done_lookup_and_lock);
+
int kern_path(const char *name, unsigned int flags, struct path *path)
{
struct filename *filename = getname_kernel(name);
@@ -4146,8 +4153,7 @@ EXPORT_SYMBOL(kern_path_create);
void done_path_create(struct path *path, struct dentry *dentry)
{
- dput(dentry);
- inode_unlock(path->dentry->d_inode);
+ done_lookup_and_lock(path->dentry, dentry);
mnt_drop_write(path->mnt);
path_put(path);
}
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 15118992f745..898fc8ba37e1 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -65,6 +65,7 @@ extern struct dentry *user_path_create(int, const char __user *, struct path *,
extern void done_path_create(struct path *, struct dentry *);
extern struct dentry *kern_path_locked(const char *, struct path *);
extern struct dentry *user_path_locked_at(int , const char __user *, struct path *);
+extern void done_lookup_and_lock(struct dentry *parent, struct dentry *child);
int vfs_path_parent_lookup(struct filename *filename, unsigned int flags,
struct path *parent, struct qstr *last, int *type,
const struct path *root);
diff --git a/kernel/audit_fsnotify.c b/kernel/audit_fsnotify.c
index c565fbf66ac8..db2c03caa74d 100644
--- a/kernel/audit_fsnotify.c
+++ b/kernel/audit_fsnotify.c
@@ -86,7 +86,6 @@ struct audit_fsnotify_mark *audit_alloc_mark(struct audit_krule *krule, char *pa
if (IS_ERR(dentry))
return ERR_CAST(dentry); /* returning an error */
inode = path.dentry->d_inode;
- inode_unlock(inode);
audit_mark = kzalloc(sizeof(*audit_mark), GFP_KERNEL);
if (unlikely(!audit_mark)) {
@@ -107,7 +106,7 @@ struct audit_fsnotify_mark *audit_alloc_mark(struct audit_krule *krule, char *pa
audit_mark = ERR_PTR(ret);
}
out:
- dput(dentry);
+ done_lookup_and_lock(path.dentry, dentry);
path_put(&path);
return audit_mark;
}
diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
index e3130675ee6b..e1137ea9294b 100644
--- a/kernel/audit_watch.c
+++ b/kernel/audit_watch.c
@@ -354,8 +354,7 @@ static int audit_get_nd(struct audit_watch *watch, struct path *parent)
watch->dev = d->d_sb->s_dev;
watch->ino = d_backing_inode(d)->i_ino;
- inode_unlock(d_backing_inode(parent->dentry));
- dput(d);
+ done_lookup_and_lock(parent->dentry, d);
return 0;
}
--
2.47.0
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH 07/11] VFS: introduce lookup_and_lock()
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
` (5 preceding siblings ...)
2024-12-20 2:54 ` [PATCH 06/11] VFS: introduce done_lookup_and_lock() NeilBrown
@ 2024-12-20 2:54 ` NeilBrown
2024-12-20 2:54 ` [PATCH 08/11] VFS: add inode_dir_lock/unlock NeilBrown
` (5 subsequent siblings)
12 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2024-12-20 2:54 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds
Cc: linux-fsdevel, linux-kernel
lookup_and_lock() combines locking the directory and performing a lookup
prior to a change to the directory.
Abstracting this prepares for changing the locking requirements.
done_lookup_and_lock() will be called by all callers of
lookup_and_lock() to unlock and dput()
lookup_and_lock() returns -ENOENT if LOOKUP_CREATE was NOT given and the
name cannot be found,, and returns -EEXIST if LOOKUP_EXCL WAS given and
the name CAN be found. This is what callers want.
These functions replace all uses of lookup_one_qstr_excl() in namei.c
except for those used for rename.
The name might seem backwards as the lock happens before the lookup.
A future patch will change this so that only a shared lock is taken
before the lookup, and an exclusive lock on the dentry is taken after a
successful lookup. So the order "lookup" then "lock" will make sense.
This functionality is exported as lookup_and_lock_one() which takes a
name and len rather than a qstr.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/namei.c | 118 ++++++++++++++++++++++--------------------
include/linux/namei.h | 3 ++
2 files changed, 65 insertions(+), 56 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 29f86df4b9dc..371c80902c59 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1703,6 +1703,33 @@ struct dentry *lookup_one_qstr_excl(const struct qstr *name,
}
EXPORT_SYMBOL(lookup_one_qstr_excl);
+static struct dentry *lookup_and_lock(const struct qstr *last,
+ struct dentry *base,
+ unsigned int lookup_flags)
+{
+ struct dentry *dentry;
+ int err;
+
+ inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
+ dentry = lookup_one_qstr_excl(last, base, lookup_flags);
+ if (IS_ERR(dentry))
+ goto out;
+ err = -EEXIST;
+ if ((lookup_flags & LOOKUP_EXCL) && d_is_positive(dentry))
+ goto err;
+ err = -ENOENT;
+ if (!(lookup_flags & LOOKUP_CREATE) && d_is_negative(dentry))
+ goto err;
+ return dentry;
+
+err:
+ dput(dentry);
+ dentry = ERR_PTR(err);
+out:
+ inode_unlock(base->d_inode);
+ return dentry;
+}
+
/**
* lookup_fast - do fast lockless (but racy) lookup of a dentry
* @nd: current nameidata
@@ -2741,16 +2768,9 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
path_put(path);
return ERR_PTR(-EINVAL);
}
- inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
- d = lookup_one_qstr_excl(&last, path->dentry, 0);
- if (!IS_ERR(d) && d_is_negative(d)) {
- dput(d);
- d = ERR_PTR(-ENOENT);
- }
- if (IS_ERR(d)) {
- inode_unlock(path->dentry->d_inode);
+ d = lookup_and_lock(&last, path->dentry, 0);
+ if (IS_ERR(d))
path_put(path);
- }
return d;
}
@@ -3051,6 +3071,22 @@ struct dentry *lookup_positive_unlocked(const char *name,
}
EXPORT_SYMBOL(lookup_positive_unlocked);
+struct dentry *lookup_and_lock_one(struct mnt_idmap *idmap,
+ const char *name, int len, struct dentry *base,
+ unsigned int lookup_flags)
+{
+ struct qstr this;
+ int err;
+
+ if (!idmap)
+ idmap = &nop_mnt_idmap;
+ err = lookup_one_common(idmap, name, base, len, &this);
+ if (err)
+ return ERR_PTR(err);
+ return lookup_and_lock(&this, base, lookup_flags);
+}
+EXPORT_SYMBOL(lookup_and_lock_one);
+
#ifdef CONFIG_UNIX98_PTYS
int path_pts(struct path *path)
{
@@ -4080,7 +4116,6 @@ static struct dentry *filename_create(int dfd, struct filename *name,
unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL;
int type;
- int err2;
int error;
error = filename_parentat(dfd, name, reval_flag, path, &last, &type);
@@ -4092,50 +4127,31 @@ static struct dentry *filename_create(int dfd, struct filename *name,
* (foo/., foo/.., /////)
*/
if (unlikely(type != LAST_NORM))
- goto out;
+ goto put;
/* don't fail immediately if it's r/o, at least try to report other errors */
- err2 = mnt_want_write(path->mnt);
+ error = mnt_want_write(path->mnt);
/*
* Do the final lookup. Suppress 'create' if there is a trailing
* '/', and a directory wasn't requested.
*/
if (last.name[last.len] && !want_dir)
- create_flags = 0;
- inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
- dentry = lookup_one_qstr_excl(&last, path->dentry,
- reval_flag | create_flags);
+ create_flags &= ~LOOKUP_CREATE;
+ dentry = lookup_and_lock(&last, path->dentry, reval_flag | create_flags);
if (IS_ERR(dentry))
- goto unlock;
+ goto drop;
- error = -EEXIST;
- if (d_is_positive(dentry))
- goto fail;
-
- /*
- * Special case - lookup gave negative, but... we had foo/bar/
- * From the vfs_mknod() POV we just have a negative dentry -
- * all is fine. Let's be bastards - you had / on the end, you've
- * been asking for (non-existent) directory. -ENOENT for you.
- */
- if (unlikely(!create_flags)) {
- error = -ENOENT;
- goto fail;
- }
- if (unlikely(err2)) {
- error = err2;
+ if (unlikely(error))
goto fail;
- }
return dentry;
fail:
d_lookup_done(dentry);
- dput(dentry);
+ done_lookup_and_lock(path->dentry, dentry);
dentry = ERR_PTR(error);
-unlock:
- inode_unlock(path->dentry->d_inode);
- if (!err2)
+drop:
+ if (!error)
mnt_drop_write(path->mnt);
-out:
+put:
path_put(path);
return dentry;
}
@@ -4555,23 +4571,18 @@ int do_rmdir(int dfd, struct filename *name)
if (error)
goto exit2;
- inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
- dentry = lookup_one_qstr_excl(&last, path.dentry, lookup_flags);
+ dentry = lookup_and_lock(&last, path.dentry, lookup_flags);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit3;
- if (!dentry->d_inode) {
- error = -ENOENT;
- goto exit4;
- }
+
error = security_path_rmdir(&path, dentry);
if (error)
goto exit4;
error = vfs_rmdir(mnt_idmap(path.mnt), path.dentry->d_inode, dentry);
exit4:
- dput(dentry);
+ done_lookup_and_lock(path.dentry, dentry);
exit3:
- inode_unlock(path.dentry->d_inode);
mnt_drop_write(path.mnt);
exit2:
path_put(&path);
@@ -4691,13 +4702,11 @@ int do_unlinkat(int dfd, struct filename *name)
if (error)
goto exit2;
retry_deleg:
- inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
- dentry = lookup_one_qstr_excl(&last, path.dentry, lookup_flags);
+ dentry = lookup_and_lock(&last, path.dentry, lookup_flags);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
-
/* Why not before? Because we want correct error value */
- if (last.name[last.len] || d_is_negative(dentry))
+ if (last.name[last.len])
goto slashes;
inode = dentry->d_inode;
ihold(inode);
@@ -4707,9 +4716,8 @@ int do_unlinkat(int dfd, struct filename *name)
error = vfs_unlink(mnt_idmap(path.mnt), path.dentry->d_inode,
dentry, &delegated_inode);
exit3:
- dput(dentry);
+ done_lookup_and_lock(path.dentry, dentry);
}
- inode_unlock(path.dentry->d_inode);
if (inode)
iput(inode); /* truncate the inode here */
inode = NULL;
@@ -4731,9 +4739,7 @@ int do_unlinkat(int dfd, struct filename *name)
return error;
slashes:
- if (d_is_negative(dentry))
- error = -ENOENT;
- else if (d_is_dir(dentry))
+ if (d_is_dir(dentry))
error = -EISDIR;
else
error = -ENOTDIR;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 898fc8ba37e1..f882874a7b00 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -83,6 +83,9 @@ struct dentry *lookup_one_unlocked(struct mnt_idmap *idmap,
struct dentry *lookup_one_positive_unlocked(struct mnt_idmap *idmap,
const char *name,
struct dentry *base, int len);
+struct dentry *lookup_and_lock_one(struct mnt_idmap *idmap,
+ const char *name, int len, struct dentry *base,
+ unsigned int lookup_flags);
extern int follow_down_one(struct path *);
extern int follow_down(struct path *path, unsigned int flags);
--
2.47.0
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH 08/11] VFS: add inode_dir_lock/unlock
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
` (6 preceding siblings ...)
2024-12-20 2:54 ` [PATCH 07/11] VFS: introduce lookup_and_lock() NeilBrown
@ 2024-12-20 2:54 ` NeilBrown
2024-12-21 1:21 ` Hillf Danton
2024-12-20 2:54 ` [PATCH 09/11] VFS: re-pack DENTRY_ flags NeilBrown
` (4 subsequent siblings)
12 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2024-12-20 2:54 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds
Cc: linux-fsdevel, linux-kernel
During the transition from providing exclusive locking on the directory
for directory modifying operation to providing exclusive locking only on
the dentry with a shared lock on the directory - we need an alternate
way to provide exclusion on the directory for file systems which haven't
been converted. This is provided by inode_dir_lock() and
inode_dir_inlock().
This uses a bit in i_state for locking, and wait_var_event_spinlock() for
waiting.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/inode.c | 3 ++
fs/namei.c | 81 +++++++++++++++++++++++++++++++++++++---------
include/linux/fs.h | 5 +++
3 files changed, 74 insertions(+), 15 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 6b4c77268fc0..9ba69837aa56 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -492,6 +492,8 @@ EXPORT_SYMBOL(address_space_init_once);
*/
void inode_init_once(struct inode *inode)
{
+ static struct lock_class_key __key;
+
memset(inode, 0, sizeof(*inode));
INIT_HLIST_NODE(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_devices);
@@ -501,6 +503,7 @@ void inode_init_once(struct inode *inode)
INIT_LIST_HEAD(&inode->i_sb_list);
__address_space_init_once(&inode->i_data);
i_size_ordered_init(inode);
+ lockdep_init_map(&inode->i_dirlock_map, "I_DIR_LOCKED", &__key, 0);
}
EXPORT_SYMBOL(inode_init_once);
diff --git a/fs/namei.c b/fs/namei.c
index 371c80902c59..68750b15dbf4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3364,6 +3364,34 @@ static inline umode_t vfs_prepare_mode(struct mnt_idmap *idmap,
return mode;
}
+static bool check_dir_locked(struct inode *dir)
+{
+ if (dir->i_state & I_DIR_LOCKED) {
+ dir->i_state |= I_DIR_LOCK_WAITER;
+ return true;
+ }
+ return false;
+}
+
+static void inode_lock_dir(struct inode *dir)
+{
+ lock_acquire_exclusive(&dir->i_dirlock_map, 0, 0, NULL, _THIS_IP_);
+ spin_lock(&dir->i_lock);
+ wait_var_event_spinlock(dir, !check_dir_locked(dir),
+ &dir->i_lock);
+ dir->i_state |= I_DIR_LOCKED;
+ spin_unlock(&dir->i_lock);
+}
+
+static void inode_unlock_dir(struct inode *dir)
+{
+ lock_map_release(&dir->i_dirlock_map);
+ spin_lock(&dir->i_lock);
+ dir->i_state &= ~(I_DIR_LOCKED | I_DIR_LOCK_WAITER);
+ wake_up_var_locked(dir, &dir->i_lock);
+ spin_unlock(&dir->i_lock);
+}
+
/**
* vfs_create - create new file
* @idmap: idmap of the mount the inode was found from
@@ -3396,10 +3424,13 @@ int vfs_create(struct mnt_idmap *idmap, struct inode *dir,
error = security_inode_create(dir, dentry, mode);
if (error)
return error;
- if (dir->i_op->create_shared)
+ if (dir->i_op->create_shared) {
error = dir->i_op->create_shared(idmap, dir, dentry, mode, want_excl);
- else
+ } else {
+ inode_lock_dir(dir);
error = dir->i_op->create(idmap, dir, dentry, mode, want_excl);
+ inode_unlock_dir(dir);
+ }
if (!error)
fsnotify_create(dir, dentry);
return error;
@@ -3699,16 +3730,19 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
file->f_mode |= FMODE_CREATED;
audit_inode_child(dir_inode, dentry, AUDIT_TYPE_CHILD_CREATE);
- if (dir_inode->i_op->create_shared)
+ if (dir_inode->i_op->create_shared) {
error = dir_inode->i_op->create_shared(idmap, dir_inode,
dentry, mode,
open_flag & O_EXCL);
- else if (dir_inode->i_op->create)
+ } else if (dir_inode->i_op->create) {
+ inode_lock_dir(dir_inode);
error = dir_inode->i_op->create(idmap, dir_inode,
dentry, mode,
open_flag & O_EXCL);
- else
+ inode_unlock_dir(dir_inode);
+ } else {
error = -EACCES;
+ }
if (error)
goto out_dput;
}
@@ -4227,10 +4261,13 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;
- if (dir->i_op->mknod_shared)
+ if (dir->i_op->mknod_shared) {
error = dir->i_op->mknod_shared(idmap, dir, dentry, mode, dev);
- else
+ } else {
+ inode_lock_dir(dir);
error = dir->i_op->mknod(idmap, dir, dentry, mode, dev);
+ inode_unlock_dir(dir);
+ }
if (!error)
fsnotify_create(dir, dentry);
return error;
@@ -4360,7 +4397,9 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
else if (de)
dput(de);
} else {
+ inode_lock_dir(dir);
error = dir->i_op->mkdir(idmap, dir, dentry, mode);
+ inode_unlock_dir(dir);
}
if (!error)
fsnotify_mkdir(dir, dentry);
@@ -4521,10 +4560,13 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
if (error)
goto out;
- if (dir->i_op->rmdir_shared)
+ if (dir->i_op->rmdir_shared) {
error = dir->i_op->rmdir_shared(dir, dentry);
- else
+ } else {
+ inode_lock_dir(dir);
error = dir->i_op->rmdir(dir, dentry);
+ inode_unlock_dir(dir);
+ }
if (error)
goto out;
@@ -4648,10 +4690,13 @@ int vfs_unlink(struct mnt_idmap *idmap, struct inode *dir,
error = try_break_deleg(target, delegated_inode);
if (error)
goto out;
- if (dir->i_op->unlink_shared)
+ if (dir->i_op->unlink_shared) {
error = dir->i_op->unlink_shared(dir, dentry);
- else
+ } else {
+ inode_lock_dir(dir);
error = dir->i_op->unlink(dir, dentry);
+ inode_unlock_dir(dir);
+ }
if (!error) {
dont_mount(dentry);
detach_mounts(dentry);
@@ -4792,10 +4837,13 @@ int vfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;
- if (dir->i_op->symlink_shared)
+ if (dir->i_op->symlink_shared) {
error = dir->i_op->symlink_shared(idmap, dir, dentry, oldname);
- else
+ } else {
+ inode_lock_dir(dir);
error = dir->i_op->symlink(idmap, dir, dentry, oldname);
+ inode_unlock_dir(dir);
+ }
if (!error)
fsnotify_create(dir, dentry);
return error;
@@ -4920,10 +4968,13 @@ int vfs_link(struct dentry *old_dentry, struct mnt_idmap *idmap,
error = try_break_deleg(inode, delegated_inode);
if (error)
;
- else if (dir->i_op->link_shared)
+ else if (dir->i_op->link_shared) {
error = dir->i_op->link_shared(old_dentry, dir, new_dentry);
- else
+ } else {
+ inode_lock_dir(dir);
error = dir->i_op->link(old_dentry, dir, new_dentry);
+ inode_unlock_dir(dir);
+ }
}
if (!error && (inode->i_state & I_LINKABLE)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 68eba181175b..3ca92a54f28e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -722,6 +722,8 @@ struct inode {
void (*free_inode)(struct inode *);
};
struct file_lock_context *i_flctx;
+
+ struct lockdep_map i_dirlock_map; /* For tracking I_DIR_LOCKED locks */
struct address_space i_data;
struct list_head i_devices;
union {
@@ -2493,6 +2495,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
#define I_SYNC_QUEUED (1 << 16)
#define I_PINNING_NETFS_WB (1 << 17)
+#define I_DIR_LOCK_WAITER (1 << 30)
+#define I_DIR_LOCKED (1 << 31)
+
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
--
2.47.0
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH 08/11] VFS: add inode_dir_lock/unlock
2024-12-20 2:54 ` [PATCH 08/11] VFS: add inode_dir_lock/unlock NeilBrown
@ 2024-12-21 1:21 ` Hillf Danton
2024-12-23 3:10 ` NeilBrown
0 siblings, 1 reply; 26+ messages in thread
From: Hillf Danton @ 2024-12-21 1:21 UTC (permalink / raw)
To: NeilBrown
Cc: Alexander Viro, Christian Brauner, Jan Kara, Peter Zijlstra,
Linus Torvalds, linux-fsdevel, linux-kernel
On Fri, 20 Dec 2024 13:54:26 +1100 NeilBrown <neilb@suse.de>
> During the transition from providing exclusive locking on the directory
> for directory modifying operation to providing exclusive locking only on
> the dentry with a shared lock on the directory - we need an alternate
> way to provide exclusion on the directory for file systems which haven't
> been converted. This is provided by inode_dir_lock() and
> inode_dir_inlock().
> This uses a bit in i_state for locking, and wait_var_event_spinlock() for
> waiting.
>
Inventing anything like mutex sounds bad.
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/inode.c | 3 ++
> fs/namei.c | 81 +++++++++++++++++++++++++++++++++++++---------
> include/linux/fs.h | 5 +++
> 3 files changed, 74 insertions(+), 15 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 6b4c77268fc0..9ba69837aa56 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -492,6 +492,8 @@ EXPORT_SYMBOL(address_space_init_once);
> */
> void inode_init_once(struct inode *inode)
> {
> + static struct lock_class_key __key;
> +
> memset(inode, 0, sizeof(*inode));
> INIT_HLIST_NODE(&inode->i_hash);
> INIT_LIST_HEAD(&inode->i_devices);
> @@ -501,6 +503,7 @@ void inode_init_once(struct inode *inode)
> INIT_LIST_HEAD(&inode->i_sb_list);
> __address_space_init_once(&inode->i_data);
> i_size_ordered_init(inode);
> + lockdep_init_map(&inode->i_dirlock_map, "I_DIR_LOCKED", &__key, 0);
> }
> EXPORT_SYMBOL(inode_init_once);
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 371c80902c59..68750b15dbf4 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3364,6 +3364,34 @@ static inline umode_t vfs_prepare_mode(struct mnt_idmap *idmap,
> return mode;
> }
>
> +static bool check_dir_locked(struct inode *dir)
> +{
> + if (dir->i_state & I_DIR_LOCKED) {
> + dir->i_state |= I_DIR_LOCK_WAITER;
> + return true;
> + }
> + return false;
> +}
> +
> +static void inode_lock_dir(struct inode *dir)
> +{
> + lock_acquire_exclusive(&dir->i_dirlock_map, 0, 0, NULL, _THIS_IP_);
> + spin_lock(&dir->i_lock);
> + wait_var_event_spinlock(dir, !check_dir_locked(dir),
> + &dir->i_lock);
> + dir->i_state |= I_DIR_LOCKED;
> + spin_unlock(&dir->i_lock);
> +}
> +
> +static void inode_unlock_dir(struct inode *dir)
> +{
> + lock_map_release(&dir->i_dirlock_map);
> + spin_lock(&dir->i_lock);
> + dir->i_state &= ~(I_DIR_LOCKED | I_DIR_LOCK_WAITER);
> + wake_up_var_locked(dir, &dir->i_lock);
> + spin_unlock(&dir->i_lock);
> +}
> +
> /**
> * vfs_create - create new file
> * @idmap: idmap of the mount the inode was found from
> @@ -3396,10 +3424,13 @@ int vfs_create(struct mnt_idmap *idmap, struct inode *dir,
> error = security_inode_create(dir, dentry, mode);
> if (error)
> return error;
> - if (dir->i_op->create_shared)
> + if (dir->i_op->create_shared) {
> error = dir->i_op->create_shared(idmap, dir, dentry, mode, want_excl);
> - else
> + } else {
> + inode_lock_dir(dir);
> error = dir->i_op->create(idmap, dir, dentry, mode, want_excl);
> + inode_unlock_dir(dir);
> + }
> if (!error)
> fsnotify_create(dir, dentry);
> return error;
> @@ -3699,16 +3730,19 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> file->f_mode |= FMODE_CREATED;
> audit_inode_child(dir_inode, dentry, AUDIT_TYPE_CHILD_CREATE);
>
> - if (dir_inode->i_op->create_shared)
> + if (dir_inode->i_op->create_shared) {
> error = dir_inode->i_op->create_shared(idmap, dir_inode,
> dentry, mode,
> open_flag & O_EXCL);
> - else if (dir_inode->i_op->create)
> + } else if (dir_inode->i_op->create) {
> + inode_lock_dir(dir_inode);
> error = dir_inode->i_op->create(idmap, dir_inode,
> dentry, mode,
> open_flag & O_EXCL);
> - else
> + inode_unlock_dir(dir_inode);
> + } else {
> error = -EACCES;
> + }
> if (error)
> goto out_dput;
> }
> @@ -4227,10 +4261,13 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
> if (error)
> return error;
>
> - if (dir->i_op->mknod_shared)
> + if (dir->i_op->mknod_shared) {
> error = dir->i_op->mknod_shared(idmap, dir, dentry, mode, dev);
> - else
> + } else {
> + inode_lock_dir(dir);
> error = dir->i_op->mknod(idmap, dir, dentry, mode, dev);
> + inode_unlock_dir(dir);
> + }
> if (!error)
> fsnotify_create(dir, dentry);
> return error;
> @@ -4360,7 +4397,9 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
> else if (de)
> dput(de);
> } else {
> + inode_lock_dir(dir);
> error = dir->i_op->mkdir(idmap, dir, dentry, mode);
> + inode_unlock_dir(dir);
> }
> if (!error)
> fsnotify_mkdir(dir, dentry);
> @@ -4521,10 +4560,13 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> if (error)
> goto out;
>
> - if (dir->i_op->rmdir_shared)
> + if (dir->i_op->rmdir_shared) {
> error = dir->i_op->rmdir_shared(dir, dentry);
> - else
> + } else {
> + inode_lock_dir(dir);
> error = dir->i_op->rmdir(dir, dentry);
> + inode_unlock_dir(dir);
> + }
> if (error)
> goto out;
>
> @@ -4648,10 +4690,13 @@ int vfs_unlink(struct mnt_idmap *idmap, struct inode *dir,
> error = try_break_deleg(target, delegated_inode);
> if (error)
> goto out;
> - if (dir->i_op->unlink_shared)
> + if (dir->i_op->unlink_shared) {
> error = dir->i_op->unlink_shared(dir, dentry);
> - else
> + } else {
> + inode_lock_dir(dir);
> error = dir->i_op->unlink(dir, dentry);
> + inode_unlock_dir(dir);
> + }
> if (!error) {
> dont_mount(dentry);
> detach_mounts(dentry);
> @@ -4792,10 +4837,13 @@ int vfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
> if (error)
> return error;
>
> - if (dir->i_op->symlink_shared)
> + if (dir->i_op->symlink_shared) {
> error = dir->i_op->symlink_shared(idmap, dir, dentry, oldname);
> - else
> + } else {
> + inode_lock_dir(dir);
> error = dir->i_op->symlink(idmap, dir, dentry, oldname);
> + inode_unlock_dir(dir);
> + }
> if (!error)
> fsnotify_create(dir, dentry);
> return error;
> @@ -4920,10 +4968,13 @@ int vfs_link(struct dentry *old_dentry, struct mnt_idmap *idmap,
> error = try_break_deleg(inode, delegated_inode);
> if (error)
> ;
> - else if (dir->i_op->link_shared)
> + else if (dir->i_op->link_shared) {
> error = dir->i_op->link_shared(old_dentry, dir, new_dentry);
> - else
> + } else {
> + inode_lock_dir(dir);
> error = dir->i_op->link(old_dentry, dir, new_dentry);
> + inode_unlock_dir(dir);
> + }
> }
>
> if (!error && (inode->i_state & I_LINKABLE)) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 68eba181175b..3ca92a54f28e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -722,6 +722,8 @@ struct inode {
> void (*free_inode)(struct inode *);
> };
> struct file_lock_context *i_flctx;
> +
> + struct lockdep_map i_dirlock_map; /* For tracking I_DIR_LOCKED locks */
The cost of this map says no to any attempt inventing mutex in any form.
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 08/11] VFS: add inode_dir_lock/unlock
2024-12-21 1:21 ` Hillf Danton
@ 2024-12-23 3:10 ` NeilBrown
2024-12-23 11:12 ` Hillf Danton
0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2024-12-23 3:10 UTC (permalink / raw)
To: Hillf Danton
Cc: Alexander Viro, Christian Brauner, Jan Kara, Peter Zijlstra,
Linus Torvalds, linux-fsdevel, linux-kernel
On Sat, 21 Dec 2024, Hillf Danton wrote:
> On Fri, 20 Dec 2024 13:54:26 +1100 NeilBrown <neilb@suse.de>
> > During the transition from providing exclusive locking on the directory
> > for directory modifying operation to providing exclusive locking only on
> > the dentry with a shared lock on the directory - we need an alternate
> > way to provide exclusion on the directory for file systems which haven't
> > been converted. This is provided by inode_dir_lock() and
> > inode_dir_inlock().
> > This uses a bit in i_state for locking, and wait_var_event_spinlock() for
> > waiting.
> >
> Inventing anything like mutex sounds bad.
In general I would agree. But when the cost of adding a mutex exceeds
the cost of using an alternate solution that only requires 2 bits, I
think the alternate solution is justified.
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -722,6 +722,8 @@ struct inode {
> > void (*free_inode)(struct inode *);
> > };
> > struct file_lock_context *i_flctx;
> > +
> > + struct lockdep_map i_dirlock_map; /* For tracking I_DIR_LOCKED locks */
>
> The cost of this map says no to any attempt inventing mutex in any form.
>
"struct lockdep_map" is size-zero when lockdep is not enabled. And when
it is enabled we accept the cost of larger structures to benefit from
deadlock detection.
So I don't think this is a sound argument.
Thanks for the review,
NeilBrown
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 08/11] VFS: add inode_dir_lock/unlock
2024-12-23 3:10 ` NeilBrown
@ 2024-12-23 11:12 ` Hillf Danton
2024-12-23 20:36 ` NeilBrown
0 siblings, 1 reply; 26+ messages in thread
From: Hillf Danton @ 2024-12-23 11:12 UTC (permalink / raw)
To: NeilBrown
Cc: Alexander Viro, Christian Brauner, Jan Kara, Peter Zijlstra,
Linus Torvalds, linux-fsdevel, linux-kernel
On Mon, 23 Dec 2024 14:10:07 +1100 NeilBrown <neilb@suse.de>
> On Sat, 21 Dec 2024, Hillf Danton wrote:
> > Inventing anything like mutex sounds bad.
>
> In general I would agree. But when the cost of adding a mutex exceeds
> the cost of using an alternate solution that only requires 2 bits, I
> think the alternate solution is justified.
>
Inode deserves more than the 2 bits before such a solution is able to
rework mutex.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 08/11] VFS: add inode_dir_lock/unlock
2024-12-23 11:12 ` Hillf Danton
@ 2024-12-23 20:36 ` NeilBrown
2024-12-24 10:26 ` Hillf Danton
0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2024-12-23 20:36 UTC (permalink / raw)
To: Hillf Danton
Cc: Alexander Viro, Christian Brauner, Jan Kara, Peter Zijlstra,
Linus Torvalds, linux-fsdevel, linux-kernel
On Mon, 23 Dec 2024, Hillf Danton wrote:
> On Mon, 23 Dec 2024 14:10:07 +1100 NeilBrown <neilb@suse.de>
> > On Sat, 21 Dec 2024, Hillf Danton wrote:
> > > Inventing anything like mutex sounds bad.
> >
> > In general I would agree. But when the cost of adding a mutex exceeds
> > the cost of using an alternate solution that only requires 2 bits, I
> > think the alternate solution is justified.
> >
> Inode deserves more than the 2 bits before such a solution is able to
> rework mutex.
I'm sorry but I don't understand what you are saying. Could you please
give more details about your concern?
Are you concerned about correctness? Performance? Maintainability?
Something else?
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 08/11] VFS: add inode_dir_lock/unlock
2024-12-23 20:36 ` NeilBrown
@ 2024-12-24 10:26 ` Hillf Danton
0 siblings, 0 replies; 26+ messages in thread
From: Hillf Danton @ 2024-12-24 10:26 UTC (permalink / raw)
To: NeilBrown
Cc: Alexander Viro, Christian Brauner, Jan Kara, Peter Zijlstra,
Linus Torvalds, linux-fsdevel, linux-kernel
On Tue, 24 Dec 2024 07:36:08 +1100 NeilBrown <neilb@suse.de>
> On Mon, 23 Dec 2024, Hillf Danton wrote:
> > On Mon, 23 Dec 2024 14:10:07 +1100 NeilBrown <neilb@suse.de>
> > > On Sat, 21 Dec 2024, Hillf Danton wrote:
> > > > Inventing anything like mutex sounds bad.
> > >
> > > In general I would agree. But when the cost of adding a mutex exceeds
> > > the cost of using an alternate solution that only requires 2 bits, I
> > > think the alternate solution is justified.
> > >
> > Inode deserves more than the 2 bits before such a solution is able to
> > rework mutex.
>
> I'm sorry but I don't understand what you are saying. Could you please
> give more details about your concern?
> Are you concerned about correctness? Performance? Maintainability?
> Something else?
>
It is that you are adding a pair of honey bee wings to A380, so no takeoff
can be expected.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 09/11] VFS: re-pack DENTRY_ flags.
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
` (7 preceding siblings ...)
2024-12-20 2:54 ` [PATCH 08/11] VFS: add inode_dir_lock/unlock NeilBrown
@ 2024-12-20 2:54 ` NeilBrown
2024-12-20 2:54 ` [PATCH 10/11] VFS: take a shared lock for create/remove directory operations NeilBrown
` (3 subsequent siblings)
12 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2024-12-20 2:54 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds
Cc: linux-fsdevel, linux-kernel
Bits 13, 23, 24, and 27 are not used. Move all those holes to the end.
Signed-off-by: NeilBrown <neilb@suse.de>
---
include/linux/dcache.h | 38 +++++++++++++++++++-------------------
1 file changed, 19 insertions(+), 19 deletions(-)
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index b64c0260e4be..fc7f571bd5bb 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -191,34 +191,34 @@ struct dentry_operations {
#define DCACHE_NFSFS_RENAMED BIT(12)
/* this dentry has been "silly renamed" and has to be deleted on the last
* dput() */
-#define DCACHE_FSNOTIFY_PARENT_WATCHED BIT(14)
+#define DCACHE_FSNOTIFY_PARENT_WATCHED BIT(13)
/* Parent inode is watched by some fsnotify listener */
-#define DCACHE_DENTRY_KILLED BIT(15)
+#define DCACHE_DENTRY_KILLED BIT(14)
-#define DCACHE_MOUNTED BIT(16) /* is a mountpoint */
-#define DCACHE_NEED_AUTOMOUNT BIT(17) /* handle automount on this dir */
-#define DCACHE_MANAGE_TRANSIT BIT(18) /* manage transit from this dirent */
+#define DCACHE_MOUNTED BIT(15) /* is a mountpoint */
+#define DCACHE_NEED_AUTOMOUNT BIT(16) /* handle automount on this dir */
+#define DCACHE_MANAGE_TRANSIT BIT(17) /* manage transit from this dirent */
#define DCACHE_MANAGED_DENTRY \
(DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT)
-#define DCACHE_LRU_LIST BIT(19)
+#define DCACHE_LRU_LIST BIT(18)
-#define DCACHE_ENTRY_TYPE (7 << 20) /* bits 20..22 are for storing type: */
-#define DCACHE_MISS_TYPE (0 << 20) /* Negative dentry */
-#define DCACHE_WHITEOUT_TYPE (1 << 20) /* Whiteout dentry (stop pathwalk) */
-#define DCACHE_DIRECTORY_TYPE (2 << 20) /* Normal directory */
-#define DCACHE_AUTODIR_TYPE (3 << 20) /* Lookupless directory (presumed automount) */
-#define DCACHE_REGULAR_TYPE (4 << 20) /* Regular file type */
-#define DCACHE_SPECIAL_TYPE (5 << 20) /* Other file type */
-#define DCACHE_SYMLINK_TYPE (6 << 20) /* Symlink */
+#define DCACHE_ENTRY_TYPE (7 << 19) /* bits 19..21 are for storing type: */
+#define DCACHE_MISS_TYPE (0 << 19) /* Negative dentry */
+#define DCACHE_WHITEOUT_TYPE (1 << 19) /* Whiteout dentry (stop pathwalk) */
+#define DCACHE_DIRECTORY_TYPE (2 << 19) /* Normal directory */
+#define DCACHE_AUTODIR_TYPE (3 << 19) /* Lookupless directory (presumed automount) */
+#define DCACHE_REGULAR_TYPE (4 << 19) /* Regular file type */
+#define DCACHE_SPECIAL_TYPE (5 << 19) /* Other file type */
+#define DCACHE_SYMLINK_TYPE (6 << 19) /* Symlink */
-#define DCACHE_NOKEY_NAME BIT(25) /* Encrypted name encoded without key */
-#define DCACHE_OP_REAL BIT(26)
+#define DCACHE_NOKEY_NAME BIT(22) /* Encrypted name encoded without key */
+#define DCACHE_OP_REAL BIT(23)
-#define DCACHE_PAR_LOOKUP BIT(28) /* being looked up (with parent locked shared) */
-#define DCACHE_DENTRY_CURSOR BIT(29)
-#define DCACHE_NORCU BIT(30) /* No RCU delay for freeing */
+#define DCACHE_PAR_LOOKUP BIT(24) /* being looked up (with parent locked shared) */
+#define DCACHE_DENTRY_CURSOR BIT(25)
+#define DCACHE_NORCU BIT(26) /* No RCU delay for freeing */
extern seqlock_t rename_lock;
--
2.47.0
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH 10/11] VFS: take a shared lock for create/remove directory operations.
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
` (8 preceding siblings ...)
2024-12-20 2:54 ` [PATCH 09/11] VFS: re-pack DENTRY_ flags NeilBrown
@ 2024-12-20 2:54 ` NeilBrown
2024-12-23 5:19 ` Al Viro
2024-12-20 2:54 ` [PATCH 11/11] nfsd: use lookup_and_lock_one() NeilBrown
` (2 subsequent siblings)
12 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2024-12-20 2:54 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds
Cc: linux-fsdevel, linux-kernel
With this patch the VFS takes a shared lock on the directory (i_rwsem)
when performing create or remove operations. Rename is as yet
unchanged.
Not all callers are changed, only the common ones in fs/namei.c
While the directory only has a shared lock, the dentry being updated has
an exclusive lock using a bit in ->d_flags. Waiters use
wait_var_event_spinlock(), and a wakeup is only sent in the unusual case
that some other task is actually waiting - indicated by another d_flags
bit.
Once the exclusive "update" lock is obtained on the dentry we must make
sure it wasn't unlinked or renamed while we slept. If it was we repeat
the lookup.
The filesystem operations that expect an exclusive lock are still
provided with exclusion, but this is handled by inode_dir_lock().
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/dcache.c | 9 ++++++-
fs/namei.c | 53 ++++++++++++++++++++++++++++++++++++++----
include/linux/dcache.h | 4 ++++
3 files changed, 60 insertions(+), 6 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index ebe849474bd8..3fb3af83add5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1636,9 +1636,10 @@ EXPORT_SYMBOL(d_invalidate);
* available. On a success the dentry is returned. The name passed in is
* copied and the copy passed in may be reused after this call.
*/
-
+
static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
{
+ static struct lock_class_key __key;
struct dentry *dentry;
char *dname;
int err;
@@ -1697,6 +1698,8 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
INIT_HLIST_NODE(&dentry->d_sib);
d_set_d_op(dentry, dentry->d_sb->s_d_op);
+ lockdep_init_map(&dentry->d_update_map, "DCACHE_PAR_UPDATE", &__key, 0);
+
if (dentry->d_op && dentry->d_op->d_init) {
err = dentry->d_op->d_init(dentry);
if (err) {
@@ -3030,6 +3033,10 @@ static int __d_unalias(struct dentry *dentry, struct dentry *alias)
* In that case, we know that the inode will be a regular file, and also this
* will only occur during atomic_open. So we need to check for the dentry
* being already hashed only in the final case.
+ *
+ * @dentry must have a valid ->d_parent and that directory must be
+ * locked (i_rwsem) either exclusively or shared. If shared then
+ * @dentry must have %DCACHE_PAR_LOOKUP or %DCACHE_PAR_UPDATE set.
*/
struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
{
diff --git a/fs/namei.c b/fs/namei.c
index 68750b15dbf4..fb40ae64dc8d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1703,6 +1703,15 @@ struct dentry *lookup_one_qstr_excl(const struct qstr *name,
}
EXPORT_SYMBOL(lookup_one_qstr_excl);
+static bool check_dentry_locked(struct dentry *de)
+{
+ if (de->d_flags & DCACHE_PAR_UPDATE) {
+ de->d_flags |= DCACHE_PAR_WAITER;
+ return true;
+ }
+ return false;
+}
+
static struct dentry *lookup_and_lock(const struct qstr *last,
struct dentry *base,
unsigned int lookup_flags)
@@ -1710,10 +1719,36 @@ static struct dentry *lookup_and_lock(const struct qstr *last,
struct dentry *dentry;
int err;
- inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
+ inode_lock_shared_nested(base->d_inode, I_MUTEX_PARENT);
+retry:
dentry = lookup_one_qstr_excl(last, base, lookup_flags);
if (IS_ERR(dentry))
goto out;
+ lock_acquire_exclusive(&dentry->d_update_map, 0, 0, NULL, _THIS_IP_);
+ spin_lock(&dentry->d_lock);
+ wait_var_event_spinlock(&dentry->d_flags,
+ !check_dentry_locked(dentry),
+ &dentry->d_lock);
+ if (d_is_positive(dentry)) {
+ rcu_read_lock(); /* needed for d_same_name() */
+ if (
+ /* Was unlinked while we waited ?*/
+ d_unhashed(dentry) ||
+ /* Or was dentry renamed ?? */
+ dentry->d_parent != base ||
+ dentry->d_name.hash != last->hash ||
+ !d_same_name(dentry, base, last)
+ ) {
+ rcu_read_unlock();
+ spin_unlock(&dentry->d_lock);
+ lock_map_release(&dentry->d_update_map);
+ dput(dentry);
+ goto retry;
+ }
+ rcu_read_unlock();
+ }
+ dentry->d_flags |= DCACHE_PAR_UPDATE;
+ spin_unlock(&dentry->d_lock);
err = -EEXIST;
if ((lookup_flags & LOOKUP_EXCL) && d_is_positive(dentry))
goto err;
@@ -1723,10 +1758,11 @@ static struct dentry *lookup_and_lock(const struct qstr *last,
return dentry;
err:
- dput(dentry);
- dentry = ERR_PTR(err);
+ done_lookup_and_lock(base, dentry);
+ return ERR_PTR(err);
+
out:
- inode_unlock(base->d_inode);
+ inode_unlock_shared(base->d_inode);
return dentry;
}
@@ -2795,8 +2831,15 @@ EXPORT_SYMBOL(user_path_locked_at);
void done_lookup_and_lock(struct dentry *parent, struct dentry *child)
{
+ lock_map_release(&child->d_update_map);
+ spin_lock(&child->d_lock);
+ if (child->d_flags & DCACHE_PAR_WAITER)
+ wake_up_var_locked(&child->d_flags, &child->d_lock);
+ child->d_flags &= ~(DCACHE_PAR_UPDATE | DCACHE_PAR_WAITER);
+ spin_unlock(&child->d_lock);
+
+ inode_unlock_shared(parent->d_inode);
dput(child);
- inode_unlock(d_inode(parent));
}
EXPORT_SYMBOL(done_lookup_and_lock);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index fc7f571bd5bb..6d404c296ac0 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -102,6 +102,8 @@ struct dentry {
* possible!
*/
+ /* lockdep tracking of DCACHE_PAR_UPDATE locks */
+ struct lockdep_map d_update_map;
union {
struct list_head d_lru; /* LRU list */
wait_queue_head_t *d_wait; /* in-lookup ones only */
@@ -220,6 +222,8 @@ struct dentry_operations {
#define DCACHE_DENTRY_CURSOR BIT(25)
#define DCACHE_NORCU BIT(26) /* No RCU delay for freeing */
+#define DCACHE_PAR_UPDATE BIT(27) /* Locked for update */
+#define DCACHE_PAR_WAITER BIT(28) /* someone is waiting for PAR_UPDATE */
extern seqlock_t rename_lock;
/*
--
2.47.0
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH 10/11] VFS: take a shared lock for create/remove directory operations.
2024-12-20 2:54 ` [PATCH 10/11] VFS: take a shared lock for create/remove directory operations NeilBrown
@ 2024-12-23 5:19 ` Al Viro
2024-12-23 7:11 ` NeilBrown
0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2024-12-23 5:19 UTC (permalink / raw)
To: NeilBrown
Cc: Christian Brauner, Jan Kara, Linus Torvalds, linux-fsdevel,
linux-kernel
On Fri, Dec 20, 2024 at 01:54:28PM +1100, NeilBrown wrote:
> Once the exclusive "update" lock is obtained on the dentry we must make
> sure it wasn't unlinked or renamed while we slept. If it was we repeat
> the lookup.
> + if (
> + /* Was unlinked while we waited ?*/
> + d_unhashed(dentry) ||
> + /* Or was dentry renamed ?? */
> + dentry->d_parent != base ||
> + dentry->d_name.hash != last->hash ||
> + !d_same_name(dentry, base, last)
> + ) {
> + rcu_read_unlock();
> + spin_unlock(&dentry->d_lock);
> + lock_map_release(&dentry->d_update_map);
> + dput(dentry);
> + goto retry;
> + }
> + rcu_read_unlock();
> + }
> + dentry->d_flags |= DCACHE_PAR_UPDATE;
> + spin_unlock(&dentry->d_lock);
... and now __d_unalias() moves it to new place, making all the checks
you've just done completely useless.
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 10/11] VFS: take a shared lock for create/remove directory operations.
2024-12-23 5:19 ` Al Viro
@ 2024-12-23 7:11 ` NeilBrown
2024-12-23 7:26 ` Al Viro
0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2024-12-23 7:11 UTC (permalink / raw)
To: Al Viro
Cc: Christian Brauner, Jan Kara, Linus Torvalds, linux-fsdevel,
linux-kernel
On Mon, 23 Dec 2024, Al Viro wrote:
> On Fri, Dec 20, 2024 at 01:54:28PM +1100, NeilBrown wrote:
>
> > Once the exclusive "update" lock is obtained on the dentry we must make
> > sure it wasn't unlinked or renamed while we slept. If it was we repeat
> > the lookup.
>
> > + if (
> > + /* Was unlinked while we waited ?*/
> > + d_unhashed(dentry) ||
> > + /* Or was dentry renamed ?? */
> > + dentry->d_parent != base ||
> > + dentry->d_name.hash != last->hash ||
> > + !d_same_name(dentry, base, last)
> > + ) {
> > + rcu_read_unlock();
> > + spin_unlock(&dentry->d_lock);
> > + lock_map_release(&dentry->d_update_map);
> > + dput(dentry);
> > + goto retry;
> > + }
> > + rcu_read_unlock();
> > + }
> > + dentry->d_flags |= DCACHE_PAR_UPDATE;
> > + spin_unlock(&dentry->d_lock);
>
> ... and now __d_unalias() moves it to new place, making all the checks
> you've just done completely useless.
>
... Yes, thanks.
So I need __d_unalias() to effectively do a "try_lock" of
DCACHE_PAR_UPDATE and hold the lock across __d_move().
I think that would address the concerned you raised.
Did you see anything else that might be problematic?
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 10/11] VFS: take a shared lock for create/remove directory operations.
2024-12-23 7:11 ` NeilBrown
@ 2024-12-23 7:26 ` Al Viro
2024-12-23 20:40 ` NeilBrown
0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2024-12-23 7:26 UTC (permalink / raw)
To: NeilBrown
Cc: Christian Brauner, Jan Kara, Linus Torvalds, linux-fsdevel,
linux-kernel
On Mon, Dec 23, 2024 at 06:11:16PM +1100, NeilBrown wrote:
> ... Yes, thanks.
>
> So I need __d_unalias() to effectively do a "try_lock" of
> DCACHE_PAR_UPDATE and hold the lock across __d_move().
> I think that would address the concerned you raised.
>
> Did you see anything else that might be problematic?
That might work with ->d_parent, but it won't help with ->d_name
in same-parent case of __d_unalias()...
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 10/11] VFS: take a shared lock for create/remove directory operations.
2024-12-23 7:26 ` Al Viro
@ 2024-12-23 20:40 ` NeilBrown
0 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2024-12-23 20:40 UTC (permalink / raw)
To: Al Viro
Cc: Christian Brauner, Jan Kara, Linus Torvalds, linux-fsdevel,
linux-kernel
On Mon, 23 Dec 2024, Al Viro wrote:
> On Mon, Dec 23, 2024 at 06:11:16PM +1100, NeilBrown wrote:
> > ... Yes, thanks.
> >
> > So I need __d_unalias() to effectively do a "try_lock" of
> > DCACHE_PAR_UPDATE and hold the lock across __d_move().
> > I think that would address the concerned you raised.
> >
> > Did you see anything else that might be problematic?
>
> That might work with ->d_parent, but it won't help with ->d_name
> in same-parent case of __d_unalias()...
>
Why would the same-parent case be any different?
Certainly it doesn't need s_vfs_rename_mutex and it there is no second
parent to get a shared lock on. But we would still need to set
DCACHE_PAR_UPDATE under ->d_lock on "alias". If we found that it was
already set and instead failed with -ESTALE, that would prevent
__d_unalias from changing anything including ->d_name after
lookup_and_lock has checked that the parent and d_name are unchanged
(until done_lookup_and_lock is called of course).
What am I missing?
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 11/11] nfsd: use lookup_and_lock_one()
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
` (9 preceding siblings ...)
2024-12-20 2:54 ` [PATCH 10/11] VFS: take a shared lock for create/remove directory operations NeilBrown
@ 2024-12-20 2:54 ` NeilBrown
2024-12-20 20:55 ` [PATCH 00/11 RFC] Allow concurrent changes in a directory Andreas Dilger
2024-12-23 5:22 ` Al Viro
12 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2024-12-20 2:54 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds
Cc: linux-fsdevel, linux-kernel
nfsd now used looksup_and_lock_one() when creating/removing names in the
exported filesystem.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/nfsproc.c | 12 ++++-----
fs/nfsd/vfs.c | 67 +++++++++++++++--------------------------------
2 files changed, 27 insertions(+), 52 deletions(-)
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 6dda081eb24c..11ad710a0853 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -311,17 +311,16 @@ nfsd_proc_create(struct svc_rqst *rqstp)
goto done;
}
- inode_lock_nested(dirfhp->fh_dentry->d_inode, I_MUTEX_PARENT);
- dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
+ dchild = lookup_and_lock_one(NULL, argp->name, argp->len,
+ dirfhp->fh_dentry, LOOKUP_CREATE);
if (IS_ERR(dchild)) {
resp->status = nfserrno(PTR_ERR(dchild));
- goto out_unlock;
+ goto put_write;
}
fh_init(newfhp, NFS_FHSIZE);
resp->status = fh_compose(newfhp, dirfhp->fh_export, dchild, dirfhp);
if (!resp->status && d_really_is_negative(dchild))
resp->status = nfserr_noent;
- dput(dchild);
if (resp->status) {
if (resp->status != nfserr_noent)
goto out_unlock;
@@ -331,7 +330,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
*/
resp->status = nfserr_acces;
if (!newfhp->fh_dentry) {
- printk(KERN_WARNING
+ printk(KERN_WARNING
"nfsd_proc_create: file handle not verified\n");
goto out_unlock;
}
@@ -427,7 +426,8 @@ nfsd_proc_create(struct svc_rqst *rqstp)
}
out_unlock:
- inode_unlock(dirfhp->fh_dentry->d_inode);
+ done_lookup_and_lock(dirfhp->fh_dentry, dchild);
+put_write:
fh_drop_write(dirfhp);
done:
fh_put(dirfhp);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 740332413138..011fb68bfa4b 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1551,19 +1551,13 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (host_err)
return nfserrno(host_err);
- inode_lock_nested(dentry->d_inode, I_MUTEX_PARENT);
- dchild = lookup_one_len(fname, dentry, flen);
+ dchild = lookup_and_lock_one(NULL, fname, flen, dentry, LOOKUP_CREATE);
host_err = PTR_ERR(dchild);
if (IS_ERR(dchild)) {
err = nfserrno(host_err);
- goto out_unlock;
+ goto out;
}
err = fh_compose(resfhp, fhp->fh_export, dchild, fhp);
- /*
- * We unconditionally drop our ref to dchild as fh_compose will have
- * already grabbed its own ref for it.
- */
- dput(dchild);
if (err)
goto out_unlock;
err = fh_fill_pre_attrs(fhp);
@@ -1572,7 +1566,8 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
err = nfsd_create_locked(rqstp, fhp, attrs, type, rdev, resfhp);
fh_fill_post_attrs(fhp);
out_unlock:
- inode_unlock(dentry->d_inode);
+ done_lookup_and_lock(dentry, dchild);
+out:
return err;
}
@@ -1656,8 +1651,7 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
dentry = fhp->fh_dentry;
- inode_lock_nested(dentry->d_inode, I_MUTEX_PARENT);
- dnew = lookup_one_len(fname, dentry, flen);
+ dnew = lookup_and_lock_one(NULL, fname, flen, dentry, LOOKUP_CREATE);
if (IS_ERR(dnew)) {
err = nfserrno(PTR_ERR(dnew));
inode_unlock(dentry->d_inode);
@@ -1673,11 +1667,11 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
nfsd_create_setattr(rqstp, fhp, resfhp, attrs);
fh_fill_post_attrs(fhp);
out_unlock:
- inode_unlock(dentry->d_inode);
+ done_lookup_and_lock(dentry, dnew);
if (!err)
err = nfserrno(commit_metadata(fhp));
- dput(dnew);
- if (err==0) err = cerr;
+ if (err==0)
+ err = cerr;
out_drop_write:
fh_drop_write(fhp);
out:
@@ -1721,43 +1715,35 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
ddir = ffhp->fh_dentry;
dirp = d_inode(ddir);
- inode_lock_nested(dirp, I_MUTEX_PARENT);
-
- dnew = lookup_one_len(name, ddir, len);
+ dnew = lookup_and_lock_one(NULL, name, len, ddir, LOOKUP_CREATE);
if (IS_ERR(dnew)) {
- err = nfserrno(PTR_ERR(dnew));
- goto out_unlock;
+ err = PTR_ERR(dnew);
+ goto out_drop_write;
}
dold = tfhp->fh_dentry;
err = nfserr_noent;
if (d_really_is_negative(dold))
- goto out_dput;
+ goto out_unlock;
err = fh_fill_pre_attrs(ffhp);
if (err != nfs_ok)
- goto out_dput;
+ goto out_unlock;
host_err = vfs_link(dold, &nop_mnt_idmap, dirp, dnew, NULL);
fh_fill_post_attrs(ffhp);
- inode_unlock(dirp);
- if (!host_err) {
+out_unlock:
+ done_lookup_and_lock(ddir, dnew);
+ if (!err && !host_err) {
err = nfserrno(commit_metadata(ffhp));
if (!err)
err = nfserrno(commit_metadata(tfhp));
- } else {
+ } else if (!err) {
err = nfserrno(host_err);
}
- dput(dnew);
out_drop_write:
fh_drop_write(tfhp);
out:
return err;
-
-out_dput:
- dput(dnew);
-out_unlock:
- inode_unlock(dirp);
- goto out_drop_write;
}
static void
@@ -1943,18 +1929,11 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
dentry = fhp->fh_dentry;
dirp = d_inode(dentry);
- inode_lock_nested(dirp, I_MUTEX_PARENT);
-
- rdentry = lookup_one_len(fname, dentry, flen);
+ rdentry = lookup_and_lock_one(NULL, fname, flen, dentry, 0);
host_err = PTR_ERR(rdentry);
if (IS_ERR(rdentry))
- goto out_unlock;
+ goto out_drop_write;
- if (d_really_is_negative(rdentry)) {
- dput(rdentry);
- host_err = -ENOENT;
- goto out_unlock;
- }
rinode = d_inode(rdentry);
err = fh_fill_pre_attrs(fhp);
if (err != nfs_ok)
@@ -1981,11 +1960,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
host_err = vfs_rmdir(&nop_mnt_idmap, dirp, rdentry);
}
fh_fill_post_attrs(fhp);
-
- inode_unlock(dirp);
+out_unlock:
+ done_lookup_and_lock(dentry, rdentry);
if (!host_err)
host_err = commit_metadata(fhp);
- dput(rdentry);
iput(rinode); /* truncate the inode here */
out_drop_write:
@@ -2001,9 +1979,6 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
}
out:
return err;
-out_unlock:
- inode_unlock(dirp);
- goto out_drop_write;
}
/*
--
2.47.0
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH 00/11 RFC] Allow concurrent changes in a directory
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
` (10 preceding siblings ...)
2024-12-20 2:54 ` [PATCH 11/11] nfsd: use lookup_and_lock_one() NeilBrown
@ 2024-12-20 20:55 ` Andreas Dilger
2024-12-23 5:22 ` Al Viro
12 siblings, 0 replies; 26+ messages in thread
From: Andreas Dilger @ 2024-12-20 20:55 UTC (permalink / raw)
To: NeilBrown
Cc: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
linux-fsdevel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2091 bytes --]
On Dec 19, 2024, at 7:54 PM, NeilBrown <neilb@suse.de> wrote:
>
> A while ago I posted a patchset with a similar goal as this:
>
> https://lore.kernel.org/all/166147828344.25420.13834885828450967910.stgit@noble.brown/
>
> and recieved useful feedback. Here is a new version.
>
> This version is not complete. It does not change rename and does not
> change any filesystem to make use of the new opportunity for
> parallelism. I'll work on those once the bases functionality is agreed
> on.
>
> With this series, instead of a filesystem setting a flag to indiciate
> that parallel updates are support, there are now a new set of inode
> operations with a _shared prefix. If a directory provides a _shared
> interface it will be used with a shared lock on the inode, else the
> current interface will be used with an exclusive lock.
Hi Neil, thanks for the patch. One minor nit for the next revision
of the cover letter:
> Another motivation is lustre which
> can use a modified ext4 as the storage backend. One of the current
> modification is to allow concurrent updates in a directory as lustre uses a flat directory structure to store data.
This isn't really correct. Lustre uses a directory tree for the
namespace, but directories might become very large in some cases
with 1M+ cores working in a single directory (hey, I don't write
the applications, I just need to deal with them). The servers will
only have 500-2000 threads working on a single directory, but the
fine-grained locking on the servers is definitely a big win.
Being able to have parallel locking on the client VFS side would
also be a win, given that large nodes commonly have 192 or 256
cores/threads today. We know parallel directory locking will be
a win because mounting the filesystem multiple times on a single
client (which the VFS treats as multiple separate filesystems)
and running a multi-threaded benchmark in each mount in parallel
is considerably faster than running the same number of threads in
a single mountpoint.
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 00/11 RFC] Allow concurrent changes in a directory
2024-12-20 2:54 [PATCH 00/11 RFC] Allow concurrent changes in a directory NeilBrown
` (11 preceding siblings ...)
2024-12-20 20:55 ` [PATCH 00/11 RFC] Allow concurrent changes in a directory Andreas Dilger
@ 2024-12-23 5:22 ` Al Viro
12 siblings, 0 replies; 26+ messages in thread
From: Al Viro @ 2024-12-23 5:22 UTC (permalink / raw)
To: NeilBrown
Cc: Christian Brauner, Jan Kara, Linus Torvalds, linux-fsdevel,
linux-kernel
On Fri, Dec 20, 2024 at 01:54:18PM +1100, NeilBrown wrote:
> Not all calling code has been converted. Some callers outside of
> fs/namei.c still take an exclusive lock with i_rw_sem. Some might never
> be changed.
>
> As yet this has only been lightly tested as I haven't add foo_shared
> operations to any filesystem yet.
It's still fundamentally broken. You assume that shared lock on
parent is enough to prevent the parent/name changes of children.
That assumption is simply not true.
^ permalink raw reply [flat|nested] 26+ messages in thread