linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHES] fdtable series v2
@ 2024-08-22  0:20 Al Viro
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
  2024-10-07 17:39 ` [PATCHES] fdtable series v3 Al Viro
  0 siblings, 2 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

	close_range() fix added in the beginning of the series,
dup_fd() calling convention change folded into it (as requested
by Linus), the rest rebased on top of that.
	sane_fdtable_size() change is dropped, as it's obsoleted
by close_range() fix.  Several patches added at the end of
series.

	Same branch -
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.fdtable
individual patches in followups.  If nobody objects, I'm adding that
pile to #for-next

Shortlog:

Al Viro (9):
      close_range(): fix the logics in descriptor table trimming
      get rid of ...lookup...fdget_rcu() family
      remove pointless includes of <linux/fdtable.h>
      close_files(): don't bother with xchg()
      move close_range(2) into fs/file.c, fold __close_range() into it
      alloc_fdtable(): change calling conventions.
      file.c: merge __{set,clear}_close_on_exec()
      make __set_open_fd() set cloexec state as well
      expand_files(): simplify calling conventions

Yu Ma (3):
      fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
      fs/file.c: conditionally clear full_fds
      fs/file.c: add fast path in find_next_fd()

Diffstat:
 arch/powerpc/platforms/cell/spufs/coredump.c |   4 +-
 fs/fcntl.c                                   |   1 -
 fs/file.c                                    | 291 ++++++++++-----------------
 fs/file_table.c                              |   1 -
 fs/gfs2/glock.c                              |  12 +-
 fs/notify/dnotify/dnotify.c                  |   5 +-
 fs/notify/fanotify/fanotify.c                |   1 -
 fs/notify/fanotify/fanotify_user.c           |   1 -
 fs/open.c                                    |  17 --
 fs/overlayfs/copy_up.c                       |   1 -
 fs/proc/base.c                               |   1 -
 fs/proc/fd.c                                 |  12 +-
 include/linux/fdtable.h                      |  13 +-
 include/linux/file.h                         |   1 +
 io_uring/io_uring.c                          |   1 -
 kernel/bpf/bpf_inode_storage.c               |   1 -
 kernel/bpf/bpf_task_storage.c                |   1 -
 kernel/bpf/task_iter.c                       |   6 +-
 kernel/bpf/token.c                           |   1 -
 kernel/exit.c                                |   1 -
 kernel/fork.c                                |  32 ++-
 kernel/kcmp.c                                |   4 +-
 kernel/module/dups.c                         |   1 -
 kernel/module/kmod.c                         |   1 -
 kernel/umh.c                                 |   1 -
 net/handshake/request.c                      |   1 -
 security/apparmor/domain.c                   |   1 -
 27 files changed, 136 insertions(+), 277 deletions(-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 01/12] close_range(): fix the logics in descriptor table trimming
  2024-08-22  0:20 [PATCHES] fdtable series v2 Al Viro
@ 2024-08-22  0:22 ` Al Viro
  2024-08-22  0:22   ` [PATCH 02/12] get rid of ...lookup...fdget_rcu() family Al Viro
                     ` (10 more replies)
  2024-10-07 17:39 ` [PATCHES] fdtable series v3 Al Viro
  1 sibling, 11 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

Cloning a descriptor table picks the size that would cover all currently
opened files.  That's fine for clone() and unshare(), but for close_range()
there's an additional twist - we clone before we close, and it would be
a shame to have
	close_range(3, ~0U, CLOSE_RANGE_UNSHARE)
leave us with a huge descriptor table when we are not going to keep
anything past stderr, just because some large file descriptor used to
be open before our call has taken it out.

Unfortunately, it had been dealt with in an inherently racy way -
sane_fdtable_size() gets a "don't copy anything past that" argument
(passed via unshare_fd() and dup_fd()), close_range() decides how much
should be trimmed and passes that to unshare_fd().

The problem is, a range that used to extend to the end of descriptor
table back when close_range() had looked at it might very well have stuff
grown after it by the time dup_fd() has allocated a new files_struct
and started to figure out the capacity of fdtable to be attached to that.

That leads to interesting pathological cases; at the very least it's a
QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes
a snapshot of descriptor table one might have observed at some point.
Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination
of unshare(CLONE_FILES) with plain close_range(), ending up with a
weird state that would never occur with unshare(2) is confusing, to put
it mildly.

It's not hard to get rid of - all it takes is passing both ends of the
range down to sane_fdtable_size().  There we are under ->files_lock,
so the race is trivially avoided.

So we do the following:
	* switch close_files() from calling unshare_fd() to calling
dup_fd().
	* undo the calling convention change done to unshare_fd() in
60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
	* introduce struct fd_range, pass a pointer to that to dup_fd()
and sane_fdtable_size() instead of "trim everything past that point"
they are currently getting.  NULL means "we are not going to be punching
any holes"; NR_OPEN_MAX is gone.
	* make sane_fdtable_size() use find_last_bit() instead of
open-coding it; it's easier to follow that way.
	* while we are at it, have dup_fd() report errors by returning
ERR_PTR(), no need to use a separate int *errorp argument.

Fixes: 60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c               | 95 +++++++++++++++--------------------------
 include/linux/fdtable.h |  8 ++--
 kernel/fork.c           | 32 ++++++--------
 3 files changed, 52 insertions(+), 83 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 655338effe9c..c2403cde40e4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -272,59 +272,45 @@ static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
 	return test_bit(fd, fdt->open_fds);
 }
 
-static unsigned int count_open_files(struct fdtable *fdt)
-{
-	unsigned int size = fdt->max_fds;
-	unsigned int i;
-
-	/* Find the last open fd */
-	for (i = size / BITS_PER_LONG; i > 0; ) {
-		if (fdt->open_fds[--i])
-			break;
-	}
-	i = (i + 1) * BITS_PER_LONG;
-	return i;
-}
-
 /*
  * Note that a sane fdtable size always has to be a multiple of
  * BITS_PER_LONG, since we have bitmaps that are sized by this.
  *
- * 'max_fds' will normally already be properly aligned, but it
- * turns out that in the close_range() -> __close_range() ->
- * unshare_fd() -> dup_fd() -> sane_fdtable_size() we can end
- * up having a 'max_fds' value that isn't already aligned.
- *
- * Rather than make close_range() have to worry about this,
- * just make that BITS_PER_LONG alignment be part of a sane
- * fdtable size. Becuase that's really what it is.
+ * punch_hole is optional - when close_range() is asked to unshare
+ * and close, we don't need to copy descriptors in that range, so
+ * a smaller cloned descriptor table might suffice if the last
+ * currently opened descriptor falls into that range.
  */
-static unsigned int sane_fdtable_size(struct fdtable *fdt, unsigned int max_fds)
+static unsigned int sane_fdtable_size(struct fdtable *fdt, struct fd_range *punch_hole)
 {
-	unsigned int count;
-
-	count = count_open_files(fdt);
-	if (max_fds < NR_OPEN_DEFAULT)
-		max_fds = NR_OPEN_DEFAULT;
-	return ALIGN(min(count, max_fds), BITS_PER_LONG);
+	unsigned int last = find_last_bit(fdt->open_fds, fdt->max_fds);
+
+	if (last == fdt->max_fds)
+		return NR_OPEN_DEFAULT;
+	if (punch_hole && punch_hole->to >= last && punch_hole->from <= last) {
+		last = find_last_bit(fdt->open_fds, punch_hole->from);
+		if (last == punch_hole->from)
+			return NR_OPEN_DEFAULT;
+	}
+	return ALIGN(last + 1, BITS_PER_LONG);
 }
 
 /*
- * Allocate a new files structure and copy contents from the
- * passed in files structure.
- * errorp will be valid only when the returned files_struct is NULL.
+ * Allocate a new descriptor table and copy contents from the passed in
+ * instance.  Returns a pointer to cloned table on success, ERR_PTR()
+ * on failure.  For 'punch_hole' see sane_fdtable_size().
  */
-struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int *errorp)
+struct files_struct *dup_fd(struct files_struct *oldf, struct fd_range *punch_hole)
 {
 	struct files_struct *newf;
 	struct file **old_fds, **new_fds;
 	unsigned int open_files, i;
 	struct fdtable *old_fdt, *new_fdt;
+	int error;
 
-	*errorp = -ENOMEM;
 	newf = kmem_cache_alloc(files_cachep, GFP_KERNEL);
 	if (!newf)
-		goto out;
+		return ERR_PTR(-ENOMEM);
 
 	atomic_set(&newf->count, 1);
 
@@ -341,7 +327,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 
 	spin_lock(&oldf->file_lock);
 	old_fdt = files_fdtable(oldf);
-	open_files = sane_fdtable_size(old_fdt, max_fds);
+	open_files = sane_fdtable_size(old_fdt, punch_hole);
 
 	/*
 	 * Check whether we need to allocate a larger fd array and fd set.
@@ -354,14 +340,14 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 
 		new_fdt = alloc_fdtable(open_files - 1);
 		if (!new_fdt) {
-			*errorp = -ENOMEM;
+			error = -ENOMEM;
 			goto out_release;
 		}
 
 		/* beyond sysctl_nr_open; nothing to do */
 		if (unlikely(new_fdt->max_fds < open_files)) {
 			__free_fdtable(new_fdt);
-			*errorp = -EMFILE;
+			error = -EMFILE;
 			goto out_release;
 		}
 
@@ -372,7 +358,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 		 */
 		spin_lock(&oldf->file_lock);
 		old_fdt = files_fdtable(oldf);
-		open_files = sane_fdtable_size(old_fdt, max_fds);
+		open_files = sane_fdtable_size(old_fdt, punch_hole);
 	}
 
 	copy_fd_bitmaps(new_fdt, old_fdt, open_files / BITS_PER_LONG);
@@ -406,8 +392,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 
 out_release:
 	kmem_cache_free(files_cachep, newf);
-out:
-	return NULL;
+	return ERR_PTR(error);
 }
 
 static struct fdtable *close_files(struct files_struct * files)
@@ -748,37 +733,25 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
 	if (fd > max_fd)
 		return -EINVAL;
 
-	if (flags & CLOSE_RANGE_UNSHARE) {
-		int ret;
-		unsigned int max_unshare_fds = NR_OPEN_MAX;
+	if ((flags & CLOSE_RANGE_UNSHARE) && atomic_read(&cur_fds->count) > 1) {
+		struct fd_range range = {fd, max_fd}, *punch_hole = &range;
 
 		/*
 		 * If the caller requested all fds to be made cloexec we always
 		 * copy all of the file descriptors since they still want to
 		 * use them.
 		 */
-		if (!(flags & CLOSE_RANGE_CLOEXEC)) {
-			/*
-			 * If the requested range is greater than the current
-			 * maximum, we're closing everything so only copy all
-			 * file descriptors beneath the lowest file descriptor.
-			 */
-			rcu_read_lock();
-			if (max_fd >= last_fd(files_fdtable(cur_fds)))
-				max_unshare_fds = fd;
-			rcu_read_unlock();
-		}
-
-		ret = unshare_fd(CLONE_FILES, max_unshare_fds, &fds);
-		if (ret)
-			return ret;
+		if (flags & CLOSE_RANGE_CLOEXEC)
+			punch_hole = NULL;
 
+		fds = dup_fd(cur_fds, punch_hole);
+		if (IS_ERR(fds))
+			return PTR_ERR(fds);
 		/*
 		 * We used to share our file descriptor table, and have now
 		 * created a private one, make sure we're using it below.
 		 */
-		if (fds)
-			swap(cur_fds, fds);
+		swap(cur_fds, fds);
 	}
 
 	if (flags & CLOSE_RANGE_CLOEXEC)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 2944d4aa413b..b1c5722f2b3c 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -22,7 +22,6 @@
  * as this is the granularity returned by copy_fdset().
  */
 #define NR_OPEN_DEFAULT BITS_PER_LONG
-#define NR_OPEN_MAX ~0U
 
 struct fdtable {
 	unsigned int max_fds;
@@ -106,7 +105,10 @@ struct task_struct;
 
 void put_files_struct(struct files_struct *fs);
 int unshare_files(void);
-struct files_struct *dup_fd(struct files_struct *, unsigned, int *) __latent_entropy;
+struct fd_range {
+	unsigned int from, to;
+};
+struct files_struct *dup_fd(struct files_struct *, struct fd_range *) __latent_entropy;
 void do_close_on_exec(struct files_struct *);
 int iterate_fd(struct files_struct *, unsigned,
 		int (*)(const void *, struct file *, unsigned),
@@ -115,8 +117,6 @@ int iterate_fd(struct files_struct *, unsigned,
 extern int close_fd(unsigned int fd);
 extern int __close_range(unsigned int fd, unsigned int max_fd, unsigned int flags);
 extern struct file *file_close_fd(unsigned int fd);
-extern int unshare_fd(unsigned long unshare_flags, unsigned int max_fds,
-		      struct files_struct **new_fdp);
 
 extern struct kmem_cache *files_cachep;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index cc760491f201..6b97fb2ac4af 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1754,33 +1754,30 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
 		      int no_files)
 {
 	struct files_struct *oldf, *newf;
-	int error = 0;
 
 	/*
 	 * A background process may not have any files ...
 	 */
 	oldf = current->files;
 	if (!oldf)
-		goto out;
+		return 0;
 
 	if (no_files) {
 		tsk->files = NULL;
-		goto out;
+		return 0;
 	}
 
 	if (clone_flags & CLONE_FILES) {
 		atomic_inc(&oldf->count);
-		goto out;
+		return 0;
 	}
 
-	newf = dup_fd(oldf, NR_OPEN_MAX, &error);
-	if (!newf)
-		goto out;
+	newf = dup_fd(oldf, NULL);
+	if (IS_ERR(newf))
+		return PTR_ERR(newf);
 
 	tsk->files = newf;
-	error = 0;
-out:
-	return error;
+	return 0;
 }
 
 static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
@@ -3232,17 +3229,16 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
 /*
  * Unshare file descriptor table if it is being shared
  */
-int unshare_fd(unsigned long unshare_flags, unsigned int max_fds,
-	       struct files_struct **new_fdp)
+static int unshare_fd(unsigned long unshare_flags, struct files_struct **new_fdp)
 {
 	struct files_struct *fd = current->files;
-	int error = 0;
 
 	if ((unshare_flags & CLONE_FILES) &&
 	    (fd && atomic_read(&fd->count) > 1)) {
-		*new_fdp = dup_fd(fd, max_fds, &error);
-		if (!*new_fdp)
-			return error;
+		fd = dup_fd(fd, NULL);
+		if (IS_ERR(fd))
+			return PTR_ERR(fd);
+		*new_fdp = fd;
 	}
 
 	return 0;
@@ -3300,7 +3296,7 @@ int ksys_unshare(unsigned long unshare_flags)
 	err = unshare_fs(unshare_flags, &new_fs);
 	if (err)
 		goto bad_unshare_out;
-	err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd);
+	err = unshare_fd(unshare_flags, &new_fd);
 	if (err)
 		goto bad_unshare_cleanup_fs;
 	err = unshare_userns(unshare_flags, &new_cred);
@@ -3392,7 +3388,7 @@ int unshare_files(void)
 	struct files_struct *old, *copy = NULL;
 	int error;
 
-	error = unshare_fd(CLONE_FILES, NR_OPEN_MAX, &copy);
+	error = unshare_fd(CLONE_FILES, &copy);
 	if (error || !copy)
 		return error;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 02/12] get rid of ...lookup...fdget_rcu() family
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
@ 2024-08-22  0:22   ` Al Viro
  2024-08-22  0:22   ` [PATCH 03/12] remove pointless includes of <linux/fdtable.h> Al Viro
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

Once upon a time, predecessors of those used to do file lookup
without bumping a refcount, provided that caller held rcu_read_lock()
across the lookup and whatever it wanted to read from the struct
file found.  When struct file allocation switched to SLAB_TYPESAFE_BY_RCU,
that stopped being feasible and these primitives started to bump the
file refcount for lookup result, requiring the caller to call fput()
afterwards.

But that turned them pointless - e.g.
	rcu_read_lock();
	file = lookup_fdget_rcu(fd);
	rcu_read_unlock();
is equivalent to
	file = fget_raw(fd);
and all callers of lookup_fdget_rcu() are of that form.  Similarly,
task_lookup_fdget_rcu() calls can be replaced with calling fget_task().
task_lookup_next_fdget_rcu() doesn't have direct counterparts, but
its callers would be happier if we replaced it with an analogue that
deals with RCU internally.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/powerpc/platforms/cell/spufs/coredump.c |  4 +--
 fs/file.c                                    | 28 +++-----------------
 fs/gfs2/glock.c                              | 12 ++-------
 fs/notify/dnotify/dnotify.c                  |  5 +---
 fs/proc/fd.c                                 | 12 +++------
 include/linux/fdtable.h                      |  4 ---
 include/linux/file.h                         |  1 +
 kernel/bpf/task_iter.c                       |  6 +----
 kernel/kcmp.c                                |  4 +--
 9 files changed, 14 insertions(+), 62 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c b/arch/powerpc/platforms/cell/spufs/coredump.c
index 18daafbe2e65..301ee7d8b7df 100644
--- a/arch/powerpc/platforms/cell/spufs/coredump.c
+++ b/arch/powerpc/platforms/cell/spufs/coredump.c
@@ -73,9 +73,7 @@ static struct spu_context *coredump_next_context(int *fd)
 		return NULL;
 	*fd = n - 1;
 
-	rcu_read_lock();
-	file = lookup_fdget_rcu(*fd);
-	rcu_read_unlock();
+	file = fget_raw(*fd);
 	if (file) {
 		ctx = SPUFS_I(file_inode(file))->i_ctx;
 		get_spu_context(ctx);
diff --git a/fs/file.c b/fs/file.c
index c2403cde40e4..8c5b8569045c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1037,29 +1037,7 @@ struct file *fget_task(struct task_struct *task, unsigned int fd)
 	return file;
 }
 
-struct file *lookup_fdget_rcu(unsigned int fd)
-{
-	return __fget_files_rcu(current->files, fd, 0);
-
-}
-EXPORT_SYMBOL_GPL(lookup_fdget_rcu);
-
-struct file *task_lookup_fdget_rcu(struct task_struct *task, unsigned int fd)
-{
-	/* Must be called with rcu_read_lock held */
-	struct files_struct *files;
-	struct file *file = NULL;
-
-	task_lock(task);
-	files = task->files;
-	if (files)
-		file = __fget_files_rcu(files, fd, 0);
-	task_unlock(task);
-
-	return file;
-}
-
-struct file *task_lookup_next_fdget_rcu(struct task_struct *task, unsigned int *ret_fd)
+struct file *fget_task_next(struct task_struct *task, unsigned int *ret_fd)
 {
 	/* Must be called with rcu_read_lock held */
 	struct files_struct *files;
@@ -1069,17 +1047,19 @@ struct file *task_lookup_next_fdget_rcu(struct task_struct *task, unsigned int *
 	task_lock(task);
 	files = task->files;
 	if (files) {
+		rcu_read_lock();
 		for (; fd < files_fdtable(files)->max_fds; fd++) {
 			file = __fget_files_rcu(files, fd, 0);
 			if (file)
 				break;
 		}
+		rcu_read_unlock();
 	}
 	task_unlock(task);
 	*ret_fd = fd;
 	return file;
 }
-EXPORT_SYMBOL(task_lookup_next_fdget_rcu);
+EXPORT_SYMBOL(fget_task_next);
 
 /*
  * Lightweight file lookup - no refcnt increment if fd table isn't shared.
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 12a769077ea0..a4f5940c3e0a 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -34,7 +34,6 @@
 #include <linux/lockref.h>
 #include <linux/rhashtable.h>
 #include <linux/pid_namespace.h>
-#include <linux/fdtable.h>
 #include <linux/file.h>
 
 #include "gfs2.h"
@@ -2765,25 +2764,18 @@ static struct file *gfs2_glockfd_next_file(struct gfs2_glockfd_iter *i)
 		i->file = NULL;
 	}
 
-	rcu_read_lock();
 	for(;; i->fd++) {
-		struct inode *inode;
-
-		i->file = task_lookup_next_fdget_rcu(i->task, &i->fd);
+		i->file = fget_task_next(i->task, &i->fd);
 		if (!i->file) {
 			i->fd = 0;
 			break;
 		}
 
-		inode = file_inode(i->file);
-		if (inode->i_sb == i->sb)
+		if (file_inode(i->file)->i_sb == i->sb)
 			break;
 
-		rcu_read_unlock();
 		fput(i->file);
-		rcu_read_lock();
 	}
-	rcu_read_unlock();
 	return i->file;
 }
 
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index f3669403fabf..65521c01d2a4 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -16,7 +16,6 @@
 #include <linux/security.h>
 #include <linux/spinlock.h>
 #include <linux/slab.h>
-#include <linux/fdtable.h>
 #include <linux/fsnotify_backend.h>
 
 static int dir_notify_enable __read_mostly = 1;
@@ -343,9 +342,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned int arg)
 		new_fsn_mark = NULL;
 	}
 
-	rcu_read_lock();
-	f = lookup_fdget_rcu(fd);
-	rcu_read_unlock();
+	f = fget_raw(fd);
 
 	/* if (f != filp) means that we lost a race and another task/thread
 	 * actually closed the fd we are still playing with before we grabbed
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 586bbc84ca04..077c51ba1ba7 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -116,9 +116,7 @@ static bool tid_fd_mode(struct task_struct *task, unsigned fd, fmode_t *mode)
 {
 	struct file *file;
 
-	rcu_read_lock();
-	file = task_lookup_fdget_rcu(task, fd);
-	rcu_read_unlock();
+	file = fget_task(task, fd);
 	if (file) {
 		*mode = file->f_mode;
 		fput(file);
@@ -258,19 +256,17 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 	if (!dir_emit_dots(file, ctx))
 		goto out;
 
-	rcu_read_lock();
 	for (fd = ctx->pos - 2;; fd++) {
 		struct file *f;
 		struct fd_data data;
 		char name[10 + 1];
 		unsigned int len;
 
-		f = task_lookup_next_fdget_rcu(p, &fd);
+		f = fget_task_next(p, &fd);
 		ctx->pos = fd + 2LL;
 		if (!f)
 			break;
 		data.mode = f->f_mode;
-		rcu_read_unlock();
 		fput(f);
 		data.fd = fd;
 
@@ -278,11 +274,9 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 		if (!proc_fill_cache(file, ctx,
 				     name, len, instantiate, p,
 				     &data))
-			goto out;
+			break;
 		cond_resched();
-		rcu_read_lock();
 	}
-	rcu_read_unlock();
 out:
 	put_task_struct(p);
 	return 0;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index b1c5722f2b3c..e25e2cb65d30 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -92,10 +92,6 @@ static inline struct file *files_lookup_fd_locked(struct files_struct *files, un
 	return files_lookup_fd_raw(files, fd);
 }
 
-struct file *lookup_fdget_rcu(unsigned int fd);
-struct file *task_lookup_fdget_rcu(struct task_struct *task, unsigned int fd);
-struct file *task_lookup_next_fdget_rcu(struct task_struct *task, unsigned int *fd);
-
 static inline bool close_on_exec(unsigned int fd, const struct files_struct *files)
 {
 	return test_bit(fd, files_fdtable(files)->close_on_exec);
diff --git a/include/linux/file.h b/include/linux/file.h
index 237931f20739..006005f621d1 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -51,6 +51,7 @@ static inline void fdput(struct fd fd)
 extern struct file *fget(unsigned int fd);
 extern struct file *fget_raw(unsigned int fd);
 extern struct file *fget_task(struct task_struct *task, unsigned int fd);
+extern struct file *fget_task_next(struct task_struct *task, unsigned int *fd);
 extern unsigned long __fdget(unsigned int fd);
 extern unsigned long __fdget_raw(unsigned int fd);
 extern unsigned long __fdget_pos(unsigned int fd);
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 02aa9db8d796..7fe602ca74a0 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -5,7 +5,6 @@
 #include <linux/namei.h>
 #include <linux/pid_namespace.h>
 #include <linux/fs.h>
-#include <linux/fdtable.h>
 #include <linux/filter.h>
 #include <linux/bpf_mem_alloc.h>
 #include <linux/btf_ids.h>
@@ -286,17 +285,14 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info)
 			curr_fd = 0;
 	}
 
-	rcu_read_lock();
-	f = task_lookup_next_fdget_rcu(curr_task, &curr_fd);
+	f = fget_task_next(curr_task, &curr_fd);
 	if (f) {
 		/* set info->fd */
 		info->fd = curr_fd;
-		rcu_read_unlock();
 		return f;
 	}
 
 	/* the current task is done, go to the next task */
-	rcu_read_unlock();
 	put_task_struct(curr_task);
 
 	if (info->common.type == BPF_TASK_ITER_TID) {
diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index b0639f21041f..2c596851f8a9 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -63,9 +63,7 @@ get_file_raw_ptr(struct task_struct *task, unsigned int idx)
 {
 	struct file *file;
 
-	rcu_read_lock();
-	file = task_lookup_fdget_rcu(task, idx);
-	rcu_read_unlock();
+	file = fget_task(task, idx);
 	if (file)
 		fput(file);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 03/12] remove pointless includes of <linux/fdtable.h>
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
  2024-08-22  0:22   ` [PATCH 02/12] get rid of ...lookup...fdget_rcu() family Al Viro
@ 2024-08-22  0:22   ` Al Viro
  2024-08-22  0:22   ` [PATCH 04/12] close_files(): don't bother with xchg() Al Viro
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

some of those used to be needed, some had been cargo-culted for
no reason...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/fcntl.c                         | 1 -
 fs/file_table.c                    | 1 -
 fs/notify/fanotify/fanotify.c      | 1 -
 fs/notify/fanotify/fanotify_user.c | 1 -
 fs/overlayfs/copy_up.c             | 1 -
 fs/proc/base.c                     | 1 -
 io_uring/io_uring.c                | 1 -
 kernel/bpf/bpf_inode_storage.c     | 1 -
 kernel/bpf/bpf_task_storage.c      | 1 -
 kernel/bpf/token.c                 | 1 -
 kernel/exit.c                      | 1 -
 kernel/module/dups.c               | 1 -
 kernel/module/kmod.c               | 1 -
 kernel/umh.c                       | 1 -
 net/handshake/request.c            | 1 -
 security/apparmor/domain.c         | 1 -
 16 files changed, 16 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 300e5d9ad913..56262bdbc544 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -12,7 +12,6 @@
 #include <linux/fs.h>
 #include <linux/filelock.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/capability.h>
 #include <linux/dnotify.h>
 #include <linux/slab.h>
diff --git a/fs/file_table.c b/fs/file_table.c
index ca7843dde56d..15adc085df4f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -9,7 +9,6 @@
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/init.h>
 #include <linux/module.h>
 #include <linux/fs.h>
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 224bccaab4cc..24c7c5df4998 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -1,6 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/fanotify.h>
-#include <linux/fdtable.h>
 #include <linux/fsnotify_backend.h>
 #include <linux/init.h>
 #include <linux/jiffies.h>
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 9ec313e9f6e1..cc91977cf202 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1,7 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/fanotify.h>
 #include <linux/fcntl.h>
-#include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/anon_inodes.h>
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index a5ef2005a2cc..73b502524d1c 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -16,7 +16,6 @@
 #include <linux/sched/signal.h>
 #include <linux/cred.h>
 #include <linux/namei.h>
-#include <linux/fdtable.h>
 #include <linux/ratelimit.h>
 #include <linux/exportfs.h>
 #include "overlayfs.h"
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 72a1acd03675..d4ecad2b0a2e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -58,7 +58,6 @@
 #include <linux/init.h>
 #include <linux/capability.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/generic-radix-tree.h>
 #include <linux/string.h>
 #include <linux/seq_file.h>
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3942db160f18..0b4be6d5edaf 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -51,7 +51,6 @@
 #include <linux/sched/signal.h>
 #include <linux/fs.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
 #include <linux/percpu.h>
diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index b0ef45db207c..f8b97d8a874a 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -16,7 +16,6 @@
 #include <uapi/linux/btf.h>
 #include <linux/bpf_lsm.h>
 #include <linux/btf_ids.h>
-#include <linux/fdtable.h>
 #include <linux/rcupdate_trace.h>
 
 DEFINE_BPF_STORAGE_CACHE(inode_cache);
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index adf6dfe0ba68..1eb9852a9f8e 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -16,7 +16,6 @@
 #include <linux/filter.h>
 #include <uapi/linux/btf.h>
 #include <linux/btf_ids.h>
-#include <linux/fdtable.h>
 #include <linux/rcupdate_trace.h>
 
 DEFINE_BPF_STORAGE_CACHE(task_cache);
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
index d6ccf8d00eab..3ea6a7505662 100644
--- a/kernel/bpf/token.c
+++ b/kernel/bpf/token.c
@@ -1,6 +1,5 @@
 #include <linux/bpf.h>
 #include <linux/vmalloc.h>
-#include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/kernel.h>
diff --git a/kernel/exit.c b/kernel/exit.c
index 7430852a8571..d441193a4537 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -25,7 +25,6 @@
 #include <linux/acct.h>
 #include <linux/tsacct_kern.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/freezer.h>
 #include <linux/binfmts.h>
 #include <linux/nsproxy.h>
diff --git a/kernel/module/dups.c b/kernel/module/dups.c
index 9a92f2f8c9d3..bd2149fbe117 100644
--- a/kernel/module/dups.c
+++ b/kernel/module/dups.c
@@ -18,7 +18,6 @@
 #include <linux/completion.h>
 #include <linux/cred.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/workqueue.h>
 #include <linux/security.h>
 #include <linux/mount.h>
diff --git a/kernel/module/kmod.c b/kernel/module/kmod.c
index 0800d9891692..25f253812512 100644
--- a/kernel/module/kmod.c
+++ b/kernel/module/kmod.c
@@ -15,7 +15,6 @@
 #include <linux/completion.h>
 #include <linux/cred.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/workqueue.h>
 #include <linux/security.h>
 #include <linux/mount.h>
diff --git a/kernel/umh.c b/kernel/umh.c
index ff1f13a27d29..be9234270777 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -13,7 +13,6 @@
 #include <linux/completion.h>
 #include <linux/cred.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/fs_struct.h>
 #include <linux/workqueue.h>
 #include <linux/security.h>
diff --git a/net/handshake/request.c b/net/handshake/request.c
index 94d5cef3e048..274d2c89b6b2 100644
--- a/net/handshake/request.c
+++ b/net/handshake/request.c
@@ -13,7 +13,6 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/inet.h>
-#include <linux/fdtable.h>
 #include <linux/rhashtable.h>
 
 #include <net/sock.h>
diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index 571158ec6188..2bc34dce9a46 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -9,7 +9,6 @@
  */
 
 #include <linux/errno.h>
-#include <linux/fdtable.h>
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/mount.h>
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 04/12] close_files(): don't bother with xchg()
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
  2024-08-22  0:22   ` [PATCH 02/12] get rid of ...lookup...fdget_rcu() family Al Viro
  2024-08-22  0:22   ` [PATCH 03/12] remove pointless includes of <linux/fdtable.h> Al Viro
@ 2024-08-22  0:22   ` Al Viro
  2024-08-22  0:22   ` [PATCH 05/12] move close_range(2) into fs/file.c, fold __close_range() into it Al Viro
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

At that point nobody else has references to the victim files_struct;
as the matter of fact, the caller will free it immediately after
close_files() returns, with no RCU delays or anything of that sort.

That's why we are not protecting against fdtable reallocation on
expansion, not cleaning the bitmaps, etc.  There's no point
zeroing the pointers in ->fd[] either, let alone make that an
atomic operation.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index 8c5b8569045c..4ea7afce828c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -413,7 +413,7 @@ static struct fdtable *close_files(struct files_struct * files)
 		set = fdt->open_fds[j++];
 		while (set) {
 			if (set & 1) {
-				struct file * file = xchg(&fdt->fd[i], NULL);
+				struct file *file = fdt->fd[i];
 				if (file) {
 					filp_close(file, files);
 					cond_resched();
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 05/12] move close_range(2) into fs/file.c, fold __close_range() into it
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
                     ` (2 preceding siblings ...)
  2024-08-22  0:22   ` [PATCH 04/12] close_files(): don't bother with xchg() Al Viro
@ 2024-08-22  0:22   ` Al Viro
  2024-08-22  0:22   ` [PATCH 06/12] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Al Viro
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

	We never had callers for __close_range() except for close_range(2)
itself.  Nothing of that sort has appeared in four years and if any users
do show up, we can always separate those suckers again.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c               |  6 ++++--
 fs/open.c               | 17 -----------------
 include/linux/fdtable.h |  1 -
 3 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 4ea7afce828c..dcafae2acae8 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -713,7 +713,7 @@ static inline void __range_close(struct files_struct *files, unsigned int fd,
 }
 
 /**
- * __close_range() - Close all file descriptors in a given range.
+ * sys_close_range() - Close all file descriptors in a given range.
  *
  * @fd:     starting file descriptor to close
  * @max_fd: last file descriptor to close
@@ -721,8 +721,10 @@ static inline void __range_close(struct files_struct *files, unsigned int fd,
  *
  * This closes a range of file descriptors. All file descriptors
  * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
  */
-int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+		unsigned int, flags)
 {
 	struct task_struct *me = current;
 	struct files_struct *cur_fds = me->files, *fds = NULL;
diff --git a/fs/open.c b/fs/open.c
index 22adbef7ecc2..25443d846d5e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1575,23 +1575,6 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 	return retval;
 }
 
-/**
- * sys_close_range() - Close all file descriptors in a given range.
- *
- * @fd:     starting file descriptor to close
- * @max_fd: last file descriptor to close
- * @flags:  reserved for future extensions
- *
- * This closes a range of file descriptors. All file descriptors
- * from @fd up to and including @max_fd are closed.
- * Currently, errors to close a given file descriptor are ignored.
- */
-SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
-		unsigned int, flags)
-{
-	return __close_range(fd, max_fd, flags);
-}
-
 /*
  * This routine simulates a hangup on the tty, to arrange that users
  * are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index e25e2cb65d30..c45306a9f007 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -111,7 +111,6 @@ int iterate_fd(struct files_struct *, unsigned,
 		const void *);
 
 extern int close_fd(unsigned int fd);
-extern int __close_range(unsigned int fd, unsigned int max_fd, unsigned int flags);
 extern struct file *file_close_fd(unsigned int fd);
 
 extern struct kmem_cache *files_cachep;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 06/12] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
                     ` (3 preceding siblings ...)
  2024-08-22  0:22   ` [PATCH 05/12] move close_range(2) into fs/file.c, fold __close_range() into it Al Viro
@ 2024-08-22  0:22   ` Al Viro
  2024-08-22  0:22   ` [PATCH 07/12] fs/file.c: conditionally clear full_fds Al Viro
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

From: Yu Ma <yu.ma@intel.com>

alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-2-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index dcafae2acae8..9b8df3bc6fea 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -496,7 +496,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	if (fd < files->next_fd)
 		fd = files->next_fd;
 
-	if (fd < fdt->max_fds)
+	if (likely(fd < fdt->max_fds))
 		fd = find_next_fd(fdt, fd);
 
 	/*
@@ -504,19 +504,21 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	 * will limit the total number of files that can be opened.
 	 */
 	error = -EMFILE;
-	if (fd >= end)
+	if (unlikely(fd >= end))
 		goto out;
 
-	error = expand_files(files, fd);
-	if (error < 0)
-		goto out;
+	if (unlikely(fd >= fdt->max_fds)) {
+		error = expand_files(files, fd);
+		if (error < 0)
+			goto out;
 
-	/*
-	 * If we needed to expand the fs array we
-	 * might have blocked - try again.
-	 */
-	if (error)
-		goto repeat;
+		/*
+		 * If we needed to expand the fs array we
+		 * might have blocked - try again.
+		 */
+		if (error)
+			goto repeat;
+	}
 
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
@@ -527,13 +529,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	else
 		__clear_close_on_exec(fd, fdt);
 	error = fd;
-#if 1
-	/* Sanity check */
-	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
-		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
-		rcu_assign_pointer(fdt->fd[fd], NULL);
-	}
-#endif
 
 out:
 	spin_unlock(&files->file_lock);
@@ -599,7 +594,7 @@ void fd_install(unsigned int fd, struct file *file)
 		rcu_read_unlock_sched();
 		spin_lock(&files->file_lock);
 		fdt = files_fdtable(files);
-		BUG_ON(fdt->fd[fd] != NULL);
+		WARN_ON(fdt->fd[fd] != NULL);
 		rcu_assign_pointer(fdt->fd[fd], file);
 		spin_unlock(&files->file_lock);
 		return;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 07/12] fs/file.c: conditionally clear full_fds
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
                     ` (4 preceding siblings ...)
  2024-08-22  0:22   ` [PATCH 06/12] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Al Viro
@ 2024-08-22  0:22   ` Al Viro
  2024-08-22  0:22   ` [PATCH 08/12] fs/file.c: add fast path in find_next_fd() Al Viro
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

From: Yu Ma <yu.ma@intel.com>

64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc7.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-3-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index 9b8df3bc6fea..d6f5add1a786 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -264,7 +264,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__clear_bit(fd, fdt->open_fds);
-	__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+	fd /= BITS_PER_LONG;
+	if (test_bit(fd, fdt->full_fds_bits))
+		__clear_bit(fd, fdt->full_fds_bits);
 }
 
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 08/12] fs/file.c: add fast path in find_next_fd()
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
                     ` (5 preceding siblings ...)
  2024-08-22  0:22   ` [PATCH 07/12] fs/file.c: conditionally clear full_fds Al Viro
@ 2024-08-22  0:22   ` Al Viro
  2024-08-22  0:22   ` [PATCH 09/12] alloc_fdtable(): change calling conventions Al Viro
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

From: Yu Ma <yu.ma@intel.com>

Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.

This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-4-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index d6f5add1a786..b94ee8270867 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -472,6 +472,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
 	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
 	unsigned int maxbit = maxfd / BITS_PER_LONG;
 	unsigned int bitbit = start / BITS_PER_LONG;
+	unsigned int bit;
+
+	/*
+	 * Try to avoid looking at the second level bitmap
+	 */
+	bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+				 start & (BITS_PER_LONG - 1));
+	if (bit < BITS_PER_LONG)
+		return bit + bitbit * BITS_PER_LONG;
 
 	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
 	if (bitbit >= maxfd)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 09/12] alloc_fdtable(): change calling conventions.
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
                     ` (6 preceding siblings ...)
  2024-08-22  0:22   ` [PATCH 08/12] fs/file.c: add fast path in find_next_fd() Al Viro
@ 2024-08-22  0:22   ` Al Viro
  2024-08-22  0:22   ` [PATCH 10/12] file.c: merge __{set,clear}_close_on_exec() Al Viro
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

First of all, tell it how many slots do we want, not which slot
is wanted.  It makes one caller (dup_fd()) more straightforward
and doesn't harm another (expand_fdtable()).

Furthermore, make it return ERR_PTR() on failure rather than
returning NULL.  Simplifies the callers.

Simplify the size calculation, while we are at it - note that we
always have slots_wanted greater than BITS_PER_LONG.  What the
rules boil down to is
	* use the smallest power of two large enough to give us
that many slots
	* on 32bit skip 64 and 128 - the minimal capacity we want
there is 256 slots (i.e. 1Kb fd array).
	* on 64bit don't skip anything, the minimal capacity is
128 - and we'll never be asked for 64 or less.  128 slots means
1Kb fd array, again.
	* on 128bit, if that ever happens, don't skip anything -
we'll never be asked for 128 or less, so the fd array allocation
will be at least 2Kb.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 75 +++++++++++++++++++++----------------------------------
 1 file changed, 29 insertions(+), 46 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index b94ee8270867..4fbd4e323f6c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -89,18 +89,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
  * 'unsigned long' in some places, but simply because that is how the Linux
  * kernel bitmaps are defined to work: they are not "bits in an array of bytes",
  * they are very much "bits in an array of unsigned long".
- *
- * The ALIGN(nr, BITS_PER_LONG) here is for clarity: since we just multiplied
- * by that "1024/sizeof(ptr)" before, we already know there are sufficient
- * clear low bits. Clang seems to realize that, gcc ends up being confused.
- *
- * On a 128-bit machine, the ALIGN() would actually matter. In the meantime,
- * let's consider it documentation (and maybe a test-case for gcc to improve
- * its code generation ;)
  */
-static struct fdtable * alloc_fdtable(unsigned int nr)
+static struct fdtable *alloc_fdtable(unsigned int slots_wanted)
 {
 	struct fdtable *fdt;
+	unsigned int nr;
 	void *data;
 
 	/*
@@ -108,22 +101,32 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	 * Allocation steps are keyed to the size of the fdarray, since it
 	 * grows far faster than any of the other dynamic data. We try to fit
 	 * the fdarray into comfortable page-tuned chunks: starting at 1024B
-	 * and growing in powers of two from there on.
+	 * and growing in powers of two from there on.  Since we called only
+	 * with slots_wanted > BITS_PER_LONG (embedded instance in files->fdtab
+	 * already gives BITS_PER_LONG slots), the above boils down to
+	 * 1.  use the smallest power of two large enough to give us that many
+	 * slots.
+	 * 2.  on 32bit skip 64 and 128 - the minimal capacity we want there is
+	 * 256 slots (i.e. 1Kb fd array).
+	 * 3.  on 64bit don't skip anything, 1Kb fd array means 128 slots there
+	 * and we are never going to be asked for 64 or less.
 	 */
-	nr /= (1024 / sizeof(struct file *));
-	nr = roundup_pow_of_two(nr + 1);
-	nr *= (1024 / sizeof(struct file *));
-	nr = ALIGN(nr, BITS_PER_LONG);
+	if (IS_ENABLED(CONFIG_32BIT) && slots_wanted < 256)
+		nr = 256;
+	else
+		nr = roundup_pow_of_two(slots_wanted);
 	/*
 	 * Note that this can drive nr *below* what we had passed if sysctl_nr_open
-	 * had been set lower between the check in expand_files() and here.  Deal
-	 * with that in caller, it's cheaper that way.
+	 * had been set lower between the check in expand_files() and here.
 	 *
 	 * We make sure that nr remains a multiple of BITS_PER_LONG - otherwise
 	 * bitmaps handling below becomes unpleasant, to put it mildly...
 	 */
-	if (unlikely(nr > sysctl_nr_open))
-		nr = ((sysctl_nr_open - 1) | (BITS_PER_LONG - 1)) + 1;
+	if (unlikely(nr > sysctl_nr_open)) {
+		nr = round_down(sysctl_nr_open, BITS_PER_LONG);
+		if (nr < slots_wanted)
+			return ERR_PTR(-EMFILE);
+	}
 
 	fdt = kmalloc(sizeof(struct fdtable), GFP_KERNEL_ACCOUNT);
 	if (!fdt)
@@ -152,7 +155,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 out_fdt:
 	kfree(fdt);
 out:
-	return NULL;
+	return ERR_PTR(-ENOMEM);
 }
 
 /*
@@ -169,7 +172,7 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
 	struct fdtable *new_fdt, *cur_fdt;
 
 	spin_unlock(&files->file_lock);
-	new_fdt = alloc_fdtable(nr);
+	new_fdt = alloc_fdtable(nr + 1);
 
 	/* make sure all fd_install() have seen resize_in_progress
 	 * or have finished their rcu_read_lock_sched() section.
@@ -178,16 +181,8 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
 		synchronize_rcu();
 
 	spin_lock(&files->file_lock);
-	if (!new_fdt)
-		return -ENOMEM;
-	/*
-	 * extremely unlikely race - sysctl_nr_open decreased between the check in
-	 * caller and alloc_fdtable().  Cheaper to catch it here...
-	 */
-	if (unlikely(new_fdt->max_fds <= nr)) {
-		__free_fdtable(new_fdt);
-		return -EMFILE;
-	}
+	if (IS_ERR(new_fdt))
+		return PTR_ERR(new_fdt);
 	cur_fdt = files_fdtable(files);
 	BUG_ON(nr < cur_fdt->max_fds);
 	copy_fdtable(new_fdt, cur_fdt);
@@ -308,7 +303,6 @@ struct files_struct *dup_fd(struct files_struct *oldf, struct fd_range *punch_ho
 	struct file **old_fds, **new_fds;
 	unsigned int open_files, i;
 	struct fdtable *old_fdt, *new_fdt;
-	int error;
 
 	newf = kmem_cache_alloc(files_cachep, GFP_KERNEL);
 	if (!newf)
@@ -340,17 +334,10 @@ struct files_struct *dup_fd(struct files_struct *oldf, struct fd_range *punch_ho
 		if (new_fdt != &newf->fdtab)
 			__free_fdtable(new_fdt);
 
-		new_fdt = alloc_fdtable(open_files - 1);
-		if (!new_fdt) {
-			error = -ENOMEM;
-			goto out_release;
-		}
-
-		/* beyond sysctl_nr_open; nothing to do */
-		if (unlikely(new_fdt->max_fds < open_files)) {
-			__free_fdtable(new_fdt);
-			error = -EMFILE;
-			goto out_release;
+		new_fdt = alloc_fdtable(open_files);
+		if (IS_ERR(new_fdt)) {
+			kmem_cache_free(files_cachep, newf);
+			return ERR_CAST(new_fdt);
 		}
 
 		/*
@@ -391,10 +378,6 @@ struct files_struct *dup_fd(struct files_struct *oldf, struct fd_range *punch_ho
 	rcu_assign_pointer(newf->fdt, new_fdt);
 
 	return newf;
-
-out_release:
-	kmem_cache_free(files_cachep, newf);
-	return ERR_PTR(error);
 }
 
 static struct fdtable *close_files(struct files_struct * files)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 10/12] file.c: merge __{set,clear}_close_on_exec()
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
                     ` (7 preceding siblings ...)
  2024-08-22  0:22   ` [PATCH 09/12] alloc_fdtable(): change calling conventions Al Viro
@ 2024-08-22  0:22   ` Al Viro
  2024-08-22  0:22   ` [PATCH 11/12] make __set_open_fd() set cloexec state as well Al Viro
  2024-08-22  0:22   ` [PATCH 12/12] expand_files(): simplify calling conventions Al Viro
  10 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

they are always go in pairs; seeing that they are inlined, might
as well make that a single inline function taking a boolean
argument ("do we want close_on_exec set for that descriptor")

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 33 +++++++++++----------------------
 1 file changed, 11 insertions(+), 22 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 4fbd4e323f6c..61fb8994203f 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -237,15 +237,15 @@ static int expand_files(struct files_struct *files, unsigned int nr)
 	return expanded;
 }
 
-static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt)
+static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
+				       bool set)
 {
-	__set_bit(fd, fdt->close_on_exec);
-}
-
-static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
-{
-	if (test_bit(fd, fdt->close_on_exec))
-		__clear_bit(fd, fdt->close_on_exec);
+	if (set) {
+		__set_bit(fd, fdt->close_on_exec);
+	} else {
+		if (test_bit(fd, fdt->close_on_exec))
+			__clear_bit(fd, fdt->close_on_exec);
+	}
 }
 
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
@@ -518,10 +518,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 		files->next_fd = fd + 1;
 
 	__set_open_fd(fd, fdt);
-	if (flags & O_CLOEXEC)
-		__set_close_on_exec(fd, fdt);
-	else
-		__clear_close_on_exec(fd, fdt);
+	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
 	error = fd;
 
 out:
@@ -1147,13 +1144,8 @@ void __f_unlock_pos(struct file *f)
 void set_close_on_exec(unsigned int fd, int flag)
 {
 	struct files_struct *files = current->files;
-	struct fdtable *fdt;
 	spin_lock(&files->file_lock);
-	fdt = files_fdtable(files);
-	if (flag)
-		__set_close_on_exec(fd, fdt);
-	else
-		__clear_close_on_exec(fd, fdt);
+	__set_close_on_exec(fd, files_fdtable(files), flag);
 	spin_unlock(&files->file_lock);
 }
 
@@ -1195,10 +1187,7 @@ __releases(&files->file_lock)
 	get_file(file);
 	rcu_assign_pointer(fdt->fd[fd], file);
 	__set_open_fd(fd, fdt);
-	if (flags & O_CLOEXEC)
-		__set_close_on_exec(fd, fdt);
-	else
-		__clear_close_on_exec(fd, fdt);
+	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
 	spin_unlock(&files->file_lock);
 
 	if (tofree)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 11/12] make __set_open_fd() set cloexec state as well
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
                     ` (8 preceding siblings ...)
  2024-08-22  0:22   ` [PATCH 10/12] file.c: merge __{set,clear}_close_on_exec() Al Viro
@ 2024-08-22  0:22   ` Al Viro
  2024-08-22  0:22   ` [PATCH 12/12] expand_files(): simplify calling conventions Al Viro
  10 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

->close_on_exec[] state is maintained only for opened descriptors;
as the result, anything that marks a descriptor opened has to
set its cloexec state explicitly.

As the result, all calls of __set_open_fd() are followed by
__set_close_on_exec(); might as well fold it into __set_open_fd()
so that cloexec state is defined as soon as the descriptor is
marked opened.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 61fb8994203f..411e658c3fa3 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -248,12 +248,13 @@ static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
 	}
 }
 
-static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
+static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt, bool set)
 {
 	__set_bit(fd, fdt->open_fds);
 	fd /= BITS_PER_LONG;
 	if (!~fdt->open_fds[fd])
 		__set_bit(fd, fdt->full_fds_bits);
+	__set_close_on_exec(fd, fdt, set);
 }
 
 static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
@@ -517,8 +518,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
 
-	__set_open_fd(fd, fdt);
-	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
+	__set_open_fd(fd, fdt, flags & O_CLOEXEC);
 	error = fd;
 
 out:
@@ -1186,8 +1186,7 @@ __releases(&files->file_lock)
 		goto Ebusy;
 	get_file(file);
 	rcu_assign_pointer(fdt->fd[fd], file);
-	__set_open_fd(fd, fdt);
-	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
+	__set_open_fd(fd, fdt, flags & O_CLOEXEC);
 	spin_unlock(&files->file_lock);
 
 	if (tofree)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 12/12] expand_files(): simplify calling conventions
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
                     ` (9 preceding siblings ...)
  2024-08-22  0:22   ` [PATCH 11/12] make __set_open_fd() set cloexec state as well Al Viro
@ 2024-08-22  0:22   ` Al Viro
  10 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-08-22  0:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

All callers treat 0 and 1 returned by expand_files() in the same way
now since the call in alloc_fd() had been made conditional.  Just make
it return 0 on success and be done with it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 411e658c3fa3..0dbb3f68249d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -162,7 +162,7 @@ static struct fdtable *alloc_fdtable(unsigned int slots_wanted)
  * Expand the file descriptor table.
  * This function will allocate a new fdtable and both fd array and fdset, of
  * the given size.
- * Return <0 error code on error; 1 on successful completion.
+ * Return <0 error code on error; 0 on successful completion.
  * The files->file_lock should be held on entry, and will be held on exit.
  */
 static int expand_fdtable(struct files_struct *files, unsigned int nr)
@@ -191,15 +191,14 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
 		call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
 	/* coupled with smp_rmb() in fd_install() */
 	smp_wmb();
-	return 1;
+	return 0;
 }
 
 /*
  * Expand files.
  * This function will expand the file structures, if the requested size exceeds
  * the current capacity and there is room for expansion.
- * Return <0 error code on error; 0 when nothing done; 1 when files were
- * expanded and execution may have blocked.
+ * Return <0 error code on error; 0 on success.
  * The files->file_lock should be held on entry, and will be held on exit.
  */
 static int expand_files(struct files_struct *files, unsigned int nr)
@@ -207,14 +206,14 @@ static int expand_files(struct files_struct *files, unsigned int nr)
 	__acquires(files->file_lock)
 {
 	struct fdtable *fdt;
-	int expanded = 0;
+	int error;
 
 repeat:
 	fdt = files_fdtable(files);
 
 	/* Do we need to expand? */
 	if (nr < fdt->max_fds)
-		return expanded;
+		return 0;
 
 	/* Can we expand? */
 	if (nr >= sysctl_nr_open)
@@ -222,7 +221,6 @@ static int expand_files(struct files_struct *files, unsigned int nr)
 
 	if (unlikely(files->resize_in_progress)) {
 		spin_unlock(&files->file_lock);
-		expanded = 1;
 		wait_event(files->resize_wait, !files->resize_in_progress);
 		spin_lock(&files->file_lock);
 		goto repeat;
@@ -230,11 +228,11 @@ static int expand_files(struct files_struct *files, unsigned int nr)
 
 	/* All good, so we try */
 	files->resize_in_progress = true;
-	expanded = expand_fdtable(files, nr);
+	error = expand_fdtable(files, nr);
 	files->resize_in_progress = false;
 
 	wake_up_all(&files->resize_wait);
-	return expanded;
+	return error;
 }
 
 static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
@@ -507,12 +505,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 		if (error < 0)
 			goto out;
 
-		/*
-		 * If we needed to expand the fs array we
-		 * might have blocked - try again.
-		 */
-		if (error)
-			goto repeat;
+		goto repeat;
 	}
 
 	if (start <= files->next_fd)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCHES] fdtable series v3
  2024-08-22  0:20 [PATCHES] fdtable series v2 Al Viro
  2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
@ 2024-10-07 17:39 ` Al Viro
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
                     ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:39 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner

Changes since v2:
	Rebased to 6.12-rc2, now that close_range() fix got into
mainline.  No more than that so far (there will be followup
cleanups, but for now I just want it posted and in -next).

Changes since v1:
 	close_range() fix added in the beginning of the series,
dup_fd() calling convention change folded into it (as requested
by Linus), the rest rebased on top of that.
 	sane_fdtable_size() change is dropped, as it's obsoleted
by close_range() fix.  Several patches added at the end of
series.

	Same branch -
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.fdtable
individual patches in followups.

Christian, I can move that to vfs/vfs.git if that's more convenient
for you - we are about to step on each other's toes with that and
your ->f_count work.

Shortlog:

Al Viro (8):
      get rid of ...lookup...fdget_rcu() family
      remove pointless includes of <linux/fdtable.h>
      close_files(): don't bother with xchg()
      move close_range(2) into fs/file.c, fold __close_range() into it
      alloc_fdtable(): change calling conventions.
      file.c: merge __{set,clear}_close_on_exec()
      make __set_open_fd() set cloexec state as well
      expand_files(): simplify calling conventions

Yu Ma (3):
      fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
      fs/file.c: conditionally clear full_fds
      fs/file.c: add fast path in find_next_fd()

Diffstat:

 arch/powerpc/platforms/cell/spufs/coredump.c |   4 +-
 fs/fcntl.c                                   |   1 -
 fs/file.c                                    | 204 ++++++++++-----------------
 fs/file_table.c                              |   1 -
 fs/gfs2/glock.c                              |  12 +-
 fs/notify/dnotify/dnotify.c                  |   5 +-
 fs/notify/fanotify/fanotify.c                |   1 -
 fs/notify/fanotify/fanotify_user.c           |   1 -
 fs/open.c                                    |  17 ---
 fs/overlayfs/copy_up.c                       |   1 -
 fs/proc/base.c                               |   1 -
 fs/proc/fd.c                                 |  12 +-
 include/linux/fdtable.h                      |   5 -
 include/linux/file.h                         |   1 +
 io_uring/io_uring.c                          |   1 -
 kernel/bpf/bpf_inode_storage.c               |   1 -
 kernel/bpf/bpf_task_storage.c                |   1 -
 kernel/bpf/task_iter.c                       |   6 +-
 kernel/bpf/token.c                           |   1 -
 kernel/exit.c                                |   1 -
 kernel/kcmp.c                                |   4 +-
 kernel/module/dups.c                         |   1 -
 kernel/module/kmod.c                         |   1 -
 kernel/umh.c                                 |   1 -
 net/handshake/request.c                      |   1 -
 security/apparmor/domain.c                   |   1 -
 26 files changed, 88 insertions(+), 198 deletions(-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family
  2024-10-07 17:39 ` [PATCHES] fdtable series v3 Al Viro
@ 2024-10-07 17:43   ` Al Viro
  2024-10-07 17:43     ` [PATCH v3 02/11] remove pointless includes of <linux/fdtable.h> Al Viro
                       ` (9 more replies)
  2024-10-07 18:12   ` [PATCHES] fdtable series v3 Linus Torvalds
  2024-10-08 11:15   ` Christian Brauner
  2 siblings, 10 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, brauner

Once upon a time, predecessors of those used to do file lookup
without bumping a refcount, provided that caller held rcu_read_lock()
across the lookup and whatever it wanted to read from the struct
file found.  When struct file allocation switched to SLAB_TYPESAFE_BY_RCU,
that stopped being feasible and these primitives started to bump the
file refcount for lookup result, requiring the caller to call fput()
afterwards.

But that turned them pointless - e.g.
	rcu_read_lock();
	file = lookup_fdget_rcu(fd);
	rcu_read_unlock();
is equivalent to
	file = fget_raw(fd);
and all callers of lookup_fdget_rcu() are of that form.  Similarly,
task_lookup_fdget_rcu() calls can be replaced with calling fget_task().
task_lookup_next_fdget_rcu() doesn't have direct counterparts, but
its callers would be happier if we replaced it with an analogue that
deals with RCU internally.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/powerpc/platforms/cell/spufs/coredump.c |  4 +--
 fs/file.c                                    | 28 +++-----------------
 fs/gfs2/glock.c                              | 12 ++-------
 fs/notify/dnotify/dnotify.c                  |  5 +---
 fs/proc/fd.c                                 | 12 +++------
 include/linux/fdtable.h                      |  4 ---
 include/linux/file.h                         |  1 +
 kernel/bpf/task_iter.c                       |  6 +----
 kernel/kcmp.c                                |  4 +--
 9 files changed, 14 insertions(+), 62 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c b/arch/powerpc/platforms/cell/spufs/coredump.c
index 18daafbe2e65..301ee7d8b7df 100644
--- a/arch/powerpc/platforms/cell/spufs/coredump.c
+++ b/arch/powerpc/platforms/cell/spufs/coredump.c
@@ -73,9 +73,7 @@ static struct spu_context *coredump_next_context(int *fd)
 		return NULL;
 	*fd = n - 1;
 
-	rcu_read_lock();
-	file = lookup_fdget_rcu(*fd);
-	rcu_read_unlock();
+	file = fget_raw(*fd);
 	if (file) {
 		ctx = SPUFS_I(file_inode(file))->i_ctx;
 		get_spu_context(ctx);
diff --git a/fs/file.c b/fs/file.c
index eb093e736972..991860ee7848 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1037,29 +1037,7 @@ struct file *fget_task(struct task_struct *task, unsigned int fd)
 	return file;
 }
 
-struct file *lookup_fdget_rcu(unsigned int fd)
-{
-	return __fget_files_rcu(current->files, fd, 0);
-
-}
-EXPORT_SYMBOL_GPL(lookup_fdget_rcu);
-
-struct file *task_lookup_fdget_rcu(struct task_struct *task, unsigned int fd)
-{
-	/* Must be called with rcu_read_lock held */
-	struct files_struct *files;
-	struct file *file = NULL;
-
-	task_lock(task);
-	files = task->files;
-	if (files)
-		file = __fget_files_rcu(files, fd, 0);
-	task_unlock(task);
-
-	return file;
-}
-
-struct file *task_lookup_next_fdget_rcu(struct task_struct *task, unsigned int *ret_fd)
+struct file *fget_task_next(struct task_struct *task, unsigned int *ret_fd)
 {
 	/* Must be called with rcu_read_lock held */
 	struct files_struct *files;
@@ -1069,17 +1047,19 @@ struct file *task_lookup_next_fdget_rcu(struct task_struct *task, unsigned int *
 	task_lock(task);
 	files = task->files;
 	if (files) {
+		rcu_read_lock();
 		for (; fd < files_fdtable(files)->max_fds; fd++) {
 			file = __fget_files_rcu(files, fd, 0);
 			if (file)
 				break;
 		}
+		rcu_read_unlock();
 	}
 	task_unlock(task);
 	*ret_fd = fd;
 	return file;
 }
-EXPORT_SYMBOL(task_lookup_next_fdget_rcu);
+EXPORT_SYMBOL(fget_task_next);
 
 /*
  * Lightweight file lookup - no refcnt increment if fd table isn't shared.
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 269c3bc7fced..4701c4aafbf4 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -34,7 +34,6 @@
 #include <linux/lockref.h>
 #include <linux/rhashtable.h>
 #include <linux/pid_namespace.h>
-#include <linux/fdtable.h>
 #include <linux/file.h>
 
 #include "gfs2.h"
@@ -2768,25 +2767,18 @@ static struct file *gfs2_glockfd_next_file(struct gfs2_glockfd_iter *i)
 		i->file = NULL;
 	}
 
-	rcu_read_lock();
 	for(;; i->fd++) {
-		struct inode *inode;
-
-		i->file = task_lookup_next_fdget_rcu(i->task, &i->fd);
+		i->file = fget_task_next(i->task, &i->fd);
 		if (!i->file) {
 			i->fd = 0;
 			break;
 		}
 
-		inode = file_inode(i->file);
-		if (inode->i_sb == i->sb)
+		if (file_inode(i->file)->i_sb == i->sb)
 			break;
 
-		rcu_read_unlock();
 		fput(i->file);
-		rcu_read_lock();
 	}
-	rcu_read_unlock();
 	return i->file;
 }
 
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index d5dbef7f5c95..6004dfdfdf0f 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -16,7 +16,6 @@
 #include <linux/security.h>
 #include <linux/spinlock.h>
 #include <linux/slab.h>
-#include <linux/fdtable.h>
 #include <linux/fsnotify_backend.h>
 
 static int dir_notify_enable __read_mostly = 1;
@@ -347,9 +346,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned int arg)
 		new_fsn_mark = NULL;
 	}
 
-	rcu_read_lock();
-	f = lookup_fdget_rcu(fd);
-	rcu_read_unlock();
+	f = fget_raw(fd);
 
 	/* if (f != filp) means that we lost a race and another task/thread
 	 * actually closed the fd we are still playing with before we grabbed
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 1f54a54bfb91..18d0dddc8e2f 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -116,9 +116,7 @@ static bool tid_fd_mode(struct task_struct *task, unsigned fd, fmode_t *mode)
 {
 	struct file *file;
 
-	rcu_read_lock();
-	file = task_lookup_fdget_rcu(task, fd);
-	rcu_read_unlock();
+	file = fget_task(task, fd);
 	if (file) {
 		*mode = file->f_mode;
 		fput(file);
@@ -258,19 +256,17 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 	if (!dir_emit_dots(file, ctx))
 		goto out;
 
-	rcu_read_lock();
 	for (fd = ctx->pos - 2;; fd++) {
 		struct file *f;
 		struct fd_data data;
 		char name[10 + 1];
 		unsigned int len;
 
-		f = task_lookup_next_fdget_rcu(p, &fd);
+		f = fget_task_next(p, &fd);
 		ctx->pos = fd + 2LL;
 		if (!f)
 			break;
 		data.mode = f->f_mode;
-		rcu_read_unlock();
 		fput(f);
 		data.fd = fd;
 
@@ -278,11 +274,9 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 		if (!proc_fill_cache(file, ctx,
 				     name, len, instantiate, p,
 				     &data))
-			goto out;
+			break;
 		cond_resched();
-		rcu_read_lock();
 	}
-	rcu_read_unlock();
 out:
 	put_task_struct(p);
 	return 0;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index b1c5722f2b3c..e25e2cb65d30 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -92,10 +92,6 @@ static inline struct file *files_lookup_fd_locked(struct files_struct *files, un
 	return files_lookup_fd_raw(files, fd);
 }
 
-struct file *lookup_fdget_rcu(unsigned int fd);
-struct file *task_lookup_fdget_rcu(struct task_struct *task, unsigned int fd);
-struct file *task_lookup_next_fdget_rcu(struct task_struct *task, unsigned int *fd);
-
 static inline bool close_on_exec(unsigned int fd, const struct files_struct *files)
 {
 	return test_bit(fd, files_fdtable(files)->close_on_exec);
diff --git a/include/linux/file.h b/include/linux/file.h
index f98de143245a..ec4ad5e6a061 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -72,6 +72,7 @@ static inline void fdput(struct fd fd)
 extern struct file *fget(unsigned int fd);
 extern struct file *fget_raw(unsigned int fd);
 extern struct file *fget_task(struct task_struct *task, unsigned int fd);
+extern struct file *fget_task_next(struct task_struct *task, unsigned int *fd);
 extern void __f_unlock_pos(struct file *);
 
 struct fd fdget(unsigned int fd);
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 02aa9db8d796..7fe602ca74a0 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -5,7 +5,6 @@
 #include <linux/namei.h>
 #include <linux/pid_namespace.h>
 #include <linux/fs.h>
-#include <linux/fdtable.h>
 #include <linux/filter.h>
 #include <linux/bpf_mem_alloc.h>
 #include <linux/btf_ids.h>
@@ -286,17 +285,14 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info)
 			curr_fd = 0;
 	}
 
-	rcu_read_lock();
-	f = task_lookup_next_fdget_rcu(curr_task, &curr_fd);
+	f = fget_task_next(curr_task, &curr_fd);
 	if (f) {
 		/* set info->fd */
 		info->fd = curr_fd;
-		rcu_read_unlock();
 		return f;
 	}
 
 	/* the current task is done, go to the next task */
-	rcu_read_unlock();
 	put_task_struct(curr_task);
 
 	if (info->common.type == BPF_TASK_ITER_TID) {
diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index b0639f21041f..2c596851f8a9 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -63,9 +63,7 @@ get_file_raw_ptr(struct task_struct *task, unsigned int idx)
 {
 	struct file *file;
 
-	rcu_read_lock();
-	file = task_lookup_fdget_rcu(task, idx);
-	rcu_read_unlock();
+	file = fget_task(task, idx);
 	if (file)
 		fput(file);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 02/11] remove pointless includes of <linux/fdtable.h>
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
@ 2024-10-07 17:43     ` Al Viro
  2024-10-07 17:43     ` [PATCH v3 03/11] close_files(): don't bother with xchg() Al Viro
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, brauner

some of those used to be needed, some had been cargo-culted for
no reason...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/fcntl.c                         | 1 -
 fs/file_table.c                    | 1 -
 fs/notify/fanotify/fanotify.c      | 1 -
 fs/notify/fanotify/fanotify_user.c | 1 -
 fs/overlayfs/copy_up.c             | 1 -
 fs/proc/base.c                     | 1 -
 io_uring/io_uring.c                | 1 -
 kernel/bpf/bpf_inode_storage.c     | 1 -
 kernel/bpf/bpf_task_storage.c      | 1 -
 kernel/bpf/token.c                 | 1 -
 kernel/exit.c                      | 1 -
 kernel/module/dups.c               | 1 -
 kernel/module/kmod.c               | 1 -
 kernel/umh.c                       | 1 -
 net/handshake/request.c            | 1 -
 security/apparmor/domain.c         | 1 -
 16 files changed, 16 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 22dd9dcce7ec..8928874c8a2e 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -12,7 +12,6 @@
 #include <linux/fs.h>
 #include <linux/filelock.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/capability.h>
 #include <linux/dnotify.h>
 #include <linux/slab.h>
diff --git a/fs/file_table.c b/fs/file_table.c
index eed5ffad9997..9e46fd4336b0 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -9,7 +9,6 @@
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/init.h>
 #include <linux/module.h>
 #include <linux/fs.h>
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 224bccaab4cc..24c7c5df4998 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -1,6 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/fanotify.h>
-#include <linux/fdtable.h>
 #include <linux/fsnotify_backend.h>
 #include <linux/init.h>
 #include <linux/jiffies.h>
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 9644bc72e457..61b83039771e 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1,7 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/fanotify.h>
 #include <linux/fcntl.h>
-#include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/anon_inodes.h>
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 2ed6ad641a20..ee2cbd044ce6 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -16,7 +16,6 @@
 #include <linux/sched/signal.h>
 #include <linux/cred.h>
 #include <linux/namei.h>
-#include <linux/fdtable.h>
 #include <linux/ratelimit.h>
 #include <linux/exportfs.h>
 #include "overlayfs.h"
diff --git a/fs/proc/base.c b/fs/proc/base.c
index b31283d81c52..e9d7ddc52f69 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -58,7 +58,6 @@
 #include <linux/init.h>
 #include <linux/capability.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/generic-radix-tree.h>
 #include <linux/string.h>
 #include <linux/seq_file.h>
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index b2736e3491b8..5a1676bab998 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -51,7 +51,6 @@
 #include <linux/sched/signal.h>
 #include <linux/fs.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
 #include <linux/percpu.h>
diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index 29da6d3838f6..e16e79f8cd6d 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -16,7 +16,6 @@
 #include <uapi/linux/btf.h>
 #include <linux/bpf_lsm.h>
 #include <linux/btf_ids.h>
-#include <linux/fdtable.h>
 #include <linux/rcupdate_trace.h>
 
 DEFINE_BPF_STORAGE_CACHE(inode_cache);
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index adf6dfe0ba68..1eb9852a9f8e 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -16,7 +16,6 @@
 #include <linux/filter.h>
 #include <uapi/linux/btf.h>
 #include <linux/btf_ids.h>
-#include <linux/fdtable.h>
 #include <linux/rcupdate_trace.h>
 
 DEFINE_BPF_STORAGE_CACHE(task_cache);
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
index dcbec1a0dfb3..26057aa13503 100644
--- a/kernel/bpf/token.c
+++ b/kernel/bpf/token.c
@@ -1,6 +1,5 @@
 #include <linux/bpf.h>
 #include <linux/vmalloc.h>
-#include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/kernel.h>
diff --git a/kernel/exit.c b/kernel/exit.c
index 619f0014c33b..1dcddfe537ee 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -25,7 +25,6 @@
 #include <linux/acct.h>
 #include <linux/tsacct_kern.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/freezer.h>
 #include <linux/binfmts.h>
 #include <linux/nsproxy.h>
diff --git a/kernel/module/dups.c b/kernel/module/dups.c
index 9a92f2f8c9d3..bd2149fbe117 100644
--- a/kernel/module/dups.c
+++ b/kernel/module/dups.c
@@ -18,7 +18,6 @@
 #include <linux/completion.h>
 #include <linux/cred.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/workqueue.h>
 #include <linux/security.h>
 #include <linux/mount.h>
diff --git a/kernel/module/kmod.c b/kernel/module/kmod.c
index 0800d9891692..25f253812512 100644
--- a/kernel/module/kmod.c
+++ b/kernel/module/kmod.c
@@ -15,7 +15,6 @@
 #include <linux/completion.h>
 #include <linux/cred.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/workqueue.h>
 #include <linux/security.h>
 #include <linux/mount.h>
diff --git a/kernel/umh.c b/kernel/umh.c
index ff1f13a27d29..be9234270777 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -13,7 +13,6 @@
 #include <linux/completion.h>
 #include <linux/cred.h>
 #include <linux/file.h>
-#include <linux/fdtable.h>
 #include <linux/fs_struct.h>
 #include <linux/workqueue.h>
 #include <linux/security.h>
diff --git a/net/handshake/request.c b/net/handshake/request.c
index 94d5cef3e048..274d2c89b6b2 100644
--- a/net/handshake/request.c
+++ b/net/handshake/request.c
@@ -13,7 +13,6 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/inet.h>
-#include <linux/fdtable.h>
 #include <linux/rhashtable.h>
 
 #include <net/sock.h>
diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index 571158ec6188..2bc34dce9a46 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -9,7 +9,6 @@
  */
 
 #include <linux/errno.h>
-#include <linux/fdtable.h>
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/mount.h>
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 03/11] close_files(): don't bother with xchg()
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
  2024-10-07 17:43     ` [PATCH v3 02/11] remove pointless includes of <linux/fdtable.h> Al Viro
@ 2024-10-07 17:43     ` Al Viro
  2024-10-07 17:43     ` [PATCH v3 04/11] move close_range(2) into fs/file.c, fold __close_range() into it Al Viro
                       ` (7 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, brauner

At that point nobody else has references to the victim files_struct;
as the matter of fact, the caller will free it immediately after
close_files() returns, with no RCU delays or anything of that sort.

That's why we are not protecting against fdtable reallocation on
expansion, not cleaning the bitmaps, etc.  There's no point
zeroing the pointers in ->fd[] either, let alone make that an
atomic operation.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index 991860ee7848..8770010170c5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -413,7 +413,7 @@ static struct fdtable *close_files(struct files_struct * files)
 		set = fdt->open_fds[j++];
 		while (set) {
 			if (set & 1) {
-				struct file * file = xchg(&fdt->fd[i], NULL);
+				struct file *file = fdt->fd[i];
 				if (file) {
 					filp_close(file, files);
 					cond_resched();
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 04/11] move close_range(2) into fs/file.c, fold __close_range() into it
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
  2024-10-07 17:43     ` [PATCH v3 02/11] remove pointless includes of <linux/fdtable.h> Al Viro
  2024-10-07 17:43     ` [PATCH v3 03/11] close_files(): don't bother with xchg() Al Viro
@ 2024-10-07 17:43     ` Al Viro
  2024-10-07 17:43     ` [PATCH v3 05/11] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Al Viro
                       ` (6 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, brauner

	We never had callers for __close_range() except for close_range(2)
itself.  Nothing of that sort has appeared in four years and if any users
do show up, we can always separate those suckers again.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c               |  6 ++++--
 fs/open.c               | 17 -----------------
 include/linux/fdtable.h |  1 -
 3 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 8770010170c5..8e8f504782bf 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -713,7 +713,7 @@ static inline void __range_close(struct files_struct *files, unsigned int fd,
 }
 
 /**
- * __close_range() - Close all file descriptors in a given range.
+ * sys_close_range() - Close all file descriptors in a given range.
  *
  * @fd:     starting file descriptor to close
  * @max_fd: last file descriptor to close
@@ -721,8 +721,10 @@ static inline void __range_close(struct files_struct *files, unsigned int fd,
  *
  * This closes a range of file descriptors. All file descriptors
  * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
  */
-int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+		unsigned int, flags)
 {
 	struct task_struct *me = current;
 	struct files_struct *cur_fds = me->files, *fds = NULL;
diff --git a/fs/open.c b/fs/open.c
index acaeb3e25c88..62dd1383d6f9 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1574,23 +1574,6 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 	return retval;
 }
 
-/**
- * sys_close_range() - Close all file descriptors in a given range.
- *
- * @fd:     starting file descriptor to close
- * @max_fd: last file descriptor to close
- * @flags:  reserved for future extensions
- *
- * This closes a range of file descriptors. All file descriptors
- * from @fd up to and including @max_fd are closed.
- * Currently, errors to close a given file descriptor are ignored.
- */
-SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
-		unsigned int, flags)
-{
-	return __close_range(fd, max_fd, flags);
-}
-
 /*
  * This routine simulates a hangup on the tty, to arrange that users
  * are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index e25e2cb65d30..c45306a9f007 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -111,7 +111,6 @@ int iterate_fd(struct files_struct *, unsigned,
 		const void *);
 
 extern int close_fd(unsigned int fd);
-extern int __close_range(unsigned int fd, unsigned int max_fd, unsigned int flags);
 extern struct file *file_close_fd(unsigned int fd);
 
 extern struct kmem_cache *files_cachep;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 05/11] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
                       ` (2 preceding siblings ...)
  2024-10-07 17:43     ` [PATCH v3 04/11] move close_range(2) into fs/file.c, fold __close_range() into it Al Viro
@ 2024-10-07 17:43     ` Al Viro
  2024-10-07 17:43     ` [PATCH v3 06/11] fs/file.c: conditionally clear full_fds Al Viro
                       ` (5 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, brauner

From: Yu Ma <yu.ma@intel.com>

alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-2-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 8e8f504782bf..90b8aa2378cc 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -496,7 +496,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	if (fd < files->next_fd)
 		fd = files->next_fd;
 
-	if (fd < fdt->max_fds)
+	if (likely(fd < fdt->max_fds))
 		fd = find_next_fd(fdt, fd);
 
 	/*
@@ -504,19 +504,21 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	 * will limit the total number of files that can be opened.
 	 */
 	error = -EMFILE;
-	if (fd >= end)
+	if (unlikely(fd >= end))
 		goto out;
 
-	error = expand_files(files, fd);
-	if (error < 0)
-		goto out;
+	if (unlikely(fd >= fdt->max_fds)) {
+		error = expand_files(files, fd);
+		if (error < 0)
+			goto out;
 
-	/*
-	 * If we needed to expand the fs array we
-	 * might have blocked - try again.
-	 */
-	if (error)
-		goto repeat;
+		/*
+		 * If we needed to expand the fs array we
+		 * might have blocked - try again.
+		 */
+		if (error)
+			goto repeat;
+	}
 
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
@@ -527,13 +529,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	else
 		__clear_close_on_exec(fd, fdt);
 	error = fd;
-#if 1
-	/* Sanity check */
-	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
-		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
-		rcu_assign_pointer(fdt->fd[fd], NULL);
-	}
-#endif
 
 out:
 	spin_unlock(&files->file_lock);
@@ -599,7 +594,7 @@ void fd_install(unsigned int fd, struct file *file)
 		rcu_read_unlock_sched();
 		spin_lock(&files->file_lock);
 		fdt = files_fdtable(files);
-		BUG_ON(fdt->fd[fd] != NULL);
+		WARN_ON(fdt->fd[fd] != NULL);
 		rcu_assign_pointer(fdt->fd[fd], file);
 		spin_unlock(&files->file_lock);
 		return;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 06/11] fs/file.c: conditionally clear full_fds
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
                       ` (3 preceding siblings ...)
  2024-10-07 17:43     ` [PATCH v3 05/11] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Al Viro
@ 2024-10-07 17:43     ` Al Viro
  2024-10-07 17:43     ` [PATCH v3 07/11] fs/file.c: add fast path in find_next_fd() Al Viro
                       ` (4 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, brauner

From: Yu Ma <yu.ma@intel.com>

64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc7.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-3-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index 90b8aa2378cc..36c5089812f5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -264,7 +264,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__clear_bit(fd, fdt->open_fds);
-	__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+	fd /= BITS_PER_LONG;
+	if (test_bit(fd, fdt->full_fds_bits))
+		__clear_bit(fd, fdt->full_fds_bits);
 }
 
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 07/11] fs/file.c: add fast path in find_next_fd()
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
                       ` (4 preceding siblings ...)
  2024-10-07 17:43     ` [PATCH v3 06/11] fs/file.c: conditionally clear full_fds Al Viro
@ 2024-10-07 17:43     ` Al Viro
  2024-10-07 17:43     ` [PATCH v3 08/11] alloc_fdtable(): change calling conventions Al Viro
                       ` (3 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, brauner

From: Yu Ma <yu.ma@intel.com>

Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.

This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-4-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index 36c5089812f5..236d8bbadb0e 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -472,6 +472,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
 	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
 	unsigned int maxbit = maxfd / BITS_PER_LONG;
 	unsigned int bitbit = start / BITS_PER_LONG;
+	unsigned int bit;
+
+	/*
+	 * Try to avoid looking at the second level bitmap
+	 */
+	bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+				 start & (BITS_PER_LONG - 1));
+	if (bit < BITS_PER_LONG)
+		return bit + bitbit * BITS_PER_LONG;
 
 	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
 	if (bitbit >= maxfd)
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 08/11] alloc_fdtable(): change calling conventions.
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
                       ` (5 preceding siblings ...)
  2024-10-07 17:43     ` [PATCH v3 07/11] fs/file.c: add fast path in find_next_fd() Al Viro
@ 2024-10-07 17:43     ` Al Viro
  2024-10-07 17:43     ` [PATCH v3 09/11] file.c: merge __{set,clear}_close_on_exec() Al Viro
                       ` (2 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, brauner

First of all, tell it how many slots do we want, not which slot
is wanted.  It makes one caller (dup_fd()) more straightforward
and doesn't harm another (expand_fdtable()).

Furthermore, make it return ERR_PTR() on failure rather than
returning NULL.  Simplifies the callers.

Simplify the size calculation, while we are at it - note that we
always have slots_wanted greater than BITS_PER_LONG.  What the
rules boil down to is
	* use the smallest power of two large enough to give us
that many slots
	* on 32bit skip 64 and 128 - the minimal capacity we want
there is 256 slots (i.e. 1Kb fd array).
	* on 64bit don't skip anything, the minimal capacity is
128 - and we'll never be asked for 64 or less.  128 slots means
1Kb fd array, again.
	* on 128bit, if that ever happens, don't skip anything -
we'll never be asked for 128 or less, so the fd array allocation
will be at least 2Kb.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 75 +++++++++++++++++++++----------------------------------
 1 file changed, 29 insertions(+), 46 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 236d8bbadb0e..7e5e9803a173 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -89,18 +89,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
  * 'unsigned long' in some places, but simply because that is how the Linux
  * kernel bitmaps are defined to work: they are not "bits in an array of bytes",
  * they are very much "bits in an array of unsigned long".
- *
- * The ALIGN(nr, BITS_PER_LONG) here is for clarity: since we just multiplied
- * by that "1024/sizeof(ptr)" before, we already know there are sufficient
- * clear low bits. Clang seems to realize that, gcc ends up being confused.
- *
- * On a 128-bit machine, the ALIGN() would actually matter. In the meantime,
- * let's consider it documentation (and maybe a test-case for gcc to improve
- * its code generation ;)
  */
-static struct fdtable * alloc_fdtable(unsigned int nr)
+static struct fdtable *alloc_fdtable(unsigned int slots_wanted)
 {
 	struct fdtable *fdt;
+	unsigned int nr;
 	void *data;
 
 	/*
@@ -108,22 +101,32 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	 * Allocation steps are keyed to the size of the fdarray, since it
 	 * grows far faster than any of the other dynamic data. We try to fit
 	 * the fdarray into comfortable page-tuned chunks: starting at 1024B
-	 * and growing in powers of two from there on.
+	 * and growing in powers of two from there on.  Since we called only
+	 * with slots_wanted > BITS_PER_LONG (embedded instance in files->fdtab
+	 * already gives BITS_PER_LONG slots), the above boils down to
+	 * 1.  use the smallest power of two large enough to give us that many
+	 * slots.
+	 * 2.  on 32bit skip 64 and 128 - the minimal capacity we want there is
+	 * 256 slots (i.e. 1Kb fd array).
+	 * 3.  on 64bit don't skip anything, 1Kb fd array means 128 slots there
+	 * and we are never going to be asked for 64 or less.
 	 */
-	nr /= (1024 / sizeof(struct file *));
-	nr = roundup_pow_of_two(nr + 1);
-	nr *= (1024 / sizeof(struct file *));
-	nr = ALIGN(nr, BITS_PER_LONG);
+	if (IS_ENABLED(CONFIG_32BIT) && slots_wanted < 256)
+		nr = 256;
+	else
+		nr = roundup_pow_of_two(slots_wanted);
 	/*
 	 * Note that this can drive nr *below* what we had passed if sysctl_nr_open
-	 * had been set lower between the check in expand_files() and here.  Deal
-	 * with that in caller, it's cheaper that way.
+	 * had been set lower between the check in expand_files() and here.
 	 *
 	 * We make sure that nr remains a multiple of BITS_PER_LONG - otherwise
 	 * bitmaps handling below becomes unpleasant, to put it mildly...
 	 */
-	if (unlikely(nr > sysctl_nr_open))
-		nr = ((sysctl_nr_open - 1) | (BITS_PER_LONG - 1)) + 1;
+	if (unlikely(nr > sysctl_nr_open)) {
+		nr = round_down(sysctl_nr_open, BITS_PER_LONG);
+		if (nr < slots_wanted)
+			return ERR_PTR(-EMFILE);
+	}
 
 	fdt = kmalloc(sizeof(struct fdtable), GFP_KERNEL_ACCOUNT);
 	if (!fdt)
@@ -152,7 +155,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 out_fdt:
 	kfree(fdt);
 out:
-	return NULL;
+	return ERR_PTR(-ENOMEM);
 }
 
 /*
@@ -169,7 +172,7 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
 	struct fdtable *new_fdt, *cur_fdt;
 
 	spin_unlock(&files->file_lock);
-	new_fdt = alloc_fdtable(nr);
+	new_fdt = alloc_fdtable(nr + 1);
 
 	/* make sure all fd_install() have seen resize_in_progress
 	 * or have finished their rcu_read_lock_sched() section.
@@ -178,16 +181,8 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
 		synchronize_rcu();
 
 	spin_lock(&files->file_lock);
-	if (!new_fdt)
-		return -ENOMEM;
-	/*
-	 * extremely unlikely race - sysctl_nr_open decreased between the check in
-	 * caller and alloc_fdtable().  Cheaper to catch it here...
-	 */
-	if (unlikely(new_fdt->max_fds <= nr)) {
-		__free_fdtable(new_fdt);
-		return -EMFILE;
-	}
+	if (IS_ERR(new_fdt))
+		return PTR_ERR(new_fdt);
 	cur_fdt = files_fdtable(files);
 	BUG_ON(nr < cur_fdt->max_fds);
 	copy_fdtable(new_fdt, cur_fdt);
@@ -308,7 +303,6 @@ struct files_struct *dup_fd(struct files_struct *oldf, struct fd_range *punch_ho
 	struct file **old_fds, **new_fds;
 	unsigned int open_files, i;
 	struct fdtable *old_fdt, *new_fdt;
-	int error;
 
 	newf = kmem_cache_alloc(files_cachep, GFP_KERNEL);
 	if (!newf)
@@ -340,17 +334,10 @@ struct files_struct *dup_fd(struct files_struct *oldf, struct fd_range *punch_ho
 		if (new_fdt != &newf->fdtab)
 			__free_fdtable(new_fdt);
 
-		new_fdt = alloc_fdtable(open_files - 1);
-		if (!new_fdt) {
-			error = -ENOMEM;
-			goto out_release;
-		}
-
-		/* beyond sysctl_nr_open; nothing to do */
-		if (unlikely(new_fdt->max_fds < open_files)) {
-			__free_fdtable(new_fdt);
-			error = -EMFILE;
-			goto out_release;
+		new_fdt = alloc_fdtable(open_files);
+		if (IS_ERR(new_fdt)) {
+			kmem_cache_free(files_cachep, newf);
+			return ERR_CAST(new_fdt);
 		}
 
 		/*
@@ -391,10 +378,6 @@ struct files_struct *dup_fd(struct files_struct *oldf, struct fd_range *punch_ho
 	rcu_assign_pointer(newf->fdt, new_fdt);
 
 	return newf;
-
-out_release:
-	kmem_cache_free(files_cachep, newf);
-	return ERR_PTR(error);
 }
 
 static struct fdtable *close_files(struct files_struct * files)
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 09/11] file.c: merge __{set,clear}_close_on_exec()
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
                       ` (6 preceding siblings ...)
  2024-10-07 17:43     ` [PATCH v3 08/11] alloc_fdtable(): change calling conventions Al Viro
@ 2024-10-07 17:43     ` Al Viro
  2024-10-07 17:43     ` [PATCH v3 10/11] make __set_open_fd() set cloexec state as well Al Viro
  2024-10-07 17:43     ` [PATCH v3 11/11] expand_files(): simplify calling conventions Al Viro
  9 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, brauner

they are always go in pairs; seeing that they are inlined, might
as well make that a single inline function taking a boolean
argument ("do we want close_on_exec set for that descriptor")

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 33 +++++++++++----------------------
 1 file changed, 11 insertions(+), 22 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 7e5e9803a173..d8fccd4796a9 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -237,15 +237,15 @@ static int expand_files(struct files_struct *files, unsigned int nr)
 	return expanded;
 }
 
-static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt)
+static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
+				       bool set)
 {
-	__set_bit(fd, fdt->close_on_exec);
-}
-
-static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
-{
-	if (test_bit(fd, fdt->close_on_exec))
-		__clear_bit(fd, fdt->close_on_exec);
+	if (set) {
+		__set_bit(fd, fdt->close_on_exec);
+	} else {
+		if (test_bit(fd, fdt->close_on_exec))
+			__clear_bit(fd, fdt->close_on_exec);
+	}
 }
 
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
@@ -518,10 +518,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 		files->next_fd = fd + 1;
 
 	__set_open_fd(fd, fdt);
-	if (flags & O_CLOEXEC)
-		__set_close_on_exec(fd, fdt);
-	else
-		__clear_close_on_exec(fd, fdt);
+	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
 	error = fd;
 
 out:
@@ -1147,13 +1144,8 @@ void __f_unlock_pos(struct file *f)
 void set_close_on_exec(unsigned int fd, int flag)
 {
 	struct files_struct *files = current->files;
-	struct fdtable *fdt;
 	spin_lock(&files->file_lock);
-	fdt = files_fdtable(files);
-	if (flag)
-		__set_close_on_exec(fd, fdt);
-	else
-		__clear_close_on_exec(fd, fdt);
+	__set_close_on_exec(fd, files_fdtable(files), flag);
 	spin_unlock(&files->file_lock);
 }
 
@@ -1195,10 +1187,7 @@ __releases(&files->file_lock)
 	get_file(file);
 	rcu_assign_pointer(fdt->fd[fd], file);
 	__set_open_fd(fd, fdt);
-	if (flags & O_CLOEXEC)
-		__set_close_on_exec(fd, fdt);
-	else
-		__clear_close_on_exec(fd, fdt);
+	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
 	spin_unlock(&files->file_lock);
 
 	if (tofree)
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 10/11] make __set_open_fd() set cloexec state as well
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
                       ` (7 preceding siblings ...)
  2024-10-07 17:43     ` [PATCH v3 09/11] file.c: merge __{set,clear}_close_on_exec() Al Viro
@ 2024-10-07 17:43     ` Al Viro
  2024-10-09 10:50       ` Steven Price
       [not found]       ` <CGME20241009114254eucas1p2c174307fe53b3d5563795cc8eb92b91d@eucas1p2.samsung.com>
  2024-10-07 17:43     ` [PATCH v3 11/11] expand_files(): simplify calling conventions Al Viro
  9 siblings, 2 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, brauner

->close_on_exec[] state is maintained only for opened descriptors;
as the result, anything that marks a descriptor opened has to
set its cloexec state explicitly.

As the result, all calls of __set_open_fd() are followed by
__set_close_on_exec(); might as well fold it into __set_open_fd()
so that cloexec state is defined as soon as the descriptor is
marked opened.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index d8fccd4796a9..b63294ed85ec 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -248,12 +248,13 @@ static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
 	}
 }
 
-static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
+static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt, bool set)
 {
 	__set_bit(fd, fdt->open_fds);
 	fd /= BITS_PER_LONG;
 	if (!~fdt->open_fds[fd])
 		__set_bit(fd, fdt->full_fds_bits);
+	__set_close_on_exec(fd, fdt, set);
 }
 
 static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
@@ -517,8 +518,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
 
-	__set_open_fd(fd, fdt);
-	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
+	__set_open_fd(fd, fdt, flags & O_CLOEXEC);
 	error = fd;
 
 out:
@@ -1186,8 +1186,7 @@ __releases(&files->file_lock)
 		goto Ebusy;
 	get_file(file);
 	rcu_assign_pointer(fdt->fd[fd], file);
-	__set_open_fd(fd, fdt);
-	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
+	__set_open_fd(fd, fdt, flags & O_CLOEXEC);
 	spin_unlock(&files->file_lock);
 
 	if (tofree)
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 11/11] expand_files(): simplify calling conventions
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
                       ` (8 preceding siblings ...)
  2024-10-07 17:43     ` [PATCH v3 10/11] make __set_open_fd() set cloexec state as well Al Viro
@ 2024-10-07 17:43     ` Al Viro
  9 siblings, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-10-07 17:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, brauner

All callers treat 0 and 1 returned by expand_files() in the same way
now since the call in alloc_fd() had been made conditional.  Just make
it return 0 on success and be done with it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index b63294ed85ec..70cd6d8c6257 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -162,7 +162,7 @@ static struct fdtable *alloc_fdtable(unsigned int slots_wanted)
  * Expand the file descriptor table.
  * This function will allocate a new fdtable and both fd array and fdset, of
  * the given size.
- * Return <0 error code on error; 1 on successful completion.
+ * Return <0 error code on error; 0 on successful completion.
  * The files->file_lock should be held on entry, and will be held on exit.
  */
 static int expand_fdtable(struct files_struct *files, unsigned int nr)
@@ -191,15 +191,14 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
 		call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
 	/* coupled with smp_rmb() in fd_install() */
 	smp_wmb();
-	return 1;
+	return 0;
 }
 
 /*
  * Expand files.
  * This function will expand the file structures, if the requested size exceeds
  * the current capacity and there is room for expansion.
- * Return <0 error code on error; 0 when nothing done; 1 when files were
- * expanded and execution may have blocked.
+ * Return <0 error code on error; 0 on success.
  * The files->file_lock should be held on entry, and will be held on exit.
  */
 static int expand_files(struct files_struct *files, unsigned int nr)
@@ -207,14 +206,14 @@ static int expand_files(struct files_struct *files, unsigned int nr)
 	__acquires(files->file_lock)
 {
 	struct fdtable *fdt;
-	int expanded = 0;
+	int error;
 
 repeat:
 	fdt = files_fdtable(files);
 
 	/* Do we need to expand? */
 	if (nr < fdt->max_fds)
-		return expanded;
+		return 0;
 
 	/* Can we expand? */
 	if (nr >= sysctl_nr_open)
@@ -222,7 +221,6 @@ static int expand_files(struct files_struct *files, unsigned int nr)
 
 	if (unlikely(files->resize_in_progress)) {
 		spin_unlock(&files->file_lock);
-		expanded = 1;
 		wait_event(files->resize_wait, !files->resize_in_progress);
 		spin_lock(&files->file_lock);
 		goto repeat;
@@ -230,11 +228,11 @@ static int expand_files(struct files_struct *files, unsigned int nr)
 
 	/* All good, so we try */
 	files->resize_in_progress = true;
-	expanded = expand_fdtable(files, nr);
+	error = expand_fdtable(files, nr);
 	files->resize_in_progress = false;
 
 	wake_up_all(&files->resize_wait);
-	return expanded;
+	return error;
 }
 
 static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
@@ -507,12 +505,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 		if (error < 0)
 			goto out;
 
-		/*
-		 * If we needed to expand the fs array we
-		 * might have blocked - try again.
-		 */
-		if (error)
-			goto repeat;
+		goto repeat;
 	}
 
 	if (start <= files->next_fd)
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCHES] fdtable series v3
  2024-10-07 17:39 ` [PATCHES] fdtable series v3 Al Viro
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
@ 2024-10-07 18:12   ` Linus Torvalds
  2024-10-08 11:15   ` Christian Brauner
  2 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2024-10-07 18:12 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Christian Brauner

On Mon, 7 Oct 2024 at 10:39, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Changes since v2:
>         Rebased to 6.12-rc2, now that close_range() fix got into
> mainline.  No more than that so far (there will be followup
> cleanups, but for now I just want it posted and in -next).

Looks all sane to me, the patches look clean, and removing over a
hundred lines certainly is an improvement.

                     Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCHES] fdtable series v3
  2024-10-07 17:39 ` [PATCHES] fdtable series v3 Al Viro
  2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
  2024-10-07 18:12   ` [PATCHES] fdtable series v3 Linus Torvalds
@ 2024-10-08 11:15   ` Christian Brauner
  2 siblings, 0 replies; 33+ messages in thread
From: Christian Brauner @ 2024-10-08 11:15 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Mon, Oct 07, 2024 at 06:39:12PM GMT, Al Viro wrote:
> Changes since v2:
> 	Rebased to 6.12-rc2, now that close_range() fix got into
> mainline.  No more than that so far (there will be followup
> cleanups, but for now I just want it posted and in -next).
> 
> Changes since v1:
>  	close_range() fix added in the beginning of the series,
> dup_fd() calling convention change folded into it (as requested
> by Linus), the rest rebased on top of that.
>  	sane_fdtable_size() change is dropped, as it's obsoleted
> by close_range() fix.  Several patches added at the end of
> series.
> 
> 	Same branch -
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.fdtable
> individual patches in followups.
> 
> Christian, I can move that to vfs/vfs.git if that's more convenient
> for you - we are about to step on each other's toes with that and

That's fine with me but note, there were no conflicts at all when I
pulled this whole series into vfs.file on top of my stuff. If you don't
mind I can just keep it there.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 10/11] make __set_open_fd() set cloexec state as well
  2024-10-07 17:43     ` [PATCH v3 10/11] make __set_open_fd() set cloexec state as well Al Viro
@ 2024-10-09 10:50       ` Steven Price
  2024-10-09 14:19         ` Christian Brauner
       [not found]       ` <CGME20241009114254eucas1p2c174307fe53b3d5563795cc8eb92b91d@eucas1p2.samsung.com>
  1 sibling, 1 reply; 33+ messages in thread
From: Steven Price @ 2024-10-09 10:50 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel; +Cc: torvalds, brauner

On 07/10/2024 18:43, Al Viro wrote:
> ->close_on_exec[] state is maintained only for opened descriptors;
> as the result, anything that marks a descriptor opened has to
> set its cloexec state explicitly.
> 
> As the result, all calls of __set_open_fd() are followed by
> __set_close_on_exec(); might as well fold it into __set_open_fd()
> so that cloexec state is defined as soon as the descriptor is
> marked opened.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/file.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index d8fccd4796a9..b63294ed85ec 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -248,12 +248,13 @@ static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
>  	}
>  }
>  
> -static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
> +static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt, bool set)
>  {
>  	__set_bit(fd, fdt->open_fds);
>  	fd /= BITS_PER_LONG;

Here fd is being modified...

>  	if (!~fdt->open_fds[fd])
>  		__set_bit(fd, fdt->full_fds_bits);
> +	__set_close_on_exec(fd, fdt, set);

... which means fd here isn't the same as the passed in value. So this
call to __set_close_on_exec affects a different fd to the expected one.

Steve

>  }
>  
>  static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
> @@ -517,8 +518,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>  	if (start <= files->next_fd)
>  		files->next_fd = fd + 1;
>  
> -	__set_open_fd(fd, fdt);
> -	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
> +	__set_open_fd(fd, fdt, flags & O_CLOEXEC);
>  	error = fd;
>  
>  out:
> @@ -1186,8 +1186,7 @@ __releases(&files->file_lock)
>  		goto Ebusy;
>  	get_file(file);
>  	rcu_assign_pointer(fdt->fd[fd], file);
> -	__set_open_fd(fd, fdt);
> -	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
> +	__set_open_fd(fd, fdt, flags & O_CLOEXEC);
>  	spin_unlock(&files->file_lock);
>  
>  	if (tofree)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 10/11] make __set_open_fd() set cloexec state as well
       [not found]       ` <CGME20241009114254eucas1p2c174307fe53b3d5563795cc8eb92b91d@eucas1p2.samsung.com>
@ 2024-10-09 11:42         ` Marek Szyprowski
  0 siblings, 0 replies; 33+ messages in thread
From: Marek Szyprowski @ 2024-10-09 11:42 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel; +Cc: torvalds, brauner

On 07.10.2024 19:43, Al Viro wrote:
> ->close_on_exec[] state is maintained only for opened descriptors;
> as the result, anything that marks a descriptor opened has to
> set its cloexec state explicitly.
>
> As the result, all calls of __set_open_fd() are followed by
> __set_close_on_exec(); might as well fold it into __set_open_fd()
> so that cloexec state is defined as soon as the descriptor is
> marked opened.
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

This patch landed in today's linux-next as commit 218a562f273b ("make 
__set_open_fd() set cloexec state as well"). In my tests I found that it 
breaks booting of many of my test systems (arm 32bit, arm 64bit and 
riscv64). It's hard to describe what exactly is broken, but none of the 
affected boards reached the login shell. All crashed somewhere in the 
userspace during systemd startup. This can be easily reproduced even 
with qemu.

> ---
>   fs/file.c | 9 ++++-----
>   1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index d8fccd4796a9..b63294ed85ec 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -248,12 +248,13 @@ static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
>   	}
>   }
>   
> -static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
> +static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt, bool set)
>   {
>   	__set_bit(fd, fdt->open_fds);
>   	fd /= BITS_PER_LONG;
>   	if (!~fdt->open_fds[fd])
>   		__set_bit(fd, fdt->full_fds_bits);
> +	__set_close_on_exec(fd, fdt, set);
>   }
>   
>   static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
> @@ -517,8 +518,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>   	if (start <= files->next_fd)
>   		files->next_fd = fd + 1;
>   
> -	__set_open_fd(fd, fdt);
> -	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
> +	__set_open_fd(fd, fdt, flags & O_CLOEXEC);
>   	error = fd;
>   
>   out:
> @@ -1186,8 +1186,7 @@ __releases(&files->file_lock)
>   		goto Ebusy;
>   	get_file(file);
>   	rcu_assign_pointer(fdt->fd[fd], file);
> -	__set_open_fd(fd, fdt);
> -	__set_close_on_exec(fd, fdt, flags & O_CLOEXEC);
> +	__set_open_fd(fd, fdt, flags & O_CLOEXEC);
>   	spin_unlock(&files->file_lock);
>   
>   	if (tofree)

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 10/11] make __set_open_fd() set cloexec state as well
  2024-10-09 10:50       ` Steven Price
@ 2024-10-09 14:19         ` Christian Brauner
  2024-10-09 15:24           ` Al Viro
  0 siblings, 1 reply; 33+ messages in thread
From: Christian Brauner @ 2024-10-09 14:19 UTC (permalink / raw)
  To: Steven Price; +Cc: Al Viro, linux-fsdevel, torvalds

On Wed, Oct 09, 2024 at 11:50:36AM GMT, Steven Price wrote:
> On 07/10/2024 18:43, Al Viro wrote:
> > ->close_on_exec[] state is maintained only for opened descriptors;
> > as the result, anything that marks a descriptor opened has to
> > set its cloexec state explicitly.
> > 
> > As the result, all calls of __set_open_fd() are followed by
> > __set_close_on_exec(); might as well fold it into __set_open_fd()
> > so that cloexec state is defined as soon as the descriptor is
> > marked opened.
> > 
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> > ---
> >  fs/file.c | 9 ++++-----
> >  1 file changed, 4 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/file.c b/fs/file.c
> > index d8fccd4796a9..b63294ed85ec 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -248,12 +248,13 @@ static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
> >  	}
> >  }
> >  
> > -static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
> > +static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt, bool set)
> >  {
> >  	__set_bit(fd, fdt->open_fds);
> >  	fd /= BITS_PER_LONG;
> 
> Here fd is being modified...
> 
> >  	if (!~fdt->open_fds[fd])
> >  		__set_bit(fd, fdt->full_fds_bits);
> > +	__set_close_on_exec(fd, fdt, set);
> 
> ... which means fd here isn't the same as the passed in value. So this
> call to __set_close_on_exec affects a different fd to the expected one.

Good spot. Al, I folded the below fix so from my end you don't have to
do anything unless you want to nitpick how to fix it. Local variable
looked ugly to me.

Pushed and running through tests now.

diff --git a/fs/file.c b/fs/file.c
index c039e21c1cd1..52654415f17e 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -354,12 +354,17 @@ static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
        }
 }

-static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt, bool set)
+static inline void __set_open(unsigned int fd, struct fdtable *fdt, bool set)
 {
        __set_bit(fd, fdt->open_fds);
        fd /= BITS_PER_LONG;
        if (!~fdt->open_fds[fd])
                __set_bit(fd, fdt->full_fds_bits);
+}
+
+static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt, bool set)
+{
+       __set_open(fd, fdt, set);
        __set_close_on_exec(fd, fdt, set);
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 10/11] make __set_open_fd() set cloexec state as well
  2024-10-09 14:19         ` Christian Brauner
@ 2024-10-09 15:24           ` Al Viro
  2024-10-09 15:36             ` Al Viro
  2024-10-10  8:13             ` Christian Brauner
  0 siblings, 2 replies; 33+ messages in thread
From: Al Viro @ 2024-10-09 15:24 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Steven Price, linux-fsdevel, torvalds

On Wed, Oct 09, 2024 at 04:19:47PM +0200, Christian Brauner wrote:
> On Wed, Oct 09, 2024 at 11:50:36AM GMT, Steven Price wrote:
> > On 07/10/2024 18:43, Al Viro wrote:
> > > ->close_on_exec[] state is maintained only for opened descriptors;
> > > as the result, anything that marks a descriptor opened has to
> > > set its cloexec state explicitly.
> > > 
> > > As the result, all calls of __set_open_fd() are followed by
> > > __set_close_on_exec(); might as well fold it into __set_open_fd()
> > > so that cloexec state is defined as soon as the descriptor is
> > > marked opened.
> > > 
> > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> > > ---
> > >  fs/file.c | 9 ++++-----
> > >  1 file changed, 4 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/file.c b/fs/file.c
> > > index d8fccd4796a9..b63294ed85ec 100644
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -248,12 +248,13 @@ static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
> > >  	}
> > >  }
> > >  
> > > -static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
> > > +static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt, bool set)
> > >  {
> > >  	__set_bit(fd, fdt->open_fds);
> > >  	fd /= BITS_PER_LONG;
> > 
> > Here fd is being modified...
> > 
> > >  	if (!~fdt->open_fds[fd])
> > >  		__set_bit(fd, fdt->full_fds_bits);
> > > +	__set_close_on_exec(fd, fdt, set);
> > 
> > ... which means fd here isn't the same as the passed in value. So this
> > call to __set_close_on_exec affects a different fd to the expected one.

ACK.

> Good spot. Al, I folded the below fix so from my end you don't have to
> do anything unless you want to nitpick how to fix it. Local variable
> looked ugly to me.

Wait, folded it _where_?  And how did it end up pushed to -next in the
first place?

<checks vfs/vfs.git>

FWIW, I do have a problem with your variant.  My preferred incremental would be
simply to deal with close_on_exec bitmap between updating open_fds and full_fds_bits:

diff --git a/fs/file.c b/fs/file.c
index 70cd6d8c6257..7251d215048d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -249,10 +249,10 @@ static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt, bool set)
 {
 	__set_bit(fd, fdt->open_fds);
+	__set_close_on_exec(fd, fdt, set);
 	fd /= BITS_PER_LONG;
 	if (!~fdt->open_fds[fd])
 		__set_bit(fd, fdt->full_fds_bits);
-	__set_close_on_exec(fd, fdt, set);
 }
 
 static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 10/11] make __set_open_fd() set cloexec state as well
  2024-10-09 15:24           ` Al Viro
@ 2024-10-09 15:36             ` Al Viro
  2024-10-10  8:13             ` Christian Brauner
  1 sibling, 0 replies; 33+ messages in thread
From: Al Viro @ 2024-10-09 15:36 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Steven Price, linux-fsdevel, torvalds

On Wed, Oct 09, 2024 at 04:24:11PM +0100, Al Viro wrote:

> Wait, folded it _where_?  And how did it end up pushed to -next in the
> first place?
> 
> <checks vfs/vfs.git>

Preferred variant force-pushed to #work.fdtable, and pushed to vfs/vfs.git#work.fdtable
as well.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 10/11] make __set_open_fd() set cloexec state as well
  2024-10-09 15:24           ` Al Viro
  2024-10-09 15:36             ` Al Viro
@ 2024-10-10  8:13             ` Christian Brauner
  1 sibling, 0 replies; 33+ messages in thread
From: Christian Brauner @ 2024-10-10  8:13 UTC (permalink / raw)
  To: Al Viro; +Cc: Steven Price, linux-fsdevel, torvalds

On Wed, Oct 09, 2024 at 04:24:11PM +0100, Al Viro wrote:
> On Wed, Oct 09, 2024 at 04:19:47PM +0200, Christian Brauner wrote:
> > On Wed, Oct 09, 2024 at 11:50:36AM GMT, Steven Price wrote:
> > > On 07/10/2024 18:43, Al Viro wrote:
> > > > ->close_on_exec[] state is maintained only for opened descriptors;
> > > > as the result, anything that marks a descriptor opened has to
> > > > set its cloexec state explicitly.
> > > > 
> > > > As the result, all calls of __set_open_fd() are followed by
> > > > __set_close_on_exec(); might as well fold it into __set_open_fd()
> > > > so that cloexec state is defined as soon as the descriptor is
> > > > marked opened.
> > > > 
> > > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> > > > ---
> > > >  fs/file.c | 9 ++++-----
> > > >  1 file changed, 4 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/fs/file.c b/fs/file.c
> > > > index d8fccd4796a9..b63294ed85ec 100644
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -248,12 +248,13 @@ static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt,
> > > >  	}
> > > >  }
> > > >  
> > > > -static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
> > > > +static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt, bool set)
> > > >  {
> > > >  	__set_bit(fd, fdt->open_fds);
> > > >  	fd /= BITS_PER_LONG;
> > > 
> > > Here fd is being modified...
> > > 
> > > >  	if (!~fdt->open_fds[fd])
> > > >  		__set_bit(fd, fdt->full_fds_bits);
> > > > +	__set_close_on_exec(fd, fdt, set);
> > > 
> > > ... which means fd here isn't the same as the passed in value. So this
> > > call to __set_close_on_exec affects a different fd to the expected one.
> 
> ACK.
> 
> > Good spot. Al, I folded the below fix so from my end you don't have to
> > do anything unless you want to nitpick how to fix it. Local variable
> > looked ugly to me.
> 
> Wait, folded it _where_?  And how did it end up pushed to -next in the
> first place?
> 
> <checks vfs/vfs.git>

Did you just not read:

https://lore.kernel.org/r/20241008-baufinanzierung-lakai-00b02ba0ac19@brauner

by any chance?

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-10-10  8:13 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-22  0:20 [PATCHES] fdtable series v2 Al Viro
2024-08-22  0:22 ` [PATCH 01/12] close_range(): fix the logics in descriptor table trimming Al Viro
2024-08-22  0:22   ` [PATCH 02/12] get rid of ...lookup...fdget_rcu() family Al Viro
2024-08-22  0:22   ` [PATCH 03/12] remove pointless includes of <linux/fdtable.h> Al Viro
2024-08-22  0:22   ` [PATCH 04/12] close_files(): don't bother with xchg() Al Viro
2024-08-22  0:22   ` [PATCH 05/12] move close_range(2) into fs/file.c, fold __close_range() into it Al Viro
2024-08-22  0:22   ` [PATCH 06/12] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Al Viro
2024-08-22  0:22   ` [PATCH 07/12] fs/file.c: conditionally clear full_fds Al Viro
2024-08-22  0:22   ` [PATCH 08/12] fs/file.c: add fast path in find_next_fd() Al Viro
2024-08-22  0:22   ` [PATCH 09/12] alloc_fdtable(): change calling conventions Al Viro
2024-08-22  0:22   ` [PATCH 10/12] file.c: merge __{set,clear}_close_on_exec() Al Viro
2024-08-22  0:22   ` [PATCH 11/12] make __set_open_fd() set cloexec state as well Al Viro
2024-08-22  0:22   ` [PATCH 12/12] expand_files(): simplify calling conventions Al Viro
2024-10-07 17:39 ` [PATCHES] fdtable series v3 Al Viro
2024-10-07 17:43   ` [PATCH v3 01/11] get rid of ...lookup...fdget_rcu() family Al Viro
2024-10-07 17:43     ` [PATCH v3 02/11] remove pointless includes of <linux/fdtable.h> Al Viro
2024-10-07 17:43     ` [PATCH v3 03/11] close_files(): don't bother with xchg() Al Viro
2024-10-07 17:43     ` [PATCH v3 04/11] move close_range(2) into fs/file.c, fold __close_range() into it Al Viro
2024-10-07 17:43     ` [PATCH v3 05/11] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Al Viro
2024-10-07 17:43     ` [PATCH v3 06/11] fs/file.c: conditionally clear full_fds Al Viro
2024-10-07 17:43     ` [PATCH v3 07/11] fs/file.c: add fast path in find_next_fd() Al Viro
2024-10-07 17:43     ` [PATCH v3 08/11] alloc_fdtable(): change calling conventions Al Viro
2024-10-07 17:43     ` [PATCH v3 09/11] file.c: merge __{set,clear}_close_on_exec() Al Viro
2024-10-07 17:43     ` [PATCH v3 10/11] make __set_open_fd() set cloexec state as well Al Viro
2024-10-09 10:50       ` Steven Price
2024-10-09 14:19         ` Christian Brauner
2024-10-09 15:24           ` Al Viro
2024-10-09 15:36             ` Al Viro
2024-10-10  8:13             ` Christian Brauner
     [not found]       ` <CGME20241009114254eucas1p2c174307fe53b3d5563795cc8eb92b91d@eucas1p2.samsung.com>
2024-10-09 11:42         ` Marek Szyprowski
2024-10-07 17:43     ` [PATCH v3 11/11] expand_files(): simplify calling conventions Al Viro
2024-10-07 18:12   ` [PATCHES] fdtable series v3 Linus Torvalds
2024-10-08 11:15   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).