public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd
@ 2025-03-04 18:35 Mateusz Guzik
  2025-03-04 18:35 ` [PATCH v2 1/4] file: add fput and file_ref_put routines optimized for use " Mateusz Guzik
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Mateusz Guzik @ 2025-03-04 18:35 UTC (permalink / raw)
  To: brauner, viro; +Cc: jack, linux-kernel, linux-fsdevel, Mateusz Guzik

The stock kernel transitioning the file to no refs held penalizes the
caller with an extra atomic to block any increments.

For cases where the file is highly likely to be going away this is
easily avoidable.

In the open+close case the win is very modest because of the following
problems:
- kmem and memcg having terrible performance
- putname using an atomic (I have a wip to whack that)
- open performing an extra ref/unref on the dentry (there are patches to
  do it, including by Al. I mailed about them in [1])
- creds using atomics (I have a wip to whack that)
- apparmor using atomics (ditto, same mechanism)

On top of that I have a WIP patch to dodge some of the work at lookup
itself.

All in all there is several % avoidably lost here.

stats colected during a kernel build with:
bpftrace -e 'kprobe:filp_close,kprobe:fput,kprobe:fput_close* { @[probe] = hist(((struct file *)arg0)->f_ref.refcnt.counter > 0); }'

@[kprobe:filp_close]:
[0]                32195 |@@@@@@@@@@                                          |
[1]               164567 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@[kprobe:fput]:
[0]               339240 |@@@@@@                                              |
[1]              2888064 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@[kprobe:fput_close]:
[0]              5116767 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]               164544 |@                                                   |

@[kprobe:fput_close_sync]:
[0]              5340660 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]               358943 |@@@                                                 |


0 indicates the last reference, 1 that there is more.

filp_close is largely skewed because of close_on_exec.

vast majority of last fputs are from remove_vma. I think that code wants
to be patched to batch them (as in something like fput_many should be
added -- something for later).

[1] https://lore.kernel.org/linux-fsdevel/20250304165728.491785-1-mjguzik@gmail.com/T/#u

v2:
- patch filp_close
- patch failing open

Mateusz Guzik (4):
  file: add fput and file_ref_put routines optimized for use when
    closing a fd
  fs: use fput_close_sync() in close()
  fs: use fput_close() in filp_close()
  fs: use fput_close() in path_openat()

 fs/file.c                | 75 ++++++++++++++++++++++++++++++----------
 fs/file_table.c          | 72 +++++++++++++++++++++++++++-----------
 fs/namei.c               |  2 +-
 fs/open.c                |  4 +--
 include/linux/file.h     |  2 ++
 include/linux/file_ref.h |  1 +
 6 files changed, 114 insertions(+), 42 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/4] file: add fput and file_ref_put routines optimized for use when closing a fd
  2025-03-04 18:35 [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd Mateusz Guzik
@ 2025-03-04 18:35 ` Mateusz Guzik
  2025-03-04 18:48   ` Mateusz Guzik
  2025-03-05 10:46   ` Christian Brauner
  2025-03-04 18:35 ` [PATCH v2 2/4] fs: use fput_close_sync() in close() Mateusz Guzik
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 12+ messages in thread
From: Mateusz Guzik @ 2025-03-04 18:35 UTC (permalink / raw)
  To: brauner, viro; +Cc: jack, linux-kernel, linux-fsdevel, Mateusz Guzik

Vast majority of the time closing a file descriptor also operates on the
last reference, where a regular fput usage will result in 2 atomics.
This can be changed to only suffer 1.

See commentary above file_ref_put_close() for more information.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 fs/file.c                | 75 ++++++++++++++++++++++++++++++----------
 fs/file_table.c          | 72 +++++++++++++++++++++++++++-----------
 include/linux/file.h     |  2 ++
 include/linux/file_ref.h |  1 +
 4 files changed, 111 insertions(+), 39 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 44efdc8c1e27..ea753f9c8e08 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -26,6 +26,28 @@
 
 #include "internal.h"
 
+static bool __file_ref_put_badval(file_ref_t *ref, unsigned long cnt)
+{
+	/*
+	 * If the reference count was already in the dead zone, then this
+	 * put() operation is imbalanced. Warn, put the reference count back to
+	 * DEAD and tell the caller to not deconstruct the object.
+	 */
+	if (WARN_ONCE(cnt >= FILE_REF_RELEASED, "imbalanced put on file reference count")) {
+		atomic_long_set(&ref->refcnt, FILE_REF_DEAD);
+		return false;
+	}
+
+	/*
+	 * This is a put() operation on a saturated refcount. Restore the
+	 * mean saturation value and tell the caller to not deconstruct the
+	 * object.
+	 */
+	if (cnt > FILE_REF_MAXREF)
+		atomic_long_set(&ref->refcnt, FILE_REF_SATURATED);
+	return false;
+}
+
 /**
  * __file_ref_put - Slowpath of file_ref_put()
  * @ref:	Pointer to the reference count
@@ -67,27 +89,44 @@ bool __file_ref_put(file_ref_t *ref, unsigned long cnt)
 		return true;
 	}
 
-	/*
-	 * If the reference count was already in the dead zone, then this
-	 * put() operation is imbalanced. Warn, put the reference count back to
-	 * DEAD and tell the caller to not deconstruct the object.
-	 */
-	if (WARN_ONCE(cnt >= FILE_REF_RELEASED, "imbalanced put on file reference count")) {
-		atomic_long_set(&ref->refcnt, FILE_REF_DEAD);
-		return false;
-	}
-
-	/*
-	 * This is a put() operation on a saturated refcount. Restore the
-	 * mean saturation value and tell the caller to not deconstruct the
-	 * object.
-	 */
-	if (cnt > FILE_REF_MAXREF)
-		atomic_long_set(&ref->refcnt, FILE_REF_SATURATED);
-	return false;
+	return __file_ref_put_badval(ref, cnt);
 }
 EXPORT_SYMBOL_GPL(__file_ref_put);
 
+/**
+ * file_ref_put_close - drop a reference expecting it would transition to FILE_REF_NOREF
+ * @ref:	Pointer to the reference count
+ *
+ * Semantically it is equivalent to calling file_ref_put(), but it trades lower
+ * performance in face of other CPUs also modifying the refcount for higher
+ * performance when this happens to be the last reference.
+ *
+ * For the last reference file_ref_put() issues 2 atomics. One to drop the
+ * reference and another to transition it to FILE_REF_DEAD. This routine does
+ * the work in one step, but in order to do it has to pre-read the variable which
+ * decreases scalability.
+ *
+ * Use with close() et al, stick to file_ref_put() by default.
+ */
+bool file_ref_put_close(file_ref_t *ref)
+{
+	long old, new;
+
+	old = atomic_long_read(&ref->refcnt);
+	do {
+		if (unlikely(old < 0))
+			return __file_ref_put_badval(ref, old);
+
+		if (old == FILE_REF_ONEREF)
+			new = FILE_REF_DEAD;
+		else
+			new = old - 1;
+	} while (!atomic_long_try_cmpxchg(&ref->refcnt, &old, new));
+
+	return new == FILE_REF_DEAD;
+}
+EXPORT_SYMBOL_GPL(file_ref_put_close);
+
 unsigned int sysctl_nr_open __read_mostly = 1024*1024;
 unsigned int sysctl_nr_open_min = BITS_PER_LONG;
 /* our min() is unusable in constant expressions ;-/ */
diff --git a/fs/file_table.c b/fs/file_table.c
index 5c00dc38558d..4189c682eb06 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -511,31 +511,37 @@ void flush_delayed_fput(void)
 }
 EXPORT_SYMBOL_GPL(flush_delayed_fput);
 
-void fput(struct file *file)
+static void __fput_defer_free(struct file *file)
 {
-	if (file_ref_put(&file->f_ref)) {
-		struct task_struct *task = current;
+	struct task_struct *task = current;
 
-		if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
-			file_free(file);
+	if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
+		file_free(file);
+		return;
+	}
+	if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+		init_task_work(&file->f_task_work, ____fput);
+		if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
 			return;
-		}
-		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
-			init_task_work(&file->f_task_work, ____fput);
-			if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
-				return;
-			/*
-			 * After this task has run exit_task_work(),
-			 * task_work_add() will fail.  Fall through to delayed
-			 * fput to avoid leaking *file.
-			 */
-		}
-
-		if (llist_add(&file->f_llist, &delayed_fput_list))
-			schedule_delayed_work(&delayed_fput_work, 1);
+		/*
+		 * After this task has run exit_task_work(),
+		 * task_work_add() will fail.  Fall through to delayed
+		 * fput to avoid leaking *file.
+		 */
 	}
+
+	if (llist_add(&file->f_llist, &delayed_fput_list))
+		schedule_delayed_work(&delayed_fput_work, 1);
 }
 
+void fput(struct file *file)
+{
+	if (unlikely(file_ref_put(&file->f_ref))) {
+		__fput_defer_free(file);
+	}
+}
+EXPORT_SYMBOL(fput);
+
 /*
  * synchronous analog of fput(); for kernel threads that might be needed
  * in some umount() (and thus can't use flush_delayed_fput() without
@@ -549,10 +555,34 @@ void __fput_sync(struct file *file)
 	if (file_ref_put(&file->f_ref))
 		__fput(file);
 }
-
-EXPORT_SYMBOL(fput);
 EXPORT_SYMBOL(__fput_sync);
 
+/*
+ * Equivalent to __fput_sync(), but optimized for being called with the last
+ * reference.
+ *
+ * See file_ref_put_close() for details.
+ */
+void fput_close_sync(struct file *file)
+{
+	if (unlikely(file_ref_put_close(&file->f_ref)))
+		__fput(file);
+}
+EXPORT_SYMBOL(fput_close_sync);
+
+/*
+ * Equivalent to fput(), but optimized for being called with the last
+ * reference.
+ *
+ * See file_ref_put_close() for details.
+ */
+void fput_close(struct file *file)
+{
+	if (file_ref_put_close(&file->f_ref))
+		__fput_defer_free(file);
+}
+EXPORT_SYMBOL(fput_close);
+
 void __init files_init(void)
 {
 	struct kmem_cache_args args = {
diff --git a/include/linux/file.h b/include/linux/file.h
index 302f11355b10..7b04e87cbde6 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -124,6 +124,8 @@ int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags);
 
 extern void flush_delayed_fput(void);
 extern void __fput_sync(struct file *);
+void fput_close_sync(struct file *);
+void fput_close(struct file *);
 
 extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
 
diff --git a/include/linux/file_ref.h b/include/linux/file_ref.h
index 9b3a8d9b17ab..f269299941aa 100644
--- a/include/linux/file_ref.h
+++ b/include/linux/file_ref.h
@@ -62,6 +62,7 @@ static inline void file_ref_init(file_ref_t *ref, unsigned long cnt)
 }
 
 bool __file_ref_put(file_ref_t *ref, unsigned long cnt);
+bool file_ref_put_close(file_ref_t *ref);
 
 /**
  * file_ref_get - Acquire one reference on a file
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/4] fs: use fput_close_sync() in close()
  2025-03-04 18:35 [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd Mateusz Guzik
  2025-03-04 18:35 ` [PATCH v2 1/4] file: add fput and file_ref_put routines optimized for use " Mateusz Guzik
@ 2025-03-04 18:35 ` Mateusz Guzik
  2025-03-04 18:35 ` [PATCH v2 3/4] fs: use fput_close() in filp_close() Mateusz Guzik
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Mateusz Guzik @ 2025-03-04 18:35 UTC (permalink / raw)
  To: brauner, viro; +Cc: jack, linux-kernel, linux-fsdevel, Mateusz Guzik

This bumps open+close rate by 1% on Sapphire Rapids by eliding one
atomic.

It would be higher if it was not for several other slowdowns of the same
nature.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 fs/open.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/open.c b/fs/open.c
index 57e088c01ea4..fc1c6118eb30 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1615,7 +1615,7 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 	 * We're returning to user space. Don't bother
 	 * with any delayed fput() cases.
 	 */
-	__fput_sync(file);
+	fput_close_sync(file);
 
 	if (likely(retval == 0))
 		return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/4] fs: use fput_close() in filp_close()
  2025-03-04 18:35 [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd Mateusz Guzik
  2025-03-04 18:35 ` [PATCH v2 1/4] file: add fput and file_ref_put routines optimized for use " Mateusz Guzik
  2025-03-04 18:35 ` [PATCH v2 2/4] fs: use fput_close_sync() in close() Mateusz Guzik
@ 2025-03-04 18:35 ` Mateusz Guzik
  2025-03-04 18:35 ` [PATCH v2 4/4] fs: use fput_close() in path_openat() Mateusz Guzik
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Mateusz Guzik @ 2025-03-04 18:35 UTC (permalink / raw)
  To: brauner, viro; +Cc: jack, linux-kernel, linux-fsdevel, Mateusz Guzik

When tracing a kernel build over refcounts seen this is a wash:
@[kprobe:filp_close]:
[0]                32195 |@@@@@@@@@@                                          |
[1]               164567 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

I verified vast majority of the skew comes from do_close_on_exec() which
could be changed to use a different variant instead.

Even without changing that, the 19.5% of calls which got here still can
save the extra atomic. Calls here are borderline non-existent compared
to fput (over 3.2 mln!), so they should not negatively affect
scalability.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 fs/open.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/open.c b/fs/open.c
index fc1c6118eb30..ffb7e2e5e1ef 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1589,7 +1589,7 @@ int filp_close(struct file *filp, fl_owner_t id)
 	int retval;
 
 	retval = filp_flush(filp, id);
-	fput(filp);
+	fput_close(filp);
 
 	return retval;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/4] fs: use fput_close() in path_openat()
  2025-03-04 18:35 [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd Mateusz Guzik
                   ` (2 preceding siblings ...)
  2025-03-04 18:35 ` [PATCH v2 3/4] fs: use fput_close() in filp_close() Mateusz Guzik
@ 2025-03-04 18:35 ` Mateusz Guzik
  2025-03-05  2:19 ` [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd David Laight
  2025-03-05 10:38 ` Christian Brauner
  5 siblings, 0 replies; 12+ messages in thread
From: Mateusz Guzik @ 2025-03-04 18:35 UTC (permalink / raw)
  To: brauner, viro; +Cc: jack, linux-kernel, linux-fsdevel, Mateusz Guzik

This bumps failing open rate by 1.7% on Sapphire Rapids by avoiding one
atomic.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 fs/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 8ce8e6038346..d38b8dc16b4b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4014,7 +4014,7 @@ static struct file *path_openat(struct nameidata *nd,
 		WARN_ON(1);
 		error = -EINVAL;
 	}
-	fput(file);
+	fput_close(file);
 	if (error == -EOPENSTALE) {
 		if (flags & LOOKUP_RCU)
 			error = -ECHILD;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/4] file: add fput and file_ref_put routines optimized for use when closing a fd
  2025-03-04 18:35 ` [PATCH v2 1/4] file: add fput and file_ref_put routines optimized for use " Mateusz Guzik
@ 2025-03-04 18:48   ` Mateusz Guzik
  2025-03-05 10:46   ` Christian Brauner
  1 sibling, 0 replies; 12+ messages in thread
From: Mateusz Guzik @ 2025-03-04 18:48 UTC (permalink / raw)
  To: brauner, viro; +Cc: jack, linux-kernel, linux-fsdevel

On Tue, Mar 4, 2025 at 7:35 PM Mateusz Guzik <mjguzik@gmail.com> wrote:

> +void fput_close_sync(struct file *file)
> +{
> +       if (unlikely(file_ref_put_close(&file->f_ref)))
> +               __fput(file);
> +}

this of course should be: *likely*

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd
  2025-03-04 18:35 [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd Mateusz Guzik
                   ` (3 preceding siblings ...)
  2025-03-04 18:35 ` [PATCH v2 4/4] fs: use fput_close() in path_openat() Mateusz Guzik
@ 2025-03-05  2:19 ` David Laight
  2025-03-05 12:43   ` Mateusz Guzik
  2025-03-05 10:38 ` Christian Brauner
  5 siblings, 1 reply; 12+ messages in thread
From: David Laight @ 2025-03-05  2:19 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: brauner, viro, jack, linux-kernel, linux-fsdevel

On Tue,  4 Mar 2025 19:35:02 +0100
Mateusz Guzik <mjguzik@gmail.com> wrote:

> The stock kernel transitioning the file to no refs held penalizes the
> caller with an extra atomic to block any increments.
> 
> For cases where the file is highly likely to be going away this is
> easily avoidable.

Have you looked at the problem caused by epoll() ?
The epoll code has a 'hidden' extra reference to the fd.
This doesn't usualy matter, but some of the driver callbacks add and
remove an extra reference - which doesn't work well if fput() has
just decremented it to zero.

The fput code might need to do a 'decrement not one' so that the
epoll tidyup can be done while the refcount is still one.

That would save the extra atomic pair that (IIRC) got added into
the epoll callback code.

Thoughts?

	David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd
  2025-03-04 18:35 [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd Mateusz Guzik
                   ` (4 preceding siblings ...)
  2025-03-05  2:19 ` [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd David Laight
@ 2025-03-05 10:38 ` Christian Brauner
  2025-03-05 12:42   ` Mateusz Guzik
  5 siblings, 1 reply; 12+ messages in thread
From: Christian Brauner @ 2025-03-05 10:38 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: viro, jack, linux-kernel, linux-fsdevel

On Tue, Mar 04, 2025 at 07:35:02PM +0100, Mateusz Guzik wrote:
> The stock kernel transitioning the file to no refs held penalizes the
> caller with an extra atomic to block any increments.
> 
> For cases where the file is highly likely to be going away this is
> easily avoidable.
> 
> In the open+close case the win is very modest because of the following
> problems:
> - kmem and memcg having terrible performance

I thought that was going to be addressed by Vlastimil, i.e., the mm guys
to provide a new memcg api.

> - putname using an atomic (I have a wip to whack that)
> - open performing an extra ref/unref on the dentry (there are patches to
>   do it, including by Al. I mailed about them in [1])
> - creds using atomics (I have a wip to whack that)
> - apparmor using atomics (ditto, same mechanism)
> 
> On top of that I have a WIP patch to dodge some of the work at lookup
> itself.
> 
> All in all there is several % avoidably lost here.
> 
> stats colected during a kernel build with:
> bpftrace -e 'kprobe:filp_close,kprobe:fput,kprobe:fput_close* { @[probe] = hist(((struct file *)arg0)->f_ref.refcnt.counter > 0); }'
> 
> @[kprobe:filp_close]:
> [0]                32195 |@@@@@@@@@@                                          |
> [1]               164567 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> @[kprobe:fput]:
> [0]               339240 |@@@@@@                                              |
> [1]              2888064 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> @[kprobe:fput_close]:
> [0]              5116767 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1]               164544 |@                                                   |
> 
> @[kprobe:fput_close_sync]:
> [0]              5340660 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1]               358943 |@@@                                                 |
> 
> 
> 0 indicates the last reference, 1 that there is more.
> 
> filp_close is largely skewed because of close_on_exec.
> 
> vast majority of last fputs are from remove_vma. I think that code wants
> to be patched to batch them (as in something like fput_many should be
> added -- something for later).

We used to have that for io_uring and got rid of it. The less fput()
primitives the better tbh. But let's see.

> 
> [1] https://lore.kernel.org/linux-fsdevel/20250304165728.491785-1-mjguzik@gmail.com/T/#u
> 
> v2:
> - patch filp_close
> - patch failing open
> 
> Mateusz Guzik (4):
>   file: add fput and file_ref_put routines optimized for use when
>     closing a fd
>   fs: use fput_close_sync() in close()
>   fs: use fput_close() in filp_close()
>   fs: use fput_close() in path_openat()
> 
>  fs/file.c                | 75 ++++++++++++++++++++++++++++++----------
>  fs/file_table.c          | 72 +++++++++++++++++++++++++++-----------
>  fs/namei.c               |  2 +-
>  fs/open.c                |  4 +--
>  include/linux/file.h     |  2 ++
>  include/linux/file_ref.h |  1 +
>  6 files changed, 114 insertions(+), 42 deletions(-)
> 
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/4] file: add fput and file_ref_put routines optimized for use when closing a fd
  2025-03-04 18:35 ` [PATCH v2 1/4] file: add fput and file_ref_put routines optimized for use " Mateusz Guzik
  2025-03-04 18:48   ` Mateusz Guzik
@ 2025-03-05 10:46   ` Christian Brauner
  2025-03-05 12:40     ` Mateusz Guzik
  1 sibling, 1 reply; 12+ messages in thread
From: Christian Brauner @ 2025-03-05 10:46 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: viro, jack, linux-kernel, linux-fsdevel

On Tue, Mar 04, 2025 at 07:35:03PM +0100, Mateusz Guzik wrote:
> Vast majority of the time closing a file descriptor also operates on the
> last reference, where a regular fput usage will result in 2 atomics.
> This can be changed to only suffer 1.
> 
> See commentary above file_ref_put_close() for more information.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---

I'm not enthused about the patches tbh because we end up with two new
primivites and I really dislike so many new primitives with slightly
different semantics. But it should at least all be kept private to fs/.

>  fs/file.c                | 75 ++++++++++++++++++++++++++++++----------
>  fs/file_table.c          | 72 +++++++++++++++++++++++++++-----------
>  include/linux/file.h     |  2 ++
>  include/linux/file_ref.h |  1 +
>  4 files changed, 111 insertions(+), 39 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index 44efdc8c1e27..ea753f9c8e08 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -26,6 +26,28 @@
>  
>  #include "internal.h"
>  
> +static bool __file_ref_put_badval(file_ref_t *ref, unsigned long cnt)
> +{
> +	/*
> +	 * If the reference count was already in the dead zone, then this
> +	 * put() operation is imbalanced. Warn, put the reference count back to
> +	 * DEAD and tell the caller to not deconstruct the object.
> +	 */
> +	if (WARN_ONCE(cnt >= FILE_REF_RELEASED, "imbalanced put on file reference count")) {
> +		atomic_long_set(&ref->refcnt, FILE_REF_DEAD);
> +		return false;
> +	}
> +
> +	/*
> +	 * This is a put() operation on a saturated refcount. Restore the
> +	 * mean saturation value and tell the caller to not deconstruct the
> +	 * object.
> +	 */
> +	if (cnt > FILE_REF_MAXREF)
> +		atomic_long_set(&ref->refcnt, FILE_REF_SATURATED);
> +	return false;
> +}
> +
>  /**
>   * __file_ref_put - Slowpath of file_ref_put()
>   * @ref:	Pointer to the reference count
> @@ -67,27 +89,44 @@ bool __file_ref_put(file_ref_t *ref, unsigned long cnt)
>  		return true;
>  	}
>  
> -	/*
> -	 * If the reference count was already in the dead zone, then this
> -	 * put() operation is imbalanced. Warn, put the reference count back to
> -	 * DEAD and tell the caller to not deconstruct the object.
> -	 */
> -	if (WARN_ONCE(cnt >= FILE_REF_RELEASED, "imbalanced put on file reference count")) {
> -		atomic_long_set(&ref->refcnt, FILE_REF_DEAD);
> -		return false;
> -	}
> -
> -	/*
> -	 * This is a put() operation on a saturated refcount. Restore the
> -	 * mean saturation value and tell the caller to not deconstruct the
> -	 * object.
> -	 */
> -	if (cnt > FILE_REF_MAXREF)
> -		atomic_long_set(&ref->refcnt, FILE_REF_SATURATED);
> -	return false;
> +	return __file_ref_put_badval(ref, cnt);
>  }
>  EXPORT_SYMBOL_GPL(__file_ref_put);
>  
> +/**
> + * file_ref_put_close - drop a reference expecting it would transition to FILE_REF_NOREF
> + * @ref:	Pointer to the reference count
> + *
> + * Semantically it is equivalent to calling file_ref_put(), but it trades lower
> + * performance in face of other CPUs also modifying the refcount for higher
> + * performance when this happens to be the last reference.
> + *
> + * For the last reference file_ref_put() issues 2 atomics. One to drop the
> + * reference and another to transition it to FILE_REF_DEAD. This routine does
> + * the work in one step, but in order to do it has to pre-read the variable which
> + * decreases scalability.
> + *
> + * Use with close() et al, stick to file_ref_put() by default.
> + */
> +bool file_ref_put_close(file_ref_t *ref)
> +{
> +	long old, new;
> +
> +	old = atomic_long_read(&ref->refcnt);
> +	do {
> +		if (unlikely(old < 0))
> +			return __file_ref_put_badval(ref, old);
> +
> +		if (old == FILE_REF_ONEREF)
> +			new = FILE_REF_DEAD;
> +		else
> +			new = old - 1;
> +	} while (!atomic_long_try_cmpxchg(&ref->refcnt, &old, new));
> +
> +	return new == FILE_REF_DEAD;
> +}
> +EXPORT_SYMBOL_GPL(file_ref_put_close);
> +
>  unsigned int sysctl_nr_open __read_mostly = 1024*1024;
>  unsigned int sysctl_nr_open_min = BITS_PER_LONG;
>  /* our min() is unusable in constant expressions ;-/ */
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 5c00dc38558d..4189c682eb06 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -511,31 +511,37 @@ void flush_delayed_fput(void)
>  }
>  EXPORT_SYMBOL_GPL(flush_delayed_fput);
>  
> -void fput(struct file *file)
> +static void __fput_defer_free(struct file *file)

Imho just call it __fput_deferred().

>  {
> -	if (file_ref_put(&file->f_ref)) {
> -		struct task_struct *task = current;
> +	struct task_struct *task = current;
>  
> -		if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
> -			file_free(file);
> +	if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
> +		file_free(file);
> +		return;
> +	}
> +	if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> +		init_task_work(&file->f_task_work, ____fput);
> +		if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
>  			return;
> -		}
> -		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> -			init_task_work(&file->f_task_work, ____fput);
> -			if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
> -				return;
> -			/*
> -			 * After this task has run exit_task_work(),
> -			 * task_work_add() will fail.  Fall through to delayed
> -			 * fput to avoid leaking *file.
> -			 */
> -		}
> -
> -		if (llist_add(&file->f_llist, &delayed_fput_list))
> -			schedule_delayed_work(&delayed_fput_work, 1);
> +		/*
> +		 * After this task has run exit_task_work(),
> +		 * task_work_add() will fail.  Fall through to delayed
> +		 * fput to avoid leaking *file.
> +		 */
>  	}
> +
> +	if (llist_add(&file->f_llist, &delayed_fput_list))
> +		schedule_delayed_work(&delayed_fput_work, 1);
>  }
>  
> +void fput(struct file *file)
> +{
> +	if (unlikely(file_ref_put(&file->f_ref))) {
> +		__fput_defer_free(file);
> +	}
> +}
> +EXPORT_SYMBOL(fput);
> +
>  /*
>   * synchronous analog of fput(); for kernel threads that might be needed
>   * in some umount() (and thus can't use flush_delayed_fput() without
> @@ -549,10 +555,34 @@ void __fput_sync(struct file *file)
>  	if (file_ref_put(&file->f_ref))
>  		__fput(file);
>  }
> -
> -EXPORT_SYMBOL(fput);
>  EXPORT_SYMBOL(__fput_sync);
>  
> +/*
> + * Equivalent to __fput_sync(), but optimized for being called with the last
> + * reference.
> + *
> + * See file_ref_put_close() for details.
> + */
> +void fput_close_sync(struct file *file)
> +{
> +	if (unlikely(file_ref_put_close(&file->f_ref)))
> +		__fput(file);
> +}
> +EXPORT_SYMBOL(fput_close_sync);

Shouldn't be exported to modules, please.

> +
> +/*
> + * Equivalent to fput(), but optimized for being called with the last
> + * reference.
> + *
> + * See file_ref_put_close() for details.
> + */
> +void fput_close(struct file *file)
> +{
> +	if (file_ref_put_close(&file->f_ref))
> +		__fput_defer_free(file);
> +}
> +EXPORT_SYMBOL(fput_close);

Shouldn't be exported to modules, please.

> +
>  void __init files_init(void)
>  {
>  	struct kmem_cache_args args = {
> diff --git a/include/linux/file.h b/include/linux/file.h
> index 302f11355b10..7b04e87cbde6 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -124,6 +124,8 @@ int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags);
>  
>  extern void flush_delayed_fput(void);
>  extern void __fput_sync(struct file *);
> +void fput_close_sync(struct file *);
> +void fput_close(struct file *);

Should go into internal.h.

>  
>  extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
>  
> diff --git a/include/linux/file_ref.h b/include/linux/file_ref.h
> index 9b3a8d9b17ab..f269299941aa 100644
> --- a/include/linux/file_ref.h
> +++ b/include/linux/file_ref.h
> @@ -62,6 +62,7 @@ static inline void file_ref_init(file_ref_t *ref, unsigned long cnt)
>  }
>  
>  bool __file_ref_put(file_ref_t *ref, unsigned long cnt);
> +bool file_ref_put_close(file_ref_t *ref);
>  
>  /**
>   * file_ref_get - Acquire one reference on a file
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/4] file: add fput and file_ref_put routines optimized for use when closing a fd
  2025-03-05 10:46   ` Christian Brauner
@ 2025-03-05 12:40     ` Mateusz Guzik
  0 siblings, 0 replies; 12+ messages in thread
From: Mateusz Guzik @ 2025-03-05 12:40 UTC (permalink / raw)
  To: Christian Brauner; +Cc: viro, jack, linux-kernel, linux-fsdevel

On Wed, Mar 5, 2025 at 11:47 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Mar 04, 2025 at 07:35:03PM +0100, Mateusz Guzik wrote:
> > Vast majority of the time closing a file descriptor also operates on the
> > last reference, where a regular fput usage will result in 2 atomics.
> > This can be changed to only suffer 1.
> >
> > See commentary above file_ref_put_close() for more information.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > ---
>
> I'm not enthused about the patches tbh because we end up with two new
> primivites and I really dislike so many new primitives with slightly
> different semantics. But it should at least all be kept private to fs/.
>

Caller-observable behavior is the same, so I don't think this is a big
deal. The intent was to keep this very much for vfs-internal usage.
The EXPORT thing was copy-pasto.

Anyhow, everyone interested can still just fput so this does not put
any burden on other people afaics. The few patched spots are very much
vfs-internal.

I sent v3 with the feedback addressed.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd
  2025-03-05 10:38 ` Christian Brauner
@ 2025-03-05 12:42   ` Mateusz Guzik
  0 siblings, 0 replies; 12+ messages in thread
From: Mateusz Guzik @ 2025-03-05 12:42 UTC (permalink / raw)
  To: Christian Brauner; +Cc: viro, jack, linux-kernel, linux-fsdevel

On Wed, Mar 5, 2025 at 11:38 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Mar 04, 2025 at 07:35:02PM +0100, Mateusz Guzik wrote:
> > The stock kernel transitioning the file to no refs held penalizes the
> > caller with an extra atomic to block any increments.
> >
> > For cases where the file is highly likely to be going away this is
> > easily avoidable.
> >
> > In the open+close case the win is very modest because of the following
> > problems:
> > - kmem and memcg having terrible performance
>
> I thought that was going to be addressed by Vlastimil, i.e., the mm guys
> to provide a new memcg api.

afaics this is for kmem only, no memcg support.

There is an old patch which sped up memcg, but got reverted and it is
on the table to bring it back, but Vlastimil is dragging his foot.

> > vast majority of last fputs are from remove_vma. I think that code wants
> > to be patched to batch them (as in something like fput_many should be
> > added -- something for later).
>
> We used to have that for io_uring and got rid of it. The less fput()
> primitives the better tbh. But let's see.
>

I would say a _many (or whatever) variant is pretty idiomatic, but I'm
not going to insist. This is a side remark to the patchset.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd
  2025-03-05  2:19 ` [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd David Laight
@ 2025-03-05 12:43   ` Mateusz Guzik
  0 siblings, 0 replies; 12+ messages in thread
From: Mateusz Guzik @ 2025-03-05 12:43 UTC (permalink / raw)
  To: David Laight; +Cc: brauner, viro, jack, linux-kernel, linux-fsdevel

On Wed, Mar 5, 2025 at 3:19 AM David Laight
<david.laight.linux@gmail.com> wrote:
> Have you looked at the problem caused by epoll() ?
> The epoll code has a 'hidden' extra reference to the fd.
> This doesn't usualy matter, but some of the driver callbacks add and
> remove an extra reference - which doesn't work well if fput() has
> just decremented it to zero.
>
> The fput code might need to do a 'decrement not one' so that the
> epoll tidyup can be done while the refcount is still one.
>
> That would save the extra atomic pair that (IIRC) got added into
> the epoll callback code.
>
> Thoughts?

I am not aware of this problem and don't have particular interest in
looking at it either, sorry.

Good thing you are free to make the case to Christian. :)
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-03-05 12:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-04 18:35 [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd Mateusz Guzik
2025-03-04 18:35 ` [PATCH v2 1/4] file: add fput and file_ref_put routines optimized for use " Mateusz Guzik
2025-03-04 18:48   ` Mateusz Guzik
2025-03-05 10:46   ` Christian Brauner
2025-03-05 12:40     ` Mateusz Guzik
2025-03-04 18:35 ` [PATCH v2 2/4] fs: use fput_close_sync() in close() Mateusz Guzik
2025-03-04 18:35 ` [PATCH v2 3/4] fs: use fput_close() in filp_close() Mateusz Guzik
2025-03-04 18:35 ` [PATCH v2 4/4] fs: use fput_close() in path_openat() Mateusz Guzik
2025-03-05  2:19 ` [RFC PATCH v v2 0/4] avoid the extra atomic on a ref when closing a fd David Laight
2025-03-05 12:43   ` Mateusz Guzik
2025-03-05 10:38 ` Christian Brauner
2025-03-05 12:42   ` Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox