linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] fs: introduce file_ref_t
@ 2024-10-07 14:23 Christian Brauner
  2024-10-07 14:23 ` [PATCH v2 1/3] fs: protect backing files with rcu Christian Brauner
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Christian Brauner @ 2024-10-07 14:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Thomas Gleixner, Jens Axboe, Christian Brauner

As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
O(N^2) behaviour under contention with N concurrent operations and it is
in a hot path in __fget_files_rcu().

The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make this
work and requiring rcu protection for the data structure in question.
This not just scales better it also introduces overflow protection.

However, in contrast to generic rcuref, files require a memory barrier
and thus cannot rely on *_relaxed() atomic operations and also require
to be built on atomic_long_t as having massive amounts of reference
isn't unheard of even if it is just an attack.

As suggested by Linus, add a file specific variant instead of making
this a generic library.

I've been testing this with will-it-scale using a multi-threaded fstat()
on the same file descriptor on a machine that Jens gave me access (thank
you very much!):

processor       : 511
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 160
model name      : AMD EPYC 9754 128-Core Processor

and I consistently get a 3-5% improvement on workloads with 256+ and
more threads comparing v6.12-rc1 as base with and without these patches
applied.

In vfs.file now.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Don't introduce a separate rcuref_long_t library just implement it
  only for files for now.
- Retain memory barrier by using atomic_long_add_negative() instead of
  atomic_long_add_negative_relaxed() to order the loads before and after
  the increment in __fget_files_rcu().
- Link to v1: https://lore.kernel.org/r/20241005-brauner-file-rcuref-v1-0-725d5e713c86@kernel.org

---
Christian Brauner (3):
      fs: protect backing files with rcu
      fs: add file_ref
      fs: port files to file_ref

 drivers/gpu/drm/i915/gt/shmem_utils.c |   2 +-
 drivers/gpu/drm/vmwgfx/ttm_object.c   |   2 +-
 fs/eventpoll.c                        |   2 +-
 fs/file.c                             | 120 ++++++++++++++++++++++++++++++++--
 fs/file_table.c                       |  23 +++++--
 include/linux/file_ref.h              | 116 ++++++++++++++++++++++++++++++++
 include/linux/fs.h                    |   9 +--
 7 files changed, 253 insertions(+), 21 deletions(-)
---
base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
change-id: 20240927-brauner-file-rcuref-bfa4a4ba915b


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/3] fs: protect backing files with rcu
  2024-10-07 14:23 [PATCH v2 0/3] fs: introduce file_ref_t Christian Brauner
@ 2024-10-07 14:23 ` Christian Brauner
  2024-10-07 14:23 ` [PATCH v2 2/3] fs: add file_ref Christian Brauner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2024-10-07 14:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Thomas Gleixner, Jens Axboe, Christian Brauner

Currently backing files are not under any form of rcu protection.
Switching to file_ref requires rcu protection and so does the
speculative vma lookup. Switch backing files to the same rcu slab type
as regular files. There should be no additional magic required as the
lifetime of a backing file is always tied to a regular file.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/file_table.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index eed5ffad9997c24e533f88285deb537ddf9429ed..4b23eb7b79dd9d4ec779f4c01ba2e902988895dc 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -40,13 +40,17 @@ static struct files_stat_struct files_stat = {
 
 /* SLAB cache for file structures */
 static struct kmem_cache *filp_cachep __ro_after_init;
+static struct kmem_cache *bfilp_cachep __ro_after_init;
 
 static struct percpu_counter nr_files __cacheline_aligned_in_smp;
 
 /* Container for backing file with optional user path */
 struct backing_file {
 	struct file file;
-	struct path user_path;
+	union {
+		struct path user_path;
+		freeptr_t bf_freeptr;
+	};
 };
 
 static inline struct backing_file *backing_file(struct file *f)
@@ -68,7 +72,7 @@ static inline void file_free(struct file *f)
 	put_cred(f->f_cred);
 	if (unlikely(f->f_mode & FMODE_BACKING)) {
 		path_put(backing_file_user_path(f));
-		kfree(backing_file(f));
+		kmem_cache_free(bfilp_cachep, backing_file(f));
 	} else {
 		kmem_cache_free(filp_cachep, f);
 	}
@@ -267,13 +271,13 @@ struct file *alloc_empty_backing_file(int flags, const struct cred *cred)
 	struct backing_file *ff;
 	int error;
 
-	ff = kzalloc(sizeof(struct backing_file), GFP_KERNEL);
+	ff = kmem_cache_zalloc(bfilp_cachep, GFP_KERNEL);
 	if (unlikely(!ff))
 		return ERR_PTR(-ENOMEM);
 
 	error = init_file(&ff->file, flags, cred);
 	if (unlikely(error)) {
-		kfree(ff);
+		kmem_cache_free(bfilp_cachep, ff);
 		return ERR_PTR(error);
 	}
 
@@ -529,6 +533,11 @@ void __init files_init(void)
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), &args,
 				SLAB_HWCACHE_ALIGN | SLAB_PANIC |
 				SLAB_ACCOUNT | SLAB_TYPESAFE_BY_RCU);
+
+	args.freeptr_offset = offsetof(struct backing_file, bf_freeptr);
+	bfilp_cachep = kmem_cache_create("bfilp", sizeof(struct backing_file),
+				&args, SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				SLAB_ACCOUNT | SLAB_TYPESAFE_BY_RCU);
 	percpu_counter_init(&nr_files, 0, GFP_KERNEL);
 }
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/3] fs: add file_ref
  2024-10-07 14:23 [PATCH v2 0/3] fs: introduce file_ref_t Christian Brauner
  2024-10-07 14:23 ` [PATCH v2 1/3] fs: protect backing files with rcu Christian Brauner
@ 2024-10-07 14:23 ` Christian Brauner
  2024-10-07 18:07   ` Linus Torvalds
  2024-10-07 14:23 ` [PATCH v2 3/3] fs: port files to file_ref Christian Brauner
  2024-10-07 18:27 ` [PATCH v2 0/3] fs: introduce file_ref_t Jens Axboe
  3 siblings, 1 reply; 14+ messages in thread
From: Christian Brauner @ 2024-10-07 14:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Thomas Gleixner, Jens Axboe, Christian Brauner

As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
O(N^2) behaviour under contention with N concurrent operations and it is
in a hot path in __fget_files_rcu().

The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make this
work and requiring rcu protection for the data structure in question.
This not just scales better it also introduces overflow protection.

However, in contrast to generic rcuref, files require a memory barrier
and thus cannot rely on *_relaxed() atomic operations and also require
to be built on atomic_long_t as having massive amounts of reference
isn't unheard of even if it is just an attack.

As suggested by Linus, add a file specific variant instead of making
this a generic library.

Files are SLAB_TYPESAFE_BY_RCU and thus don't have "regular" rcu
protection. In short, freeing of files isn't delayed until a grace
period has elapsed. Instead, they are freed immediately and thus can be
reused (multiple times) within the same grace period.

So when picking a file from the file descriptor table via its file
descriptor number it is thus possible to see an elevated reference count
on file->f_count even though the file has already been recycled possibly
multiple times by another task.

To guard against this the vfs will pick the file from the file
descriptor table twice. Once before the refcount increment and once
after to compare the pointers (grossly simplified). If they match then
the file is still valid. If not the caller needs to fput() it.

The unconditional increment makes the following race possible as
illustrated by rcuref:

> Deconstruction race
> ===================
>
> The release operation must be protected by prohibiting a grace period in
> order to prevent a possible use after free:
>
>      T1                              T2
>      put()                           get()
>      // ref->refcnt = ONEREF
>      if (!atomic_add_negative(-1, &ref->refcnt))
>              return false;                           <- Not taken
>
>      // ref->refcnt == NOREF
>      --> preemption
>                                      // Elevates ref->refcnt to ONEREF
>                                      if (!atomic_add_negative(1, &ref->refcnt))
>                                              return true;                    <- taken
>
>                                      if (put(&p->ref)) { <-- Succeeds
>                                              remove_pointer(p);
>                                              kfree_rcu(p, rcu);
>                                      }
>
>              RCU grace period ends, object is freed
>
>      atomic_cmpxchg(&ref->refcnt, NOREF, DEAD);      <- UAF
>
> [...] it prevents the grace period which keeps the object alive until
> all put() operations complete.

Having files by SLAB_TYPESAFE_BY_RCU shouldn't cause any problems for
this deconstruction race. Afaict, the only interesting case would be
someone freeing the file and someone immediately recycling it within the
same grace period and reinitializing file->f_count to ONEREF while a
concurrent fput() is doing atomic_cmpxchg(&ref->refcnt, NOREF, DEAD) as
in the race above.

But this is safe from SLAB_TYPESAFE_BY_RCU's perspective and it should
be safe from rcuref's perspective.

      T1                              T2                                                    T3
      fput()                          fget()
      // f_count->refcnt = ONEREF
      if (!atomic_add_negative(-1, &f_count->refcnt))
              return false;                           <- Not taken

      // f_count->refcnt == NOREF
      --> preemption
                                      // Elevates f_count->refcnt to ONEREF
                                      if (!atomic_add_negative(1, &f_count->refcnt))
                                              return true;                    <- taken

                                      if (put(&f_count)) { <-- Succeeds
                                              remove_pointer(p);
                                              /*
                                               * Cache is SLAB_TYPESAFE_BY_RCU
                                               * so this is freed without a grace period.
                                               */
                                              kmem_cache_free(p);
                                      }

                                                                                             kmem_cache_alloc()
                                                                                             init_file() {
                                                                                                     // Sets f_count->refcnt to ONEREF
                                                                                                     rcuref_long_init(&f->f_count, 1);
                                                                                             }

                        Object has been reused within the same grace period
                        via kmem_cache_alloc()'s SLAB_TYPESAFE_BY_RCU.

      /*
       * With SLAB_TYPESAFE_BY_RCU this would be a safe UAF access and
       * it would work correctly because the atomic_cmpxchg()
       * will fail because the refcount has been reset to ONEREF by T3.
       */
      atomic_cmpxchg(&ref->refcnt, NOREF, DEAD);      <- UAF

I've been testing this with will-it-scale using fstat() on a machine
that Jens gave me access (thank you very much!):

processor       : 511
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 160
model name      : AMD EPYC 9754 128-Core Processor

and I consistently get a 3-5% improvement on 256+ threads.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/file.c                | 106 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/file_ref.h | 116 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 222 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index 5125607d040a2ff073e170d043124db5f444a90a..1b5fc867d8ddff856501ba49d8c751f888810330 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -20,10 +20,116 @@
 #include <linux/spinlock.h>
 #include <linux/rcupdate.h>
 #include <linux/close_range.h>
+#include <linux/file_ref.h>
 #include <net/sock.h>
 
 #include "internal.h"
 
+/**
+ * __file_ref_get - Slowpath of file_ref_get()
+ * @ref:	Pointer to the reference count
+ *
+ * Invoked when the reference count is outside of the valid zone.
+ *
+ * Return:
+ *	False if the reference count was already marked dead
+ *
+ *	True if the reference count is saturated, which prevents the
+ *	object from being deconstructed ever.
+ */
+bool __file_ref_get(file_ref_t *ref)
+{
+	unsigned long cnt;
+
+	/*
+	 * If the reference count was already marked dead, undo the
+	 * increment so it stays in the middle of the dead zone and return
+	 * fail.
+	 */
+	cnt = atomic_long_read(&ref->refcnt);
+	if (cnt >= FILE_REF_RELEASED) {
+		atomic_long_set(&ref->refcnt, FILE_REF_DEAD);
+		return false;
+	}
+
+	/*
+	 * If it was saturated, warn and mark it so. In case the increment
+	 * was already on a saturated value restore the saturation
+	 * marker. This keeps it in the middle of the saturation zone and
+	 * prevents the reference count from overflowing. This leaks the
+	 * object memory, but prevents the obvious reference count overflow
+	 * damage.
+	 */
+	if (WARN_ONCE(cnt > FILE_REF_MAXREF, "leaking memory because file reference count is saturated"))
+		atomic_long_set(&ref->refcnt, FILE_REF_SATURATED);
+	return true;
+}
+EXPORT_SYMBOL_GPL(__file_ref_get);
+
+/**
+ * __file_ref_put - Slowpath of file_ref_put()
+ * @ref:	Pointer to the reference count
+ *
+ * Invoked when the reference count is outside of the valid zone.
+ *
+ * Return:
+ *	True if this was the last reference with no future references
+ *	possible. This signals the caller that it can safely schedule the
+ *	object, which is protected by the reference counter, for
+ *	deconstruction.
+ *
+ *	False if there are still active references or the put() raced
+ *	with a concurrent get()/put() pair. Caller is not allowed to
+ *	deconstruct the protected object.
+ */
+bool __file_ref_put(file_ref_t *ref)
+{
+	unsigned long cnt;
+
+	/* Did this drop the last reference? */
+	cnt = atomic_long_read(&ref->refcnt);
+	if (likely(cnt == FILE_REF_NOREF)) {
+		/*
+		 * Carefully try to set the reference count to FILE_REF_DEAD.
+		 *
+		 * This can fail if a concurrent get() operation has
+		 * elevated it again or the corresponding put() even marked
+		 * it dead already. Both are valid situations and do not
+		 * require a retry. If this fails the caller is not
+		 * allowed to deconstruct the object.
+		 */
+		if (!atomic_long_try_cmpxchg_release(&ref->refcnt, &cnt, FILE_REF_DEAD))
+			return false;
+
+		/*
+		 * The caller can safely schedule the object for
+		 * deconstruction. Provide acquire ordering.
+		 */
+		smp_acquire__after_ctrl_dep();
+		return true;
+	}
+
+	/*
+	 * If the reference count was already in the dead zone, then this
+	 * put() operation is imbalanced. Warn, put the reference count back to
+	 * DEAD and tell the caller to not deconstruct the object.
+	 */
+	if (WARN_ONCE(cnt >= FILE_REF_RELEASED, "imbalanced put on file reference count")) {
+		atomic_long_set(&ref->refcnt, FILE_REF_DEAD);
+		return false;
+	}
+
+	/*
+	 * This is a put() operation on a saturated refcount. Restore the
+	 * mean saturation value and tell the caller to not deconstruct the
+	 * object.
+	 */
+	if (cnt > FILE_REF_MAXREF)
+		atomic_long_set(&ref->refcnt, FILE_REF_SATURATED);
+	return false;
+}
+EXPORT_SYMBOL_GPL(__file_ref_put);
+
 unsigned int sysctl_nr_open __read_mostly = 1024*1024;
 unsigned int sysctl_nr_open_min = BITS_PER_LONG;
 /* our min() is unusable in constant expressions ;-/ */
diff --git a/include/linux/file_ref.h b/include/linux/file_ref.h
new file mode 100644
index 0000000000000000000000000000000000000000..ff3f36eac0efffd21f6298c42102e41c1422212d
--- /dev/null
+++ b/include/linux/file_ref.h
@@ -0,0 +1,116 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_FILE_REF_H
+#define _LINUX_FILE_REF_H
+
+#ifdef CONFIG_64BIT
+#define FILE_REF_ONEREF		0x0000000000000000U
+#define FILE_REF_MAXREF		0x7FFFFFFFFFFFFFFFU
+#define FILE_REF_SATURATED	0xA000000000000000U
+#define FILE_REF_RELEASED	0xC000000000000000U
+#define FILE_REF_DEAD		0xE000000000000000U
+#define FILE_REF_NOREF		0xFFFFFFFFFFFFFFFFU
+#else
+#define FILE_REF_ONEREF		0x00000000U
+#define FILE_REF_MAXREF		0x7FFFFFFFU
+#define FILE_REF_SATURATED	0xA0000000U
+#define FILE_REF_RELEASED	0xC0000000U
+#define FILE_REF_DEAD		0xE0000000U
+#define FILE_REF_NOREF		0xFFFFFFFFU
+#endif
+
+typedef struct {
+#ifdef CONFIG_64BIT
+	atomic64_t refcnt;
+#else
+	atomic_t refcnt;
+#endif
+} file_ref_t;
+
+/**
+ * file_ref_init - Initialize a file reference count
+ * @ref: Pointer to the reference count
+ * @cnt: The initial reference count typically '1'
+ */
+static inline void file_ref_init(file_ref_t *ref, unsigned long cnt)
+{
+	atomic_long_set(&ref->refcnt, cnt - 1);
+}
+
+bool __file_ref_get(file_ref_t *ref);
+bool __file_ref_put(file_ref_t *ref);
+
+/**
+ * file_ref_get - Acquire one reference on a file
+ * @ref: Pointer to the reference count
+ *
+ * Similar to atomic_inc_not_zero() but saturates at FILE_REF_MAXREF.
+ *
+ * Provides full memory ordering.
+ *
+ * Return: False if the attempt to acquire a reference failed. This happens
+ *         when the last reference has been put already. True if a reference
+ *         was successfully acquired
+ */
+static __always_inline __must_check bool file_ref_get(file_ref_t *ref)
+{
+	/*
+	 * Unconditionally increase the reference count with full
+	 * ordering. The saturation and dead zones provide enough
+	 * tolerance for this.
+	 */
+	if (likely(!atomic_long_add_negative(1, &ref->refcnt)))
+		return true;
+
+	/* Handle the cases inside the saturation and dead zones */
+	return __file_ref_get(ref);
+}
+
+/**
+ * file_ref_put -- Release a file reference
+ * @ref:	Pointer to the reference count
+ *
+ * Provides release memory ordering, such that prior loads and stores
+ * are done before, and provides an acquire ordering on success such
+ * that free() must come after.
+ *
+ * Return: True if this was the last reference with no future references
+ *         possible. This signals the caller that it can safely release
+ *         the object which is protected by the reference counter.
+ *         False if there are still active references or the put() raced
+ *         with a concurrent get()/put() pair. Caller is not allowed to
+ *         release the protected object.
+ */
+static __always_inline __must_check bool file_ref_put(file_ref_t *ref)
+{
+	bool released;
+
+	preempt_disable();
+	/*
+	 * Unconditionally decrease the reference count. The saturation
+	 * and dead zones provide enough tolerance for this. If this
+	 * fails then we need to handle the last reference drop and
+	 * cases inside the saturation and dead zones.
+	 */
+	if (likely(!atomic_long_add_negative_release(-1, &ref->refcnt)))
+		released = false;
+	else
+		released = __file_ref_put(ref);
+	preempt_enable();
+	return released;
+}
+
+/**
+ * file_ref_read - Read the number of file references
+ * @ref: Pointer to the reference count
+ *
+ * Return: The number of held references (0 ... N)
+ */
+static inline unsigned long file_ref_read(file_ref_t *ref)
+{
+	unsigned long c = atomic_long_read(&ref->refcnt);
+
+	/* Return 0 if within the DEAD zone. */
+	return c >= FILE_REF_RELEASED ? 0 : c + 1;
+}
+
+#endif

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/3] fs: port files to file_ref
  2024-10-07 14:23 [PATCH v2 0/3] fs: introduce file_ref_t Christian Brauner
  2024-10-07 14:23 ` [PATCH v2 1/3] fs: protect backing files with rcu Christian Brauner
  2024-10-07 14:23 ` [PATCH v2 2/3] fs: add file_ref Christian Brauner
@ 2024-10-07 14:23 ` Christian Brauner
  2024-10-25 20:45   ` Jann Horn
  2024-10-07 18:27 ` [PATCH v2 0/3] fs: introduce file_ref_t Jens Axboe
  3 siblings, 1 reply; 14+ messages in thread
From: Christian Brauner @ 2024-10-07 14:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Thomas Gleixner, Jens Axboe, Christian Brauner

Port files to rely on file_ref reference to improve scaling and gain
overflow protection.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 drivers/gpu/drm/i915/gt/shmem_utils.c |  2 +-
 drivers/gpu/drm/vmwgfx/ttm_object.c   |  2 +-
 fs/eventpoll.c                        |  2 +-
 fs/file.c                             | 14 +++++++-------
 fs/file_table.c                       |  6 +++---
 include/linux/fs.h                    |  9 +++++----
 6 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/shmem_utils.c b/drivers/gpu/drm/i915/gt/shmem_utils.c
index 1fb6ff77fd899111a0797dac0edd3f2cfa01f42d..bb696b29ee2c992c6b6d0ec5ae538f9ebbb9ed29 100644
--- a/drivers/gpu/drm/i915/gt/shmem_utils.c
+++ b/drivers/gpu/drm/i915/gt/shmem_utils.c
@@ -40,7 +40,7 @@ struct file *shmem_create_from_object(struct drm_i915_gem_object *obj)
 
 	if (i915_gem_object_is_shmem(obj)) {
 		file = obj->base.filp;
-		atomic_long_inc(&file->f_count);
+		get_file(file);
 		return file;
 	}
 
diff --git a/drivers/gpu/drm/vmwgfx/ttm_object.c b/drivers/gpu/drm/vmwgfx/ttm_object.c
index 3353e97687d1d5d0e05bdc8f26ae4b0aae53a997..a17e62867f3b33cd1aafade244d387b43bb66b51 100644
--- a/drivers/gpu/drm/vmwgfx/ttm_object.c
+++ b/drivers/gpu/drm/vmwgfx/ttm_object.c
@@ -471,7 +471,7 @@ void ttm_object_device_release(struct ttm_object_device **p_tdev)
  */
 static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
 {
-	return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
+	return file_ref_get(&dmabuf->file->f_ref);
 }
 
 /**
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1ae4542f0bd88b07e323d0dd75be6c0fe9fff54f..212383cefe6c9fe13a38061c2c81e5b6ff857925 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1002,7 +1002,7 @@ static struct file *epi_fget(const struct epitem *epi)
 	struct file *file;
 
 	file = epi->ffd.file;
-	if (!atomic_long_inc_not_zero(&file->f_count))
+	if (!file_ref_get(&file->f_ref))
 		file = NULL;
 	return file;
 }
diff --git a/fs/file.c b/fs/file.c
index 1b5fc867d8ddff856501ba49d8c751f888810330..f4df09e92b6158ee40a794a4e0462f57acf595cf 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -972,7 +972,7 @@ static struct file *__get_file_rcu(struct file __rcu **f)
 	if (!file)
 		return NULL;
 
-	if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+	if (unlikely(!file_ref_get(&file->f_ref)))
 		return ERR_PTR(-EAGAIN);
 
 	file_reloaded = rcu_dereference_raw(*f);
@@ -986,8 +986,8 @@ static struct file *__get_file_rcu(struct file __rcu **f)
 	OPTIMIZER_HIDE_VAR(file_reloaded_cmp);
 
 	/*
-	 * atomic_long_inc_not_zero() above provided a full memory
-	 * barrier when we acquired a reference.
+	 * file_ref_get() above provided a full memory barrier when we
+	 * acquired a reference.
 	 *
 	 * This is paired with the write barrier from assigning to the
 	 * __rcu protected file pointer so that if that pointer still
@@ -1085,11 +1085,11 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 * We need to confirm it by incrementing the refcount
 		 * and then check the lookup again.
 		 *
-		 * atomic_long_inc_not_zero() gives us a full memory
-		 * barrier. We only really need an 'acquire' one to
-		 * protect the loads below, but we don't have that.
+		 * file_ref_get() gives us a full memory barrier. We
+		 * only really need an 'acquire' one to protect the
+		 * loads below, but we don't have that.
 		 */
-		if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+		if (unlikely(!file_ref_get(&file->f_ref)))
 			continue;
 
 		/*
diff --git a/fs/file_table.c b/fs/file_table.c
index 4b23eb7b79dd9d4ec779f4c01ba2e902988895dc..3f5dc4176b21ff82cc9440ed92a0ad962fdb2046 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -178,7 +178,7 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
 	 * fget-rcu pattern users need to be able to handle spurious
 	 * refcount bumps we should reinitialize the reused file first.
 	 */
-	atomic_long_set(&f->f_count, 1);
+	file_ref_init(&f->f_ref, 1);
 	return 0;
 }
 
@@ -483,7 +483,7 @@ static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
 
 void fput(struct file *file)
 {
-	if (atomic_long_dec_and_test(&file->f_count)) {
+	if (file_ref_put(&file->f_ref)) {
 		struct task_struct *task = current;
 
 		if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
@@ -516,7 +516,7 @@ void fput(struct file *file)
  */
 void __fput_sync(struct file *file)
 {
-	if (atomic_long_dec_and_test(&file->f_count))
+	if (file_ref_put(&file->f_ref))
 		__fput(file);
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3c603d01337650d562405500013f5c4cfed8eb6..fece097d41a8fb47a1483e5418f1e7319405bba2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -45,6 +45,7 @@
 #include <linux/slab.h>
 #include <linux/maple_tree.h>
 #include <linux/rw_hint.h>
+#include <linux/file_ref.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -1030,7 +1031,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
  * @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.)
  */
 struct file {
-	atomic_long_t			f_count;
+	file_ref_t			f_ref;
 	spinlock_t			f_lock;
 	fmode_t				f_mode;
 	const struct file_operations	*f_op;
@@ -1078,15 +1079,15 @@ struct file_handle {
 
 static inline struct file *get_file(struct file *f)
 {
-	long prior = atomic_long_fetch_inc_relaxed(&f->f_count);
-	WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n");
+	WARN_ONCE(!file_ref_get(&f->f_ref),
+		  "struct file::f_ref incremented from zero; use-after-free condition present!\n");
 	return f;
 }
 
 struct file *get_file_rcu(struct file __rcu **f);
 struct file *get_file_active(struct file **f);
 
-#define file_count(x)	atomic_long_read(&(x)->f_count)
+#define file_count(f)	file_ref_read(&(f)->f_ref)
 
 #define	MAX_NON_LFS	((1UL<<31) - 1)
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] fs: add file_ref
  2024-10-07 14:23 ` [PATCH v2 2/3] fs: add file_ref Christian Brauner
@ 2024-10-07 18:07   ` Linus Torvalds
  2024-10-08 10:12     ` Christian Brauner
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2024-10-07 18:07 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Thomas Gleixner, Jens Axboe

On Mon, 7 Oct 2024 at 07:24, Christian Brauner <brauner@kernel.org> wrote:
>
> +static __always_inline __must_check bool file_ref_get(file_ref_t *ref)
> +{
> +       /*
> +        * Unconditionally increase the reference count with full
> +        * ordering. The saturation and dead zones provide enough
> +        * tolerance for this.
> +        */
> +       if (likely(!atomic_long_add_negative(1, &ref->refcnt)))
> +               return true;
> +
> +       /* Handle the cases inside the saturation and dead zones */
> +       return __file_ref_get(ref);
> +}

Ack. This looks like it does the right thing.

That said, I wonder if we could clarify this code sequence by using

        long old = atomic_long_fetch_inc(&ref->refcnt);
        if (old > 0)
                return true;
        return __file_ref_get(ref, old);

instead, which would obviate all the games with using the magic
offset? IOW, it could use 0 as "free" instead of the special
off-by-one "-1 is free".

The reason we have that special "add_negative()" thing is that this
*used* to be much better on x86, because "xadd" was only added in the
i486, and used to be a few cycles slower than "lock ; add".

We got rid of i386 support some time ago, and the lack of xadd was
_one_ of the reasons for it (the supervisor mode WP bit handling issue
was the big one).

And yes, "add_negative()" still generates simpler code, in that it
just returns the sign flag, but I do feel it makes that code harder to
follow.  And I say that despite being very aware of the whole
background to it all and somewhat used to this pattern.

Then file_ref_put() would do

        old = atomic_long_fetch_dec(&ref->refcnt);
        if (old > 1)
                released = false;
        else
                released = __file_ref_put(ref, old);

(inside the preempt disable, of course), and again I feel like this
would be more legible.

And file_ref_read() would just become

        unsigned long val = atomic_long_read(&ref->refcnt);
        return val >= FILE_REF_RELEASED ? 0 : val;

without that odd "+1" to fix the off-by-one, and for the same reason
file_ref_init() would just become

        atomic_long_set(&ref->refcnt, cnt);

with no off-by-one games.

I dunno. This is very much not a big deal, but I did react to the code
being a bit hard to read, and I think it could be friendlier to
humans..

                  Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] fs: introduce file_ref_t
  2024-10-07 14:23 [PATCH v2 0/3] fs: introduce file_ref_t Christian Brauner
                   ` (2 preceding siblings ...)
  2024-10-07 14:23 ` [PATCH v2 3/3] fs: port files to file_ref Christian Brauner
@ 2024-10-07 18:27 ` Jens Axboe
  3 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2024-10-07 18:27 UTC (permalink / raw)
  To: Christian Brauner, Linus Torvalds; +Cc: linux-fsdevel, Thomas Gleixner

On 10/7/24 8:23 AM, Christian Brauner wrote:
> As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
> O(N^2) behaviour under contention with N concurrent operations and it is
> in a hot path in __fget_files_rcu().
> 
> The rcuref infrastructures remedies this problem by using an
> unconditional increment relying on safe- and dead zones to make this
> work and requiring rcu protection for the data structure in question.
> This not just scales better it also introduces overflow protection.
> 
> However, in contrast to generic rcuref, files require a memory barrier
> and thus cannot rely on *_relaxed() atomic operations and also require
> to be built on atomic_long_t as having massive amounts of reference
> isn't unheard of even if it is just an attack.
> 
> As suggested by Linus, add a file specific variant instead of making
> this a generic library.
> 
> I've been testing this with will-it-scale using a multi-threaded fstat()
> on the same file descriptor on a machine that Jens gave me access (thank
> you very much!):
> 
> processor       : 511
> vendor_id       : AuthenticAMD
> cpu family      : 25
> model           : 160
> model name      : AMD EPYC 9754 128-Core Processor
> 
> and I consistently get a 3-5% improvement on workloads with 256+ and
> more threads comparing v6.12-rc1 as base with and without these patches
> applied.

FWIW, I ran this on another box, which is a 2-socket with these CPUs:

AMD EPYC 7763 64-Core Processor

hence 128 cores, 256 threads. I ran my usual max iops test case, which
is 24 threads, each driving a fast drive. If I run without io_uring
direct descriptors, then fget/fput is hit decently hard. In that case, I
see a net reduction of about 1.2% CPU time for the fget/fput parts. So
not as huge a win as mentioned above, but it's also using way fewer
threads and different file descriptors. I'd say that's a pretty
noticeable win!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] fs: add file_ref
  2024-10-07 18:07   ` Linus Torvalds
@ 2024-10-08 10:12     ` Christian Brauner
  2024-10-08 17:29       ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Christian Brauner @ 2024-10-08 10:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, Thomas Gleixner, Jens Axboe

On Mon, Oct 07, 2024 at 11:07:51AM GMT, Linus Torvalds wrote:
> On Mon, 7 Oct 2024 at 07:24, Christian Brauner <brauner@kernel.org> wrote:
> >
> > +static __always_inline __must_check bool file_ref_get(file_ref_t *ref)
> > +{
> > +       /*
> > +        * Unconditionally increase the reference count with full
> > +        * ordering. The saturation and dead zones provide enough
> > +        * tolerance for this.
> > +        */
> > +       if (likely(!atomic_long_add_negative(1, &ref->refcnt)))
> > +               return true;
> > +
> > +       /* Handle the cases inside the saturation and dead zones */
> > +       return __file_ref_get(ref);
> > +}
> 
> Ack. This looks like it does the right thing.
> 
> That said, I wonder if we could clarify this code sequence by using
> 
>         long old = atomic_long_fetch_inc(&ref->refcnt);
>         if (old > 0)
>                 return true;
>         return __file_ref_get(ref, old);
> 
> instead, which would obviate all the games with using the magic
> offset? IOW, it could use 0 as "free" instead of the special
> off-by-one "-1 is free".
> 
> The reason we have that special "add_negative()" thing is that this
> *used* to be much better on x86, because "xadd" was only added in the
> i486, and used to be a few cycles slower than "lock ; add".
> 
> We got rid of i386 support some time ago, and the lack of xadd was
> _one_ of the reasons for it (the supervisor mode WP bit handling issue
> was the big one).
> 
> And yes, "add_negative()" still generates simpler code, in that it
> just returns the sign flag, but I do feel it makes that code harder to
> follow.  And I say that despite being very aware of the whole

Thanks for the background. That was helpful!

Switching atomic_long_fetch_inc() would change the logic quite a bit
though as atomic_long_fetch_inc() returns the previous value. In the
current scheme atomic_long_add_negative() will signal that an overflow
happened and then we drop into the slowpath reading the updated value
and resetting to FILE_REF_DEAD if we overflowed.

So if T1 overflows but accidently happens to be the last thread that
does a file_ref_get() then T1 will reset the counter to FILE_REF_DEAD.

But iiuc, if we now rely on the previous value T1 wouldn't reset if it
overflowed and happened to currently be the last thread that does some
operation that takes a lot of time. It also would mean that T1 overflows
but does still return success.

We could handle such cases by either using atomic_long_inc_return() or
by manually checking the previous value against FILE_REF_MAXREF and
passing a +1 or -1 value to the slowpath helpers.

But imho I find the current code easier to follow. Maybe I'm being held
hostage by having read the rcuref code a bit too often by now. But if
you don't feel strongly I'd leave it as is.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] fs: add file_ref
  2024-10-08 10:12     ` Christian Brauner
@ 2024-10-08 17:29       ` Linus Torvalds
  0 siblings, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2024-10-08 17:29 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Thomas Gleixner, Jens Axboe

On Tue, 8 Oct 2024 at 03:12, Christian Brauner <brauner@kernel.org> wrote:
>
> Switching atomic_long_fetch_inc() would change the logic quite a bit
> though as atomic_long_fetch_inc() returns the previous value.

You can use atomic_long_inc_return() if you want the new value.

So the "atomic_fetch_op()" is a "fetch old value and do the op".

The "atomic_op_return()" is "do the op and return the new value".

We do have both, although on x86, the "fetch_op" is the one that most
closely then maps to "xadd".

But if you find the current code easier, that's fine.

            Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] fs: port files to file_ref
  2024-10-07 14:23 ` [PATCH v2 3/3] fs: port files to file_ref Christian Brauner
@ 2024-10-25 20:45   ` Jann Horn
  2024-10-25 23:55     ` Jann Horn
  0 siblings, 1 reply; 14+ messages in thread
From: Jann Horn @ 2024-10-25 20:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, linux-fsdevel, Thomas Gleixner, Jens Axboe

On Mon, Oct 7, 2024 at 4:23 PM Christian Brauner <brauner@kernel.org> wrote:
> Port files to rely on file_ref reference to improve scaling and gain
> overflow protection.
[...]
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 4b23eb7b79dd9d4ec779f4c01ba2e902988895dc..3f5dc4176b21ff82cc9440ed92a0ad962fdb2046 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -178,7 +178,7 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
>          * fget-rcu pattern users need to be able to handle spurious
>          * refcount bumps we should reinitialize the reused file first.
>          */
> -       atomic_long_set(&f->f_count, 1);
> +       file_ref_init(&f->f_ref, 1);

It is good that you use file_ref_init() here to atomically initialize
the file_ref; however, I think it is problematic that before this,
alloc_empty_file() uses kmem_cache_zalloc(filp_cachep, GFP_KERNEL) to
allocate the file, because that sets __GFP_ZERO, which means that
slab_post_alloc_hook() will use memset() to zero the file object. That
causes trouble in two different ways:


1. After the memset() has changed the file ref to zero, I think
file_ref_get() can return true? Which means __get_file_rcu() could
believe that it acquired a reference, and we could race like this:

task A                          task B
                                __get_file_rcu()
                                  rcu_dereference_raw()
close()
  [frees file]
alloc_empty_file()
  kmem_cache_zalloc()
    [reallocates same file]
    memset(..., 0, ...)
                                  file_ref_get()
                                    [increments 0->1, returns true]
  init_file()
    file_ref_init(..., 1)
      [sets to 0]
                                  rcu_dereference_raw()
                                  fput()
                                    file_ref_put()
                                      [decrements 0->FILE_REF_NOREF, frees file]
  [UAF]


2. AFAIK the memset() is not guaranteed to atomically update an
"unsigned long", so you could see an entirely bogus torn counter
value.

The only reason this worked in the old code is that the refcount value
stored in freed files is 0.

So I think you need to stop using kmem_cache_zalloc() to allocate
files, and instead use a constructor function that zeroes the refcount
field, and manually memset() the rest of the "struct file" to 0 after
calling kmem_cache_alloc().

>         return 0;
>  }
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] fs: port files to file_ref
  2024-10-25 20:45   ` Jann Horn
@ 2024-10-25 23:55     ` Jann Horn
  2024-10-28 11:17       ` Christian Brauner
  0 siblings, 1 reply; 14+ messages in thread
From: Jann Horn @ 2024-10-25 23:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, linux-fsdevel, Thomas Gleixner, Jens Axboe

On Fri, Oct 25, 2024 at 10:45 PM Jann Horn <jannh@google.com> wrote:
> On Mon, Oct 7, 2024 at 4:23 PM Christian Brauner <brauner@kernel.org> wrote:
> > Port files to rely on file_ref reference to improve scaling and gain
> > overflow protection.
> [...]
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index 4b23eb7b79dd9d4ec779f4c01ba2e902988895dc..3f5dc4176b21ff82cc9440ed92a0ad962fdb2046 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -178,7 +178,7 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
> >          * fget-rcu pattern users need to be able to handle spurious
> >          * refcount bumps we should reinitialize the reused file first.
> >          */
> > -       atomic_long_set(&f->f_count, 1);
> > +       file_ref_init(&f->f_ref, 1);
>
> It is good that you use file_ref_init() here to atomically initialize
> the file_ref; however, I think it is problematic that before this,
> alloc_empty_file() uses kmem_cache_zalloc(filp_cachep, GFP_KERNEL) to
> allocate the file, because that sets __GFP_ZERO, which means that
> slab_post_alloc_hook() will use memset() to zero the file object. That
> causes trouble in two different ways:
>
>
> 1. After the memset() has changed the file ref to zero, I think
> file_ref_get() can return true? Which means __get_file_rcu() could
> believe that it acquired a reference, and we could race like this:
>
> task A                          task B
>                                 __get_file_rcu()
>                                   rcu_dereference_raw()
> close()
>   [frees file]
> alloc_empty_file()
>   kmem_cache_zalloc()
>     [reallocates same file]
>     memset(..., 0, ...)
>                                   file_ref_get()
>                                     [increments 0->1, returns true]
>   init_file()
>     file_ref_init(..., 1)
>       [sets to 0]
>                                   rcu_dereference_raw()
>                                   fput()
>                                     file_ref_put()
>                                       [decrements 0->FILE_REF_NOREF, frees file]
>   [UAF]
>
>
> 2. AFAIK the memset() is not guaranteed to atomically update an
> "unsigned long", so you could see an entirely bogus torn counter
> value.
>
> The only reason this worked in the old code is that the refcount value
> stored in freed files is 0.
>
> So I think you need to stop using kmem_cache_zalloc() to allocate
> files, and instead use a constructor function that zeroes the refcount
> field, and manually memset() the rest of the "struct file" to 0 after
> calling kmem_cache_alloc().

Actually, thinking about this again, I guess there's actually no
reason why you'd need a constructor function; if you just avoid
memset()ing the refcount field on allocation, that'd be good enough.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] fs: port files to file_ref
  2024-10-25 23:55     ` Jann Horn
@ 2024-10-28 11:17       ` Christian Brauner
  2024-10-28 18:30         ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Christian Brauner @ 2024-10-28 11:17 UTC (permalink / raw)
  To: Jann Horn; +Cc: Linus Torvalds, linux-fsdevel, Thomas Gleixner, Jens Axboe

> Actually, thinking about this again, I guess there's actually no
> reason why you'd need a constructor function; if you just avoid
> memset()ing the refcount field on allocation, that'd be good enough.

Thanks for catching this. So what I did is:

diff --git a/fs/file_table.c b/fs/file_table.c
index 5b8b18259eca..c81a47aea47f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -148,10 +148,14 @@ static int __init init_fs_stat_sysctls(void)
 fs_initcall(init_fs_stat_sysctls);
 #endif

+static_assert(offsetof(struct file, f_ref) == 0);
+
 static int init_file(struct file *f, int flags, const struct cred *cred)
 {
        int error;

+       memset(f + sizeof(file_ref_t), 0, sizeof(*f) - sizeof(file_ref_t));
+
        f->f_cred = get_cred(cred);
        error = security_file_alloc(f);
        if (unlikely(error)) {
@@ -209,7 +213,7 @@ struct file *alloc_empty_file(int flags, const struct cred *cred)
                        goto over;
        }

-       f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
+       f = kmem_cache_alloc(filp_cachep, GFP_KERNEL);
        if (unlikely(!f))
                return ERR_PTR(-ENOMEM);

@@ -243,7 +247,7 @@ struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred)
        struct file *f;
        int error;

-       f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
+       f = kmem_cache_alloc(filp_cachep, GFP_KERNEL);
        if (unlikely(!f))
                return ERR_PTR(-ENOMEM);

@@ -270,7 +274,7 @@ struct file *alloc_empty_backing_file(int flags, const struct cred *cred)
        struct backing_file *ff;
        int error;

-       ff = kmem_cache_zalloc(bfilp_cachep, GFP_KERNEL);
+       ff = kmem_cache_alloc(bfilp_cachep, GFP_KERNEL);
        if (unlikely(!ff))
                return ERR_PTR(-ENOMEM);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] fs: port files to file_ref
  2024-10-28 11:17       ` Christian Brauner
@ 2024-10-28 18:30         ` Linus Torvalds
  2024-10-29 14:18           ` Christian Brauner
  2024-10-29 17:30           ` Endorsing __randomize_layout for projects! " Cedric Blancher
  0 siblings, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2024-10-28 18:30 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jann Horn, linux-fsdevel, Thomas Gleixner, Jens Axboe

On Mon, 28 Oct 2024 at 01:17, Christian Brauner <brauner@kernel.org> wrote:
>
> Thanks for catching this. So what I did is:

You had better remove the __randomize_layout from 'struct file' too,
otherwise your patch is entirely broken.

We should damn well remove it anyway, the whole struct randomization
is just a bad joke. Nobody sane enables it, afaik. But for your patch
in particular, it's now an active bug.

Also, I wonder if we would be better off with f_count _away_ from the
other fields we touch, because the file count ref always ends up
making it cpu-local, so no shared caching behavior. We had that
reported for the inode contents.

So any threaded use of the same file will end up bouncing not just the
refcount, but also won't be caching some of the useful info at the
beginning of the file, that is basically read-only and could be shared
across CPUs.

            Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] fs: port files to file_ref
  2024-10-28 18:30         ` Linus Torvalds
@ 2024-10-29 14:18           ` Christian Brauner
  2024-10-29 17:30           ` Endorsing __randomize_layout for projects! " Cedric Blancher
  1 sibling, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2024-10-29 14:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jann Horn, linux-fsdevel, Thomas Gleixner, Jens Axboe

On Mon, Oct 28, 2024 at 08:30:39AM -1000, Linus Torvalds wrote:
> On Mon, 28 Oct 2024 at 01:17, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Thanks for catching this. So what I did is:
> 
> You had better remove the __randomize_layout from 'struct file' too,
> otherwise your patch is entirely broken.

Yeah, it has actively been baffling me why this ever made it to struct
file in the first place.

> 
> We should damn well remove it anyway, the whole struct randomization
> is just a bad joke. Nobody sane enables it, afaik. But for your patch
> in particular, it's now an active bug.

I'm aware and so I did end up just initializing things manually instead
of the proposed patch. We do always call init_file() and we do
initialize a bunch of field in there already anyway. So might just as
well do the rest of the fields.

> 
> Also, I wonder if we would be better off with f_count _away_ from the
> other fields we touch, because the file count ref always ends up
> making it cpu-local, so no shared caching behavior. We had that
> reported for the inode contents.
> 
> So any threaded use of the same file will end up bouncing not just the
> refcount, but also won't be caching some of the useful info at the
> beginning of the file, that is basically read-only and could be shared
> across CPUs.

Yes, I'm aware of that and I commented on that in the commit message
that we will likely end up reordering fields to prevent such false
sharing.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Endorsing __randomize_layout for projects! Re: [PATCH v2 3/3] fs: port files to file_ref
  2024-10-28 18:30         ` Linus Torvalds
  2024-10-29 14:18           ` Christian Brauner
@ 2024-10-29 17:30           ` Cedric Blancher
  1 sibling, 0 replies; 14+ messages in thread
From: Cedric Blancher @ 2024-10-29 17:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Jann Horn, linux-fsdevel, Thomas Gleixner,
	Jens Axboe

On Mon, 28 Oct 2024 at 19:33, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 28 Oct 2024 at 01:17, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Thanks for catching this. So what I did is:
>
> You had better remove the __randomize_layout from 'struct file' too,
> otherwise your patch is entirely broken.
>
> We should damn well remove it anyway, the whole struct randomization
> is just a bad joke. Nobody sane enables it, afaik.

French Cybersecurity Agency (ANSSI) and the American NSA actively
endorse it, and want to put that crap on the things of mandatory
things for projects.
Maybe you can find some curses for them, please? Linus-style and loud, please!

Ced

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-10-29 17:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-07 14:23 [PATCH v2 0/3] fs: introduce file_ref_t Christian Brauner
2024-10-07 14:23 ` [PATCH v2 1/3] fs: protect backing files with rcu Christian Brauner
2024-10-07 14:23 ` [PATCH v2 2/3] fs: add file_ref Christian Brauner
2024-10-07 18:07   ` Linus Torvalds
2024-10-08 10:12     ` Christian Brauner
2024-10-08 17:29       ` Linus Torvalds
2024-10-07 14:23 ` [PATCH v2 3/3] fs: port files to file_ref Christian Brauner
2024-10-25 20:45   ` Jann Horn
2024-10-25 23:55     ` Jann Horn
2024-10-28 11:17       ` Christian Brauner
2024-10-28 18:30         ` Linus Torvalds
2024-10-29 14:18           ` Christian Brauner
2024-10-29 17:30           ` Endorsing __randomize_layout for projects! " Cedric Blancher
2024-10-07 18:27 ` [PATCH v2 0/3] fs: introduce file_ref_t Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).