* [PATCH RFC 1/4] fs: protect backing files with rcu
2024-10-05 19:16 [PATCH RFC 0/4] fs: port files to rcuref_long_t Christian Brauner
@ 2024-10-05 19:16 ` Christian Brauner
2024-10-05 19:16 ` [PATCH RFC 2/4] types: add rcuref_long_t Christian Brauner
` (3 subsequent siblings)
4 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2024-10-05 19:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-fsdevel, Thomas Gleixner, Jann Horn, Christian Brauner
Currently backing files are not under any form of rcu protection.
Switching to rcuref requires rcu protection and so does the speculative
vma lookup. Switch them to the same rcu slab as regular files. There
should be no additional magic required as the lifetime of backing files
are always tied to a regular file.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/file_table.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/file_table.c b/fs/file_table.c
index eed5ffad9997c24e533f88285deb537ddf9429ed..9fc9048145ca023ef8af8769d5f1234a69f10df1 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -40,6 +40,7 @@ static struct files_stat_struct files_stat = {
/* SLAB cache for file structures */
static struct kmem_cache *filp_cachep __ro_after_init;
+static struct kmem_cache *bfilp_cachep __ro_after_init;
static struct percpu_counter nr_files __cacheline_aligned_in_smp;
@@ -68,7 +69,7 @@ static inline void file_free(struct file *f)
put_cred(f->f_cred);
if (unlikely(f->f_mode & FMODE_BACKING)) {
path_put(backing_file_user_path(f));
- kfree(backing_file(f));
+ kmem_cache_free(bfilp_cachep, backing_file(f));
} else {
kmem_cache_free(filp_cachep, f);
}
@@ -267,13 +268,13 @@ struct file *alloc_empty_backing_file(int flags, const struct cred *cred)
struct backing_file *ff;
int error;
- ff = kzalloc(sizeof(struct backing_file), GFP_KERNEL);
+ ff = kmem_cache_zalloc(bfilp_cachep, GFP_KERNEL);
if (unlikely(!ff))
return ERR_PTR(-ENOMEM);
error = init_file(&ff->file, flags, cred);
if (unlikely(error)) {
- kfree(ff);
+ kmem_cache_free(bfilp_cachep, ff);
return ERR_PTR(error);
}
@@ -529,6 +530,10 @@ void __init files_init(void)
filp_cachep = kmem_cache_create("filp", sizeof(struct file), &args,
SLAB_HWCACHE_ALIGN | SLAB_PANIC |
SLAB_ACCOUNT | SLAB_TYPESAFE_BY_RCU);
+
+ bfilp_cachep = kmem_cache_create("bfilp", sizeof(struct backing_file),
+ NULL, SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+ SLAB_ACCOUNT | SLAB_TYPESAFE_BY_RCU);
percpu_counter_init(&nr_files, 0, GFP_KERNEL);
}
--
2.45.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH RFC 2/4] types: add rcuref_long_t
2024-10-05 19:16 [PATCH RFC 0/4] fs: port files to rcuref_long_t Christian Brauner
2024-10-05 19:16 ` [PATCH RFC 1/4] fs: protect backing files with rcu Christian Brauner
@ 2024-10-05 19:16 ` Christian Brauner
2024-10-05 19:16 ` [PATCH RFC 3/4] rcuref: add rcuref_long_*() helpers Christian Brauner
` (2 subsequent siblings)
4 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2024-10-05 19:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-fsdevel, Thomas Gleixner, Jann Horn, Christian Brauner
Add a variant of rcuref that operates on atomic_long_t instead of
atomic_t so it can be used for data structures that require
atomic_long_t.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
include/linux/types.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/include/linux/types.h b/include/linux/types.h
index 2bc8766ba20cab014a380f02e5644bd0d772ec67..b10bf351f3e4d1f1c1ca16248199470306de4aa0 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -190,6 +190,16 @@ typedef struct {
#define RCUREF_INIT(i) { .refcnt = ATOMIC_INIT(i - 1) }
+typedef struct {
+#ifdef CONFIG_64BIT
+ atomic64_t refcnt;
+#else
+ atomic_t refcnt;
+#endif
+} rcuref_long_t;
+
+#define RCUREF_LONG_INIT(i) { .refcnt = ATOMIC_LONG_INIT(i - 1) }
+
struct list_head {
struct list_head *next, *prev;
};
--
2.45.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH RFC 3/4] rcuref: add rcuref_long_*() helpers
2024-10-05 19:16 [PATCH RFC 0/4] fs: port files to rcuref_long_t Christian Brauner
2024-10-05 19:16 ` [PATCH RFC 1/4] fs: protect backing files with rcu Christian Brauner
2024-10-05 19:16 ` [PATCH RFC 2/4] types: add rcuref_long_t Christian Brauner
@ 2024-10-05 19:16 ` Christian Brauner
2024-10-05 19:16 ` [PATCH RFC 4/4] fs: port files to rcuref_long_t Christian Brauner
2024-10-05 21:42 ` [PATCH RFC 0/4] " Linus Torvalds
4 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2024-10-05 19:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-fsdevel, Thomas Gleixner, Jann Horn, Christian Brauner
Add a variant of rcuref helpers that operate on atomic_long_t instead of
atomic_t so rcuref can be used for data structures that require
atomic_long_t.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
include/linux/rcuref_long.h | 165 ++++++++++++++++++++++++++++++++++++++++++++
lib/rcuref.c | 104 ++++++++++++++++++++++++++++
2 files changed, 269 insertions(+)
diff --git a/include/linux/rcuref_long.h b/include/linux/rcuref_long.h
new file mode 100644
index 0000000000000000000000000000000000000000..7cedc537e5268e114f1a4221a4f1b0cb8d0e1241
--- /dev/null
+++ b/include/linux/rcuref_long.h
@@ -0,0 +1,165 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_RCUREF_LONG_H
+#define _LINUX_RCUREF_LONG_H
+
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/limits.h>
+#include <linux/lockdep.h>
+#include <linux/preempt.h>
+#include <linux/rcupdate.h>
+#include <linux/rcuref.h>
+
+#ifdef CONFIG_64BIT
+#define RCUREF_LONG_ONEREF 0x0000000000000000U
+#define RCUREF_LONG_MAXREF 0x7FFFFFFFFFFFFFFFU
+#define RCUREF_LONG_SATURATED 0xA000000000000000U
+#define RCUREF_LONG_RELEASED 0xC000000000000000U
+#define RCUREF_LONG_DEAD 0xE000000000000000U
+#define RCUREF_LONG_NOREF 0xFFFFFFFFFFFFFFFFU
+#else
+#define RCUREF_LONG_ONEREF RCUREF_ONEREF
+#define RCUREF_LONG_MAXREF RCUREF_MAXREF
+#define RCUREF_LONG_SATURATED RCUREF_SATURATED
+#define RCUREF_LONG_RELEASED RCUREF_RELEASED
+#define RCUREF_LONG_DEAD RCUREF_DEAD
+#define RCUREF_LONG_NOREF RCUREF_NOREF
+#endif
+
+/**
+ * rcuref_long_init - Initialize a rcuref reference count with the given reference count
+ * @ref: Pointer to the reference count
+ * @cnt: The initial reference count typically '1'
+ */
+static inline void rcuref_long_init(rcuref_long_t *ref, unsigned long cnt)
+{
+ atomic_long_set(&ref->refcnt, cnt - 1);
+}
+
+/**
+ * rcuref_long_read - Read the number of held reference counts of a rcuref
+ * @ref: Pointer to the reference count
+ *
+ * Return: The number of held references (0 ... N)
+ */
+static inline unsigned long rcuref_long_read(rcuref_long_t *ref)
+{
+ unsigned long c = atomic_long_read(&ref->refcnt);
+
+ /* Return 0 if within the DEAD zone. */
+ return c >= RCUREF_LONG_RELEASED ? 0 : c + 1;
+}
+
+__must_check bool rcuref_long_get_slowpath(rcuref_long_t *ref);
+
+/**
+ * rcuref_long_get - Acquire one reference on a rcuref reference count
+ * @ref: Pointer to the reference count
+ *
+ * Similar to atomic_long_inc_not_zero() but saturates at RCUREF_LONG_MAXREF.
+ *
+ * Provides no memory ordering, it is assumed the caller has guaranteed the
+ * object memory to be stable (RCU, etc.). It does provide a control dependency
+ * and thereby orders future stores. See documentation in lib/rcuref.c
+ *
+ * Return:
+ * False if the attempt to acquire a reference failed. This happens
+ * when the last reference has been put already
+ *
+ * True if a reference was successfully acquired
+ */
+static inline __must_check bool rcuref_long_get(rcuref_long_t *ref)
+{
+ /*
+ * Unconditionally increase the reference count. The saturation and
+ * dead zones provide enough tolerance for this.
+ */
+ if (likely(!atomic_long_add_negative_relaxed(1, &ref->refcnt)))
+ return true;
+
+ /* Handle the cases inside the saturation and dead zones */
+ return rcuref_long_get_slowpath(ref);
+}
+
+__must_check bool rcuref_long_put_slowpath(rcuref_long_t *ref);
+
+/*
+ * Internal helper. Do not invoke directly.
+ */
+static __always_inline __must_check bool __rcuref_long_put(rcuref_long_t *ref)
+{
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held() && preemptible(),
+ "suspicious rcuref_put_rcusafe() usage");
+ /*
+ * Unconditionally decrease the reference count. The saturation and
+ * dead zones provide enough tolerance for this.
+ */
+ if (likely(!atomic_long_add_negative_release(-1, &ref->refcnt)))
+ return false;
+
+ /*
+ * Handle the last reference drop and cases inside the saturation
+ * and dead zones.
+ */
+ return rcuref_long_put_slowpath(ref);
+}
+
+/**
+ * rcuref_long_put_rcusafe -- Release one reference for a rcuref reference count RCU safe
+ * @ref: Pointer to the reference count
+ *
+ * Provides release memory ordering, such that prior loads and stores are done
+ * before, and provides an acquire ordering on success such that free()
+ * must come after.
+ *
+ * Can be invoked from contexts, which guarantee that no grace period can
+ * happen which would free the object concurrently if the decrement drops
+ * the last reference and the slowpath races against a concurrent get() and
+ * put() pair. rcu_read_lock()'ed and atomic contexts qualify.
+ *
+ * Return:
+ * True if this was the last reference with no future references
+ * possible. This signals the caller that it can safely release the
+ * object which is protected by the reference counter.
+ *
+ * False if there are still active references or the put() raced
+ * with a concurrent get()/put() pair. Caller is not allowed to
+ * release the protected object.
+ */
+static inline __must_check bool rcuref_long_put_rcusafe(rcuref_long_t *ref)
+{
+ return __rcuref_long_put(ref);
+}
+
+/**
+ * rcuref_long_put -- Release one reference for a rcuref reference count
+ * @ref: Pointer to the reference count
+ *
+ * Can be invoked from any context.
+ *
+ * Provides release memory ordering, such that prior loads and stores are done
+ * before, and provides an acquire ordering on success such that free()
+ * must come after.
+ *
+ * Return:
+ *
+ * True if this was the last reference with no future references
+ * possible. This signals the caller that it can safely schedule the
+ * object, which is protected by the reference counter, for
+ * deconstruction.
+ *
+ * False if there are still active references or the put() raced
+ * with a concurrent get()/put() pair. Caller is not allowed to
+ * deconstruct the protected object.
+ */
+static inline __must_check bool rcuref_long_put(rcuref_long_t *ref)
+{
+ bool released;
+
+ preempt_disable();
+ released = __rcuref_long_put(ref);
+ preempt_enable();
+ return released;
+}
+
+#endif
diff --git a/lib/rcuref.c b/lib/rcuref.c
index 97f300eca927ced7f36fe0c932d2a9d3759809b8..01a4c317c8bb7ff24632334ddb4520aa79aa46f3 100644
--- a/lib/rcuref.c
+++ b/lib/rcuref.c
@@ -176,6 +176,7 @@
#include <linux/export.h>
#include <linux/rcuref.h>
+#include <linux/rcuref_long.h>
/**
* rcuref_get_slowpath - Slowpath of rcuref_get()
@@ -217,6 +218,46 @@ bool rcuref_get_slowpath(rcuref_t *ref)
}
EXPORT_SYMBOL_GPL(rcuref_get_slowpath);
+/**
+ * rcuref_long_get_slowpath - Slowpath of rcuref_long_get()
+ * @ref: Pointer to the reference count
+ *
+ * Invoked when the reference count is outside of the valid zone.
+ *
+ * Return:
+ * False if the reference count was already marked dead
+ *
+ * True if the reference count is saturated, which prevents the
+ * object from being deconstructed ever.
+ */
+bool rcuref_long_get_slowpath(rcuref_long_t *ref)
+{
+ unsigned long cnt = atomic_long_read(&ref->refcnt);
+
+ /*
+ * If the reference count was already marked dead, undo the
+ * increment so it stays in the middle of the dead zone and return
+ * fail.
+ */
+ if (cnt >= RCUREF_LONG_RELEASED) {
+ atomic_long_set(&ref->refcnt, RCUREF_LONG_DEAD);
+ return false;
+ }
+
+ /*
+ * If it was saturated, warn and mark it so. In case the increment
+ * was already on a saturated value restore the saturation
+ * marker. This keeps it in the middle of the saturation zone and
+ * prevents the reference count from overflowing. This leaks the
+ * object memory, but prevents the obvious reference count overflow
+ * damage.
+ */
+ if (WARN_ONCE(cnt > RCUREF_LONG_MAXREF, "rcuref saturated - leaking memory"))
+ atomic_long_set(&ref->refcnt, RCUREF_LONG_SATURATED);
+ return true;
+}
+EXPORT_SYMBOL_GPL(rcuref_long_get_slowpath);
+
/**
* rcuref_put_slowpath - Slowpath of __rcuref_put()
* @ref: Pointer to the reference count
@@ -279,3 +320,66 @@ bool rcuref_put_slowpath(rcuref_t *ref)
return false;
}
EXPORT_SYMBOL_GPL(rcuref_put_slowpath);
+
+/**
+ * rcuref_long_put_slowpath - Slowpath of __rcuref_long_put()
+ * @ref: Pointer to the reference count
+ *
+ * Invoked when the reference count is outside of the valid zone.
+ *
+ * Return:
+ * True if this was the last reference with no future references
+ * possible. This signals the caller that it can safely schedule the
+ * object, which is protected by the reference counter, for
+ * deconstruction.
+ *
+ * False if there are still active references or the put() raced
+ * with a concurrent get()/put() pair. Caller is not allowed to
+ * deconstruct the protected object.
+ */
+bool rcuref_long_put_slowpath(rcuref_long_t *ref)
+{
+ unsigned long cnt = atomic_long_read(&ref->refcnt);
+
+ /* Did this drop the last reference? */
+ if (likely(cnt == RCUREF_LONG_NOREF)) {
+ /*
+ * Carefully try to set the reference count to RCUREF_LONG_DEAD.
+ *
+ * This can fail if a concurrent get() operation has
+ * elevated it again or the corresponding put() even marked
+ * it dead already. Both are valid situations and do not
+ * require a retry. If this fails the caller is not
+ * allowed to deconstruct the object.
+ */
+ if (!atomic_long_try_cmpxchg_release(&ref->refcnt, &cnt, RCUREF_LONG_DEAD))
+ return false;
+
+ /*
+ * The caller can safely schedule the object for
+ * deconstruction. Provide acquire ordering.
+ */
+ smp_acquire__after_ctrl_dep();
+ return true;
+ }
+
+ /*
+ * If the reference count was already in the dead zone, then this
+ * put() operation is imbalanced. Warn, put the reference count back to
+ * DEAD and tell the caller to not deconstruct the object.
+ */
+ if (WARN_ONCE(cnt >= RCUREF_LONG_RELEASED, "rcuref - imbalanced put()")) {
+ atomic_long_set(&ref->refcnt, RCUREF_LONG_DEAD);
+ return false;
+ }
+
+ /*
+ * This is a put() operation on a saturated refcount. Restore the
+ * mean saturation value and tell the caller to not deconstruct the
+ * object.
+ */
+ if (cnt > RCUREF_LONG_MAXREF)
+ atomic_long_set(&ref->refcnt, RCUREF_LONG_SATURATED);
+ return false;
+}
+EXPORT_SYMBOL_GPL(rcuref_long_put_slowpath);
--
2.45.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH RFC 4/4] fs: port files to rcuref_long_t
2024-10-05 19:16 [PATCH RFC 0/4] fs: port files to rcuref_long_t Christian Brauner
` (2 preceding siblings ...)
2024-10-05 19:16 ` [PATCH RFC 3/4] rcuref: add rcuref_long_*() helpers Christian Brauner
@ 2024-10-05 19:16 ` Christian Brauner
2024-10-05 21:42 ` [PATCH RFC 0/4] " Linus Torvalds
4 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2024-10-05 19:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-fsdevel, Thomas Gleixner, Jann Horn, Christian Brauner
As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
O(N^2) behaviour under contention with N concurrent operations. The
rcuref infrastructure uses atomic_add_negative_relaxed() for the fast
path, which scales better under contention and we get overflow
protection for free.
I've been testing this with will-it-scale using fstat() on a machine
that Jens gave me access (thank you very much!):
processor : 511
vendor_id : AuthenticAMD
cpu family : 25
model : 160
model name : AMD EPYC 9754 128-Core Processor
and I consistently get a 3-5% improvement on 256+ threads.
Files are SLAB_TYPESAFE_BY_RCU and thus don't have "regular" rcu
protection. In short, freeing of files isn't delayed until a grace
period has elapsed. Instead, they are freed immediately and thus can be
reused (multiple times) within the same grace period.
So when picking a file from the file descriptor table via its file
descriptor number it is thus possible to see an elevated reference count
on file->f_count even though the file has already been recycled possibly
multiple times by another task. To guard against this the vfs will pick
the file from the file descriptor via its file descriptor number twice.
Once before the rcuref_long_get() and once after and compare the
pointers (grossly simplified). If they match then the file is still
valid. If not the caller needs to fput() it.
The rcuref infrastructure requires explicit rcu protection to handle the
following race:
> Deconstruction race
> ===================
>
> The release operation must be protected by prohibiting a grace period in
> order to prevent a possible use after free:
>
> T1 T2
> put() get()
> // ref->refcnt = ONEREF
> if (!atomic_add_negative(-1, &ref->refcnt))
> return false; <- Not taken
>
> // ref->refcnt == NOREF
> --> preemption
> // Elevates ref->refcnt to ONEREF
> if (!atomic_add_negative(1, &ref->refcnt))
> return true; <- taken
>
> if (put(&p->ref)) { <-- Succeeds
> remove_pointer(p);
> kfree_rcu(p, rcu);
> }
>
> RCU grace period ends, object is freed
>
> atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF
>
> [...] it prevents the grace period which keeps the object alive until
> all put() operations complete.
Having files by SLAB_TYPESAFE_BY_RCU shouldn't cause any problems for
the rcuref deconstruction race. Afaict, the only interesting case would
be someone freeing the file and someone immediately recycling it within
the same grace period and reinitializing file->f_count to ONEREF while a
concurrent fput() is doing atomic_cmpxchg(&ref->refcnt, NOREF, DEAD) as
in the race above.
But this seems safe from SLAB_TYPESAFE_BY_RCU's perspective and it
should be safe from rcuref's perspective.
T1 T2 T3
fput() fget()
// f_count->refcnt = ONEREF
if (!atomic_add_negative(-1, &f_count->refcnt))
return false; <- Not taken
// f_count->refcnt == NOREF
--> preemption
// Elevates f_count->refcnt to ONEREF
if (!atomic_add_negative(1, &f_count->refcnt))
return true; <- taken
if (put(&f_count)) { <-- Succeeds
remove_pointer(p);
/*
* Cache is SLAB_TYPESAFE_BY_RCU
* so this is freed without a grace period.
*/
kmem_cache_free(p);
}
kmem_cache_alloc()
init_file() {
// Sets f_count->refcnt to ONEREF
rcuref_long_init(&f->f_count, 1);
}
Object has been reused within the same grace period
via kmem_cache_alloc()'s SLAB_TYPESAFE_BY_RCU.
/*
* With SLAB_TYPESAFE_BY_RCU this would be a safe UAF access and
* it would probably work correctly because the atomic_cmpxchg()
* will fail because the refcount has been reset to ONEREF by T3.
*/
atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
drivers/gpu/drm/i915/gt/shmem_utils.c | 2 +-
drivers/gpu/drm/vmwgfx/ttm_object.c | 2 +-
fs/eventpoll.c | 2 +-
fs/file.c | 17 ++++++++---------
fs/file_table.c | 7 ++++---
include/linux/fs.h | 9 +++++----
include/linux/rcuref_long.h | 5 +++--
7 files changed, 23 insertions(+), 21 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/shmem_utils.c b/drivers/gpu/drm/i915/gt/shmem_utils.c
index 1fb6ff77fd899111a0797dac0edd3f2cfa01f42d..bb696b29ee2c992c6b6d0ec5ae538f9ebbb9ed29 100644
--- a/drivers/gpu/drm/i915/gt/shmem_utils.c
+++ b/drivers/gpu/drm/i915/gt/shmem_utils.c
@@ -40,7 +40,7 @@ struct file *shmem_create_from_object(struct drm_i915_gem_object *obj)
if (i915_gem_object_is_shmem(obj)) {
file = obj->base.filp;
- atomic_long_inc(&file->f_count);
+ get_file(file);
return file;
}
diff --git a/drivers/gpu/drm/vmwgfx/ttm_object.c b/drivers/gpu/drm/vmwgfx/ttm_object.c
index 3353e97687d1d5d0e05bdc8f26ae4b0aae53a997..539dfec0e623ec2be730924fe7b8e28a2ff1face 100644
--- a/drivers/gpu/drm/vmwgfx/ttm_object.c
+++ b/drivers/gpu/drm/vmwgfx/ttm_object.c
@@ -471,7 +471,7 @@ void ttm_object_device_release(struct ttm_object_device **p_tdev)
*/
static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
{
- return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
+ return rcuref_long_get(&dmabuf->file->f_count);
}
/**
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1ae4542f0bd88b07e323d0dd75be6c0fe9fff54f..0a033950225af274c21e503a6ea4813e5bab5dc2 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1002,7 +1002,7 @@ static struct file *epi_fget(const struct epitem *epi)
struct file *file;
file = epi->ffd.file;
- if (!atomic_long_inc_not_zero(&file->f_count))
+ if (!rcuref_long_get(&file->f_count))
file = NULL;
return file;
}
diff --git a/fs/file.c b/fs/file.c
index 5125607d040a2ff073e170d043124db5f444a90a..74e7a1cd709fc2147655d5e4b75cc2d8250bed88 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -866,10 +866,10 @@ static struct file *__get_file_rcu(struct file __rcu **f)
if (!file)
return NULL;
- if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+ if (unlikely(!rcuref_long_get(&file->f_count)))
return ERR_PTR(-EAGAIN);
- file_reloaded = rcu_dereference_raw(*f);
+ file_reloaded = smp_load_acquire(f);
/*
* Ensure that all accesses have a dependency on the load from
@@ -880,8 +880,8 @@ static struct file *__get_file_rcu(struct file __rcu **f)
OPTIMIZER_HIDE_VAR(file_reloaded_cmp);
/*
- * atomic_long_inc_not_zero() above provided a full memory
- * barrier when we acquired a reference.
+ * smp_load_acquire() provided an acquire barrier when we loaded
+ * the file pointer.
*
* This is paired with the write barrier from assigning to the
* __rcu protected file pointer so that if that pointer still
@@ -979,11 +979,10 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
* We need to confirm it by incrementing the refcount
* and then check the lookup again.
*
- * atomic_long_inc_not_zero() gives us a full memory
- * barrier. We only really need an 'acquire' one to
- * protect the loads below, but we don't have that.
+ * rcuref_long_get() doesn't provide a memory barrier so
+ * we use smp_load_acquire() on the file pointer below.
*/
- if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+ if (unlikely(!rcuref_long_get(&file->f_count)))
continue;
/*
@@ -1000,7 +999,7 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
*
* If so, we need to put our ref and try again.
*/
- if (unlikely(file != rcu_dereference_raw(*fdentry)) ||
+ if (unlikely(file != smp_load_acquire(fdentry)) ||
unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
fput(file);
continue;
diff --git a/fs/file_table.c b/fs/file_table.c
index 9fc9048145ca023ef8af8769d5f1234a69f10df1..f4b96a9dade804a81347865625418a0fdc9a7c09 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -28,6 +28,7 @@
#include <linux/task_work.h>
#include <linux/swap.h>
#include <linux/kmemleak.h>
+#include <linux/rcuref_long.h>
#include <linux/atomic.h>
@@ -175,7 +176,7 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
* fget-rcu pattern users need to be able to handle spurious
* refcount bumps we should reinitialize the reused file first.
*/
- atomic_long_set(&f->f_count, 1);
+ rcuref_long_init(&f->f_count, 1);
return 0;
}
@@ -480,7 +481,7 @@ static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
void fput(struct file *file)
{
- if (atomic_long_dec_and_test(&file->f_count)) {
+ if (rcuref_long_put_rcusafe(&file->f_count)) {
struct task_struct *task = current;
if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
@@ -513,7 +514,7 @@ void fput(struct file *file)
*/
void __fput_sync(struct file *file)
{
- if (atomic_long_dec_and_test(&file->f_count))
+ if (rcuref_long_put_rcusafe(&file->f_count))
__fput(file);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3c603d01337650d562405500013f5c4cfed8eb6..a7831eaf0edd13ebe9765e532602688b317da315 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -45,6 +45,7 @@
#include <linux/slab.h>
#include <linux/maple_tree.h>
#include <linux/rw_hint.h>
+#include <linux/rcuref_long.h>
#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -1030,7 +1031,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
* @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.)
*/
struct file {
- atomic_long_t f_count;
+ rcuref_long_t f_count;
spinlock_t f_lock;
fmode_t f_mode;
const struct file_operations *f_op;
@@ -1078,15 +1079,15 @@ struct file_handle {
static inline struct file *get_file(struct file *f)
{
- long prior = atomic_long_fetch_inc_relaxed(&f->f_count);
- WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n");
+ WARN_ONCE(!rcuref_long_get(&f->f_count),
+ "struct file::f_count incremented from zero; use-after-free condition present!\n");
return f;
}
struct file *get_file_rcu(struct file __rcu **f);
struct file *get_file_active(struct file **f);
-#define file_count(x) atomic_long_read(&(x)->f_count)
+#define file_count(x) rcuref_long_read(&(x)->f_count)
#define MAX_NON_LFS ((1UL<<31) - 1)
diff --git a/include/linux/rcuref_long.h b/include/linux/rcuref_long.h
index 7cedc537e5268e114f1a4221a4f1b0cb8d0e1241..10623119bb5038a1b171e31b8fd962a87e3670f5 100644
--- a/include/linux/rcuref_long.h
+++ b/include/linux/rcuref_long.h
@@ -85,11 +85,12 @@ __must_check bool rcuref_long_put_slowpath(rcuref_long_t *ref);
/*
* Internal helper. Do not invoke directly.
+ *
+ * Ideally we'd RCU_LOCKDEP_WARN() here but we can't since this api is
+ * used with SLAB_TYPSAFE_BY_RCU.
*/
static __always_inline __must_check bool __rcuref_long_put(rcuref_long_t *ref)
{
- RCU_LOCKDEP_WARN(!rcu_read_lock_held() && preemptible(),
- "suspicious rcuref_put_rcusafe() usage");
/*
* Unconditionally decrease the reference count. The saturation and
* dead zones provide enough tolerance for this.
--
2.45.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-05 19:16 [PATCH RFC 0/4] fs: port files to rcuref_long_t Christian Brauner
` (3 preceding siblings ...)
2024-10-05 19:16 ` [PATCH RFC 4/4] fs: port files to rcuref_long_t Christian Brauner
@ 2024-10-05 21:42 ` Linus Torvalds
2024-10-05 22:01 ` Al Viro
` (2 more replies)
4 siblings, 3 replies; 17+ messages in thread
From: Linus Torvalds @ 2024-10-05 21:42 UTC (permalink / raw)
To: Christian Brauner; +Cc: linux-fsdevel, Thomas Gleixner, Jann Horn
On Sat, 5 Oct 2024 at 12:17, Christian Brauner <brauner@kernel.org> wrote:
>
> Note that atomic_inc_not_zero() contained a full memory barrier that we
> relied upon. But we only need an acquire barrier and so I replaced the
> second load from the file table with a smp_load_acquire(). I'm not
> completely sure this is correct or if we could get away with something
> else. Linus, maybe you have input here?
I don't think this is valid.
You go from this:
file = rcu_dereference_raw(*f);
if (!file)
return NULL;
if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
return ERR_PTR(-EAGAIN);
file_reloaded = rcu_dereference_raw(*f);
to this:
file = rcu_dereference_raw(*f);
if (!file)
return NULL;
if (unlikely(!rcuref_long_get(&file->f_count)))
return ERR_PTR(-EAGAIN);
file_reloaded = smp_load_acquire(f);
and the thing is, that rcuref_long_get() had better be a *full* memory barrier.
The smp_load_acquire() does absolutely nothing: it means that the load
will be done before *subsequent* memory operations. But it is not a
barrier to preceding memory operations.
So if rcuref_long_get() isn't ordered (and you made it relaxed, the
same way the existing rcuref_get() is), the CPU can basically move
that down past the smp_load_acquire(), so the code actually
effectively turns into this:
file = rcu_dereference_raw(*f);
if (!file)
return NULL;
file_reloaded = smp_load_acquire(f);
if (unlikely(!rcuref_long_get(&file->f_count)))
return ERR_PTR(-EAGAIN);
and now the "file_reloaded" thing is completely pointless.
Of course, you will never *see* this on x86, because all atomics are
always full memory barriers on x86, so when you test this all on that
noce 256-thread CPU, it all works fine. Because on x86, the "relaxed"
memory ordering isn't.
So no. You can't use atomic_long_add_negative_relaxed() in the file
handling, because it really does require a proper memory barrier.
Now, I do think that replacing the cmpxchg loop with
atomic_long_add_negative() is a good idea.
But you can't do it this way, and you can't use the RCUREF logic and
just extend it to the file ref.
I would suggest you take that whole "rcuref_long_get()" code and *not*
try to make it look like the rcuref code, but instead make it
explicitly just about the file counting. Because the file counting
really does have these special rules.
Also, honestly, the only reason the file counting is using a "long" is
because the code does *NOT* do overflow checking. But once you start
looking at the sign and do conditional increments, you can actually
just make the whole refcount be a "int" instead, and make "struct
file" potentially smaller.
And yes, that requires that people who get a file() will fail when the
file count goes negative, but that's *good*.
That's in fact exactly what we do for the page ref count, for the
exact same reason, btw.
IOW, I think you should make "f_count" be a regular "atomic_t", and
you should make the above sequence actually just more along the lines
of
file = rcu_dereference_raw(*f);
if (!file)
return NULL;
ret = atomic_inc_return(&file->f_count);
file_reloaded = smp_load_acquire(f);
and now you've got the right memory ordering, but you also have the
newly incremented return value and the file reloaded value, so now you
can continue on with something like this:
// Check that we didn't increment the refcount
// out of bounds or from zero..
if (ret > 1) {
struct file *file_reloaded_cmp = file_reloaded;
OPTIMIZER_HIDE_VAR(file_reloaded_cmp);
if (file == file_reloaded_cmp)
return file_reloaded;
}
// Uhhuh, we got some other file or the file count had
// overflowed. Decrement the refcount again
fput(file);
return ERR_PTR(-EAGAIN);
and I think that might work, although the zero count case worries me
(ie 'fput twice').
Currently we avoid the fput twice because we use that
"inc_not_zero()". So that needs some thinking about.
Linus
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-05 21:42 ` [PATCH RFC 0/4] " Linus Torvalds
@ 2024-10-05 22:01 ` Al Viro
2024-10-05 22:14 ` Linus Torvalds
2024-10-05 22:01 ` Linus Torvalds
2024-10-06 9:48 ` Christian Brauner
2 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2024-10-05 22:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, linux-fsdevel, Thomas Gleixner, Jann Horn
On Sat, Oct 05, 2024 at 02:42:25PM -0700, Linus Torvalds wrote:
> Also, honestly, the only reason the file counting is using a "long" is
> because the code does *NOT* do overflow checking. But once you start
> looking at the sign and do conditional increments, you can actually
> just make the whole refcount be a "int" instead, and make "struct
> file" potentially smaller.
I wouldn't bet on that. You *can* get over 2G references on 64bit box
with arseloads of memory, and we have no way to make fget() et.al.
fail when refcount gets that high - no valid error to return and
a serious DoS potential if we start doing that.
Overflow on leaks is one thing, a huge pile of real references is a
different story, and yes, we can get that. And boxen with 1Tb RAM
are not as exotic these days as they used to be...
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-05 22:01 ` Al Viro
@ 2024-10-05 22:14 ` Linus Torvalds
2024-10-05 22:28 ` Al Viro
0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2024-10-05 22:14 UTC (permalink / raw)
To: Al Viro; +Cc: Christian Brauner, linux-fsdevel, Thomas Gleixner, Jann Horn
On Sat, 5 Oct 2024 at 15:01, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I wouldn't bet on that. You *can* get over 2G references on 64bit box
> with arseloads of memory, and we have no way to make fget() et.al.
> fail when refcount gets that high - no valid error to return and
> a serious DoS potential if we start doing that.
Why do you think we can't make fget() fail?
That's what we did for the page count, and it worked just fine. Two
billion file references is not a valid thing - it's an attack.
We don't have to fix every file ref increment - the same way we didn't
fix all the get_page() cases. We only need to fix the ones that are
directly triggerable from user accesses (ie temporary kernel refs are
fine).
And the main case that is user-accessible is exactly the fget()
variatiosn that result in get_file_rcu(). It's perfectly fine to say
"nope, I'm not getting you this file, you're clearly a bad person".
There are probably other cases that just do "get_file()" on an
existing 'struct file *" directly, but it's usually fairly hard to get
users to refer to file pointers without giving an actual file
descriptor number. But things like "split a vma" will do it (on the
file that is mapped), and probably a few other cases, but I bet it
really is just a few cases.
That said, the simple "atomic_inc_return()" suggestion of mine was
broken for other reasons anyway.
So I do think it needs some of the rcuref code complexity with the
rcuref_put_slowpath() turning the last ref into RCUREF_DEAD).
Linus
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-05 22:14 ` Linus Torvalds
@ 2024-10-05 22:28 ` Al Viro
2024-10-05 22:43 ` Linus Torvalds
0 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2024-10-05 22:28 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, linux-fsdevel, Thomas Gleixner, Jann Horn
On Sat, Oct 05, 2024 at 03:14:14PM -0700, Linus Torvalds wrote:
> There are probably other cases that just do "get_file()" on an
> existing 'struct file *" directly, but it's usually fairly hard to get
> users to refer to file pointers without giving an actual file
> descriptor number. But things like "split a vma" will do it (on the
> file that is mapped), and probably a few other cases, but I bet it
> really is just a few cases.
You can keep sending SCM_RIGHTS packets filled with references
to the same file, for example...
Anyway, which error would you return? EBADF? Because fget() callers
really expect a struct file reference or NULL. Sure, we can teach that
to return ERR_PTR() and deal with all callers (not that many these days),
but then there's fdget(). I've looked through those recently; more than
half treats "not open" as EBADF, but there's a bunch of EINVAL, etc.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-05 22:28 ` Al Viro
@ 2024-10-05 22:43 ` Linus Torvalds
2024-10-05 22:51 ` Al Viro
2024-10-06 9:55 ` Christian Brauner
0 siblings, 2 replies; 17+ messages in thread
From: Linus Torvalds @ 2024-10-05 22:43 UTC (permalink / raw)
To: Al Viro; +Cc: Christian Brauner, linux-fsdevel, Thomas Gleixner, Jann Horn
On Sat, 5 Oct 2024 at 15:28, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> You can keep sending SCM_RIGHTS packets filled with references
> to the same file, for example...
Yeah, it's always SCM_RIGHTS or splice() that ends up having those
issues. But it's a really easy place to go "Oh, no, you can't do
that".
> Anyway, which error would you return? EBADF?
I really don't think it matters.
Come on - the only possible reason for two billion files is a bad
user. Do we care what error return an attacker gets? No. We're not
talking about breaking existing programs.
So EBADF sounds fine to me particularly from fget() and friends, since
they have that whole case of "no such file" anyway.
For try_get_page(), we also still have the WARN_ON_ONCE() if it ever
triggers. I don't recall having ever heard a report of it actually
triggering, but I do think we would do the same thing for the file ref
counting.
Anyway, maybe making f_count a 32-bit thing isn't worth it, simply
because 'struct file' is so much bigger (and less critical) than
'struct page' is anyway.
So I don't think that's actually the important thing here. If we keep
it a 'long', I won't lose any sleep over it.
But using the rcuref code doesn't work (regardless of whether 32-bit
or 'long' ref), because of the memory ordering issues.
Linus
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-05 22:43 ` Linus Torvalds
@ 2024-10-05 22:51 ` Al Viro
2024-10-06 9:55 ` Christian Brauner
1 sibling, 0 replies; 17+ messages in thread
From: Al Viro @ 2024-10-05 22:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, linux-fsdevel, Thomas Gleixner, Jann Horn
On Sat, Oct 05, 2024 at 03:43:16PM -0700, Linus Torvalds wrote:
> Come on - the only possible reason for two billion files is a bad
> user. Do we care what error return an attacker gets? No. We're not
> talking about breaking existing programs.
Umm... It can be a bad failure mode of a buggy program, really
unpleasant to debug if the pile of references kept growing on stdin
of that sucker, which just happens to be controlling tty of your
login session; fork(2) starting to fail for everything that has it
opened is not fun.
That's really nit-picking though, since
> So I don't think that's actually the important thing here. If we keep
> it a 'long', I won't lose any sleep over it.
... exactly.
> But using the rcuref code doesn't work (regardless of whether 32-bit
> or 'long' ref), because of the memory ordering issues.
*nod*
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-05 22:43 ` Linus Torvalds
2024-10-05 22:51 ` Al Viro
@ 2024-10-06 9:55 ` Christian Brauner
1 sibling, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2024-10-06 9:55 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Thomas Gleixner, Jann Horn
On Sat, Oct 05, 2024 at 03:43:16PM GMT, Linus Torvalds wrote:
> On Sat, 5 Oct 2024 at 15:28, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > You can keep sending SCM_RIGHTS packets filled with references
> > to the same file, for example...
>
> Yeah, it's always SCM_RIGHTS or splice() that ends up having those
> issues. But it's a really easy place to go "Oh, no, you can't do
> that".
>
> > Anyway, which error would you return? EBADF?
>
> I really don't think it matters.
>
> Come on - the only possible reason for two billion files is a bad
> user. Do we care what error return an attacker gets? No. We're not
> talking about breaking existing programs.
>
> So EBADF sounds fine to me particularly from fget() and friends, since
> they have that whole case of "no such file" anyway.
>
> For try_get_page(), we also still have the WARN_ON_ONCE() if it ever
> triggers. I don't recall having ever heard a report of it actually
> triggering, but I do think we would do the same thing for the file ref
> counting.
I really don't see the issue in making fget() _theoretically_ fail. And
I agree it would be a bug. We already have that WARN_ON() in get_file()
just to protect against completely theoretical overflow issues. If that
ever triggers anywhere we should worry about it.
>
> Anyway, maybe making f_count a 32-bit thing isn't worth it, simply
> because 'struct file' is so much bigger (and less critical) than
> 'struct page' is anyway.
>
> So I don't think that's actually the important thing here. If we keep
> it a 'long', I won't lose any sleep over it.
struct file is significantly smaller since the work I did the last two
cycles. So I don't think there's any pressure to go from long to int.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-05 21:42 ` [PATCH RFC 0/4] " Linus Torvalds
2024-10-05 22:01 ` Al Viro
@ 2024-10-05 22:01 ` Linus Torvalds
2024-10-06 10:21 ` Christian Brauner
2024-10-06 9:48 ` Christian Brauner
2 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2024-10-05 22:01 UTC (permalink / raw)
To: Christian Brauner; +Cc: linux-fsdevel, Thomas Gleixner, Jann Horn
On Sat, 5 Oct 2024 at 14:42, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> and I think that might work, although the zero count case worries me
> (ie 'fput twice').
>
> Currently we avoid the fput twice because we use that
> "inc_not_zero()". So that needs some thinking about.
Actually, it's worse. Even the
if (ret > 1)
case is dangerous, because we could have two or more threads doing
that atomic_inc_return() on a dead file descriptor at the same time.
So that approach is just broken.
Linus
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-05 22:01 ` Linus Torvalds
@ 2024-10-06 10:21 ` Christian Brauner
2024-10-06 18:09 ` Linus Torvalds
0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2024-10-06 10:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Thomas Gleixner, Jann Horn
On Sat, Oct 05, 2024 at 03:01:45PM GMT, Linus Torvalds wrote:
> On Sat, 5 Oct 2024 at 14:42, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > and I think that might work, although the zero count case worries me
> > (ie 'fput twice').
> >
> > Currently we avoid the fput twice because we use that
> > "inc_not_zero()". So that needs some thinking about.
>
> Actually, it's worse. Even the
>
> if (ret > 1)
>
> case is dangerous, because we could have two or more threads doing
> that atomic_inc_return() on a dead file descriptor at the same time.
>
> So that approach is just broken.
Iiuc, then we should retain the deadzone handling but should replace
atomic_long_add_negative() with atomic_long_add_negative_relaxed().
So I would add:
static inline __must_check bool rcuref_long_inc(rcuref_long_t *ref)
{
/*
* Unconditionally increase the reference count with full
* ordering. The saturation and dead zones provide enough
* tolerance for this.
*/
if (likely(!atomic_long_add_negative(1, &ref->refcnt)))
return true;
/* Handle the cases inside the saturation and dead zones */
return rcuref_long_get_slowpath(ref);
}
Or did I miss something else?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-06 10:21 ` Christian Brauner
@ 2024-10-06 18:09 ` Linus Torvalds
2024-10-07 7:37 ` Christian Brauner
0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2024-10-06 18:09 UTC (permalink / raw)
To: Christian Brauner; +Cc: linux-fsdevel, Thomas Gleixner, Jann Horn
On Sun, 6 Oct 2024 at 03:21, Christian Brauner <brauner@kernel.org> wrote:
>
> Iiuc, then we should retain the deadzone handling but should replace
> atomic_long_add_negative() with atomic_long_add_negative_relaxed().
I assume you meant the other way around.
However, then it's not the same as the regular rcuref any more. It
looks similar, it sounds similar, but it's something completely
different.
I definitely do *not* want to have "rcuref_long_get()" fundamentally
different from just plain "rcuref_get()" .
Now, maybe we should just make the plain version also do a full memory
barrier. Honestly, we have exactly *one* user of rcyref_get(): the
networking code dst cache. Using the relaxed op clearly makes no
difference at all on x86, and it _probably_ makes little to no
difference on other relevant architectures either.
But if the networking people want their relaxed version, I really
really don't want rcuref_long_get() using non-relaxed one. And with
just one single user of the existing rcuref code, and now another
single user of the "long" variant, I really don't think it makes much
sense as a "library".
IOW, my gut feeling is that you'd actually be better off just taking
the rcuref code, changing it to using atomic_long_t and the
non-relaxed version, and renaming it to "file_ref", and keep it all
purely in fs/file.c (actually right now it's oddly split between
fs/file.c and fs/file_table.c, but whatever - you get the idea).
Trying to make it a library when it has one user and that one user
wants a very very different model than the other user that looked
similar smells like a BAD idea to me.
The whole "let's make it generic" is a disease, if the result only has
a single use.
Linus
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-06 18:09 ` Linus Torvalds
@ 2024-10-07 7:37 ` Christian Brauner
0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2024-10-07 7:37 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Thomas Gleixner, Jann Horn
On Sun, Oct 06, 2024 at 11:09:30AM GMT, Linus Torvalds wrote:
> On Sun, 6 Oct 2024 at 03:21, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Iiuc, then we should retain the deadzone handling but should replace
> > atomic_long_add_negative() with atomic_long_add_negative_relaxed().
>
> I assume you meant the other way around.
>
> However, then it's not the same as the regular rcuref any more. It
> looks similar, it sounds similar, but it's something completely
> different.
>
> I definitely do *not* want to have "rcuref_long_get()" fundamentally
> different from just plain "rcuref_get()" .
Right, that's why I added a separate helper. IOW, I didn't change the
behavior of the helper.
>
> Now, maybe we should just make the plain version also do a full memory
> barrier. Honestly, we have exactly *one* user of rcyref_get(): the
> networking code dst cache. Using the relaxed op clearly makes no
> difference at all on x86, and it _probably_ makes little to no
> difference on other relevant architectures either.
>
> But if the networking people want their relaxed version, I really
> really don't want rcuref_long_get() using non-relaxed one. And with
> just one single user of the existing rcuref code, and now another
> single user of the "long" variant, I really don't think it makes much
> sense as a "library".
>
> IOW, my gut feeling is that you'd actually be better off just taking
> the rcuref code, changing it to using atomic_long_t and the
> non-relaxed version, and renaming it to "file_ref", and keep it all
> purely in fs/file.c (actually right now it's oddly split between
> fs/file.c and fs/file_table.c, but whatever - you get the idea).
>
> Trying to make it a library when it has one user and that one user
> wants a very very different model than the other user that looked
> similar smells like a BAD idea to me.
Ok, sounds good.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t
2024-10-05 21:42 ` [PATCH RFC 0/4] " Linus Torvalds
2024-10-05 22:01 ` Al Viro
2024-10-05 22:01 ` Linus Torvalds
@ 2024-10-06 9:48 ` Christian Brauner
2 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2024-10-06 9:48 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Thomas Gleixner, Jann Horn
On Sat, Oct 05, 2024 at 02:42:25PM GMT, Linus Torvalds wrote:
> On Sat, 5 Oct 2024 at 12:17, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Note that atomic_inc_not_zero() contained a full memory barrier that we
> > relied upon. But we only need an acquire barrier and so I replaced the
> > second load from the file table with a smp_load_acquire(). I'm not
> > completely sure this is correct or if we could get away with something
> > else. Linus, maybe you have input here?
>
> I don't think this is valid.
>
> You go from this:
>
> file = rcu_dereference_raw(*f);
> if (!file)
> return NULL;
> if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
> return ERR_PTR(-EAGAIN);
> file_reloaded = rcu_dereference_raw(*f);
>
> to this:
>
> file = rcu_dereference_raw(*f);
> if (!file)
> return NULL;
> if (unlikely(!rcuref_long_get(&file->f_count)))
> return ERR_PTR(-EAGAIN);
> file_reloaded = smp_load_acquire(f);
>
> and the thing is, that rcuref_long_get() had better be a *full* memory barrier.
>
> The smp_load_acquire() does absolutely nothing: it means that the load
> will be done before *subsequent* memory operations. But it is not a
> barrier to preceding memory operations.
Right, because we need the increment to be ordered against the second
load not the two loads against each other.
>
> So if rcuref_long_get() isn't ordered (and you made it relaxed, the
> same way the existing rcuref_get() is), the CPU can basically move
> that down past the smp_load_acquire(), so the code actually
> effectively turns into this:
>
> file = rcu_dereference_raw(*f);
> if (!file)
> return NULL;
> file_reloaded = smp_load_acquire(f);
> if (unlikely(!rcuref_long_get(&file->f_count)))
> return ERR_PTR(-EAGAIN);
>
> and now the "file_reloaded" thing is completely pointless.
>
> Of course, you will never *see* this on x86, because all atomics are
> always full memory barriers on x86, so when you test this all on that
> noce 256-thread CPU, it all works fine. Because on x86, the "relaxed"
> memory ordering isn't.
Yeah, I was aware of that but I don't have a 64bit arm box. It would be
nice for testing anyway.
>
> So no. You can't use atomic_long_add_negative_relaxed() in the file
> handling, because it really does require a proper memory barrier.
>
> Now, I do think that replacing the cmpxchg loop with
> atomic_long_add_negative() is a good idea.
>
> But you can't do it this way, and you can't use the RCUREF logic and
> just extend it to the file ref.
>
> I would suggest you take that whole "rcuref_long_get()" code and *not*
> try to make it look like the rcuref code, but instead make it
> explicitly just about the file counting. Because the file counting
> really does have these special rules.
>
> Also, honestly, the only reason the file counting is using a "long" is
> because the code does *NOT* do overflow checking. But once you start
> looking at the sign and do conditional increments, you can actually
> just make the whole refcount be a "int" instead, and make "struct
> file" potentially smaller.
I've already shrunk it down to three cachelines.
>
> And yes, that requires that people who get a file() will fail when the
> file count goes negative, but that's *good*.
Yeah, the overflow protection would be neat to have.
^ permalink raw reply [flat|nested] 17+ messages in thread