* Re: [PATCH bpf-next v10 4/8] bpf: Add syscall common attributes support for prog_load
From: Andrii Nakryiko @ 2026-02-18 18:44 UTC (permalink / raw)
To: Leon Hwang
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
Amery Hung, Rong Tao, linux-kernel, linux-api, linux-kselftest,
kernel-patches-bot
In-Reply-To: <eb82cc40-e5c0-4f23-ad92-92633ccb2e0d@linux.dev>
On Wed, Feb 11, 2026 at 9:50 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 12/2/26 06:08, Andrii Nakryiko wrote:
> > On Wed, Feb 11, 2026 at 7:13 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
>
> [...]
>
> >> diff --git a/kernel/bpf/log.c b/kernel/bpf/log.c
> >> index e31747b84fe2..a2b41bf5e9cb 100644
> >> --- a/kernel/bpf/log.c
> >> +++ b/kernel/bpf/log.c
> >> @@ -864,14 +864,43 @@ void print_insn_state(struct bpf_verifier_env *env, const struct bpf_verifier_st
> >> print_verifier_state(env, vstate, frameno, false);
> >> }
> >>
> >> +static bool bpf_log_attrs_set(u64 log_buf, u32 log_size, u32 log_level)
> >> +{
> >> + return log_buf && log_size && log_level;
> >> +}
> >> +
> >> +static bool bpf_log_attrs_diff(struct bpf_common_attr *common, u64 log_buf, u32 log_size,
> >> + u32 log_level)
> >> +{
> >> + return bpf_log_attrs_set(log_buf, log_size, log_level) &&
> >> + bpf_log_attrs_set(common->log_buf, common->log_size, common->log_level) &&
> >> + (log_buf != common->log_buf || log_size != common->log_size ||
> >> + log_level != common->log_level);
> >> +}
> >> +
> >
> > I'm not sure this check is doing what we discussed previously?... If
> > log_buf is set, but log_size or log_level is zero, you'll just ignore
> > log_buf here...
> >
> > Maybe let's keep it super simple:
> >
> > if (log_buf && common->log_buf && log_buf != common->log_buf)
> > return -EINVAL;
> > /* same for log_size, log_level, log_true_size */
> >
> > and then below just
> >
> > log->log_buf = u64_to_user_ptr(log_buf ?: common->log_buf);
> > log->log_size = log_size ?: common->log_size;
> >
> > and so on
> >
> >
> > We can be stricter than that, of course (as in, all triplets have to
> > be completely set in either/both common_attr and attr, and they should
> > completely match), but it's just more code for little benefit.
> >
>
> We cannot mix fields across the two sources. For example, using log_buf
> from attr together with common->log_size when log_size is zero would mix
> the configuration and make the effective log setup ambiguous.
>
> The intent is to align strictly with the semantics enforced by
> bpf_verifier_log_attr_valid():
>
> * log_buf and log_size must be specified together.
> * A non-NULL log_buf requires log_level != 0.
> * All values must pass basic sanity checks.
>
> Given that contract, we should:
>
> 1. Validate the log attributes from attr and common independently using
> the same helper.
> 2. if both sides provide log buffers, require the tuples to match
> exactly.
> 3. select either the attr tuple or the common tuple as a whole — never
> mix fields across the two.
>
> The patch below implements this by reusing bpf_verifier_log_attr_valid()
> for both sources and resolving conflicts before selecting the effective
> log configuration.
> >
> >> int bpf_log_attr_init(struct bpf_log_attr *log, u64 log_buf, u32 log_size, u32 log_level,
> >> - u32 __user *log_true_size)
> >> + u32 __user *log_true_size, struct bpf_common_attr *common, bpfptr_t uattr,
> >> + u32 size)
> >> {
> >> + if (bpf_log_attrs_diff(common, log_buf, log_size, log_level))
> >> + return -EINVAL;
> >> +
> >> memset(log, 0, sizeof(*log));
> >> log->log_buf = u64_to_user_ptr(log_buf);
> >> log->log_size = log_size;
> >> log->log_level = log_level;
> >> log->log_true_size = log_true_size;
> >> +
> >> + if (!log_buf && common->log_buf) {
> >> + log->log_buf = u64_to_user_ptr(common->log_buf);
> >> + log->log_size = common->log_size;
> >> + log->log_level = common->log_level;
> >> + if (size >= offsetofend(struct bpf_common_attr, log_true_size))
> >> + log->log_true_size = uattr.user +
> >> + offsetof(struct bpf_common_attr, log_true_size);
> >> + else
> >> + log->log_true_size = NULL;
> >
> > why not treat log_true_size same as log_buf/log_level/log_size? If
> > both are provided, they should match, and then we don't have a
> > possibility of inconsistency?
> >
> log_true_size is different from log_buf/log_size/log_level.
>
> It is not a regular attribute stored in either union bpf_attr or
> struct bpf_common_attr. Instead, it is a user pointer derived from
> uattr.user + offset.
>
> As a result, the computed log_true_size pointer for union bpf_attr
> and for struct bpf_common_attr will always differ, because they are
> based on different base user pointers (uattr.user vs
> uattr_common.user).
>
> So unlike the other log attributes, pointer equality is not a
> meaningful consistency check for log_true_size. The only sensible
> rule is that whichever side provides the effective log triplet also
> determines the write-back destination.
yeah, you are right, I forgot that log_true_size is not a pointer
itself, it's just a field in user-provided attrs. I'll check what you
did in v11, let's continue there.
>
> Thanks,
> Leon
>
> ---
>
> Based-on commit 19de32d4cb58 ("selftests/bpf: Migrate align.c tests to
> test_loader framework").
>
> From 32ec02c06d2abacbde17a45edbda46ef8a16fa2d Mon Sep 17 00:00:00 2001
> From: Leon Hwang <leon.hwang@linux.dev>
> Date: Wed, 11 Feb 2026 23:11:11 +0800
> Subject: [PATCH bpf-next v11 4/8] bpf: Add syscall common attributes support
> for prog_load
>
> BPF_PROG_LOAD can now take log parameters from both union bpf_attr and
> struct bpf_common_attr. The merge rules are:
>
> - if both sides provide a complete log tuple (buf/size/level) and they
> match, use it;
> - if only one side provides log parameters, use that one;
> - if both sides provide complete tuples but they differ, return -EINVAL.
>
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
> include/linux/bpf_verifier.h | 3 ++-
> kernel/bpf/log.c | 38 ++++++++++++++++++++++++++++--------
> kernel/bpf/syscall.c | 2 +-
> 3 files changed, 33 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index dbd9bdb955b3..34f28d40022a 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -643,7 +643,8 @@ struct bpf_log_attr {
> };
>
> int bpf_log_attr_init(struct bpf_log_attr *log, u64 log_buf, u32
> log_size, u32 log_level,
> - u32 __user *log_true_size);
> + u32 __user *log_true_size, struct bpf_common_attr *common,
> bpfptr_t uattr,
> + u32 size);
> int bpf_log_attr_finalize(struct bpf_log_attr *attr, struct
> bpf_verifier_log *log);
>
> #define BPF_MAX_SUBPROGS 256
> diff --git a/kernel/bpf/log.c b/kernel/bpf/log.c
> index e31747b84fe2..47bf496b673e 100644
> --- a/kernel/bpf/log.c
> +++ b/kernel/bpf/log.c
> @@ -13,17 +13,17 @@
>
> #define verbose(env, fmt, args...) bpf_verifier_log_write(env, fmt, ##args)
>
> -static bool bpf_verifier_log_attr_valid(const struct bpf_verifier_log *log)
> +static bool bpf_verifier_log_attr_valid(u32 log_level, char __user
> *log_buf, u32 log_size)
> {
> /* ubuf and len_total should both be specified (or not) together */
> - if (!!log->ubuf != !!log->len_total)
> + if (!!log_buf != !!log_size)
> return false;
> /* log buf without log_level is meaningless */
> - if (log->ubuf && log->level == 0)
> + if (log_buf && log_level == 0)
> return false;
> - if (log->level & ~BPF_LOG_MASK)
> + if (log_level & ~BPF_LOG_MASK)
> return false;
> - if (log->len_total > UINT_MAX >> 2)
> + if (log_size > UINT_MAX >> 2)
> return false;
> return true;
> }
> @@ -36,7 +36,7 @@ int bpf_vlog_init(struct bpf_verifier_log *log, u32
> log_level,
> log->len_total = log_size;
>
> /* log attributes have to be sane */
> - if (!bpf_verifier_log_attr_valid(log))
> + if (!bpf_verifier_log_attr_valid(log_level, log_buf, log_size))
> return -EINVAL;
>
> return 0;
> @@ -865,13 +865,35 @@ void print_insn_state(struct bpf_verifier_env
> *env, const struct bpf_verifier_st
> }
>
> int bpf_log_attr_init(struct bpf_log_attr *log, u64 log_buf, u32
> log_size, u32 log_level,
> - u32 __user *log_true_size)
> + u32 __user *log_true_size, struct bpf_common_attr *common,
> bpfptr_t uattr,
> + u32 size)
> {
> + char __user *ubuf_common = u64_to_user_ptr(common->log_buf);
> + char __user *ubuf = u64_to_user_ptr(log_buf);
> +
> + if (!bpf_verifier_log_attr_valid(common->log_level, ubuf_common,
> common->log_size) ||
> + !bpf_verifier_log_attr_valid(log_level, ubuf, log_size))
> + return -EINVAL;
> +
> + if (ubuf && ubuf_common && (ubuf != ubuf_common || log_size !=
> common->log_size ||
> + log_level != common->log_level))
> + return -EINVAL;
> +
> memset(log, 0, sizeof(*log));
> - log->log_buf = u64_to_user_ptr(log_buf);
> + log->log_buf = ubuf;
> log->log_size = log_size;
> log->log_level = log_level;
> log->log_true_size = log_true_size;
> +
> + if (!ubuf && ubuf_common) {
> + log->log_buf = ubuf_common;
> + log->log_size = common->log_size;
> + log->log_level = common->log_level;
> + log->log_true_size = NULL;
> + if (size >= offsetofend(struct bpf_common_attr, log_true_size))
> + log->log_true_size = uattr.user +
> + offsetof(struct bpf_common_attr, log_true_size);
> + }
> return 0;
> }
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e86674811996..17116603ff51 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -6247,7 +6247,7 @@ static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t
> uattr, unsigned int size,
> if (from_user && size >= offsetofend(union bpf_attr, log_true_size))
> log_true_size = uattr.user + offsetof(union bpf_attr, log_true_size);
> err = bpf_log_attr_init(&attr_log, attr.log_buf, attr.log_size,
> attr.log_level,
> - log_true_size);
> + log_true_size, &attr_common, uattr_common, size_common);
> err = err ?: bpf_prog_load(&attr, uattr, &attr_log);
> break;
> case BPF_OBJ_PIN:
> --
> 2.52.0
>
>
^ permalink raw reply
* Re: [PATCH v8 08/17] ext4: Report case sensitivity in fileattr_get
From: Theodore Tso @ 2026-02-19 13:14 UTC (permalink / raw)
To: Chuck Lever
Cc: Al Viro, Christian Brauner, Jan Kara, linux-fsdevel, linux-ext4,
linux-xfs, linux-cifs, linux-nfs, linux-api, linux-f2fs-devel,
hirofumi, linkinjeon, sj1557.seo, yuezhang.mo,
almaz.alexandrovich, slava, glaubitz, frank.li, adilger.kernel,
cem, sfrench, pc, ronniesahlberg, sprasad, trondmy, anna, jaegeuk,
chao, hansg, senozhatsky, Chuck Lever
In-Reply-To: <20260217214741.1928576-9-cel@kernel.org>
On Tue, Feb 17, 2026 at 04:47:32PM -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Report ext4's case sensitivity behavior via the FS_XFLAG_CASEFOLD
> flag. ext4 always preserves case at rest.
>
> Case sensitivity is a per-directory setting in ext4. If the queried
> inode is a casefolded directory, report case-insensitive; otherwise
> report case-sensitive (standard POSIX behavior).
>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
From: Askar Safin @ 2026-02-19 23:42 UTC (permalink / raw)
To: rob; +Cc: containers, initramfs, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <6375f293-709c-41b8-a23d-12010baa3cae@landley.net>
Rob Landley <rob@landley.net>:
> Also, could you guys make CONFIG_DEVTMPFS_MOUNT work with initramfs?
I did something similar:
https://lore.kernel.org/initramfs/20260219210312.3468980-1-safinaskar@gmail.com/T/#u
Does this solve your problem?
--
Askar Safin
^ permalink raw reply
* [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition
From: André Almeida @ 2026-02-20 20:26 UTC (permalink / raw)
To: Carlos O'Donell, Sebastian Andrzej Siewior, Peter Zijlstra,
Florian Weimer, Rich Felker, Torvald Riegel, Darren Hart,
Thomas Gleixner, Ingo Molnar, Davidlohr Bueso, Arnd Bergmann,
Mathieu Desnoyers, Liam R . Howlett
Cc: kernel-dev, linux-api, linux-kernel, André Almeida
In-Reply-To: <20260220202620.139584-1-andrealmeid@igalia.com>
Create a reproducer for https://sourceware.org/bugzilla/show_bug.cgi?id=14485
This is not supposed to be merged.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
robust_bug.c | 178 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 178 insertions(+)
create mode 100644 robust_bug.c
diff --git a/robust_bug.c b/robust_bug.c
new file mode 100644
index 000000000000..1ade4e6d66dd
--- /dev/null
+++ b/robust_bug.c
@@ -0,0 +1,178 @@
+/*
+ * gcc robust_bug.c -o robust_bug
+ *
+ * This is a reproducer for "File corruption race condition in robust
+ * mutex unlocking" from https://sourceware.org/bugzilla/show_bug.cgi?id=14485
+ *
+ * To increase the changes of reaching the race condition, a delay can be added
+ * to the kernel function handle_futex_death(), just before the user memory
+ * write futex_cmpxchg_value_locked().
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/futex.h>
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+#include <time.h>
+
+#define cpu_relax() asm volatile("rep; nop");
+
+/*
+ * This struct is an example of a lock struct, shared between the threads.
+ */
+struct lock_struct {
+ uint32_t futex;
+ struct robust_list list;
+};
+
+static struct lock_struct *lock;
+
+/*
+ * This is the struct that we are going to use to allocate on top of the
+ * freed memory to observe the race condition.
+ */
+struct another_struct {
+ uint64_t value;
+};
+
+static pthread_barrier_t barrier;
+
+static int set_robust_list(struct robust_list_head *head)
+{
+ return syscall(SYS_set_robust_list, head, sizeof(*head));
+}
+
+/*
+ * This thread emulates the behaviour of a thread releasing a robust mutex:
+ * - It starts by adding the mutex to the op_pending field
+ * - Remove the mutex from the robust list
+ * - Release the lock and wake up waiters
+ * - Remove the mutex from the op_pending field
+ *
+ * However, this thread dies before doing this last step, leaving the mutex
+ * behind in the op_pending field.
+ */
+void *func_b(void *arg)
+{
+ static struct robust_list_head head;
+ pid_t tid = gettid() | FUTEX_WAITERS;
+
+ /*
+ * Initial thread setup. This would happen in an earlier stage of the
+ * thread execution.
+ */
+ set_robust_list(&head);
+ head.list.next = &head.list;
+ head.futex_offset = (size_t) offsetof(struct lock_struct, futex) -
+ (size_t) offsetof(struct lock_struct, list);
+
+ /* This thread takes the lock... */
+ lock->futex = tid;
+
+ /* ...would do some work here... */
+
+ /*
+ * ...and starts the release process. Adds the mutex to be released on
+ * the op_pending.
+ */
+ head.list_op_pending = &lock->list;
+
+ /* Barrier to synchronize thread B taking the lock */
+ pthread_barrier_wait(&barrier);
+ usleep(100);
+
+ /*
+ * Here we would release the lock and wake up any waiters.
+ *
+ * lock->futex = LOCK_FREE;
+ * futex_wake(lock->futex, 1);
+ */
+
+ /*
+ * We would remove the lock from op_pending, but we emulate a thread
+ * exiting before doing it.
+ */
+ return NULL;
+}
+
+int main(int argc, char *argv[])
+{
+ struct another_struct *new;
+ uint64_t original_val;
+ pthread_t thread_b;
+ uint32_t value;
+ int ret;
+
+ ret = pthread_barrier_init(&barrier, NULL, 2);
+ if (ret) {
+ puts("pthread_barrier_init failed");
+ return -1;
+ }
+
+ /* Initialize the lock */
+ lock = mmap(NULL, sizeof(struct lock_struct), PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+ if (lock == MAP_FAILED) {
+ puts("mmap failed");
+ return -1;
+ }
+ memset(lock, 0, sizeof(*lock));
+
+ /* Create the thread B that will take the lock */
+ pthread_create(&thread_b, NULL, func_b, NULL);
+
+ /* Barrier to synchronize thread B taking the lock */
+ pthread_barrier_wait(&barrier);
+
+ /* Copy this value as we will use it later */
+ value = lock->futex;
+
+ /*
+ * Here, this thread would do the following:
+ * - It would wait for the lock, and be wake from thread B
+ * - Take the lock, do some work, and release it
+ * - After releasing the lock and being the last user, it can correctly
+ * free it
+ */
+ munmap(lock, sizeof(struct lock_struct));
+
+ /*
+ * After freeing the lock, this thread allocates memory, which
+ * happens to be at the same address of the lock, and by chance, it fills
+ * the memory with the TID of thread B.
+ */
+ new = mmap(NULL, sizeof(struct another_struct), PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+ if (new == MAP_FAILED) {
+ puts("mmap failed");
+ return -1;
+ }
+ if ((uintptr_t) lock != (uintptr_t) new) {
+ puts("mmap got a different address");
+ return -1;
+ }
+
+ new->value = ((uint64_t) value << 32) + value;
+
+ /* Create a backup of the current value */
+ original_val = new->value;
+
+ /* Wait for the memory corruption to happen... */
+ while (new->value == original_val)
+ cpu_relax();
+
+ /* ...and now the kernel just overwrote an unrelated user memory! */
+ printf("Memory was corrupted by the kernel: %lx vs %lx\n",
+ original_val, new->value);
+
+ munmap(new, sizeof(struct another_struct));
+
+ return 0;
+}
--
2.53.0
^ permalink raw reply related
* [RFC PATCH 2/2] futex: hack: Add debug delays
From: André Almeida @ 2026-02-20 20:26 UTC (permalink / raw)
To: Carlos O'Donell, Sebastian Andrzej Siewior, Peter Zijlstra,
Florian Weimer, Rich Felker, Torvald Riegel, Darren Hart,
Thomas Gleixner, Ingo Molnar, Davidlohr Bueso, Arnd Bergmann,
Mathieu Desnoyers, Liam R . Howlett
Cc: kernel-dev, linux-api, linux-kernel, André Almeida
In-Reply-To: <20260220202620.139584-1-andrealmeid@igalia.com>
Add delays to handle_futex_death() to increase the chance of hitting the race
condition.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
kernel/futex/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index cf7e610eac42..d409b3368cb3 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -44,6 +44,7 @@
#include <linux/prctl.h>
#include <linux/mempolicy.h>
#include <linux/mmap_lock.h>
+#include <linux/delay.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -1095,6 +1096,12 @@ static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr,
* does not guarantee R/W access. If that fails we
* give up and leave the futex locked.
*/
+
+ if (!strcmp(current->comm, "robust_bug")) {
+ printk("robust_bug is exiting\n");
+ msleep(500);
+ }
+
if ((err = futex_cmpxchg_value_locked(&nval, uaddr, uval, mval))) {
switch (err) {
case -EFAULT:
@@ -1112,6 +1119,9 @@ static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr,
}
}
+ if (!strcmp(current->comm, "robust_bug"))
+ printk("memory written\n");
+
if (nval != uval)
goto retry;
--
2.53.0
^ permalink raw reply related
* [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: André Almeida @ 2026-02-20 20:26 UTC (permalink / raw)
To: Carlos O'Donell, Sebastian Andrzej Siewior, Peter Zijlstra,
Florian Weimer, Rich Felker, Torvald Riegel, Darren Hart,
Thomas Gleixner, Ingo Molnar, Davidlohr Bueso, Arnd Bergmann,
Mathieu Desnoyers, Liam R . Howlett
Cc: kernel-dev, linux-api, linux-kernel, André Almeida
During LPC 2025, I presented a session about creating a new syscall for
robust_list[0][1]. However, most of the session discussion wasn't much related
to the new syscall itself, but much more related to an old bug that exists in
the current robust_list mechanism.
Since at least 2012, there's an open bug reporting a race condition, as
Carlos O'Donell pointed out:
"File corruption race condition in robust mutex unlocking"
https://sourceware.org/bugzilla/show_bug.cgi?id=14485
To help understand the bug, I've created a reproducer (patch 1/2) and a
companion kernel hack (patch 2/2) that helps to make the race condition
more likely. When the bug happens, the reproducer shows a message
comparing the original memory with the corrupted one:
"Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 8001fe8dc0000000"
I'm not sure yet what would be the appropriated approach to fix it, so I
decided to reach the community before moving forward in some direction.
One suggestion from Peter[2] resolves around serializing the mmap() and the
robust list exit path, which might cause overheads for the common case,
where list_op_pending is empty.
However, giving that there's a new interface being prepared, this could
also give the opportunity to rethink how list_op_pending works, and get
rid of the race condition by design.
Feedback is very much welcome.
Thanks!
André
[0] https://lore.kernel.org/lkml/20251122-tonyk-robust_futex-v6-0-05fea005a0fd@igalia.com/
[1] https://lpc.events/event/19/contributions/2108/
[2] https://lore.kernel.org/lkml/20241219171344.GA26279@noisy.programming.kicks-ass.net/
André Almeida (2):
futex: Create reproducer for robust_list race condition
futex: Add debug delays
kernel/futex/core.c | 10 +++
robust_bug.c | 178 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 188 insertions(+)
create mode 100644 robust_bug.c
--
2.53.0
^ permalink raw reply
* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: Liam R. Howlett @ 2026-02-20 20:51 UTC (permalink / raw)
To: André Almeida
Cc: Carlos O'Donell, Sebastian Andrzej Siewior, Peter Zijlstra,
Florian Weimer, Rich Felker, Torvald Riegel, Darren Hart,
Thomas Gleixner, Ingo Molnar, Davidlohr Bueso, Arnd Bergmann,
Mathieu Desnoyers, kernel-dev, linux-api, linux-kernel,
Suren Baghdasaryan, Lorenzo Stoakes, Michal Hocko
In-Reply-To: <20260220202620.139584-1-andrealmeid@igalia.com>
+Cc Suren, Lorenzo, and Michal
* André Almeida <andrealmeid@igalia.com> [260220 15:27]:
> During LPC 2025, I presented a session about creating a new syscall for
> robust_list[0][1]. However, most of the session discussion wasn't much related
> to the new syscall itself, but much more related to an old bug that exists in
> the current robust_list mechanism.
Ah, sorry for hijacking the session, that was not my intention, but this
needs to be addressed before we propagate the issue into the next
iteration.
>
> Since at least 2012, there's an open bug reporting a race condition, as
> Carlos O'Donell pointed out:
>
> "File corruption race condition in robust mutex unlocking"
> https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>
> To help understand the bug, I've created a reproducer (patch 1/2) and a
> companion kernel hack (patch 2/2) that helps to make the race condition
> more likely. When the bug happens, the reproducer shows a message
> comparing the original memory with the corrupted one:
>
> "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 8001fe8dc0000000"
>
> I'm not sure yet what would be the appropriated approach to fix it, so I
> decided to reach the community before moving forward in some direction.
> One suggestion from Peter[2] resolves around serializing the mmap() and the
> robust list exit path, which might cause overheads for the common case,
> where list_op_pending is empty.
>
> However, giving that there's a new interface being prepared, this could
> also give the opportunity to rethink how list_op_pending works, and get
> rid of the race condition by design.
>
> Feedback is very much welcome.
There was a delay added to the oom reaper for these tasks [1] by commit
e4a38402c36e ("oom_kill.c: futex: delay the OOM reaper to allow time for
proper futex cleanup")
We did discuss marking the vmas as needing to be skipped by the oom
manager, but no clear path forward was clear. It's also not clear if
that's the only area where such a problem exists.
[1]. https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u
>
> Thanks!
> André
>
> [0] https://lore.kernel.org/lkml/20251122-tonyk-robust_futex-v6-0-05fea005a0fd@igalia.com/
> [1] https://lpc.events/event/19/contributions/2108/
> [2] https://lore.kernel.org/lkml/20241219171344.GA26279@noisy.programming.kicks-ass.net/
>
> André Almeida (2):
> futex: Create reproducer for robust_list race condition
> futex: Add debug delays
>
> kernel/futex/core.c | 10 +++
> robust_bug.c | 178 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 188 insertions(+)
> create mode 100644 robust_bug.c
>
> --
> 2.53.0
>
^ permalink raw reply
* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: Mathieu Desnoyers @ 2026-02-20 21:42 UTC (permalink / raw)
To: André Almeida, Carlos O'Donell,
Sebastian Andrzej Siewior, Peter Zijlstra, Florian Weimer,
Rich Felker, Torvald Riegel, Darren Hart, Thomas Gleixner,
Ingo Molnar, Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett,
Lorenzo Stoakes, Michal Hocko
Cc: kernel-dev, linux-api, linux-kernel, libc-alpha
In-Reply-To: <20260220202620.139584-1-andrealmeid@igalia.com>
+CC libc-alpha.
On 2026-02-20 15:26, André Almeida wrote:
> During LPC 2025, I presented a session about creating a new syscall for
> robust_list[0][1]. However, most of the session discussion wasn't much related
> to the new syscall itself, but much more related to an old bug that exists in
> the current robust_list mechanism.
>
> Since at least 2012, there's an open bug reporting a race condition, as
> Carlos O'Donell pointed out:
>
> "File corruption race condition in robust mutex unlocking"
> https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>
> To help understand the bug, I've created a reproducer (patch 1/2) and a
> companion kernel hack (patch 2/2) that helps to make the race condition
> more likely. When the bug happens, the reproducer shows a message
> comparing the original memory with the corrupted one:
>
> "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 8001fe8dc0000000"
>
> I'm not sure yet what would be the appropriated approach to fix it, so I
> decided to reach the community before moving forward in some direction.
> One suggestion from Peter[2] resolves around serializing the mmap() and the
> robust list exit path, which might cause overheads for the common case,
> where list_op_pending is empty.
>
> However, giving that there's a new interface being prepared, this could
> also give the opportunity to rethink how list_op_pending works, and get
> rid of the race condition by design.
>
> Feedback is very much welcome.
Looking at this bug, one thing I'm starting to consider is that it
appears to be an issue inherent to lack of synchronization between
pthread_mutex_destroy(3) and the per-thread list_op_pending fields
and not so much a kernel issue.
Here is why I think the issue is purely userspace:
Let's suppose we have a shared memory area across Processes 1 and Process 2,
which internally have its own custom memory allocator in userspace to
allocate/free space within that shared memory.
Process 1, Thread A stumbles through the scenario highlighted by this bug, and
basically gets preempted at this FIXME in libc __pthread_mutex_unlock_full():
if (__glibc_unlikely ((atomic_exchange_release (&mutex->__data.__lock, 0)
& FUTEX_WAITERS) != 0))
futex_wake ((unsigned int *) &mutex->__data.__lock, 1, private);
/* We must clear op_pending after we release the mutex.
FIXME However, this violates the mutex destruction requirements
because another thread could acquire the mutex, destroy it, and
reuse the memory for something else; then, if this thread crashes,
and the memory happens to have a value equal to the TID, the kernel
will believe it is still related to the mutex (which has been
destroyed already) and will modify some other random object. */
__asm ("" ::: "memory");
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
Then Process 1, Thread B runs, grabs the lock, releases it, and based on
program state it knows it can pthread_mutex_destroy() this lock, free its
associated memory through the custom shared memory allocator, and allocate
it for other purposes. Then we get to the point where Process 1 is
killed, and where the robust futex kernel code corrupts data in shared
memory because of the dangling list_op_pending pointer.
That shared memory data is still observable by Process B, which will get a
corrupted state.
Notice how this all happens without any munmap(2)/mmap(2) in the sequence ?
This is why I think this is purely a userspace issue rather than an issue
we can solve by adding extra synchronization in the kernel.
The one point we have in that sequence where I think we can add synchronization
is pthread_mutex_destroy(3) in libc. One possible "big hammer" solution would be
to make pthread_mutex_destroy iterate on all other threads list_op_pending
and busy-wait if it finds that the mutex address is in use. It would of course
only have to do that for robust futexes.
If that big hammer solution is not fast enough for many-threaded use-cases,
then we can think of other approaches such as adding a reference counter
in the mutex structure, or introducing hazard pointers in userspace to reduce
synchronization iteration from nr_threads to nr_cpus (or even down to max
rseq mm_cid).
Thoughts ?
Thanks,
Mathieu
>
> Thanks!
> André
>
> [0] https://lore.kernel.org/lkml/20251122-tonyk-robust_futex-v6-0-05fea005a0fd@igalia.com/
> [1] https://lpc.events/event/19/contributions/2108/
> [2] https://lore.kernel.org/lkml/20241219171344.GA26279@noisy.programming.kicks-ass.net/
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply
* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: Mathieu Desnoyers @ 2026-02-20 22:41 UTC (permalink / raw)
To: André Almeida, Carlos O'Donell,
Sebastian Andrzej Siewior, Peter Zijlstra, Florian Weimer,
Rich Felker, Torvald Riegel, Darren Hart, Thomas Gleixner,
Ingo Molnar, Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett,
Lorenzo Stoakes, Michal Hocko
Cc: kernel-dev, linux-api, linux-kernel, libc-alpha
In-Reply-To: <0d334517-63ee-46c9-884d-6c2ae8388b87@efficios.com>
On 2026-02-20 16:42, Mathieu Desnoyers wrote:
> +CC libc-alpha.
>
> On 2026-02-20 15:26, André Almeida wrote:
>> During LPC 2025, I presented a session about creating a new syscall for
>> robust_list[0][1]. However, most of the session discussion wasn't much
>> related
>> to the new syscall itself, but much more related to an old bug that
>> exists in
>> the current robust_list mechanism.
>>
>> Since at least 2012, there's an open bug reporting a race condition, as
>> Carlos O'Donell pointed out:
>>
>> "File corruption race condition in robust mutex unlocking"
>> https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>>
>> To help understand the bug, I've created a reproducer (patch 1/2) and a
>> companion kernel hack (patch 2/2) that helps to make the race condition
>> more likely. When the bug happens, the reproducer shows a message
>> comparing the original memory with the corrupted one:
>>
>> "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs
>> 8001fe8dc0000000"
>>
>> I'm not sure yet what would be the appropriated approach to fix it, so I
>> decided to reach the community before moving forward in some direction.
>> One suggestion from Peter[2] resolves around serializing the mmap()
>> and the
>> robust list exit path, which might cause overheads for the common case,
>> where list_op_pending is empty.
>>
>> However, giving that there's a new interface being prepared, this could
>> also give the opportunity to rethink how list_op_pending works, and get
>> rid of the race condition by design.
>>
>> Feedback is very much welcome.
>
> Looking at this bug, one thing I'm starting to consider is that it
> appears to be an issue inherent to lack of synchronization between
> pthread_mutex_destroy(3) and the per-thread list_op_pending fields
> and not so much a kernel issue.
>
> Here is why I think the issue is purely userspace:
>
> Let's suppose we have a shared memory area across Processes 1 and
> Process 2,
> which internally have its own custom memory allocator in userspace to
> allocate/free space within that shared memory.
>
> Process 1, Thread A stumbles through the scenario highlighted by this
> bug, and
> basically gets preempted at this FIXME in libc
> __pthread_mutex_unlock_full():
>
> if (__glibc_unlikely ((atomic_exchange_release (&mutex-
> >__data.__lock, 0)
> & FUTEX_WAITERS) != 0))
> futex_wake ((unsigned int *) &mutex->__data.__lock, 1, private);
>
> /* We must clear op_pending after we release the mutex.
> FIXME However, this violates the mutex destruction requirements
> because another thread could acquire the mutex, destroy it, and
> reuse the memory for something else; then, if this thread
> crashes,
> and the memory happens to have a value equal to the TID, the
> kernel
> will believe it is still related to the mutex (which has been
> destroyed already) and will modify some other random object. */
> __asm ("" ::: "memory");
> THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
>
> Then Process 1, Thread B runs, grabs the lock, releases it, and based on
> program state it knows it can pthread_mutex_destroy() this lock, free its
> associated memory through the custom shared memory allocator, and allocate
> it for other purposes. Then we get to the point where Process 1 is
> killed, and where the robust futex kernel code corrupts data in shared
> memory because of the dangling list_op_pending pointer.
>
> That shared memory data is still observable by Process B, which will get a
> corrupted state.
>
> Notice how this all happens without any munmap(2)/mmap(2) in the sequence ?
> This is why I think this is purely a userspace issue rather than an issue
> we can solve by adding extra synchronization in the kernel.
>
> The one point we have in that sequence where I think we can add
> synchronization
> is pthread_mutex_destroy(3) in libc. One possible "big hammer" solution
> would be
> to make pthread_mutex_destroy iterate on all other threads list_op_pending
> and busy-wait if it finds that the mutex address is in use. It would of
> course
> only have to do that for robust futexes.
>
> If that big hammer solution is not fast enough for many-threaded use-cases,
> then we can think of other approaches such as adding a reference counter
> in the mutex structure, or introducing hazard pointers in userspace to
> reduce
> synchronization iteration from nr_threads to nr_cpus (or even down to max
> rseq mm_cid).
To make matters even worse, the pthread_mutex_destroy(3) and reallocation
could happen from Process 2 rather than Process 1. So iterating on a
threads from Process 1 is not sufficient. We'd need to synchronize
pthread_mutex_destroy on something within the mutex structure which is
observable from all processes using the lock, for instance a reference count.
Thanks,
Mathieu
>
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
>>
>> Thanks!
>> André
>>
>> [0] https://lore.kernel.org/lkml/20251122-tonyk-robust_futex-
>> v6-0-05fea005a0fd@igalia.com/
>> [1] https://lpc.events/event/19/contributions/2108/
>> [2] https://lore.kernel.org/
>> lkml/20241219171344.GA26279@noisy.programming.kicks-ass.net/
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply
* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: Mathieu Desnoyers @ 2026-02-20 23:17 UTC (permalink / raw)
To: André Almeida, Carlos O'Donell,
Sebastian Andrzej Siewior, Peter Zijlstra, Florian Weimer,
Rich Felker, Torvald Riegel, Darren Hart, Thomas Gleixner,
Ingo Molnar, Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett,
Lorenzo Stoakes, Michal Hocko
Cc: kernel-dev, linux-api, linux-kernel, libc-alpha
In-Reply-To: <67be0aa1-2241-43ef-aa01-a89ced26c8f6@efficios.com>
On 2026-02-20 17:41, Mathieu Desnoyers wrote:
> On 2026-02-20 16:42, Mathieu Desnoyers wrote:
>> +CC libc-alpha.
>>
>> On 2026-02-20 15:26, André Almeida wrote:
>>> During LPC 2025, I presented a session about creating a new syscall for
>>> robust_list[0][1]. However, most of the session discussion wasn't
>>> much related
>>> to the new syscall itself, but much more related to an old bug that
>>> exists in
>>> the current robust_list mechanism.
>>>
>>> Since at least 2012, there's an open bug reporting a race condition, as
>>> Carlos O'Donell pointed out:
>>>
>>> "File corruption race condition in robust mutex unlocking"
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>>>
>>> To help understand the bug, I've created a reproducer (patch 1/2) and a
>>> companion kernel hack (patch 2/2) that helps to make the race condition
>>> more likely. When the bug happens, the reproducer shows a message
>>> comparing the original memory with the corrupted one:
>>>
>>> "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs
>>> 8001fe8dc0000000"
>>>
>>> I'm not sure yet what would be the appropriated approach to fix it, so I
>>> decided to reach the community before moving forward in some direction.
>>> One suggestion from Peter[2] resolves around serializing the mmap()
>>> and the
>>> robust list exit path, which might cause overheads for the common case,
>>> where list_op_pending is empty.
>>>
>>> However, giving that there's a new interface being prepared, this could
>>> also give the opportunity to rethink how list_op_pending works, and get
>>> rid of the race condition by design.
>>>
>>> Feedback is very much welcome.
>>
>> Looking at this bug, one thing I'm starting to consider is that it
>> appears to be an issue inherent to lack of synchronization between
>> pthread_mutex_destroy(3) and the per-thread list_op_pending fields
>> and not so much a kernel issue.
>>
>> Here is why I think the issue is purely userspace:
>>
>> Let's suppose we have a shared memory area across Processes 1 and
>> Process 2,
>> which internally have its own custom memory allocator in userspace to
>> allocate/free space within that shared memory.
>>
>> Process 1, Thread A stumbles through the scenario highlighted by this
>> bug, and
>> basically gets preempted at this FIXME in libc
>> __pthread_mutex_unlock_full():
>>
>> if (__glibc_unlikely ((atomic_exchange_release (&mutex-
>> >__data.__lock, 0)
>> & FUTEX_WAITERS) != 0))
>> futex_wake ((unsigned int *) &mutex->__data.__lock, 1, private);
>>
>> /* We must clear op_pending after we release the mutex.
>> FIXME However, this violates the mutex destruction requirements
>> because another thread could acquire the mutex, destroy it, and
>> reuse the memory for something else; then, if this thread
>> crashes,
>> and the memory happens to have a value equal to the TID, the
>> kernel
>> will believe it is still related to the mutex (which has been
>> destroyed already) and will modify some other random
>> object. */
>> __asm ("" ::: "memory");
>> THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
>>
>> Then Process 1, Thread B runs, grabs the lock, releases it, and based on
>> program state it knows it can pthread_mutex_destroy() this lock, free its
>> associated memory through the custom shared memory allocator, and
>> allocate
>> it for other purposes. Then we get to the point where Process 1 is
>> killed, and where the robust futex kernel code corrupts data in shared
>> memory because of the dangling list_op_pending pointer.
>>
>> That shared memory data is still observable by Process B, which will
>> get a
>> corrupted state.
>>
>> Notice how this all happens without any munmap(2)/mmap(2) in the
>> sequence ?
>> This is why I think this is purely a userspace issue rather than an issue
>> we can solve by adding extra synchronization in the kernel.
>>
>> The one point we have in that sequence where I think we can add
>> synchronization
>> is pthread_mutex_destroy(3) in libc. One possible "big hammer"
>> solution would be
>> to make pthread_mutex_destroy iterate on all other threads
>> list_op_pending
>> and busy-wait if it finds that the mutex address is in use. It would
>> of course
>> only have to do that for robust futexes.
>>
>> If that big hammer solution is not fast enough for many-threaded use-
>> cases,
>> then we can think of other approaches such as adding a reference counter
>> in the mutex structure, or introducing hazard pointers in userspace to
>> reduce
>> synchronization iteration from nr_threads to nr_cpus (or even down to max
>> rseq mm_cid).
>
> To make matters even worse, the pthread_mutex_destroy(3) and reallocation
> could happen from Process 2 rather than Process 1. So iterating on a
> threads from Process 1 is not sufficient. We'd need to synchronize
> pthread_mutex_destroy on something within the mutex structure which is
> observable from all processes using the lock, for instance a reference
> count.
Trying to find a backward compatible way to solve this may be tricky.
Here is one possible approach I have in mind: Introduce a new syscall,
e.g. sys_cleanup_robust_list(void *addr)
This system call would be invoked on pthread_mutex_destroy(3) of
robust mutexes, and do the following:
- Calculate the offset of @addr within its mapping,
- Iterate on all processes which map the backing store which contain
the lock address @addr.
- Iterate on each thread sibling within each of those processes,
- If the thread has a robust list, and its list_op_pending points
to the same offset within the backing store mapping, clear the
list_op_pending pointer.
The overhead would be added specifically to pthread_mutex_destroy(3),
and only for robust mutexes.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply
* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: Florian Weimer @ 2026-02-23 11:13 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: André Almeida, Carlos O'Donell,
Sebastian Andrzej Siewior, Peter Zijlstra, Rich Felker,
Torvald Riegel, Darren Hart, Thomas Gleixner, Ingo Molnar,
Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett, Lorenzo Stoakes,
Michal Hocko, kernel-dev, linux-api, linux-kernel, libc-alpha
In-Reply-To: <a1e24288-6ffc-438d-8a2a-d152134c9555@efficios.com>
* Mathieu Desnoyers:
> Trying to find a backward compatible way to solve this may be tricky.
> Here is one possible approach I have in mind: Introduce a new syscall,
> e.g. sys_cleanup_robust_list(void *addr)
>
> This system call would be invoked on pthread_mutex_destroy(3) of
> robust mutexes, and do the following:
>
> - Calculate the offset of @addr within its mapping,
> - Iterate on all processes which map the backing store which contain
> the lock address @addr.
> - Iterate on each thread sibling within each of those processes,
> - If the thread has a robust list, and its list_op_pending points
> to the same offset within the backing store mapping, clear the
> list_op_pending pointer.
>
> The overhead would be added specifically to pthread_mutex_destroy(3),
> and only for robust mutexes.
Would we have to do this for pthread_mutex_destroy only, or also for
pthread_join? It is defined to exit a thread with mutexes still locked,
and the pthread_join call could mean that the application can determine
by its own logic that the backing store can be deallocated.
Thanks,
Florian
^ permalink raw reply
* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: Mathieu Desnoyers @ 2026-02-23 13:37 UTC (permalink / raw)
To: Florian Weimer
Cc: André Almeida, Carlos O'Donell,
Sebastian Andrzej Siewior, Peter Zijlstra, Rich Felker,
Torvald Riegel, Darren Hart, Thomas Gleixner, Ingo Molnar,
Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett, Lorenzo Stoakes,
Michal Hocko, kernel-dev, linux-api, linux-kernel, libc-alpha
In-Reply-To: <lhusearzp8o.fsf@oldenburg.str.redhat.com>
On 2026-02-23 06:13, Florian Weimer wrote:
> * Mathieu Desnoyers:
>
>> Trying to find a backward compatible way to solve this may be tricky.
>> Here is one possible approach I have in mind: Introduce a new syscall,
>> e.g. sys_cleanup_robust_list(void *addr)
>>
>> This system call would be invoked on pthread_mutex_destroy(3) of
>> robust mutexes, and do the following:
>>
>> - Calculate the offset of @addr within its mapping,
>> - Iterate on all processes which map the backing store which contain
>> the lock address @addr.
>> - Iterate on each thread sibling within each of those processes,
>> - If the thread has a robust list, and its list_op_pending points
>> to the same offset within the backing store mapping, clear the
>> list_op_pending pointer.
>>
>> The overhead would be added specifically to pthread_mutex_destroy(3),
>> and only for robust mutexes.
>
> Would we have to do this for pthread_mutex_destroy only, or also for
> pthread_join? It is defined to exit a thread with mutexes still locked,
> and the pthread_join call could mean that the application can determine
> by its own logic that the backing store can be deallocated.
Let me try to wrap my head around this scenario.
AFAIU, the https://man7.org/linux/man-pages/man3/pthread_join.3.html
NOTES section states the following for pthread_join(3):
After a successful call to pthread_join(), the caller is
guaranteed that the target thread has terminated. The caller may
then choose to do any clean-up that is required after termination
of the thread (e.g., freeing memory or other resources that were
allocated to the target thread).
What is the behavior when a thread exits with a mutex locked ? I would
expect that this mutex stays locked and the pthread_join(3) caller gets
to release that mutex and eventually calls pthread_mutex_destroy(3) if
the application logic allows it.
But it looks like you are implying that the pthread_mutex_destroy(3) is
somehow implicit to pthread_join, and I really don't understand that
part. Am I missing something ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply
* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: Rich Felker @ 2026-02-23 13:47 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Florian Weimer, André Almeida, Carlos O'Donell,
Sebastian Andrzej Siewior, Peter Zijlstra, Torvald Riegel,
Darren Hart, Thomas Gleixner, Ingo Molnar, Davidlohr Bueso,
Arnd Bergmann, Liam R . Howlett, Lorenzo Stoakes, Michal Hocko,
kernel-dev, linux-api, linux-kernel, libc-alpha
In-Reply-To: <87003e32-eae2-41c8-8b83-2530f084b3c7@efficios.com>
On Mon, Feb 23, 2026 at 08:37:13AM -0500, Mathieu Desnoyers wrote:
> On 2026-02-23 06:13, Florian Weimer wrote:
> > * Mathieu Desnoyers:
> >
> > > Trying to find a backward compatible way to solve this may be tricky.
> > > Here is one possible approach I have in mind: Introduce a new syscall,
> > > e.g. sys_cleanup_robust_list(void *addr)
> > >
> > > This system call would be invoked on pthread_mutex_destroy(3) of
> > > robust mutexes, and do the following:
> > >
> > > - Calculate the offset of @addr within its mapping,
> > > - Iterate on all processes which map the backing store which contain
> > > the lock address @addr.
> > > - Iterate on each thread sibling within each of those processes,
> > > - If the thread has a robust list, and its list_op_pending points
> > > to the same offset within the backing store mapping, clear the
> > > list_op_pending pointer.
> > >
> > > The overhead would be added specifically to pthread_mutex_destroy(3),
> > > and only for robust mutexes.
> >
> > Would we have to do this for pthread_mutex_destroy only, or also for
> > pthread_join? It is defined to exit a thread with mutexes still locked,
> > and the pthread_join call could mean that the application can determine
> > by its own logic that the backing store can be deallocated.
> Let me try to wrap my head around this scenario.
>
> AFAIU, the https://man7.org/linux/man-pages/man3/pthread_join.3.html
> NOTES section states the following for pthread_join(3):
>
> After a successful call to pthread_join(), the caller is
> guaranteed that the target thread has terminated. The caller may
> then choose to do any clean-up that is required after termination
> of the thread (e.g., freeing memory or other resources that were
> allocated to the target thread).
>
> What is the behavior when a thread exits with a mutex locked ? I would
> expect that this mutex stays locked
For a robust mutex, if the owning thread exits, the mutex enters
EOWNERDEAD state.
Otherwise, per POSIX the mutex just remains permanently locked and
undestroyable. glibc does not actually implement this for recursive or
errorchecking mutexes, as the tid might get reused and then the new
thread that got the same tid will now behave as if it were the owner
(e.g. it's allowed to take further recursive locks or observe itself
as the owner via EDEADLK). In musl we implement this by putting all
recursive and errorchecking mutexes on a robust list to reassign an
unmatchable tid to them at pthread_exit time.
> and the pthread_join(3) caller gets
> to release that mutex and eventually calls pthread_mutex_destroy(3) if
> the application logic allows it.
No other thread can release the mutex that was left locked unless it
was robust and it goes via the EOWNERDEAD/recovery process. Nor can
you legally call pthread_mutex_destroy on a mutex that's still owned.
Rich
^ permalink raw reply
* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
From: Florian Weimer @ 2026-02-24 11:23 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
Josef Bacik, Jan Kara, Aleksa Sarai, linux-api, rudi
In-Reply-To: <20251229-work-empty-namespace-v1-1-bfb24c7b061f@kernel.org>
* Christian Brauner:
> diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> index 5d3f8c9e3a62..acbc22241c9c 100644
> --- a/include/uapi/linux/mount.h
> +++ b/include/uapi/linux/mount.h
> @@ -61,7 +61,8 @@
> /*
> * open_tree() flags.
> */
> -#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
> +#define OPEN_TREE_CLONE (1 << 0) /* Clone the target tree and attach the clone */
This change causes pointless -Werror=undef errors in projects that have
settled on the old definition.
Reported here:
Bug 33921 - Building with Linux-7.0-rc1 errors on OPEN_TREE_CLONE
<https://sourceware.org/bugzilla/show_bug.cgi?id=33921>
Thanks,
Florian
^ permalink raw reply
* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
From: Christian Brauner @ 2026-02-24 12:05 UTC (permalink / raw)
To: Florian Weimer
Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
Josef Bacik, Jan Kara, Aleksa Sarai, linux-api, rudi
In-Reply-To: <lhuecmaz8p6.fsf@oldenburg.str.redhat.com>
On Tue, Feb 24, 2026 at 12:23:33PM +0100, Florian Weimer wrote:
> * Christian Brauner:
>
> > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> > index 5d3f8c9e3a62..acbc22241c9c 100644
> > --- a/include/uapi/linux/mount.h
> > +++ b/include/uapi/linux/mount.h
> > @@ -61,7 +61,8 @@
> > /*
> > * open_tree() flags.
> > */
> > -#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
> > +#define OPEN_TREE_CLONE (1 << 0) /* Clone the target tree and attach the clone */
>
> This change causes pointless -Werror=undef errors in projects that have
> settled on the old definition.
>
> Reported here:
>
> Bug 33921 - Building with Linux-7.0-rc1 errors on OPEN_TREE_CLONE
> <https://sourceware.org/bugzilla/show_bug.cgi?id=33921>
Send a patch to change it back, please.
Otherwise it might take a few days until I get around to it.
^ permalink raw reply
* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
From: Florian Weimer @ 2026-02-24 13:30 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
Josef Bacik, Jan Kara, Aleksa Sarai, linux-api, rudi
In-Reply-To: <20260224-erbitten-kaufleute-6f14e3072c5d@brauner>
* Christian Brauner:
> On Tue, Feb 24, 2026 at 12:23:33PM +0100, Florian Weimer wrote:
>> * Christian Brauner:
>>
>> > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
>> > index 5d3f8c9e3a62..acbc22241c9c 100644
>> > --- a/include/uapi/linux/mount.h
>> > +++ b/include/uapi/linux/mount.h
>> > @@ -61,7 +61,8 @@
>> > /*
>> > * open_tree() flags.
>> > */
>> > -#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
>> > +#define OPEN_TREE_CLONE (1 << 0) /* Clone the target tree and attach the clone */
>>
>> This change causes pointless -Werror=undef errors in projects that have
>> settled on the old definition.
>>
>> Reported here:
>>
>> Bug 33921 - Building with Linux-7.0-rc1 errors on OPEN_TREE_CLONE
>> <https://sourceware.org/bugzilla/show_bug.cgi?id=33921>
>
> Send a patch to change it back, please.
> Otherwise it might take a few days until I get around to it.
Rudi, could you post a patch?
Thanks,
Florian
^ permalink raw reply
* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
From: Christian Brauner @ 2026-02-24 14:33 UTC (permalink / raw)
To: Florian Weimer
Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
Josef Bacik, Jan Kara, Aleksa Sarai, linux-api, rudi
In-Reply-To: <lhuv7fmxo8y.fsf@oldenburg.str.redhat.com>
On Tue, Feb 24, 2026 at 02:30:37PM +0100, Florian Weimer wrote:
> * Christian Brauner:
>
> > On Tue, Feb 24, 2026 at 12:23:33PM +0100, Florian Weimer wrote:
> >> * Christian Brauner:
> >>
> >> > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> >> > index 5d3f8c9e3a62..acbc22241c9c 100644
> >> > --- a/include/uapi/linux/mount.h
> >> > +++ b/include/uapi/linux/mount.h
> >> > @@ -61,7 +61,8 @@
> >> > /*
> >> > * open_tree() flags.
> >> > */
> >> > -#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
> >> > +#define OPEN_TREE_CLONE (1 << 0) /* Clone the target tree and attach the clone */
> >>
> >> This change causes pointless -Werror=undef errors in projects that have
> >> settled on the old definition.
> >>
> >> Reported here:
> >>
> >> Bug 33921 - Building with Linux-7.0-rc1 errors on OPEN_TREE_CLONE
> >> <https://sourceware.org/bugzilla/show_bug.cgi?id=33921>
> >
> > Send a patch to change it back, please.
> > Otherwise it might take a few days until I get around to it.
>
> Rudi, could you post a patch?
I'm a bit confused though and not super happy that you're basically
asking us to be so constrained that we aren't even allowed to change 1
to 1 - just syntactically different.
^ permalink raw reply
* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
From: Jan Kara @ 2026-02-26 11:54 UTC (permalink / raw)
To: Christian Brauner
Cc: Florian Weimer, linux-fsdevel, Jeff Layton, Alexander Viro,
Amir Goldstein, Josef Bacik, Jan Kara, Aleksa Sarai, linux-api,
rudi
In-Reply-To: <20260224-kandidat-wohltat-ae8fb7a57738@brauner>
On Tue 24-02-26 15:33:13, Christian Brauner wrote:
> On Tue, Feb 24, 2026 at 02:30:37PM +0100, Florian Weimer wrote:
> > * Christian Brauner:
> >
> > > On Tue, Feb 24, 2026 at 12:23:33PM +0100, Florian Weimer wrote:
> > >> * Christian Brauner:
> > >>
> > >> > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> > >> > index 5d3f8c9e3a62..acbc22241c9c 100644
> > >> > --- a/include/uapi/linux/mount.h
> > >> > +++ b/include/uapi/linux/mount.h
> > >> > @@ -61,7 +61,8 @@
> > >> > /*
> > >> > * open_tree() flags.
> > >> > */
> > >> > -#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
> > >> > +#define OPEN_TREE_CLONE (1 << 0) /* Clone the target tree and attach the clone */
> > >>
> > >> This change causes pointless -Werror=undef errors in projects that have
> > >> settled on the old definition.
> > >>
> > >> Reported here:
> > >>
> > >> Bug 33921 - Building with Linux-7.0-rc1 errors on OPEN_TREE_CLONE
> > >> <https://sourceware.org/bugzilla/show_bug.cgi?id=33921>
> > >
> > > Send a patch to change it back, please.
> > > Otherwise it might take a few days until I get around to it.
> >
> > Rudi, could you post a patch?
>
> I'm a bit confused though and not super happy that you're basically
> asking us to be so constrained that we aren't even allowed to change 1
> to 1 - just syntactically different.
Agreed, this looks more like a tooling bug than anything else...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH v8 01/17] fs: Move file_kattr initialization to callers
From: Jan Kara @ 2026-02-27 11:34 UTC (permalink / raw)
To: Chuck Lever
Cc: Al Viro, Christian Brauner, Jan Kara, linux-fsdevel, linux-ext4,
linux-xfs, linux-cifs, linux-nfs, linux-api, linux-f2fs-devel,
hirofumi, linkinjeon, sj1557.seo, yuezhang.mo,
almaz.alexandrovich, slava, glaubitz, frank.li, tytso,
adilger.kernel, cem, sfrench, pc, ronniesahlberg, sprasad,
trondmy, anna, jaegeuk, chao, hansg, senozhatsky, Chuck Lever,
Darrick J. Wong
In-Reply-To: <20260217214741.1928576-2-cel@kernel.org>
On Tue 17-02-26 16:47:25, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> fileattr_fill_xflags() and fileattr_fill_flags() zero the entire
> file_kattr struct before populating select fields. This behavior
> prevents callers from setting flags in fa->fsx_xflags before
> calling these helpers; the zeroing clears any pre-set values.
>
> As Darrick Wong observed, when a function named "fill_xflags"
> modifies more than just xflags, filesystems must understand
> implementation details beyond the function's apparent scope. When
> initialization occurs at entry points, helper functions need not
> duplicate that zeroing.
>
> Move struct file_kattr zero-initialization from the fill functions
> to their callers. Entry points such as ioctl_setflags(),
> ioctl_fssetxattr(), and the file_getattr/file_setattr syscalls
> now perform aggregate initialization directly. The fill functions
> retain their field-setting logic but no longer clear the struct.
>
> This change enables subsequent patches where filesystem
> ->fileattr_get() handlers can set case-sensitivity flags
> (FS_XFLAG_CASEFOLD, FS_XFLAG_CASENONPRESERVING) in fa->fsx_xflags
> before calling the fill functions.
>
> Suggested-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/file_attr.c | 14 +++++---------
> fs/xfs/xfs_ioctl.c | 2 +-
> 2 files changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/fs/file_attr.c b/fs/file_attr.c
> index 6d2a298a786d..42aa511111a0 100644
> --- a/fs/file_attr.c
> +++ b/fs/file_attr.c
> @@ -15,12 +15,10 @@
> * @fa: fileattr pointer
> * @xflags: FS_XFLAG_* flags
> *
> - * Set ->fsx_xflags, ->fsx_valid and ->flags (translated xflags). All
> - * other fields are zeroed.
> + * Set ->fsx_xflags, ->fsx_valid and ->flags (translated xflags).
> */
> void fileattr_fill_xflags(struct file_kattr *fa, u32 xflags)
> {
> - memset(fa, 0, sizeof(*fa));
> fa->fsx_valid = true;
> fa->fsx_xflags = xflags;
> if (fa->fsx_xflags & FS_XFLAG_IMMUTABLE)
> @@ -48,11 +46,9 @@ EXPORT_SYMBOL(fileattr_fill_xflags);
> * @flags: FS_*_FL flags
> *
> * Set ->flags, ->flags_valid and ->fsx_xflags (translated flags).
> - * All other fields are zeroed.
> */
> void fileattr_fill_flags(struct file_kattr *fa, u32 flags)
> {
> - memset(fa, 0, sizeof(*fa));
> fa->flags_valid = true;
> fa->flags = flags;
> if (fa->flags & FS_SYNC_FL)
> @@ -325,7 +321,7 @@ int ioctl_setflags(struct file *file, unsigned int __user *argp)
> {
> struct mnt_idmap *idmap = file_mnt_idmap(file);
> struct dentry *dentry = file->f_path.dentry;
> - struct file_kattr fa;
> + struct file_kattr fa = {};
> unsigned int flags;
> int err;
>
> @@ -357,7 +353,7 @@ int ioctl_fssetxattr(struct file *file, void __user *argp)
> {
> struct mnt_idmap *idmap = file_mnt_idmap(file);
> struct dentry *dentry = file->f_path.dentry;
> - struct file_kattr fa;
> + struct file_kattr fa = {};
> int err;
>
> err = copy_fsxattr_from_user(&fa, argp);
> @@ -378,7 +374,7 @@ SYSCALL_DEFINE5(file_getattr, int, dfd, const char __user *, filename,
> struct path filepath __free(path_put) = {};
> unsigned int lookup_flags = 0;
> struct file_attr fattr;
> - struct file_kattr fa;
> + struct file_kattr fa = {};
> int error;
>
> BUILD_BUG_ON(sizeof(struct file_attr) < FILE_ATTR_SIZE_VER0);
> @@ -431,7 +427,7 @@ SYSCALL_DEFINE5(file_setattr, int, dfd, const char __user *, filename,
> struct path filepath __free(path_put) = {};
> unsigned int lookup_flags = 0;
> struct file_attr fattr;
> - struct file_kattr fa;
> + struct file_kattr fa = {};
> int error;
>
> BUILD_BUG_ON(sizeof(struct file_attr) < FILE_ATTR_SIZE_VER0);
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 4eeda4d4e3ab..369555275140 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -498,7 +498,7 @@ xfs_ioc_fsgetxattra(
> xfs_inode_t *ip,
> void __user *arg)
> {
> - struct file_kattr fa;
> + struct file_kattr fa = {};
>
> xfs_ilock(ip, XFS_ILOCK_SHARED);
> xfs_fill_fsxattr(ip, XFS_ATTR_FORK, &fa);
> --
> 2.53.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH v8 02/17] fs: Add case sensitivity flags to file_kattr
From: Jan Kara @ 2026-02-27 11:37 UTC (permalink / raw)
To: Chuck Lever
Cc: Al Viro, Christian Brauner, Jan Kara, linux-fsdevel, linux-ext4,
linux-xfs, linux-cifs, linux-nfs, linux-api, linux-f2fs-devel,
hirofumi, linkinjeon, sj1557.seo, yuezhang.mo,
almaz.alexandrovich, slava, glaubitz, frank.li, tytso,
adilger.kernel, cem, sfrench, pc, ronniesahlberg, sprasad,
trondmy, anna, jaegeuk, chao, hansg, senozhatsky, Chuck Lever,
Darrick J. Wong
In-Reply-To: <20260217214741.1928576-3-cel@kernel.org>
On Tue 17-02-26 16:47:26, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Enable upper layers such as NFSD to retrieve case sensitivity
> information from file systems by adding FS_XFLAG_CASEFOLD and
> FS_XFLAG_CASENONPRESERVING flags.
>
> Filesystems report case-insensitive or case-nonpreserving behavior
> by setting these flags directly in fa->fsx_xflags. The default
> (flags unset) indicates POSIX semantics: case-sensitive and
> case-preserving. These flags are read-only; userspace cannot set
> them via ioctl.
>
> Case sensitivity information is exported to userspace via the
> fa_xflags field in the FS_IOC_FSGETXATTR ioctl and file_getattr()
> system call.
>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/file_attr.c | 4 ++++
> include/linux/fileattr.h | 3 ++-
> include/uapi/linux/fs.h | 7 +++++++
> 3 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/fs/file_attr.c b/fs/file_attr.c
> index 42aa511111a0..5d9a7ed159fb 100644
> --- a/fs/file_attr.c
> +++ b/fs/file_attr.c
> @@ -37,6 +37,8 @@ void fileattr_fill_xflags(struct file_kattr *fa, u32 xflags)
> fa->flags |= FS_PROJINHERIT_FL;
> if (fa->fsx_xflags & FS_XFLAG_VERITY)
> fa->flags |= FS_VERITY_FL;
> + if (fa->fsx_xflags & FS_XFLAG_CASEFOLD)
> + fa->flags |= FS_CASEFOLD_FL;
> }
> EXPORT_SYMBOL(fileattr_fill_xflags);
>
> @@ -67,6 +69,8 @@ void fileattr_fill_flags(struct file_kattr *fa, u32 flags)
> fa->fsx_xflags |= FS_XFLAG_PROJINHERIT;
> if (fa->flags & FS_VERITY_FL)
> fa->fsx_xflags |= FS_XFLAG_VERITY;
> + if (fa->flags & FS_CASEFOLD_FL)
> + fa->fsx_xflags |= FS_XFLAG_CASEFOLD;
> }
> EXPORT_SYMBOL(fileattr_fill_flags);
>
> diff --git a/include/linux/fileattr.h b/include/linux/fileattr.h
> index 3780904a63a6..58044b598016 100644
> --- a/include/linux/fileattr.h
> +++ b/include/linux/fileattr.h
> @@ -16,7 +16,8 @@
>
> /* Read-only inode flags */
> #define FS_XFLAG_RDONLY_MASK \
> - (FS_XFLAG_PREALLOC | FS_XFLAG_HASATTR | FS_XFLAG_VERITY)
> + (FS_XFLAG_PREALLOC | FS_XFLAG_HASATTR | FS_XFLAG_VERITY | \
> + FS_XFLAG_CASEFOLD | FS_XFLAG_CASENONPRESERVING)
>
> /* Flags to indicate valid value of fsx_ fields */
> #define FS_XFLAG_VALUES_MASK \
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 70b2b661f42c..2fa003575e8b 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -254,6 +254,13 @@ struct file_attr {
> #define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
> #define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
> #define FS_XFLAG_VERITY 0x00020000 /* fs-verity enabled */
> +/*
> + * Case handling flags (read-only, cannot be set via ioctl).
> + * Default (neither set) indicates POSIX semantics: case-sensitive
> + * lookups and case-preserving storage.
> + */
> +#define FS_XFLAG_CASEFOLD 0x00040000 /* case-insensitive lookups */
> +#define FS_XFLAG_CASENONPRESERVING 0x00080000 /* case not preserved */
> #define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
>
> /* the read-only stuff doesn't really belong here, but any other place is
> --
> 2.53.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH v8 03/17] fat: Implement fileattr_get for case sensitivity
From: Jan Kara @ 2026-02-27 11:41 UTC (permalink / raw)
To: Chuck Lever
Cc: Al Viro, Christian Brauner, Jan Kara, linux-fsdevel, linux-ext4,
linux-xfs, linux-cifs, linux-nfs, linux-api, linux-f2fs-devel,
hirofumi, linkinjeon, sj1557.seo, yuezhang.mo,
almaz.alexandrovich, slava, glaubitz, frank.li, tytso,
adilger.kernel, cem, sfrench, pc, ronniesahlberg, sprasad,
trondmy, anna, jaegeuk, chao, hansg, senozhatsky, Chuck Lever
In-Reply-To: <20260217214741.1928576-4-cel@kernel.org>
On Tue 17-02-26 16:47:27, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Report FAT's case sensitivity behavior via the FS_XFLAG_CASEFOLD
> and FS_XFLAG_CASENONPRESERVING flags. FAT filesystems are
> case-insensitive by default.
>
> MSDOS supports a 'nocase' mount option that enables case-sensitive
> behavior; check this option when reporting case sensitivity.
>
> VFAT long filename entries preserve case; without VFAT, only
> uppercased 8.3 short names are stored. MSDOS with 'nocase' also
> preserves case since the name-formatting code skips upcasing when
> 'nocase' is set. Check both options when reporting case preservation.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Looks good to me from general POV. It would be good to get confirmation
from FAT maintainer you've got all the corner cases of FAT configuration
right :) Anyway, feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/fat/fat.h | 3 +++
> fs/fat/file.c | 22 ++++++++++++++++++++++
> fs/fat/namei_msdos.c | 1 +
> fs/fat/namei_vfat.c | 1 +
> 4 files changed, 27 insertions(+)
>
> diff --git a/fs/fat/fat.h b/fs/fat/fat.h
> index 0d269dba897b..c5bcd1063f9c 100644
> --- a/fs/fat/fat.h
> +++ b/fs/fat/fat.h
> @@ -10,6 +10,8 @@
> #include <linux/fs_context.h>
> #include <linux/fs_parser.h>
>
> +struct file_kattr;
> +
> /*
> * vfat shortname flags
> */
> @@ -407,6 +409,7 @@ extern void fat_truncate_blocks(struct inode *inode, loff_t offset);
> extern int fat_getattr(struct mnt_idmap *idmap,
> const struct path *path, struct kstat *stat,
> u32 request_mask, unsigned int flags);
> +int fat_fileattr_get(struct dentry *dentry, struct file_kattr *fa);
> extern int fat_file_fsync(struct file *file, loff_t start, loff_t end,
> int datasync);
>
> diff --git a/fs/fat/file.c b/fs/fat/file.c
> index 124d9c5431c8..6823269a8604 100644
> --- a/fs/fat/file.c
> +++ b/fs/fat/file.c
> @@ -17,6 +17,7 @@
> #include <linux/fsnotify.h>
> #include <linux/security.h>
> #include <linux/falloc.h>
> +#include <linux/fileattr.h>
> #include "fat.h"
>
> static long fat_fallocate(struct file *file, int mode,
> @@ -396,6 +397,26 @@ void fat_truncate_blocks(struct inode *inode, loff_t offset)
> fat_flush_inodes(inode->i_sb, inode, NULL);
> }
>
> +int fat_fileattr_get(struct dentry *dentry, struct file_kattr *fa)
> +{
> + struct msdos_sb_info *sbi = MSDOS_SB(dentry->d_sb);
> +
> + /*
> + * FAT filesystems are case-insensitive by default. MSDOS
> + * supports a 'nocase' mount option for case-sensitive behavior.
> + *
> + * VFAT long filename entries preserve case. Without VFAT, only
> + * uppercased 8.3 short names are stored. MSDOS with 'nocase'
> + * also preserves case.
> + */
> + if (!sbi->options.nocase)
> + fa->fsx_xflags |= FS_XFLAG_CASEFOLD;
> + if (!sbi->options.isvfat && !sbi->options.nocase)
> + fa->fsx_xflags |= FS_XFLAG_CASENONPRESERVING;
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(fat_fileattr_get);
> +
> int fat_getattr(struct mnt_idmap *idmap, const struct path *path,
> struct kstat *stat, u32 request_mask, unsigned int flags)
> {
> @@ -573,5 +594,6 @@ EXPORT_SYMBOL_GPL(fat_setattr);
> const struct inode_operations fat_file_inode_operations = {
> .setattr = fat_setattr,
> .getattr = fat_getattr,
> + .fileattr_get = fat_fileattr_get,
> .update_time = fat_update_time,
> };
> diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
> index 048c103b506a..4a3db08e51c0 100644
> --- a/fs/fat/namei_msdos.c
> +++ b/fs/fat/namei_msdos.c
> @@ -642,6 +642,7 @@ static const struct inode_operations msdos_dir_inode_operations = {
> .rename = msdos_rename,
> .setattr = fat_setattr,
> .getattr = fat_getattr,
> + .fileattr_get = fat_fileattr_get,
> .update_time = fat_update_time,
> };
>
> diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
> index 2acfe3123a72..18f4c316aa05 100644
> --- a/fs/fat/namei_vfat.c
> +++ b/fs/fat/namei_vfat.c
> @@ -1185,6 +1185,7 @@ static const struct inode_operations vfat_dir_inode_operations = {
> .rename = vfat_rename2,
> .setattr = fat_setattr,
> .getattr = fat_getattr,
> + .fileattr_get = fat_fileattr_get,
> .update_time = fat_update_time,
> };
>
> --
> 2.53.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: André Almeida @ 2026-02-27 19:15 UTC (permalink / raw)
To: Liam R. Howlett
Cc: Carlos O'Donell, Sebastian Andrzej Siewior, Peter Zijlstra,
Florian Weimer, Rich Felker, Torvald Riegel, Darren Hart,
Thomas Gleixner, Ingo Molnar, Davidlohr Bueso, Arnd Bergmann,
Mathieu Desnoyers, kernel-dev, linux-api, linux-kernel,
Suren Baghdasaryan, Lorenzo Stoakes, Michal Hocko
In-Reply-To: <sn6isqtjcgzix4iwifcg6fy2lq3klfdykezyodzbt7fz7urhcs@dc5sxuzypdoc>
Hi Liam,
Em 20/02/2026 17:51, Liam R. Howlett escreveu:
> +Cc Suren, Lorenzo, and Michal
>
> * André Almeida <andrealmeid@igalia.com> [260220 15:27]:
>> During LPC 2025, I presented a session about creating a new syscall for
>> robust_list[0][1]. However, most of the session discussion wasn't much related
>> to the new syscall itself, but much more related to an old bug that exists in
>> the current robust_list mechanism.
>
> Ah, sorry for hijacking the session, that was not my intention, but this
> needs to be addressed before we propagate the issue into the next
> iteration.
>
No problem! I believe that this reflects the fact that the race
condition is the main concern about this new interface, and that we
should focus our discussion around this.
>>
>> Since at least 2012, there's an open bug reporting a race condition, as
>> Carlos O'Donell pointed out:
>>
>> "File corruption race condition in robust mutex unlocking"
>> https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>>
[...]
>
> There was a delay added to the oom reaper for these tasks [1] by commit
> e4a38402c36e ("oom_kill.c: futex: delay the OOM reaper to allow time for
> proper futex cleanup")
>
> We did discuss marking the vmas as needing to be skipped by the oom
> manager, but no clear path forward was clear. It's also not clear if
> that's the only area where such a problem exists.
>
> [1]. https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u
>
So how would you detect which vmas should be skipped? And this won't fix
the issue when the memory is unmapped right, just for the OOM case?
^ permalink raw reply
* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: André Almeida @ 2026-02-27 19:16 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: kernel-dev, Liam R . Howlett, linux-api, Darren Hart,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Florian Weimer,
Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
Carlos O'Donell, Michal Hocko, linux-kernel, libc-alpha,
Arnd Bergmann, Sebastian Andrzej Siewior
In-Reply-To: <a1e24288-6ffc-438d-8a2a-d152134c9555@efficios.com>
Hi Mathieu,
Em 20/02/2026 20:17, Mathieu Desnoyers escreveu:
> On 2026-02-20 17:41, Mathieu Desnoyers wrote:
>> On 2026-02-20 16:42, Mathieu Desnoyers wrote:
>>> +CC libc-alpha.
>>>
>>> On 2026-02-20 15:26, André Almeida wrote:
>>>> During LPC 2025, I presented a session about creating a new syscall for
>>>> robust_list[0][1]. However, most of the session discussion wasn't
>>>> much related
>>>> to the new syscall itself, but much more related to an old bug that
>>>> exists in
>>>> the current robust_list mechanism.
>>>>
>>>> Since at least 2012, there's an open bug reporting a race condition, as
>>>> Carlos O'Donell pointed out:
>>>>
>>>> "File corruption race condition in robust mutex unlocking"
>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>>>>
>>>> To help understand the bug, I've created a reproducer (patch 1/2) and a
>>>> companion kernel hack (patch 2/2) that helps to make the race condition
>>>> more likely. When the bug happens, the reproducer shows a message
>>>> comparing the original memory with the corrupted one:
>>>>
>>>> "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs
>>>> 8001fe8dc0000000"
>>>>
>>>> I'm not sure yet what would be the appropriated approach to fix it,
>>>> so I
>>>> decided to reach the community before moving forward in some direction.
>>>> One suggestion from Peter[2] resolves around serializing the mmap()
>>>> and the
>>>> robust list exit path, which might cause overheads for the common case,
>>>> where list_op_pending is empty.
>>>>
>>>> However, giving that there's a new interface being prepared, this could
>>>> also give the opportunity to rethink how list_op_pending works, and get
>>>> rid of the race condition by design.
>>>>
>>>> Feedback is very much welcome.
>>>
>>> Looking at this bug, one thing I'm starting to consider is that it
>>> appears to be an issue inherent to lack of synchronization between
>>> pthread_mutex_destroy(3) and the per-thread list_op_pending fields
>>> and not so much a kernel issue.
>>>
>>> Here is why I think the issue is purely userspace:
>>>
>>> Let's suppose we have a shared memory area across Processes 1 and
>>> Process 2,
>>> which internally have its own custom memory allocator in userspace to
>>> allocate/free space within that shared memory.
>>>
>>> Process 1, Thread A stumbles through the scenario highlighted by this
>>> bug, and
>>> basically gets preempted at this FIXME in libc
>>> __pthread_mutex_unlock_full():
>>>
>>> if (__glibc_unlikely ((atomic_exchange_release (&mutex-
>>> >__data.__lock, 0)
>>> & FUTEX_WAITERS) != 0))
>>> futex_wake ((unsigned int *) &mutex->__data.__lock, 1,
>>> private);
>>>
>>> /* We must clear op_pending after we release the mutex.
>>> FIXME However, this violates the mutex destruction
>>> requirements
>>> because another thread could acquire the mutex, destroy it,
>>> and
>>> reuse the memory for something else; then, if this thread
>>> crashes,
>>> and the memory happens to have a value equal to the TID,
>>> the kernel
>>> will believe it is still related to the mutex (which has been
>>> destroyed already) and will modify some other random
>>> object. */
>>> __asm ("" ::: "memory");
>>> THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
>>>
>>> Then Process 1, Thread B runs, grabs the lock, releases it, and based on
>>> program state it knows it can pthread_mutex_destroy() this lock, free
>>> its
>>> associated memory through the custom shared memory allocator, and
>>> allocate
>>> it for other purposes. Then we get to the point where Process 1 is
>>> killed, and where the robust futex kernel code corrupts data in shared
>>> memory because of the dangling list_op_pending pointer.
>>>
>>> That shared memory data is still observable by Process B, which will
>>> get a
>>> corrupted state.
>>>
>>> Notice how this all happens without any munmap(2)/mmap(2) in the
>>> sequence ?
>>> This is why I think this is purely a userspace issue rather than an
>>> issue
>>> we can solve by adding extra synchronization in the kernel.
>>>
>>> The one point we have in that sequence where I think we can add
>>> synchronization
>>> is pthread_mutex_destroy(3) in libc. One possible "big hammer"
>>> solution would be
>>> to make pthread_mutex_destroy iterate on all other threads
>>> list_op_pending
>>> and busy-wait if it finds that the mutex address is in use. It would
>>> of course
>>> only have to do that for robust futexes.
>>>
>>> If that big hammer solution is not fast enough for many-threaded use-
>>> cases,
>>> then we can think of other approaches such as adding a reference counter
>>> in the mutex structure, or introducing hazard pointers in userspace
>>> to reduce
>>> synchronization iteration from nr_threads to nr_cpus (or even down to
>>> max
>>> rseq mm_cid).
>>
>> To make matters even worse, the pthread_mutex_destroy(3) and reallocation
>> could happen from Process 2 rather than Process 1. So iterating on a
>> threads from Process 1 is not sufficient. We'd need to synchronize
>> pthread_mutex_destroy on something within the mutex structure which is
>> observable from all processes using the lock, for instance a reference
>> count.
> Trying to find a backward compatible way to solve this may be tricky.
> Here is one possible approach I have in mind: Introduce a new syscall,
> e.g. sys_cleanup_robust_list(void *addr)
>
> This system call would be invoked on pthread_mutex_destroy(3) of
> robust mutexes, and do the following:
>
> - Calculate the offset of @addr within its mapping,
> - Iterate on all processes which map the backing store which contain
> the lock address @addr.
> - Iterate on each thread sibling within each of those processes,
> - If the thread has a robust list, and its list_op_pending points
> to the same offset within the backing store mapping, clear the
> list_op_pending pointer.
>
> The overhead would be added specifically to pthread_mutex_destroy(3),
> and only for robust mutexes.
>
> Thoughts ?
>
Right, your explanation makes sense to me. I think the only difference
between alloc/free and map/munmap is that ""freeing" memory does not
actually return it to the operating system for other applications to
use"[1], so I don't know if this custom allocator is violating some
memory rules.
About the system call, we would call sys_cleanup_robust_list() before
freeing/unmapping the robust mutex. To guarantee that we check every
process that shares the memory region, would we need to check *every*
single process? I don't think there's a way find a way to find such maps
without checking them all.
I'm trying to explore the idea about the reference counter. Would the
mummap() be blocked till the refcount goes to zero or something like
that? I've also tried to find more examples of a memory region that's
shared between one or more process and the kernel at the same time to
get some inspiration, but it seems robust_list is a quite unique design
on its own regarding this memory sharing problem.
[1] https://sourceware.org/glibc/wiki/MallocInternals
^ permalink raw reply
* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: Mathieu Desnoyers @ 2026-02-27 19:59 UTC (permalink / raw)
To: André Almeida
Cc: kernel-dev, Liam R . Howlett, linux-api, Darren Hart,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Florian Weimer,
Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
Carlos O'Donell, Michal Hocko, linux-kernel, libc-alpha,
Arnd Bergmann, Sebastian Andrzej Siewior
In-Reply-To: <ed918547-1406-4ae6-8a94-4e03712a4923@igalia.com>
On 2026-02-27 14:16, André Almeida wrote:
[...]
>> Trying to find a backward compatible way to solve this may be tricky.
>> Here is one possible approach I have in mind: Introduce a new syscall,
>> e.g. sys_cleanup_robust_list(void *addr)
>>
>> This system call would be invoked on pthread_mutex_destroy(3) of
>> robust mutexes, and do the following:
>>
>> - Calculate the offset of @addr within its mapping,
>> - Iterate on all processes which map the backing store which contain
>> the lock address @addr.
>> - Iterate on each thread sibling within each of those processes,
>> - If the thread has a robust list, and its list_op_pending points
>> to the same offset within the backing store mapping, clear the
>> list_op_pending pointer.
>>
>> The overhead would be added specifically to pthread_mutex_destroy(3),
>> and only for robust mutexes.
>>
>> Thoughts ?
>>
[...]
>
> About the system call, we would call sys_cleanup_robust_list() before
> freeing/unmapping the robust mutex. To guarantee that we check every
> process that shares the memory region, would we need to check *every*
> single process? I don't think there's a way find a way to find such maps
> without checking them all.
We should be able to do it with just an iteration on the struct address_space
reverse mapping (list of vma which map the shared mapping).
AFAIU we'd want to get the struct address_space associated with the
__user pointer, then, while holding i_mmap_lock_read(mapping), iterate
on its reverse mapping (i_mmap field) with vma_interval_tree_foreach. We
can get each mm_struct through vma->vm_mm.
We'd want to do most of this in a kthread and use other mm_struct through
use_mm().
For each mm_struct, we go through the owner field to get the thread
group leader, and iterate on all thread siblings (for_each_thread).
For each of those threads, we'd want to clear the list_op_pending
if it matches the offset of @addr within the mapping. I suspect we'd
want to clear that userspace pointer with a futex_atomic_cmpxchg_inatomic
which only clears the pointer if the old value match the one we expect.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply
* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: Suren Baghdasaryan @ 2026-02-27 20:41 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: André Almeida, kernel-dev, Liam R . Howlett, linux-api,
Darren Hart, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
Florian Weimer, Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes,
Rich Felker, Carlos O'Donell, Michal Hocko, linux-kernel,
libc-alpha, Arnd Bergmann, Sebastian Andrzej Siewior, npache
In-Reply-To: <bd7a8dd3-8dee-4886-abe6-bdda25fe4a0d@efficios.com>
On Fri, Feb 27, 2026 at 8:00 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2026-02-27 14:16, André Almeida wrote:
> [...]
> >> Trying to find a backward compatible way to solve this may be tricky.
> >> Here is one possible approach I have in mind: Introduce a new syscall,
> >> e.g. sys_cleanup_robust_list(void *addr)
> >>
> >> This system call would be invoked on pthread_mutex_destroy(3) of
> >> robust mutexes, and do the following:
> >>
> >> - Calculate the offset of @addr within its mapping,
> >> - Iterate on all processes which map the backing store which contain
> >> the lock address @addr.
> >> - Iterate on each thread sibling within each of those processes,
> >> - If the thread has a robust list, and its list_op_pending points
> >> to the same offset within the backing store mapping, clear the
> >> list_op_pending pointer.
> >>
> >> The overhead would be added specifically to pthread_mutex_destroy(3),
> >> and only for robust mutexes.
> >>
> >> Thoughts ?
> >>
> [...]
> >
> > About the system call, we would call sys_cleanup_robust_list() before
> > freeing/unmapping the robust mutex. To guarantee that we check every
> > process that shares the memory region, would we need to check *every*
> > single process? I don't think there's a way find a way to find such maps
> > without checking them all.
>
> We should be able to do it with just an iteration on the struct address_space
> reverse mapping (list of vma which map the shared mapping).
>
> AFAIU we'd want to get the struct address_space associated with the
> __user pointer, then, while holding i_mmap_lock_read(mapping), iterate
> on its reverse mapping (i_mmap field) with vma_interval_tree_foreach. We
> can get each mm_struct through vma->vm_mm.
>
> We'd want to do most of this in a kthread and use other mm_struct through
> use_mm().
>
> For each mm_struct, we go through the owner field to get the thread
> group leader, and iterate on all thread siblings (for_each_thread).
>
> For each of those threads, we'd want to clear the list_op_pending
> if it matches the offset of @addr within the mapping. I suspect we'd
> want to clear that userspace pointer with a futex_atomic_cmpxchg_inatomic
> which only clears the pointer if the old value match the one we expect.
I've been looking into this problem this week and IIUC Nico Pache
pursued this direction at some point (see [1]). I'm CC'ing him to
share his experience.
FYI, the link also contains an interesting discussion between Thomas
and Michal about difficulty of identifying all the VMAs possibly
involved in the lock chain and some technical challenges.
[1] https://lore.kernel.org/all/bd61369c-ef50-2eb4-2cca-91422fbfa328@redhat.com/
Thanks,
Suren.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox