Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH 0/9] Kernel API Specification Framework
From: Sasha Levin @ 2026-03-13 15:09 UTC (permalink / raw)
  To: linux-api, linux-kernel
  Cc: linux-doc, linux-fsdevel, linux-kbuild, linux-kselftest,
	workflows, tools, x86, Thomas Gleixner, Paul E. McKenney,
	Greg Kroah-Hartman, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
	Cyril Hrubis, Kees Cook, Jake Edge, David Laight, Askar Safin,
	Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
	Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
	Ingo Molnar, Arnd Bergmann, Sasha Levin

This proposal introduces machinery for documenting kernel APIs, addressing the
long-standing challenge of maintaining stable interfaces between the kernel and
user-space programs. Despite the kernel's commitment to never breaking user
space, the lack of machine-readable API specifications has led to breakages and
across system calls and IOCTLs.

Specifications can document parameter types, valid ranges, constraints, and
alignment requirements. They capture return value semantics including success
conditions and error codes with their meaning. Execution context requirements,
capabilities, locking constraints, signal handling behavior, and side effects
can all be formally specified.

These specifications live alongside the code they document and are both
human-readable and machine-parseable. They can be validated at runtime when
CONFIG_KAPI_RUNTIME_CHECKS is enabled, exported via debugfs for userspace
tools, and extracted from either vmlinux or source code.

This enables static analysis tools to verify userspace API usage at compile
time, test generation based on formal specifications, consistent error handling
validation, automated documentation generation, and formal verification of
kernel interfaces.

The implementation includes a core framework with ELF section storage,
kerneldoc integration for inline specification, a debugfs interface for runtime
querying, and a Rust-based extraction tool (tools/kapi) supporting JSON, RST,
and plain text output formats. Example specifications are provided for the four
fundamental file syscalls (sys_open, sys_close, sys_read, sys_write). The
series also includes a KUnit test suite with 38 tests and a runtime
verification selftest with 29+ TAP tests.

The series with runtime testing enabled (CONFIG_KAPI_RUNTIME_CHECKS=y)
currently survives LTP tests in a KVM VM.

Changes since RFC v5:

- Streamlined example specs: focus on open/close/read/write to start with.

- Added KUnit test suite.

- Added runtime verification selftest.

- Fixed kernel test robot warnings from v5: fixed "document isn't included in
  any toctree" (kernel-api-spec.rst now properly added to
  Documentation/dev-tools/index.rst), fixed sparse "non size-preserving
  integer to pointer cast" warnings in kernel_api_spec.c.

- Rebased on v7.0-rc1.

References:

  RFC v5: https://lore.kernel.org/lkml/20251218204239.4159453-1-sashal@kernel.org/
  RFC v4: https://lore.kernel.org/lkml/20250825181434.3340805-1-sashal@kernel.org/
  RFC v3: https://lore.kernel.org/lkml/20250711114248.2288591-1-sashal@kernel.org/
  RFC v2: https://lore.kernel.org/lkml/20250624180742.5795-1-sashal@kernel.org/
  RFC v1: https://lore.kernel.org/lkml/20250614134858.790460-1-sashal@kernel.org/

Sasha Levin (9):
  kernel/api: introduce kernel API specification framework
  kernel/api: enable kerneldoc-based API specifications
  kernel/api: add debugfs interface for kernel API specifications
  tools/kapi: Add kernel API specification extraction tool
  kernel/api: add API specification for sys_open
  kernel/api: add API specification for sys_close
  kernel/api: add API specification for sys_read
  kernel/api: add API specification for sys_write
  kernel/api: add runtime verification selftest

 .gitignore                                    |    1 +
 Documentation/dev-tools/index.rst             |    1 +
 Documentation/dev-tools/kernel-api-spec.rst   |  629 +++++++
 MAINTAINERS                                   |   12 +
 arch/x86/include/asm/syscall_wrapper.h        |   40 +
 fs/open.c                                     |  576 +++++-
 fs/read_write.c                               |  687 +++++++
 include/asm-generic/vmlinux.lds.h             |   28 +
 include/linux/kernel_api_spec.h               | 1580 +++++++++++++++++
 include/linux/syscall_api_spec.h              |  192 ++
 include/linux/syscalls.h                      |   39 +
 init/Kconfig                                  |    2 +
 kernel/Makefile                               |    3 +
 kernel/api/.gitignore                         |    2 +
 kernel/api/Kconfig                            |   70 +
 kernel/api/Makefile                           |   14 +
 kernel/api/kapi_debugfs.c                     |  503 ++++++
 kernel/api/kapi_kunit.c                       |  536 ++++++
 kernel/api/kernel_api_spec.c                  | 1277 +++++++++++++
 scripts/Makefile.build                        |   31 +
 scripts/Makefile.clean                        |    3 +
 tools/docs/kernel-doc                         |    5 +
 tools/kapi/.gitignore                         |    4 +
 tools/kapi/Cargo.toml                         |   19 +
 tools/kapi/src/extractor/debugfs.rs           |  581 ++++++
 tools/kapi/src/extractor/kerneldoc_parser.rs  | 1554 ++++++++++++++++
 tools/kapi/src/extractor/mod.rs               |  463 +++++
 tools/kapi/src/extractor/source_parser.rs     |  405 +++++
 .../src/extractor/vmlinux/binary_utils.rs     |  505 ++++++
 .../src/extractor/vmlinux/magic_finder.rs     |  112 ++
 tools/kapi/src/extractor/vmlinux/mod.rs       |  842 +++++++++
 tools/kapi/src/formatter/json.rs              |  727 ++++++++
 tools/kapi/src/formatter/mod.rs               |  140 ++
 tools/kapi/src/formatter/plain.rs             |  708 ++++++++
 tools/kapi/src/formatter/rst.rs               |  852 +++++++++
 tools/kapi/src/main.rs                        |  119 ++
 tools/lib/python/kdoc/kdoc_apispec.py         |  887 +++++++++
 tools/lib/python/kdoc/kdoc_output.py          |    9 +-
 tools/lib/python/kdoc/kdoc_parser.py          |   86 +-
 tools/testing/selftests/kapi/Makefile         |    7 +
 tools/testing/selftests/kapi/kapi_test_util.h |   31 +
 tools/testing/selftests/kapi/test_kapi.c      | 1021 +++++++++++
 42 files changed, 15294 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/dev-tools/kernel-api-spec.rst
 create mode 100644 include/linux/kernel_api_spec.h
 create mode 100644 include/linux/syscall_api_spec.h
 create mode 100644 kernel/api/.gitignore
 create mode 100644 kernel/api/Kconfig
 create mode 100644 kernel/api/Makefile
 create mode 100644 kernel/api/kapi_debugfs.c
 create mode 100644 kernel/api/kapi_kunit.c
 create mode 100644 kernel/api/kernel_api_spec.c
 create mode 100644 tools/kapi/.gitignore
 create mode 100644 tools/kapi/Cargo.toml
 create mode 100644 tools/kapi/src/extractor/debugfs.rs
 create mode 100644 tools/kapi/src/extractor/kerneldoc_parser.rs
 create mode 100644 tools/kapi/src/extractor/mod.rs
 create mode 100644 tools/kapi/src/extractor/source_parser.rs
 create mode 100644 tools/kapi/src/extractor/vmlinux/binary_utils.rs
 create mode 100644 tools/kapi/src/extractor/vmlinux/magic_finder.rs
 create mode 100644 tools/kapi/src/extractor/vmlinux/mod.rs
 create mode 100644 tools/kapi/src/formatter/json.rs
 create mode 100644 tools/kapi/src/formatter/mod.rs
 create mode 100644 tools/kapi/src/formatter/plain.rs
 create mode 100644 tools/kapi/src/formatter/rst.rs
 create mode 100644 tools/kapi/src/main.rs
 create mode 100644 tools/lib/python/kdoc/kdoc_apispec.py
 create mode 100644 tools/testing/selftests/kapi/Makefile
 create mode 100644 tools/testing/selftests/kapi/kapi_test_util.h
 create mode 100644 tools/testing/selftests/kapi/test_kapi.c

-- 
2.51.0


^ permalink raw reply

* Re: [RFC] Modernizing Linux authentication logs (lastlog, btmp, utmp, wtmp) with SQLite
From: Vincent Lefevre @ 2026-03-13 14:45 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: Roman Bakshansky, linux-api, Thorsten Kukuk, linux-kernel, audit,
	libc-alpha
In-Reply-To: <2d5de14a-17d2-4d08-992e-cbc5d430e231@linaro.org>

On 2026-03-13 10:59:11 -0300, Adhemerval Zanella Netto wrote:
> From the glibc standpoint my plan is just to make the accounting database
> function no-op [1] (I hopefully to get this in the next 2.44 release).
> 
> And I think Thorsten Kukuk already adapted most of the usages in current
> distros [2][3] using similar strategy, along with a better systemd
> integration.  I am not sure if/when distros are incorporating his work.
> 
> [1] https://patchwork.sourceware.org/project/glibc/list/?series=37271
> [2] https://www.thkukuk.de/blog/Y2038_glibc_lastlog_64bit/
> [3] https://www.thkukuk.de/blog/Y2038_glibc_utmp_64bit/

FYI, utmp has been reintroduced in Debian for libutempter (and thus
applications that use this library), because systemd was not working
or at least not sufficiently documented:

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1125682

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / Pascaline project (LIP, ENS-Lyon)

^ permalink raw reply

* Re: [RFC PATCH v4] futex: Introduce __vdso_robust_futex_unlock and __vdso_robust_pi_futex_try_unlock
From: Thomas Weißschuh @ 2026-03-13 14:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: André Almeida, linux-kernel, Carlos O'Donell,
	Sebastian Andrzej Siewior, Peter Zijlstra, Florian Weimer,
	Rich Felker, Torvald Riegel, Darren Hart, Thomas Gleixner,
	Ingo Molnar, Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett,
	linux-api
In-Reply-To: <20260313133903.2202079-1-mathieu.desnoyers@efficios.com>

Hi Mathieu,

some small remarks around the vDSO code.

On Fri, Mar 13, 2026 at 09:39:03AM -0400, Mathieu Desnoyers wrote:

(...)

> The approach taken to fix this is to introduce two vDSO and extend the
> x86 vDSO exception table to track the relevant ip ranges: one for non-PI
> robust futexes, and one for PI robust futexes.

One of the central points behind the vDSO so far was that it is only a
performance optimization. It is never required for correctness.
What are applications supposed to do when the vDSO is disabled?

(...)

> diff --git a/arch/x86/entry/vdso/common/vfutex.c b/arch/x86/entry/vdso/common/vfutex.c
> new file mode 100644
> index 000000000000..336095b04952
> --- /dev/null
> +++ b/arch/x86/entry/vdso/common/vfutex.c
> @@ -0,0 +1,90 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2026 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> + */
> +#include <linux/types.h>
> +#include <linux/futex.h>

This should be uapi/linux/futex.h. Headers from the linux/ namespace should
not be used in vDSO code. The definitions from them may end up being wrong
in the compat vDSO. Either use uapi/ or vdso/ headers. (linux/types.h is a bit
of an exception for historic reasons, it could be replaced by uapi/linux/types.h)

> +#include <vdso/futex.h>
> +#include "extable.h"

> +
> +#ifdef CONFIG_X86_64

This only works because of the ugly hacks in fake_32bit_build.h.
Testing for '#ifdef __x86_64__' is simpler and nicer to read.

> +# define ASM_PTR_BIT_SET	"btsq "
> +# define ASM_PTR_SET		"movq "
> +#else
> +# define ASM_PTR_BIT_SET	"btsl "
> +# define ASM_PTR_SET		"movl "
> +#endif
> +
> +u32 __vdso_robust_futex_unlock(u32 *uaddr, struct robust_list_head *robust_list_head)
> +{
> +	u32 val = 0;
> +
> +	/*
> +	 * Within the ip range identified by the futex exception table,
> +	 * the register "eax" contains the value loaded by xchg. This is
> +	 * expected by futex_vdso_exception() to check whether waiters
> +	 * need to be woken up. This register state is transferred to
> +	 * bit 1 (NEED_ACTION) of *op_pending_addr before the ip range
> +	 * ends.
> +	 */
> +	asm volatile (
> +		_ASM_VDSO_EXTABLE_FUTEX_HANDLE(1f, 3f)
> +		/* Exchange uaddr (store-release). */
> +		"xchg %[uaddr], %[val]\n\t"
> +		"1:\n\t"
> +		/* Test if FUTEX_WAITERS (0x80000000) is set. */
> +		"test %[val], %[val]\n\t"
> +		"js 2f\n\t"
> +		/* Clear *op_pending_addr if there are no waiters. */
> +		ASM_PTR_SET "$0, %[op_pending_addr]\n\t"
> +		"jmp 3f\n\t"
> +		"2:\n\t"
> +		/* Set bit 1 (NEED_ACTION) in *op_pending_addr. */
> +		ASM_PTR_BIT_SET "$1, %[op_pending_addr]\n\t"
> +		"3:\n\t"
> +		: [val] "+a" (val),
> +		  [uaddr] "+m" (*uaddr)
> +		: [op_pending_addr] "m" (robust_list_head->list_op_pending)
> +		: "memory"
> +	);
> +	return val;
> +}
> +
> +u32 robust_futex_unlock(u32 *, struct robust_list_head *)
> +	__attribute__((weak, alias("__vdso_robust_futex_unlock")));

__weak and __alias() as per checkpatch.pl.

The entries in the linkerscripts are missing.

(...)

> --- /dev/null
> +++ b/include/vdso/futex.h
> @@ -0,0 +1,72 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2026 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> + */
> +
> +#ifndef _VDSO_FUTEX_H
> +#define _VDSO_FUTEX_H
> +
> +#include <linux/types.h>
> +#include <linux/futex.h>

Same remarks about the linux/ namespace as before.

> +/**
> + * __vdso_robust_futex_unlock - Architecture-specific vDSO implementation of robust futex unlock.
> + * @uaddr:		Lock address (points to a 32-bit unsigned integer type).
> + * @robust_list_head:	The thread-specific robust list that has been registered with set_robust_list.
> + *
> + * This vDSO unlocks the robust futex by exchanging the content of
> + * *@uaddr with 0 with a store-release semantic. If the futex has
> + * waiters, it sets bit 1 of *@robust_list_head->list_op_pending, else
> + * it clears *@robust_list_head->list_op_pending. Those operations are
> + * within a code region known by the kernel, making them safe with
> + * respect to asynchronous program termination either from thread
> + * context or from a nested signal handler.
> + *
> + * Returns:	The old value present at *@uaddr.
> + *
> + * Expected use of this vDSO:
> + *
> + * robust_list_head is the thread-specific robust list that has been
> + * registered with set_robust_list.
> + *
> + * if ((__vdso_robust_futex_unlock((u32 *) &mutex->__data.__lock, robust_list_head)
> + *     & FUTEX_WAITERS) != 0)
> + *         futex_wake((u32 *) &mutex->__data.__lock, 1, private);
> + * WRITE_ONCE(robust_list_head->list_op_pending, 0);
> + */
> +extern u32 __vdso_robust_futex_unlock(u32 *uaddr, struct robust_list_head *robust_list_head);

Drop the extern.

(...)

> +#endif /* _VDSO_FUTEX_H */

(...)

^ permalink raw reply

* Re: [RFC] Modernizing Linux authentication logs (lastlog, btmp, utmp, wtmp) with SQLite
From: Adhemerval Zanella Netto @ 2026-03-13 13:59 UTC (permalink / raw)
  To: Roman Bakshansky, linux-api, Thorsten Kukuk
  Cc: linux-kernel, audit, libc-alpha
In-Reply-To: <660c10e6-f8b5-46e2-a424-e3e052992b3a@gmail.com>



On 12/03/26 18:01, Roman Bakshansky wrote:
> Hi all,
> 
> I'd like to share a draft RFC proposing a complete overhaul of the legacy
> binary logs used for authentication auditing in Linux: lastlog, btmp, utmp,
> and wtmp.
> 
> These files, designed decades ago, are running into fundamental limitations:
> 
> - Y2038 problem - they use 32-bit timestamps (time_t in lastlog,
>   tv_sec in utmpx). Even on 64-bit systems the fields remain 32-bit
>   due to ABI constraints, so all Linux systems are affected.
> - No extensibility - any new field (e.g., container ID, service name,
>   source IP) requires changing fixed structures, breaking all existing
>   tools that read them.
> - Poor query performance - tools like last, lastb, who have to
>   scan whole files linearly; with millions of records this becomes
>   painfully slow.
> - No atomicity - partial writes during a crash can corrupt logs.
> - Concurrency bottlenecks - multiple writers (sshd, login, etc.)
>   contend for the same file with coarse locking.
> 
> To address this once and for all, the RFC proposes replacing these logs
> with dedicated shared libraries that use SQLite as the storage backend:
> 
> - liblastlog2 - last login time
> - libbtmp2    - failed login attempts
> - libutmp2    - current sessions
> - libwtmp2    - login/logout history
> 
> SQLite brings:
> - 64-bit time -> Y2038 solved forever.
> - Indexes -> O(log N) queries instead of full scans.
> - Extensible schema -> new fields can be added without breaking old tools.
> - ACID and WAL mode -> atomic writes and concurrent access.
> - Portability - runs on any Linux system, no systemd dependency.
> 
> The full RFC, including preliminary database schemas and API drafts,
> is available in the discussion repository:
> 
>     https://github.com/bakshansky/linux-auth-logs
> 
> I'm looking for feedback on the overall direction, the proposed
> interfaces, and the open questions listed in the document (e.g.,
> library naming, database location, fallback options for embedded
> systems). Please use GitHub Issues for comments, or reply to this
> thread - I'll monitor both.
> 
> Thanks for your time and input!

From the glibc standpoint my plan is just to make the accounting database
function no-op [1] (I hopefully to get this in the next 2.44 release).

And I think Thorsten Kukuk already adapted most of the usages in current
distros [2][3] using similar strategy, along with a better systemd
integration.  I am not sure if/when distros are incorporating his work.

[1] https://patchwork.sourceware.org/project/glibc/list/?series=37271
[2] https://www.thkukuk.de/blog/Y2038_glibc_lastlog_64bit/
[3] https://www.thkukuk.de/blog/Y2038_glibc_utmp_64bit/


^ permalink raw reply

* [RFC PATCH v4] futex: Introduce __vdso_robust_futex_unlock and __vdso_robust_pi_futex_try_unlock
From: Mathieu Desnoyers @ 2026-03-13 13:39 UTC (permalink / raw)
  To: André Almeida
  Cc: linux-kernel, Mathieu Desnoyers, Carlos O'Donell,
	Sebastian Andrzej Siewior, Peter Zijlstra, Florian Weimer,
	Rich Felker, Torvald Riegel, Darren Hart, Thomas Gleixner,
	Ingo Molnar, Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett,
	linux-api

Fix a long standing data corruption race condition with robust futexes,
as pointed out here:

  "File corruption race condition in robust mutex unlocking"
  https://sourceware.org/bugzilla/show_bug.cgi?id=14485

The culprit of this issue is the fact that atomic instructions (at least
on x86) only take a single memory operand. The exchange and
compare-and-exchange populate the result (loaded value or success) into
registers or flags, which creates a window of instructions where the
kernel does not easily have access to that state. Therefore, handling of
process termination can incorrectly assume that the futex value state
still needs to be cleared, which can lead to corruption.

The approach taken to fix this is to introduce two vDSO and extend the
x86 vDSO exception table to track the relevant ip ranges: one for non-PI
robust futexes, and one for PI robust futexes.

The __vdso_robust_futex_unlock vDSO unlocks the robust futex by
exchanging the content of uaddr with 0 with a store-release
semantic. If the futex has waiters, it sets bit 1 of
*robust_list_head->list_op_pending, else it clears
*robust_list_head->list_op_pending. Those operations are within a code
region known by the kernel, making them safe with respect to
asynchronous program termination either from thread context or from a
nested signal handler.

Expected use of this vDSO:

if ((__vdso_robust_futex_unlock((u32 *) &mutex->__data.__lock, robust_list_head)
    & FUTEX_WAITERS) != 0)
        futex_wake((u32 *) &mutex->__data.__lock, 1, private);
WRITE_ONCE(robust_list_head->list_op_pending, 0);

Also introduce __vdso_robust_pi_futex_try_unlock to fix a similar
unlock race with robust PI futexes.

The __vdso_robust_pi_futex_try_unlock vDSO try to perform a
compare-and-exchange with release semantic to set the expected
*uaddr content to 0. If the futex has waiters, it fails, and
userspace needs to call futex_unlock_pi(). Before exiting the
critical section, if the cmpxchg fails, it sets bit 1 of
*robust_list_head->list_op_pending. If the cmpxchg succeeds, it
clears *@robust_list_head->list_op_pending. Those operations are
within a code region known by the kernel, making them safe with
respect to asynchronous program termination either from thread
context or from a nested signal handler.

Expected use of this vDSO:

int l = atomic_load_relaxed(&mutex->__data.__lock);
do {
        if (((l & FUTEX_WAITERS) != 0) || (l != READ_ONCE(pd->tid))) {
                futex_unlock_pi((unsigned int *) &mutex->__data.__lock, private);
                break;
        }
} while (!__vdso_robust_pi_futex_try_unlock(&mutex->__data.__lock,
                                            &l, robust_list_head));
WRITE_ONCE(robust_list_head->list_op_pending, 0);

The four kernel execution paths impacted by this change are:

  1) exit_robust_list/compat_exit_robust_list (process exit)
  2) setup_rt_frame (signal delivery)
  3) futex_wake
  4) futex_unlock_pi

Bit 1 of the robust_list_head->list_op_pending pointer is used to flag
whether there is either a pending wakeup or futex_unlock_pi action
(FUTEX_UADDR_NEED_ACTION). This allows extending the "need action" state
beyond the vDSO and lets the caller issue futex_wake and futex_unlock_pi
system calls. This "need action" flag is cleared by the caller when
zeroing robust_list_head->list_op_pending.

futex_wake now clears the robust_list_head->list_op_pending to close the
race between call to futex_wake and clearing of the
robust_list_head->list_op_pending by the application. This prevents
multiple calls to futex_wake in case a crash happens within that window.

futex_unlock_pi now clears the robust_list_head->list_op_pending
to close the race between call to futex_unlock_pi and
clearing of the robust_list_head->list_op_pending by the application.
This prevents multiple calls to futex_unlock_pi in case a crash happens
within that window.

[ This patch is lightly compiled tested on x86-64 only, submitted for feedback.
  It implements the vDSO for x86-32 and x86-64.
  It is based on v7.0-rc3. ]

Link: https://lore.kernel.org/lkml/20260311185409.1988269-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20260220202620.139584-1-andrealmeid@igalia.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "André Almeida" <andrealmeid@igalia.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Torvald Riegel <triegel@redhat.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Liam R . Howlett" <Liam.Howlett@oracle.com>
Cc: linux-api@vger.kernel.org
---
Changes since v3:
- Remove _u32 suffix.
- Remove "val" parameter (always 0 per robust futex ABI).

Changes since v2:
- Pass robust_list_head as vdso argument.
- Add "val" parameter to each vdso.
- Add _u32 suffix to each vdso.
- Introduce ARCH_HAS_VDSO_FUTEX to provide a futex_vdso_exception stub
  when not implemented by the architecture.
- Wire up x86 vdso32 vfutex.o.

Changes since v1:
- Remove unlock_store_done leftover code from handle_futex_death.
- Handle robust PI futexes.
---
 arch/Kconfig                        |   3 +
 arch/x86/Kconfig                    |   1 +
 arch/x86/entry/vdso/common/vfutex.c |  90 +++++++++++++
 arch/x86/entry/vdso/extable.c       |  59 ++++++++-
 arch/x86/entry/vdso/extable.h       |  37 ++++--
 arch/x86/entry/vdso/vdso32/Makefile |   1 +
 arch/x86/entry/vdso/vdso32/vfutex.c |   1 +
 arch/x86/entry/vdso/vdso64/Makefile |   1 +
 arch/x86/entry/vdso/vdso64/vfutex.c |   1 +
 arch/x86/entry/vdso/vdso64/vsgx.S   |   2 +-
 arch/x86/include/asm/vdso.h         |   3 +
 arch/x86/kernel/signal.c            |   4 +
 include/linux/futex.h               |   1 +
 include/vdso/futex.h                |  72 +++++++++++
 kernel/futex/core.c                 | 188 ++++++++++++++++++++++++----
 kernel/futex/futex.h                |   2 +
 kernel/futex/pi.c                   |   3 +
 kernel/futex/waitwake.c             |   3 +
 18 files changed, 439 insertions(+), 33 deletions(-)
 create mode 100644 arch/x86/entry/vdso/common/vfutex.c
 create mode 100644 arch/x86/entry/vdso/vdso32/vfutex.c
 create mode 100644 arch/x86/entry/vdso/vdso64/vfutex.c
 create mode 100644 include/vdso/futex.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 102ddbd4298e..4f3e1be29af1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1670,6 +1670,9 @@ config ARCH_HAS_VDSO_ARCH_DATA
 config ARCH_HAS_VDSO_TIME_DATA
 	bool
 
+config ARCH_HAS_VDSO_FUTEX
+	bool
+
 config HAVE_STATIC_CALL
 	bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e2df1b147184..957d5d9209a1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -111,6 +111,7 @@ config X86
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN
 	select ARCH_HAS_DEBUG_WX
+	select ARCH_HAS_VDSO_FUTEX
 	select ARCH_HAS_ZONE_DMA_SET if EXPERT
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_HAVE_EXTRA_ELF_NOTES
diff --git a/arch/x86/entry/vdso/common/vfutex.c b/arch/x86/entry/vdso/common/vfutex.c
new file mode 100644
index 000000000000..336095b04952
--- /dev/null
+++ b/arch/x86/entry/vdso/common/vfutex.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2026 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ */
+#include <linux/types.h>
+#include <linux/futex.h>
+#include <vdso/futex.h>
+#include "extable.h"
+
+#ifdef CONFIG_X86_64
+# define ASM_PTR_BIT_SET	"btsq "
+# define ASM_PTR_SET		"movq "
+#else
+# define ASM_PTR_BIT_SET	"btsl "
+# define ASM_PTR_SET		"movl "
+#endif
+
+u32 __vdso_robust_futex_unlock(u32 *uaddr, struct robust_list_head *robust_list_head)
+{
+	u32 val = 0;
+
+	/*
+	 * Within the ip range identified by the futex exception table,
+	 * the register "eax" contains the value loaded by xchg. This is
+	 * expected by futex_vdso_exception() to check whether waiters
+	 * need to be woken up. This register state is transferred to
+	 * bit 1 (NEED_ACTION) of *op_pending_addr before the ip range
+	 * ends.
+	 */
+	asm volatile (
+		_ASM_VDSO_EXTABLE_FUTEX_HANDLE(1f, 3f)
+		/* Exchange uaddr (store-release). */
+		"xchg %[uaddr], %[val]\n\t"
+		"1:\n\t"
+		/* Test if FUTEX_WAITERS (0x80000000) is set. */
+		"test %[val], %[val]\n\t"
+		"js 2f\n\t"
+		/* Clear *op_pending_addr if there are no waiters. */
+		ASM_PTR_SET "$0, %[op_pending_addr]\n\t"
+		"jmp 3f\n\t"
+		"2:\n\t"
+		/* Set bit 1 (NEED_ACTION) in *op_pending_addr. */
+		ASM_PTR_BIT_SET "$1, %[op_pending_addr]\n\t"
+		"3:\n\t"
+		: [val] "+a" (val),
+		  [uaddr] "+m" (*uaddr)
+		: [op_pending_addr] "m" (robust_list_head->list_op_pending)
+		: "memory"
+	);
+	return val;
+}
+
+u32 robust_futex_unlock(u32 *, struct robust_list_head *)
+	__attribute__((weak, alias("__vdso_robust_futex_unlock")));
+
+int __vdso_robust_pi_futex_try_unlock(u32 *uaddr, u32 *expected, struct robust_list_head *robust_list_head)
+{
+	u32 val = 0, orig, expect = *expected;
+
+	orig = expect;
+	/*
+	 * The ZF is set/cleared by cmpxchg and expected to stay
+	 * invariant for the rest of the code region.
+	 */
+	asm volatile (
+		_ASM_VDSO_EXTABLE_PI_FUTEX_HANDLE(1f, 3f)
+		/* Compare-and-exchange uaddr (store-release). Set/clear the ZF. */
+		"lock; cmpxchg %[val], %[uaddr]\n\t"
+		"1:\n\t"
+		/* Check whether cmpxchg fails. */
+		"jnz 2f\n\t"
+		/* Clear *op_pending_addr. */
+		ASM_PTR_SET "$0, %[op_pending_addr]\n\t"
+		"jmp 3f\n\t"
+		"2:\n\t"
+		/* Set bit 1 (NEED_ACTION) in *op_pending_addr. */
+		ASM_PTR_BIT_SET "$1, %[op_pending_addr]\n\t"
+		"3:\n\t"
+		: [expect] "+a" (expect),
+		  [uaddr] "+m" (*uaddr)
+		: [op_pending_addr] "m" (robust_list_head->list_op_pending),
+		  [val] "r" (val)
+		: "memory"
+	);
+	*expected = expect;
+	return expect == orig;
+}
+
+int robust_pi_futex_try_unlock(u32 *, u32 *, struct robust_list_head *)
+	__attribute__((weak, alias("__vdso_robust_pi_futex_try_unlock")));
diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
index afcf5b65beef..90a31ffb9c6d 100644
--- a/arch/x86/entry/vdso/extable.c
+++ b/arch/x86/entry/vdso/extable.c
@@ -1,12 +1,27 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/err.h>
 #include <linux/mm.h>
+#include <linux/futex.h>
 #include <asm/current.h>
 #include <asm/traps.h>
 #include <asm/vdso.h>
 
+enum vdso_extable_entry_type {
+	VDSO_EXTABLE_ENTRY_FIXUP = 0,
+	VDSO_EXTABLE_ENTRY_FUTEX = 1,
+	VDSO_EXTABLE_ENTRY_PI_FUTEX = 2,
+};
+
 struct vdso_exception_table_entry {
-	int insn, fixup;
+	int type;	/* enum vdso_extable_entry_type */
+	union {
+		struct {
+			int insn, fixup_insn;
+		} fixup;
+		struct {
+			int start, end;
+		} futex;
+	};
 };
 
 bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
@@ -33,8 +48,10 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
 	extable = image->extable;
 
 	for (i = 0; i < nr_entries; i++) {
-		if (regs->ip == base + extable[i].insn) {
-			regs->ip = base + extable[i].fixup;
+		if (extable[i].type != VDSO_EXTABLE_ENTRY_FIXUP)
+			continue;
+		if (regs->ip == base + extable[i].fixup.insn) {
+			regs->ip = base + extable[i].fixup.fixup_insn;
 			regs->di = trapnr;
 			regs->si = error_code;
 			regs->dx = fault_addr;
@@ -44,3 +61,39 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
 
 	return false;
 }
+
+void futex_vdso_exception(struct pt_regs *regs,
+			  bool *_in_futex_vdso,
+			  bool *_need_action)
+{
+	const struct vdso_image *image = current->mm->context.vdso_image;
+	const struct vdso_exception_table_entry *extable;
+	bool in_futex_vdso = false, need_action = false;
+	unsigned int nr_entries, i;
+	unsigned long base;
+
+	if (!current->mm->context.vdso)
+		goto end;
+
+	base = (unsigned long)current->mm->context.vdso + image->extable_base;
+	nr_entries = image->extable_len / (sizeof(*extable));
+	extable = image->extable;
+
+	for (i = 0; i < nr_entries; i++) {
+		if (extable[i].type != VDSO_EXTABLE_ENTRY_FUTEX &&
+		    extable[i].type != VDSO_EXTABLE_ENTRY_PI_FUTEX)
+			continue;
+		if (regs->ip >= base + extable[i].futex.start &&
+		    regs->ip < base + extable[i].futex.end) {
+			in_futex_vdso = true;
+			if (extable[i].type == VDSO_EXTABLE_ENTRY_FUTEX)
+				need_action = (regs->ax & FUTEX_WAITERS);
+			else
+				need_action = !(regs->flags & X86_EFLAGS_ZF);
+			break;
+		}
+	}
+end:
+	*_in_futex_vdso = in_futex_vdso;
+	*_need_action = need_action;
+}
diff --git a/arch/x86/entry/vdso/extable.h b/arch/x86/entry/vdso/extable.h
index baba612b832c..5dfbde724065 100644
--- a/arch/x86/entry/vdso/extable.h
+++ b/arch/x86/entry/vdso/extable.h
@@ -8,21 +8,44 @@
  * exception table, not each individual entry.
  */
 #ifdef __ASSEMBLER__
-#define _ASM_VDSO_EXTABLE_HANDLE(from, to)	\
-	ASM_VDSO_EXTABLE_HANDLE from to
+#define _ASM_VDSO_EXTABLE_FIXUP_HANDLE(from, to)	\
+	ASM_VDSO_EXTABLE_FIXUP_HANDLE from to
 
-.macro ASM_VDSO_EXTABLE_HANDLE from:req to:req
+.macro ASM_VDSO_EXTABLE_FIXUP_HANDLE from:req to:req
 	.pushsection __ex_table, "a"
+	.long 0		/* type: fixup */
 	.long (\from) - __ex_table
 	.long (\to) - __ex_table
 	.popsection
 .endm
 #else
-#define _ASM_VDSO_EXTABLE_HANDLE(from, to)	\
-	".pushsection __ex_table, \"a\"\n"      \
-	".long (" #from ") - __ex_table\n"      \
-	".long (" #to ") - __ex_table\n"        \
+#define _ASM_VDSO_EXTABLE_FIXUP_HANDLE(from, to)	\
+	".pushsection __ex_table, \"a\"\n"		\
+	".long 0\n"	/* type: fixup */		\
+	".long (" #from ") - __ex_table\n"		\
+	".long (" #to ") - __ex_table\n"		\
 	".popsection\n"
+
+/*
+ * Identify robust futex unlock critical section.
+ */
+#define _ASM_VDSO_EXTABLE_FUTEX_HANDLE(start, end)	\
+	".pushsection __ex_table, \"a\"\n"		\
+	".long 1\n"	/* type: futex */		\
+	".long (" #start ") - __ex_table\n"		\
+	".long (" #end ") - __ex_table\n"		\
+	".popsection\n"
+
+/*
+ * Identify robust PI futex unlock critical section.
+ */
+#define _ASM_VDSO_EXTABLE_PI_FUTEX_HANDLE(start, end)	\
+	".pushsection __ex_table, \"a\"\n"		\
+	".long 2\n"	/* type: pi_futex */		\
+	".long (" #start ") - __ex_table\n"		\
+	".long (" #end ") - __ex_table\n"		\
+	".popsection\n"
+
 #endif
 
 #endif /* __VDSO_EXTABLE_H */
diff --git a/arch/x86/entry/vdso/vdso32/Makefile b/arch/x86/entry/vdso/vdso32/Makefile
index add6afb484ba..acf4f990be98 100644
--- a/arch/x86/entry/vdso/vdso32/Makefile
+++ b/arch/x86/entry/vdso/vdso32/Makefile
@@ -9,6 +9,7 @@ vdsos-y			:= 32
 # Files to link into the vDSO:
 vobjs-y			:= note.o vclock_gettime.o vgetcpu.o
 vobjs-y			+= system_call.o sigreturn.o
+vobjs-y			+= vfutex.o
 
 # Compilation flags
 flags-y			:= -DBUILD_VDSO32 -m32 -mregparm=0
diff --git a/arch/x86/entry/vdso/vdso32/vfutex.c b/arch/x86/entry/vdso/vdso32/vfutex.c
new file mode 100644
index 000000000000..940a6ee30026
--- /dev/null
+++ b/arch/x86/entry/vdso/vdso32/vfutex.c
@@ -0,0 +1 @@
+#include "common/vfutex.c"
diff --git a/arch/x86/entry/vdso/vdso64/Makefile b/arch/x86/entry/vdso/vdso64/Makefile
index bfffaf1aeecc..df53c2d0037d 100644
--- a/arch/x86/entry/vdso/vdso64/Makefile
+++ b/arch/x86/entry/vdso/vdso64/Makefile
@@ -10,6 +10,7 @@ vdsos-$(CONFIG_X86_X32_ABI)	+= x32
 # Files to link into the vDSO:
 vobjs-y				:= note.o vclock_gettime.o vgetcpu.o
 vobjs-y				+= vgetrandom.o vgetrandom-chacha.o
+vobjs-y				+= vfutex.o
 vobjs-$(CONFIG_X86_SGX)		+= vsgx.o
 
 # Compilation flags
diff --git a/arch/x86/entry/vdso/vdso64/vfutex.c b/arch/x86/entry/vdso/vdso64/vfutex.c
new file mode 100644
index 000000000000..940a6ee30026
--- /dev/null
+++ b/arch/x86/entry/vdso/vdso64/vfutex.c
@@ -0,0 +1 @@
+#include "common/vfutex.c"
diff --git a/arch/x86/entry/vdso/vdso64/vsgx.S b/arch/x86/entry/vdso/vdso64/vsgx.S
index 37a3d4c02366..0ea5a1ebd455 100644
--- a/arch/x86/entry/vdso/vdso64/vsgx.S
+++ b/arch/x86/entry/vdso/vdso64/vsgx.S
@@ -145,6 +145,6 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave)
 
 	.cfi_endproc
 
-_ASM_VDSO_EXTABLE_HANDLE(.Lenclu_eenter_eresume, .Lhandle_exception)
+_ASM_VDSO_EXTABLE_FIXUP_HANDLE(.Lenclu_eenter_eresume, .Lhandle_exception)
 
 SYM_FUNC_END(__vdso_sgx_enter_enclave)
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index e8afbe9faa5b..9ac7af34cdc4 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -38,6 +38,9 @@ extern int map_vdso_once(const struct vdso_image *image, unsigned long addr);
 extern bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
 				 unsigned long error_code,
 				 unsigned long fault_addr);
+extern void futex_vdso_exception(struct pt_regs *regs,
+				 bool *in_futex_vdso,
+				 bool *need_action);
 #endif /* __ASSEMBLER__ */
 
 #endif /* _ASM_X86_VDSO_H */
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 2404233336ab..c2e4db89f16d 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -28,6 +28,7 @@
 #include <linux/entry-common.h>
 #include <linux/syscalls.h>
 #include <linux/rseq.h>
+#include <linux/futex.h>
 
 #include <asm/processor.h>
 #include <asm/ucontext.h>
@@ -235,6 +236,9 @@ unsigned long get_sigframe_size(void)
 static int
 setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 {
+	/* Handle futex robust list fixup. */
+	futex_signal_deliver(ksig, regs);
+
 	/* Perform fixup for the pre-signal frame. */
 	rseq_signal_deliver(ksig, regs);
 
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 9e9750f04980..6c274c79e176 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -81,6 +81,7 @@ void futex_exec_release(struct task_struct *tsk);
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	      u32 __user *uaddr2, u32 val2, u32 val3);
 int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4);
+void futex_signal_deliver(struct ksignal *ksig, struct pt_regs *regs);
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 int futex_hash_allocate_default(void);
diff --git a/include/vdso/futex.h b/include/vdso/futex.h
new file mode 100644
index 000000000000..757623b99250
--- /dev/null
+++ b/include/vdso/futex.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2026 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ */
+
+#ifndef _VDSO_FUTEX_H
+#define _VDSO_FUTEX_H
+
+#include <linux/types.h>
+#include <linux/futex.h>
+
+/**
+ * __vdso_robust_futex_unlock - Architecture-specific vDSO implementation of robust futex unlock.
+ * @uaddr:		Lock address (points to a 32-bit unsigned integer type).
+ * @robust_list_head:	The thread-specific robust list that has been registered with set_robust_list.
+ *
+ * This vDSO unlocks the robust futex by exchanging the content of
+ * *@uaddr with 0 with a store-release semantic. If the futex has
+ * waiters, it sets bit 1 of *@robust_list_head->list_op_pending, else
+ * it clears *@robust_list_head->list_op_pending. Those operations are
+ * within a code region known by the kernel, making them safe with
+ * respect to asynchronous program termination either from thread
+ * context or from a nested signal handler.
+ *
+ * Returns:	The old value present at *@uaddr.
+ *
+ * Expected use of this vDSO:
+ *
+ * robust_list_head is the thread-specific robust list that has been
+ * registered with set_robust_list.
+ *
+ * if ((__vdso_robust_futex_unlock((u32 *) &mutex->__data.__lock, robust_list_head)
+ *     & FUTEX_WAITERS) != 0)
+ *         futex_wake((u32 *) &mutex->__data.__lock, 1, private);
+ * WRITE_ONCE(robust_list_head->list_op_pending, 0);
+ */
+extern u32 __vdso_robust_futex_unlock(u32 *uaddr, struct robust_list_head *robust_list_head);
+
+/*
+ * __vdso_robust_pi_futex_try_unlock - Architecture-specific vDSO implementation of robust PI futex unlock.
+ * @uaddr:		Lock address (points to a 32-bit unsigned integer type).
+ * @expected:		Expected value (in), value loaded by compare-and-exchange (out).
+ * @robust_list_head:	The thread-specific robust list that has been registered with set_robust_list.
+ *
+ * This vDSO try to perform a compare-and-exchange with release semantic
+ * to set the expected *@uaddr content to 0. If the futex has
+ * waiters, it fails, and userspace needs to call futex_unlock_pi().
+ * Before exiting the critical section, if the cmpxchg fails, it sets
+ * bit 1 of *@robust_list_head->list_op_pending. If the cmpxchg
+ * succeeds, it clears *@robust_list_head->list_op_pending. Those
+ * operations are within a code region known by the kernel, making them
+ * safe with respect to asynchronous program termination either from
+ * thread context or from a nested signal handler.
+ *
+ * Returns:	Zero if the operation fails to release the lock, non-zero on success.
+ *
+ * Expected use of this vDSO:
+ *
+ *
+ * int l = atomic_load_relaxed(&mutex->__data.__lock);
+ * do {
+ *         if (((l & FUTEX_WAITERS) != 0) || (l != READ_ONCE(pd->tid))) {
+ *                 futex_unlock_pi((unsigned int *) &mutex->__data.__lock, private);
+ *                 break;
+ *         }
+ * } while (!__vdso_robust_pi_futex_try_unlock(&mutex->__data.__lock,
+ *                                             &l, robust_list_head));
+ * WRITE_ONCE(robust_list_head->list_op_pending, 0);
+ */
+int __vdso_robust_pi_futex_try_unlock(u32 *uaddr, u32 *expected, struct robust_list_head *robust_list_head);
+
+#endif /* _VDSO_FUTEX_H */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index cf7e610eac42..28bcbe6156ee 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -48,6 +48,10 @@
 #include "futex.h"
 #include "../locking/rtmutex_common.h"
 
+#define FUTEX_UADDR_PI		(1UL << 0)
+#define FUTEX_UADDR_NEED_ACTION	(1UL << 1)
+#define FUTEX_UADDR_MASK	(~(FUTEX_UADDR_PI | FUTEX_UADDR_NEED_ACTION))
+
 /*
  * The base of the bucket array and its size are always used together
  * (after initialization only in futex_hash()), so ensure that they
@@ -1004,6 +1008,118 @@ void futex_unqueue_pi(struct futex_q *q)
 	q->pi_state = NULL;
 }
 
+#ifndef CONFIG_ARCH_HAS_VDSO_FUTEX
+static void futex_vdso_exception(struct pt_regs *regs, bool *in_futex_vdso, bool *need_action)
+{
+	*in_futex_vdso = false;
+	*need_action = false;
+}
+#endif
+
+/*
+ * Transfer the need action state from vDSO stack to the
+ * FUTEX_UADDR_NEED_ACTION list_op_pending bit so it's observed if the
+ * program is terminated while executing the signal handler.
+ */
+static void signal_delivery_fixup_robust_list(struct task_struct *curr, struct pt_regs *regs)
+{
+	struct robust_list_head __user *head = curr->robust_list;
+	bool in_futex_vdso, need_action;
+	unsigned long pending;
+
+	if (!head)
+		return;
+	futex_vdso_exception(regs, &in_futex_vdso, &need_action);
+	if (!in_futex_vdso)
+		return;
+
+	if (need_action) {
+		if (get_user(pending, (unsigned long __user *)&head->list_op_pending))
+			goto fault;
+		pending |= FUTEX_UADDR_NEED_ACTION;
+		if (put_user(pending, (unsigned long __user *)&head->list_op_pending))
+			goto fault;
+	} else {
+		if (put_user(0UL, (unsigned long __user *)&head->list_op_pending))
+			goto fault;
+	}
+	return;
+fault:
+	force_sig(SIGSEGV);
+}
+
+#ifdef CONFIG_COMPAT
+static void compat_signal_delivery_fixup_robust_list(struct task_struct *curr, struct pt_regs *regs)
+{
+	struct compat_robust_list_head __user *head = curr->compat_robust_list;
+	bool in_futex_vdso, need_action;
+	unsigned int pending;
+
+	if (!head)
+		return;
+	futex_vdso_exception(regs, &in_futex_vdso, &need_action);
+	if (!in_futex_vdso)
+		return;
+	if (need_action) {
+		if (get_user(pending, (compat_uptr_t __user *)&head->list_op_pending))
+			goto fault;
+		pending |= FUTEX_UADDR_NEED_ACTION;
+		if (put_user(pending, (compat_uptr_t __user *)&head->list_op_pending))
+			goto fault;
+	} else {
+		if (put_user(0U, (compat_uptr_t __user *)&head->list_op_pending))
+			goto fault;
+	}
+	return;
+fault:
+	force_sig(SIGSEGV);
+}
+#endif
+
+void futex_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
+{
+	struct task_struct *tsk = current;
+
+	if (unlikely(tsk->robust_list))
+		signal_delivery_fixup_robust_list(tsk, regs);
+#ifdef CONFIG_COMPAT
+	if (unlikely(tsk->compat_robust_list))
+		compat_signal_delivery_fixup_robust_list(tsk, regs);
+#endif
+}
+
+static void do_clear_robust_list_pending_op(struct task_struct *curr)
+{
+	struct robust_list_head __user *head = curr->robust_list;
+
+	if (!head)
+		return;
+	if (put_user(0UL, (unsigned long __user *)&head->list_op_pending))
+		force_sig(SIGSEGV);
+}
+
+#ifdef CONFIG_COMPAT
+static void do_compat_clear_robust_list_pending_op(struct task_struct *curr)
+{
+	struct robust_list_head __user *head = curr->robust_list;
+
+	if (!head)
+		return;
+	if (put_user(0U, (unsigned int __user *)&head->list_op_pending))
+		force_sig(SIGSEGV);
+}
+#endif
+
+void clear_robust_list_pending_op(struct task_struct *curr)
+{
+	if (unlikely(curr->robust_list))
+		do_clear_robust_list_pending_op(curr);
+#ifdef CONFIG_COMPAT
+	if (unlikely(curr->compat_robust_list))
+		do_compat_clear_robust_list_pending_op(curr);
+#endif
+}
+
 /* Constants for the pending_op argument of handle_futex_death */
 #define HANDLE_DEATH_PENDING	true
 #define HANDLE_DEATH_LIST	false
@@ -1013,12 +1129,34 @@ void futex_unqueue_pi(struct futex_q *q)
  * dying task, and do notification if so:
  */
 static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr,
-			      bool pi, bool pending_op)
+			      bool pi, bool pending_op, bool need_action)
 {
 	u32 uval, nval, mval;
 	pid_t owner;
 	int err;
 
+	/*
+	 * Process dies after the store unlocking futex, before clearing
+	 * the pending ops. Perform the required action if needed.
+	 * For non-PI futex, the action is to wake up the waiter.
+	 * For PI futex, the action is to call robust_unlock_pi.
+	 * Prevent storing to the futex after it was unlocked.
+	 */
+	if (pending_op) {
+		bool in_futex_vdso, vdso_need_action;
+
+		futex_vdso_exception(task_pt_regs(curr), &in_futex_vdso, &vdso_need_action);
+		if (need_action || vdso_need_action) {
+			if (pi)
+				futex_unlock_pi(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED);
+			else
+				futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, 1,
+					   FUTEX_BITSET_MATCH_ANY);
+		}
+		if (need_action || in_futex_vdso)
+			return 0;
+	}
+
 	/* Futex address must be 32bit aligned */
 	if ((((unsigned long)uaddr) % sizeof(*uaddr)) != 0)
 		return -1;
@@ -1128,19 +1266,23 @@ static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr,
 }
 
 /*
- * Fetch a robust-list pointer. Bit 0 signals PI futexes:
+ * Fetch a robust-list pointer. Bit 0 signals PI futexes, bit 1 signals
+ * need action:
  */
 static inline int fetch_robust_entry(struct robust_list __user **entry,
 				     struct robust_list __user * __user *head,
-				     unsigned int *pi)
+				     unsigned int *pi,
+				     unsigned int *need_action)
 {
 	unsigned long uentry;
 
 	if (get_user(uentry, (unsigned long __user *)head))
 		return -EFAULT;
 
-	*entry = (void __user *)(uentry & ~1UL);
-	*pi = uentry & 1;
+	*entry = (void __user *)(uentry & FUTEX_UADDR_MASK);
+	*pi = uentry & FUTEX_UADDR_PI;
+	if (need_action)
+		*need_action = uentry & FUTEX_UADDR_NEED_ACTION;
 
 	return 0;
 }
@@ -1155,7 +1297,7 @@ static void exit_robust_list(struct task_struct *curr)
 {
 	struct robust_list_head __user *head = curr->robust_list;
 	struct robust_list __user *entry, *next_entry, *pending;
-	unsigned int limit = ROBUST_LIST_LIMIT, pi, pip;
+	unsigned int limit = ROBUST_LIST_LIMIT, pi, pip, need_action;
 	unsigned int next_pi;
 	unsigned long futex_offset;
 	int rc;
@@ -1164,7 +1306,7 @@ static void exit_robust_list(struct task_struct *curr)
 	 * Fetch the list head (which was registered earlier, via
 	 * sys_set_robust_list()):
 	 */
-	if (fetch_robust_entry(&entry, &head->list.next, &pi))
+	if (fetch_robust_entry(&entry, &head->list.next, &pi, NULL))
 		return;
 	/*
 	 * Fetch the relative futex offset:
@@ -1175,7 +1317,7 @@ static void exit_robust_list(struct task_struct *curr)
 	 * Fetch any possibly pending lock-add first, and handle it
 	 * if it exists:
 	 */
-	if (fetch_robust_entry(&pending, &head->list_op_pending, &pip))
+	if (fetch_robust_entry(&pending, &head->list_op_pending, &pip, &need_action))
 		return;
 
 	next_entry = NULL;	/* avoid warning with gcc */
@@ -1184,14 +1326,14 @@ static void exit_robust_list(struct task_struct *curr)
 		 * Fetch the next entry in the list before calling
 		 * handle_futex_death:
 		 */
-		rc = fetch_robust_entry(&next_entry, &entry->next, &next_pi);
+		rc = fetch_robust_entry(&next_entry, &entry->next, &next_pi, NULL);
 		/*
 		 * A pending lock might already be on the list, so
 		 * don't process it twice:
 		 */
 		if (entry != pending) {
 			if (handle_futex_death((void __user *)entry + futex_offset,
-						curr, pi, HANDLE_DEATH_LIST))
+						curr, pi, HANDLE_DEATH_LIST, false))
 				return;
 		}
 		if (rc)
@@ -1209,7 +1351,7 @@ static void exit_robust_list(struct task_struct *curr)
 
 	if (pending) {
 		handle_futex_death((void __user *)pending + futex_offset,
-				   curr, pip, HANDLE_DEATH_PENDING);
+				   curr, pip, HANDLE_DEATH_PENDING, need_action);
 	}
 }
 
@@ -1224,17 +1366,20 @@ static void __user *futex_uaddr(struct robust_list __user *entry,
 }
 
 /*
- * Fetch a robust-list pointer. Bit 0 signals PI futexes:
+ * Fetch a robust-list pointer. Bit 0 signals PI futexes, bit 1 signals
+ * need action:
  */
 static inline int
 compat_fetch_robust_entry(compat_uptr_t *uentry, struct robust_list __user **entry,
-		   compat_uptr_t __user *head, unsigned int *pi)
+		   compat_uptr_t __user *head, unsigned int *pi, unsigned int *need_action)
 {
 	if (get_user(*uentry, head))
 		return -EFAULT;
 
-	*entry = compat_ptr((*uentry) & ~1);
-	*pi = (unsigned int)(*uentry) & 1;
+	*entry = compat_ptr((*uentry) & FUTEX_UADDR_MASK);
+	*pi = (unsigned int)(*uentry) & FUTEX_UADDR_PI;
+	if (need_action)
+		*need_action = (unsigned int)(*uentry) & FUTEX_UADDR_NEED_ACTION;
 
 	return 0;
 }
@@ -1249,7 +1394,7 @@ static void compat_exit_robust_list(struct task_struct *curr)
 {
 	struct compat_robust_list_head __user *head = curr->compat_robust_list;
 	struct robust_list __user *entry, *next_entry, *pending;
-	unsigned int limit = ROBUST_LIST_LIMIT, pi, pip;
+	unsigned int limit = ROBUST_LIST_LIMIT, pi, pip, need_action;
 	unsigned int next_pi;
 	compat_uptr_t uentry, next_uentry, upending;
 	compat_long_t futex_offset;
@@ -1259,7 +1404,7 @@ static void compat_exit_robust_list(struct task_struct *curr)
 	 * Fetch the list head (which was registered earlier, via
 	 * sys_set_robust_list()):
 	 */
-	if (compat_fetch_robust_entry(&uentry, &entry, &head->list.next, &pi))
+	if (compat_fetch_robust_entry(&uentry, &entry, &head->list.next, &pi, NULL))
 		return;
 	/*
 	 * Fetch the relative futex offset:
@@ -1271,7 +1416,7 @@ static void compat_exit_robust_list(struct task_struct *curr)
 	 * if it exists:
 	 */
 	if (compat_fetch_robust_entry(&upending, &pending,
-			       &head->list_op_pending, &pip))
+			       &head->list_op_pending, &pip, &need_action))
 		return;
 
 	next_entry = NULL;	/* avoid warning with gcc */
@@ -1281,7 +1426,7 @@ static void compat_exit_robust_list(struct task_struct *curr)
 		 * handle_futex_death:
 		 */
 		rc = compat_fetch_robust_entry(&next_uentry, &next_entry,
-			(compat_uptr_t __user *)&entry->next, &next_pi);
+			(compat_uptr_t __user *)&entry->next, &next_pi, NULL);
 		/*
 		 * A pending lock might already be on the list, so
 		 * dont process it twice:
@@ -1289,8 +1434,7 @@ static void compat_exit_robust_list(struct task_struct *curr)
 		if (entry != pending) {
 			void __user *uaddr = futex_uaddr(entry, futex_offset);
 
-			if (handle_futex_death(uaddr, curr, pi,
-					       HANDLE_DEATH_LIST))
+			if (handle_futex_death(uaddr, curr, pi, HANDLE_DEATH_LIST, false))
 				return;
 		}
 		if (rc)
@@ -1309,7 +1453,7 @@ static void compat_exit_robust_list(struct task_struct *curr)
 	if (pending) {
 		void __user *uaddr = futex_uaddr(pending, futex_offset);
 
-		handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING);
+		handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING, need_action);
 	}
 }
 #endif
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 30c2afa03889..f64ed00463ca 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -396,6 +396,8 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
 		spin_unlock(&hb2->lock);
 }
 
+extern void clear_robust_list_pending_op(struct task_struct *curr);
+
 /* syscalls */
 
 extern int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, u32
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index bc1f7e83a37e..3b889dfbcdd5 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1148,6 +1148,9 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	if ((uval & FUTEX_TID_MASK) != vpid)
 		return -EPERM;
 
+	/* Clear the pending_op_list. */
+	clear_robust_list_pending_op(current);
+
 	ret = get_futex_key(uaddr, flags, &key, FUTEX_WRITE);
 	if (ret)
 		return ret;
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 1c2dd03f11ec..7752ed8c6dc1 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -162,6 +162,9 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	if (!bitset)
 		return -EINVAL;
 
+	/* Clear the pending_op_list. */
+	clear_robust_list_pending_op(current);
+
 	ret = get_futex_key(uaddr, flags, &key, FUTEX_READ);
 	if (unlikely(ret != 0))
 		return ret;
-- 
2.39.5


^ permalink raw reply related

* [RFC] Modernizing Linux authentication logs (lastlog, btmp, utmp, wtmp) with SQLite
From: Roman Bakshansky @ 2026-03-12 21:01 UTC (permalink / raw)
  To: linux-api; +Cc: linux-kernel, audit, libc-alpha

Hi all,

I'd like to share a draft RFC proposing a complete overhaul of the legacy
binary logs used for authentication auditing in Linux: lastlog, btmp, utmp,
and wtmp.

These files, designed decades ago, are running into fundamental limitations:

- Y2038 problem - they use 32-bit timestamps (time_t in lastlog,
   tv_sec in utmpx). Even on 64-bit systems the fields remain 32-bit
   due to ABI constraints, so all Linux systems are affected.
- No extensibility - any new field (e.g., container ID, service name,
   source IP) requires changing fixed structures, breaking all existing
   tools that read them.
- Poor query performance - tools like last, lastb, who have to
   scan whole files linearly; with millions of records this becomes
   painfully slow.
- No atomicity - partial writes during a crash can corrupt logs.
- Concurrency bottlenecks - multiple writers (sshd, login, etc.)
   contend for the same file with coarse locking.

To address this once and for all, the RFC proposes replacing these logs
with dedicated shared libraries that use SQLite as the storage backend:

- liblastlog2 - last login time
- libbtmp2    - failed login attempts
- libutmp2    - current sessions
- libwtmp2    - login/logout history

SQLite brings:
- 64-bit time -> Y2038 solved forever.
- Indexes -> O(log N) queries instead of full scans.
- Extensible schema -> new fields can be added without breaking old tools.
- ACID and WAL mode -> atomic writes and concurrent access.
- Portability - runs on any Linux system, no systemd dependency.

The full RFC, including preliminary database schemas and API drafts,
is available in the discussion repository:

     https://github.com/bakshansky/linux-auth-logs

I'm looking for feedback on the overall direction, the proposed
interfaces, and the open questions listed in the document (e.g.,
library naming, database location, fallback options for embedded
systems). Please use GitHub Issues for comments, or reply to this
thread - I'll monitor both.

Thanks for your time and input!

^ permalink raw reply

* Re: [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition
From: André Almeida @ 2026-03-12 13:36 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Carlos O'Donell, Peter Zijlstra, Florian Weimer, Rich Felker,
	Torvald Riegel, Darren Hart, Thomas Gleixner, Ingo Molnar,
	Davidlohr Bueso, Arnd Bergmann, Mathieu Desnoyers,
	Liam R . Howlett, kernel-dev, linux-api, linux-kernel
In-Reply-To: <20260312090445.4Zabebfp@linutronix.de>

Em 12/03/2026 06:04, Sebastian Andrzej Siewior escreveu:
> On 2026-02-20 17:26:19 [-0300], André Almeida wrote:
>> --- /dev/null
>> +++ b/robust_bug.c
> …
>> +	new->value = ((uint64_t) value << 32) + value;
>> +
>> +	/* Create a backup of the current value */
>> +	original_val = new->value;
> 
> Now that I finally got it and I might have understood the issue.
> 
> You exit before unlocking the futex. You free this block and this new
> memory (address) is the same as the old one. Your corruption comes from
> the fact that the old content is the same as the new content.
> 
> If the thread does unlock in userland (or kernel) but the lock remains
> on the robust_list while it gets killed then the kernel will attempt to
> unlock the lock. But this requires that the futex value matches the
> value.
> So if it is unlocked (0x0) or used again then nothing happens. Unless
> the new memory gets the same value assigned as the pid value by
> accident. Then it gets changed…
> 
> If the unlock did not happen and is still owned by the thread, that is
> killed, then the "fixup" here is the right thing to do. The memory
> should not be free()ed because the lock was still owned by the thread.
> The misunderstanding here might be "once the thread is gone, the lock is
> free we can throw away the memory". At the very least, it was a locked
> mutex and I think pthread_mutex_destroy() would complain here.
> 
> So is the issue here that the "new" value is the same as the "old" value
> and the robust-death-handle part in the kernel does its job? Or did I
> over simplify something?
> Let me continue with the thread…
> 

Yes, this is exactly what I understood as well.

User thread A releases the lock, but exits before setting op_pending = 
NULL. Thread B can free the lock after using it, and by chance needs to 
use the same value as the PID in the same memory. Then thread A do the 
robust list handle inside the kernel and the corruption happens.

^ permalink raw reply

* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Aleksa Sarai @ 2026-03-12  9:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Brauner, Jeff Layton, Dorjoy Chowdhury, linux-fsdevel,
	linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
	v9fs, linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
	adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
	tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
	amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
	ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <CALCETrVMF3VBr0cuEYOg-M_u+hX77Jfdujv3ZMtLGCzHgOcsGA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3792 bytes --]

On 2026-03-11, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Mar 10, 2026 at 9:49 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2026-03-09, Christian Brauner <brauner@kernel.org> wrote:
> > > > > On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > > > > > I think this needs more clarification as to what "regular" means,
> > > > > > since S_IFREG may not be sufficient.  The UAPI group page says:
> > > > > >
> > > > > > Use-Case: this would be very useful to write secure programs that want
> > > > > > to avoid being tricked into opening device nodes with special
> > > > > > semantics while thinking they operate on regular files. This is
> > > > > > particularly relevant as many device nodes (or even FIFOs) come with
> > > > > > blocking I/O (or even blocking open()!) by default, which is not
> > > > > > expected from regular files backed by “fast” disk I/O. Consider
> > > > > > implementation of a naive web browser which is pointed to
> > > > > > file://dev/zero, not expecting an endless amount of data to read.
> > > > > >
> > > > > > What about procfs?  What about sysfs?  What about /proc/self/fd/17
> > > > > > where that fd is a memfd?  What about files backed by non-"fast" disk
> > > > > > I/O like something on a flaky USB stick or a network mount or FUSE?
> > > > > >
> > > > > > Are we concerned about blocking open?  (open blocks as a matter of
> > > > > > course.)  Are we concerned about open having strange side effects?
> > > > > > Are we concerned about write having strange side effects?  Are we
> > > > > > concerned about cases where opening the file as root results in
> > > > > > elevated privilege beyond merely gaining the ability to write to that
> > > > > > specific path on an ordinary filesystem?
> > >
> > > I think this is opening up a barrage of question that I'm not sure are
> > > all that useful. The ability to only open regular file isn't intended to
> > > defend against hung FUSE or NFS servers or other random Linux
> > > special-sauce murder-suicide file descriptor traps. For a lot of those
> > > we have O_PATH which can easily function with the new extension. A lot
> > > of the other special-sauce files (most anonymous inode fds) cannot even
> > > be reopened via e.g., /proc.
> >
> > Indeed, I see OPENAT2_REGULAR as a way of optimising the tedious checks
> > that userspace does using O_PATH+/proc/self/fd/$n re-opening when
> > dealing with regular files.
> 
> Can you give a brief decription or a link to what these checks are and
> what problem they solve?

There are a few variations, but in this particular case they are just
doing fstat() and then checking whether the file is a regular file
(i.e., S_IFREG) or not.

A container rootfs can contain arbitrary files (because container images
are just tar archives, usually with no restrictions on inodes -- a fair
few container runtimes assume that the devices cgroup is sufficient to
protect against the container overwriting your rootfs). The S_IFREG
check avoids an administrative process from being tricked into opening a
block device or an endlessly-streaming FIFO -- if you also use
RESOLVE_NO_XDEV you can also make sure that you don't land on procfs or
sysfs by accident.

I will say that in a previous version of this patchset I said that I
would prefer this be done with an allow-bitmask of S_IFMT rather than a
single O_REGULAR toggle -- this would allow for usecases such as "only
open a regular file or directory" (inode_type_can_chattr() from systemd
is a practical example of this kind of usage) or "anything except for
block/char devices", but the definition of S_IFBLK (S_IFCHR|S_IFDIR)
makes this a little too ugly. :/

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition
From: Sebastian Andrzej Siewior @ 2026-03-12  9:04 UTC (permalink / raw)
  To: André Almeida
  Cc: Carlos O'Donell, Peter Zijlstra, Florian Weimer, Rich Felker,
	Torvald Riegel, Darren Hart, Thomas Gleixner, Ingo Molnar,
	Davidlohr Bueso, Arnd Bergmann, Mathieu Desnoyers,
	Liam R . Howlett, kernel-dev, linux-api, linux-kernel
In-Reply-To: <20260220202620.139584-2-andrealmeid@igalia.com>

On 2026-02-20 17:26:19 [-0300], André Almeida wrote:
> --- /dev/null
> +++ b/robust_bug.c
…
> +	new->value = ((uint64_t) value << 32) + value;
> +
> +	/* Create a backup of the current value */
> +	original_val = new->value;

Now that I finally got it and I might have understood the issue.

You exit before unlocking the futex. You free this block and this new
memory (address) is the same as the old one. Your corruption comes from
the fact that the old content is the same as the new content.

If the thread does unlock in userland (or kernel) but the lock remains
on the robust_list while it gets killed then the kernel will attempt to
unlock the lock. But this requires that the futex value matches the
value.
So if it is unlocked (0x0) or used again then nothing happens. Unless
the new memory gets the same value assigned as the pid value by
accident. Then it gets changed…

If the unlock did not happen and is still owned by the thread, that is
killed, then the "fixup" here is the right thing to do. The memory
should not be free()ed because the lock was still owned by the thread.
The misunderstanding here might be "once the thread is gone, the lock is
free we can throw away the memory". At the very least, it was a locked
mutex and I think pthread_mutex_destroy() would complain here.

So is the issue here that the "new" value is the same as the "old" value
and the robust-death-handle part in the kernel does its job? Or did I
over simplify something?
Let me continue with the thread…

Sebastian

^ permalink raw reply

* Re: [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions
From: Eric Dumazet @ 2026-03-12  1:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Wesley Atwell, Simon Baatz, davem, pabeni, ncardwell, dsahern,
	matttbe, martineau, netdev, mptcp, kuniyu, horms, geliang, corbet,
	skhan, rostedt, mhiramat, mathieu.desnoyers, 0x7f454c46,
	linux-doc, linux-trace-kernel, linux-kselftest, linux-kernel,
	linux-api
In-Reply-To: <20260311174154.5fadb207@kernel.org>

On Thu, Mar 12, 2026 at 1:41 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 11 Mar 2026 09:34:32 +0100 Eric Dumazet wrote:
> > Your series will heavily conflict with Simon's one
> >
> > https://patchwork.kernel.org/project/netdevbpf/list/?series=1063486&state=%2A&archive=both
> >
> > I suggest you rebase/retest/resend after we merge it.
>
> Would it make sense to extend netdevsim and packetdrill to be able to
> exercise scaling ratio a little more? Having it optionally clone the
> skb and truesize += X would be trivial. IDK how many bugs this would
> let us catch tho :(

Yes, I think we mentioned this at some point.
packetdrill uses tun device.
Adding a TUN ioctl() to control how many additional bytes are added to
skb->truesize after tun allocates an skb is doable.

^ permalink raw reply

* Re: [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions
From: Jakub Kicinski @ 2026-03-12  0:43 UTC (permalink / raw)
  To: Wesley Atwell
  Cc: davem, pabeni, edumazet, ncardwell, dsahern, matttbe, martineau,
	netdev, mptcp, kuniyu, horms, geliang, corbet, skhan, rostedt,
	mhiramat, mathieu.desnoyers, 0x7f454c46, linux-doc,
	linux-trace-kernel, linux-kselftest, linux-kernel, linux-api
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>

On Wed, 11 Mar 2026 01:55:53 -0600 Wesley Atwell wrote:
> Subject: [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions

when you repost please make sure you use "PATCH net-next v2" 
as the tag / prefix. "net" is a tree we use to fast track fixes.

^ permalink raw reply

* Re: [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions
From: Jakub Kicinski @ 2026-03-12  0:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Wesley Atwell, Simon Baatz, davem, pabeni, ncardwell, dsahern,
	matttbe, martineau, netdev, mptcp, kuniyu, horms, geliang, corbet,
	skhan, rostedt, mhiramat, mathieu.desnoyers, 0x7f454c46,
	linux-doc, linux-trace-kernel, linux-kselftest, linux-kernel,
	linux-api
In-Reply-To: <CANn89i+dojcg=TDh6E1++g_TM7qdcpnyu47n2Q9DRW_w73TjzA@mail.gmail.com>

On Wed, 11 Mar 2026 09:34:32 +0100 Eric Dumazet wrote:
> Your series will heavily conflict with Simon's one
> 
> https://patchwork.kernel.org/project/netdevbpf/list/?series=1063486&state=%2A&archive=both
> 
> I suggest you rebase/retest/resend after we merge it.

Would it make sense to extend netdevsim and packetdrill to be able to
exercise scaling ratio a little more? Having it optionally clone the
skb and truesize += X would be trivial. IDK how many bugs this would
let us catch tho :(

^ permalink raw reply

* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Andy Lutomirski @ 2026-03-11 16:10 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Christian Brauner, Jeff Layton, Dorjoy Chowdhury, linux-fsdevel,
	linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
	v9fs, linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
	adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
	tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
	amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
	ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <2026-03-11-regular-sore-census-shops-DqYcUT@cyphar.com>

On Tue, Mar 10, 2026 at 9:49 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2026-03-09, Christian Brauner <brauner@kernel.org> wrote:
> > > > On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > > > > I think this needs more clarification as to what "regular" means,
> > > > > since S_IFREG may not be sufficient.  The UAPI group page says:
> > > > >
> > > > > Use-Case: this would be very useful to write secure programs that want
> > > > > to avoid being tricked into opening device nodes with special
> > > > > semantics while thinking they operate on regular files. This is
> > > > > particularly relevant as many device nodes (or even FIFOs) come with
> > > > > blocking I/O (or even blocking open()!) by default, which is not
> > > > > expected from regular files backed by “fast” disk I/O. Consider
> > > > > implementation of a naive web browser which is pointed to
> > > > > file://dev/zero, not expecting an endless amount of data to read.
> > > > >
> > > > > What about procfs?  What about sysfs?  What about /proc/self/fd/17
> > > > > where that fd is a memfd?  What about files backed by non-"fast" disk
> > > > > I/O like something on a flaky USB stick or a network mount or FUSE?
> > > > >
> > > > > Are we concerned about blocking open?  (open blocks as a matter of
> > > > > course.)  Are we concerned about open having strange side effects?
> > > > > Are we concerned about write having strange side effects?  Are we
> > > > > concerned about cases where opening the file as root results in
> > > > > elevated privilege beyond merely gaining the ability to write to that
> > > > > specific path on an ordinary filesystem?
> >
> > I think this is opening up a barrage of question that I'm not sure are
> > all that useful. The ability to only open regular file isn't intended to
> > defend against hung FUSE or NFS servers or other random Linux
> > special-sauce murder-suicide file descriptor traps. For a lot of those
> > we have O_PATH which can easily function with the new extension. A lot
> > of the other special-sauce files (most anonymous inode fds) cannot even
> > be reopened via e.g., /proc.
>
> Indeed, I see OPENAT2_REGULAR as a way of optimising the tedious checks
> that userspace does using O_PATH+/proc/self/fd/$n re-opening when
> dealing with regular files.

Can you give a brief decription or a link to what these checks are and
what problem they solve?

--Andy

^ permalink raw reply

* Re: [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions
From: Eric Dumazet @ 2026-03-11  8:34 UTC (permalink / raw)
  To: Wesley Atwell, Simon Baatz
  Cc: davem, kuba, pabeni, ncardwell, dsahern, matttbe, martineau,
	netdev, mptcp, kuniyu, horms, geliang, corbet, skhan, rostedt,
	mhiramat, mathieu.desnoyers, 0x7f454c46, linux-doc,
	linux-trace-kernel, linux-kselftest, linux-kernel, linux-api
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>

On Wed, Mar 11, 2026 at 8:56 AM Wesley Atwell <atwellwea@gmail.com> wrote:
>
> This series keeps sender-visible TCP receive-window accounting tied to the
> scaling basis that was in force when the window was advertised.
>
> Problem
> -------
>
> `tp->rcv_wnd` is an advertised promise to the sender, but later
> receive-memory admission and clamping could reconstruct that promise
> through the mutable live `scaling_ratio`. After ratio drift, the stack
> could retain or advertise a receive window that no longer matched the
> local hard rmem budget.
>
> Fix
> ---
>
> - store the advertise-time scaling basis alongside `tp->rcv_wnd`
> - refresh that pair at the TCP and MPTCP receive-window write sites
> - consume the snapshot in receive-memory admission, clamping, and the
>   scaled-window quantization path
> - preserve the snapshot across `TCP_REPAIR_WINDOW` restore when userspace
>   provides it, and fall back safely when legacy userspace cannot
> - expose the accounting in tracepoints and cover the ABI/runtime contract
>   in selftests
>

Your series will heavily conflict with Simon's one

https://patchwork.kernel.org/project/netdevbpf/list/?series=1063486&state=%2A&archive=both

I suggest you rebase/retest/resend after we merge it.

> Series layout
> -------------
>
> 1. track the receive-window snapshot state and helpers
> 2. refresh the snapshot when TCP advertises or initializes windows
> 3. use the snapshot in receive-memory admission and clamping
> 4. extend `TCP_REPAIR_WINDOW` for exact restore plus legacy compatibility
> 5. refresh the TCP shadow window snapshot in MPTCP
> 6. expose rmem/backlog state in `rcvbuf_grow` tracepoints
> 7. cover legacy and extended repair-window layouts in selftests
>
> Testing
> -------
>
> - `git diff --check origin/main..HEAD`
> - `scripts/checkpatch.pl --strict --show-types` on patches 1-7
> - `make -j8 headers`
> - `make -j8 net/ipv4/tcp_input.o net/ipv4/tcp_output.o net/ipv4/tcp_minisocks.o net/ipv4/tcp.o`
> - `make -j8 C=1 CF='-D__CHECK_ENDIAN__' W=1 net/ipv4/tcp_input.o net/ipv4/tcp_output.o net/ipv4/tcp_minisocks.o net/ipv4/tcp.o`
> - `make SPHINXDIRS='networking/net_cachelines' htmldocs`
> - `make -j8 vmlinux bzImage modules`
> - `make -C tools/testing/selftests/net/tcp_ao -j8`
> - `make -C tools/testing/selftests/net/mptcp -j8`
> - `packetdrill --dry_run` for `tcp_rcv_toobig.pkt` and
>   `tcp_rcv_toobig_default.pkt`
> - `virtme-run` guest pass for both packetdrill tests
> - feature-enabled guest pass for `restore_ipv4`, `self-connect_ipv4`, and
>   `mptcp_sockopt.sh`
>
> Thanks,
> Wesley
>
> ---
> base-commit: 908c344d5cfa0ee6efb3226d22ea661e078ebfa0
> --
> 2.43.0
>

^ permalink raw reply

* [PATCH net 7/7] selftests: tcp_ao: cover legacy and extended TCP_REPAIR_WINDOW layouts
From: Wesley Atwell @ 2026-03-11  7:56 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
	martineau, netdev, mptcp
  Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
	mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
	linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>

Extend the repair helpers and selftests so the ABI contract is pinned
down in-tree.

The TCP-AO restore coverage now exercises both the exact and legacy
TCP_REPAIR_WINDOW layouts, verifies that intermediate lengths are
rejected, and keeps the packetdrill coverage for the advertised-window
receive-memory regressions in the same net selftest series.

Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
 .../net/packetdrill/tcp_rcv_toobig.pkt        | 35 +++++++
 .../packetdrill/tcp_rcv_toobig_default.pkt    | 97 +++++++++++++++++++
 .../testing/selftests/net/tcp_ao/lib/aolib.h  | 56 +++++++++--
 .../testing/selftests/net/tcp_ao/lib/repair.c | 18 ++--
 .../selftests/net/tcp_ao/self-connect.c       | 61 ++++++++++--
 5 files changed, 244 insertions(+), 23 deletions(-)
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_toobig_default.pkt

diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt
new file mode 100644
index 000000000000..723c739ddc32
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+
+--mss=1000
+
+`./defaults.sh`
+
+    0 `nstat -n`
+
+// Establish a connection.
+   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+   +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [20000], 4) = 0
+   +0 bind(3, ..., ...) = 0
+   +0 listen(3, 1) = 0
+
+   +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
+   +0 > S. 0:0(0) ack 1 win 18980 <mss 1460,nop,wscale 0>
+  +.1 < . 1:1(0) ack 1 win 257
+
+   +0 accept(3, ..., ...) = 4
+
+   +0 < P. 1:20001(20000) ack 1 win 257
+ +.04 > .  1:1(0) ack 20001 win 18000
+
+   +0 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [12000], 4) = 0
+   +0 < P. 20001:80001(60000) ack 1 win 257
+   +0 > .  1:1(0) ack 20001 win 18000
+
+   +0 read(4, ..., 20000) = 20000
+
+// A too big packet is accepted if the receive queue is empty, but the
+// stronger admission path must not zero the receive buffer while doing so.
+   +0 < P. 20001:80001(60000) ack 1 win 257
+   +0 > .  1:1(0) ack 80001 win 0
+   +0 %{ assert SK_MEMINFO_RCVBUF > 0, SK_MEMINFO_RCVBUF }%
diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig_default.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig_default.pkt
new file mode 100644
index 000000000000..b2e4950e0b83
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig_default.pkt
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0
+
+--mss=1000
+
+`./defaults.sh
+sysctl -q net.ipv4.tcp_moderate_rcvbuf=0`
+
+// Establish a connection on the default receive buffer. Leave a large skb in
+// the queue, then deliver another one which still fits the remaining rwnd.
+// We should grow sk_rcvbuf to honor the already-advertised window instead of
+// dropping the packet.
+   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+   +0 bind(3, ..., ...) = 0
+   +0 listen(3, 1) = 0
+
+   +0 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 7>
+   +0 > S. 0:0(0) ack 1 <...>
+  +.1 < . 1:1(0) ack 1 win 257
+
+   +0 accept(3, ..., ...) = 4
+
+// Exchange enough data to get past the completely fresh-socket case while
+// still keeping the receive buffer at its 128kB default.
+   +0 < P. 1:65001(65000) ack 1 win 257
+   * > .  1:1(0) ack 65001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 65001:130001(65000) ack 1 win 257
+   * > .  1:1(0) ack 130001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 130001:195001(65000) ack 1 win 257
+   * > .  1:1(0) ack 195001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 195001:260001(65000) ack 1 win 257
+   * > .  1:1(0) ack 260001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 260001:325001(65000) ack 1 win 257
+   * > .  1:1(0) ack 325001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 325001:390001(65000) ack 1 win 257
+   * > .  1:1(0) ack 390001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 390001:455001(65000) ack 1 win 257
+   * > .  1:1(0) ack 455001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 455001:520001(65000) ack 1 win 257
+   * > .  1:1(0) ack 520001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 520001:585001(65000) ack 1 win 257
+   * > .  1:1(0) ack 585001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 585001:650001(65000) ack 1 win 257
+   * > .  1:1(0) ack 650001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 650001:715001(65000) ack 1 win 257
+   * > .  1:1(0) ack 715001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 715001:780001(65000) ack 1 win 257
+   * > .  1:1(0) ack 780001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 780001:845001(65000) ack 1 win 257
+   * > .  1:1(0) ack 845001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 845001:910001(65000) ack 1 win 257
+   * > .  1:1(0) ack 910001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 910001:975001(65000) ack 1 win 257
+   * > .  1:1(0) ack 975001
+   +0 read(4, ..., 65000) = 65000
+
+   +0 < P. 975001:1040001(65000) ack 1 win 257
+   * > .  1:1(0) ack 1040001
+   +0 read(4, ..., 65000) = 65000
+
+// Leave about 60kB queued, then accept another large skb which still fits
+// the rwnd we already exposed to the peer. The regression is the drop; the
+// exact sk_rcvbuf growth path is an implementation detail.
+   +0 < P. 1040001:1102001(62000) ack 1 win 257
+   * > .  1:1(0) ack 1102001
+
+   +0 < P. 1102001:1167001(65000) ack 1 win 257
+   * > .  1:1(0) ack 1167001
+   +0 read(4, ..., 127000) = 127000
diff --git a/tools/testing/selftests/net/tcp_ao/lib/aolib.h b/tools/testing/selftests/net/tcp_ao/lib/aolib.h
index ebb2899c12fe..ff259795a4a0 100644
--- a/tools/testing/selftests/net/tcp_ao/lib/aolib.h
+++ b/tools/testing/selftests/net/tcp_ao/lib/aolib.h
@@ -13,6 +13,7 @@
 #include <linux/snmp.h>
 #include <linux/tcp.h>
 #include <netinet/in.h>
+#include <stddef.h>
 #include <stdarg.h>
 #include <stdbool.h>
 #include <stdlib.h>
@@ -671,17 +672,42 @@ struct tcp_sock_state {
 	int timestamp;
 };
 
-extern void __test_sock_checkpoint(int sk, struct tcp_sock_state *state,
-				   void *addr, size_t addr_size);
+/* Legacy userspace stops before the snapshot field and therefore exercises
+ * the kernel's unknown-snapshot fallback path.
+ */
+static inline socklen_t test_tcp_repair_window_legacy_size(void)
+{
+	return offsetof(struct tcp_repair_window, rcv_wnd_scaling_ratio);
+}
+
+static inline socklen_t test_tcp_repair_window_exact_size(void)
+{
+	return sizeof(struct tcp_repair_window);
+}
+
+void __test_sock_checkpoint_opt(int sk, struct tcp_sock_state *state,
+				socklen_t trw_len,
+				void *addr, size_t addr_size);
 static inline void test_sock_checkpoint(int sk, struct tcp_sock_state *state,
 					sockaddr_af *saddr)
 {
-	__test_sock_checkpoint(sk, state, saddr, sizeof(*saddr));
+	__test_sock_checkpoint_opt(sk, state, test_tcp_repair_window_exact_size(),
+				   saddr, sizeof(*saddr));
+}
+
+static inline void test_sock_checkpoint_legacy(int sk,
+					       struct tcp_sock_state *state,
+					       sockaddr_af *saddr)
+{
+	__test_sock_checkpoint_opt(sk, state, test_tcp_repair_window_legacy_size(),
+				   saddr, sizeof(*saddr));
 }
 extern void test_ao_checkpoint(int sk, struct tcp_ao_repair *state);
-extern void __test_sock_restore(int sk, const char *device,
-				struct tcp_sock_state *state,
-				void *saddr, void *daddr, size_t addr_size);
+void __test_sock_restore_opt(int sk, const char *device,
+			     struct tcp_sock_state *state,
+			     socklen_t trw_len,
+			     void *saddr, void *daddr,
+			     size_t addr_size);
 static inline void test_sock_restore(int sk, struct tcp_sock_state *state,
 				     sockaddr_af *saddr,
 				     const union tcp_addr daddr,
@@ -690,7 +716,23 @@ static inline void test_sock_restore(int sk, struct tcp_sock_state *state,
 	sockaddr_af addr;
 
 	tcp_addr_to_sockaddr_in(&addr, &daddr, htons(dport));
-	__test_sock_restore(sk, veth_name, state, saddr, &addr, sizeof(addr));
+	__test_sock_restore_opt(sk, veth_name, state,
+				test_tcp_repair_window_exact_size(),
+				saddr, &addr, sizeof(addr));
+}
+
+static inline void test_sock_restore_legacy(int sk,
+					    struct tcp_sock_state *state,
+					    sockaddr_af *saddr,
+					    const union tcp_addr daddr,
+					    unsigned int dport)
+{
+	sockaddr_af addr;
+
+	tcp_addr_to_sockaddr_in(&addr, &daddr, htons(dport));
+	__test_sock_restore_opt(sk, veth_name, state,
+				test_tcp_repair_window_legacy_size(),
+				saddr, &addr, sizeof(addr));
 }
 extern void test_ao_restore(int sk, struct tcp_ao_repair *state);
 extern void test_sock_state_free(struct tcp_sock_state *state);
diff --git a/tools/testing/selftests/net/tcp_ao/lib/repair.c b/tools/testing/selftests/net/tcp_ao/lib/repair.c
index 9893b3ba69f5..befbd0f72db5 100644
--- a/tools/testing/selftests/net/tcp_ao/lib/repair.c
+++ b/tools/testing/selftests/net/tcp_ao/lib/repair.c
@@ -66,8 +66,9 @@ static void test_sock_checkpoint_queue(int sk, int queue, int qlen,
 		test_error("recv(%d): %d", qlen, ret);
 }
 
-void __test_sock_checkpoint(int sk, struct tcp_sock_state *state,
-			    void *addr, size_t addr_size)
+void __test_sock_checkpoint_opt(int sk, struct tcp_sock_state *state,
+				socklen_t trw_len,
+				void *addr, size_t addr_size)
 {
 	socklen_t len = sizeof(state->info);
 	int ret;
@@ -82,9 +83,9 @@ void __test_sock_checkpoint(int sk, struct tcp_sock_state *state,
 	if (getsockname(sk, addr, &len) || len != addr_size)
 		test_error("getsockname(): %d", (int)len);
 
-	len = sizeof(state->trw);
+	len = trw_len;
 	ret = getsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &state->trw, &len);
-	if (ret || len != sizeof(state->trw))
+	if (ret || len != trw_len)
 		test_error("getsockopt(TCP_REPAIR_WINDOW): %d", (int)len);
 
 	if (ioctl(sk, SIOCOUTQ, &state->outq_len))
@@ -160,9 +161,10 @@ static void test_sock_restore_queue(int sk, int queue, void *buf, int len)
 	} while (len > 0);
 }
 
-void __test_sock_restore(int sk, const char *device,
-			 struct tcp_sock_state *state,
-			 void *saddr, void *daddr, size_t addr_size)
+void __test_sock_restore_opt(int sk, const char *device,
+			     struct tcp_sock_state *state,
+			     socklen_t trw_len,
+			     void *saddr, void *daddr, size_t addr_size)
 {
 	struct tcp_repair_opt opts[4];
 	unsigned int opt_nr = 0;
@@ -215,7 +217,7 @@ void __test_sock_restore(int sk, const char *device,
 	}
 	test_sock_restore_queue(sk, TCP_RECV_QUEUE, state->in.buf, state->inq_len);
 	test_sock_restore_queue(sk, TCP_SEND_QUEUE, state->out.buf, state->outq_len);
-	if (setsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &state->trw, sizeof(state->trw)))
+	if (setsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &state->trw, trw_len))
 		test_error("setsockopt(TCP_REPAIR_WINDOW)");
 }
 
diff --git a/tools/testing/selftests/net/tcp_ao/self-connect.c b/tools/testing/selftests/net/tcp_ao/self-connect.c
index 2c73bea698a6..a7edd72ab28d 100644
--- a/tools/testing/selftests/net/tcp_ao/self-connect.c
+++ b/tools/testing/selftests/net/tcp_ao/self-connect.c
@@ -4,6 +4,7 @@
 #include "aolib.h"
 
 static union tcp_addr local_addr;
+static bool checked_repair_window_lens;
 
 static void __setup_lo_intf(const char *lo_intf,
 			    const char *addr_str, uint8_t prefix)
@@ -30,8 +31,40 @@ static void setup_lo_intf(const char *lo_intf)
 #endif
 }
 
+/* The repair ABI accepts exactly the legacy and extended layouts. */
+static void test_repair_window_len_contract(int sk)
+{
+	struct tcp_repair_window trw = {};
+	socklen_t len = test_tcp_repair_window_exact_size();
+	socklen_t bad_len = test_tcp_repair_window_legacy_size() + 1;
+	int ret;
+
+	if (checked_repair_window_lens)
+		return;
+
+	checked_repair_window_lens = true;
+
+	ret = getsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &trw, &len);
+	if (ret || len != test_tcp_repair_window_exact_size())
+		test_error("getsockopt(TCP_REPAIR_WINDOW): %d", (int)len);
+
+	len = bad_len;
+	ret = getsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &trw, &len);
+	if (ret == 0 || errno != EINVAL)
+		test_fail("repair-window get rejects invalid len");
+	else
+		test_ok("repair-window get rejects invalid len");
+
+	ret = setsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &trw, bad_len);
+	if (ret == 0 || errno != EINVAL)
+		test_fail("repair-window set rejects invalid len");
+	else
+		test_ok("repair-window set rejects invalid len");
+}
+
 static void tcp_self_connect(const char *tst, unsigned int port,
-			     bool different_keyids, bool check_restore)
+			     bool different_keyids, bool check_restore,
+			     bool legacy_repair_window)
 {
 	struct tcp_counters before, after;
 	uint64_t before_aogood, after_aogood;
@@ -109,7 +142,11 @@ static void tcp_self_connect(const char *tst, unsigned int port,
 	}
 
 	test_enable_repair(sk);
-	test_sock_checkpoint(sk, &img, &addr);
+	test_repair_window_len_contract(sk);
+	if (legacy_repair_window)
+		test_sock_checkpoint_legacy(sk, &img, &addr);
+	else
+		test_sock_checkpoint(sk, &img, &addr);
 #ifdef IPV6_TEST
 	addr.sin6_port = htons(port + 1);
 #else
@@ -123,7 +160,11 @@ static void tcp_self_connect(const char *tst, unsigned int port,
 		test_error("socket()");
 
 	test_enable_repair(sk);
-	__test_sock_restore(sk, "lo", &img, &addr, &addr, sizeof(addr));
+	__test_sock_restore_opt(sk, "lo", &img,
+				legacy_repair_window ?
+				test_tcp_repair_window_legacy_size() :
+				test_tcp_repair_window_exact_size(),
+				&addr, &addr, sizeof(addr));
 	if (different_keyids) {
 		if (test_add_repaired_key(sk, DEFAULT_TEST_PASSWORD, 0,
 					  local_addr, -1, 7, 5))
@@ -165,20 +206,24 @@ static void *client_fn(void *arg)
 
 	setup_lo_intf("lo");
 
-	tcp_self_connect("self-connect(same keyids)", port++, false, false);
+	tcp_self_connect("self-connect(same keyids)", port++, false, false, false);
 
 	/* expecting rnext to change based on the first segment RNext != Current */
 	trace_ao_event_expect(TCP_AO_RNEXT_REQUEST, local_addr, local_addr,
 			      port, port, 0, -1, -1, -1, -1, -1, 7, 5, -1);
-	tcp_self_connect("self-connect(different keyids)", port++, true, false);
-	tcp_self_connect("self-connect(restore)", port, false, true);
+	tcp_self_connect("self-connect(different keyids)", port++, true, false, false);
+	tcp_self_connect("self-connect(restore)", port, false, true, false);
+	port += 2; /* restore test restores over different port */
+	tcp_self_connect("self-connect(restore, legacy repair window)",
+			 port, false, true, true);
 	port += 2; /* restore test restores over different port */
 	trace_ao_event_expect(TCP_AO_RNEXT_REQUEST, local_addr, local_addr,
 			      port, port, 0, -1, -1, -1, -1, -1, 7, 5, -1);
 	/* intentionally on restore they are added to the socket in different order */
 	trace_ao_event_expect(TCP_AO_RNEXT_REQUEST, local_addr, local_addr,
 			      port + 1, port + 1, 0, -1, -1, -1, -1, -1, 5, 7, -1);
-	tcp_self_connect("self-connect(restore, different keyids)", port, true, true);
+	tcp_self_connect("self-connect(restore, different keyids)",
+			 port, true, true, false);
 	port += 2; /* restore test restores over different port */
 
 	return NULL;
@@ -186,6 +231,6 @@ static void *client_fn(void *arg)
 
 int main(int argc, char *argv[])
 {
-	test_init(5, client_fn, NULL);
+	test_init(8, client_fn, NULL);
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 6/7] tcp: expose rmem and backlog accounting in rcvbuf_grow tracepoints
From: Wesley Atwell @ 2026-03-11  7:55 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
	martineau, netdev, mptcp
  Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
	mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
	linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>

The receive-window work now depends on keeping sender-visible rwnd and
hard receive-memory accounting aligned.

Expose the current rmem charge and backlog reservation in the TCP and
MPTCP rcvbuf_grow tracepoints so that later drift between advertised
window and local backing is visible during review and debugging.

Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
 include/trace/events/mptcp.h | 11 +++++++----
 include/trace/events/tcp.h   | 12 +++++++-----
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/trace/events/mptcp.h b/include/trace/events/mptcp.h
index 269d949b2025..167970e8e0a5 100644
--- a/include/trace/events/mptcp.h
+++ b/include/trace/events/mptcp.h
@@ -199,6 +199,8 @@ TRACE_EVENT(mptcp_rcvbuf_grow,
 		__field(__u32, inq)
 		__field(__u32, space)
 		__field(__u32, ooo_space)
+		__field(__u32, rmem_alloc)
+		__field(__u32, backlog_len)
 		__field(__u32, rcvbuf)
 		__field(__u32, rcv_wnd)
 		__field(__u8, scaling_ratio)
@@ -228,6 +230,8 @@ TRACE_EVENT(mptcp_rcvbuf_grow,
 				     MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq -
 				     msk->ack_seq;
 
+		__entry->rmem_alloc = tcp_rmem_used(sk);
+		__entry->backlog_len = READ_ONCE(msk->backlog_len);
 		__entry->rcvbuf = sk->sk_rcvbuf;
 		__entry->rcv_wnd = atomic64_read(&msk->rcv_wnd_sent) -
 				   msk->ack_seq;
@@ -248,12 +252,11 @@ TRACE_EVENT(mptcp_rcvbuf_grow,
 		__entry->skaddr = sk;
 	),
 
-	TP_printk("time=%u rtt_us=%u copied=%u inq=%u space=%u ooo=%u scaling_ratio=%u "
-		  "rcvbuf=%u rcv_wnd=%u family=%d sport=%hu dport=%hu saddr=%pI4 "
-		  "daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c skaddr=%p",
+	TP_printk("time=%u rtt_us=%u copied=%u inq=%u space=%u ooo=%u scaling_ratio=%u rmem_alloc=%u backlog_len=%u rcvbuf=%u rcv_wnd=%u family=%d sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c skaddr=%p",
 		  __entry->time, __entry->rtt_us, __entry->copied,
 		  __entry->inq, __entry->space, __entry->ooo_space,
-		  __entry->scaling_ratio, __entry->rcvbuf, __entry->rcv_wnd,
+		  __entry->scaling_ratio, __entry->rmem_alloc,
+		  __entry->backlog_len, __entry->rcvbuf, __entry->rcv_wnd,
 		  __entry->family, __entry->sport, __entry->dport,
 		  __entry->saddr, __entry->daddr, __entry->saddr_v6,
 		  __entry->daddr_v6, __entry->skaddr)
diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index f155f95cdb6e..92d0bd6be0ba 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -217,6 +217,8 @@ TRACE_EVENT(tcp_rcvbuf_grow,
 		__field(__u32, inq)
 		__field(__u32, space)
 		__field(__u32, ooo_space)
+		__field(__u32, rmem_alloc)
+		__field(__u32, backlog_len)
 		__field(__u32, rcvbuf)
 		__field(__u32, rcv_ssthresh)
 		__field(__u32, window_clamp)
@@ -247,6 +249,8 @@ TRACE_EVENT(tcp_rcvbuf_grow,
 				     TCP_SKB_CB(tp->ooo_last_skb)->end_seq -
 				     tp->rcv_nxt;
 
+		__entry->rmem_alloc = tcp_rmem_used(sk);
+		__entry->backlog_len = READ_ONCE(sk->sk_backlog.len);
 		__entry->rcvbuf = sk->sk_rcvbuf;
 		__entry->rcv_ssthresh = tp->rcv_ssthresh;
 		__entry->window_clamp = tp->window_clamp;
@@ -269,13 +273,11 @@ TRACE_EVENT(tcp_rcvbuf_grow,
 		__entry->sock_cookie = sock_gen_cookie(sk);
 	),
 
-	TP_printk("time=%u rtt_us=%u copied=%u inq=%u space=%u ooo=%u scaling_ratio=%u rcvbuf=%u "
-		  "rcv_ssthresh=%u window_clamp=%u rcv_wnd=%u "
-		  "family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 "
-		  "saddrv6=%pI6c daddrv6=%pI6c skaddr=%p sock_cookie=%llx",
+	TP_printk("time=%u rtt_us=%u copied=%u inq=%u space=%u ooo=%u scaling_ratio=%u rmem_alloc=%u backlog_len=%u rcvbuf=%u rcv_ssthresh=%u window_clamp=%u rcv_wnd=%u family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c skaddr=%p sock_cookie=%llx",
 		  __entry->time, __entry->rtt_us, __entry->copied,
 		  __entry->inq, __entry->space, __entry->ooo_space,
-		  __entry->scaling_ratio, __entry->rcvbuf,
+		  __entry->scaling_ratio, __entry->rmem_alloc,
+		  __entry->backlog_len, __entry->rcvbuf,
 		  __entry->rcv_ssthresh, __entry->window_clamp,
 		  __entry->rcv_wnd,
 		  show_family_name(__entry->family),
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 5/7] mptcp: refresh tcp rcv_wnd snapshot when syncing receive windows
From: Wesley Atwell @ 2026-03-11  7:55 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
	martineau, netdev, mptcp
  Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
	mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
	linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>

MPTCP rewrites the TCP shadow receive window on subflows when shared
receive-window state changes.

Once tp->rcv_wnd carries paired snapshot semantics, those subflow shadow
updates have to refresh the snapshot too. Convert the MPTCP window-sync
write sites to use the helper and keep the aggregate receive-space
arithmetic using the explicit rwnd-availability helper.

Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
 net/mptcp/options.c  | 12 ++++++++----
 net/mptcp/protocol.h | 14 +++++++++++---
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 43df4293f58b..6e6aa084cbfa 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1073,9 +1073,12 @@ static void rwin_update(struct mptcp_sock *msk, struct sock *ssk,
 		return;
 
 	/* Some other subflow grew the mptcp-level rwin since rcv_wup,
-	 * resync.
+	 * resync. Keep the TCP shadow window in its advertised u32 domain
+	 * and refresh the advertise-time scaling snapshot while doing so.
 	 */
-	tp->rcv_wnd += mptcp_rcv_wnd - subflow->rcv_wnd_sent;
+	tcp_set_rcv_wnd(tp, min_t(u64, (u64)tp->rcv_wnd +
+				  (mptcp_rcv_wnd - subflow->rcv_wnd_sent),
+				  U32_MAX));
 	subflow->rcv_wnd_sent = mptcp_rcv_wnd;
 }
 
@@ -1334,11 +1337,12 @@ static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
 	if (rcv_wnd_new != rcv_wnd_old) {
 raise_win:
 		/* The msk-level rcv wnd is after the tcp level one,
-		 * sync the latter.
+		 * sync the latter and refresh its advertise-time scaling
+		 * snapshot.
 		 */
 		rcv_wnd_new = rcv_wnd_old;
 		win = rcv_wnd_old - ack_seq;
-		tp->rcv_wnd = min_t(u64, win, U32_MAX);
+		tcp_set_rcv_wnd(tp, min_t(u64, win, U32_MAX));
 		new_win = tp->rcv_wnd;
 
 		/* Make sure we do not exceed the maximum possible
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 0bd1ee860316..4ea95c9c0c7a 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -408,11 +408,19 @@ static inline int mptcp_space_from_win(const struct sock *sk, int win)
 	return __tcp_space_from_win(mptcp_sk(sk)->scaling_ratio, win);
 }
 
+/* MPTCP exposes window space from the mptcp-level receive queue, so it tracks
+ * a separate backlog counter from the subflow backlog embedded in struct sock.
+ */
+static inline int mptcp_rwnd_avail(const struct sock *sk)
+{
+	return READ_ONCE(sk->sk_rcvbuf) -
+	       READ_ONCE(mptcp_sk(sk)->backlog_len) -
+	       tcp_rmem_used(sk);
+}
+
 static inline int __mptcp_space(const struct sock *sk)
 {
-	return mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) -
-				    READ_ONCE(mptcp_sk(sk)->backlog_len) -
-				    sk_rmem_alloc_get(sk));
+	return mptcp_win_from_space(sk, mptcp_rwnd_avail(sk));
 }
 
 static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 4/7] tcp: extend TCP_REPAIR_WINDOW with receive-window scaling snapshot
From: Wesley Atwell @ 2026-03-11  7:55 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
	martineau, netdev, mptcp
  Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
	mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
	linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>

The paired receive-window state is now part of the live TCP socket
semantics, so repair and restore need a way to preserve it.

Extend TCP_REPAIR_WINDOW with the advertise-time scaling snapshot while
keeping old userspace working. The kernel now accepts exactly the legacy
layout and the extended layout. Legacy restore leaves the snapshot
unknown so the socket falls back safely until a fresh local window
advertisement refreshes the pair, while the extended layout restores the
exact snapshot.

Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
 include/uapi/linux/tcp.h |  1 +
 net/ipv4/tcp.c           | 34 ++++++++++++++++++++++++++++------
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 03772dd4d399..3a799f4c0e1e 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -159,6 +159,7 @@ struct tcp_repair_window {
 
 	__u32	rcv_wnd;
 	__u32	rcv_wup;
+	__u32	rcv_wnd_scaling_ratio; /* 0 means advertise-time basis unknown */
 };
 
 enum {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cec9ae1bf875..dd2b4fe61bd8 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3551,17 +3551,25 @@ static inline bool tcp_can_repair_sock(const struct sock *sk)
 		(sk->sk_state != TCP_LISTEN);
 }
 
+/* Keep accepting the pre-extension TCP_REPAIR_WINDOW layout so legacy
+ * userspace can restore sockets without fabricating a snapshot basis.
+ */
+static inline int tcp_repair_window_legacy_size(void)
+{
+	return offsetof(struct tcp_repair_window, rcv_wnd_scaling_ratio);
+}
+
 static int tcp_repair_set_window(struct tcp_sock *tp, sockptr_t optbuf, int len)
 {
-	struct tcp_repair_window opt;
+	struct tcp_repair_window opt = {};
 
 	if (!tp->repair)
 		return -EPERM;
 
-	if (len != sizeof(opt))
+	if (len != tcp_repair_window_legacy_size() && len != sizeof(opt))
 		return -EINVAL;
 
-	if (copy_from_sockptr(&opt, optbuf, sizeof(opt)))
+	if (copy_from_sockptr(&opt, optbuf, len))
 		return -EFAULT;
 
 	if (opt.max_window < opt.snd_wnd)
@@ -3577,7 +3585,20 @@ static int tcp_repair_set_window(struct tcp_sock *tp, sockptr_t optbuf, int len)
 	tp->snd_wnd	= opt.snd_wnd;
 	tp->max_window	= opt.max_window;
 
-	tp->rcv_wnd	= opt.rcv_wnd;
+	if (len == tcp_repair_window_legacy_size()) {
+		/* Legacy repair UAPI has no advertise-time basis for tp->rcv_wnd.
+		 * Mark the snapshot unknown until a fresh local advertisement
+		 * re-establishes the pair.
+		 */
+		tcp_set_rcv_wnd_unknown(tp, opt.rcv_wnd);
+		tp->rcv_wup	= opt.rcv_wup;
+		return 0;
+	}
+
+	if (opt.rcv_wnd_scaling_ratio > U8_MAX)
+		return -EINVAL;
+
+	tcp_set_rcv_wnd_snapshot(tp, opt.rcv_wnd, opt.rcv_wnd_scaling_ratio);
 	tp->rcv_wup	= opt.rcv_wup;
 
 	return 0;
@@ -4667,12 +4688,12 @@ int do_tcp_getsockopt(struct sock *sk, int level,
 		break;
 
 	case TCP_REPAIR_WINDOW: {
-		struct tcp_repair_window opt;
+		struct tcp_repair_window opt = {};
 
 		if (copy_from_sockptr(&len, optlen, sizeof(int)))
 			return -EFAULT;
 
-		if (len != sizeof(opt))
+		if (len != tcp_repair_window_legacy_size() && len != sizeof(opt))
 			return -EINVAL;
 
 		if (!tp->repair)
@@ -4683,6 +4704,7 @@ int do_tcp_getsockopt(struct sock *sk, int level,
 		opt.max_window	= tp->max_window;
 		opt.rcv_wnd	= tp->rcv_wnd;
 		opt.rcv_wup	= tp->rcv_wup;
+		opt.rcv_wnd_scaling_ratio = tp->rcv_wnd_scaling_ratio;
 
 		if (copy_to_sockptr(optval, &opt, len))
 			return -EFAULT;
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 3/7] tcp: honor advertised receive window in memory admission and clamping
From: Wesley Atwell @ 2026-03-11  7:55 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
	martineau, netdev, mptcp
  Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
	mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
	linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>

tp->rcv_wnd is an advertised promise to the sender, but receive-memory
accounting was still reconstructing that promise through mutable live
state.

Switch the receive-side decisions over to the advertise-time snapshot.
Use it when deciding whether a packet can be admitted, when deciding how
far to clamp future window growth, and when handling the scaled-window
quantization slack in __tcp_select_window(). If a snapshot is not
available, keep the legacy fallback behavior.

This keeps sender-visible rwnd and the local hard rmem budget in the
same unit system instead of letting ratio drift create accounting
mismatches.

Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
 include/net/tcp.h     |  1 +
 net/ipv4/tcp_input.c  | 86 ++++++++++++++++++++++++++++++++++++++++---
 net/ipv4/tcp_output.c | 14 ++++++-
 3 files changed, 93 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 187e6d660f62..88ddf7ee826e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -384,6 +384,7 @@ int tcp_ioctl(struct sock *sk, int cmd, int *karg);
 enum skb_drop_reason tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb);
 void tcp_rcv_established(struct sock *sk, struct sk_buff *skb);
 void tcp_rcvbuf_grow(struct sock *sk, u32 newval);
+bool tcp_try_grow_rcvbuf(struct sock *sk, int needed);
 void tcp_rcv_space_adjust(struct sock *sk);
 int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp);
 void tcp_twsk_destructor(struct sock *sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cba89733d121..f76011fc1b7a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -774,8 +774,37 @@ static void tcp_init_buffer_space(struct sock *sk)
 				    (u32)TCP_INIT_CWND * tp->advmss);
 }
 
+/* Try to grow sk_rcvbuf so the hard receive-memory limit covers @needed
+ * bytes beyond the memory already charged in sk_rmem_alloc.
+ */
+bool tcp_try_grow_rcvbuf(struct sock *sk, int needed)
+{
+	struct net *net = sock_net(sk);
+	int target;
+	int rmem2;
+
+	needed = max(needed, 0);
+	target = tcp_rmem_used(sk) + needed;
+
+	if (target <= READ_ONCE(sk->sk_rcvbuf))
+		return true;
+
+	rmem2 = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]);
+	if (READ_ONCE(sk->sk_rcvbuf) >= rmem2 ||
+	    (sk->sk_userlocks & SOCK_RCVBUF_LOCK) ||
+	    tcp_under_memory_pressure(sk) ||
+	    sk_memory_allocated(sk) >= sk_prot_mem_limits(sk, 0))
+		return false;
+
+	WRITE_ONCE(sk->sk_rcvbuf,
+		   min_t(int, rmem2,
+			 max_t(int, READ_ONCE(sk->sk_rcvbuf), target)));
+
+	return target <= READ_ONCE(sk->sk_rcvbuf);
+}
+
 /* 4. Recalculate window clamp after socket hit its memory bounds. */
-static void tcp_clamp_window(struct sock *sk)
+static void tcp_clamp_window_legacy(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
@@ -785,14 +814,42 @@ static void tcp_clamp_window(struct sock *sk)
 	icsk->icsk_ack.quick = 0;
 	rmem2 = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]);
 
-	if (sk->sk_rcvbuf < rmem2 &&
+	if (READ_ONCE(sk->sk_rcvbuf) < rmem2 &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
 	    !tcp_under_memory_pressure(sk) &&
 	    sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)) {
 		WRITE_ONCE(sk->sk_rcvbuf,
 			   min(atomic_read(&sk->sk_rmem_alloc), rmem2));
 	}
-	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
+	if (atomic_read(&sk->sk_rmem_alloc) > READ_ONCE(sk->sk_rcvbuf))
+		tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
+}
+
+static void tcp_clamp_window(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	u32 cur_rwnd = tcp_receive_window(tp);
+	int need;
+
+	if (!tcp_space_from_rcv_wnd(tp, cur_rwnd, &need)) {
+		tcp_clamp_window_legacy(sk);
+		return;
+	}
+
+	inet_csk(sk)->icsk_ack.quick = 0;
+	need = max_t(int, need, 0);
+
+	/* Keep the hard receive-memory cap large enough to honor the
+	 * remaining receive window we already exposed to the sender. Use
+	 * the scaling_ratio snapshot taken when tp->rcv_wnd was advertised,
+	 * not the mutable live ratio which may drift later in the flow.
+	 */
+	tcp_try_grow_rcvbuf(sk, need);
+
+	/* If the remaining advertised rwnd no longer fits the hard budget,
+	 * slow future window growth until the accounting converges again.
+	 */
+	if (need > tcp_rmem_avail(sk))
 		tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
 }
 
@@ -5374,11 +5431,28 @@ static void tcp_ofo_queue(struct sock *sk)
 static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb);
 static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb);
 
+/* Sequence checks run against the sender-visible receive window before this
+ * point. Convert the incoming payload back to the hard receive-memory budget
+ * using the scaling_ratio that was in force when tp->rcv_wnd was advertised,
+ * so admission keeps honoring the same exposed window even if the live ratio
+ * changes later in the flow. Legacy TCP_REPAIR restores do not have that
+ * advertise-time basis, so they fall back to the pre-series admission rule
+ * until a fresh local advertisement refreshes the pair.
+ *
+ * Do not subtract sk_backlog.len here. tcp_space() already reserves backlog
+ * bytes when selecting future advertised windows, and sk_backlog.len stays
+ * inflated until __release_sock() finishes draining backlog. Subtracting it
+ * again here would double count already-queued backlog packets as they move
+ * into sk_rmem_alloc.
+ */
 static bool tcp_can_ingest(const struct sock *sk, const struct sk_buff *skb)
 {
-	unsigned int rmem = atomic_read(&sk->sk_rmem_alloc);
+	int need;
+
+	if (!tcp_space_from_rcv_wnd(tcp_sk(sk), skb->len, &need))
+		return atomic_read(&sk->sk_rmem_alloc) <= READ_ONCE(sk->sk_rcvbuf);
 
-	return rmem <= sk->sk_rcvbuf;
+	return need <= tcp_rmem_avail(sk);
 }
 
 static int tcp_try_rmem_schedule(struct sock *sk, const struct sk_buff *skb,
@@ -6014,7 +6088,7 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	/* Do nothing if our queues are empty. */
-	if (!atomic_read(&sk->sk_rmem_alloc))
+	if (!tcp_rmem_used(sk))
 		return -1;
 
 	NET_INC_STATS(sock_net(sk), LINUX_MIB_PRUNECALLED);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c1b94d67d8fe..5e69fc31a4da 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3377,13 +3377,23 @@ u32 __tcp_select_window(struct sock *sk)
 	 * scaled window will not line up with the MSS boundary anyway.
 	 */
 	if (tp->rx_opt.rcv_wscale) {
+		int rcv_wscale = 1 << tp->rx_opt.rcv_wscale;
+
 		window = free_space;
 
 		/* Advertise enough space so that it won't get scaled away.
-		 * Import case: prevent zero window announcement if
+		 * Important case: prevent zero-window announcement if
 		 * 1<<rcv_wscale > mss.
 		 */
-		window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
+		window = ALIGN(window, rcv_wscale);
+
+		/* Back any scale-quantization slack before we expose it.
+		 * Otherwise tcp_can_ingest() can reject data which is still
+		 * within the sender-visible window.
+		 */
+		if (window > free_space &&
+		    !tcp_try_grow_rcvbuf(sk, tcp_space_from_win(sk, window)))
+			window = round_down(free_space, rcv_wscale);
 	} else {
 		window = tp->rcv_wnd;
 		/* Get the largest window that is a nice multiple of mss.
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 2/7] tcp: preserve rcv_wnd snapshot when updating advertised windows
From: Wesley Atwell @ 2026-03-11  7:55 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
	martineau, netdev, mptcp
  Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
	mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
	linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>

Once tp->rcv_wnd carries paired snapshot semantics, every write of the
advertised window has to refresh the snapshot at the same time.

Convert the active-open, passive-open, and normal advertised-window
update sites to use tcp_set_rcv_wnd(). This keeps new sockets and later
window advertisements initialized with a valid advertise-time basis
before the receive-memory logic starts consuming it.

Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
 net/ipv4/tcp_minisocks.c | 2 +-
 net/ipv4/tcp_output.c    | 8 ++++++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index dafb63b923d0..ae8a466b5298 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -603,7 +603,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	newtp->rx_opt.sack_ok = ireq->sack_ok;
 	newtp->window_clamp = req->rsk_window_clamp;
 	newtp->rcv_ssthresh = req->rsk_rcv_wnd;
-	newtp->rcv_wnd = req->rsk_rcv_wnd;
+	tcp_set_rcv_wnd(newtp, req->rsk_rcv_wnd);
 	newtp->rx_opt.wscale_ok = ireq->wscale_ok;
 	if (newtp->rx_opt.wscale_ok) {
 		newtp->rx_opt.snd_wscale = ireq->snd_wscale;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 326b58ff1118..c1b94d67d8fe 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -291,7 +291,7 @@ static u16 tcp_select_window(struct sock *sk)
 	 */
 	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
 		tp->pred_flags = 0;
-		tp->rcv_wnd = 0;
+		tcp_set_rcv_wnd(tp, 0);
 		tp->rcv_wup = tp->rcv_nxt;
 		return 0;
 	}
@@ -314,7 +314,7 @@ static u16 tcp_select_window(struct sock *sk)
 		}
 	}
 
-	tp->rcv_wnd = new_win;
+	tcp_set_rcv_wnd(tp, new_win);
 	tp->rcv_wup = tp->rcv_nxt;
 
 	/* Make sure we do not exceed the maximum possible
@@ -4150,6 +4150,10 @@ static void tcp_connect_init(struct sock *sk)
 				  READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_window_scaling),
 				  &rcv_wscale,
 				  rcv_wnd);
+	/* tcp_select_initial_window() filled tp->rcv_wnd through its out-param,
+	 * so snapshot the scaling_ratio we will use for that initial rwnd.
+	 */
+	tcp_set_rcv_wnd(tp, tp->rcv_wnd);
 
 	tp->rx_opt.rcv_wscale = rcv_wscale;
 	tp->rcv_ssthresh = tp->rcv_wnd;
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 1/7] tcp: track advertise-time scaling basis for rcv_wnd
From: Wesley Atwell @ 2026-03-11  7:55 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
	martineau, netdev, mptcp
  Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
	mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
	linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>

tp->rcv_wnd is an advertised window, but later receive-side accounting
needs to recover the hard memory budget that window represented when it
was exposed.

Prepare for that by storing the scaling basis alongside tp->rcv_wnd and
centralizing the helper API around the paired state. While here, make the
existing receive-memory arithmetic use the shared helper names so later
behavioral changes can build on one explicit accounting model.

This patch is groundwork only. Later patches will refresh the snapshot at
window write sites and consume it in the receive-memory paths.

Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
 .../networking/net_cachelines/tcp_sock.rst    |  1 +
 include/linux/tcp.h                           |  1 +
 include/net/tcp.h                             | 79 +++++++++++++++++--
 net/ipv4/tcp.c                                |  1 +
 4 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
index 563daea10d6c..1415981b9d8a 100644
--- a/Documentation/networking/net_cachelines/tcp_sock.rst
+++ b/Documentation/networking/net_cachelines/tcp_sock.rst
@@ -12,6 +12,7 @@ struct inet_connection_sock   inet_conn
 u16                           tcp_header_len          read_mostly         read_mostly         tcp_bound_to_half_wnd,tcp_current_mss(tx);tcp_rcv_established(rx)
 u16                           gso_segs                read_mostly                             tcp_xmit_size_goal
 __be32                        pred_flags              read_write          read_mostly         tcp_select_window(tx);tcp_rcv_established(rx)
+u8                            rcv_wnd_scaling_ratio   read_write          read_mostly         tcp_set_rcv_wnd,tcp_can_ingest,tcp_clamp_window
 u64                           bytes_received                              read_write          tcp_rcv_nxt_update(rx)
 u32                           segs_in                                     read_write          tcp_v6_rcv(rx)
 u32                           data_segs_in                                read_write          tcp_v6_rcv(rx)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index f72eef31fa23..ec6b70c1174b 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -297,6 +297,7 @@ struct tcp_sock {
 		est_ecnfield:2,/* ECN field for AccECN delivered estimates */
 		accecn_opt_demand:2,/* Demand AccECN option for n next ACKs */
 		prev_ecnfield:2; /* ECN bits from the previous segment */
+	u8	rcv_wnd_scaling_ratio; /* 0 if unknown, else tp->rcv_wnd basis */
 	__be32	pred_flags;
 	u64	tcp_clock_cache; /* cache last tcp_clock_ns() (see tcp_mstamp_refresh()) */
 	u64	tcp_mstamp;	/* most recent packet received/sent */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 978eea2d5df0..187e6d660f62 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1702,6 +1702,26 @@ static inline int tcp_space_from_win(const struct sock *sk, int win)
 	return __tcp_space_from_win(tcp_sk(sk)->scaling_ratio, win);
 }
 
+static inline bool tcp_rcv_wnd_snapshot_valid(const struct tcp_sock *tp)
+{
+	return tp->rcv_wnd_scaling_ratio != 0;
+}
+
+/* Rebuild hard receive-memory units for data already covered by tp->rcv_wnd if
+ * the advertise-time basis is known. Legacy TCP_REPAIR restores can only
+ * recover tp->rcv_wnd itself; callers must fall back when the snapshot is
+ * unknown.
+ */
+static inline bool tcp_space_from_rcv_wnd(const struct tcp_sock *tp, int win,
+					  int *space)
+{
+	if (!tcp_rcv_wnd_snapshot_valid(tp))
+		return false;
+
+	*space = __tcp_space_from_win(tp->rcv_wnd_scaling_ratio, win);
+	return true;
+}
+
 /* Assume a 50% default for skb->len/skb->truesize ratio.
  * This may be adjusted later in tcp_measure_rcv_mss().
  */
@@ -1709,15 +1729,62 @@ static inline int tcp_space_from_win(const struct sock *sk, int win)
 
 static inline void tcp_scaling_ratio_init(struct sock *sk)
 {
-	tcp_sk(sk)->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	tp->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+	tp->rcv_wnd_scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+}
+
+/* tp->rcv_wnd is paired with the scaling_ratio that was in force when that
+ * window was last advertised. Legacy TCP_REPAIR restores can only recover the
+ * window value itself and use a zero snapshot until a fresh local window
+ * advertisement refreshes the pair.
+ */
+static inline void tcp_set_rcv_wnd_snapshot(struct tcp_sock *tp, u32 win,
+					    u8 scaling_ratio)
+{
+	tp->rcv_wnd = win;
+	tp->rcv_wnd_scaling_ratio = scaling_ratio;
+}
+
+static inline void tcp_set_rcv_wnd(struct tcp_sock *tp, u32 win)
+{
+	tcp_set_rcv_wnd_snapshot(tp, win, tp->scaling_ratio);
+}
+
+static inline void tcp_set_rcv_wnd_unknown(struct tcp_sock *tp, u32 win)
+{
+	tcp_set_rcv_wnd_snapshot(tp, win, 0);
+}
+
+/* TCP receive-side accounting reuses sk_rcvbuf as both a hard memory limit
+ * and as the source material for the advertised receive window after
+ * scaling_ratio conversion. Keep the byte accounting explicit so admission,
+ * pruning, and rwnd selection all start from the same quantities.
+ */
+static inline int tcp_rmem_used(const struct sock *sk)
+{
+	return atomic_read(&sk->sk_rmem_alloc);
+}
+
+static inline int tcp_rmem_avail(const struct sock *sk)
+{
+	return READ_ONCE(sk->sk_rcvbuf) - tcp_rmem_used(sk);
+}
+
+/* Sender-visible rwnd headroom also reserves bytes already queued on backlog.
+ * Those bytes are not free to advertise again until __release_sock() drains
+ * backlog and clears sk_backlog.len.
+ */
+static inline int tcp_rwnd_avail(const struct sock *sk)
+{
+	return tcp_rmem_avail(sk) - READ_ONCE(sk->sk_backlog.len);
 }
 
 /* Note: caller must be prepared to deal with negative returns */
 static inline int tcp_space(const struct sock *sk)
 {
-	return tcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) -
-				  READ_ONCE(sk->sk_backlog.len) -
-				  atomic_read(&sk->sk_rmem_alloc));
+	return tcp_win_from_space(sk, tcp_rwnd_avail(sk));
 }
 
 static inline int tcp_full_space(const struct sock *sk)
@@ -1760,7 +1827,7 @@ static inline bool tcp_rmem_pressure(const struct sock *sk)
 	rcvbuf = READ_ONCE(sk->sk_rcvbuf);
 	threshold = rcvbuf - (rcvbuf >> 3);
 
-	return atomic_read(&sk->sk_rmem_alloc) > threshold;
+	return tcp_rmem_used(sk) > threshold;
 }
 
 static inline bool tcp_epollin_ready(const struct sock *sk, int target)
@@ -1910,7 +1977,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
 
 	if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
 	    tp->rcv_wnd &&
-	    atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
+	    tcp_rmem_avail(sk) > 0 &&
 	    !tp->urg_data)
 		tcp_fast_path_on(tp);
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 202a4e57a218..cec9ae1bf875 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -5238,6 +5238,7 @@ static void __init tcp_struct_check(void)
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ce);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ecn_bytes);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, app_limited);
+	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_wnd_scaling_ratio);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_wnd);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_tstamp);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rx_opt);
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions
From: Wesley Atwell @ 2026-03-11  7:55 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
	martineau, netdev, mptcp
  Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
	mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
	linux-kselftest, linux-kernel, linux-api, atwellwea

This series keeps sender-visible TCP receive-window accounting tied to the
scaling basis that was in force when the window was advertised.

Problem
-------

`tp->rcv_wnd` is an advertised promise to the sender, but later
receive-memory admission and clamping could reconstruct that promise
through the mutable live `scaling_ratio`. After ratio drift, the stack
could retain or advertise a receive window that no longer matched the
local hard rmem budget.

Fix
---

- store the advertise-time scaling basis alongside `tp->rcv_wnd`
- refresh that pair at the TCP and MPTCP receive-window write sites
- consume the snapshot in receive-memory admission, clamping, and the
  scaled-window quantization path
- preserve the snapshot across `TCP_REPAIR_WINDOW` restore when userspace
  provides it, and fall back safely when legacy userspace cannot
- expose the accounting in tracepoints and cover the ABI/runtime contract
  in selftests

Series layout
-------------

1. track the receive-window snapshot state and helpers
2. refresh the snapshot when TCP advertises or initializes windows
3. use the snapshot in receive-memory admission and clamping
4. extend `TCP_REPAIR_WINDOW` for exact restore plus legacy compatibility
5. refresh the TCP shadow window snapshot in MPTCP
6. expose rmem/backlog state in `rcvbuf_grow` tracepoints
7. cover legacy and extended repair-window layouts in selftests

Testing
-------

- `git diff --check origin/main..HEAD`
- `scripts/checkpatch.pl --strict --show-types` on patches 1-7
- `make -j8 headers`
- `make -j8 net/ipv4/tcp_input.o net/ipv4/tcp_output.o net/ipv4/tcp_minisocks.o net/ipv4/tcp.o`
- `make -j8 C=1 CF='-D__CHECK_ENDIAN__' W=1 net/ipv4/tcp_input.o net/ipv4/tcp_output.o net/ipv4/tcp_minisocks.o net/ipv4/tcp.o`
- `make SPHINXDIRS='networking/net_cachelines' htmldocs`
- `make -j8 vmlinux bzImage modules`
- `make -C tools/testing/selftests/net/tcp_ao -j8`
- `make -C tools/testing/selftests/net/mptcp -j8`
- `packetdrill --dry_run` for `tcp_rcv_toobig.pkt` and
  `tcp_rcv_toobig_default.pkt`
- `virtme-run` guest pass for both packetdrill tests
- feature-enabled guest pass for `restore_ipv4`, `self-connect_ipv4`, and
  `mptcp_sockopt.sh`

Thanks,
Wesley

---
base-commit: 908c344d5cfa0ee6efb3226d22ea661e078ebfa0
-- 
2.43.0

^ permalink raw reply

* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Aleksa Sarai @ 2026-03-11  4:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Jeff Layton, Dorjoy Chowdhury, linux-fsdevel,
	linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
	v9fs, linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
	adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
	tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
	amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
	ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <20260309-umsturz-herfallen-067eb2df7ec2@brauner>

[-- Attachment #1: Type: text/plain, Size: 2709 bytes --]

On 2026-03-09, Christian Brauner <brauner@kernel.org> wrote:
> > > On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > > > I think this needs more clarification as to what "regular" means,
> > > > since S_IFREG may not be sufficient.  The UAPI group page says:
> > > >
> > > > Use-Case: this would be very useful to write secure programs that want
> > > > to avoid being tricked into opening device nodes with special
> > > > semantics while thinking they operate on regular files. This is
> > > > particularly relevant as many device nodes (or even FIFOs) come with
> > > > blocking I/O (or even blocking open()!) by default, which is not
> > > > expected from regular files backed by “fast” disk I/O. Consider
> > > > implementation of a naive web browser which is pointed to
> > > > file://dev/zero, not expecting an endless amount of data to read.
> > > >
> > > > What about procfs?  What about sysfs?  What about /proc/self/fd/17
> > > > where that fd is a memfd?  What about files backed by non-"fast" disk
> > > > I/O like something on a flaky USB stick or a network mount or FUSE?
> > > >
> > > > Are we concerned about blocking open?  (open blocks as a matter of
> > > > course.)  Are we concerned about open having strange side effects?
> > > > Are we concerned about write having strange side effects?  Are we
> > > > concerned about cases where opening the file as root results in
> > > > elevated privilege beyond merely gaining the ability to write to that
> > > > specific path on an ordinary filesystem?
> 
> I think this is opening up a barrage of question that I'm not sure are
> all that useful. The ability to only open regular file isn't intended to
> defend against hung FUSE or NFS servers or other random Linux
> special-sauce murder-suicide file descriptor traps. For a lot of those
> we have O_PATH which can easily function with the new extension. A lot
> of the other special-sauce files (most anonymous inode fds) cannot even
> be reopened via e.g., /proc.

Indeed, I see OPENAT2_REGULAR as a way of optimising the tedious checks
that userspace does using O_PATH+/proc/self/fd/$n re-opening when
dealing with regular files.

For the problem of stuck NFS handles and so on, an idea I've had on my
backlog for a long time was RESOLVE_NO_REMOTE that would block those
kinds of things. IMHO it doesn't make sense to block those things with
an O_* flag because (especially in the NFS example) directory components
can also cause the syscall to block indefinitely and so RESOLVE_* flags
make more sense for this anyway. But in my mind this is a separate
problem to OPENAT2_REGULAR.

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/5] fs: add generic write-stream management ioctl
From: Darrick J. Wong @ 2026-03-10 20:44 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
	gost.dev, linux-api
In-Reply-To: <2cde8902-6d50-4035-b9c4-89bd5e2c9468@samsung.com>

On Tue, Mar 10, 2026 at 11:25:25PM +0530, Kanchan Joshi wrote:
> On 3/9/2026 10:03 PM, Darrick J. Wong wrote:
> >> +struct fs_write_stream {
> >> +	__u32		op_flags;	/* IN: operation flags */
> >> +	__u32		stream_id;	/* IN/OUT:  stream value to assign/guery */
> >> +	__u32		max_streams;	/* OUT: max streams values supported */
> >> +	__u32		rsvd;
> >> +};
> > This isn't an very cohesive interface -- GET_MAX probably only needs
> > op_flags and max_streams, right?  And GET/SET only use op_flags and
> > stream_id, right?
> 
> Yeah, right. That's the trade-off with swiss army knife type ioctl which 
> uses op_flags to decide what it should do. Apart from keeping a single 
> ioctl I was thinking a bit about extensibility (for anything new we may 
> be able to do a new op_flags with some rsvd or union) too. But if you 
> feel strong about this, I can take 3 ioctl route?

struct fs_write_stream {
	__u32		op_flags;
	union {
		__u32	stream_id;
		__u32	max_ids;
	};
	__u64		reserved;
};

perhaps?  You might want to look into whether or not we're allowed to
have anonymous unions in UAPI headers.  We all ❤️ C11, right?

--D

> >> +#define FS_WRITE_STREAM_OP_GET_MAX		(1 << 0)
> >> +#define FS_WRITE_STREAM_OP_GET			(1 << 1)
> >> +#define FS_WRITE_STREAM_OP_SET			(1 << 2)
> >> +
> >> +#define FS_IOC_WRITE_STREAM		_IOWR('f', 43, struct fs_write_stream)
> > EXT4_IOC_CHECKPOINT already took 'f' / 43.  I/think/ there's no problem
> > because its argument is a u32 and ioctl definitions incorporate the
> > lower bits of of the argument size but you might want to be careful
> > anyway.
> 
> Indeed, thanks!
> 

^ permalink raw reply

* Re: [PATCH v2 1/5] fs: add generic write-stream management ioctl
From: Kanchan Joshi @ 2026-03-10 17:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
	gost.dev, linux-api
In-Reply-To: <20260309163325.GE6033@frogsfrogsfrogs>

On 3/9/2026 10:03 PM, Darrick J. Wong wrote:
>> +struct fs_write_stream {
>> +	__u32		op_flags;	/* IN: operation flags */
>> +	__u32		stream_id;	/* IN/OUT:  stream value to assign/guery */
>> +	__u32		max_streams;	/* OUT: max streams values supported */
>> +	__u32		rsvd;
>> +};
> This isn't an very cohesive interface -- GET_MAX probably only needs
> op_flags and max_streams, right?  And GET/SET only use op_flags and
> stream_id, right?

Yeah, right. That's the trade-off with swiss army knife type ioctl which 
uses op_flags to decide what it should do. Apart from keeping a single 
ioctl I was thinking a bit about extensibility (for anything new we may 
be able to do a new op_flags with some rsvd or union) too. But if you 
feel strong about this, I can take 3 ioctl route?

>> +#define FS_WRITE_STREAM_OP_GET_MAX		(1 << 0)
>> +#define FS_WRITE_STREAM_OP_GET			(1 << 1)
>> +#define FS_WRITE_STREAM_OP_SET			(1 << 2)
>> +
>> +#define FS_IOC_WRITE_STREAM		_IOWR('f', 43, struct fs_write_stream)
> EXT4_IOC_CHECKPOINT already took 'f' / 43.  I/think/ there's no problem
> because its argument is a u32 and ioctl definitions incorporate the
> lower bits of of the argument size but you might want to be careful
> anyway.

Indeed, thanks!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox