* Re: [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition
From: Sebastian Andrzej Siewior @ 2026-03-12 9:04 UTC (permalink / raw)
To: André Almeida
Cc: Carlos O'Donell, Peter Zijlstra, Florian Weimer, Rich Felker,
Torvald Riegel, Darren Hart, Thomas Gleixner, Ingo Molnar,
Davidlohr Bueso, Arnd Bergmann, Mathieu Desnoyers,
Liam R . Howlett, kernel-dev, linux-api, linux-kernel
In-Reply-To: <20260220202620.139584-2-andrealmeid@igalia.com>
On 2026-02-20 17:26:19 [-0300], André Almeida wrote:
> --- /dev/null
> +++ b/robust_bug.c
…
> + new->value = ((uint64_t) value << 32) + value;
> +
> + /* Create a backup of the current value */
> + original_val = new->value;
Now that I finally got it and I might have understood the issue.
You exit before unlocking the futex. You free this block and this new
memory (address) is the same as the old one. Your corruption comes from
the fact that the old content is the same as the new content.
If the thread does unlock in userland (or kernel) but the lock remains
on the robust_list while it gets killed then the kernel will attempt to
unlock the lock. But this requires that the futex value matches the
value.
So if it is unlocked (0x0) or used again then nothing happens. Unless
the new memory gets the same value assigned as the pid value by
accident. Then it gets changed…
If the unlock did not happen and is still owned by the thread, that is
killed, then the "fixup" here is the right thing to do. The memory
should not be free()ed because the lock was still owned by the thread.
The misunderstanding here might be "once the thread is gone, the lock is
free we can throw away the memory". At the very least, it was a locked
mutex and I think pthread_mutex_destroy() would complain here.
So is the issue here that the "new" value is the same as the "old" value
and the robust-death-handle part in the kernel does its job? Or did I
over simplify something?
Let me continue with the thread…
Sebastian
^ permalink raw reply
* Re: [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions
From: Eric Dumazet @ 2026-03-12 1:49 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Wesley Atwell, Simon Baatz, davem, pabeni, ncardwell, dsahern,
matttbe, martineau, netdev, mptcp, kuniyu, horms, geliang, corbet,
skhan, rostedt, mhiramat, mathieu.desnoyers, 0x7f454c46,
linux-doc, linux-trace-kernel, linux-kselftest, linux-kernel,
linux-api
In-Reply-To: <20260311174154.5fadb207@kernel.org>
On Thu, Mar 12, 2026 at 1:41 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 11 Mar 2026 09:34:32 +0100 Eric Dumazet wrote:
> > Your series will heavily conflict with Simon's one
> >
> > https://patchwork.kernel.org/project/netdevbpf/list/?series=1063486&state=%2A&archive=both
> >
> > I suggest you rebase/retest/resend after we merge it.
>
> Would it make sense to extend netdevsim and packetdrill to be able to
> exercise scaling ratio a little more? Having it optionally clone the
> skb and truesize += X would be trivial. IDK how many bugs this would
> let us catch tho :(
Yes, I think we mentioned this at some point.
packetdrill uses tun device.
Adding a TUN ioctl() to control how many additional bytes are added to
skb->truesize after tun allocates an skb is doable.
^ permalink raw reply
* Re: [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions
From: Jakub Kicinski @ 2026-03-12 0:43 UTC (permalink / raw)
To: Wesley Atwell
Cc: davem, pabeni, edumazet, ncardwell, dsahern, matttbe, martineau,
netdev, mptcp, kuniyu, horms, geliang, corbet, skhan, rostedt,
mhiramat, mathieu.desnoyers, 0x7f454c46, linux-doc,
linux-trace-kernel, linux-kselftest, linux-kernel, linux-api
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>
On Wed, 11 Mar 2026 01:55:53 -0600 Wesley Atwell wrote:
> Subject: [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions
when you repost please make sure you use "PATCH net-next v2"
as the tag / prefix. "net" is a tree we use to fast track fixes.
^ permalink raw reply
* Re: [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions
From: Jakub Kicinski @ 2026-03-12 0:41 UTC (permalink / raw)
To: Eric Dumazet
Cc: Wesley Atwell, Simon Baatz, davem, pabeni, ncardwell, dsahern,
matttbe, martineau, netdev, mptcp, kuniyu, horms, geliang, corbet,
skhan, rostedt, mhiramat, mathieu.desnoyers, 0x7f454c46,
linux-doc, linux-trace-kernel, linux-kselftest, linux-kernel,
linux-api
In-Reply-To: <CANn89i+dojcg=TDh6E1++g_TM7qdcpnyu47n2Q9DRW_w73TjzA@mail.gmail.com>
On Wed, 11 Mar 2026 09:34:32 +0100 Eric Dumazet wrote:
> Your series will heavily conflict with Simon's one
>
> https://patchwork.kernel.org/project/netdevbpf/list/?series=1063486&state=%2A&archive=both
>
> I suggest you rebase/retest/resend after we merge it.
Would it make sense to extend netdevsim and packetdrill to be able to
exercise scaling ratio a little more? Having it optionally clone the
skb and truesize += X would be trivial. IDK how many bugs this would
let us catch tho :(
^ permalink raw reply
* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Andy Lutomirski @ 2026-03-11 16:10 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Christian Brauner, Jeff Layton, Dorjoy Chowdhury, linux-fsdevel,
linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
v9fs, linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <2026-03-11-regular-sore-census-shops-DqYcUT@cyphar.com>
On Tue, Mar 10, 2026 at 9:49 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2026-03-09, Christian Brauner <brauner@kernel.org> wrote:
> > > > On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > > > > I think this needs more clarification as to what "regular" means,
> > > > > since S_IFREG may not be sufficient. The UAPI group page says:
> > > > >
> > > > > Use-Case: this would be very useful to write secure programs that want
> > > > > to avoid being tricked into opening device nodes with special
> > > > > semantics while thinking they operate on regular files. This is
> > > > > particularly relevant as many device nodes (or even FIFOs) come with
> > > > > blocking I/O (or even blocking open()!) by default, which is not
> > > > > expected from regular files backed by “fast” disk I/O. Consider
> > > > > implementation of a naive web browser which is pointed to
> > > > > file://dev/zero, not expecting an endless amount of data to read.
> > > > >
> > > > > What about procfs? What about sysfs? What about /proc/self/fd/17
> > > > > where that fd is a memfd? What about files backed by non-"fast" disk
> > > > > I/O like something on a flaky USB stick or a network mount or FUSE?
> > > > >
> > > > > Are we concerned about blocking open? (open blocks as a matter of
> > > > > course.) Are we concerned about open having strange side effects?
> > > > > Are we concerned about write having strange side effects? Are we
> > > > > concerned about cases where opening the file as root results in
> > > > > elevated privilege beyond merely gaining the ability to write to that
> > > > > specific path on an ordinary filesystem?
> >
> > I think this is opening up a barrage of question that I'm not sure are
> > all that useful. The ability to only open regular file isn't intended to
> > defend against hung FUSE or NFS servers or other random Linux
> > special-sauce murder-suicide file descriptor traps. For a lot of those
> > we have O_PATH which can easily function with the new extension. A lot
> > of the other special-sauce files (most anonymous inode fds) cannot even
> > be reopened via e.g., /proc.
>
> Indeed, I see OPENAT2_REGULAR as a way of optimising the tedious checks
> that userspace does using O_PATH+/proc/self/fd/$n re-opening when
> dealing with regular files.
Can you give a brief decription or a link to what these checks are and
what problem they solve?
--Andy
^ permalink raw reply
* Re: [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions
From: Eric Dumazet @ 2026-03-11 8:34 UTC (permalink / raw)
To: Wesley Atwell, Simon Baatz
Cc: davem, kuba, pabeni, ncardwell, dsahern, matttbe, martineau,
netdev, mptcp, kuniyu, horms, geliang, corbet, skhan, rostedt,
mhiramat, mathieu.desnoyers, 0x7f454c46, linux-doc,
linux-trace-kernel, linux-kselftest, linux-kernel, linux-api
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>
On Wed, Mar 11, 2026 at 8:56 AM Wesley Atwell <atwellwea@gmail.com> wrote:
>
> This series keeps sender-visible TCP receive-window accounting tied to the
> scaling basis that was in force when the window was advertised.
>
> Problem
> -------
>
> `tp->rcv_wnd` is an advertised promise to the sender, but later
> receive-memory admission and clamping could reconstruct that promise
> through the mutable live `scaling_ratio`. After ratio drift, the stack
> could retain or advertise a receive window that no longer matched the
> local hard rmem budget.
>
> Fix
> ---
>
> - store the advertise-time scaling basis alongside `tp->rcv_wnd`
> - refresh that pair at the TCP and MPTCP receive-window write sites
> - consume the snapshot in receive-memory admission, clamping, and the
> scaled-window quantization path
> - preserve the snapshot across `TCP_REPAIR_WINDOW` restore when userspace
> provides it, and fall back safely when legacy userspace cannot
> - expose the accounting in tracepoints and cover the ABI/runtime contract
> in selftests
>
Your series will heavily conflict with Simon's one
https://patchwork.kernel.org/project/netdevbpf/list/?series=1063486&state=%2A&archive=both
I suggest you rebase/retest/resend after we merge it.
> Series layout
> -------------
>
> 1. track the receive-window snapshot state and helpers
> 2. refresh the snapshot when TCP advertises or initializes windows
> 3. use the snapshot in receive-memory admission and clamping
> 4. extend `TCP_REPAIR_WINDOW` for exact restore plus legacy compatibility
> 5. refresh the TCP shadow window snapshot in MPTCP
> 6. expose rmem/backlog state in `rcvbuf_grow` tracepoints
> 7. cover legacy and extended repair-window layouts in selftests
>
> Testing
> -------
>
> - `git diff --check origin/main..HEAD`
> - `scripts/checkpatch.pl --strict --show-types` on patches 1-7
> - `make -j8 headers`
> - `make -j8 net/ipv4/tcp_input.o net/ipv4/tcp_output.o net/ipv4/tcp_minisocks.o net/ipv4/tcp.o`
> - `make -j8 C=1 CF='-D__CHECK_ENDIAN__' W=1 net/ipv4/tcp_input.o net/ipv4/tcp_output.o net/ipv4/tcp_minisocks.o net/ipv4/tcp.o`
> - `make SPHINXDIRS='networking/net_cachelines' htmldocs`
> - `make -j8 vmlinux bzImage modules`
> - `make -C tools/testing/selftests/net/tcp_ao -j8`
> - `make -C tools/testing/selftests/net/mptcp -j8`
> - `packetdrill --dry_run` for `tcp_rcv_toobig.pkt` and
> `tcp_rcv_toobig_default.pkt`
> - `virtme-run` guest pass for both packetdrill tests
> - feature-enabled guest pass for `restore_ipv4`, `self-connect_ipv4`, and
> `mptcp_sockopt.sh`
>
> Thanks,
> Wesley
>
> ---
> base-commit: 908c344d5cfa0ee6efb3226d22ea661e078ebfa0
> --
> 2.43.0
>
^ permalink raw reply
* [PATCH net 7/7] selftests: tcp_ao: cover legacy and extended TCP_REPAIR_WINDOW layouts
From: Wesley Atwell @ 2026-03-11 7:56 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
martineau, netdev, mptcp
Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>
Extend the repair helpers and selftests so the ABI contract is pinned
down in-tree.
The TCP-AO restore coverage now exercises both the exact and legacy
TCP_REPAIR_WINDOW layouts, verifies that intermediate lengths are
rejected, and keeps the packetdrill coverage for the advertised-window
receive-memory regressions in the same net selftest series.
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
.../net/packetdrill/tcp_rcv_toobig.pkt | 35 +++++++
.../packetdrill/tcp_rcv_toobig_default.pkt | 97 +++++++++++++++++++
.../testing/selftests/net/tcp_ao/lib/aolib.h | 56 +++++++++--
.../testing/selftests/net/tcp_ao/lib/repair.c | 18 ++--
.../selftests/net/tcp_ao/self-connect.c | 61 ++++++++++--
5 files changed, 244 insertions(+), 23 deletions(-)
create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt
create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_toobig_default.pkt
diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt
new file mode 100644
index 000000000000..723c739ddc32
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+
+--mss=1000
+
+`./defaults.sh`
+
+ 0 `nstat -n`
+
+// Establish a connection.
+ +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+ +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+ +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [20000], 4) = 0
+ +0 bind(3, ..., ...) = 0
+ +0 listen(3, 1) = 0
+
+ +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
+ +0 > S. 0:0(0) ack 1 win 18980 <mss 1460,nop,wscale 0>
+ +.1 < . 1:1(0) ack 1 win 257
+
+ +0 accept(3, ..., ...) = 4
+
+ +0 < P. 1:20001(20000) ack 1 win 257
+ +.04 > . 1:1(0) ack 20001 win 18000
+
+ +0 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [12000], 4) = 0
+ +0 < P. 20001:80001(60000) ack 1 win 257
+ +0 > . 1:1(0) ack 20001 win 18000
+
+ +0 read(4, ..., 20000) = 20000
+
+// A too big packet is accepted if the receive queue is empty, but the
+// stronger admission path must not zero the receive buffer while doing so.
+ +0 < P. 20001:80001(60000) ack 1 win 257
+ +0 > . 1:1(0) ack 80001 win 0
+ +0 %{ assert SK_MEMINFO_RCVBUF > 0, SK_MEMINFO_RCVBUF }%
diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig_default.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig_default.pkt
new file mode 100644
index 000000000000..b2e4950e0b83
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig_default.pkt
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0
+
+--mss=1000
+
+`./defaults.sh
+sysctl -q net.ipv4.tcp_moderate_rcvbuf=0`
+
+// Establish a connection on the default receive buffer. Leave a large skb in
+// the queue, then deliver another one which still fits the remaining rwnd.
+// We should grow sk_rcvbuf to honor the already-advertised window instead of
+// dropping the packet.
+ +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+ +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+ +0 bind(3, ..., ...) = 0
+ +0 listen(3, 1) = 0
+
+ +0 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 7>
+ +0 > S. 0:0(0) ack 1 <...>
+ +.1 < . 1:1(0) ack 1 win 257
+
+ +0 accept(3, ..., ...) = 4
+
+// Exchange enough data to get past the completely fresh-socket case while
+// still keeping the receive buffer at its 128kB default.
+ +0 < P. 1:65001(65000) ack 1 win 257
+ * > . 1:1(0) ack 65001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 65001:130001(65000) ack 1 win 257
+ * > . 1:1(0) ack 130001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 130001:195001(65000) ack 1 win 257
+ * > . 1:1(0) ack 195001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 195001:260001(65000) ack 1 win 257
+ * > . 1:1(0) ack 260001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 260001:325001(65000) ack 1 win 257
+ * > . 1:1(0) ack 325001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 325001:390001(65000) ack 1 win 257
+ * > . 1:1(0) ack 390001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 390001:455001(65000) ack 1 win 257
+ * > . 1:1(0) ack 455001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 455001:520001(65000) ack 1 win 257
+ * > . 1:1(0) ack 520001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 520001:585001(65000) ack 1 win 257
+ * > . 1:1(0) ack 585001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 585001:650001(65000) ack 1 win 257
+ * > . 1:1(0) ack 650001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 650001:715001(65000) ack 1 win 257
+ * > . 1:1(0) ack 715001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 715001:780001(65000) ack 1 win 257
+ * > . 1:1(0) ack 780001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 780001:845001(65000) ack 1 win 257
+ * > . 1:1(0) ack 845001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 845001:910001(65000) ack 1 win 257
+ * > . 1:1(0) ack 910001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 910001:975001(65000) ack 1 win 257
+ * > . 1:1(0) ack 975001
+ +0 read(4, ..., 65000) = 65000
+
+ +0 < P. 975001:1040001(65000) ack 1 win 257
+ * > . 1:1(0) ack 1040001
+ +0 read(4, ..., 65000) = 65000
+
+// Leave about 60kB queued, then accept another large skb which still fits
+// the rwnd we already exposed to the peer. The regression is the drop; the
+// exact sk_rcvbuf growth path is an implementation detail.
+ +0 < P. 1040001:1102001(62000) ack 1 win 257
+ * > . 1:1(0) ack 1102001
+
+ +0 < P. 1102001:1167001(65000) ack 1 win 257
+ * > . 1:1(0) ack 1167001
+ +0 read(4, ..., 127000) = 127000
diff --git a/tools/testing/selftests/net/tcp_ao/lib/aolib.h b/tools/testing/selftests/net/tcp_ao/lib/aolib.h
index ebb2899c12fe..ff259795a4a0 100644
--- a/tools/testing/selftests/net/tcp_ao/lib/aolib.h
+++ b/tools/testing/selftests/net/tcp_ao/lib/aolib.h
@@ -13,6 +13,7 @@
#include <linux/snmp.h>
#include <linux/tcp.h>
#include <netinet/in.h>
+#include <stddef.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdlib.h>
@@ -671,17 +672,42 @@ struct tcp_sock_state {
int timestamp;
};
-extern void __test_sock_checkpoint(int sk, struct tcp_sock_state *state,
- void *addr, size_t addr_size);
+/* Legacy userspace stops before the snapshot field and therefore exercises
+ * the kernel's unknown-snapshot fallback path.
+ */
+static inline socklen_t test_tcp_repair_window_legacy_size(void)
+{
+ return offsetof(struct tcp_repair_window, rcv_wnd_scaling_ratio);
+}
+
+static inline socklen_t test_tcp_repair_window_exact_size(void)
+{
+ return sizeof(struct tcp_repair_window);
+}
+
+void __test_sock_checkpoint_opt(int sk, struct tcp_sock_state *state,
+ socklen_t trw_len,
+ void *addr, size_t addr_size);
static inline void test_sock_checkpoint(int sk, struct tcp_sock_state *state,
sockaddr_af *saddr)
{
- __test_sock_checkpoint(sk, state, saddr, sizeof(*saddr));
+ __test_sock_checkpoint_opt(sk, state, test_tcp_repair_window_exact_size(),
+ saddr, sizeof(*saddr));
+}
+
+static inline void test_sock_checkpoint_legacy(int sk,
+ struct tcp_sock_state *state,
+ sockaddr_af *saddr)
+{
+ __test_sock_checkpoint_opt(sk, state, test_tcp_repair_window_legacy_size(),
+ saddr, sizeof(*saddr));
}
extern void test_ao_checkpoint(int sk, struct tcp_ao_repair *state);
-extern void __test_sock_restore(int sk, const char *device,
- struct tcp_sock_state *state,
- void *saddr, void *daddr, size_t addr_size);
+void __test_sock_restore_opt(int sk, const char *device,
+ struct tcp_sock_state *state,
+ socklen_t trw_len,
+ void *saddr, void *daddr,
+ size_t addr_size);
static inline void test_sock_restore(int sk, struct tcp_sock_state *state,
sockaddr_af *saddr,
const union tcp_addr daddr,
@@ -690,7 +716,23 @@ static inline void test_sock_restore(int sk, struct tcp_sock_state *state,
sockaddr_af addr;
tcp_addr_to_sockaddr_in(&addr, &daddr, htons(dport));
- __test_sock_restore(sk, veth_name, state, saddr, &addr, sizeof(addr));
+ __test_sock_restore_opt(sk, veth_name, state,
+ test_tcp_repair_window_exact_size(),
+ saddr, &addr, sizeof(addr));
+}
+
+static inline void test_sock_restore_legacy(int sk,
+ struct tcp_sock_state *state,
+ sockaddr_af *saddr,
+ const union tcp_addr daddr,
+ unsigned int dport)
+{
+ sockaddr_af addr;
+
+ tcp_addr_to_sockaddr_in(&addr, &daddr, htons(dport));
+ __test_sock_restore_opt(sk, veth_name, state,
+ test_tcp_repair_window_legacy_size(),
+ saddr, &addr, sizeof(addr));
}
extern void test_ao_restore(int sk, struct tcp_ao_repair *state);
extern void test_sock_state_free(struct tcp_sock_state *state);
diff --git a/tools/testing/selftests/net/tcp_ao/lib/repair.c b/tools/testing/selftests/net/tcp_ao/lib/repair.c
index 9893b3ba69f5..befbd0f72db5 100644
--- a/tools/testing/selftests/net/tcp_ao/lib/repair.c
+++ b/tools/testing/selftests/net/tcp_ao/lib/repair.c
@@ -66,8 +66,9 @@ static void test_sock_checkpoint_queue(int sk, int queue, int qlen,
test_error("recv(%d): %d", qlen, ret);
}
-void __test_sock_checkpoint(int sk, struct tcp_sock_state *state,
- void *addr, size_t addr_size)
+void __test_sock_checkpoint_opt(int sk, struct tcp_sock_state *state,
+ socklen_t trw_len,
+ void *addr, size_t addr_size)
{
socklen_t len = sizeof(state->info);
int ret;
@@ -82,9 +83,9 @@ void __test_sock_checkpoint(int sk, struct tcp_sock_state *state,
if (getsockname(sk, addr, &len) || len != addr_size)
test_error("getsockname(): %d", (int)len);
- len = sizeof(state->trw);
+ len = trw_len;
ret = getsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &state->trw, &len);
- if (ret || len != sizeof(state->trw))
+ if (ret || len != trw_len)
test_error("getsockopt(TCP_REPAIR_WINDOW): %d", (int)len);
if (ioctl(sk, SIOCOUTQ, &state->outq_len))
@@ -160,9 +161,10 @@ static void test_sock_restore_queue(int sk, int queue, void *buf, int len)
} while (len > 0);
}
-void __test_sock_restore(int sk, const char *device,
- struct tcp_sock_state *state,
- void *saddr, void *daddr, size_t addr_size)
+void __test_sock_restore_opt(int sk, const char *device,
+ struct tcp_sock_state *state,
+ socklen_t trw_len,
+ void *saddr, void *daddr, size_t addr_size)
{
struct tcp_repair_opt opts[4];
unsigned int opt_nr = 0;
@@ -215,7 +217,7 @@ void __test_sock_restore(int sk, const char *device,
}
test_sock_restore_queue(sk, TCP_RECV_QUEUE, state->in.buf, state->inq_len);
test_sock_restore_queue(sk, TCP_SEND_QUEUE, state->out.buf, state->outq_len);
- if (setsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &state->trw, sizeof(state->trw)))
+ if (setsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &state->trw, trw_len))
test_error("setsockopt(TCP_REPAIR_WINDOW)");
}
diff --git a/tools/testing/selftests/net/tcp_ao/self-connect.c b/tools/testing/selftests/net/tcp_ao/self-connect.c
index 2c73bea698a6..a7edd72ab28d 100644
--- a/tools/testing/selftests/net/tcp_ao/self-connect.c
+++ b/tools/testing/selftests/net/tcp_ao/self-connect.c
@@ -4,6 +4,7 @@
#include "aolib.h"
static union tcp_addr local_addr;
+static bool checked_repair_window_lens;
static void __setup_lo_intf(const char *lo_intf,
const char *addr_str, uint8_t prefix)
@@ -30,8 +31,40 @@ static void setup_lo_intf(const char *lo_intf)
#endif
}
+/* The repair ABI accepts exactly the legacy and extended layouts. */
+static void test_repair_window_len_contract(int sk)
+{
+ struct tcp_repair_window trw = {};
+ socklen_t len = test_tcp_repair_window_exact_size();
+ socklen_t bad_len = test_tcp_repair_window_legacy_size() + 1;
+ int ret;
+
+ if (checked_repair_window_lens)
+ return;
+
+ checked_repair_window_lens = true;
+
+ ret = getsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &trw, &len);
+ if (ret || len != test_tcp_repair_window_exact_size())
+ test_error("getsockopt(TCP_REPAIR_WINDOW): %d", (int)len);
+
+ len = bad_len;
+ ret = getsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &trw, &len);
+ if (ret == 0 || errno != EINVAL)
+ test_fail("repair-window get rejects invalid len");
+ else
+ test_ok("repair-window get rejects invalid len");
+
+ ret = setsockopt(sk, SOL_TCP, TCP_REPAIR_WINDOW, &trw, bad_len);
+ if (ret == 0 || errno != EINVAL)
+ test_fail("repair-window set rejects invalid len");
+ else
+ test_ok("repair-window set rejects invalid len");
+}
+
static void tcp_self_connect(const char *tst, unsigned int port,
- bool different_keyids, bool check_restore)
+ bool different_keyids, bool check_restore,
+ bool legacy_repair_window)
{
struct tcp_counters before, after;
uint64_t before_aogood, after_aogood;
@@ -109,7 +142,11 @@ static void tcp_self_connect(const char *tst, unsigned int port,
}
test_enable_repair(sk);
- test_sock_checkpoint(sk, &img, &addr);
+ test_repair_window_len_contract(sk);
+ if (legacy_repair_window)
+ test_sock_checkpoint_legacy(sk, &img, &addr);
+ else
+ test_sock_checkpoint(sk, &img, &addr);
#ifdef IPV6_TEST
addr.sin6_port = htons(port + 1);
#else
@@ -123,7 +160,11 @@ static void tcp_self_connect(const char *tst, unsigned int port,
test_error("socket()");
test_enable_repair(sk);
- __test_sock_restore(sk, "lo", &img, &addr, &addr, sizeof(addr));
+ __test_sock_restore_opt(sk, "lo", &img,
+ legacy_repair_window ?
+ test_tcp_repair_window_legacy_size() :
+ test_tcp_repair_window_exact_size(),
+ &addr, &addr, sizeof(addr));
if (different_keyids) {
if (test_add_repaired_key(sk, DEFAULT_TEST_PASSWORD, 0,
local_addr, -1, 7, 5))
@@ -165,20 +206,24 @@ static void *client_fn(void *arg)
setup_lo_intf("lo");
- tcp_self_connect("self-connect(same keyids)", port++, false, false);
+ tcp_self_connect("self-connect(same keyids)", port++, false, false, false);
/* expecting rnext to change based on the first segment RNext != Current */
trace_ao_event_expect(TCP_AO_RNEXT_REQUEST, local_addr, local_addr,
port, port, 0, -1, -1, -1, -1, -1, 7, 5, -1);
- tcp_self_connect("self-connect(different keyids)", port++, true, false);
- tcp_self_connect("self-connect(restore)", port, false, true);
+ tcp_self_connect("self-connect(different keyids)", port++, true, false, false);
+ tcp_self_connect("self-connect(restore)", port, false, true, false);
+ port += 2; /* restore test restores over different port */
+ tcp_self_connect("self-connect(restore, legacy repair window)",
+ port, false, true, true);
port += 2; /* restore test restores over different port */
trace_ao_event_expect(TCP_AO_RNEXT_REQUEST, local_addr, local_addr,
port, port, 0, -1, -1, -1, -1, -1, 7, 5, -1);
/* intentionally on restore they are added to the socket in different order */
trace_ao_event_expect(TCP_AO_RNEXT_REQUEST, local_addr, local_addr,
port + 1, port + 1, 0, -1, -1, -1, -1, -1, 5, 7, -1);
- tcp_self_connect("self-connect(restore, different keyids)", port, true, true);
+ tcp_self_connect("self-connect(restore, different keyids)",
+ port, true, true, false);
port += 2; /* restore test restores over different port */
return NULL;
@@ -186,6 +231,6 @@ static void *client_fn(void *arg)
int main(int argc, char *argv[])
{
- test_init(5, client_fn, NULL);
+ test_init(8, client_fn, NULL);
return 0;
}
--
2.34.1
^ permalink raw reply related
* [PATCH net 6/7] tcp: expose rmem and backlog accounting in rcvbuf_grow tracepoints
From: Wesley Atwell @ 2026-03-11 7:55 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
martineau, netdev, mptcp
Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>
The receive-window work now depends on keeping sender-visible rwnd and
hard receive-memory accounting aligned.
Expose the current rmem charge and backlog reservation in the TCP and
MPTCP rcvbuf_grow tracepoints so that later drift between advertised
window and local backing is visible during review and debugging.
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
include/trace/events/mptcp.h | 11 +++++++----
include/trace/events/tcp.h | 12 +++++++-----
2 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/include/trace/events/mptcp.h b/include/trace/events/mptcp.h
index 269d949b2025..167970e8e0a5 100644
--- a/include/trace/events/mptcp.h
+++ b/include/trace/events/mptcp.h
@@ -199,6 +199,8 @@ TRACE_EVENT(mptcp_rcvbuf_grow,
__field(__u32, inq)
__field(__u32, space)
__field(__u32, ooo_space)
+ __field(__u32, rmem_alloc)
+ __field(__u32, backlog_len)
__field(__u32, rcvbuf)
__field(__u32, rcv_wnd)
__field(__u8, scaling_ratio)
@@ -228,6 +230,8 @@ TRACE_EVENT(mptcp_rcvbuf_grow,
MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq -
msk->ack_seq;
+ __entry->rmem_alloc = tcp_rmem_used(sk);
+ __entry->backlog_len = READ_ONCE(msk->backlog_len);
__entry->rcvbuf = sk->sk_rcvbuf;
__entry->rcv_wnd = atomic64_read(&msk->rcv_wnd_sent) -
msk->ack_seq;
@@ -248,12 +252,11 @@ TRACE_EVENT(mptcp_rcvbuf_grow,
__entry->skaddr = sk;
),
- TP_printk("time=%u rtt_us=%u copied=%u inq=%u space=%u ooo=%u scaling_ratio=%u "
- "rcvbuf=%u rcv_wnd=%u family=%d sport=%hu dport=%hu saddr=%pI4 "
- "daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c skaddr=%p",
+ TP_printk("time=%u rtt_us=%u copied=%u inq=%u space=%u ooo=%u scaling_ratio=%u rmem_alloc=%u backlog_len=%u rcvbuf=%u rcv_wnd=%u family=%d sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c skaddr=%p",
__entry->time, __entry->rtt_us, __entry->copied,
__entry->inq, __entry->space, __entry->ooo_space,
- __entry->scaling_ratio, __entry->rcvbuf, __entry->rcv_wnd,
+ __entry->scaling_ratio, __entry->rmem_alloc,
+ __entry->backlog_len, __entry->rcvbuf, __entry->rcv_wnd,
__entry->family, __entry->sport, __entry->dport,
__entry->saddr, __entry->daddr, __entry->saddr_v6,
__entry->daddr_v6, __entry->skaddr)
diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index f155f95cdb6e..92d0bd6be0ba 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -217,6 +217,8 @@ TRACE_EVENT(tcp_rcvbuf_grow,
__field(__u32, inq)
__field(__u32, space)
__field(__u32, ooo_space)
+ __field(__u32, rmem_alloc)
+ __field(__u32, backlog_len)
__field(__u32, rcvbuf)
__field(__u32, rcv_ssthresh)
__field(__u32, window_clamp)
@@ -247,6 +249,8 @@ TRACE_EVENT(tcp_rcvbuf_grow,
TCP_SKB_CB(tp->ooo_last_skb)->end_seq -
tp->rcv_nxt;
+ __entry->rmem_alloc = tcp_rmem_used(sk);
+ __entry->backlog_len = READ_ONCE(sk->sk_backlog.len);
__entry->rcvbuf = sk->sk_rcvbuf;
__entry->rcv_ssthresh = tp->rcv_ssthresh;
__entry->window_clamp = tp->window_clamp;
@@ -269,13 +273,11 @@ TRACE_EVENT(tcp_rcvbuf_grow,
__entry->sock_cookie = sock_gen_cookie(sk);
),
- TP_printk("time=%u rtt_us=%u copied=%u inq=%u space=%u ooo=%u scaling_ratio=%u rcvbuf=%u "
- "rcv_ssthresh=%u window_clamp=%u rcv_wnd=%u "
- "family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 "
- "saddrv6=%pI6c daddrv6=%pI6c skaddr=%p sock_cookie=%llx",
+ TP_printk("time=%u rtt_us=%u copied=%u inq=%u space=%u ooo=%u scaling_ratio=%u rmem_alloc=%u backlog_len=%u rcvbuf=%u rcv_ssthresh=%u window_clamp=%u rcv_wnd=%u family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c skaddr=%p sock_cookie=%llx",
__entry->time, __entry->rtt_us, __entry->copied,
__entry->inq, __entry->space, __entry->ooo_space,
- __entry->scaling_ratio, __entry->rcvbuf,
+ __entry->scaling_ratio, __entry->rmem_alloc,
+ __entry->backlog_len, __entry->rcvbuf,
__entry->rcv_ssthresh, __entry->window_clamp,
__entry->rcv_wnd,
show_family_name(__entry->family),
--
2.34.1
^ permalink raw reply related
* [PATCH net 5/7] mptcp: refresh tcp rcv_wnd snapshot when syncing receive windows
From: Wesley Atwell @ 2026-03-11 7:55 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
martineau, netdev, mptcp
Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>
MPTCP rewrites the TCP shadow receive window on subflows when shared
receive-window state changes.
Once tp->rcv_wnd carries paired snapshot semantics, those subflow shadow
updates have to refresh the snapshot too. Convert the MPTCP window-sync
write sites to use the helper and keep the aggregate receive-space
arithmetic using the explicit rwnd-availability helper.
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
net/mptcp/options.c | 12 ++++++++----
net/mptcp/protocol.h | 14 +++++++++++---
2 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 43df4293f58b..6e6aa084cbfa 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1073,9 +1073,12 @@ static void rwin_update(struct mptcp_sock *msk, struct sock *ssk,
return;
/* Some other subflow grew the mptcp-level rwin since rcv_wup,
- * resync.
+ * resync. Keep the TCP shadow window in its advertised u32 domain
+ * and refresh the advertise-time scaling snapshot while doing so.
*/
- tp->rcv_wnd += mptcp_rcv_wnd - subflow->rcv_wnd_sent;
+ tcp_set_rcv_wnd(tp, min_t(u64, (u64)tp->rcv_wnd +
+ (mptcp_rcv_wnd - subflow->rcv_wnd_sent),
+ U32_MAX));
subflow->rcv_wnd_sent = mptcp_rcv_wnd;
}
@@ -1334,11 +1337,12 @@ static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
if (rcv_wnd_new != rcv_wnd_old) {
raise_win:
/* The msk-level rcv wnd is after the tcp level one,
- * sync the latter.
+ * sync the latter and refresh its advertise-time scaling
+ * snapshot.
*/
rcv_wnd_new = rcv_wnd_old;
win = rcv_wnd_old - ack_seq;
- tp->rcv_wnd = min_t(u64, win, U32_MAX);
+ tcp_set_rcv_wnd(tp, min_t(u64, win, U32_MAX));
new_win = tp->rcv_wnd;
/* Make sure we do not exceed the maximum possible
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 0bd1ee860316..4ea95c9c0c7a 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -408,11 +408,19 @@ static inline int mptcp_space_from_win(const struct sock *sk, int win)
return __tcp_space_from_win(mptcp_sk(sk)->scaling_ratio, win);
}
+/* MPTCP exposes window space from the mptcp-level receive queue, so it tracks
+ * a separate backlog counter from the subflow backlog embedded in struct sock.
+ */
+static inline int mptcp_rwnd_avail(const struct sock *sk)
+{
+ return READ_ONCE(sk->sk_rcvbuf) -
+ READ_ONCE(mptcp_sk(sk)->backlog_len) -
+ tcp_rmem_used(sk);
+}
+
static inline int __mptcp_space(const struct sock *sk)
{
- return mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) -
- READ_ONCE(mptcp_sk(sk)->backlog_len) -
- sk_rmem_alloc_get(sk));
+ return mptcp_win_from_space(sk, mptcp_rwnd_avail(sk));
}
static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
--
2.34.1
^ permalink raw reply related
* [PATCH net 4/7] tcp: extend TCP_REPAIR_WINDOW with receive-window scaling snapshot
From: Wesley Atwell @ 2026-03-11 7:55 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
martineau, netdev, mptcp
Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>
The paired receive-window state is now part of the live TCP socket
semantics, so repair and restore need a way to preserve it.
Extend TCP_REPAIR_WINDOW with the advertise-time scaling snapshot while
keeping old userspace working. The kernel now accepts exactly the legacy
layout and the extended layout. Legacy restore leaves the snapshot
unknown so the socket falls back safely until a fresh local window
advertisement refreshes the pair, while the extended layout restores the
exact snapshot.
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
include/uapi/linux/tcp.h | 1 +
net/ipv4/tcp.c | 34 ++++++++++++++++++++++++++++------
2 files changed, 29 insertions(+), 6 deletions(-)
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 03772dd4d399..3a799f4c0e1e 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -159,6 +159,7 @@ struct tcp_repair_window {
__u32 rcv_wnd;
__u32 rcv_wup;
+ __u32 rcv_wnd_scaling_ratio; /* 0 means advertise-time basis unknown */
};
enum {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cec9ae1bf875..dd2b4fe61bd8 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3551,17 +3551,25 @@ static inline bool tcp_can_repair_sock(const struct sock *sk)
(sk->sk_state != TCP_LISTEN);
}
+/* Keep accepting the pre-extension TCP_REPAIR_WINDOW layout so legacy
+ * userspace can restore sockets without fabricating a snapshot basis.
+ */
+static inline int tcp_repair_window_legacy_size(void)
+{
+ return offsetof(struct tcp_repair_window, rcv_wnd_scaling_ratio);
+}
+
static int tcp_repair_set_window(struct tcp_sock *tp, sockptr_t optbuf, int len)
{
- struct tcp_repair_window opt;
+ struct tcp_repair_window opt = {};
if (!tp->repair)
return -EPERM;
- if (len != sizeof(opt))
+ if (len != tcp_repair_window_legacy_size() && len != sizeof(opt))
return -EINVAL;
- if (copy_from_sockptr(&opt, optbuf, sizeof(opt)))
+ if (copy_from_sockptr(&opt, optbuf, len))
return -EFAULT;
if (opt.max_window < opt.snd_wnd)
@@ -3577,7 +3585,20 @@ static int tcp_repair_set_window(struct tcp_sock *tp, sockptr_t optbuf, int len)
tp->snd_wnd = opt.snd_wnd;
tp->max_window = opt.max_window;
- tp->rcv_wnd = opt.rcv_wnd;
+ if (len == tcp_repair_window_legacy_size()) {
+ /* Legacy repair UAPI has no advertise-time basis for tp->rcv_wnd.
+ * Mark the snapshot unknown until a fresh local advertisement
+ * re-establishes the pair.
+ */
+ tcp_set_rcv_wnd_unknown(tp, opt.rcv_wnd);
+ tp->rcv_wup = opt.rcv_wup;
+ return 0;
+ }
+
+ if (opt.rcv_wnd_scaling_ratio > U8_MAX)
+ return -EINVAL;
+
+ tcp_set_rcv_wnd_snapshot(tp, opt.rcv_wnd, opt.rcv_wnd_scaling_ratio);
tp->rcv_wup = opt.rcv_wup;
return 0;
@@ -4667,12 +4688,12 @@ int do_tcp_getsockopt(struct sock *sk, int level,
break;
case TCP_REPAIR_WINDOW: {
- struct tcp_repair_window opt;
+ struct tcp_repair_window opt = {};
if (copy_from_sockptr(&len, optlen, sizeof(int)))
return -EFAULT;
- if (len != sizeof(opt))
+ if (len != tcp_repair_window_legacy_size() && len != sizeof(opt))
return -EINVAL;
if (!tp->repair)
@@ -4683,6 +4704,7 @@ int do_tcp_getsockopt(struct sock *sk, int level,
opt.max_window = tp->max_window;
opt.rcv_wnd = tp->rcv_wnd;
opt.rcv_wup = tp->rcv_wup;
+ opt.rcv_wnd_scaling_ratio = tp->rcv_wnd_scaling_ratio;
if (copy_to_sockptr(optval, &opt, len))
return -EFAULT;
--
2.34.1
^ permalink raw reply related
* [PATCH net 3/7] tcp: honor advertised receive window in memory admission and clamping
From: Wesley Atwell @ 2026-03-11 7:55 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
martineau, netdev, mptcp
Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>
tp->rcv_wnd is an advertised promise to the sender, but receive-memory
accounting was still reconstructing that promise through mutable live
state.
Switch the receive-side decisions over to the advertise-time snapshot.
Use it when deciding whether a packet can be admitted, when deciding how
far to clamp future window growth, and when handling the scaled-window
quantization slack in __tcp_select_window(). If a snapshot is not
available, keep the legacy fallback behavior.
This keeps sender-visible rwnd and the local hard rmem budget in the
same unit system instead of letting ratio drift create accounting
mismatches.
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
include/net/tcp.h | 1 +
net/ipv4/tcp_input.c | 86 ++++++++++++++++++++++++++++++++++++++++---
net/ipv4/tcp_output.c | 14 ++++++-
3 files changed, 93 insertions(+), 8 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 187e6d660f62..88ddf7ee826e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -384,6 +384,7 @@ int tcp_ioctl(struct sock *sk, int cmd, int *karg);
enum skb_drop_reason tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb);
void tcp_rcv_established(struct sock *sk, struct sk_buff *skb);
void tcp_rcvbuf_grow(struct sock *sk, u32 newval);
+bool tcp_try_grow_rcvbuf(struct sock *sk, int needed);
void tcp_rcv_space_adjust(struct sock *sk);
int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp);
void tcp_twsk_destructor(struct sock *sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cba89733d121..f76011fc1b7a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -774,8 +774,37 @@ static void tcp_init_buffer_space(struct sock *sk)
(u32)TCP_INIT_CWND * tp->advmss);
}
+/* Try to grow sk_rcvbuf so the hard receive-memory limit covers @needed
+ * bytes beyond the memory already charged in sk_rmem_alloc.
+ */
+bool tcp_try_grow_rcvbuf(struct sock *sk, int needed)
+{
+ struct net *net = sock_net(sk);
+ int target;
+ int rmem2;
+
+ needed = max(needed, 0);
+ target = tcp_rmem_used(sk) + needed;
+
+ if (target <= READ_ONCE(sk->sk_rcvbuf))
+ return true;
+
+ rmem2 = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]);
+ if (READ_ONCE(sk->sk_rcvbuf) >= rmem2 ||
+ (sk->sk_userlocks & SOCK_RCVBUF_LOCK) ||
+ tcp_under_memory_pressure(sk) ||
+ sk_memory_allocated(sk) >= sk_prot_mem_limits(sk, 0))
+ return false;
+
+ WRITE_ONCE(sk->sk_rcvbuf,
+ min_t(int, rmem2,
+ max_t(int, READ_ONCE(sk->sk_rcvbuf), target)));
+
+ return target <= READ_ONCE(sk->sk_rcvbuf);
+}
+
/* 4. Recalculate window clamp after socket hit its memory bounds. */
-static void tcp_clamp_window(struct sock *sk)
+static void tcp_clamp_window_legacy(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
@@ -785,14 +814,42 @@ static void tcp_clamp_window(struct sock *sk)
icsk->icsk_ack.quick = 0;
rmem2 = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]);
- if (sk->sk_rcvbuf < rmem2 &&
+ if (READ_ONCE(sk->sk_rcvbuf) < rmem2 &&
!(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
!tcp_under_memory_pressure(sk) &&
sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)) {
WRITE_ONCE(sk->sk_rcvbuf,
min(atomic_read(&sk->sk_rmem_alloc), rmem2));
}
- if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
+ if (atomic_read(&sk->sk_rmem_alloc) > READ_ONCE(sk->sk_rcvbuf))
+ tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
+}
+
+static void tcp_clamp_window(struct sock *sk)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 cur_rwnd = tcp_receive_window(tp);
+ int need;
+
+ if (!tcp_space_from_rcv_wnd(tp, cur_rwnd, &need)) {
+ tcp_clamp_window_legacy(sk);
+ return;
+ }
+
+ inet_csk(sk)->icsk_ack.quick = 0;
+ need = max_t(int, need, 0);
+
+ /* Keep the hard receive-memory cap large enough to honor the
+ * remaining receive window we already exposed to the sender. Use
+ * the scaling_ratio snapshot taken when tp->rcv_wnd was advertised,
+ * not the mutable live ratio which may drift later in the flow.
+ */
+ tcp_try_grow_rcvbuf(sk, need);
+
+ /* If the remaining advertised rwnd no longer fits the hard budget,
+ * slow future window growth until the accounting converges again.
+ */
+ if (need > tcp_rmem_avail(sk))
tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
}
@@ -5374,11 +5431,28 @@ static void tcp_ofo_queue(struct sock *sk)
static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb);
static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb);
+/* Sequence checks run against the sender-visible receive window before this
+ * point. Convert the incoming payload back to the hard receive-memory budget
+ * using the scaling_ratio that was in force when tp->rcv_wnd was advertised,
+ * so admission keeps honoring the same exposed window even if the live ratio
+ * changes later in the flow. Legacy TCP_REPAIR restores do not have that
+ * advertise-time basis, so they fall back to the pre-series admission rule
+ * until a fresh local advertisement refreshes the pair.
+ *
+ * Do not subtract sk_backlog.len here. tcp_space() already reserves backlog
+ * bytes when selecting future advertised windows, and sk_backlog.len stays
+ * inflated until __release_sock() finishes draining backlog. Subtracting it
+ * again here would double count already-queued backlog packets as they move
+ * into sk_rmem_alloc.
+ */
static bool tcp_can_ingest(const struct sock *sk, const struct sk_buff *skb)
{
- unsigned int rmem = atomic_read(&sk->sk_rmem_alloc);
+ int need;
+
+ if (!tcp_space_from_rcv_wnd(tcp_sk(sk), skb->len, &need))
+ return atomic_read(&sk->sk_rmem_alloc) <= READ_ONCE(sk->sk_rcvbuf);
- return rmem <= sk->sk_rcvbuf;
+ return need <= tcp_rmem_avail(sk);
}
static int tcp_try_rmem_schedule(struct sock *sk, const struct sk_buff *skb,
@@ -6014,7 +6088,7 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
struct tcp_sock *tp = tcp_sk(sk);
/* Do nothing if our queues are empty. */
- if (!atomic_read(&sk->sk_rmem_alloc))
+ if (!tcp_rmem_used(sk))
return -1;
NET_INC_STATS(sock_net(sk), LINUX_MIB_PRUNECALLED);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c1b94d67d8fe..5e69fc31a4da 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3377,13 +3377,23 @@ u32 __tcp_select_window(struct sock *sk)
* scaled window will not line up with the MSS boundary anyway.
*/
if (tp->rx_opt.rcv_wscale) {
+ int rcv_wscale = 1 << tp->rx_opt.rcv_wscale;
+
window = free_space;
/* Advertise enough space so that it won't get scaled away.
- * Import case: prevent zero window announcement if
+ * Important case: prevent zero-window announcement if
* 1<<rcv_wscale > mss.
*/
- window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
+ window = ALIGN(window, rcv_wscale);
+
+ /* Back any scale-quantization slack before we expose it.
+ * Otherwise tcp_can_ingest() can reject data which is still
+ * within the sender-visible window.
+ */
+ if (window > free_space &&
+ !tcp_try_grow_rcvbuf(sk, tcp_space_from_win(sk, window)))
+ window = round_down(free_space, rcv_wscale);
} else {
window = tp->rcv_wnd;
/* Get the largest window that is a nice multiple of mss.
--
2.34.1
^ permalink raw reply related
* [PATCH net 2/7] tcp: preserve rcv_wnd snapshot when updating advertised windows
From: Wesley Atwell @ 2026-03-11 7:55 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
martineau, netdev, mptcp
Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>
Once tp->rcv_wnd carries paired snapshot semantics, every write of the
advertised window has to refresh the snapshot at the same time.
Convert the active-open, passive-open, and normal advertised-window
update sites to use tcp_set_rcv_wnd(). This keeps new sockets and later
window advertisements initialized with a valid advertise-time basis
before the receive-memory logic starts consuming it.
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
net/ipv4/tcp_minisocks.c | 2 +-
net/ipv4/tcp_output.c | 8 ++++++--
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index dafb63b923d0..ae8a466b5298 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -603,7 +603,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
newtp->rx_opt.sack_ok = ireq->sack_ok;
newtp->window_clamp = req->rsk_window_clamp;
newtp->rcv_ssthresh = req->rsk_rcv_wnd;
- newtp->rcv_wnd = req->rsk_rcv_wnd;
+ tcp_set_rcv_wnd(newtp, req->rsk_rcv_wnd);
newtp->rx_opt.wscale_ok = ireq->wscale_ok;
if (newtp->rx_opt.wscale_ok) {
newtp->rx_opt.snd_wscale = ireq->snd_wscale;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 326b58ff1118..c1b94d67d8fe 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -291,7 +291,7 @@ static u16 tcp_select_window(struct sock *sk)
*/
if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
tp->pred_flags = 0;
- tp->rcv_wnd = 0;
+ tcp_set_rcv_wnd(tp, 0);
tp->rcv_wup = tp->rcv_nxt;
return 0;
}
@@ -314,7 +314,7 @@ static u16 tcp_select_window(struct sock *sk)
}
}
- tp->rcv_wnd = new_win;
+ tcp_set_rcv_wnd(tp, new_win);
tp->rcv_wup = tp->rcv_nxt;
/* Make sure we do not exceed the maximum possible
@@ -4150,6 +4150,10 @@ static void tcp_connect_init(struct sock *sk)
READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_window_scaling),
&rcv_wscale,
rcv_wnd);
+ /* tcp_select_initial_window() filled tp->rcv_wnd through its out-param,
+ * so snapshot the scaling_ratio we will use for that initial rwnd.
+ */
+ tcp_set_rcv_wnd(tp, tp->rcv_wnd);
tp->rx_opt.rcv_wscale = rcv_wscale;
tp->rcv_ssthresh = tp->rcv_wnd;
--
2.34.1
^ permalink raw reply related
* [PATCH net 1/7] tcp: track advertise-time scaling basis for rcv_wnd
From: Wesley Atwell @ 2026-03-11 7:55 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
martineau, netdev, mptcp
Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
linux-kselftest, linux-kernel, linux-api, atwellwea
In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com>
tp->rcv_wnd is an advertised window, but later receive-side accounting
needs to recover the hard memory budget that window represented when it
was exposed.
Prepare for that by storing the scaling basis alongside tp->rcv_wnd and
centralizing the helper API around the paired state. While here, make the
existing receive-memory arithmetic use the shared helper names so later
behavioral changes can build on one explicit accounting model.
This patch is groundwork only. Later patches will refresh the snapshot at
window write sites and consume it in the receive-memory paths.
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
.../networking/net_cachelines/tcp_sock.rst | 1 +
include/linux/tcp.h | 1 +
include/net/tcp.h | 79 +++++++++++++++++--
net/ipv4/tcp.c | 1 +
4 files changed, 76 insertions(+), 6 deletions(-)
diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
index 563daea10d6c..1415981b9d8a 100644
--- a/Documentation/networking/net_cachelines/tcp_sock.rst
+++ b/Documentation/networking/net_cachelines/tcp_sock.rst
@@ -12,6 +12,7 @@ struct inet_connection_sock inet_conn
u16 tcp_header_len read_mostly read_mostly tcp_bound_to_half_wnd,tcp_current_mss(tx);tcp_rcv_established(rx)
u16 gso_segs read_mostly tcp_xmit_size_goal
__be32 pred_flags read_write read_mostly tcp_select_window(tx);tcp_rcv_established(rx)
+u8 rcv_wnd_scaling_ratio read_write read_mostly tcp_set_rcv_wnd,tcp_can_ingest,tcp_clamp_window
u64 bytes_received read_write tcp_rcv_nxt_update(rx)
u32 segs_in read_write tcp_v6_rcv(rx)
u32 data_segs_in read_write tcp_v6_rcv(rx)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index f72eef31fa23..ec6b70c1174b 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -297,6 +297,7 @@ struct tcp_sock {
est_ecnfield:2,/* ECN field for AccECN delivered estimates */
accecn_opt_demand:2,/* Demand AccECN option for n next ACKs */
prev_ecnfield:2; /* ECN bits from the previous segment */
+ u8 rcv_wnd_scaling_ratio; /* 0 if unknown, else tp->rcv_wnd basis */
__be32 pred_flags;
u64 tcp_clock_cache; /* cache last tcp_clock_ns() (see tcp_mstamp_refresh()) */
u64 tcp_mstamp; /* most recent packet received/sent */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 978eea2d5df0..187e6d660f62 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1702,6 +1702,26 @@ static inline int tcp_space_from_win(const struct sock *sk, int win)
return __tcp_space_from_win(tcp_sk(sk)->scaling_ratio, win);
}
+static inline bool tcp_rcv_wnd_snapshot_valid(const struct tcp_sock *tp)
+{
+ return tp->rcv_wnd_scaling_ratio != 0;
+}
+
+/* Rebuild hard receive-memory units for data already covered by tp->rcv_wnd if
+ * the advertise-time basis is known. Legacy TCP_REPAIR restores can only
+ * recover tp->rcv_wnd itself; callers must fall back when the snapshot is
+ * unknown.
+ */
+static inline bool tcp_space_from_rcv_wnd(const struct tcp_sock *tp, int win,
+ int *space)
+{
+ if (!tcp_rcv_wnd_snapshot_valid(tp))
+ return false;
+
+ *space = __tcp_space_from_win(tp->rcv_wnd_scaling_ratio, win);
+ return true;
+}
+
/* Assume a 50% default for skb->len/skb->truesize ratio.
* This may be adjusted later in tcp_measure_rcv_mss().
*/
@@ -1709,15 +1729,62 @@ static inline int tcp_space_from_win(const struct sock *sk, int win)
static inline void tcp_scaling_ratio_init(struct sock *sk)
{
- tcp_sk(sk)->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+ struct tcp_sock *tp = tcp_sk(sk);
+
+ tp->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+ tp->rcv_wnd_scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+}
+
+/* tp->rcv_wnd is paired with the scaling_ratio that was in force when that
+ * window was last advertised. Legacy TCP_REPAIR restores can only recover the
+ * window value itself and use a zero snapshot until a fresh local window
+ * advertisement refreshes the pair.
+ */
+static inline void tcp_set_rcv_wnd_snapshot(struct tcp_sock *tp, u32 win,
+ u8 scaling_ratio)
+{
+ tp->rcv_wnd = win;
+ tp->rcv_wnd_scaling_ratio = scaling_ratio;
+}
+
+static inline void tcp_set_rcv_wnd(struct tcp_sock *tp, u32 win)
+{
+ tcp_set_rcv_wnd_snapshot(tp, win, tp->scaling_ratio);
+}
+
+static inline void tcp_set_rcv_wnd_unknown(struct tcp_sock *tp, u32 win)
+{
+ tcp_set_rcv_wnd_snapshot(tp, win, 0);
+}
+
+/* TCP receive-side accounting reuses sk_rcvbuf as both a hard memory limit
+ * and as the source material for the advertised receive window after
+ * scaling_ratio conversion. Keep the byte accounting explicit so admission,
+ * pruning, and rwnd selection all start from the same quantities.
+ */
+static inline int tcp_rmem_used(const struct sock *sk)
+{
+ return atomic_read(&sk->sk_rmem_alloc);
+}
+
+static inline int tcp_rmem_avail(const struct sock *sk)
+{
+ return READ_ONCE(sk->sk_rcvbuf) - tcp_rmem_used(sk);
+}
+
+/* Sender-visible rwnd headroom also reserves bytes already queued on backlog.
+ * Those bytes are not free to advertise again until __release_sock() drains
+ * backlog and clears sk_backlog.len.
+ */
+static inline int tcp_rwnd_avail(const struct sock *sk)
+{
+ return tcp_rmem_avail(sk) - READ_ONCE(sk->sk_backlog.len);
}
/* Note: caller must be prepared to deal with negative returns */
static inline int tcp_space(const struct sock *sk)
{
- return tcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) -
- READ_ONCE(sk->sk_backlog.len) -
- atomic_read(&sk->sk_rmem_alloc));
+ return tcp_win_from_space(sk, tcp_rwnd_avail(sk));
}
static inline int tcp_full_space(const struct sock *sk)
@@ -1760,7 +1827,7 @@ static inline bool tcp_rmem_pressure(const struct sock *sk)
rcvbuf = READ_ONCE(sk->sk_rcvbuf);
threshold = rcvbuf - (rcvbuf >> 3);
- return atomic_read(&sk->sk_rmem_alloc) > threshold;
+ return tcp_rmem_used(sk) > threshold;
}
static inline bool tcp_epollin_ready(const struct sock *sk, int target)
@@ -1910,7 +1977,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
tp->rcv_wnd &&
- atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
+ tcp_rmem_avail(sk) > 0 &&
!tp->urg_data)
tcp_fast_path_on(tp);
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 202a4e57a218..cec9ae1bf875 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -5238,6 +5238,7 @@ static void __init tcp_struct_check(void)
CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ce);
CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ecn_bytes);
CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, app_limited);
+ CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_wnd_scaling_ratio);
CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_wnd);
CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_tstamp);
CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rx_opt);
--
2.34.1
^ permalink raw reply related
* [PATCH net 0/7] tcp: preserve advertised rwnd accounting across receive-memory decisions
From: Wesley Atwell @ 2026-03-11 7:55 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, ncardwell, dsahern, matttbe,
martineau, netdev, mptcp
Cc: kuniyu, horms, geliang, corbet, skhan, rostedt, mhiramat,
mathieu.desnoyers, 0x7f454c46, linux-doc, linux-trace-kernel,
linux-kselftest, linux-kernel, linux-api, atwellwea
This series keeps sender-visible TCP receive-window accounting tied to the
scaling basis that was in force when the window was advertised.
Problem
-------
`tp->rcv_wnd` is an advertised promise to the sender, but later
receive-memory admission and clamping could reconstruct that promise
through the mutable live `scaling_ratio`. After ratio drift, the stack
could retain or advertise a receive window that no longer matched the
local hard rmem budget.
Fix
---
- store the advertise-time scaling basis alongside `tp->rcv_wnd`
- refresh that pair at the TCP and MPTCP receive-window write sites
- consume the snapshot in receive-memory admission, clamping, and the
scaled-window quantization path
- preserve the snapshot across `TCP_REPAIR_WINDOW` restore when userspace
provides it, and fall back safely when legacy userspace cannot
- expose the accounting in tracepoints and cover the ABI/runtime contract
in selftests
Series layout
-------------
1. track the receive-window snapshot state and helpers
2. refresh the snapshot when TCP advertises or initializes windows
3. use the snapshot in receive-memory admission and clamping
4. extend `TCP_REPAIR_WINDOW` for exact restore plus legacy compatibility
5. refresh the TCP shadow window snapshot in MPTCP
6. expose rmem/backlog state in `rcvbuf_grow` tracepoints
7. cover legacy and extended repair-window layouts in selftests
Testing
-------
- `git diff --check origin/main..HEAD`
- `scripts/checkpatch.pl --strict --show-types` on patches 1-7
- `make -j8 headers`
- `make -j8 net/ipv4/tcp_input.o net/ipv4/tcp_output.o net/ipv4/tcp_minisocks.o net/ipv4/tcp.o`
- `make -j8 C=1 CF='-D__CHECK_ENDIAN__' W=1 net/ipv4/tcp_input.o net/ipv4/tcp_output.o net/ipv4/tcp_minisocks.o net/ipv4/tcp.o`
- `make SPHINXDIRS='networking/net_cachelines' htmldocs`
- `make -j8 vmlinux bzImage modules`
- `make -C tools/testing/selftests/net/tcp_ao -j8`
- `make -C tools/testing/selftests/net/mptcp -j8`
- `packetdrill --dry_run` for `tcp_rcv_toobig.pkt` and
`tcp_rcv_toobig_default.pkt`
- `virtme-run` guest pass for both packetdrill tests
- feature-enabled guest pass for `restore_ipv4`, `self-connect_ipv4`, and
`mptcp_sockopt.sh`
Thanks,
Wesley
---
base-commit: 908c344d5cfa0ee6efb3226d22ea661e078ebfa0
--
2.43.0
^ permalink raw reply
* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Aleksa Sarai @ 2026-03-11 4:48 UTC (permalink / raw)
To: Christian Brauner
Cc: Andy Lutomirski, Jeff Layton, Dorjoy Chowdhury, linux-fsdevel,
linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
v9fs, linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <20260309-umsturz-herfallen-067eb2df7ec2@brauner>
[-- Attachment #1: Type: text/plain, Size: 2709 bytes --]
On 2026-03-09, Christian Brauner <brauner@kernel.org> wrote:
> > > On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > > > I think this needs more clarification as to what "regular" means,
> > > > since S_IFREG may not be sufficient. The UAPI group page says:
> > > >
> > > > Use-Case: this would be very useful to write secure programs that want
> > > > to avoid being tricked into opening device nodes with special
> > > > semantics while thinking they operate on regular files. This is
> > > > particularly relevant as many device nodes (or even FIFOs) come with
> > > > blocking I/O (or even blocking open()!) by default, which is not
> > > > expected from regular files backed by “fast” disk I/O. Consider
> > > > implementation of a naive web browser which is pointed to
> > > > file://dev/zero, not expecting an endless amount of data to read.
> > > >
> > > > What about procfs? What about sysfs? What about /proc/self/fd/17
> > > > where that fd is a memfd? What about files backed by non-"fast" disk
> > > > I/O like something on a flaky USB stick or a network mount or FUSE?
> > > >
> > > > Are we concerned about blocking open? (open blocks as a matter of
> > > > course.) Are we concerned about open having strange side effects?
> > > > Are we concerned about write having strange side effects? Are we
> > > > concerned about cases where opening the file as root results in
> > > > elevated privilege beyond merely gaining the ability to write to that
> > > > specific path on an ordinary filesystem?
>
> I think this is opening up a barrage of question that I'm not sure are
> all that useful. The ability to only open regular file isn't intended to
> defend against hung FUSE or NFS servers or other random Linux
> special-sauce murder-suicide file descriptor traps. For a lot of those
> we have O_PATH which can easily function with the new extension. A lot
> of the other special-sauce files (most anonymous inode fds) cannot even
> be reopened via e.g., /proc.
Indeed, I see OPENAT2_REGULAR as a way of optimising the tedious checks
that userspace does using O_PATH+/proc/self/fd/$n re-opening when
dealing with regular files.
For the problem of stuck NFS handles and so on, an idea I've had on my
backlog for a long time was RESOLVE_NO_REMOTE that would block those
kinds of things. IMHO it doesn't make sense to block those things with
an O_* flag because (especially in the NFS example) directory components
can also cause the syscall to block indefinitely and so RESOLVE_* flags
make more sense for this anyway. But in my mind this is a separate
problem to OPENAT2_REGULAR.
--
Aleksa Sarai
https://www.cyphar.com/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: add generic write-stream management ioctl
From: Darrick J. Wong @ 2026-03-10 20:44 UTC (permalink / raw)
To: Kanchan Joshi
Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
gost.dev, linux-api
In-Reply-To: <2cde8902-6d50-4035-b9c4-89bd5e2c9468@samsung.com>
On Tue, Mar 10, 2026 at 11:25:25PM +0530, Kanchan Joshi wrote:
> On 3/9/2026 10:03 PM, Darrick J. Wong wrote:
> >> +struct fs_write_stream {
> >> + __u32 op_flags; /* IN: operation flags */
> >> + __u32 stream_id; /* IN/OUT: stream value to assign/guery */
> >> + __u32 max_streams; /* OUT: max streams values supported */
> >> + __u32 rsvd;
> >> +};
> > This isn't an very cohesive interface -- GET_MAX probably only needs
> > op_flags and max_streams, right? And GET/SET only use op_flags and
> > stream_id, right?
>
> Yeah, right. That's the trade-off with swiss army knife type ioctl which
> uses op_flags to decide what it should do. Apart from keeping a single
> ioctl I was thinking a bit about extensibility (for anything new we may
> be able to do a new op_flags with some rsvd or union) too. But if you
> feel strong about this, I can take 3 ioctl route?
struct fs_write_stream {
__u32 op_flags;
union {
__u32 stream_id;
__u32 max_ids;
};
__u64 reserved;
};
perhaps? You might want to look into whether or not we're allowed to
have anonymous unions in UAPI headers. We all ❤️ C11, right?
--D
> >> +#define FS_WRITE_STREAM_OP_GET_MAX (1 << 0)
> >> +#define FS_WRITE_STREAM_OP_GET (1 << 1)
> >> +#define FS_WRITE_STREAM_OP_SET (1 << 2)
> >> +
> >> +#define FS_IOC_WRITE_STREAM _IOWR('f', 43, struct fs_write_stream)
> > EXT4_IOC_CHECKPOINT already took 'f' / 43. I/think/ there's no problem
> > because its argument is a u32 and ioctl definitions incorporate the
> > lower bits of of the argument size but you might want to be careful
> > anyway.
>
> Indeed, thanks!
>
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: add generic write-stream management ioctl
From: Kanchan Joshi @ 2026-03-10 17:55 UTC (permalink / raw)
To: Darrick J. Wong
Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
gost.dev, linux-api
In-Reply-To: <20260309163325.GE6033@frogsfrogsfrogs>
On 3/9/2026 10:03 PM, Darrick J. Wong wrote:
>> +struct fs_write_stream {
>> + __u32 op_flags; /* IN: operation flags */
>> + __u32 stream_id; /* IN/OUT: stream value to assign/guery */
>> + __u32 max_streams; /* OUT: max streams values supported */
>> + __u32 rsvd;
>> +};
> This isn't an very cohesive interface -- GET_MAX probably only needs
> op_flags and max_streams, right? And GET/SET only use op_flags and
> stream_id, right?
Yeah, right. That's the trade-off with swiss army knife type ioctl which
uses op_flags to decide what it should do. Apart from keeping a single
ioctl I was thinking a bit about extensibility (for anything new we may
be able to do a new op_flags with some rsvd or union) too. But if you
feel strong about this, I can take 3 ioctl route?
>> +#define FS_WRITE_STREAM_OP_GET_MAX (1 << 0)
>> +#define FS_WRITE_STREAM_OP_GET (1 << 1)
>> +#define FS_WRITE_STREAM_OP_SET (1 << 2)
>> +
>> +#define FS_IOC_WRITE_STREAM _IOWR('f', 43, struct fs_write_stream)
> EXT4_IOC_CHECKPOINT already took 'f' / 43. I/think/ there's no problem
> because its argument is a u32 and ioctl definitions incorporate the
> lower bits of of the argument size but you might want to be careful
> anyway.
Indeed, thanks!
^ permalink raw reply
* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Christian Brauner @ 2026-03-10 11:24 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Jeff Layton, Dorjoy Chowdhury, linux-fsdevel, linux-kernel,
linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs,
linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <CALCETrWjb+V-zrMT412MtmgDCx9y8simJBQ7+45C9MtdiSMnuw@mail.gmail.com>
On Mon, Mar 09, 2026 at 09:50:18AM -0700, Andy Lutomirski wrote:
> On Mon, Mar 9, 2026 at 1:58 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Sun, Mar 08, 2026 at 10:10:05AM -0700, Andy Lutomirski wrote:
> > > On Sun, Mar 8, 2026 at 4:40 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > >
> > > > On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > > > > On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > > > > >
> > > > > > This flag indicates the path should be opened if it's a regular file.
> > > > > > This is useful to write secure programs that want to avoid being
> > > > > > tricked into opening device nodes with special semantics while thinking
> > > > > > they operate on regular files. This is a requested feature from the
> > > > > > uapi-group[1].
> > > > > >
> > > > >
> > > > > I think this needs a lot more clarification as to what "regular"
> > > > > means. If it's literally
> > > > >
> > > > > > A corresponding error code EFTYPE has been introduced. For example, if
> > > > > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > > > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > > > > like FreeBSD, macOS.
> > > > >
> > > > > I think this needs more clarification as to what "regular" means,
> > > > > since S_IFREG may not be sufficient. The UAPI group page says:
> > > > >
> > > > > Use-Case: this would be very useful to write secure programs that want
> > > > > to avoid being tricked into opening device nodes with special
> > > > > semantics while thinking they operate on regular files. This is
> > > > > particularly relevant as many device nodes (or even FIFOs) come with
> > > > > blocking I/O (or even blocking open()!) by default, which is not
> > > > > expected from regular files backed by “fast” disk I/O. Consider
> > > > > implementation of a naive web browser which is pointed to
> > > > > file://dev/zero, not expecting an endless amount of data to read.
> > > > >
> > > > > What about procfs? What about sysfs? What about /proc/self/fd/17
> > > > > where that fd is a memfd? What about files backed by non-"fast" disk
> > > > > I/O like something on a flaky USB stick or a network mount or FUSE?
> > > > >
> > > > > Are we concerned about blocking open? (open blocks as a matter of
> > > > > course.) Are we concerned about open having strange side effects?
> > > > > Are we concerned about write having strange side effects? Are we
> > > > > concerned about cases where opening the file as root results in
> > > > > elevated privilege beyond merely gaining the ability to write to that
> > > > > specific path on an ordinary filesystem?
> >
> > I think this is opening up a barrage of question that I'm not sure are
> > all that useful. The ability to only open regular file isn't intended to
> > defend against hung FUSE or NFS servers or other random Linux
> > special-sauce murder-suicide file descriptor traps. For a lot of those
> > we have O_PATH which can easily function with the new extension. A lot
> > of the other special-sauce files (most anonymous inode fds) cannot even
> > be reopened via e.g., /proc.
>
> On the flip side, /proc itself can certainly be opened. Should
> O_REGULAR be able to open the more magical /proc and /sys files? Are
> there any that are problematic?
If procfs job isn't to provide problematic files to userspace I'm not
sure what it is. Joking aside, I think in general you are of course
right that procfs is full of files that under a very strict
interpretation of "regular file" should absolutely not count as a
regular file. sysfs probably as well and let's ignore debugfs and
tracefs and all the other magic filesystems or files.
In general, Linux has been so loosey-goosey with "regular file" for such
a long-time that making OPENAT2_REGULAR come up with some strict
definition of "this is a regular file - no really, pinky-promise a
regular one" - is just doomed to fail.
The other problem is that we cannot reasonably determine what odd file
the user really wanted to defend against opening with OPENAT2_REGULAR.
A caller may really want to open /proc/kmsg and just be sure that
someone didn't overmount it with a fifo (systemd does that in containers
iirc).
My personal "hot take" is that adding an api built around a regular file
with immediate irreversible side-effects for the caller on VFS
syscall-based open [1] is a bug. Such broken semantics is what ioctl()s
are for.
[1]: I mean specifically open(), openat2() etc. I'm excluding all
dedicated APIs that return file descriptors that cannot be reopened
via regular lookup.
From my pov, what would help is if one had a flexible way to scope opens
on e.g., filesystem. But imo, that is not policy the kernel can
reasonably express at the syscall api layer - it would look fugly as
hell and how many other knobs would we have to add to satisfy all needs.
I think that is best left to an lsm hooking into security_file_open()
which can maintain a map of files and filesystems to allow or deny - a
bpf lsm can do this quite nicely.
^ permalink raw reply
* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Florian Weimer @ 2026-03-09 17:39 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Christian Brauner, Jeff Layton, Dorjoy Chowdhury, linux-fsdevel,
linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
v9fs, linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <CALCETrWjb+V-zrMT412MtmgDCx9y8simJBQ7+45C9MtdiSMnuw@mail.gmail.com>
* Andy Lutomirski:
> On the flip side, /proc itself can certainly be opened. Should
> O_REGULAR be able to open the more magical /proc and /sys files? Are
> there any that are problematic?
It seems reading from /proc/kmsg is destructive. The file doesn't have
an end, either. It's more like a character device. Apparently,
/sys/kernel/tracing/trace_pipe is similar in that regard. Maybe that's
sufficient reason for blocking access? Although the side effect does
not happen on open.
The other issue is the incorrect size reporting in stat, which affects
most (all?) files under /proc and /sys. Userspace has already to around
that, though.
Thanks,
Florian
^ permalink raw reply
* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Andy Lutomirski @ 2026-03-09 16:50 UTC (permalink / raw)
To: Christian Brauner
Cc: Jeff Layton, Dorjoy Chowdhury, linux-fsdevel, linux-kernel,
linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs,
linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <20260309-umsturz-herfallen-067eb2df7ec2@brauner>
On Mon, Mar 9, 2026 at 1:58 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Sun, Mar 08, 2026 at 10:10:05AM -0700, Andy Lutomirski wrote:
> > On Sun, Mar 8, 2026 at 4:40 AM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > > > On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > > > >
> > > > > This flag indicates the path should be opened if it's a regular file.
> > > > > This is useful to write secure programs that want to avoid being
> > > > > tricked into opening device nodes with special semantics while thinking
> > > > > they operate on regular files. This is a requested feature from the
> > > > > uapi-group[1].
> > > > >
> > > >
> > > > I think this needs a lot more clarification as to what "regular"
> > > > means. If it's literally
> > > >
> > > > > A corresponding error code EFTYPE has been introduced. For example, if
> > > > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > > > like FreeBSD, macOS.
> > > >
> > > > I think this needs more clarification as to what "regular" means,
> > > > since S_IFREG may not be sufficient. The UAPI group page says:
> > > >
> > > > Use-Case: this would be very useful to write secure programs that want
> > > > to avoid being tricked into opening device nodes with special
> > > > semantics while thinking they operate on regular files. This is
> > > > particularly relevant as many device nodes (or even FIFOs) come with
> > > > blocking I/O (or even blocking open()!) by default, which is not
> > > > expected from regular files backed by “fast” disk I/O. Consider
> > > > implementation of a naive web browser which is pointed to
> > > > file://dev/zero, not expecting an endless amount of data to read.
> > > >
> > > > What about procfs? What about sysfs? What about /proc/self/fd/17
> > > > where that fd is a memfd? What about files backed by non-"fast" disk
> > > > I/O like something on a flaky USB stick or a network mount or FUSE?
> > > >
> > > > Are we concerned about blocking open? (open blocks as a matter of
> > > > course.) Are we concerned about open having strange side effects?
> > > > Are we concerned about write having strange side effects? Are we
> > > > concerned about cases where opening the file as root results in
> > > > elevated privilege beyond merely gaining the ability to write to that
> > > > specific path on an ordinary filesystem?
>
> I think this is opening up a barrage of question that I'm not sure are
> all that useful. The ability to only open regular file isn't intended to
> defend against hung FUSE or NFS servers or other random Linux
> special-sauce murder-suicide file descriptor traps. For a lot of those
> we have O_PATH which can easily function with the new extension. A lot
> of the other special-sauce files (most anonymous inode fds) cannot even
> be reopened via e.g., /proc.
On the flip side, /proc itself can certainly be opened. Should
O_REGULAR be able to open the more magical /proc and /sys files? Are
there any that are problematic?
--Andy
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: add generic write-stream management ioctl
From: Darrick J. Wong @ 2026-03-09 16:33 UTC (permalink / raw)
To: Kanchan Joshi
Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
gost.dev, linux-api
In-Reply-To: <20260309052944.156054-2-joshi.k@samsung.com>
[cc linux-api because this is certainly an API definition]
On Mon, Mar 09, 2026 at 10:59:40AM +0530, Kanchan Joshi wrote:
> Wire up the userspace interface for write stream management via a new
> vfs ioctl 'FS_IOC_WRITE_STEAM'.
> Application communictes the intended operation using the 'op_flags'
> field of the passed 'struct fs_write_stream'.
> Valid flags are:
> FS_WRITE_STREAM_OP_GET_MAX: Returns the number of available streams.
> FS_WRITE_STREAM_OP_SET: Assign a specific stream value to the file.
> FS_WRITE_STREAM_OP_GET: Query what stream value is set on the file.
>
> Application should query the available streams by using
> FS_WRITE_STREAM_OP_GET_MAX first.
> If returned value is N, valid stream values for the file are 0 to N.
> Stream value 0 implies that no stream is set on the file.
> Setting a larger value than available streams is rejected.
>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
> include/uapi/linux/fs.h | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 70b2b661f42c..4d0805b52949 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -338,6 +338,18 @@ struct file_attr {
> /* Get logical block metadata capability details */
> #define FS_IOC_GETLBMD_CAP _IOWR(0x15, 2, struct logical_block_metadata_cap)
>
> +struct fs_write_stream {
> + __u32 op_flags; /* IN: operation flags */
> + __u32 stream_id; /* IN/OUT: stream value to assign/guery */
> + __u32 max_streams; /* OUT: max streams values supported */
> + __u32 rsvd;
> +};
This isn't an very cohesive interface -- GET_MAX probably only needs
op_flags and max_streams, right? And GET/SET only use op_flags and
stream_id, right?
> +#define FS_WRITE_STREAM_OP_GET_MAX (1 << 0)
> +#define FS_WRITE_STREAM_OP_GET (1 << 1)
> +#define FS_WRITE_STREAM_OP_SET (1 << 2)
> +
> +#define FS_IOC_WRITE_STREAM _IOWR('f', 43, struct fs_write_stream)
EXT4_IOC_CHECKPOINT already took 'f' / 43. I /think/ there's no problem
because its argument is a u32 and ioctl definitions incorporate the
lower bits of of the argument size but you might want to be careful
anyway.
--D
> /*
> * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
> *
> --
> 2.25.1
>
>
^ permalink raw reply
* Re: [PATCH v2] sched/deadline: document new sched_getattr() feature for retrieving current parameters for DEADLINE tasks
From: Jonathan Corbet @ 2026-03-09 16:17 UTC (permalink / raw)
To: Tommaso Cucinotta, Peter Zijlstra
Cc: Tommaso Cucinotta, linux-api, Juri Lelli, Shuah Khan,
Shashank Balaji, linux-doc, linux-kernel
In-Reply-To: <20260304102843.1373905-2-tommaso.cucinotta@santannapisa.it>
Tommaso Cucinotta <tommaso.cucinotta@gmail.com> writes:
> Document in Documentation/sched/sched-deadline.rst the new capability of
> sched_getattr() to retrieve, for DEADLINE tasks, the runtime left and absolute
> deadline (setting the flags syscall parameter to 1), in addition to the static
> parameters (obtained with flags=0).
>
> Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
> Acked-by: Juri Lelli <juri.lelli@redhat.com>
> ---
> Documentation/scheduler/sched-deadline.rst | 19 +++++++++++++++----
> 1 file changed, 15 insertions(+), 4 deletions(-)
Applied, thanks.
jon
^ permalink raw reply
* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Christian Brauner @ 2026-03-09 8:57 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Jeff Layton, Dorjoy Chowdhury, linux-fsdevel, linux-kernel,
linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs,
linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <CALCETrVt7o+7JCMfTX3Vu9PANJJgR8hB5Z2THcXzam61kG9Gig@mail.gmail.com>
On Sun, Mar 08, 2026 at 10:10:05AM -0700, Andy Lutomirski wrote:
> On Sun, Mar 8, 2026 at 4:40 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > > On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > > >
> > > > This flag indicates the path should be opened if it's a regular file.
> > > > This is useful to write secure programs that want to avoid being
> > > > tricked into opening device nodes with special semantics while thinking
> > > > they operate on regular files. This is a requested feature from the
> > > > uapi-group[1].
> > > >
> > >
> > > I think this needs a lot more clarification as to what "regular"
> > > means. If it's literally
> > >
> > > > A corresponding error code EFTYPE has been introduced. For example, if
> > > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > > like FreeBSD, macOS.
> > >
> > > I think this needs more clarification as to what "regular" means,
> > > since S_IFREG may not be sufficient. The UAPI group page says:
> > >
> > > Use-Case: this would be very useful to write secure programs that want
> > > to avoid being tricked into opening device nodes with special
> > > semantics while thinking they operate on regular files. This is
> > > particularly relevant as many device nodes (or even FIFOs) come with
> > > blocking I/O (or even blocking open()!) by default, which is not
> > > expected from regular files backed by “fast” disk I/O. Consider
> > > implementation of a naive web browser which is pointed to
> > > file://dev/zero, not expecting an endless amount of data to read.
> > >
> > > What about procfs? What about sysfs? What about /proc/self/fd/17
> > > where that fd is a memfd? What about files backed by non-"fast" disk
> > > I/O like something on a flaky USB stick or a network mount or FUSE?
> > >
> > > Are we concerned about blocking open? (open blocks as a matter of
> > > course.) Are we concerned about open having strange side effects?
> > > Are we concerned about write having strange side effects? Are we
> > > concerned about cases where opening the file as root results in
> > > elevated privilege beyond merely gaining the ability to write to that
> > > specific path on an ordinary filesystem?
I think this is opening up a barrage of question that I'm not sure are
all that useful. The ability to only open regular file isn't intended to
defend against hung FUSE or NFS servers or other random Linux
special-sauce murder-suicide file descriptor traps. For a lot of those
we have O_PATH which can easily function with the new extension. A lot
of the other special-sauce files (most anonymous inode fds) cannot even
be reopened via e.g., /proc.
^ permalink raw reply
* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Andy Lutomirski @ 2026-03-08 17:10 UTC (permalink / raw)
To: Jeff Layton
Cc: Dorjoy Chowdhury, linux-fsdevel, linux-kernel, linux-api,
ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
viro, brauner, jack, chuck.lever, alex.aring, arnd, adilger,
mjguzik, smfrench, richard.henderson, mattst88, linmag7, tsbogend,
James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <801cf2c42b80d486726ea0a3774e52abcb158100.camel@kernel.org>
On Sun, Mar 8, 2026 at 4:40 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > >
> > > This flag indicates the path should be opened if it's a regular file.
> > > This is useful to write secure programs that want to avoid being
> > > tricked into opening device nodes with special semantics while thinking
> > > they operate on regular files. This is a requested feature from the
> > > uapi-group[1].
> > >
> >
> > I think this needs a lot more clarification as to what "regular"
> > means. If it's literally
> >
> > > A corresponding error code EFTYPE has been introduced. For example, if
> > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > like FreeBSD, macOS.
> >
> > I think this needs more clarification as to what "regular" means,
> > since S_IFREG may not be sufficient. The UAPI group page says:
> >
> > Use-Case: this would be very useful to write secure programs that want
> > to avoid being tricked into opening device nodes with special
> > semantics while thinking they operate on regular files. This is
> > particularly relevant as many device nodes (or even FIFOs) come with
> > blocking I/O (or even blocking open()!) by default, which is not
> > expected from regular files backed by “fast” disk I/O. Consider
> > implementation of a naive web browser which is pointed to
> > file://dev/zero, not expecting an endless amount of data to read.
> >
> > What about procfs? What about sysfs? What about /proc/self/fd/17
> > where that fd is a memfd? What about files backed by non-"fast" disk
> > I/O like something on a flaky USB stick or a network mount or FUSE?
> >
> > Are we concerned about blocking open? (open blocks as a matter of
> > course.) Are we concerned about open having strange side effects?
> > Are we concerned about write having strange side effects? Are we
> > concerned about cases where opening the file as root results in
> > elevated privilege beyond merely gaining the ability to write to that
> > specific path on an ordinary filesystem?
> >
>
> Above the use-case, it also says:
>
> "O_REGULAR (inspired by the existing O_DIRECTORY flag for open()),
> which opens a file only if it is of type S_IFREG."
>
> Since we allow programs to open a directory under /proc or /sys using
> O_DIRECTORY, I don't think we should do anything different here. To the
> VFS, all of the examples you gave above are S_IFREG "regular files",
> even if they are backed by something quite irregular.
That's certainly a valid and consistent way to define this, but is it useful?
--Andy
^ permalink raw reply
* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jeff Layton @ 2026-03-08 11:40 UTC (permalink / raw)
To: Andy Lutomirski, Dorjoy Chowdhury
Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner, jack,
chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
bharathsm, shuah, miklos, hansg
In-Reply-To: <CALCETrXVBA9uGEUdQPEZ2MVdxjLwwcWi5kzhOr1NdOWSSRaROw@mail.gmail.com>
On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> >
> > This flag indicates the path should be opened if it's a regular file.
> > This is useful to write secure programs that want to avoid being
> > tricked into opening device nodes with special semantics while thinking
> > they operate on regular files. This is a requested feature from the
> > uapi-group[1].
> >
>
> I think this needs a lot more clarification as to what "regular"
> means. If it's literally
>
> > A corresponding error code EFTYPE has been introduced. For example, if
> > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > like FreeBSD, macOS.
>
> I think this needs more clarification as to what "regular" means,
> since S_IFREG may not be sufficient. The UAPI group page says:
>
> Use-Case: this would be very useful to write secure programs that want
> to avoid being tricked into opening device nodes with special
> semantics while thinking they operate on regular files. This is
> particularly relevant as many device nodes (or even FIFOs) come with
> blocking I/O (or even blocking open()!) by default, which is not
> expected from regular files backed by “fast” disk I/O. Consider
> implementation of a naive web browser which is pointed to
> file://dev/zero, not expecting an endless amount of data to read.
>
> What about procfs? What about sysfs? What about /proc/self/fd/17
> where that fd is a memfd? What about files backed by non-"fast" disk
> I/O like something on a flaky USB stick or a network mount or FUSE?
>
> Are we concerned about blocking open? (open blocks as a matter of
> course.) Are we concerned about open having strange side effects?
> Are we concerned about write having strange side effects? Are we
> concerned about cases where opening the file as root results in
> elevated privilege beyond merely gaining the ability to write to that
> specific path on an ordinary filesystem?
>
Above the use-case, it also says:
"O_REGULAR (inspired by the existing O_DIRECTORY flag for open()),
which opens a file only if it is of type S_IFREG."
Since we allow programs to open a directory under /proc or /sys using
O_DIRECTORY, I don't think we should do anything different here. To the
VFS, all of the examples you gave above are S_IFREG "regular files",
even if they are backed by something quite irregular.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox