Linux io-uring development

Linux io-uring development
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] io_uring/net: support registered buffer for plain send and recv
From: Jens Axboe @ 2026-06-08  3:08 UTC (permalink / raw)
  To: Ming Lei, io-uring
In-Reply-To: <20260601095853.3670199-2-ming.lei@redhat.com>

On 6/1/26 3:58 AM, Ming Lei wrote:
> diff --git a/io_uring/net.c b/io_uring/net.c
> index f01f1d25e930..9c42c3dbccd7 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -431,6 +432,14 @@ int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  	sr->flags = READ_ONCE(sqe->ioprio);
>  	if (sr->flags & ~SENDMSG_FLAGS)
>  		return -EINVAL;
> +	if (sr->flags & IORING_RECVSEND_FIXED_BUF) {
> +		/* registered buffer send only supported for plain IORING_OP_SEND */
> +		if (req->opcode != IORING_OP_SEND ||
> +		    (sr->flags & IORING_RECVSEND_BUNDLE) ||
> +		    (req->flags & REQ_F_BUFFER_SELECT))
> +			return -EINVAL;
> +		req->buf_index = READ_ONCE(sqe->buf_index);
> +	}

I think this should either reject IORING_SEND_VECTORIZED, or if there's
a use case for it, ensure that it actually works.

Outside of that, change seems straight forward to me.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 0/2] io_uring/net: support registered buffer for plain send and recv
From: Ming Lei @ 2026-06-07 23:30 UTC (permalink / raw)
  To: Jens Axboe, io-uring
In-Reply-To: <20260601095853.3670199-1-ming.lei@redhat.com>

Hello Jens,

Ping...

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH] test/recv-bundle-pbuf-len-poison: add regression test for pbuf len corruption
From: Jens Axboe @ 2026-06-07 22:16 UTC (permalink / raw)
  To: Nyakundi Emmanuel; +Cc: federico.brasili, io-uring, linux-kernel
In-Reply-To: <20260607221114.135950-1-nyariboemmanuel8@gmail.com>

On 6/7/26 4:10 PM, Nyakundi Emmanuel wrote:
> A failed IORING_RECVSEND_BUNDLE receive on a non-INC provided-buffer
> ring can persistently corrupt the buffer descriptor length. When the
> receive fails with -EAGAIN, the kernel writes the requested length into
> buf->len during buffer selection but never restores it on failure.
> 
> A later unrelated IORING_OP_READ using the same buffer group then
> consumes the corrupted length, returning fewer bytes than expected.
> 
> This test reproduces the issue as reported by Federico Brasili.

Thanks, but I already wrote one, which also tests the much more
important aspect of the kernel change - that the reported CQE
completion reports the right amount without truncating the
buffer length when no bytes have been transferred.

And once again, it's not _corrupting_ the buffer length. It's
shrinking it, which is unexpected and should not happen, but there's
no corruption taking place.

I'm dubious on how much AI koolaid was used in reproducing the
test case and report? That said, it is something we should fix,
as the kernel should not be changing the buffer length for this
case.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] iouring: Fix min_timeout behaviour
From: Jens Axboe @ 2026-06-07 22:13 UTC (permalink / raw)
  To: Christian A. Ehrhardt; +Cc: Tip ten Brink, io-uring, linux-kernel
In-Reply-To: <20260606201120.1441447-1-lk@c--e.de>

On Sat, 06 Jun 2026 22:11:20 +0200, Christian A. Ehrhardt wrote:
> The wakeup condition if a min timeout is present and has
> expired is that at least _one_ CQE was posted. Thus set
> the cq_tail target to ->cq_min_tail + 1. Without this
> commit a spurious wakeup can result in a premature wakeup
> because io_should_wake() will return true even if _no_ CQE
> was posted at all.
> 
> [...]

Applied, thanks!

[1/1] iouring: Fix min_timeout behaviour
      commit: 29fe1bd01b99714f3136f922230a643c2742cda9

Best regards,
-- 
Jens Axboe

^ permalink raw reply

* [PATCH] test/recv-bundle-pbuf-len-poison: add regression test for pbuf len corruption
From: Nyakundi Emmanuel @ 2026-06-07 22:10 UTC (permalink / raw)
  To: axboe; +Cc: federico.brasili, io-uring, linux-kernel, Nyakundi Emmanuel
In-Reply-To: <1fd2ea63-c128-4641-9565-dbafd97de612@kernel.dk>

A failed IORING_RECVSEND_BUNDLE receive on a non-INC provided-buffer
ring can persistently corrupt the buffer descriptor length. When the
receive fails with -EAGAIN, the kernel writes the requested length into
buf->len during buffer selection but never restores it on failure.

A later unrelated IORING_OP_READ using the same buffer group then
consumes the corrupted length, returning fewer bytes than expected.

This test reproduces the issue as reported by Federico Brasili.

Reported-by: Federico Brasili <federico.brasili@gmail.com>
Link: https://lore.kernel.org/io-uring/CAAEr8jbY60noGj1fw_k91UJRBkyiRVoS6=nLhZ7Svwidjn4CAA@mail.gmail.com/
Signed-off-by: Nyakundi Emmanuel <nyariboemmanuel8@gmail.com>
---
 test/recv-bundle-pbuf-len-poison.c | 146 +++++++++++++++++++++++++++++
 1 file changed, 146 insertions(+)
 create mode 100644 test/recv-bundle-pbuf-len-poison.c

diff --git a/test/recv-bundle-pbuf-len-poison.c b/test/recv-bundle-pbuf-len-poison.c
new file mode 100644
index 00000000..90fafff4
--- /dev/null
+++ b/test/recv-bundle-pbuf-len-poison.c
@@ -0,0 +1,146 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Regression test for io_uring provided-buffer ring length corruption.
+ *
+ * A failed IORING_RECVSEND_BUNDLE receive on a non-INC provided-buffer
+ * ring can persistently shrink the user-visible buffer descriptor length.
+ * The modified length is not rolled back when the receive fails with
+ * -EAGAIN, and a later unrelated IORING_OP_READ from a pipe consumes
+ * the corrupted length.
+ *
+ * Reported-by: Federico Brasili <federico.brasili@gmail.com>
+ */
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/socket.h>
+
+#include "liburing.h"
+#include "helpers.h"
+
+#define BGID		8
+#define BUF_SIZE	4096
+#define NR_BUFS		2
+
+static int test(void)
+{
+	struct io_uring_buf_ring *br;
+	struct io_uring_cqe *cqe;
+	struct io_uring_sqe *sqe;
+	struct io_uring ring;
+	struct io_uring_buf *buf_entry;
+	int sockfd, pipefds[2], ret;
+	void *buf;
+	char pipe_data[BUF_SIZE];
+
+	ret = io_uring_queue_init(8, &ring, 0);
+	if (ret) {
+		fprintf(stderr, "queue init failed: %d\n", ret);
+		return T_EXIT_FAIL;
+	}
+
+	if (posix_memalign(&buf, 4096, BUF_SIZE * NR_BUFS))
+		return T_EXIT_FAIL;
+
+	/* set up non-INC provided buffer ring with 2 buffers of BUF_SIZE */
+	br = io_uring_setup_buf_ring(&ring, NR_BUFS, BGID, 0, &ret);
+	if (!br) {
+		if (ret == -EINVAL)
+			return T_EXIT_SKIP;
+		fprintf(stderr, "buf ring setup failed: %d\n", ret);
+		return T_EXIT_FAIL;
+	}
+
+	io_uring_buf_ring_add(br, buf,             BUF_SIZE, 0, NR_BUFS - 1, 0);
+	io_uring_buf_ring_add(br, buf + BUF_SIZE,  BUF_SIZE, 1, NR_BUFS - 1, 1);
+	io_uring_buf_ring_advance(br, NR_BUFS);
+
+	/* create an empty SOCK_DGRAM socket to trigger -EAGAIN */
+	sockfd = socket(AF_UNIX, SOCK_DGRAM, 0);
+	if (sockfd < 0) {
+		perror("socket");
+		return T_EXIT_FAIL;
+	}
+
+	/* submit RECV_BUNDLE on empty socket — expects -EAGAIN */
+	sqe = io_uring_get_sqe(&ring);
+	io_uring_prep_recv(sqe, sockfd, NULL, 1, MSG_DONTWAIT);
+	sqe->ioprio |= IORING_RECVSEND_BUNDLE;
+	sqe->flags  |= IOSQE_BUFFER_SELECT;
+	sqe->buf_group = BGID;
+	sqe->user_data  = 0x1111;
+	io_uring_submit(&ring);
+
+	ret = io_uring_wait_cqe(&ring, &cqe);
+	if (ret) {
+		fprintf(stderr, "wait cqe failed: %d\n", ret);
+		return T_EXIT_FAIL;
+	}
+	if (cqe->res != -EAGAIN) {
+		fprintf(stderr, "expected -EAGAIN, got %d\n", cqe->res);
+		io_uring_cqe_seen(&ring, cqe);
+		return T_EXIT_FAIL;
+	}
+	io_uring_cqe_seen(&ring, cqe);
+
+	/* check entry0.len — must still be BUF_SIZE after failed RECV */
+	buf_entry = &br->bufs[0];
+	if (buf_entry->len != BUF_SIZE) {
+		fprintf(stderr,
+			"FAIL: entry0.len corrupted after -EAGAIN RECV_BUNDLE: "
+			"got %u, expected %u\n",
+			buf_entry->len, BUF_SIZE);
+		return T_EXIT_FAIL;
+	}
+
+	/* now do a pipe READ using the same buffer group */
+	if (pipe(pipefds)) {
+		perror("pipe");
+		return T_EXIT_FAIL;
+	}
+
+	memset(pipe_data, 'A', BUF_SIZE);
+	if (write(pipefds[1], pipe_data, BUF_SIZE) != BUF_SIZE) {
+		fprintf(stderr, "pipe write failed\n");
+		return T_EXIT_FAIL;
+	}
+
+	sqe = io_uring_get_sqe(&ring);
+	io_uring_prep_read(sqe, pipefds[0], NULL, BUF_SIZE, 0);
+	sqe->flags    |= IOSQE_BUFFER_SELECT;
+	sqe->buf_group = BGID;
+	sqe->user_data  = 0x6666;
+	io_uring_submit(&ring);
+
+	ret = io_uring_wait_cqe(&ring, &cqe);
+	if (ret) {
+		fprintf(stderr, "wait read cqe failed: %d\n", ret);
+		return T_EXIT_FAIL;
+	}
+	if (cqe->res != BUF_SIZE) {
+		fprintf(stderr,
+			"FAIL: READ got %d bytes, expected %d — "
+			"pbuf len was poisoned by failed RECV_BUNDLE\n",
+			cqe->res, BUF_SIZE);
+		io_uring_cqe_seen(&ring, cqe);
+		return T_EXIT_FAIL;
+	}
+	io_uring_cqe_seen(&ring, cqe);
+
+	close(sockfd);
+	close(pipefds[0]);
+	close(pipefds[1]);
+	io_uring_queue_exit(&ring);
+	free(buf);
+	return T_EXIT_PASS;
+}
+
+int main(int argc, char *argv[])
+{
+	if (argc > 1)
+		return T_EXIT_SKIP;
+
+	return test();
+}
-- 
2.54.0


^ permalink raw reply related

* [PATCH] io_uring/kbuf: don't truncate end buffer for bundles
From: Jens Axboe @ 2026-06-07 22:11 UTC (permalink / raw)
  To: io-uring; +Cc: Federico Brasili

If buffers have been peeked for a bundle receive, the kernel will
truncate the end buffer, if the available length is shorter than the
buffer itself. This is unnecessary, as applications iterating bundle
receives must always use the minimum size of the buffer length and the
remaining number of bytes in the bundle. The examples in liburing do
that as well, eg examples/proxy.c.

If the kernel does truncate this buffer AND the current transfer fails,
then the buffer will be left with a smaller size than what is otherwise
available.

Just remove the buffer truncation, as it's not necessary in the first
place.

Link: https://lore.kernel.org/io-uring/CAAEr8jbY60noGj1fw_k91UJRBkyiRVoS6=nLhZ7Svwidjn4CAA@mail.gmail.com/
Reported-by: Federico Brasili <federico.brasili@gmail.com>
Cc: stable@vger.kernel.org
Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 63061aa1cab9..926254b6898f 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -305,7 +305,6 @@ static int io_ring_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg,
 				arg->partial_map = 1;
 				if (iov != arg->iovs)
 					break;
-				WRITE_ONCE(buf->len, len);
 			}
 		}

-- 
Jens Axboe

^ permalink raw reply related

* Re: [BUG io_uring] Failed RECVSEND_BUNDLE can persistently shrink non-INC pbuf ring len and affect later READ operations
From: Jens Axboe @ 2026-06-07 21:52 UTC (permalink / raw)
  To: Federico Brasili; +Cc: io-uring, linux-kernel
In-Reply-To: <36351bf5-fb6a-4712-ae27-5b907452bdab@kernel.dk>

On 6/7/26 3:38 PM, Jens Axboe wrote:
>> The reproducer runs unprivileged and demonstrates:
>>
>> 1. non-INC provided-buffer ring with entry0.len = 4096 and entry1.len = 4096
>> 2. IORING_OP_RECV + IOSQE_BUFFER_SELECT + IORING_RECVSEND_BUNDLE on an
>> empty SOCK_DGRAM socket
>> 3. CQE returns -EAGAIN, but entry0.len is changed from 4096 to 1
>> 4. a later unrelated IORING_OP_READ from a pipe using the same buffer
>> group returns 1 byte instead of 4096
>> 5. a second READ uses entry1 and returns 4096, so head/bid accounting
>> appears coherent in this repro
>>
>> I am not claiming privilege escalation from this. The demonstrated
>> issue is persistent provided-buffer descriptor length corruption after
>> a failed/no-data RECV_BUNDLE, affecting a later READ operation.
> 
> Right, I believe you already mentioned in the first email. It's just
> a bug that can cause the app to (rightfully) get confused about the
> state of a buffer.
> 
> And it's not a corruption in the sense that something else writes
> to this buffer length field, the kernel is deliberately writing
> to that valid piece of memory. It just misses restoring it when
> the operation fails.

IOW, it's a consistency issue. Words like unprivileged are tossed around
here, but the app could've just written this memory without even the
kernel to do it, it's application memory. There's absolutely nothing
privileged going on here, kernel isn't touching anything that the
application couldn't just have done itself, without involving the
kernel. The kernel _should_ not do it for this case, that's the bug. And
from a quick look, the fix would just be to remove that buf->len
assignment in this case. For the normal case of eg wanting to read 32b
where the length would've been truncated to 32b in the buffer, it should
be fine to leave it at 4096 or whatever size it is. For bundles,
userspace must iterate the buffers when it gets a completion for X
bytes. But the iteration should always be:

	unsigned this_len = min(buf->len, left);

and hence it should not matter if buf->len remains at the untouched
length, for a truncated end buffer.

-- 
Jens Axboe

^ permalink raw reply

* Re: [BUG io_uring] Failed RECVSEND_BUNDLE can persistently shrink non-INC pbuf ring len and affect later READ operations
From: Jens Axboe @ 2026-06-07 21:39 UTC (permalink / raw)
  To: Nyakundi Emmanuel, federico.brasili; +Cc: io-uring, linux-kernel
In-Reply-To: <nyakundi-confirm-recvsend-bundle-20260607@gmail.com>

On 6/7/26 3:22 PM, Nyakundi Emmanuel wrote:
> On Sun, 7 Jun 2026, Federico Brasili wrote:
>> I found a reproducible io_uring provided-buffer ring issue on Ubuntu
>> kernel 7.0.0-22-generic.
>>
>> A failed IORING_RECVSEND_BUNDLE receive on a non-INC provided-buffer
>> ring can persistently shrink the user-visible buffer descriptor length.
> 
> Confirmed reproducible on:
> 
>   Linux archlinux 7.0.11-arch1-1 #1 SMP PREEMPT_DYNAMIC
>   Tue, 02 Jun 2026 18:26:58 +0000 x86_64
>   Arch Linux (rolling)
> 
> Output from your reproducer, run unprivileged:
> 
>   [INIT] entry0 len=4096 bid=0 entry1 len=4096 bid=1 tail=2
>   [STEP1] poison empty socket: BUNDLE len=1 expect -EAGAIN but entry0 len may truncate
>   [CQE1] res=-11 flags=0x0 user=0x1111
>   [AFTER1] entry0 len=1 entry1 len=4096 tail=2 changed_buf0=0 changed_buf1=0 guard_before=0 guard_after=0
>   [STEP2] wrote pipe bytes=4096, now IORING_OP_READ len=4096 after recv-BUNDLE poisoning
>   [CQE_READ] res=1 flags=0x1 user=0x6666
>   [AFTER_READ] entry0 len=1 entry1 len=4096 tail=2 changed_buf0=1 changed_buf1=0 guard_before=0 guard_after=0
>   [STEP3] wrote second pipe chunk bytes=4096, second IORING_OP_READ len=4096 without republish
>   [CQE_READ2] res=4096 flags=0x10001 user=0x7777
>   [AFTER_READ2] entry0 len=1 entry1 len=4096 tail=2 changed_buf0=1 changed_buf1=4096 guard_before=0 guard_after=0
> 
> entry0.len persistently corrupted 4096 -> 1 after -EAGAIN RECV_BUNDLE.
> Subsequent IORING_OP_READ consumed the poisoned length as reported.
> 
> This confirms the issue is not Ubuntu-specific and reproduces on a
> stock upstream-tracking kernel.

Which is entirely expected, it's just a generic kernel bug and I doubt
that ubuntu is shipping any specific patches here that aren't already
in stable or upstream.

-- 
Jens Axboe


^ permalink raw reply

* Re: [BUG io_uring] Failed RECVSEND_BUNDLE can persistently shrink non-INC pbuf ring len and affect later READ operations
From: Jens Axboe @ 2026-06-07 21:38 UTC (permalink / raw)
  To: Federico Brasili; +Cc: io-uring, linux-kernel
In-Reply-To: <CAAEr8jZDdiYB2vp9VJzSqq2J-GssH8GhrLYYn_2W2KAjYwDzSQ@mail.gmail.com>

On 6/7/26 2:08 PM, Federico Brasili wrote:
> Hi Jens,
> 
> Sure, attaching the minimal reproducer and the output from my Ubuntu
> 7.0.0-22-generic test system.

Great thanks, I'll take a look. For the record, please don't top post
reply. It makes a mess of conversations on the mailing list.

> The reproducer runs unprivileged and demonstrates:
> 
> 1. non-INC provided-buffer ring with entry0.len = 4096 and entry1.len = 4096
> 2. IORING_OP_RECV + IOSQE_BUFFER_SELECT + IORING_RECVSEND_BUNDLE on an
> empty SOCK_DGRAM socket
> 3. CQE returns -EAGAIN, but entry0.len is changed from 4096 to 1
> 4. a later unrelated IORING_OP_READ from a pipe using the same buffer
> group returns 1 byte instead of 4096
> 5. a second READ uses entry1 and returns 4096, so head/bid accounting
> appears coherent in this repro
> 
> I am not claiming privilege escalation from this. The demonstrated
> issue is persistent provided-buffer descriptor length corruption after
> a failed/no-data RECV_BUNDLE, affecting a later READ operation.

Right, I believe you already mentioned in the first email. It's just
a bug that can cause the app to (rightfully) get confused about the
state of a buffer.

And it's not a corruption in the sense that something else writes
to this buffer length field, the kernel is deliberately writing
to that valid piece of memory. It just misses restoring it when
the operation fails.

-- 
Jens Axboe


^ permalink raw reply

* Re: [BUG io_uring] Failed RECVSEND_BUNDLE can persistently shrink non-INC pbuf ring len and affect later READ operations
From: Nyakundi Emmanuel @ 2026-06-07 21:22 UTC (permalink / raw)
  To: federico.brasili; +Cc: axboe, io-uring, linux-kernel, Nyakundi Emmanuel
In-Reply-To: <CAAEr8jZDdiYB2vp9VJzSqq2J-GssH8GhrLYYn_2W2KAjYwDzSQ@mail.gmail.com>

On Sun, 7 Jun 2026, Federico Brasili wrote:
> I found a reproducible io_uring provided-buffer ring issue on Ubuntu
> kernel 7.0.0-22-generic.
>
> A failed IORING_RECVSEND_BUNDLE receive on a non-INC provided-buffer
> ring can persistently shrink the user-visible buffer descriptor length.

Confirmed reproducible on:

  Linux archlinux 7.0.11-arch1-1 #1 SMP PREEMPT_DYNAMIC
  Tue, 02 Jun 2026 18:26:58 +0000 x86_64
  Arch Linux (rolling)

Output from your reproducer, run unprivileged:

  [INIT] entry0 len=4096 bid=0 entry1 len=4096 bid=1 tail=2
  [STEP1] poison empty socket: BUNDLE len=1 expect -EAGAIN but entry0 len may truncate
  [CQE1] res=-11 flags=0x0 user=0x1111
  [AFTER1] entry0 len=1 entry1 len=4096 tail=2 changed_buf0=0 changed_buf1=0 guard_before=0 guard_after=0
  [STEP2] wrote pipe bytes=4096, now IORING_OP_READ len=4096 after recv-BUNDLE poisoning
  [CQE_READ] res=1 flags=0x1 user=0x6666
  [AFTER_READ] entry0 len=1 entry1 len=4096 tail=2 changed_buf0=1 changed_buf1=0 guard_before=0 guard_after=0
  [STEP3] wrote second pipe chunk bytes=4096, second IORING_OP_READ len=4096 without republish
  [CQE_READ2] res=4096 flags=0x10001 user=0x7777
  [AFTER_READ2] entry0 len=1 entry1 len=4096 tail=2 changed_buf0=1 changed_buf1=4096 guard_before=0 guard_after=0

entry0.len persistently corrupted 4096 -> 1 after -EAGAIN RECV_BUNDLE.
Subsequent IORING_OP_READ consumed the poisoned length as reported.

This confirms the issue is not Ubuntu-specific and reproduces on a
stock upstream-tracking kernel.

Nyakundi Emmanuel

^ permalink raw reply

* Re: [BUG io_uring] Failed RECVSEND_BUNDLE can persistently shrink non-INC pbuf ring len and affect later READ operations
From: Federico Brasili @ 2026-06-07 20:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-kernel
In-Reply-To: <71417fb0-4060-4823-8e4f-f216ce0235d4@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 5881 bytes --]

Hi Jens,

Sure, attaching the minimal reproducer and the output from my Ubuntu
7.0.0-22-generic test system.

The reproducer runs unprivileged and demonstrates:

1. non-INC provided-buffer ring with entry0.len = 4096 and entry1.len = 4096
2. IORING_OP_RECV + IOSQE_BUFFER_SELECT + IORING_RECVSEND_BUNDLE on an
empty SOCK_DGRAM socket
3. CQE returns -EAGAIN, but entry0.len is changed from 4096 to 1
4. a later unrelated IORING_OP_READ from a pipe using the same buffer
group returns 1 byte instead of 4096
5. a second READ uses entry1 and returns 4096, so head/bid accounting
appears coherent in this repro

I am not claiming privilege escalation from this. The demonstrated
issue is persistent provided-buffer descriptor length corruption after
a failed/no-data RECV_BUNDLE, affecting a later READ operation.

Thanks,
Federico

Il giorno dom 7 giu 2026 alle ore 21:07 Jens Axboe <axboe@kernel.dk> ha scritto:
>
> On 6/7/26 5:41 AM, Federico Brasili wrote:
> > Hi,
> >
> > I found a reproducible io_uring provided-buffer ring issue on Ubuntu
> > kernel 7.0.0-22-generic.
> >
> > A failed IORING_RECVSEND_BUNDLE receive on a non-INC provided-buffer
> > ring can persistently shrink the user-visible buffer descriptor
> > length. The modified length is not rolled back when the receive fails
> > with -EAGAIN/no data, and a later unrelated io_uring operation, such
> > as IORING_OP_READ from a pipe, consumes the corrupted length.
> >
> > This is not a demonstrated privilege escalation. The demonstrated
> > impact is deterministic unprivileged provided-buffer ring metadata
> > corruption across unrelated io_uring operations.
> >
> > Tested kernel:
> >
> > Linux ubuntu 7.0.0-22-generic #22-Ubuntu SMP PREEMPT_DYNAMIC Mon May
> > 25 15:54:34 UTC 2026 x86_64 GNU/Linux
> >
> > Summary:
> >
> > Create an io_uring instance as an unprivileged user.
> >
> > Register a non-INC provided-buffer ring with two buffers:
> >
> > entry0.len = 4096
> >
> > entry1.len = 4096
> >
> > Submit IORING_OP_RECV with:
> >
> > IOSQE_BUFFER_SELECT
> >
> > IORING_RECVSEND_BUNDLE
> >
> > req_len = 1
> >
> > MSG_DONTWAIT
> >
> > empty AF_UNIX SOCK_DGRAM socket
> >
> > The receive fails with -EAGAIN, but entry0.len is changed from 4096 to 1.
> >
> > Submit a later unrelated IORING_OP_READ from a pipe using the same
> > provided-buffer group with req_len = 4096.
> >
> > The READ returns only 1 byte, because it uses the previously corrupted
> > entry0.len.
> >
> > A second READ then consumes entry1 normally and returns 4096 bytes,
> > showing that head/bid accounting remains coherent and the corruption
> > is localized to the poisoned descriptor.
> >
> > Observed output from clean unprivileged reproduction:
> >
> > [INIT] uid=1002 entry0.len=4096 entry1.len=4096 tail=2
> > [STEP1] RECV BUNDLE on empty socket, req_len=1, expected CQE=-EAGAIN
> > [CQE_RECV_BUNDLE] res=-11 flags=0x0 user=0x1111
> > [AFTER_RECV_BUNDLE] entry0.len=1 entry1.len=4096 changed_buf0=0
> > changed_buf1=0 guard_before=0 guard_after=0
> > [STEP2] write pipe bytes=4096, then IORING_OP_READ req_len=4096 using
> > same pbuf group
> > [CQE_READ1] res=1 flags=0x1 user=0x6666
> > [AFTER_READ1] entry0.len=1 entry1.len=4096 changed_buf0=1
> > changed_buf1=0 guard_before=0 guard_after=0
> > [STEP3] write second pipe bytes=4096, then second IORING_OP_READ
> > req_len=4096 without republish
> > [CQE_READ2] res=4096 flags=0x10001 user=0x7777
> > [AFTER_READ2] entry0.len=1 entry1.len=4096 changed_buf0=1
> > changed_buf1=4096 guard_before=0 guard_after=0
> > [RESULT] PASS: unprivileged RECV_BUNDLE -EAGAIN poisoned pbuf len and
> > later IORING_OP_READ consumed the corrupted len.
> >
> > Why this looks like a bug:
> >
> > The failed receive should not persistently alter the provided-buffer
> > descriptor in a way that affects future unrelated operations. In this
> > case, a no-data/-EAGAIN RECV_BUNDLE changes entry0.len from 4096 to 1,
> > and that corrupted length is later consumed by IORING_OP_READ from a
> > pipe.
> >
> > The suspected root cause is in the non-INC provided-buffer ring BUNDLE
> > selection path:
> >
> > io_ring_buffers_peek()
> > if (len > arg->max_len) {
> > len = arg->max_len;
> > if (!(bl->flags & IOBL_INC)) {
> > arg->partial_map = 1;
> > if (iov != arg->iovs)
> > break;
> > WRITE_ONCE(buf->len, len);
> > }
> > }
> >
> > The descriptor length is modified during buffer selection/peek before
> > the receive operation has completed successfully. If the receive later
> > fails with -EAGAIN/no data, the buffer is recycled but the modified
> > buf->len is not restored.
> >
> > Additional observations:
> >
> > The issue reproduces as an unprivileged user.
> >
> > The effect crosses io_uring operations: RECV affects a later READ.
> >
> > The effect crosses subsystems: socket receive affects pipe read.
> >
> > The second READ correctly uses entry1 and returns 4096 bytes, so this
> > does not appear to be a head/bid desync in the tested case.
> >
> > No kernel crash, OOB write, UAF, or privilege escalation has been demonstrated.
> >
> > Expected behavior:
> >
> > If IORING_RECVSEND_BUNDLE fails with -EAGAIN/no data, the
> > provided-buffer ring descriptor should not be persistently modified,
> > or the original len should be restored during recycle/rollback.
> >
> > Actual behavior:
> >
> > The failed BUNDLE receive leaves entry0.len shortened to the requested
> > length, and later unrelated operations using the same provided-buffer
> > group consume that corrupted length.
> >
> > I can provide the minimal C reproducer and full output if useful.
>
> Please do, no point in me recreating one for it. Then it can also get
> turned into a regression test cor liburing. Reproducers also mean more
> than a thousand words in an email, it tells us exactly what is bring run
> and what is going wrong. Or in some cases, what the wrong expectations
> are.
>
> --
> Jens Axboe

[-- Attachment #2: iouring_pbuf_reproducer_for_jens.tar.gz --]
[-- Type: application/x-gzip, Size: 2865 bytes --]

^ permalink raw reply

* Re: [BUG io_uring] Failed RECVSEND_BUNDLE can persistently shrink non-INC pbuf ring len and affect later READ operations
From: Jens Axboe @ 2026-06-07 19:07 UTC (permalink / raw)
  To: Federico Brasili, io-uring; +Cc: linux-kernel
In-Reply-To: <CAAEr8jbY60noGj1fw_k91UJRBkyiRVoS6=nLhZ7Svwidjn4CAA@mail.gmail.com>

On 6/7/26 5:41 AM, Federico Brasili wrote:
> Hi,
> 
> I found a reproducible io_uring provided-buffer ring issue on Ubuntu
> kernel 7.0.0-22-generic.
> 
> A failed IORING_RECVSEND_BUNDLE receive on a non-INC provided-buffer
> ring can persistently shrink the user-visible buffer descriptor
> length. The modified length is not rolled back when the receive fails
> with -EAGAIN/no data, and a later unrelated io_uring operation, such
> as IORING_OP_READ from a pipe, consumes the corrupted length.
> 
> This is not a demonstrated privilege escalation. The demonstrated
> impact is deterministic unprivileged provided-buffer ring metadata
> corruption across unrelated io_uring operations.
> 
> Tested kernel:
> 
> Linux ubuntu 7.0.0-22-generic #22-Ubuntu SMP PREEMPT_DYNAMIC Mon May
> 25 15:54:34 UTC 2026 x86_64 GNU/Linux
> 
> Summary:
> 
> Create an io_uring instance as an unprivileged user.
> 
> Register a non-INC provided-buffer ring with two buffers:
> 
> entry0.len = 4096
> 
> entry1.len = 4096
> 
> Submit IORING_OP_RECV with:
> 
> IOSQE_BUFFER_SELECT
> 
> IORING_RECVSEND_BUNDLE
> 
> req_len = 1
> 
> MSG_DONTWAIT
> 
> empty AF_UNIX SOCK_DGRAM socket
> 
> The receive fails with -EAGAIN, but entry0.len is changed from 4096 to 1.
> 
> Submit a later unrelated IORING_OP_READ from a pipe using the same
> provided-buffer group with req_len = 4096.
> 
> The READ returns only 1 byte, because it uses the previously corrupted
> entry0.len.
> 
> A second READ then consumes entry1 normally and returns 4096 bytes,
> showing that head/bid accounting remains coherent and the corruption
> is localized to the poisoned descriptor.
> 
> Observed output from clean unprivileged reproduction:
> 
> [INIT] uid=1002 entry0.len=4096 entry1.len=4096 tail=2
> [STEP1] RECV BUNDLE on empty socket, req_len=1, expected CQE=-EAGAIN
> [CQE_RECV_BUNDLE] res=-11 flags=0x0 user=0x1111
> [AFTER_RECV_BUNDLE] entry0.len=1 entry1.len=4096 changed_buf0=0
> changed_buf1=0 guard_before=0 guard_after=0
> [STEP2] write pipe bytes=4096, then IORING_OP_READ req_len=4096 using
> same pbuf group
> [CQE_READ1] res=1 flags=0x1 user=0x6666
> [AFTER_READ1] entry0.len=1 entry1.len=4096 changed_buf0=1
> changed_buf1=0 guard_before=0 guard_after=0
> [STEP3] write second pipe bytes=4096, then second IORING_OP_READ
> req_len=4096 without republish
> [CQE_READ2] res=4096 flags=0x10001 user=0x7777
> [AFTER_READ2] entry0.len=1 entry1.len=4096 changed_buf0=1
> changed_buf1=4096 guard_before=0 guard_after=0
> [RESULT] PASS: unprivileged RECV_BUNDLE -EAGAIN poisoned pbuf len and
> later IORING_OP_READ consumed the corrupted len.
> 
> Why this looks like a bug:
> 
> The failed receive should not persistently alter the provided-buffer
> descriptor in a way that affects future unrelated operations. In this
> case, a no-data/-EAGAIN RECV_BUNDLE changes entry0.len from 4096 to 1,
> and that corrupted length is later consumed by IORING_OP_READ from a
> pipe.
> 
> The suspected root cause is in the non-INC provided-buffer ring BUNDLE
> selection path:
> 
> io_ring_buffers_peek()
> if (len > arg->max_len) {
> len = arg->max_len;
> if (!(bl->flags & IOBL_INC)) {
> arg->partial_map = 1;
> if (iov != arg->iovs)
> break;
> WRITE_ONCE(buf->len, len);
> }
> }
> 
> The descriptor length is modified during buffer selection/peek before
> the receive operation has completed successfully. If the receive later
> fails with -EAGAIN/no data, the buffer is recycled but the modified
> buf->len is not restored.
> 
> Additional observations:
> 
> The issue reproduces as an unprivileged user.
> 
> The effect crosses io_uring operations: RECV affects a later READ.
> 
> The effect crosses subsystems: socket receive affects pipe read.
> 
> The second READ correctly uses entry1 and returns 4096 bytes, so this
> does not appear to be a head/bid desync in the tested case.
> 
> No kernel crash, OOB write, UAF, or privilege escalation has been demonstrated.
> 
> Expected behavior:
> 
> If IORING_RECVSEND_BUNDLE fails with -EAGAIN/no data, the
> provided-buffer ring descriptor should not be persistently modified,
> or the original len should be restored during recycle/rollback.
> 
> Actual behavior:
> 
> The failed BUNDLE receive leaves entry0.len shortened to the requested
> length, and later unrelated operations using the same provided-buffer
> group consume that corrupted length.
> 
> I can provide the minimal C reproducer and full output if useful.

Please do, no point in me recreating one for it. Then it can also get
turned into a regression test cor liburing. Reproducers also mean more
than a thousand words in an email, it tells us exactly what is bring run
and what is going wrong. Or in some cases, what the wrong expectations
are.

-- 
Jens Axboe

^ permalink raw reply

* [BUG io_uring] Failed RECVSEND_BUNDLE can persistently shrink non-INC pbuf ring len and affect later READ operations
From: Federico Brasili @ 2026-06-07 11:41 UTC (permalink / raw)
  To: io-uring; +Cc: linux-kernel

Hi,

I found a reproducible io_uring provided-buffer ring issue on Ubuntu
kernel 7.0.0-22-generic.

A failed IORING_RECVSEND_BUNDLE receive on a non-INC provided-buffer
ring can persistently shrink the user-visible buffer descriptor
length. The modified length is not rolled back when the receive fails
with -EAGAIN/no data, and a later unrelated io_uring operation, such
as IORING_OP_READ from a pipe, consumes the corrupted length.

This is not a demonstrated privilege escalation. The demonstrated
impact is deterministic unprivileged provided-buffer ring metadata
corruption across unrelated io_uring operations.

Tested kernel:

Linux ubuntu 7.0.0-22-generic #22-Ubuntu SMP PREEMPT_DYNAMIC Mon May
25 15:54:34 UTC 2026 x86_64 GNU/Linux

Summary:

Create an io_uring instance as an unprivileged user.

Register a non-INC provided-buffer ring with two buffers:

entry0.len = 4096

entry1.len = 4096

Submit IORING_OP_RECV with:

IOSQE_BUFFER_SELECT

IORING_RECVSEND_BUNDLE

req_len = 1

MSG_DONTWAIT

empty AF_UNIX SOCK_DGRAM socket

The receive fails with -EAGAIN, but entry0.len is changed from 4096 to 1.

Submit a later unrelated IORING_OP_READ from a pipe using the same
provided-buffer group with req_len = 4096.

The READ returns only 1 byte, because it uses the previously corrupted
entry0.len.

A second READ then consumes entry1 normally and returns 4096 bytes,
showing that head/bid accounting remains coherent and the corruption
is localized to the poisoned descriptor.

Observed output from clean unprivileged reproduction:

[INIT] uid=1002 entry0.len=4096 entry1.len=4096 tail=2
[STEP1] RECV BUNDLE on empty socket, req_len=1, expected CQE=-EAGAIN
[CQE_RECV_BUNDLE] res=-11 flags=0x0 user=0x1111
[AFTER_RECV_BUNDLE] entry0.len=1 entry1.len=4096 changed_buf0=0
changed_buf1=0 guard_before=0 guard_after=0
[STEP2] write pipe bytes=4096, then IORING_OP_READ req_len=4096 using
same pbuf group
[CQE_READ1] res=1 flags=0x1 user=0x6666
[AFTER_READ1] entry0.len=1 entry1.len=4096 changed_buf0=1
changed_buf1=0 guard_before=0 guard_after=0
[STEP3] write second pipe bytes=4096, then second IORING_OP_READ
req_len=4096 without republish
[CQE_READ2] res=4096 flags=0x10001 user=0x7777
[AFTER_READ2] entry0.len=1 entry1.len=4096 changed_buf0=1
changed_buf1=4096 guard_before=0 guard_after=0
[RESULT] PASS: unprivileged RECV_BUNDLE -EAGAIN poisoned pbuf len and
later IORING_OP_READ consumed the corrupted len.

Why this looks like a bug:

The failed receive should not persistently alter the provided-buffer
descriptor in a way that affects future unrelated operations. In this
case, a no-data/-EAGAIN RECV_BUNDLE changes entry0.len from 4096 to 1,
and that corrupted length is later consumed by IORING_OP_READ from a
pipe.

The suspected root cause is in the non-INC provided-buffer ring BUNDLE
selection path:

io_ring_buffers_peek()
if (len > arg->max_len) {
len = arg->max_len;
if (!(bl->flags & IOBL_INC)) {
arg->partial_map = 1;
if (iov != arg->iovs)
break;
WRITE_ONCE(buf->len, len);
}
}

The descriptor length is modified during buffer selection/peek before
the receive operation has completed successfully. If the receive later
fails with -EAGAIN/no data, the buffer is recycled but the modified
buf->len is not restored.

Additional observations:

The issue reproduces as an unprivileged user.

The effect crosses io_uring operations: RECV affects a later READ.

The effect crosses subsystems: socket receive affects pipe read.

The second READ correctly uses entry1 and returns 4096 bytes, so this
does not appear to be a head/bid desync in the tested case.

No kernel crash, OOB write, UAF, or privilege escalation has been demonstrated.

Expected behavior:

If IORING_RECVSEND_BUNDLE fails with -EAGAIN/no data, the
provided-buffer ring descriptor should not be persistently modified,
or the original len should be restored during recycle/rollback.

Actual behavior:

The failed BUNDLE receive leaves entry0.len shortened to the requested
length, and later unrelated operations using the same provided-buffer
group consume that corrupted length.

I can provide the minimal C reproducer and full output if useful.

Thanks,
Federico

^ permalink raw reply

* Re: [PATCH] iouring: Fix min_timeout behaviour
From: Christian A. Ehrhardt @ 2026-06-07 10:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Tip ten Brink, io-uring, linux-kernel
In-Reply-To: <9f94f066-ea36-443e-b989-cf920ff9d27e@kernel.dk>

On Sat, Jun 06, 2026 at 03:55:18PM -0600, Jens Axboe wrote:
> On 6/6/26 2:11 PM, Christian A. Ehrhardt wrote:
> > The wakeup condition if a min timeout is present and has
> > expired is that at least _one_ CQE was posted. Thus set
> > the cq_tail target to ->cq_min_tail + 1. Without this
> > commit a spurious wakeup can result in a premature wakeup
> > because io_should_wake() will return true even if _no_ CQE
> > was posted at all.
> > 
> > Tested by running the liburing testsuite with no regressions.
> > 
> > Additionally, tested by turning all calls to schedule() in
> > io_uring/wait.c into calls to schedule_timeout(1) to force
> > the spurious wakeups. With these spurious wakeups the
> > min-timeout.t test fails before and passes after this commit.
> 
> Either this or the test case is broken, with or without the change
> you sent for the test case. I'll take a look, but it's definitely
> not passing as-is.

I also tested with the zig reproducer from
	https://github.com/axboe/liburing/issues/1477
and with the spurious wakeups the reproducer shows the premature
wakeup without any CQE posted, too. It seems that the missing "+1"
is an oversight that got introduced between v1 and v2 of the commit
that fixed the above issue.


Best regards,
Christian

^ permalink raw reply

* [PATCH v2 2/2] nfs: expose FMODE_NOWAIT for read-only files
From: Dylan Yudaken @ 2026-06-07  7:31 UTC (permalink / raw)
  To: trondmy, anna, linux-nfs; +Cc: axboe, io-uring, linux-kernel, Dylan Yudaken
In-Reply-To: <20260607073155.105314-1-dyudaken@gmail.com>

NFS O_DIRECT reads already (mostly) handle async requests, with the
exception of locking the inode for direct.
Handle async requests properly by using nfs_start_io_direct_nowait,
and then expose FMODE_NOWAIT since it's now supported for direct reads.

Signed-off-by: Dylan Yudaken <dyudaken@gmail.com>
---
 fs/nfs/direct.c | 12 ++++++++++--
 fs/nfs/file.c   | 16 +++++++++++++++-
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 48d89716193a..e626c72495e6 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -466,14 +466,22 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
 		goto out_release;
 	}
 	dreq->l_ctx = l_ctx;
-	if (!is_sync_kiocb(iocb))
+	if (!is_sync_kiocb(iocb)) {
 		dreq->iocb = iocb;
+	} else if (iocb->ki_flags & IOCB_NOWAIT) {
+		result = -EAGAIN;
+		nfs_direct_req_release(dreq);
+		goto out_release;
+	}
 
 	if (user_backed_iter(iter))
 		dreq->flags = NFS_ODIRECT_SHOULD_DIRTY;
 
 	if (!swap) {
-		result = nfs_start_io_direct(inode);
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			result = nfs_start_io_direct_nowait(inode);
+		else
+			result = nfs_start_io_direct(inode);
 		if (result) {
 			/* release the reference that would usually be
 			 * consumed by nfs_direct_read_schedule_iovec()
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 25048a3c2364..a0d8f1c1cf10 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -72,8 +72,12 @@ nfs_file_open(struct inode *inode, struct file *filp)
 		return res;
 
 	res = nfs_open(inode, filp);
-	if (res == 0)
+	if (res == 0) {
 		filp->f_mode |= FMODE_CAN_ODIRECT;
+		/* flag NOWAIT on read-only files only */
+		if (!(filp->f_mode & FMODE_WRITE))
+			filp->f_mode |= FMODE_NOWAIT;
+	}
 	return res;
 }
 
@@ -166,6 +170,10 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 	if (iocb->ki_flags & IOCB_DIRECT)
 		return nfs_file_direct_read(iocb, to, false);
 
+	/* NOWAIT only supported on direct reads */
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		return -EAGAIN;
+
 	dprintk("NFS: read(%pD2, %zu@%lu)\n",
 		iocb->ki_filp,
 		iov_iter_count(to), (unsigned long) iocb->ki_pos);
@@ -705,6 +713,12 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 
 	trace_nfs_file_write(iocb, from);
 
+	/*
+	 * FMODE_NOWAIT is not set for writable files
+	 */
+	if (WARN_ON_ONCE(iocb->ki_flags & IOCB_NOWAIT))
+		return -EAGAIN;
+
 	result = nfs_key_timeout_notify(file, inode);
 	if (result)
 		return result;
-- 
2.50.1


^ permalink raw reply related

* [PATCH v2 1/2] nfs: add nowait version of nfs_start_io_direct
From: Dylan Yudaken @ 2026-06-07  7:31 UTC (permalink / raw)
  To: trondmy, anna, linux-nfs; +Cc: axboe, io-uring, linux-kernel, Dylan Yudaken
In-Reply-To: <20260607073155.105314-1-dyudaken@gmail.com>

nfs_start_io_direct might block on existing operations to the same
inode. In order to support NOWAIT O_DIRECT reads, add a non-blocking
version of this nfs_start_io_direct that just returns -EAGAIN if locks
could not be taken.

Signed-off-by: Dylan Yudaken <dyudaken@gmail.com>
---
 fs/nfs/internal.h |  1 +
 fs/nfs/io.c       | 41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 18d46b0e71dd..0c9aca624353 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -532,6 +532,7 @@ extern void nfs_end_io_read(struct inode *inode);
 extern  __must_check int nfs_start_io_write(struct inode *inode);
 extern void nfs_end_io_write(struct inode *inode);
 extern __must_check int nfs_start_io_direct(struct inode *inode);
+extern __must_check int nfs_start_io_direct_nowait(struct inode *inode);
 extern void nfs_end_io_direct(struct inode *inode);
 
 static inline bool nfs_file_io_is_buffered(struct nfs_inode *nfsi)
diff --git a/fs/nfs/io.c b/fs/nfs/io.c
index 8337f0ae852d..2faf2003faf6 100644
--- a/fs/nfs/io.c
+++ b/fs/nfs/io.c
@@ -109,6 +109,16 @@ static void nfs_block_buffered(struct nfs_inode *nfsi, struct inode *inode)
 	}
 }
 
+static int nfs_block_buffered_nowait(struct nfs_inode *nfsi, struct inode *inode)
+{
+	if (!test_bit(NFS_INO_ODIRECT, &nfsi->flags)) {
+		if (inode->i_mapping->nrpages != 0)
+			return 1;
+		set_bit(NFS_INO_ODIRECT, &nfsi->flags);
+	}
+	return 0;
+}
+
 /**
  * nfs_start_io_direct - declare the file is being used for direct i/o
  * @inode: file inode
@@ -149,6 +159,37 @@ nfs_start_io_direct(struct inode *inode)
 	return 0;
 }
 
+/**
+ * nfs_start_io_direct_nowait - non-blocking variant of nfs_start_io_direct()
+ * @inode: file inode
+ *
+ * Try to declare that a direct I/O operation is about to start without
+ * blocking.
+ * Ensure all buffered I/O is blocked.
+ * If this could not be done without blocking then returns -EAGAIN.
+ */
+int
+nfs_start_io_direct_nowait(struct inode *inode)
+{
+	struct nfs_inode *nfsi = NFS_I(inode);
+
+	if (!down_read_trylock(&inode->i_rwsem))
+		return -EAGAIN;
+	if (test_bit(NFS_INO_ODIRECT, &nfsi->flags))
+		return 0;
+	up_read(&inode->i_rwsem);
+
+	/* Slow path: try to flip NFS_INO_ODIRECT without blocking. */
+	if (!down_write_trylock(&inode->i_rwsem))
+		return -EAGAIN;
+	if (nfs_block_buffered_nowait(nfsi, inode)) {
+		up_write(&inode->i_rwsem);
+		return -EAGAIN;
+	}
+	downgrade_write(&inode->i_rwsem);
+	return 0;
+}
+
 /**
  * nfs_end_io_direct - declare that the direct i/o operation is done
  * @inode: file inode
-- 
2.50.1


^ permalink raw reply related

* [PATCH v2 0/2] nfs: support FMODE_NOWAIT on O_DIRECT reads
From: Dylan Yudaken @ 2026-06-07  7:31 UTC (permalink / raw)
  To: trondmy, anna, linux-nfs; +Cc: axboe, io-uring, linux-kernel, Dylan Yudaken

I had noticed that io_uring always punts O_DIRECT NFS reads to a background thread
since the file does not advertise FMODE_NOWAIT.

I am not very familiar with the NFS codebase, but looking around suggests a simple change
to nfs_start_io_direct is all that is required to properly support this functionality.
On the request issue side, it seems everything in NFS is actually run in the background
(post this lock change), and the completion codepaths all look to have no similar locking
semantics.

I have restricted this to read-only files initially, as the code paths are simpler.

I unfortunately do not have the means to test the performance improvement, since even
without this change my local network is the bottleneck here.
However I do suspect that there are people that would want this fix ([1]).
Applying a similar patch on that GitHub issue did give performance gains.

To convince myself this works at all I did trace io_uring events through with and
without the patch.
Using a test app ([2]) to issue O_DIRECT io_uring reads calls io_uring_queue_async_work
without this patch, while with it the call is skipped and the completion is queued into
io_uring directly from nfs_direct_read_completion.

Patch 1 here adds an unused nfs_start_io_direct_nowait which patch 2 uses in order to safely
advertise FMODE_NOWAIT.

v2: Suggestions from Sashiko:
* Handle file flags changing
* Do not use mapping_empty anymore as it was apparently racy

[1]: https://github.com/axboe/liburing/issues/1499
[2]: https://github.com/DylanZA/liburing/commit/264c06f1939dfd6b6bc4c967ada5960c4f4f2db3

Dylan Yudaken (2):
  nfs: add nowait version of nfs_start_io_direct
  nfs: expose FMODE_NOWAIT for read-only files

 fs/nfs/direct.c   | 12 ++++++++++--
 fs/nfs/file.c     | 16 +++++++++++++++-
 fs/nfs/internal.h |  1 +
 fs/nfs/io.c       | 41 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 67 insertions(+), 3 deletions(-)

base-commit: a2be31abc3fac6a20f662f6118815b9c40c371c9
-- 
2.50.1

^ permalink raw reply

* Re: [PATCH] iouring: Fix min_timeout behaviour
From: Jens Axboe @ 2026-06-06 21:55 UTC (permalink / raw)
  To: Christian A. Ehrhardt; +Cc: Tip ten Brink, io-uring, linux-kernel
In-Reply-To: <20260606201120.1441447-1-lk@c--e.de>

On 6/6/26 2:11 PM, Christian A. Ehrhardt wrote:
> The wakeup condition if a min timeout is present and has
> expired is that at least _one_ CQE was posted. Thus set
> the cq_tail target to ->cq_min_tail + 1. Without this
> commit a spurious wakeup can result in a premature wakeup
> because io_should_wake() will return true even if _no_ CQE
> was posted at all.
> 
> Tested by running the liburing testsuite with no regressions.
> 
> Additionally, tested by turning all calls to schedule() in
> io_uring/wait.c into calls to schedule_timeout(1) to force
> the spurious wakeups. With these spurious wakeups the
> min-timeout.t test fails before and passes after this commit.

Either this or the test case is broken, with or without the change
you sent for the test case. I'll take a look, but it's definitely
not passing as-is.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH] iouring: Fix min_timeout behaviour
From: Christian A. Ehrhardt @ 2026-06-06 20:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christian A. Ehrhardt, Tip ten Brink, io-uring, linux-kernel

The wakeup condition if a min timeout is present and has
expired is that at least _one_ CQE was posted. Thus set
the cq_tail target to ->cq_min_tail + 1. Without this
commit a spurious wakeup can result in a premature wakeup
because io_should_wake() will return true even if _no_ CQE
was posted at all.

Tested by running the liburing testsuite with no regressions.

Additionally, tested by turning all calls to schedule() in
io_uring/wait.c into calls to schedule_timeout(1) to force
the spurious wakeups. With these spurious wakeups the
min-timeout.t test fails before and passes after this commit.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tip ten Brink <tip@tenbrinkmeijs.com>
Fixes: e15cb2200b93 ("io_uring: fix min_wait wakeups for SQPOLL")
Cc: stable@vger.kernel.org
Signed-off-by: Christian A. Ehrhardt <lk@c--e.de>
---
 io_uring/wait.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/wait.c b/io_uring/wait.c
index ec01e78a216d..d005ea17b35f 100644
--- a/io_uring/wait.c
+++ b/io_uring/wait.c
@@ -103,7 +103,7 @@ static enum hrtimer_restart io_cqring_min_timer_wakeup(struct hrtimer *timer)
 	}

 	/* any generated CQE posted past this time should wake us up */
-	iowq->cq_tail = iowq->cq_min_tail;
+	iowq->cq_tail = iowq->cq_min_tail + 1;

 	hrtimer_update_function(&iowq->t, io_cqring_timer_wakeup);
 	hrtimer_set_expires(timer, iowq->timeout);
-- 
2.43.0

^ permalink raw reply related

* Re: [GIT PULL] io_uring fix for 7.1-rc7
From: pr-tracker-bot @ 2026-06-05 22:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, io-uring
In-Reply-To: <956f675b-1106-4e26-86ec-8592bafd99ad@kernel.dk>

The pull request you sent on Fri, 5 Jun 2026 13:37:03 -0600:

> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git tags/io_uring-7.1-20260605

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/c10130c234c81f4a7a143edbf413080235f8d8ce

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* [GIT PULL] io_uring fix for 7.1-rc7
From: Jens Axboe @ 2026-06-05 19:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring

Hi Linus,

Just a single fix for a missing flag mask when multishot is used with
an incrementally consumped buffer ring, potentially leading to
application confusion because of lack of IORING_CQE_F_BUF_MORE
consistency.

Please pull!

The following changes since commit a88c02915d9c6160cfc7ab1b26ed64b2993e2b94:

  io_uring/tctx: set ->io_uring before publishing the tctx node (2026-05-24 12:01:15 -0600)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git tags/io_uring-7.1-20260605

for you to fetch changes up to ed46f39c47eb5530a9c161481a2080d3a869cfaf:

  io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries (2026-06-05 05:20:25 -0600)

----------------------------------------------------------------
io_uring-7.1-20260605

----------------------------------------------------------------
Clément Léger (1):
      io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries

 io_uring/net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

-- 
Jens Axboe


^ permalink raw reply

* Re: [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y
From: Brian Foster @ 2026-06-05 15:55 UTC (permalink / raw)
  To: Gregg Leventhal
  Cc: hch, djwong, Eric Hagberg, linux-xfs, linux-fsdevel, io-uring,
	Jens Axboe, stable
In-Reply-To: <CAFN_u7FrgM4Dzie2jjkLwWV8P0dvUG_Wwy3Q9B3-2HnnWiDu8w@mail.gmail.com>

On Thu, Jun 04, 2026 at 02:46:33PM -0400, Gregg Leventhal wrote:
> Hi all,
> 
> We're seeing silent data corruption -- a chunk of a buffered write being
> silently repeated at a later offset -- when using io_uring async buffered
> writes with O_APPEND on XFS. It reproduces on the longterm stable trees
> 6.1.y and 6.12.y under memory pressure, and is fixed in 6.18.y.
> 
...
> What fixes it
> -------------
> We did not bisect. We identified Brian Foster's "iomap: incremental
> per-operation iter advance" series as the likely relevant change,
> backported it to the affected kernel, and confirmed it makes the
> reproducer pass. The series was merged for v6.15:
> 
>   [1]https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h
> =linux-6.18.y&id=30f530096166202cf70e1b7d1de5a8cdfba42af1
> 
> It reworks iomap_write_iter() to advance iter->pos/iter->len incrementally
> (iomap_iter_advance) and removes the iov_iter_revert/-EAGAIN handling, so
> retries resume from the correct offset. The buffered-write change is in
> "iomap: advance the iter directly on buffered writes" (d9dc477ff6a2), but
> it depends on the earlier infrastructure patches in the same series.
> 

Note: the correct hash for the buffered write change is 1a1a3b574b97
("iomap: advance the iter directly on buffered writes"). The hash
referenced above is the commit for the read path.

Thanks for the legwork here. I'm at least glad to see I accidentally
fixed something vs. breaking it for a change. ;) I am slightly wondering
if the fundamental issue here is splitting the append write into two
partial requests and whether that's racy wrt EOF, not necessarily the
pos tracking added by this patch.

For example, if we assume the same sort of append+nowait -> partial
write -> append retry loop via io_uring on the current code, but then
inject some other append write (or i_size change) between the two split
writes, wouldn't that result in a similar problem? I don't think we'd
rewrite the original data again, but maybe data ends up interleaved or
something.

But anyways looking back at that commit, I think the relevant behavior
change is that ki_pos update is made consistent with whatever partial
completion iomap_write_iter() may have performed. More specifically, the
older code doesn't update iter->pos until after a successful iomap
iteration (via iomap_iter()). This means that if we loop within
iomap_write_iter() one or more times before hitting an -EAGAIN, the
local pos update is lost and doesn't reflect back into the iomap_iter or
thus the assignment to iocb->ki_pos in the caller
(iomap_file_buffered_write()). Therefore, we revert the iov iter so it
is consistent with ki_pos at the start of the current iter.  However we
have an append write in this case and i_size is updated within
iomap_write_iter(), so the write retry will overwrite ki_pos to the
new/updated EOF IIUC (via generic_write_checks_count(), called from the
fs before calling into iomap) and result in the observed behavior of
rewriting some amount of data to a new file offset.

The updated code bumps iter->pos incrementally within
iomap_write_iter(). We don't need to revert the iov_iter anymore because
incremental progress will be reflected back to iocb->ki_pos via
iter->pos. However this also happens to be consistent with incremental
i_size updates within iomap_write_iter(), so (as long as we don't race,
I think) the retry should be consistent and pick up where the partial
write left off.

One thing I might try here is to see if just deferring append writes to
!NOWAIT context avoids this problem because I wonder how sane that sort
of retry situation really is. I'm not quite sure what the expectations
are in that sort of case. Is that something that's easy to test? Of
course that wouldn't prevent this issue if other applications have this
same write pattern.

Another potential option for a stable only fix might be tweak the iomap
code to not update i_size for append (&& nowait?) writes until an
iter->pos update is imminent, but I think we'd need to be careful there
due to the pagecache_isize_extended() call. I think that's more for
non-append cases, but I'd have to take a closer look. Maybe we could
also replace that iov_iter revert with a hardcoded advance of the iomap
iter and emulate the same behavior as newer kernels. That seems
cleanest actually, but again needs some thought and testing...

Brian

> Detection in the reproducer (both silent)
> -----------------------------------------
>   1) final file size > sum of CQE byte counts the kernel reported.
>   2) the file is filled with a u64 "byte offset / 8" pattern, so on
>      readback element j must equal j; the first mismatch marks the start
>      of the duplicated copy (observed to be page-aligned).
> 
> Reproducer
> ----------
> Build: gcc -O2 -o repro_uring_dup repro_uring_dup.c -luring
> Run:   ./repro_uring_dup /path/on/xfs/repro [seconds] [file_target_mb]
> Needs the system under memory pressure to trigger; under those conditions
> it reproduces reliably. Source attached (repro_uring_dup.c).
> 
> Notes on stable
> ---------------
> The fix is a refactor with no Fixes: tag, and the buffered-write commit
> builds on the preceding patches in the series, so a single-commit
> cherry-pick into 6.1.y / 6.12.y doesn't look feasible. We're wondering
> whether a smaller, targeted fix would be more backportable for the active
> LTS trees -- e.g. ensuring the -EAGAIN retry path keeps the append
> position consistent with the reverted iov_iter so the already-committed
> range isn't re-appended -- but we'd defer to your judgment on whether that
> is sound or whether backporting the series as a unit is the better path.
> Given this is silent data corruption present since io_uring async buffered
> write support (~v6.0), we'd appreciate guidance on the right approach.
> 
> Happy to test patches and provide any additional detail.
> 
> Regards,
> Gregg Leventhal <[2]gleventhal@janestreet.com> and Eric Hagberg <[3]
> ehagberg@janestreet.com>
> 
> References:
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.18.y&id=30f530096166202cf70e1b7d1de5a8cdfba42af1
> [2] mailto:gleventhal@janestreet.com
> [3] mailto:ehagberg@janestreet.com

> /*
>  * repro_uring_dup.c
>  *
>  * Reproducer for io_uring async buffered-write duplication on XFS.
>  * Issues large, variable-size, non-page-aligned buffered writev's appended
>  * to a file via io_uring with offset -1 ("use current position").
>  *
>  * Bug: when the inline IOCB_NOWAIT attempt does a partial-page short write
>  * (landing on a page boundary) and the page-aligned remainder is reissued on
>  * io-wq, a page-aligned, page-multiple sub-range of the remainder is written
>  * TWICE, while the CQE still reports the full requested byte count. Result:
>  * the file is larger than the bytes we were told succeeded, with a page-aligned
>  * duplicated chunk.
>  *
>  * Detection (both silent - no error is ever returned):
>  *   1) final file size > total bytes the kernel told us it wrote.
>  *   2) file is filled with a u64 "byte offset / 8" pattern, so on readback
>  *      element j must equal j; the first j where it doesn't is the start of the
>  *      duplicated copy (expected to be page-aligned).
>  *
>  * Build:  gcc -O2 -o repro_uring_dup repro_uring_dup.c -luring
>  * Run:    ./repro_uring_dup /path/on/xfs/repro [seconds] [file_target_mb]
>  */
> #define _GNU_SOURCE
> #include <liburing.h>
> #include <fcntl.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <time.h>
> #include <unistd.h>
> #include <errno.h>
> #include <sys/stat.h>
> #include <sys/uio.h>
> 
> #define QD 8
> #define MB (1024UL * 1024UL)
> #define MAXCHUNK (24UL * MB)
> #define MINCHUNK (1UL * MB)
> 
> /* 1: O_APPEND / offset -1 variant (corrupts).
>  * 0: no O_APPEND, explicit offset variant (does not corrupt). */
> static int use_append = 1;
> 
> static uint64_t now_ns(void) {
>   struct timespec ts;
>   clock_gettime(CLOCK_MONOTONIC, &ts);
>   return (uint64_t)ts.tv_sec * 1000000000ULL + (uint64_t)ts.tv_nsec;
> }
> 
> /* Fill buf so the u64 at global byte offset (base+8*i) holds (base+8*i)/8. */
> static void fill_pattern(uint64_t *buf, uint64_t base_bytes, size_t len) {
>   uint64_t start_idx = base_bytes / 8;
>   size_t n = len / 8;
>   for (size_t i = 0; i < n; i++)
>     buf[i] = start_idx + i;
> }
> 
> /* One writev; loops over (legitimately) short *returned* results. */
> static void write_all(struct io_uring *ring, int fd, uint8_t *buf, size_t len,
>                       uint64_t expected) {
>   size_t done = 0;
>   while (done < len) {
>     struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
>     struct iovec iov = {.iov_base = buf + done, .iov_len = len - done};
>     long long off = use_append ? -1LL : (long long)(expected + done);
>     io_uring_prep_writev(sqe, fd, &iov, 1, (unsigned long long)off);
> 
>     int ret = io_uring_submit(ring);
>     if (ret < 0) {
>       fprintf(stderr, "submit: %s\n", strerror(-ret));
>       exit(1);
>     }
> 
>     struct io_uring_cqe *cqe;
>     ret = io_uring_wait_cqe(ring, &cqe);
>     if (ret < 0) {
>       fprintf(stderr, "wait_cqe: %s\n", strerror(-ret));
>       exit(1);
>     }
>     int res = cqe->res;
>     io_uring_cqe_seen(ring, cqe);
> 
>     if (res < 0) {
>       fprintf(stderr, "write: %s\n", strerror(-res));
>       exit(1);
>     }
>     if (res == 0) {
>       fprintf(stderr, "write returned 0\n");
>       exit(1);
>     }
>     done += (size_t)res;
>   }
> }
> 
> int main(int argc, char **argv) {
>   if (argc < 2) {
>     fprintf(stderr, "usage: %s <path-prefix-on-xfs> [seconds] [file_target_mb]\n",
>             argv[0]);
>     return 2;
>   }
>   const char *prefix = argv[1];
>   int seconds = (argc > 2) ? atoi(argv[2]) : 60;
>   uint64_t file_target = ((argc > 3) ? (uint64_t)atoll(argv[3]) : 48) * MB;
> 
>   srand((unsigned)(time(NULL) ^ getpid()));
> 
>   struct io_uring ring;
>   if (io_uring_queue_init(QD, &ring, 0)) {
>     perror("io_uring_queue_init");
>     return 1;
>   }
> 
>   uint8_t *buf = aligned_alloc(4096, MAXCHUNK);
>   if (!buf) {
>     perror("aligned_alloc");
>     return 1;
>   }
> 
>   static uint64_t rbuf[1 << 16];
>   uint64_t deadline = now_ns() + (uint64_t)seconds * 1000000000ULL;
>   long files = 0;
> 
>   while (now_ns() < deadline) {
>     char fn[8192];
>     snprintf(fn, sizeof fn, "%s.%ld", prefix, files);
>     int flags = O_WRONLY | O_CREAT | O_TRUNC | (use_append ? O_APPEND : 0);
>     int fd = open(fn, flags, 0644);
>     if (fd < 0) {
>       perror("open");
>       return 1;
>     }
> 
>     uint64_t expected = 0;
>     while (expected < file_target) {
>       size_t want = MINCHUNK + ((size_t)rand() % (MAXCHUNK - MINCHUNK));
>       want &= ~((size_t)7); /* 8-align; deliberately NOT page-aligned */
>       fill_pattern((uint64_t *)buf, expected, want);
>       write_all(&ring, fd, buf, want, expected);
>       expected += want; /* CQE reported full success */
>     }
>     close(fd);
> 
>     /* ---- verify ---- */
>     struct stat st;
>     if (stat(fn, &st)) {
>       perror("stat");
>       return 1;
>     }
> 
>     long long first_bad = -1;
>     uint64_t bad_val = 0;
>     int rfd = open(fn, O_RDONLY);
>     if (rfd < 0) {
>       perror("open ro");
>       return 1;
>     }
>     uint64_t idx = 0;
>     ssize_t r;
>     while ((r = read(rfd, rbuf, sizeof rbuf)) > 0) {
>       size_t cnt = (size_t)r / 8;
>       for (size_t i = 0; i < cnt; i++) {
>         if (rbuf[i] != idx) {
>           first_bad = (long long)(idx * 8);
>           bad_val = rbuf[i];
>           break;
>         }
>         idx++;
>       }
>       if (first_bad >= 0)
>         break;
>     }
>     close(rfd);
> 
>     int bug = ((uint64_t)st.st_size != expected) || (first_bad >= 0);
>     files++;
> 
>     if (bug) {
>       printf("\n*** CORRUPTION DETECTED in %s ***\n", fn);
>       printf("  bytes kernel said it wrote (sum of CQE results): %llu\n",
>              (unsigned long long)expected);
>       printf("  actual file size:                                %llu\n",
>              (unsigned long long)st.st_size);
>       printf("  extra (duplicated) bytes:                        %lld\n",
>              (long long)st.st_size - (long long)expected);
>       if (first_bad >= 0) {
>         printf("  first mismatching offset: %lld (0x%llx)  page_aligned=%s\n", first_bad,
>                (unsigned long long)first_bad, (first_bad % 4096 == 0) ? "YES" : "no");
>         printf("    expected u64 %llu but found %llu "
>                "(content from byte offset %llu reappeared here)\n",
>                (unsigned long long)(first_bad / 8), (unsigned long long)bad_val,
>                (unsigned long long)(bad_val * 8));
>       }
>       printf("  (file kept for inspection)\n");
>       io_uring_queue_exit(&ring);
>       return 0;
>     }
>     unlink(fn);
>     if (files % 20 == 0)
>       fprintf(stderr, "...%ld files clean\n", files);
>   }
> 
>   printf("No corruption in %d s (%ld files). Try more time, parallel instances, "
>          "or memory pressure.\n",
>          seconds, files);
>   io_uring_queue_exit(&ring);
>   free(buf);
>   return 0;
> }


^ permalink raw reply

* Re: [PATCH net-next v3 0/4] net: move .getsockopt away from __user buffers (update 1)
From: David Laight @ 2026-06-05 15:14 UTC (permalink / raw)
  To: Breno Leitao
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
	Stanislav Fomichev, io-uring, bpf, netdev, Linus Torvalds,
	linux-kernel, kernel-team
In-Reply-To: <aiK94g9vphHls3x_@gmail.com>

On Fri, 5 Jun 2026 05:25:21 -0700
Breno Leitao <leitao@debian.org> wrote:

> On Wed, Apr 08, 2026 at 03:30:28AM -0700, Breno Leitao wrote:
> > Currently, the .getsockopt callback requires __user pointers:
> > 
> >   int (*getsockopt)(struct socket *sock, int level,
> >                     int optname, char __user *optval, int __user *optlen);
> > 
> > This prevents kernel callers (io_uring, BPF) from using getsockopt on
> > levels other than SOL_SOCKET, since they pass kernel pointers.
> > 
> > Following Linus' suggestion [0], this series introduces sockopt_t, a
> > type-safe wrapper around iov_iter, 

I'd have thought it would also have been better to use a wrapper function
instead of direct calls to copy_from_iter().
There is no need for most of the code to know there is a iov_iter hiding
inside sockopt_t.

-- David

> > and a getsockopt_iter callback that
> > works with both user and kernel buffers. AF_PACKET and CAN raw are
> > converted as initial users, with selftests covering the trickiest
> > conversion patterns.  
> 
> Quick update on this effort.
> 
> All proto_ops users have been converted to getsockopt_iter and submitted.
> 
> Most conversions are already in linux-next. Three remain:
> 
> 1) rds: Under review
>    https://lore.kernel.org/all/20260605-getsock_more-v2-3-80f38cdb8706@debian.org/
> 
> 2) smc: Submitted today. This is only limited to UBUF right now
>    https://lore.kernel.org/all/20260605-getsockopt_smc-v1-1-65da62fa44c4@debian.org/
> 
> 3) CAN drivers: Reviewed and acked, pending Marc's merge
>    https://lore.kernel.org/all/f83e25e1-b9f5-4810-bbd6-fdb8d2a10c8e@hartkopp.net/
> 
> Once these are merged, I'll rename getsockopt_iter to getsockopt and
> remove the legacy path.
> 
> Next, I'll convert struct proto the same way to eliminate the remaining
> userspace optlen/optval pointers.
> 
> After that, io_uring getsockopt operations will be unblocked.
> 


^ permalink raw reply

* Re: [PATCH net-next v3 0/4] net: move .getsockopt away from __user buffers (update 1)
From: Breno Leitao @ 2026-06-05 12:25 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
	Stanislav Fomichev
  Cc: io-uring, bpf, netdev, Linus Torvalds, linux-kernel, kernel-team
In-Reply-To: <20260408-getsockopt-v3-0-061bb9cb355d@debian.org>

On Wed, Apr 08, 2026 at 03:30:28AM -0700, Breno Leitao wrote:
> Currently, the .getsockopt callback requires __user pointers:
> 
>   int (*getsockopt)(struct socket *sock, int level,
>                     int optname, char __user *optval, int __user *optlen);
> 
> This prevents kernel callers (io_uring, BPF) from using getsockopt on
> levels other than SOL_SOCKET, since they pass kernel pointers.
> 
> Following Linus' suggestion [0], this series introduces sockopt_t, a
> type-safe wrapper around iov_iter, and a getsockopt_iter callback that
> works with both user and kernel buffers. AF_PACKET and CAN raw are
> converted as initial users, with selftests covering the trickiest
> conversion patterns.

Quick update on this effort.

All proto_ops users have been converted to getsockopt_iter and submitted.

Most conversions are already in linux-next. Three remain:

1) rds: Under review
   https://lore.kernel.org/all/20260605-getsock_more-v2-3-80f38cdb8706@debian.org/

2) smc: Submitted today. This is only limited to UBUF right now
   https://lore.kernel.org/all/20260605-getsockopt_smc-v1-1-65da62fa44c4@debian.org/

3) CAN drivers: Reviewed and acked, pending Marc's merge
   https://lore.kernel.org/all/f83e25e1-b9f5-4810-bbd6-fdb8d2a10c8e@hartkopp.net/

Once these are merged, I'll rename getsockopt_iter to getsockopt and
remove the legacy path.

Next, I'll convert struct proto the same way to eliminate the remaining
userspace optlen/optval pointers.

After that, io_uring getsockopt operations will be unblocked.

^ permalink raw reply

* Re: [PATCH] io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries
From: Jens Axboe @ 2026-06-05 11:21 UTC (permalink / raw)
  To: io-uring, Clément Léger
In-Reply-To: <20260604160715.2482972-1-cleger@meta.com>


On Thu, 04 Jun 2026 09:07:13 -0700, Clément Léger wrote:
> When a bundle recv retries inside io_recv_finish(), the merge logic
> OR the saved cflags from the previous iteration with the cflags
> returned by the new iteration:
>   cflags = req->cqe.flags | (cflags & CQE_F_MASK);
> 
> Bits listed in CQE_F_MASK are inherited from the new iteration, and
> all other bits (notably IORING_CQE_F_BUFFER and the buffer ID) come
> from the saved cflags. Before this change CQE_F_MASK covered only
> IORING_CQE_F_SOCK_NONEMPTY and IORING_CQE_F_MORE.
> 
> [...]

Applied, thanks!

[1/1] io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries
      commit: ed46f39c47eb5530a9c161481a2080d3a869cfaf

Best regards,
-- 
Jens Axboe




^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox