Linux io-uring development

Linux io-uring development
 help / color / mirror / Atom feed

* [syzbot] [io-uring?] INFO: task hung in io_sq_thread_park (4)
From: syzbot @ 2026-05-26  2:49 UTC (permalink / raw)
  To: axboe, io-uring, linux-kernel, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    45255ea1ca09 Merge tag 'pm-7.1-rc5' of git://git.kernel.or..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=12030d36580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=8d24a1331e060dda
dashboard link: https://syzkaller.appspot.com/bug?extid=4be91bcb08eab9a156da
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17c2db96580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/55e9065ee7f2/disk-45255ea1.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/f53a442e25dd/vmlinux-45255ea1.xz
kernel image: https://storage.googleapis.com/syzbot-assets/ab16a4623640/bzImage-45255ea1.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+4be91bcb08eab9a156da@syzkaller.appspotmail.com

INFO: task kworker/u8:2:36 blocked for more than 143 seconds.
      Not tainted syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:2    state:D stack:22696 pid:36    tgid:36    ppid:2      task_flags:0x4208060 flags:0x00080000
Workqueue: iou_exit io_ring_exit_work
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5388 [inline]
 __schedule+0x1821/0x5740 kernel/sched/core.c:7189
 __schedule_loop kernel/sched/core.c:7268 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7283
 schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7340
 __mutex_lock_common kernel/locking/mutex.c:726 [inline]
 __mutex_lock+0x7f7/0x1550 kernel/locking/mutex.c:820
 io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
 io_ring_exit_work+0x2dd/0x980 io_uring/io_uring.c:2359
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
INFO: task kworker/u8:5:139 blocked for more than 145 seconds.
      Not tainted syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:5    state:D stack:24120 pid:139   tgid:139   ppid:2      task_flags:0x4208060 flags:0x00080000
Workqueue: iou_exit io_ring_exit_work
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5388 [inline]
 __schedule+0x1821/0x5740 kernel/sched/core.c:7189
 __schedule_loop kernel/sched/core.c:7268 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7283
 schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7340
 __mutex_lock_common kernel/locking/mutex.c:726 [inline]
 __mutex_lock+0x7f7/0x1550 kernel/locking/mutex.c:820
 io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
 io_ring_exit_work+0x2dd/0x980 io_uring/io_uring.c:2359
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
INFO: task kworker/u8:9:5810 blocked for more than 146 seconds.
      Not tainted syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:9    state:D stack:24248 pid:5810  tgid:5810  ppid:2      task_flags:0x4208060 flags:0x00080000
Workqueue: iou_exit io_ring_exit_work
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5388 [inline]
 __schedule+0x1821/0x5740 kernel/sched/core.c:7189
 __schedule_loop kernel/sched/core.c:7268 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7283
 schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7340
 __mutex_lock_common kernel/locking/mutex.c:726 [inline]
 __mutex_lock+0x7f7/0x1550 kernel/locking/mutex.c:820
 io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
 io_ring_exit_work+0x2dd/0x980 io_uring/io_uring.c:2359
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Showing all locks held in the system:
3 locks held by kworker/0:0/9:
 #0: ffff88813fe43140 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88813fe43140 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc900000e7c40 (rx_mode_work){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc900000e7c40 (rx_mode_work){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffffffff8fdd1400 (rtnl_mutex){+.+.}-{4:4}, at: netdev_rx_mode_work+0x19/0x3c0 net/core/dev_addr_lists.c:1312
3 locks held by kworker/u8:1/13:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90000127c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90000127c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff88802856dc68 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
1 lock held by khungtaskd/31:
 #0: ffffffff8e95cca0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline]
 #0: ffffffff8e95cca0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
 #0: ffffffff8e95cca0 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x2e/0x180 kernel/locking/lockdep.c:6775
3 locks held by kworker/u8:2/36:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90000ac7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90000ac7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff888024d77468 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
3 locks held by kworker/u8:3/47:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90000b77c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90000b77c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff888033183068 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
3 locks held by kworker/u9:0/50:
 #0: ffff888060790940 ((wq_completion)hci11){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff888060790940 ((wq_completion)hci11){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90000ba7c40 ((work_completion)(&hdev->cmd_sync_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90000ba7c40 ((work_completion)(&hdev->cmd_sync_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff88807ff38ea0 (&hdev->req_lock){+.+.}-{4:4}, at: hci_cmd_sync_work+0x1d3/0x400 net/bluetooth/hci_sync.c:331
3 locks held by kworker/u8:4/58:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc900015f7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc900015f7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff88802902ec68 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
3 locks held by kworker/u8:5/139:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90002e17c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90002e17c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff88807d43c068 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
3 locks held by kworker/u8:7/1145:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc900053efc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc900053efc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff88807b88f068 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
3 locks held by kworker/u8:8/3333:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc9000e61fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc9000e61fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff8880578a1468 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
1 lock held by udevd/4987:
 #0: ffff8880b863aea0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x31/0x150 kernel/sched/core.c:652
2 locks held by getty/5374:
 #0: ffff8880362670a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x25/0x70 drivers/tty/tty_ldisc.c:243
 #1: ffffc9000322b2e8 (&ldata->atomic_read_lock){+.+.}-{4:4}, at: n_tty_read+0x45c/0x13a0 drivers/tty/n_tty.c:2211
3 locks held by kworker/u8:9/5810:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc900038c7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc900038c7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff888075ffdc68 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
2 locks held by kworker/u8:10/5820:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc900038e7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc900038e7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
2 locks held by iou-sqp-6349/6354:
1 lock held by iou-sqp-7229/7232:
2 locks held by iou-sqp-7262/7266:
1 lock held by iou-sqp-7452/7455:
2 locks held by iou-sqp-7518/7521:
2 locks held by kworker/u8:11/7547:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90003f87c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90003f87c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
2 locks held by iou-sqp-7648/7649:
2 locks held by kworker/u8:12/7655:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90003b1fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90003b1fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
1 lock held by syz-executor/7715:
3 locks held by kworker/u8:13/7719:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc9000206fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc9000206fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff888057817068 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH] block: Add bvec_folio()
From: Christoph Hellwig @ 2026-05-26  6:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-kernel,
	io-uring, linux-mm, Leon Romanovsky
In-Reply-To: <ahROtyLcr567wM8l@casper.infradead.org>

On Mon, May 25, 2026 at 02:29:27PM +0100, Matthew Wilcox wrote:
> > So I'm not against the function per se, but the documentation must
> > explain the minefields it is stepping into a bit better.
> 
> Lower level drivers shouldn't be concerning themselves with folios.
> For a start, we can put non-folios (eg slab memory) into bvecs.

Well, that is a very good thing to put into the comment.  We can also
put them into high-level bvecs, so framing this as 'only use if you
know the memory is folios, which you can't unless you are the entity
who filled the bio' might be a good choice.


^ permalink raw reply

* Re: [PATCH 1/3] net: Remove support for AIO on sockets
From: Jens Axboe @ 2026-05-26 15:58 UTC (permalink / raw)
  To: Christoph Hellwig, demiobenour
  Cc: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Jakub Kicinski, Simon Horman,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, James Clark, Jonathan Corbet,
	Shuah Khan, Eric Biggers, Ard Biesheuvel, linux-crypto,
	linux-kernel, io-uring, netdev, linux-perf-users, linux-doc,
	Toke Høiland-Jørgensen, linux-api
In-Reply-To: <ahQCZQNoyO8GQt3H@infradead.org>

On 5/25/26 2:03 AM, Christoph Hellwig wrote:
> On Sat, May 23, 2026 at 03:43:02PM -0400, Demi Marie Obenour via B4 Relay wrote:
>> From: Demi Marie Obenour <demiobenour@gmail.com>
>>
>> The only user of msg->msg_iocb was AF_ALG, but that's deprecated.
>> It can be removed entirely at the cost of only supporting synchronous
>> operations.  This doesn't break userspace, which will silently block
>> (for a bounded amount of time) in io_submit instead of operating
>> asynchronously.
>>
>> This also makes struct msghdr smaller, helping every other caller of
>> sendmsg().
> 
> So we just had a discussion at LLC about how networking needs to support
> AIO better for zero copy.
> 
> The current TCP zerocopy implementation provides completion notification
> through the socket error code, which is freaking weird and doesn't
> integrate well with either io_uring or in-kernel callers.

We already have that via io_uring, and without needing msg_kiocb or the
(very) weird socket error code retrieving.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 0/8] first zcrx updates for 7.2
From: Jens Axboe @ 2026-05-26 16:42 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov; +Cc: netdev
In-Reply-To: <cover.1779189667.git.asml.silence@gmail.com>


On Tue, 19 May 2026 12:44:26 +0100, Pavel Begunkov wrote:
> First batch of zcrx updates for 7.2. The main part is patches 5-8,
> which add notifications from zcrx to userspace via asynchronous
> CQE posting about events like allocation failures and copying, and
> statistics. It's accompanied by relevant query updates. Patches 1-4
> are general cleanups.
> 
> Bertie Tryner (1):
>   io_uring/zcrx: reorder fd allocation in zcrx_export()
> 
> [...]

Applied, thanks!

[1/8] io_uring/zcrx: make scrubbing more reliable
      commit: 74fc9a9b50d43ed473ea2449682000da43e17175
[2/8] io_uring/zcrx: poison pointers on unregistration
      commit: e57b44039bc54bbdf3d1511021458356858a4a12
[3/8] io_uring/zcrx: remove extra ifq close
      commit: 98f07b0f74b65284ebe0d021505b461d4be6bf07
[4/8] io_uring/zcrx: reorder fd allocation in zcrx_export()
      commit: 84f7d0931c42cb0690615a431738cf6913d265f2
[5/8] io_uring/zcrx: add ctx pointer to zcrx
      commit: 8503f2de11f7fe78a7fdb87746255c8d02897279
[6/8] io_uring/zcrx: notify user when out of buffers
      commit: 0719e10d826aa0ba4840917d0261986eaead9a51
[7/8] io_uring/zcrx: notify user on frag copy fallback
      commit: 255180f7034f48aa5b0c8df70228307394bddbb9
[8/8] io_uring/zcrx: add shared-memory notification statistics
      commit: 6935f631465f5f60205978a59228a26db4723d51

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* [PATCH v3] io_uring: annotate remote tasks for kcoverage
From: Robert Femmer @ 2026-05-26 16:49 UTC (permalink / raw)
  To: io-uring
  Cc: Jens Axboe, Dmitry Vyukov, Andrey Konovalov, kasan-dev,
	Robert Femmer
In-Reply-To: <CA+fCnZeE6-8NFXjguJJKc_=UuF-Puw8BdtiFcUhOd23y9pAKOw@mail.gmail.com>

Fuzzers use coverage information to guide generation of test cases
towards new or interesting code paths. Syzkaller, specifically, makes
use kcoverage (CONFIG_KCOV). Coverage information is not collected for
kernel tasks unless annotated by kcov_remote_start and kcov_remote_stop.
This patch annotates io-uring's work queue and sqpoll tasks.

Depends-on: 20260430-kcov-refactor-common-handle-v1-1-23a0c7a0ba38@google.com
Signed-off-by: Robert Femmer <robert@fmmr.tech>
---
 include/linux/io_uring_types.h | 2 ++
 io_uring/io-wq.c               | 4 ++++
 io_uring/io_uring.c            | 1 +
 io_uring/io_uring.h            | 2 ++
 io_uring/sqpoll.c              | 4 ++++
 5 files changed, 13 insertions(+)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 244392026c6d..b6590b2b350c 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -504,6 +504,8 @@ struct io_ring_ctx {
 	struct io_mapped_region		ring_region;
 	/* used for optimised request parameter and wait argument passing  */
 	struct io_mapped_region		param_region;
+
+	struct kcov_common_handle_id	kcov_handle;
 };
 
 /*
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 8cc7b47d3089..9ade4c4f4983 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -639,6 +639,7 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
 		/* handle a whole dependent link */
 		do {
 			struct io_wq_work *next_hashed, *linked;
+			struct io_kiocb *req;
 			unsigned int work_flags = atomic_read(&work->flags);
 			unsigned int hash = __io_wq_is_hashed(work_flags)
 				? __io_get_work_hash(work_flags)
@@ -649,7 +650,10 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
 			if (do_kill &&
 			    (work_flags & IO_WQ_WORK_UNBOUND))
 				atomic_or(IO_WQ_WORK_CANCEL, &work->flags);
+			req = container_of(work, struct io_kiocb, work);
+			kcov_remote_start_common(req->ctx->kcov_handle);
 			io_wq_submit_work(work);
+			kcov_remote_stop();
 			io_assign_current_work(worker, NULL);
 
 			linked = io_wq_free_work(work);
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 103b6c88f252..89cb649944d9 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -293,6 +293,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_HLIST_HEAD(&ctx->cancelable_uring_cmd);
 	io_napi_init(ctx);
 	mutex_init(&ctx->mmap_lock);
+	ctx->kcov_handle = kcov_common_handle();
 
 	return ctx;
 
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index e612a66ee80e..7226fbbbf9f0 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -7,6 +7,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/poll.h>
 #include <linux/io_uring_types.h>
+#include <linux/kcov.h>
 #include <uapi/linux/eventpoll.h>
 #include "alloc_cache.h"
 #include "io-wq.h"
@@ -581,4 +582,5 @@ static inline bool io_has_work(struct io_ring_ctx *ctx)
 	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) ||
 	       io_local_work_pending(ctx);
 }
+
 #endif
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 46c12afec73e..c7b78ea98587 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -342,19 +342,23 @@ static int io_sq_thread(void *data)
 
 		cap_entries = !list_is_singular(&sqd->ctx_list);
 		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
+			kcov_remote_start_common(ctx->kcov_handle);
 			int ret = __io_sq_thread(ctx, sqd, cap_entries, &ist);
 
 			if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list)))
 				sqt_spin = true;
+			kcov_remote_stop();
 		}
 		if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE))
 			sqt_spin = true;
 
 		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
+			kcov_remote_start_common(ctx->kcov_handle);
 			if (io_napi(ctx)) {
 				io_sq_start_worktime(&ist);
 				io_napi_sqpoll_busy_poll(ctx);
 			}
+			kcov_remote_stop();
 		}
 
 		io_sq_update_worktime(sqd, &ist);
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] block: Add bvec_folio()
From: Matthew Wilcox @ 2026-05-26 17:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-kernel, io-uring, linux-mm,
	Leon Romanovsky
In-Reply-To: <ahVBCtsodsM2FHis@infradead.org>

On Mon, May 25, 2026 at 11:43:22PM -0700, Christoph Hellwig wrote:
> On Mon, May 25, 2026 at 02:29:27PM +0100, Matthew Wilcox wrote:
> > > So I'm not against the function per se, but the documentation must
> > > explain the minefields it is stepping into a bit better.
> > 
> > Lower level drivers shouldn't be concerning themselves with folios.
> > For a start, we can put non-folios (eg slab memory) into bvecs.
> 
> Well, that is a very good thing to put into the comment.  We can also
> put them into high-level bvecs, so framing this as 'only use if you
> know the memory is folios, which you can't unless you are the entity
> who filled the bio' might be a good choice.

How about:

/**
 * bvec_folio - Return the first folio referenced by this bvec
 * @bv: bvec to access
 *
 * bvecs can contain non-folio memory, so this should only be called by
 * the creator of the bvec; drivers have no business looking at the owner
 * of the memory.  It may not even be the right interface for the caller
 * to use as bvecs can span multiple folios.  You may be better off using
 * something like bio_for_each_folio_all() which iterates over all folios.
 */

^ permalink raw reply

* Re: [PATCH 1/3] net: Remove support for AIO on sockets
From: Jakub Kicinski @ 2026-05-27  1:40 UTC (permalink / raw)
  To: Demi Marie Obenour via B4 Relay
  Cc: demiobenour, Herbert Xu, David S. Miller, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, Jens Axboe,
	Simon Horman, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Jonathan Corbet, Shuah Khan, Eric Biggers,
	Ard Biesheuvel, linux-crypto, linux-kernel, io-uring, netdev,
	linux-perf-users, linux-doc
In-Reply-To: <20260523-af-alg-harden-v1-1-c76755c3a5c5@gmail.com>

On Sat, 23 May 2026 15:43:02 -0400 Demi Marie Obenour via B4 Relay
wrote:
> The only user of msg->msg_iocb was AF_ALG, but that's deprecated.
> It can be removed entirely at the cost of only supporting synchronous
> operations.  This doesn't break userspace, which will silently block
> (for a bounded amount of time) in io_submit instead of operating
> asynchronously.
> 
> This also makes struct msghdr smaller, helping every other caller of
> sendmsg().

Acked-by: Jakub Kicinski <kuba@kernel.org>

^ permalink raw reply

* Re: [PATCH] block: Add bvec_folio()
From: Christoph Hellwig @ 2026-05-27  6:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-kernel,
	io-uring, linux-mm, Leon Romanovsky
In-Reply-To: <ahXcsrxUFfzoVCOr@casper.infradead.org>

On Tue, May 26, 2026 at 06:47:30PM +0100, Matthew Wilcox wrote:
> How about:
> 
> /**
>  * bvec_folio - Return the first folio referenced by this bvec
>  * @bv: bvec to access
>  *
>  * bvecs can contain non-folio memory, so this should only be called by
>  * the creator of the bvec; drivers have no business looking at the owner
>  * of the memory.  It may not even be the right interface for the caller
>  * to use as bvecs can span multiple folios.  You may be better off using
>  * something like bio_for_each_folio_all() which iterates over all folios.
>  */

Sounds good, although I'd captialize the first word in the sentence.
(Not that anyone should follow my spelling advice in general)


^ permalink raw reply

* [syzbot] [io-uring?] WARNING in io_wq_put_and_exit (2)
From: syzbot @ 2026-05-27  6:52 UTC (permalink / raw)
  To: axboe, io-uring, linux-kernel, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    6a97c4d5262d Merge tag 'for-linus' of git://git.kernel.org..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17a41ea6580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=e327ee9a867dd6b9
dashboard link: https://syzkaller.appspot.com/bug?extid=b0d54b9e81de55179e47
compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/5cc8b6debcbd/disk-6a97c4d5.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/649698f0231f/vmlinux-6a97c4d5.xz
kernel image: https://storage.googleapis.com/syzbot-assets/b297958f355b/bzImage-6a97c4d5.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+b0d54b9e81de55179e47@syzkaller.appspotmail.com

------------[ cut here ]------------
time_after(jiffies, warn_timeout)
WARNING: io_uring/io-wq.c:1370 at io_wq_exit_workers io_uring/io-wq.c:1370 [inline], CPU#1: syz.1.3756/22484
WARNING: io_uring/io-wq.c:1370 at io_wq_put_and_exit+0x36c/0x9d0 io_uring/io-wq.c:1399, CPU#1: syz.1.3756/22484
Modules linked in:
CPU: 1 UID: 0 PID: 22484 Comm: syz.1.3756 Tainted: G             L      syzkaller #0 PREEMPT(full) 
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
RIP: 0010:io_wq_exit_workers io_uring/io-wq.c:1370 [inline]
RIP: 0010:io_wq_put_and_exit+0x36c/0x9d0 io_uring/io-wq.c:1399
Code: 00 00 0f 85 7d 05 00 00 48 8b 15 9f 3c 4c 09 4d 89 e7 31 ff 49 29 d7 4c 89 fe e8 cf f6 13 fd 4d 85 ff 79 aa e8 e5 fb 13 fd 90 <0f> 0b 90 eb 9f e8 da fb 13 fd 4c 8d 63 08 48 b8 00 00 00 00 00 fc
RSP: 0018:ffffc900077af8c0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff888033fdb000 RCX: ffffffff84f4b5f1
RDX: ffff888030d4ca00 RSI: ffffffff84f4b5fb RDI: ffff888030d4ca00
RBP: fffffbfff1c81e50 R08: 0000000000000007 R09: 0000000000000000
R10: ffffffffffffe7c3 R11: 0000000000000000 R12: 0000000100027def
R13: 0000000000001b58 R14: ffff888033fdb018 R15: ffffffffffffe7c3
FS:  00007ff97ac796c0(0000) GS:ffff88812446a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4b79c90b9b CR3: 0000000037235000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 io_uring_clean_tctx+0x114/0x180 io_uring/tctx.c:248
 io_uring_cancel_generic+0x7b9/0x810 io_uring/cancel.c:657
 io_uring_files_cancel include/linux/io_uring.h:20 [inline]
 do_exit+0x344/0x2af0 kernel/exit.c:916
 do_group_exit+0xd5/0x2a0 kernel/exit.c:1119
 get_signal+0x20ff/0x2210 kernel/signal.c:3037
 arch_do_signal_or_restart+0x91/0x7e0 arch/x86/kernel/signal.c:337
 __exit_to_user_mode_loop kernel/entry/common.c:64 [inline]
 exit_to_user_mode_loop+0x8b/0x4f0 kernel/entry/common.c:98
 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:207 [inline]
 syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:230 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:318 [inline]
 do_syscall_64+0x706/0x860 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff979d9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ff97ac79028 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: 0000000000018ff8 RBX: 00007ff97a015fa0 RCX: 00007ff979d9ce59
RDX: 0000000000018ff8 RSI: 0000200000009b80 RDI: 0000000000000003
RBP: 00007ff979e32d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ff97a016038 R14: 00007ff97a015fa0 R15: 00007ffc1552e588
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* improve the kmem_cache_alloc_bulk API
From: Christoph Hellwig @ 2026-05-27  7:02 UTC (permalink / raw)
  To: Vlastimil Babka, Harry Yoo, Andrew Morton
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Jesper Dangaard Brouer, linux-arm-msm, dri-devel, freedreno,
	linux-kernel, linux-mm, io-uring, kasan-dev, bpf, netdev

Hi all,

kmem_cache_alloc_bulk has a very unintuitive and undocumented return
value convention.  Fix that and add documentation.

Note that the few comments explaining it mention that the gfp flags
must allow "spinning".  That's not really a term used in the memory
allocator, is this supposed to mean "block" or "sleep"?

Diffstat:
 drivers/gpu/drm/msm/msm_iommu.c       |    6 +--
 drivers/gpu/drm/panthor/panthor_mmu.c |   12 ++-----
 include/linux/slab.h                  |    6 ++-
 io_uring/io_uring.c                   |   23 +++++--------
 lib/test_meminit.c                    |   19 +++++------
 mm/kasan/kasan_test_c.c               |    5 +-
 mm/kfence/kfence_test.c               |    9 ++---
 mm/slub.c                             |   58 ++++++++++++++++++----------------
 net/bpf/test_run.c                    |    7 +---
 net/core/skbuff.c                     |   23 +++++++------
 tools/include/linux/slab.h            |    2 -
 tools/testing/shared/linux.c          |   19 ++++-------
 12 files changed, 92 insertions(+), 97 deletions(-)

^ permalink raw reply

* [PATCH] mm/slab: improve kmem_cache_alloc_bulk
From: Christoph Hellwig @ 2026-05-27  7:02 UTC (permalink / raw)
  To: Vlastimil Babka, Harry Yoo, Andrew Morton
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Jesper Dangaard Brouer, linux-arm-msm, dri-devel, freedreno,
	linux-kernel, linux-mm, io-uring, kasan-dev, bpf, netdev
In-Reply-To: <20260527070239.2252948-1-hch@lst.de>

The kmem_cache_alloc_bulk return value is weird.  It returns the number
of allocated objects, but that must always be 0 or the requested number
based on the implementations and the handling in the callers, but that
assumption is not actually documented anywhere, which confuses automated
review tools.

Fix this by returning a bool if the allocation succeeded and adding a
kerneldoc comment explaining the API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/gpu/drm/msm/msm_iommu.c       |  6 +--
 drivers/gpu/drm/panthor/panthor_mmu.c | 12 +++---
 include/linux/slab.h                  |  6 ++-
 io_uring/io_uring.c                   | 23 +++++------
 lib/test_meminit.c                    | 19 +++++----
 mm/kasan/kasan_test_c.c               |  5 +--
 mm/kfence/kfence_test.c               |  9 +++--
 mm/slub.c                             | 58 +++++++++++++++------------
 net/bpf/test_run.c                    |  7 ++--
 net/core/skbuff.c                     | 24 ++++++-----
 tools/include/linux/slab.h            |  2 +-
 tools/testing/shared/linux.c          | 19 ++++-----
 12 files changed, 93 insertions(+), 97 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index 058c71c82cf5..533104d71f6c 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -330,17 +330,15 @@ static int
 msm_iommu_pagetable_prealloc_allocate(struct msm_mmu *mmu, struct msm_mmu_prealloc *p)
 {
 	struct kmem_cache *pt_cache = get_pt_cache(mmu);
-	int ret;
 
 	p->pages = kvmalloc_objs(*p->pages, p->count);
 	if (!p->pages)
 		return -ENOMEM;
 
-	ret = kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, p->count, p->pages);
-	if (ret != p->count) {
+	if (!kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, p->count, p->pages)) {
 		kfree(p->pages);
 		p->pages = NULL;
-		p->count = ret;
+		p->count = 0;
 		return -ENOMEM;
 	}
 
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
index 75d98dad7b1d..b80d7e1d5123 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.c
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -1274,10 +1274,9 @@ static int panthor_vm_prepare_map_op_ctx(struct panthor_vm_op_ctx *op_ctx,
 		goto err_cleanup;
 	}
 
-	ret = kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, pt_count,
-				    op_ctx->rsvd_page_tables.pages);
-	op_ctx->rsvd_page_tables.count = ret;
-	if (ret != pt_count) {
+	if (!kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, pt_count,
+			op_ctx->rsvd_page_tables.pages)) {
+		op_ctx->rsvd_page_tables.count = 0;
 		ret = -ENOMEM;
 		goto err_cleanup;
 	}
@@ -1328,9 +1327,8 @@ static int panthor_vm_prepare_unmap_op_ctx(struct panthor_vm_op_ctx *op_ctx,
 			goto err_cleanup;
 		}
 
-		ret = kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, pt_count,
-					    op_ctx->rsvd_page_tables.pages);
-		if (ret != pt_count) {
+		if (!kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, pt_count,
+				op_ctx->rsvd_page_tables.pages)) {
 			ret = -ENOMEM;
 			goto err_cleanup;
 		}
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 2b5ab488e96b..6a7b452d43a0 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -815,8 +815,10 @@ kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags,
  */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
 
-int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
-#define kmem_cache_alloc_bulk(...)	alloc_hooks(kmem_cache_alloc_bulk_noprof(__VA_ARGS__))
+bool kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags,
+		size_t size, void **p);
+#define kmem_cache_alloc_bulk(...) \
+	alloc_hooks(kmem_cache_alloc_bulk_noprof(__VA_ARGS__))
 
 static __always_inline void kfree_bulk(size_t size, void **p)
 {
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 103b6c88f252..bf847ca823f7 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -978,29 +978,24 @@ __cold bool __io_alloc_req_refill(struct io_ring_ctx *ctx)
 {
 	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO;
 	void *reqs[IO_REQ_ALLOC_BATCH];
-	int ret;
-
-	ret = kmem_cache_alloc_bulk(req_cachep, gfp, ARRAY_SIZE(reqs), reqs);
+	int nr_reqs = ARRAY_SIZE(reqs);
 
 	/*
-	 * Bulk alloc is all-or-nothing. If we fail to get a batch,
-	 * retry single alloc to be on the safe side.
+	 * Bulk alloc is all-or-nothing. If we fail to get a batch, retry a
+	 * single allocation to be on the safe side.
 	 */
-	if (unlikely(ret <= 0)) {
+	if (!kmem_cache_alloc_bulk(req_cachep, gfp, nr_reqs, reqs)) {
 		reqs[0] = kmem_cache_alloc(req_cachep, gfp);
 		if (!reqs[0])
 			return false;
-		ret = 1;
+		nr_reqs = 1;
 	}
 
-	percpu_ref_get_many(&ctx->refs, ret);
-	ctx->nr_req_allocated += ret;
-
-	while (ret--) {
-		struct io_kiocb *req = reqs[ret];
+	percpu_ref_get_many(&ctx->refs, nr_reqs);
+	ctx->nr_req_allocated += nr_reqs;
 
-		io_req_add_to_cache(req, ctx);
-	}
+	while (nr_reqs--)
+		io_req_add_to_cache(reqs[nr_reqs], ctx);
 	return true;
 }
 
diff --git a/lib/test_meminit.c b/lib/test_meminit.c
index d028a6552cd6..8f178fdf80ff 100644
--- a/lib/test_meminit.c
+++ b/lib/test_meminit.c
@@ -229,14 +229,12 @@ static int __init do_kmem_cache_size(size_t size, bool want_ctor,
 	for (iter = 0; iter < 10; iter++) {
 		/* Do a test of bulk allocations */
 		if (!want_rcu && !want_ctor) {
-			int ret;
-
-			ret = kmem_cache_alloc_bulk(c, alloc_mask, BULK_SIZE, bulk_array);
-			if (!ret) {
+			if (!kmem_cache_alloc_bulk(c, alloc_mask, BULK_SIZE,
+					bulk_array)) {
 				fail = true;
 			} else {
 				int i;
-				for (i = 0; i < ret; i++)
+				for (i = 0; i < BULK_SIZE; i++)
 					fail |= check_buf(bulk_array[i], size, want_ctor, want_rcu, want_zero);
 				kmem_cache_free_bulk(c, ret, bulk_array);
 			}
@@ -354,17 +352,18 @@ static int __init do_kmem_cache_size_bulk(int size, int *total_failures)
 
 	c = kmem_cache_create("test_cache", size, size, 0, NULL);
 	for (iter = 0; (iter < maxiter) && !fail; iter++) {
-		num = kmem_cache_alloc_bulk(c, GFP_KERNEL, ARRAY_SIZE(objects),
-					    objects);
-		for (i = 0; i < num; i++) {
+		if (!kmem_cache_alloc_bulk(c, GFP_KERNEL, ARRAY_SIZE(objects),
+				objects))
+			continue;
+
+		for (i = 0; i < ARRAY_SIZE(objects); i++) {
 			bytes = count_nonzero_bytes(objects[i], size);
 			if (bytes)
 				fail = true;
 			fill_with_garbage(objects[i], size);
 		}
 
-		if (num)
-			kmem_cache_free_bulk(c, num, objects);
+		kmem_cache_free_bulk(c, num, objects);
 	}
 	kmem_cache_destroy(c);
 	*total_failures += fail;
diff --git a/mm/kasan/kasan_test_c.c b/mm/kasan/kasan_test_c.c
index 3f4ed29178b3..b9e167ed5be3 100644
--- a/mm/kasan/kasan_test_c.c
+++ b/mm/kasan/kasan_test_c.c
@@ -1225,14 +1225,13 @@ static void kmem_cache_bulk(struct kunit *test)
 	struct kmem_cache *cache;
 	size_t size = 200;
 	char *p[10];
-	bool ret;
 	int i;
 
 	cache = kmem_cache_create("test_cache", size, 0, 0, NULL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, cache);
 
-	ret = kmem_cache_alloc_bulk(cache, GFP_KERNEL, ARRAY_SIZE(p), (void **)&p);
-	if (!ret) {
+	if (!kmem_cache_alloc_bulk(cache, GFP_KERNEL, ARRAY_SIZE(p),
+			(void **)&p)) {
 		kunit_err(test, "Allocation failed: %s\n", __func__);
 		kmem_cache_destroy(cache);
 		return;
diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
index 10424cd25e5a..c472e66e7242 100644
--- a/mm/kfence/kfence_test.c
+++ b/mm/kfence/kfence_test.c
@@ -761,9 +761,10 @@ static void test_memcache_alloc_bulk(struct kunit *test)
 	timeout = jiffies + msecs_to_jiffies(100 * kfence_sample_interval);
 	do {
 		void *objects[100];
-		int i, num = kmem_cache_alloc_bulk(test_cache, GFP_ATOMIC, ARRAY_SIZE(objects),
-						   objects);
-		if (!num)
+		int i;
+
+		if (!kmem_cache_alloc_bulk(test_cache, GFP_ATOMIC,
+				ARRAY_SIZE(objects), objects))
 			continue;
 		for (i = 0; i < ARRAY_SIZE(objects); i++) {
 			if (is_kfence_address(objects[i])) {
@@ -771,7 +772,7 @@ static void test_memcache_alloc_bulk(struct kunit *test)
 				break;
 			}
 		}
-		kmem_cache_free_bulk(test_cache, num, objects);
+		kmem_cache_free_bulk(test_cache, ARRAY_SIZE(objects), objects);
 		/*
 		 * kmem_cache_alloc_bulk() disables interrupts, and calling it
 		 * in a tight loop may not give KFENCE a chance to switch the
diff --git a/mm/slub.c b/mm/slub.c
index a2bf3756ca7d..d9790e7c17f6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4981,8 +4981,8 @@ static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
 	return ret;
 }
 
-static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
-				   size_t size, void **p);
+static bool __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
+		size_t size, void **p);
 
 /*
  * returns a sheaf that has at least the requested size
@@ -5154,9 +5154,8 @@ int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
 			return __prefill_sheaf_pfmemalloc(s, sheaf, gfp);
 
 		if (!__kmem_cache_alloc_bulk(s, gfp, sheaf->capacity - sheaf->size,
-					     &sheaf->objects[sheaf->size])) {
+					     &sheaf->objects[sheaf->size]))
 			return -ENOMEM;
-		}
 		sheaf->size = sheaf->capacity;
 
 		return 0;
@@ -7289,9 +7288,8 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 	return refilled;
 }
 
-static inline
-int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
-			    void **p)
+static bool __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
+		size_t size, void **p)
 {
 	int i;
 
@@ -7312,30 +7310,42 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 		stat_add(s, ALLOC_SLOWPATH, i);
 	}
 
-	return i;
+	return true;
 
 error:
 	__kmem_cache_free_bulk(s, i, p);
-	return 0;
-
+	return false;
 }
 
-/*
- * Note that interrupts must be enabled when calling this function and gfp
- * flags must allow spinning.
+/**
+ * kmem_cache_alloc_bulk - Allocate multiple objects
+ * @s:		The cache to allocate from
+ * @flags:	GFP_* flags. See kmalloc().
+ * @size:	Number of objects to allocate
+ * @p:		Array of allocated objects
+ *
+ * Allocate @size objects from @s and places them into @p.
+ *
+ * Interrupts must be enabled when calling this function and @flags must allow
+ * spinning.
+ *
+ * Unlike alloc_pages_bulk(), this function does not check for already allocated
+ * objects in @p, and thus the caller does not need to zero it.
+ *
+ * Return: %true if the allocation succeeded, or %false if it failed.
  */
-int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
-				 void **p)
+bool kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags,
+		size_t size, void **p)
 {
 	unsigned int i = 0;
 	void *kfence_obj;
 
 	if (!size)
-		return 0;
+		return false;
 
 	s = slab_pre_alloc_hook(s, flags);
 	if (unlikely(!s))
-		return 0;
+		return false;
 
 	/*
 	 * to make things simpler, only assume at most once kfence allocated
@@ -7352,18 +7362,18 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 	}
 
 	i = alloc_from_pcs_bulk(s, flags, size, p);
-
 	if (i < size) {
 		/*
 		 * If we ran out of memory, don't bother with freeing back to
 		 * the percpu sheaves, we have bigger problems.
 		 */
-		if (unlikely(__kmem_cache_alloc_bulk(s, flags, size - i, p + i) == 0)) {
+		if (unlikely(!__kmem_cache_alloc_bulk(s, flags, size - i,
+				p + i))) {
 			if (i > 0)
 				__kmem_cache_free_bulk(s, i, p);
 			if (kfence_obj)
 				__kfence_free(kfence_obj);
-			return 0;
+			return false;
 		}
 	}
 
@@ -7382,12 +7392,8 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 	 * memcg and kmem_cache debug support and memory initialization.
 	 * Done outside of the IRQ disabled fastpath loop.
 	 */
-	if (unlikely(!slab_post_alloc_hook(s, NULL, flags, size, p,
-		    slab_want_init_on_alloc(flags, s), s->object_size))) {
-		return 0;
-	}
-
-	return size;
+	return likely(slab_post_alloc_hook(s, NULL, flags, size, p,
+			slab_want_init_on_alloc(flags, s), s->object_size));
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
 
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2bc04feadfab..99ab9ddb05e3 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -243,12 +243,11 @@ static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
 			   struct net_device *dev)
 {
 	gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
-	int i, n;
+	int i;
 	LIST_HEAD(list);
 
-	n = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache, gfp, nframes,
-				  (void **)skbs);
-	if (unlikely(n == 0)) {
+	if (!kmem_cache_alloc_bulk(net_hotdata.skbuff_cache, gfp, nframes,
+				   (void **)skbs)) {
 		for (i = 0; i < nframes; i++)
 			xdp_return_frame(frames[i]);
 		return -ENOMEM;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 44ac121cfccb..73045b688385 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -288,11 +288,11 @@ static inline struct sk_buff *napi_skb_cache_get(bool alloc)
 
 	local_lock_nested_bh(&napi_alloc_cache.bh_lock);
 	if (unlikely(!nc->skb_count)) {
-		if (alloc)
-			nc->skb_count = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
-						GFP_ATOMIC | __GFP_NOWARN,
-						NAPI_SKB_CACHE_BULK,
-						nc->skb_cache);
+		if (alloc && kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
+						   GFP_ATOMIC | __GFP_NOWARN,
+						   NAPI_SKB_CACHE_BULK,
+						   nc->skb_cache))
+			nc->skb_count = NAPI_SKB_CACHE_BULK;
 		if (unlikely(!nc->skb_count)) {
 			local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
 			return NULL;
@@ -353,16 +353,18 @@ u32 napi_skb_cache_get_bulk(void **skbs, u32 n)
 
 	/* No enough cached skbs. Try refilling the cache first */
 	bulk = min(NAPI_SKB_CACHE_SIZE - nc->skb_count, NAPI_SKB_CACHE_BULK);
-	nc->skb_count += kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
-					       GFP_ATOMIC | __GFP_NOWARN, bulk,
-					       &nc->skb_cache[nc->skb_count]);
+	if (kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
+				  GFP_ATOMIC | __GFP_NOWARN, bulk,
+				  &nc->skb_cache[nc->skb_count]))
+		nc->skb_count += bulk;
 	if (likely(nc->skb_count >= n))
 		goto get;
 
 	/* Still not enough. Bulk-allocate the missing part directly, zeroed */
-	n -= kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
-				   GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
-				   n - nc->skb_count, &skbs[nc->skb_count]);
+	if (kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
+				  GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
+				  n - nc->skb_count, &skbs[nc->skb_count]))
+		n = nc->skb_count;
 	if (likely(nc->skb_count >= n))
 		goto get;
 
diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
index 6d8e9413d5a4..2e63c2e726aa 100644
--- a/tools/include/linux/slab.h
+++ b/tools/include/linux/slab.h
@@ -183,7 +183,7 @@ __kmem_cache_create(const char *name, unsigned int size, unsigned int align,
 		default: __kmem_cache_create)(__name, __object_size, __args, __VA_ARGS__)
 
 void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
-int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
+bool kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 			  void **list);
 struct slab_sheaf *
 kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
index 8c7257155958..e9c3bc9b3272 100644
--- a/tools/testing/shared/linux.c
+++ b/tools/testing/shared/linux.c
@@ -154,7 +154,7 @@ void kmem_cache_shrink(struct kmem_cache *cachep)
 {
 }
 
-int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
+bool kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 			  void **p)
 {
 	size_t i;
@@ -213,7 +213,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 		pthread_mutex_unlock(&cachep->lock);
 		if (cachep->callback)
 			cachep->exec_callback = true;
-		return 0;
+		return false;
 	}
 
 	for (i = 0; i < size; i++) {
@@ -224,7 +224,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 			printf("Allocating %p from slab\n", p[i]);
 	}
 
-	return size;
+	return true;
 }
 
 struct kmem_cache *
@@ -271,8 +271,8 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 
 	sheaf->cache = s;
 	sheaf->capacity = capacity;
-	sheaf->size = kmem_cache_alloc_bulk(s, gfp, size, sheaf->objects);
-	if (!sheaf->size) {
+	sheaf->size = size;
+	if (!kmem_cache_alloc_bulk(s, gfp, size, sheaf->objects)) {
 		free(sheaf);
 		return NULL;
 	}
@@ -284,7 +284,6 @@ int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
 		 struct slab_sheaf **sheafp, unsigned int size)
 {
 	struct slab_sheaf *sheaf = *sheafp;
-	int refill;
 
 	if (sheaf->size >= size)
 		return 0;
@@ -299,12 +298,10 @@ int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
 		return 0;
 	}
 
-	refill = kmem_cache_alloc_bulk(s, gfp, size - sheaf->size,
-				       &sheaf->objects[sheaf->size]);
-	if (!refill)
+	if (!kmem_cache_alloc_bulk(s, gfp, size - sheaf->size,
+			&sheaf->objects[sheaf->size]))
 		return -ENOMEM;
-
-	sheaf->size += refill;
+	sheaf->size += (size - sheaf->size);
 	return 0;
 }
 
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH] mm/slab: improve kmem_cache_alloc_bulk
From: bot+bpf-ci @ 2026-05-27  7:53 UTC (permalink / raw)
  To: hch, vbabka, harry, akpm
  Cc: hao.li, cl, rientjes, roman.gushchin, hawk, linux-arm-msm,
	dri-devel, freedreno, linux-kernel, linux-mm, io-uring, kasan-dev,
	bpf, netdev, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260527070239.2252948-2-hch@lst.de>

[-- Attachment #1: Type: text/plain, Size: 4512 bytes --]

> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> index 75d98dad7b1d..b80d7e1d5123 100644
> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> @@ -1274,10 +1274,9 @@ static int panthor_vm_prepare_map_op_ctx(struct panthor_vm_op_ctx *op_ctx,
>  		goto err_cleanup;
>  	}
>
> -	ret = kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, pt_count,
> -				    op_ctx->rsvd_page_tables.pages);
> -	op_ctx->rsvd_page_tables.count = ret;
> -	if (ret != pt_count) {
> +	if (!kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, pt_count,
> +			op_ctx->rsvd_page_tables.pages)) {
> +		op_ctx->rsvd_page_tables.count = 0;
>  		ret = -ENOMEM;
>  		goto err_cleanup;
>  	}

Does this omit setting op_ctx->rsvd_page_tables.count to pt_count on success?

Since op_ctx is initialized via memset previously, the count will remain 0 when
the allocation succeeds. This could lead to memory leaks when these page tables
are consumed or cleaned up later, since the cleanup code won't know about the
allocated page tables.

The original code set the count to ret on both success and failure. The
updated code only sets count = 0 in the failure path, but doesn't set
count = pt_count in the success path.

[ ... ]

> diff --git a/lib/test_meminit.c b/lib/test_meminit.c
> index 6298f66c964b..3727dee7c262 100644
> --- a/lib/test_meminit.c
> +++ b/lib/test_meminit.c
> @@ -229,14 +229,12 @@ static int __init do_kmem_cache_size(size_t size, bool want_ctor,
>  	for (iter = 0; iter < 10; iter++) {
>  		/* Do a test of bulk allocations */
>  		if (!want_rcu && !want_ctor) {
> -			int ret;
> -
> -			ret = kmem_cache_alloc_bulk(c, alloc_mask, BULK_SIZE, bulk_array);
> -			if (!ret) {
> +			if (!kmem_cache_alloc_bulk(c, alloc_mask, BULK_SIZE,
> +					bulk_array)) {
>  				fail = true;
>  			} else {
>  				int i;
> -				for (i = 0; i < ret; i++)
> +				for (i = 0; i < BULK_SIZE; i++)
>  					fail |= check_buf(bulk_array[i], size, want_ctor, want_rcu, want_zero);
>  				kmem_cache_free_bulk(c, ret, bulk_array);
>  			}

Will this cause a compilation error?

The local variable ret is removed from this scope, but it's still used in the
kmem_cache_free_bulk() call. This should fail to compile because ret is
undeclared.

The fix should probably replace ret with BULK_SIZE in the free call:

	kmem_cache_free_bulk(c, BULK_SIZE, bulk_array);


> @@ -354,17 +352,18 @@ static int __init do_kmem_cache_size_bulk(int size, int *total_failures)
>
>  	c = kmem_cache_create("test_cache", size, size, 0, NULL);
>  	for (iter = 0; (iter < maxiter) && !fail; iter++) {
> -		num = kmem_cache_alloc_bulk(c, GFP_KERNEL, ARRAY_SIZE(objects),
> -					    objects);
> -		for (i = 0; i < num; i++) {
> +		if (!kmem_cache_alloc_bulk(c, GFP_KERNEL, ARRAY_SIZE(objects),
> +				objects))
> +			continue;
> +
> +		for (i = 0; i < ARRAY_SIZE(objects); i++) {
>  			bytes = count_nonzero_bytes(objects[i], size);
>  			if (bytes)
>  				fail = true;
>  			fill_with_garbage(objects[i], size);
>  		}
>
> -		if (num)
> -			kmem_cache_free_bulk(c, num, objects);
> +		kmem_cache_free_bulk(c, num, objects);
>  	}
>  	kmem_cache_destroy(c);
>  	*total_failures += fail;

Is num uninitialized here?

It's no longer assigned the result of the allocation call. Passing it to
kmem_cache_free_bulk() will use indeterminate stack data, which could lead to
memory corruption. The fix should probably be:

	kmem_cache_free_bulk(c, ARRAY_SIZE(objects), objects);

[ ... ]

> diff --git a/mm/slub.c b/mm/slub.c
> index a2bf3756ca7d..d9790e7c17f6 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c

[ ... ]

> @@ -7352,18 +7362,18 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>  	 */
>  	unsigned int i = 0;
>  	void *kfence_obj;
>
>  	if (!size)
> -		return 0;
> +		return false;

Does this incorrectly report an allocation failure when size is 0?

Previously, callers interpreting the return value against size (ret != size)
would treat 0 objects allocated for a request of 0 as a success. Returning
false here contradicts the typical kernel behavior where allocating 0 items is
a successful no-op.

This might cause callers to incorrectly abort with an error when they request
0 objects, which should be a valid no-op case.


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26496962101

^ permalink raw reply

* Re: [PATCH 1/3] net: Remove support for AIO on sockets
From: Christoph Hellwig @ 2026-05-27  8:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, demiobenour, Herbert Xu, David S. Miller,
	Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	Jakub Kicinski, Simon Horman, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Jonathan Corbet, Shuah Khan, Eric Biggers,
	Ard Biesheuvel, linux-crypto, linux-kernel, io-uring, netdev,
	linux-perf-users, linux-doc, Toke Høiland-Jørgensen,
	linux-api, David Howells
In-Reply-To: <92db3ff0-8f0b-4b61-a167-5004ffcf9025@kernel.dk>

On Tue, May 26, 2026 at 09:58:27AM -0600, Jens Axboe wrote:
> > The current TCP zerocopy implementation provides completion notification
> > through the socket error code, which is freaking weird and doesn't
> > integrate well with either io_uring or in-kernel callers.
> 
> We already have that via io_uring

Where?  And how do make that available to in-kernel users like
storage protocols and network file system, which really suffer from
the current MSG_SPLICE_PAGES semantics.

> , and without needing msg_kiocb or the

What do you think is the downside of using a kiocb here like for
everything else with async notifications?


^ permalink raw reply

* Re: [PATCH] mm/slab: improve kmem_cache_alloc_bulk
From: Jesper Dangaard Brouer @ 2026-05-27  8:51 UTC (permalink / raw)
  To: Christoph Hellwig, Vlastimil Babka, Harry Yoo, Andrew Morton
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	linux-arm-msm, dri-devel, freedreno, linux-kernel, linux-mm,
	io-uring, kasan-dev, bpf, netdev, Alexander Lobakin, Matt Fleming,
	kernel-team
In-Reply-To: <20260527070239.2252948-2-hch@lst.de>



On 27/05/2026 09.02, Christoph Hellwig wrote:
> The kmem_cache_alloc_bulk return value is weird.  It returns the number
> of allocated objects, but that must always be 0 or the requested number
> based on the implementations and the handling in the callers, but that
> assumption is not actually documented anywhere, which confuses automated
> review tools.
> 

I remember, this API behavior was requested by AKPM when I developed
kmem_cache_alloc_bulk.  I trusted AKPM's decision, but I cannot explain
why this choice was made.

I kept the netdev code usage below. The current napi_skb_cache_get_bulk
have a retry logic that assumes that a partial bulk number can be
returned (which it cannot as Hellwig explains).  Cc Alex/Olek please
review the changes below as you added this retry logic.


> Fix this by returning a bool if the allocation succeeded and adding a
> kerneldoc comment explaining the API.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/gpu/drm/msm/msm_iommu.c       |  6 +--
>   drivers/gpu/drm/panthor/panthor_mmu.c | 12 +++---
>   include/linux/slab.h                  |  6 ++-
>   io_uring/io_uring.c                   | 23 +++++------
>   lib/test_meminit.c                    | 19 +++++----
>   mm/kasan/kasan_test_c.c               |  5 +--
>   mm/kfence/kfence_test.c               |  9 +++--
>   mm/slub.c                             | 58 +++++++++++++++------------
>   net/bpf/test_run.c                    |  7 ++--
>   net/core/skbuff.c                     | 24 ++++++-----
>   tools/include/linux/slab.h            |  2 +-
>   tools/testing/shared/linux.c          | 19 ++++-----
>   12 files changed, 93 insertions(+), 97 deletions(-)
> 

[...]

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 44ac121cfccb..73045b688385 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -288,11 +288,11 @@ static inline struct sk_buff *napi_skb_cache_get(bool alloc)
>   
>   	local_lock_nested_bh(&napi_alloc_cache.bh_lock);
>   	if (unlikely(!nc->skb_count)) {
> -		if (alloc)
> -			nc->skb_count = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
> -						GFP_ATOMIC | __GFP_NOWARN,
> -						NAPI_SKB_CACHE_BULK,
> -						nc->skb_cache);
> +		if (alloc && kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
> +						   GFP_ATOMIC | __GFP_NOWARN,
> +						   NAPI_SKB_CACHE_BULK,
> +						   nc->skb_cache))
> +			nc->skb_count = NAPI_SKB_CACHE_BULK;
>   		if (unlikely(!nc->skb_count)) {
>   			local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
>   			return NULL;
> @@ -353,16 +353,18 @@ u32 napi_skb_cache_get_bulk(void **skbs, u32 n)
>   
>   	/* No enough cached skbs. Try refilling the cache first */
>   	bulk = min(NAPI_SKB_CACHE_SIZE - nc->skb_count, NAPI_SKB_CACHE_BULK);
> -	nc->skb_count += kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
> -					       GFP_ATOMIC | __GFP_NOWARN, bulk,
> -					       &nc->skb_cache[nc->skb_count]);
> +	if (kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
> +				  GFP_ATOMIC | __GFP_NOWARN, bulk,
> +				  &nc->skb_cache[nc->skb_count]))
> +		nc->skb_count += bulk;
>   	if (likely(nc->skb_count >= n))
>   		goto get;
>   
>   	/* Still not enough. Bulk-allocate the missing part directly, zeroed */
> -	n -= kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
> -				   GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
> -				   n - nc->skb_count, &skbs[nc->skb_count]);
> +	if (kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
> +				  GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
> +				  n - nc->skb_count, &skbs[nc->skb_count]))
> +		n = nc->skb_count;
>   	if (likely(nc->skb_count >= n))
>   		goto get;
>   

^ permalink raw reply

* Re: improve the kmem_cache_alloc_bulk API
From: Vlastimil Babka (SUSE) @ 2026-05-27  9:11 UTC (permalink / raw)
  To: Christoph Hellwig, Harry Yoo, Andrew Morton
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Jesper Dangaard Brouer, linux-arm-msm, dri-devel, freedreno,
	linux-kernel, linux-mm, io-uring, kasan-dev, bpf, netdev
In-Reply-To: <20260527070239.2252948-1-hch@lst.de>

On 5/27/26 09:02, Christoph Hellwig wrote:
> Hi all,
> 
> kmem_cache_alloc_bulk has a very unintuitive and undocumented return
> value convention.  Fix that and add documentation.
> 
> Note that the few comments explaining it mention that the gfp flags
> must allow "spinning".  That's not really a term used in the memory
> allocator, is this supposed to mean "block" or "sleep"?

Page allocator now has alloc_pages_nolock() for when no spinning is
possible, and it uses ALLOC_TRYLOCK internally.

Slab has kmalloc_nolock() relying on that when it needs new pages.

In terms of gfp flags, such context is currently indicated by lack of
__GFP_KSWAPD_RECLAIM, where lack of __GFP_DIRECT_RECLAIM only means "no
sleeping" - see gfpflags_allow_spinning(). Slab uses it internally as
there's no ALLOC_TRYLOCK, but also there are callers from memcg and stackdepot.

Like the rest of gfp flags it's far from ideal, maybe we'll figure out a
better design eventually.

> Diffstat:
>  drivers/gpu/drm/msm/msm_iommu.c       |    6 +--
>  drivers/gpu/drm/panthor/panthor_mmu.c |   12 ++-----
>  include/linux/slab.h                  |    6 ++-
>  io_uring/io_uring.c                   |   23 +++++--------
>  lib/test_meminit.c                    |   19 +++++------
>  mm/kasan/kasan_test_c.c               |    5 +-
>  mm/kfence/kfence_test.c               |    9 ++---
>  mm/slub.c                             |   58 ++++++++++++++++++----------------
>  net/bpf/test_run.c                    |    7 +---
>  net/core/skbuff.c                     |   23 +++++++------
>  tools/include/linux/slab.h            |    2 -
>  tools/testing/shared/linux.c          |   19 ++++-------
>  12 files changed, 92 insertions(+), 97 deletions(-)


^ permalink raw reply

* Re: [PATCH] mm/slab: improve kmem_cache_alloc_bulk
From: Vlastimil Babka (SUSE) @ 2026-05-27  9:38 UTC (permalink / raw)
  To: Christoph Hellwig, Harry Yoo, Andrew Morton
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Jesper Dangaard Brouer, linux-arm-msm, dri-devel, freedreno,
	linux-kernel, linux-mm, io-uring, kasan-dev, bpf, netdev
In-Reply-To: <20260527070239.2252948-2-hch@lst.de>

On 5/27/26 09:02, Christoph Hellwig wrote:
> The kmem_cache_alloc_bulk return value is weird.  It returns the number
> of allocated objects, but that must always be 0 or the requested number
> based on the implementations and the handling in the callers, but that
> assumption is not actually documented anywhere, which confuses automated
> review tools.
> 
> Fix this by returning a bool if the allocation succeeded and adding a
> kerneldoc comment explaining the API.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Would 0 / -ENOMEM be more like what people would expect? I guess both that
and bool are better than the current API.


^ permalink raw reply

* [PATCH] scsi: bsg: copy uring_cmd payload to prevent double-fetch from shared SQE
From: Rahul Chandelkar @ 2026-05-27 10:59 UTC (permalink / raw)
  To: rc, James E . J . Bottomley, Martin K . Petersen, Jens Axboe,
	FUJITA Tomonori
  Cc: linux-scsi, linux-block, io-uring, linux-kernel, stable

scsi_bsg_uring_cmd() and scsi_bsg_map_user_buffer() read bsg_uring_cmd
fields directly from the shared mmap'd io_uring submission ring via
io_uring_sqe128_cmd().  On the inline execution path, io_uring has not
yet copied the SQE to kernel memory, so a concurrent userspace thread
can modify fields between reads.

cmd->request_len is read for the bounds check, for the cmd_len
assignment, and for the copy_from_user length.  A racing thread can
change request_len between the bounds check (passes with <= 32) and
copy_from_user (uses the enlarged value), overflowing the 32-byte
scmd->cmnd[] buffer into subsequent struct scsi_cmnd fields.

scsi_bsg_map_user_buffer() independently re-derives its cmd pointer
from the same shared SQE, re-reading dout_xfer_len, din_xfer_len,
dout_xferp, and din_xferp, enabling direction confusion and buffer
length races.

Copy struct bsg_uring_cmd to a stack-local variable before use in both
functions.  The pointer variable 'cmd' is redirected to the local copy
so the rest of each function is unchanged.

Tested with KASAN on QEMU (virtio-scsi, 2 vCPUs).  Without this fix,
a two-thread race produces:

  BUG: KASAN: wild-memory-access in scsi_queue_rq+0x4a3/0x58a0
  Write of size 96 at addr dead000000001000 by task poc/67
  Call Trace:
   kasan_report+0xce/0x100
   __asan_memset+0x23/0x50
   scsi_queue_rq+0x4a3/0x58a0
   scsi_bsg_uring_cmd+0x942/0x1570
   io_uring_cmd+0x2f6/0x950
   io_issue_sqe+0xe5/0x22d0

Fixes: 7b6d3255e7f8 ("scsi: bsg: add io_uring passthrough handler")
Cc: stable@vger.kernel.org
Signed-off-by: Rahul Chandelkar <rc@rexion.ai>
---
 drivers/scsi/scsi_bsg.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_bsg.c b/drivers/scsi/scsi_bsg.c
index e80dec53174e..244740655eb0 100644
--- a/drivers/scsi/scsi_bsg.c
+++ b/drivers/scsi/scsi_bsg.c
@@ -78,13 +78,21 @@ static int scsi_bsg_map_user_buffer(struct request *req,
 				    struct io_uring_cmd *ioucmd,
 				    unsigned int issue_flags, gfp_t gfp_mask)
 {
-	const struct bsg_uring_cmd *cmd = io_uring_sqe128_cmd(ioucmd->sqe, struct bsg_uring_cmd);
-	bool is_write = cmd->dout_xfer_len > 0;
-	u64 buf_addr = is_write ? cmd->dout_xferp : cmd->din_xferp;
-	unsigned long buf_len = is_write ? cmd->dout_xfer_len : cmd->din_xfer_len;
+	struct bsg_uring_cmd local_cmd;
+	const struct bsg_uring_cmd *cmd;
+	bool is_write;
+	u64 buf_addr;
+	unsigned long buf_len;
 	struct iov_iter iter;
 	int ret;
 
+	memcpy(&local_cmd, io_uring_sqe128_cmd(ioucmd->sqe, struct bsg_uring_cmd),
+	       sizeof(local_cmd));
+	cmd = &local_cmd;
+	is_write = cmd->dout_xfer_len > 0;
+	buf_addr = is_write ? cmd->dout_xferp : cmd->din_xferp;
+	buf_len = is_write ? cmd->dout_xfer_len : cmd->din_xfer_len;
+
 	if (ioucmd->flags & IORING_URING_CMD_FIXED) {
 		ret = io_uring_cmd_import_fixed(buf_addr, buf_len,
 						is_write ? WRITE : READ,
@@ -104,13 +112,18 @@ static int scsi_bsg_uring_cmd(struct request_queue *q, struct io_uring_cmd *iouc
 			       unsigned int issue_flags, bool open_for_write)
 {
 	struct scsi_bsg_uring_cmd_pdu *pdu = scsi_bsg_uring_cmd_pdu(ioucmd);
-	const struct bsg_uring_cmd *cmd = io_uring_sqe128_cmd(ioucmd->sqe, struct bsg_uring_cmd);
+	struct bsg_uring_cmd local_cmd;
+	const struct bsg_uring_cmd *cmd;
 	struct scsi_cmnd *scmd;
 	struct request *req;
 	blk_mq_req_flags_t blk_flags = 0;
 	gfp_t gfp_mask = GFP_KERNEL;
 	int ret;
 
+	memcpy(&local_cmd, io_uring_sqe128_cmd(ioucmd->sqe, struct bsg_uring_cmd),
+	       sizeof(local_cmd));
+	cmd = &local_cmd;
+
 	if (cmd->protocol != BSG_PROTOCOL_SCSI ||
 	    cmd->subprotocol != BSG_SUB_PROTOCOL_SCSI_CMD)
 		return -EINVAL;
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] mm/slab: improve kmem_cache_alloc_bulk
From: Christoph Hellwig @ 2026-05-27 12:20 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Christoph Hellwig, Harry Yoo, Andrew Morton, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Jesper Dangaard Brouer, linux-arm-msm, dri-devel, freedreno,
	linux-kernel, linux-mm, io-uring, kasan-dev, bpf, netdev
In-Reply-To: <f7f35169-f77d-4678-8797-a2ad00d89e6c@kernel.org>

On Wed, May 27, 2026 at 11:38:21AM +0200, Vlastimil Babka (SUSE) wrote:
> On 5/27/26 09:02, Christoph Hellwig wrote:
> > The kmem_cache_alloc_bulk return value is weird.  It returns the number
> > of allocated objects, but that must always be 0 or the requested number
> > based on the implementations and the handling in the callers, but that
> > assumption is not actually documented anywhere, which confuses automated
> > review tools.
> > 
> > Fix this by returning a bool if the allocation succeeded and adding a
> > kerneldoc comment explaining the API.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Would 0 / -ENOMEM be more like what people would expect? I guess both that
> and bool are better than the current API.

I find an errno return where the API could not return anything but the
specific error code a bit odd.  But even that would be a lot better
than the current version.

^ permalink raw reply

* Re: improve the kmem_cache_alloc_bulk API
From: Christoph Hellwig @ 2026-05-27 12:21 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Christoph Hellwig, Harry Yoo, Andrew Morton, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Jesper Dangaard Brouer, linux-arm-msm, dri-devel, freedreno,
	linux-kernel, linux-mm, io-uring, kasan-dev, bpf, netdev
In-Reply-To: <ee817070-cc7a-40d5-92a4-2bd8e9e65fbe@kernel.org>

On Wed, May 27, 2026 at 11:11:31AM +0200, Vlastimil Babka (SUSE) wrote:
> > value convention.  Fix that and add documentation.
> > 
> > Note that the few comments explaining it mention that the gfp flags
> > must allow "spinning".  That's not really a term used in the memory
> > allocator, is this supposed to mean "block" or "sleep"?
> 
> Page allocator now has alloc_pages_nolock() for when no spinning is
> possible, and it uses ALLOC_TRYLOCK internally.
> 
> Slab has kmalloc_nolock() relying on that when it needs new pages.

The comment long predates that, and it isn't expressed using gfp flags,
but by requiring separate functions so I somehow doubt that was meant.
But I could also not see why it would not support GFP_ATOMIC /
GFP_NOWAIT allocation, so I might just be confused.

^ permalink raw reply

* Re: [PATCH] mm/slab: improve kmem_cache_alloc_bulk
From: Alexander Lobakin @ 2026-05-27 13:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Christoph Hellwig
  Cc: Vlastimil Babka, Harry Yoo, Andrew Morton, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin, linux-arm-msm,
	dri-devel, freedreno, linux-kernel, linux-mm, io-uring, kasan-dev,
	bpf, netdev, Matt Fleming, kernel-team
In-Reply-To: <e4dcfbc8-2666-452c-90b2-25c4b2c50c9f@kernel.org>

From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Wed, 27 May 2026 10:51:42 +0200

> 
> 
> On 27/05/2026 09.02, Christoph Hellwig wrote:
>> The kmem_cache_alloc_bulk return value is weird.  It returns the number
>> of allocated objects, but that must always be 0 or the requested number
>> based on the implementations and the handling in the callers, but that
>> assumption is not actually documented anywhere, which confuses automated
>> review tools.
>>
> 
> I remember, this API behavior was requested by AKPM when I developed
> kmem_cache_alloc_bulk.  I trusted AKPM's decision, but I cannot explain
> why this choice was made.

I sorta remember that when I was reading this function, I also noticed
that it always returns only 2 possible values (0 or the requested
number), but didn't pay enough attention or it was already after I
introduced napi_skb_cache_get_bulk().

> 
> I kept the netdev code usage below. The current napi_skb_cache_get_bulk
> have a retry logic that assumes that a partial bulk number can be
> returned (which it cannot as Hellwig explains).  Cc Alex/Olek please
> review the changes below as you added this retry logic.

As far as I can see, the diff below doesn't introduce any functional
changes (but allows for a bit better compiler optimization). The logic
is still the same:

1) try to allocate non-zeroed skbs into the cache
2) if not enough, try to allocate zeroed skbs directly
3) if still not enough, return less than requested

The logic is still valid even if kmem_cache_alloc_bulk() return bool --
we might have some skbs in the cache (but less than requested) and then
the first allocation try may fail, but the second one succeed (as it
allocates from a different (the zeroed) zone).

> 
> 
>> Fix this by returning a bool if the allocation succeeded and adding a
>> kerneldoc comment explaining the API.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> ---
>>   drivers/gpu/drm/msm/msm_iommu.c       |  6 +--
>>   drivers/gpu/drm/panthor/panthor_mmu.c | 12 +++---
>>   include/linux/slab.h                  |  6 ++-
>>   io_uring/io_uring.c                   | 23 +++++------
>>   lib/test_meminit.c                    | 19 +++++----
>>   mm/kasan/kasan_test_c.c               |  5 +--
>>   mm/kfence/kfence_test.c               |  9 +++--
>>   mm/slub.c                             | 58 +++++++++++++++------------
>>   net/bpf/test_run.c                    |  7 ++--
>>   net/core/skbuff.c                     | 24 ++++++-----
>>   tools/include/linux/slab.h            |  2 +-
>>   tools/testing/shared/linux.c          | 19 ++++-----
>>   12 files changed, 93 insertions(+), 97 deletions(-)
>>
> 
> [...]
> 
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 44ac121cfccb..73045b688385 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -288,11 +288,11 @@ static inline struct sk_buff
>> *napi_skb_cache_get(bool alloc)
>>         local_lock_nested_bh(&napi_alloc_cache.bh_lock);
>>       if (unlikely(!nc->skb_count)) {
>> -        if (alloc)
>> -            nc->skb_count =
>> kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
>> -                        GFP_ATOMIC | __GFP_NOWARN,
>> -                        NAPI_SKB_CACHE_BULK,
>> -                        nc->skb_cache);
>> +        if (alloc && kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
>> +                           GFP_ATOMIC | __GFP_NOWARN,
>> +                           NAPI_SKB_CACHE_BULK,
>> +                           nc->skb_cache))
>> +            nc->skb_count = NAPI_SKB_CACHE_BULK;
>>           if (unlikely(!nc->skb_count)) {
>>               local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
>>               return NULL;
>> @@ -353,16 +353,18 @@ u32 napi_skb_cache_get_bulk(void **skbs, u32 n)
>>         /* No enough cached skbs. Try refilling the cache first */
>>       bulk = min(NAPI_SKB_CACHE_SIZE - nc->skb_count,
>> NAPI_SKB_CACHE_BULK);
>> -    nc->skb_count += kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
>> -                           GFP_ATOMIC | __GFP_NOWARN, bulk,
>> -                           &nc->skb_cache[nc->skb_count]);
>> +    if (kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
>> +                  GFP_ATOMIC | __GFP_NOWARN, bulk,
>> +                  &nc->skb_cache[nc->skb_count]))
>> +        nc->skb_count += bulk;
>>       if (likely(nc->skb_count >= n))
>>           goto get;
>>         /* Still not enough. Bulk-allocate the missing part directly,
>> zeroed */
>> -    n -= kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
>> -                   GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
>> -                   n - nc->skb_count, &skbs[nc->skb_count]);
>> +    if (kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
>> +                  GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
>> +                  n - nc->skb_count, &skbs[nc->skb_count]))
>> +        n = nc->skb_count;

kmem_cache_alloc_bulk() allocates `n - nc->skb_count`, but here you
assign `nc->skb_count` to n.
Ah wait,

n -= n - nc->skb_count
n = n - (n - nc->skb_count)
n = n - n + nc->skb_count
n = nc->skb_count

Correct :D

>>       if (likely(nc->skb_count >= n))
>>           goto get;

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> # skbuff

Thanks,
Olek

^ permalink raw reply

* Re: improve the kmem_cache_alloc_bulk API
From: Vlastimil Babka (SUSE) @ 2026-05-27 14:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Harry Yoo, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Jesper Dangaard Brouer,
	linux-arm-msm, dri-devel, freedreno, linux-kernel, linux-mm,
	io-uring, kasan-dev, bpf, netdev
In-Reply-To: <20260527122148.GA6838@lst.de>

On 5/27/26 14:21, Christoph Hellwig wrote:
> On Wed, May 27, 2026 at 11:11:31AM +0200, Vlastimil Babka (SUSE) wrote:
>> > value convention.  Fix that and add documentation.
>> > 
>> > Note that the few comments explaining it mention that the gfp flags
>> > must allow "spinning".  That's not really a term used in the memory
>> > allocator, is this supposed to mean "block" or "sleep"?
>> 
>> Page allocator now has alloc_pages_nolock() for when no spinning is
>> possible, and it uses ALLOC_TRYLOCK internally.
>> 
>> Slab has kmalloc_nolock() relying on that when it needs new pages.
> 
> The comment long predates that, and it isn't expressed using gfp flags,

Do we both mean this comment?

-/* Note that interrupts must be enabled when calling this function. */
+/*
+ * Note that interrupts must be enabled when calling this function and gfp     
+ * flags must allow spinning.               
+ */
 int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,                                                                                                                                                    
                                 void **p)   

commit 46dea1744498 ("slab: refill sheaves from all nodes") from this January.
Previously it was just interrupts enabled.

> but by requiring separate functions so I somehow doubt that was meant.

Yeah, it's expressed by the _nolock variants. But slab propagates it internally
by the gfp flags, and since 46dea1744498 it affects kmem_cache_alloc_bulk().

> But I could also not see why it would not support GFP_ATOMIC /
> GFP_NOWAIT allocation, so I might just be confused.

Yeah those are supported because they can spin, just not sleep.

^ permalink raw reply

* Re: [PATCH] mm/slab: improve kmem_cache_alloc_bulk
From: Christoph Hellwig @ 2026-05-27 14:07 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Jesper Dangaard Brouer, Christoph Hellwig, Vlastimil Babka,
	Harry Yoo, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-arm-msm, dri-devel,
	freedreno, linux-kernel, linux-mm, io-uring, kasan-dev, bpf,
	netdev, Matt Fleming, kernel-team
In-Reply-To: <777c7229-4285-4c7c-9340-dfaebd2ab291@intel.com>

On Wed, May 27, 2026 at 03:56:38PM +0200, Alexander Lobakin wrote:
> >> -    n -= kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
> >> -                   GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
> >> -                   n - nc->skb_count, &skbs[nc->skb_count]);
> >> +    if (kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
> >> +                  GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
> >> +                  n - nc->skb_count, &skbs[nc->skb_count]))
> >> +        n = nc->skb_count;
> 
> kmem_cache_alloc_bulk() allocates `n - nc->skb_count`, but here you
> assign `nc->skb_count` to n.
> Ah wait,
> 
> n -= n - nc->skb_count
> n = n - (n - nc->skb_count)
> n = n - n + nc->skb_count
> n = nc->skb_count
> 
> Correct :D

Exactly the steps I went through when writing this patch :)


^ permalink raw reply

* [PATCH] io_uring/io-wq: re-check IO_WQ_BIT_EXIT for each linked work item
From: Runyu Xiao @ 2026-05-27 14:37 UTC (permalink / raw)
  To: axboe, io-uring; +Cc: linux-kernel, gregkh, jianhao.xu, Runyu Xiao, stable

Commit bdf0bf73006e ("io_uring/io-wq: check IO_WQ_BIT_EXIT inside work
run loop") fixed the obvious case where io_worker_handle_work() took one
exit-bit snapshot before draining pending work, but the fix stops one
level too early.

io_worker_handle_work() now re-checks IO_WQ_BIT_EXIT in its outer work
run loop, yet it still snapshots that bit once before processing a
whole dependent linked-work chain. If io_wq_exit_start() sets
IO_WQ_BIT_EXIT after the first linked item has started, the remaining
linked items can still reuse stale do_kill = false, skip
IO_WQ_WORK_CANCEL, and continue running after exit has begun.

That means the previous fix did not fully eliminate the exit-latency
problem; it only narrowed it to linked chains. A long or slow linked
chain can still keep io-wq exit waiting for work that should already
have been canceled.

The issue was found on Linux v6.18.21 by our static-analysis tool,
which flagged linked-work loops that snapshot shared exit state
outside per-item cancel decisions, and was then confirmed by manual
auditing of io_worker_handle_work(). It was later reproduced with a
QEMU no-device validation selftest that preserved the same contract:
a three-node unbound linked chain, an exit actor setting
IO_WQ_BIT_EXIT after work1, and slow post-exit linked work. With a
3000 ms delay injected into each post-exit item, the buggy path
spends about 6066 ms after exit running work2/work3, while the fixed
path cancels both and finishes in about 2 ms.

Re-check test_bit(IO_WQ_BIT_EXIT, &wq->state) for each iteration of the
dependent-link loop, right before deciding whether to cancel the
current work item. That closes the remaining stale-snapshot window and
prevents linked post-exit work from stretching shutdown latency.

Build-tested by compiling io_uring/io-wq.o on x86_64 with the local
.config. No special hardware was required.

Fixes: bdf0bf73006e ("io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
 io_uring/io-wq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 49a9c914b4e9..28d81398ebee 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -601,7 +601,6 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
 	struct io_wq *wq = worker->wq;

 	do {
-		bool do_kill = test_bit(IO_WQ_BIT_EXIT, &wq->state);
 		struct io_wq_work *work;

 		/*
@@ -637,6 +636,7 @@ static void io_worker_handle_work(struct io_wq_acct *acct,

 		/* handle a whole dependent link */
 		do {
+			bool do_kill = test_bit(IO_WQ_BIT_EXIT, &wq->state);
 			struct io_wq_work *next_hashed, *linked;
 			unsigned int work_flags = atomic_read(&work->flags);
 			unsigned int hash = __io_wq_is_hashed(work_flags)
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH] block: Add bvec_folio()
From: Matthew Wilcox @ 2026-05-27 15:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-kernel, io-uring, linux-mm,
	Leon Romanovsky
In-Reply-To: <ahaNpbG15d6StT9d@infradead.org>

On Tue, May 26, 2026 at 11:22:29PM -0700, Christoph Hellwig wrote:
> On Tue, May 26, 2026 at 06:47:30PM +0100, Matthew Wilcox wrote:
> > How about:
> > 
> > /**
> >  * bvec_folio - Return the first folio referenced by this bvec
> >  * @bv: bvec to access
> >  *
> >  * bvecs can contain non-folio memory, so this should only be called by
> >  * the creator of the bvec; drivers have no business looking at the owner
> >  * of the memory.  It may not even be the right interface for the caller
> >  * to use as bvecs can span multiple folios.  You may be better off using
> >  * something like bio_for_each_folio_all() which iterates over all folios.
> >  */
> 
> Sounds good, although I'd captialize the first word in the sentence.
> (Not that anyone should follow my spelling advice in general)

I don't know how to capitalise bvec.  Is it Bvec?  BVec?

Fortunately my wife is an expert, and many years ago taught me that if
you have a difficult grammar problem, don't fix it, avoid it.

 * A bvec can contain non-folio memory, so this should only be called by

^ permalink raw reply

* Re: [PATCH] io_uring/io-wq: re-check IO_WQ_BIT_EXIT for each linked work item
From: Jens Axboe @ 2026-05-27 16:03 UTC (permalink / raw)
  To: Runyu Xiao, io-uring; +Cc: linux-kernel, gregkh, jianhao.xu, stable
In-Reply-To: <20260527143726.1272269-1-runyu.xiao@seu.edu.cn>

On 5/27/26 8:37 AM, Runyu Xiao wrote:
> Commit bdf0bf73006e ("io_uring/io-wq: check IO_WQ_BIT_EXIT inside work
> run loop") fixed the obvious case where io_worker_handle_work() took one
> exit-bit snapshot before draining pending work, but the fix stops one
> level too early.
> 
> io_worker_handle_work() now re-checks IO_WQ_BIT_EXIT in its outer work
> run loop, yet it still snapshots that bit once before processing a
> whole dependent linked-work chain. If io_wq_exit_start() sets
> IO_WQ_BIT_EXIT after the first linked item has started, the remaining
> linked items can still reuse stale do_kill = false, skip
> IO_WQ_WORK_CANCEL, and continue running after exit has begun.
> 
> That means the previous fix did not fully eliminate the exit-latency
> problem; it only narrowed it to linked chains. A long or slow linked
> chain can still keep io-wq exit waiting for work that should already
> have been canceled.
> 
> The issue was found on Linux v6.18.21 by our static-analysis tool,
> which flagged linked-work loops that snapshot shared exit state
> outside per-item cancel decisions, and was then confirmed by manual
> auditing of io_worker_handle_work(). It was later reproduced with a
> QEMU no-device validation selftest that preserved the same contract:
> a three-node unbound linked chain, an exit actor setting
> IO_WQ_BIT_EXIT after work1, and slow post-exit linked work. With a
> 3000 ms delay injected into each post-exit item, the buggy path
> spends about 6066 ms after exit running work2/work3, while the fixed
> path cancels both and finishes in about 2 ms.
> 
> Re-check test_bit(IO_WQ_BIT_EXIT, &wq->state) for each iteration of the
> dependent-link loop, right before deciding whether to cancel the
> current work item. That closes the remaining stale-snapshot window and
> prevents linked post-exit work from stretching shutdown latency.

I think this change makes sense to further cut down on the time, but you
need to send it in for the _upstream_ kernel, stable only does backports
of those. Eg if you send this one for current -git and mark it fixing
the correct upstream commit (not the stable one) and add CC stable, then
it'll wind up in stable as well.

-- 
Jens Axboe

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox