Linux io-uring development

Linux io-uring development
 help / color / mirror / Atom feed

* [syzbot] [io-uring?] INFO: task hung in io_sq_thread_park (4)
From: syzbot @ 2026-05-26  2:49 UTC (permalink / raw)
  To: axboe, io-uring, linux-kernel, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    45255ea1ca09 Merge tag 'pm-7.1-rc5' of git://git.kernel.or..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=12030d36580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=8d24a1331e060dda
dashboard link: https://syzkaller.appspot.com/bug?extid=4be91bcb08eab9a156da
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17c2db96580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/55e9065ee7f2/disk-45255ea1.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/f53a442e25dd/vmlinux-45255ea1.xz
kernel image: https://storage.googleapis.com/syzbot-assets/ab16a4623640/bzImage-45255ea1.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+4be91bcb08eab9a156da@syzkaller.appspotmail.com

INFO: task kworker/u8:2:36 blocked for more than 143 seconds.
      Not tainted syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:2    state:D stack:22696 pid:36    tgid:36    ppid:2      task_flags:0x4208060 flags:0x00080000
Workqueue: iou_exit io_ring_exit_work
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5388 [inline]
 __schedule+0x1821/0x5740 kernel/sched/core.c:7189
 __schedule_loop kernel/sched/core.c:7268 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7283
 schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7340
 __mutex_lock_common kernel/locking/mutex.c:726 [inline]
 __mutex_lock+0x7f7/0x1550 kernel/locking/mutex.c:820
 io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
 io_ring_exit_work+0x2dd/0x980 io_uring/io_uring.c:2359
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
INFO: task kworker/u8:5:139 blocked for more than 145 seconds.
      Not tainted syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:5    state:D stack:24120 pid:139   tgid:139   ppid:2      task_flags:0x4208060 flags:0x00080000
Workqueue: iou_exit io_ring_exit_work
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5388 [inline]
 __schedule+0x1821/0x5740 kernel/sched/core.c:7189
 __schedule_loop kernel/sched/core.c:7268 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7283
 schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7340
 __mutex_lock_common kernel/locking/mutex.c:726 [inline]
 __mutex_lock+0x7f7/0x1550 kernel/locking/mutex.c:820
 io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
 io_ring_exit_work+0x2dd/0x980 io_uring/io_uring.c:2359
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
INFO: task kworker/u8:9:5810 blocked for more than 146 seconds.
      Not tainted syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:9    state:D stack:24248 pid:5810  tgid:5810  ppid:2      task_flags:0x4208060 flags:0x00080000
Workqueue: iou_exit io_ring_exit_work
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5388 [inline]
 __schedule+0x1821/0x5740 kernel/sched/core.c:7189
 __schedule_loop kernel/sched/core.c:7268 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7283
 schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7340
 __mutex_lock_common kernel/locking/mutex.c:726 [inline]
 __mutex_lock+0x7f7/0x1550 kernel/locking/mutex.c:820
 io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
 io_ring_exit_work+0x2dd/0x980 io_uring/io_uring.c:2359
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Showing all locks held in the system:
3 locks held by kworker/0:0/9:
 #0: ffff88813fe43140 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88813fe43140 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc900000e7c40 (rx_mode_work){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc900000e7c40 (rx_mode_work){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffffffff8fdd1400 (rtnl_mutex){+.+.}-{4:4}, at: netdev_rx_mode_work+0x19/0x3c0 net/core/dev_addr_lists.c:1312
3 locks held by kworker/u8:1/13:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90000127c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90000127c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff88802856dc68 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
1 lock held by khungtaskd/31:
 #0: ffffffff8e95cca0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline]
 #0: ffffffff8e95cca0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
 #0: ffffffff8e95cca0 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x2e/0x180 kernel/locking/lockdep.c:6775
3 locks held by kworker/u8:2/36:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90000ac7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90000ac7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff888024d77468 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
3 locks held by kworker/u8:3/47:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90000b77c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90000b77c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff888033183068 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
3 locks held by kworker/u9:0/50:
 #0: ffff888060790940 ((wq_completion)hci11){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff888060790940 ((wq_completion)hci11){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90000ba7c40 ((work_completion)(&hdev->cmd_sync_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90000ba7c40 ((work_completion)(&hdev->cmd_sync_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff88807ff38ea0 (&hdev->req_lock){+.+.}-{4:4}, at: hci_cmd_sync_work+0x1d3/0x400 net/bluetooth/hci_sync.c:331
3 locks held by kworker/u8:4/58:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc900015f7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc900015f7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff88802902ec68 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
3 locks held by kworker/u8:5/139:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90002e17c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90002e17c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff88807d43c068 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
3 locks held by kworker/u8:7/1145:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc900053efc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc900053efc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff88807b88f068 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
3 locks held by kworker/u8:8/3333:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc9000e61fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc9000e61fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff8880578a1468 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
1 lock held by udevd/4987:
 #0: ffff8880b863aea0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x31/0x150 kernel/sched/core.c:652
2 locks held by getty/5374:
 #0: ffff8880362670a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x25/0x70 drivers/tty/tty_ldisc.c:243
 #1: ffffc9000322b2e8 (&ldata->atomic_read_lock){+.+.}-{4:4}, at: n_tty_read+0x45c/0x13a0 drivers/tty/n_tty.c:2211
3 locks held by kworker/u8:9/5810:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc900038c7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc900038c7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff888075ffdc68 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56
2 locks held by kworker/u8:10/5820:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc900038e7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc900038e7c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
2 locks held by iou-sqp-6349/6354:
1 lock held by iou-sqp-7229/7232:
2 locks held by iou-sqp-7262/7266:
1 lock held by iou-sqp-7452/7455:
2 locks held by iou-sqp-7518/7521:
2 locks held by kworker/u8:11/7547:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90003f87c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90003f87c40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
2 locks held by iou-sqp-7648/7649:
2 locks held by kworker/u8:12/7655:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc90003b1fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc90003b1fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
1 lock held by syz-executor/7715:
3 locks held by kworker/u8:13/7719:
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
 #0: ffff88801af44940 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
 #1: ffffc9000206fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
 #1: ffffc9000206fc40 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
 #2: ffff888057817068 (&sqd->lock){+.+.}-{4:4}, at: io_sq_thread_park+0x44/0x140 io_uring/sqpoll.c:56


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH] block: Add bvec_folio()
From: Matthew Wilcox @ 2026-05-25 13:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-kernel, io-uring, linux-mm,
	Leon Romanovsky
In-Reply-To: <ahPm4h2gKgyEEuvV@infradead.org>

On Sun, May 24, 2026 at 11:06:26PM -0700, Christoph Hellwig wrote:
> > +/**
> > + * bvec_folio - Return the first folio referenced by this bvec
> > + * @bv: bvec to access
> > + *
> > + * bvecs can span multiple folios.  Unless you know that this
> > + * bvec does not, you may be better off using something like
> > + * bio_for_each_folio_all() which iterates over all folios.
> > + */
> > +static inline struct folio *bvec_folio(const struct bio_vec *bv)
> > +{
> > +	return page_folio(bv->bv_page);
> > +}
> 
> The comment here is confusing.  bio_for_each_folio_all is a helper that
> only works in the submitter side, and not for anything using the
> bvec_iter required for drivers or anything else sitting below a
> potential bio clone/split or using bvecs from an upper layer (like
> ITER_BVEC direct I/O).  Additionally bv_page can be a different
> page than the fist page due to large bv_offset on split bios.
> 
> So I'm not against the function per se, but the documentation must
> explain the minefields it is stepping into a bit better.

Lower level drivers shouldn't be concerning themselves with folios.
For a start, we can put non-folios (eg slab memory) into bvecs.
I'm happy to clarify this comment further, but I don't understand
who's going to look at this function and need to have more explanation.

^ permalink raw reply

* Re: [PATCH 1/3] net: Remove support for AIO on sockets
From: Christoph Hellwig @ 2026-05-25  8:03 UTC (permalink / raw)
  To: demiobenour
  Cc: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Jens Axboe, Jakub Kicinski,
	Simon Horman, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Jonathan Corbet, Shuah Khan, Eric Biggers,
	Ard Biesheuvel, linux-crypto, linux-kernel, io-uring, netdev,
	linux-perf-users, linux-doc, Toke Høiland-Jørgensen,
	linux-api
In-Reply-To: <20260523-af-alg-harden-v1-1-c76755c3a5c5@gmail.com>

On Sat, May 23, 2026 at 03:43:02PM -0400, Demi Marie Obenour via B4 Relay wrote:
> From: Demi Marie Obenour <demiobenour@gmail.com>
> 
> The only user of msg->msg_iocb was AF_ALG, but that's deprecated.
> It can be removed entirely at the cost of only supporting synchronous
> operations.  This doesn't break userspace, which will silently block
> (for a bounded amount of time) in io_submit instead of operating
> asynchronously.
> 
> This also makes struct msghdr smaller, helping every other caller of
> sendmsg().

So we just had a discussion at LLC about how networking needs to support
AIO better for zero copy.

The current TCP zerocopy implementation provides completion notification
through the socket error code, which is freaking weird and doesn't
integrate well with either io_uring or in-kernel callers.

So we really want to pass the iocb down into networking and have it
call ki_complete on completion, with something higher up in the stack
adding that to the error queue for the legacy user interface.

Now I'm not sure if we wouldn't be better off passing that iocb
explicitly instead of in a weird hidden way, but this seemed like
a good place to bring this up.

^ permalink raw reply

* Re: [PATCH v3 04/10] block: introduce dma map backed bio type
From: Pavel Begunkov @ 2026-05-25  7:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-block, linux-kernel, linux-nvme,
	linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
	Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
	William Power, Phil Cayton, Jason Gunthorpe
In-Reply-To: <20260520083043.GA18893@lst.de>

On 5/20/26 09:30, Christoph Hellwig wrote:
> On Mon, May 18, 2026 at 11:29:54AM +0100, Pavel Begunkov wrote:
>>>>    	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
>>>>    	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
>>>> +	BIO_DMABUF_MAP, /* Using premmaped dma buffers */
>>>
>>> Shouldn't this be a REQ_ flag as we should never mix and match bios with
>>> and without this flag in a single request?
>>
>> Do you mean adding both and propagating it from bio to req? submit_bio()
>> takes a bio, so we still need to set it there before it reaches blk-mq.
>> And there might be bio-based drivers using it in the future.
> 
> I think I forgot to reply to this, so let's do this now.
> 
> REQ_ is actually used by both bios and requests, so if you set it in
> bio->bi_opf it will automatically get propagated to the request, but
> it can also always be tested on the bio, including by bio-based
> drivers.

Ah yes, good point, thanks

-- 
Pavel Begunkov


^ permalink raw reply

* Re: [PATCH] block: Add bvec_folio()
From: Christoph Hellwig @ 2026-05-25  6:06 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Jens Axboe, linux-block, linux-kernel, io-uring, linux-mm,
	Leon Romanovsky
In-Reply-To: <20260522182122.2489391-1-willy@infradead.org>

> +/**
> + * bvec_folio - Return the first folio referenced by this bvec
> + * @bv: bvec to access
> + *
> + * bvecs can span multiple folios.  Unless you know that this
> + * bvec does not, you may be better off using something like
> + * bio_for_each_folio_all() which iterates over all folios.
> + */
> +static inline struct folio *bvec_folio(const struct bio_vec *bv)
> +{
> +	return page_folio(bv->bv_page);
> +}

The comment here is confusing.  bio_for_each_folio_all is a helper that
only works in the submitter side, and not for anything using the
bvec_iter required for drivers or anything else sitting below a
potential bio clone/split or using bvecs from an upper layer (like
ITER_BVEC direct I/O).  Additionally bv_page can be a different
page than the fist page due to large bv_offset on split bios.

So I'm not against the function per se, but the documentation must
explain the minefields it is stepping into a bit better.

^ permalink raw reply

* Re: [PATCH] io_uring/tctx: set ->io_uring before publishing the tctx node
From: Jens Axboe @ 2026-05-24 18:08 UTC (permalink / raw)
  To: io-uring, Lim HyeonJun
In-Reply-To: <20260524110853.115634-1-shja0831@gmail.com>


On Sun, 24 May 2026 20:08:53 +0900, Lim HyeonJun wrote:
> io_register_iowq_max_workers() walks ctx->tctx_list under ctx->tctx_lock
> and dereferences each node's task->io_uring without a NULL check:
> 
> 	list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
> 		tctx = node->task->io_uring;
> 		if (WARN_ON_ONCE(!tctx->io_wq))
> 			continue;
> 		...
> 	}
> 
> [...]

Applied, thanks!

[1/1] io_uring/tctx: set ->io_uring before publishing the tctx node
      commit: a88c02915d9c6160cfc7ab1b26ed64b2993e2b94

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH 5.15.y] io_uring: prevent opcode speculation
From: Sasha Levin @ 2026-05-24 12:09 UTC (permalink / raw)
  To: stable, Pavel Begunkov
  Cc: Sasha Levin, Jens Axboe, Li Zetao, Robert Garcia, io-uring,
	linux-kernel
In-Reply-To: <20260520062833.2563847-1-rob_garcia@163.com>

Queued for 5.15, thanks.

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH 6.1.y] io_uring: prevent opcode speculation
From: Sasha Levin @ 2026-05-24 12:09 UTC (permalink / raw)
  To: stable, Pavel Begunkov
  Cc: Sasha Levin, Jens Axboe, Li Zetao, Robert Garcia, io-uring,
	linux-kernel
In-Reply-To: <20260521054919.87373-1-rob_garcia@163.com>

Queued for 6.1, thanks.

-- 
Thanks,
Sasha

^ permalink raw reply

* [PATCH] io_uring/tctx: set ->io_uring before publishing the tctx node
From: Lim HyeonJun @ 2026-05-24 11:08 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: Lim HyeonJun
In-Reply-To: <2af95968-bcb3-4ed5-9242-3f8358e71f9e@kernel.dk>

io_register_iowq_max_workers() walks ctx->tctx_list under ctx->tctx_lock
and dereferences each node's task->io_uring without a NULL check:

	list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
		tctx = node->task->io_uring;
		if (WARN_ON_ONCE(!tctx->io_wq))
			continue;
		...
	}

__io_uring_add_tctx_node() installs the node into ctx->tctx_list (via
io_tctx_install_node(), which does the list_add() under tctx_lock) and
only assigns current->io_uring = tctx afterwards. A task doing its first
io_uring operation on a shared ring therefore has a window in which its
node is already visible on ctx->tctx_list while node->task->io_uring is
still NULL. A concurrent IORING_REGISTER_IOWQ_MAX_WORKERS on the same
ring reads that NULL and dereferences tctx->io_wq:

  KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
  RIP: io_register_iowq_max_workers io_uring/register.c:423

Publish current->io_uring = tctx before installing the node, so any node
visible on ctx->tctx_list always has a valid task->io_uring. The
tctx_lock taken in io_tctx_install_node() orders this store before the
node becomes visible to other iterators. On the install/limits failure
paths the freshly allocated tctx is freed, so clear current->io_uring
there as well to avoid leaving a dangling pointer.

The bug reproduces on an SMP+KASAN build with a plain (non-SQPOLL) ring
shared across threads: a stream of fresh threads each do their first
io_uring_enter() while two threads spam IORING_REGISTER_IOWQ_MAX_WORKERS;
it GPFs within seconds.

Fixes: 7880174e1e5e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling")
Signed-off-by: Lim HyeonJun <shja0831@gmail.com>
---
 io_uring/tctx.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index 6af62ca9baba..42b219b34aa8 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -139,12 +139,14 @@ static int io_tctx_install_node(struct io_ring_ctx *ctx,
 int __io_uring_add_tctx_node(struct io_ring_ctx *ctx)
 {
 	struct io_uring_task *tctx = current->io_uring;
+	bool new_tctx = false;
 	int ret;

 	if (unlikely(!tctx)) {
 		tctx = io_uring_alloc_task_context(current, ctx);
 		if (IS_ERR(tctx))
 			return PTR_ERR(tctx);
+		new_tctx = true;

 		if (data_race(ctx->int_flags) & IO_RING_F_IOWQ_LIMITS_SET) {
 			unsigned int limits[2];
@@ -168,13 +170,15 @@ int __io_uring_add_tctx_node(struct io_ring_ctx *ctx)
 	if (tctx->io_wq)
 		io_wq_set_exit_on_idle(tctx->io_wq, false);

-	ret = io_tctx_install_node(ctx, tctx);
-	if (!ret) {
+	if (new_tctx)
 		current->io_uring = tctx;
+
+	ret = io_tctx_install_node(ctx, tctx);
+	if (!ret)
 		return 0;
-	}
-	if (!current->io_uring) {
 err_free:
+	if (new_tctx) {
+		current->io_uring = NULL;
 		if (tctx->io_wq) {
 			io_wq_exit_start(tctx->io_wq);
 			io_wq_put_and_exit(tctx->io_wq);
-- 
2.53.0

^ permalink raw reply related

* [PATCH 0/3] AF_ALG: Remove support for AIO and old-style drivers
From: Demi Marie Obenour via B4 Relay @ 2026-05-23 19:43 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Jens Axboe, Jakub Kicinski,
	Simon Horman, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Jonathan Corbet, Shuah Khan, Eric Biggers,
	Ard Biesheuvel
  Cc: linux-crypto, linux-kernel, io-uring, netdev, linux-perf-users,
	linux-doc, Demi Marie Obenour

AF_ALG is a deprecated API only useful for compatibility with existing
userspace.  It has had a lot of vulnerabilities, including the infamous
CopyFail.

Rip out support for offload drivers, which tend to be buggy.  Also rip
out support for AIO, which actually bloats the entire socket subsystem.

Only compile-tested.

Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
---
Demi Marie Obenour (3):
      net: Remove support for AIO on sockets
      AF_ALG: Drop support for off-CPU cryptography
      AF_ALG: Document that it is *always* slower

 Documentation/crypto/userspace-if.rst          | 26 ++++++++--
 crypto/af_alg.c                                | 35 ++------------
 crypto/algif_aead.c                            | 43 ++++-------------
 crypto/algif_hash.c                            |  4 +-
 crypto/algif_rng.c                             |  4 +-
 crypto/algif_skcipher.c                        | 66 ++++++--------------------
 include/crypto/if_alg.h                        | 19 ++++++--
 include/linux/socket.h                         |  1 -
 io_uring/net.c                                 |  1 -
 net/compat.c                                   |  1 -
 net/socket.c                                   |  7 +--
 tools/perf/trace/beauty/include/linux/socket.h |  1 -
 12 files changed, 70 insertions(+), 138 deletions(-)
---
base-commit: 49e05bb00f2e8168695f7af4d694c39e1423e8a2
change-id: 20260502-af-alg-harden-900849451653

Best regards,
-- 
Demi Marie Obenour <demiobenour@gmail.com>



^ permalink raw reply

* [PATCH 2/3] AF_ALG: Drop support for off-CPU cryptography
From: Demi Marie Obenour via B4 Relay @ 2026-05-23 19:43 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Jens Axboe, Jakub Kicinski,
	Simon Horman, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Jonathan Corbet, Shuah Khan, Eric Biggers,
	Ard Biesheuvel
  Cc: linux-crypto, linux-kernel, io-uring, netdev, linux-perf-users,
	linux-doc, Demi Marie Obenour
In-Reply-To: <20260523-af-alg-harden-v1-0-c76755c3a5c5@gmail.com>

From: Demi Marie Obenour <demiobenour@gmail.com>

AF_ALG is deprecated and exposed to unprivileged userspace.  Only
use the least buggy algorithm implementations: the pure software ones.

This removes one of the main advantages of AF_ALG, which is the
ability to use it with off-CPU accelerators.  However, using off-CPU
accelerators has huge overheads, both in performance and attack surface.
I have yet to see real-world, performance-critical workloads where using
an accelerator via AF_ALG is actually a win over doing cryptography in
userspace.

If using an off-CPU accelerator really does turn out to be a win, a new
API should be developed that is actually a good fit for it.

Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
---
 Documentation/crypto/userspace-if.rst |  7 ++++++-
 crypto/af_alg.c                       |  2 +-
 crypto/algif_aead.c                   |  4 ++--
 crypto/algif_hash.c                   |  4 ++--
 crypto/algif_rng.c                    |  4 ++--
 crypto/algif_skcipher.c               |  4 ++--
 include/crypto/if_alg.h               | 14 +++++++++++++-
 7 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/Documentation/crypto/userspace-if.rst b/Documentation/crypto/userspace-if.rst
index ea1b1b3f4049fd4673528dc2a6234f6376a3489f..b31117d4415dda6ad6ca36275e615bec7df9552e 100644
--- a/Documentation/crypto/userspace-if.rst
+++ b/Documentation/crypto/userspace-if.rst
@@ -9,7 +9,8 @@ symmetric cipher, AEAD, and RNG algorithms that are implemented in kernel-mode
 code.
 
 AF_ALG is insecure and is deprecated. Originally added to the kernel in 2010,
-most kernel developers now consider it to be a mistake.
+most kernel developers now consider it to be a mistake. Support for hardware
+accelerators, which was the original purpose of AF_ALG, has been removed.
 
 AF_ALG continues to be supported only for backwards compatibility. On systems
 where no programs using AF_ALG remain, the support for it should be disabled by
@@ -59,6 +60,10 @@ Some of the examples include:
 - CVE-2013-7421
 - CVE-2011-4081
 
+Hardware accelerator drivers are frequently buggy. To reduce attack surface,
+AF_ALG now only provides access to algorithms implemented in software. This
+means that AF_ALG no longer fulfills its original purpose.
+
 It is recommended that, whenever possible, userspace programs be migrated to
 userspace crypto code (which again, is what is normally used anyway) and
 ``CONFIG_CRYPTO_USER_API_*`` be disabled.  On systems that use SELinux, SELinux
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 8ccf7a737cd6ca9a5d5bf47050c9afea0dfd61bf..cce000e8590e469927b5a5a0ceccfdf0ef54633d 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -181,7 +181,7 @@ static int alg_bind(struct socket *sock, struct sockaddr_unsized *uaddr, int add
 	if (IS_ERR(type))
 		return PTR_ERR(type);
 
-	private = type->bind(sa->salg_name, sa->salg_feat, sa->salg_mask);
+	private = type->bind(sa->salg_name);
 	if (IS_ERR(private)) {
 		module_put(type->owner);
 		return PTR_ERR(private);
diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
index 60f06597cb0b13036bc975641a0b02ea8a41ad03..787aac8aeb24eed128f08345ba730478113919b3 100644
--- a/crypto/algif_aead.c
+++ b/crypto/algif_aead.c
@@ -342,9 +342,9 @@ static struct proto_ops algif_aead_ops_nokey = {
 	.poll		=	af_alg_poll,
 };
 
-static void *aead_bind(const char *name, u32 type, u32 mask)
+static void *aead_bind(const char *name)
 {
-	return crypto_alloc_aead(name, type, mask);
+	return crypto_alloc_aead(name, 0, AF_ALG_CRYPTOAPI_MASK);
 }
 
 static void aead_release(void *private)
diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
index 4d3dfc60a16a6d8b677d903d209df18d67202c98..5452ad6c15069c3cb0ff78fe58868fe7ce4b0fc3 100644
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -380,9 +380,9 @@ static struct proto_ops algif_hash_ops_nokey = {
 	.accept		=	hash_accept_nokey,
 };
 
-static void *hash_bind(const char *name, u32 type, u32 mask)
+static void *hash_bind(const char *name)
 {
-	return crypto_alloc_ahash(name, type, mask);
+	return crypto_alloc_ahash(name, 0, AF_ALG_CRYPTOAPI_MASK);
 }
 
 static void hash_release(void *private)
diff --git a/crypto/algif_rng.c b/crypto/algif_rng.c
index a9fb492e929a70c94476f296f5f5e7c42f0313b7..4dfe7899f8fa4ce82d5f2236297230fb44bc35d6 100644
--- a/crypto/algif_rng.c
+++ b/crypto/algif_rng.c
@@ -197,7 +197,7 @@ static struct proto_ops __maybe_unused algif_rng_test_ops = {
 	.sendmsg	=	rng_test_sendmsg,
 };
 
-static void *rng_bind(const char *name, u32 type, u32 mask)
+static void *rng_bind(const char *name)
 {
 	struct rng_parent_ctx *pctx;
 	struct crypto_rng *rng;
@@ -206,7 +206,7 @@ static void *rng_bind(const char *name, u32 type, u32 mask)
 	if (!pctx)
 		return ERR_PTR(-ENOMEM);
 
-	rng = crypto_alloc_rng(name, type, mask);
+	rng = crypto_alloc_rng(name, 0, AF_ALG_CRYPTOAPI_MASK);
 	if (IS_ERR(rng)) {
 		kfree(pctx);
 		return ERR_CAST(rng);
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index 9dbccabd87b13920c27aff5a450a235cc6a27d59..df20bdfe1f1f4e453782dee3b743dd1939ab4c6c 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -307,9 +307,9 @@ static struct proto_ops algif_skcipher_ops_nokey = {
 	.poll		=	af_alg_poll,
 };
 
-static void *skcipher_bind(const char *name, u32 type, u32 mask)
+static void *skcipher_bind(const char *name)
 {
-	return crypto_alloc_skcipher(name, type, mask);
+	return crypto_alloc_skcipher(name, 0, AF_ALG_CRYPTOAPI_MASK);
 }
 
 static void skcipher_release(void *private)
diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
index 62867daca47d76c9ea1a7ed233188788c5f6c3c0..7643ba954125aba0c06aaf19de087985325885ad 100644
--- a/include/crypto/if_alg.h
+++ b/include/crypto/if_alg.h
@@ -41,7 +41,7 @@ struct af_alg_control {
 };
 
 struct af_alg_type {
-	void *(*bind)(const char *name, u32 type, u32 mask);
+	void *(*bind)(const char *name);
 	void (*release)(void *private);
 	int (*setkey)(void *private, const u8 *key, unsigned int keylen);
 	int (*setentropy)(void *private, sockptr_t entropy, unsigned int len);
@@ -243,4 +243,16 @@ int af_alg_get_rsgl(struct sock *sk, struct msghdr *msg, int flags,
 		    struct af_alg_async_req *areq, size_t maxsize,
 		    size_t *outlen);
 
+/*
+ * Mask used to disable unsupported algorithm implementations.
+ *
+ * This is the same as FSCRYPT_CRYPTOAPI_MASK in fs/crypto/fscrypt_private.h.
+ * In additions to the motivations there, this API is exposed to userspace
+ * that might not be fully trusted.
+ */
+#define AF_ALG_CRYPTOAPI_MASK                             \
+	(CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY | \
+	 CRYPTO_ALG_KERN_DRIVER_ONLY)
+
+
 #endif	/* _CRYPTO_IF_ALG_H */

-- 
2.54.0



^ permalink raw reply related

* [PATCH 3/3] AF_ALG: Document that it is *always* slower
From: Demi Marie Obenour via B4 Relay @ 2026-05-23 19:43 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Jens Axboe, Jakub Kicinski,
	Simon Horman, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Jonathan Corbet, Shuah Khan, Eric Biggers,
	Ard Biesheuvel
  Cc: linux-crypto, linux-kernel, io-uring, netdev, linux-perf-users,
	linux-doc, Demi Marie Obenour
In-Reply-To: <20260523-af-alg-harden-v1-0-c76755c3a5c5@gmail.com>

From: Demi Marie Obenour <demiobenour@gmail.com>

Without support for zero-copy or off-CPU offloads, AF_ALG is always
slower than software cryptography. Its only advantage is that it might
save code size. However, this is largely mitigated by lightweight
userspace cryptographic libraries.

Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
---
 Documentation/crypto/userspace-if.rst | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/Documentation/crypto/userspace-if.rst b/Documentation/crypto/userspace-if.rst
index b31117d4415dda6ad6ca36275e615bec7df9552e..ab93300c8e04524469f284704c7c5ed582fdcbc0 100644
--- a/Documentation/crypto/userspace-if.rst
+++ b/Documentation/crypto/userspace-if.rst
@@ -28,8 +28,8 @@ functionality than that. It actually provides access to all software algorithms.
 
 This includes arbitrary compositions of different algorithms created via a
 complex template system, as well as algorithms that only make sense as internal
-implementation details of other algorithms. It also includes full zero-copy
-support, which is difficult for the kernel to implement securely.
+implementation details of other algorithms. In the past, it also included full
+zero-copy support, which was difficult for the kernel to implement securely.
 
 Ultimately, these algorithms are just math computations. They use the same
 instructions that userspace programs already have access to, just accessed in a
@@ -38,6 +38,21 @@ much more convoluted and less efficient way.
 Indeed, userspace code is nearly always what is being used anyway. These same
 algorithms are widely implemented in userspace crypto libraries.
 
+Even when zero-copy and off-CPU accelerators were supported, AF_ALG was usually
+much slower than optimized software cryptography in userspace. This was
+especially true for the small message sizes usually seen in performance-critical
+workloads. While it was possible to demonstrate performance wins for hashing
+large files on embedded devices, it is hard to imagine a situation where this
+would be performance-critical.
+
+Nowadays, AF_ALG no longer supports zero-copy or off-CPU accelerators.
+Therefore, it is *always* slower than an optimized userspace implementation,
+even for large messages. The only possible advantage left is that it avoids
+duplicating code between kernel and userspace. However, userspace
+implementations, especially hardware-accelerated ones, do not need to be large.
+Just because OpenSSL is huge does not mean that all userspace cryptography
+libraries are.
+
 Meanwhile, AF_ALG hasn't been withstanding modern vulnerability discovery tools
 such as syzbot and large language models. It receives a steady stream of CVEs.
 Some of the examples include:

-- 
2.54.0



^ permalink raw reply related

* [PATCH 1/3] net: Remove support for AIO on sockets
From: Demi Marie Obenour via B4 Relay @ 2026-05-23 19:43 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Jens Axboe, Jakub Kicinski,
	Simon Horman, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Jonathan Corbet, Shuah Khan, Eric Biggers,
	Ard Biesheuvel
  Cc: linux-crypto, linux-kernel, io-uring, netdev, linux-perf-users,
	linux-doc, Demi Marie Obenour
In-Reply-To: <20260523-af-alg-harden-v1-0-c76755c3a5c5@gmail.com>

From: Demi Marie Obenour <demiobenour@gmail.com>

The only user of msg->msg_iocb was AF_ALG, but that's deprecated.
It can be removed entirely at the cost of only supporting synchronous
operations.  This doesn't break userspace, which will silently block
(for a bounded amount of time) in io_submit instead of operating
asynchronously.

This also makes struct msghdr smaller, helping every other caller of
sendmsg().

Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
---
 crypto/af_alg.c                                | 33 +-------------
 crypto/algif_aead.c                            | 39 ++++------------
 crypto/algif_skcipher.c                        | 62 +++++---------------------
 include/crypto/if_alg.h                        |  5 +--
 include/linux/socket.h                         |  1 -
 io_uring/net.c                                 |  1 -
 net/compat.c                                   |  1 -
 net/socket.c                                   |  7 +--
 tools/perf/trace/beauty/include/linux/socket.h |  1 -
 9 files changed, 25 insertions(+), 125 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 48c53f488e0fd30818e72439fe0c0d7e4cee1432..8ccf7a737cd6ca9a5d5bf47050c9afea0dfd61bf 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -1085,35 +1085,6 @@ void af_alg_free_resources(struct af_alg_async_req *areq)
 }
 EXPORT_SYMBOL_GPL(af_alg_free_resources);
 
-/**
- * af_alg_async_cb - AIO callback handler
- * @data: async request completion data
- * @err: if non-zero, error result to be returned via ki_complete();
- *       otherwise return the AIO output length via ki_complete().
- *
- * This handler cleans up the struct af_alg_async_req upon completion of the
- * AIO operation.
- *
- * The number of bytes to be generated with the AIO operation must be set
- * in areq->outlen before the AIO callback handler is invoked.
- */
-void af_alg_async_cb(void *data, int err)
-{
-	struct af_alg_async_req *areq = data;
-	struct sock *sk = areq->sk;
-	struct kiocb *iocb = areq->iocb;
-	unsigned int resultlen;
-
-	/* Buffer size written by crypto operation. */
-	resultlen = areq->outlen;
-
-	af_alg_free_resources(areq);
-	sock_put(sk);
-
-	iocb->ki_complete(iocb, err ? err : (int)resultlen);
-}
-EXPORT_SYMBOL_GPL(af_alg_async_cb);
-
 /**
  * af_alg_poll - poll system call handler
  * @file: file pointer
@@ -1154,8 +1125,8 @@ struct af_alg_async_req *af_alg_alloc_areq(struct sock *sk,
 	struct af_alg_ctx *ctx = alg_sk(sk)->private;
 	struct af_alg_async_req *areq;
 
-	/* Only one AIO request can be in flight. */
-	if (ctx->inflight)
+	/* Only one request can be in flight. */
+	if (WARN_ON_ONCE(ctx->inflight))
 		return ERR_PTR(-EBUSY);
 
 	areq = sock_kmalloc(sk, areqlen, GFP_KERNEL);
diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
index c6c2ce21895dd7df51dc825ed886ba7e1aa37130..60f06597cb0b13036bc975641a0b02ea8a41ad03 100644
--- a/crypto/algif_aead.c
+++ b/crypto/algif_aead.c
@@ -197,37 +197,14 @@ static int _aead_recvmsg(struct socket *sock, struct msghdr *msg,
 	aead_request_set_ad(&areq->cra_u.aead_req, ctx->aead_assoclen);
 	aead_request_set_tfm(&areq->cra_u.aead_req, tfm);
 
-	if (msg->msg_iocb && !is_sync_kiocb(msg->msg_iocb)) {
-		/* AIO operation */
-		sock_hold(sk);
-		areq->iocb = msg->msg_iocb;
-
-		/* Remember output size that will be generated. */
-		areq->outlen = outlen;
-
-		aead_request_set_callback(&areq->cra_u.aead_req,
-					  CRYPTO_TFM_REQ_MAY_SLEEP,
-					  af_alg_async_cb, areq);
-		err = ctx->enc ? crypto_aead_encrypt(&areq->cra_u.aead_req) :
-				 crypto_aead_decrypt(&areq->cra_u.aead_req);
-
-		/* AIO operation in progress */
-		if (err == -EINPROGRESS)
-			return -EIOCBQUEUED;
-
-		sock_put(sk);
-	} else {
-		/* Synchronous operation */
-		aead_request_set_callback(&areq->cra_u.aead_req,
-					  CRYPTO_TFM_REQ_MAY_SLEEP |
-					  CRYPTO_TFM_REQ_MAY_BACKLOG,
-					  crypto_req_done, &ctx->wait);
-		err = crypto_wait_req(ctx->enc ?
-				crypto_aead_encrypt(&areq->cra_u.aead_req) :
-				crypto_aead_decrypt(&areq->cra_u.aead_req),
-				&ctx->wait);
-	}
-
+	aead_request_set_callback(&areq->cra_u.aead_req,
+				  CRYPTO_TFM_REQ_MAY_SLEEP |
+				  CRYPTO_TFM_REQ_MAY_BACKLOG,
+				  crypto_req_done, &ctx->wait);
+	err = crypto_wait_req(ctx->enc ?
+			crypto_aead_encrypt(&areq->cra_u.aead_req) :
+			crypto_aead_decrypt(&areq->cra_u.aead_req),
+			&ctx->wait);
 
 free:
 	af_alg_free_resources(areq);
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index ba0a17fd95aca22aa58ebf510c7d9b5f0cea2c2e..9dbccabd87b13920c27aff5a450a235cc6a27d59 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -79,20 +79,6 @@ static int algif_skcipher_export(struct sock *sk, struct skcipher_request *req)
 	return err;
 }
 
-static void algif_skcipher_done(void *data, int err)
-{
-	struct af_alg_async_req *areq = data;
-	struct sock *sk = areq->sk;
-
-	if (err)
-		goto out;
-
-	err = algif_skcipher_export(sk, &areq->cra_u.skcipher_req);
-
-out:
-	af_alg_async_cb(data, err);
-}
-
 static int _skcipher_recvmsg(struct socket *sock, struct msghdr *msg,
 			     size_t ignored, int flags)
 {
@@ -171,43 +157,19 @@ static int _skcipher_recvmsg(struct socket *sock, struct msghdr *msg,
 		cflags |= CRYPTO_SKCIPHER_REQ_CONT;
 	}
 
-	if (msg->msg_iocb && !is_sync_kiocb(msg->msg_iocb)) {
-		/* AIO operation */
-		sock_hold(sk);
-		areq->iocb = msg->msg_iocb;
+	skcipher_request_set_callback(&areq->cra_u.skcipher_req,
+				      cflags |
+				      CRYPTO_TFM_REQ_MAY_SLEEP |
+				      CRYPTO_TFM_REQ_MAY_BACKLOG,
+				      crypto_req_done, &ctx->wait);
+	err = crypto_wait_req(ctx->enc ?
+		crypto_skcipher_encrypt(&areq->cra_u.skcipher_req) :
+		crypto_skcipher_decrypt(&areq->cra_u.skcipher_req),
+					 &ctx->wait);
 
-		/* Remember output size that will be generated. */
-		areq->outlen = len;
-
-		skcipher_request_set_callback(&areq->cra_u.skcipher_req,
-					      cflags |
-					      CRYPTO_TFM_REQ_MAY_SLEEP,
-					      algif_skcipher_done, areq);
-		err = ctx->enc ?
-			crypto_skcipher_encrypt(&areq->cra_u.skcipher_req) :
-			crypto_skcipher_decrypt(&areq->cra_u.skcipher_req);
-
-		/* AIO operation in progress */
-		if (err == -EINPROGRESS)
-			return -EIOCBQUEUED;
-
-		sock_put(sk);
-	} else {
-		/* Synchronous operation */
-		skcipher_request_set_callback(&areq->cra_u.skcipher_req,
-					      cflags |
-					      CRYPTO_TFM_REQ_MAY_SLEEP |
-					      CRYPTO_TFM_REQ_MAY_BACKLOG,
-					      crypto_req_done, &ctx->wait);
-		err = crypto_wait_req(ctx->enc ?
-			crypto_skcipher_encrypt(&areq->cra_u.skcipher_req) :
-			crypto_skcipher_decrypt(&areq->cra_u.skcipher_req),
-						 &ctx->wait);
-
-		if (!err)
-			err = algif_skcipher_export(
-				sk, &areq->cra_u.skcipher_req);
-	}
+	if (!err)
+		err = algif_skcipher_export(
+			sk, &areq->cra_u.skcipher_req);
 
 free:
 	af_alg_free_resources(areq);
diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
index 0cc8fa749f68d2356789f72771c9e550b79e0b3d..62867daca47d76c9ea1a7ed233188788c5f6c3c0 100644
--- a/include/crypto/if_alg.h
+++ b/include/crypto/if_alg.h
@@ -80,7 +80,6 @@ struct af_alg_rsgl {
 
 /**
  * struct af_alg_async_req - definition of crypto request
- * @iocb:		IOCB for AIO operations
  * @sk:			Socket the request is associated with
  * @first_rsgl:		First RX SG
  * @last_rsgl:		Pointer to last RX SG
@@ -92,7 +91,6 @@ struct af_alg_rsgl {
  * @cra_u:		Cipher request
  */
 struct af_alg_async_req {
-	struct kiocb *iocb;
 	struct sock *sk;
 
 	struct af_alg_rsgl first_rsgl;
@@ -138,7 +136,7 @@ struct af_alg_async_req {
  * @write:		True if we are in the middle of a write.
  * @init:		True if metadata has been sent.
  * @len:		Length of memory allocated for this data structure.
- * @inflight:		Non-zero when AIO requests are in flight.
+ * @inflight:		Non-zero when requests are in flight, for debugging only.
  */
 struct af_alg_ctx {
 	struct list_head tsgl_list;
@@ -237,7 +235,6 @@ int af_alg_wait_for_data(struct sock *sk, unsigned flags, unsigned min);
 int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
 		   unsigned int ivsize);
 void af_alg_free_resources(struct af_alg_async_req *areq);
-void af_alg_async_cb(void *data, int err);
 __poll_t af_alg_poll(struct file *file, struct socket *sock,
 			 poll_table *wait);
 struct af_alg_async_req *af_alg_alloc_areq(struct sock *sk,
diff --git a/include/linux/socket.h b/include/linux/socket.h
index ec4a0a0257939a5363c55bed3ccb20182965b2e3..3ffdfe184b23d0a739e095407e956885d116c299 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -89,7 +89,6 @@ struct msghdr {
 	bool		msg_get_inq : 1;/* return INQ after receive */
 	unsigned int	msg_flags;	/* flags on received message */
 	__kernel_size_t	msg_controllen;	/* ancillary data buffer length */
-	struct kiocb	*msg_iocb;	/* ptr to iocb for async requests */
 	struct ubuf_info *msg_ubuf;
 	int (*sg_from_iter)(struct sk_buff *skb,
 			    struct iov_iter *from, size_t length);
diff --git a/io_uring/net.c b/io_uring/net.c
index 30cd22c0b934b97ce6e265756b24daca7d398361..22100933966af547dfe6a52e69fc6882b4197234 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -771,7 +771,6 @@ static int io_recvmsg_prep_setup(struct io_kiocb *req)
 		kmsg->msg.msg_control = NULL;
 		kmsg->msg.msg_get_inq = 1;
 		kmsg->msg.msg_controllen = 0;
-		kmsg->msg.msg_iocb = NULL;
 		kmsg->msg.msg_ubuf = NULL;
 
 		if (req->flags & REQ_F_BUFFER_SELECT)
diff --git a/net/compat.c b/net/compat.c
index 2c9bd0edac997bc8c6ebd1bc8b92d8437ff32ea4..d68cf9c3aad5f7f1de84edbfffcf99d71e89292a 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -75,7 +75,6 @@ int __get_compat_msghdr(struct msghdr *kmsg,
 	if (msg->msg_iovlen > UIO_MAXIOV)
 		return -EMSGSIZE;
 
-	kmsg->msg_iocb = NULL;
 	kmsg->msg_ubuf = NULL;
 	return 0;
 }
diff --git a/net/socket.c b/net/socket.c
index 22a412fdec079cf8fd829a15236de9daea09d2f2..9785363858cef0c4e6f0efc45b17c3d2add5a53c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1213,8 +1213,7 @@ static ssize_t sock_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *file = iocb->ki_filp;
 	struct socket *sock = file->private_data;
-	struct msghdr msg = {.msg_iter = *to,
-			     .msg_iocb = iocb};
+	struct msghdr msg = {.msg_iter = *to};
 	ssize_t res;
 
 	if (file->f_flags & O_NONBLOCK || (iocb->ki_flags & IOCB_NOWAIT))
@@ -1235,8 +1234,7 @@ static ssize_t sock_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
 	struct socket *sock = file->private_data;
-	struct msghdr msg = {.msg_iter = *from,
-			     .msg_iocb = iocb};
+	struct msghdr msg = {.msg_iter = *from};
 	ssize_t res;
 
 	if (iocb->ki_pos != 0)
@@ -2612,7 +2610,6 @@ int __copy_msghdr(struct msghdr *kmsg,
 	if (msg->msg_iovlen > UIO_MAXIOV)
 		return -EMSGSIZE;
 
-	kmsg->msg_iocb = NULL;
 	kmsg->msg_ubuf = NULL;
 	return 0;
 }
diff --git a/tools/perf/trace/beauty/include/linux/socket.h b/tools/perf/trace/beauty/include/linux/socket.h
index ec715ad4bf25f5f759d2cab3c6b796fed84df932..2a0a50fd66f41589f2699f7288a143873ce1bba6 100644
--- a/tools/perf/trace/beauty/include/linux/socket.h
+++ b/tools/perf/trace/beauty/include/linux/socket.h
@@ -89,7 +89,6 @@ struct msghdr {
 	bool		msg_get_inq : 1;/* return INQ after receive */
 	unsigned int	msg_flags;	/* flags on received message */
 	__kernel_size_t	msg_controllen;	/* ancillary data buffer length */
-	struct kiocb	*msg_iocb;	/* ptr to iocb for async requests */
 	struct ubuf_info *msg_ubuf;
 	int (*sg_from_iter)(struct sk_buff *skb,
 			    struct iov_iter *from, size_t length);

-- 
2.54.0



^ permalink raw reply related

* Re: [PATCH AUTOSEL 7.0] io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER
From: Sasha Levin @ 2026-05-23 15:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: patches, stable, Maoyi Xie, Pavel Begunkov, Maoyi Xie, io-uring,
	linux-kernel
In-Reply-To: <afe1ad86-3454-4092-88d0-bd9753a1b2c8@kernel.dk>

On Sat, May 23, 2026 at 08:55:43AM -0600, Jens Axboe wrote:
>On 5/23/26 8:45 AM, Sasha Levin wrote:
>> The volume of mails and patches makes it really difficult to give
>> prompt answers here. I have no idea if
>> 9cc6bac1bebf8310d2950d1411a91479e86d69a1 applies cleanly, whether I
>> need to ask for a backport, or whether I should just drop
>> 45d2b37a37ab9848 until I sit down and get to this batch of AUTOSEL
>> commits.
>
>If you can't handle basic replies when running AUTOSEL, then I don't
>think you should have that process in the first place.

You know, you're probably right. I'll just take a break from AUTOSEL for now.

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH AUTOSEL 7.0] io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER
From: Jens Axboe @ 2026-05-23 14:55 UTC (permalink / raw)
  To: Sasha Levin
  Cc: patches, stable, Maoyi Xie, Pavel Begunkov, Maoyi Xie, io-uring,
	linux-kernel
In-Reply-To: <ahG9meYUQ-YLDwHN@laps>

On 5/23/26 8:45 AM, Sasha Levin wrote:
> On Sat, May 23, 2026 at 08:23:13AM -0600, Jens Axboe wrote:
>> On 5/20/26 5:40 AM, Jens Axboe wrote:
>>> On 5/20/26 5:18 AM, Sasha Levin wrote:
>>>> From: Maoyi Xie <maoyixie.tju@gmail.com>
>>>>
>>>> [ Upstream commit 45d2b37a37ab98484693533496395c610a2cab96 ]
>>>>
>>>> io_uring_enter() with IORING_ENTER_ABS_TIMER takes an absolute
>>>> timespec from the caller via ext_arg->ts. It arms an ABS mode
>>>> hrtimer in __io_cqring_wait_schedule(). The conversion path in
>>>> io_uring/wait.c parses ext_arg->ts inline rather than going
>>>> through io_parse_user_time(). It therefore does not pick up the
>>>> time namespace conversion added by the previous patch.
>>>
>>> Once again - If you auto-pick this one, please also do the other one in
>>> the series, 9cc6bac1bebf8310d2950d1411a91479e86d69a1. Makes no sense to
>>> do just one of them.
>>
>> And once again, no reply. What is going on with stable these days?
> 
> Jens, as I've mentioned in the previous mail, I handle the AUTOSEL
> mails weeks after I originally sent them out for reviews.

And you think that's working fine? I would suggest that's a terrible
process. How are maintainers supposed to deal with that? Patches x and y
are autoselected and an email is sent out. Maintainers react to that,
either saying "no don't pick X" or "if you pick Y, please also do Z".
The expectation would then be a reply that says "ok, doing that" or
whatever might be appropriate there. Instead, it's just silence. And now
I have to follow-up MULTIPLE times to ensure the right thing is being
done. We're about 2 weeks into this particular incidence, and
hilariously, I still have no idea what the state is on your end. Did it
get dropped? Did the other one I asked for get picked up? Nobody knows!

At least Greg actually promptly replies for the non-autosel stuff he
does. Which is the ONLY thing that makes Fixes tags and CC stable
actually work. The AUTOSEL stuff, it does not. When it happens to pick
the right patches, yeah all is good. But when there's a problem, the
process is terrible, as evidenced by this particular patch.

> The volume of mails and patches makes it really difficult to give
> prompt answers here. I have no idea if
> 9cc6bac1bebf8310d2950d1411a91479e86d69a1 applies cleanly, whether I
> need to ask for a backport, or whether I should just drop
> 45d2b37a37ab9848 until I sit down and get to this batch of AUTOSEL
> commits.

If you can't handle basic replies when running AUTOSEL, then I don't
think you should have that process in the first place.

> If this process doesn't work well for you, I'm happy top skip all
> non-stable-tagged commits for io_uring. This is supposed to be only a
> best effort attempt to catch commits that slipped through the cracks.

Please don't do AUTOSEL for any patches for any subsystem that I am a
maintainer or co-maintainer of. Until this part of the stable tree
process can be improved, it's a net negative.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH AUTOSEL 7.0] io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER
From: Sasha Levin @ 2026-05-23 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: patches, stable, Maoyi Xie, Pavel Begunkov, Maoyi Xie, io-uring,
	linux-kernel
In-Reply-To: <8e853555-604e-46e5-8e25-a5f80b88e51c@kernel.dk>

On Sat, May 23, 2026 at 08:23:13AM -0600, Jens Axboe wrote:
>On 5/20/26 5:40 AM, Jens Axboe wrote:
>> On 5/20/26 5:18 AM, Sasha Levin wrote:
>>> From: Maoyi Xie <maoyixie.tju@gmail.com>
>>>
>>> [ Upstream commit 45d2b37a37ab98484693533496395c610a2cab96 ]
>>>
>>> io_uring_enter() with IORING_ENTER_ABS_TIMER takes an absolute
>>> timespec from the caller via ext_arg->ts. It arms an ABS mode
>>> hrtimer in __io_cqring_wait_schedule(). The conversion path in
>>> io_uring/wait.c parses ext_arg->ts inline rather than going
>>> through io_parse_user_time(). It therefore does not pick up the
>>> time namespace conversion added by the previous patch.
>>
>> Once again - If you auto-pick this one, please also do the other one in
>> the series, 9cc6bac1bebf8310d2950d1411a91479e86d69a1. Makes no sense to
>> do just one of them.
>
>And once again, no reply. What is going on with stable these days?

Jens, as I've mentioned in the previous mail, I handle the AUTOSEL mails weeks
after I originally sent them out for reviews.

The volume of mails and patches makes it really difficult to give prompt
answers here. I have no idea if 9cc6bac1bebf8310d2950d1411a91479e86d69a1
applies cleanly, whether I need to ask for a backport, or whether I should just
drop 45d2b37a37ab9848 until I sit down and get to this batch of AUTOSEL
commits.

If this process doesn't work well for you, I'm happy top skip all
non-stable-tagged commits for io_uring. This is supposed to be only a best
effort attempt to catch commits that slipped through the cracks.

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH AUTOSEL 7.0] io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER
From: Jens Axboe @ 2026-05-23 14:23 UTC (permalink / raw)
  To: Sasha Levin, patches, stable
  Cc: Maoyi Xie, Pavel Begunkov, Maoyi Xie, io-uring, linux-kernel
In-Reply-To: <5a50c3f5-a5ef-4b2b-821c-5858d8b1ac13@kernel.dk>

On 5/20/26 5:40 AM, Jens Axboe wrote:
> On 5/20/26 5:18 AM, Sasha Levin wrote:
>> From: Maoyi Xie <maoyixie.tju@gmail.com>
>>
>> [ Upstream commit 45d2b37a37ab98484693533496395c610a2cab96 ]
>>
>> io_uring_enter() with IORING_ENTER_ABS_TIMER takes an absolute
>> timespec from the caller via ext_arg->ts. It arms an ABS mode
>> hrtimer in __io_cqring_wait_schedule(). The conversion path in
>> io_uring/wait.c parses ext_arg->ts inline rather than going
>> through io_parse_user_time(). It therefore does not pick up the
>> time namespace conversion added by the previous patch.
> 
> Once again - If you auto-pick this one, please also do the other one in
> the series, 9cc6bac1bebf8310d2950d1411a91479e86d69a1. Makes no sense to
> do just one of them.

And once again, no reply. What is going on with stable these days?

-- 
Jens Axboe


^ permalink raw reply

* Re: [bug] io_uring : NULL pointer deref in io_register_iowq_max_workers()
From: Jens Axboe @ 2026-05-23 13:54 UTC (permalink / raw)
  To: 시리얼, io-uring
In-Reply-To: <CACR30Wj7yEweYqJg4Ovrbr4s9a8EZRYD8FMAWhjWUv3XunrMFQ@mail.gmail.com>

On 5/23/26 6:00 AM, ??? wrote:
> Frist I'm not good at English, so my grammar might be weird.
> and i use translator so It may not look natural.
> 
> I found NULL-pointer dereference (general protection fault under
> KASAN) in io_register_iowq_max_workers() on 7.1.0-rc1. It is a race
> between the IORING_REGISTER_IOWQ_MAX_WORKERS propagation loop and a
> task installing its first io_uring task context (tctx) node on a
> shared ring. A small multithreaded reproducer triggers it reliably.
> 
> syzkaller log:
> 
> Oops: general protection fault, probably for non-canonical address
> 0xdffffc0000000003: 0000 [#1] SMP KASAN NOPTI
> KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> CPU: 1 UID: 0 PID: 230570 Comm: syz.1.42039 Not tainted 7.1.0-rc1 #1
> PREEMPT(full)
> Hardware name: QEMU Ubuntu 26.04 PC (i440FX + PIIX, 1996), BIOS
> 1.17.0-debian-1.17.0-1ubuntu1 04/01/2014
> RIP: 0010:io_register_iowq_max_workers io_uring/register.c:423 [inline]
> RIP: 0010:__io_uring_register io_uring/register.c:865 [inline]
> RIP: 0010:__do_sys_io_uring_register.cold+0xcae/0xe32 io_uring/register.c:1029
> Code: bd 68 09 00 00 48 89 fa 48 c1 ea 03 42 80 3c 2a 00 74 05 e8 06
> 3a 40 01 48 8b ad 68 09 00 00 48 8d 7d 18 48 89 fa 48 c1 ea 03 <42> 80
> 3c 2a 00 74 05 e8 e8 39 40 01 48 8b 6d 18 48 85 ed 0f 85 ec
> RSP: 0018:ffffc90002a9fd90 EFLAGS: 00010206
> RAX: 1ffff11004c9f63a RBX: ffff888054ee4000 RCX: 0000000000000001
> RDX: 0000000000000003 RSI: ffffffff81364a23 RDI: 0000000000000018
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> R10: ffffc90002a9fd90 R11: 0000000080000000 R12: ffff8880264fb1c0
> R13: dffffc0000000000 R14: 0000000000000013 R15: 0000000000000013
> FS:  00007ff8525ee6c0(0000) GS:ffff8880d687a000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 000000005ea81000 CR4: 0000000000352ef0
> Call Trace:
>  <TASK>
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0xff/0xf80 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7ff8543b85fd
> Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48
> 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007ff8525edff8 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab
> RAX: ffffffffffffffda RBX: 00007ff854645fa0 RCX: 00007ff8543b85fd
> RDX: 0000200000000040 RSI: 0000000000000013 RDI: 0000000000000003
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
> R13: 00007ffdb063f3d0 R14: 00007ff8525eece4 R15: 00007ffdb063f4c7
>  </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:io_register_iowq_max_workers io_uring/register.c:423 [inline]
> RIP: 0010:__io_uring_register io_uring/register.c:865 [inline]
> RIP: 0010:__do_sys_io_uring_register.cold+0xcae/0xe32 io_uring/register.c:1029
> Code: bd 68 09 00 00 48 89 fa 48 c1 ea 03 42 80 3c 2a 00 74 05 e8 06
> 3a 40 01 48 8b ad 68 09 00 00 48 8d 7d 18 48 89 fa 48 c1 ea 03 <42> 80
> 3c 2a 00 74 05 e8 e8 39 40 01 48 8b 6d 18 48 85 ed 0f 85 ec
> RSP: 0018:ffffc90002a9fd90 EFLAGS: 00010206
> RAX: 1ffff11004c9f63a RBX: ffff888054ee4000 RCX: 0000000000000001
> RDX: 0000000000000003 RSI: ffffffff81364a23 RDI: 0000000000000018
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> R10: ffffc90002a9fd90 R11: 0000000080000000 R12: ffff8880264fb1c0
> R13: dffffc0000000000 R14: 0000000000000013 R15: 0000000000000013
> FS:  00007ff8525ee6c0(0000) GS:ffff8880d687a000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f1280e130b0 CR3: 000000005ea81000 CR4: 0000000000352ef0
> ----------------
> Code disassembly (best guess):
>    0: bd 68 09 00 00       mov    $0x968,%ebp
>    5: 48 89 fa             mov    %rdi,%rdx
>    8: 48 c1 ea 03           shr    $0x3,%rdx
>    c: 42 80 3c 2a 00       cmpb   $0x0,(%rdx,%r13,1)
>   11: 74 05                 je     0x18
>   13: e8 06 3a 40 01       call   0x1403a1e
>   18: 48 8b ad 68 09 00 00 mov    0x968(%rbp),%rbp
>   1f: 48 8d 7d 18           lea    0x18(%rbp),%rdi
>   23: 48 89 fa             mov    %rdi,%rdx
>   26: 48 c1 ea 03           shr    $0x3,%rdx
> * 2a: 42 80 3c 2a 00       cmpb   $0x0,(%rdx,%r13,1) <-- trapping instruction
>   2f: 74 05                 je     0x36
>   31: e8 e8 39 40 01       call   0x1403a1e
>   36: 48 8b 6d 18           mov    0x18(%rbp),%rbp
>   3a: 48 85 ed             test   %rbp,%rbp
>   3d: 0f                   .byte 0xf
>   3e: 85 ec                 test   %ebp,%esp
> 
> 
> <<<<<<<<<<<<<<< tail report >>>>>>>>>>>>>>>
> 
> This bug is in io_register_iowq_max_workers()
> 
> mutex_lock(&ctx->tctx_lock);
> list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
>     tctx = node->task->io_uring;
>     if (WARN_ON_ONCE(!tctx->io_wq)) // derefs tctx without NULL check
>         continue;
>     // skip
> }
> 
> propagates the limit to all registered users (non-SQPOLL path)
> 
> The node is published into ctx->tctx_list before node->task->io_uring
> is set (io_uring/tctx.c):
> 
> io_tctx_install_node():
>     node->task = current;
>     mutex_lock(&ctx->tctx_lock);
>     list_add(&node->ctx_node, &ctx->tctx_list);   // node visible
>     mutex_unlock(&ctx->tctx_lock); // lock dropped
> 
> __io_uring_add_tctx_node():
>     ret = io_tctx_install_node(ctx, tctx);
>     if (!ret)
>         current->io_uring = tctx;   // set AFTER, outside lock
> 
> There is a window where a node is on ctx->tctx_list while
> node->task->io_uring is still NULL (the task is doing its first
> io_uring op, tctx freshly allocated, not yet published). A concurrent
> IORING_REGISTER_IOWQ_MAX_WORKERS on the same ring takes
> ctx->tctx_lock, iterates, reads node->task->io_uring == NULL, and
> dereferences tctx->io_wq ? GPF.
> 
> The other two ctx->tctx_list consumers already guard this ? cancel.c
> io_async_cancel_one() and io_uring_try_cancel_iowq() both do if (!tctx
> || !tctx->io_wq). io_register_iowq_max_workers() is the only consumer
> that omits the !tctx check, so this is simply a missing guard.
> 
> Reproducer
> 
> Plain (non-SQPOLL) ring shared across threads. A stream of fresh
> threads each do their first io_uring_enter() (hits the window) while
> two threads spam IORING_REGISTER_IOWQ_MAX_WORKERS. GPFs within
> seconds-to-minutes on SMP+KASAN.
> 
> #define _GNU_SOURCE
> #include <pthread.h>
> #include <string.h>
> #include <sys/syscall.h>
> #include <linux/io_uring.h>
> static int ring_fd;
> static long setup(unsigned e, struct io_uring_params *p){ return
> syscall(__NR_io_uring_setup, e, p); }
> static long enter(int fd, unsigned ts){ return
> syscall(__NR_io_uring_enter, fd, ts, 0, 0, (void*)0, (size_t)0); }
> static long reg(int fd, unsigned op, void *a, unsigned n){ return
> syscall(__NR_io_uring_register, fd, op, a, n); }
> static void *fresh(void *x){ enter(ring_fd, 1); return 0; }   // first
> op -> window
> static void *spam(void *x){ unsigned c[2]={1,1}; for(;;) reg(ring_fd,
> IORING_REGISTER_IOWQ_MAX_WORKERS, c, 2); return 0; }
> int main(void){
>     struct io_uring_params p; memset(&p,0,sizeof(p));
>     ring_fd = setup(8, &p);
>     pthread_t s; pthread_create(&s,0,spam,0); pthread_create(&s,0,spam,0);
>     for(;;){ pthread_t t[64];
>         for(int i=0;i<64;i++) pthread_create(&t[i],0,fresh,0);
>         for(int i=0;i<64;i++) pthread_join(t[i],0); }
> }
> 
> Reproduced on 7.1.0-rc1 with KASAN; the racy ordering predates the
> 2024 shadow-variable cleanup that last touched register.c:422.
> 
> Suggested fix
> 
> Either make io_register_iowq_max_workers() match its siblings:
> 
> before:
> 
> mutex_lock(&ctx->tctx_lock);
> list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
>     tctx = node->task->io_uring;
>     if (WARN_ON_ONCE(!tctx->io_wq)) // derefs tctx without NULL check
>         continue;
>     // skip
> }
> 
> to:
> 
> mutex_lock(&ctx->tctx_lock);
> list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
>     tctx = node->task->io_uring;
>     if (!tctx || !tctx->io_wq)
>         continue;
>     // skip
> }
> 
> or close the window in __io_uring_add_tctx_node() by publishing
> current->io_uring = tctx before the node is added to ctx->tctx_list,
> so a listed node always has a valid task->io_uring.

Setting ->io_uring = tctx before adding to the list is, by far, the
better fix. Rather than just report it, do you want to submit an actual
patch for that? I can surely patch it up myself, but you could also just
send a patch for it.

-- 
Jens Axboe

^ permalink raw reply

* Re: [bug] io_uring : NULL pointer deref in io_register_iowq_max_workers()
From: 시리얼 @ 2026-05-23 12:23 UTC (permalink / raw)
  To: io-uring, axboe
In-Reply-To: <CACR30WjNLCzzFTvdtM-kHC5Ei+W6H6Jeq3WpVh-asm0Ju1zv_Q@mail.gmail.com>

Sorry, I made a mistake in my follow-up.

Before that reply, I had checked the current Linus tree and confirmed
that this code path was still present there. Later, I checked again in
my local directory with git, did not see it there, and sent the
correction too quickly.

I checked again, and the same ordering is in the current Linus tree as
well: the node is added to ctx->tctx_list before current->io_uring =
tctx, and io_register_iowq_max_workers() still dereferences
tctx->io_wq without a !tctx check.

Sorry for the confusion.


2026년 5월 23일 (토) 오후 9:10, 시리얼 <shja0831@gmail.com>님이 작성:
>
> I checked again: the missing `!tctx` check in
> io_register_iowq_max_workers() is older, but the specific race window
> I described appears to come from the recent tctx refactor.
>
> 2026년 5월 23일 (토) 오후 9:00, 시리얼 <shja0831@gmail.com>님이 작성:
> >
> > Frist I'm not good at English, so my grammar might be weird.
> > and i use translator so It may not look natural.
> >
> > I found NULL-pointer dereference (general protection fault under
> > KASAN) in io_register_iowq_max_workers() on 7.1.0-rc1. It is a race
> > between the IORING_REGISTER_IOWQ_MAX_WORKERS propagation loop and a
> > task installing its first io_uring task context (tctx) node on a
> > shared ring. A small multithreaded reproducer triggers it reliably.
> >
> > syzkaller log:
> >
> > Oops: general protection fault, probably for non-canonical address
> > 0xdffffc0000000003: 0000 [#1] SMP KASAN NOPTI
> > KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> > CPU: 1 UID: 0 PID: 230570 Comm: syz.1.42039 Not tainted 7.1.0-rc1 #1
> > PREEMPT(full)
> > Hardware name: QEMU Ubuntu 26.04 PC (i440FX + PIIX, 1996), BIOS
> > 1.17.0-debian-1.17.0-1ubuntu1 04/01/2014
> > RIP: 0010:io_register_iowq_max_workers io_uring/register.c:423 [inline]
> > RIP: 0010:__io_uring_register io_uring/register.c:865 [inline]
> > RIP: 0010:__do_sys_io_uring_register.cold+0xcae/0xe32 io_uring/register.c:1029
> > Code: bd 68 09 00 00 48 89 fa 48 c1 ea 03 42 80 3c 2a 00 74 05 e8 06
> > 3a 40 01 48 8b ad 68 09 00 00 48 8d 7d 18 48 89 fa 48 c1 ea 03 <42> 80
> > 3c 2a 00 74 05 e8 e8 39 40 01 48 8b 6d 18 48 85 ed 0f 85 ec
> > RSP: 0018:ffffc90002a9fd90 EFLAGS: 00010206
> > RAX: 1ffff11004c9f63a RBX: ffff888054ee4000 RCX: 0000000000000001
> > RDX: 0000000000000003 RSI: ffffffff81364a23 RDI: 0000000000000018
> > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> > R10: ffffc90002a9fd90 R11: 0000000080000000 R12: ffff8880264fb1c0
> > R13: dffffc0000000000 R14: 0000000000000013 R15: 0000000000000013
> > FS:  00007ff8525ee6c0(0000) GS:ffff8880d687a000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000000 CR3: 000000005ea81000 CR4: 0000000000352ef0
> > Call Trace:
> >  <TASK>
> >  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> >  do_syscall_64+0xff/0xf80 arch/x86/entry/syscall_64.c:94
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x7ff8543b85fd
> > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48
> > 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
> > 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:00007ff8525edff8 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab
> > RAX: ffffffffffffffda RBX: 00007ff854645fa0 RCX: 00007ff8543b85fd
> > RDX: 0000200000000040 RSI: 0000000000000013 RDI: 0000000000000003
> > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
> > R13: 00007ffdb063f3d0 R14: 00007ff8525eece4 R15: 00007ffdb063f4c7
> >  </TASK>
> > Modules linked in:
> > ---[ end trace 0000000000000000 ]---
> > RIP: 0010:io_register_iowq_max_workers io_uring/register.c:423 [inline]
> > RIP: 0010:__io_uring_register io_uring/register.c:865 [inline]
> > RIP: 0010:__do_sys_io_uring_register.cold+0xcae/0xe32 io_uring/register.c:1029
> > Code: bd 68 09 00 00 48 89 fa 48 c1 ea 03 42 80 3c 2a 00 74 05 e8 06
> > 3a 40 01 48 8b ad 68 09 00 00 48 8d 7d 18 48 89 fa 48 c1 ea 03 <42> 80
> > 3c 2a 00 74 05 e8 e8 39 40 01 48 8b 6d 18 48 85 ed 0f 85 ec
> > RSP: 0018:ffffc90002a9fd90 EFLAGS: 00010206
> > RAX: 1ffff11004c9f63a RBX: ffff888054ee4000 RCX: 0000000000000001
> > RDX: 0000000000000003 RSI: ffffffff81364a23 RDI: 0000000000000018
> > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> > R10: ffffc90002a9fd90 R11: 0000000080000000 R12: ffff8880264fb1c0
> > R13: dffffc0000000000 R14: 0000000000000013 R15: 0000000000000013
> > FS:  00007ff8525ee6c0(0000) GS:ffff8880d687a000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f1280e130b0 CR3: 000000005ea81000 CR4: 0000000000352ef0
> > ----------------
> > Code disassembly (best guess):
> >    0: bd 68 09 00 00       mov    $0x968,%ebp
> >    5: 48 89 fa             mov    %rdi,%rdx
> >    8: 48 c1 ea 03           shr    $0x3,%rdx
> >    c: 42 80 3c 2a 00       cmpb   $0x0,(%rdx,%r13,1)
> >   11: 74 05                 je     0x18
> >   13: e8 06 3a 40 01       call   0x1403a1e
> >   18: 48 8b ad 68 09 00 00 mov    0x968(%rbp),%rbp
> >   1f: 48 8d 7d 18           lea    0x18(%rbp),%rdi
> >   23: 48 89 fa             mov    %rdi,%rdx
> >   26: 48 c1 ea 03           shr    $0x3,%rdx
> > * 2a: 42 80 3c 2a 00       cmpb   $0x0,(%rdx,%r13,1) <-- trapping instruction
> >   2f: 74 05                 je     0x36
> >   31: e8 e8 39 40 01       call   0x1403a1e
> >   36: 48 8b 6d 18           mov    0x18(%rbp),%rbp
> >   3a: 48 85 ed             test   %rbp,%rbp
> >   3d: 0f                   .byte 0xf
> >   3e: 85 ec                 test   %ebp,%esp
> >
> >
> > <<<<<<<<<<<<<<< tail report >>>>>>>>>>>>>>>
> >
> > This bug is in io_register_iowq_max_workers()
> >
> > mutex_lock(&ctx->tctx_lock);
> > list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
> >     tctx = node->task->io_uring;
> >     if (WARN_ON_ONCE(!tctx->io_wq)) // derefs tctx without NULL check
> >         continue;
> >     // skip
> > }
> >
> > propagates the limit to all registered users (non-SQPOLL path)
> >
> > The node is published into ctx->tctx_list before node->task->io_uring
> > is set (io_uring/tctx.c):
> >
> > io_tctx_install_node():
> >     node->task = current;
> >     mutex_lock(&ctx->tctx_lock);
> >     list_add(&node->ctx_node, &ctx->tctx_list);   // node visible
> >     mutex_unlock(&ctx->tctx_lock); // lock dropped
> >
> > __io_uring_add_tctx_node():
> >     ret = io_tctx_install_node(ctx, tctx);
> >     if (!ret)
> >         current->io_uring = tctx;   // set AFTER, outside lock
> >
> > There is a window where a node is on ctx->tctx_list while
> > node->task->io_uring is still NULL (the task is doing its first
> > io_uring op, tctx freshly allocated, not yet published). A concurrent
> > IORING_REGISTER_IOWQ_MAX_WORKERS on the same ring takes
> > ctx->tctx_lock, iterates, reads node->task->io_uring == NULL, and
> > dereferences tctx->io_wq → GPF.
> >
> > The other two ctx->tctx_list consumers already guard this — cancel.c
> > io_async_cancel_one() and io_uring_try_cancel_iowq() both do if (!tctx
> > || !tctx->io_wq). io_register_iowq_max_workers() is the only consumer
> > that omits the !tctx check, so this is simply a missing guard.
> >
> > Reproducer
> >
> > Plain (non-SQPOLL) ring shared across threads. A stream of fresh
> > threads each do their first io_uring_enter() (hits the window) while
> > two threads spam IORING_REGISTER_IOWQ_MAX_WORKERS. GPFs within
> > seconds-to-minutes on SMP+KASAN.
> >
> > #define _GNU_SOURCE
> > #include <pthread.h>
> > #include <string.h>
> > #include <sys/syscall.h>
> > #include <linux/io_uring.h>
> > static int ring_fd;
> > static long setup(unsigned e, struct io_uring_params *p){ return
> > syscall(__NR_io_uring_setup, e, p); }
> > static long enter(int fd, unsigned ts){ return
> > syscall(__NR_io_uring_enter, fd, ts, 0, 0, (void*)0, (size_t)0); }
> > static long reg(int fd, unsigned op, void *a, unsigned n){ return
> > syscall(__NR_io_uring_register, fd, op, a, n); }
> > static void *fresh(void *x){ enter(ring_fd, 1); return 0; }   // first
> > op -> window
> > static void *spam(void *x){ unsigned c[2]={1,1}; for(;;) reg(ring_fd,
> > IORING_REGISTER_IOWQ_MAX_WORKERS, c, 2); return 0; }
> > int main(void){
> >     struct io_uring_params p; memset(&p,0,sizeof(p));
> >     ring_fd = setup(8, &p);
> >     pthread_t s; pthread_create(&s,0,spam,0); pthread_create(&s,0,spam,0);
> >     for(;;){ pthread_t t[64];
> >         for(int i=0;i<64;i++) pthread_create(&t[i],0,fresh,0);
> >         for(int i=0;i<64;i++) pthread_join(t[i],0); }
> > }
> >
> > Reproduced on 7.1.0-rc1 with KASAN; the racy ordering predates the
> > 2024 shadow-variable cleanup that last touched register.c:422.
> >
> > Suggested fix
> >
> > Either make io_register_iowq_max_workers() match its siblings:
> >
> > before:
> >
> > mutex_lock(&ctx->tctx_lock);
> > list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
> >     tctx = node->task->io_uring;
> >     if (WARN_ON_ONCE(!tctx->io_wq)) // derefs tctx without NULL check
> >         continue;
> >     // skip
> > }
> >
> > to:
> >
> > mutex_lock(&ctx->tctx_lock);
> > list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
> >     tctx = node->task->io_uring;
> >     if (!tctx || !tctx->io_wq)
> >         continue;
> >     // skip
> > }
> >
> > or close the window in __io_uring_add_tctx_node() by publishing
> > current->io_uring = tctx before the node is added to ctx->tctx_list,
> > so a listed node always has a valid task->io_uring.
> >
> >
> > LimHyeonJun

^ permalink raw reply

* Re: [bug] io_uring : NULL pointer deref in io_register_iowq_max_workers()
From: 시리얼 @ 2026-05-23 12:10 UTC (permalink / raw)
  To: io-uring, axboe
In-Reply-To: <CACR30Wj7yEweYqJg4Ovrbr4s9a8EZRYD8FMAWhjWUv3XunrMFQ@mail.gmail.com>

I checked again: the missing `!tctx` check in
io_register_iowq_max_workers() is older, but the specific race window
I described appears to come from the recent tctx refactor.

2026년 5월 23일 (토) 오후 9:00, 시리얼 <shja0831@gmail.com>님이 작성:
>
> Frist I'm not good at English, so my grammar might be weird.
> and i use translator so It may not look natural.
>
> I found NULL-pointer dereference (general protection fault under
> KASAN) in io_register_iowq_max_workers() on 7.1.0-rc1. It is a race
> between the IORING_REGISTER_IOWQ_MAX_WORKERS propagation loop and a
> task installing its first io_uring task context (tctx) node on a
> shared ring. A small multithreaded reproducer triggers it reliably.
>
> syzkaller log:
>
> Oops: general protection fault, probably for non-canonical address
> 0xdffffc0000000003: 0000 [#1] SMP KASAN NOPTI
> KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> CPU: 1 UID: 0 PID: 230570 Comm: syz.1.42039 Not tainted 7.1.0-rc1 #1
> PREEMPT(full)
> Hardware name: QEMU Ubuntu 26.04 PC (i440FX + PIIX, 1996), BIOS
> 1.17.0-debian-1.17.0-1ubuntu1 04/01/2014
> RIP: 0010:io_register_iowq_max_workers io_uring/register.c:423 [inline]
> RIP: 0010:__io_uring_register io_uring/register.c:865 [inline]
> RIP: 0010:__do_sys_io_uring_register.cold+0xcae/0xe32 io_uring/register.c:1029
> Code: bd 68 09 00 00 48 89 fa 48 c1 ea 03 42 80 3c 2a 00 74 05 e8 06
> 3a 40 01 48 8b ad 68 09 00 00 48 8d 7d 18 48 89 fa 48 c1 ea 03 <42> 80
> 3c 2a 00 74 05 e8 e8 39 40 01 48 8b 6d 18 48 85 ed 0f 85 ec
> RSP: 0018:ffffc90002a9fd90 EFLAGS: 00010206
> RAX: 1ffff11004c9f63a RBX: ffff888054ee4000 RCX: 0000000000000001
> RDX: 0000000000000003 RSI: ffffffff81364a23 RDI: 0000000000000018
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> R10: ffffc90002a9fd90 R11: 0000000080000000 R12: ffff8880264fb1c0
> R13: dffffc0000000000 R14: 0000000000000013 R15: 0000000000000013
> FS:  00007ff8525ee6c0(0000) GS:ffff8880d687a000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 000000005ea81000 CR4: 0000000000352ef0
> Call Trace:
>  <TASK>
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0xff/0xf80 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7ff8543b85fd
> Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48
> 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007ff8525edff8 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab
> RAX: ffffffffffffffda RBX: 00007ff854645fa0 RCX: 00007ff8543b85fd
> RDX: 0000200000000040 RSI: 0000000000000013 RDI: 0000000000000003
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
> R13: 00007ffdb063f3d0 R14: 00007ff8525eece4 R15: 00007ffdb063f4c7
>  </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:io_register_iowq_max_workers io_uring/register.c:423 [inline]
> RIP: 0010:__io_uring_register io_uring/register.c:865 [inline]
> RIP: 0010:__do_sys_io_uring_register.cold+0xcae/0xe32 io_uring/register.c:1029
> Code: bd 68 09 00 00 48 89 fa 48 c1 ea 03 42 80 3c 2a 00 74 05 e8 06
> 3a 40 01 48 8b ad 68 09 00 00 48 8d 7d 18 48 89 fa 48 c1 ea 03 <42> 80
> 3c 2a 00 74 05 e8 e8 39 40 01 48 8b 6d 18 48 85 ed 0f 85 ec
> RSP: 0018:ffffc90002a9fd90 EFLAGS: 00010206
> RAX: 1ffff11004c9f63a RBX: ffff888054ee4000 RCX: 0000000000000001
> RDX: 0000000000000003 RSI: ffffffff81364a23 RDI: 0000000000000018
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> R10: ffffc90002a9fd90 R11: 0000000080000000 R12: ffff8880264fb1c0
> R13: dffffc0000000000 R14: 0000000000000013 R15: 0000000000000013
> FS:  00007ff8525ee6c0(0000) GS:ffff8880d687a000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f1280e130b0 CR3: 000000005ea81000 CR4: 0000000000352ef0
> ----------------
> Code disassembly (best guess):
>    0: bd 68 09 00 00       mov    $0x968,%ebp
>    5: 48 89 fa             mov    %rdi,%rdx
>    8: 48 c1 ea 03           shr    $0x3,%rdx
>    c: 42 80 3c 2a 00       cmpb   $0x0,(%rdx,%r13,1)
>   11: 74 05                 je     0x18
>   13: e8 06 3a 40 01       call   0x1403a1e
>   18: 48 8b ad 68 09 00 00 mov    0x968(%rbp),%rbp
>   1f: 48 8d 7d 18           lea    0x18(%rbp),%rdi
>   23: 48 89 fa             mov    %rdi,%rdx
>   26: 48 c1 ea 03           shr    $0x3,%rdx
> * 2a: 42 80 3c 2a 00       cmpb   $0x0,(%rdx,%r13,1) <-- trapping instruction
>   2f: 74 05                 je     0x36
>   31: e8 e8 39 40 01       call   0x1403a1e
>   36: 48 8b 6d 18           mov    0x18(%rbp),%rbp
>   3a: 48 85 ed             test   %rbp,%rbp
>   3d: 0f                   .byte 0xf
>   3e: 85 ec                 test   %ebp,%esp
>
>
> <<<<<<<<<<<<<<< tail report >>>>>>>>>>>>>>>
>
> This bug is in io_register_iowq_max_workers()
>
> mutex_lock(&ctx->tctx_lock);
> list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
>     tctx = node->task->io_uring;
>     if (WARN_ON_ONCE(!tctx->io_wq)) // derefs tctx without NULL check
>         continue;
>     // skip
> }
>
> propagates the limit to all registered users (non-SQPOLL path)
>
> The node is published into ctx->tctx_list before node->task->io_uring
> is set (io_uring/tctx.c):
>
> io_tctx_install_node():
>     node->task = current;
>     mutex_lock(&ctx->tctx_lock);
>     list_add(&node->ctx_node, &ctx->tctx_list);   // node visible
>     mutex_unlock(&ctx->tctx_lock); // lock dropped
>
> __io_uring_add_tctx_node():
>     ret = io_tctx_install_node(ctx, tctx);
>     if (!ret)
>         current->io_uring = tctx;   // set AFTER, outside lock
>
> There is a window where a node is on ctx->tctx_list while
> node->task->io_uring is still NULL (the task is doing its first
> io_uring op, tctx freshly allocated, not yet published). A concurrent
> IORING_REGISTER_IOWQ_MAX_WORKERS on the same ring takes
> ctx->tctx_lock, iterates, reads node->task->io_uring == NULL, and
> dereferences tctx->io_wq → GPF.
>
> The other two ctx->tctx_list consumers already guard this — cancel.c
> io_async_cancel_one() and io_uring_try_cancel_iowq() both do if (!tctx
> || !tctx->io_wq). io_register_iowq_max_workers() is the only consumer
> that omits the !tctx check, so this is simply a missing guard.
>
> Reproducer
>
> Plain (non-SQPOLL) ring shared across threads. A stream of fresh
> threads each do their first io_uring_enter() (hits the window) while
> two threads spam IORING_REGISTER_IOWQ_MAX_WORKERS. GPFs within
> seconds-to-minutes on SMP+KASAN.
>
> #define _GNU_SOURCE
> #include <pthread.h>
> #include <string.h>
> #include <sys/syscall.h>
> #include <linux/io_uring.h>
> static int ring_fd;
> static long setup(unsigned e, struct io_uring_params *p){ return
> syscall(__NR_io_uring_setup, e, p); }
> static long enter(int fd, unsigned ts){ return
> syscall(__NR_io_uring_enter, fd, ts, 0, 0, (void*)0, (size_t)0); }
> static long reg(int fd, unsigned op, void *a, unsigned n){ return
> syscall(__NR_io_uring_register, fd, op, a, n); }
> static void *fresh(void *x){ enter(ring_fd, 1); return 0; }   // first
> op -> window
> static void *spam(void *x){ unsigned c[2]={1,1}; for(;;) reg(ring_fd,
> IORING_REGISTER_IOWQ_MAX_WORKERS, c, 2); return 0; }
> int main(void){
>     struct io_uring_params p; memset(&p,0,sizeof(p));
>     ring_fd = setup(8, &p);
>     pthread_t s; pthread_create(&s,0,spam,0); pthread_create(&s,0,spam,0);
>     for(;;){ pthread_t t[64];
>         for(int i=0;i<64;i++) pthread_create(&t[i],0,fresh,0);
>         for(int i=0;i<64;i++) pthread_join(t[i],0); }
> }
>
> Reproduced on 7.1.0-rc1 with KASAN; the racy ordering predates the
> 2024 shadow-variable cleanup that last touched register.c:422.
>
> Suggested fix
>
> Either make io_register_iowq_max_workers() match its siblings:
>
> before:
>
> mutex_lock(&ctx->tctx_lock);
> list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
>     tctx = node->task->io_uring;
>     if (WARN_ON_ONCE(!tctx->io_wq)) // derefs tctx without NULL check
>         continue;
>     // skip
> }
>
> to:
>
> mutex_lock(&ctx->tctx_lock);
> list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
>     tctx = node->task->io_uring;
>     if (!tctx || !tctx->io_wq)
>         continue;
>     // skip
> }
>
> or close the window in __io_uring_add_tctx_node() by publishing
> current->io_uring = tctx before the node is added to ctx->tctx_list,
> so a listed node always has a valid task->io_uring.
>
>
> LimHyeonJun

^ permalink raw reply

* [bug] io_uring : NULL pointer deref in io_register_iowq_max_workers()
From: 시리얼 @ 2026-05-23 12:00 UTC (permalink / raw)
  To: io-uring, axboe

Frist I'm not good at English, so my grammar might be weird.
and i use translator so It may not look natural.

I found NULL-pointer dereference (general protection fault under
KASAN) in io_register_iowq_max_workers() on 7.1.0-rc1. It is a race
between the IORING_REGISTER_IOWQ_MAX_WORKERS propagation loop and a
task installing its first io_uring task context (tctx) node on a
shared ring. A small multithreaded reproducer triggers it reliably.

syzkaller log:

Oops: general protection fault, probably for non-canonical address
0xdffffc0000000003: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
CPU: 1 UID: 0 PID: 230570 Comm: syz.1.42039 Not tainted 7.1.0-rc1 #1
PREEMPT(full)
Hardware name: QEMU Ubuntu 26.04 PC (i440FX + PIIX, 1996), BIOS
1.17.0-debian-1.17.0-1ubuntu1 04/01/2014
RIP: 0010:io_register_iowq_max_workers io_uring/register.c:423 [inline]
RIP: 0010:__io_uring_register io_uring/register.c:865 [inline]
RIP: 0010:__do_sys_io_uring_register.cold+0xcae/0xe32 io_uring/register.c:1029
Code: bd 68 09 00 00 48 89 fa 48 c1 ea 03 42 80 3c 2a 00 74 05 e8 06
3a 40 01 48 8b ad 68 09 00 00 48 8d 7d 18 48 89 fa 48 c1 ea 03 <42> 80
3c 2a 00 74 05 e8 e8 39 40 01 48 8b 6d 18 48 85 ed 0f 85 ec
RSP: 0018:ffffc90002a9fd90 EFLAGS: 00010206
RAX: 1ffff11004c9f63a RBX: ffff888054ee4000 RCX: 0000000000000001
RDX: 0000000000000003 RSI: ffffffff81364a23 RDI: 0000000000000018
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
R10: ffffc90002a9fd90 R11: 0000000080000000 R12: ffff8880264fb1c0
R13: dffffc0000000000 R14: 0000000000000013 R15: 0000000000000013
FS:  00007ff8525ee6c0(0000) GS:ffff8880d687a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000005ea81000 CR4: 0000000000352ef0
Call Trace:
 <TASK>
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xff/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff8543b85fd
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ff8525edff8 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab
RAX: ffffffffffffffda RBX: 00007ff854645fa0 RCX: 00007ff8543b85fd
RDX: 0000200000000040 RSI: 0000000000000013 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffdb063f3d0 R14: 00007ff8525eece4 R15: 00007ffdb063f4c7
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:io_register_iowq_max_workers io_uring/register.c:423 [inline]
RIP: 0010:__io_uring_register io_uring/register.c:865 [inline]
RIP: 0010:__do_sys_io_uring_register.cold+0xcae/0xe32 io_uring/register.c:1029
Code: bd 68 09 00 00 48 89 fa 48 c1 ea 03 42 80 3c 2a 00 74 05 e8 06
3a 40 01 48 8b ad 68 09 00 00 48 8d 7d 18 48 89 fa 48 c1 ea 03 <42> 80
3c 2a 00 74 05 e8 e8 39 40 01 48 8b 6d 18 48 85 ed 0f 85 ec
RSP: 0018:ffffc90002a9fd90 EFLAGS: 00010206
RAX: 1ffff11004c9f63a RBX: ffff888054ee4000 RCX: 0000000000000001
RDX: 0000000000000003 RSI: ffffffff81364a23 RDI: 0000000000000018
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
R10: ffffc90002a9fd90 R11: 0000000080000000 R12: ffff8880264fb1c0
R13: dffffc0000000000 R14: 0000000000000013 R15: 0000000000000013
FS:  00007ff8525ee6c0(0000) GS:ffff8880d687a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1280e130b0 CR3: 000000005ea81000 CR4: 0000000000352ef0
----------------
Code disassembly (best guess):
   0: bd 68 09 00 00       mov    $0x968,%ebp
   5: 48 89 fa             mov    %rdi,%rdx
   8: 48 c1 ea 03           shr    $0x3,%rdx
   c: 42 80 3c 2a 00       cmpb   $0x0,(%rdx,%r13,1)
  11: 74 05                 je     0x18
  13: e8 06 3a 40 01       call   0x1403a1e
  18: 48 8b ad 68 09 00 00 mov    0x968(%rbp),%rbp
  1f: 48 8d 7d 18           lea    0x18(%rbp),%rdi
  23: 48 89 fa             mov    %rdi,%rdx
  26: 48 c1 ea 03           shr    $0x3,%rdx
* 2a: 42 80 3c 2a 00       cmpb   $0x0,(%rdx,%r13,1) <-- trapping instruction
  2f: 74 05                 je     0x36
  31: e8 e8 39 40 01       call   0x1403a1e
  36: 48 8b 6d 18           mov    0x18(%rbp),%rbp
  3a: 48 85 ed             test   %rbp,%rbp
  3d: 0f                   .byte 0xf
  3e: 85 ec                 test   %ebp,%esp


<<<<<<<<<<<<<<< tail report >>>>>>>>>>>>>>>

This bug is in io_register_iowq_max_workers()

mutex_lock(&ctx->tctx_lock);
list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
    tctx = node->task->io_uring;
    if (WARN_ON_ONCE(!tctx->io_wq)) // derefs tctx without NULL check
        continue;
    // skip
}

propagates the limit to all registered users (non-SQPOLL path)

The node is published into ctx->tctx_list before node->task->io_uring
is set (io_uring/tctx.c):

io_tctx_install_node():
    node->task = current;
    mutex_lock(&ctx->tctx_lock);
    list_add(&node->ctx_node, &ctx->tctx_list);   // node visible
    mutex_unlock(&ctx->tctx_lock); // lock dropped

__io_uring_add_tctx_node():
    ret = io_tctx_install_node(ctx, tctx);
    if (!ret)
        current->io_uring = tctx;   // set AFTER, outside lock

There is a window where a node is on ctx->tctx_list while
node->task->io_uring is still NULL (the task is doing its first
io_uring op, tctx freshly allocated, not yet published). A concurrent
IORING_REGISTER_IOWQ_MAX_WORKERS on the same ring takes
ctx->tctx_lock, iterates, reads node->task->io_uring == NULL, and
dereferences tctx->io_wq → GPF.

The other two ctx->tctx_list consumers already guard this — cancel.c
io_async_cancel_one() and io_uring_try_cancel_iowq() both do if (!tctx
|| !tctx->io_wq). io_register_iowq_max_workers() is the only consumer
that omits the !tctx check, so this is simply a missing guard.

Reproducer

Plain (non-SQPOLL) ring shared across threads. A stream of fresh
threads each do their first io_uring_enter() (hits the window) while
two threads spam IORING_REGISTER_IOWQ_MAX_WORKERS. GPFs within
seconds-to-minutes on SMP+KASAN.

#define _GNU_SOURCE
#include <pthread.h>
#include <string.h>
#include <sys/syscall.h>
#include <linux/io_uring.h>
static int ring_fd;
static long setup(unsigned e, struct io_uring_params *p){ return
syscall(__NR_io_uring_setup, e, p); }
static long enter(int fd, unsigned ts){ return
syscall(__NR_io_uring_enter, fd, ts, 0, 0, (void*)0, (size_t)0); }
static long reg(int fd, unsigned op, void *a, unsigned n){ return
syscall(__NR_io_uring_register, fd, op, a, n); }
static void *fresh(void *x){ enter(ring_fd, 1); return 0; }   // first
op -> window
static void *spam(void *x){ unsigned c[2]={1,1}; for(;;) reg(ring_fd,
IORING_REGISTER_IOWQ_MAX_WORKERS, c, 2); return 0; }
int main(void){
    struct io_uring_params p; memset(&p,0,sizeof(p));
    ring_fd = setup(8, &p);
    pthread_t s; pthread_create(&s,0,spam,0); pthread_create(&s,0,spam,0);
    for(;;){ pthread_t t[64];
        for(int i=0;i<64;i++) pthread_create(&t[i],0,fresh,0);
        for(int i=0;i<64;i++) pthread_join(t[i],0); }
}

Reproduced on 7.1.0-rc1 with KASAN; the racy ordering predates the
2024 shadow-variable cleanup that last touched register.c:422.

Suggested fix

Either make io_register_iowq_max_workers() match its siblings:

before:

mutex_lock(&ctx->tctx_lock);
list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
    tctx = node->task->io_uring;
    if (WARN_ON_ONCE(!tctx->io_wq)) // derefs tctx without NULL check
        continue;
    // skip
}

to:

mutex_lock(&ctx->tctx_lock);
list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
    tctx = node->task->io_uring;
    if (!tctx || !tctx->io_wq)
        continue;
    // skip
}

or close the window in __io_uring_add_tctx_node() by publishing
current->io_uring = tctx before the node is added to ctx->tctx_list,
so a listed node always has a valid task->io_uring.


LimHyeonJun

^ permalink raw reply

* Re: [GIT PULL] io_uring fixes for 7.1-rc5
From: pr-tracker-bot @ 2026-05-22 19:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, io-uring
In-Reply-To: <a2fc1873-e68c-45ad-a8db-c70eb2c9c5a8@kernel.dk>

The pull request you sent on Fri, 22 May 2026 09:50:57 -0600:

> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git tags/io_uring-7.1-20260522

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/dbae42cfa618abc57f0bc3c28cc140292f4f7410

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* [PATCH] block: Add bvec_folio()
From: Matthew Wilcox (Oracle) @ 2026-05-22 18:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matthew Wilcox (Oracle), linux-block, linux-kernel, io-uring,
	linux-mm, Leon Romanovsky

This is a simple helper which replaces page_folio(bvec->bv_page).
Minor improvement in readability, but the real motivation is to reduce
the number of references to bvec->bv_page so that it can be changed
with less work.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Leon Romanovsky <leon@kernel.org>
---

Hi Jens,

I have a pile of other patches which depend on this one, but they're
spread all over the kernel and don't really have anything in common
with each other.  Getting this in the next merge window will let me send
those patches next cycle.

 block/bio.c          |  6 +++---
 include/linux/bio.h  |  2 +-
 include/linux/bvec.h | 13 +++++++++++++
 io_uring/rsrc.c      |  2 +-
 mm/page_io.c         |  4 ++--
 5 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f10900b3f42..85aab3140909 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1300,7 +1300,7 @@ static void bio_free_folios(struct bio *bio)
 	int i;
 
 	bio_for_each_bvec_all(bv, bio, i) {
-		struct folio *folio = page_folio(bv->bv_page);
+		struct folio *folio = bvec_folio(bv);
 
 		if (!is_zero_folio(folio))
 			folio_put(folio);
@@ -1409,7 +1409,7 @@ int bio_iov_iter_bounce(struct bio *bio, struct iov_iter *iter, size_t maxlen,
 
 static void bvec_unpin(struct bio_vec *bv, bool mark_dirty)
 {
-	struct folio *folio = page_folio(bv->bv_page);
+	struct folio *folio = bvec_folio(bv);
 	size_t nr_pages = (bv->bv_offset + bv->bv_len - 1) / PAGE_SIZE -
 			bv->bv_offset / PAGE_SIZE + 1;
 
@@ -1443,7 +1443,7 @@ static void bio_iov_iter_unbounce_read(struct bio *bio, bool is_error,
 			bvec_unpin(&bio->bi_io_vec[1 + i], mark_dirty);
 	}
 
-	folio_put(page_folio(bio->bi_io_vec[0].bv_page));
+	folio_put(bvec_folio(&bio->bi_io_vec[0]));
 }
 
 /**
diff --git a/include/linux/bio.h b/include/linux/bio.h
index dc17780d6c1e..6613ab4519bd 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -283,7 +283,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
 		return;
 	}
 
-	fi->folio = page_folio(bvec->bv_page);
+	fi->folio = bvec_folio(bvec);
 	fi->offset = bvec->bv_offset +
 			PAGE_SIZE * folio_page_idx(fi->folio, bvec->bv_page);
 	fi->_seg_count = bvec->bv_len;
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index d36dd476feda..32846079b853 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -74,6 +74,19 @@ static inline void bvec_set_virt(struct bio_vec *bv, void *vaddr,
 	bvec_set_page(bv, virt_to_page(vaddr), len, offset_in_page(vaddr));
 }
 
+/**
+ * bvec_folio - Return the first folio referenced by this bvec
+ * @bv: bvec to access
+ *
+ * bvecs can span multiple folios.  Unless you know that this
+ * bvec does not, you may be better off using something like
+ * bio_for_each_folio_all() which iterates over all folios.
+ */
+static inline struct folio *bvec_folio(const struct bio_vec *bv)
+{
+	return page_folio(bv->bv_page);
+}
+
 struct bvec_iter {
 	/*
 	 * Current device address in 512 byte sectors. Only updated by the bio
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 650303626be6..5d792f70ec1e 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -102,7 +102,7 @@ static void io_release_ubuf(void *priv)
 	unsigned int i;
 
 	for (i = 0; i < imu->nr_bvecs; i++) {
-		struct folio *folio = page_folio(imu->bvec[i].bv_page);
+		struct folio *folio = bvec_folio(&imu->bvec[i]);
 
 		unpin_user_folio(folio, 1);
 	}
diff --git a/mm/page_io.c b/mm/page_io.c
index 70cea9e24d2f..a59b73f8bdd9 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -490,7 +490,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
 
 	if (ret == sio->len) {
 		for (p = 0; p < sio->pages; p++) {
-			struct folio *folio = page_folio(sio->bvec[p].bv_page);
+			struct folio *folio = bvec_folio(&sio->bvec[p]);
 
 			count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN);
 			count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio));
@@ -500,7 +500,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
 		count_vm_events(PSWPIN, sio->len >> PAGE_SHIFT);
 	} else {
 		for (p = 0; p < sio->pages; p++) {
-			struct folio *folio = page_folio(sio->bvec[p].bv_page);
+			struct folio *folio = bvec_folio(&sio->bvec[p]);
 
 			folio_unlock(folio);
 		}
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH v2] io_uring: annotate remote tasks for kcoverage
From: Andrey Konovalov @ 2026-05-22 16:23 UTC (permalink / raw)
  To: Robert Femmer; +Cc: io-uring, Jens Axboe, Dmitry Vyukov, kasan-dev, Jann Horn
In-Reply-To: <20260520204303.558392-2-robert@fmmr.tech>

On Wed, May 20, 2026 at 10:44 PM Robert Femmer <robert@fmmr.tech> wrote:
>
> Fuzzers use coverage information to guide generation of test cases
> towards new or interesting code paths. Syzkaller, specifically, makes
> use kcoverage (CONFIG_KCOV). Coverage information is not collected for
> kernel tasks unless annotated by kcov_remote_start and kcov_remote_stop.
> This patch annotates io-uring's work queue and sqpoll tasks.
>
> Signed-off-by: Robert Femmer <robert@fmmr.tech>
> ---
>  include/linux/io_uring_types.h |  4 ++++
>  io_uring/io-wq.c               |  4 ++++
>  io_uring/io_uring.c            |  3 +++
>  io_uring/io_uring.h            | 24 ++++++++++++++++++++++++
>  io_uring/sqpoll.c              |  4 ++++
>  5 files changed, 39 insertions(+)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 244392026c6d..b92b8e7169ea 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -504,6 +504,10 @@ struct io_ring_ctx {
>         struct io_mapped_region         ring_region;
>         /* used for optimised request parameter and wait argument passing  */
>         struct io_mapped_region         param_region;
> +
> +#ifdef CONFIG_KCOV
> +       u64                             kcov_handle;
> +#endif

Jann recently sent a patch that added kcov_common_handle_id, I think
you can base your code on it and use that helper struct here.

https://lore.kernel.org/all/20260430-kcov-refactor-common-handle-v1-1-23a0c7a0ba38@google.com/

>  };
>
>  /*
> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
> index 8cc7b47d3089..16af75b1cfe0 100644
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -639,6 +639,7 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
>                 /* handle a whole dependent link */
>                 do {
>                         struct io_wq_work *next_hashed, *linked;
> +                       struct io_kiocb *req;
>                         unsigned int work_flags = atomic_read(&work->flags);
>                         unsigned int hash = __io_wq_is_hashed(work_flags)
>                                 ? __io_get_work_hash(work_flags)
> @@ -649,7 +650,10 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
>                         if (do_kill &&
>                             (work_flags & IO_WQ_WORK_UNBOUND))
>                                 atomic_or(IO_WQ_WORK_CANCEL, &work->flags);
> +                       req = container_of(work, struct io_kiocb, work);
> +                       io_kcov_remote_start(req->ctx);

And also use kcov_remote_start_common() here.

>                         io_wq_submit_work(work);
> +                       io_kcov_remote_stop(req->ctx);
>                         io_assign_current_work(worker, NULL);
>
>                         linked = io_wq_free_work(work);
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 036145ee466c..f38b8eca6bbb 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -293,6 +293,9 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>         INIT_HLIST_HEAD(&ctx->cancelable_uring_cmd);
>         io_napi_init(ctx);
>         mutex_init(&ctx->mmap_lock);
> +#ifdef CONFIG_KCOV
> +       ctx->kcov_handle = current->kcov_handle;
> +#endif
>
>         return ctx;
>
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index e612a66ee80e..881d43bd529c 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -7,6 +7,7 @@
>  #include <linux/resume_user_mode.h>
>  #include <linux/poll.h>
>  #include <linux/io_uring_types.h>
> +#include <linux/kcov.h>
>  #include <uapi/linux/eventpoll.h>
>  #include "alloc_cache.h"
>  #include "io-wq.h"
> @@ -581,4 +582,27 @@ static inline bool io_has_work(struct io_ring_ctx *ctx)
>         return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) ||
>                io_local_work_pending(ctx);
>  }
> +
> +#ifdef CONFIG_KCOV
> +static inline void io_kcov_remote_start(struct io_ring_ctx *ctx)
> +{
> +       if (ctx->kcov_handle)
> +               kcov_remote_start(ctx->kcov_handle);
> +}
> +
> +static inline void io_kcov_remote_stop(struct io_ring_ctx *ctx)
> +{
> +       if (ctx->kcov_handle)
> +               kcov_remote_stop();
> +}
> +#else
> +static inline void io_kcov_remote_start(struct io_ring_ctx *ctx)
> +{
> +}
> +
> +static inline void io_kcov_remote_stop(struct io_ring_ctx *ctx)
> +{
> +}
> +#endif
> +
>  #endif
> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
> index 46c12afec73e..8d2876e31acb 100644
> --- a/io_uring/sqpoll.c
> +++ b/io_uring/sqpoll.c
> @@ -342,19 +342,23 @@ static int io_sq_thread(void *data)
>
>                 cap_entries = !list_is_singular(&sqd->ctx_list);
>                 list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
> +                       io_kcov_remote_start(ctx);
>                         int ret = __io_sq_thread(ctx, sqd, cap_entries, &ist);
>
>                         if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list)))
>                                 sqt_spin = true;
> +                       io_kcov_remote_stop(ctx);
>                 }
>                 if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE))
>                         sqt_spin = true;
>
>                 list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
> +                       io_kcov_remote_start(ctx);
>                         if (io_napi(ctx)) {
>                                 io_sq_start_worktime(&ist);
>                                 io_napi_sqpoll_busy_poll(ctx);
>                         }
> +                       io_kcov_remote_stop(ctx);
>                 }
>
>                 io_sq_update_worktime(sqd, &ist);
> --
> 2.54.0
>
> --
> You received this message because you are subscribed to the Google Groups "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kasan-dev+unsubscribe@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/kasan-dev/20260520204303.558392-2-robert%40fmmr.tech.

^ permalink raw reply

* [GIT PULL] io_uring fixes for 7.1-rc5
From: Jens Axboe @ 2026-05-22 15:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring

Hi Linus,

A few fixes for io_uring that should go into the 7.1 kernel release.
This pull request contains:

- Fix for an issue with IORING_OP_NOP and using injection results

- Fix for an issue in IORING_OP_WAITID, where the info state was assumed
  cleared by the lower level syscall handler, but for some cases it is
  not. Just clear the data upfront, so that non-initialized data isn't
  copied back to userspace.

- Fix for a lockdep reported issue, where IORING_OP_BIND enters file
  create and hence hits mnt_want_write(), which creates a 3 part lockdep
  cycle between the super lock, io_uring's uring_lock, and the cred
  mutex.

- Fix a regression introduced in this cycle with how linked timeouts are
  deleted.

- Ensure that the ->opcode nospec indexing on the opcode issue side
  covers all the cases.

Please pull!


The following changes since commit f44d38a31f1802b7222adaea9ee69f9d280f698a:

  io_uring: validate user-controlled cq.head in io_cqe_cache_refill() (2026-05-13 21:44:57 -0600)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git tags/io_uring-7.1-20260522

for you to fetch changes up to e97ff8b62d4690c69297f0f6de874f0564cc01a4:

  io_uring/nop: pass all errors to userspace (2026-05-21 11:10:56 -0600)

----------------------------------------------------------------
io_uring-7.1-20260522

----------------------------------------------------------------
Alexander A. Klimov (1):
      io_uring/nop: pass all errors to userspace

Heechan Kang (1):
      io_uring/waitid: clear waitid info before copying it to userspace

Jens Axboe (2):
      io_uring/net: punt IORING_OP_BIND async if it needs file create
      io_uring/timeout: splice timed out link in timeout handler

Michael Bommarito (1):
      io_uring: propagate array_index_nospec opcode into req->opcode

 io_uring/io_uring.c |  9 ++++-----
 io_uring/net.c      | 26 +++++++++++++++++++++++++-
 io_uring/nop.c      |  4 ++--
 io_uring/timeout.c  |  4 +++-
 io_uring/waitid.c   |  1 +
 5 files changed, 35 insertions(+), 9 deletions(-)

-- 
Jens Axboe


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox