All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] RCU hang with io_uring nvme polling
@ 2026-06-26 15:09 Ben Carey
  2026-06-26 15:17 ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Ben Carey @ 2026-06-26 15:09 UTC (permalink / raw)
  To: io-uring; +Cc: linux-kernel, axboe, stable, benjamin.james.carey3

From: benjamin.james.carey3@gmail.com

Hello, whomever this may concern.

I am working in a lab researching energy efficiency of I/O servicing and
completion mechanisms, and we have encountered an issue when using io_uring and
completing I/O requests while polling NVMe drives.

Description
===========

When using fio to run io_uring test benches for energy consumption analysis
on our lab server, we're encountering strange kernel locking behaviors as
numjobs increases.

This issue occurs on our workloads the poll for I/O completion. Specifically,
whenever the numjobs parameter scales to beyond the nvme.poll_queues
parameter, the job takes much longer to complete or doesn't complete at all.

Notably, this issue occurs also on a QEMU image mimicking our setup. Using GDB
to read dmesg output we get the following:

...
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-7): P1070
rcu: 	(detected by 7, t=252035 jiffies, g=1985, q=25149 ncpus=8)
task:fio             state:R  running task     stack:13296 pid:1070  tgid:1070  ppid:1068   task_flags:0x400140 flags:0x00080000
Call Trace:
...
? blk_hctx_poll+0x34/0x80
blk_mq_poll+0x2b/0x40
bio_poll+0x94/0x180
iocb_bio_iopoll+0x31/0x50
io_uring_classic_poll+0x20/0x40
io_do_iopoll+0x233/0x430
? io_issue_sqe+0x2f/0x560
? io_submit_sqes+0x270/0x820
__do_sys_io_uring_enter+0x228/0x770
? handle_softirqs+0xc7/0x250
__x64_sys_io_uring_enter+0x21/0x30
x64_sys_call+0x17c8/0x1dd0
do_syscall_64+0xe0/0x5a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Expected behavior
=================

fio job completes after specified runtime.

Actual behavior
===============

fio job never completes, system becomes less responsive (if the number of poll
queues and jobs are high) and RCU stall checker detects stalls.

Observations
============

After some minimal investigation we found this notable function being called as
the callback for q->mq_ops->poll:

static int nvme_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
{
	struct nvme_queue *nvmeq = hctx->driver_data;
	bool found;

	if (!test_bit(NVMEQ_POLLED, &nvmeq->flags) ||
	    !nvme_cqe_pending(nvmeq))
		return 0;

	spin_lock(&nvmeq->cq_poll_lock);
	found = nvme_poll_cq(nvmeq, iob);
	spin_unlock(&nvmeq->cq_poll_lock);

	return found;
}

This function, when stuck on the RCU loop, always returns 0. It also always
calls the helper function nvme_cqe_pending.

Following this are some items that may help in reproducing this issue.

Steps to reproduce
==================
From a running QEMU image with the latest kernel:
1. Attach GDB to the running instance.
2. Enable io polling via sysfs (echo 1 > /sys/block/nvme0n1/queue/io_poll).
3. Execute the fio job below.
4. After 1-2 minutes, observe RCU stalls.

Offending fio job
=================

fio --bs=1K --direct=1 --iodepth=1 --runtime=1 --rw=randread --time_based \
  --ioengine=io_uring --hipri=1 --fixedbufs=0 --registerfiles=0 \
    --sqthread_poll=0 \
  --numjobs=2 --name=job0 --output-format=json --clocksource=clock_gettime \
  --filename=/dev/nvme0n1

Kernel config
=============

Start with x86_defconfig

The following options are enabled for ease of debugging with GDB and QEMU.

In "Kernel Hacking" do the following:
- Set "Compile-time checks and compiler options -> Debug options" to "Rely on
  the toolchain's implicit default DWARF version."
- Set "Compile-time checks and compiler options -> Provide GDB scripts for
  debugging" to Yes.
- Set "x86 Debugging -> Choose kernel unwinder" to "Frame pointer unwinder."

In "Processor types and features" do the following:
- Set "Randomize the address of the kernel image (KASLR)" to No.

The following options are enabled to support NVMe over PCIe.

In "Device Drivers" do the following:
- Set "PCI Support -> PCI Endpoint support" to Yes.
- Set "NVMe Support -> NVM Express block device" to Module.
- Set "NVMe Support -> NVMe Target Support" to Module.
- Set "NVMe Support -> NVMe PCI Endpoint Function target support" to Module.

Kernel command line
===================

BOOT_IMAGE=/vmlinuz-7.1.0-g3996771b8f75 root=/dev/mapper/ubuntu--vg-ubuntu--lv \
ro nvme.poll_queues=1 nokaslr \
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M

(nokaslr may be unneeded.)

QEMU command line
=================
qemu-system-x86_64 \
  -m 4G -enable-kvm -monitor stdio -s -S -smp 8 \
  -device nvme,serial=deadbeef,drive=nvm \
  -drive file=disk.img,index=0,media=disk,if=virtio \
  -drive file=nvme.img,index=1,media=disk,if=none,id=nvm \
  -chardev socket,path=/tmp/port1,server=on,wait=off,id=port1-char \
  -device virtio-serial \
  -device virtserialport,id=port1,chardev=port1-char,name=org.fedoraproject.port.0 \
  -net user,hostfwd=tcp::10022-:22,hostfwd=tcp::45455-:45455 \
  -net nic

For us, disk.img and nvme.img are created via:
dd if=/dev/zero of=disk.img bs=4K count=5000000
dd if=/dev/zero of=nvme.img bs=4K count=2000000

We then format disk.img with ext4.

To install a test userspace we download an ISO of Ubuntu Server 26.04 and
append the filename as a parameter to the QEMU task. After installing it.


If you all think there's a better mailing list to which this should be sent,
please let me know. Also, please let me know if there are other details about
how to reproduce this issue or the system on which this issue appears, or if
you have any other questions.

Best wishes,
Benjamin Carey

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-26 17:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-26 15:09 [BUG] RCU hang with io_uring nvme polling Ben Carey
2026-06-26 15:17 ` Jens Axboe
2026-06-26 16:05   ` Keith Busch
2026-06-26 16:06     ` Jens Axboe
2026-06-26 16:33       ` Keith Busch
2026-06-26 16:35         ` Jens Axboe
2026-06-26 16:48           ` Keith Busch
     [not found]             ` <CA+KFGSoyCSRzgamm-38oyAtEsqd7wZZ8awL79P40x7a819EK4w@mail.gmail.com>
2026-06-26 17:41               ` Ben Carey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.