[BUG] RCU hang with io_uring nvme polling

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUG] RCU hang with io_uring nvme polling
@ 2026-06-26 15:09 Ben Carey
  2026-06-26 15:17 ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Ben Carey @ 2026-06-26 15:09 UTC (permalink / raw)
  To: io-uring; +Cc: linux-kernel, axboe, stable, benjamin.james.carey3

From: benjamin.james.carey3@gmail.com

Hello, whomever this may concern.

I am working in a lab researching energy efficiency of I/O servicing and
completion mechanisms, and we have encountered an issue when using io_uring and
completing I/O requests while polling NVMe drives.

Description
===========

When using fio to run io_uring test benches for energy consumption analysis
on our lab server, we're encountering strange kernel locking behaviors as
numjobs increases.

This issue occurs on our workloads the poll for I/O completion. Specifically,
whenever the numjobs parameter scales to beyond the nvme.poll_queues
parameter, the job takes much longer to complete or doesn't complete at all.

Notably, this issue occurs also on a QEMU image mimicking our setup. Using GDB
to read dmesg output we get the following:

...
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-7): P1070
rcu: 	(detected by 7, t=252035 jiffies, g=1985, q=25149 ncpus=8)
task:fio             state:R  running task     stack:13296 pid:1070  tgid:1070  ppid:1068   task_flags:0x400140 flags:0x00080000
Call Trace:
...
? blk_hctx_poll+0x34/0x80
blk_mq_poll+0x2b/0x40
bio_poll+0x94/0x180
iocb_bio_iopoll+0x31/0x50
io_uring_classic_poll+0x20/0x40
io_do_iopoll+0x233/0x430
? io_issue_sqe+0x2f/0x560
? io_submit_sqes+0x270/0x820
__do_sys_io_uring_enter+0x228/0x770
? handle_softirqs+0xc7/0x250
__x64_sys_io_uring_enter+0x21/0x30
x64_sys_call+0x17c8/0x1dd0
do_syscall_64+0xe0/0x5a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Expected behavior
=================

fio job completes after specified runtime.

Actual behavior
===============

fio job never completes, system becomes less responsive (if the number of poll
queues and jobs are high) and RCU stall checker detects stalls.

Observations
============

After some minimal investigation we found this notable function being called as
the callback for q->mq_ops->poll:

static int nvme_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
{
	struct nvme_queue *nvmeq = hctx->driver_data;
	bool found;

	if (!test_bit(NVMEQ_POLLED, &nvmeq->flags) ||
	    !nvme_cqe_pending(nvmeq))
		return 0;

	spin_lock(&nvmeq->cq_poll_lock);
	found = nvme_poll_cq(nvmeq, iob);
	spin_unlock(&nvmeq->cq_poll_lock);

	return found;
}

This function, when stuck on the RCU loop, always returns 0. It also always
calls the helper function nvme_cqe_pending.

Following this are some items that may help in reproducing this issue.

Steps to reproduce
==================
From a running QEMU image with the latest kernel:
1. Attach GDB to the running instance.
2. Enable io polling via sysfs (echo 1 > /sys/block/nvme0n1/queue/io_poll).
3. Execute the fio job below.
4. After 1-2 minutes, observe RCU stalls.

Offending fio job
=================

fio --bs=1K --direct=1 --iodepth=1 --runtime=1 --rw=randread --time_based \
  --ioengine=io_uring --hipri=1 --fixedbufs=0 --registerfiles=0 \
    --sqthread_poll=0 \
  --numjobs=2 --name=job0 --output-format=json --clocksource=clock_gettime \
  --filename=/dev/nvme0n1

Kernel config
=============

Start with x86_defconfig

The following options are enabled for ease of debugging with GDB and QEMU.

In "Kernel Hacking" do the following:
- Set "Compile-time checks and compiler options -> Debug options" to "Rely on
  the toolchain's implicit default DWARF version."
- Set "Compile-time checks and compiler options -> Provide GDB scripts for
  debugging" to Yes.
- Set "x86 Debugging -> Choose kernel unwinder" to "Frame pointer unwinder."

In "Processor types and features" do the following:
- Set "Randomize the address of the kernel image (KASLR)" to No.

The following options are enabled to support NVMe over PCIe.

In "Device Drivers" do the following:
- Set "PCI Support -> PCI Endpoint support" to Yes.
- Set "NVMe Support -> NVM Express block device" to Module.
- Set "NVMe Support -> NVMe Target Support" to Module.
- Set "NVMe Support -> NVMe PCI Endpoint Function target support" to Module.

Kernel command line
===================

BOOT_IMAGE=/vmlinuz-7.1.0-g3996771b8f75 root=/dev/mapper/ubuntu--vg-ubuntu--lv \
ro nvme.poll_queues=1 nokaslr \
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M

(nokaslr may be unneeded.)

QEMU command line
=================
qemu-system-x86_64 \
  -m 4G -enable-kvm -monitor stdio -s -S -smp 8 \
  -device nvme,serial=deadbeef,drive=nvm \
  -drive file=disk.img,index=0,media=disk,if=virtio \
  -drive file=nvme.img,index=1,media=disk,if=none,id=nvm \
  -chardev socket,path=/tmp/port1,server=on,wait=off,id=port1-char \
  -device virtio-serial \
  -device virtserialport,id=port1,chardev=port1-char,name=org.fedoraproject.port.0 \
  -net user,hostfwd=tcp::10022-:22,hostfwd=tcp::45455-:45455 \
  -net nic

For us, disk.img and nvme.img are created via:
dd if=/dev/zero of=disk.img bs=4K count=5000000
dd if=/dev/zero of=nvme.img bs=4K count=2000000

We then format disk.img with ext4.

To install a test userspace we download an ISO of Ubuntu Server 26.04 and
append the filename as a parameter to the QEMU task. After installing it.

If you all think there's a better mailing list to which this should be sent,
please let me know. Also, please let me know if there are other details about
how to reproduce this issue or the system on which this issue appears, or if
you have any other questions.

Best wishes,
Benjamin Carey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 15:09 [BUG] RCU hang with io_uring nvme polling Ben Carey
@ 2026-06-26 15:17 ` Jens Axboe
  2026-06-26 16:05   ` Keith Busch
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2026-06-26 15:17 UTC (permalink / raw)
  To: Ben Carey, io-uring; +Cc: linux-kernel

On 6/26/26 9:09 AM, Ben Carey wrote:
> From: benjamin.james.carey3@gmail.com
> 
> Hello, whomever this may concern.
> 
> I am working in a lab researching energy efficiency of I/O servicing and
> completion mechanisms, and we have encountered an issue when using io_uring and
> completing I/O requests while polling NVMe drives.
> 
> Description
> ===========
> 
> When using fio to run io_uring test benches for energy consumption analysis
> on our lab server, we're encountering strange kernel locking behaviors as
> numjobs increases.
> 
> This issue occurs on our workloads the poll for I/O completion. Specifically,
> whenever the numjobs parameter scales to beyond the nvme.poll_queues
> parameter, the job takes much longer to complete or doesn't complete at all.
> 
> Notably, this issue occurs also on a QEMU image mimicking our setup. Using GDB
> to read dmesg output we get the following:
> 
> ...
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-7): P1070
> rcu: 	(detected by 7, t=252035 jiffies, g=1985, q=25149 ncpus=8)
> task:fio             state:R  running task     stack:13296 pid:1070  tgid:1070  ppid:1068   task_flags:0x400140 flags:0x00080000
> Call Trace:
> ...
> ? blk_hctx_poll+0x34/0x80
> blk_mq_poll+0x2b/0x40
> bio_poll+0x94/0x180
> iocb_bio_iopoll+0x31/0x50
> io_uring_classic_poll+0x20/0x40
> io_do_iopoll+0x233/0x430
> ? io_issue_sqe+0x2f/0x560
> ? io_submit_sqes+0x270/0x820
> __do_sys_io_uring_enter+0x228/0x770
> ? handle_softirqs+0xc7/0x250
> __x64_sys_io_uring_enter+0x21/0x30
> x64_sys_call+0x17c8/0x1dd0
> do_syscall_64+0xe0/0x5a0
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Expected behavior
> =================
> 
> fio job completes after specified runtime.
> 
> Actual behavior
> ===============
> 
> fio job never completes, system becomes less responsive (if the number of poll
> queues and jobs are high) and RCU stall checker detects stalls.
> 
> Observations
> ============
> 
> After some minimal investigation we found this notable function being called as
> the callback for q->mq_ops->poll:
> 
> static int nvme_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
> {
> 	struct nvme_queue *nvmeq = hctx->driver_data;
> 	bool found;
> 
> 	if (!test_bit(NVMEQ_POLLED, &nvmeq->flags) ||
> 	    !nvme_cqe_pending(nvmeq))
> 		return 0;
> 
> 	spin_lock(&nvmeq->cq_poll_lock);
> 	found = nvme_poll_cq(nvmeq, iob);
> 	spin_unlock(&nvmeq->cq_poll_lock);
> 
> 	return found;
> }
> 
> This function, when stuck on the RCU loop, always returns 0. It also always
> calls the helper function nvme_cqe_pending.
> 
> Following this are some items that may help in reproducing this issue.
> 
> Steps to reproduce
> ==================
> From a running QEMU image with the latest kernel:
> 1. Attach GDB to the running instance.
> 2. Enable io polling via sysfs (echo 1 > /sys/block/nvme0n1/queue/io_poll).

That's not how that works at all. You need to setup poll queues on the
nvme driver side, using the nvme.poll_queues=XX kernel parameter, or if
using nvme as a module, load the module with poll_queues=XX where XX is
the number of poll queues. You're not doing any polled IO as-is, and the
above should also have dumped a dmesg message about how that does
absolutely nothing.

That said, it should still work, just not doing polled IO. I'll take a
look sometime next week, OOO right now.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 15:17 ` Jens Axboe
@ 2026-06-26 16:05   ` Keith Busch
  2026-06-26 16:06     ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Keith Busch @ 2026-06-26 16:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ben Carey, io-uring, linux-kernel

On Fri, Jun 26, 2026 at 09:17:35AM -0600, Jens Axboe wrote:
> On 6/26/26 9:09 AM, Ben Carey wrote:
> > From a running QEMU image with the latest kernel:
> > 1. Attach GDB to the running instance.
> > 2. Enable io polling via sysfs (echo 1 > /sys/block/nvme0n1/queue/io_poll).
> 
> That's not how that works at all. You need to setup poll queues on the
> nvme driver side, using the nvme.poll_queues=XX kernel parameter, or if
> using nvme as a module, load the module with poll_queues=XX where XX is
> the number of poll queues. You're not doing any polled IO as-is, and the
> above should also have dumped a dmesg message about how that does
> absolutely nothing.
> 
> That said, it should still work, just not doing polled IO. I'll take a
> look sometime next week, OOO right now.

Yeah, the sysfs attribute does nothing, but Ben mentioned they had the
correct kernel command line:

  BOOT_IMAGE=/vmlinuz-7.1.0-g3996771b8f75 root=/dev/mapper/ubuntu--vg-ubuntu--lv \
    ro nvme.poll_queues=1 nokaslr

So they did enable polling, but the "echo" step is just confusing and
unnecessary.

I tried out the test, and there does appear to be a problem here, so I'm
looking into it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 16:05   ` Keith Busch
@ 2026-06-26 16:06     ` Jens Axboe
  2026-06-26 16:33       ` Keith Busch
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2026-06-26 16:06 UTC (permalink / raw)
  To: Keith Busch; +Cc: Ben Carey, io-uring, linux-kernel

On 6/26/26 10:05 AM, Keith Busch wrote:
> On Fri, Jun 26, 2026 at 09:17:35AM -0600, Jens Axboe wrote:
>> On 6/26/26 9:09 AM, Ben Carey wrote:
>>> From a running QEMU image with the latest kernel:
>>> 1. Attach GDB to the running instance.
>>> 2. Enable io polling via sysfs (echo 1 > /sys/block/nvme0n1/queue/io_poll).
>>
>> That's not how that works at all. You need to setup poll queues on the
>> nvme driver side, using the nvme.poll_queues=XX kernel parameter, or if
>> using nvme as a module, load the module with poll_queues=XX where XX is
>> the number of poll queues. You're not doing any polled IO as-is, and the
>> above should also have dumped a dmesg message about how that does
>> absolutely nothing.
>>
>> That said, it should still work, just not doing polled IO. I'll take a
>> look sometime next week, OOO right now.
> 
> Yeah, the sysfs attribute does nothing, but Ben mentioned they had the
> correct kernel command line:
> 
>   BOOT_IMAGE=/vmlinuz-7.1.0-g3996771b8f75 root=/dev/mapper/ubuntu--vg-ubuntu--lv \
>     ro nvme.poll_queues=1 nokaslr
> 
> So they did enable polling, but the "echo" step is just confusing and
> unnecessary.
> 
> I tried out the test, and there does appear to be a problem here, so I'm
> looking into it.

Ah good catch, I missed that. Should've grepped! In general, IO should
either get polled, or if the device is misbehaving, then timeouts will
catch it. That said, haven't looked at the actual report yet, will do
so next week (unless you beat me to it...?)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 16:06     ` Jens Axboe
@ 2026-06-26 16:33       ` Keith Busch
  2026-06-26 16:35         ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Keith Busch @ 2026-06-26 16:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ben Carey, io-uring, linux-kernel

On Fri, Jun 26, 2026 at 10:06:49AM -0600, Jens Axboe wrote:
> Ah good catch, I missed that. Should've grepped! In general, IO should
> either get polled, or if the device is misbehaving, then timeouts will
> catch it. That said, haven't looked at the actual report yet, will do
> so next week (unless you beat me to it...?)

I'll give it a shot!

The test has 1 polling queue with 2 jobs dispatching. One of the job's
polled the completions for both. The other job is polling for no reason
at all with nothing outstanding. The only thing that can break us out of
that loop now is need_resched(), but that appears to never return true.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 16:33       ` Keith Busch
@ 2026-06-26 16:35         ` Jens Axboe
  2026-06-26 16:48           ` Keith Busch
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2026-06-26 16:35 UTC (permalink / raw)
  To: Keith Busch; +Cc: Ben Carey, io-uring, linux-kernel

On 6/26/26 10:33 AM, Keith Busch wrote:
> On Fri, Jun 26, 2026 at 10:06:49AM -0600, Jens Axboe wrote:
>> Ah good catch, I missed that. Should've grepped! In general, IO should
>> either get polled, or if the device is misbehaving, then timeouts will
>> catch it. That said, haven't looked at the actual report yet, will do
>> so next week (unless you beat me to it...?)
> 
> I'll give it a shot!
> 
> The test has 1 polling queue with 2 jobs dispatching. One of the job's
> polled the completions for both. The other job is polling for no reason
> at all with nothing outstanding. The only thing that can break us out of
> that loop now is need_resched(), but that appears to never return true.

Yes, it's a bad configuration. I bet it's as simple as:

https://lore.kernel.org/linux-block/20260617155051.1266079-1-anuj20.g@samsung.com/

but in practice nobody should configure a single poll queue and run
multiple jobs, particularly not when the object is framed around "energy
efficiency" as this configuration is pretty much guaranteed to waste 2
cores, with most of the time going towards spinning on a lock rather
than doing potentially useful work.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 16:35         ` Jens Axboe
@ 2026-06-26 16:48           ` Keith Busch
       [not found]             ` <CA+KFGSoyCSRzgamm-38oyAtEsqd7wZZ8awL79P40x7a819EK4w@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Keith Busch @ 2026-06-26 16:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ben Carey, io-uring, linux-kernel

On Fri, Jun 26, 2026 at 10:35:56AM -0600, Jens Axboe wrote:
> Yes, it's a bad configuration. I bet it's as simple as:
> 
> https://lore.kernel.org/linux-block/20260617155051.1266079-1-anuj20.g@samsung.com/

Yep, that's definitely the same problem. Thanks, I hadn't seen that
thread yet.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
       [not found]             ` <CA+KFGSoyCSRzgamm-38oyAtEsqd7wZZ8awL79P40x7a819EK4w@mail.gmail.com>
@ 2026-06-26 17:41               ` Ben Carey
  0 siblings, 0 replies; 8+ messages in thread
From: Ben Carey @ 2026-06-26 17:41 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, io-uring, linux-kernel

On Fri, Jun 26, 2026 at 12:48 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, Jun 26, 2026 at 10:35:56AM -0600, Jens Axboe wrote:
> > Yes, it's a bad configuration. I bet it's as simple as:
> >
> > https://lore.kernel.org/linux-block/20260617155051.1266079-1-anuj20.g@samsung.com/
>
> Yep, that's definitely the same problem. Thanks, I hadn't seen that
> thread yet.

(Resending this, didn't enable plain-text mode)

Jens, Keith,

I appreciate you both for the quick responses.

After putting that patch into the kernel, the fio job ran, but most of the
I/O's completed after about 2 milliseconds (makes sense, given the offset.)

Part of me wonders if there's a race between the NVMe driver and the hardware
controller, but I can't back this up right now.

Ben Carey

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-26 17:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-26 15:09 [BUG] RCU hang with io_uring nvme polling Ben Carey
2026-06-26 15:17 ` Jens Axboe
2026-06-26 16:05   ` Keith Busch
2026-06-26 16:06     ` Jens Axboe
2026-06-26 16:33       ` Keith Busch
2026-06-26 16:35         ` Jens Axboe
2026-06-26 16:48           ` Keith Busch
     [not found]             ` <CA+KFGSoyCSRzgamm-38oyAtEsqd7wZZ8awL79P40x7a819EK4w@mail.gmail.com>
2026-06-26 17:41               ` Ben Carey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.