[BUG] RCU hang with io_uring nvme polling

Linux io-uring development
 help / color / mirror / Atom feed

* [BUG] RCU hang with io_uring nvme polling
@ 2026-06-26 15:09 Ben Carey
  2026-06-26 15:17 ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Ben Carey @ 2026-06-26 15:09 UTC (permalink / raw)
  To: io-uring; +Cc: linux-kernel, axboe, stable, benjamin.james.carey3

From: benjamin.james.carey3@gmail.com

Hello, whomever this may concern.

I am working in a lab researching energy efficiency of I/O servicing and
completion mechanisms, and we have encountered an issue when using io_uring and
completing I/O requests while polling NVMe drives.

Description
===========

When using fio to run io_uring test benches for energy consumption analysis
on our lab server, we're encountering strange kernel locking behaviors as
numjobs increases.

This issue occurs on our workloads the poll for I/O completion. Specifically,
whenever the numjobs parameter scales to beyond the nvme.poll_queues
parameter, the job takes much longer to complete or doesn't complete at all.

Notably, this issue occurs also on a QEMU image mimicking our setup. Using GDB
to read dmesg output we get the following:

...
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-7): P1070
rcu: 	(detected by 7, t=252035 jiffies, g=1985, q=25149 ncpus=8)
task:fio             state:R  running task     stack:13296 pid:1070  tgid:1070  ppid:1068   task_flags:0x400140 flags:0x00080000
Call Trace:
...
? blk_hctx_poll+0x34/0x80
blk_mq_poll+0x2b/0x40
bio_poll+0x94/0x180
iocb_bio_iopoll+0x31/0x50
io_uring_classic_poll+0x20/0x40
io_do_iopoll+0x233/0x430
? io_issue_sqe+0x2f/0x560
? io_submit_sqes+0x270/0x820
__do_sys_io_uring_enter+0x228/0x770
? handle_softirqs+0xc7/0x250
__x64_sys_io_uring_enter+0x21/0x30
x64_sys_call+0x17c8/0x1dd0
do_syscall_64+0xe0/0x5a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Expected behavior
=================

fio job completes after specified runtime.

Actual behavior
===============

fio job never completes, system becomes less responsive (if the number of poll
queues and jobs are high) and RCU stall checker detects stalls.

Observations
============

After some minimal investigation we found this notable function being called as
the callback for q->mq_ops->poll:

static int nvme_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
{
	struct nvme_queue *nvmeq = hctx->driver_data;
	bool found;

	if (!test_bit(NVMEQ_POLLED, &nvmeq->flags) ||
	    !nvme_cqe_pending(nvmeq))
		return 0;

	spin_lock(&nvmeq->cq_poll_lock);
	found = nvme_poll_cq(nvmeq, iob);
	spin_unlock(&nvmeq->cq_poll_lock);

	return found;
}

This function, when stuck on the RCU loop, always returns 0. It also always
calls the helper function nvme_cqe_pending.

Following this are some items that may help in reproducing this issue.

Steps to reproduce
==================
From a running QEMU image with the latest kernel:
1. Attach GDB to the running instance.
2. Enable io polling via sysfs (echo 1 > /sys/block/nvme0n1/queue/io_poll).
3. Execute the fio job below.
4. After 1-2 minutes, observe RCU stalls.

Offending fio job
=================

fio --bs=1K --direct=1 --iodepth=1 --runtime=1 --rw=randread --time_based \
  --ioengine=io_uring --hipri=1 --fixedbufs=0 --registerfiles=0 \
    --sqthread_poll=0 \
  --numjobs=2 --name=job0 --output-format=json --clocksource=clock_gettime \
  --filename=/dev/nvme0n1

Kernel config
=============

Start with x86_defconfig

The following options are enabled for ease of debugging with GDB and QEMU.

In "Kernel Hacking" do the following:
- Set "Compile-time checks and compiler options -> Debug options" to "Rely on
  the toolchain's implicit default DWARF version."
- Set "Compile-time checks and compiler options -> Provide GDB scripts for
  debugging" to Yes.
- Set "x86 Debugging -> Choose kernel unwinder" to "Frame pointer unwinder."

In "Processor types and features" do the following:
- Set "Randomize the address of the kernel image (KASLR)" to No.

The following options are enabled to support NVMe over PCIe.

In "Device Drivers" do the following:
- Set "PCI Support -> PCI Endpoint support" to Yes.
- Set "NVMe Support -> NVM Express block device" to Module.
- Set "NVMe Support -> NVMe Target Support" to Module.
- Set "NVMe Support -> NVMe PCI Endpoint Function target support" to Module.

Kernel command line
===================

BOOT_IMAGE=/vmlinuz-7.1.0-g3996771b8f75 root=/dev/mapper/ubuntu--vg-ubuntu--lv \
ro nvme.poll_queues=1 nokaslr \
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M

(nokaslr may be unneeded.)

QEMU command line
=================
qemu-system-x86_64 \
  -m 4G -enable-kvm -monitor stdio -s -S -smp 8 \
  -device nvme,serial=deadbeef,drive=nvm \
  -drive file=disk.img,index=0,media=disk,if=virtio \
  -drive file=nvme.img,index=1,media=disk,if=none,id=nvm \
  -chardev socket,path=/tmp/port1,server=on,wait=off,id=port1-char \
  -device virtio-serial \
  -device virtserialport,id=port1,chardev=port1-char,name=org.fedoraproject.port.0 \
  -net user,hostfwd=tcp::10022-:22,hostfwd=tcp::45455-:45455 \
  -net nic

For us, disk.img and nvme.img are created via:
dd if=/dev/zero of=disk.img bs=4K count=5000000
dd if=/dev/zero of=nvme.img bs=4K count=2000000

We then format disk.img with ext4.

To install a test userspace we download an ISO of Ubuntu Server 26.04 and
append the filename as a parameter to the QEMU task. After installing it.

If you all think there's a better mailing list to which this should be sent,
please let me know. Also, please let me know if there are other details about
how to reproduce this issue or the system on which this issue appears, or if
you have any other questions.

Best wishes,
Benjamin Carey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 15:09 [BUG] RCU hang with io_uring nvme polling Ben Carey
@ 2026-06-26 15:17 ` Jens Axboe
  2026-06-26 16:05   ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2026-06-26 15:17 UTC (permalink / raw)
  To: Ben Carey, io-uring; +Cc: linux-kernel

On 6/26/26 9:09 AM, Ben Carey wrote:
> From: benjamin.james.carey3@gmail.com
> 
> Hello, whomever this may concern.
> 
> I am working in a lab researching energy efficiency of I/O servicing and
> completion mechanisms, and we have encountered an issue when using io_uring and
> completing I/O requests while polling NVMe drives.
> 
> Description
> ===========
> 
> When using fio to run io_uring test benches for energy consumption analysis
> on our lab server, we're encountering strange kernel locking behaviors as
> numjobs increases.
> 
> This issue occurs on our workloads the poll for I/O completion. Specifically,
> whenever the numjobs parameter scales to beyond the nvme.poll_queues
> parameter, the job takes much longer to complete or doesn't complete at all.
> 
> Notably, this issue occurs also on a QEMU image mimicking our setup. Using GDB
> to read dmesg output we get the following:
> 
> ...
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-7): P1070
> rcu: 	(detected by 7, t=252035 jiffies, g=1985, q=25149 ncpus=8)
> task:fio             state:R  running task     stack:13296 pid:1070  tgid:1070  ppid:1068   task_flags:0x400140 flags:0x00080000
> Call Trace:
> ...
> ? blk_hctx_poll+0x34/0x80
> blk_mq_poll+0x2b/0x40
> bio_poll+0x94/0x180
> iocb_bio_iopoll+0x31/0x50
> io_uring_classic_poll+0x20/0x40
> io_do_iopoll+0x233/0x430
> ? io_issue_sqe+0x2f/0x560
> ? io_submit_sqes+0x270/0x820
> __do_sys_io_uring_enter+0x228/0x770
> ? handle_softirqs+0xc7/0x250
> __x64_sys_io_uring_enter+0x21/0x30
> x64_sys_call+0x17c8/0x1dd0
> do_syscall_64+0xe0/0x5a0
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Expected behavior
> =================
> 
> fio job completes after specified runtime.
> 
> Actual behavior
> ===============
> 
> fio job never completes, system becomes less responsive (if the number of poll
> queues and jobs are high) and RCU stall checker detects stalls.
> 
> Observations
> ============
> 
> After some minimal investigation we found this notable function being called as
> the callback for q->mq_ops->poll:
> 
> static int nvme_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
> {
> 	struct nvme_queue *nvmeq = hctx->driver_data;
> 	bool found;
> 
> 	if (!test_bit(NVMEQ_POLLED, &nvmeq->flags) ||
> 	    !nvme_cqe_pending(nvmeq))
> 		return 0;
> 
> 	spin_lock(&nvmeq->cq_poll_lock);
> 	found = nvme_poll_cq(nvmeq, iob);
> 	spin_unlock(&nvmeq->cq_poll_lock);
> 
> 	return found;
> }
> 
> This function, when stuck on the RCU loop, always returns 0. It also always
> calls the helper function nvme_cqe_pending.
> 
> Following this are some items that may help in reproducing this issue.
> 
> Steps to reproduce
> ==================
> From a running QEMU image with the latest kernel:
> 1. Attach GDB to the running instance.
> 2. Enable io polling via sysfs (echo 1 > /sys/block/nvme0n1/queue/io_poll).

That's not how that works at all. You need to setup poll queues on the
nvme driver side, using the nvme.poll_queues=XX kernel parameter, or if
using nvme as a module, load the module with poll_queues=XX where XX is
the number of poll queues. You're not doing any polled IO as-is, and the
above should also have dumped a dmesg message about how that does
absolutely nothing.

That said, it should still work, just not doing polled IO. I'll take a
look sometime next week, OOO right now.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 15:17 ` Jens Axboe
@ 2026-06-26 16:05   ` Keith Busch
  2026-06-26 16:06     ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2026-06-26 16:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ben Carey, io-uring, linux-kernel

On Fri, Jun 26, 2026 at 09:17:35AM -0600, Jens Axboe wrote:
> On 6/26/26 9:09 AM, Ben Carey wrote:
> > From a running QEMU image with the latest kernel:
> > 1. Attach GDB to the running instance.
> > 2. Enable io polling via sysfs (echo 1 > /sys/block/nvme0n1/queue/io_poll).
> 
> That's not how that works at all. You need to setup poll queues on the
> nvme driver side, using the nvme.poll_queues=XX kernel parameter, or if
> using nvme as a module, load the module with poll_queues=XX where XX is
> the number of poll queues. You're not doing any polled IO as-is, and the
> above should also have dumped a dmesg message about how that does
> absolutely nothing.
> 
> That said, it should still work, just not doing polled IO. I'll take a
> look sometime next week, OOO right now.

Yeah, the sysfs attribute does nothing, but Ben mentioned they had the
correct kernel command line:

  BOOT_IMAGE=/vmlinuz-7.1.0-g3996771b8f75 root=/dev/mapper/ubuntu--vg-ubuntu--lv \
    ro nvme.poll_queues=1 nokaslr

So they did enable polling, but the "echo" step is just confusing and
unnecessary.

I tried out the test, and there does appear to be a problem here, so I'm
looking into it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 16:05   ` Keith Busch
@ 2026-06-26 16:06     ` Jens Axboe
  2026-06-26 16:33       ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2026-06-26 16:06 UTC (permalink / raw)
  To: Keith Busch; +Cc: Ben Carey, io-uring, linux-kernel

On 6/26/26 10:05 AM, Keith Busch wrote:
> On Fri, Jun 26, 2026 at 09:17:35AM -0600, Jens Axboe wrote:
>> On 6/26/26 9:09 AM, Ben Carey wrote:
>>> From a running QEMU image with the latest kernel:
>>> 1. Attach GDB to the running instance.
>>> 2. Enable io polling via sysfs (echo 1 > /sys/block/nvme0n1/queue/io_poll).
>>
>> That's not how that works at all. You need to setup poll queues on the
>> nvme driver side, using the nvme.poll_queues=XX kernel parameter, or if
>> using nvme as a module, load the module with poll_queues=XX where XX is
>> the number of poll queues. You're not doing any polled IO as-is, and the
>> above should also have dumped a dmesg message about how that does
>> absolutely nothing.
>>
>> That said, it should still work, just not doing polled IO. I'll take a
>> look sometime next week, OOO right now.
> 
> Yeah, the sysfs attribute does nothing, but Ben mentioned they had the
> correct kernel command line:
> 
>   BOOT_IMAGE=/vmlinuz-7.1.0-g3996771b8f75 root=/dev/mapper/ubuntu--vg-ubuntu--lv \
>     ro nvme.poll_queues=1 nokaslr
> 
> So they did enable polling, but the "echo" step is just confusing and
> unnecessary.
> 
> I tried out the test, and there does appear to be a problem here, so I'm
> looking into it.

Ah good catch, I missed that. Should've grepped! In general, IO should
either get polled, or if the device is misbehaving, then timeouts will
catch it. That said, haven't looked at the actual report yet, will do
so next week (unless you beat me to it...?)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 16:06     ` Jens Axboe
@ 2026-06-26 16:33       ` Keith Busch
  2026-06-26 16:35         ` Jens Axboe
  2026-06-29 20:47         ` Ben Carey
  0 siblings, 2 replies; 15+ messages in thread
From: Keith Busch @ 2026-06-26 16:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ben Carey, io-uring, linux-kernel

On Fri, Jun 26, 2026 at 10:06:49AM -0600, Jens Axboe wrote:
> Ah good catch, I missed that. Should've grepped! In general, IO should
> either get polled, or if the device is misbehaving, then timeouts will
> catch it. That said, haven't looked at the actual report yet, will do
> so next week (unless you beat me to it...?)

I'll give it a shot!

The test has 1 polling queue with 2 jobs dispatching. One of the job's
polled the completions for both. The other job is polling for no reason
at all with nothing outstanding. The only thing that can break us out of
that loop now is need_resched(), but that appears to never return true.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 16:33       ` Keith Busch
@ 2026-06-26 16:35         ` Jens Axboe
  2026-06-26 16:48           ` Keith Busch
  2026-06-29 20:47         ` Ben Carey
  1 sibling, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2026-06-26 16:35 UTC (permalink / raw)
  To: Keith Busch; +Cc: Ben Carey, io-uring, linux-kernel

On 6/26/26 10:33 AM, Keith Busch wrote:
> On Fri, Jun 26, 2026 at 10:06:49AM -0600, Jens Axboe wrote:
>> Ah good catch, I missed that. Should've grepped! In general, IO should
>> either get polled, or if the device is misbehaving, then timeouts will
>> catch it. That said, haven't looked at the actual report yet, will do
>> so next week (unless you beat me to it...?)
> 
> I'll give it a shot!
> 
> The test has 1 polling queue with 2 jobs dispatching. One of the job's
> polled the completions for both. The other job is polling for no reason
> at all with nothing outstanding. The only thing that can break us out of
> that loop now is need_resched(), but that appears to never return true.

Yes, it's a bad configuration. I bet it's as simple as:

https://lore.kernel.org/linux-block/20260617155051.1266079-1-anuj20.g@samsung.com/

but in practice nobody should configure a single poll queue and run
multiple jobs, particularly not when the object is framed around "energy
efficiency" as this configuration is pretty much guaranteed to waste 2
cores, with most of the time going towards spinning on a lock rather
than doing potentially useful work.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 16:35         ` Jens Axboe
@ 2026-06-26 16:48           ` Keith Busch
       [not found]             ` <CA+KFGSoyCSRzgamm-38oyAtEsqd7wZZ8awL79P40x7a819EK4w@mail.gmail.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2026-06-26 16:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ben Carey, io-uring, linux-kernel

On Fri, Jun 26, 2026 at 10:35:56AM -0600, Jens Axboe wrote:
> Yes, it's a bad configuration. I bet it's as simple as:
> 
> https://lore.kernel.org/linux-block/20260617155051.1266079-1-anuj20.g@samsung.com/

Yep, that's definitely the same problem. Thanks, I hadn't seen that
thread yet.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
       [not found]             ` <CA+KFGSoyCSRzgamm-38oyAtEsqd7wZZ8awL79P40x7a819EK4w@mail.gmail.com>
@ 2026-06-26 17:41               ` Ben Carey
  2026-07-03 17:20               ` Ben Carey
  1 sibling, 0 replies; 15+ messages in thread
From: Ben Carey @ 2026-06-26 17:41 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, io-uring, linux-kernel

On Fri, Jun 26, 2026 at 12:48 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, Jun 26, 2026 at 10:35:56AM -0600, Jens Axboe wrote:
> > Yes, it's a bad configuration. I bet it's as simple as:
> >
> > https://lore.kernel.org/linux-block/20260617155051.1266079-1-anuj20.g@samsung.com/
>
> Yep, that's definitely the same problem. Thanks, I hadn't seen that
> thread yet.

(Resending this, didn't enable plain-text mode)

Jens, Keith,

I appreciate you both for the quick responses.

After putting that patch into the kernel, the fio job ran, but most of the
I/O's completed after about 2 milliseconds (makes sense, given the offset.)

Part of me wonders if there's a race between the NVMe driver and the hardware
controller, but I can't back this up right now.

Ben Carey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-26 16:33       ` Keith Busch
  2026-06-26 16:35         ` Jens Axboe
@ 2026-06-29 20:47         ` Ben Carey
  2026-06-29 21:40           ` Keith Busch
  1 sibling, 1 reply; 15+ messages in thread
From: Ben Carey @ 2026-06-29 20:47 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, io-uring, linux-kernel

On Fri, Jun 26, 2026 at 12:33 PM Keith Busch <kbusch@kernel.org> wrote:
> The test has 1 polling queue with 2 jobs dispatching. One of the job's
> polled the completions for both. The other job is polling for no reason
> at all with nothing outstanding. The only thing that can break us out of
> that loop now is need_resched(), but that appears to never return true.

Inspired by this I tried to find a place where one thread polls on a job that's
already finished. I found that a race to io_check_iopoll causes one thread to
enter the polling loop when another has already finished on it. Putting
io_check_iopoll behind a spinlock seems to fix it, though I imagine a more
elegant fix is out there (reusing a different lock, not using expensive locks,
a smarter place to check for racing, etc.)

The diff is as follows:

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 214fdbd49..e4f76fa74 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -406,6 +406,7 @@ struct io_ring_ctx {
        } ____cacheline_aligned_in_smp;

        spinlock_t              completion_lock;
+       spinlock_t              cq_poll_lock;

        struct list_head        cq_overflow_list;

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4d7bcbb97..b65e2b11a 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -272,6 +272,7 @@ static __cold struct io_ring_ctx
*io_ring_ctx_alloc(struct io_uring_params *p)
        init_waitqueue_head(&ctx->cq_wait);
        init_waitqueue_head(&ctx->poll_wq);
        spin_lock_init(&ctx->completion_lock);
+       spin_lock_init(&ctx->cq_poll_lock);
        raw_spin_lock_init(&ctx->timeout_lock);
        INIT_LIST_HEAD(&ctx->iopoll_list);
        INIT_LIST_HEAD(&ctx->defer_list);
@@ -1243,7 +1244,13 @@ static int io_iopoll_check(struct io_ring_ctx
*ctx, unsigned int min_events)
                        if (tail != ctx->cached_cq_tail ||
list_empty(&ctx->iopoll_list))
                                break;
                }
-               ret = io_do_iopoll(ctx, !min_events);
+               if (spin_trylock(&ctx->cq_poll_lock)) {
+                       ret = io_do_iopoll(ctx, !min_events);
+                       spin_unlock(&ctx->cq_poll_lock);
+               } else {
+                       ret = 0;
+               }
+
                if (unlikely(ret < 0))
                        return ret;

Best wishes,
Ben Carey

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-29 20:47         ` Ben Carey
@ 2026-06-29 21:40           ` Keith Busch
  2026-06-29 22:04             ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2026-06-29 21:40 UTC (permalink / raw)
  To: Ben Carey; +Cc: Jens Axboe, io-uring, linux-kernel

On Mon, Jun 29, 2026 at 04:47:00PM -0400, Ben Carey wrote:
> On Fri, Jun 26, 2026 at 12:33 PM Keith Busch <kbusch@kernel.org> wrote:
> > The test has 1 polling queue with 2 jobs dispatching. One of the job's
> > polled the completions for both. The other job is polling for no reason
> > at all with nothing outstanding. The only thing that can break us out of
> > that loop now is need_resched(), but that appears to never return true.
> 
> Inspired by this I tried to find a place where one thread polls on a job that's
> already finished. I found that a race to io_check_iopoll causes one thread to
> enter the polling loop when another has already finished on it. Putting
> io_check_iopoll behind a spinlock seems to fix it, though I imagine a more
> elegant fix is out there (reusing a different lock, not using expensive locks,
> a smarter place to check for racing, etc.)

I can see why that resolves your observation, but I don't think we can
do this. We're ultimately polling for a hardware event, and this layer
is too high a level for serializing these things. This could be a
multi-device or multi-queue backing storage where the completion
pollings should occur concurrently.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-06-29 21:40           ` Keith Busch
@ 2026-06-29 22:04             ` Keith Busch
  0 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2026-06-29 22:04 UTC (permalink / raw)
  To: Ben Carey; +Cc: Jens Axboe, io-uring, linux-kernel

On Mon, Jun 29, 2026 at 03:40:52PM -0600, Keith Busch wrote:
> On Mon, Jun 29, 2026 at 04:47:00PM -0400, Ben Carey wrote:
>
> > Putting
> > io_check_iopoll behind a spinlock seems to fix it, though I imagine a more
> > elegant fix is out there (reusing a different lock, not using expensive locks,
> > a smarter place to check for racing, etc.)
> 
> I can see why that resolves your observation, but I don't think we can
> do this. We're ultimately polling for a hardware event, and this layer
> is too high a level for serializing these things.

It's also worse than that; your proposal serializes within an
io_uring_ctx, so two completely different applications could have the
exact same problem you discovered.

I don't necessarily like the accepted solution as it is time bound on
jiffies for an idle device, which is an eternity for low-latency
storage, but what else can we do? It's too expensive to check for a
specific IO or idle on each polling iteration. I guess we're expecting a
hi-pri application is constantly feeding the queue such that this is a
non-issue.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
       [not found]             ` <CA+KFGSoyCSRzgamm-38oyAtEsqd7wZZ8awL79P40x7a819EK4w@mail.gmail.com>
  2026-06-26 17:41               ` Ben Carey
@ 2026-07-03 17:20               ` Ben Carey
  2026-07-04 17:01                 ` Keith Busch
  1 sibling, 1 reply; 15+ messages in thread
From: Ben Carey @ 2026-07-03 17:20 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, io-uring, linux-kernel

On Fri, Jun 26, 2026 at 1:33 PM Ben Carey
<benjamin.james.carey3@gmail.com> wrote:
>
> After putting that patch into the kernel, the fio job ran, but most of the
> I/O's completed after about 2 milliseconds (makes sense, given the offset.)
>
>
> Ben Carey

While testing the patch we decided to trace the amount of time a workload
spends in blk_hctx_poll. We found that, for a test case with 8 jobs running for
10 seconds, it spent ~71% of its runtime in that function alone. We ran this
test with an Intel Optane mounted with NVMe over PCIe target but have observed
similar behavior on a VM, measured by:

perf record -F 99 -a -g -- \
  fio --bs=1K --direct=1 --iodepth=1 --runtime=10 --rw=randread --time_based \
    --ioengine=io_uring --hipri=1 --fixedbufs=0 --registerfiles=0 \
      --sqthread_poll=0 \
    --numjobs=8 --name=job0 --output-format=json --clocksource=clock_gettime \
    --filename=/dev/nvme0n1

Again, this was tested with nvme.poll_queues=1, but similar behavior occurs
with higher poll_queues, and also on a VM.

This bug seems to pollute our experimental results, and thus stands as
something needing to be fixed for us to continue our research. Do you all think
there's a different solution than the timeout?

Thanks again,
Ben Carey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-07-03 17:20               ` Ben Carey
@ 2026-07-04 17:01                 ` Keith Busch
  2026-07-04 19:35                   ` Ben Carey
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2026-07-04 17:01 UTC (permalink / raw)
  To: Ben Carey; +Cc: Jens Axboe, io-uring, linux-kernel

On Fri, Jul 03, 2026 at 01:20:24PM -0400, Ben Carey wrote:
> While testing the patch we decided to trace the amount of time a workload
> spends in blk_hctx_poll. We found that, for a test case with 8 jobs running for
> 10 seconds, it spent ~71% of its runtime in that function alone. We ran this
> test with an Intel Optane mounted with NVMe over PCIe target but have observed
> similar behavior on a VM, measured by:
> 
> perf record -F 99 -a -g -- \
>   fio --bs=1K --direct=1 --iodepth=1 --runtime=10 --rw=randread --time_based \
>     --ioengine=io_uring --hipri=1 --fixedbufs=0 --registerfiles=0 \
>       --sqthread_poll=0 \
>     --numjobs=8 --name=job0 --output-format=json --clocksource=clock_gettime \
>     --filename=/dev/nvme0n1
> 
> Again, this was tested with nvme.poll_queues=1, but similar behavior occurs
> with higher poll_queues, and also on a VM.
> 
> This bug seems to pollute our experimental results, and thus stands as
> something needing to be fixed for us to continue our research. Do you all think
> there's a different solution than the timeout?

What exactly do you have in mind? Shouldn't you expect to spend most of
your CPU time in the polling loop? As long as you keep the queues busy,
there's something to poll, so blk_hctx_poll is exactly where you want to
see the software be in a perf report. Seeing a high poll CPU utilization
means the software is efficient compared to the hardware. If we spend
very little time in the polling loop, then either you have incredibly
quick hardware, and let's face it, Optane SSDs are EOL and a generation
behind on link speeds so that's not gonna get there anymore, or our
software dispatch stack has an inefficiency somewhere.

If you have many pollers competing against a very low utilized queue,
then I think you have an application level problem mismatched to the
feature.

If you want to spend less time in the poll loop, then set the hybrid
poll sleep time. It should result in less polling time, but it'll push
your average latency higher.

The only thing the jiffie timeout may show a problem is when you stop
dispatching, which should only affect the time to close the ring when it
lost the polling race with a peer on the last IO it is looking for, but
should not affect individual IO latency.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-07-04 17:01                 ` Keith Busch
@ 2026-07-04 19:35                   ` Ben Carey
  2026-07-04 19:38                     ` Ben Carey
  0 siblings, 1 reply; 15+ messages in thread
From: Ben Carey @ 2026-07-04 19:35 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, io-uring, linux-kernel

On Sat, Jul 4, 2026 at 1:01 PM Keith Busch <kbusch@kernel.org> wrote:
>
> What exactly do you have in mind? Shouldn't you expect to spend most of
> your CPU time in the polling loop? As long as you keep the queues busy,
> there's something to poll, so blk_hctx_poll is exactly where you want to
> see the software be in a perf report. Seeing a high poll CPU utilization
> means the software is efficient compared to the hardware. If we spend
> very little time in the polling loop, then either you have incredibly
> quick hardware, and let's face it, Optane SSDs are EOL and a generation
> behind on link speeds so that's not gonna get there anymore, or our
> software dispatch stack has an inefficiency somewhere.

This is a fair point and I also think exposes some flaws with using perf
runtime reports for justifying a fix.

I'm most definitely not qualified to suggest this as a passable alternative,
but when polling a tagset, is there a way to check if the tagset's been
completed by another thread? Maybe break out if, for each polled request,
request->state == MQ_RQ_COMPLETE? I'm unsure how to translate the parameters in
blk_hctx_poll into the set of requests being waited on.

> If you have many pollers competing against a very low utilized queue,
> then I think you have an application level problem mismatched to the
> feature.

You're right, the test case above doesn't give a fair representation of the
issue.

> The only thing the jiffie timeout may show a problem is when you stop
> dispatching, which should only affect the time to close the ring when it
> lost the polling race with a peer on the last IO it is looking for, but
> should not affect individual IO latency.

We measured the disk latency usage with a simple kernel patch and confirmed
that individual IO latency is not impacted by the timeout issue.

We've seen, however, that the timeout can occur a large number of times even
with high queue saturation. When running the fio job below we observed the
timeout 132757 times, which I'm concerned could negatively impact bandwidth.

fio --bs=128K --direct=1 --iodepth=256 --runtime=200 --rw=randread \
    --time_based \
  --ioengine=io_uring --hipri=1 --fixedbufs=0 --registerfiles=0 \
    --sqthread_poll=0 \
  --numjobs=32 --name=job0 --output-format=json --clocksource=clock_gettime \
  --filename=/dev/nvme0n1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] RCU hang with io_uring nvme polling
  2026-07-04 19:35                   ` Ben Carey
@ 2026-07-04 19:38                     ` Ben Carey
  0 siblings, 0 replies; 15+ messages in thread
From: Ben Carey @ 2026-07-04 19:38 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, io-uring, linux-kernel

On Sat, Jul 4, 2026 at 3:35 PM Ben Carey
<benjamin.james.carey3@gmail.com> wrote:
> We've seen, however, that the timeout can occur a large number of times even
> with high queue saturation. When running the fio job below we observed the
> timeout 132757 times, which I'm concerned could negatively impact bandwidth.
>
> fio --bs=128K --direct=1 --iodepth=256 --runtime=200 --rw=randread \
>     --time_based \
>   --ioengine=io_uring --hipri=1 --fixedbufs=0 --registerfiles=0 \
>     --sqthread_poll=0 \
>   --numjobs=32 --name=job0 --output-format=json --clocksource=clock_gettime \
>   --filename=/dev/nvme0n1

I should note this was while running in a VM issuing IO to a virtual NVMe
device.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-07-04 19:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-26 15:09 [BUG] RCU hang with io_uring nvme polling Ben Carey
2026-06-26 15:17 ` Jens Axboe
2026-06-26 16:05   ` Keith Busch
2026-06-26 16:06     ` Jens Axboe
2026-06-26 16:33       ` Keith Busch
2026-06-26 16:35         ` Jens Axboe
2026-06-26 16:48           ` Keith Busch
     [not found]             ` <CA+KFGSoyCSRzgamm-38oyAtEsqd7wZZ8awL79P40x7a819EK4w@mail.gmail.com>
2026-06-26 17:41               ` Ben Carey
2026-07-03 17:20               ` Ben Carey
2026-07-04 17:01                 ` Keith Busch
2026-07-04 19:35                   ` Ben Carey
2026-07-04 19:38                     ` Ben Carey
2026-06-29 20:47         ` Ben Carey
2026-06-29 21:40           ` Keith Busch
2026-06-29 22:04             ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox