[PATCH] nvme-pci: fix potential I/O hang when CQ is full

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH] nvme-pci: fix potential I/O hang when CQ is full
@ 2026-02-09 12:10 Junnan Zhang
  2026-02-10 15:57 ` Christoph Hellwig
  2026-02-11  9:47 ` Junnan Zhang
  0 siblings, 2 replies; 5+ messages in thread
From: Junnan Zhang @ 2026-02-09 12:10 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg
  Cc: Junnan Zhang, Shouxin Sun, Junnan Zhang, Qiliang Yuan,
	Zhaolong Zhang, Yaxuan Liu, linux-nvme, linux-kernel

When an NVMe interrupt is triggered, the current implementation first
handles the CQE and then updates the CQ head. This may lead to a timing
window causing a lower-level CQ full condition, which in turn leads to
an I/O hang as described below:

1. NVMe interrupt handling flow: nvme_handle_cqe -> nvme_pci_complete_rq
-> ... -> blk_mq_put_tag(when not added to the batch processing flow)
notifies the NVMe driver that it can continue issuing commands.
2. The NVMe driver issues a new command while the CQ head has not yet
been updated.
3. The underlying layer finishes processing the new command immediately
and attempts to place it into the completion queue. It then detects that
the CQ is full and discards the command.
4. The NVMe interrupt flow from step 1 subsequently updates the CQ head.

The sequence diagram is as follows:

      driver              irq             underlying(virtual/hardware)
      ------             ------                     ------
  1. Wait for tag
                        1. Read CQE      CQ is full, wait for head update
                        2. Handle CQE
                        3. Wake up tag
  2. Get tag
  (blk_mq_put_tag)
  3. Issue new cmd
                                          1. Process cmd
                                          2. Try write to CQ
                                          3. CQ is full, discard cmd!
                        4. Update CQ head
                            (LATE!)
  4. Cmd timeout

In this scenario, the NVMe driver observes that the command never completes
and reports a hung task error.

[ 7128.239445] INFO: task kworker/u128:1:912 blocked for more than 122 seconds.
[ 7128.241536] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7128.242675] task:kworker/u128:1  state:D stack:    0 pid:  912 ppid:     2 flags:0x00004000
[ 7128.243862] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[ 7128.244736] Call Trace:
[ 7128.245283]  __schedule+0x2ea/0x640
[ 7128.245951]  schedule+0x46/0xb0
[ 7128.246576]  schedule_timeout+0x1a7/0x2b0
[ 7128.247364]  ? __next_timer_interrupt+0x110/0x110
[ 7128.248281]  io_schedule_timeout+0x4c/0x80
[ 7128.249129]  wait_for_common_io.constprop.0+0x80/0xf0
[ 7128.250083]  __nvme_disable_io_queues+0x14d/0x1a0 [nvme]
[ 7128.251084]  ? nvme_del_queue_end+0x20/0x20 [nvme]
[ 7128.252016]  nvme_dev_disable+0x20b/0x210 [nvme]
[ 7128.252908]  nvme_remove+0x6d/0x1b0 [nvme]
[ 7128.253709]  pci_device_remove+0x38/0xa0
[ 7128.254423]  __device_release_driver+0x172/0x260
[ 7128.255189]  device_release_driver+0x24/0x30
[ 7128.255937]  pci_stop_bus_device+0x6c/0x90
[ 7128.256659]  pci_stop_and_remove_bus_device+0xe/0x20
[ 7128.257594]  disable_slot+0x49/0x90
[ 7128.258336]  acpiphp_disable_and_eject_slot+0x15/0x90
[ 7128.259302]  hotplug_event+0xc8/0x220
[ 7128.260080]  ? acpiphp_post_dock_fixup+0xc0/0xc0
[ 7128.260993]  acpiphp_hotplug_notify+0x20/0x40
[ 7128.261835]  acpi_device_hotplug+0x8c/0x1d0
[ 7128.262702]  acpi_hotplug_work_fn+0x3d/0x50
[ 7128.263532]  process_one_work+0x1ad/0x350
[ 7128.264329]  worker_thread+0x49/0x310
[ 7128.265114]  ? rescuer_thread+0x370/0x370
[ 7128.265945]  kthread+0xfb/0x140
[ 7128.266650]  ? kthread_park+0x90/0x90
[ 7128.267435]  ret_from_fork+0x1f/0x30

Reproducing method:

In a cloud-native environment, SPDK vfio-user emulates an NVMe disk for
VMs. The issue can be reliably reproduced by repeatedly attaching and
detaching one disk via a host shell script that calls virsh
attatch-interface and virsh detach-interface. Once the issue is
reproduced, SPDK logs the following messages:

    vfio_user.c:1856:post_completion: ERROR: cqid:0 full
      (tail=6, head=7, inflight io: 2)
    vfio_user.c:1116:fail_ctrlr: ERROR: failing controller

Correspondingly, the guest kernel reports a hung task error, displaying
the call stack detailed in the previous section.

Fix: Update the CQ head first, then process the CQE, and clear the bitmap
to indicate that free slots are available for further command submission.

Fixes: 324b494c2862 ("nvme-pci: Remove two-pass completions")
Signed-off-by: Shouxin Sun <sunshx@chinatelecom.cn>
Signed-off-by: Junnan Zhang <zhangjn11@chinatelecom.cn>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
Signed-off-by: Zhaolong Zhang <zhangzl68@chinatelecom.cn>
Signed-off-by: Yaxuan Liu <liuyx92@chinatelecom.cn>
Signed-off-by: Junnan Zhang <zhangjn_dev@163.com>
---
 drivers/nvme/host/pci.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d86f2565a92c..904f45761cd2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1438,7 +1438,9 @@ static inline bool nvme_poll_cq(struct nvme_queue *nvmeq,
 			        struct io_comp_batch *iob)
 {
 	bool found = false;
+	u16 start, end;
 
+	start = nvmeq->cq_head;
 	while (nvme_cqe_pending(nvmeq)) {
 		found = true;
 		/*
@@ -1446,12 +1448,19 @@ static inline bool nvme_poll_cq(struct nvme_queue *nvmeq,
 		 * the cqe requires a full read memory barrier
 		 */
 		dma_rmb();
-		nvme_handle_cqe(nvmeq, iob, nvmeq->cq_head);
 		nvme_update_cq_head(nvmeq);
 	}
+	end = nvmeq->cq_head;
 
-	if (found)
+	if (found) {
 		nvme_ring_cq_doorbell(nvmeq);
+		while (start != end) {
+			nvme_handle_cqe(nvmeq, iob, start);
+			if (++start == nvmeq->q_depth)
+				start = 0;
+		}
+	}
+
 	return found;
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] nvme-pci: fix potential I/O hang when CQ is full
  2026-02-09 12:10 [PATCH] nvme-pci: fix potential I/O hang when CQ is full Junnan Zhang
@ 2026-02-10 15:57 ` Christoph Hellwig
  2026-02-11  9:47 ` Junnan Zhang
  1 sibling, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2026-02-10 15:57 UTC (permalink / raw)
  To: Junnan Zhang
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Shouxin Sun, Junnan Zhang, Qiliang Yuan, Zhaolong Zhang,
	Yaxuan Liu, linux-nvme, linux-kernel

We can't update the CQ head before consuming the CQEs, otherwise
the device can reuse them.  And devices must not discard completions
when there is no completion queue entry, nvme does allow SQs and CQs
to be smaller than the number of outstanding commands.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] nvme-pci: fix potential I/O hang when CQ is full
  2026-02-09 12:10 [PATCH] nvme-pci: fix potential I/O hang when CQ is full Junnan Zhang
  2026-02-10 15:57 ` Christoph Hellwig
@ 2026-02-11  9:47 ` Junnan Zhang
  2026-02-11 12:27   ` Keith Busch
  1 sibling, 1 reply; 5+ messages in thread
From: Junnan Zhang @ 2026-02-11  9:47 UTC (permalink / raw)
  To: zhangjn_dev
  Cc: axboe, hch, kbusch, linux-kernel, linux-nvme, liuyx92, sagi,
	sunshx, yuanql9, zhangjn11, zhangzl68

On Tue, 10 Feb 2026 16:57:12 +0100, Christoph Hellwig wrote:

> We can't update the CQ head before consuming the CQEs, otherwise
> the device can reuse them.  And devices must not discard completions
> when there is no completion queue entry, nvme does allow SQs and CQs
> to be smaller than the number of outstanding commands.

Updating the CQ head before consuming the CQE would not cause the device to 
reuse these entries, as new commands can only be submitted by the driver after
the CQE is consumed. Therefore, the device does not have the opportunity 
to reuse these entries.

Actually, the root cause of the issue is that the underlying device received 
more commands from the NVMe driver than the queue depth (q_depth), leading 
to a CQ full problem.

In my environment, the NVMe admin queue depth is 32, allowing a maximum of 
32 commands to be processed concurrently. During the NVMe disk removal process, 
the NVMe driver sends commands via the admin queue to delete all I/O queues.
When the NVMe driver has already submitted more than 32 commands, any additional
commands beyond 32 will wait for the previous ones to complete.

During NVMe interrupt handling, the current implementation first processes the 
CQE and then updates the CQ head. The commands allocated by nvme_delete_queue
are not processed through the batch flow during interrupt response. After
consuming the CQE, the tag is released and the upper-layer NVMe driver is notified
(note: at this point, the CQ head has not yet been updated, meaning the entire 
previous process is not yet complete). Upon receiving the notification, the NVMe 
driver immediately submits new commands to the SQ. When the underlying device 
completes command processing and writes the result back to the CQ (while the CQ 
head remains unupdated), the number of commands processed by the underlying device
exceeds the NVMe queue depth. Since there is no available space in the CQ to place 
the completion, a CQ full error is reported.

The above process can be illustrated by the following diagram:

          driver              irq             underlying(virtual/hardware)
          ------             ------                     ------
      1. Wait for tag
                            1. Read CQE      CQ is full, wait for head update
                            2. Handle CQE
                            3. Wake up tag
      2. Get tag
      (blk_mq_put_tag)
      3. Issue new cmd
                                              1. Process cmd
                                              2. Try write to CQ
                                              3. CQ is full, discard cmd!
                            4. Update CQ head
                                (LATE!)
      4. Cmd timeout

Best regards,
Junnan Zhang

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] nvme-pci: fix potential I/O hang when CQ is full
  2026-02-11  9:47 ` Junnan Zhang
@ 2026-02-11 12:27   ` Keith Busch
  2026-02-12  9:42     ` Junnan Zhang
  0 siblings, 1 reply; 5+ messages in thread
From: Keith Busch @ 2026-02-11 12:27 UTC (permalink / raw)
  To: Junnan Zhang
  Cc: axboe, hch, linux-kernel, linux-nvme, liuyx92, sagi, sunshx,
	yuanql9, zhangjn11, zhangzl68

On Wed, Feb 11, 2026 at 05:47:44PM +0800, Junnan Zhang wrote:
> On Tue, 10 Feb 2026 16:57:12 +0100, Christoph Hellwig wrote:
> 
> > We can't update the CQ head before consuming the CQEs, otherwise
> > the device can reuse them.  And devices must not discard completions
> > when there is no completion queue entry, nvme does allow SQs and CQs
> > to be smaller than the number of outstanding commands.
> 
> Updating the CQ head before consuming the CQE would not cause the device to 
> reuse these entries, as new commands can only be submitted by the driver after
> the CQE is consumed. Therefore, the device does not have the opportunity 
> to reuse these entries.

That's just an artifact of how this host implementation constrains its
tag space. It's not a reflection of how the NVMe protocol fundamentally
works.

A full queue is not an error. It's a spec defined condition that the
submitter just has to deal with. The protocol was specifically made to
allow scenarios for dispatching more outstanding commands than the
queues can hold. Your controller is broken.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] nvme-pci: fix potential I/O hang when CQ is full
  2026-02-11 12:27   ` Keith Busch
@ 2026-02-12  9:42     ` Junnan Zhang
  0 siblings, 0 replies; 5+ messages in thread
From: Junnan Zhang @ 2026-02-12  9:42 UTC (permalink / raw)
  To: kbusch
  Cc: axboe, hch, linux-kernel, linux-nvme, liuyx92, sagi, sunshx,
	yuanql9, zhangjn11, zhangjn_dev, zhangzl68

On Wed, 11 Feb 2026 05:27:50 -0700, Keith Busch wrote:
> On Wed, Feb 11, 2026 at 05:47:44PM +0800, Junnan Zhang wrote:
> > On Tue, 10 Feb 2026 16:57:12 +0100, Christoph Hellwig wrote:
> > 
> > > We can't update the CQ head before consuming the CQEs, otherwise
> > > the device can reuse them.  And devices must not discard completions
> > > when there is no completion queue entry, nvme does allow SQs and CQs
> > > to be smaller than the number of outstanding commands.
> > 
> > Updating the CQ head before consuming the CQE would not cause the device to 
> > reuse these entries, as new commands can only be submitted by the driver after
> > the CQE is consumed. Therefore, the device does not have the opportunity 
> > to reuse these entries.
> 
> That's just an artifact of how this host implementation constrains its
> tag space. It's not a reflection of how the NVMe protocol fundamentally
> works.
> 
> A full queue is not an error. It's a spec defined condition that the
> submitter just has to deal with. The protocol was specifically made to
> allow scenarios for dispatching more outstanding commands than the
> queues can hold. Your controller is broken.

Thank you very much. I understand your point. According to Section 3.3.1.2.1
Completion Queue Flow Control in the NVMe specification:

    If there are no free slots in a Completion Queue, then the controller 
    shall not post status to that Completion Queue until slots become 
    available. In this case, the controller may stop processing additional 
    submission queue entries associated with the affected Completion Queue 
    until slots become available. The controller shall continue processing 
    for other Submission Queues not associated with the affected Completion 
    Queue.

Thus, a full queue is not an error. It is a condition defined by the specification
that the submitter must handle accordingly.

In practice, SPDK vfio-user also addresses and resolves this issue. As referenced
in the following link:

	https://review.spdk.io/c/spdk/spdk/+/25473

During my testing involving repeated NVMe drive mounting and unmounting, I observed
the following:
1. Using the latest kernel version 6.19 + unmodified SPDK: issues occur.
2. Using the latest kernel version 6.19 + modified SPDK: no issues.
3. Using the latest kernel version 6.19 with an NVMe patch + unmodified SPDK: no issues.

Test Environment:
A virtual machine uses SPDK vfio-user to passthrough an NVMe drive. The VM has 64 vCPUs,
and the backend supports an NVMe I/O queue depth of at least 32 (Note: Since the admin 
queue depth is 32, the issue only reproduces when the queue depth is >=32). The issue 
occurs when repeatedly mounting and unmounting the drive on the host. Reproducing the 
issue typically requires about 10 cycles. Each cycle consists of the following steps:
1. virsh attach-device <VM> <disk.xml>
2. sleep 1.5
3. virsh detach-device <VM> <disk.xml>

Given the third observation - that using kernel 6.19 with an NVMe patch + unmodified SPDK
does not cause issues, so I was wondering if modifications to the NVMe driver are necessary.
Your expert guidance would be greatly appreciated.

Best regards,
Junnan Zhang

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-12  9:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09 12:10 [PATCH] nvme-pci: fix potential I/O hang when CQ is full Junnan Zhang
2026-02-10 15:57 ` Christoph Hellwig
2026-02-11  9:47 ` Junnan Zhang
2026-02-11 12:27   ` Keith Busch
2026-02-12  9:42     ` Junnan Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox