All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset
@ 2026-05-27  7:53 guzebing
  2026-05-27  7:53 ` [RFC PATCH 1/1] " guzebing
  2026-05-27 13:19 ` [RFC PATCH 0/1] " Christoph Hellwig
  0 siblings, 2 replies; 6+ messages in thread
From: guzebing @ 2026-05-27  7:53 UTC (permalink / raw)
  To: kbusch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel, Guzebing

From: Guzebing <guzebing@bytedance.com>

We have hit a case where an NVMe firmware activation made the controller
report a different CAP.MQES value after the following controller reset.
This was seen in production on a Memblaze PBlaze5 510 device:

  model:        P5510DS0384T00
  old firmware: 224005A0, CAP.MQES-derived queue depth 1024
  new firmware: 224005F0, CAP.MQES-derived queue depth 128

One way to hit this path is to activate the new firmware and then reset
the controller:

  nvme fw-download /dev/nvmeX -f fw_bin.tar
  nvme fw-activate /dev/nvmeX -s 2 -a 1
  nvme reset /dev/nvmeX

When the I/O queue depth derived from CAP.MQES became smaller after that
reset, the driver failed to recover any usable I/O queues.  In our
production kernel this was logged as:

  nvme nvme0: IO queues lost

The namespaces were then removed, so the corresponding block device
disappeared.  The opposite direction is less visible: if the CAP-derived
depth becomes larger, reset can complete without an error and the block
device can remain usable, but the live queue depth state is not updated
consistently.

The reason is that reset updates only part of the live queue depth state.
The nvme-pci reset path disables the controller, re-enables it, re-reads
CAP, and recalculates:

  dev->q_depth
  ctrl->sqsize

from the new CAP.MQES value.  Later, however, nvme_create_io_queues()
reuses the existing struct nvme_queue entries.  nvme_alloc_queue()
returns immediately when the queue already exists, so the old values
remain in:

  nvmeq->q_depth
  nvmeq->cq_dma_addr
  nvmeq->sq_dma_addr

Create CQ/SQ then requests queues with nvmeq->q_depth entries (encoded
in the command as nvmeq->q_depth - 1) and uses the old SQ/CQ DMA
addresses, not the newly computed dev->q_depth.  The blk-mq side also
keeps the old depth: the reset path updates the number of hardware
queues through blk_mq_update_nr_hw_queues(), but it does not resize the
existing tag set or update its queue_depth.

This explains the observed shrink failure and the expected grow case:

  * If the CAP-derived depth becomes smaller, the driver may try to create
    an I/O queue with the old larger nvmeq->q_depth.  A controller that
    now enforces the smaller CAP.MQES-derived limit can reject the Create
    CQ/SQ command.  If no I/O queues are recovered, nvme-pci removes the
    namespaces, so the block device disappears.

  * If the CAP-derived depth becomes larger, the old nvmeq->q_depth is
    still within the new controller limit.  Queue creation can therefore
    succeed and the device can remain usable, but the live state is
    inconsistent: dev->q_depth and ctrl->sqsize reflect the new capability
    while nvmeq queue resources and the blk-mq tag set still reflect the
    old depth.  The larger depth is not used until the controller is
    removed and probed again.

There are two broad ways to address this.

The direct fix would be to make reset recovery handle a changed live queue
depth.  That would require updating or rebuilding the nvmeq depth and
SQ/CQ DMA allocations, and resizing the block-layer depth state
consistently, including the blk-mq tag set, scheduler tags when present,
and queue->nr_requests.  That is broader than an nvme-pci-only change
and needs block layer review.

This RFC instead takes the smaller approach of detecting the reset-time
CAP.MQES change and making it visible.  If the live I/O queue depth
shrinks, reset recovery is failed before recreating I/O queues.  If it
grows, the driver warns and continues with the existing queue resources.

Feedback would be appreciated on whether this detection is useful on its
own, or whether nvme-pci should instead support full live queue-depth
resizing together with the required blk-mq changes.

Guzebing (1):
  nvme-pci: detect I/O queue depth changes after reset

 drivers/nvme/host/pci.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 1/1] nvme-pci: detect I/O queue depth changes after reset
  2026-05-27  7:53 [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset guzebing
@ 2026-05-27  7:53 ` guzebing
  2026-05-27 13:19 ` [RFC PATCH 0/1] " Christoph Hellwig
  1 sibling, 0 replies; 6+ messages in thread
From: guzebing @ 2026-05-27  7:53 UTC (permalink / raw)
  To: kbusch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel, Guzebing

From: Guzebing <guzebing@bytedance.com>

Firmware activation may change the controller queue depth reported
through CAP.MQES. In the nvme-pci reset path, nvme_pci_enable()
rereads CAP and updates dev->q_depth, while existing struct nvme_queue
entries keep the old q_depth and SQ/CQ DMA addresses.

If the new depth is smaller than the existing nvmeq depth, reset recovery
would try to create I/O queues with a depth the controller no longer
accepts. Detect this before recreating I/O queues and fail the reset with
an explicit error; without this, the failure shows up later as lost I/O
queues and namespace removal.

If the new depth is larger, warn and continue with the existing queue
resources. The larger depth will not be used until the controller is
removed and probed again.

Signed-off-by: Guzebing <guzebing@bytedance.com>
---
 drivers/nvme/host/pci.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index db5fc9bf66272..4bc112f8a096e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3155,6 +3155,33 @@ static bool nvme_pci_update_nr_queues(struct nvme_dev *dev)
 	return true;
 }
 
+static int nvme_pci_check_reset_queue_depth(struct nvme_dev *dev)
+{
+	u32 nvmeq_q_depth;
+	u32 dev_q_depth = dev->q_depth;
+
+	if (dev->ctrl.queue_count <= 1)
+		return 0;
+
+	nvmeq_q_depth = dev->queues[1].q_depth;
+	if (nvmeq_q_depth == dev_q_depth)
+		return 0;
+
+	if (nvmeq_q_depth > dev_q_depth) {
+		dev_err(dev->ctrl.device,
+			"IO queue depth decreased after reset (%u -> %u); "
+			"live reset recovery is unsupported\n",
+			nvmeq_q_depth, dev_q_depth);
+		return -EIO;
+	}
+
+	dev_warn(dev->ctrl.device,
+		 "IO queue depth increased after reset (%u -> %u); "
+		 "remove and probe the controller again to use the new depth\n",
+		 nvmeq_q_depth, dev_q_depth);
+	return 0;
+}
+
 static int nvme_pci_enable(struct nvme_dev *dev)
 {
 	int result = -ENOMEM;
@@ -3371,6 +3398,9 @@ static void nvme_reset_work(struct work_struct *work)
 
 	mutex_lock(&dev->shutdown_lock);
 	result = nvme_pci_enable(dev);
+	if (result)
+		goto out_unlock;
+	result = nvme_pci_check_reset_queue_depth(dev);
 	if (result)
 		goto out_unlock;
 	nvme_unquiesce_admin_queue(&dev->ctrl);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset
  2026-05-27  7:53 [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset guzebing
  2026-05-27  7:53 ` [RFC PATCH 1/1] " guzebing
@ 2026-05-27 13:19 ` Christoph Hellwig
  2026-05-28  1:38   ` guzebing
  2026-05-28  3:03   ` Keith Busch
  1 sibling, 2 replies; 6+ messages in thread
From: Christoph Hellwig @ 2026-05-27 13:19 UTC (permalink / raw)
  To: guzebing; +Cc: kbusch, axboe, hch, sagi, linux-nvme, linux-kernel, Guzebing

On Wed, May 27, 2026 at 03:53:19PM +0800, guzebing wrote:
> This RFC instead takes the smaller approach of detecting the reset-time
> CAP.MQES change and making it visible.  If the live I/O queue depth
> shrinks, reset recovery is failed before recreating I/O queues.  If it
> grows, the driver warns and continues with the existing queue resources.

Unlike the other version this at least sounds doable without creating
a complete mess.  So if we can live with this version that'd make me
much happier.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset
  2026-05-27 13:19 ` [RFC PATCH 0/1] " Christoph Hellwig
@ 2026-05-28  1:38   ` guzebing
  2026-05-28  3:03   ` Keith Busch
  1 sibling, 0 replies; 6+ messages in thread
From: guzebing @ 2026-05-28  1:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: kbusch, axboe, sagi, linux-nvme, linux-kernel, Guzebing


在 2026/5/27 21:19, Christoph Hellwig 写道:
> On Wed, May 27, 2026 at 03:53:19PM +0800, guzebing wrote:
>> This RFC instead takes the smaller approach of detecting the reset-time
>> CAP.MQES change and making it visible.  If the live I/O queue depth
>> shrinks, reset recovery is failed before recreating I/O queues.  If it
>> grows, the driver warns and continues with the existing queue resources.
> 
> Unlike the other version this at least sounds doable without creating
> a complete mess.  So if we can live with this version that'd make me
> much happier.

Thanks, Christoph.

Yes, that is the direction I would like to take here.

The goal of this RFC is to avoid the live queue-depth resize path for 
now and keep the reset recovery policy explicit: fail reset before 
recreating I/O queues if the CAP.MQES-derived depth shrinks, and warn 
but keep using the existing queue resources if it grows.

I will keep this lightweight approach unless there are objections, and 
will wait a bit for other comments before sending a non-RFC version.

Thanks,
Guzebing





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset
  2026-05-27 13:19 ` [RFC PATCH 0/1] " Christoph Hellwig
  2026-05-28  1:38   ` guzebing
@ 2026-05-28  3:03   ` Keith Busch
  2026-05-28  8:44     ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Keith Busch @ 2026-05-28  3:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: guzebing, axboe, sagi, linux-nvme, linux-kernel, Guzebing

On Wed, May 27, 2026 at 03:19:51PM +0200, Christoph Hellwig wrote:
> On Wed, May 27, 2026 at 03:53:19PM +0800, guzebing wrote:
> > This RFC instead takes the smaller approach of detecting the reset-time
> > CAP.MQES change and making it visible.  If the live I/O queue depth
> > shrinks, reset recovery is failed before recreating I/O queues.  If it
> > grows, the driver warns and continues with the existing queue resources.
> 
> Unlike the other version this at least sounds doable without creating
> a complete mess.  So if we can live with this version that'd make me
> much happier.

Abandoning the device here sounds pretty harsh when blk-mq already
supports user space decreasing the q-depth on a live queue. Why can't
the driver do the same thing?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset
  2026-05-28  3:03   ` Keith Busch
@ 2026-05-28  8:44     ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2026-05-28  8:44 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, guzebing, axboe, sagi, linux-nvme,
	linux-kernel, Guzebing

On Wed, May 27, 2026 at 09:03:02PM -0600, Keith Busch wrote:
> On Wed, May 27, 2026 at 03:19:51PM +0200, Christoph Hellwig wrote:
> > On Wed, May 27, 2026 at 03:53:19PM +0800, guzebing wrote:
> > > This RFC instead takes the smaller approach of detecting the reset-time
> > > CAP.MQES change and making it visible.  If the live I/O queue depth
> > > shrinks, reset recovery is failed before recreating I/O queues.  If it
> > > grows, the driver warns and continues with the existing queue resources.
> > 
> > Unlike the other version this at least sounds doable without creating
> > a complete mess.  So if we can live with this version that'd make me
> > much happier.
> 
> Abandoning the device here sounds pretty harsh when blk-mq already
> supports user space decreasing the q-depth on a live queue. Why can't
> the driver do the same thing?

Reducing capabilities with a firmware upgrade is a losing proposition,
there's just way too many things that can go wrong.

And reducing the queue depth is one of the haіriest operation in
blk-mq.  I wish we had never supported it, but allowing it with a
remove trigger sounds really bad.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-28  8:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-27  7:53 [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset guzebing
2026-05-27  7:53 ` [RFC PATCH 1/1] " guzebing
2026-05-27 13:19 ` [RFC PATCH 0/1] " Christoph Hellwig
2026-05-28  1:38   ` guzebing
2026-05-28  3:03   ` Keith Busch
2026-05-28  8:44     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.