* [PATCH] nvme-pci: check for valid request when polling for completions
@ 2024-09-02 13:07 Hannes Reinecke
2024-09-02 17:04 ` Sagi Grimberg
0 siblings, 1 reply; 6+ messages in thread
From: Hannes Reinecke @ 2024-09-02 13:07 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke
When polling for completions from the timeout handler we traverse
over _all_ cqes, and the fetching the request via blk_mq_tag_to_rq().
Unfortunately that function will always return a request, even if
that request is already completed.
So we need to check if the command is still in flight before
attempting to complete it.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/pci.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6cd9395ba9ec..fc4c616516d6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1037,6 +1037,13 @@ static inline void nvme_handle_cqe(struct nvme_queue *nvmeq,
return;
}
+ if (blk_mq_rq_state(req) != MQ_RQ_IN_FLIGHT) {
+ dev_warn(nvmeq->dev->ctrl.device,
+ "stale completion id %d on queue %d\n",
+ command_id, le16_to_cpu(cqe->sq_id));
+ return;
+ }
+
trace_nvme_sq(req, cqe->sq_head, nvmeq->sq_tail);
if (!nvme_try_complete_req(req, cqe->status, cqe->result) &&
!blk_mq_add_to_batch(req, iob, nvme_req(req)->status,
--
2.35.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] nvme-pci: check for valid request when polling for completions
2024-09-02 13:07 [PATCH] nvme-pci: check for valid request when polling for completions Hannes Reinecke
@ 2024-09-02 17:04 ` Sagi Grimberg
2024-09-03 6:25 ` Hannes Reinecke
0 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2024-09-02 17:04 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 02/09/2024 16:07, Hannes Reinecke wrote:
> When polling for completions from the timeout handler we traverse
> over _all_ cqes, and the fetching the request via blk_mq_tag_to_rq().
> Unfortunately that function will always return a request, even if
> that request is already completed.
> So we need to check if the command is still in flight before
> attempting to complete it.
Hannes, are you able to trigger a use-after-free here? That is the
inevitable outcome of completing an already completed request.
So, why are you not calling it as it is?
Was this seen in practice? do you have a stack trace?
I ask for this because it makes the review a lot clearer.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] nvme-pci: check for valid request when polling for completions
2024-09-02 17:04 ` Sagi Grimberg
@ 2024-09-03 6:25 ` Hannes Reinecke
2024-09-03 15:14 ` Keith Busch
0 siblings, 1 reply; 6+ messages in thread
From: Hannes Reinecke @ 2024-09-03 6:25 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 9/2/24 19:04, Sagi Grimberg wrote:
>
>
>
> On 02/09/2024 16:07, Hannes Reinecke wrote:
>> When polling for completions from the timeout handler we traverse
>> over _all_ cqes, and the fetching the request via blk_mq_tag_to_rq().
>> Unfortunately that function will always return a request, even if
>> that request is already completed.
>> So we need to check if the command is still in flight before
>> attempting to complete it.
>
> Hannes, are you able to trigger a use-after-free here? That is the
> inevitable outcome of completing an already completed request.
> So, why are you not calling it as it is?
>
Okay, will do.
> Was this seen in practice? do you have a stack trace?
> I ask for this because it makes the review a lot clearer.
Here it is:
crash> bt
PID: 183 TASK: ffff8d3d06150000 CPU: 3 COMMAND: "kworker/3:1H"
#0 [ffff9e83c0427930] machine_kexec at ffffffff9de840f3
#1 [ffff9e83c0427988] __crash_kexec at ffffffff9df8736a
#2 [ffff9e83c0427a48] crash_kexec at ffffffff9df88264
#3 [ffff9e83c0427a58] oops_end at ffffffff9de42068
#4 [ffff9e83c0427a78] do_trap at ffffffff9de3e53a
#5 [ffff9e83c0427ac0] do_error_trap at ffffffff9de3e744
#6 [ffff9e83c0427b00] exc_invalid_op at ffffffff9e869bb3
#7 [ffff9e83c0427b20] asm_exc_invalid_op at ffffffff9ea00e4d
[exception RIP: __slab_free+698]
RIP: ffffffff9e124e2a RSP: ffff9e83c0427bd0 RFLAGS: 00010246
RAX: ffff8d3f51141800 RBX: 0000000000080006 RCX: ffff8d3f51141000
RDX: ffff8d3f51141000 RSI: fffffb0f4d445000 RDI: ffff8d3d00042d00
RBP: ffff9e83c0427c80 R8: 0000000000000001 R9: ffffffffc04ff496
R10: 0000000000000000 R11: ffff8d3f51141000 R12: fffffb0f4d445000
R13: ffff8d3f51141000 R14: ffff8d3d00042d00 R15: ffff8d3f51141000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffff9e83c0427bf8] dev_printk_emit at ffffffff9e867920
#9 [ffff9e83c0427c88] nvme_pci_complete_rq at ffffffffc04ff496 [nvme]
#10 [ffff9e83c0427cb0] nvme_poll_irqdisable at ffffffffc0501496 [nvme]
#11 [ffff9e83c0427ce8] nvme_timeout at ffffffffc0503306 [nvme]
#12 [ffff9e83c0427d68] blk_mq_check_expired at ffffffff9e2faa8a
#13 [ffff9e83c0427d78] bt_iter at ffffffff9e301349
#14 [ffff9e83c0427db0] blk_mq_queue_tag_busy_iter at ffffffff9e301f35
#15 [ffff9e83c0427e60] blk_mq_timeout_work at ffffffff9e2fbe6e
#16 [ffff9e83c0427e98] process_one_work at ffffffff9ded1f64
#17 [ffff9e83c0427ed8] worker_thread at ffffffff9ded216d
#18 [ffff9e83c0427f10] kthread at ffffffff9ded9b54
The panic was triggered at:
# mm/slub.c
374 static inline void set_freepointer(struct kmem_cache *s, void
*object, void *fp)
375 {
376 unsigned long freeptr_addr = (unsigned long)object +
s->offset;
377
378 #ifdef CONFIG_SLAB_FREELIST_HARDENED
379 BUG_ON(object == fp); /* naive detection of double free or
corruption */
380 #endif
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] nvme-pci: check for valid request when polling for completions
2024-09-03 6:25 ` Hannes Reinecke
@ 2024-09-03 15:14 ` Keith Busch
2024-09-03 20:07 ` Keith Busch
0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2024-09-03 15:14 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig, linux-nvme
On Tue, Sep 03, 2024 at 08:25:08AM +0200, Hannes Reinecke wrote:
> On 9/2/24 19:04, Sagi Grimberg wrote:
> > On 02/09/2024 16:07, Hannes Reinecke wrote:
> > > When polling for completions from the timeout handler we traverse
> > > over _all_ cqes, and the fetching the request via blk_mq_tag_to_rq().
> > > Unfortunately that function will always return a request, even if
> > > that request is already completed.
> > > So we need to check if the command is still in flight before
> > > attempting to complete it.
So the very same command was completed in some other context? We've
disabled the queue's interrupt here, there should be no other context
that can concurrently complete it. The timeout poll check is supposed to
check only unseen cqes, not "all" of them. Is disable_irq() not a
sufficient barrier for accessing the cq head or something?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] nvme-pci: check for valid request when polling for completions
2024-09-03 15:14 ` Keith Busch
@ 2024-09-03 20:07 ` Keith Busch
2025-04-28 13:38 ` Daniel Wagner
0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2024-09-03 20:07 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig, linux-nvme
On Tue, Sep 03, 2024 at 09:14:27AM -0600, Keith Busch wrote:
> On Tue, Sep 03, 2024 at 08:25:08AM +0200, Hannes Reinecke wrote:
> > On 9/2/24 19:04, Sagi Grimberg wrote:
> > > On 02/09/2024 16:07, Hannes Reinecke wrote:
> > > > When polling for completions from the timeout handler we traverse
> > > > over _all_ cqes, and the fetching the request via blk_mq_tag_to_rq().
> > > > Unfortunately that function will always return a request, even if
> > > > that request is already completed.
> > > > So we need to check if the command is still in flight before
> > > > attempting to complete it.
>
> So the very same command was completed in some other context? We've
> disabled the queue's interrupt here, there should be no other context
> that can concurrently complete it. The timeout poll check is supposed to
> check only unseen cqes, not "all" of them. Is disable_irq() not a
> sufficient barrier for accessing the cq head or something?
Ooo, I think I see a problem. Does your device have more than one
namespace? I think we need to lock this queue for that condition because
the timeout work executes per-namespace, and we're not locking that
today. If you do have a muliti-namespace controller, does the below fix
your observation?
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6cd9395ba9ec3..2c73ccd21afe3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1109,9 +1109,11 @@ static void nvme_poll_irqdisable(struct nvme_queue *nvmeq)
WARN_ON_ONCE(test_bit(NVMEQ_POLLED, &nvmeq->flags));
+ spin_lock(&nvmeq->cq_poll_lock);
disable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
nvme_poll_cq(nvmeq, NULL);
enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
+ spin_unlock(&nvmeq->cq_poll_lock);
}
static int nvme_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] nvme-pci: check for valid request when polling for completions
2024-09-03 20:07 ` Keith Busch
@ 2025-04-28 13:38 ` Daniel Wagner
0 siblings, 0 replies; 6+ messages in thread
From: Daniel Wagner @ 2025-04-28 13:38 UTC (permalink / raw)
To: Keith Busch
Cc: Hannes Reinecke, Sagi Grimberg, Hannes Reinecke,
Christoph Hellwig, linux-nvme
Hi Keith,
On Tue, Sep 03, 2024 at 02:07:29PM -0600, Keith Busch wrote:
> On Tue, Sep 03, 2024 at 09:14:27AM -0600, Keith Busch wrote:
> > On Tue, Sep 03, 2024 at 08:25:08AM +0200, Hannes Reinecke wrote:
> > > On 9/2/24 19:04, Sagi Grimberg wrote:
> > > > On 02/09/2024 16:07, Hannes Reinecke wrote:
> > > > > When polling for completions from the timeout handler we traverse
> > > > > over _all_ cqes, and the fetching the request via blk_mq_tag_to_rq().
> > > > > Unfortunately that function will always return a request, even if
> > > > > that request is already completed.
> > > > > So we need to check if the command is still in flight before
> > > > > attempting to complete it.
> >
> > So the very same command was completed in some other context? We've
> > disabled the queue's interrupt here, there should be no other context
> > that can concurrently complete it. The timeout poll check is supposed to
> > check only unseen cqes, not "all" of them. Is disable_irq() not a
> > sufficient barrier for accessing the cq head or something?
>
> Ooo, I think I see a problem. Does your device have more than one
> namespace? I think we need to lock this queue for that condition because
> the timeout work executes per-namespace, and we're not locking that
> today. If you do have a muliti-namespace controller, does the below fix
> your observation?
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6cd9395ba9ec3..2c73ccd21afe3 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1109,9 +1109,11 @@ static void nvme_poll_irqdisable(struct nvme_queue *nvmeq)
>
> WARN_ON_ONCE(test_bit(NVMEQ_POLLED, &nvmeq->flags));
>
> + spin_lock(&nvmeq->cq_poll_lock);
> disable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
> nvme_poll_cq(nvmeq, NULL);
> enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
> + spin_unlock(&nvmeq->cq_poll_lock);
> }
Ah sorry, this one got lost in my todo list.
We got positive feedback from our customer. This indeed fixes the
problem. Are you going to send out a proper patch, or do you want me to
do it?
Thanks,
Daniel
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-04-28 13:55 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-02 13:07 [PATCH] nvme-pci: check for valid request when polling for completions Hannes Reinecke
2024-09-02 17:04 ` Sagi Grimberg
2024-09-03 6:25 ` Hannes Reinecke
2024-09-03 15:14 ` Keith Busch
2024-09-03 20:07 ` Keith Busch
2025-04-28 13:38 ` Daniel Wagner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox