* [PATCH v2] nvme: reserve a keep-alive admin tag for all transports
@ 2026-05-15 7:12 Chao Shi
2026-05-19 7:14 ` Christoph Hellwig
2026-05-20 20:26 ` Keith Busch
0 siblings, 2 replies; 8+ messages in thread
From: Chao Shi @ 2026-05-15 7:12 UTC (permalink / raw)
To: linux-nvme, Keith Busch
Cc: Christoph Hellwig, Sagi Grimberg, Jens Axboe, Tatsuya Sasaki,
Maurizio Lombardi, linux-kernel, Sungwoo Kim, Dave Tian,
Weidong Zhu
nvme_keep_alive_work() always allocates with BLK_MQ_REQ_RESERVED, but
nvme_alloc_admin_tag_set() only sets reserved_tags for fabrics. Since
commit b58da2d270db ("nvme: update keep alive interval when kato is
modified"), userspace can start keep-alive on any transport via Set
Features (KATO), after which the allocation trips WARN_ON_ONCE() in
blk_mq_get_tag() and fails with -EWOULDBLOCK:
nvme nvme0: keep-alive failed: -11
Per NVMe 2.0a section 5.27.1.12 and the transport binding wording,
PCIe MAY support KATO. Reserve one admin tag on all transports so
the host is ready when a controller accepts the feature. Fabrics
keeps two, the second being for the connect command.
A quirk-based approach was considered but no PCIe controller
documented to declare KAS != 0 was found (two enterprise SSDs tested
locally report KAS=0), so an allowlist has no entries today.
Link: https://lore.kernel.org/linux-nvme/20260428022911.1288485-1-coshi036@gmail.com/
Fixes: b58da2d270db ("nvme: update keep alive interval when kato is modified")
Found by FuzzNvme (Syzkaller with FEMU fuzzing framework).
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
Reproducer (run as root on an unpatched kernel with a PCIe NVMe device):
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/nvme_ioctl.h>
int main(void)
{
struct nvme_admin_cmd cmd = {0};
int fd = open("/dev/nvme0", O_RDWR);
if (fd < 0) { perror("open"); return 1; }
cmd.opcode = 0x09; /* SET_FEATURES */
cmd.cdw10 = 0x0f; /* Feature ID: KATO */
cmd.cdw11 = 5; /* KATO = 5 seconds */
if (ioctl(fd, NVME_IOCTL_ADMIN_CMD, &cmd) < 0) {
perror("ioctl");
return 1;
}
return 0;
}
Within ~kato/2 seconds after the program exits, dmesg shows:
nvme nvme0: keep alive interval updated from 0 ms to 5000 ms
WARNING: CPU: 0 PID: ... at block/blk-mq-tag.c:148 blk_mq_get_tag+...
nvme nvme0: keep-alive failed: -11
Changes since v1:
- Add spec citation (NVMe 2.0a 5.27.1.12 + transport binding wording)
clarifying that PCIe MAY support KATO.
- Discuss the quirk-based alternative suggested in v1 review and
note that no PCIe controller declaring KAS != 0 is documented
today (two enterprise SSDs tested locally report KAS=0).
- Add Link: to v1 thread.
drivers/nvme/host/core.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7bf228df6001..6db02ecde6d1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4850,8 +4850,13 @@ int nvme_alloc_admin_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set,
memset(set, 0, sizeof(*set));
set->ops = ops;
set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
+ /*
+ * Reserve one tag for keep-alive, which is allocated with
+ * BLK_MQ_REQ_RESERVED and can be enabled on any transport via the
+ * KATO feature. Fabrics needs a second reserved tag for connect.
+ */
+ set->reserved_tags = 1;
if (ctrl->ops->flags & NVME_F_FABRICS)
- /* Reserved for fabric connect and keep alive */
set->reserved_tags = 2;
set->numa_node = ctrl->numa_node;
if (ctrl->ops->flags & NVME_F_BLOCKING)
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v2] nvme: reserve a keep-alive admin tag for all transports
2026-05-15 7:12 [PATCH v2] nvme: reserve a keep-alive admin tag for all transports Chao Shi
@ 2026-05-19 7:14 ` Christoph Hellwig
2026-05-20 20:26 ` Keith Busch
1 sibling, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2026-05-19 7:14 UTC (permalink / raw)
To: Chao Shi
Cc: linux-nvme, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Jens Axboe, Tatsuya Sasaki, Maurizio Lombardi, linux-kernel,
Sungwoo Kim, Dave Tian, Weidong Zhu
On Fri, May 15, 2026 at 03:12:48AM -0400, Chao Shi wrote:
> A quirk-based approach was considered but no PCIe controller
> documented to declare KAS != 0 was found (two enterprise SSDs tested
> locally report KAS=0), so an allowlist has no entries today.
Quirking for spec allowed behavior sounds odd. If we care about
testing KA for PCIe it should be trivial to implement in nvmet-epf
in the kernel, but I'm not sure there is much of a point in that.
> Reproducer (run as root on an unpatched kernel with a PCIe NVMe device):
Can you wire this up as a testcase in blktests?
The patch itself looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] nvme: reserve a keep-alive admin tag for all transports
2026-05-15 7:12 [PATCH v2] nvme: reserve a keep-alive admin tag for all transports Chao Shi
2026-05-19 7:14 ` Christoph Hellwig
@ 2026-05-20 20:26 ` Keith Busch
2026-05-21 8:25 ` Christoph Hellwig
1 sibling, 1 reply; 8+ messages in thread
From: Keith Busch @ 2026-05-20 20:26 UTC (permalink / raw)
To: Chao Shi
Cc: linux-nvme, Christoph Hellwig, Sagi Grimberg, Jens Axboe,
Tatsuya Sasaki, Maurizio Lombardi, linux-kernel, Sungwoo Kim,
Dave Tian, Weidong Zhu
On Fri, May 15, 2026 at 03:12:48AM -0400, Chao Shi wrote:
> Per NVMe 2.0a section 5.27.1.12 and the transport binding wording,
> PCIe MAY support KATO. Reserve one admin tag on all transports so
> the host is ready when a controller accepts the feature. Fabrics
> keeps two, the second being for the connect command.
>
> A quirk-based approach was considered but no PCIe controller
> documented to declare KAS != 0 was found (two enterprise SSDs tested
> locally report KAS=0), so an allowlist has no entries today.
I totally get it's optional for PCIe, but that also means it's the
host's option on whether it wants to use it, and there's no requirement
we have to. We just need the driver react correctly when someone tries
to do it.
I am skeptical anyone would produce a PCIe device that supports it, but
let's say someone does: what is the use case motivating enabling this
optional feature in this driver? If it's just because the option is
there, then I think we can just reject the user command submitting the
feature for PCIe transports, like I earlier suggested. Requiring an
active command will just harm idle power states.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] nvme: reserve a keep-alive admin tag for all transports
2026-05-20 20:26 ` Keith Busch
@ 2026-05-21 8:25 ` Christoph Hellwig
2026-05-21 14:38 ` Keith Busch
0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2026-05-21 8:25 UTC (permalink / raw)
To: Keith Busch
Cc: Chao Shi, linux-nvme, Christoph Hellwig, Sagi Grimberg,
Jens Axboe, Tatsuya Sasaki, Maurizio Lombardi, linux-kernel,
Sungwoo Kim, Dave Tian, Weidong Zhu
On Wed, May 20, 2026 at 02:26:13PM -0600, Keith Busch wrote:
> On Fri, May 15, 2026 at 03:12:48AM -0400, Chao Shi wrote:
> > Per NVMe 2.0a section 5.27.1.12 and the transport binding wording,
> > PCIe MAY support KATO. Reserve one admin tag on all transports so
> > the host is ready when a controller accepts the feature. Fabrics
> > keeps two, the second being for the connect command.
> >
> > A quirk-based approach was considered but no PCIe controller
> > documented to declare KAS != 0 was found (two enterprise SSDs tested
> > locally report KAS=0), so an allowlist has no entries today.
>
> I totally get it's optional for PCIe, but that also means it's the
> host's option on whether it wants to use it, and there's no requirement
> we have to. We just need the driver react correctly when someone tries
> to do it.
>
> I am skeptical anyone would produce a PCIe device that supports it, but
> let's say someone does: what is the use case motivating enabling this
> optional feature in this driver? If it's just because the option is
> there, then I think we can just reject the user command submitting the
> feature for PCIe transports, like I earlier suggested. Requiring an
> active command will just harm idle power states.
I don't think that's quite the point. We'd have to add special
filtering to fix the reproducer. Compared to that just reserving
a tag and officially supporting the feature is much easier and a much
better story.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] nvme: reserve a keep-alive admin tag for all transports
2026-05-21 8:25 ` Christoph Hellwig
@ 2026-05-21 14:38 ` Keith Busch
2026-05-22 12:14 ` Christoph Hellwig
2026-05-22 15:33 ` Chao S
0 siblings, 2 replies; 8+ messages in thread
From: Keith Busch @ 2026-05-21 14:38 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chao Shi, linux-nvme, Sagi Grimberg, Jens Axboe, Tatsuya Sasaki,
Maurizio Lombardi, linux-kernel, Sungwoo Kim, Dave Tian,
Weidong Zhu
On Thu, May 21, 2026 at 10:25:49AM +0200, Christoph Hellwig wrote:
> On Wed, May 20, 2026 at 02:26:13PM -0600, Keith Busch wrote:
> > I am skeptical anyone would produce a PCIe device that supports it, but
> > let's say someone does: what is the use case motivating enabling this
> > optional feature in this driver? If it's just because the option is
> > there, then I think we can just reject the user command submitting the
> > feature for PCIe transports, like I earlier suggested. Requiring an
> > active command will just harm idle power states.
>
> I don't think that's quite the point. We'd have to add special
> filtering to fix the reproducer. Compared to that just reserving
> a tag and officially supporting the feature is much easier and a much
> better story.
This command is already special since we filter for it on the completion
side. We may want to selectively filter other Set Feature commands too.
For example, we don't want user space turning on Host Dispersed
Namespace Support, because this driver is not going to correctly react
to that one either.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] nvme: reserve a keep-alive admin tag for all transports
2026-05-21 14:38 ` Keith Busch
@ 2026-05-22 12:14 ` Christoph Hellwig
2026-05-22 15:32 ` Chao S
2026-05-22 15:33 ` Chao S
1 sibling, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2026-05-22 12:14 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Chao Shi, linux-nvme, Sagi Grimberg,
Jens Axboe, Tatsuya Sasaki, Maurizio Lombardi, linux-kernel,
Sungwoo Kim, Dave Tian, Weidong Zhu
On Thu, May 21, 2026 at 08:38:29AM -0600, Keith Busch wrote:
> This command is already special since we filter for it on the completion
> side. We may want to selectively filter other Set Feature commands too.
> For example, we don't want user space turning on Host Dispersed
> Namespace Support, because this driver is not going to correctly react
> to that one either.
True. So maybe start filtering out all these things will go wrong
commands. Chao, can you start on that for keep alive? We can then
extend it as needed.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] nvme: reserve a keep-alive admin tag for all transports
2026-05-22 12:14 ` Christoph Hellwig
@ 2026-05-22 15:32 ` Chao S
0 siblings, 0 replies; 8+ messages in thread
From: Chao S @ 2026-05-22 15:32 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-nvme, Sagi Grimberg, Jens Axboe,
Tatsuya Sasaki, Maurizio Lombardi, linux-kernel, Sungwoo Kim,
Dave Tian, Weidong Zhu
On Thu, May 21, 2026, Christoph Hellwig wrote:
> True. So maybe start filtering out all these things will go wrong
> commands. Chao, can you start on that for keep alive? We can then
> extend it as needed.
Sure, thank you so much for your help and comments!
Done. v3 adds the filter (nvme_passthru_cmd_allowed) in the
passthrough path, rejecting KATO Set Features on non-fabrics with
-EOPNOTSUPP, and is structured to extend to other features later.
The reserve-a-tag change from v1/v2 is dropped.
For the blktests testcase you asked about earlier: I will send that
separately against the blktests tree.
Sent as a new series:
https://lore.kernel.org/linux-nvme/20260522152807.2061501-1-coshi036@gmail.com/
Chao
On Fri, May 22, 2026 at 8:14 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Thu, May 21, 2026 at 08:38:29AM -0600, Keith Busch wrote:
> > This command is already special since we filter for it on the completion
> > side. We may want to selectively filter other Set Feature commands too.
> > For example, we don't want user space turning on Host Dispersed
> > Namespace Support, because this driver is not going to correctly react
> > to that one either.
>
> True. So maybe start filtering out all these things will go wrong
> commands. Chao, can you start on that for keep alive? We can then
> extend it as needed.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] nvme: reserve a keep-alive admin tag for all transports
2026-05-21 14:38 ` Keith Busch
2026-05-22 12:14 ` Christoph Hellwig
@ 2026-05-22 15:33 ` Chao S
1 sibling, 0 replies; 8+ messages in thread
From: Chao S @ 2026-05-22 15:33 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, linux-nvme, Sagi Grimberg, Jens Axboe,
Tatsuya Sasaki, Maurizio Lombardi, linux-kernel, Sungwoo Kim,
Dave Tian, Weidong Zhu
On Thu, May 21, 2026, Keith Busch wrote:
> This command is already special since we filter for it on the completion
> side. We may want to selectively filter other Set Feature commands too.
> For example, we don't want user space turning on Host Dispersed
> Namespace Support, because this driver is not going to correctly react
> to that one either.
Thank you so much. I agreed.
v3 implements this: a filter in the passthrough path
(nvme_passthru_cmd_allowed) that rejects Set Features commands the
driver is not prepared to handle, returning -EOPNOTSUPP. It starts
with KATO on non-fabrics and is structured so other features such as
Host Dispersed Namespace Support can be added as needed.
Thanks also for the idle-power-states point - that is now part of the
rationale in the commit message, since an active keep-alive on PCIe
would prevent deeper idle states for no benefit.
Sent as a new series:
https://lore.kernel.org/linux-nvme/20260522152807.2061501-1-coshi036@gmail.com/
Chao
On Thu, May 21, 2026 at 10:38 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, May 21, 2026 at 10:25:49AM +0200, Christoph Hellwig wrote:
> > On Wed, May 20, 2026 at 02:26:13PM -0600, Keith Busch wrote:
> > > I am skeptical anyone would produce a PCIe device that supports it, but
> > > let's say someone does: what is the use case motivating enabling this
> > > optional feature in this driver? If it's just because the option is
> > > there, then I think we can just reject the user command submitting the
> > > feature for PCIe transports, like I earlier suggested. Requiring an
> > > active command will just harm idle power states.
> >
> > I don't think that's quite the point. We'd have to add special
> > filtering to fix the reproducer. Compared to that just reserving
> > a tag and officially supporting the feature is much easier and a much
> > better story.
>
> This command is already special since we filter for it on the completion
> side. We may want to selectively filter other Set Feature commands too.
> For example, we don't want user space turning on Host Dispersed
> Namespace Support, because this driver is not going to correctly react
> to that one either.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-05-22 15:33 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15 7:12 [PATCH v2] nvme: reserve a keep-alive admin tag for all transports Chao Shi
2026-05-19 7:14 ` Christoph Hellwig
2026-05-20 20:26 ` Keith Busch
2026-05-21 8:25 ` Christoph Hellwig
2026-05-21 14:38 ` Keith Busch
2026-05-22 12:14 ` Christoph Hellwig
2026-05-22 15:32 ` Chao S
2026-05-22 15:33 ` Chao S
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.