From: Chao Shi <coshi036@gmail.com>
To: linux-nvme@lists.infradead.org, Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>,
Jens Axboe <axboe@kernel.dk>,
Tatsuya Sasaki <tatsuya6.sasaki@kioxia.com>,
Maurizio Lombardi <mlombard@arkamax.eu>,
linux-kernel@vger.kernel.org, Sungwoo Kim <iam@sung-woo.kim>,
Dave Tian <daveti@purdue.edu>, Weidong Zhu <weizhu@fiu.edu>
Subject: [PATCH v5] nvme: reject passthrough of driver-managed Set Features
Date: Sat, 23 May 2026 18:56:29 -0400 [thread overview]
Message-ID: <20260523225629.3964037-1-coshi036@gmail.com> (raw)
Since commit b58da2d270db ("nvme: update keep alive interval when kato
is modified"), userspace can start keep-alive on any transport via a
Set Features (KATO) passthrough command. nvme_keep_alive_work() then
allocates with BLK_MQ_REQ_RESERVED, but nvme_alloc_admin_tag_set()
only reserves admin tags for fabrics, so the allocation trips
WARN_ON_ONCE() in blk_mq_get_tag() and fails:
nvme nvme0: keep-alive failed: -11
More generally, several Set Features change controller state that the
driver manages itself and cannot react to correctly when set behind
its back from userspace. Reject these in nvme_cmd_allowed():
- KATO on non-fabrics (keep-alive is only armed for fabrics; on PCIe
it has no reserved tag and an active keep-alive harms idle power
states)
- Host Behavior Support, Host Memory Buffer, Number of Queues, and
Autonomous Power State Transition (all driver-managed)
Keep Alive on fabrics is unchanged. I/O commands are unaffected as the
check is confined to the admin path (ns == NULL).
Link: https://lore.kernel.org/linux-nvme/20260522162639.395802-1-coshi036@gmail.com/
Fixes: b58da2d270db ("nvme: update keep alive interval when kato is modified")
Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
Reproducer for the keep-alive case (run as root on a PCIe NVMe device):
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/nvme_ioctl.h>
int main(void)
{
struct nvme_admin_cmd cmd = {0};
int fd = open("/dev/nvme0", O_RDWR);
if (fd < 0) { perror("open"); return 1; }
cmd.opcode = 0x09; /* SET_FEATURES */
cmd.cdw10 = 0x0f; /* Feature ID: KATO */
cmd.cdw11 = 5; /* KATO = 5 seconds */
if (ioctl(fd, NVME_IOCTL_ADMIN_CMD, &cmd) < 0) {
perror("ioctl");
return 1;
}
return 0;
}
On an unpatched kernel, within ~kato/2 seconds after the program exits,
dmesg shows:
nvme nvme0: keep alive interval updated from 0 ms to 5000 ms
WARNING: CPU: 0 PID: ... at block/blk-mq-tag.c:148 blk_mq_get_tag+...
nvme nvme0: keep-alive failed: -11
With this patch the ioctl fails with EACCES on non-fabrics.
Changes since v4:
- Fold the check into the existing nvme_cmd_allowed() instead of a
separate helper, and reject additional driver-managed Set Features
(Host Behavior, Host Memory Buffer, Number of Queues, Autonomous
Power State Transition) in the same switch (Keith Busch). The admin
vs I/O distinction is now structural: the switch lives in the
ns == NULL branch, so I/O commands (e.g. Dataset Management, which
shares opcode 0x09 with Set Features) are never inspected.
Changes since v3:
- Only inspect admin commands so a DSM I/O command is not wrongly
rejected (Keith Busch).
Changes since v2:
- Reject the KATO passthrough on non-fabrics instead of reserving an
admin tag for all transports (Keith Busch, Christoph Hellwig).
Changes since v1:
- v2 added a spec citation and quirk discussion, superseded by the
reject approach.
drivers/nvme/host/ioctl.c | 33 +++++++++++++++++++++++++++------
1 file changed, 27 insertions(+), 6 deletions(-)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index a9c097dacad6..31784506e845 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -14,8 +14,9 @@ enum {
NVME_IOCTL_PARTITION = (1 << 1),
};
-static bool nvme_cmd_allowed(struct nvme_ns *ns, struct nvme_command *c,
- unsigned int flags, bool open_for_write)
+static bool nvme_cmd_allowed(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+ struct nvme_command *c, unsigned int flags,
+ bool open_for_write)
{
u32 effects;
@@ -50,6 +51,26 @@ static bool nvme_cmd_allowed(struct nvme_ns *ns, struct nvme_command *c,
case NVME_ID_CNS_CTRL:
return true;
}
+ } else if (c->common.opcode == nvme_admin_set_features) {
+ /*
+ * Reject Set Features that change controller state the
+ * driver manages itself; setting them behind the driver's
+ * back from userspace leaves it unable to react correctly.
+ * Keep Alive is only armed for fabrics - on other
+ * transports it has no reserved tag and harms idle power
+ * states.
+ */
+ switch (le32_to_cpu(c->features.fid) & 0xff) {
+ case NVME_FEAT_KATO:
+ if (ctrl->ops->flags & NVME_F_FABRICS)
+ break;
+ fallthrough;
+ case NVME_FEAT_HOST_BEHAVIOR:
+ case NVME_FEAT_HOST_MEM_BUF:
+ case NVME_FEAT_NUM_QUEUES:
+ case NVME_FEAT_AUTO_PST:
+ return false;
+ }
}
goto admin;
}
@@ -59,7 +80,7 @@ static bool nvme_cmd_allowed(struct nvme_ns *ns, struct nvme_command *c,
* and marks this command as supported. If not reject unprivileged
* passthrough.
*/
- effects = nvme_command_effects(ns->ctrl, ns, c->common.opcode);
+ effects = nvme_command_effects(ctrl, ns, c->common.opcode);
if (!(effects & NVME_CMD_EFFECTS_CSUPP))
goto admin;
@@ -308,7 +329,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
c.common.cdw14 = cpu_to_le32(cmd.cdw14);
c.common.cdw15 = cpu_to_le32(cmd.cdw15);
- if (!nvme_cmd_allowed(ns, &c, 0, open_for_write))
+ if (!nvme_cmd_allowed(ctrl, ns, &c, 0, open_for_write))
return -EACCES;
if (cmd.timeout_ms)
@@ -355,7 +376,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
c.common.cdw14 = cpu_to_le32(cmd.cdw14);
c.common.cdw15 = cpu_to_le32(cmd.cdw15);
- if (!nvme_cmd_allowed(ns, &c, flags, open_for_write))
+ if (!nvme_cmd_allowed(ctrl, ns, &c, flags, open_for_write))
return -EACCES;
if (cmd.timeout_ms)
@@ -472,7 +493,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
c.common.cdw14 = cpu_to_le32(READ_ONCE(cmd->cdw14));
c.common.cdw15 = cpu_to_le32(READ_ONCE(cmd->cdw15));
- if (!nvme_cmd_allowed(ns, &c, 0, ioucmd->file->f_mode & FMODE_WRITE))
+ if (!nvme_cmd_allowed(ctrl, ns, &c, 0, ioucmd->file->f_mode & FMODE_WRITE))
return -EACCES;
d.metadata = READ_ONCE(cmd->metadata);
--
2.43.0
next reply other threads:[~2026-05-23 22:56 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-23 22:56 Chao Shi [this message]
2026-05-25 15:34 ` [PATCH v5] nvme: reject passthrough of driver-managed Set Features Tokunori Ikegami
2026-05-25 15:49 ` Keith Busch
2026-05-25 16:49 ` Tokunori Ikegami
2026-05-27 14:16 ` Christoph Hellwig
2026-05-27 14:32 ` Keith Busch
2026-05-28 8:43 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260523225629.3964037-1-coshi036@gmail.com \
--to=coshi036@gmail.com \
--cc=axboe@kernel.dk \
--cc=daveti@purdue.edu \
--cc=hch@lst.de \
--cc=iam@sung-woo.kim \
--cc=kbusch@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=mlombard@arkamax.eu \
--cc=sagi@grimberg.me \
--cc=tatsuya6.sasaki@kioxia.com \
--cc=weizhu@fiu.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox