* [PATCH v4 0/1] nvme-tcp: fix possible data corruption caused by premature queue
@ 2025-04-29 16:39 Michael Liang
2025-04-29 16:42 ` [PATCH v4 1/1] nvme-tcp: fix possible data corruption caused by premature queue removal and I/O failover Michael Liang
0 siblings, 1 reply; 3+ messages in thread
From: Michael Liang @ 2025-04-29 16:39 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg
Cc: Michael Liang, Mohamed Khalfella, Randy Jennings, linux-nvme,
linux-kernel
Changes in v4:
- Change subject and add "Fixes" tag per suggestion;
- Link to v3: https://lore.kernel.org/linux-nvme/20250424161438.g2fyo4ozvburf2rh@purestorage.com/
Michael Liang (1):
nvme-tcp: fix possible data corruption caused by premature queue
removal and I/O failover
drivers/nvme/host/tcp.c | 31 +++++++++++++++++++++++++++++--
1 file changed, 29 insertions(+), 2 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH v4 1/1] nvme-tcp: fix possible data corruption caused by premature queue removal and I/O failover
2025-04-29 16:39 [PATCH v4 0/1] nvme-tcp: fix possible data corruption caused by premature queue Michael Liang
@ 2025-04-29 16:42 ` Michael Liang
2025-04-30 13:12 ` Christoph Hellwig
0 siblings, 1 reply; 3+ messages in thread
From: Michael Liang @ 2025-04-29 16:42 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg
Cc: Michael Liang, Mohamed Khalfella, Randy Jennings, linux-nvme,
linux-kernel
This patch addresses a data corruption issue observed in nvme-tcp during
testing.
Issue description:
In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
I/Os are canceled almost immediately after the kernel socket is shut down.
These canceled I/Os are reported as host path errors, triggering a failover
that succeeds on a different path.
However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is considered
completed since they're acked on the different path and may be reused for new
I/O requests.
Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target, ultimately
causing data corruption.
We can reproduce this data corruption by injecting delay on one path and
triggering i/o timeout.
To prevent this issue, this change ensures that all inflight transmissions are
fully completed from host's perspective before returning from queue
stop. To handle concurrent I/O timeout from multiple namespaces under
the same controller, always wait in queue stop regardless of queue's state.
This aligns with the behavior of queue stopping in other NVMe fabric transports.
Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Michael Liang <mliang@purestorage.com>
---
drivers/nvme/host/tcp.c | 31 +++++++++++++++++++++++++++++--
1 file changed, 29 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 72d260201d8c..aba365f97cf6 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1946,7 +1946,7 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
cancel_work_sync(&queue->io_work);
}
-static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
+static void nvme_tcp_stop_queue_nowait(struct nvme_ctrl *nctrl, int qid)
{
struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
struct nvme_tcp_queue *queue = &ctrl->queues[qid];
@@ -1965,6 +1965,31 @@ static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
mutex_unlock(&queue->queue_lock);
}
+static void nvme_tcp_wait_queue(struct nvme_ctrl *nctrl, int qid)
+{
+ struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+ struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+ int timeout = 100;
+
+ while (timeout > 0) {
+ if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags) ||
+ !sk_wmem_alloc_get(queue->sock->sk))
+ return;
+ msleep(2);
+ timeout -= 2;
+ }
+ dev_warn(nctrl->device,
+ "qid %d: timeout draining sock wmem allocation expired\n",
+ qid);
+}
+
+static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
+{
+ nvme_tcp_stop_queue_nowait(nctrl, qid);
+ nvme_tcp_wait_queue(nctrl, qid);
+}
+
+
static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)
{
write_lock_bh(&queue->sock->sk->sk_callback_lock);
@@ -2032,7 +2057,9 @@ static void nvme_tcp_stop_io_queues(struct nvme_ctrl *ctrl)
int i;
for (i = 1; i < ctrl->queue_count; i++)
- nvme_tcp_stop_queue(ctrl, i);
+ nvme_tcp_stop_queue_nowait(ctrl, i);
+ for (i = 1; i < ctrl->queue_count; i++)
+ nvme_tcp_wait_queue(ctrl, i);
}
static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl,
--
2.34.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v4 1/1] nvme-tcp: fix possible data corruption caused by premature queue removal and I/O failover
2025-04-29 16:42 ` [PATCH v4 1/1] nvme-tcp: fix possible data corruption caused by premature queue removal and I/O failover Michael Liang
@ 2025-04-30 13:12 ` Christoph Hellwig
0 siblings, 0 replies; 3+ messages in thread
From: Christoph Hellwig @ 2025-04-30 13:12 UTC (permalink / raw)
To: Michael Liang
Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Mohamed Khalfella, Randy Jennings, linux-nvme, linux-kernel
Thanks,
applied to nvme-6.15 with a bit of commit log tweaking to avoid the
overly long lines.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-04-30 13:12 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29 16:39 [PATCH v4 0/1] nvme-tcp: fix possible data corruption caused by premature queue Michael Liang
2025-04-29 16:42 ` [PATCH v4 1/1] nvme-tcp: fix possible data corruption caused by premature queue removal and I/O failover Michael Liang
2025-04-30 13:12 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox