Linux kernel -stable discussions
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	patches@lists.linux.dev, Michael Liang <mliang@purestorage.com>,
	Mohamed Khalfella <mkhalfella@purestorage.com>,
	Randy Jennings <randyj@purestorage.com>,
	Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 5.15 34/55] nvme-tcp: fix premature queue removal and I/O failover
Date: Wed,  7 May 2025 20:39:35 +0200	[thread overview]
Message-ID: <20250507183800.413811673@linuxfoundation.org> (raw)
In-Reply-To: <20250507183759.048732653@linuxfoundation.org>

5.15-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Michael Liang <mliang@purestorage.com>

[ Upstream commit 77e40bbce93059658aee02786a32c5c98a240a8a ]

This patch addresses a data corruption issue observed in nvme-tcp during
testing.

In an NVMe native multipath setup, when an I/O timeout occurs, all
inflight I/Os are canceled almost immediately after the kernel socket is
shut down. These canceled I/Os are reported as host path errors,
triggering a failover that succeeds on a different path.

However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is
considered completed since they're acked on the different path and may
be reused for new I/O requests.

Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target,
ultimately causing data corruption.

We can reproduce this data corruption by injecting delay on one path and
triggering i/o timeout.

To prevent this issue, this change ensures that all inflight
transmissions are fully completed from host's perspective before
returning from queue stop. To handle concurrent I/O timeout from multiple
namespaces under the same controller, always wait in queue stop
regardless of queue's state.

This aligns with the behavior of queue stopping in other NVMe fabric
transports.

Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/nvme/host/tcp.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0fc5aba88bc15..99bf17f2dcfca 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1602,7 +1602,7 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
 	cancel_work_sync(&queue->io_work);
 }
 
-static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
+static void nvme_tcp_stop_queue_nowait(struct nvme_ctrl *nctrl, int qid)
 {
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
@@ -1613,6 +1613,31 @@ static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
 	mutex_unlock(&queue->queue_lock);
 }
 
+static void nvme_tcp_wait_queue(struct nvme_ctrl *nctrl, int qid)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+	int timeout = 100;
+
+	while (timeout > 0) {
+		if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags) ||
+		    !sk_wmem_alloc_get(queue->sock->sk))
+			return;
+		msleep(2);
+		timeout -= 2;
+	}
+	dev_warn(nctrl->device,
+		 "qid %d: timeout draining sock wmem allocation expired\n",
+		 qid);
+}
+
+static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
+{
+	nvme_tcp_stop_queue_nowait(nctrl, qid);
+	nvme_tcp_wait_queue(nctrl, qid);
+}
+
+
 static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)
 {
 	write_lock_bh(&queue->sock->sk->sk_callback_lock);
@@ -1720,7 +1745,9 @@ static void nvme_tcp_stop_io_queues(struct nvme_ctrl *ctrl)
 	int i;
 
 	for (i = 1; i < ctrl->queue_count; i++)
-		nvme_tcp_stop_queue(ctrl, i);
+		nvme_tcp_stop_queue_nowait(ctrl, i);
+	for (i = 1; i < ctrl->queue_count; i++)
+		nvme_tcp_wait_queue(ctrl, i);
 }
 
 static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl)
-- 
2.39.5




  parent reply	other threads:[~2025-05-07 18:42 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-07 18:39 [PATCH 5.15 00/55] 5.15.182-rc1 review Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 01/55] ALSA: usb-audio: Add second USB ID for Jabra Evolve 65 headset Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 02/55] drm/nouveau: Fix WARN_ON in nouveau_fence_context_kill() Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 03/55] EDAC/altera: Test the correct error reg offset Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 04/55] EDAC/altera: Set DDR and SDMMC interrupt mask before registration Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 05/55] i2c: imx-lpi2c: Fix clock count when probe defers Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 06/55] arm64: errata: Add missing sentinels to Spectre-BHB MIDR arrays Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 07/55] parisc: Fix double SIGFPE crash Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 08/55] amd-xgbe: Fix to ensure dependent features are toggled with RX checksum offload Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 09/55] mmc: renesas_sdhi: Fix error handling in renesas_sdhi_probe Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 10/55] wifi: brcm80211: fmac: Add error handling for brcmf_usb_dl_writeimage() Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 11/55] dm-integrity: fix a warning on invalid table line Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 12/55] dm: always update the array size in realloc_argv on success Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 13/55] iommu/amd: Fix potential buffer overflow in parse_ivrs_acpihid Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 14/55] iommu/vt-d: Apply quirk_iommu_igfx for 8086:0044 (QM57/QS57) Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 15/55] tracing: Fix oob write in trace_seq_to_buffer() Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 16/55] KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loop Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 17/55] net/sched: act_mirred: dont override retval if we already lost the skb Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 18/55] net/mlx5: E-Switch, Initialize MAC Address for Default GID Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 19/55] net/mlx5: E-switch, Fix error handling for enabling roce Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 20/55] net: ethernet: mtk-star-emac: separate tx/rx handling with two NAPIs Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 21/55] net: ethernet: mtk-star-emac: fix spinlock recursion issues on rx/tx poll Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 22/55] net: ethernet: mtk-star-emac: rearm interrupts in rx_poll only when advised Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 23/55] net_sched: drr: Fix double list add in class with netem as child qdisc Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 24/55] net_sched: hfsc: Fix a UAF vulnerability " Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 25/55] net_sched: ets: Fix double list add " Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 26/55] net_sched: qfq: " Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 27/55] ice: Refactor promiscuous functions Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 28/55] ice: Check VF VSI Pointer Value in ice_vc_add_fdir_fltr() Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 29/55] net: dlink: Correct endianness handling of led_mode Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 30/55] net: ipv6: fix UDPv6 GSO segmentation with NAT Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 31/55] bnxt_en: Fix coredump logic to free allocated buffer Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 32/55] bnxt_en: Fix out-of-bound memcpy() during ethtool -w Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 33/55] bnxt_en: Fix ethtool -d byte order for 32-bit values Greg Kroah-Hartman
2025-05-07 18:39 ` Greg Kroah-Hartman [this message]
2025-05-07 18:39 ` [PATCH 5.15 35/55] net: lan743x: Fix memleak issue when GSO enabled Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 36/55] net: fec: ERR007885 Workaround for conventional TX Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 37/55] net: hns3: store rx VLAN tag offload state for VF Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 38/55] net: hns3: add support for external loopback test Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 39/55] net: hns3: fix an interrupt residual problem Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 40/55] net: hns3: fixed debugfs tm_qset size Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 41/55] net: hns3: defer calling ptp_clock_register() Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 42/55] PCI: imx6: Skip controller_id generation logic for i.MX7D Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 43/55] of: module: add buffer overflow check in of_modalias() Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 44/55] net: hns3: fix deadlock issue when externel_lb and reset are executed together Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 45/55] firmware: arm_scmi: Balance device refcount when destroying devices Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 46/55] ARM: dts: opos6ul: add ksz8081 phy properties Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 47/55] net: phy: microchip: force IRQ polling mode for lan88xx Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 48/55] Revert "drm/meson: vclk: fix calculation of 59.94 fractional rates" Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 49/55] irqchip/gic-v2m: Add const to of_device_id Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 50/55] irqchip/gic-v2m: Mark a few functions __init Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 51/55] irqchip/gic-v2m: Prevent use after free of gicv2m_get_fwnode() Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 52/55] serial: msm: Configure correct working mode before starting earlycon Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 53/55] riscv: uprobes: Add missing fence.i after building the XOL buffer Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 54/55] iommu/arm-smmu-v3: Use the new rb tree helpers Greg Kroah-Hartman
2025-05-07 18:39 ` [PATCH 5.15 55/55] iommu/arm-smmu-v3: Fix iommu_device_probe bug due to duplicated stream ids Greg Kroah-Hartman
2025-05-08  9:23 ` [PATCH 5.15 00/55] 5.15.182-rc1 review Pavel Machek
2025-05-08  9:44 ` Jon Hunter
2025-05-08 11:23   ` Greg Kroah-Hartman
2025-05-08 12:22     ` Jon Hunter
2025-05-08 11:05 ` Florian Fainelli
2025-05-08 15:01 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250507183800.413811673@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=hch@lst.de \
    --cc=mkhalfella@purestorage.com \
    --cc=mliang@purestorage.com \
    --cc=patches@lists.linux.dev \
    --cc=randyj@purestorage.com \
    --cc=sagi@grimberg.me \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox