[RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export
@ 2026-04-20 11:49 Nilay Shroff
  2026-04-20 11:49 ` [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues Nilay Shroff
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Nilay Shroff @ 2026-04-20 11:49 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, hch, hare, sagi, chaitanyak, gjoyce, Nilay Shroff

Hi,

The NVMe/TCP host driver currently provisions I/O queues primarily based
on CPU availability rather than the capabilities and topology of the
underlying network interface.

On modern systems with many CPUs but fewer NIC hardware queues, this can
lead to multiple NVMe/TCP I/O workers contending for the same TX/RX queue,
resulting in increased lock contention, cacheline bouncing, and degraded
throughput.

This RFC proposes a set of changes to better align NVMe/TCP I/O queues
with NIC queue resources, and to expose queue/flow information to enable
more effective system-level tuning.

Key ideas
---------

1. Scale NVMe/TCP I/O queues based on NIC queue count
   Instead of relying solely on CPU count, limit the number of I/O workers
   to:
       min(num_online_cpus, netdev->real_num_{tx,rx}_queues)

2. Improve CPU locality
   Align NVMe/TCP I/O workers with CPUs associated with NIC IRQ affinity
   to reduce cross-CPU traffic and improve cache locality.

3. Expose queue and flow information via debugfs
   Export per-I/O queue information including:
       - queue id (qid)
       - CPU affinity
       - TCP flow (src/dst IP and ports)

   This enables userspace tools to configure:
       - IRQ affinity
       - RPS/XPS
       - ntuple steering
       - or any other scaling as deemed feasible

4. Provide infrastructure for extensible debugfs support in NVMe

Together, these changes allow better alignment of:
    flow -> NIC queue -> IRQ -> CPU -> NVMe/TCP I/O worker

Performance Evaluation
----------------------
Tests were conducted using fio over NVMe/TCP with the following parameters:
    ioengine=io_uring
    direct=1
    bs=4k
    numjobs=<#nic-queues>
    iodepth=64
System:
    CPUs: 72
    NIC: 100G mlx5

Two configurations were evaluated.

Scenario 1: NIC queues < CPU count
----------------------------------
- CPUs: 72
- NIC queues: 32

                Baseline        Patched        Patched + tuning
randread        3141 MB/s       3228 MB/s      7509 MB/s
                (767k IOPS)     (788k IOPS)    (1833k IOPS)

randwrite       4510 MB/s       6172 MB/s      7518 MB/s
                (1101k IOPS)    (1507k IOPS)   (1836k IOPS)

randrw (read)   2156 MB/s       2560 MB/s      3932 MB/s
                (526k IOPS)     (625k IOPS)    (960k IOPS)

randrw (write)  2155 MB/s       2560 MB/s      3932 MB/s
                (526k IOPS)     (625k IOPS)    (960k IOPS)

Observation:
When CPU count exceeds NIC queue count, the baseline configuration
suffers from queue contention. The proposed changes provide modest
improvements on their own, and when combined with queue-aware tuning
(IRQ affinity, ntuple steering, and CPU alignment), enable up to 
~1.5x–2.5x throughput improvement.

Scenario 2: NIC queues == CPU count
-----------------------------------

- CPUs: 72
- NIC queues: 72

                Baseline                Patched + tuning
randread        4310 MB/s               7987 MB/s
                (1052k IOPS)            (1950k IOPS)

randwrite       7947 MB/s               7972 MB/s
                (1940k IOPS)            (1946k IOPS)

randrw (read)   3583 MB/s               4030 MB/s
                (875k IOPS)             (984k IOPS)

randrw (write)  3583 MB/s               4029 MB/s
                (875k IOPS)             (984k IOPS)

Observation:
When NIC queues are already aligned with CPU count, the baseline performs
well. The proposed changes maintain write performance (no regression) and
still improve read and mixed workloads due to better flow-to-CPU locality.

Notes on tuning
---------------
The "patched + tuning" configuration includes:
    - aligning NVMe/TCP I/O workers with NIC queue count
    - IRQ affinity configuration per RX queue
    - ntuple-based flow steering
    - CPU/queue affinity alignment

These tuning steps are enabled by the queue/flow information exposed through
this patchset.

Discussion
----------
This RFC aims to start discussion around:
  - Whether NVMe/TCP queue scaling should consider NIC queue topology
  - How best to expose queue/flow information to userspace
  - The role of userspace vs kernel in steering decisions

As usual, feedback/comment/suggestions are most welcome!

Reference to LSF/MM/BPF abstarct: https://lore.kernel.org/all/5db8ce78-0dfa-4dcb-bf71-5fb9c8f463e5@linux.ibm.com/

Nilay Shroff (4):
  nvme-tcp: optionally limit I/O queue count based on NIC queues
  nvme-tcp: add a diagnostic message when NIC queues are underutilized
  nvme: add debugfs helpers for NVMe drivers
  nvme: expose queue information via debugfs

 drivers/nvme/host/Makefile  |   2 +-
 drivers/nvme/host/core.c    |   3 +
 drivers/nvme/host/debugfs.c | 162 +++++++++++++++++++++++++++
 drivers/nvme/host/fabrics.c |   4 +
 drivers/nvme/host/fabrics.h |   3 +
 drivers/nvme/host/nvme.h    |  12 ++
 drivers/nvme/host/tcp.c     | 211 +++++++++++++++++++++++++++++++++++-
 7 files changed, 395 insertions(+), 2 deletions(-)
 create mode 100644 drivers/nvme/host/debugfs.c

-- 
2.53.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues
  2026-04-20 11:49 [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Nilay Shroff
@ 2026-04-20 11:49 ` Nilay Shroff
  2026-04-24 13:46   ` Christoph Hellwig
  2026-04-24 22:10   ` Sagi Grimberg
  2026-04-20 11:49 ` [RFC PATCH 2/4] nvme-tcp: add a diagnostic message when NIC queues are underutilized Nilay Shroff
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Nilay Shroff @ 2026-04-20 11:49 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, hch, hare, sagi, chaitanyak, gjoyce, Nilay Shroff

NVMe-TCP currently provisions I/O queues primarily based on CPU
availability. On systems where the number of CPUs significantly exceeds
the number of NIC hardware queues, this can lead to multiple I/O queues
sharing the same NIC TX/RX queues, resulting in increased lock
contention, cacheline bouncing, and inter-processor interrupts (IPIs).

In such configurations, limiting the number of NVMe-TCP I/O queues to
the number of NIC hardware queues can improve performance by reducing
contention and improving locality. Aligning NVMe-TCP worker threads with
NIC queue topology may also help reduce tail latency.

Add a new transport option "match_hw_queues" to allow users to
optionally limit the number of NVMe-TCP I/O queues to the number of NIC
TX/RX queues. When enabled, the number of I/O queues is set to:

    min(num_online_cpus, num_nic_queues)

This behavior is opt-in and does not change existing defaults.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/fabrics.c |   4 ++
 drivers/nvme/host/fabrics.h |   3 +
 drivers/nvme/host/tcp.c     | 120 +++++++++++++++++++++++++++++++++++-
 3 files changed, 126 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index ac3d4f400601..62ae998825e1 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -709,6 +709,7 @@ static const match_table_t opt_tokens = {
 	{ NVMF_OPT_TLS,			"tls"			},
 	{ NVMF_OPT_CONCAT,		"concat"		},
 #endif
+	{ NVMF_OPT_MATCH_HW_QUEUES,	"match_hw_queues"	},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -1064,6 +1065,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 			}
 			opts->concat = true;
 			break;
+		case NVMF_OPT_MATCH_HW_QUEUES:
+			opts->match_hw_queues = true;
+			break;
 		default:
 			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
 				p);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index caf5503d0833..e8e3a2672832 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -67,6 +67,7 @@ enum {
 	NVMF_OPT_KEYRING	= 1 << 26,
 	NVMF_OPT_TLS_KEY	= 1 << 27,
 	NVMF_OPT_CONCAT		= 1 << 28,
+	NVMF_OPT_MATCH_HW_QUEUES = 1 << 29,
 };
 
 /**
@@ -106,6 +107,7 @@ enum {
  * @disable_sqflow: disable controller sq flow control
  * @hdr_digest: generate/verify header digest (TCP)
  * @data_digest: generate/verify data digest (TCP)
+ * @match_hw_queues: limit controller IO queue count based on NIC queues (TCP)
  * @nr_write_queues: number of queues for write I/O
  * @nr_poll_queues: number of queues for polling I/O
  * @tos: type of service
@@ -136,6 +138,7 @@ struct nvmf_ctrl_options {
 	bool			disable_sqflow;
 	bool			hdr_digest;
 	bool			data_digest;
+	bool			match_hw_queues;
 	unsigned int		nr_write_queues;
 	unsigned int		nr_poll_queues;
 	int			tos;
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 243dab830dc8..7102a7a54d78 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -16,6 +16,8 @@
 #include <net/tls.h>
 #include <net/tls_prot.h>
 #include <net/handshake.h>
+#include <net/ip6_route.h>
+#include <linux/in6.h>
 #include <linux/blk-mq.h>
 #include <net/busy_poll.h>
 #include <trace/events/sock.h>
@@ -1762,6 +1764,103 @@ static int nvme_tcp_start_tls(struct nvme_ctrl *nctrl,
 	return ret;
 }
 
+static struct net_device *nvme_tcp_get_netdev(struct nvme_ctrl *ctrl)
+{
+	struct net_device *dev = NULL;
+
+	if (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)
+		dev = dev_get_by_name(&init_net, ctrl->opts->host_iface);
+	else {
+		struct nvme_tcp_ctrl *tctrl = to_tcp_ctrl(ctrl);
+
+		if (tctrl->addr.ss_family == AF_INET) {
+			struct rtable *rt;
+			struct flowi4 fl4 = {};
+			struct sockaddr_in *addr =
+					(struct sockaddr_in *)&tctrl->addr;
+
+			fl4.daddr = addr->sin_addr.s_addr;
+			if (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR) {
+				addr = (struct sockaddr_in *)&tctrl->src_addr;
+				fl4.saddr = addr->sin_addr.s_addr;
+			}
+			fl4.flowi4_proto = IPPROTO_TCP;
+
+			rt = ip_route_output_key(&init_net, &fl4);
+			if (IS_ERR(rt))
+				return NULL;
+
+			dev = dst_dev(&rt->dst);
+			/*
+			 * Get reference to netdev as ip_rt_put() will
+			 * release the netdev reference.
+			 */
+			if (dev)
+				dev_hold(dev);
+
+			ip_rt_put(rt);
+
+		} else if (tctrl->addr.ss_family == AF_INET6) {
+			struct dst_entry *dst;
+			struct flowi6 fl6 = {};
+			struct sockaddr_in6 *addr6 =
+					(struct sockaddr_in6 *)&tctrl->addr;
+
+			fl6.daddr = addr6->sin6_addr;
+			if (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR) {
+				addr6 = (struct sockaddr_in6 *)&tctrl->src_addr;
+				fl6.saddr = addr6->sin6_addr;
+			}
+			fl6.flowi6_proto = IPPROTO_TCP;
+
+			dst = ip6_route_output(&init_net, NULL, &fl6);
+			if (dst->error) {
+				dst_release(dst);
+				return NULL;
+			}
+
+			dev = dst_dev(dst);
+			/*
+			 * Get reference to netdev as dst_release() will
+			 * release the netdev reference.
+			 */
+			if (dev)
+				dev_hold(dev);
+
+			dst_release(dst);
+		}
+	}
+
+	return dev;
+}
+
+static void nvme_tcp_put_netdev(struct net_device *dev)
+{
+	if (dev)
+		dev_put(dev);
+}
+
+/*
+ * Returns number of active NIC queues (min of TX/RX), or 0 if device cannot
+ * be determined.
+ */
+static int nvme_tcp_get_netdev_current_queue_count(struct nvme_ctrl *ctrl)
+{
+	struct net_device *dev;
+	int tx_queues, rx_queues;
+
+	dev = nvme_tcp_get_netdev(ctrl);
+	if (!dev)
+		return 0;
+
+	tx_queues = dev->real_num_tx_queues;
+	rx_queues = dev->real_num_rx_queues;
+
+	nvme_tcp_put_netdev(dev);
+
+	return min(tx_queues, rx_queues);
+}
+
 static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
 				key_serial_t pskid)
 {
@@ -2144,6 +2243,24 @@ static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
 	unsigned int nr_io_queues;
 	int ret;
 
+	if (!(ctrl->opts->mask & NVMF_OPT_NR_IO_QUEUES) &&
+			(ctrl->opts->mask & NVMF_OPT_MATCH_HW_QUEUES)) {
+		int nr_hw_queues;
+
+		nr_hw_queues = nvme_tcp_get_netdev_current_queue_count(ctrl);
+		if (nr_hw_queues <= 0)
+			goto init_queue;
+
+		ctrl->opts->nr_io_queues = min(nr_hw_queues, num_online_cpus());
+
+		if (ctrl->opts->nr_io_queues < num_online_cpus())
+			dev_info(ctrl->device,
+				"limiting I/O queues to %u (NIC queues %d, CPUs %u)\n",
+				ctrl->opts->nr_io_queues, nr_hw_queues,
+				num_online_cpus());
+	}
+
+init_queue:
 	nr_io_queues = nvmf_nr_io_queues(ctrl->opts);
 	ret = nvme_set_queue_count(ctrl, &nr_io_queues);
 	if (ret)
@@ -3019,7 +3136,8 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
 			  NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST |
 			  NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES |
 			  NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE | NVMF_OPT_TLS |
-			  NVMF_OPT_KEYRING | NVMF_OPT_TLS_KEY | NVMF_OPT_CONCAT,
+			  NVMF_OPT_KEYRING | NVMF_OPT_TLS_KEY |
+			  NVMF_OPT_CONCAT | NVMF_OPT_MATCH_HW_QUEUES,
 	.create_ctrl	= nvme_tcp_create_ctrl,
 };
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/4] nvme-tcp: add a diagnostic message when NIC queues are underutilized
  2026-04-20 11:49 [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Nilay Shroff
  2026-04-20 11:49 ` [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues Nilay Shroff
@ 2026-04-20 11:49 ` Nilay Shroff
  2026-04-24 22:15   ` Sagi Grimberg
  2026-04-20 11:49 ` [RFC PATCH 3/4] nvme: add debugfs helpers for NVMe drivers Nilay Shroff
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Nilay Shroff @ 2026-04-20 11:49 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, hch, hare, sagi, chaitanyak, gjoyce, Nilay Shroff

Some systems may configure fewer NIC queues than supported by the
hardware. When the number of NVMe-TCP I/O queues is limited by the
number of active NIC queues, this can result in suboptimal performance.

Add a diagnostic message to warn when the configured NIC queue count
is lower than the maximum supported queue count, as reported by the
driver. This may help users identify configurations where increasing
the NIC queue count could improve performance.

This change is informational only and does not modify NIC configuration.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/tcp.c | 45 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 7102a7a54d78..9239495122fc 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -11,6 +11,7 @@
 #include <linux/crc32.h>
 #include <linux/nvme-tcp.h>
 #include <linux/nvme-keyring.h>
+#include <linux/ethtool.h>
 #include <net/sock.h>
 #include <net/tcp.h>
 #include <net/tls.h>
@@ -20,6 +21,7 @@
 #include <linux/in6.h>
 #include <linux/blk-mq.h>
 #include <net/busy_poll.h>
+#include <net/netdev_lock.h>
 #include <trace/events/sock.h>
 
 #include "nvme.h"
@@ -1861,6 +1863,35 @@ static int nvme_tcp_get_netdev_current_queue_count(struct nvme_ctrl *ctrl)
 	return min(tx_queues, rx_queues);
 }
 
+static int nvme_tcp_get_netdev_max_queue_count(struct nvme_ctrl *ctrl)
+{
+	struct net_device *dev;
+	struct ethtool_channels channels = {0};
+	int max = 0;
+
+	dev = nvme_tcp_get_netdev(ctrl);
+	if (!dev)
+		return 0;
+
+	rtnl_lock();
+	if (!dev->ethtool_ops || !dev->ethtool_ops->get_channels)
+		goto out;
+
+	netdev_lock_ops(dev);
+
+	dev->ethtool_ops->get_channels(dev, &channels);
+	if (channels.max_combined)
+		max = channels.max_combined;
+	else
+		max = min(channels.max_rx, channels.max_tx);
+
+	netdev_unlock_ops(dev);
+out:
+	rtnl_unlock();
+
+	return max;
+}
+
 static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
 				key_serial_t pskid)
 {
@@ -2245,19 +2276,27 @@ static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
 
 	if (!(ctrl->opts->mask & NVMF_OPT_NR_IO_QUEUES) &&
 			(ctrl->opts->mask & NVMF_OPT_MATCH_HW_QUEUES)) {
-		int nr_hw_queues;
+		int nr_hw_queues, max_hw_queues;
 
 		nr_hw_queues = nvme_tcp_get_netdev_current_queue_count(ctrl);
 		if (nr_hw_queues <= 0)
 			goto init_queue;
 
 		ctrl->opts->nr_io_queues = min(nr_hw_queues, num_online_cpus());
-
-		if (ctrl->opts->nr_io_queues < num_online_cpus())
+		if (ctrl->opts->nr_io_queues < num_online_cpus()) {
 			dev_info(ctrl->device,
 				"limiting I/O queues to %u (NIC queues %d, CPUs %u)\n",
 				ctrl->opts->nr_io_queues, nr_hw_queues,
 				num_online_cpus());
+
+			max_hw_queues =
+				nvme_tcp_get_netdev_max_queue_count(ctrl);
+			if (max_hw_queues > nr_hw_queues)
+				dev_info(ctrl->device,
+					"NIC supports %u queues but only %u are configured; "
+					"consider increasing queue count for better perfromance\n",
+					max_hw_queues, nr_hw_queues);
+		}
 	}
 
 init_queue:
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/4] nvme: add debugfs helpers for NVMe drivers
  2026-04-20 11:49 [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Nilay Shroff
  2026-04-20 11:49 ` [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues Nilay Shroff
  2026-04-20 11:49 ` [RFC PATCH 2/4] nvme-tcp: add a diagnostic message when NIC queues are underutilized Nilay Shroff
@ 2026-04-20 11:49 ` Nilay Shroff
  2026-04-20 11:49 ` [RFC PATCH 4/4] nvme: expose queue information via debugfs Nilay Shroff
  2026-04-22 11:10 ` [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Hannes Reinecke
  4 siblings, 0 replies; 17+ messages in thread
From: Nilay Shroff @ 2026-04-20 11:49 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, hch, hare, sagi, chaitanyak, gjoyce, Nilay Shroff

Introduce helper APIs that allow NVMe drivers to register and unregister
debugfs entries, along with a reusable attribute structure for defining
new debugfs files.

The implementation uses seq_file interfaces to safely expose per-
namespace or per-path statistics, while supporting both simple show
callbacks and full seq_operations.

This will be used by subsequent patches to expose NVMe-TCP queue
and flow information for tuning NVMe TCP I/O workqueue and network stack
components.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/Makefile  |   2 +-
 drivers/nvme/host/debugfs.c | 111 ++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h    |  10 ++++
 3 files changed, 122 insertions(+), 1 deletion(-)
 create mode 100644 drivers/nvme/host/debugfs.c

diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index 6414ec968f99..7962dfc3b2ad 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_NVME_FC)			+= nvme-fc.o
 obj-$(CONFIG_NVME_TCP)			+= nvme-tcp.o
 obj-$(CONFIG_NVME_APPLE)		+= nvme-apple.o
 
-nvme-core-y				+= core.o ioctl.o sysfs.o pr.o
+nvme-core-y				+= core.o ioctl.o sysfs.o pr.o debugfs.o
 nvme-core-$(CONFIG_NVME_VERBOSE_ERRORS)	+= constants.o
 nvme-core-$(CONFIG_TRACING)		+= trace.o
 nvme-core-$(CONFIG_NVME_MULTIPATH)	+= multipath.o
diff --git a/drivers/nvme/host/debugfs.c b/drivers/nvme/host/debugfs.c
new file mode 100644
index 000000000000..ee86138487d0
--- /dev/null
+++ b/drivers/nvme/host/debugfs.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 IBM Corporation
+ *	Nilay Shroff <nilay@linux.ibm.com>
+ */
+
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+
+#include "nvme.h"
+
+struct nvme_debugfs_attr {
+	const char *name;
+	umode_t mode;
+	int (*show)(void *data, struct seq_file *m);
+	const struct seq_operations *seq_ops;
+};
+
+struct nvme_debugfs_ctx {
+	void *data;
+	struct nvme_debugfs_attr *attr;
+};
+
+static int nvme_debugfs_show(struct seq_file *m, void *v)
+{
+	struct nvme_debugfs_ctx *ctx = m->private;
+	void *data = ctx->data;
+	struct nvme_debugfs_attr *attr = ctx->attr;
+
+	return attr->show(data, m);
+}
+
+static int nvme_debugfs_open(struct inode *inode, struct file *file)
+{
+	void *data = inode->i_private;
+	struct nvme_debugfs_attr *attr = debugfs_get_aux(file);
+	struct nvme_debugfs_ctx *ctx;
+	struct seq_file *m;
+	int ret;
+
+	ctx = kzalloc_obj(*ctx);
+	if (WARN_ON_ONCE(!ctx))
+		return -ENOMEM;
+
+	ctx->data = data;
+	ctx->attr = attr;
+
+	if (attr->seq_ops) {
+		ret = seq_open(file, attr->seq_ops);
+		if (ret) {
+			kfree(ctx);
+			return ret;
+		}
+		m = file->private_data;
+		m->private = ctx;
+		return ret;
+	}
+
+	if (WARN_ON_ONCE(!attr->show)) {
+		kfree(ctx);
+		return -EPERM;
+	}
+
+	return single_open(file, nvme_debugfs_show, ctx);
+}
+
+static int nvme_debugfs_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *m = file->private_data;
+	struct nvme_debugfs_ctx *ctx = m->private;
+	struct nvme_debugfs_attr *attr = ctx->attr;
+	int ret;
+
+	if (attr->seq_ops)
+		ret = seq_release(inode, file);
+	else
+		ret = single_release(inode, file);
+
+	kfree(ctx);
+	return ret;
+}
+
+static const struct file_operations nvme_debugfs_fops = {
+	.owner   = THIS_MODULE,
+	.open    = nvme_debugfs_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = nvme_debugfs_release,
+};
+
+static const struct nvme_debugfs_attr nvme_ns_debugfs_attrs[] = {
+	{},
+};
+
+static void nvme_debugfs_create_files(struct request_queue *q,
+		const struct nvme_debugfs_attr *attr, void *data)
+{
+	if (WARN_ON_ONCE(!q->debugfs_dir))
+		return;
+
+	for (; attr->name; attr++)
+		debugfs_create_file_aux(attr->name, attr->mode, q->debugfs_dir,
+				data, (void *)attr, &nvme_debugfs_fops);
+}
+
+void nvme_debugfs_register(struct gendisk *disk)
+{
+	nvme_debugfs_create_files(disk->queue, nvme_ns_debugfs_attrs,
+			disk->private_data);
+}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ccd5e05dac98..2f3f1d2d19b9 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -468,6 +468,16 @@ struct nvme_ctrl {
 	u16			awupf; /* 0's based value. */
 };
 
+void nvme_debugfs_register(struct gendisk *disk);
+static inline void nvme_debugfs_unregister(struct gendisk *disk)
+{
+	/*
+	 * Nothing to do for now. When the request queue is unregistered,
+	 * all files under q->debugfs_dir are recursively deleted.
+	 * This is just a placeholder; the compiler will optimize it out.
+	 */
+}
+
 static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
 {
 	return READ_ONCE(ctrl->state);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 4/4] nvme: expose queue information via debugfs
  2026-04-20 11:49 [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Nilay Shroff
                   ` (2 preceding siblings ...)
  2026-04-20 11:49 ` [RFC PATCH 3/4] nvme: add debugfs helpers for NVMe drivers Nilay Shroff
@ 2026-04-20 11:49 ` Nilay Shroff
  2026-04-24 22:23   ` Sagi Grimberg
  2026-04-22 11:10 ` [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Hannes Reinecke
  4 siblings, 1 reply; 17+ messages in thread
From: Nilay Shroff @ 2026-04-20 11:49 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, hch, hare, sagi, chaitanyak, gjoyce, Nilay Shroff

Add a new debugfs attribute "io_queue_info" to expose per-queue
information for NVMe controllers. For NVMe-TCP, this includes the
CPU handling each I/O queue and the associated TCP flow (source and
destination address/port).

This information can be useful for understanding and tuning the
interaction between NVMe-TCP I/O queues and network stack components,
such as IRQ affinity, RPS/RFS, XPS, or NIC flow steering (ntuple).

The data is exported using seq_file interfaces to allow iteration
over all controller queues.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c    |  3 +++
 drivers/nvme/host/debugfs.c | 53 ++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h    |  2 ++
 drivers/nvme/host/tcp.c     | 52 ++++++++++++++++++++++++++++++++++++
 4 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e33af94c24b..1b0d13374d45 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4207,6 +4207,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 	if (device_add_disk(ctrl->device, ns->disk, nvme_ns_attr_groups))
 		goto out_cleanup_ns_from_list;
 
+	nvme_debugfs_register(ns->disk);
+
 	if (!nvme_ns_head_multipath(ns->head))
 		nvme_add_ns_cdev(ns);
 
@@ -4285,6 +4287,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
 	nvme_mpath_remove_sysfs_link(ns);
 
+	nvme_debugfs_unregister(ns->disk);
 	del_gendisk(ns->disk);
 
 	mutex_lock(&ns->ctrl->namespaces_lock);
diff --git a/drivers/nvme/host/debugfs.c b/drivers/nvme/host/debugfs.c
index ee86138487d0..68c40582fa97 100644
--- a/drivers/nvme/host/debugfs.c
+++ b/drivers/nvme/host/debugfs.c
@@ -22,6 +22,56 @@ struct nvme_debugfs_ctx {
 	struct nvme_debugfs_attr *attr;
 };
 
+static void *nvme_io_queue_info_start(struct seq_file *m, loff_t *pos)
+{
+	struct nvme_debugfs_ctx *ctx = m->private;
+	struct nvme_ns *ns = ctx->data;
+	struct nvme_ctrl *ctrl = ns->ctrl;
+
+	nvme_get_ctrl(ctrl);
+	/*
+	 * IO queues starts at offset 1.
+	 */
+	return (++*pos < ctrl->queue_count) ? pos : NULL;
+}
+
+static void *nvme_io_queue_info_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct nvme_debugfs_ctx *ctx = m->private;
+	struct nvme_ns *ns = ctx->data;
+	struct nvme_ctrl *ctrl = ns->ctrl;
+
+	return (++*pos < ctrl->queue_count) ? pos : NULL;
+}
+
+static void nvme_io_queue_info_stop(struct seq_file *m, void *v)
+{
+	struct nvme_debugfs_ctx *ctx = m->private;
+	struct nvme_ns *ns = ctx->data;
+	struct nvme_ctrl *ctrl = ns->ctrl;
+
+	nvme_put_ctrl(ctrl);
+}
+
+static int nvme_io_queue_info_show(struct seq_file *m, void *v)
+{
+	struct nvme_debugfs_ctx *ctx = m->private;
+	struct nvme_ns *ns = ctx->data;
+	struct nvme_ctrl *ctrl = ns->ctrl;
+
+	if (ctrl->ops->print_io_queue_info)
+		return ctrl->ops->print_io_queue_info(m, ctrl, *(loff_t *)v);
+
+	return 0;
+}
+
+const struct seq_operations nvme_io_queue_info_seq_ops = {
+	.start = nvme_io_queue_info_start,
+	.next = nvme_io_queue_info_next,
+	.stop = nvme_io_queue_info_stop,
+	.show = nvme_io_queue_info_show
+};
+
 static int nvme_debugfs_show(struct seq_file *m, void *v)
 {
 	struct nvme_debugfs_ctx *ctx = m->private;
@@ -90,7 +140,8 @@ static const struct file_operations nvme_debugfs_fops = {
 };
 
 static const struct nvme_debugfs_attr nvme_ns_debugfs_attrs[] = {
-	{},
+	{"io_queue_info", 0400, .seq_ops = &nvme_io_queue_info_seq_ops},
+	{}
 };
 
 static void nvme_debugfs_create_files(struct request_queue *q,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2f3f1d2d19b9..d7ff82971136 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -650,6 +650,8 @@ struct nvme_ctrl_ops {
 	void (*print_device_info)(struct nvme_ctrl *ctrl);
 	bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
 	unsigned long (*get_virt_boundary)(struct nvme_ctrl *ctrl, bool is_admin);
+	int (*print_io_queue_info)(struct seq_file *m, struct nvme_ctrl *ctrl,
+			int qid);
 };
 
 /*
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9239495122fc..6d06e984de47 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2723,6 +2723,57 @@ static void nvme_tcp_free_ctrl(struct nvme_ctrl *nctrl)
 	kfree(ctrl);
 }
 
+static int nvme_tcp_print_io_queue_info(struct seq_file *m,
+		struct nvme_ctrl *ctrl, int qid)
+{
+	int cpu;
+	struct sockaddr_storage src, dst;
+	struct nvme_tcp_ctrl *tctrl = to_tcp_ctrl(ctrl);
+	struct nvme_tcp_queue *queue = &tctrl->queues[qid];
+	int ret = -EINVAL;
+
+	if (!qid || qid >= ctrl->queue_count  ||
+			!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
+		return -EINVAL;
+
+	mutex_lock(&queue->queue_lock);
+	if (!queue->sock)
+		goto unlock;
+
+	ret = kernel_getsockname(queue->sock, (struct sockaddr *)&src);
+	if (ret <= 0)
+		goto unlock;
+
+	ret = kernel_getpeername(queue->sock, (struct sockaddr *)&dst);
+	if (ret <= 0)
+		goto unlock;
+
+	cpu = (queue->io_cpu == WORK_CPU_UNBOUND) ? -1 : queue->io_cpu;
+
+	if (src.ss_family == AF_INET) {
+		struct sockaddr_in *sip = (struct sockaddr_in *)&src;
+		struct sockaddr_in *dip = (struct sockaddr_in *)&dst;
+
+		seq_printf(m, "qid=%d cpu=%d src_ip=%pI4 src_port=%u dst_ip=%pI4 dst_port=%u\n",
+				qid, cpu,
+				&sip->sin_addr.s_addr, ntohs(sip->sin_port),
+				&dip->sin_addr.s_addr, ntohs(dip->sin_port));
+		ret = 0;
+	} else if (src.ss_family == AF_INET6) {
+		struct sockaddr_in6 *sip6 = (struct sockaddr_in6 *)&src;
+		struct sockaddr_in6 *dip6 = (struct sockaddr_in6 *)&dst;
+
+		seq_printf(m, "qid=%d cpu=%d src_ip=%pI6c src_port=%u dst_ip=%pI6c dst_port=%u\n",
+				qid, cpu,
+				&sip6->sin6_addr, ntohs(sip6->sin6_port),
+				&dip6->sin6_addr, ntohs(dip6->sin6_port));
+		ret = 0;
+	}
+unlock:
+	mutex_unlock(&queue->queue_lock);
+	return ret;
+}
+
 static void nvme_tcp_set_sg_null(struct nvme_command *c)
 {
 	struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
@@ -3023,6 +3074,7 @@ static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = {
 	.get_address		= nvme_tcp_get_address,
 	.stop_ctrl		= nvme_tcp_stop_ctrl,
 	.get_virt_boundary	= nvmf_get_virt_boundary,
+	.print_io_queue_info	= nvme_tcp_print_io_queue_info,
 };
 
 static bool
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export
  2026-04-20 11:49 [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Nilay Shroff
                   ` (3 preceding siblings ...)
  2026-04-20 11:49 ` [RFC PATCH 4/4] nvme: expose queue information via debugfs Nilay Shroff
@ 2026-04-22 11:10 ` Hannes Reinecke
  2026-04-24 22:30   ` Sagi Grimberg
  2026-04-27  6:13   ` Nilay Shroff
  4 siblings, 2 replies; 17+ messages in thread
From: Hannes Reinecke @ 2026-04-22 11:10 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, sagi, chaitanyak, gjoyce

On 4/20/26 13:49, Nilay Shroff wrote:
> Hi,
> 
> The NVMe/TCP host driver currently provisions I/O queues primarily based
> on CPU availability rather than the capabilities and topology of the
> underlying network interface.
> 
> On modern systems with many CPUs but fewer NIC hardware queues, this can
> lead to multiple NVMe/TCP I/O workers contending for the same TX/RX queue,
> resulting in increased lock contention, cacheline bouncing, and degraded
> throughput.
> 
> This RFC proposes a set of changes to better align NVMe/TCP I/O queues
> with NIC queue resources, and to expose queue/flow information to enable
> more effective system-level tuning.
> 
> Key ideas
> ---------
> 
> 1. Scale NVMe/TCP I/O queues based on NIC queue count
>     Instead of relying solely on CPU count, limit the number of I/O workers
>     to:
>         min(num_online_cpus, netdev->real_num_{tx,rx}_queues)
> 
> 2. Improve CPU locality
>     Align NVMe/TCP I/O workers with CPUs associated with NIC IRQ affinity
>     to reduce cross-CPU traffic and improve cache locality.
> 
> 3. Expose queue and flow information via debugfs
>     Export per-I/O queue information including:
>         - queue id (qid)
>         - CPU affinity
>         - TCP flow (src/dst IP and ports)
> 
>     This enables userspace tools to configure:
>         - IRQ affinity
>         - RPS/XPS
>         - ntuple steering
>         - or any other scaling as deemed feasible
> 
> 4. Provide infrastructure for extensible debugfs support in NVMe
> 
> Together, these changes allow better alignment of:
>      flow -> NIC queue -> IRQ -> CPU -> NVMe/TCP I/O worker
> 
> Performance Evaluation
> ----------------------
> Tests were conducted using fio over NVMe/TCP with the following parameters:
>      ioengine=io_uring
>      direct=1
>      bs=4k
>      numjobs=<#nic-queues>
>      iodepth=64
> System:
>      CPUs: 72
>      NIC: 100G mlx5
> 
> Two configurations were evaluated.
> 
> Scenario 1: NIC queues < CPU count
> ----------------------------------
> - CPUs: 72
> - NIC queues: 32
> 
>                  Baseline        Patched        Patched + tuning
> randread        3141 MB/s       3228 MB/s      7509 MB/s
>                  (767k IOPS)     (788k IOPS)    (1833k IOPS)
> 
> randwrite       4510 MB/s       6172 MB/s      7518 MB/s
>                  (1101k IOPS)    (1507k IOPS)   (1836k IOPS)
> 
> randrw (read)   2156 MB/s       2560 MB/s      3932 MB/s
>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
> 
> randrw (write)  2155 MB/s       2560 MB/s      3932 MB/s
>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
> 
> Observation:
> When CPU count exceeds NIC queue count, the baseline configuration
> suffers from queue contention. The proposed changes provide modest
> improvements on their own, and when combined with queue-aware tuning
> (IRQ affinity, ntuple steering, and CPU alignment), enable up to
> ~1.5x–2.5x throughput improvement.
> 
> Scenario 2: NIC queues == CPU count
> -----------------------------------
> 
> - CPUs: 72
> - NIC queues: 72
> 
>                  Baseline                Patched + tuning
> randread        4310 MB/s               7987 MB/s
>                  (1052k IOPS)            (1950k IOPS)
> 
> randwrite       7947 MB/s               7972 MB/s
>                  (1940k IOPS)            (1946k IOPS)
> 
> randrw (read)   3583 MB/s               4030 MB/s
>                  (875k IOPS)             (984k IOPS)
> 
> randrw (write)  3583 MB/s               4029 MB/s
>                  (875k IOPS)             (984k IOPS)
> 
> Observation:
> When NIC queues are already aligned with CPU count, the baseline performs
> well. The proposed changes maintain write performance (no regression) and
> still improve read and mixed workloads due to better flow-to-CPU locality.
> 
> Notes on tuning
> ---------------
> The "patched + tuning" configuration includes:
>      - aligning NVMe/TCP I/O workers with NIC queue count
>      - IRQ affinity configuration per RX queue
>      - ntuple-based flow steering
>      - CPU/queue affinity alignment
> 
> These tuning steps are enabled by the queue/flow information exposed through
> this patchset.
> 
> Discussion
> ----------
> This RFC aims to start discussion around:
>    - Whether NVMe/TCP queue scaling should consider NIC queue topology
>    - How best to expose queue/flow information to userspace
>    - The role of userspace vs kernel in steering decisions
> 
> As usual, feedback/comment/suggestions are most welcome!
> 
> Reference to LSF/MM/BPF abstarct: https://lore.kernel.org/all/5db8ce78-0dfa-4dcb-bf71-5fb9c8f463e5@linux.ibm.com/
> 

Weelll ... we have been debating this back and forth over recent years:
Should we check for hardware limitations for NVMe-over-Fabrics or not?

Initially it sounds appealing, and in fact I've worked on several 
attempts myself. But in the end there are far more things which need
to be considered:
-> For networking, number of queues is not really telling us anything.
    Most NICs have distinct RX and TX queues, and the number (of both!)
    varies quite dramatically.
-> The number of queues does _not_ indicate that all queues are used
    simultaneously. That is down to things like RSS and friends.
    I gave a stab at configuring _that_ but it's patently horrible
    trying to out-guess things for yourself.
-> It'll only work if you run directly on the NIC. As soon as there
    is anything in between (qemu? Tunnelling?) you are out of luck.

So yeah, we should have a discussion here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues
  2026-04-20 11:49 ` [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues Nilay Shroff
@ 2026-04-24 13:46   ` Christoph Hellwig
  2026-04-27  7:37     ` Nilay Shroff
  2026-04-24 22:10   ` Sagi Grimberg
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2026-04-24 13:46 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-nvme, kbusch, hch, hare, sagi, chaitanyak, gjoyce

> In such configurations, limiting the number of NVMe-TCP I/O queues to
> the number of NIC hardware queues can improve performance by reducing
> contention and improving locality. Aligning NVMe-TCP worker threads with
> NIC queue topology may also help reduce tail latency.

Yes, this sounds useful.

> 
> Add a new transport option "match_hw_queues" to allow users to
> optionally limit the number of NVMe-TCP I/O queues to the number of NIC
> TX/RX queues. When enabled, the number of I/O queues is set to:
> 
>     min(num_online_cpus, num_nic_queues)
> 
> This behavior is opt-in and does not change existing defaults.

Any good reason for that?  For PCI and RDMA we try to do the right
thing by default.

> +static struct net_device *nvme_tcp_get_netdev(struct nvme_ctrl *ctrl)
> +{
> +	struct net_device *dev = NULL;
> +
> +	if (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)
> +		dev = dev_get_by_name(&init_net, ctrl->opts->host_iface);

Return early here instead of the giant indentation for the new options.

> +	else {
> +		struct nvme_tcp_ctrl *tctrl = to_tcp_ctrl(ctrl);
> +
> +		if (tctrl->addr.ss_family == AF_INET) {

And then split each address family into a helper.  And to me those
look like something that should be in net/.

> +
> +/*
> + * Returns number of active NIC queues (min of TX/RX), or 0 if device cannot
> + * be determined.
> + */
> +static int nvme_tcp_get_netdev_current_queue_count(struct nvme_ctrl *ctrl)

drop _current to make this a bit more readable?

> @@ -2144,6 +2243,24 @@ static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>  	unsigned int nr_io_queues;
>  	int ret;
>  
> +	if (!(ctrl->opts->mask & NVMF_OPT_NR_IO_QUEUES) &&
> +			(ctrl->opts->mask & NVMF_OPT_MATCH_HW_QUEUES)) {

The more readable formatting would be:

	if (!(ctrl->opts->mask & NVMF_OPT_NR_IO_QUEUES) &&
	    (ctrl->opts->mask & NVMF_OPT_MATCH_HW_QUEUES)) {

> +		int nr_hw_queues;
> +
> +		nr_hw_queues = nvme_tcp_get_netdev_current_queue_count(ctrl);
> +		if (nr_hw_queues <= 0)
> +			goto init_queue;
> +
> +		ctrl->opts->nr_io_queues = min(nr_hw_queues, num_online_cpus());
> +
> +		if (ctrl->opts->nr_io_queues < num_online_cpus())
> +			dev_info(ctrl->device,
> +				"limiting I/O queues to %u (NIC queues %d, CPUs %u)\n",
> +				ctrl->opts->nr_io_queues, nr_hw_queues,
> +				num_online_cpus());
> +	}

And splitting this into a helper would help keeping the flow sane.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues
  2026-04-20 11:49 ` [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues Nilay Shroff
  2026-04-24 13:46   ` Christoph Hellwig
@ 2026-04-24 22:10   ` Sagi Grimberg
  2026-04-27 11:57     ` Nilay Shroff
  1 sibling, 1 reply; 17+ messages in thread
From: Sagi Grimberg @ 2026-04-24 22:10 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, hare, chaitanyak, gjoyce



On 20/04/2026 14:49, Nilay Shroff wrote:
> NVMe-TCP currently provisions I/O queues primarily based on CPU
> availability. On systems where the number of CPUs significantly exceeds
> the number of NIC hardware queues, this can lead to multiple I/O queues
> sharing the same NIC TX/RX queues, resulting in increased lock
> contention, cacheline bouncing, and inter-processor interrupts (IPIs).

Yes, I agree it is very inefficient to create something like 192 queues 
in practice.
Nevermind that this is pretty much never the case because real controllers
will limit the number of IO queues to something much lower than that,
in the majority of cases probably a handful or more.

Please note that this is very much in common with RDMA, so the
patch series should probably address both.

>
> In such configurations, limiting the number of NVMe-TCP I/O queues to
> the number of NIC hardware queues can improve performance by reducing
> contention and improving locality. Aligning NVMe-TCP worker threads with
> NIC queue topology may also help reduce tail latency.

As mentioned, from what I know, when using real nvmf arrays, the number of
queues will usually be much lower than both the cpu count as well as the 
NIC hw
queues.

>
> Add a new transport option "match_hw_queues" to allow users to
> optionally limit the number of NVMe-TCP I/O queues to the number of NIC
> TX/RX queues. When enabled, the number of I/O queues is set to:
>
>      min(num_online_cpus, num_nic_queues)
>
> This behavior is opt-in and does not change existing defaults.

In my mind, there is no real reason for an opt-in. The opt-in should
probably be if the user actually wants to use num_online_cpus() worth of 
queues (e.g. user explicitly asked for nr_io_queues).
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>   drivers/nvme/host/fabrics.c |   4 ++
>   drivers/nvme/host/fabrics.h |   3 +
>   drivers/nvme/host/tcp.c     | 120 +++++++++++++++++++++++++++++++++++-
>   3 files changed, 126 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index ac3d4f400601..62ae998825e1 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -709,6 +709,7 @@ static const match_table_t opt_tokens = {
>   	{ NVMF_OPT_TLS,			"tls"			},
>   	{ NVMF_OPT_CONCAT,		"concat"		},
>   #endif
> +	{ NVMF_OPT_MATCH_HW_QUEUES,	"match_hw_queues"	},
>   	{ NVMF_OPT_ERR,			NULL			}
>   };
>   
> @@ -1064,6 +1065,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>   			}
>   			opts->concat = true;
>   			break;
> +		case NVMF_OPT_MATCH_HW_QUEUES:
> +			opts->match_hw_queues = true;
> +			break;
>   		default:
>   			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
>   				p);
> diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
> index caf5503d0833..e8e3a2672832 100644
> --- a/drivers/nvme/host/fabrics.h
> +++ b/drivers/nvme/host/fabrics.h
> @@ -67,6 +67,7 @@ enum {
>   	NVMF_OPT_KEYRING	= 1 << 26,
>   	NVMF_OPT_TLS_KEY	= 1 << 27,
>   	NVMF_OPT_CONCAT		= 1 << 28,
> +	NVMF_OPT_MATCH_HW_QUEUES = 1 << 29,
>   };

No need for the above in my mind.

>   
>   /**
> @@ -106,6 +107,7 @@ enum {
>    * @disable_sqflow: disable controller sq flow control
>    * @hdr_digest: generate/verify header digest (TCP)
>    * @data_digest: generate/verify data digest (TCP)
> + * @match_hw_queues: limit controller IO queue count based on NIC queues (TCP)
>    * @nr_write_queues: number of queues for write I/O
>    * @nr_poll_queues: number of queues for polling I/O
>    * @tos: type of service
> @@ -136,6 +138,7 @@ struct nvmf_ctrl_options {
>   	bool			disable_sqflow;
>   	bool			hdr_digest;
>   	bool			data_digest;
> +	bool			match_hw_queues;
>   	unsigned int		nr_write_queues;
>   	unsigned int		nr_poll_queues;
>   	int			tos;
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 243dab830dc8..7102a7a54d78 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -16,6 +16,8 @@
>   #include <net/tls.h>
>   #include <net/tls_prot.h>
>   #include <net/handshake.h>
> +#include <net/ip6_route.h>
> +#include <linux/in6.h>
>   #include <linux/blk-mq.h>
>   #include <net/busy_poll.h>
>   #include <trace/events/sock.h>
> @@ -1762,6 +1764,103 @@ static int nvme_tcp_start_tls(struct nvme_ctrl *nctrl,
>   	return ret;
>   }
>   
> +static struct net_device *nvme_tcp_get_netdev(struct nvme_ctrl *ctrl)
> +{
> +	struct net_device *dev = NULL;
> +
> +	if (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)
> +		dev = dev_get_by_name(&init_net, ctrl->opts->host_iface);
> +	else {
> +		struct nvme_tcp_ctrl *tctrl = to_tcp_ctrl(ctrl);
> +
> +		if (tctrl->addr.ss_family == AF_INET) {
> +			struct rtable *rt;
> +			struct flowi4 fl4 = {};
> +			struct sockaddr_in *addr =
> +					(struct sockaddr_in *)&tctrl->addr;
> +
> +			fl4.daddr = addr->sin_addr.s_addr;
> +			if (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR) {
> +				addr = (struct sockaddr_in *)&tctrl->src_addr;
> +				fl4.saddr = addr->sin_addr.s_addr;
> +			}
> +			fl4.flowi4_proto = IPPROTO_TCP;
> +
> +			rt = ip_route_output_key(&init_net, &fl4);
> +			if (IS_ERR(rt))
> +				return NULL;
> +
> +			dev = dst_dev(&rt->dst);
> +			/*
> +			 * Get reference to netdev as ip_rt_put() will
> +			 * release the netdev reference.
> +			 */
> +			if (dev)
> +				dev_hold(dev);
> +
> +			ip_rt_put(rt);
> +
> +		} else if (tctrl->addr.ss_family == AF_INET6) {
> +			struct dst_entry *dst;
> +			struct flowi6 fl6 = {};
> +			struct sockaddr_in6 *addr6 =
> +					(struct sockaddr_in6 *)&tctrl->addr;
> +
> +			fl6.daddr = addr6->sin6_addr;
> +			if (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR) {
> +				addr6 = (struct sockaddr_in6 *)&tctrl->src_addr;
> +				fl6.saddr = addr6->sin6_addr;
> +			}
> +			fl6.flowi6_proto = IPPROTO_TCP;
> +
> +			dst = ip6_route_output(&init_net, NULL, &fl6);
> +			if (dst->error) {
> +				dst_release(dst);
> +				return NULL;
> +			}
> +
> +			dev = dst_dev(dst);
> +			/*
> +			 * Get reference to netdev as dst_release() will
> +			 * release the netdev reference.
> +			 */
> +			if (dev)
> +				dev_hold(dev);
> +
> +			dst_release(dst);
> +		}
> +	}

This looks like a helper that should be outside of nvme-tcp.
Nothing specific to it here. Something like dev_get_by_dstaddr()
> +
> +	return dev;
> +}
> +
> +static void nvme_tcp_put_netdev(struct net_device *dev)
> +{
> +	if (dev)
> +		dev_put(dev);
> +}
> +
> +/*
> + * Returns number of active NIC queues (min of TX/RX), or 0 if device cannot
> + * be determined.
> + */
> +static int nvme_tcp_get_netdev_current_queue_count(struct nvme_ctrl *ctrl)
> +{
> +	struct net_device *dev;
> +	int tx_queues, rx_queues;
> +
> +	dev = nvme_tcp_get_netdev(ctrl);
> +	if (!dev)
> +		return 0;
> +
> +	tx_queues = dev->real_num_tx_queues;
> +	rx_queues = dev->real_num_rx_queues;

I can see various ways how this can get wrong with the variety of 
stacked network
devices. For example for bonding, this can easily diverge with the slave 
devices
queues (in theory at least). Also vlan/vxlan devices will also not 
represent the
real hw queues iirc.

This is a good example of how nvme-tcp is different than the other drivers.
It sits on top of an abstraction layer, which prevents it from "not 
getting it wrong".
It may get it right *some* of the time, but it can also get it wrong...

Maybe an explicit optin is warranted here...
I would not be against having this approach in case an explicit opt-in 
is passed
by the user I suppose.

btw such an approach would be much more robust in nvme-rdma which
does not see this set of abstractions.

One thing that I will comment in addition, is that nvme-tcp is likely to
see *multiple* controllers (HA fundamentals for nvmf arrays), so I think 
that
improving performance in this scenario would be much more impactful.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/4] nvme-tcp: add a diagnostic message when NIC queues are underutilized
  2026-04-20 11:49 ` [RFC PATCH 2/4] nvme-tcp: add a diagnostic message when NIC queues are underutilized Nilay Shroff
@ 2026-04-24 22:15   ` Sagi Grimberg
  2026-04-27 12:14     ` Nilay Shroff
  0 siblings, 1 reply; 17+ messages in thread
From: Sagi Grimberg @ 2026-04-24 22:15 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, hare, chaitanyak, gjoyce



On 20/04/2026 14:49, Nilay Shroff wrote:
> Some systems may configure fewer NIC queues than supported by the
> hardware. When the number of NVMe-TCP I/O queues is limited by the
> number of active NIC queues, this can result in suboptimal performance.
>
> Add a diagnostic message to warn when the configured NIC queue count
> is lower than the maximum supported queue count, as reported by the
> driver. This may help users identify configurations where increasing
> the NIC queue count could improve performance.
>
> This change is informational only and does not modify NIC configuration.

I don't think that we want this at all. I don't think this is nvme-tcp place
to print such a log message at all. Especially not every time it connects
to a controller.

If you think you need to add this, create a userspace nvmf tool that
tests/validates a host.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 4/4] nvme: expose queue information via debugfs
  2026-04-20 11:49 ` [RFC PATCH 4/4] nvme: expose queue information via debugfs Nilay Shroff
@ 2026-04-24 22:23   ` Sagi Grimberg
  2026-04-27 12:12     ` Nilay Shroff
  0 siblings, 1 reply; 17+ messages in thread
From: Sagi Grimberg @ 2026-04-24 22:23 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, hare, chaitanyak, gjoyce



On 20/04/2026 14:49, Nilay Shroff wrote:
> Add a new debugfs attribute "io_queue_info" to expose per-queue
> information for NVMe controllers. For NVMe-TCP, this includes the
> CPU handling each I/O queue and the associated TCP flow (source and
> destination address/port).
>
> This information can be useful for understanding and tuning the
> interaction between NVMe-TCP I/O queues and network stack components,
> such as IRQ affinity, RPS/RFS, XPS, or NIC flow steering (ntuple).
>
> The data is exported using seq_file interfaces to allow iteration
> over all controller queues.

Don't really mind having this. Not sure who will actually go through
the process of mangling RFS/RPS/XPS for based on this 5-tuple, but ok...


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export
  2026-04-22 11:10 ` [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Hannes Reinecke
@ 2026-04-24 22:30   ` Sagi Grimberg
  2026-04-27 12:11     ` Nilay Shroff
  2026-04-27  6:13   ` Nilay Shroff
  1 sibling, 1 reply; 17+ messages in thread
From: Sagi Grimberg @ 2026-04-24 22:30 UTC (permalink / raw)
  To: Hannes Reinecke, Nilay Shroff, linux-nvme; +Cc: kbusch, hch, chaitanyak, gjoyce



On 22/04/2026 14:10, Hannes Reinecke wrote:
> On 4/20/26 13:49, Nilay Shroff wrote:
>> Hi,
>>
>> The NVMe/TCP host driver currently provisions I/O queues primarily based
>> on CPU availability rather than the capabilities and topology of the
>> underlying network interface.
>>
>> On modern systems with many CPUs but fewer NIC hardware queues, this can
>> lead to multiple NVMe/TCP I/O workers contending for the same TX/RX 
>> queue,
>> resulting in increased lock contention, cacheline bouncing, and degraded
>> throughput.
>>
>> This RFC proposes a set of changes to better align NVMe/TCP I/O queues
>> with NIC queue resources, and to expose queue/flow information to enable
>> more effective system-level tuning.
>>
>> Key ideas
>> ---------
>>
>> 1. Scale NVMe/TCP I/O queues based on NIC queue count
>>     Instead of relying solely on CPU count, limit the number of I/O 
>> workers
>>     to:
>>         min(num_online_cpus, netdev->real_num_{tx,rx}_queues)
>>
>> 2. Improve CPU locality
>>     Align NVMe/TCP I/O workers with CPUs associated with NIC IRQ 
>> affinity
>>     to reduce cross-CPU traffic and improve cache locality.
>>
>> 3. Expose queue and flow information via debugfs
>>     Export per-I/O queue information including:
>>         - queue id (qid)
>>         - CPU affinity
>>         - TCP flow (src/dst IP and ports)
>>
>>     This enables userspace tools to configure:
>>         - IRQ affinity
>>         - RPS/XPS
>>         - ntuple steering
>>         - or any other scaling as deemed feasible
>>
>> 4. Provide infrastructure for extensible debugfs support in NVMe
>>
>> Together, these changes allow better alignment of:
>>      flow -> NIC queue -> IRQ -> CPU -> NVMe/TCP I/O worker
>>
>> Performance Evaluation
>> ----------------------
>> Tests were conducted using fio over NVMe/TCP with the following 
>> parameters:
>>      ioengine=io_uring
>>      direct=1
>>      bs=4k
>>      numjobs=<#nic-queues>
>>      iodepth=64
>> System:
>>      CPUs: 72
>>      NIC: 100G mlx5
>>
>> Two configurations were evaluated.
>>
>> Scenario 1: NIC queues < CPU count
>> ----------------------------------
>> - CPUs: 72
>> - NIC queues: 32
>>
>>                  Baseline        Patched        Patched + tuning
>> randread        3141 MB/s       3228 MB/s      7509 MB/s
>>                  (767k IOPS)     (788k IOPS)    (1833k IOPS)
>>
>> randwrite       4510 MB/s       6172 MB/s      7518 MB/s
>>                  (1101k IOPS)    (1507k IOPS)   (1836k IOPS)
>>
>> randrw (read)   2156 MB/s       2560 MB/s      3932 MB/s
>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>
>> randrw (write)  2155 MB/s       2560 MB/s      3932 MB/s
>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>
>> Observation:
>> When CPU count exceeds NIC queue count, the baseline configuration
>> suffers from queue contention. The proposed changes provide modest
>> improvements on their own, and when combined with queue-aware tuning
>> (IRQ affinity, ntuple steering, and CPU alignment), enable up to
>> ~1.5x–2.5x throughput improvement.
>>
>> Scenario 2: NIC queues == CPU count
>> -----------------------------------
>>
>> - CPUs: 72
>> - NIC queues: 72
>>
>>                  Baseline                Patched + tuning
>> randread        4310 MB/s               7987 MB/s
>>                  (1052k IOPS)            (1950k IOPS)
>>
>> randwrite       7947 MB/s               7972 MB/s
>>                  (1940k IOPS)            (1946k IOPS)
>>
>> randrw (read)   3583 MB/s               4030 MB/s
>>                  (875k IOPS)             (984k IOPS)
>>
>> randrw (write)  3583 MB/s               4029 MB/s
>>                  (875k IOPS)             (984k IOPS)
>>
>> Observation:
>> When NIC queues are already aligned with CPU count, the baseline 
>> performs
>> well. The proposed changes maintain write performance (no regression) 
>> and
>> still improve read and mixed workloads due to better flow-to-CPU 
>> locality.
>>
>> Notes on tuning
>> ---------------
>> The "patched + tuning" configuration includes:
>>      - aligning NVMe/TCP I/O workers with NIC queue count
>>      - IRQ affinity configuration per RX queue
>>      - ntuple-based flow steering
>>      - CPU/queue affinity alignment
>>
>> These tuning steps are enabled by the queue/flow information exposed 
>> through
>> this patchset.
>>
>> Discussion
>> ----------
>> This RFC aims to start discussion around:
>>    - Whether NVMe/TCP queue scaling should consider NIC queue topology
>>    - How best to expose queue/flow information to userspace
>>    - The role of userspace vs kernel in steering decisions
>>
>> As usual, feedback/comment/suggestions are most welcome!
>>
>> Reference to LSF/MM/BPF abstarct: 
>> https://lore.kernel.org/all/5db8ce78-0dfa-4dcb-bf71-5fb9c8f463e5@linux.ibm.com/
>>
>
> Weelll ... we have been debating this back and forth over recent years:
> Should we check for hardware limitations for NVMe-over-Fabrics or not?
>
> Initially it sounds appealing, and in fact I've worked on several 
> attempts myself. But in the end there are far more things which need
> to be considered:
> -> For networking, number of queues is not really telling us anything.
>    Most NICs have distinct RX and TX queues, and the number (of both!)
>    varies quite dramatically.
> -> The number of queues does _not_ indicate that all queues are used
>    simultaneously. That is down to things like RSS and friends.
>    I gave a stab at configuring _that_ but it's patently horrible
>    trying to out-guess things for yourself.
> -> It'll only work if you run directly on the NIC. As soon as there
>    is anything in between (qemu? Tunnelling?) you are out of luck.
>
> So yeah, we should have a discussion here.

TBH, I don't think that this is very useful. I mentioned some areas on 
why on patch #1

But the main reason is that I think that the majority the gains that you 
are showing
is the tuning - which is somewhat unrelated to the driver, and TBH, I 
doubt anyone
will actually do in reality.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export
  2026-04-22 11:10 ` [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Hannes Reinecke
  2026-04-24 22:30   ` Sagi Grimberg
@ 2026-04-27  6:13   ` Nilay Shroff
  1 sibling, 0 replies; 17+ messages in thread
From: Nilay Shroff @ 2026-04-27  6:13 UTC (permalink / raw)
  To: Hannes Reinecke, linux-nvme; +Cc: kbusch, hch, sagi, chaitanyak, gjoyce

On 4/22/26 4:40 PM, Hannes Reinecke wrote:
> On 4/20/26 13:49, Nilay Shroff wrote:
>> Hi,
>>
>> The NVMe/TCP host driver currently provisions I/O queues primarily based
>> on CPU availability rather than the capabilities and topology of the
>> underlying network interface.
>>
>> On modern systems with many CPUs but fewer NIC hardware queues, this can
>> lead to multiple NVMe/TCP I/O workers contending for the same TX/RX queue,
>> resulting in increased lock contention, cacheline bouncing, and degraded
>> throughput.
>>
>> This RFC proposes a set of changes to better align NVMe/TCP I/O queues
>> with NIC queue resources, and to expose queue/flow information to enable
>> more effective system-level tuning.
>>
>> Key ideas
>> ---------
>>
>> 1. Scale NVMe/TCP I/O queues based on NIC queue count
>>     Instead of relying solely on CPU count, limit the number of I/O workers
>>     to:
>>         min(num_online_cpus, netdev->real_num_{tx,rx}_queues)
>>
>> 2. Improve CPU locality
>>     Align NVMe/TCP I/O workers with CPUs associated with NIC IRQ affinity
>>     to reduce cross-CPU traffic and improve cache locality.
>>
>> 3. Expose queue and flow information via debugfs
>>     Export per-I/O queue information including:
>>         - queue id (qid)
>>         - CPU affinity
>>         - TCP flow (src/dst IP and ports)
>>
>>     This enables userspace tools to configure:
>>         - IRQ affinity
>>         - RPS/XPS
>>         - ntuple steering
>>         - or any other scaling as deemed feasible
>>
>> 4. Provide infrastructure for extensible debugfs support in NVMe
>>
>> Together, these changes allow better alignment of:
>>      flow -> NIC queue -> IRQ -> CPU -> NVMe/TCP I/O worker
>>
>> Performance Evaluation
>> ----------------------
>> Tests were conducted using fio over NVMe/TCP with the following parameters:
>>      ioengine=io_uring
>>      direct=1
>>      bs=4k
>>      numjobs=<#nic-queues>
>>      iodepth=64
>> System:
>>      CPUs: 72
>>      NIC: 100G mlx5
>>
>> Two configurations were evaluated.
>>
>> Scenario 1: NIC queues < CPU count
>> ----------------------------------
>> - CPUs: 72
>> - NIC queues: 32
>>
>>                  Baseline        Patched        Patched + tuning
>> randread        3141 MB/s       3228 MB/s      7509 MB/s
>>                  (767k IOPS)     (788k IOPS)    (1833k IOPS)
>>
>> randwrite       4510 MB/s       6172 MB/s      7518 MB/s
>>                  (1101k IOPS)    (1507k IOPS)   (1836k IOPS)
>>
>> randrw (read)   2156 MB/s       2560 MB/s      3932 MB/s
>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>
>> randrw (write)  2155 MB/s       2560 MB/s      3932 MB/s
>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>
>> Observation:
>> When CPU count exceeds NIC queue count, the baseline configuration
>> suffers from queue contention. The proposed changes provide modest
>> improvements on their own, and when combined with queue-aware tuning
>> (IRQ affinity, ntuple steering, and CPU alignment), enable up to
>> ~1.5x–2.5x throughput improvement.
>>
>> Scenario 2: NIC queues == CPU count
>> -----------------------------------
>>
>> - CPUs: 72
>> - NIC queues: 72
>>
>>                  Baseline                Patched + tuning
>> randread        4310 MB/s               7987 MB/s
>>                  (1052k IOPS)            (1950k IOPS)
>>
>> randwrite       7947 MB/s               7972 MB/s
>>                  (1940k IOPS)            (1946k IOPS)
>>
>> randrw (read)   3583 MB/s               4030 MB/s
>>                  (875k IOPS)             (984k IOPS)
>>
>> randrw (write)  3583 MB/s               4029 MB/s
>>                  (875k IOPS)             (984k IOPS)
>>
>> Observation:
>> When NIC queues are already aligned with CPU count, the baseline performs
>> well. The proposed changes maintain write performance (no regression) and
>> still improve read and mixed workloads due to better flow-to-CPU locality.
>>
>> Notes on tuning
>> ---------------
>> The "patched + tuning" configuration includes:
>>      - aligning NVMe/TCP I/O workers with NIC queue count
>>      - IRQ affinity configuration per RX queue
>>      - ntuple-based flow steering
>>      - CPU/queue affinity alignment
>>
>> These tuning steps are enabled by the queue/flow information exposed through
>> this patchset.
>>
>> Discussion
>> ----------
>> This RFC aims to start discussion around:
>>    - Whether NVMe/TCP queue scaling should consider NIC queue topology
>>    - How best to expose queue/flow information to userspace
>>    - The role of userspace vs kernel in steering decisions
>>
>> As usual, feedback/comment/suggestions are most welcome!
>>
>> Reference to LSF/MM/BPF abstarct: https://lore.kernel.org/all/5db8ce78-0dfa-4dcb-bf71-5fb9c8f463e5@linux.ibm.com/
>>
> 
> Weelll ... we have been debating this back and forth over recent years:
> Should we check for hardware limitations for NVMe-over-Fabrics or not?
> 
> Initially it sounds appealing, and in fact I've worked on several attempts myself. But in the end there are far more things which need
> to be considered:
> -> For networking, number of queues is not really telling us anything.
>     Most NICs have distinct RX and TX queues, and the number (of both!)
>     varies quite dramatically.

The proposed I/O queue scaling follows a conservative approach based on
currently configured NIC queues:

   if (NIC exposes combined TX/RX queues):
       nr_hw_queues = min(num_online_cpus, combined_tx_rx);
   else:
       real_hw_queues = min(real_num_tx_queues, real_num_rx_queues);
       nr_hw_queues = min(num_online_cpus, real_hw_queues);

The intent here is not to model full NIC behavior, but to avoid obvious
over-subscription when the number of NVMe/TCP I/O workers significantly
exceeds the available NIC queues.

Also, this is not enabled by default. It is gated behind the
`match-hw-queues` fabric option, so existing setups are unaffected unless
explicitly enabled.

> -> The number of queues does _not_ indicate that all queues are used
>     simultaneously. That is down to things like RSS and friends.
>     I gave a stab at configuring _that_ but it's patently horrible
>     trying to out-guess things for yourself.

Agreed that queue count alone does not imply effective parallelism, as
traffic distribution depends on RSS/RPS/XPS.

This patchset does not attempt to infer or control how queues are used.
Instead, it treats the currently configured number of TX/RX queues as a
conservative upper bound for I/O worker scaling.

In addition, the patchset exposes queue, CPU, and flow information via
debugfs. This allows userspace to configure steering policies (IRQ
affinity, ntuple, RPS/XPS) based on actual system behavior. If NIC supports
configuring n-tuple filter then it's possible to distribute each I/O flow to
a unique queue. In fact, exporting the nvmf-tcp I/O flow and its cpu information
via debugfs should be very useful to configure n-tuple and thus distribute each
flow to a unique queue and thus it shall help align tx, rx, IRQ and tcp worker
on the same cpu.

> -> It'll only work if you run directly on the NIC. As soon as there
>     is anything in between (qemu? Tunnelling?) you are out of luck.
> 
Agreed that in environments where NIC topology does not reflect the
effective data path (e.g., certain QEMU or tunneling configurations),
this heuristic may not be meaningful. In such cases, users can simply
avoid enabling "match-hw-queues" and retain the existing behavior.

That said, there are also configurations using QEMU (e.g., vhost-net with
multiqueue, or VFIO passthrough) or SR-IOV where NIC queue topology is still
relevant, and this approach can provide benefit.

Overall, the goal here is:
- to avoid clear over-provisioning of I/O workers, and
- to expose sufficient information for userspace driven tuning
   (using RPS/XPS/n-tuple etc.)

> So yeah, we should have a discussion here.
> 
Sure, looking forward for further discussion.
Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues
  2026-04-24 13:46   ` Christoph Hellwig
@ 2026-04-27  7:37     ` Nilay Shroff
  0 siblings, 0 replies; 17+ messages in thread
From: Nilay Shroff @ 2026-04-27  7:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, kbusch, hare, sagi, chaitanyak, gjoyce

On 4/24/26 7:16 PM, Christoph Hellwig wrote:
>> In such configurations, limiting the number of NVMe-TCP I/O queues to
>> the number of NIC hardware queues can improve performance by reducing
>> contention and improving locality. Aligning NVMe-TCP worker threads with
>> NIC queue topology may also help reduce tail latency.
> 
> Yes, this sounds useful.
> 
>>
>> Add a new transport option "match_hw_queues" to allow users to
>> optionally limit the number of NVMe-TCP I/O queues to the number of NIC
>> TX/RX queues. When enabled, the number of I/O queues is set to:
>>
>>      min(num_online_cpus, num_nic_queues)
>>
>> This behavior is opt-in and does not change existing defaults.
> 
> Any good reason for that?  For PCI and RDMA we try to do the right
> thing by default.
> 
The only reason was that in certain complex typologies it may not
be really possible (for instance, QEMU) to get the real num of tx/rx
queues. In such situation, I thought we're better off using this
feature and hence I added the opt-in. But yes I'd also love to remove
this option and find a better way to detect such cases where we can't get
the real num of tx/rx queues and thus aromatically fallback to creating
as many I/O queues as num of online cpus. I'd explore this and see
if that's possible.

>> +static struct net_device *nvme_tcp_get_netdev(struct nvme_ctrl *ctrl)
>> +{
>> +	struct net_device *dev = NULL;
>> +
>> +	if (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)
>> +		dev = dev_get_by_name(&init_net, ctrl->opts->host_iface);
> 
> Return early here instead of the giant indentation for the new options.
> 
Yes okay, makes sense!

>> +	else {
>> +		struct nvme_tcp_ctrl *tctrl = to_tcp_ctrl(ctrl);
>> +
>> +		if (tctrl->addr.ss_family == AF_INET) {
> 
> And then split each address family into a helper.  And to me those
> look like something that should be in net/.
> 
Hmm okay, I think if we want to add these helpers under net/ then it should be
in include/net/route.h and include/net/ip6_route.h for IPv4 and IPv6 respectively.

>> +
>> +/*
>> + * Returns number of active NIC queues (min of TX/RX), or 0 if device cannot
>> + * be determined.
>> + */
>> +static int nvme_tcp_get_netdev_current_queue_count(struct nvme_ctrl *ctrl)
> 
> drop _current to make this a bit more readable?
> 
Sure.

>> @@ -2144,6 +2243,24 @@ static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>>   	unsigned int nr_io_queues;
>>   	int ret;
>>   
>> +	if (!(ctrl->opts->mask & NVMF_OPT_NR_IO_QUEUES) &&
>> +			(ctrl->opts->mask & NVMF_OPT_MATCH_HW_QUEUES)) {
> 
> The more readable formatting would be:
> 
> 	if (!(ctrl->opts->mask & NVMF_OPT_NR_IO_QUEUES) &&
> 	    (ctrl->opts->mask & NVMF_OPT_MATCH_HW_QUEUES)) {
> 
Yep, I will change this.

>> +		int nr_hw_queues;
>> +
>> +		nr_hw_queues = nvme_tcp_get_netdev_current_queue_count(ctrl);
>> +		if (nr_hw_queues <= 0)
>> +			goto init_queue;
>> +
>> +		ctrl->opts->nr_io_queues = min(nr_hw_queues, num_online_cpus());
>> +
>> +		if (ctrl->opts->nr_io_queues < num_online_cpus())
>> +			dev_info(ctrl->device,
>> +				"limiting I/O queues to %u (NIC queues %d, CPUs %u)\n",
>> +				ctrl->opts->nr_io_queues, nr_hw_queues,
>> +				num_online_cpus());
>> +	}
> 
> And splitting this into a helper would help keeping the flow sane.
> 
Alright, will make it into separate helper.

Thanks,
--Nilay





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues
  2026-04-24 22:10   ` Sagi Grimberg
@ 2026-04-27 11:57     ` Nilay Shroff
  0 siblings, 0 replies; 17+ messages in thread
From: Nilay Shroff @ 2026-04-27 11:57 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme; +Cc: kbusch, hch, hare, chaitanyak, gjoyce

On 4/25/26 3:40 AM, Sagi Grimberg wrote:
> 
> 
> On 20/04/2026 14:49, Nilay Shroff wrote:
>> NVMe-TCP currently provisions I/O queues primarily based on CPU
>> availability. On systems where the number of CPUs significantly exceeds
>> the number of NIC hardware queues, this can lead to multiple I/O queues
>> sharing the same NIC TX/RX queues, resulting in increased lock
>> contention, cacheline bouncing, and inter-processor interrupts (IPIs).
> 
> Yes, I agree it is very inefficient to create something like 192 queues in practice.
> Nevermind that this is pretty much never the case because real controllers
> will limit the number of IO queues to something much lower than that,
> in the majority of cases probably a handful or more.
> 
Yes, it may not always be possible (or meaningful) to have a 1:1 mapping
between NVMe/TCP I/O queues and the real controller’s internal queues.

The intent here is not to enforce such a mapping, but rather to improve
host-side locality. Even when the controller limits the number of I/O queues,
mismatch with NIC queue topology can still lead to multiple I/O workers
contending on the same TX/RX queues. This change tries to avoid such
over-subscription on the host side.

> Please note that this is very much in common with RDMA, so the
> patch series should probably address both.
> 
Yes agreed — the model in RDMA is more tightly coupled to hardware resources.
It probably makes sense to consider that separately, as the problem space is
simpler and the mapping is more direct.

>>
>> In such configurations, limiting the number of NVMe-TCP I/O queues to
>> the number of NIC hardware queues can improve performance by reducing
>> contention and improving locality. Aligning NVMe-TCP worker threads with
>> NIC queue topology may also help reduce tail latency.
> 
> As mentioned, from what I know, when using real nvmf arrays, the number of
> queues will usually be much lower than both the cpu count as well as the NIC hw
> queues.
> 
>>
>> Add a new transport option "match_hw_queues" to allow users to
>> optionally limit the number of NVMe-TCP I/O queues to the number of NIC
>> TX/RX queues. When enabled, the number of I/O queues is set to:
>>
>>      min(num_online_cpus, num_nic_queues)
>>
>> This behavior is opt-in and does not change existing defaults.
> 
> In my mind, there is no real reason for an opt-in. The opt-in should
> probably be if the user actually wants to use num_online_cpus() worth of queues (e.g. user explicitly asked for nr_io_queues).
>>
Yes that makes sense, but in certain complex typologies such as host
running inside QEMU or stacked network devices, it may not be possible
to find the real num of tx/rx queues exposed by real NIC cards. So for
such cased we may not want to alter the existing behavior and hence
I choose to support opt-in.

>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>>   drivers/nvme/host/fabrics.c |   4 ++
>>   drivers/nvme/host/fabrics.h |   3 +
>>   drivers/nvme/host/tcp.c     | 120 +++++++++++++++++++++++++++++++++++-
>>   3 files changed, 126 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
>> index ac3d4f400601..62ae998825e1 100644
>> --- a/drivers/nvme/host/fabrics.c
>> +++ b/drivers/nvme/host/fabrics.c
>> @@ -709,6 +709,7 @@ static const match_table_t opt_tokens = {
>>       { NVMF_OPT_TLS,            "tls"            },
>>       { NVMF_OPT_CONCAT,        "concat"        },
>>   #endif
>> +    { NVMF_OPT_MATCH_HW_QUEUES,    "match_hw_queues"    },
>>       { NVMF_OPT_ERR,            NULL            }
>>   };
>> @@ -1064,6 +1065,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>>               }
>>               opts->concat = true;
>>               break;
>> +        case NVMF_OPT_MATCH_HW_QUEUES:
>> +            opts->match_hw_queues = true;
>> +            break;
>>           default:
>>               pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
>>                   p);
>> diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
>> index caf5503d0833..e8e3a2672832 100644
>> --- a/drivers/nvme/host/fabrics.h
>> +++ b/drivers/nvme/host/fabrics.h
>> @@ -67,6 +67,7 @@ enum {
>>       NVMF_OPT_KEYRING    = 1 << 26,
>>       NVMF_OPT_TLS_KEY    = 1 << 27,
>>       NVMF_OPT_CONCAT        = 1 << 28,
>> +    NVMF_OPT_MATCH_HW_QUEUES = 1 << 29,
>>   };
> 
> No need for the above in my mind.
> 
>>   /**
>> @@ -106,6 +107,7 @@ enum {
>>    * @disable_sqflow: disable controller sq flow control
>>    * @hdr_digest: generate/verify header digest (TCP)
>>    * @data_digest: generate/verify data digest (TCP)
>> + * @match_hw_queues: limit controller IO queue count based on NIC queues (TCP)
>>    * @nr_write_queues: number of queues for write I/O
>>    * @nr_poll_queues: number of queues for polling I/O
>>    * @tos: type of service
>> @@ -136,6 +138,7 @@ struct nvmf_ctrl_options {
>>       bool            disable_sqflow;
>>       bool            hdr_digest;
>>       bool            data_digest;
>> +    bool            match_hw_queues;
>>       unsigned int        nr_write_queues;
>>       unsigned int        nr_poll_queues;
>>       int            tos;
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 243dab830dc8..7102a7a54d78 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -16,6 +16,8 @@
>>   #include <net/tls.h>
>>   #include <net/tls_prot.h>
>>   #include <net/handshake.h>
>> +#include <net/ip6_route.h>
>> +#include <linux/in6.h>
>>   #include <linux/blk-mq.h>
>>   #include <net/busy_poll.h>
>>   #include <trace/events/sock.h>
>> @@ -1762,6 +1764,103 @@ static int nvme_tcp_start_tls(struct nvme_ctrl *nctrl,
>>       return ret;
>>   }
>> +static struct net_device *nvme_tcp_get_netdev(struct nvme_ctrl *ctrl)
>> +{
>> +    struct net_device *dev = NULL;
>> +
>> +    if (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)
>> +        dev = dev_get_by_name(&init_net, ctrl->opts->host_iface);
>> +    else {
>> +        struct nvme_tcp_ctrl *tctrl = to_tcp_ctrl(ctrl);
>> +
>> +        if (tctrl->addr.ss_family == AF_INET) {
>> +            struct rtable *rt;
>> +            struct flowi4 fl4 = {};
>> +            struct sockaddr_in *addr =
>> +                    (struct sockaddr_in *)&tctrl->addr;
>> +
>> +            fl4.daddr = addr->sin_addr.s_addr;
>> +            if (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR) {
>> +                addr = (struct sockaddr_in *)&tctrl->src_addr;
>> +                fl4.saddr = addr->sin_addr.s_addr;
>> +            }
>> +            fl4.flowi4_proto = IPPROTO_TCP;
>> +
>> +            rt = ip_route_output_key(&init_net, &fl4);
>> +            if (IS_ERR(rt))
>> +                return NULL;
>> +
>> +            dev = dst_dev(&rt->dst);
>> +            /*
>> +             * Get reference to netdev as ip_rt_put() will
>> +             * release the netdev reference.
>> +             */
>> +            if (dev)
>> +                dev_hold(dev);
>> +
>> +            ip_rt_put(rt);
>> +
>> +        } else if (tctrl->addr.ss_family == AF_INET6) {
>> +            struct dst_entry *dst;
>> +            struct flowi6 fl6 = {};
>> +            struct sockaddr_in6 *addr6 =
>> +                    (struct sockaddr_in6 *)&tctrl->addr;
>> +
>> +            fl6.daddr = addr6->sin6_addr;
>> +            if (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR) {
>> +                addr6 = (struct sockaddr_in6 *)&tctrl->src_addr;
>> +                fl6.saddr = addr6->sin6_addr;
>> +            }
>> +            fl6.flowi6_proto = IPPROTO_TCP;
>> +
>> +            dst = ip6_route_output(&init_net, NULL, &fl6);
>> +            if (dst->error) {
>> +                dst_release(dst);
>> +                return NULL;
>> +            }
>> +
>> +            dev = dst_dev(dst);
>> +            /*
>> +             * Get reference to netdev as dst_release() will
>> +             * release the netdev reference.
>> +             */
>> +            if (dev)
>> +                dev_hold(dev);
>> +
>> +            dst_release(dst);
>> +        }
>> +    }
> 
> This looks like a helper that should be outside of nvme-tcp.
> Nothing specific to it here. Something like dev_get_by_dstaddr()

Yeah, in another thread Christoph also had similar opinion. So I'd
add a helper under net/.

>> +
>> +    return dev;
>> +}
>> +
>> +static void nvme_tcp_put_netdev(struct net_device *dev)
>> +{
>> +    if (dev)
>> +        dev_put(dev);
>> +}
>> +
>> +/*
>> + * Returns number of active NIC queues (min of TX/RX), or 0 if device cannot
>> + * be determined.
>> + */
>> +static int nvme_tcp_get_netdev_current_queue_count(struct nvme_ctrl *ctrl)
>> +{
>> +    struct net_device *dev;
>> +    int tx_queues, rx_queues;
>> +
>> +    dev = nvme_tcp_get_netdev(ctrl);
>> +    if (!dev)
>> +        return 0;
>> +
>> +    tx_queues = dev->real_num_tx_queues;
>> +    rx_queues = dev->real_num_rx_queues;
> 
> I can see various ways how this can get wrong with the variety of stacked network
> devices. For example for bonding, this can easily diverge with the slave devices
> queues (in theory at least). Also vlan/vxlan devices will also not represent the
> real hw queues iirc.
> 
> This is a good example of how nvme-tcp is different than the other drivers.
> It sits on top of an abstraction layer, which prevents it from "not getting it wrong".
> It may get it right *some* of the time, but it can also get it wrong...
> 
> Maybe an explicit optin is warranted here...
> I would not be against having this approach in case an explicit opt-in is passed
> by the user I suppose.
> 
Yes so this is the _real_ reason why I proposed opt-in.

> btw such an approach would be much more robust in nvme-rdma which
> does not see this set of abstractions.
> 
> One thing that I will comment in addition, is that nvme-tcp is likely to
> see *multiple* controllers (HA fundamentals for nvmf arrays), so I think that
> improving performance in this scenario would be much more impactful.

Hmm, yes really good point. This patch focuses on improving per-controller
locality, but I agree that coordinating resource usage across controllers
sharing the same NIC could provide additional benefits. This seems like a
natural next step, thanks for highlighting this. I will address this in
a separate patch, once current work is accepted.

Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export
  2026-04-24 22:30   ` Sagi Grimberg
@ 2026-04-27 12:11     ` Nilay Shroff
  0 siblings, 0 replies; 17+ messages in thread
From: Nilay Shroff @ 2026-04-27 12:11 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, linux-nvme
  Cc: kbusch, hch, chaitanyak, gjoyce

On 4/25/26 4:00 AM, Sagi Grimberg wrote:
> 
> 
> On 22/04/2026 14:10, Hannes Reinecke wrote:
>> On 4/20/26 13:49, Nilay Shroff wrote:
>>> Hi,
>>>
>>> The NVMe/TCP host driver currently provisions I/O queues primarily based
>>> on CPU availability rather than the capabilities and topology of the
>>> underlying network interface.
>>>
>>> On modern systems with many CPUs but fewer NIC hardware queues, this can
>>> lead to multiple NVMe/TCP I/O workers contending for the same TX/RX queue,
>>> resulting in increased lock contention, cacheline bouncing, and degraded
>>> throughput.
>>>
>>> This RFC proposes a set of changes to better align NVMe/TCP I/O queues
>>> with NIC queue resources, and to expose queue/flow information to enable
>>> more effective system-level tuning.
>>>
>>> Key ideas
>>> ---------
>>>
>>> 1. Scale NVMe/TCP I/O queues based on NIC queue count
>>>     Instead of relying solely on CPU count, limit the number of I/O workers
>>>     to:
>>>         min(num_online_cpus, netdev->real_num_{tx,rx}_queues)
>>>
>>> 2. Improve CPU locality
>>>     Align NVMe/TCP I/O workers with CPUs associated with NIC IRQ affinity
>>>     to reduce cross-CPU traffic and improve cache locality.
>>>
>>> 3. Expose queue and flow information via debugfs
>>>     Export per-I/O queue information including:
>>>         - queue id (qid)
>>>         - CPU affinity
>>>         - TCP flow (src/dst IP and ports)
>>>
>>>     This enables userspace tools to configure:
>>>         - IRQ affinity
>>>         - RPS/XPS
>>>         - ntuple steering
>>>         - or any other scaling as deemed feasible
>>>
>>> 4. Provide infrastructure for extensible debugfs support in NVMe
>>>
>>> Together, these changes allow better alignment of:
>>>      flow -> NIC queue -> IRQ -> CPU -> NVMe/TCP I/O worker
>>>
>>> Performance Evaluation
>>> ----------------------
>>> Tests were conducted using fio over NVMe/TCP with the following parameters:
>>>      ioengine=io_uring
>>>      direct=1
>>>      bs=4k
>>>      numjobs=<#nic-queues>
>>>      iodepth=64
>>> System:
>>>      CPUs: 72
>>>      NIC: 100G mlx5
>>>
>>> Two configurations were evaluated.
>>>
>>> Scenario 1: NIC queues < CPU count
>>> ----------------------------------
>>> - CPUs: 72
>>> - NIC queues: 32
>>>
>>>                  Baseline        Patched        Patched + tuning
>>> randread        3141 MB/s       3228 MB/s      7509 MB/s
>>>                  (767k IOPS)     (788k IOPS)    (1833k IOPS)
>>>
>>> randwrite       4510 MB/s       6172 MB/s      7518 MB/s
>>>                  (1101k IOPS)    (1507k IOPS)   (1836k IOPS)
>>>
>>> randrw (read)   2156 MB/s       2560 MB/s      3932 MB/s
>>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>>
>>> randrw (write)  2155 MB/s       2560 MB/s      3932 MB/s
>>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>>
>>> Observation:
>>> When CPU count exceeds NIC queue count, the baseline configuration
>>> suffers from queue contention. The proposed changes provide modest
>>> improvements on their own, and when combined with queue-aware tuning
>>> (IRQ affinity, ntuple steering, and CPU alignment), enable up to
>>> ~1.5x–2.5x throughput improvement.
>>>
>>> Scenario 2: NIC queues == CPU count
>>> -----------------------------------
>>>
>>> - CPUs: 72
>>> - NIC queues: 72
>>>
>>>                  Baseline                Patched + tuning
>>> randread        4310 MB/s               7987 MB/s
>>>                  (1052k IOPS)            (1950k IOPS)
>>>
>>> randwrite       7947 MB/s               7972 MB/s
>>>                  (1940k IOPS)            (1946k IOPS)
>>>
>>> randrw (read)   3583 MB/s               4030 MB/s
>>>                  (875k IOPS)             (984k IOPS)
>>>
>>> randrw (write)  3583 MB/s               4029 MB/s
>>>                  (875k IOPS)             (984k IOPS)
>>>
>>> Observation:
>>> When NIC queues are already aligned with CPU count, the baseline performs
>>> well. The proposed changes maintain write performance (no regression) and
>>> still improve read and mixed workloads due to better flow-to-CPU locality.
>>>
>>> Notes on tuning
>>> ---------------
>>> The "patched + tuning" configuration includes:
>>>      - aligning NVMe/TCP I/O workers with NIC queue count
>>>      - IRQ affinity configuration per RX queue
>>>      - ntuple-based flow steering
>>>      - CPU/queue affinity alignment
>>>
>>> These tuning steps are enabled by the queue/flow information exposed through
>>> this patchset.
>>>
>>> Discussion
>>> ----------
>>> This RFC aims to start discussion around:
>>>    - Whether NVMe/TCP queue scaling should consider NIC queue topology
>>>    - How best to expose queue/flow information to userspace
>>>    - The role of userspace vs kernel in steering decisions
>>>
>>> As usual, feedback/comment/suggestions are most welcome!
>>>
>>> Reference to LSF/MM/BPF abstarct: https://lore.kernel.org/all/5db8ce78-0dfa-4dcb-bf71-5fb9c8f463e5@linux.ibm.com/
>>>
>>
>> Weelll ... we have been debating this back and forth over recent years:
>> Should we check for hardware limitations for NVMe-over-Fabrics or not?
>>
>> Initially it sounds appealing, and in fact I've worked on several attempts myself. But in the end there are far more things which need
>> to be considered:
>> -> For networking, number of queues is not really telling us anything.
>>    Most NICs have distinct RX and TX queues, and the number (of both!)
>>    varies quite dramatically.
>> -> The number of queues does _not_ indicate that all queues are used
>>    simultaneously. That is down to things like RSS and friends.
>>    I gave a stab at configuring _that_ but it's patently horrible
>>    trying to out-guess things for yourself.
>> -> It'll only work if you run directly on the NIC. As soon as there
>>    is anything in between (qemu? Tunnelling?) you are out of luck.
>>
>> So yeah, we should have a discussion here.
> 
> TBH, I don't think that this is very useful. I mentioned some areas on why on patch #1
> 
> But the main reason is that I think that the majority the gains that you are showing
> is the tuning - which is somewhat unrelated to the driver, and TBH, I doubt anyone
> will actually do in reality.

Even without additional tuning, aligning the NVMe/TCP I/O workers with
CPU and NIC queue locality already provides measurable performance
benefits (primarily visible in random write workloads, as shown in
Scenario 1).

The additional gains come from system-level tuning (e.g., XPS/RPS/RSS),
which further improves utilization of NIC queues and CPU locality.
However, the patch enables this tuning by exposing queue/flow
information and establishing better default alignment.

While such tuning may not be applied in all deployments, IMO, it should be
commonly used in performance-sensitive environments where users aim to
fully utilize available NIC and CPU resources.

Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 4/4] nvme: expose queue information via debugfs
  2026-04-24 22:23   ` Sagi Grimberg
@ 2026-04-27 12:12     ` Nilay Shroff
  0 siblings, 0 replies; 17+ messages in thread
From: Nilay Shroff @ 2026-04-27 12:12 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme; +Cc: kbusch, hch, hare, chaitanyak, gjoyce

On 4/25/26 3:53 AM, Sagi Grimberg wrote:
> 
> 
> On 20/04/2026 14:49, Nilay Shroff wrote:
>> Add a new debugfs attribute "io_queue_info" to expose per-queue
>> information for NVMe controllers. For NVMe-TCP, this includes the
>> CPU handling each I/O queue and the associated TCP flow (source and
>> destination address/port).
>>
>> This information can be useful for understanding and tuning the
>> interaction between NVMe-TCP I/O queues and network stack components,
>> such as IRQ affinity, RPS/RFS, XPS, or NIC flow steering (ntuple).
>>
>> The data is exported using seq_file interfaces to allow iteration
>> over all controller queues.
> 
> Don't really mind having this. Not sure who will actually go through
> the process of mangling RFS/RPS/XPS for based on this 5-tuple, but ok...

Yeah it may not be used always but I think in performance sensitive
workload, user would like to leverage this information for tuning
I/O stack.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/4] nvme-tcp: add a diagnostic message when NIC queues are underutilized
  2026-04-24 22:15   ` Sagi Grimberg
@ 2026-04-27 12:14     ` Nilay Shroff
  0 siblings, 0 replies; 17+ messages in thread
From: Nilay Shroff @ 2026-04-27 12:14 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme; +Cc: kbusch, hch, hare, chaitanyak, gjoyce

On 4/25/26 3:45 AM, Sagi Grimberg wrote:
> 
> 
> On 20/04/2026 14:49, Nilay Shroff wrote:
>> Some systems may configure fewer NIC queues than supported by the
>> hardware. When the number of NVMe-TCP I/O queues is limited by the
>> number of active NIC queues, this can result in suboptimal performance.
>>
>> Add a diagnostic message to warn when the configured NIC queue count
>> is lower than the maximum supported queue count, as reported by the
>> driver. This may help users identify configurations where increasing
>> the NIC queue count could improve performance.
>>
>> This change is informational only and does not modify NIC configuration.
> 
> I don't think that we want this at all. I don't think this is nvme-tcp place
> to print such a log message at all. Especially not every time it connects
> to a controller.
> 
> If you think you need to add this, create a userspace nvmf tool that
> tests/validates a host.

Okay we can drop this from kernel and may add it in userspace tool such
as nvme connect...

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-04-27 12:14 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-20 11:49 [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues Nilay Shroff
2026-04-24 13:46   ` Christoph Hellwig
2026-04-27  7:37     ` Nilay Shroff
2026-04-24 22:10   ` Sagi Grimberg
2026-04-27 11:57     ` Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 2/4] nvme-tcp: add a diagnostic message when NIC queues are underutilized Nilay Shroff
2026-04-24 22:15   ` Sagi Grimberg
2026-04-27 12:14     ` Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 3/4] nvme: add debugfs helpers for NVMe drivers Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 4/4] nvme: expose queue information via debugfs Nilay Shroff
2026-04-24 22:23   ` Sagi Grimberg
2026-04-27 12:12     ` Nilay Shroff
2026-04-22 11:10 ` [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Hannes Reinecke
2026-04-24 22:30   ` Sagi Grimberg
2026-04-27 12:11     ` Nilay Shroff
2026-04-27  6:13   ` Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox