From: Sagi Grimberg <sagi@grimberg.me>
To: Hannes Reinecke <hare@suse.de>,
Nilay Shroff <nilay@linux.ibm.com>,
linux-nvme@lists.infradead.org
Cc: kbusch@kernel.org, hch@lst.de, chaitanyak@nvidia.com,
gjoyce@linux.ibm.com
Subject: Re: [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export
Date: Sat, 25 Apr 2026 01:30:03 +0300 [thread overview]
Message-ID: <faeb48dc-5672-465b-9650-558522dc4f65@grimberg.me> (raw)
In-Reply-To: <649036dd-b99f-4f60-93f4-16979e11f520@suse.de>
On 22/04/2026 14:10, Hannes Reinecke wrote:
> On 4/20/26 13:49, Nilay Shroff wrote:
>> Hi,
>>
>> The NVMe/TCP host driver currently provisions I/O queues primarily based
>> on CPU availability rather than the capabilities and topology of the
>> underlying network interface.
>>
>> On modern systems with many CPUs but fewer NIC hardware queues, this can
>> lead to multiple NVMe/TCP I/O workers contending for the same TX/RX
>> queue,
>> resulting in increased lock contention, cacheline bouncing, and degraded
>> throughput.
>>
>> This RFC proposes a set of changes to better align NVMe/TCP I/O queues
>> with NIC queue resources, and to expose queue/flow information to enable
>> more effective system-level tuning.
>>
>> Key ideas
>> ---------
>>
>> 1. Scale NVMe/TCP I/O queues based on NIC queue count
>> Instead of relying solely on CPU count, limit the number of I/O
>> workers
>> to:
>> min(num_online_cpus, netdev->real_num_{tx,rx}_queues)
>>
>> 2. Improve CPU locality
>> Align NVMe/TCP I/O workers with CPUs associated with NIC IRQ
>> affinity
>> to reduce cross-CPU traffic and improve cache locality.
>>
>> 3. Expose queue and flow information via debugfs
>> Export per-I/O queue information including:
>> - queue id (qid)
>> - CPU affinity
>> - TCP flow (src/dst IP and ports)
>>
>> This enables userspace tools to configure:
>> - IRQ affinity
>> - RPS/XPS
>> - ntuple steering
>> - or any other scaling as deemed feasible
>>
>> 4. Provide infrastructure for extensible debugfs support in NVMe
>>
>> Together, these changes allow better alignment of:
>> flow -> NIC queue -> IRQ -> CPU -> NVMe/TCP I/O worker
>>
>> Performance Evaluation
>> ----------------------
>> Tests were conducted using fio over NVMe/TCP with the following
>> parameters:
>> ioengine=io_uring
>> direct=1
>> bs=4k
>> numjobs=<#nic-queues>
>> iodepth=64
>> System:
>> CPUs: 72
>> NIC: 100G mlx5
>>
>> Two configurations were evaluated.
>>
>> Scenario 1: NIC queues < CPU count
>> ----------------------------------
>> - CPUs: 72
>> - NIC queues: 32
>>
>> Baseline Patched Patched + tuning
>> randread 3141 MB/s 3228 MB/s 7509 MB/s
>> (767k IOPS) (788k IOPS) (1833k IOPS)
>>
>> randwrite 4510 MB/s 6172 MB/s 7518 MB/s
>> (1101k IOPS) (1507k IOPS) (1836k IOPS)
>>
>> randrw (read) 2156 MB/s 2560 MB/s 3932 MB/s
>> (526k IOPS) (625k IOPS) (960k IOPS)
>>
>> randrw (write) 2155 MB/s 2560 MB/s 3932 MB/s
>> (526k IOPS) (625k IOPS) (960k IOPS)
>>
>> Observation:
>> When CPU count exceeds NIC queue count, the baseline configuration
>> suffers from queue contention. The proposed changes provide modest
>> improvements on their own, and when combined with queue-aware tuning
>> (IRQ affinity, ntuple steering, and CPU alignment), enable up to
>> ~1.5x–2.5x throughput improvement.
>>
>> Scenario 2: NIC queues == CPU count
>> -----------------------------------
>>
>> - CPUs: 72
>> - NIC queues: 72
>>
>> Baseline Patched + tuning
>> randread 4310 MB/s 7987 MB/s
>> (1052k IOPS) (1950k IOPS)
>>
>> randwrite 7947 MB/s 7972 MB/s
>> (1940k IOPS) (1946k IOPS)
>>
>> randrw (read) 3583 MB/s 4030 MB/s
>> (875k IOPS) (984k IOPS)
>>
>> randrw (write) 3583 MB/s 4029 MB/s
>> (875k IOPS) (984k IOPS)
>>
>> Observation:
>> When NIC queues are already aligned with CPU count, the baseline
>> performs
>> well. The proposed changes maintain write performance (no regression)
>> and
>> still improve read and mixed workloads due to better flow-to-CPU
>> locality.
>>
>> Notes on tuning
>> ---------------
>> The "patched + tuning" configuration includes:
>> - aligning NVMe/TCP I/O workers with NIC queue count
>> - IRQ affinity configuration per RX queue
>> - ntuple-based flow steering
>> - CPU/queue affinity alignment
>>
>> These tuning steps are enabled by the queue/flow information exposed
>> through
>> this patchset.
>>
>> Discussion
>> ----------
>> This RFC aims to start discussion around:
>> - Whether NVMe/TCP queue scaling should consider NIC queue topology
>> - How best to expose queue/flow information to userspace
>> - The role of userspace vs kernel in steering decisions
>>
>> As usual, feedback/comment/suggestions are most welcome!
>>
>> Reference to LSF/MM/BPF abstarct:
>> https://lore.kernel.org/all/5db8ce78-0dfa-4dcb-bf71-5fb9c8f463e5@linux.ibm.com/
>>
>
> Weelll ... we have been debating this back and forth over recent years:
> Should we check for hardware limitations for NVMe-over-Fabrics or not?
>
> Initially it sounds appealing, and in fact I've worked on several
> attempts myself. But in the end there are far more things which need
> to be considered:
> -> For networking, number of queues is not really telling us anything.
> Most NICs have distinct RX and TX queues, and the number (of both!)
> varies quite dramatically.
> -> The number of queues does _not_ indicate that all queues are used
> simultaneously. That is down to things like RSS and friends.
> I gave a stab at configuring _that_ but it's patently horrible
> trying to out-guess things for yourself.
> -> It'll only work if you run directly on the NIC. As soon as there
> is anything in between (qemu? Tunnelling?) you are out of luck.
>
> So yeah, we should have a discussion here.
TBH, I don't think that this is very useful. I mentioned some areas on
why on patch #1
But the main reason is that I think that the majority the gains that you
are showing
is the tuning - which is somewhat unrelated to the driver, and TBH, I
doubt anyone
will actually do in reality.
next prev parent reply other threads:[~2026-04-24 22:30 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-20 11:49 [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues Nilay Shroff
2026-04-24 13:46 ` Christoph Hellwig
2026-04-27 7:37 ` Nilay Shroff
2026-04-24 22:10 ` Sagi Grimberg
2026-04-27 11:57 ` Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 2/4] nvme-tcp: add a diagnostic message when NIC queues are underutilized Nilay Shroff
2026-04-24 22:15 ` Sagi Grimberg
2026-04-27 12:14 ` Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 3/4] nvme: add debugfs helpers for NVMe drivers Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 4/4] nvme: expose queue information via debugfs Nilay Shroff
2026-04-24 22:23 ` Sagi Grimberg
2026-04-27 12:12 ` Nilay Shroff
2026-04-22 11:10 ` [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Hannes Reinecke
2026-04-24 22:30 ` Sagi Grimberg [this message]
2026-04-27 12:11 ` Nilay Shroff
2026-04-27 6:13 ` Nilay Shroff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=faeb48dc-5672-465b-9650-558522dc4f65@grimberg.me \
--to=sagi@grimberg.me \
--cc=chaitanyak@nvidia.com \
--cc=gjoyce@linux.ibm.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=nilay@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox