Re: [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Hannes Reinecke <hare@suse.de>
To: Nilay Shroff <nilay@linux.ibm.com>, linux-nvme@lists.infradead.org
Cc: kbusch@kernel.org, hch@lst.de, sagi@grimberg.me,
	chaitanyak@nvidia.com, gjoyce@linux.ibm.com
Subject: Re: [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export
Date: Wed, 22 Apr 2026 13:10:13 +0200	[thread overview]
Message-ID: <649036dd-b99f-4f60-93f4-16979e11f520@suse.de> (raw)
In-Reply-To: <20260420115716.3071293-1-nilay@linux.ibm.com>

On 4/20/26 13:49, Nilay Shroff wrote:
> Hi,
> 
> The NVMe/TCP host driver currently provisions I/O queues primarily based
> on CPU availability rather than the capabilities and topology of the
> underlying network interface.
> 
> On modern systems with many CPUs but fewer NIC hardware queues, this can
> lead to multiple NVMe/TCP I/O workers contending for the same TX/RX queue,
> resulting in increased lock contention, cacheline bouncing, and degraded
> throughput.
> 
> This RFC proposes a set of changes to better align NVMe/TCP I/O queues
> with NIC queue resources, and to expose queue/flow information to enable
> more effective system-level tuning.
> 
> Key ideas
> ---------
> 
> 1. Scale NVMe/TCP I/O queues based on NIC queue count
>     Instead of relying solely on CPU count, limit the number of I/O workers
>     to:
>         min(num_online_cpus, netdev->real_num_{tx,rx}_queues)
> 
> 2. Improve CPU locality
>     Align NVMe/TCP I/O workers with CPUs associated with NIC IRQ affinity
>     to reduce cross-CPU traffic and improve cache locality.
> 
> 3. Expose queue and flow information via debugfs
>     Export per-I/O queue information including:
>         - queue id (qid)
>         - CPU affinity
>         - TCP flow (src/dst IP and ports)
> 
>     This enables userspace tools to configure:
>         - IRQ affinity
>         - RPS/XPS
>         - ntuple steering
>         - or any other scaling as deemed feasible
> 
> 4. Provide infrastructure for extensible debugfs support in NVMe
> 
> Together, these changes allow better alignment of:
>      flow -> NIC queue -> IRQ -> CPU -> NVMe/TCP I/O worker
> 
> Performance Evaluation
> ----------------------
> Tests were conducted using fio over NVMe/TCP with the following parameters:
>      ioengine=io_uring
>      direct=1
>      bs=4k
>      numjobs=<#nic-queues>
>      iodepth=64
> System:
>      CPUs: 72
>      NIC: 100G mlx5
> 
> Two configurations were evaluated.
> 
> Scenario 1: NIC queues < CPU count
> ----------------------------------
> - CPUs: 72
> - NIC queues: 32
> 
>                  Baseline        Patched        Patched + tuning
> randread        3141 MB/s       3228 MB/s      7509 MB/s
>                  (767k IOPS)     (788k IOPS)    (1833k IOPS)
> 
> randwrite       4510 MB/s       6172 MB/s      7518 MB/s
>                  (1101k IOPS)    (1507k IOPS)   (1836k IOPS)
> 
> randrw (read)   2156 MB/s       2560 MB/s      3932 MB/s
>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
> 
> randrw (write)  2155 MB/s       2560 MB/s      3932 MB/s
>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
> 
> Observation:
> When CPU count exceeds NIC queue count, the baseline configuration
> suffers from queue contention. The proposed changes provide modest
> improvements on their own, and when combined with queue-aware tuning
> (IRQ affinity, ntuple steering, and CPU alignment), enable up to
> ~1.5x–2.5x throughput improvement.
> 
> Scenario 2: NIC queues == CPU count
> -----------------------------------
> 
> - CPUs: 72
> - NIC queues: 72
> 
>                  Baseline                Patched + tuning
> randread        4310 MB/s               7987 MB/s
>                  (1052k IOPS)            (1950k IOPS)
> 
> randwrite       7947 MB/s               7972 MB/s
>                  (1940k IOPS)            (1946k IOPS)
> 
> randrw (read)   3583 MB/s               4030 MB/s
>                  (875k IOPS)             (984k IOPS)
> 
> randrw (write)  3583 MB/s               4029 MB/s
>                  (875k IOPS)             (984k IOPS)
> 
> Observation:
> When NIC queues are already aligned with CPU count, the baseline performs
> well. The proposed changes maintain write performance (no regression) and
> still improve read and mixed workloads due to better flow-to-CPU locality.
> 
> Notes on tuning
> ---------------
> The "patched + tuning" configuration includes:
>      - aligning NVMe/TCP I/O workers with NIC queue count
>      - IRQ affinity configuration per RX queue
>      - ntuple-based flow steering
>      - CPU/queue affinity alignment
> 
> These tuning steps are enabled by the queue/flow information exposed through
> this patchset.
> 
> Discussion
> ----------
> This RFC aims to start discussion around:
>    - Whether NVMe/TCP queue scaling should consider NIC queue topology
>    - How best to expose queue/flow information to userspace
>    - The role of userspace vs kernel in steering decisions
> 
> As usual, feedback/comment/suggestions are most welcome!
> 
> Reference to LSF/MM/BPF abstarct: https://lore.kernel.org/all/5db8ce78-0dfa-4dcb-bf71-5fb9c8f463e5@linux.ibm.com/
> 

Weelll ... we have been debating this back and forth over recent years:
Should we check for hardware limitations for NVMe-over-Fabrics or not?

Initially it sounds appealing, and in fact I've worked on several 
attempts myself. But in the end there are far more things which need
to be considered:
-> For networking, number of queues is not really telling us anything.
    Most NICs have distinct RX and TX queues, and the number (of both!)
    varies quite dramatically.
-> The number of queues does _not_ indicate that all queues are used
    simultaneously. That is down to things like RSS and friends.
    I gave a stab at configuring _that_ but it's patently horrible
    trying to out-guess things for yourself.
-> It'll only work if you run directly on the NIC. As soon as there
    is anything in between (qemu? Tunnelling?) you are out of luck.

So yeah, we should have a discussion here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

next prev parent reply	other threads:[~2026-04-22 11:10 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-20 11:49 [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues Nilay Shroff
2026-04-24 13:46   ` Christoph Hellwig
2026-04-27  7:37     ` Nilay Shroff
2026-04-24 22:10   ` Sagi Grimberg
2026-04-27 11:57     ` Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 2/4] nvme-tcp: add a diagnostic message when NIC queues are underutilized Nilay Shroff
2026-04-24 22:15   ` Sagi Grimberg
2026-04-27 12:14     ` Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 3/4] nvme: add debugfs helpers for NVMe drivers Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 4/4] nvme: expose queue information via debugfs Nilay Shroff
2026-04-24 22:23   ` Sagi Grimberg
2026-04-27 12:12     ` Nilay Shroff
2026-04-22 11:10 ` Hannes Reinecke [this message]
2026-04-24 22:30   ` [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Sagi Grimberg
2026-04-27 12:11     ` Nilay Shroff
2026-04-27  6:13   ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=649036dd-b99f-4f60-93f4-16979e11f520@suse.de \
    --to=hare@suse.de \
    --cc=chaitanyak@nvidia.com \
    --cc=gjoyce@linux.ibm.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=nilay@linux.ibm.com \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox