* [LSF/MM/BPF TOPIC] Topology-Aware NVMe-TCP I/O Queue Scaling and Worker Efficiency
@ 2026-02-15 17:06 Nilay Shroff
2026-02-16 0:35 ` Chaitanya Kulkarni
0 siblings, 1 reply; 3+ messages in thread
From: Nilay Shroff @ 2026-02-15 17:06 UTC (permalink / raw)
To: lsf-pc
Cc: linux-nvme@lists.infradead.org, Keith Busch, Christoph Hellwig,
Sagi Grimberg, Hannes Reinecke
The NVMe-TCP host driver currently provisions I/O queues primarily based on CPU
availability rather than the capabilities and topology of the underlying network
interface. On modern systems with many CPUs but fewer NIC hardware queues, this
can lead to multiple NVMe-TCP I/O queues contending for the same transmit/receive
queue, increasing lock contention, cacheline bouncing, and tail latency.
This session explores making NVMe-TCP queue provisioning and execution more
network-aware. We propose aligning the number of NVMe-TCP I/O queues with the
number of NIC hardware TX/RX queues, and binding each I/O queue to CPUs that are
already affine to the corresponding NIC interrupt vectors. This aims to improve
cache locality and reduce cross-CPU wakeups in high-IOPS deployments.
We also examine the behavior of the NVMe-TCP I/O worker thread, which currently
operates under a fixed time budget (~1ms). In some workloads, the worker may
relinquish the CPU even when additional transmit/receive work is immediately
available. We propose exposing observability data such as per-worker I/O processing
counts, relinquish events, and CPU placement to better understand and potentially
tune this budget.
We plan to implement a proof-of-concept for these ideas ahead of the conference
and and submit RFC, if feasible, demonstrate the impact live using fio workloads
on a real system.This session seeks feedback on whether the NVMe-TCP host should
consider NIC topology when provisioning I/O queues, how tightly queue placement
should follow interrupt affinity, and whether additional observability or tunable
budgets for I/O workers would be useful. We also discuss potential interfaces
between networking and storage subsystems to support topology-aware queue scaling.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Topology-Aware NVMe-TCP I/O Queue Scaling and Worker Efficiency
2026-02-15 17:06 [LSF/MM/BPF TOPIC] Topology-Aware NVMe-TCP I/O Queue Scaling and Worker Efficiency Nilay Shroff
@ 2026-02-16 0:35 ` Chaitanya Kulkarni
2026-02-16 6:49 ` Nilay Shroff
0 siblings, 1 reply; 3+ messages in thread
From: Chaitanya Kulkarni @ 2026-02-16 0:35 UTC (permalink / raw)
To: Nilay Shroff, lsf-pc
Cc: linux-nvme@lists.infradead.org, Keith Busch, Christoph Hellwig,
Sagi Grimberg, Hannes Reinecke
On 2/15/26 09:06, Nilay Shroff wrote:
> The NVMe-TCP host driver currently provisions I/O queues primarily based on CPU
> availability rather than the capabilities and topology of the underlying network
> interface. On modern systems with many CPUs but fewer NIC hardware queues, this
> can lead to multiple NVMe-TCP I/O queues contending for the same transmit/receive
> queue, increasing lock contention, cacheline bouncing, and tail latency.
Can you share any performance work that you have done prior to the
LSF session ?
-ck
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Topology-Aware NVMe-TCP I/O Queue Scaling and Worker Efficiency
2026-02-16 0:35 ` Chaitanya Kulkarni
@ 2026-02-16 6:49 ` Nilay Shroff
0 siblings, 0 replies; 3+ messages in thread
From: Nilay Shroff @ 2026-02-16 6:49 UTC (permalink / raw)
To: Chaitanya Kulkarni, lsf-pc
Cc: linux-nvme@lists.infradead.org, Keith Busch, Christoph Hellwig,
Sagi Grimberg, Hannes Reinecke
On 2/16/26 6:05 AM, Chaitanya Kulkarni wrote:
> On 2/15/26 09:06, Nilay Shroff wrote:
>
>> The NVMe-TCP host driver currently provisions I/O queues primarily based on CPU
>> availability rather than the capabilities and topology of the underlying network
>> interface. On modern systems with many CPUs but fewer NIC hardware queues, this
>> can lead to multiple NVMe-TCP I/O queues contending for the same transmit/receive
>> queue, increasing lock contention, cacheline bouncing, and tail latency.
>
>
> Can you share any performance work that you have done prior to the
>
> LSF session ?
>
Yes — I’ve started prototyping the queue-scaling and CPU/IRQ-affinity changes and
have some early performance results from a local setup. These are still preliminary
and I’m continuing to expand testing, but the initial data looks promising enough
to motivate the discussion.
Test setup (current prototype):
- 32-CPU system
- NIC exposing 2 TX/RX queues
- fio (io_uring, direct=1, 32 jobs, iodepth=64)
Throughput results:
Without patch With patch
Randread: 263 MB/s 986 MB/s
Randwrite: 849 MB/s 1047 MB/s
Randrw: R: 142 MB/s R: 419 MB/s
W: 142 MB/s W: 419 MB/s
The largest gains appear in read-heavy and mixed workloads where multiple NVMe-TCP
queues were previously contending for a small number of NIC hardware queues. Aligning
I/O queue count with NIC queue count and improving CPU/IRQ locality significantly
reduced contention in this configuration.
I’m continuing to refine the prototype and expand testing across different queue
counts and NUMA layouts. My goal is to have more comprehensive data available ahead
of the LSFMM+BPF session, and if feasible I plan to demonstrate the impact live using
fio workloads.
Happy to share updated numbers as the work progresses.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-02-16 6:50 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-15 17:06 [LSF/MM/BPF TOPIC] Topology-Aware NVMe-TCP I/O Queue Scaling and Worker Efficiency Nilay Shroff
2026-02-16 0:35 ` Chaitanya Kulkarni
2026-02-16 6:49 ` Nilay Shroff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox