From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 7B5D6FF8860
	for <linux-nvme@archiver.kernel.org>; Mon, 27 Apr 2026 06:14:13 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help
	:List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding:
	Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date:
	Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner;
	bh=FRgX4A6tGjMJNyFs9SBlRK9RDPaweQwbrDMnv1TGSss=; b=2qW4v/6fLSoQ6mweWwP6eGXSbf
	QUni8n580IUHd1MtrDAt/tki6TW5fctgXMa9ek8pIaO6GEniugHkcZ2oCUG3jha6yHCvNhAyqhA2N
	dyLFtJSGmVJi0uNIPxI/F/Z7M1+Es6itOw+dtyq796Ssuken93xIvre1H6T+oFjaBUDg4YChg2kJ6
	ev5ex1/T06vX5YKWYu6qDGUW5DIWZh2vLtF8x/aSF2deyDhpKLbXYN4f+pBiJVld+pO3ziIJYWpcU
	ZpBP6wHY8vZTbf3Ph4/SG+XdxkOZCdLRhKoL5blRWVBcUmSp+1aXWS5l0JT2qaCeCvpkrLgMRse6v
	XwgjJosA==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux))
	id 1wHFEQ-0000000GEqv-10bx;
	Mon, 27 Apr 2026 06:14:10 +0000
Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5])
	by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux))
	id 1wHFEO-0000000GEqY-01Mi
	for linux-nvme@lists.infradead.org;
	Mon, 27 Apr 2026 06:14:09 +0000
Received: from pps.filterd (m0360072.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 63QBVhvO1473100;
	Mon, 27 Apr 2026 06:13:38 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc
	:content-transfer-encoding:content-type:date:from:in-reply-to
	:message-id:mime-version:references:subject:to; s=pp1; bh=FRgX4A
	6tGjMJNyFs9SBlRK9RDPaweQwbrDMnv1TGSss=; b=Yvpfypw5t0mxAFNh8VRmM/
	htwCal+6zt7aW5jti2VWWQKFcNSjashV3HYSZcac7AW4cD0NIOVyzsSfTzyHcZuG
	t1SE8MRLcsMiGAwDHhiBC6aGZGDYVCBbCN6T1n9n5pac1A/lT+oyxGYShv8Ac6I+
	7yF+5CcQDO2bmV83SSGYI4HmOJSJ+Q66ED1NDy/klKDsCqJ4PPfl1Po/MhRE+BeV
	8FYFgqHEDlkwaW+R1alsBcqssptGMGR0ikkoaND44E7tr5EBJAkYStC87c8H7h/E
	Q8ue7PNeHMTdJL/WN/UC+t8arIBGy07Tw33FaIrlHlvPzJoXQfhvSRbp4EL3wvlg
	==
Received: from ppma11.dal12v.mail.ibm.com (db.9e.1632.ip4.static.sl-reverse.com [50.22.158.219])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4drn8v65kp-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Mon, 27 Apr 2026 06:13:38 +0000 (GMT)
Received: from pps.filterd (ppma11.dal12v.mail.ibm.com [127.0.0.1])
	by ppma11.dal12v.mail.ibm.com (8.18.1.7/8.18.1.7) with ESMTP id 63R5rf1h027625;
	Mon, 27 Apr 2026 06:13:37 GMT
Received: from smtprelay04.dal12v.mail.ibm.com ([172.16.1.6])
	by ppma11.dal12v.mail.ibm.com (PPS) with ESMTPS id 4dsamy3p94-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Mon, 27 Apr 2026 06:13:37 +0000 (GMT)
Received: from smtpav05.dal12v.mail.ibm.com (smtpav05.dal12v.mail.ibm.com [10.241.53.104])
	by smtprelay04.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 63R6DbY620251258
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Mon, 27 Apr 2026 06:13:37 GMT
Received: from smtpav05.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 222365806C;
	Mon, 27 Apr 2026 06:13:37 +0000 (GMT)
Received: from smtpav05.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id C7D0758067;
	Mon, 27 Apr 2026 06:13:31 +0000 (GMT)
Received: from [9.123.7.57] (unknown [9.123.7.57])
	by smtpav05.dal12v.mail.ibm.com (Postfix) with ESMTP;
	Mon, 27 Apr 2026 06:13:31 +0000 (GMT)
Message-ID: <ce3b6e53-2171-4926-b2a7-0c80049fa49e@linux.ibm.com>
Date: Mon, 27 Apr 2026 11:43:29 +0530
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling
 and queue info export
To: Hannes Reinecke <hare@suse.de>, linux-nvme@lists.infradead.org
Cc: kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, chaitanyak@nvidia.com,
        gjoyce@linux.ibm.com
References: <20260420115716.3071293-1-nilay@linux.ibm.com>
 <649036dd-b99f-4f60-93f4-16979e11f520@suse.de>
Content-Language: en-US
From: Nilay Shroff <nilay@linux.ibm.com>
In-Reply-To: <649036dd-b99f-4f60-93f4-16979e11f520@suse.de>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-TM-AS-GCONF: 00
X-Authority-Analysis: v=2.4 cv=CIIamxrD c=1 sm=1 tr=0 ts=69eefe92 cx=c_pps
 a=aDMHemPKRhS1OARIsFnwRA==:117 a=aDMHemPKRhS1OARIsFnwRA==:17
 a=IkcTkHD0fZMA:10 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22
 a=RnoormkPH1_aCDwRdu11:22 a=RzCfie-kr_QcCd8fBx8p:22 a=VwQbUJbxAAAA:8
 a=VnNF1IyMAAAA:8 a=lYbZpKIyzPbRwdZMwNcA:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10
X-Proofpoint-ORIG-GUID: S31E470HydndiTZlrsYzaAa78QV6jxwP
X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDI3MDA1OSBTYWx0ZWRfX8WLuKIVbZ+N0
 k/ikhW3K4dHdLVFxL7kDA+1/UfVL1bjcd4yu8sEfxrYKfvW7JIcJ7O6FqBe7uEE0XyZ8EMHpCej
 aghDQq1bTt9odoey2DD+7qD+xLlAZ/uAQuVm/hY5KD9fPptVPwtzUulLEfBus6xiZLB/KeWLr8I
 UaFafI8LNfAErfnELdi8dc9Ax6opUdg713/Ifg4D2GiBMUbDjVF9qRie4OPTync8xbowZaAuCRK
 sB7khf+lE7rsAcKI9fMk6BeFww64trgcTVrLNuNdcGZcKt0pabstMRUqlKrV//yVNoX7DPzyKYG
 lokTy32ZlbJ065iH7v2Rfq8FIpces1UJFilb4Oa4hbjaZ8y2rzXUEFxeI7zFiyRUXOf12eWd/P0
 645Xb3UHUCtYcC4ZyqpywwgAiWD/GfBxWL/wv+jotKFPX0TE3FRvRt7oXEZBqhFRIwYmB3XfYlu
 +arLo1f8vPhOc9az3hw==
X-Proofpoint-GUID: S31E470HydndiTZlrsYzaAa78QV6jxwP
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49
 definitions=2026-04-27_01,2026-04-21_02,2025-10-01_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 spamscore=0 phishscore=0 malwarescore=0 suspectscore=0 adultscore=0
 impostorscore=0 lowpriorityscore=0 clxscore=1015 bulkscore=0
 priorityscore=1501 classifier=typeunknown authscore=0 authtc= authcc=
 route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2604200000
 definitions=main-2604270059
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20260426_231408_226141_920FEE0B 
X-CRM114-Status: GOOD (  27.42  )
X-BeenThere: linux-nvme@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-nvme.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-nvme>,
 <mailto:linux-nvme-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-nvme/>
List-Post: <mailto:linux-nvme@lists.infradead.org>
List-Help: <mailto:linux-nvme-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-nvme>,
 <mailto:linux-nvme-request@lists.infradead.org?subject=subscribe>
Sender: "Linux-nvme" <linux-nvme-bounces@lists.infradead.org>
Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org

On 4/22/26 4:40 PM, Hannes Reinecke wrote:
> On 4/20/26 13:49, Nilay Shroff wrote:
>> Hi,
>>
>> The NVMe/TCP host driver currently provisions I/O queues primarily based
>> on CPU availability rather than the capabilities and topology of the
>> underlying network interface.
>>
>> On modern systems with many CPUs but fewer NIC hardware queues, this can
>> lead to multiple NVMe/TCP I/O workers contending for the same TX/RX queue,
>> resulting in increased lock contention, cacheline bouncing, and degraded
>> throughput.
>>
>> This RFC proposes a set of changes to better align NVMe/TCP I/O queues
>> with NIC queue resources, and to expose queue/flow information to enable
>> more effective system-level tuning.
>>
>> Key ideas
>> ---------
>>
>> 1. Scale NVMe/TCP I/O queues based on NIC queue count
>>     Instead of relying solely on CPU count, limit the number of I/O workers
>>     to:
>>         min(num_online_cpus, netdev->real_num_{tx,rx}_queues)
>>
>> 2. Improve CPU locality
>>     Align NVMe/TCP I/O workers with CPUs associated with NIC IRQ affinity
>>     to reduce cross-CPU traffic and improve cache locality.
>>
>> 3. Expose queue and flow information via debugfs
>>     Export per-I/O queue information including:
>>         - queue id (qid)
>>         - CPU affinity
>>         - TCP flow (src/dst IP and ports)
>>
>>     This enables userspace tools to configure:
>>         - IRQ affinity
>>         - RPS/XPS
>>         - ntuple steering
>>         - or any other scaling as deemed feasible
>>
>> 4. Provide infrastructure for extensible debugfs support in NVMe
>>
>> Together, these changes allow better alignment of:
>>      flow -> NIC queue -> IRQ -> CPU -> NVMe/TCP I/O worker
>>
>> Performance Evaluation
>> ----------------------
>> Tests were conducted using fio over NVMe/TCP with the following parameters:
>>      ioengine=io_uring
>>      direct=1
>>      bs=4k
>>      numjobs=<#nic-queues>
>>      iodepth=64
>> System:
>>      CPUs: 72
>>      NIC: 100G mlx5
>>
>> Two configurations were evaluated.
>>
>> Scenario 1: NIC queues < CPU count
>> ----------------------------------
>> - CPUs: 72
>> - NIC queues: 32
>>
>>                  Baseline        Patched        Patched + tuning
>> randread        3141 MB/s       3228 MB/s      7509 MB/s
>>                  (767k IOPS)     (788k IOPS)    (1833k IOPS)
>>
>> randwrite       4510 MB/s       6172 MB/s      7518 MB/s
>>                  (1101k IOPS)    (1507k IOPS)   (1836k IOPS)
>>
>> randrw (read)   2156 MB/s       2560 MB/s      3932 MB/s
>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>
>> randrw (write)  2155 MB/s       2560 MB/s      3932 MB/s
>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>
>> Observation:
>> When CPU count exceeds NIC queue count, the baseline configuration
>> suffers from queue contention. The proposed changes provide modest
>> improvements on their own, and when combined with queue-aware tuning
>> (IRQ affinity, ntuple steering, and CPU alignment), enable up to
>> ~1.5x–2.5x throughput improvement.
>>
>> Scenario 2: NIC queues == CPU count
>> -----------------------------------
>>
>> - CPUs: 72
>> - NIC queues: 72
>>
>>                  Baseline                Patched + tuning
>> randread        4310 MB/s               7987 MB/s
>>                  (1052k IOPS)            (1950k IOPS)
>>
>> randwrite       7947 MB/s               7972 MB/s
>>                  (1940k IOPS)            (1946k IOPS)
>>
>> randrw (read)   3583 MB/s               4030 MB/s
>>                  (875k IOPS)             (984k IOPS)
>>
>> randrw (write)  3583 MB/s               4029 MB/s
>>                  (875k IOPS)             (984k IOPS)
>>
>> Observation:
>> When NIC queues are already aligned with CPU count, the baseline performs
>> well. The proposed changes maintain write performance (no regression) and
>> still improve read and mixed workloads due to better flow-to-CPU locality.
>>
>> Notes on tuning
>> ---------------
>> The "patched + tuning" configuration includes:
>>      - aligning NVMe/TCP I/O workers with NIC queue count
>>      - IRQ affinity configuration per RX queue
>>      - ntuple-based flow steering
>>      - CPU/queue affinity alignment
>>
>> These tuning steps are enabled by the queue/flow information exposed through
>> this patchset.
>>
>> Discussion
>> ----------
>> This RFC aims to start discussion around:
>>    - Whether NVMe/TCP queue scaling should consider NIC queue topology
>>    - How best to expose queue/flow information to userspace
>>    - The role of userspace vs kernel in steering decisions
>>
>> As usual, feedback/comment/suggestions are most welcome!
>>
>> Reference to LSF/MM/BPF abstarct: https://lore.kernel.org/all/5db8ce78-0dfa-4dcb-bf71-5fb9c8f463e5@linux.ibm.com/
>>
> 
> Weelll ... we have been debating this back and forth over recent years:
> Should we check for hardware limitations for NVMe-over-Fabrics or not?
> 
> Initially it sounds appealing, and in fact I've worked on several attempts myself. But in the end there are far more things which need
> to be considered:
> -> For networking, number of queues is not really telling us anything.
>     Most NICs have distinct RX and TX queues, and the number (of both!)
>     varies quite dramatically.

The proposed I/O queue scaling follows a conservative approach based on
currently configured NIC queues:

   if (NIC exposes combined TX/RX queues):
       nr_hw_queues = min(num_online_cpus, combined_tx_rx);
   else:
       real_hw_queues = min(real_num_tx_queues, real_num_rx_queues);
       nr_hw_queues = min(num_online_cpus, real_hw_queues);

The intent here is not to model full NIC behavior, but to avoid obvious
over-subscription when the number of NVMe/TCP I/O workers significantly
exceeds the available NIC queues.

Also, this is not enabled by default. It is gated behind the
`match-hw-queues` fabric option, so existing setups are unaffected unless
explicitly enabled.

> -> The number of queues does _not_ indicate that all queues are used
>     simultaneously. That is down to things like RSS and friends.
>     I gave a stab at configuring _that_ but it's patently horrible
>     trying to out-guess things for yourself.

Agreed that queue count alone does not imply effective parallelism, as
traffic distribution depends on RSS/RPS/XPS.

This patchset does not attempt to infer or control how queues are used.
Instead, it treats the currently configured number of TX/RX queues as a
conservative upper bound for I/O worker scaling.

In addition, the patchset exposes queue, CPU, and flow information via
debugfs. This allows userspace to configure steering policies (IRQ
affinity, ntuple, RPS/XPS) based on actual system behavior. If NIC supports
configuring n-tuple filter then it's possible to distribute each I/O flow to
a unique queue. In fact, exporting the nvmf-tcp I/O flow and its cpu information
via debugfs should be very useful to configure n-tuple and thus distribute each
flow to a unique queue and thus it shall help align tx, rx, IRQ and tcp worker
on the same cpu.

> -> It'll only work if you run directly on the NIC. As soon as there
>     is anything in between (qemu? Tunnelling?) you are out of luck.
> 
Agreed that in environments where NIC topology does not reflect the
effective data path (e.g., certain QEMU or tunneling configurations),
this heuristic may not be meaningful. In such cases, users can simply
avoid enabling "match-hw-queues" and retain the existing behavior.

That said, there are also configurations using QEMU (e.g., vhost-net with
multiqueue, or VFIO passthrough) or SR-IOV where NIC queue topology is still
relevant, and this approach can provide benefit.

Overall, the goal here is:
- to avoid clear over-provisioning of I/O workers, and
- to expose sufficient information for userspace driven tuning
   (using RPS/XPS/n-tuple etc.)

> So yeah, we should have a discussion here.
> 
Sure, looking forward for further discussion.
Thanks,
--Nilay