Re: [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Nilay Shroff <nilay@linux.ibm.com>
To: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org, kbusch@kernel.org, hare@suse.de,
	sagi@grimberg.me, chaitanyak@nvidia.com, gjoyce@linux.ibm.com
Subject: Re: [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues
Date: Mon, 27 Apr 2026 13:07:55 +0530	[thread overview]
Message-ID: <c5e8020f-1d2e-4609-a5fe-ff84c5653e7a@linux.ibm.com> (raw)
In-Reply-To: <20260424134620.GA17351@lst.de>

On 4/24/26 7:16 PM, Christoph Hellwig wrote:
>> In such configurations, limiting the number of NVMe-TCP I/O queues to
>> the number of NIC hardware queues can improve performance by reducing
>> contention and improving locality. Aligning NVMe-TCP worker threads with
>> NIC queue topology may also help reduce tail latency.
> 
> Yes, this sounds useful.
> 
>>
>> Add a new transport option "match_hw_queues" to allow users to
>> optionally limit the number of NVMe-TCP I/O queues to the number of NIC
>> TX/RX queues. When enabled, the number of I/O queues is set to:
>>
>>      min(num_online_cpus, num_nic_queues)
>>
>> This behavior is opt-in and does not change existing defaults.
> 
> Any good reason for that?  For PCI and RDMA we try to do the right
> thing by default.
> 
The only reason was that in certain complex typologies it may not
be really possible (for instance, QEMU) to get the real num of tx/rx
queues. In such situation, I thought we're better off using this
feature and hence I added the opt-in. But yes I'd also love to remove
this option and find a better way to detect such cases where we can't get
the real num of tx/rx queues and thus aromatically fallback to creating
as many I/O queues as num of online cpus. I'd explore this and see
if that's possible.

>> +static struct net_device *nvme_tcp_get_netdev(struct nvme_ctrl *ctrl)
>> +{
>> +	struct net_device *dev = NULL;
>> +
>> +	if (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)
>> +		dev = dev_get_by_name(&init_net, ctrl->opts->host_iface);
> 
> Return early here instead of the giant indentation for the new options.
> 
Yes okay, makes sense!

>> +	else {
>> +		struct nvme_tcp_ctrl *tctrl = to_tcp_ctrl(ctrl);
>> +
>> +		if (tctrl->addr.ss_family == AF_INET) {
> 
> And then split each address family into a helper.  And to me those
> look like something that should be in net/.
> 
Hmm okay, I think if we want to add these helpers under net/ then it should be
in include/net/route.h and include/net/ip6_route.h for IPv4 and IPv6 respectively.

>> +
>> +/*
>> + * Returns number of active NIC queues (min of TX/RX), or 0 if device cannot
>> + * be determined.
>> + */
>> +static int nvme_tcp_get_netdev_current_queue_count(struct nvme_ctrl *ctrl)
> 
> drop _current to make this a bit more readable?
> 
Sure.

>> @@ -2144,6 +2243,24 @@ static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>>   	unsigned int nr_io_queues;
>>   	int ret;
>>   
>> +	if (!(ctrl->opts->mask & NVMF_OPT_NR_IO_QUEUES) &&
>> +			(ctrl->opts->mask & NVMF_OPT_MATCH_HW_QUEUES)) {
> 
> The more readable formatting would be:
> 
> 	if (!(ctrl->opts->mask & NVMF_OPT_NR_IO_QUEUES) &&
> 	    (ctrl->opts->mask & NVMF_OPT_MATCH_HW_QUEUES)) {
> 
Yep, I will change this.

>> +		int nr_hw_queues;
>> +
>> +		nr_hw_queues = nvme_tcp_get_netdev_current_queue_count(ctrl);
>> +		if (nr_hw_queues <= 0)
>> +			goto init_queue;
>> +
>> +		ctrl->opts->nr_io_queues = min(nr_hw_queues, num_online_cpus());
>> +
>> +		if (ctrl->opts->nr_io_queues < num_online_cpus())
>> +			dev_info(ctrl->device,
>> +				"limiting I/O queues to %u (NIC queues %d, CPUs %u)\n",
>> +				ctrl->opts->nr_io_queues, nr_hw_queues,
>> +				num_online_cpus());
>> +	}
> 
> And splitting this into a helper would help keeping the flow sane.
> 
Alright, will make it into separate helper.

Thanks,
--Nilay

next prev parent reply	other threads:[~2026-04-27  7:38 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-20 11:49 [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 1/4] nvme-tcp: optionally limit I/O queue count based on NIC queues Nilay Shroff
2026-04-24 13:46   ` Christoph Hellwig
2026-04-27  7:37     ` Nilay Shroff [this message]
2026-04-24 22:10   ` Sagi Grimberg
2026-04-27 11:57     ` Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 2/4] nvme-tcp: add a diagnostic message when NIC queues are underutilized Nilay Shroff
2026-04-24 22:15   ` Sagi Grimberg
2026-04-27 12:14     ` Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 3/4] nvme: add debugfs helpers for NVMe drivers Nilay Shroff
2026-04-20 11:49 ` [RFC PATCH 4/4] nvme: expose queue information via debugfs Nilay Shroff
2026-04-24 22:23   ` Sagi Grimberg
2026-04-27 12:12     ` Nilay Shroff
2026-04-22 11:10 ` [RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export Hannes Reinecke
2026-04-24 22:30   ` Sagi Grimberg
2026-04-27 12:11     ` Nilay Shroff
2026-04-27  6:13   ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c5e8020f-1d2e-4609-a5fe-ff84c5653e7a@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=chaitanyak@nvidia.com \
    --cc=gjoyce@linux.ibm.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox