Re: [PATCH v3 3/4] net: mana: Allow irq_setup() to skip cpus for affinity

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

From: Yury Norov <yury.norov@gmail.com>
To: Michael Kelley <mhklinux@outlook.com>
Cc: "Shradha Gupta" <shradhagupta@linux.microsoft.com>,
	"Dexuan Cui" <decui@microsoft.com>,
	"Wei Liu" <wei.liu@kernel.org>,
	"Haiyang Zhang" <haiyangz@microsoft.com>,
	"K. Y. Srinivasan" <kys@microsoft.com>,
	"Andrew Lunn" <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	"Konstantin Taranov" <kotaranov@microsoft.com>,
	"Simon Horman" <horms@kernel.org>,
	"Leon Romanovsky" <leon@kernel.org>,
	"Maxim Levitsky" <mlevitsk@redhat.com>,
	"Erni Sri Satya Vennela" <ernis@linux.microsoft.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Nipun Gupta" <nipun.gupta@amd.com>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Jonathan Cameron" <Jonathan.Cameron@huwei.com>,
	"Anna-Maria Behnsen" <anna-maria@linutronix.de>,
	"Kevin Tian" <kevin.tian@intel.com>,
	"Long Li" <longli@microsoft.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Bjorn Helgaas" <bhelgaas@google.com>,
	"Rob Herring" <robh@kernel.org>,
	"Manivannan Sadhasivam" <manivannan.sadhasivam@linaro.org>,
	"Krzysztof Wilczy�~Dski" <kw@linux.com>,
	"Lorenzo Pieralisi" <lpieralisi@kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"Paul Rosswurm" <paulros@microsoft.com>,
	"Shradha Gupta" <shradhagupta@microsoft.com>
Subject: Re: [PATCH v3 3/4] net: mana: Allow irq_setup() to skip cpus for affinity
Date: Wed, 14 May 2025 12:55:00 -0400	[thread overview]
Message-ID: <aCTK5PjV1n1EYOpi@yury> (raw)
In-Reply-To: <SN6PR02MB41577E2FAA79E2803C3384B0D491A@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, May 14, 2025 at 04:53:34AM +0000, Michael Kelley wrote:
> > -static int irq_setup(unsigned int *irqs, unsigned int len, int node)
> > +static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > +		     bool skip_first_cpu)
> >  {
> >  	const struct cpumask *next, *prev = cpu_none_mask;
> >  	cpumask_var_t cpus __free(free_cpumask_var);
> > @@ -1303,9 +1304,20 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node)
> >  		while (weight > 0) {
> >  			cpumask_andnot(cpus, next, prev);
> >  			for_each_cpu(cpu, cpus) {
> > +				/*
> > +				 * if the CPU sibling set is to be skipped we
> > +				 * just move on to the next CPUs without len--
> > +				 */
> > +				if (unlikely(skip_first_cpu)) {
> > +					skip_first_cpu = false;
> > +					goto next_cpumask;
> > +				}
> > +
> >  				if (len-- == 0)
> >  					goto done;
> > +
> >  				irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> > +next_cpumask:
> >  				cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> >  				--weight;
> >  			}
> 
> With a little bit of reordering of the code, you could avoid the need for the "next_cpumask"
> label and goto statement.  "continue" is usually cleaner than a "goto". Here's what I'm thinking:
> 
> 		for_each_cpu(cpu, cpus) {
> 			cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> 			--weight;

cpumask_andnot() is O(N), and before it was conditional on 'len == 0',
so we didn't do that on the very last step. Your version has to do that.
Don't know how important that is for real workloads. Shradha maybe can
measure it...

> 
> 			If (unlikely(skip_first_cpu)) {
> 				skip_first_cpu = false;
> 				continue;
> 			}
> 
> 			If (len-- == 0)
> 				goto done;
> 
> 			irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> 		}
> 
> I wish there were some comments in irq_setup() explaining the overall intention of
> the algorithm. I can see how the goal is to first assign CPUs that are local to the current
> NUMA node, and then expand outward to CPUs that are further away. And you want
> to *not* assign both siblings in a hyper-threaded core.

I wrote this function, so let me step in.

The intention is described in the corresponding commit message:

  Souradeep investigated that the driver performs faster if IRQs are
  spread on CPUs with the following heuristics:
  
  1. No more than one IRQ per CPU, if possible;
  2. NUMA locality is the second priority;
  3. Sibling dislocality is the last priority.
  
  Let's consider this topology:
  
  Node            0               1
  Core        0       1       2       3
  CPU       0   1   2   3   4   5   6   7
  
  The most performant IRQ distribution based on the above topology
  and heuristics may look like this:
  
  IRQ     Nodes   Cores   CPUs
  0       1       0       0-1
  1       1       1       2-3
  2       1       0       0-1
  3       1       1       2-3
  4       2       2       4-5
  5       2       3       6-7
  6       2       2       4-5
  7       2       3       6-7

> But I can't figure out what
> "weight" is trying to accomplish. Maybe this was discussed when the code first
> went in, but I can't remember now. :-(

The weight here is to implement the heuristic discovered by Souradeep:
NUMA locality is preferred over sibling dislocality. 

The outer for_each() loop resets the weight to the actual number of
CPUs in the hop. Then inner for_each() loop decrements it by the
number of sibling groups (cores) while assigning first IRQ to each
group. 

Now, because NUMA locality is more important, we should walk the
same set of siblings and assign 2nd IRQ, and it's implemented by the
medium while() loop. So, we do like this unless the number of IRQs
assigned on this hop will not become equal to number of CPUs in the
hop (weight == 0). Then we switch to the next hop and do the same
thing.

Hope that helps.

Thanks,
Yury

next prev parent reply	other threads:[~2025-05-14 16:55 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-09 10:12 [PATCH v3 0/4] Allow dyn MSI-X vector allocation of MANA Shradha Gupta
2025-05-09 10:13 ` [PATCH v3 1/4] PCI/MSI: Export pci_msix_prepare_desc() for dynamic MSI-X allocations Shradha Gupta
2025-05-09 10:13 ` [PATCH v3 2/4] PCI: hv: Allow dynamic MSI-X vector allocation Shradha Gupta
2025-05-12  7:00   ` Manivannan Sadhasivam
2025-05-12  7:38     ` Shradha Gupta
2025-05-09 10:13 ` [PATCH v3 3/4] net: mana: Allow irq_setup() to skip cpus for affinity Shradha Gupta
2025-05-09 15:49   ` Yury Norov
2025-05-12  5:49     ` Shradha Gupta
2025-05-14  4:53   ` Michael Kelley
2025-05-14 16:55     ` Yury Norov [this message]
2025-05-14 17:26       ` Michael Kelley
2025-05-14 17:58         ` Yury Norov
2025-05-15  4:51           ` Shradha Gupta
2025-05-15  4:49         ` Shradha Gupta
2025-05-09 10:13 ` [PATCH v3 4/4] net: mana: Allocate MSI-X vectors dynamically Shradha Gupta
2025-05-14  5:04   ` Michael Kelley
2025-05-14 17:07     ` Yury Norov
2025-05-15  5:10       ` Shradha Gupta
2025-05-21 16:27   ` Michael Kelley
2025-05-22  9:17     ` Shradha Gupta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aCTK5PjV1n1EYOpi@yury \
    --to=yury.norov@gmail.com \
    --cc=Jonathan.Cameron@huwei.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=anna-maria@linutronix.de \
    --cc=bhelgaas@google.com \
    --cc=davem@davemloft.net \
    --cc=decui@microsoft.com \
    --cc=edumazet@google.com \
    --cc=ernis@linux.microsoft.com \
    --cc=haiyangz@microsoft.com \
    --cc=horms@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=kevin.tian@intel.com \
    --cc=kotaranov@microsoft.com \
    --cc=kuba@kernel.org \
    --cc=kw@linux.com \
    --cc=kys@microsoft.com \
    --cc=leon@kernel.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=longli@microsoft.com \
    --cc=lpieralisi@kernel.org \
    --cc=manivannan.sadhasivam@linaro.org \
    --cc=mhklinux@outlook.com \
    --cc=mlevitsk@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=nipun.gupta@amd.com \
    --cc=pabeni@redhat.com \
    --cc=paulros@microsoft.com \
    --cc=peterz@infradead.org \
    --cc=robh@kernel.org \
    --cc=shradhagupta@linux.microsoft.com \
    --cc=shradhagupta@microsoft.com \
    --cc=tglx@linutronix.de \
    --cc=wei.liu@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox