public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: Shaokun Zhang <zhangshaokun@hisilicon.com>,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
Cc: Yuqi Jin <jinyuqi@huawei.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Andrew Morton <akpm@linux-foundation.org>,
	Juergen Gross <jgross@suse.com>,
	Paul Burton <paul.burton@mips.com>,
	Michal Hocko <mhocko@suse.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>
Subject: Re: [PATCH v7] lib: optimize cpumask_local_spread()
Date: Mon, 30 Nov 2020 09:08:38 -0800	[thread overview]
Message-ID: <b3122c82-e0fc-5bb8-82ec-43ae785f381f@intel.com> (raw)
In-Reply-To: <a3b8ab12-604b-1efe-f091-de782c3c8ed5@hisilicon.com>

>>>  {
>>> -	int cpu, hk_flags;
>>> +	static DEFINE_SPINLOCK(spread_lock);
>>> +	static bool used[MAX_NUMNODES];
>>
>> I thought I mentioned this last time.  How large is this array?  How
>> large would it be if it were a nodemask_t?  Would this be less code if
> 
> Apologies that I forgot to do it.
> 
>> you just dynamically allocated and freed the node mask instead of having
>> a spinlock and a memset?
> 
> Ok, but I think the spinlock is also needed, do I miss something?

There was no spinlock there before your patch.  You just need it to
protect the structures you declared static.  If you didn't have static
structures, you wouldn't need a lock.

>>> +	unsigned long flags;
>>> +	int cpu, hk_flags, j, id;
>>>  	const struct cpumask *mask;
>>>  
>>>  	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>> @@ -352,20 +379,27 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>>>  				return cpu;
>>>  		}
>>>  	} else {
>>> -		/* NUMA first. */
>>> -		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>> -			if (i-- == 0)
>>> -				return cpu;
>>> +		spin_lock_irqsave(&spread_lock, flags);
>>> +		memset(used, 0, nr_node_ids * sizeof(bool));
>>> +		/* select node according to the distance from local node */
>>> +		for (j = 0; j < nr_node_ids; j++) {
>>> +			id = find_nearest_node(node, used);
>>> +			if (id < 0)
>>> +				break;
>>
>> There's presumably an outer loop in a driver which is trying to bind a
>> bunch of interrupts to a bunch of CPUs.  We know there are on the order
>> of dozens of these interrupts.
>>
>> 	for_each_interrupt() // in the driver
>> 		for (j=0;j<nr_node_ids;j++) // cpumask_local_spread()
>> 			// find_nearest_node():
>> 			for (i = 0; i < nr_node_ids; i++) {
>> 			for (i = 0; i < nr_node_ids; i++) {
>>
>> Does this worry anybody else?  It thought our upper limits on the number
>> of NUMA nodes was 1024.  Doesn't that make our loop O(N^3) where the
>> worst case is hundreds of millions of loops?
> 
> If the NUMA nodes is 1024 in real system, it is more worthy to find the
> earest node, rather than choose a random one, And it is only called in
> I/O device initialization. Comments also are given to this interface.

This doesn't really make me feel better.  An end user booting this on a
big system with a bunch of cards could see a minutes-long delay.  I can
also see funky stuff happening like if we have a ton of NUMA nodes and
few CPUs.

>> I don't want to prematurely optimize this, but that seems like something
>> that might just fall over on bigger systems.
>>
>> This also seems really wasteful if we have a bunch of memory-only nodes.
>>  Each of those will be found via find_nearest_node(), but then this loop:
> 
> Got it, all effort is used to choose the nearest node for performance. If
> we don't it, I think some one will also debug this in future.

If we're going to kick the can down the road for some poor sod to debug,
can we at least help them out with a warning?

Maybe we WARN_ONCE() after we fall back for more than 2 or 3 nodes.

But, I still don't think you've addressed my main concern: This is
horrifically inefficient searching for CPUs inside nodes that are known
to have no CPUs.

>>> +			for_each_cpu_and(cpu, cpumask_of_node(id), mask)
>>> +				if (i-- == 0) {
>>> +					spin_unlock_irqrestore(&spread_lock,
>>> +							       flags);
>>> +					return cpu;
>>> +				}
>>> +			used[id] = true;
>>>  		}
>>
>> Will just exit immediately because cpumask_of_node() is empty.
> 
> Yes, and this node used[id] became true.
> 
>>
>> 'used', for instance, should start by setting 'true' for all nodes which
>> are not in N_CPUS.
> 
> No, because I used 'nr_node_ids' which is possible node ids to check.

I'm saying that it's wasteful to loop over and search in all the nodes.

  reply	other threads:[~2020-11-30 17:09 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-18  2:54 [PATCH v7] lib: optimize cpumask_local_spread() Shaokun Zhang
2020-11-20 17:48 ` Dave Hansen
2020-11-27  2:29   ` Shaokun Zhang
2020-11-30 17:08     ` Dave Hansen [this message]
2020-12-11  2:25       ` Shaokun Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b3122c82-e0fc-5bb8-82ec-43ae785f381f@intel.com \
    --to=dave.hansen@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=jgross@suse.com \
    --cc=jinyuqi@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=mpe@ellerman.id.au \
    --cc=netdev@vger.kernel.org \
    --cc=paul.burton@mips.com \
    --cc=rppt@linux.ibm.com \
    --cc=rusty@rustcorp.com.au \
    --cc=zhangshaokun@hisilicon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox