Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Valentin Schneider <vschneid@redhat.com>
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag
Date: Tue, 12 Mar 2024 11:57:49 +0100	[thread overview]
Message-ID: <ZfA1LRq1d2ueoSRm@gmail.com> (raw)
In-Reply-To: <41e11090-a100-48a7-a0dd-c989772822d7@linux.ibm.com>


* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> > I think we should probably do something about this contention on this 
> > large system: especially if #2 'no work to be done' bailout is the 
> > common case.
> 
> 
> I have been thinking would it be right to move this balancing 
> trylock/atomic after should_we_balance(swb). This does reduce the number 
> of times this checked/updated significantly. Contention is still present. 
> That's possible at higher utilization when there are multiple NUMA 
> domains. one CPU in each NUMA domain can contend if their invocation is 
> aligned.

Note that it's not true contention: it simply means there's overlapping 
requests for the highest domains to be balanced, for which we only have a 
single thread of execution at a time, system-wide.

> That makes sense since, Right now a CPU takes lock, checks if it can 
>  balance, do balance if yes and then releases the lock. If the lock is 
>  taken after swb then also, CPU checks if it can balance,
> tries to take the lock and releases the lock if it did. If lock is 
> contended, it bails out of load_balance. That is the current behaviour as 
> well, or I am completely wrong.
> 
> Not sure in which scenarios that would hurt. we could do this after this 
> series. This may need wider functional testing to make sure we don't 
> regress badly in some cases. This is only an *idea* as of now.
> 
> Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system. 
> 6.8-rc6                                                                         
> -----------------------------------------                                       
> idle system:                                                                    
> 449 probe:rebalance_domains_L37                                                 
> 377 probe:rebalance_domains_L55                                                 
> stress-ng --cpu=$(nproc) -l 51     << 51% load                                               
> 88K probe:rebalance_domains_L37                                                 
> 77K probe:rebalance_domains_L55                                                 
> stress-ng --cpu=$(nproc) -l 100    << 100% load                                             
> 41K probe:rebalance_domains_L37                                                 
> 10K probe:rebalance_domains_L55                                                 
>                                                                                 
> +below patch                                                                          
> ----------------------------------------                                        
> idle system:                                                                    
> 462 probe:load_balance_L35                                                      
> 394 probe:load_balance_L274                                                     
> stress-ng --cpu=$(nproc) -l 51      << 51% load                                            
> 5K probe:load_balance_L35                       	<<-- almost 15x less                                
> 4K probe:load_balance_L274                                                      
> stress-ng --cpu=$(nproc) -l 100     << 100% load                                            
> 8K probe:load_balance_L35                                                       
> 3K probe:load_balance_L274 				<<-- almost 4x less

That's nice.

> +static DEFINE_SPINLOCK(balancing);
>  /*
>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>   * tasks if there is an imbalance.
> @@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  	struct rq *busiest;
>  	struct rq_flags rf;
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> +	int need_serialize;
>  	struct lb_env env = {
>  		.sd		= sd,
>  		.dst_cpu	= this_cpu,
> @@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  		goto out_balanced;
>  	}
> 
> +	need_serialize = sd->flags & SD_SERIALIZE;
> +	if (need_serialize) {
> +		if (!spin_trylock(&balancing))
> +			goto lockout;
> +	}
> +
>  	group = find_busiest_group(&env);

So if I'm reading your patch right, the main difference appears to be that 
it allows the should_we_balance() check to be executed in parallel, and 
will only try to take the NUMA-balancing flag if that function indicates an 
imbalance.

Since should_we_balance() isn't taking any locks AFAICS, this might be a 
valid approach. What might make sense is to instrument the percentage of 
NUMA-balancing flag-taking 'failures' vs. successful attempts - not 
necessarily the 'contention percentage'.

But another question is, why do we get here so frequently, so that the 
cumulative execution time of these SD_SERIAL rebalance passes exceeds that 
of 100% of single CPU time? Ie. a single CPU is basically continuously 
scanning the scheduler data structures for imbalances, right? That doesn't 
seem natural even with just ~224 CPUs.

Alternatively, is perhaps the execution time of the SD_SERIAL pass so large 
that we exceed 100% CPU time?

Thanks,

	Ingo

next prev parent reply	other threads:[~2024-03-12 10:57 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-04  9:48 [PATCH -v3 0/9] sched/balancing: Misc updates & cleanups Ingo Molnar
2024-03-04  9:48 ` [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-08  9:48     ` Ingo Molnar
2024-03-05 11:11   ` Shrikanth Hegde
2024-03-08 11:23     ` Ingo Molnar
2024-03-08 14:48       ` Shrikanth Hegde
2024-03-12 10:57         ` Ingo Molnar [this message]
2024-03-21 12:12           ` Shrikanth Hegde
2024-03-04  9:48 ` [PATCH 2/9] sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat() Ingo Molnar
2024-03-04 15:05   ` Shrikanth Hegde
2024-03-08  9:55     ` Ingo Molnar
2024-03-04  9:48 ` [PATCH 3/9] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 15:46   ` Vincent Guittot
2024-03-08  9:59     ` Ingo Molnar
2024-03-04  9:48 ` [PATCH 4/9] sched/balancing: Change comment formatting to not overlap Git conflict marker lines Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 15:44   ` Vincent Guittot
2024-03-04  9:48 ` [PATCH 5/9] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 15:43   ` Vincent Guittot
2024-03-08 10:11     ` Ingo Molnar
2024-03-04  9:48 ` [PATCH 6/9] sched/balancing: Update run_rebalance_domains() comments Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 16:17     ` Vincent Guittot
2024-03-08 10:15       ` Ingo Molnar
2024-03-08 11:57         ` Vincent Guittot
2024-03-08 16:45           ` Valentin Schneider
2024-03-04  9:48 ` [PATCH 7/9] sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats' Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-04  9:48 ` [PATCH 8/9] sched/balancing: Update comments in " Ingo Molnar
2024-03-05 10:51   ` Valentin Schneider
2024-03-04  9:48 ` [PATCH 9/9] sched/balancing: Rename run_rebalance_domains() => sched_balance_softirq() Ingo Molnar
2024-03-05 10:51   ` Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZfA1LRq1d2ueoSRm@gmail.com \
    --to=mingo@kernel.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=sshegde@linux.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.