public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Chen, Yu C" <yu.c.chen@intel.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	David Vernet <void@manifault.com>,
	"Gautham R. Shenoy" <gautham.shenoy@amd.com>,
	"Swapnil Sapkal" <swapnil.sapkal@amd.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>, <yu.c.chen@intel.com>,
	<yu.chen.surf@foxmail.com>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop
Date: Tue, 18 Mar 2025 02:04:55 +0800	[thread overview]
Message-ID: <3915af7f-fe05-4a1b-a8b2-e9690827b916@intel.com> (raw)
In-Reply-To: <20250313093746.6760-8-kprateek.nayak@amd.com>

On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
> Allow update_sg_lb_stats() to retrieve the group stats cached in
> sg_lb_stats_prop saved by another CPU performing load balancing around
> the same time (same jiffy)
> 

If I understand correctly, we allow update_sg_lb_stats() to retrieve
cached sg stats if another CPU in the same sched group has just done
load balance within a jiffy ago, say 10ms for CONFIG_100_HZ.

There are two roles,  writer who updates the cached stats,
the reader who reads the cache stats. For both cache writer and
the cache reader, do we trigger them only when it is in busy periodic
load balance? If yes, consider the periodic load balance is usually
triggered on 1 CPU in each SD(should_we_balance()),  and the
interval increases with the number of CPUs in that sd, just wonder
if 10 ms is a little short to find a cached stats on large LLC?
thanks,
Chenyu


> Current implementation without invalidation of cached stats have few
> limitations namely that the stats reuse is limited to busy load
> balancing since stats can only be updated once a jiffy. Newidle Balance
> can happen frequently and concurrently on many CPUs which can result in
> readers seeing inconsitent values for the propagated stats.
> 
> For this iteration, the focus is to reduce the time taken for busy load
> balancing allowing the busy CPU to resume renning the task as quickly as
> possible.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 81 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 60517a732c10..3b402f294f0b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10275,6 +10275,75 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>   	return check_cpu_capacity(rq, sd);
>   }
>   
> +static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
> +{
> +	/*
> +	 * Only under perioric load balancing can we ensure that no concurrent
> +	 * CPUs modifies the stats being propagated upwards since
> +	 * should_we_balance() can allow multiple concurrent newidle balance
> +	 * to progress and an idle -> busy transition for idle balance will
> +	 * require the stats to be recomputed since idleness metrics will
> +	 * change with migration.
> +	 */
> +	if (idle)
> +		return 0;
> +
> +	/*
> +	 * If individual groups are separate NUMA domains, migrations can cause
> +	 * preferred task statistics to change and will require recomputing of
> +	 * stats.
> +	 */
> +	if (sd->child && (sd->child->flags & SD_NUMA))
> +		return 0;
> +
> +	/*
> +	 * misfit_task_load requires recalculation on SD_ASYM_CPUCAPACITY
> +	 * domains. Skip caching stats for them.
> +	 */
> +	if (sd->flags & SD_ASYM_CPUCAPACITY)
> +		return 0;
> +
> +	/*
> +	 * TODO: For CPU_IDLE case, invalidate stats for an idle -> busy
> +	 * transition but for the time being, save some cycles during busy
> +	 * load balancing.
> +	 */
> +	return 1;
> +}
> +
> +static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_stats *sg_stats)
> +{
> +	struct sched_domain_shared *sg_share = group->shared;
> +	unsigned long current_jiffy = jiffies;
> +	struct sg_lb_stats_prop *lb_prop;
> +
> +	if (!sg_share)
> +		return 0;
> +
> +	lb_prop = (struct sg_lb_stats_prop *)sg_share->private;
> +	if (!lb_prop)
> +		return 0;
> +
> +	/* Stale stats */
> +	if (READ_ONCE(lb_prop->last_update) != current_jiffy)
> +		return 0;
> +
> +	/*
> +	 * Pairs against the update to sgs_prop->last_update to
> +	 * prevent readers from seeing an inconsistent value of
> +	 * the propagated stats from a concurrent update.
> +	 */
> +	smp_rmb();
> +	*sg_stats = lb_prop->sg_stats;
> +
> +	/*
> +	 * If stats were read in the same interval, it cannot
> +	 * read an inconsistent state since stats are only
> +	 * updated once per jiffy.
> +	 */
> +	return time_before_eq(jiffies, current_jiffy);
> +}
> +
>   /**
>    * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>    * @env: The load balancing environment.
> @@ -10292,10 +10361,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   	int i, nr_running, local_group, sd_flags = env->sd->flags;
>   	bool balancing_at_rd = !env->sd->parent;
>   
> -	memset(sgs, 0, sizeof(*sgs));
> -
>   	local_group = group == sds->local;
>   
> +	/*
> +	 * If stats can be retrieved, we are doing a busy load balancing.
> +	 * Skip right ahead to group_classify() since group_asym_packing and
> +	 * group_smt_balance is not possible under busy load balancing.
> +	 */
> +	if (can_retrieve_stats(env->sd, env->idle) &&
> +	    retrieve_cached_stats(group, sgs))
> +		goto group_classify;
> +
> +	memset(sgs, 0, sizeof(*sgs));
> +
>   	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   		struct rq *rq = cpu_rq(i);
>   		unsigned long load = cpu_load(rq);
> @@ -10360,6 +10438,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   	if (!local_group && smt_balance(env, sgs, group))
>   		sgs->group_smt_balance = 1;
>   
> +group_classify:
>   	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>   
>   	/* Computing avg_load makes sense only when group is overloaded */

  reply	other threads:[~2025-03-17 18:05 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 1/8] sched/topology: Assign sd_share for all non NUMA sched domains K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 2/8] sched/topology: Introduce sg->shared K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 3/8] sched/fair: Move "struct sg_lb_stats" and its dependencies to sched.h K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 4/8] sched/fair: Move sg_{overloaded,overutilized} calculation to sg_lb_stats K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 5/8] sched/topology: Define sg_lb_stats_prop and embed it inside sched_domain_shared K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused K Prateek Nayak
2025-03-17 18:07   ` Chen, Yu C
2025-03-19  6:51     ` K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop K Prateek Nayak
2025-03-17 18:04   ` Chen, Yu C [this message]
2025-03-19  6:42     ` K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 8/8] sched/fair: Update stats for sched_domain using the sched_group stats K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 09/08] [ANNOTATE] sched/fair: Stats versioning and invalidation K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 10/08] sched/fair: Compute nr_{numa,preferred}_running for non-NUMA domains K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 11/08] sched/fair: Move from "last_update" to stats versioning K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 12/08] sched/fair: Record the cpu that updated the stats last K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 13/08] sched/fair: Invalidate stats once the load balancing instance is done K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 14/08] [DEBUG] sched/fair: Add more lb_stats around lb_time and stats reuse K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 15/08] [DEBUG] tools/lib/perf: Extend schedstats v17 headers to include the new debug fields K Prateek Nayak
2025-03-17 17:25 ` [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy Peter Zijlstra
2025-03-17 18:23   ` Chen, Yu C
2025-03-21 10:04 ` Libo Chen
2025-03-24  3:58   ` K Prateek Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3915af7f-fe05-4a1b-a8b2-e9690827b916@intel.com \
    --to=yu.c.chen@intel.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sshegde@linux.ibm.com \
    --cc=swapnil.sapkal@amd.com \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    --cc=yu.chen.surf@foxmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox