Re: [PATCH v3] sched/fair: Cache NUMA node statistics to avoid O(N) scanning

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Qiliang Yuan <realwujing@gmail.com>
Cc: <bsegall@google.com>, <dietmar.eggemann@arm.com>,
	<juri.lelli@redhat.com>, <linux-kernel@vger.kernel.org>,
	<mgorman@suse.de>, <mingo@redhat.com>, <peterz@infradead.org>,
	<rostedt@goodmis.org>, <vincent.guittot@linaro.org>,
	<vschneid@redhat.com>, <yuanql9@chinatelecom.cn>
Subject: Re: [PATCH v3] sched/fair: Cache NUMA node statistics to avoid O(N) scanning
Date: Tue, 27 Jan 2026 08:55:07 +0530	[thread overview]
Message-ID: <a269c33c-eaa3-4a06-aa27-062273e2e1c4@amd.com> (raw)
In-Reply-To: <20260126110250.1060512-1-realwujing@gmail.com>

Hello Qiliang,

On 1/26/2026 4:32 PM, Qiliang Yuan wrote:
> Optimize update_numa_stats() by leveraging pre-calculated node
> statistics cached during the load balancing process. This reduces the
> complexity of NUMA balancing overhead from O(CPUs_per_node) to O(1)
> when statistics for the source node are fresh.
> 
> Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
> Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
> ---

Missing a changelog and the performance numbers that justify this
change.

>  kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e71302282671..070b61f65b6d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2094,6 +2094,17 @@ static inline int numa_idle_core(int idle_core, int cpu)
>   * borrows code and logic from update_sg_lb_stats but sharing a
>   * common implementation is impractical.
>   */
> +struct numa_stats_cache {
> +	unsigned long load;
> +	unsigned long runnable;
> +	unsigned long util;
> +	unsigned long nr_running;
> +	unsigned long capacity;
> +	unsigned long last_update;
> +};
> +
> +static struct numa_stats_cache node_stats_cache[MAX_NUMNODES];

MAX_NUMNODES is a very large value. Why do you need to have this
all up front and not dynamically allocate it during sched domain
build.

Speaking of sched domains, partitioning the system can make it
so that the NUMA domain is split across multiple partition which
makes these numbers partition specific. Tasks running in one
partition cannot use the cached values from another partition.

If there is really a noticeable benefit, I would suggest using
the previous method to cache it somewhere in the sched domain
hierarchy - but only if there is a noticeable benefit.

> +
>  static void update_numa_stats(struct task_numa_env *env,
>  			      struct numa_stats *ns, int nid,
>  			      bool find_idle)
> @@ -2104,6 +2115,24 @@ static void update_numa_stats(struct task_numa_env *env,
>  	ns->idle_cpu = -1;
>  
>  	rcu_read_lock();
> +	/*
> +	 * Algorithmic Optimization: Avoid O(N) scan by using cached stats.
> +	 * Only applicable for the source node where we don't need to find
> +	 * an idle CPU.
> +	 */
> +	if (!find_idle && nid == env->src_nid) {
> +		struct numa_stats_cache *cache = &node_stats_cache[nid];
> +
> +		if (time_before(jiffies, cache->last_update + msecs_to_jiffies(10))) {
> +			ns->load = READ_ONCE(cache->load);
> +			ns->runnable = READ_ONCE(cache->runnable);
> +			ns->util = READ_ONCE(cache->util);
> +			ns->nr_running = READ_ONCE(cache->nr_running);
> +			ns->compute_capacity = READ_ONCE(cache->capacity);

So READ_ONCE()/WRITE_ONCE() doesn't solve the issue I was highlighting
in the last version. Say the following happens:

    CPU0                                            CPU1
    ====                                            ====

  update_numa_stats()
    /* Working on current numa_stats_cache */
    ns->load = READ_ONCE(cache->load);
    ns->runnable = READ_ONCE(cache->runnable);
    ... interrupted                               update_sg_lb_stats()
    ...                                           ... updates the entire numa_stats_cache
    ...
    ns->util = READ_ONCE(cache->util); /* Sees new data. */


Can this cause an issue? If not, please highlight in the commit log why
it is not an issue. There can be cases where we see util > capacity,
util > runnable, etc. which might lead to incorrect calculations later
on.

> +			goto skip_scan;
> +		}
> +	}
> +
>  	for_each_cpu(cpu, cpumask_of_node(nid)) {
>  		struct rq *rq = cpu_rq(cpu);
>  
> @@ -2124,6 +2153,8 @@ static void update_numa_stats(struct task_numa_env *env,
>  			idle_core = numa_idle_core(idle_core, cpu);
>  		}
>  	}
> +
> +skip_scan:
>  	rcu_read_unlock();
>  
>  	ns->weight = cpumask_weight(cpumask_of_node(nid));
> @@ -10488,6 +10519,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  	if (sgs->group_type == group_overloaded)
>  		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
>  				sgs->group_capacity;
> +
> +	/* Algorithmic Optimization: Cache node stats for O(1) NUMA lookups */
> +	if (env->sd->flags & SD_NUMA) {

Also you'll need to think about partitions.

-- 
Thanks and Regards,
Prateek

     prev parent reply	other threads:[~2026-01-27  3:25 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-22 16:16 [PATCH] sched/fair: Cache NUMA node statistics to avoid O(N) scanning Qiliang Yuan
2026-01-22 16:16 ` [PATCH] sched/numa: Optimize NUMA placement algorithm complexity from O(Nodes) to O(Active_Nodes) Qiliang Yuan
2026-01-23  1:39 ` [PATCH v2] sched/fair: Cache NUMA node statistics to avoid O(N) scanning Qiliang Yuan
2026-01-23  3:10   ` K Prateek Nayak
2026-01-26 11:02     ` [PATCH v3] " Qiliang Yuan
2026-01-26 15:30       ` kernel test robot
2026-01-26 16:23       ` kernel test robot
2026-01-27  3:25       ` K Prateek Nayak [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a269c33c-eaa3-4a06-aa27-062273e2e1c4@amd.com \
    --to=kprateek.nayak@amd.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=realwujing@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yuanql9@chinatelecom.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox