Re: [RFC patch v3 01/20] sched: Cache aware load-balancing

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Tim Chen <tim.c.chen@linux.intel.com>
To: Shrikanth Hegde <sshegde@linux.ibm.com>, Chen Yu <yu.c.chen@intel.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Libo Chen <libo.chen@oracle.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	 Hillf Danton <hdanton@sina.com>, Len Brown <len.brown@intel.com>,
	linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>
Subject: Re: [RFC patch v3 01/20] sched: Cache aware load-balancing
Date: Mon, 07 Jul 2025 12:57:10 -0700	[thread overview]
Message-ID: <007752e0d8318f21dae5d19d4a0a86dc15e73c2e.camel@linux.intel.com> (raw)
In-Reply-To: <b4cce11c-5ed6-4366-92c6-9a6b34998295@linux.ibm.com>

On Fri, 2025-07-04 at 00:59 +0530, Shrikanth Hegde wrote:
> 
> Hi Tim, Chen,
> skimming through the series and will try to go through in coming days.
> 
> > 

Thanks for taking a look.  Some further comments on top of Chen Yu's response.

[snip]
> >   
> > +#ifdef CONFIG_SCHED_CACHE
> > +	struct callback_head		cache_work;
> > +#endif
> > +
> >   #ifdef CONFIG_RSEQ
> >   	struct rseq __user *rseq;
> >   	u32 rseq_len;
> > diff --git a/init/Kconfig b/init/Kconfig
> > index bf3a920064be..e2509127b6f9 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -953,6 +953,10 @@ config NUMA_BALANCING
> >   
> >   	  This system will be inactive on UMA systems.
> >   
> > +config SCHED_CACHE
> > +	bool "Cache aware scheduler"
> > +	default y
> > +
> 
> Should it depend on EXPERT?
> IMO this could add quite a bit of overhead and maybe n by default?
> 

We do have a SCHED_CACHE scheduler feature in the later patches.  
So the feature could be turned on/off at run time by admin who don't want
to incur this overhead.

> >   config NUMA_BALANCING_DEFAULT_ENABLED
> >   	bool "Automatically enable NUMA aware memory/task placement"
> >   	default y
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 168681fc4b25..da1387823b9e 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1332,6 +1332,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >   	if (mm_alloc_cid(mm, p))
> >   		goto fail_cid;
> >   

[snip]
> 
> > +
> > +static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
> > +{
> > +	lockdep_assert_held(&rq->cpu_epoch_lock);
> > +
> > +	unsigned long n, now = jiffies;
> > +	long delta = now - rq->cpu_epoch_next;
> > +
> > +	if (delta > 0) {
> > +		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
> > +		rq->cpu_epoch += n;
> > +		rq->cpu_epoch_next += n * EPOCH_PERIOD;
> > +		__shr_u64(&rq->cpu_runtime, n);
> 
> Another doubt i had, does this occupancy works when there is CPU bandwidth controller running?
> A 50% occupancy may have different meaning when CPU bandwidth is set to 50%?

The occupancy is used to compare tasks occupancy within a process.  With bandwidth
controller set to 50%, it just mean that all tasks in the process will run 50% less,
but the relative occupancy ratio between tasks should still remain the same.

[snip]
> 
> > +	}
> > 
> > +
> > +static void task_cache_work(struct callback_head *work)
> > +{
> > +	struct task_struct *p = current;
> > +	struct mm_struct *mm = p->mm;
> > +	unsigned long m_a_occ = 0;
> > +	int cpu, m_a_cpu = -1;
> > +	cpumask_var_t cpus;
> > +
> > +	WARN_ON_ONCE(work != &p->cache_work);
> > +
> > +	work->next = work;
> > +
> > +	if (p->flags & PF_EXITING)
> > +		return;
> > +
> > +	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
> > +		return;
> > +
> > +	scoped_guard (cpus_read_lock) {
> > +		cpumask_copy(cpus, cpu_online_mask);
> > +
> 
> As pointed out already, this is going to be costly in multi NUMA systems. Any cross NUMA access of
> CPUs data is going to add overhead to system bus bandwidth and this happening at tick could be costly.
> 

We'll consider restricting the scan on preferred NUMA node (if numa balancing is running), which
should greatly reduce the overhead. 

> Also, taking cpu_read_lock does preempt_disable, this could add to large preemptoff?
> We need to measure the time it takes on large system. Will try and get back with that number

Tim

next prev parent reply	other threads:[~2025-07-07 19:57 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
2025-06-18 18:27 ` [RFC patch v3 01/20] sched: Cache aware load-balancing Tim Chen
2025-06-26 12:23   ` Jianyong Wu
2025-06-26 13:32     ` Chen, Yu C
2025-06-27  0:10       ` Tim Chen
2025-06-27  2:13         ` Jianyong Wu
2025-07-03 19:29   ` Shrikanth Hegde
2025-07-04  8:40     ` Chen, Yu C
2025-07-04  8:45       ` Peter Zijlstra
2025-07-04  8:54         ` Shrikanth Hegde
2025-07-07 19:57     ` Tim Chen [this message]
2025-06-18 18:27 ` [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling Tim Chen
2025-07-03 19:33   ` Shrikanth Hegde
2025-07-07 21:02     ` Tim Chen
2025-07-08  1:15   ` Libo Chen
2025-07-08  7:54     ` Chen, Yu C
2025-07-08 15:47       ` Libo Chen
2025-06-18 18:27 ` [RFC patch v3 03/20] sched: Avoid task migration within its preferred LLC Tim Chen
2025-06-18 18:27 ` [RFC patch v3 04/20] sched: Avoid calculating the cpumask if the system is overloaded Tim Chen
2025-07-03 19:39   ` Shrikanth Hegde
2025-07-07 14:57     ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 05/20] sched: Add hysteresis to switch a task's preferred LLC Tim Chen
2025-07-02  6:47   ` Madadi Vineeth Reddy
2025-07-02 21:47     ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 06/20] sched: Save the per LLC utilization for better cache aware scheduling Tim Chen
2025-06-18 18:27 ` [RFC patch v3 07/20] sched: Add helper function to decide whether to allow " Tim Chen
2025-07-08  0:41   ` Libo Chen
2025-07-08  8:29     ` Chen, Yu C
2025-07-08 17:22       ` Libo Chen
2025-07-09 14:41         ` Chen, Yu C
2025-07-09 21:31           ` Libo Chen
2025-07-08 21:59     ` Tim Chen
2025-07-09 21:22       ` Libo Chen
2025-06-18 18:27 ` [RFC patch v3 08/20] sched: Set up LLC indexing Tim Chen
2025-07-03 19:44   ` Shrikanth Hegde
2025-07-04  9:36     ` Chen, Yu C
2025-06-18 18:27 ` [RFC patch v3 09/20] sched: Introduce task preferred LLC field Tim Chen
2025-06-18 18:27 ` [RFC patch v3 10/20] sched: Calculate the number of tasks that have LLC preference on a runqueue Tim Chen
2025-07-03 19:45   ` Shrikanth Hegde
2025-07-04 15:00     ` Chen, Yu C
2025-06-18 18:27 ` [RFC patch v3 11/20] sched: Introduce per runqueue task LLC preference counter Tim Chen
2025-06-18 18:28 ` [RFC patch v3 12/20] sched: Calculate the total number of preferred LLC tasks during load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 13/20] sched: Tag the sched group as llc_balance if it has tasks prefer other LLC Tim Chen
2025-06-18 18:28 ` [RFC patch v3 14/20] sched: Introduce update_llc_busiest() to deal with groups having preferred LLC tasks Tim Chen
2025-07-03 19:52   ` Shrikanth Hegde
2025-07-05  2:26     ` Chen, Yu C
2025-06-18 18:28 ` [RFC patch v3 15/20] sched: Introduce a new migration_type to track the preferred LLC load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 16/20] sched: Consider LLC locality for active balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 17/20] sched: Consider LLC preference when picking tasks from busiest queue Tim Chen
2025-06-18 18:28 ` [RFC patch v3 18/20] sched: Do not migrate task if it is moving out of its preferred LLC Tim Chen
2025-06-18 18:28 ` [RFC patch v3 19/20] sched: Introduce SCHED_CACHE_LB to control cache aware load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 20/20] sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake up Tim Chen
2025-06-19  6:39 ` [RFC patch v3 00/20] Cache aware scheduling Yangyu Chen
2025-06-19 13:21   ` Chen, Yu C
2025-06-19 14:12     ` Yangyu Chen
2025-06-20 19:25 ` Madadi Vineeth Reddy
2025-06-22  0:39   ` Chen, Yu C
2025-06-24 17:47     ` Madadi Vineeth Reddy
2025-06-23 16:45   ` Tim Chen
2025-06-24  5:00 ` K Prateek Nayak
2025-06-24 12:16   ` Chen, Yu C
2025-06-25  4:19     ` K Prateek Nayak
2025-06-25  0:30   ` Tim Chen
2025-06-25  4:30     ` K Prateek Nayak
2025-07-03 20:00   ` Shrikanth Hegde
2025-07-04 10:09     ` Chen, Yu C
2025-07-09 19:39 ` Madadi Vineeth Reddy
2025-07-10  3:33   ` Chen, Yu C

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=007752e0d8318f21dae5d19d4a0a86dc15e73c2e.camel@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=hdanton@sina.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=len.brown@intel.com \
    --cc=libo.chen@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sshegde@linux.ibm.com \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vineethr@linux.ibm.com \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).