Re: [RFC][PATCH] sched: Cache aware load-balancing

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Chen, Yu C" <yu.c.chen@intel.com>
To: Abel Wu <wuyun.abel@bytedance.com>,
	Peter Zijlstra <peterz@infradead.org>
Cc: <juri.lelli@redhat.com>, <vincent.guittot@linaro.org>,
	<dietmar.eggemann@arm.com>, <rostedt@goodmis.org>,
	<bsegall@google.com>, <mgorman@suse.de>, <vschneid@redhat.com>,
	<linux-kernel@vger.kernel.org>, <tim.c.chen@linux.intel.com>,
	<tglx@linutronix.de>, <mingo@kernel.org>,
	<gautham.shenoy@amd.com>, <kprateek.nayak@amd.com>
Subject: Re: [RFC][PATCH] sched: Cache aware load-balancing
Date: Sat, 29 Mar 2025 23:06:03 +0800	[thread overview]
Message-ID: <143a63f6-3e1b-42fd-a4c8-8d2f6b7d3583@intel.com> (raw)
In-Reply-To: <93907416-dab4-4a3a-82b6-e37e4ee334db@bytedance.com>

On 3/28/2025 9:57 PM, Abel Wu wrote:
> Hi Peter,
> 
> On 3/25/25 8:09 PM, Peter Zijlstra wrote:
>>   struct mmu_gather;
>>   extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct 
>> *mm);
>>   extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct 
>> mm_struct *mm);
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 6e5c38718ff5..f8eafe440369 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1379,6 +1379,10 @@ struct task_struct {
>>       unsigned long            numa_pages_migrated;
>>   #endif /* CONFIG_NUMA_BALANCING */
>> +#ifdef CONFIG_SCHED_CACHE
>> +    struct callback_head        cache_work;
>> +#endif
> 
> IIUC this work updates stats for the whole mm and seems not
> necessary for each task of the process to repeat same thing.
> Hence would be better move this work to mm_struct.
> 

It seems that the per task cache_work is used to avoid
duplicated task_cache_work() in task->task_works queue, see
task_tick_cache()'s check
  if (work->next == work)
	task_work_add()

To do exclusive task_cache_work() and only allow 1 task
in the process to do the calculation, maybe introducing similar 
mechanism like task_numa_work(), something like this:

if (!try_cmpxchg(&mm->cache_next_scan, &calc, next_scan))
	return;

>> +
>> +static inline
>> +void account_mm_sched(struct rq *rq, struct task_struct *p, s64 
>> delta_exec)
>> +{
>> +    struct mm_struct *mm = p->mm;
>> +    struct mm_sched *pcpu_sched;
>> +    unsigned long epoch;
>> +
>> +    /*
>> +     * init_task and kthreads don't be having no mm
>> +     */
>> +    if (!mm || !mm->pcpu_sched)
>> +        return;
>> +
>> +    pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
>> +
>> +    scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
>> +        __update_mm_sched(rq, pcpu_sched);
>> +        pcpu_sched->runtime += delta_exec;
>> +        rq->cpu_runtime += delta_exec;
>> +        epoch = rq->cpu_epoch;
>> +    }
>> +
>> +    /*
>> +     * If this task hasn't hit task_cache_work() for a while, invalidate
>> +     * it's preferred state.
>> +     */
>> +    if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
>> +        mm->mm_sched_cpu = -1;
>> +        pcpu_sched->occ = -1;
>> +    }
> 
> This seems too late. account_mm_sched() is called when p is runnable,
> so if the whole process sleeps for a while before woken up, ttwu will
> take the out-dated value.
> 

Yup, there seems to be a problem. It would be better if we could reset 
the mm_sched_cpu to -1 after the last thread of the process falls 
asleep. But considering that all threads are sleeping, even if the ttwu 
tries to enqueue the first newly-woken thread to an out-dated idle 
mm_sched_cpu, does it matter? I guess it would not be a serious problem, 
because all the cache of the process might have been evicted due to long 
sleep.

>> +
>> +static void task_cache_work(struct callback_head *work)
>> +{
>> +    struct task_struct *p = current;
>> +    struct mm_struct *mm = p->mm;
>> +    unsigned long m_a_occ = 0;
>> +    int cpu, m_a_cpu = -1;
>> +    cpumask_var_t cpus;
>> +
>> +    WARN_ON_ONCE(work != &p->cache_work);
>> +
>> +    work->next = work;
>> +
>> +    if (p->flags & PF_EXITING)
>> +        return;
>> +
>> +    if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
>> +        return;
>> +
>> +    scoped_guard (cpus_read_lock) {
>> +        cpumask_copy(cpus, cpu_online_mask);
>> +
>> +        for_each_cpu(cpu, cpus) {
>> +            /* XXX sched_cluster_active */
>> +            struct sched_domain *sd = per_cpu(sd_llc, cpu);
>> +            unsigned long occ, m_occ = 0, a_occ = 0;
>> +            int m_cpu = -1, nr = 0, i;
>> +
>> +            for_each_cpu(i, sched_domain_span(sd)) {
>> +                occ = fraction_mm_sched(cpu_rq(i),
>> +                            per_cpu_ptr(mm->pcpu_sched, i));
>> +                a_occ += occ;
>> +                if (occ > m_occ) {
>> +                    m_occ = occ;
>> +                    m_cpu = i;
>> +                }
> 
> It would be possible to cause task stacking on this hint cpu
> due to its less frequently updated compared to wakeup.
> 

The SIS would overwrite the prev CPU with this hint(cached) CPU, and use 
that cached CPU as a hint to search for an idle CPU, so ideally it 
should not cause task stacking. But there is a race condition that 
multiple wakeup path might find the same cached "idle" CPU and queue 
wakees on it, this usually happens when there is frequent context 
switch(wakeup)+short duration tasks.


> And although the occupancy heuristic looks reasonable, IMHO it
> doesn't make much sense to compare between cpus as they share
> the LLC, and a non-hint cpu with warmer L1/L2$ in same LLC with
> the hint cpu seems more preferred.
> 
> Do you think it's appropriate or not to only hint on the hottest
> LLC? So the tasks can hopefully wokenup on 'right' LLC on the
> premise that wouldn't cause much imbalance between LLCs.
> 
 > I will do some tests and return with more feedback.
 >

Find an idle CPU in the wakee's hostest LLC seems to be plausible.
The benchmark data might indicate a proper way.

thanks,
Chenyu

next prev parent reply	other threads:[~2025-03-29 15:06 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-25 12:09 [RFC][PATCH] sched: Cache aware load-balancing Peter Zijlstra
2025-03-25 15:19 ` Chen, Yu C
2025-03-25 18:44   ` Peter Zijlstra
2025-03-26  6:18     ` K Prateek Nayak
2025-03-26  9:15       ` Chen, Yu C
2025-03-26  9:42         ` Peter Zijlstra
2025-03-27  8:10           ` Chen, Yu C
2025-03-26  9:38   ` Peter Zijlstra
2025-03-26 10:25     ` Peter Zijlstra
2025-03-26 10:42       ` Peter Zijlstra
2025-03-26 10:46       ` Peter Zijlstra
     [not found]       ` <20250327112059.3661-1-hdanton@sina.com>
2025-03-31  6:25         ` Chen, Yu C
2025-03-27  2:48     ` Chen, Yu C
2025-03-27  2:43 ` Madadi Vineeth Reddy
2025-03-27 11:14   ` Chen, Yu C
2025-03-31 20:17     ` Madadi Vineeth Reddy
2025-03-28 13:57 ` Abel Wu
2025-03-29 15:06   ` Chen, Yu C [this message]
2025-03-30  8:46     ` Abel Wu
2025-03-31  5:25       ` Chen, Yu C
2025-03-31  8:04         ` Abel Wu
2025-03-31 21:06 ` Tim Chen
2025-04-02  1:52 ` Libo Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=143a63f6-3e1b-42fd-a4c8-8d2f6b7d3583@intel.com \
    --to=yu.c.chen@intel.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox