From: David Vernet <void@manifault.com>
To: Abel Wu <wuyun.abel@bytedance.com>
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
bristot@redhat.com, vschneid@redhat.com, gautham.shenoy@amd.com,
kprateek.nayak@amd.com, aaron.lu@intel.com, clm@meta.com,
tj@kernel.org, roman.gushchin@linux.dev, kernel-team@meta.com,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS
Date: Wed, 12 Jul 2023 17:16:57 -0500 [thread overview]
Message-ID: <20230712221657.GF12207@maniforge> (raw)
In-Reply-To: <93260dd9-818a-7f98-e030-635e0dc8cad8@bytedance.com>
On Wed, Jul 12, 2023 at 06:47:26PM +0800, Abel Wu wrote:
> Hi David, interesting patch!
Hi Abel,
Thanks for the helpful review!
> On 7/11/23 4:03 AM, David Vernet wrote:
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 1292d38d66cc..5c05a3da3d50 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -770,6 +770,8 @@ struct task_struct {
> > unsigned long wakee_flip_decay_ts;
> > struct task_struct *last_wakee;
> > + struct list_head shared_runq_node;
> > +
> > /*
> > * recent_used_cpu is initially set as the last CPU used by a task
> > * that wakes affine another task. Waker/wakee relationships can
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1451f5aa82ac..3ad437d4ea3d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4503,6 +4503,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > #ifdef CONFIG_SMP
> > p->wake_entry.u_flags = CSD_TYPE_TTWU;
> > p->migration_pending = NULL;
> > + INIT_LIST_HEAD(&p->shared_runq_node);
> > #endif
> > init_sched_mm_cid(p);
> > }
> > @@ -9842,6 +9843,7 @@ void __init sched_init_smp(void)
> > init_sched_rt_class();
> > init_sched_dl_class();
> > + init_sched_fair_class_late();
> > sched_smp_initialized = true;
> > }
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f7967be7646c..ff2491387201 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -139,18 +139,163 @@ static int __init setup_sched_thermal_decay_shift(char *str)
> > }
> > __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
> > +/**
> > + * struct shared_runq - Per-LLC queue structure for enqueuing and pulling
> > + * waking tasks.
> > + *
> > + * WHAT
> > + * ====
> > + *
> > + * This structure enables the scheduler to be more aggressively work
> > + * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
> > + * pulled from when another core in the LLC is going to go idle.
> > + *
> > + * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
> > + * Waking tasks are enqueued in a shared_runq at the end of
> > + * enqueue_task_fair(), and are opportunistically pulled from the shared_runq
> > + * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
> > + * to being pulled from the shared_runq, in which case they're simply dequeued
> > + * from the shared_runq. A waking task is only enqueued to a shared_runq when
> > + * it was _not_ manually migrated to the current runqueue by
> > + * select_task_rq_fair().
> > + *
> > + * There is currently no task-stealing between shared_runqs in different LLCs,
> > + * which means that shared_runq is not fully work conserving. This could be
> > + * added at a later time, with tasks likely only being stolen across
> > + * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
> > + *
> > + * HOW
> > + * ===
> > + *
> > + * An shared_runq is comprised of a list, and a spinlock for synchronization.
> > + * Given that the critical section for a shared_runq is typically a fast list
> > + * operation, and that the shared_runq is localized to a single LLC, the
> > + * spinlock will typically only be contended on workloads that do little else
> > + * other than hammer the runqueue.
>
> Would there be scalability issues on large LLCs?
See the next patch in the series [0] where we shard the per-LLC shared
runqueues to avoid contention.
[0]: https://lore.kernel.org/lkml/20230710200342.358255-7-void@manifault.com/
>
> > + *
> > + * WHY
> > + * ===
> > + *
> > + * As mentioned above, the main benefit of shared_runq is that it enables more
> > + * aggressive work conservation in the scheduler. This can benefit workloads
> > + * that benefit more from CPU utilization than from L1/L2 cache locality.
> > + *
> > + * shared_runqs are segmented across LLCs both to avoid contention on the
> > + * shared_runq spinlock by minimizing the number of CPUs that could contend on
> > + * it, as well as to strike a balance between work conservation, and L3 cache
> > + * locality.
> > + */
> > +struct shared_runq {
> > + struct list_head list;
> > + spinlock_t lock;
> > +} ____cacheline_aligned;
> > +
> > #ifdef CONFIG_SMP
> > +static struct shared_runq *rq_shared_runq(struct rq *rq)
> > +{
> > + return rq->cfs.shared_runq;
> > +}
> > +
> > +static struct task_struct *shared_runq_pop_task(struct rq *rq)
> > +{
> > + unsigned long flags;
> > + struct task_struct *p;
> > + struct shared_runq *shared_runq;
> > +
> > + shared_runq = rq_shared_runq(rq);
> > + if (list_empty(&shared_runq->list))
> > + return NULL;
> > +
> > + spin_lock_irqsave(&shared_runq->lock, flags);
> > + p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
> > + shared_runq_node);
> > + if (p && is_cpu_allowed(p, cpu_of(rq)))
> > + list_del_init(&p->shared_runq_node);
> > + else
> > + p = NULL;
> > + spin_unlock_irqrestore(&shared_runq->lock, flags);
> > +
> > + return p;
> > +}
> > +
> > +static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
> > +{
> > + unsigned long flags;
> > + struct shared_runq *shared_runq;
> > +
> > + shared_runq = rq_shared_runq(rq);
> > + spin_lock_irqsave(&shared_runq->lock, flags);
> > + list_add_tail(&p->shared_runq_node, &shared_runq->list);
> > + spin_unlock_irqrestore(&shared_runq->lock, flags);
> > +}
> > +
> > static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
> > int enq_flags)
> > -{}
> > +{
> > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > +
> > + /*
> > + * Only enqueue the task in the shared runqueue if:
> > + *
> > + * - SWQUEUE is enabled
> > + * - The task is on the wakeup path
> > + * - The task wasn't purposefully migrated to the current rq by
> > + * select_task_rq()
> > + * - The task isn't pinned to a specific CPU
> > + */
> > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > + return;
> > +
> > + shared_runq_push_task(rq, p);
> > +}
> > static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
> > {
> > - return 0;
> > + struct task_struct *p = NULL;
> > + struct rq *src_rq;
> > + struct rq_flags src_rf;
> > + int ret;
> > +
> > + p = shared_runq_pop_task(rq);
> > + if (!p)
> > + return 0;
> > +
> > + rq_unpin_lock(rq, rf);
> > + raw_spin_rq_unlock(rq);
>
> It would be better use the rq_unlock(rq, rf) for simplicity.
> But it's absolutely OK if you want to keep as it is to be
> correspond with the below lock&repin part :)
Yeah, personally I'd prefer to keep the style the consistent between
here and below.
> > +
> > + src_rq = task_rq_lock(p, &src_rf);
> > +
> > + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) {
>
> IMHO it should be:
>
> if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
Agreed, will change in v3.
> > + update_rq_clock(src_rq);
> > + src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> > + }
> > +
> > + if (src_rq->cpu != rq->cpu)
>
> Why not just 'if (src_rq != rq)'?
Ack, will change.
>
> > + ret = 1;
> > + else
> > + ret = -1;
>
> What about making @ret default to -1 and changing to 1 right
> after move_queued_task()? Both for better readability and align
> with the behavior of newidle_balance().
Yep. This aligns with Gautham's suggestion in [1] as well. Will combine
your input for v3.
[1]: https://lore.kernel.org/lkml/ZK5BdysC0lxKQ%2FgE@BLR-5CG11610CF.amd.com/
>
> > +
> > + task_rq_unlock(src_rq, p, &src_rf);
> > +
> > + raw_spin_rq_lock(rq);
> > + rq_repin_lock(rq, rf);
>
> By making it looks more ugly, we can save some cycles..
>
> if (src_rq != rq) {
> task_rq_unlock(src_rq, p, &src_rf);
> } else {
> rq_unpin_lock(src_rq, src_rf);
> raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
> rq_repin_lock(rq, rf);
> }
I like it. Thanks for the suggestion. I'll apply for v3.
> > +
> > + return ret;
> > }
> > static void shared_runq_dequeue_task(struct task_struct *p)
> > -{}
> > +{
> > + unsigned long flags;
> > + struct shared_runq *shared_runq;
> > +
> > + if (!list_empty(&p->shared_runq_node)) {
> > + shared_runq = rq_shared_runq(task_rq(p));
> > + spin_lock_irqsave(&shared_runq->lock, flags);
> > + list_del_init(&p->shared_runq_node);
> > + spin_unlock_irqrestore(&shared_runq->lock, flags);
> > + }
> > +}
> > /*
> > * For asym packing, by default the lower numbered CPU has higher priority.
> > @@ -12854,3 +12999,34 @@ __init void init_sched_fair_class(void)
> > #endif /* SMP */
> > }
> > +
> > +__init void init_sched_fair_class_late(void)
> > +{
> > +#ifdef CONFIG_SMP
> > + int i;
> > + struct shared_runq *shared_runq;
> > + struct rq *rq;
> > + struct rq *llc_rq;
> > +
> > + for_each_possible_cpu(i) {
> > + if (per_cpu(sd_llc_id, i) == i) {
> > + llc_rq = cpu_rq(i);
> > +
> > + shared_runq = kzalloc_node(sizeof(struct shared_runq),
> > + GFP_KERNEL, cpu_to_node(i));
> > + INIT_LIST_HEAD(&shared_runq->list);
> > + spin_lock_init(&shared_runq->lock);
> > + llc_rq->cfs.shared_runq = shared_runq;
> > + }
> > + }
> > +
> > + for_each_possible_cpu(i) {
> > + rq = cpu_rq(i);
> > + llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
> > +
> > + if (rq == llc_rq)
> > + continue;
> > + rq->cfs.shared_runq = llc_rq->cfs.shared_runq;
> > + }
> > +#endif /* SMP */
> > +}
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 187ad5da5ef6..8b573dfaba33 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -576,6 +576,7 @@ struct cfs_rq {
> > #endif
> > #ifdef CONFIG_SMP
> > + struct shared_runq *shared_runq;
>
> I would suggest moving shared_runq into shared LLC sched domain,
> which is pointed by sd_llc_shared, as you might finally put the
> retrieval of shared_runq under RCU critical sections to handle
> domain re-building cases.
Yep, I have plans to take add support for domain re-building in v3.
I'll see whether it makes sense to put them into the shared LLC sched
domain, but at first glance it seems like a sane choice.
>
> Best Regards,
> Abel
Thanks for the review.
- David
>
> > /*
> > * CFS load tracking
> > */
> > @@ -2440,6 +2441,7 @@ extern void update_max_interval(void);
> > extern void init_sched_dl_class(void);
> > extern void init_sched_rt_class(void);
> > extern void init_sched_fair_class(void);
> > +extern void init_sched_fair_class_late(void);
> > extern void reweight_task(struct task_struct *p, int prio);
next prev parent reply other threads:[~2023-07-12 22:17 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-10 20:03 [PATCH v2 0/7] sched: Implement shared runqueue in CFS David Vernet
2023-07-10 20:03 ` [PATCH v2 1/7] sched: Expose move_queued_task() from core.c David Vernet
2023-07-10 20:03 ` [PATCH v2 2/7] sched: Move is_cpu_allowed() into sched.h David Vernet
2023-07-10 20:03 ` [PATCH v2 3/7] sched: Check cpu_active() earlier in newidle_balance() David Vernet
2023-07-10 20:03 ` [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls David Vernet
2023-07-11 9:45 ` Peter Zijlstra
2023-07-11 16:19 ` David Vernet
2023-07-12 8:39 ` Abel Wu
2023-07-12 21:34 ` David Vernet
2023-07-10 20:03 ` [PATCH v2 5/7] sched: Implement shared runqueue in CFS David Vernet
2023-07-11 10:18 ` Peter Zijlstra
2023-07-11 16:26 ` David Vernet
2023-07-12 6:00 ` Gautham R. Shenoy
2023-07-12 19:13 ` David Vernet
2023-07-12 10:47 ` Abel Wu
2023-07-12 22:16 ` David Vernet [this message]
2023-07-13 3:43 ` Abel Wu
2023-07-13 4:05 ` David Vernet
2023-07-13 7:58 ` Aaron Lu
2023-07-13 8:29 ` Peter Zijlstra
2023-07-10 20:03 ` [PATCH v2 6/7] sched: Shard per-LLC shared runqueues David Vernet
2023-07-11 10:49 ` Peter Zijlstra
2023-07-11 19:57 ` David Vernet
2023-07-12 10:06 ` Gautham R. Shenoy
2023-07-12 12:22 ` Peter Zijlstra
2023-07-10 20:03 ` [PATCH v2 7/7] sched: Move shared_runq to __{enqueue,dequeue}_entity() David Vernet
2023-07-11 10:51 ` Peter Zijlstra
2023-07-11 16:30 ` David Vernet
2023-07-11 11:42 ` [PATCH v2 0/7] sched: Implement shared runqueue in CFS Peter Zijlstra
2023-07-11 21:33 ` David Vernet
2023-07-21 9:12 ` Gautham R. Shenoy
2023-07-25 20:22 ` David Vernet
2023-08-02 6:32 ` Gautham R. Shenoy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230712221657.GF12207@maniforge \
--to=void@manifault.com \
--cc=aaron.lu@intel.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=clm@meta.com \
--cc=dietmar.eggemann@arm.com \
--cc=gautham.shenoy@amd.com \
--cc=juri.lelli@redhat.com \
--cc=kernel-team@meta.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=wuyun.abel@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox