From: David Vernet <void@manifault.com>
To: Abel Wu <wuyun.abel@bytedance.com>
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
bristot@redhat.com, vschneid@redhat.com, gautham.shenoy@amd.com,
kprateek.nayak@amd.com, aaron.lu@intel.com, clm@meta.com,
tj@kernel.org, roman.gushchin@linux.dev, kernel-team@meta.com,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS
Date: Wed, 12 Jul 2023 17:16:57 -0500 [thread overview]
Message-ID: <20230712221657.GF12207@maniforge> (raw)
In-Reply-To: <93260dd9-818a-7f98-e030-635e0dc8cad8@bytedance.com>
On Wed, Jul 12, 2023 at 06:47:26PM +0800, Abel Wu wrote:
> Hi David, interesting patch!
Hi Abel,
Thanks for the helpful review!
> On 7/11/23 4:03 AM, David Vernet wrote:
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 1292d38d66cc..5c05a3da3d50 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -770,6 +770,8 @@ struct task_struct {
> > unsigned long wakee_flip_decay_ts;
> > struct task_struct *last_wakee;
> > + struct list_head shared_runq_node;
> > +
> > /*
> > * recent_used_cpu is initially set as the last CPU used by a task
> > * that wakes affine another task. Waker/wakee relationships can
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1451f5aa82ac..3ad437d4ea3d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4503,6 +4503,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > #ifdef CONFIG_SMP
> > p->wake_entry.u_flags = CSD_TYPE_TTWU;
> > p->migration_pending = NULL;
> > + INIT_LIST_HEAD(&p->shared_runq_node);
> > #endif
> > init_sched_mm_cid(p);
> > }
> > @@ -9842,6 +9843,7 @@ void __init sched_init_smp(void)
> > init_sched_rt_class();
> > init_sched_dl_class();
> > + init_sched_fair_class_late();
> > sched_smp_initialized = true;
> > }
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f7967be7646c..ff2491387201 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -139,18 +139,163 @@ static int __init setup_sched_thermal_decay_shift(char *str)
> > }
> > __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
> > +/**
> > + * struct shared_runq - Per-LLC queue structure for enqueuing and pulling
> > + * waking tasks.
> > + *
> > + * WHAT
> > + * ====
> > + *
> > + * This structure enables the scheduler to be more aggressively work
> > + * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
> > + * pulled from when another core in the LLC is going to go idle.
> > + *
> > + * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
> > + * Waking tasks are enqueued in a shared_runq at the end of
> > + * enqueue_task_fair(), and are opportunistically pulled from the shared_runq
> > + * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
> > + * to being pulled from the shared_runq, in which case they're simply dequeued
> > + * from the shared_runq. A waking task is only enqueued to a shared_runq when
> > + * it was _not_ manually migrated to the current runqueue by
> > + * select_task_rq_fair().
> > + *
> > + * There is currently no task-stealing between shared_runqs in different LLCs,
> > + * which means that shared_runq is not fully work conserving. This could be
> > + * added at a later time, with tasks likely only being stolen across
> > + * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
> > + *
> > + * HOW
> > + * ===
> > + *
> > + * An shared_runq is comprised of a list, and a spinlock for synchronization.
> > + * Given that the critical section for a shared_runq is typically a fast list
> > + * operation, and that the shared_runq is localized to a single LLC, the
> > + * spinlock will typically only be contended on workloads that do little else
> > + * other than hammer the runqueue.
>
> Would there be scalability issues on large LLCs?
See the next patch in the series [0] where we shard the per-LLC shared
runqueues to avoid contention.
[0]: https://lore.kernel.org/lkml/20230710200342.358255-7-void@manifault.com/
>
> > + *
> > + * WHY
> > + * ===
> > + *
> > + * As mentioned above, the main benefit of shared_runq is that it enables more
> > + * aggressive work conservation in the scheduler. This can benefit workloads
> > + * that benefit more from CPU utilization than from L1/L2 cache locality.
> > + *
> > + * shared_runqs are segmented across LLCs both to avoid contention on the
> > + * shared_runq spinlock by minimizing the number of CPUs that could contend on
> > + * it, as well as to strike a balance between work conservation, and L3 cache
> > + * locality.
> > + */
> > +struct shared_runq {
> > + struct list_head list;
> > + spinlock_t lock;
> > +} ____cacheline_aligned;
> > +
> > #ifdef CONFIG_SMP
> > +static struct shared_runq *rq_shared_runq(struct rq *rq)
> > +{
> > + return rq->cfs.shared_runq;
> > +}
> > +
> > +static struct task_struct *shared_runq_pop_task(struct rq *rq)
> > +{
> > + unsigned long flags;
> > + struct task_struct *p;
> > + struct shared_runq *shared_runq;
> > +
> > + shared_runq = rq_shared_runq(rq);
> > + if (list_empty(&shared_runq->list))
> > + return NULL;
> > +
> > + spin_lock_irqsave(&shared_runq->lock, flags);
> > + p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
> > + shared_runq_node);
> > + if (p && is_cpu_allowed(p, cpu_of(rq)))
> > + list_del_init(&p->shared_runq_node);
> > + else
> > + p = NULL;
> > + spin_unlock_irqrestore(&shared_runq->lock, flags);
> > +
> > + return p;
> > +}
> > +
> > +static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
> > +{
> > + unsigned long flags;
> > + struct shared_runq *shared_runq;
> > +
> > + shared_runq = rq_shared_runq(rq);
> > + spin_lock_irqsave(&shared_runq->lock, flags);
> > + list_add_tail(&p->shared_runq_node, &shared_runq->list);
> > + spin_unlock_irqrestore(&shared_runq->lock, flags);
> > +}
> > +
> > static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
> > int enq_flags)
> > -{}
> > +{
> > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > +
> > + /*
> > + * Only enqueue the task in the shared runqueue if:
> > + *
> > + * - SWQUEUE is enabled
> > + * - The task is on the wakeup path
> > + * - The task wasn't purposefully migrated to the current rq by
> > + * select_task_rq()
> > + * - The task isn't pinned to a specific CPU
> > + */
> > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > + return;
> > +
> > + shared_runq_push_task(rq, p);
> > +}
> > static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
> > {
> > - return 0;
> > + struct task_struct *p = NULL;
> > + struct rq *src_rq;
> > + struct rq_flags src_rf;
> > + int ret;
> > +
> > + p = shared_runq_pop_task(rq);
> > + if (!p)
> > + return 0;
> > +
> > + rq_unpin_lock(rq, rf);
> > + raw_spin_rq_unlock(rq);
>
> It would be better use the rq_unlock(rq, rf) for simplicity.
> But it's absolutely OK if you want to keep as it is to be
> correspond with the below lock&repin part :)
Yeah, personally I'd prefer to keep the style the consistent between
here and below.
> > +
> > + src_rq = task_rq_lock(p, &src_rf);
> > +
> > + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) {
>
> IMHO it should be:
>
> if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
Agreed, will change in v3.
> > + update_rq_clock(src_rq);
> > + src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> > + }
> > +
> > + if (src_rq->cpu != rq->cpu)
>
> Why not just 'if (src_rq != rq)'?
Ack, will change.
>
> > + ret = 1;
> > + else
> > + ret = -1;
>
> What about making @ret default to -1 and changing to 1 right
> after move_queued_task()? Both for better readability and align
> with the behavior of newidle_balance().
Yep. This aligns with Gautham's suggestion in [1] as well. Will combine
your input for v3.
[1]: https://lore.kernel.org/lkml/ZK5BdysC0lxKQ%2FgE@BLR-5CG11610CF.amd.com/
>
> > +
> > + task_rq_unlock(src_rq, p, &src_rf);
> > +
> > + raw_spin_rq_lock(rq);
> > + rq_repin_lock(rq, rf);
>
> By making it looks more ugly, we can save some cycles..
>
> if (src_rq != rq) {
> task_rq_unlock(src_rq, p, &src_rf);
> } else {
> rq_unpin_lock(src_rq, src_rf);
> raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
> rq_repin_lock(rq, rf);
> }
I like it. Thanks for the suggestion. I'll apply for v3.
> > +
> > + return ret;
> > }
> > static void shared_runq_dequeue_task(struct task_struct *p)
> > -{}
> > +{
> > + unsigned long flags;
> > + struct shared_runq *shared_runq;
> > +
> > + if (!list_empty(&p->shared_runq_node)) {
> > + shared_runq = rq_shared_runq(task_rq(p));
> > + spin_lock_irqsave(&shared_runq->lock, flags);
> > + list_del_init(&p->shared_runq_node);
> > + spin_unlock_irqrestore(&shared_runq->lock, flags);
> > + }
> > +}
> > /*
> > * For asym packing, by default the lower numbered CPU has higher priority.
> > @@ -12854,3 +12999,34 @@ __init void init_sched_fair_class(void)
> > #endif /* SMP */
> > }
> > +
> > +__init void init_sched_fair_class_late(void)
> > +{
> > +#ifdef CONFIG_SMP
> > + int i;
> > + struct shared_runq *shared_runq;
> > + struct rq *rq;
> > + struct rq *llc_rq;
> > +
> > + for_each_possible_cpu(i) {
> > + if (per_cpu(sd_llc_id, i) == i) {
> > + llc_rq = cpu_rq(i);
> > +
> > + shared_runq = kzalloc_node(sizeof(struct shared_runq),
> > + GFP_KERNEL, cpu_to_node(i));
> > + INIT_LIST_HEAD(&shared_runq->list);
> > + spin_lock_init(&shared_runq->lock);
> > + llc_rq->cfs.shared_runq = shared_runq;
> > + }
> > + }
> > +
> > + for_each_possible_cpu(i) {
> > + rq = cpu_rq(i);
> > + llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
> > +
> > + if (rq == llc_rq)
> > + continue;
> > + rq->cfs.shared_runq = llc_rq->cfs.shared_runq;
> > + }
> > +#endif /* SMP */
> > +}
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 187ad5da5ef6..8b573dfaba33 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -576,6 +576,7 @@ struct cfs_rq {
> > #endif
> > #ifdef CONFIG_SMP
> > + struct shared_runq *shared_runq;
>
> I would suggest moving shared_runq into shared LLC sched domain,
> which is pointed by sd_llc_shared, as you might finally put the
> retrieval of shared_runq under RCU critical sections to handle
> domain re-building cases.
Yep, I have plans to take add support for domain re-building in v3.
I'll see whether it makes sense to put them into the shared LLC sched
domain, but at first glance it seems like a sane choice.
>
> Best Regards,
> Abel
Thanks for the review.
- David
>
> > /*
> > * CFS load tracking
> > */
> > @@ -2440,6 +2441,7 @@ extern void update_max_interval(void);
> > extern void init_sched_dl_class(void);
> > extern void init_sched_rt_class(void);
> > extern void init_sched_fair_class(void);
> > +extern void init_sched_fair_class_late(void);
> > extern void reweight_task(struct task_struct *p, int prio);
next prev parent reply other threads:[~2023-07-12 22:17 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-10 20:03 [PATCH v2 0/7] sched: Implement shared runqueue in CFS David Vernet
2023-07-10 20:03 ` [PATCH v2 1/7] sched: Expose move_queued_task() from core.c David Vernet
2023-07-10 20:03 ` [PATCH v2 2/7] sched: Move is_cpu_allowed() into sched.h David Vernet
2023-07-10 20:03 ` [PATCH v2 3/7] sched: Check cpu_active() earlier in newidle_balance() David Vernet
2023-07-10 20:03 ` [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls David Vernet
2023-07-11 9:45 ` Peter Zijlstra
2023-07-11 16:19 ` David Vernet
2023-07-12 8:39 ` Abel Wu
2023-07-12 21:34 ` David Vernet
2023-07-10 20:03 ` [PATCH v2 5/7] sched: Implement shared runqueue in CFS David Vernet
2023-07-11 10:18 ` Peter Zijlstra
2023-07-11 16:26 ` David Vernet
2023-07-12 6:00 ` Gautham R. Shenoy
2023-07-12 19:13 ` David Vernet
2023-07-12 10:47 ` Abel Wu
2023-07-12 22:16 ` David Vernet [this message]
2023-07-13 3:43 ` Abel Wu
2023-07-13 4:05 ` David Vernet
2023-07-13 7:58 ` Aaron Lu
2023-07-13 8:29 ` Peter Zijlstra
2023-07-10 20:03 ` [PATCH v2 6/7] sched: Shard per-LLC shared runqueues David Vernet
2023-07-11 10:49 ` Peter Zijlstra
2023-07-11 19:57 ` David Vernet
2023-07-12 10:06 ` Gautham R. Shenoy
2023-07-12 12:22 ` Peter Zijlstra
2023-07-10 20:03 ` [PATCH v2 7/7] sched: Move shared_runq to __{enqueue,dequeue}_entity() David Vernet
2023-07-11 10:51 ` Peter Zijlstra
2023-07-11 16:30 ` David Vernet
2023-07-11 11:42 ` [PATCH v2 0/7] sched: Implement shared runqueue in CFS Peter Zijlstra
2023-07-11 21:33 ` David Vernet
2023-07-21 9:12 ` Gautham R. Shenoy
2023-07-25 20:22 ` David Vernet
2023-08-02 6:32 ` Gautham R. Shenoy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230712221657.GF12207@maniforge \
--to=void@manifault.com \
--cc=aaron.lu@intel.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=clm@meta.com \
--cc=dietmar.eggemann@arm.com \
--cc=gautham.shenoy@amd.com \
--cc=juri.lelli@redhat.com \
--cc=kernel-team@meta.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=wuyun.abel@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.