From: Peter Zijlstra <peterz@infradead.org>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Don <joshdon@google.com>,
Vineeth Pillai <vineethrp@google.com>,
Steven Rostedt <rostedt@goodmis.org>,
"Chen, Tim C" <tim.c.chen@intel.com>,
"Brown, Len" <len.brown@intel.com>,
LKML <linux-kernel@vger.kernel.org>,
"AubreyLi@google.com" <aubrey.intel@gmail.com>,
aubrey.li@linux.intel.com, Aaron Lu <aaron.lwe@gmail.com>,
"Hyser,Chris" <chris.hyser@oracle.com>,
Don Hiatt <dhiatt@digitalocean.com>,
ricardo.neri@intel.com, vincent.guittot@linaro.org,
joelaf@google.com
Subject: Re: [RFC] High latency with core scheduling
Date: Sat, 18 Dec 2021 01:01:02 +0100 [thread overview]
Message-ID: <20211218000102.GK16608@worktop.programming.kicks-ass.net> (raw)
In-Reply-To: <Ybvcu5RIwV+Vko09@google.com>
On Thu, Dec 16, 2021 at 07:41:31PM -0500, Joel Fernandes wrote:
> One of the issues we see is that the core rbtree is static when nothing in
> the tree goes to sleep or wakes up. This can cause the same task in the core
> rbtree to be repeatedly picked in pick_task().
>
> The below diff seems to improve the situation, could you please take a look?
> If it seems sane, we can go ahead and make it a formal patch to at least fix
> one of the known issues.
>
> The patch is simple, just remove the currently running task from the core rb
> tree as its vruntime is not really static. Add it back on preemption.
> note: This is against a 5.4 kernel, but the code is about the same and its RFC.
I think you'll find there's significant differences..
> note: The issue does not seem to happen without CGroups involved so perhaps
> something is wonky in cfs_prio_less() still. Peter?
that's weird... but it's also 00h30 am, so who knows :-)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c023a9a0c4ae..3c588ad05ab6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -200,7 +200,7 @@ static inline void dump_scrb(struct rb_node *root, int lvl, char *buf, int size)
> dump_scrb(root->rb_right, lvl+1, buf, size);
> }
>
> -static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> +void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> {
> struct rb_node *parent, **node;
> struct task_struct *node_task;
> @@ -212,6 +212,9 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> if (!p->core_cookie)
> return;
>
> + if (sched_core_enqueued(p))
> + return;
Are you actually hitting that? It feels wrong.
> node = &rq->core_tree.rb_node;
> parent = *node;
>
> @@ -232,7 +235,7 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> rb_insert_color(&p->core_node, &rq->core_tree);
> }
>
> -static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
> +void sched_core_dequeue(struct rq *rq, struct task_struct *p)
> {
> rq->core->core_task_seq++;
>
> @@ -4745,6 +4748,18 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
> return class_pick;
>
> cookie_pick = sched_core_find(rq, cookie);
> +
> + /*
> + * Currently running process might not be in the runqueue if fair class.
> + * If it is of the same cookie as cookie_pick and has more priority,
> + * then select it.
> + */
> + if (rq != this_rq() && !is_task_rq_idle(cookie_pick) && !is_task_rq_idle(rq->curr) &&
> + cookie_pick->core_cookie == rq->curr->core_cookie &&
> + prio_less(cookie_pick, rq->curr, in_fi)) {
guys, this indent style kills infants.
> + cookie_pick = rq->curr;
> + }
This is the part that doesn't apply.. We completely rewrote the pick
loop. I think you're looking at a change in sched_core_find() now.
Basically it should check rq->curr against whatever it finds in the
core_tree, right?
> +
> /*
> * If class > max && class > cookie, it is the highest priority task on
> * the core (so far) and it must be selected, otherwise we must go with
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 86cc67dd38e9..820c5cf4ecc1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1936,15 +1936,33 @@ struct sched_class {
> #endif
> };
>
> +void sched_core_enqueue(struct rq *rq, struct task_struct *p);
> +void sched_core_dequeue(struct rq *rq, struct task_struct *p);
> +
> static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
> {
> WARN_ON_ONCE(rq->curr != prev);
> prev->sched_class->put_prev_task(rq, prev);
> +#ifdef CONFIG_SCHED_CORE
> + if (sched_core_enabled(rq) && READ_ONCE(prev->state) != TASK_DEAD && prev->core_cookie && prev->on_rq) {
That TASK_DEAD thing is weird... do_task_dead() goes something like:
set_special_state(TASK_DEAD)
schedule()
deactivate_task(prev)
prev->on_rq = 0;
dequeue_task()
sched_core_dequeue() /* also wrong, see below */
prev->sched_class->dequeue_task()
...
next = pick_next_task(..,prev,..);
put_prev_task()
if (... && prev->on_rq /* false */)
sched_core_enqueue()
Notably, the sched_core_dequeue() in dequeue_task() shouldn't happen
either, because it's current and as such shouldn't be enqueued to begin
with.
> + sched_core_enqueue(rq, prev);
> + }
> +#endif
> }
>
> static inline void set_next_task(struct rq *rq, struct task_struct *next)
> {
> next->sched_class->set_next_task(rq, next, false);
> +#ifdef CONFIG_SCHED_CORE
> + /*
> + * This task is going to run next and its vruntime will change.
> + * Remove it from core rbtree so as to not confuse the ordering
> + * in the rbtree when its vrun changes.
> + */
> + if (sched_core_enabled(rq) && next->core_cookie && next->on_rq) {
> + sched_core_dequeue(rq, next);
> + }
> +#endif
Anyway... *ouch* at the additional rb-tree ops, but I think you're right
about needing this :/
Just please, think through the whole enqueue/dequeue thing because even
for an rfc this seems overly sloppy.
next prev parent reply other threads:[~2021-12-18 0:01 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-17 0:41 [RFC] High latency with core scheduling Joel Fernandes
2021-12-18 0:01 ` Peter Zijlstra [this message]
2021-12-18 17:50 ` Joel Fernandes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211218000102.GK16608@worktop.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=aaron.lwe@gmail.com \
--cc=aubrey.intel@gmail.com \
--cc=aubrey.li@linux.intel.com \
--cc=chris.hyser@oracle.com \
--cc=dhiatt@digitalocean.com \
--cc=joel@joelfernandes.org \
--cc=joelaf@google.com \
--cc=joshdon@google.com \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=ricardo.neri@intel.com \
--cc=rostedt@goodmis.org \
--cc=tim.c.chen@intel.com \
--cc=vincent.guittot@linaro.org \
--cc=vineethrp@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox