All of lore.kernel.org
 help / color / mirror / Atom feed
From: Julien Desfossez <jdesfossez@digitalocean.com>
To: Aaron Lu <aaron.lu@linux.alibaba.com>
Cc: Vineeth Remanan Pillai <vpillai@digitalocean.com>,
	Phil Auld <pauld@redhat.com>,
	Nishanth Aravamudan <naravamudan@digitalocean.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	mingo@kernel.org, tglx@linutronix.de, pjt@google.com,
	torvalds@linux-foundation.org, linux-kernel@vger.kernel.org,
	subhra.mazumdar@oracle.com, fweisbec@gmail.com,
	keescook@chromium.org, kerrnel@google.com,
	Aaron Lu <aaron.lwe@gmail.com>,
	Aubrey Li <aubrey.intel@gmail.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Pawan Gupta <pawan.kumar.gupta@linux.intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [RFC PATCH v2 00/17] Core scheduling v2
Date: Mon, 6 May 2019 15:39:37 -0400	[thread overview]
Message-ID: <20190506193937.GA10264@sinkpad> (raw)
In-Reply-To: <20190429035320.GB128241@aaronlu>

On 29-Apr-2019 11:53:21 AM, Aaron Lu wrote:
> On Tue, Apr 23, 2019 at 06:45:27PM +0000, Vineeth Remanan Pillai wrote:
> > >> - Processes with different tags can still share the core
> > 
> > > I may have missed something... Could you explain this statement?
> > 
> > > This, to me, is the whole point of the patch series. If it's not
> > > doing this then ... what?
> > 
> > What I meant was, the patch needs some more work to be accurate.
> > There are some race conditions where the core violation can still
> > happen. In our testing, we saw around 1 to 5% of the time being
> > shared with incompatible processes. One example of this happening
> > is as follows(let cpu 0 and 1 be siblings):
> > - cpu 0 selects a process with a cookie
> > - cpu 1 selects a higher priority process without cookie
> > - Selection process restarts for cpu 0 and it might select a
> >   process with cookie but with lesser priority.
> > - Since it is lesser priority, the logic in pick_next_task
> >   doesn't compare again for the cookie(trusts pick_task) and
> >   proceeds.
> > 
> > This is one of the scenarios that we saw from traces, but there
> > might be other race conditions as well. Fix seems a little
> > involved and We are working on that.
> 
> This is what I have used to make sure no two unmatched tasks being
> scheduled on the same core: (on top of v1, I thinks it's easier to just
> show the diff instead of commenting on various places of the patches :-)

We imported this fix in v2 and made some small changes and optimizations
(with and without Peter’s fix from https://lkml.org/lkml/2019/4/26/658)
and in both cases, the performance problem where the core can end up
idle with tasks in its runqueues came back.

This is pretty easy to reproduce with a multi-file disk write benchmark.

Here is the patch based on your changes applied on v2 (on top of Peter’s
fix):

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 07f3f0c..e09fa25 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3653,6 +3653,13 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 }
 
 // XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
 static struct task_struct *
 pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
 {
@@ -3660,19 +3667,36 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 	unsigned long cookie = rq->core->core_cookie;
 
 	class_pick = class->pick_task(rq);
-	if (!cookie)
+	if (!class_pick)
+		return NULL;
+
+	if (!cookie) {
+		/*
+		 * If class_pick is tagged, return it only if it has
+		 * higher priority than max.
+		 */
+		if (max && class_pick->core_cookie &&
+		    core_prio_less(class_pick, max))
+			return idle_sched_class.pick_task(rq);
+
+		return class_pick;
+	}
+
+	/*
+	 * If there is a cooke match here, return early.
+	 */
+	if (class_pick->core_cookie == cookie)
 		return class_pick;
 
 	cookie_pick = sched_core_find(rq, cookie);
-	if (!class_pick)
-		return cookie_pick;
 
 	/*
 	 * If class > max && class > cookie, it is the highest priority task on
 	 * the core (so far) and it must be selected, otherwise we must go with
 	 * the cookie pick in order to satisfy the constraint.
 	 */
-	if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, class_pick))
+	if (cpu_prio_less(cookie_pick, class_pick) &&
+	    (!max || core_prio_less(max, class_pick)))
 		return class_pick;
 
 	return cookie_pick;
@@ -3742,8 +3766,16 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 		rq_i->core_pick = NULL;
 
-		if (i != cpu)
+		if (i != cpu) {
 			update_rq_clock(rq_i);
+
+			/*
+			 * If a sibling is idle, we can initiate an
+			 * unconstrained pick.
+			 */
+			if (is_idle_task(rq_i->curr) && prev_cookie)
+				prev_cookie = 0UL;
+		}
 	}
 
 	/*
@@ -3820,12 +3852,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			/*
 			 * If this new candidate is of higher priority than the
 			 * previous; and they're incompatible; we need to wipe
-			 * the slate and start over.
+			 * the slate and start over. pick_task makes sure that
+			 * p's priority is more than max if it doesn't match
+			 * max's cookie.
 			 *
 			 * NOTE: this is a linear max-filter and is thus bounded
 			 * in execution time.
 			 */
-			if (!max || core_prio_less(max, p)) {
+			if (!max || !cookie_match(max, p)) {
 				struct task_struct *old_max = max;
 
 				rq->core->core_cookie = p->core_cookie;
@@ -3833,7 +3867,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 				trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
 
-				if (old_max && !cookie_match(old_max, p)) {
+				if (old_max) {
 					for_each_cpu(j, smt_mask) {
 						if (j == i)
 							continue;
@@ -3879,6 +3913,23 @@ next_class:;
 
 	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
 
+	/* make sure we didn't break L1TF */
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+		if (i == cpu)
+			continue;
+
+		if (likely(cookie_match(next, rq_i->core_pick)))
+			continue;
+
+		trace_printk("[%d]: cookie mismatch. %s/%d/0x%lx/0x%lx\n",
+			     rq_i->cpu, rq_i->core_pick->comm,
+			     rq_i->core_pick->pid,
+			     rq_i->core_pick->core_cookie,
+			     rq_i->core->core_cookie);
+		WARN_ON_ONCE(1);
+	}
+
 done:
 	set_next_task(rq, next);
 	return next;


  reply	other threads:[~2019-05-06 19:39 UTC|newest]

Thread overview: 109+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-23 16:18 [RFC PATCH v2 00/17] Core scheduling v2 Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 01/17] stop_machine: Fix stop_cpus_in_progress ordering Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 02/17] sched: Fix kerneldoc comment for ia64_set_curr_task Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 03/17] sched: Wrap rq::lock access Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 04/17] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 05/17] sched: Add task_struct pointer to sched_class::set_curr_task Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 06/17] sched/fair: Export newidle_balance() Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 07/17] sched: Allow put_prev_task() to drop rq->lock Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 08/17] sched: Rework pick_next_task() slow-path Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 09/17] sched: Introduce sched_class::pick_task() Vineeth Remanan Pillai
2019-04-26 14:02   ` Peter Zijlstra
2019-04-26 16:10     ` Vineeth Remanan Pillai
2019-04-29  5:38   ` Aaron Lu
2019-04-23 16:18 ` [RFC PATCH v2 10/17] sched: Core-wide rq->lock Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks Vineeth Remanan Pillai
2019-04-24  0:08   ` Tim Chen
2019-04-24 20:43     ` Vineeth Remanan Pillai
2019-04-24 22:12       ` Tim Chen
2019-04-25 14:35       ` Phil Auld
2019-05-22 19:52         ` Vineeth Remanan Pillai
2019-04-24  0:17   ` Tim Chen
2019-04-24 20:43     ` Vineeth Remanan Pillai
2019-04-29  3:36   ` Aaron Lu
2019-05-10 13:06     ` Peter Zijlstra
2019-04-29  6:15   ` Aaron Lu
2019-05-01 23:27     ` Tim Chen
2019-05-03  0:06       ` Tim Chen
2019-05-08 15:49         ` Aubrey Li
2019-05-08 18:19           ` Subhra Mazumdar
2019-05-08 18:37             ` Subhra Mazumdar
2019-05-09  0:01               ` Aubrey Li
2019-05-09  0:25                 ` Subhra Mazumdar
2019-05-09  1:38                   ` Aubrey Li
2019-05-09  2:14                     ` Subhra Mazumdar
2019-05-09 15:10                       ` Aubrey Li
2019-05-09 17:50                         ` Subhra Mazumdar
2019-05-10  0:09                           ` Tim Chen
2019-04-23 16:18 ` [RFC PATCH v2 12/17] sched: A quick and dirty cgroup tagging interface Vineeth Remanan Pillai
2019-04-25 14:26   ` Phil Auld
2019-04-26 14:13     ` Peter Zijlstra
2019-04-26 14:19       ` Phil Auld
2019-05-10 15:12   ` Julien Desfossez
2019-04-23 16:18 ` [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling Vineeth Remanan Pillai
2019-04-29  7:13   ` Aaron Lu
2019-05-18 15:37   ` Aubrey Li
2019-05-20 13:04     ` Phil Auld
2019-05-20 14:04       ` Vineeth Pillai
2019-05-21  8:19         ` Aubrey Li
2019-05-21 13:24           ` Vineeth Pillai
2019-04-23 16:18 ` [RFC PATCH v2 14/17] sched/fair: Add a few assertions Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 15/17] sched: Trivial forced-newidle balancer Vineeth Remanan Pillai
2019-04-23 23:46   ` Aubrey Li
2019-04-24 14:03     ` Vineeth Remanan Pillai
2019-04-24 14:05     ` Vineeth Remanan Pillai
2019-04-23 16:18 ` [RFC PATCH v2 16/17] sched: Wake up sibling if it has something to run Vineeth Remanan Pillai
2019-04-26 15:03   ` Peter Zijlstra
2019-04-29 12:36     ` Julien Desfossez
2019-04-23 16:18 ` [RFC PATCH v2 17/17] sched: Debug bits Vineeth Remanan Pillai
2019-05-17 17:18   ` Aubrey Li
2019-04-23 18:02 ` [RFC PATCH v2 00/17] Core scheduling v2 Phil Auld
2019-04-23 18:45   ` Vineeth Remanan Pillai
2019-04-29  3:53     ` Aaron Lu
2019-05-06 19:39       ` Julien Desfossez [this message]
2019-05-08  2:30         ` Aaron Lu
2019-05-08 17:49           ` Julien Desfossez
2019-05-09  2:11             ` Aaron Lu
2019-05-15 21:36               ` Vineeth Remanan Pillai
2019-04-23 23:25 ` Aubrey Li
2019-04-24 11:19   ` Vineeth Remanan Pillai
2019-05-15 21:39     ` Vineeth Remanan Pillai
2019-04-24 13:13 ` Aubrey Li
2019-04-24 14:00   ` Julien Desfossez
2019-04-25  3:15     ` Aubrey Li
2019-04-25  9:55       ` Ingo Molnar
2019-04-25 14:46         ` Mel Gorman
2019-04-25 18:53           ` Ingo Molnar
2019-04-25 18:59             ` Thomas Gleixner
2019-04-25 19:34               ` Ingo Molnar
2019-04-25 21:31             ` Mel Gorman
2019-04-26  8:42               ` Ingo Molnar
2019-04-26 10:43                 ` Mel Gorman
2019-04-26 18:37                   ` Subhra Mazumdar
2019-04-26 19:49                     ` Mel Gorman
2019-04-26  9:45               ` Ingo Molnar
2019-04-26 10:19                 ` Mel Gorman
2019-04-27  9:06                   ` Ingo Molnar
2019-04-26  9:51               ` Ingo Molnar
2019-04-26 14:15             ` Phil Auld
2019-04-26  2:18         ` Aubrey Li
2019-04-26  9:51           ` Ingo Molnar
2019-04-27  3:51         ` Aubrey Li
2019-04-27  9:17           ` Ingo Molnar
2019-04-27 14:04             ` Aubrey Li
2019-04-27 14:21               ` Ingo Molnar
2019-04-27 15:54                 ` Aubrey Li
2019-04-28  9:33                   ` Ingo Molnar
2019-04-28 10:29                     ` Aubrey Li
2019-04-28 12:17                       ` Ingo Molnar
2019-04-29  2:17                         ` Li, Aubrey
2019-04-29  6:14                           ` Ingo Molnar
2019-04-29 13:25                             ` Li, Aubrey
2019-04-29 15:39                               ` Phil Auld
2019-04-30  1:24                                 ` Aubrey Li
2019-04-29 16:00                               ` Ingo Molnar
2019-04-30  1:34                                 ` Aubrey Li
2019-04-30  4:42                                   ` Ingo Molnar
2019-05-18  0:58                                     ` Li, Aubrey
2019-05-18  1:08                                       ` Li, Aubrey
2019-04-25 14:36 ` Julien Desfossez

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190506193937.GA10264@sinkpad \
    --to=jdesfossez@digitalocean.com \
    --cc=aaron.lu@linux.alibaba.com \
    --cc=aaron.lwe@gmail.com \
    --cc=aubrey.intel@gmail.com \
    --cc=fweisbec@gmail.com \
    --cc=keescook@chromium.org \
    --cc=kerrnel@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=naravamudan@digitalocean.com \
    --cc=pauld@redhat.com \
    --cc=pawan.kumar.gupta@linux.intel.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=subhra.mazumdar@oracle.com \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=valentin.schneider@arm.com \
    --cc=vpillai@digitalocean.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.