public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Li, Aubrey" <aubrey.li@linux.intel.com>,
	vincent.guittot@linaro.org, linux-kernel@vger.kernel.org,
	mingo@redhat.com, juri.lelli@redhat.com,
	valentin.schneider@arm.com, qais.yousef@arm.com,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, tim.c.chen@linux.intel.com,
	benbjiang@gmail.com
Subject: Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
Date: Fri, 8 Jan 2021 10:27:38 +0000	[thread overview]
Message-ID: <20210108102738.GB3592@techsingularity.net> (raw)
In-Reply-To: <20201215075911.GA3040@hirez.programming.kicks-ass.net>

On Tue, Dec 15, 2020 at 08:59:11AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 15, 2020 at 11:36:35AM +0800, Li, Aubrey wrote:
> > On 2020/12/15 0:48, Peter Zijlstra wrote:
> > > We compute the average cost of the total scan, but then use it as a
> > > per-cpu scan cost when computing the scan proportion. Fix this by
> > > properly computing a per-cpu scan cost.
> > > 
> > > This also fixes a bug where we would terminate early (!--nr, case) and
> > > not account that cost at all.
> > 
> > I'm a bit worried this may introduce a regression under heavy load.
> > The overhead of adding another cpu_clock() and calculation becomes 
> > significant when sis_scan is throttled by nr.
> 
> The thing is, the code as it exists today makes no sense what so ever.
> It's plain broken batshit.
> 
> We calculate the total scanning time (irrespective of how many CPUs we
> touched), and then use that calculate the number of cpus to scan. That's
> just daft.
> 
> After this patch we calculate the avg cost of scanning 1 cpu and use
> that to calculate how many cpus to scan. Which is coherent and sane.
> 
> Maybe it can be improved, but that's a completely different thing.

In general I agree with everything you said so lets talk about the last
sentence.

1. avg_scan_cost is now based on the average scan cost of a rq but
   avg_idle is still scaled to the domain size. This is a bit problematic
   because it's comparing scan cost of a single rq with the estimated
   average idle time of a domain. As a result, the scan depth can be much
   larger than it was before the patch and led to some regressions.

2. Accounting for the scan cost of success makes sense but there is a
   big difference between a scan that finds an idle CPU and one that fails.
   For failures, the scan cost is wasted CPU time where as a success
   means that an uncontested CPU is used. This can cause a search to be
   truncated earlier than it should be when the domain is lightly loaded.

The patch below attempts to deal with both of those points. The
remaining difficulty is the "fuzz" factor which is necessary to bring
large avg_idle values into a reasonable range for comparing against
avg_scan_cost. The max avg_idle value can be anything but generally the
ceiling is 2*sysctl_sched_migration_cost. I didn't quickly spot a good way
of mapping avg_idle to a range between 0 and sd->span_weight.  The obvious
one was (weight*avg_idle/(2*sched_migration_cost)) but it did not work very
well as avg_scan_cost accounts for the full cost of accessing a remote rq.
However, this could be revisited later after this series is sorted out.

Anyway, for the enumerated concerns, how about this on top for your
first patch? It seemed to work well for a few workloads on 3 machines
at least and I'd like to nail this down before considering the remaining
patches. The first run indicated that the first patch offset some of the
benefits of the other patches.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65a2f208ada8..1e04a250e8ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6156,7 +6156,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
 	if (sched_feat(SIS_PROP)) {
-		u64 avg_cost, avg_idle, span_avg;
+		u64 avg_cost, avg_idle;
 
 		/*
 		 * Due to large variance we need a large fuzz factor;
@@ -6164,25 +6164,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		 */
 		avg_idle = this_rq()->avg_idle / 512;
 		avg_cost = this_sd->avg_scan_cost + 1;
-
-		span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
-			nr = div_u64(span_avg, avg_cost);
-		else
+		nr = div_u64(avg_idle, avg_cost);
+		if (nr < 4)
 			nr = 4;
 
 		time = cpu_clock(this);
 	}
 
 	for_each_cpu_wrap(cpu, cpus, target) {
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+			/* Adjust cost of a successful scan */
+			loops <<= 2;
+
 			break;
+		}
 
-		if (loops >= nr) {
+		if (++loops >= nr) {
 			cpu = -1;
 			break;
 		}
-		loops++;
 	}
 
 	if (sched_feat(SIS_PROP)) {

-- 
Mel Gorman
SUSE Labs

  parent reply	other threads:[~2021-01-08 10:28 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-14 16:48 [RFC][PATCH 0/5] select_idle_sibling() wreckage Peter Zijlstra
2020-12-14 16:48 ` [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting Peter Zijlstra
2020-12-15  3:36   ` Li, Aubrey
2020-12-15  7:59     ` Peter Zijlstra
2020-12-15 11:45       ` Mel Gorman
2020-12-15 12:13       ` Li, Aubrey
2021-01-08 10:27       ` Mel Gorman [this message]
2021-01-08 13:01         ` Qais Yousef
2021-01-08 13:47           ` Mel Gorman
2021-01-08 13:41         ` Vincent Guittot
2021-01-08 14:40           ` Mel Gorman
2021-01-08 15:10             ` Vincent Guittot
2021-01-08 16:14               ` Mel Gorman
2021-01-11 14:36                 ` Vincent Guittot
2021-01-11 15:58                   ` Mel Gorman
2021-01-08 19:45               ` Peter Zijlstra
2021-01-09 14:12                 ` Mel Gorman
2021-01-11 14:39                 ` Vincent Guittot
2021-01-08 19:49               ` Peter Zijlstra
2021-01-11 14:52                 ` Vincent Guittot
2021-01-08 20:21         ` Peter Zijlstra
2021-01-09 13:59           ` Mel Gorman
2020-12-14 16:48 ` [RFC][PATCH 2/5] sched/fair: Make select_idle_cpu() proportional to cores Peter Zijlstra
2020-12-23 13:31   ` Vincent Guittot
2020-12-14 16:48 ` [RFC][PATCH 3/5] sched/fair: Remove select_idle_smt() Peter Zijlstra
2020-12-14 16:48 ` [RFC][PATCH 4/5] sched/fair: Merge select_idle_core/cpu() Peter Zijlstra
2020-12-14 16:48 ` [RFC][PATCH 5/5] sched/fair: SIS_PROP the idle core scan Peter Zijlstra
2020-12-16 12:59 ` [RFC][PATCH 0/5] select_idle_sibling() wreckage Li, Aubrey
2020-12-16 18:07   ` Vincent Guittot
2020-12-23 13:23     ` Vincent Guittot
2021-01-04 15:40       ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210108102738.GB3592@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=aubrey.li@linux.intel.com \
    --cc=benbjiang@gmail.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=qais.yousef@arm.com \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@linux.intel.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox