* [PATCH RFC v2] Fix race window during idle cpu selection. @ 2017-12-05 19:09 Atish Patra 2017-12-05 19:09 ` [PATCH RFC v2] sched: Minimize the idle cpu selection race window Atish Patra 0 siblings, 1 reply; 3+ messages in thread From: Atish Patra @ 2017-12-05 19:09 UTC (permalink / raw) To: linux-kernel; +Cc: joelaf, jbacik, mingo, peterz, efault, urezki, atish.patra This patch tries to fix a possible race window that lets multiple tasks wakeup on the same cpu. schbench, uperf & Jackbench (Android) mostly showed improvements while none of the other standard benchmarks have shown any regression. changes from v1->v2: 1. Added the performance numbers. 2. Did not include the debug patch. It can still be found at https://lkml.org/lkml/2017/10/31/16 Atish Patra (1): sched: Minimize the idle cpu selection race window. kernel/sched/core.c | 4 ++++ kernel/sched/fair.c | 12 +++++++++--- kernel/sched/idle_task.c | 1 + kernel/sched/sched.h | 1 + 4 files changed, 15 insertions(+), 3 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH RFC v2] sched: Minimize the idle cpu selection race window. 2017-12-05 19:09 [PATCH RFC v2] Fix race window during idle cpu selection Atish Patra @ 2017-12-05 19:09 ` Atish Patra 2018-02-07 13:59 ` Matt Fleming 0 siblings, 1 reply; 3+ messages in thread From: Atish Patra @ 2017-12-05 19:09 UTC (permalink / raw) To: linux-kernel; +Cc: joelaf, jbacik, mingo, peterz, efault, urezki, atish.patra Currently, multiple tasks can wakeup on same cpu from select_idle_sibiling() path in case they wakeup simulatenously and last ran on the same llc. This happens because an idle cpu is not updated until idle task is scheduled out. Any task waking during that period may potentially select that cpu for a wakeup candidate. Introduce a per cpu variable that is set as soon as a cpu is selected for wakeup for any task. This prevents from other tasks to select the same cpu again. Note: This does not close the race window but minimizes it to accessing the per-cpu variable. If two wakee tasks access the per cpu variable at the same time, they may select the same cpu again. But it minimizes the race window considerably. Here are some performance numbers: Hardware config: 20 core (40 hyperthreaded cpus) x86 box. uperf config:ping pong test on loopback interface with message size = 8k Baseline (4.14) Baseline +pcpu Mean stdev Mean stdev Improvement(%) 1 9.056 0.02 8.966 0.083 -0.993 2 17.664 0.13 17.448 0.303 -1.222 4 32.03 0.22 31.972 0.129 -0.181 8 58.198 0.31 58.588 0.198 0.670 16 101.018 0.67 100.056 0.455 -0.952 32 148.1 15.41 164.494 2.312 11.069 64 203.66 1.16 203.042 1.348 -0.3073 128 197.12 1.04 194.722 1.174 -1.2165 schbench config: message threads = 2; time = 180s, worker thread = 19 Baseline(4.14) Base+pcpu Max Latency Mean stdev Mean stdev Improvement(%) 16457 2046.51 12985.4 48.46 21.0949747828 Latency in usec(lower is better) Mean stdev Mean stdev Improvement(%) 50% 63 4.774 58 4 7.936 75% 81.6 1.2 79.4 0.8 2.696 90% 92.8 0.979 90.6 0.489 2.370 95% 102.6 1.624 99 0.894 3.508 99% 126.25 4.573 120.4 3.872 4.63 99.5% 2712.2 4772.883 133.6 5.571 95.074 Reported by Joel: Jackbench(Android) count 16304 (@60 fps, 4.5 minutes) Without patch With patch mean 5.196633 4.429641 (+14.75%) std 2.030054 2.310025 25% 5.606810 1.991017 (+64.48%) 50% 5.824013 5.716631 (+1.84%) 75% 5.987102 5.932751 (+0.90%) 95% 6.461230 6.301318 (+2.47%) 99% 9.828959 9.697076 (+1.34%) The details are at: https://lkml.org/lkml/2017/11/22/5 Suggested-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Atish Patra <atish.patra@oracle.com> --- kernel/sched/core.c | 4 ++++ kernel/sched/fair.c | 12 +++++++++--- kernel/sched/idle_task.c | 1 + kernel/sched/sched.h | 1 + 4 files changed, 15 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2288a14..d9d501c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3896,6 +3896,7 @@ int task_prio(const struct task_struct *p) return p->prio - MAX_RT_PRIO; } +DEFINE_PER_CPU(int, claim_wakeup); /** * idle_cpu - is a given CPU idle currently? * @cpu: the processor in question. @@ -3917,6 +3918,9 @@ int idle_cpu(int cpu) return 0; #endif + if (per_cpu(claim_wakeup, cpu)) + return 0; + return 1; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 13393bb..885023a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6077,8 +6077,10 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int idle = false; } - if (idle) + if (idle) { + per_cpu(claim_wakeup, core) = 1; return core; + } } /* @@ -6102,8 +6104,10 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t for_each_cpu(cpu, cpu_smt_mask(target)) { if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; - if (idle_cpu(cpu)) + if (idle_cpu(cpu)) { + per_cpu(claim_wakeup, cpu) = 1; return cpu; + } } return -1; @@ -6165,8 +6169,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; - if (idle_cpu(cpu)) + if (idle_cpu(cpu)) { + per_cpu(claim_wakeup, cpu) = 1; break; + } } time = local_clock() - time; diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index 0c00172..64d6495 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -28,6 +28,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf { put_prev_task(rq, prev); update_idle_core(rq); + this_cpu_write(claim_wakeup, 0); schedstat_inc(rq->sched_goidle); return rq->idle; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8aa24b4..5f70b98 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1068,6 +1068,7 @@ DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DECLARE_PER_CPU(struct sched_domain *, sd_numa); DECLARE_PER_CPU(struct sched_domain *, sd_asym); +DECLARE_PER_CPU(int, claim_wakeup); struct sched_group_capacity { atomic_t ref; -- 2.7.4 ^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH RFC v2] sched: Minimize the idle cpu selection race window. 2017-12-05 19:09 ` [PATCH RFC v2] sched: Minimize the idle cpu selection race window Atish Patra @ 2018-02-07 13:59 ` Matt Fleming 0 siblings, 0 replies; 3+ messages in thread From: Matt Fleming @ 2018-02-07 13:59 UTC (permalink / raw) To: Atish Patra; +Cc: linux-kernel, joelaf, jbacik, mingo, peterz, efault, urezki On Tue, 05 Dec, at 01:09:07PM, Atish Patra wrote: > Currently, multiple tasks can wakeup on same cpu from > select_idle_sibiling() path in case they wakeup simulatenously > and last ran on the same llc. This happens because an idle cpu > is not updated until idle task is scheduled out. Any task waking > during that period may potentially select that cpu for a wakeup > candidate. > > Introduce a per cpu variable that is set as soon as a cpu is > selected for wakeup for any task. This prevents from other tasks > to select the same cpu again. Note: This does not close the race > window but minimizes it to accessing the per-cpu variable. If two > wakee tasks access the per cpu variable at the same time, they may > select the same cpu again. But it minimizes the race window > considerably. > > Here are some performance numbers: I ran this patch through some tests here on the SUSE performance grid and there's a definite regression for Mike's personal favourite benchmark, tbench. Here are the results: vanilla 4.15-rc9 on the left, -rc9 plus this patch on the right. tbench4 4.15.0-rc9 4.15.0-rc9 vanillasched-minimize-idle-cpu-window Min mb/sec-1 484.50 ( 0.00%) 463.03 ( -4.43%) Min mb/sec-2 961.43 ( 0.00%) 959.35 ( -0.22%) Min mb/sec-4 1789.60 ( 0.00%) 1760.21 ( -1.64%) Min mb/sec-8 3518.51 ( 0.00%) 3471.47 ( -1.34%) Min mb/sec-16 5521.12 ( 0.00%) 5409.77 ( -2.02%) Min mb/sec-32 7268.61 ( 0.00%) 7491.29 ( 3.06%) Min mb/sec-64 14413.45 ( 0.00%) 14347.69 ( -0.46%) Min mb/sec-128 13501.84 ( 0.00%) 13413.82 ( -0.65%) Min mb/sec-192 13237.02 ( 0.00%) 13231.43 ( -0.04%) Hmean mb/sec-1 505.20 ( 0.00%) 485.81 ( -3.84%) Hmean mb/sec-2 973.12 ( 0.00%) 970.67 ( -0.25%) Hmean mb/sec-4 1835.22 ( 0.00%) 1788.54 ( -2.54%) Hmean mb/sec-8 3529.35 ( 0.00%) 3487.20 ( -1.19%) Hmean mb/sec-16 5531.16 ( 0.00%) 5437.43 ( -1.69%) Hmean mb/sec-32 7627.96 ( 0.00%) 8021.26 ( 5.16%) Hmean mb/sec-64 14441.20 ( 0.00%) 14395.08 ( -0.32%) Hmean mb/sec-128 13620.40 ( 0.00%) 13569.17 ( -0.38%) Hmean mb/sec-192 13265.26 ( 0.00%) 13263.98 ( -0.01%) Max mb/sec-1 510.30 ( 0.00%) 489.89 ( -4.00%) Max mb/sec-2 989.45 ( 0.00%) 976.10 ( -1.35%) Max mb/sec-4 1845.65 ( 0.00%) 1795.50 ( -2.72%) Max mb/sec-8 3574.03 ( 0.00%) 3547.56 ( -0.74%) Max mb/sec-16 5556.99 ( 0.00%) 5564.80 ( 0.14%) Max mb/sec-32 7678.18 ( 0.00%) 8098.63 ( 5.48%) Max mb/sec-64 14463.07 ( 0.00%) 14437.58 ( -0.18%) Max mb/sec-128 13659.67 ( 0.00%) 13602.65 ( -0.42%) Max mb/sec-192 13612.01 ( 0.00%) 13832.98 ( 1.62%) There's a nice little performance bump around the 32-client mark. Incidentally, my test machine has 2 NUMA nodes with 24 cpus (12 cores, 2 threads) each. So 32 clients is the point at which things no longer fit on a single node. It doesn't look like the regression is caused by the schedule() path being slightly longer (i.e. it's not a latency issue) because schbench results show improvements for the low-end: schbench 4.15.0-rc9 4.15.0-rc9 vanillasched-minimize-idle-cpu-window Lat 50.00th-qrtle-1 46.00 ( 0.00%) 36.00 ( 21.74%) Lat 75.00th-qrtle-1 49.00 ( 0.00%) 37.00 ( 24.49%) Lat 90.00th-qrtle-1 52.00 ( 0.00%) 38.00 ( 26.92%) Lat 95.00th-qrtle-1 56.00 ( 0.00%) 41.00 ( 26.79%) Lat 99.00th-qrtle-1 61.00 ( 0.00%) 46.00 ( 24.59%) Lat 99.50th-qrtle-1 63.00 ( 0.00%) 48.00 ( 23.81%) Lat 99.90th-qrtle-1 77.00 ( 0.00%) 64.00 ( 16.88%) Lat 50.00th-qrtle-2 41.00 ( 0.00%) 41.00 ( 0.00%) Lat 75.00th-qrtle-2 47.00 ( 0.00%) 46.00 ( 2.13%) Lat 90.00th-qrtle-2 50.00 ( 0.00%) 49.00 ( 2.00%) Lat 95.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%) Lat 99.00th-qrtle-2 58.00 ( 0.00%) 57.00 ( 1.72%) Lat 99.50th-qrtle-2 60.00 ( 0.00%) 59.00 ( 1.67%) Lat 99.90th-qrtle-2 72.00 ( 0.00%) 69.00 ( 4.17%) Lat 50.00th-qrtle-4 46.00 ( 0.00%) 45.00 ( 2.17%) Lat 75.00th-qrtle-4 49.00 ( 0.00%) 48.00 ( 2.04%) Lat 90.00th-qrtle-4 52.00 ( 0.00%) 51.00 ( 1.92%) Lat 95.00th-qrtle-4 55.00 ( 0.00%) 53.00 ( 3.64%) Lat 99.00th-qrtle-4 61.00 ( 0.00%) 59.00 ( 3.28%) Lat 99.50th-qrtle-4 63.00 ( 0.00%) 61.00 ( 3.17%) Lat 99.90th-qrtle-4 69.00 ( 0.00%) 74.00 ( -7.25%) Lat 50.00th-qrtle-8 48.00 ( 0.00%) 50.00 ( -4.17%) Lat 75.00th-qrtle-8 52.00 ( 0.00%) 54.00 ( -3.85%) Lat 90.00th-qrtle-8 54.00 ( 0.00%) 58.00 ( -7.41%) Lat 95.00th-qrtle-8 57.00 ( 0.00%) 61.00 ( -7.02%) Lat 99.00th-qrtle-8 64.00 ( 0.00%) 68.00 ( -6.25%) Lat 99.50th-qrtle-8 67.00 ( 0.00%) 72.00 ( -7.46%) Lat 99.90th-qrtle-8 81.00 ( 0.00%) 81.00 ( 0.00%) Lat 50.00th-qrtle-16 50.00 ( 0.00%) 47.00 ( 6.00%) Lat 75.00th-qrtle-16 59.00 ( 0.00%) 57.00 ( 3.39%) Lat 90.00th-qrtle-16 66.00 ( 0.00%) 65.00 ( 1.52%) Lat 95.00th-qrtle-16 69.00 ( 0.00%) 68.00 ( 1.45%) Lat 99.00th-qrtle-16 76.00 ( 0.00%) 75.00 ( 1.32%) Lat 99.50th-qrtle-16 79.00 ( 0.00%) 79.00 ( 0.00%) Lat 99.90th-qrtle-16 86.00 ( 0.00%) 89.00 ( -3.49%) Lat 50.00th-qrtle-23 52.00 ( 0.00%) 52.00 ( 0.00%) Lat 75.00th-qrtle-23 65.00 ( 0.00%) 65.00 ( 0.00%) Lat 90.00th-qrtle-23 75.00 ( 0.00%) 74.00 ( 1.33%) Lat 95.00th-qrtle-23 81.00 ( 0.00%) 79.00 ( 2.47%) Lat 99.00th-qrtle-23 95.00 ( 0.00%) 90.00 ( 5.26%) Lat 99.50th-qrtle-23 12624.00 ( 0.00%) 1050.00 ( 91.68%) Lat 99.90th-qrtle-23 15184.00 ( 0.00%) 13872.00 ( 8.64%) If you'd like to run these tests on your own machines they're all available at https://github.com/gormanm/mmtests.git. ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2018-02-07 13:59 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-12-05 19:09 [PATCH RFC v2] Fix race window during idle cpu selection Atish Patra 2017-12-05 19:09 ` [PATCH RFC v2] sched: Minimize the idle cpu selection race window Atish Patra 2018-02-07 13:59 ` Matt Fleming
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox