* [PATCH 0/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
@ 2016-08-05 8:21 Giovanni Gherdovich
2016-08-05 8:21 ` [PATCH 1/1] " Giovanni Gherdovich
0 siblings, 1 reply; 14+ messages in thread
From: Giovanni Gherdovich @ 2016-08-05 8:21 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: Mike Galbraith, Stanislaw Gruszka, linux-kernel, Mel Gorman,
Giovanni Gherdovich
As per Peter Zijlstra's review, these are the difference wrt V1:
* inclusion of appropriate header file linux/prefetch.h
* factorized the calls to prefetch into a separate function
* introduction of the local variable curr as a form of compiler
subexpression elimination (CSE)
* fixed Signed-off-by chain
* added comment as per why the prefetches are needed
Giovanni Gherdovich (1):
sched/cputime: Mitigate performance regression in
times()/clock_gettime()
kernel/sched/core.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
--
2.6.6
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-05 8:21 [PATCH 0/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich @ 2016-08-05 8:21 ` Giovanni Gherdovich 2016-08-10 11:26 ` Ingo Molnar 2016-08-10 18:00 ` [tip:sched/core] " tip-bot for Giovanni Gherdovich 0 siblings, 2 replies; 14+ messages in thread From: Giovanni Gherdovich @ 2016-08-05 8:21 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Mike Galbraith, Stanislaw Gruszka, linux-kernel, Mel Gorman, Giovanni Gherdovich Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") fixed a problem whereby clock_nanosleep() followed by clock_gettime() could allow a task to wake early. It addressed the problem by calling the scheduling classes update_curr when the cputimer starts. Said change induced a considerable performance regression on the syscalls times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some debuggers and applications that monitor their own performance that accidentally depend on the performance of these specific calls. This patch mitigates the performace loss by prefetching data in the CPU cache, as stalls due to cache misses appear to be where most time is spent in our benchmarks. Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge box with 32 logical cores and 2 NUMA nodes. The test is repeated with a variable number of threads, from 2 to 4*num_cpus; the results are in seconds and correspond to the average of 10 runs; the percentage gain is computed with (before-after)/before so a positive value is an improvement (it's faster). The improvement varies between a few percents for 5-20 threads and more than 10% for 2 or >20 threads. pound_clock_gettime: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.48 3.06 ( 11.83%) 5 3.33 3.25 ( 2.40%) 8 3.37 3.26 ( 3.30%) 12 3.32 3.37 ( -1.60%) 21 4.01 3.90 ( 2.74%) 30 3.63 3.36 ( 7.41%) 48 3.71 3.11 ( 16.27%) 79 3.75 3.16 ( 15.74%) 110 3.81 3.25 ( 14.80%) 128 3.88 3.31 ( 14.76%) pound_times: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.65 3.25 ( 11.03%) 5 3.45 3.17 ( 7.92%) 8 3.52 3.22 ( 8.69%) 12 3.29 3.36 ( -2.04%) 21 4.07 3.92 ( 3.78%) 30 3.87 3.40 ( 12.17%) 48 3.79 3.16 ( 16.61%) 79 3.88 3.28 ( 15.42%) 110 3.90 3.38 ( 13.35%) 128 4.00 3.38 ( 15.45%) pound_clock_gettime and pound_clock_gettime are two benchmarks included in the MMTests framework. They launch a given number of threads which repeatedly call times() or clock_gettimes(). The results above can be reproduced with cloning MMTests from github.com and running the "poundtime" workload: $ git clone https://github.com/gormanm/mmtests.git $ cd mmtests $ cp configs/config-global-dhp__workload_poundtime config $ ./run-mmtests.sh --run-monitor $(uname -r) The above will run "poundtime" measuring the kernel currently running on the machine; Once a new kernel is installed and the machine rebooted, running again $ cd mmtests $ ./run-mmtests.sh --run-monitor $(uname -r) will produce results to compare with. A comparison table will be output with $ cd mmtests/work/log $ ../../compare-kernels.sh the table will contain a lot of entries; grepping for "Amean" (as in "arithmetic mean") will give the tables presented above. The source code for the two benchmarks is reported at the end of this changelog for clairity. The cache misses addressed by this patch were found using a combination of `perf top`, `perf record` and `perf annotate`. The incriminated lines were found to be struct sched_entity *curr = cfs_rq->curr; and delta_exec = now - curr->exec_start; in the function update_curr() from kernel/sched/fair.c. This patch prefetches the data from memory just before update_curr is called in the interested execution path. A comparison of the total number of cycles before and after the patch follows; the data is obtained using `perf stat -r 10 -ddd <program>` running over the same sequence of number of threads used above (a positive gain is an improvement): threads cycles before cycles after gain 2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88% 5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85% 8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74% 12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74% 21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89% 30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92% 48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10% 79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33% 110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21% 128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99% A comparison of cache miss vs total cache loads ratios, before and after the patch (again from the `perf stat -r 10 -ddd <program>` tables): threads L1 misses/total*100 L1 misses/total*100 gain before after 2 7.43 +-4.90% 7.36 +-4.70% 0.94% 5 13.09 +-4.74% 13.52 +-3.73% -3.28% 8 13.79 +-5.61% 12.90 +-3.27% 6.45% 12 11.57 +-2.44% 8.71 +-1.40% 24.72% 21 12.39 +-3.92% 9.97 +-1.84% 19.53% 30 13.91 +-2.53% 11.73 +-2.28% 15.67% 48 13.71 +-1.59% 12.32 +-1.97% 10.14% 79 14.44 +-0.66% 13.40 +-1.06% 7.20% 110 15.86 +-0.50% 14.46 +-0.59% 8.83% 128 16.51 +-0.32% 15.06 +-0.78% 8.78% As a final note, the following shows the evolution of performance figures in the "poundtime" benchmark and pinpoints commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a major source of degradation, mostly unaddressed to this day (figures expressed in seconds). pound_clock_gettime: threads parent of 6e998916dfe3 4.7-rc7 6e998916dfe3 itself 2 2.23 3.68 ( -64.56%) 3.48 (-55.48%) 5 2.83 3.78 ( -33.42%) 3.33 (-17.43%) 8 2.84 4.31 ( -52.12%) 3.37 (-18.76%) 12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%) 21 3.14 4.63 ( -47.36%) 4.01 (-27.71%) 30 3.28 5.75 ( -75.37%) 3.63 (-10.80%) 48 3.02 6.05 (-100.56%) 3.71 (-22.99%) 79 2.88 6.30 (-118.90%) 3.75 (-30.26%) 110 2.95 6.46 (-119.00%) 3.81 (-29.24%) 128 3.05 6.42 (-110.08%) 3.88 (-27.04%) pound_times: threads parent of 6e998916dfe3 4.7-rc7 6e998916dfe3 itself 2 2.27 3.73 ( -64.71%) 3.65 (-61.14%) 5 2.78 3.77 ( -35.56%) 3.45 (-23.98%) 8 2.79 4.41 ( -57.71%) 3.52 (-26.05%) 12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%) 21 3.10 4.61 ( -48.74%) 4.07 (-31.34%) 30 3.33 5.75 ( -72.53%) 3.87 (-16.01%) 48 2.96 6.06 (-105.04%) 3.79 (-28.10%) 79 2.88 6.24 (-116.83%) 3.88 (-34.81%) 110 2.98 6.37 (-114.08%) 3.90 (-31.12%) 128 3.10 6.35 (-104.61%) 4.00 (-28.87%) The source code of the two benchmarks follows. To compile the two: NR_THREADS=42 for FILE in pound_times pound_clock_gettime; do gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE done ==== BEGIN pound_times.c ==== struct tms start; void *pound (void *threadid) { struct tms end; int oldutime = 0; int utime; int i; for (i = 0; i < 5000000 / NUM_THREADS; i++) { times(&end); utime = ((int)end.tms_utime - (int)start.tms_utime); if (oldutime > utime) { printf("utime decreased, was %d, now %d!\n", oldutime, utime); } oldutime = utime; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long i; times(&start); for (i = 0; i < NUM_THREADS; i++) { pthread_create (&th[i], NULL, pound, (void *)i); } pthread_exit(NULL); return 0; } ==== END pound_times.c ==== ==== BEGIN pound_clock_gettime.c ==== void *pound (void *threadid) { struct timespec ts; int rc, i; unsigned long prev = 0, this = 0; for (i = 0; i < 5000000 / NUM_THREADS; i++) { rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); if (rc < 0) perror("clock_gettime"); this = (ts.tv_sec * 1000000000) + ts.tv_nsec; if (0 && this < prev) printf("%lu ns timewarp at iteration %d\n", prev - this, i); prev = this; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long rc, i; pid_t pgid; for (i = 0; i < NUM_THREADS; i++) { rc = pthread_create(&th[i], NULL, pound, (void *)i); if (rc < 0) perror("pthread_create"); } pthread_exit(NULL); return 0; } ==== END pound_clock_gettime.c ==== Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Suggested-by: Mike Galbraith <mgalbraith@suse.de> --- kernel/sched/core.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 51d7105..4500421 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -74,6 +74,7 @@ #include <linux/context_tracking.h> #include <linux/compiler.h> #include <linux/frame.h> +#include <linux/prefetch.h> #include <asm/switch_to.h> #include <asm/tlb.h> @@ -2965,6 +2966,23 @@ EXPORT_PER_CPU_SYMBOL(kstat); EXPORT_PER_CPU_SYMBOL(kernel_cpustat); /* + * The function fair_sched_class.update_curr accesses the struct curr + * and its field curr->exec_start; when called from task_sched_runtime, + * we observe a high rate of cache misses in practice. + * Prefetching this data results in improved performance. + */ +static inline void prefetch_curr_exec_start(struct task_struct *p) +{ +#ifdef CONFIG_FAIR_GROUP_SCHED + struct sched_entity *curr = (&p->se)->cfs_rq->curr; +#else + struct sched_entity *curr = (&task_rq(p)->cfs)->curr; +#endif + prefetch(curr); + prefetch(&curr->exec_start); +} + +/* * Return accounted runtime for the task. * In case the task is currently running, return the runtime plus current's * pending runtime that have not been accounted yet. @@ -2998,6 +3016,7 @@ unsigned long long task_sched_runtime(struct task_struct *p) * thread, breaking clock_gettime(). */ if (task_current(rq, p) && task_on_rq_queued(p)) { + prefetch_curr_exec_start(p); update_rq_clock(rq); p->sched_class->update_curr(rq); } -- 2.6.6 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-05 8:21 ` [PATCH 1/1] " Giovanni Gherdovich @ 2016-08-10 11:26 ` Ingo Molnar 2016-08-10 13:02 ` Giovanni Gherdovich 2016-08-12 12:10 ` Stanislaw Gruszka 2016-08-10 18:00 ` [tip:sched/core] " tip-bot for Giovanni Gherdovich 1 sibling, 2 replies; 14+ messages in thread From: Ingo Molnar @ 2016-08-10 11:26 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Stanislaw Gruszka, linux-kernel, Mel Gorman * Giovanni Gherdovich <ggherdovich@suse.cz> wrote: > Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() > inconsistency") fixed a problem whereby clock_nanosleep() followed by > clock_gettime() could allow a task to wake early. It addressed the problem > by calling the scheduling classes update_curr when the cputimer starts. > > Said change induced a considerable performance regression on the syscalls > times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some > debuggers and applications that monitor their own performance that > accidentally depend on the performance of these specific calls. > > This patch mitigates the performace loss by prefetching data in the CPU > cache, as stalls due to cache misses appear to be where most time is spent > in our benchmarks. > > Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge > box with 32 logical cores and 2 NUMA nodes. The test is repeated with a > variable number of threads, from 2 to 4*num_cpus; the results are in > seconds and correspond to the average of 10 runs; the percentage gain is > computed with (before-after)/before so a positive value is an improvement > (it's faster). The improvement varies between a few percents for 5-20 > threads and more than 10% for 2 or >20 threads. > > pound_clock_gettime: > > threads 4.7-rc7 patched 4.7-rc7 > [num] [secs] [secs (percent)] > 2 3.48 3.06 ( 11.83%) > 5 3.33 3.25 ( 2.40%) > 8 3.37 3.26 ( 3.30%) > 12 3.32 3.37 ( -1.60%) > 21 4.01 3.90 ( 2.74%) > 30 3.63 3.36 ( 7.41%) > 48 3.71 3.11 ( 16.27%) > 79 3.75 3.16 ( 15.74%) > 110 3.81 3.25 ( 14.80%) > 128 3.88 3.31 ( 14.76%) Nice detective work! I'm wondering, where do we stand if compared with a pre-6e998916dfe3 kernel? I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I suspect v3.17 does not run easily on a recent distro. Could you attempt to revert the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to make the result correct, just see what the performance gap is, roughly. If there's still a significant gap then it might make sense to optimize this some more. Thanks, Ingo ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-10 11:26 ` Ingo Molnar @ 2016-08-10 13:02 ` Giovanni Gherdovich 2016-08-12 12:10 ` Stanislaw Gruszka 1 sibling, 0 replies; 14+ messages in thread From: Giovanni Gherdovich @ 2016-08-10 13:02 UTC (permalink / raw) To: Ingo Molnar Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Stanislaw Gruszka, linux-kernel, Mel Gorman Hello Ingo, thank you for your reply. Ingo Molnar <mingo@kernel.org> > Nice detective work! I'm wondering, where do we stand if compared with a > pre-6e998916dfe3 kernel? The data follows. A considerable part of the performance loss is recovered; something is still on the table. "3.18-pre-bug" is the parent of 6e998916dfe3, i.e. 6e998916dfe3^1 "3.18-bug" is the revision 6e998916dfe3 itself. Figures are in seconds. Percentages refer to 3.18-pre-bug, negative = worse. times() threads 3.18-pre-bug 3.18-bug 4.7.0-rc7 4.7.0-rc7-patched 2 2.27 ( 0.00%) 3.73 (-64.71%) 3.65 (-61.14%) 3.06 (-35.16%) 5 2.78 ( 0.00%) 3.77 (-35.56%) 3.45 (-23.98%) 3.25 (-16.79%) 8 2.79 ( 0.00%) 4.41 (-57.71%) 3.52 (-26.05%) 3.26 (-16.53%) 12 3.02 ( 0.00%) 3.56 (-17.94%) 3.29 ( -9.08%) 3.37 (-11.74%) 21 3.10 ( 0.00%) 4.61 (-48.74%) 4.07 (-31.34%) 3.90 (-25.89%) 30 3.33 ( 0.00%) 5.75 (-72.53%) 3.87 (-16.01%) 3.36 ( -0.81%) 48 2.96 ( 0.00%) 6.06 (-105.04%) 3.79 (-28.10%) 3.11 ( -5.14%) 79 2.88 ( 0.00%) 6.24 (-116.83%) 3.88 (-34.81%) 3.16 ( -9.84%) 110 2.98 ( 0.00%) 6.37 (-114.08%) 3.90 (-31.12%) 3.25 ( -9.07%) 128 3.10 ( 0.00%) 6.35 (-104.61%) 4.00 (-28.87%) 3.31 ( -6.57%) clock_gettime() threads 3.18-pre-bug 3.18-bug 4.7.0-rc7 4.7.0-rc7-patched 2 2.23 ( 0.00%) 3.68 (-64.56%) 3.48 (-55.48%) 3.25 (-45.41%) 5 2.83 ( 0.00%) 3.78 (-33.42%) 3.33 (-17.43%) 3.17 (-12.03%) 8 2.84 ( 0.00%) 4.31 (-52.12%) 3.37 (-18.76%) 3.22 (-13.43%) 12 3.09 ( 0.00%) 3.61 (-16.74%) 3.32 ( -7.17%) 3.36 ( -8.47%) 21 3.14 ( 0.00%) 4.63 (-47.36%) 4.01 (-27.71%) 3.92 (-24.68%) 30 3.28 ( 0.00%) 5.75 (-75.37%) 3.63 (-10.80%) 3.40 ( -3.69%) 48 3.02 ( 0.00%) 6.05 (-100.56%) 3.71 (-22.99%) 3.16 ( -4.64%) 79 2.88 ( 0.00%) 6.30 (-118.90%) 3.75 (-30.26%) 3.28 (-13.93%) 110 2.95 ( 0.00%) 6.46 (-119.00%) 3.81 (-29.24%) 3.38 (-14.69%) 128 3.05 ( 0.00%) 6.42 (-110.08%) 3.88 (-27.04%) 3.38 (-10.70%) Regards, Giovanni Gherdovich ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-10 11:26 ` Ingo Molnar 2016-08-10 13:02 ` Giovanni Gherdovich @ 2016-08-12 12:10 ` Stanislaw Gruszka 2016-08-15 7:49 ` Giovanni Gherdovich 2016-08-15 9:13 ` Wanpeng Li 1 sibling, 2 replies; 14+ messages in thread From: Stanislaw Gruszka @ 2016-08-12 12:10 UTC (permalink / raw) To: Ingo Molnar Cc: Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra, Mike Galbraith, linux-kernel, Mel Gorman [-- Attachment #1: Type: text/plain, Size: 3629 bytes --] Hi On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: > Nice detective work! I'm wondering, where do we stand if compared with a > pre-6e998916dfe3 kernel? > > I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I > suspect v3.17 does not run easily on a recent distro. Could you attempt to revert > the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to > make the result correct, just see what the performance gap is, roughly. > > If there's still a significant gap then it might make sense to optimize this some > more. I measured (partial) revert performance on 4.7 using mmtest instructions from Giovanni and also tested some other possible fix (draft version): diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 75f98c5..54fdf6d 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) unsigned int seq, nextseq; unsigned long flags; + (void) task_sched_runtime(tsk); + rcu_read_lock(); /* Attempt a lockless read on the first round. */ nextseq = 0; @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) task_cputime(t, &utime, &stime); times->utime += utime; times->stime += stime; - times->sum_exec_runtime += task_sched_runtime(t); + times->sum_exec_runtime += t->se.sum_exec_runtime; } /* If lockless access failed, take the lock. */ nextseq = 1; --- mmtest benchmark results are below (full compare-kernels.sh output is in attachment): vanila-4.7 revert prefetch patch 4.74 ( 0.00%) 3.04 ( 35.93%) 4.09 ( 13.81%) 1.30 ( 72.59%) 5.49 ( 0.00%) 5.00 ( 8.97%) 5.34 ( 2.72%) 1.03 ( 81.16%) 6.12 ( 0.00%) 4.91 ( 19.73%) 5.97 ( 2.40%) 0.90 ( 85.27%) 6.68 ( 0.00%) 4.90 ( 26.66%) 6.02 ( 9.75%) 0.88 ( 86.89%) 7.21 ( 0.00%) 5.13 ( 28.85%) 6.70 ( 7.09%) 0.87 ( 87.91%) 7.66 ( 0.00%) 5.22 ( 31.80%) 7.17 ( 6.39%) 0.92 ( 88.01%) 7.91 ( 0.00%) 5.36 ( 32.22%) 7.30 ( 7.72%) 0.95 ( 87.97%) 7.95 ( 0.00%) 5.35 ( 32.73%) 7.34 ( 7.66%) 1.06 ( 86.66%) 8.00 ( 0.00%) 5.33 ( 33.31%) 7.38 ( 7.73%) 1.13 ( 85.82%) 5.61 ( 0.00%) 3.55 ( 36.76%) 4.53 ( 19.23%) 2.29 ( 59.28%) 5.66 ( 0.00%) 4.32 ( 23.79%) 4.75 ( 16.18%) 3.65 ( 35.46%) 5.98 ( 0.00%) 4.97 ( 16.87%) 5.96 ( 0.35%) 3.62 ( 39.40%) 6.58 ( 0.00%) 4.94 ( 24.93%) 6.04 ( 8.32%) 3.63 ( 44.89%) 7.19 ( 0.00%) 5.18 ( 28.01%) 6.68 ( 7.13%) 3.65 ( 49.22%) 7.67 ( 0.00%) 5.27 ( 31.29%) 7.16 ( 6.63%) 3.62 ( 52.76%) 7.88 ( 0.00%) 5.36 ( 31.98%) 7.28 ( 7.58%) 3.65 ( 53.71%) 7.99 ( 0.00%) 5.39 ( 32.52%) 7.40 ( 7.42%) 3.65 ( 54.25%) Patch works because we we update sum_exec_runtime on current thread what assure we see proper sum_exec_runtime value on different CPUs. I tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, patch did not break them. I'm going to run some other test. Patch is draft version for early review, task_sched_runtime() will be simplified (since it's called only current thread) and possibly split into two functions: one that call update_curr() and other that return sum_exec_runtime (assure it's consistent on 32 bit arches). Stanislaw [-- Attachment #2: compare.txt --] [-- Type: text/plain, Size: 27653 bytes --] poundtime vanilla rever prefetc mas 4.7 revert prefetch mask Min real-pound_clock_gettime-2 4.38 ( 0.00%) 2.73 ( 37.67%) 3.62 ( 17.35%) 1.19 ( 72.83%) Min real-pound_clock_gettime-5 5.40 ( 0.00%) 4.76 ( 11.85%) 4.49 ( 16.85%) 0.99 ( 81.67%) Min real-pound_clock_gettime-8 5.83 ( 0.00%) 4.88 ( 16.30%) 5.91 ( -1.37%) 0.88 ( 84.91%) Min real-pound_clock_gettime-12 6.55 ( 0.00%) 4.87 ( 25.65%) 5.98 ( 8.70%) 0.84 ( 87.18%) Min real-pound_clock_gettime-21 7.11 ( 0.00%) 5.10 ( 28.27%) 6.63 ( 6.75%) 0.85 ( 88.05%) Min real-pound_clock_gettime-30 7.56 ( 0.00%) 5.20 ( 31.22%) 7.08 ( 6.35%) 0.87 ( 88.49%) Min real-pound_clock_gettime-48 7.78 ( 0.00%) 5.24 ( 32.65%) 7.20 ( 7.46%) 0.92 ( 88.17%) Min real-pound_clock_gettime-79 7.89 ( 0.00%) 5.23 ( 33.71%) 7.20 ( 8.75%) 1.00 ( 87.33%) Min real-pound_clock_gettime-96 7.88 ( 0.00%) 5.24 ( 33.50%) 7.29 ( 7.49%) 1.09 ( 86.17%) Min real-pound_times-2 4.87 ( 0.00%) 3.19 ( 34.50%) 4.00 ( 17.86%) 2.06 ( 57.70%) Min real-pound_times-5 5.59 ( 0.00%) 3.91 ( 30.05%) 4.61 ( 17.53%) 3.61 ( 35.42%) Min real-pound_times-8 5.74 ( 0.00%) 4.88 ( 14.98%) 5.80 ( -1.05%) 3.56 ( 37.98%) Min real-pound_times-12 6.44 ( 0.00%) 4.90 ( 23.91%) 6.00 ( 6.83%) 3.52 ( 45.34%) Min real-pound_times-21 7.11 ( 0.00%) 5.11 ( 28.13%) 6.61 ( 7.03%) 3.59 ( 49.51%) Min real-pound_times-30 7.60 ( 0.00%) 5.20 ( 31.58%) 7.03 ( 7.50%) 3.54 ( 53.42%) Min real-pound_times-48 7.80 ( 0.00%) 5.24 ( 32.82%) 7.20 ( 7.69%) 3.61 ( 53.72%) Min real-pound_times-79 7.92 ( 0.00%) 5.24 ( 33.84%) 7.31 ( 7.70%) 3.61 ( 54.42%) Min real-pound_times-96 7.94 ( 0.00%) 5.24 ( 34.01%) 7.29 ( 8.19%) 3.58 ( 54.91%) Min syst-pound_clock_gettime-2 8.54 ( 0.00%) 4.89 ( 42.74%) 6.98 ( 18.27%) 2.16 ( 74.71%) Min syst-pound_clock_gettime-5 26.57 ( 0.00%) 23.29 ( 12.34%) 22.09 ( 16.86%) 4.47 ( 83.18%) Min syst-pound_clock_gettime-8 45.82 ( 0.00%) 38.02 ( 17.02%) 46.61 ( -1.72%) 6.44 ( 85.95%) Min syst-pound_clock_gettime-12 77.23 ( 0.00%) 56.61 ( 26.70%) 69.25 ( 10.33%) 9.34 ( 87.91%) Min syst-pound_clock_gettime-21 147.44 ( 0.00%) 103.97 ( 29.48%) 134.76 ( 8.60%) 15.12 ( 89.74%) Min syst-pound_clock_gettime-30 176.07 ( 0.00%) 117.81 ( 33.09%) 162.77 ( 7.55%) 15.95 ( 90.94%) Min syst-pound_clock_gettime-48 182.93 ( 0.00%) 119.92 ( 34.44%) 168.06 ( 8.13%) 19.82 ( 89.17%) Min syst-pound_clock_gettime-79 186.13 ( 0.00%) 123.31 ( 33.75%) 170.34 ( 8.48%) 22.90 ( 87.70%) Min syst-pound_clock_gettime-96 187.05 ( 0.00%) 124.22 ( 33.59%) 172.67 ( 7.69%) 25.19 ( 86.53%) Min syst-pound_times-2 9.55 ( 0.00%) 6.22 ( 34.87%) 7.80 ( 18.32%) 3.90 ( 59.16%) Min syst-pound_times-5 27.68 ( 0.00%) 19.24 ( 30.49%) 22.76 ( 17.77%) 17.56 ( 36.56%) Min syst-pound_times-8 45.11 ( 0.00%) 38.75 ( 14.10%) 45.15 ( -0.09%) 27.77 ( 38.44%) Min syst-pound_times-12 76.60 ( 0.00%) 56.89 ( 25.73%) 71.06 ( 7.23%) 41.64 ( 45.64%) Min syst-pound_times-21 145.25 ( 0.00%) 102.48 ( 29.45%) 136.15 ( 6.27%) 72.98 ( 49.76%) Min syst-pound_times-30 175.03 ( 0.00%) 118.89 ( 32.07%) 161.32 ( 7.83%) 79.91 ( 54.34%) Min syst-pound_times-48 183.61 ( 0.00%) 121.06 ( 34.07%) 167.26 ( 8.90%) 83.24 ( 54.66%) Min syst-pound_times-79 187.18 ( 0.00%) 123.24 ( 34.16%) 173.22 ( 7.46%) 84.36 ( 54.93%) Min syst-pound_times-96 188.88 ( 0.00%) 124.04 ( 34.33%) 173.52 ( 8.13%) 83.02 ( 56.05%) Amean real-pound_clock_gettime-2 4.74 ( 0.00%) 3.04 ( 35.93%) 4.09 ( 13.81%) 1.30 ( 72.59%) Amean real-pound_clock_gettime-5 5.49 ( 0.00%) 5.00 ( 8.97%) 5.34 ( 2.72%) 1.03 ( 81.16%) Amean real-pound_clock_gettime-8 6.12 ( 0.00%) 4.91 ( 19.73%) 5.97 ( 2.40%) 0.90 ( 85.27%) Amean real-pound_clock_gettime-12 6.68 ( 0.00%) 4.90 ( 26.66%) 6.02 ( 9.75%) 0.88 ( 86.89%) Amean real-pound_clock_gettime-21 7.21 ( 0.00%) 5.13 ( 28.85%) 6.70 ( 7.09%) 0.87 ( 87.91%) Amean real-pound_clock_gettime-30 7.66 ( 0.00%) 5.22 ( 31.80%) 7.17 ( 6.39%) 0.92 ( 88.01%) Amean real-pound_clock_gettime-48 7.91 ( 0.00%) 5.36 ( 32.22%) 7.30 ( 7.72%) 0.95 ( 87.97%) Amean real-pound_clock_gettime-79 7.95 ( 0.00%) 5.35 ( 32.73%) 7.34 ( 7.66%) 1.06 ( 86.66%) Amean real-pound_clock_gettime-96 8.00 ( 0.00%) 5.33 ( 33.31%) 7.38 ( 7.73%) 1.13 ( 85.82%) Amean real-pound_times-2 5.61 ( 0.00%) 3.55 ( 36.76%) 4.53 ( 19.23%) 2.29 ( 59.28%) Amean real-pound_times-5 5.66 ( 0.00%) 4.32 ( 23.79%) 4.75 ( 16.18%) 3.65 ( 35.46%) Amean real-pound_times-8 5.98 ( 0.00%) 4.97 ( 16.87%) 5.96 ( 0.35%) 3.62 ( 39.40%) Amean real-pound_times-12 6.58 ( 0.00%) 4.94 ( 24.93%) 6.04 ( 8.32%) 3.63 ( 44.89%) Amean real-pound_times-21 7.19 ( 0.00%) 5.18 ( 28.01%) 6.68 ( 7.13%) 3.65 ( 49.22%) Amean real-pound_times-30 7.67 ( 0.00%) 5.27 ( 31.29%) 7.16 ( 6.63%) 3.62 ( 52.76%) Amean real-pound_times-48 7.88 ( 0.00%) 5.36 ( 31.98%) 7.28 ( 7.58%) 3.65 ( 53.71%) Amean real-pound_times-79 7.99 ( 0.00%) 5.39 ( 32.52%) 7.40 ( 7.42%) 3.65 ( 54.25%) Amean real-pound_times-96 8.01 ( 0.00%) 5.35 ( 33.20%) 7.36 ( 8.09%) 3.64 ( 54.49%) Amean syst-pound_clock_gettime-2 9.22 ( 0.00%) 5.45 ( 40.95%) 7.90 ( 14.32%) 2.34 ( 74.66%) Amean syst-pound_clock_gettime-5 27.03 ( 0.00%) 24.21 ( 10.40%) 26.24 ( 2.90%) 4.73 ( 82.48%) Amean syst-pound_clock_gettime-8 48.33 ( 0.00%) 38.40 ( 20.55%) 47.11 ( 2.52%) 6.64 ( 86.25%) Amean syst-pound_clock_gettime-12 78.93 ( 0.00%) 57.30 ( 27.41%) 71.04 ( 10.00%) 9.69 ( 87.72%) Amean syst-pound_clock_gettime-21 149.27 ( 0.00%) 105.34 ( 29.43%) 138.19 ( 7.42%) 16.50 ( 88.95%) Amean syst-pound_clock_gettime-30 178.36 ( 0.00%) 119.83 ( 32.82%) 166.75 ( 6.51%) 18.67 ( 89.53%) Amean syst-pound_clock_gettime-48 185.77 ( 0.00%) 124.80 ( 32.82%) 171.14 ( 7.88%) 21.12 ( 88.63%) Amean syst-pound_clock_gettime-79 188.17 ( 0.00%) 126.34 ( 32.86%) 173.99 ( 7.53%) 24.07 ( 87.21%) Amean syst-pound_clock_gettime-96 190.24 ( 0.00%) 126.63 ( 33.44%) 175.32 ( 7.84%) 26.12 ( 86.27%) Amean syst-pound_times-2 11.02 ( 0.00%) 6.91 ( 37.27%) 8.85 ( 19.68%) 4.36 ( 60.45%) Amean syst-pound_times-5 27.99 ( 0.00%) 21.31 ( 23.88%) 23.42 ( 16.32%) 17.95 ( 35.87%) Amean syst-pound_times-8 47.33 ( 0.00%) 39.27 ( 17.04%) 47.16 ( 0.35%) 28.56 ( 39.66%) Amean syst-pound_times-12 78.24 ( 0.00%) 58.26 ( 25.55%) 71.55 ( 8.55%) 42.78 ( 45.32%) Amean syst-pound_times-21 148.75 ( 0.00%) 106.28 ( 28.55%) 138.22 ( 7.08%) 74.25 ( 50.09%) Amean syst-pound_times-30 177.74 ( 0.00%) 121.16 ( 31.83%) 166.70 ( 6.21%) 81.82 ( 53.96%) Amean syst-pound_times-48 184.85 ( 0.00%) 125.37 ( 32.18%) 170.87 ( 7.56%) 84.20 ( 54.45%) Amean syst-pound_times-79 189.50 ( 0.00%) 127.45 ( 32.74%) 175.58 ( 7.34%) 86.01 ( 54.61%) Amean syst-pound_times-96 190.56 ( 0.00%) 127.11 ( 33.30%) 175.08 ( 8.12%) 86.03 ( 54.85%) Stddev real-pound_clock_gettime-2 0.25 ( 0.00%) 0.27 ( -7.76%) 0.41 (-65.62%) 0.10 ( 60.73%) Stddev real-pound_clock_gettime-5 0.07 ( 0.00%) 0.09 (-35.10%) 0.51 (-674.46%) 0.05 ( 26.28%) Stddev real-pound_clock_gettime-8 0.28 ( 0.00%) 0.02 ( 92.09%) 0.04 ( 86.10%) 0.02 ( 93.65%) Stddev real-pound_clock_gettime-12 0.08 ( 0.00%) 0.02 ( 78.31%) 0.04 ( 52.02%) 0.02 ( 78.95%) Stddev real-pound_clock_gettime-21 0.06 ( 0.00%) 0.02 ( 68.54%) 0.11 (-70.01%) 0.01 ( 78.27%) Stddev real-pound_clock_gettime-30 0.05 ( 0.00%) 0.01 ( 75.00%) 0.10 (-98.93%) 0.04 ( 20.82%) Stddev real-pound_clock_gettime-48 0.09 ( 0.00%) 0.19 (-106.51%) 0.08 ( 15.24%) 0.04 ( 58.70%) Stddev real-pound_clock_gettime-79 0.03 ( 0.00%) 0.10 (-191.56%) 0.08 (-138.02%) 0.04 (-21.18%) Stddev real-pound_clock_gettime-96 0.05 ( 0.00%) 0.08 (-56.69%) 0.07 (-21.04%) 0.04 ( 31.40%) Stddev real-pound_times-2 0.55 ( 0.00%) 0.25 ( 53.82%) 0.38 ( 30.80%) 0.14 ( 74.19%) Stddev real-pound_times-5 0.06 ( 0.00%) 0.28 (-358.77%) 0.13 (-108.26%) 0.03 ( 54.64%) Stddev real-pound_times-8 0.25 ( 0.00%) 0.04 ( 83.52%) 0.06 ( 76.99%) 0.06 ( 76.94%) Stddev real-pound_times-12 0.09 ( 0.00%) 0.05 ( 41.52%) 0.02 ( 77.55%) 0.04 ( 51.60%) Stddev real-pound_times-21 0.06 ( 0.00%) 0.15 (-141.91%) 0.11 (-74.22%) 0.03 ( 48.73%) Stddev real-pound_times-30 0.06 ( 0.00%) 0.14 (-129.04%) 0.10 (-66.59%) 0.04 ( 30.36%) Stddev real-pound_times-48 0.05 ( 0.00%) 0.13 (-151.20%) 0.07 (-37.30%) 0.02 ( 54.64%) Stddev real-pound_times-79 0.04 ( 0.00%) 0.11 (-205.48%) 0.07 (-97.82%) 0.03 ( 28.17%) Stddev real-pound_times-96 0.05 ( 0.00%) 0.05 ( -1.83%) 0.04 ( 24.17%) 0.04 ( 20.00%) Stddev syst-pound_clock_gettime-2 0.47 ( 0.00%) 0.45 ( 4.96%) 0.79 (-66.33%) 0.18 ( 61.36%) Stddev syst-pound_clock_gettime-5 0.32 ( 0.00%) 0.39 (-20.09%) 2.49 (-666.63%) 0.25 ( 21.71%) Stddev syst-pound_clock_gettime-8 2.25 ( 0.00%) 0.26 ( 88.54%) 0.40 ( 82.10%) 0.17 ( 92.55%) Stddev syst-pound_clock_gettime-12 1.23 ( 0.00%) 0.43 ( 64.59%) 0.73 ( 40.82%) 0.19 ( 84.58%) Stddev syst-pound_clock_gettime-21 1.15 ( 0.00%) 1.06 ( 7.62%) 2.64 (-129.56%) 0.66 ( 42.45%) Stddev syst-pound_clock_gettime-30 1.34 ( 0.00%) 1.26 ( 6.25%) 2.69 (-99.81%) 1.58 (-17.86%) Stddev syst-pound_clock_gettime-48 2.52 ( 0.00%) 4.85 (-92.44%) 2.12 ( 15.94%) 1.08 ( 57.23%) Stddev syst-pound_clock_gettime-79 1.22 ( 0.00%) 2.51 (-105.82%) 1.99 (-62.56%) 0.96 ( 21.62%) Stddev syst-pound_clock_gettime-96 1.54 ( 0.00%) 2.21 (-43.34%) 1.74 (-12.67%) 0.80 ( 48.24%) Stddev syst-pound_times-2 1.09 ( 0.00%) 0.50 ( 53.61%) 0.76 ( 30.43%) 0.28 ( 74.11%) Stddev syst-pound_times-5 0.30 ( 0.00%) 1.41 (-367.82%) 0.65 (-115.62%) 0.21 ( 29.66%) Stddev syst-pound_times-8 2.12 ( 0.00%) 0.27 ( 87.24%) 0.71 ( 66.44%) 0.55 ( 73.94%) Stddev syst-pound_times-12 1.03 ( 0.00%) 0.74 ( 27.70%) 0.37 ( 64.41%) 0.47 ( 54.81%) Stddev syst-pound_times-21 1.60 ( 0.00%) 3.07 (-92.49%) 2.30 (-43.99%) 0.93 ( 41.93%) Stddev syst-pound_times-30 1.75 ( 0.00%) 3.05 (-74.55%) 2.84 (-62.67%) 1.17 ( 32.95%) Stddev syst-pound_times-48 0.79 ( 0.00%) 3.36 (-327.41%) 2.51 (-219.14%) 0.51 ( 34.63%) Stddev syst-pound_times-79 1.08 ( 0.00%) 2.77 (-156.12%) 1.84 (-70.34%) 0.86 ( 20.82%) Stddev syst-pound_times-96 1.19 ( 0.00%) 1.35 (-13.61%) 1.01 ( 15.16%) 1.29 ( -8.56%) CoeffVar real-pound_clock_gettime-2 5.19 ( 0.00%) 8.73 (-68.19%) 9.97 (-92.16%) 7.43 (-43.23%) CoeffVar real-pound_clock_gettime-5 1.19 ( 0.00%) 1.77 (-48.40%) 9.49 (-696.07%) 4.66 (-291.28%) CoeffVar real-pound_clock_gettime-8 4.53 ( 0.00%) 0.45 ( 90.14%) 0.64 ( 85.76%) 1.95 ( 56.89%) CoeffVar real-pound_clock_gettime-12 1.24 ( 0.00%) 0.37 ( 70.42%) 0.66 ( 46.83%) 2.00 (-60.60%) CoeffVar real-pound_clock_gettime-21 0.88 ( 0.00%) 0.39 ( 55.78%) 1.61 (-82.98%) 1.58 (-79.84%) CoeffVar real-pound_clock_gettime-30 0.68 ( 0.00%) 0.25 ( 63.35%) 1.44 (-112.50%) 4.49 (-560.29%) CoeffVar real-pound_clock_gettime-48 1.18 ( 0.00%) 3.61 (-204.68%) 1.09 ( 8.14%) 4.06 (-243.23%) CoeffVar real-pound_clock_gettime-79 0.43 ( 0.00%) 1.85 (-333.44%) 1.10 (-157.77%) 3.87 (-808.42%) CoeffVar real-pound_clock_gettime-96 0.68 ( 0.00%) 1.59 (-134.97%) 0.89 (-31.18%) 3.28 (-383.77%) CoeffVar real-pound_times-2 9.79 ( 0.00%) 7.15 ( 26.98%) 8.39 ( 14.33%) 6.21 ( 36.61%) CoeffVar real-pound_times-5 1.06 ( 0.00%) 6.39 (-501.98%) 2.64 (-148.46%) 0.75 ( 29.71%) CoeffVar real-pound_times-8 4.24 ( 0.00%) 0.84 ( 80.17%) 0.98 ( 76.91%) 1.61 ( 61.95%) CoeffVar real-pound_times-12 1.29 ( 0.00%) 1.01 ( 22.11%) 0.32 ( 75.51%) 1.14 ( 12.18%) CoeffVar real-pound_times-21 0.87 ( 0.00%) 2.91 (-236.03%) 1.63 (-87.60%) 0.87 ( -0.97%) CoeffVar real-pound_times-30 0.78 ( 0.00%) 2.62 (-233.35%) 1.40 (-78.41%) 1.16 (-47.41%) CoeffVar real-pound_times-48 0.65 ( 0.00%) 2.40 (-269.32%) 0.97 (-48.56%) 0.64 ( 2.00%) CoeffVar real-pound_times-79 0.45 ( 0.00%) 2.03 (-352.70%) 0.96 (-113.68%) 0.71 (-57.00%) CoeffVar real-pound_times-96 0.61 ( 0.00%) 0.93 (-52.43%) 0.50 ( 17.50%) 1.07 (-75.79%) CoeffVar syst-pound_clock_gettime-2 5.12 ( 0.00%) 8.25 (-60.95%) 9.95 (-94.12%) 7.81 (-52.47%) CoeffVar syst-pound_clock_gettime-5 1.20 ( 0.00%) 1.61 (-34.04%) 9.48 (-689.57%) 5.37 (-346.99%) CoeffVar syst-pound_clock_gettime-8 4.66 ( 0.00%) 0.67 ( 85.58%) 0.86 ( 81.64%) 2.53 ( 45.79%) CoeffVar syst-pound_clock_gettime-12 1.56 ( 0.00%) 0.76 ( 51.21%) 1.02 ( 34.25%) 1.95 (-25.60%) CoeffVar syst-pound_clock_gettime-21 0.77 ( 0.00%) 1.01 (-30.89%) 1.91 (-147.96%) 4.01 (-420.63%) CoeffVar syst-pound_clock_gettime-30 0.75 ( 0.00%) 1.05 (-39.54%) 1.61 (-113.72%) 8.48 (-1026.12%) CoeffVar syst-pound_clock_gettime-48 1.36 ( 0.00%) 3.89 (-186.46%) 1.24 ( 8.75%) 5.11 (-276.18%) CoeffVar syst-pound_clock_gettime-79 0.65 ( 0.00%) 1.99 (-206.55%) 1.14 (-75.81%) 3.98 (-512.73%) CoeffVar syst-pound_clock_gettime-96 0.81 ( 0.00%) 1.74 (-115.35%) 0.99 (-22.26%) 3.05 (-277.01%) CoeffVar syst-pound_times-2 9.86 ( 0.00%) 7.29 ( 26.04%) 8.54 ( 13.39%) 6.45 ( 34.55%) CoeffVar syst-pound_times-5 1.08 ( 0.00%) 6.62 (-514.60%) 2.78 (-157.67%) 1.18 ( -9.68%) CoeffVar syst-pound_times-8 4.48 ( 0.00%) 0.69 ( 84.62%) 1.51 ( 66.32%) 1.94 ( 56.81%) CoeffVar syst-pound_times-12 1.32 ( 0.00%) 1.28 ( 2.89%) 0.51 ( 61.08%) 1.09 ( 17.35%) CoeffVar syst-pound_times-21 1.07 ( 0.00%) 2.89 (-169.42%) 1.66 (-54.96%) 1.25 (-16.34%) CoeffVar syst-pound_times-30 0.98 ( 0.00%) 2.52 (-156.06%) 1.71 (-73.43%) 1.43 (-45.64%) CoeffVar syst-pound_times-48 0.43 ( 0.00%) 2.68 (-530.20%) 1.47 (-245.25%) 0.61 (-43.50%) CoeffVar syst-pound_times-79 0.57 ( 0.00%) 2.17 (-280.81%) 1.05 (-83.84%) 1.00 (-74.46%) CoeffVar syst-pound_times-96 0.63 ( 0.00%) 1.07 (-70.33%) 0.58 ( 7.66%) 1.50 (-140.44%) Max real-pound_clock_gettime-2 5.10 ( 0.00%) 3.56 ( 30.20%) 4.98 ( 2.35%) 1.47 ( 71.18%) Max real-pound_clock_gettime-5 5.59 ( 0.00%) 5.10 ( 8.77%) 6.00 ( -7.33%) 1.17 ( 79.07%) Max real-pound_clock_gettime-8 6.82 ( 0.00%) 4.95 ( 27.42%) 6.02 ( 11.73%) 0.93 ( 86.36%) Max real-pound_clock_gettime-12 6.82 ( 0.00%) 4.93 ( 27.71%) 6.13 ( 10.12%) 0.90 ( 86.80%) Max real-pound_clock_gettime-21 7.33 ( 0.00%) 5.17 ( 29.47%) 7.01 ( 4.37%) 0.89 ( 87.86%) Max real-pound_clock_gettime-30 7.71 ( 0.00%) 5.24 ( 32.04%) 7.38 ( 4.28%) 1.00 ( 87.03%) Max real-pound_clock_gettime-48 8.11 ( 0.00%) 5.86 ( 27.74%) 7.47 ( 7.89%) 1.05 ( 87.05%) Max real-pound_clock_gettime-79 8.03 ( 0.00%) 5.53 ( 31.13%) 7.48 ( 6.85%) 1.13 ( 85.93%) Max real-pound_clock_gettime-96 8.05 ( 0.00%) 5.55 ( 31.06%) 7.51 ( 6.71%) 1.21 ( 84.97%) Max real-pound_times-2 6.66 ( 0.00%) 3.89 ( 41.59%) 5.23 ( 21.47%) 2.56 ( 61.56%) Max real-pound_times-5 5.77 ( 0.00%) 4.96 ( 14.04%) 5.01 ( 13.17%) 3.69 ( 36.05%) Max real-pound_times-8 6.42 ( 0.00%) 5.04 ( 21.50%) 6.02 ( 6.23%) 3.72 ( 42.06%) Max real-pound_times-12 6.69 ( 0.00%) 5.07 ( 24.22%) 6.07 ( 9.27%) 3.67 ( 45.14%) Max real-pound_times-21 7.32 ( 0.00%) 5.63 ( 23.09%) 7.00 ( 4.37%) 3.68 ( 49.73%) Max real-pound_times-30 7.78 ( 0.00%) 5.68 ( 26.99%) 7.36 ( 5.40%) 3.66 ( 52.96%) Max real-pound_times-48 7.98 ( 0.00%) 5.58 ( 30.08%) 7.41 ( 7.14%) 3.68 ( 53.88%) Max real-pound_times-79 8.05 ( 0.00%) 5.61 ( 30.31%) 7.53 ( 6.46%) 3.69 ( 54.16%) Max real-pound_times-96 8.08 ( 0.00%) 5.42 ( 32.92%) 7.42 ( 8.17%) 3.71 ( 54.08%) Max syst-pound_clock_gettime-2 9.91 ( 0.00%) 6.30 ( 36.43%) 9.64 ( 2.72%) 2.68 ( 72.96%) Max syst-pound_clock_gettime-5 27.53 ( 0.00%) 24.74 ( 10.13%) 29.35 ( -6.61%) 5.43 ( 80.28%) Max syst-pound_clock_gettime-8 53.96 ( 0.00%) 38.82 ( 28.06%) 47.75 ( 11.51%) 6.99 ( 87.05%) Max syst-pound_clock_gettime-12 81.09 ( 0.00%) 57.99 ( 28.49%) 71.93 ( 11.30%) 10.04 ( 87.62%) Max syst-pound_clock_gettime-21 151.50 ( 0.00%) 107.03 ( 29.35%) 145.33 ( 4.07%) 17.48 ( 88.46%) Max syst-pound_clock_gettime-30 179.94 ( 0.00%) 121.68 ( 32.38%) 172.10 ( 4.36%) 21.29 ( 88.17%) Max syst-pound_clock_gettime-48 191.29 ( 0.00%) 136.82 ( 28.48%) 174.84 ( 8.60%) 23.80 ( 87.56%) Max syst-pound_clock_gettime-79 190.22 ( 0.00%) 130.28 ( 31.51%) 177.26 ( 6.81%) 25.71 ( 86.48%) Max syst-pound_clock_gettime-96 192.02 ( 0.00%) 132.27 ( 31.12%) 178.26 ( 7.17%) 27.66 ( 85.60%) Max syst-pound_times-2 13.10 ( 0.00%) 7.57 ( 42.21%) 10.21 ( 22.06%) 4.89 ( 62.67%) Max syst-pound_times-5 28.56 ( 0.00%) 24.55 ( 14.04%) 24.80 ( 13.17%) 18.20 ( 36.27%) Max syst-pound_times-8 50.89 ( 0.00%) 39.54 ( 22.30%) 47.78 ( 6.11%) 29.45 ( 42.13%) Max syst-pound_times-12 79.85 ( 0.00%) 59.80 ( 25.11%) 72.21 ( 9.57%) 43.27 ( 45.81%) Max syst-pound_times-21 151.33 ( 0.00%) 115.02 ( 23.99%) 144.60 ( 4.45%) 75.85 ( 49.88%) Max syst-pound_times-30 180.79 ( 0.00%) 130.12 ( 28.03%) 171.98 ( 4.87%) 83.31 ( 53.92%) Max syst-pound_times-48 186.61 ( 0.00%) 130.89 ( 29.86%) 174.40 ( 6.54%) 84.85 ( 54.53%) Max syst-pound_times-79 190.96 ( 0.00%) 133.09 ( 30.30%) 179.58 ( 5.96%) 87.17 ( 54.35%) Max syst-pound_times-96 192.42 ( 0.00%) 128.95 ( 32.99%) 177.09 ( 7.97%) 87.82 ( 54.36%) vanilla rever prefetc mas 4.7 revert prefetch mask User 54.91 73.30 56.08 47.56 System 21115.14 14616.16 19553.36 6360.52 Elapsed 1247.71 890.24 1149.26 409.20 vanilla rever prefetc mas 4.7 revert prefetch mask Minor Faults 291321 267632 324632 274236 Major Faults 196 272 279 279 Swap Ins 0 0 0 0 Swap Outs 0 0 0 0 Allocation stalls 0 0 0 0 DMA allocs 0 0 0 0 DMA32 allocs 12836 11773 23439 21745 Normal allocs 252492 245667 302327 270404 Movable allocs 0 0 0 0 Direct pages scanned 0 0 0 0 Kswapd pages scanned 0 0 0 0 Kswapd pages reclaimed 0 0 0 0 Direct pages reclaimed 0 0 0 0 Kswapd efficiency 100% 100% 100% 100% Kswapd velocity 0.000 0.000 0.000 0.000 Direct efficiency 100% 100% 100% 100% Direct velocity 0.000 0.000 0.000 0.000 Percentage direct scans 0% 0% 0% 0% Zone normal velocity 0.000 0.000 0.000 0.000 Zone dma32 velocity 0.000 0.000 0.000 0.000 Zone dma velocity 0.000 0.000 0.000 0.000 Page writes by reclaim 0.000 0.000 0.000 0.000 Page writes file 0 0 0 0 Page writes anon 0 0 0 0 Page reclaim immediate 0 0 0 0 Sector Reads 24440 38464 144944 143876 Sector Writes 569300 12712 16036 6956 Page rescued immediate 0 0 0 0 Slabs scanned 0 0 0 0 Direct inode steals 0 0 0 0 Kswapd inode steals 0 0 0 0 Kswapd skipped wait 0 0 0 0 THP fault alloc 0 0 0 0 THP collapse alloc 0 0 0 0 THP splits 0 0 0 0 THP fault fallback 0 0 0 0 THP collapse fail 0 0 0 0 Compaction stalls 0 0 0 0 Compaction success 0 0 0 0 Compaction failures 0 0 0 0 Page migrate success 11177 10858 14598 9857 Page migrate failure 0 2 1 1 Compaction pages isolated 0 0 0 0 Compaction migrate scanned 0 0 0 0 Compaction free scanned 0 0 0 0 Compaction cost 11 11 15 10 NUMA alloc hit 237281 229068 296261 263464 NUMA alloc miss 7 5 5 6 NUMA interleave hit 0 0 0 0 NUMA alloc local 237281 229068 296261 263464 NUMA base PTE updates 25433 20398 35883 22264 NUMA huge PMD updates 0 0 0 0 NUMA page range updates 25433 20398 35883 22264 NUMA hint faults 23242 18097 31026 17002 NUMA hint local faults 10012 6038 14657 6903 NUMA hint local percent 43 33 47 40 NUMA pages migrated 11177 10858 14598 9857 AutoNUMA cost 116% 90% 155% 85% ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-12 12:10 ` Stanislaw Gruszka @ 2016-08-15 7:49 ` Giovanni Gherdovich 2016-08-15 8:33 ` Mel Gorman 2016-08-15 9:13 ` Wanpeng Li 1 sibling, 1 reply; 14+ messages in thread From: Giovanni Gherdovich @ 2016-08-15 7:49 UTC (permalink / raw) To: Stanislaw Gruszka, Ingo Molnar Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, linux-kernel, Mel Gorman Hello Stanislaw, On Fri, 2016-08-12 at 14:10 +0200, Stanislaw Gruszka wrote: > > I measured (partial) revert performance on 4.7 using mmtest instructions > from Giovanni and also tested some other possible fix (draft version): > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 75f98c5..54fdf6d 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) > unsigned int seq, nextseq; > unsigned long flags; > > + (void) task_sched_runtime(tsk); > + > rcu_read_lock(); > /* Attempt a lockless read on the first round. */ > nextseq = 0; > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) > task_cputime(t, &utime, &stime); > times->utime += utime; > times->stime += stime; > - times->sum_exec_runtime += task_sched_runtime(t); > + times->sum_exec_runtime += t->se.sum_exec_runtime; > } > /* If lockless access failed, take the lock. */ > nextseq = 1; > --- > mmtest benchmark results are below (full compare-kernels.sh output is in attachment): > > vanila-4.7 revert prefetch patch > 4.74 ( 0.00%) 3.04 ( 35.93%) 4.09 ( 13.81%) 1.30 ( 72.59%) > 5.49 ( 0.00%) 5.00 ( 8.97%) 5.34 ( 2.72%) 1.03 ( 81.16%) > 6.12 ( 0.00%) 4.91 ( 19.73%) 5.97 ( 2.40%) 0.90 ( 85.27%) > 6.68 ( 0.00%) 4.90 ( 26.66%) 6.02 ( 9.75%) 0.88 ( 86.89%) > 7.21 ( 0.00%) 5.13 ( 28.85%) 6.70 ( 7.09%) 0.87 ( 87.91%) > 7.66 ( 0.00%) 5.22 ( 31.80%) 7.17 ( 6.39%) 0.92 ( 88.01%) > 7.91 ( 0.00%) 5.36 ( 32.22%) 7.30 ( 7.72%) 0.95 ( 87.97%) > 7.95 ( 0.00%) 5.35 ( 32.73%) 7.34 ( 7.66%) 1.06 ( 86.66%) > 8.00 ( 0.00%) 5.33 ( 33.31%) 7.38 ( 7.73%) 1.13 ( 85.82%) > 5.61 ( 0.00%) 3.55 ( 36.76%) 4.53 ( 19.23%) 2.29 ( 59.28%) > 5.66 ( 0.00%) 4.32 ( 23.79%) 4.75 ( 16.18%) 3.65 ( 35.46%) > 5.98 ( 0.00%) 4.97 ( 16.87%) 5.96 ( 0.35%) 3.62 ( 39.40%) > 6.58 ( 0.00%) 4.94 ( 24.93%) 6.04 ( 8.32%) 3.63 ( 44.89%) > 7.19 ( 0.00%) 5.18 ( 28.01%) 6.68 ( 7.13%) 3.65 ( 49.22%) > 7.67 ( 0.00%) 5.27 ( 31.29%) 7.16 ( 6.63%) 3.62 ( 52.76%) > 7.88 ( 0.00%) 5.36 ( 31.98%) 7.28 ( 7.58%) 3.65 ( 53.71%) > 7.99 ( 0.00%) 5.39 ( 32.52%) 7.40 ( 7.42%) 3.65 ( 54.25%) > > Patch works because we we update sum_exec_runtime on current thread > what assure we see proper sum_exec_runtime value on different CPUs. I > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, > patch did not break them. I'm going to run some other test. > > Patch is draft version for early review, task_sched_runtime() will be > simplified (since it's called only current thread) and possibly split > into two functions: one that call update_curr() and other that return > sum_exec_runtime (assure it's consistent on 32 bit arches). > > Stanislaw Thank you for having a look at this. Your patch performs very well, even better than the pre-6e998916dfe3 numbers I was aiming for. I confirm your results on my test machine (Sandy Bridge, 32 cores, 2 NUMA nodes). I didn't apply on the very latest 4.8-rc but used what I had handy for comparison (i.e. 4.7-rc7 and the parent of 6e998916dfe3). As I said, my measurements match yours (my tables follow); looks like your diff cures the problem while mine cures the symptoms. clock_gettime(): threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + Stanislaw (pre-6e998916dfe3) 2 3.48 2.23 ( 35.68%) 3.06 ( 11.83%) 1.08 ( 68.81%) 5 3.33 2.83 ( 14.84%) 3.25 ( 2.40%) 0.71 ( 78.55%) 8 3.37 2.84 ( 15.80%) 3.26 ( 3.30%) 0.56 ( 83.49%) 12 3.32 3.09 ( 6.69%) 3.37 ( -1.60%) 0.42 ( 87.28%) 21 4.01 3.14 ( 21.70%) 3.90 ( 2.74%) 0.35 ( 91.35%) 30 3.63 3.28 ( 9.75%) 3.36 ( 7.41%) 0.28 ( 92.23%) 48 3.71 3.02 ( 18.69%) 3.11 ( 16.27%) 0.39 ( 89.39%) 79 3.75 2.88 ( 23.23%) 3.16 ( 15.74%) 0.46 ( 87.76%) 110 3.81 2.95 ( 22.62%) 3.25 ( 14.80%) 0.56 ( 85.41%) 128 3.88 3.05 ( 21.28%) 3.31 ( 14.76%) 0.62 ( 84.10%) times(): threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + Stanislaw (pre-6e998916dfe3) 2 3.65 2.27 ( 37.94%) 3.25 ( 11.03%) 1.62 ( 55.71%) 5 3.45 2.78 ( 19.34%) 3.17 ( 7.92%) 2.33 ( 32.28%) 8 3.52 2.79 ( 20.66%) 3.22 ( 8.69%) 2.06 ( 41.44%) 12 3.29 3.02 ( 8.33%) 3.36 ( -2.04%) 2.00 ( 39.18%) 21 4.07 3.10 ( 23.86%) 3.92 ( 3.78%) 2.07 ( 49.18%) 30 3.87 3.33 ( 13.80%) 3.40 ( 12.17%) 1.89 ( 51.12%) 48 3.79 2.96 ( 21.94%) 3.16 ( 16.61%) 1.69 ( 55.46%) 79 3.88 2.88 ( 25.82%) 3.28 ( 15.42%) 1.60 ( 58.81%) 110 3.90 2.98 ( 23.73%) 3.38 ( 13.35%) 1.73 ( 55.61%) 128 4.00 3.10 ( 22.40%) 3.38 ( 15.45%) 1.66 ( 58.52%) Regards, Giovanni ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-15 7:49 ` Giovanni Gherdovich @ 2016-08-15 8:33 ` Mel Gorman 2016-08-15 9:19 ` Stanislaw Gruszka 0 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2016-08-15 8:33 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Stanislaw Gruszka, Ingo Molnar, Ingo Molnar, Peter Zijlstra, Mike Galbraith, linux-kernel On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote: > > mmtest benchmark results are below (full compare-kernels.sh output is in attachment): > > > > vanila-4.7 revert prefetch patch > > 4.74 ( 0.00%) 3.04 ( 35.93%) 4.09 ( 13.81%) 1.30 ( 72.59%) > > 5.49 ( 0.00%) 5.00 ( 8.97%) 5.34 ( 2.72%) 1.03 ( 81.16%) > > 6.12 ( 0.00%) 4.91 ( 19.73%) 5.97 ( 2.40%) 0.90 ( 85.27%) > > 6.68 ( 0.00%) 4.90 ( 26.66%) 6.02 ( 9.75%) 0.88 ( 86.89%) > > 7.21 ( 0.00%) 5.13 ( 28.85%) 6.70 ( 7.09%) 0.87 ( 87.91%) > > 7.66 ( 0.00%) 5.22 ( 31.80%) 7.17 ( 6.39%) 0.92 ( 88.01%) > > 7.91 ( 0.00%) 5.36 ( 32.22%) 7.30 ( 7.72%) 0.95 ( 87.97%) > > 7.95 ( 0.00%) 5.35 ( 32.73%) 7.34 ( 7.66%) 1.06 ( 86.66%) > > 8.00 ( 0.00%) 5.33 ( 33.31%) 7.38 ( 7.73%) 1.13 ( 85.82%) > > 5.61 ( 0.00%) 3.55 ( 36.76%) 4.53 ( 19.23%) 2.29 ( 59.28%) > > 5.66 ( 0.00%) 4.32 ( 23.79%) 4.75 ( 16.18%) 3.65 ( 35.46%) > > 5.98 ( 0.00%) 4.97 ( 16.87%) 5.96 ( 0.35%) 3.62 ( 39.40%) > > 6.58 ( 0.00%) 4.94 ( 24.93%) 6.04 ( 8.32%) 3.63 ( 44.89%) > > 7.19 ( 0.00%) 5.18 ( 28.01%) 6.68 ( 7.13%) 3.65 ( 49.22%) > > 7.67 ( 0.00%) 5.27 ( 31.29%) 7.16 ( 6.63%) 3.62 ( 52.76%) > > 7.88 ( 0.00%) 5.36 ( 31.98%) 7.28 ( 7.58%) 3.65 ( 53.71%) > > 7.99 ( 0.00%) 5.39 ( 32.52%) 7.40 ( 7.42%) 3.65 ( 54.25%) > > > > Patch works because we we update sum_exec_runtime on current thread > > what assure we see proper sum_exec_runtime value on different CPUs. I > > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, > > patch did not break them. I'm going to run some other test. > > > > Patch is draft version for early review, task_sched_runtime() will be > > simplified (since it's called only current thread) and possibly split > > into two functions: one that call update_curr() and other that return > > sum_exec_runtime (assure it's consistent on 32 bit arches). > > > > Stanislaw > Is this really equivalent though? It updates one task instead of all tasks in the group and there is no guarantee that tsk == current. Glancing at it, it should monotonically increase but it looks like it would calculate stale data. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-15 8:33 ` Mel Gorman @ 2016-08-15 9:19 ` Stanislaw Gruszka 2016-08-15 9:58 ` Mel Gorman 0 siblings, 1 reply; 14+ messages in thread From: Stanislaw Gruszka @ 2016-08-15 9:19 UTC (permalink / raw) To: Mel Gorman Cc: Giovanni Gherdovich, Ingo Molnar, Ingo Molnar, Peter Zijlstra, Mike Galbraith, linux-kernel On Mon, Aug 15, 2016 at 09:33:49AM +0100, Mel Gorman wrote: > On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote: > > > mmtest benchmark results are below (full compare-kernels.sh output is in attachment): > > > > > > vanila-4.7 revert prefetch patch > > > 4.74 ( 0.00%) 3.04 ( 35.93%) 4.09 ( 13.81%) 1.30 ( 72.59%) > > > 5.49 ( 0.00%) 5.00 ( 8.97%) 5.34 ( 2.72%) 1.03 ( 81.16%) > > > 6.12 ( 0.00%) 4.91 ( 19.73%) 5.97 ( 2.40%) 0.90 ( 85.27%) > > > 6.68 ( 0.00%) 4.90 ( 26.66%) 6.02 ( 9.75%) 0.88 ( 86.89%) > > > 7.21 ( 0.00%) 5.13 ( 28.85%) 6.70 ( 7.09%) 0.87 ( 87.91%) > > > 7.66 ( 0.00%) 5.22 ( 31.80%) 7.17 ( 6.39%) 0.92 ( 88.01%) > > > 7.91 ( 0.00%) 5.36 ( 32.22%) 7.30 ( 7.72%) 0.95 ( 87.97%) > > > 7.95 ( 0.00%) 5.35 ( 32.73%) 7.34 ( 7.66%) 1.06 ( 86.66%) > > > 8.00 ( 0.00%) 5.33 ( 33.31%) 7.38 ( 7.73%) 1.13 ( 85.82%) > > > 5.61 ( 0.00%) 3.55 ( 36.76%) 4.53 ( 19.23%) 2.29 ( 59.28%) > > > 5.66 ( 0.00%) 4.32 ( 23.79%) 4.75 ( 16.18%) 3.65 ( 35.46%) > > > 5.98 ( 0.00%) 4.97 ( 16.87%) 5.96 ( 0.35%) 3.62 ( 39.40%) > > > 6.58 ( 0.00%) 4.94 ( 24.93%) 6.04 ( 8.32%) 3.63 ( 44.89%) > > > 7.19 ( 0.00%) 5.18 ( 28.01%) 6.68 ( 7.13%) 3.65 ( 49.22%) > > > 7.67 ( 0.00%) 5.27 ( 31.29%) 7.16 ( 6.63%) 3.62 ( 52.76%) > > > 7.88 ( 0.00%) 5.36 ( 31.98%) 7.28 ( 7.58%) 3.65 ( 53.71%) > > > 7.99 ( 0.00%) 5.39 ( 32.52%) 7.40 ( 7.42%) 3.65 ( 54.25%) > > > > > > Patch works because we we update sum_exec_runtime on current thread > > > what assure we see proper sum_exec_runtime value on different CPUs. I > > > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, > > > patch did not break them. I'm going to run some other test. > > > > > > Patch is draft version for early review, task_sched_runtime() will be > > > simplified (since it's called only current thread) and possibly split > > > into two functions: one that call update_curr() and other that return > > > sum_exec_runtime (assure it's consistent on 32 bit arches). > > > > > > Stanislaw > > > > Is this really equivalent though? It updates one task instead of all > tasks in the group and there is no guarantee that tsk == current. Oh, my intention was to update runtime on current. > Glancing at it, it should monotonically increase but it looks like it > would calculate stale data. Yes, until the next tick on a CPU, the patch does not count partial runtime of thread running on that CPU. However that was the behaviour before commit d670ec13178d0 - that how old thread_group_sched_runtime() function worked: /* - * Return sum_exec_runtime for the thread group. - * In case the task is currently running, return the sum plus current's - * pending runtime that have not been accounted yet. - * - * Note that the thread group might have other running tasks as well, - * so the return value not includes other pending runtime that other - * running tasks might have. - */ -unsigned long long thread_group_sched_runtime(struct task_struct *p) Stanislaw ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-15 9:19 ` Stanislaw Gruszka @ 2016-08-15 9:58 ` Mel Gorman 2016-08-15 10:29 ` Stanislaw Gruszka 0 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2016-08-15 9:58 UTC (permalink / raw) To: Stanislaw Gruszka Cc: Giovanni Gherdovich, Ingo Molnar, Ingo Molnar, Peter Zijlstra, Mike Galbraith, linux-kernel On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote: > > Is this really equivalent though? It updates one task instead of all > > tasks in the group and there is no guarantee that tsk == current. > > Oh, my intention was to update runtime on current. > Ok, so minimally that would need addressing. However, then I would worry that two tasks in a group calling the function at the same time would see different results because each of them updated a different task. Such a situation is inherently race-prone anyway but it's a large enough functional difference to be worth calling out. Minimally, I don't think such a patch is a replacement for Giovanni's which is functionally equivalent to the current code but could be layered on top if it is proven to be ok. > > Glancing at it, it should monotonically increase but it looks like it > > would calculate stale data. > > Yes, until the next tick on a CPU, the patch does not count partial > runtime of thread running on that CPU. However that was the behaviour > before commit d670ec13178d0 - that how old thread_group_sched_runtime() > function worked: > Sure, but does this patch not reintroduce the "SMP wobble" and the problem of "the diff of 'process' should always be >= the diff of 'thread'" ? -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-15 9:58 ` Mel Gorman @ 2016-08-15 10:29 ` Stanislaw Gruszka 0 siblings, 0 replies; 14+ messages in thread From: Stanislaw Gruszka @ 2016-08-15 10:29 UTC (permalink / raw) To: Mel Gorman Cc: Giovanni Gherdovich, Ingo Molnar, Ingo Molnar, Peter Zijlstra, Mike Galbraith, linux-kernel On Mon, Aug 15, 2016 at 10:58:04AM +0100, Mel Gorman wrote: > On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote: > > > Is this really equivalent though? It updates one task instead of all > > > tasks in the group and there is no guarantee that tsk == current. > > > > Oh, my intention was to update runtime on current. > > > > Ok, so minimally that would need addressing. However, then I would worry > that two tasks in a group calling the function at the same time would > see different results because each of them updated a different task. > Such a situation is inherently race-prone anyway but it's a large enough > functional difference to be worth calling out. It races bacause we don't know which thread will call the clock_gettime() first. But once that happen, second thread will see updated runtime value from first thread as we call update_curr() for it with task_rq_lock (change from commit 6e998916dfe3). > Minimally, I don't think such a patch is a replacement for Giovanni's > which is functionally equivalent to the current code but could be layered > on top if it is proven to be ok. I agree. I wanted to post my patch on top of Giovanni's. > > > Glancing at it, it should monotonically increase but it looks like it > > > would calculate stale data. > > > > Yes, until the next tick on a CPU, the patch does not count partial > > runtime of thread running on that CPU. However that was the behaviour > > before commit d670ec13178d0 - that how old thread_group_sched_runtime() > > function worked: > > > > Sure, but does this patch not reintroduce the "SMP wobble" and the > problem of "the diff of 'process' should always be >= the diff of > 'thread'" ? It should not reintroduce that problem, also because of change from commit 6e998916dfe3. When a thread reads sum_exec_runtime it also update that value, then process reads updated value. I run test case from "SMP wobble" commit and the problem do not happen on my tests. Perhaps I should post patch with a descriptive changelog and things would be clearer ... Stanislaw ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-12 12:10 ` Stanislaw Gruszka 2016-08-15 7:49 ` Giovanni Gherdovich @ 2016-08-15 9:13 ` Wanpeng Li 2016-08-15 9:21 ` Stanislaw Gruszka 1 sibling, 1 reply; 14+ messages in thread From: Wanpeng Li @ 2016-08-15 9:13 UTC (permalink / raw) To: Stanislaw Gruszka Cc: Ingo Molnar, Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra, Mike Galbraith, linux-kernel@vger.kernel.org, Mel Gorman 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>: > Hi > > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: >> Nice detective work! I'm wondering, where do we stand if compared with a >> pre-6e998916dfe3 kernel? >> >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I >> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to >> make the result correct, just see what the performance gap is, roughly. >> >> If there's still a significant gap then it might make sense to optimize this some >> more. > > I measured (partial) revert performance on 4.7 using mmtest instructions > from Giovanni and also tested some other possible fix (draft version): > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 75f98c5..54fdf6d 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) > unsigned int seq, nextseq; > unsigned long flags; > > + (void) task_sched_runtime(tsk); > + > rcu_read_lock(); > /* Attempt a lockless read on the first round. */ > nextseq = 0; > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) > task_cputime(t, &utime, &stime); > times->utime += utime; > times->stime += stime; > - times->sum_exec_runtime += task_sched_runtime(t); > + times->sum_exec_runtime += t->se.sum_exec_runtime; If this will not have updated stats for other threads? Regards, Wanpeng Li ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-15 9:13 ` Wanpeng Li @ 2016-08-15 9:21 ` Stanislaw Gruszka 2016-08-15 9:28 ` Wanpeng Li 0 siblings, 1 reply; 14+ messages in thread From: Stanislaw Gruszka @ 2016-08-15 9:21 UTC (permalink / raw) To: Wanpeng Li Cc: Ingo Molnar, Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra, Mike Galbraith, linux-kernel@vger.kernel.org, Mel Gorman On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote: > 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>: > > Hi > > > > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: > >> Nice detective work! I'm wondering, where do we stand if compared with a > >> pre-6e998916dfe3 kernel? > >> > >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I > >> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert > >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to > >> make the result correct, just see what the performance gap is, roughly. > >> > >> If there's still a significant gap then it might make sense to optimize this some > >> more. > > > > I measured (partial) revert performance on 4.7 using mmtest instructions > > from Giovanni and also tested some other possible fix (draft version): > > > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > > index 75f98c5..54fdf6d 100644 > > --- a/kernel/sched/cputime.c > > +++ b/kernel/sched/cputime.c > > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) > > unsigned int seq, nextseq; > > unsigned long flags; > > > > + (void) task_sched_runtime(tsk); > > + > > rcu_read_lock(); > > /* Attempt a lockless read on the first round. */ > > nextseq = 0; > > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) > > task_cputime(t, &utime, &stime); > > times->utime += utime; > > times->stime += stime; > > - times->sum_exec_runtime += task_sched_runtime(t); > > + times->sum_exec_runtime += t->se.sum_exec_runtime; > > If this will not have updated stats for other threads? No, until tick/sched() on CPUs running threads. Stanislaw ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-15 9:21 ` Stanislaw Gruszka @ 2016-08-15 9:28 ` Wanpeng Li 0 siblings, 0 replies; 14+ messages in thread From: Wanpeng Li @ 2016-08-15 9:28 UTC (permalink / raw) To: Stanislaw Gruszka Cc: Ingo Molnar, Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra, Mike Galbraith, linux-kernel@vger.kernel.org, Mel Gorman 2016-08-15 17:21 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>: > On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote: >> 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>: >> > Hi >> > >> > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: >> >> Nice detective work! I'm wondering, where do we stand if compared with a >> >> pre-6e998916dfe3 kernel? >> >> >> >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I >> >> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert >> >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to >> >> make the result correct, just see what the performance gap is, roughly. >> >> >> >> If there's still a significant gap then it might make sense to optimize this some >> >> more. >> > >> > I measured (partial) revert performance on 4.7 using mmtest instructions >> > from Giovanni and also tested some other possible fix (draft version): >> > >> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c >> > index 75f98c5..54fdf6d 100644 >> > --- a/kernel/sched/cputime.c >> > +++ b/kernel/sched/cputime.c >> > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) >> > unsigned int seq, nextseq; >> > unsigned long flags; >> > >> > + (void) task_sched_runtime(tsk); >> > + >> > rcu_read_lock(); >> > /* Attempt a lockless read on the first round. */ >> > nextseq = 0; >> > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) >> > task_cputime(t, &utime, &stime); >> > times->utime += utime; >> > times->stime += stime; >> > - times->sum_exec_runtime += task_sched_runtime(t); >> > + times->sum_exec_runtime += t->se.sum_exec_runtime; >> >> If this will not have updated stats for other threads? > > No, until tick/sched() on CPUs running threads. Yeah, I think this change will result in not updated stats for other threads if they are running and before next update_curr() is called. Regards, Wanpeng Li ^ permalink raw reply [flat|nested] 14+ messages in thread
* [tip:sched/core] sched/cputime: Mitigate performance regression in times()/clock_gettime() 2016-08-05 8:21 ` [PATCH 1/1] " Giovanni Gherdovich 2016-08-10 11:26 ` Ingo Molnar @ 2016-08-10 18:00 ` tip-bot for Giovanni Gherdovich 1 sibling, 0 replies; 14+ messages in thread From: tip-bot for Giovanni Gherdovich @ 2016-08-10 18:00 UTC (permalink / raw) To: linux-tip-commits Cc: hpa, ggherdovich, tglx, sgruszka, mgorman, mingo, torvalds, mgalbraith, linux-kernel, peterz Commit-ID: 6075620b0590eaf22f10ce88833eb20a57f760d6 Gitweb: http://git.kernel.org/tip/6075620b0590eaf22f10ce88833eb20a57f760d6 Author: Giovanni Gherdovich <ggherdovich@suse.cz> AuthorDate: Fri, 5 Aug 2016 10:21:56 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 10 Aug 2016 13:32:56 +0200 sched/cputime: Mitigate performance regression in times()/clock_gettime() Commit: 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") fixed a problem whereby clock_nanosleep() followed by clock_gettime() could allow a task to wake early. It addressed the problem by calling the scheduling classes update_curr() when the cputimer starts. Said change induced a considerable performance regression on the syscalls times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some debuggers and applications that monitor their own performance that accidentally depend on the performance of these specific calls. This patch mitigates the performace loss by prefetching data in the CPU cache, as stalls due to cache misses appear to be where most time is spent in our benchmarks. Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge box with 32 logical cores and 2 NUMA nodes. The test is repeated with a variable number of threads, from 2 to 4*num_cpus; the results are in seconds and correspond to the average of 10 runs; the percentage gain is computed with (before-after)/before so a positive value is an improvement (it's faster). The improvement varies between a few percents for 5-20 threads and more than 10% for 2 or >20 threads. pound_clock_gettime: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.48 3.06 ( 11.83%) 5 3.33 3.25 ( 2.40%) 8 3.37 3.26 ( 3.30%) 12 3.32 3.37 ( -1.60%) 21 4.01 3.90 ( 2.74%) 30 3.63 3.36 ( 7.41%) 48 3.71 3.11 ( 16.27%) 79 3.75 3.16 ( 15.74%) 110 3.81 3.25 ( 14.80%) 128 3.88 3.31 ( 14.76%) pound_times: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.65 3.25 ( 11.03%) 5 3.45 3.17 ( 7.92%) 8 3.52 3.22 ( 8.69%) 12 3.29 3.36 ( -2.04%) 21 4.07 3.92 ( 3.78%) 30 3.87 3.40 ( 12.17%) 48 3.79 3.16 ( 16.61%) 79 3.88 3.28 ( 15.42%) 110 3.90 3.38 ( 13.35%) 128 4.00 3.38 ( 15.45%) pound_clock_gettime and pound_clock_gettime are two benchmarks included in the MMTests framework. They launch a given number of threads which repeatedly call times() or clock_gettimes(). The results above can be reproduced with cloning MMTests from github.com and running the "poundtime" workload: $ git clone https://github.com/gormanm/mmtests.git $ cd mmtests $ cp configs/config-global-dhp__workload_poundtime config $ ./run-mmtests.sh --run-monitor $(uname -r) The above will run "poundtime" measuring the kernel currently running on the machine; Once a new kernel is installed and the machine rebooted, running again $ cd mmtests $ ./run-mmtests.sh --run-monitor $(uname -r) will produce results to compare with. A comparison table will be output with: $ cd mmtests/work/log $ ../../compare-kernels.sh the table will contain a lot of entries; grepping for "Amean" (as in "arithmetic mean") will give the tables presented above. The source code for the two benchmarks is reported at the end of this changelog for clairity. The cache misses addressed by this patch were found using a combination of `perf top`, `perf record` and `perf annotate`. The incriminated lines were found to be struct sched_entity *curr = cfs_rq->curr; and delta_exec = now - curr->exec_start; in the function update_curr() from kernel/sched/fair.c. This patch prefetches the data from memory just before update_curr is called in the interested execution path. A comparison of the total number of cycles before and after the patch follows; the data is obtained using `perf stat -r 10 -ddd <program>` running over the same sequence of number of threads used above (a positive gain is an improvement): threads cycles before cycles after gain 2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88% 5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85% 8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74% 12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74% 21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89% 30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92% 48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10% 79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33% 110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21% 128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99% A comparison of cache miss vs total cache loads ratios, before and after the patch (again from the `perf stat -r 10 -ddd <program>` tables): threads L1 misses/total*100 L1 misses/total*100 gain before after 2 7.43 +-4.90% 7.36 +-4.70% 0.94% 5 13.09 +-4.74% 13.52 +-3.73% -3.28% 8 13.79 +-5.61% 12.90 +-3.27% 6.45% 12 11.57 +-2.44% 8.71 +-1.40% 24.72% 21 12.39 +-3.92% 9.97 +-1.84% 19.53% 30 13.91 +-2.53% 11.73 +-2.28% 15.67% 48 13.71 +-1.59% 12.32 +-1.97% 10.14% 79 14.44 +-0.66% 13.40 +-1.06% 7.20% 110 15.86 +-0.50% 14.46 +-0.59% 8.83% 128 16.51 +-0.32% 15.06 +-0.78% 8.78% As a final note, the following shows the evolution of performance figures in the "poundtime" benchmark and pinpoints commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a major source of degradation, mostly unaddressed to this day (figures expressed in seconds). pound_clock_gettime: threads parent of 6e998916dfe3 4.7-rc7 6e998916dfe3 itself 2 2.23 3.68 ( -64.56%) 3.48 (-55.48%) 5 2.83 3.78 ( -33.42%) 3.33 (-17.43%) 8 2.84 4.31 ( -52.12%) 3.37 (-18.76%) 12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%) 21 3.14 4.63 ( -47.36%) 4.01 (-27.71%) 30 3.28 5.75 ( -75.37%) 3.63 (-10.80%) 48 3.02 6.05 (-100.56%) 3.71 (-22.99%) 79 2.88 6.30 (-118.90%) 3.75 (-30.26%) 110 2.95 6.46 (-119.00%) 3.81 (-29.24%) 128 3.05 6.42 (-110.08%) 3.88 (-27.04%) pound_times: threads parent of 6e998916dfe3 4.7-rc7 6e998916dfe3 itself 2 2.27 3.73 ( -64.71%) 3.65 (-61.14%) 5 2.78 3.77 ( -35.56%) 3.45 (-23.98%) 8 2.79 4.41 ( -57.71%) 3.52 (-26.05%) 12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%) 21 3.10 4.61 ( -48.74%) 4.07 (-31.34%) 30 3.33 5.75 ( -72.53%) 3.87 (-16.01%) 48 2.96 6.06 (-105.04%) 3.79 (-28.10%) 79 2.88 6.24 (-116.83%) 3.88 (-34.81%) 110 2.98 6.37 (-114.08%) 3.90 (-31.12%) 128 3.10 6.35 (-104.61%) 4.00 (-28.87%) The source code of the two benchmarks follows. To compile the two: NR_THREADS=42 for FILE in pound_times pound_clock_gettime; do gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE done ==== BEGIN pound_times.c ==== struct tms start; void *pound (void *threadid) { struct tms end; int oldutime = 0; int utime; int i; for (i = 0; i < 5000000 / NUM_THREADS; i++) { times(&end); utime = ((int)end.tms_utime - (int)start.tms_utime); if (oldutime > utime) { printf("utime decreased, was %d, now %d!\n", oldutime, utime); } oldutime = utime; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long i; times(&start); for (i = 0; i < NUM_THREADS; i++) { pthread_create (&th[i], NULL, pound, (void *)i); } pthread_exit(NULL); return 0; } ==== END pound_times.c ==== ==== BEGIN pound_clock_gettime.c ==== void *pound (void *threadid) { struct timespec ts; int rc, i; unsigned long prev = 0, this = 0; for (i = 0; i < 5000000 / NUM_THREADS; i++) { rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); if (rc < 0) perror("clock_gettime"); this = (ts.tv_sec * 1000000000) + ts.tv_nsec; if (0 && this < prev) printf("%lu ns timewarp at iteration %d\n", prev - this, i); prev = this; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long rc, i; pid_t pgid; for (i = 0; i < NUM_THREADS; i++) { rc = pthread_create(&th[i], NULL, pound, (void *)i); if (rc < 0) perror("pthread_create"); } pthread_exit(NULL); return 0; } ==== END pound_clock_gettime.c ==== Suggested-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/core.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5c883fe..2a906f2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -74,6 +74,7 @@ #include <linux/context_tracking.h> #include <linux/compiler.h> #include <linux/frame.h> +#include <linux/prefetch.h> #include <asm/switch_to.h> #include <asm/tlb.h> @@ -2972,6 +2973,23 @@ EXPORT_PER_CPU_SYMBOL(kstat); EXPORT_PER_CPU_SYMBOL(kernel_cpustat); /* + * The function fair_sched_class.update_curr accesses the struct curr + * and its field curr->exec_start; when called from task_sched_runtime(), + * we observe a high rate of cache misses in practice. + * Prefetching this data results in improved performance. + */ +static inline void prefetch_curr_exec_start(struct task_struct *p) +{ +#ifdef CONFIG_FAIR_GROUP_SCHED + struct sched_entity *curr = (&p->se)->cfs_rq->curr; +#else + struct sched_entity *curr = (&task_rq(p)->cfs)->curr; +#endif + prefetch(curr); + prefetch(&curr->exec_start); +} + +/* * Return accounted runtime for the task. * In case the task is currently running, return the runtime plus current's * pending runtime that have not been accounted yet. @@ -3005,6 +3023,7 @@ unsigned long long task_sched_runtime(struct task_struct *p) * thread, breaking clock_gettime(). */ if (task_current(rq, p) && task_on_rq_queued(p)) { + prefetch_curr_exec_start(p); update_rq_clock(rq); p->sched_class->update_curr(rq); } ^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-08-15 10:32 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-08-05 8:21 [PATCH 0/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich 2016-08-05 8:21 ` [PATCH 1/1] " Giovanni Gherdovich 2016-08-10 11:26 ` Ingo Molnar 2016-08-10 13:02 ` Giovanni Gherdovich 2016-08-12 12:10 ` Stanislaw Gruszka 2016-08-15 7:49 ` Giovanni Gherdovich 2016-08-15 8:33 ` Mel Gorman 2016-08-15 9:19 ` Stanislaw Gruszka 2016-08-15 9:58 ` Mel Gorman 2016-08-15 10:29 ` Stanislaw Gruszka 2016-08-15 9:13 ` Wanpeng Li 2016-08-15 9:21 ` Stanislaw Gruszka 2016-08-15 9:28 ` Wanpeng Li 2016-08-10 18:00 ` [tip:sched/core] " tip-bot for Giovanni Gherdovich
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).