[PATCH 0/1] sched/cputime: Mitigate performance regression in times()/clock

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
@ 2016-08-05  8:21 Giovanni Gherdovich
  2016-08-05  8:21 ` [PATCH 1/1] " Giovanni Gherdovich
  0 siblings, 1 reply; 14+ messages in thread
From: Giovanni Gherdovich @ 2016-08-05  8:21 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Mike Galbraith, Stanislaw Gruszka, linux-kernel, Mel Gorman,
	Giovanni Gherdovich

As per Peter Zijlstra's review, these are the difference wrt V1:

* inclusion of appropriate header file linux/prefetch.h
* factorized the calls to prefetch into a separate function
* introduction of the local variable curr as a form of compiler
  subexpression elimination (CSE)
* fixed Signed-off-by chain
* added comment as per why the prefetches are needed

Giovanni Gherdovich (1):
  sched/cputime: Mitigate performance regression in
    times()/clock_gettime()

 kernel/sched/core.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

-- 
2.6.6

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-05  8:21 [PATCH 0/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich
@ 2016-08-05  8:21 ` Giovanni Gherdovich
  2016-08-10 11:26   ` Ingo Molnar
  2016-08-10 18:00   ` [tip:sched/core] " tip-bot for Giovanni Gherdovich
  0 siblings, 2 replies; 14+ messages in thread
From: Giovanni Gherdovich @ 2016-08-05  8:21 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Mike Galbraith, Stanislaw Gruszka, linux-kernel, Mel Gorman,
	Giovanni Gherdovich

Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency") fixed a problem whereby clock_nanosleep() followed by
clock_gettime() could allow a task to wake early. It addressed the problem
by calling the scheduling classes update_curr when the cputimer starts.

Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.

This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.

Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.

pound_clock_gettime:

    threads       4.7-rc7     patched 4.7-rc7
    [num]         [secs]      [secs (percent)]
      2           3.48        3.06 ( 11.83%)
      5           3.33        3.25 (  2.40%)
      8           3.37        3.26 (  3.30%)
     12           3.32        3.37 ( -1.60%)
     21           4.01        3.90 (  2.74%)
     30           3.63        3.36 (  7.41%)
     48           3.71        3.11 ( 16.27%)
     79           3.75        3.16 ( 15.74%)
    110           3.81        3.25 ( 14.80%)
    128           3.88        3.31 ( 14.76%)

pound_times:

    threads       4.7-rc7     patched 4.7-rc7
    [num]         [secs]      [secs (percent)]
      2           3.65        3.25 ( 11.03%)
      5           3.45        3.17 (  7.92%)
      8           3.52        3.22 (  8.69%)
     12           3.29        3.36 ( -2.04%)
     21           4.07        3.92 (  3.78%)
     30           3.87        3.40 ( 12.17%)
     48           3.79        3.16 ( 16.61%)
     79           3.88        3.28 ( 15.42%)
    110           3.90        3.38 ( 13.35%)
    128           4.00        3.38 ( 15.45%)

pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:

$ git clone https://github.com/gormanm/mmtests.git
$ cd mmtests
$ cp configs/config-global-dhp__workload_poundtime config
$ ./run-mmtests.sh --run-monitor $(uname -r)

The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again

$ cd mmtests
$ ./run-mmtests.sh --run-monitor $(uname -r)

will produce results to compare with. A comparison table will be output
with

$ cd mmtests/work/log
$ ../../compare-kernels.sh

the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.

The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be

    struct sched_entity *curr = cfs_rq->curr;

and

    delta_exec = now - curr->exec_start;

in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.

A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):

  threads   cycles before                 cycles after                gain

    2      19,699,563,964  +-1.19%      17,358,917,517  +-1.85%      11.88%
    5      47,401,089,566  +-2.96%      45,103,730,829  +-0.97%       4.85%
    8      80,923,501,004  +-3.01%      71,419,385,977  +-0.77%      11.74%
   12     112,326,485,473  +-0.47%     110,371,524,403  +-0.47%       1.74%
   21     193,455,574,299  +-0.72%     180,120,667,904  +-0.36%       6.89%
   30     315,073,519,013  +-1.64%     271,222,225,950  +-1.29%      13.92%
   48     321,969,515,332  +-1.48%     273,353,977,321  +-1.16%      15.10%
   79     337,866,003,422  +-0.97%     289,462,481,538  +-1.05%      14.33%
  110     338,712,691,920  +-0.78%     290,574,233,170  +-0.77%      14.21%
  128     348,384,794,006  +-0.50%     292,691,648,206  +-0.66%      15.99%

A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):

  threads   L1 misses/total*100     L1 misses/total*100            gain
		         before                   after
      2           7.43  +-4.90%           7.36  +-4.70%           0.94%
      5          13.09  +-4.74%          13.52  +-3.73%          -3.28%
      8          13.79  +-5.61%          12.90  +-3.27%           6.45%
     12          11.57  +-2.44%           8.71  +-1.40%          24.72%
     21          12.39  +-3.92%           9.97  +-1.84%          19.53%
     30          13.91  +-2.53%          11.73  +-2.28%          15.67%
     48          13.71  +-1.59%          12.32  +-1.97%          10.14%
     79          14.44  +-0.66%          13.40  +-1.06%           7.20%
    110          15.86  +-0.50%          14.46  +-0.59%           8.83%
    128          16.51  +-0.32%          15.06  +-0.78%           8.78%

As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).

pound_clock_gettime:

  threads   parent of         6e998916dfe3        4.7-rc7
	    6e998916dfe3            itself
    2        2.23          3.68 ( -64.56%)        3.48 (-55.48%)
    5        2.83          3.78 ( -33.42%)        3.33 (-17.43%)
    8        2.84          4.31 ( -52.12%)        3.37 (-18.76%)
    12       3.09          3.61 ( -16.74%)        3.32 ( -7.17%)
    21       3.14          4.63 ( -47.36%)        4.01 (-27.71%)
    30       3.28          5.75 ( -75.37%)        3.63 (-10.80%)
    48       3.02          6.05 (-100.56%)        3.71 (-22.99%)
    79       2.88          6.30 (-118.90%)        3.75 (-30.26%)
    110      2.95          6.46 (-119.00%)        3.81 (-29.24%)
    128      3.05          6.42 (-110.08%)        3.88 (-27.04%)

pound_times:

  threads   parent of         6e998916dfe3        4.7-rc7
	    6e998916dfe3            itself
    2        2.27          3.73 ( -64.71%)        3.65 (-61.14%)
    5        2.78          3.77 ( -35.56%)        3.45 (-23.98%)
    8        2.79          4.41 ( -57.71%)        3.52 (-26.05%)
    12       3.02          3.56 ( -17.94%)        3.29 ( -9.08%)
    21       3.10          4.61 ( -48.74%)        4.07 (-31.34%)
    30       3.33          5.75 ( -72.53%)        3.87 (-16.01%)
    48       2.96          6.06 (-105.04%)        3.79 (-28.10%)
    79       2.88          6.24 (-116.83%)        3.88 (-34.81%)
    110      2.98          6.37 (-114.08%)        3.90 (-31.12%)
    128      3.10          6.35 (-104.61%)        4.00 (-28.87%)

The source code of the two benchmarks follows. To compile the two:

NR_THREADS=42
for FILE in pound_times pound_clock_gettime; do
    gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
done

==== BEGIN pound_times.c ====

struct tms start;

void *pound (void *threadid)
{
  struct tms end;
  int oldutime = 0;
  int utime;
  int i;
  for (i = 0; i < 5000000 / NUM_THREADS; i++) {
          times(&end);
          utime = ((int)end.tms_utime - (int)start.tms_utime);
          if (oldutime > utime) {
            printf("utime decreased, was %d, now %d!\n", oldutime, utime);
          }
          oldutime = utime;
  }
  pthread_exit(NULL);
}

int main()
{
  pthread_t th[NUM_THREADS];
  long i;
  times(&start);
  for (i = 0; i < NUM_THREADS; i++) {
    pthread_create (&th[i], NULL, pound, (void *)i);
  }
  pthread_exit(NULL);
  return 0;
}
==== END pound_times.c ====

==== BEGIN pound_clock_gettime.c ====

void *pound (void *threadid)
{
	struct timespec ts;
	int rc, i;
	unsigned long prev = 0, this = 0;

	for (i = 0; i < 5000000 / NUM_THREADS; i++) {
		rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
		if (rc < 0)
			perror("clock_gettime");
		this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
		if (0 && this < prev)
			printf("%lu ns timewarp at iteration %d\n", prev - this, i);
		prev = this;
	}
	pthread_exit(NULL);
}

int main()
{
	pthread_t th[NUM_THREADS];
	long rc, i;
	pid_t pgid;

	for (i = 0; i < NUM_THREADS; i++) {
		rc = pthread_create(&th[i], NULL, pound, (void *)i);
		if (rc < 0)
			perror("pthread_create");
	}

	pthread_exit(NULL);
	return 0;
}
==== END pound_clock_gettime.c ====

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Suggested-by: Mike Galbraith <mgalbraith@suse.de>
---
 kernel/sched/core.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51d7105..4500421 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
 #include <linux/frame.h>
+#include <linux/prefetch.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -2965,6 +2966,23 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
 
 /*
+ * The function fair_sched_class.update_curr accesses the struct curr
+ * and its field curr->exec_start; when called from task_sched_runtime,
+ * we observe a high rate of cache misses in practice.
+ * Prefetching this data results in improved performance.
+ */
+static inline void prefetch_curr_exec_start(struct task_struct *p)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	struct sched_entity *curr = (&p->se)->cfs_rq->curr;
+#else
+	struct sched_entity *curr = (&task_rq(p)->cfs)->curr;
+#endif
+	prefetch(curr);
+	prefetch(&curr->exec_start);
+}
+
+/*
  * Return accounted runtime for the task.
  * In case the task is currently running, return the runtime plus current's
  * pending runtime that have not been accounted yet.
@@ -2998,6 +3016,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	 * thread, breaking clock_gettime().
 	 */
 	if (task_current(rq, p) && task_on_rq_queued(p)) {
+		prefetch_curr_exec_start(p);
 		update_rq_clock(rq);
 		p->sched_class->update_curr(rq);
 	}
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-05  8:21 ` [PATCH 1/1] " Giovanni Gherdovich
@ 2016-08-10 11:26   ` Ingo Molnar
  2016-08-10 13:02     ` Giovanni Gherdovich
  2016-08-12 12:10     ` Stanislaw Gruszka
  2016-08-10 18:00   ` [tip:sched/core] " tip-bot for Giovanni Gherdovich
  1 sibling, 2 replies; 14+ messages in thread
From: Ingo Molnar @ 2016-08-10 11:26 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Stanislaw Gruszka,
	linux-kernel, Mel Gorman


* Giovanni Gherdovich <ggherdovich@suse.cz> wrote:

> Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime()
> inconsistency") fixed a problem whereby clock_nanosleep() followed by
> clock_gettime() could allow a task to wake early. It addressed the problem
> by calling the scheduling classes update_curr when the cputimer starts.
> 
> Said change induced a considerable performance regression on the syscalls
> times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
> debuggers and applications that monitor their own performance that
> accidentally depend on the performance of these specific calls.
> 
> This patch mitigates the performace loss by prefetching data in the CPU
> cache, as stalls due to cache misses appear to be where most time is spent
> in our benchmarks.
> 
> Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
> box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
> variable number of threads, from 2 to 4*num_cpus; the results are in
> seconds and correspond to the average of 10 runs; the percentage gain is
> computed with (before-after)/before so a positive value is an improvement
> (it's faster). The improvement varies between a few percents for 5-20
> threads and more than 10% for 2 or >20 threads.
> 
> pound_clock_gettime:
> 
>     threads       4.7-rc7     patched 4.7-rc7
>     [num]         [secs]      [secs (percent)]
>       2           3.48        3.06 ( 11.83%)
>       5           3.33        3.25 (  2.40%)
>       8           3.37        3.26 (  3.30%)
>      12           3.32        3.37 ( -1.60%)
>      21           4.01        3.90 (  2.74%)
>      30           3.63        3.36 (  7.41%)
>      48           3.71        3.11 ( 16.27%)
>      79           3.75        3.16 ( 15.74%)
>     110           3.81        3.25 ( 14.80%)
>     128           3.88        3.31 ( 14.76%)

Nice detective work! I'm wondering, where do we stand if compared with a 
pre-6e998916dfe3 kernel?

I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I 
suspect v3.17 does not run easily on a recent distro. Could you attempt to revert 
the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to 
make the result correct, just see what the performance gap is, roughly.

If there's still a significant gap then it might make sense to optimize this some 
more.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-10 11:26   ` Ingo Molnar
@ 2016-08-10 13:02     ` Giovanni Gherdovich
  2016-08-12 12:10     ` Stanislaw Gruszka
  1 sibling, 0 replies; 14+ messages in thread
From: Giovanni Gherdovich @ 2016-08-10 13:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Stanislaw Gruszka,
	linux-kernel, Mel Gorman

Hello Ingo,

thank you for your reply.

Ingo Molnar <mingo@kernel.org>
> Nice detective work! I'm wondering, where do we stand if compared with a 
> pre-6e998916dfe3 kernel?

The data follows. A considerable part of the performance loss is recovered;
something is still on the table.

"3.18-pre-bug" is the parent of 6e998916dfe3, i.e. 6e998916dfe3^1
"3.18-bug" is the revision 6e998916dfe3 itself.
Figures are in seconds. Percentages refer to 3.18-pre-bug, negative = worse.


times()

threads    3.18-pre-bug          3.18-bug              4.7.0-rc7             4.7.0-rc7-patched

2          2.27 (  0.00%)        3.73 (-64.71%)        3.65 (-61.14%)        3.06 (-35.16%)
5          2.78 (  0.00%)        3.77 (-35.56%)        3.45 (-23.98%)        3.25 (-16.79%)
8          2.79 (  0.00%)        4.41 (-57.71%)        3.52 (-26.05%)        3.26 (-16.53%)
12         3.02 (  0.00%)        3.56 (-17.94%)        3.29 ( -9.08%)        3.37 (-11.74%)
21         3.10 (  0.00%)        4.61 (-48.74%)        4.07 (-31.34%)        3.90 (-25.89%)
30         3.33 (  0.00%)        5.75 (-72.53%)        3.87 (-16.01%)        3.36 ( -0.81%)
48         2.96 (  0.00%)        6.06 (-105.04%)       3.79 (-28.10%)        3.11 ( -5.14%)
79         2.88 (  0.00%)        6.24 (-116.83%)       3.88 (-34.81%)        3.16 ( -9.84%)
110        2.98 (  0.00%)        6.37 (-114.08%)       3.90 (-31.12%)        3.25 ( -9.07%)
128        3.10 (  0.00%)        6.35 (-104.61%)       4.00 (-28.87%)        3.31 ( -6.57%)


clock_gettime()

threads    3.18-pre-bug          3.18-bug              4.7.0-rc7             4.7.0-rc7-patched

2          2.23 (  0.00%)        3.68 (-64.56%)        3.48 (-55.48%)        3.25 (-45.41%)
5          2.83 (  0.00%)        3.78 (-33.42%)        3.33 (-17.43%)        3.17 (-12.03%)
8          2.84 (  0.00%)        4.31 (-52.12%)        3.37 (-18.76%)        3.22 (-13.43%)
12         3.09 (  0.00%)        3.61 (-16.74%)        3.32 ( -7.17%)        3.36 ( -8.47%)
21         3.14 (  0.00%)        4.63 (-47.36%)        4.01 (-27.71%)        3.92 (-24.68%)
30         3.28 (  0.00%)        5.75 (-75.37%)        3.63 (-10.80%)        3.40 ( -3.69%)
48         3.02 (  0.00%)        6.05 (-100.56%)       3.71 (-22.99%)        3.16 ( -4.64%)
79         2.88 (  0.00%)        6.30 (-118.90%)       3.75 (-30.26%)        3.28 (-13.93%)
110        2.95 (  0.00%)        6.46 (-119.00%)       3.81 (-29.24%)        3.38 (-14.69%)
128        3.05 (  0.00%)        6.42 (-110.08%)       3.88 (-27.04%)        3.38 (-10.70%)


Regards,
Giovanni Gherdovich

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-10 11:26   ` Ingo Molnar
  2016-08-10 13:02     ` Giovanni Gherdovich
@ 2016-08-12 12:10     ` Stanislaw Gruszka
  2016-08-15  7:49       ` Giovanni Gherdovich
  2016-08-15  9:13       ` Wanpeng Li
  1 sibling, 2 replies; 14+ messages in thread
From: Stanislaw Gruszka @ 2016-08-12 12:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra, Mike Galbraith,
	linux-kernel, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 3629 bytes --]

Hi

On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
> Nice detective work! I'm wondering, where do we stand if compared with a 
> pre-6e998916dfe3 kernel?
> 
> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I 
> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert 
> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to 
> make the result correct, just see what the performance gap is, roughly.
> 
> If there's still a significant gap then it might make sense to optimize this some 
> more.

I measured (partial) revert performance on 4.7 using mmtest instructions
from Giovanni and also tested some other possible fix (draft version):

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..54fdf6d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	unsigned int seq, nextseq;
 	unsigned long flags;
 
+	(void) task_sched_runtime(tsk);
+
 	rcu_read_lock();
 	/* Attempt a lockless read on the first round. */
 	nextseq = 0;
@@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
-			times->sum_exec_runtime += task_sched_runtime(t);
+			times->sum_exec_runtime += t->se.sum_exec_runtime;
 		}
 		/* If lockless access failed, take the lock. */
 		nextseq = 1;
---
mmtest benchmark results are below (full compare-kernels.sh output is in attachment):

vanila-4.7            revert                prefetch              patch
4.74 (  0.00%)        3.04 ( 35.93%)        4.09 ( 13.81%)        1.30 ( 72.59%)
5.49 (  0.00%)        5.00 (  8.97%)        5.34 (  2.72%)        1.03 ( 81.16%)
6.12 (  0.00%)        4.91 ( 19.73%)        5.97 (  2.40%)        0.90 ( 85.27%)
6.68 (  0.00%)        4.90 ( 26.66%)        6.02 (  9.75%)        0.88 ( 86.89%)
7.21 (  0.00%)        5.13 ( 28.85%)        6.70 (  7.09%)        0.87 ( 87.91%)
7.66 (  0.00%)        5.22 ( 31.80%)        7.17 (  6.39%)        0.92 ( 88.01%)
7.91 (  0.00%)        5.36 ( 32.22%)        7.30 (  7.72%)        0.95 ( 87.97%)
7.95 (  0.00%)        5.35 ( 32.73%)        7.34 (  7.66%)        1.06 ( 86.66%)
8.00 (  0.00%)        5.33 ( 33.31%)        7.38 (  7.73%)        1.13 ( 85.82%)
5.61 (  0.00%)        3.55 ( 36.76%)        4.53 ( 19.23%)        2.29 ( 59.28%)
5.66 (  0.00%)        4.32 ( 23.79%)        4.75 ( 16.18%)        3.65 ( 35.46%)
5.98 (  0.00%)        4.97 ( 16.87%)        5.96 (  0.35%)        3.62 ( 39.40%)
6.58 (  0.00%)        4.94 ( 24.93%)        6.04 (  8.32%)        3.63 ( 44.89%)
7.19 (  0.00%)        5.18 ( 28.01%)        6.68 (  7.13%)        3.65 ( 49.22%)
7.67 (  0.00%)        5.27 ( 31.29%)        7.16 (  6.63%)        3.62 ( 52.76%)
7.88 (  0.00%)        5.36 ( 31.98%)        7.28 (  7.58%)        3.65 ( 53.71%)
7.99 (  0.00%)        5.39 ( 32.52%)        7.40 (  7.42%)        3.65 ( 54.25%)

Patch works because we we update sum_exec_runtime on current thread
what assure we see proper sum_exec_runtime value on different CPUs. I
tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
patch did not break them. I'm going to run some other test.

Patch is draft version for early review, task_sched_runtime() will be
simplified (since it's called only current thread) and possibly split
into two functions: one that call update_curr() and other that return
sum_exec_runtime (assure it's consistent on 32 bit arches).

Stanislaw

[-- Attachment #2: compare.txt --]
[-- Type: text/plain, Size: 27653 bytes --]


poundtime
                                                         vanilla                       rever                     prefetc                         mas
                                                             4.7                      revert                    prefetch                        mask
Min      real-pound_clock_gettime-2         4.38 (  0.00%)        2.73 ( 37.67%)        3.62 ( 17.35%)        1.19 ( 72.83%)
Min      real-pound_clock_gettime-5         5.40 (  0.00%)        4.76 ( 11.85%)        4.49 ( 16.85%)        0.99 ( 81.67%)
Min      real-pound_clock_gettime-8         5.83 (  0.00%)        4.88 ( 16.30%)        5.91 ( -1.37%)        0.88 ( 84.91%)
Min      real-pound_clock_gettime-12        6.55 (  0.00%)        4.87 ( 25.65%)        5.98 (  8.70%)        0.84 ( 87.18%)
Min      real-pound_clock_gettime-21        7.11 (  0.00%)        5.10 ( 28.27%)        6.63 (  6.75%)        0.85 ( 88.05%)
Min      real-pound_clock_gettime-30        7.56 (  0.00%)        5.20 ( 31.22%)        7.08 (  6.35%)        0.87 ( 88.49%)
Min      real-pound_clock_gettime-48        7.78 (  0.00%)        5.24 ( 32.65%)        7.20 (  7.46%)        0.92 ( 88.17%)
Min      real-pound_clock_gettime-79        7.89 (  0.00%)        5.23 ( 33.71%)        7.20 (  8.75%)        1.00 ( 87.33%)
Min      real-pound_clock_gettime-96        7.88 (  0.00%)        5.24 ( 33.50%)        7.29 (  7.49%)        1.09 ( 86.17%)
Min      real-pound_times-2                 4.87 (  0.00%)        3.19 ( 34.50%)        4.00 ( 17.86%)        2.06 ( 57.70%)
Min      real-pound_times-5                 5.59 (  0.00%)        3.91 ( 30.05%)        4.61 ( 17.53%)        3.61 ( 35.42%)
Min      real-pound_times-8                 5.74 (  0.00%)        4.88 ( 14.98%)        5.80 ( -1.05%)        3.56 ( 37.98%)
Min      real-pound_times-12                6.44 (  0.00%)        4.90 ( 23.91%)        6.00 (  6.83%)        3.52 ( 45.34%)
Min      real-pound_times-21                7.11 (  0.00%)        5.11 ( 28.13%)        6.61 (  7.03%)        3.59 ( 49.51%)
Min      real-pound_times-30                7.60 (  0.00%)        5.20 ( 31.58%)        7.03 (  7.50%)        3.54 ( 53.42%)
Min      real-pound_times-48                7.80 (  0.00%)        5.24 ( 32.82%)        7.20 (  7.69%)        3.61 ( 53.72%)
Min      real-pound_times-79                7.92 (  0.00%)        5.24 ( 33.84%)        7.31 (  7.70%)        3.61 ( 54.42%)
Min      real-pound_times-96                7.94 (  0.00%)        5.24 ( 34.01%)        7.29 (  8.19%)        3.58 ( 54.91%)
Min      syst-pound_clock_gettime-2         8.54 (  0.00%)        4.89 ( 42.74%)        6.98 ( 18.27%)        2.16 ( 74.71%)
Min      syst-pound_clock_gettime-5        26.57 (  0.00%)       23.29 ( 12.34%)       22.09 ( 16.86%)        4.47 ( 83.18%)
Min      syst-pound_clock_gettime-8        45.82 (  0.00%)       38.02 ( 17.02%)       46.61 ( -1.72%)        6.44 ( 85.95%)
Min      syst-pound_clock_gettime-12       77.23 (  0.00%)       56.61 ( 26.70%)       69.25 ( 10.33%)        9.34 ( 87.91%)
Min      syst-pound_clock_gettime-21      147.44 (  0.00%)      103.97 ( 29.48%)      134.76 (  8.60%)       15.12 ( 89.74%)
Min      syst-pound_clock_gettime-30      176.07 (  0.00%)      117.81 ( 33.09%)      162.77 (  7.55%)       15.95 ( 90.94%)
Min      syst-pound_clock_gettime-48      182.93 (  0.00%)      119.92 ( 34.44%)      168.06 (  8.13%)       19.82 ( 89.17%)
Min      syst-pound_clock_gettime-79      186.13 (  0.00%)      123.31 ( 33.75%)      170.34 (  8.48%)       22.90 ( 87.70%)
Min      syst-pound_clock_gettime-96      187.05 (  0.00%)      124.22 ( 33.59%)      172.67 (  7.69%)       25.19 ( 86.53%)
Min      syst-pound_times-2                 9.55 (  0.00%)        6.22 ( 34.87%)        7.80 ( 18.32%)        3.90 ( 59.16%)
Min      syst-pound_times-5                27.68 (  0.00%)       19.24 ( 30.49%)       22.76 ( 17.77%)       17.56 ( 36.56%)
Min      syst-pound_times-8                45.11 (  0.00%)       38.75 ( 14.10%)       45.15 ( -0.09%)       27.77 ( 38.44%)
Min      syst-pound_times-12               76.60 (  0.00%)       56.89 ( 25.73%)       71.06 (  7.23%)       41.64 ( 45.64%)
Min      syst-pound_times-21              145.25 (  0.00%)      102.48 ( 29.45%)      136.15 (  6.27%)       72.98 ( 49.76%)
Min      syst-pound_times-30              175.03 (  0.00%)      118.89 ( 32.07%)      161.32 (  7.83%)       79.91 ( 54.34%)
Min      syst-pound_times-48              183.61 (  0.00%)      121.06 ( 34.07%)      167.26 (  8.90%)       83.24 ( 54.66%)
Min      syst-pound_times-79              187.18 (  0.00%)      123.24 ( 34.16%)      173.22 (  7.46%)       84.36 ( 54.93%)
Min      syst-pound_times-96              188.88 (  0.00%)      124.04 ( 34.33%)      173.52 (  8.13%)       83.02 ( 56.05%)
Amean    real-pound_clock_gettime-2         4.74 (  0.00%)        3.04 ( 35.93%)        4.09 ( 13.81%)        1.30 ( 72.59%)
Amean    real-pound_clock_gettime-5         5.49 (  0.00%)        5.00 (  8.97%)        5.34 (  2.72%)        1.03 ( 81.16%)
Amean    real-pound_clock_gettime-8         6.12 (  0.00%)        4.91 ( 19.73%)        5.97 (  2.40%)        0.90 ( 85.27%)
Amean    real-pound_clock_gettime-12        6.68 (  0.00%)        4.90 ( 26.66%)        6.02 (  9.75%)        0.88 ( 86.89%)
Amean    real-pound_clock_gettime-21        7.21 (  0.00%)        5.13 ( 28.85%)        6.70 (  7.09%)        0.87 ( 87.91%)
Amean    real-pound_clock_gettime-30        7.66 (  0.00%)        5.22 ( 31.80%)        7.17 (  6.39%)        0.92 ( 88.01%)
Amean    real-pound_clock_gettime-48        7.91 (  0.00%)        5.36 ( 32.22%)        7.30 (  7.72%)        0.95 ( 87.97%)
Amean    real-pound_clock_gettime-79        7.95 (  0.00%)        5.35 ( 32.73%)        7.34 (  7.66%)        1.06 ( 86.66%)
Amean    real-pound_clock_gettime-96        8.00 (  0.00%)        5.33 ( 33.31%)        7.38 (  7.73%)        1.13 ( 85.82%)
Amean    real-pound_times-2                 5.61 (  0.00%)        3.55 ( 36.76%)        4.53 ( 19.23%)        2.29 ( 59.28%)
Amean    real-pound_times-5                 5.66 (  0.00%)        4.32 ( 23.79%)        4.75 ( 16.18%)        3.65 ( 35.46%)
Amean    real-pound_times-8                 5.98 (  0.00%)        4.97 ( 16.87%)        5.96 (  0.35%)        3.62 ( 39.40%)
Amean    real-pound_times-12                6.58 (  0.00%)        4.94 ( 24.93%)        6.04 (  8.32%)        3.63 ( 44.89%)
Amean    real-pound_times-21                7.19 (  0.00%)        5.18 ( 28.01%)        6.68 (  7.13%)        3.65 ( 49.22%)
Amean    real-pound_times-30                7.67 (  0.00%)        5.27 ( 31.29%)        7.16 (  6.63%)        3.62 ( 52.76%)
Amean    real-pound_times-48                7.88 (  0.00%)        5.36 ( 31.98%)        7.28 (  7.58%)        3.65 ( 53.71%)
Amean    real-pound_times-79                7.99 (  0.00%)        5.39 ( 32.52%)        7.40 (  7.42%)        3.65 ( 54.25%)
Amean    real-pound_times-96                8.01 (  0.00%)        5.35 ( 33.20%)        7.36 (  8.09%)        3.64 ( 54.49%)
Amean    syst-pound_clock_gettime-2         9.22 (  0.00%)        5.45 ( 40.95%)        7.90 ( 14.32%)        2.34 ( 74.66%)
Amean    syst-pound_clock_gettime-5        27.03 (  0.00%)       24.21 ( 10.40%)       26.24 (  2.90%)        4.73 ( 82.48%)
Amean    syst-pound_clock_gettime-8        48.33 (  0.00%)       38.40 ( 20.55%)       47.11 (  2.52%)        6.64 ( 86.25%)
Amean    syst-pound_clock_gettime-12       78.93 (  0.00%)       57.30 ( 27.41%)       71.04 ( 10.00%)        9.69 ( 87.72%)
Amean    syst-pound_clock_gettime-21      149.27 (  0.00%)      105.34 ( 29.43%)      138.19 (  7.42%)       16.50 ( 88.95%)
Amean    syst-pound_clock_gettime-30      178.36 (  0.00%)      119.83 ( 32.82%)      166.75 (  6.51%)       18.67 ( 89.53%)
Amean    syst-pound_clock_gettime-48      185.77 (  0.00%)      124.80 ( 32.82%)      171.14 (  7.88%)       21.12 ( 88.63%)
Amean    syst-pound_clock_gettime-79      188.17 (  0.00%)      126.34 ( 32.86%)      173.99 (  7.53%)       24.07 ( 87.21%)
Amean    syst-pound_clock_gettime-96      190.24 (  0.00%)      126.63 ( 33.44%)      175.32 (  7.84%)       26.12 ( 86.27%)
Amean    syst-pound_times-2                11.02 (  0.00%)        6.91 ( 37.27%)        8.85 ( 19.68%)        4.36 ( 60.45%)
Amean    syst-pound_times-5                27.99 (  0.00%)       21.31 ( 23.88%)       23.42 ( 16.32%)       17.95 ( 35.87%)
Amean    syst-pound_times-8                47.33 (  0.00%)       39.27 ( 17.04%)       47.16 (  0.35%)       28.56 ( 39.66%)
Amean    syst-pound_times-12               78.24 (  0.00%)       58.26 ( 25.55%)       71.55 (  8.55%)       42.78 ( 45.32%)
Amean    syst-pound_times-21              148.75 (  0.00%)      106.28 ( 28.55%)      138.22 (  7.08%)       74.25 ( 50.09%)
Amean    syst-pound_times-30              177.74 (  0.00%)      121.16 ( 31.83%)      166.70 (  6.21%)       81.82 ( 53.96%)
Amean    syst-pound_times-48              184.85 (  0.00%)      125.37 ( 32.18%)      170.87 (  7.56%)       84.20 ( 54.45%)
Amean    syst-pound_times-79              189.50 (  0.00%)      127.45 ( 32.74%)      175.58 (  7.34%)       86.01 ( 54.61%)
Amean    syst-pound_times-96              190.56 (  0.00%)      127.11 ( 33.30%)      175.08 (  8.12%)       86.03 ( 54.85%)
Stddev   real-pound_clock_gettime-2         0.25 (  0.00%)        0.27 ( -7.76%)        0.41 (-65.62%)        0.10 ( 60.73%)
Stddev   real-pound_clock_gettime-5         0.07 (  0.00%)        0.09 (-35.10%)        0.51 (-674.46%)        0.05 ( 26.28%)
Stddev   real-pound_clock_gettime-8         0.28 (  0.00%)        0.02 ( 92.09%)        0.04 ( 86.10%)        0.02 ( 93.65%)
Stddev   real-pound_clock_gettime-12        0.08 (  0.00%)        0.02 ( 78.31%)        0.04 ( 52.02%)        0.02 ( 78.95%)
Stddev   real-pound_clock_gettime-21        0.06 (  0.00%)        0.02 ( 68.54%)        0.11 (-70.01%)        0.01 ( 78.27%)
Stddev   real-pound_clock_gettime-30        0.05 (  0.00%)        0.01 ( 75.00%)        0.10 (-98.93%)        0.04 ( 20.82%)
Stddev   real-pound_clock_gettime-48        0.09 (  0.00%)        0.19 (-106.51%)        0.08 ( 15.24%)        0.04 ( 58.70%)
Stddev   real-pound_clock_gettime-79        0.03 (  0.00%)        0.10 (-191.56%)        0.08 (-138.02%)        0.04 (-21.18%)
Stddev   real-pound_clock_gettime-96        0.05 (  0.00%)        0.08 (-56.69%)        0.07 (-21.04%)        0.04 ( 31.40%)
Stddev   real-pound_times-2                 0.55 (  0.00%)        0.25 ( 53.82%)        0.38 ( 30.80%)        0.14 ( 74.19%)
Stddev   real-pound_times-5                 0.06 (  0.00%)        0.28 (-358.77%)        0.13 (-108.26%)        0.03 ( 54.64%)
Stddev   real-pound_times-8                 0.25 (  0.00%)        0.04 ( 83.52%)        0.06 ( 76.99%)        0.06 ( 76.94%)
Stddev   real-pound_times-12                0.09 (  0.00%)        0.05 ( 41.52%)        0.02 ( 77.55%)        0.04 ( 51.60%)
Stddev   real-pound_times-21                0.06 (  0.00%)        0.15 (-141.91%)        0.11 (-74.22%)        0.03 ( 48.73%)
Stddev   real-pound_times-30                0.06 (  0.00%)        0.14 (-129.04%)        0.10 (-66.59%)        0.04 ( 30.36%)
Stddev   real-pound_times-48                0.05 (  0.00%)        0.13 (-151.20%)        0.07 (-37.30%)        0.02 ( 54.64%)
Stddev   real-pound_times-79                0.04 (  0.00%)        0.11 (-205.48%)        0.07 (-97.82%)        0.03 ( 28.17%)
Stddev   real-pound_times-96                0.05 (  0.00%)        0.05 ( -1.83%)        0.04 ( 24.17%)        0.04 ( 20.00%)
Stddev   syst-pound_clock_gettime-2         0.47 (  0.00%)        0.45 (  4.96%)        0.79 (-66.33%)        0.18 ( 61.36%)
Stddev   syst-pound_clock_gettime-5         0.32 (  0.00%)        0.39 (-20.09%)        2.49 (-666.63%)        0.25 ( 21.71%)
Stddev   syst-pound_clock_gettime-8         2.25 (  0.00%)        0.26 ( 88.54%)        0.40 ( 82.10%)        0.17 ( 92.55%)
Stddev   syst-pound_clock_gettime-12        1.23 (  0.00%)        0.43 ( 64.59%)        0.73 ( 40.82%)        0.19 ( 84.58%)
Stddev   syst-pound_clock_gettime-21        1.15 (  0.00%)        1.06 (  7.62%)        2.64 (-129.56%)        0.66 ( 42.45%)
Stddev   syst-pound_clock_gettime-30        1.34 (  0.00%)        1.26 (  6.25%)        2.69 (-99.81%)        1.58 (-17.86%)
Stddev   syst-pound_clock_gettime-48        2.52 (  0.00%)        4.85 (-92.44%)        2.12 ( 15.94%)        1.08 ( 57.23%)
Stddev   syst-pound_clock_gettime-79        1.22 (  0.00%)        2.51 (-105.82%)        1.99 (-62.56%)        0.96 ( 21.62%)
Stddev   syst-pound_clock_gettime-96        1.54 (  0.00%)        2.21 (-43.34%)        1.74 (-12.67%)        0.80 ( 48.24%)
Stddev   syst-pound_times-2                 1.09 (  0.00%)        0.50 ( 53.61%)        0.76 ( 30.43%)        0.28 ( 74.11%)
Stddev   syst-pound_times-5                 0.30 (  0.00%)        1.41 (-367.82%)        0.65 (-115.62%)        0.21 ( 29.66%)
Stddev   syst-pound_times-8                 2.12 (  0.00%)        0.27 ( 87.24%)        0.71 ( 66.44%)        0.55 ( 73.94%)
Stddev   syst-pound_times-12                1.03 (  0.00%)        0.74 ( 27.70%)        0.37 ( 64.41%)        0.47 ( 54.81%)
Stddev   syst-pound_times-21                1.60 (  0.00%)        3.07 (-92.49%)        2.30 (-43.99%)        0.93 ( 41.93%)
Stddev   syst-pound_times-30                1.75 (  0.00%)        3.05 (-74.55%)        2.84 (-62.67%)        1.17 ( 32.95%)
Stddev   syst-pound_times-48                0.79 (  0.00%)        3.36 (-327.41%)        2.51 (-219.14%)        0.51 ( 34.63%)
Stddev   syst-pound_times-79                1.08 (  0.00%)        2.77 (-156.12%)        1.84 (-70.34%)        0.86 ( 20.82%)
Stddev   syst-pound_times-96                1.19 (  0.00%)        1.35 (-13.61%)        1.01 ( 15.16%)        1.29 ( -8.56%)
CoeffVar real-pound_clock_gettime-2         5.19 (  0.00%)        8.73 (-68.19%)        9.97 (-92.16%)        7.43 (-43.23%)
CoeffVar real-pound_clock_gettime-5         1.19 (  0.00%)        1.77 (-48.40%)        9.49 (-696.07%)        4.66 (-291.28%)
CoeffVar real-pound_clock_gettime-8         4.53 (  0.00%)        0.45 ( 90.14%)        0.64 ( 85.76%)        1.95 ( 56.89%)
CoeffVar real-pound_clock_gettime-12        1.24 (  0.00%)        0.37 ( 70.42%)        0.66 ( 46.83%)        2.00 (-60.60%)
CoeffVar real-pound_clock_gettime-21        0.88 (  0.00%)        0.39 ( 55.78%)        1.61 (-82.98%)        1.58 (-79.84%)
CoeffVar real-pound_clock_gettime-30        0.68 (  0.00%)        0.25 ( 63.35%)        1.44 (-112.50%)        4.49 (-560.29%)
CoeffVar real-pound_clock_gettime-48        1.18 (  0.00%)        3.61 (-204.68%)        1.09 (  8.14%)        4.06 (-243.23%)
CoeffVar real-pound_clock_gettime-79        0.43 (  0.00%)        1.85 (-333.44%)        1.10 (-157.77%)        3.87 (-808.42%)
CoeffVar real-pound_clock_gettime-96        0.68 (  0.00%)        1.59 (-134.97%)        0.89 (-31.18%)        3.28 (-383.77%)
CoeffVar real-pound_times-2                 9.79 (  0.00%)        7.15 ( 26.98%)        8.39 ( 14.33%)        6.21 ( 36.61%)
CoeffVar real-pound_times-5                 1.06 (  0.00%)        6.39 (-501.98%)        2.64 (-148.46%)        0.75 ( 29.71%)
CoeffVar real-pound_times-8                 4.24 (  0.00%)        0.84 ( 80.17%)        0.98 ( 76.91%)        1.61 ( 61.95%)
CoeffVar real-pound_times-12                1.29 (  0.00%)        1.01 ( 22.11%)        0.32 ( 75.51%)        1.14 ( 12.18%)
CoeffVar real-pound_times-21                0.87 (  0.00%)        2.91 (-236.03%)        1.63 (-87.60%)        0.87 ( -0.97%)
CoeffVar real-pound_times-30                0.78 (  0.00%)        2.62 (-233.35%)        1.40 (-78.41%)        1.16 (-47.41%)
CoeffVar real-pound_times-48                0.65 (  0.00%)        2.40 (-269.32%)        0.97 (-48.56%)        0.64 (  2.00%)
CoeffVar real-pound_times-79                0.45 (  0.00%)        2.03 (-352.70%)        0.96 (-113.68%)        0.71 (-57.00%)
CoeffVar real-pound_times-96                0.61 (  0.00%)        0.93 (-52.43%)        0.50 ( 17.50%)        1.07 (-75.79%)
CoeffVar syst-pound_clock_gettime-2         5.12 (  0.00%)        8.25 (-60.95%)        9.95 (-94.12%)        7.81 (-52.47%)
CoeffVar syst-pound_clock_gettime-5         1.20 (  0.00%)        1.61 (-34.04%)        9.48 (-689.57%)        5.37 (-346.99%)
CoeffVar syst-pound_clock_gettime-8         4.66 (  0.00%)        0.67 ( 85.58%)        0.86 ( 81.64%)        2.53 ( 45.79%)
CoeffVar syst-pound_clock_gettime-12        1.56 (  0.00%)        0.76 ( 51.21%)        1.02 ( 34.25%)        1.95 (-25.60%)
CoeffVar syst-pound_clock_gettime-21        0.77 (  0.00%)        1.01 (-30.89%)        1.91 (-147.96%)        4.01 (-420.63%)
CoeffVar syst-pound_clock_gettime-30        0.75 (  0.00%)        1.05 (-39.54%)        1.61 (-113.72%)        8.48 (-1026.12%)
CoeffVar syst-pound_clock_gettime-48        1.36 (  0.00%)        3.89 (-186.46%)        1.24 (  8.75%)        5.11 (-276.18%)
CoeffVar syst-pound_clock_gettime-79        0.65 (  0.00%)        1.99 (-206.55%)        1.14 (-75.81%)        3.98 (-512.73%)
CoeffVar syst-pound_clock_gettime-96        0.81 (  0.00%)        1.74 (-115.35%)        0.99 (-22.26%)        3.05 (-277.01%)
CoeffVar syst-pound_times-2                 9.86 (  0.00%)        7.29 ( 26.04%)        8.54 ( 13.39%)        6.45 ( 34.55%)
CoeffVar syst-pound_times-5                 1.08 (  0.00%)        6.62 (-514.60%)        2.78 (-157.67%)        1.18 ( -9.68%)
CoeffVar syst-pound_times-8                 4.48 (  0.00%)        0.69 ( 84.62%)        1.51 ( 66.32%)        1.94 ( 56.81%)
CoeffVar syst-pound_times-12                1.32 (  0.00%)        1.28 (  2.89%)        0.51 ( 61.08%)        1.09 ( 17.35%)
CoeffVar syst-pound_times-21                1.07 (  0.00%)        2.89 (-169.42%)        1.66 (-54.96%)        1.25 (-16.34%)
CoeffVar syst-pound_times-30                0.98 (  0.00%)        2.52 (-156.06%)        1.71 (-73.43%)        1.43 (-45.64%)
CoeffVar syst-pound_times-48                0.43 (  0.00%)        2.68 (-530.20%)        1.47 (-245.25%)        0.61 (-43.50%)
CoeffVar syst-pound_times-79                0.57 (  0.00%)        2.17 (-280.81%)        1.05 (-83.84%)        1.00 (-74.46%)
CoeffVar syst-pound_times-96                0.63 (  0.00%)        1.07 (-70.33%)        0.58 (  7.66%)        1.50 (-140.44%)
Max      real-pound_clock_gettime-2         5.10 (  0.00%)        3.56 ( 30.20%)        4.98 (  2.35%)        1.47 ( 71.18%)
Max      real-pound_clock_gettime-5         5.59 (  0.00%)        5.10 (  8.77%)        6.00 ( -7.33%)        1.17 ( 79.07%)
Max      real-pound_clock_gettime-8         6.82 (  0.00%)        4.95 ( 27.42%)        6.02 ( 11.73%)        0.93 ( 86.36%)
Max      real-pound_clock_gettime-12        6.82 (  0.00%)        4.93 ( 27.71%)        6.13 ( 10.12%)        0.90 ( 86.80%)
Max      real-pound_clock_gettime-21        7.33 (  0.00%)        5.17 ( 29.47%)        7.01 (  4.37%)        0.89 ( 87.86%)
Max      real-pound_clock_gettime-30        7.71 (  0.00%)        5.24 ( 32.04%)        7.38 (  4.28%)        1.00 ( 87.03%)
Max      real-pound_clock_gettime-48        8.11 (  0.00%)        5.86 ( 27.74%)        7.47 (  7.89%)        1.05 ( 87.05%)
Max      real-pound_clock_gettime-79        8.03 (  0.00%)        5.53 ( 31.13%)        7.48 (  6.85%)        1.13 ( 85.93%)
Max      real-pound_clock_gettime-96        8.05 (  0.00%)        5.55 ( 31.06%)        7.51 (  6.71%)        1.21 ( 84.97%)
Max      real-pound_times-2                 6.66 (  0.00%)        3.89 ( 41.59%)        5.23 ( 21.47%)        2.56 ( 61.56%)
Max      real-pound_times-5                 5.77 (  0.00%)        4.96 ( 14.04%)        5.01 ( 13.17%)        3.69 ( 36.05%)
Max      real-pound_times-8                 6.42 (  0.00%)        5.04 ( 21.50%)        6.02 (  6.23%)        3.72 ( 42.06%)
Max      real-pound_times-12                6.69 (  0.00%)        5.07 ( 24.22%)        6.07 (  9.27%)        3.67 ( 45.14%)
Max      real-pound_times-21                7.32 (  0.00%)        5.63 ( 23.09%)        7.00 (  4.37%)        3.68 ( 49.73%)
Max      real-pound_times-30                7.78 (  0.00%)        5.68 ( 26.99%)        7.36 (  5.40%)        3.66 ( 52.96%)
Max      real-pound_times-48                7.98 (  0.00%)        5.58 ( 30.08%)        7.41 (  7.14%)        3.68 ( 53.88%)
Max      real-pound_times-79                8.05 (  0.00%)        5.61 ( 30.31%)        7.53 (  6.46%)        3.69 ( 54.16%)
Max      real-pound_times-96                8.08 (  0.00%)        5.42 ( 32.92%)        7.42 (  8.17%)        3.71 ( 54.08%)
Max      syst-pound_clock_gettime-2         9.91 (  0.00%)        6.30 ( 36.43%)        9.64 (  2.72%)        2.68 ( 72.96%)
Max      syst-pound_clock_gettime-5        27.53 (  0.00%)       24.74 ( 10.13%)       29.35 ( -6.61%)        5.43 ( 80.28%)
Max      syst-pound_clock_gettime-8        53.96 (  0.00%)       38.82 ( 28.06%)       47.75 ( 11.51%)        6.99 ( 87.05%)
Max      syst-pound_clock_gettime-12       81.09 (  0.00%)       57.99 ( 28.49%)       71.93 ( 11.30%)       10.04 ( 87.62%)
Max      syst-pound_clock_gettime-21      151.50 (  0.00%)      107.03 ( 29.35%)      145.33 (  4.07%)       17.48 ( 88.46%)
Max      syst-pound_clock_gettime-30      179.94 (  0.00%)      121.68 ( 32.38%)      172.10 (  4.36%)       21.29 ( 88.17%)
Max      syst-pound_clock_gettime-48      191.29 (  0.00%)      136.82 ( 28.48%)      174.84 (  8.60%)       23.80 ( 87.56%)
Max      syst-pound_clock_gettime-79      190.22 (  0.00%)      130.28 ( 31.51%)      177.26 (  6.81%)       25.71 ( 86.48%)
Max      syst-pound_clock_gettime-96      192.02 (  0.00%)      132.27 ( 31.12%)      178.26 (  7.17%)       27.66 ( 85.60%)
Max      syst-pound_times-2                13.10 (  0.00%)        7.57 ( 42.21%)       10.21 ( 22.06%)        4.89 ( 62.67%)
Max      syst-pound_times-5                28.56 (  0.00%)       24.55 ( 14.04%)       24.80 ( 13.17%)       18.20 ( 36.27%)
Max      syst-pound_times-8                50.89 (  0.00%)       39.54 ( 22.30%)       47.78 (  6.11%)       29.45 ( 42.13%)
Max      syst-pound_times-12               79.85 (  0.00%)       59.80 ( 25.11%)       72.21 (  9.57%)       43.27 ( 45.81%)
Max      syst-pound_times-21              151.33 (  0.00%)      115.02 ( 23.99%)      144.60 (  4.45%)       75.85 ( 49.88%)
Max      syst-pound_times-30              180.79 (  0.00%)      130.12 ( 28.03%)      171.98 (  4.87%)       83.31 ( 53.92%)
Max      syst-pound_times-48              186.61 (  0.00%)      130.89 ( 29.86%)      174.40 (  6.54%)       84.85 ( 54.53%)
Max      syst-pound_times-79              190.96 (  0.00%)      133.09 ( 30.30%)      179.58 (  5.96%)       87.17 ( 54.35%)
Max      syst-pound_times-96              192.42 (  0.00%)      128.95 ( 32.99%)      177.09 (  7.97%)       87.82 ( 54.36%)

             vanilla       rever     prefetc         mas
                 4.7      revert    prefetch        mask
User           54.91       73.30       56.08       47.56
System      21115.14    14616.16    19553.36     6360.52
Elapsed      1247.71      890.24     1149.26      409.20

                               vanilla       rever     prefetc         mas
                                   4.7      revert    prefetch        mask
Minor Faults                    291321      267632      324632      274236
Major Faults                       196         272         279         279
Swap Ins                             0           0           0           0
Swap Outs                            0           0           0           0
Allocation stalls                    0           0           0           0
DMA allocs                           0           0           0           0
DMA32 allocs                     12836       11773       23439       21745
Normal allocs                   252492      245667      302327      270404
Movable allocs                       0           0           0           0
Direct pages scanned                 0           0           0           0
Kswapd pages scanned                 0           0           0           0
Kswapd pages reclaimed               0           0           0           0
Direct pages reclaimed               0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%
Zone normal velocity             0.000       0.000       0.000       0.000
Zone dma32 velocity              0.000       0.000       0.000       0.000
Zone dma velocity                0.000       0.000       0.000       0.000
Page writes by reclaim           0.000       0.000       0.000       0.000
Page writes file                     0           0           0           0
Page writes anon                     0           0           0           0
Page reclaim immediate               0           0           0           0
Sector Reads                     24440       38464      144944      143876
Sector Writes                   569300       12712       16036        6956
Page rescued immediate               0           0           0           0
Slabs scanned                        0           0           0           0
Direct inode steals                  0           0           0           0
Kswapd inode steals                  0           0           0           0
Kswapd skipped wait                  0           0           0           0
THP fault alloc                      0           0           0           0
THP collapse alloc                   0           0           0           0
THP splits                           0           0           0           0
THP fault fallback                   0           0           0           0
THP collapse fail                    0           0           0           0
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success             11177       10858       14598        9857
Page migrate failure                 0           2           1           1
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                     11          11          15          10
NUMA alloc hit                  237281      229068      296261      263464
NUMA alloc miss                      7           5           5           6
NUMA interleave hit                  0           0           0           0
NUMA alloc local                237281      229068      296261      263464
NUMA base PTE updates            25433       20398       35883       22264
NUMA huge PMD updates                0           0           0           0
NUMA page range updates          25433       20398       35883       22264
NUMA hint faults                 23242       18097       31026       17002
NUMA hint local faults           10012        6038       14657        6903
NUMA hint local percent             43          33          47          40
NUMA pages migrated              11177       10858       14598        9857
AutoNUMA cost                     116%         90%        155%         85%

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-12 12:10     ` Stanislaw Gruszka
@ 2016-08-15  7:49       ` Giovanni Gherdovich
  2016-08-15  8:33         ` Mel Gorman
  2016-08-15  9:13       ` Wanpeng Li
  1 sibling, 1 reply; 14+ messages in thread
From: Giovanni Gherdovich @ 2016-08-15  7:49 UTC (permalink / raw)
  To: Stanislaw Gruszka, Ingo Molnar
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, linux-kernel,
	Mel Gorman

Hello Stanislaw,

On Fri, 2016-08-12 at 14:10 +0200, Stanislaw Gruszka wrote:
>
> I measured (partial) revert performance on 4.7 using mmtest instructions
> from Giovanni and also tested some other possible fix (draft version):
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..54fdf6d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>  	unsigned int seq, nextseq;
>  	unsigned long flags;
>  
> +	(void) task_sched_runtime(tsk);
> +
>  	rcu_read_lock();
>  	/* Attempt a lockless read on the first round. */
>  	nextseq = 0;
> @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>  			task_cputime(t, &utime, &stime);
>  			times->utime += utime;
>  			times->stime += stime;
> -			times->sum_exec_runtime += task_sched_runtime(t);
> +			times->sum_exec_runtime += t->se.sum_exec_runtime;
>  		}
>  		/* If lockless access failed, take the lock. */
>  		nextseq = 1;
> ---
> mmtest benchmark results are below (full compare-kernels.sh output is in attachment):
> 
> vanila-4.7            revert                prefetch              patch
> 4.74 (  0.00%)        3.04 ( 35.93%)        4.09 ( 13.81%)        1.30 ( 72.59%)
> 5.49 (  0.00%)        5.00 (  8.97%)        5.34 (  2.72%)        1.03 ( 81.16%)
> 6.12 (  0.00%)        4.91 ( 19.73%)        5.97 (  2.40%)        0.90 ( 85.27%)
> 6.68 (  0.00%)        4.90 ( 26.66%)        6.02 (  9.75%)        0.88 ( 86.89%)
> 7.21 (  0.00%)        5.13 ( 28.85%)        6.70 (  7.09%)        0.87 ( 87.91%)
> 7.66 (  0.00%)        5.22 ( 31.80%)        7.17 (  6.39%)        0.92 ( 88.01%)
> 7.91 (  0.00%)        5.36 ( 32.22%)        7.30 (  7.72%)        0.95 ( 87.97%)
> 7.95 (  0.00%)        5.35 ( 32.73%)        7.34 (  7.66%)        1.06 ( 86.66%)
> 8.00 (  0.00%)        5.33 ( 33.31%)        7.38 (  7.73%)        1.13 ( 85.82%)
> 5.61 (  0.00%)        3.55 ( 36.76%)        4.53 ( 19.23%)        2.29 ( 59.28%)
> 5.66 (  0.00%)        4.32 ( 23.79%)        4.75 ( 16.18%)        3.65 ( 35.46%)
> 5.98 (  0.00%)        4.97 ( 16.87%)        5.96 (  0.35%)        3.62 ( 39.40%)
> 6.58 (  0.00%)        4.94 ( 24.93%)        6.04 (  8.32%)        3.63 ( 44.89%)
> 7.19 (  0.00%)        5.18 ( 28.01%)        6.68 (  7.13%)        3.65 ( 49.22%)
> 7.67 (  0.00%)        5.27 ( 31.29%)        7.16 (  6.63%)        3.62 ( 52.76%)
> 7.88 (  0.00%)        5.36 ( 31.98%)        7.28 (  7.58%)        3.65 ( 53.71%)
> 7.99 (  0.00%)        5.39 ( 32.52%)        7.40 (  7.42%)        3.65 ( 54.25%)
> 
> Patch works because we we update sum_exec_runtime on current thread
> what assure we see proper sum_exec_runtime value on different CPUs. I
> tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> patch did not break them. I'm going to run some other test.
> 
> Patch is draft version for early review, task_sched_runtime() will be
> simplified (since it's called only current thread) and possibly split
> into two functions: one that call update_curr() and other that return
> sum_exec_runtime (assure it's consistent on 32 bit arches).
> 
> Stanislaw

Thank you for having a look at this.
Your patch performs very well, even better than the pre-6e998916dfe3
numbers I was aiming for. I confirm your results on my test machine
(Sandy Bridge, 32 cores, 2 NUMA nodes).
I didn't apply on the very latest 4.8-rc but used what I had handy for
comparison (i.e. 4.7-rc7 and the parent of 6e998916dfe3).
As I said, my measurements match yours (my tables follow); looks like
your diff cures the problem while mine cures the symptoms.

clock_gettime():

threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + Stanislaw
                       (pre-6e998916dfe3)
2          3.48        2.23 ( 35.68%)        3.06 ( 11.83%)        1.08 ( 68.81%)
5          3.33        2.83 ( 14.84%)        3.25 (  2.40%)        0.71 ( 78.55%)
8          3.37        2.84 ( 15.80%)        3.26 (  3.30%)        0.56 ( 83.49%)
12         3.32        3.09 (  6.69%)        3.37 ( -1.60%)        0.42 ( 87.28%)
21         4.01        3.14 ( 21.70%)        3.90 (  2.74%)        0.35 ( 91.35%)
30         3.63        3.28 (  9.75%)        3.36 (  7.41%)        0.28 ( 92.23%)
48         3.71        3.02 ( 18.69%)        3.11 ( 16.27%)        0.39 ( 89.39%)
79         3.75        2.88 ( 23.23%)        3.16 ( 15.74%)        0.46 ( 87.76%)
110        3.81        2.95 ( 22.62%)        3.25 ( 14.80%)        0.56 ( 85.41%)
128        3.88        3.05 ( 21.28%)        3.31 ( 14.76%)        0.62 ( 84.10%)

times():

threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + Stanislaw
                       (pre-6e998916dfe3)
2          3.65        2.27 ( 37.94%)        3.25 ( 11.03%)        1.62 ( 55.71%)
5          3.45        2.78 ( 19.34%)        3.17 (  7.92%)        2.33 ( 32.28%)
8          3.52        2.79 ( 20.66%)        3.22 (  8.69%)        2.06 ( 41.44%)
12         3.29        3.02 (  8.33%)        3.36 ( -2.04%)        2.00 ( 39.18%)
21         4.07        3.10 ( 23.86%)        3.92 (  3.78%)        2.07 ( 49.18%)
30         3.87        3.33 ( 13.80%)        3.40 ( 12.17%)        1.89 ( 51.12%)
48         3.79        2.96 ( 21.94%)        3.16 ( 16.61%)        1.69 ( 55.46%)
79         3.88        2.88 ( 25.82%)        3.28 ( 15.42%)        1.60 ( 58.81%)
110        3.90        2.98 ( 23.73%)        3.38 ( 13.35%)        1.73 ( 55.61%)
128        4.00        3.10 ( 22.40%)        3.38 ( 15.45%)        1.66 ( 58.52%)


Regards,
Giovanni

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-15  7:49       ` Giovanni Gherdovich
@ 2016-08-15  8:33         ` Mel Gorman
  2016-08-15  9:19           ` Stanislaw Gruszka
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2016-08-15  8:33 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Stanislaw Gruszka, Ingo Molnar, Ingo Molnar, Peter Zijlstra,
	Mike Galbraith, linux-kernel

On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote:
> > mmtest benchmark results are below (full compare-kernels.sh output is in attachment):
> > 
> > vanila-4.7            revert                prefetch              patch
> > 4.74 (  0.00%)        3.04 ( 35.93%)        4.09 ( 13.81%)        1.30 ( 72.59%)
> > 5.49 (  0.00%)        5.00 (  8.97%)        5.34 (  2.72%)        1.03 ( 81.16%)
> > 6.12 (  0.00%)        4.91 ( 19.73%)        5.97 (  2.40%)        0.90 ( 85.27%)
> > 6.68 (  0.00%)        4.90 ( 26.66%)        6.02 (  9.75%)        0.88 ( 86.89%)
> > 7.21 (  0.00%)        5.13 ( 28.85%)        6.70 (  7.09%)        0.87 ( 87.91%)
> > 7.66 (  0.00%)        5.22 ( 31.80%)        7.17 (  6.39%)        0.92 ( 88.01%)
> > 7.91 (  0.00%)        5.36 ( 32.22%)        7.30 (  7.72%)        0.95 ( 87.97%)
> > 7.95 (  0.00%)        5.35 ( 32.73%)        7.34 (  7.66%)        1.06 ( 86.66%)
> > 8.00 (  0.00%)        5.33 ( 33.31%)        7.38 (  7.73%)        1.13 ( 85.82%)
> > 5.61 (  0.00%)        3.55 ( 36.76%)        4.53 ( 19.23%)        2.29 ( 59.28%)
> > 5.66 (  0.00%)        4.32 ( 23.79%)        4.75 ( 16.18%)        3.65 ( 35.46%)
> > 5.98 (  0.00%)        4.97 ( 16.87%)        5.96 (  0.35%)        3.62 ( 39.40%)
> > 6.58 (  0.00%)        4.94 ( 24.93%)        6.04 (  8.32%)        3.63 ( 44.89%)
> > 7.19 (  0.00%)        5.18 ( 28.01%)        6.68 (  7.13%)        3.65 ( 49.22%)
> > 7.67 (  0.00%)        5.27 ( 31.29%)        7.16 (  6.63%)        3.62 ( 52.76%)
> > 7.88 (  0.00%)        5.36 ( 31.98%)        7.28 (  7.58%)        3.65 ( 53.71%)
> > 7.99 (  0.00%)        5.39 ( 32.52%)        7.40 (  7.42%)        3.65 ( 54.25%)
> > 
> > Patch works because we we update sum_exec_runtime on current thread
> > what assure we see proper sum_exec_runtime value on different CPUs. I
> > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> > patch did not break them. I'm going to run some other test.
> > 
> > Patch is draft version for early review, task_sched_runtime() will be
> > simplified (since it's called only current thread) and possibly split
> > into two functions: one that call update_curr() and other that return
> > sum_exec_runtime (assure it's consistent on 32 bit arches).
> > 
> > Stanislaw
> 

Is this really equivalent though? It updates one task instead of all
tasks in the group and there is no guarantee that tsk == current.
Glancing at it, it should monotonically increase but it looks like it
would calculate stale data.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-15  8:33         ` Mel Gorman
@ 2016-08-15  9:19           ` Stanislaw Gruszka
  2016-08-15  9:58             ` Mel Gorman
  0 siblings, 1 reply; 14+ messages in thread
From: Stanislaw Gruszka @ 2016-08-15  9:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Giovanni Gherdovich, Ingo Molnar, Ingo Molnar, Peter Zijlstra,
	Mike Galbraith, linux-kernel

On Mon, Aug 15, 2016 at 09:33:49AM +0100, Mel Gorman wrote:
> On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote:
> > > mmtest benchmark results are below (full compare-kernels.sh output is in attachment):
> > > 
> > > vanila-4.7            revert                prefetch              patch
> > > 4.74 (  0.00%)        3.04 ( 35.93%)        4.09 ( 13.81%)        1.30 ( 72.59%)
> > > 5.49 (  0.00%)        5.00 (  8.97%)        5.34 (  2.72%)        1.03 ( 81.16%)
> > > 6.12 (  0.00%)        4.91 ( 19.73%)        5.97 (  2.40%)        0.90 ( 85.27%)
> > > 6.68 (  0.00%)        4.90 ( 26.66%)        6.02 (  9.75%)        0.88 ( 86.89%)
> > > 7.21 (  0.00%)        5.13 ( 28.85%)        6.70 (  7.09%)        0.87 ( 87.91%)
> > > 7.66 (  0.00%)        5.22 ( 31.80%)        7.17 (  6.39%)        0.92 ( 88.01%)
> > > 7.91 (  0.00%)        5.36 ( 32.22%)        7.30 (  7.72%)        0.95 ( 87.97%)
> > > 7.95 (  0.00%)        5.35 ( 32.73%)        7.34 (  7.66%)        1.06 ( 86.66%)
> > > 8.00 (  0.00%)        5.33 ( 33.31%)        7.38 (  7.73%)        1.13 ( 85.82%)
> > > 5.61 (  0.00%)        3.55 ( 36.76%)        4.53 ( 19.23%)        2.29 ( 59.28%)
> > > 5.66 (  0.00%)        4.32 ( 23.79%)        4.75 ( 16.18%)        3.65 ( 35.46%)
> > > 5.98 (  0.00%)        4.97 ( 16.87%)        5.96 (  0.35%)        3.62 ( 39.40%)
> > > 6.58 (  0.00%)        4.94 ( 24.93%)        6.04 (  8.32%)        3.63 ( 44.89%)
> > > 7.19 (  0.00%)        5.18 ( 28.01%)        6.68 (  7.13%)        3.65 ( 49.22%)
> > > 7.67 (  0.00%)        5.27 ( 31.29%)        7.16 (  6.63%)        3.62 ( 52.76%)
> > > 7.88 (  0.00%)        5.36 ( 31.98%)        7.28 (  7.58%)        3.65 ( 53.71%)
> > > 7.99 (  0.00%)        5.39 ( 32.52%)        7.40 (  7.42%)        3.65 ( 54.25%)
> > > 
> > > Patch works because we we update sum_exec_runtime on current thread
> > > what assure we see proper sum_exec_runtime value on different CPUs. I
> > > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> > > patch did not break them. I'm going to run some other test.
> > > 
> > > Patch is draft version for early review, task_sched_runtime() will be
> > > simplified (since it's called only current thread) and possibly split
> > > into two functions: one that call update_curr() and other that return
> > > sum_exec_runtime (assure it's consistent on 32 bit arches).
> > > 
> > > Stanislaw
> > 
> 
> Is this really equivalent though? It updates one task instead of all
> tasks in the group and there is no guarantee that tsk == current.

Oh, my intention was to update runtime on current.

> Glancing at it, it should monotonically increase but it looks like it
> would calculate stale data.

Yes, until the next tick on a CPU, the patch does not count partial
runtime of thread running on that CPU. However that was the behaviour
before commit d670ec13178d0 - that how old thread_group_sched_runtime()
function worked:

 /*
- * Return sum_exec_runtime for the thread group.
- * In case the task is currently running, return the sum plus current's
- * pending runtime that have not been accounted yet.
- *
- * Note that the thread group might have other running tasks as well,
- * so the return value not includes other pending runtime that other
- * running tasks might have.
- */
-unsigned long long thread_group_sched_runtime(struct task_struct *p)

Stanislaw

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-15  9:19           ` Stanislaw Gruszka
@ 2016-08-15  9:58             ` Mel Gorman
  2016-08-15 10:29               ` Stanislaw Gruszka
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2016-08-15  9:58 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Giovanni Gherdovich, Ingo Molnar, Ingo Molnar, Peter Zijlstra,
	Mike Galbraith, linux-kernel

On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote:
> > Is this really equivalent though? It updates one task instead of all
> > tasks in the group and there is no guarantee that tsk == current.
> 
> Oh, my intention was to update runtime on current.
> 

Ok, so minimally that would need addressing. However, then I would worry
that two tasks in a group calling the function at the same time would
see different results because each of them updated a different task.
Such a situation is inherently race-prone anyway but it's a large enough
functional difference to be worth calling out.

Minimally, I don't think such a patch is a replacement for Giovanni's
which is functionally equivalent to the current code but could be layered
on top if it is proven to be ok.

> > Glancing at it, it should monotonically increase but it looks like it
> > would calculate stale data.
> 
> Yes, until the next tick on a CPU, the patch does not count partial
> runtime of thread running on that CPU. However that was the behaviour
> before commit d670ec13178d0 - that how old thread_group_sched_runtime()
> function worked:
> 

Sure, but does this patch not reintroduce the "SMP wobble" and the
problem of "the diff of 'process' should always be >= the diff of
'thread'" ?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-15  9:58             ` Mel Gorman
@ 2016-08-15 10:29               ` Stanislaw Gruszka
  0 siblings, 0 replies; 14+ messages in thread
From: Stanislaw Gruszka @ 2016-08-15 10:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Giovanni Gherdovich, Ingo Molnar, Ingo Molnar, Peter Zijlstra,
	Mike Galbraith, linux-kernel

On Mon, Aug 15, 2016 at 10:58:04AM +0100, Mel Gorman wrote:
> On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote:
> > > Is this really equivalent though? It updates one task instead of all
> > > tasks in the group and there is no guarantee that tsk == current.
> > 
> > Oh, my intention was to update runtime on current.
> > 
> 
> Ok, so minimally that would need addressing. However, then I would worry
> that two tasks in a group calling the function at the same time would
> see different results because each of them updated a different task.
> Such a situation is inherently race-prone anyway but it's a large enough
> functional difference to be worth calling out.

It races bacause we don't know which thread will call the clock_gettime()
first. But once that happen, second thread will see updated runtime value
from first thread as we call update_curr() for it with task_rq_lock (change
from commit 6e998916dfe3).

> Minimally, I don't think such a patch is a replacement for Giovanni's
> which is functionally equivalent to the current code but could be layered
> on top if it is proven to be ok.

I agree. I wanted to post my patch on top of Giovanni's.

> > > Glancing at it, it should monotonically increase but it looks like it
> > > would calculate stale data.
> > 
> > Yes, until the next tick on a CPU, the patch does not count partial
> > runtime of thread running on that CPU. However that was the behaviour
> > before commit d670ec13178d0 - that how old thread_group_sched_runtime()
> > function worked:
> > 
> 
> Sure, but does this patch not reintroduce the "SMP wobble" and the
> problem of "the diff of 'process' should always be >= the diff of
> 'thread'" ?

It should not reintroduce that problem, also because of change from
commit 6e998916dfe3. When a thread reads sum_exec_runtime it also
update that value, then process reads updated value. I run test
case from  "SMP wobble" commit and the problem do not happen
on my tests.

Perhaps I should post patch with a descriptive changelog and things
would be clearer ...

Stanislaw

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-12 12:10     ` Stanislaw Gruszka
  2016-08-15  7:49       ` Giovanni Gherdovich
@ 2016-08-15  9:13       ` Wanpeng Li
  2016-08-15  9:21         ` Stanislaw Gruszka
  1 sibling, 1 reply; 14+ messages in thread
From: Wanpeng Li @ 2016-08-15  9:13 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Ingo Molnar, Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra,
	Mike Galbraith, linux-kernel@vger.kernel.org, Mel Gorman

2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>:
> Hi
>
> On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
>> Nice detective work! I'm wondering, where do we stand if compared with a
>> pre-6e998916dfe3 kernel?
>>
>> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I
>> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert
>> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to
>> make the result correct, just see what the performance gap is, roughly.
>>
>> If there's still a significant gap then it might make sense to optimize this some
>> more.
>
> I measured (partial) revert performance on 4.7 using mmtest instructions
> from Giovanni and also tested some other possible fix (draft version):
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..54fdf6d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>         unsigned int seq, nextseq;
>         unsigned long flags;
>
> +       (void) task_sched_runtime(tsk);
> +
>         rcu_read_lock();
>         /* Attempt a lockless read on the first round. */
>         nextseq = 0;
> @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>                         task_cputime(t, &utime, &stime);
>                         times->utime += utime;
>                         times->stime += stime;
> -                       times->sum_exec_runtime += task_sched_runtime(t);
> +                       times->sum_exec_runtime += t->se.sum_exec_runtime;

If this will not have updated stats for other threads?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-15  9:13       ` Wanpeng Li
@ 2016-08-15  9:21         ` Stanislaw Gruszka
  2016-08-15  9:28           ` Wanpeng Li
  0 siblings, 1 reply; 14+ messages in thread
From: Stanislaw Gruszka @ 2016-08-15  9:21 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Ingo Molnar, Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra,
	Mike Galbraith, linux-kernel@vger.kernel.org, Mel Gorman

On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote:
> 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>:
> > Hi
> >
> > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
> >> Nice detective work! I'm wondering, where do we stand if compared with a
> >> pre-6e998916dfe3 kernel?
> >>
> >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I
> >> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert
> >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to
> >> make the result correct, just see what the performance gap is, roughly.
> >>
> >> If there's still a significant gap then it might make sense to optimize this some
> >> more.
> >
> > I measured (partial) revert performance on 4.7 using mmtest instructions
> > from Giovanni and also tested some other possible fix (draft version):
> >
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 75f98c5..54fdf6d 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> >         unsigned int seq, nextseq;
> >         unsigned long flags;
> >
> > +       (void) task_sched_runtime(tsk);
> > +
> >         rcu_read_lock();
> >         /* Attempt a lockless read on the first round. */
> >         nextseq = 0;
> > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> >                         task_cputime(t, &utime, &stime);
> >                         times->utime += utime;
> >                         times->stime += stime;
> > -                       times->sum_exec_runtime += task_sched_runtime(t);
> > +                       times->sum_exec_runtime += t->se.sum_exec_runtime;
> 
> If this will not have updated stats for other threads?

No, until tick/sched() on CPUs running threads.

Stanislaw

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-15  9:21         ` Stanislaw Gruszka
@ 2016-08-15  9:28           ` Wanpeng Li
  0 siblings, 0 replies; 14+ messages in thread
From: Wanpeng Li @ 2016-08-15  9:28 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Ingo Molnar, Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra,
	Mike Galbraith, linux-kernel@vger.kernel.org, Mel Gorman

2016-08-15 17:21 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>:
> On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote:
>> 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>:
>> > Hi
>> >
>> > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
>> >> Nice detective work! I'm wondering, where do we stand if compared with a
>> >> pre-6e998916dfe3 kernel?
>> >>
>> >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I
>> >> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert
>> >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to
>> >> make the result correct, just see what the performance gap is, roughly.
>> >>
>> >> If there's still a significant gap then it might make sense to optimize this some
>> >> more.
>> >
>> > I measured (partial) revert performance on 4.7 using mmtest instructions
>> > from Giovanni and also tested some other possible fix (draft version):
>> >
>> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> > index 75f98c5..54fdf6d 100644
>> > --- a/kernel/sched/cputime.c
>> > +++ b/kernel/sched/cputime.c
>> > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>> >         unsigned int seq, nextseq;
>> >         unsigned long flags;
>> >
>> > +       (void) task_sched_runtime(tsk);
>> > +
>> >         rcu_read_lock();
>> >         /* Attempt a lockless read on the first round. */
>> >         nextseq = 0;
>> > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>> >                         task_cputime(t, &utime, &stime);
>> >                         times->utime += utime;
>> >                         times->stime += stime;
>> > -                       times->sum_exec_runtime += task_sched_runtime(t);
>> > +                       times->sum_exec_runtime += t->se.sum_exec_runtime;
>>
>> If this will not have updated stats for other threads?
>
> No, until tick/sched() on CPUs running threads.

Yeah, I think this change will result in not updated stats for other
threads if they are running and before next update_curr() is called.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-05  8:21 ` [PATCH 1/1] " Giovanni Gherdovich
  2016-08-10 11:26   ` Ingo Molnar
@ 2016-08-10 18:00   ` tip-bot for Giovanni Gherdovich
  1 sibling, 0 replies; 14+ messages in thread
From: tip-bot for Giovanni Gherdovich @ 2016-08-10 18:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, ggherdovich, tglx, sgruszka, mgorman, mingo, torvalds,
	mgalbraith, linux-kernel, peterz

Commit-ID:  6075620b0590eaf22f10ce88833eb20a57f760d6
Gitweb:     http://git.kernel.org/tip/6075620b0590eaf22f10ce88833eb20a57f760d6
Author:     Giovanni Gherdovich <ggherdovich@suse.cz>
AuthorDate: Fri, 5 Aug 2016 10:21:56 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 10 Aug 2016 13:32:56 +0200

sched/cputime: Mitigate performance regression in times()/clock_gettime()

Commit:

  6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")

fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
allow a task to wake early. It addressed the problem by calling the scheduling
classes update_curr() when the cputimer starts.

Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.

This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.

Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.

pound_clock_gettime:

    threads       4.7-rc7     patched 4.7-rc7
    [num]         [secs]      [secs (percent)]
      2           3.48        3.06 ( 11.83%)
      5           3.33        3.25 (  2.40%)
      8           3.37        3.26 (  3.30%)
     12           3.32        3.37 ( -1.60%)
     21           4.01        3.90 (  2.74%)
     30           3.63        3.36 (  7.41%)
     48           3.71        3.11 ( 16.27%)
     79           3.75        3.16 ( 15.74%)
    110           3.81        3.25 ( 14.80%)
    128           3.88        3.31 ( 14.76%)

pound_times:

    threads       4.7-rc7     patched 4.7-rc7
    [num]         [secs]      [secs (percent)]
      2           3.65        3.25 ( 11.03%)
      5           3.45        3.17 (  7.92%)
      8           3.52        3.22 (  8.69%)
     12           3.29        3.36 ( -2.04%)
     21           4.07        3.92 (  3.78%)
     30           3.87        3.40 ( 12.17%)
     48           3.79        3.16 ( 16.61%)
     79           3.88        3.28 ( 15.42%)
    110           3.90        3.38 ( 13.35%)
    128           4.00        3.38 ( 15.45%)

pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:

  $ git clone https://github.com/gormanm/mmtests.git
  $ cd mmtests
  $ cp configs/config-global-dhp__workload_poundtime config
  $ ./run-mmtests.sh --run-monitor $(uname -r)

The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again

  $ cd mmtests
  $ ./run-mmtests.sh --run-monitor $(uname -r)

will produce results to compare with. A comparison table will be output
with:

  $ cd mmtests/work/log
  $ ../../compare-kernels.sh

the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.

The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be

    struct sched_entity *curr = cfs_rq->curr;

and

    delta_exec = now - curr->exec_start;

in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.

A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):

  threads   cycles before                 cycles after                gain

    2      19,699,563,964  +-1.19%      17,358,917,517  +-1.85%      11.88%
    5      47,401,089,566  +-2.96%      45,103,730,829  +-0.97%       4.85%
    8      80,923,501,004  +-3.01%      71,419,385,977  +-0.77%      11.74%
   12     112,326,485,473  +-0.47%     110,371,524,403  +-0.47%       1.74%
   21     193,455,574,299  +-0.72%     180,120,667,904  +-0.36%       6.89%
   30     315,073,519,013  +-1.64%     271,222,225,950  +-1.29%      13.92%
   48     321,969,515,332  +-1.48%     273,353,977,321  +-1.16%      15.10%
   79     337,866,003,422  +-0.97%     289,462,481,538  +-1.05%      14.33%
  110     338,712,691,920  +-0.78%     290,574,233,170  +-0.77%      14.21%
  128     348,384,794,006  +-0.50%     292,691,648,206  +-0.66%      15.99%

A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):

  threads   L1 misses/total*100     L1 misses/total*100            gain
		         before                   after
      2           7.43  +-4.90%           7.36  +-4.70%           0.94%
      5          13.09  +-4.74%          13.52  +-3.73%          -3.28%
      8          13.79  +-5.61%          12.90  +-3.27%           6.45%
     12          11.57  +-2.44%           8.71  +-1.40%          24.72%
     21          12.39  +-3.92%           9.97  +-1.84%          19.53%
     30          13.91  +-2.53%          11.73  +-2.28%          15.67%
     48          13.71  +-1.59%          12.32  +-1.97%          10.14%
     79          14.44  +-0.66%          13.40  +-1.06%           7.20%
    110          15.86  +-0.50%          14.46  +-0.59%           8.83%
    128          16.51  +-0.32%          15.06  +-0.78%           8.78%

As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).

pound_clock_gettime:

  threads   parent of         6e998916dfe3        4.7-rc7
	    6e998916dfe3            itself
    2        2.23          3.68 ( -64.56%)        3.48 (-55.48%)
    5        2.83          3.78 ( -33.42%)        3.33 (-17.43%)
    8        2.84          4.31 ( -52.12%)        3.37 (-18.76%)
    12       3.09          3.61 ( -16.74%)        3.32 ( -7.17%)
    21       3.14          4.63 ( -47.36%)        4.01 (-27.71%)
    30       3.28          5.75 ( -75.37%)        3.63 (-10.80%)
    48       3.02          6.05 (-100.56%)        3.71 (-22.99%)
    79       2.88          6.30 (-118.90%)        3.75 (-30.26%)
    110      2.95          6.46 (-119.00%)        3.81 (-29.24%)
    128      3.05          6.42 (-110.08%)        3.88 (-27.04%)

pound_times:

  threads   parent of         6e998916dfe3        4.7-rc7
	    6e998916dfe3            itself
    2        2.27          3.73 ( -64.71%)        3.65 (-61.14%)
    5        2.78          3.77 ( -35.56%)        3.45 (-23.98%)
    8        2.79          4.41 ( -57.71%)        3.52 (-26.05%)
    12       3.02          3.56 ( -17.94%)        3.29 ( -9.08%)
    21       3.10          4.61 ( -48.74%)        4.07 (-31.34%)
    30       3.33          5.75 ( -72.53%)        3.87 (-16.01%)
    48       2.96          6.06 (-105.04%)        3.79 (-28.10%)
    79       2.88          6.24 (-116.83%)        3.88 (-34.81%)
    110      2.98          6.37 (-114.08%)        3.90 (-31.12%)
    128      3.10          6.35 (-104.61%)        4.00 (-28.87%)

The source code of the two benchmarks follows. To compile the two:

  NR_THREADS=42
  for FILE in pound_times pound_clock_gettime; do
      gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
  done

==== BEGIN pound_times.c ====

struct tms start;

void *pound (void *threadid)
{
  struct tms end;
  int oldutime = 0;
  int utime;
  int i;
  for (i = 0; i < 5000000 / NUM_THREADS; i++) {
          times(&end);
          utime = ((int)end.tms_utime - (int)start.tms_utime);
          if (oldutime > utime) {
            printf("utime decreased, was %d, now %d!\n", oldutime, utime);
          }
          oldutime = utime;
  }
  pthread_exit(NULL);
}

int main()
{
  pthread_t th[NUM_THREADS];
  long i;
  times(&start);
  for (i = 0; i < NUM_THREADS; i++) {
    pthread_create (&th[i], NULL, pound, (void *)i);
  }
  pthread_exit(NULL);
  return 0;
}
==== END pound_times.c ====

==== BEGIN pound_clock_gettime.c ====

void *pound (void *threadid)
{
	struct timespec ts;
	int rc, i;
	unsigned long prev = 0, this = 0;

	for (i = 0; i < 5000000 / NUM_THREADS; i++) {
		rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
		if (rc < 0)
			perror("clock_gettime");
		this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
		if (0 && this < prev)
			printf("%lu ns timewarp at iteration %d\n", prev - this, i);
		prev = this;
	}
	pthread_exit(NULL);
}

int main()
{
	pthread_t th[NUM_THREADS];
	long rc, i;
	pid_t pgid;

	for (i = 0; i < NUM_THREADS; i++) {
		rc = pthread_create(&th[i], NULL, pound, (void *)i);
		if (rc < 0)
			perror("pthread_create");
	}

	pthread_exit(NULL);
	return 0;
}
==== END pound_clock_gettime.c ====

Suggested-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5c883fe..2a906f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
 #include <linux/frame.h>
+#include <linux/prefetch.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -2972,6 +2973,23 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
 
 /*
+ * The function fair_sched_class.update_curr accesses the struct curr
+ * and its field curr->exec_start; when called from task_sched_runtime(),
+ * we observe a high rate of cache misses in practice.
+ * Prefetching this data results in improved performance.
+ */
+static inline void prefetch_curr_exec_start(struct task_struct *p)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	struct sched_entity *curr = (&p->se)->cfs_rq->curr;
+#else
+	struct sched_entity *curr = (&task_rq(p)->cfs)->curr;
+#endif
+	prefetch(curr);
+	prefetch(&curr->exec_start);
+}
+
+/*
  * Return accounted runtime for the task.
  * In case the task is currently running, return the runtime plus current's
  * pending runtime that have not been accounted yet.
@@ -3005,6 +3023,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	 * thread, breaking clock_gettime().
 	 */
 	if (task_current(rq, p) && task_on_rq_queued(p)) {
+		prefetch_curr_exec_start(p);
 		update_rq_clock(rq);
 		p->sched_class->update_curr(rq);
 	}

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-08-15 10:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-05  8:21 [PATCH 0/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich
2016-08-05  8:21 ` [PATCH 1/1] " Giovanni Gherdovich
2016-08-10 11:26   ` Ingo Molnar
2016-08-10 13:02     ` Giovanni Gherdovich
2016-08-12 12:10     ` Stanislaw Gruszka
2016-08-15  7:49       ` Giovanni Gherdovich
2016-08-15  8:33         ` Mel Gorman
2016-08-15  9:19           ` Stanislaw Gruszka
2016-08-15  9:58             ` Mel Gorman
2016-08-15 10:29               ` Stanislaw Gruszka
2016-08-15  9:13       ` Wanpeng Li
2016-08-15  9:21         ` Stanislaw Gruszka
2016-08-15  9:28           ` Wanpeng Li
2016-08-10 18:00   ` [tip:sched/core] " tip-bot for Giovanni Gherdovich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).