[PATCH] sched/cputime: Mitigate performance regression in times()/clock

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
@ 2016-07-26 14:07 Giovanni Gherdovich
  2016-07-26 14:43 ` kbuild test robot
  2016-08-02 10:37 ` Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Giovanni Gherdovich @ 2016-07-26 14:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Mike Galbraith, Stanislaw Gruszka, linux-kernel, Mel Gorman,
	Giovanni Gherdovich

Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency") fixed a problem whereby clock_nanosleep() followed by
clock_gettime() could allow a task to wake early. It addressed the problem
by calling the scheduling classes update_curr when the cputimer starts.

Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.

This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.

Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.

pound_clock_gettime:

    threads       4.7-rc7     patched 4.7-rc7
    [num]         [secs]      [secs (percent)]
      2           3.48        3.06 ( 11.83%)
      5           3.33        3.25 (  2.40%)
      8           3.37        3.26 (  3.30%)
     12           3.32        3.37 ( -1.60%)
     21           4.01        3.90 (  2.74%)
     30           3.63        3.36 (  7.41%)
     48           3.71        3.11 ( 16.27%)
     79           3.75        3.16 ( 15.74%)
    110           3.81        3.25 ( 14.80%)
    128           3.88        3.31 ( 14.76%)

pound_times:

    threads       4.7-rc7     patched 4.7-rc7
    [num]         [secs]      [secs (percent)]
      2           3.65        3.25 ( 11.03%)
      5           3.45        3.17 (  7.92%)
      8           3.52        3.22 (  8.69%)
     12           3.29        3.36 ( -2.04%)
     21           4.07        3.92 (  3.78%)
     30           3.87        3.40 ( 12.17%)
     48           3.79        3.16 ( 16.61%)
     79           3.88        3.28 ( 15.42%)
    110           3.90        3.38 ( 13.35%)
    128           4.00        3.38 ( 15.45%)

pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:

$ git clone https://github.com/gormanm/mmtests.git
$ cd mmtests
$ cp configs/config-global-dhp__workload_poundtime config
$ ./run-mmtests.sh --run-monitor $(uname -r)

The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again

$ cd mmtests
$ ./run-mmtests.sh --run-monitor $(uname -r)

will produce results to compare with. A comparison table will be output
with

$ cd mmtests/work/log
$ ../../compare-kernels.sh

the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.

The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be

    struct sched_entity *curr = cfs_rq->curr;

and

    delta_exec = now - curr->exec_start;

in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.

A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):

  threads   cycles before                 cycles after                gain

    2      19,699,563,964  +-1.19%      17,358,917,517  +-1.85%      11.88%
    5      47,401,089,566  +-2.96%      45,103,730,829  +-0.97%       4.85%
    8      80,923,501,004  +-3.01%      71,419,385,977  +-0.77%      11.74%
   12     112,326,485,473  +-0.47%     110,371,524,403  +-0.47%       1.74%
   21     193,455,574,299  +-0.72%     180,120,667,904  +-0.36%       6.89%
   30     315,073,519,013  +-1.64%     271,222,225,950  +-1.29%      13.92%
   48     321,969,515,332  +-1.48%     273,353,977,321  +-1.16%      15.10%
   79     337,866,003,422  +-0.97%     289,462,481,538  +-1.05%      14.33%
  110     338,712,691,920  +-0.78%     290,574,233,170  +-0.77%      14.21%
  128     348,384,794,006  +-0.50%     292,691,648,206  +-0.66%      15.99%

A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):

  threads   L1 misses/total*100     L1 misses/total*100            gain
		         before                   after
      2           7.43  +-4.90%           7.36  +-4.70%           0.94%
      5          13.09  +-4.74%          13.52  +-3.73%          -3.28%
      8          13.79  +-5.61%          12.90  +-3.27%           6.45%
     12          11.57  +-2.44%           8.71  +-1.40%          24.72%
     21          12.39  +-3.92%           9.97  +-1.84%          19.53%
     30          13.91  +-2.53%          11.73  +-2.28%          15.67%
     48          13.71  +-1.59%          12.32  +-1.97%          10.14%
     79          14.44  +-0.66%          13.40  +-1.06%           7.20%
    110          15.86  +-0.50%          14.46  +-0.59%           8.83%
    128          16.51  +-0.32%          15.06  +-0.78%           8.78%

As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).

pound_clock_gettime:

  threads   parent of         6e998916dfe3        4.7-rc7
	    6e998916dfe3            itself
    2        2.23          3.68 ( -64.56%)        3.48 (-55.48%)
    5        2.83          3.78 ( -33.42%)        3.33 (-17.43%)
    8        2.84          4.31 ( -52.12%)        3.37 (-18.76%)
    12       3.09          3.61 ( -16.74%)        3.32 ( -7.17%)
    21       3.14          4.63 ( -47.36%)        4.01 (-27.71%)
    30       3.28          5.75 ( -75.37%)        3.63 (-10.80%)
    48       3.02          6.05 (-100.56%)        3.71 (-22.99%)
    79       2.88          6.30 (-118.90%)        3.75 (-30.26%)
    110      2.95          6.46 (-119.00%)        3.81 (-29.24%)
    128      3.05          6.42 (-110.08%)        3.88 (-27.04%)

pound_times:

  threads   parent of         6e998916dfe3        4.7-rc7
	    6e998916dfe3            itself
    2        2.27          3.73 ( -64.71%)        3.65 (-61.14%)
    5        2.78          3.77 ( -35.56%)        3.45 (-23.98%)
    8        2.79          4.41 ( -57.71%)        3.52 (-26.05%)
    12       3.02          3.56 ( -17.94%)        3.29 ( -9.08%)
    21       3.10          4.61 ( -48.74%)        4.07 (-31.34%)
    30       3.33          5.75 ( -72.53%)        3.87 (-16.01%)
    48       2.96          6.06 (-105.04%)        3.79 (-28.10%)
    79       2.88          6.24 (-116.83%)        3.88 (-34.81%)
    110      2.98          6.37 (-114.08%)        3.90 (-31.12%)
    128      3.10          6.35 (-104.61%)        4.00 (-28.87%)

The source code of the two benchmarks follows. To compile the two:

NR_THREADS=42
for FILE in pound_times pound_clock_gettime; do
    gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
done

==== BEGIN pound_times.c ====

struct tms start;

void *pound (void *threadid)
{
  struct tms end;
  int oldutime = 0;
  int utime;
  int i;
  for (i = 0; i < 5000000 / NUM_THREADS; i++) {
          times(&end);
          utime = ((int)end.tms_utime - (int)start.tms_utime);
          if (oldutime > utime) {
            printf("utime decreased, was %d, now %d!\n", oldutime, utime);
          }
          oldutime = utime;
  }
  pthread_exit(NULL);
}

int main()
{
  pthread_t th[NUM_THREADS];
  long i;
  times(&start);
  for (i = 0; i < NUM_THREADS; i++) {
    pthread_create (&th[i], NULL, pound, (void *)i);
  }
  pthread_exit(NULL);
  return 0;
}
==== END pound_times.c ====

==== BEGIN pound_clock_gettime.c ====

void *pound (void *threadid)
{
	struct timespec ts;
	int rc, i;
	unsigned long prev = 0, this = 0;

	for (i = 0; i < 5000000 / NUM_THREADS; i++) {
		rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
		if (rc < 0)
			perror("clock_gettime");
		this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
		if (0 && this < prev)
			printf("%lu ns timewarp at iteration %d\n", prev - this, i);
		prev = this;
	}
	pthread_exit(NULL);
}

int main()
{
	pthread_t th[NUM_THREADS];
	long rc, i;
	pid_t pgid;

	for (i = 0; i < NUM_THREADS; i++) {
		rc = pthread_create(&th[i], NULL, pound, (void *)i);
		if (rc < 0)
			perror("pthread_create");
	}

	pthread_exit(NULL);
	return 0;
}
==== END pound_clock_gettime.c ====

Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51d7105..0ef1e69 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2998,6 +2998,10 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	 * thread, breaking clock_gettime().
 	 */
 	if (task_current(rq, p) && task_on_rq_queued(p)) {
+#if defined(CONFIG_FAIR_GROUP_SCHED)
+		prefetch((&p->se)->cfs_rq->curr);
+		prefetch(&(&p->se)->cfs_rq->curr->exec_start);
+#endif
 		update_rq_clock(rq);
 		p->sched_class->update_curr(rq);
 	}
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-07-26 14:07 [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich
@ 2016-07-26 14:43 ` kbuild test robot
  2016-07-27 11:41   ` Giovanni Gherdovich
  2016-08-02 10:37 ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: kbuild test robot @ 2016-07-26 14:43 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: kbuild-all, Ingo Molnar, Peter Zijlstra, Mike Galbraith,
	Stanislaw Gruszka, linux-kernel, Mel Gorman, Giovanni Gherdovich

[-- Attachment #1: Type: text/plain, Size: 1797 bytes --]

Hi,

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.7 next-20160726]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Giovanni-Gherdovich/sched-cputime-Mitigate-performance-regression-in-times-clock_gettime/20160726-221216
config: mips-allyesconfig (attached as .config)
compiler: mips-linux-gnu-gcc (Debian 5.4.0-6) 5.4.0 20160609
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=mips 

All errors (new ones prefixed by >>):

   kernel/sched/core.c: In function 'task_sched_runtime':
>> kernel/sched/core.c:3009:3: error: implicit declaration of function 'prefetch' [-Werror=implicit-function-declaration]
      prefetch((&p->se)->cfs_rq->curr);
      ^
   cc1: some warnings being treated as errors
   make[2]: *** [kernel/sched/core.o] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [kernel/sched/] Error 2

vim +/prefetch +3009 kernel/sched/core.c

  3003		 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
  3004		 * project cycles that may never be accounted to this
  3005		 * thread, breaking clock_gettime().
  3006		 */
  3007		if (task_current(rq, p) && task_on_rq_queued(p)) {
  3008	#if defined(CONFIG_FAIR_GROUP_SCHED)
> 3009			prefetch((&p->se)->cfs_rq->curr);
  3010			prefetch(&(&p->se)->cfs_rq->curr->exec_start);
  3011	#endif
  3012			update_rq_clock(rq);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 42436 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-07-26 14:43 ` kbuild test robot
@ 2016-07-27 11:41   ` Giovanni Gherdovich
  0 siblings, 0 replies; 9+ messages in thread
From: Giovanni Gherdovich @ 2016-07-27 11:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: kbuild-all, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
	Mel Gorman, mgorman

On Tue, 2016-07-26 at 22:43 +0800, kbuild test robot wrote:
> Hi,
> 
> [auto build test ERROR on tip/sched/core]
> [also build test ERROR on v4.7 next-20160726]
> [if your patch is applied to the wrong git tree, please drop us a
> note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Giovanni-Gherdovich/
> sched-cputime-Mitigate-performance-regression-in-times
> -clock_gettime/20160726-221216
> config: mips-allyesconfig (attached as .config)
> compiler: mips-linux-gnu-gcc (Debian 5.4.0-6) 5.4.0 20160609
> reproduce:
>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tes
> ts.git/plain/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=mips 
> 
> All errors (new ones prefixed by >>):
> 
>    kernel/sched/core.c: In function 'task_sched_runtime':
> > > kernel/sched/core.c:3009:3: error: implicit declaration of
> > > function 'prefetch' [-Werror=implicit-function-declaration]
>       prefetch((&p->se)->cfs_rq->curr);
>       ^
>    cc1: some warnings being treated as errors
>    make[2]: *** [kernel/sched/core.o] Error 1
>    make[2]: Target '__build' not remade because of errors.
>    make[1]: *** [kernel/sched/] Error 2
> 
> vim +/prefetch +3009 kernel/sched/core.c
> 
>   3003		 * Must be ->curr _and_ ->on_rq.  If dequeued,
> we would
>   3004		 * project cycles that may never be accounted
> to this
>   3005		 * thread, breaking clock_gettime().
>   3006		 */
>   3007		if (task_current(rq, p) &&
> task_on_rq_queued(p)) {
>   3008	#if defined(CONFIG_FAIR_GROUP_SCHED)
> > 3009			prefetch((&p->se)->cfs_rq->curr);
>   3010			prefetch(&(&p->se)->cfs_rq->curr
> ->exec_start);
>   3011	#endif
>   3012			update_rq_clock(rq);
> 
> ---
> 0-DAY kernel test infrastructure                Open Source
> Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Cor
> poration


As the robot pointed out, I forgot the appropriate header file. On x86
it compiled as the declaration must have come via an indirect header
file inclusion.

I will wait a few days for more comments, then I'll send a v2
addressing the missing #include with the following change:

-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fde4014..97c196d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
 #include <linux/frame.h>
+#include <linux/prefetch.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 

Regards,
Giovanni Gherdovich
SUSE Labs

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-07-26 14:07 [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich
  2016-07-26 14:43 ` kbuild test robot
@ 2016-08-02 10:37 ` Peter Zijlstra
  2016-08-02 13:26   ` Mike Galbraith
  2016-08-02 22:04   ` Giovanni Gherdovich
  1 sibling, 2 replies; 9+ messages in thread
From: Peter Zijlstra @ 2016-08-02 10:37 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Ingo Molnar, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
	Mel Gorman

On Tue, Jul 26, 2016 at 04:07:14PM +0200, Giovanni Gherdovich wrote:

> Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>

SoB chain is borken. Either Mike wrote the patch in which case you're
missing a From: Mike header someplace, or you wrote it and Mike needs to
be a Ack/Reviewed or somesuch.

> ---
>  kernel/sched/core.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 51d7105..0ef1e69 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2998,6 +2998,10 @@ unsigned long long task_sched_runtime(struct task_struct *p)
>  	 * thread, breaking clock_gettime().
>  	 */
>  	if (task_current(rq, p) && task_on_rq_queued(p)) {
> +#if defined(CONFIG_FAIR_GROUP_SCHED)

This here wants a comment on why we're doing this. Because I'm sure that
if someone were to read this code in a few weeks they'd go WTF!?

Also, is there a possibility of manual CSE we should do?

> +		prefetch((&p->se)->cfs_rq->curr);
> +		prefetch(&(&p->se)->cfs_rq->curr->exec_start);
> +#endif
>  		update_rq_clock(rq);
>  		p->sched_class->update_curr(rq);
>  	}
> -- 
> 2.6.6
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-02 10:37 ` Peter Zijlstra
@ 2016-08-02 13:26   ` Mike Galbraith
  2016-08-02 22:04   ` Giovanni Gherdovich
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Galbraith @ 2016-08-02 13:26 UTC (permalink / raw)
  To: Peter Zijlstra, Giovanni Gherdovich
  Cc: Ingo Molnar, Stanislaw Gruszka, linux-kernel, Mel Gorman

On Tue, 2016-08-02 at 12:37 +0200, Peter Zijlstra wrote:
> On Tue, Jul 26, 2016 at 04:07:14PM +0200, Giovanni Gherdovich wrote:
> 
> > Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
> > Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> 
> SoB chain is borken. Either Mike wrote the patch in which case you're
> missing a From: Mike header someplace, or you wrote it and Mike needs to
> be a Ack/Reviewed or somesuch.

Hohum, I told him to just take the credit (blame) if he intended to
submit, but he didn't listen ;-)  It was a demonstration that a much
bigger/invasive patch was doing the same thing in an indirect way. 

	-Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-02 10:37 ` Peter Zijlstra
  2016-08-02 13:26   ` Mike Galbraith
@ 2016-08-02 22:04   ` Giovanni Gherdovich
  2016-08-03 10:02     ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Giovanni Gherdovich @ 2016-08-02 22:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
	Mel Gorman, mgorman

Hello Peter,

thank you for your reply.

On Tue, 2016-08-02 at 12:37 +0200, Peter Zijlstra wrote:
> On Tue, Jul 26, 2016 at 04:07:14PM +0200, Giovanni Gherdovich wrote:
> 
> > Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
> > Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> 
> SoB chain is borken. Either Mike wrote the patch in which case you're
> missing a From: Mike header someplace, or you wrote it and Mike needs
> to be a Ack/Reviewed or somesuch.

Right. As Mike already explained, this patch is the result of him
correcting a much more involved/complicated solution I prepared to
solve the problem. I will put the "From: Mike" in v2.

> 
> > ---
> >  kernel/sched/core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 51d7105..0ef1e69 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2998,6 +2998,10 @@ unsigned long long task_sched_runtime(struct
> > task_struct *p)
> >  	 * thread, breaking clock_gettime().
> >  	 */
> >  	if (task_current(rq, p) && task_on_rq_queued(p)) {
> > +#if defined(CONFIG_FAIR_GROUP_SCHED)
> 
> This here wants a comment on why we're doing this. Because I'm sure
> that if someone were to read this code in a few weeks they'd go
> WTF!?

I had that config variable set in the machine I was testing on, and
thought that for some reason it was related to my observations. I will
repeat the experiment without it, and if I obtain the same results I
will drop the conditional. Otherwise I will motivate its necessity.

I will submit a v2 early next week, rebasing the patch on the
forthcoming 4.8-rc1 tag and updating the experimental data.

> 
> Also, is there a possibility of manual CSE we should do?
> 
> > +		prefetch((&p->se)->cfs_rq->curr);
> > +		prefetch(&(&p->se)->cfs_rq->curr->exec_start);
> > +#endif
> >  		update_rq_clock(rq);
> >  		p->sched_class->update_curr(rq);
> >  	}

Good point. I verified and GCC 4.8.5 gets it already without hints
needed. This is the alternative code with the CSE that I compiled:

-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51d7105..5d676db 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2998,6 +2998,11 @@ unsigned long long task_sched_runtime(struct
task_struct *p)
         * thread, breaking clock_gettime().
         */
        if (task_current(rq, p) && task_on_rq_queued(p)) {
+#if defined(CONFIG_FAIR_GROUP_SCHED)
+               struct sched_entity *curr = (&p->se)->cfs_rq->curr;
+               prefetch(curr);
+               prefetch(&curr->exec_start);
+#endif
                update_rq_clock(rq);
                p->sched_class->update_curr(rq);
        }
-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 

I post below the snippets of generated code with and without CSE that
I got running 'disassemble /m task_sched_runtime' in gdb; you'll see
they're identical. If you prefer the explicit hint I'll include it in
v2, but it's probably safe to say it isn't needed.

Regards,
Giovanni



with CSE: -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 

3001#if defined(CONFIG_FAIR_GROUP_SCHED)
3002                struct sched_entity *curr = (&p->se)->cfs_rq->curr;

   <+117>:   mov    0x1d0(%rbx),%rdx
   <+124>:   mov    0x38(%rdx),%rdx

3003                prefetch(curr);
3004                prefetch(&curr->exec_start);
3005#endif
3006                update_rq_clock(rq);
3007                p->sched_class->update_curr(rq);

   <+144>:   mov    0x58(%rbx),%rdx
   <+148>:   mov    %rax,%rdi
   <+151>:   mov    %rax,-0x20(%rbp)
   <+155>:   callq  *0xb0(%rdx)
   <+161>:   mov    -0x20(%rbp),%rax
   <+165>:   jmp    <task_sched_runtime+66>
   <+167>:   mov    %rax,%rdi
   <+170>:   mov    %rax,-0x20(%rbp)
   <+174>:   callq  <update_rq_clock>
   <+179>:   mov    -0x20(%rbp),%rax
   <+183>:   jmp    <task_sched_runtime+144>
         :  nopl   0x0(%rax)

3008        }
3009        ns = p->se.sum_exec_runtime;

   <+66>:    mov    0xc8(%rbx),%r12

3010        task_rq_unlock(rq, p, &rf);
3011
3012        return ns;

   <+103>:   mov    %r12,%rax


w/o CSE: -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8

3001#if defined(CONFIG_FAIR_GROUP_SCHED)
3002                prefetch((&p->se)->cfs_rq->curr);

   <+117>:   mov    0x1d0(%rbx),%rdx
   <+124>:   mov    0x38(%rdx),%rdx

3003                prefetch(&(&p->se)->cfs_rq->curr->exec_start);
3004#endif
3005                update_rq_clock(rq);
3006                p->sched_class->update_curr(rq);

   <+144>:   mov    0x58(%rbx),%rdx
   <+148>:   mov    %rax,%rdi
   <+151>:   mov    %rax,-0x20(%rbp)
   <+155>:   callq  *0xb0(%rdx)
   <+161>:   mov    -0x20(%rbp),%rax
   <+165>:   jmp    <task_sched_runtime+66>
   <+167>:   mov    %rax,%rdi
   <+170>:   mov    %rax,-0x20(%rbp)
   <+174>:   callq  <update_rq_clock>
   <+179>:   mov    -0x20(%rbp),%rax
   <+183>:   jmp    <task_sched_runtime+144>
         :   nopl   0x0(%rax)

3007        }
3008        ns = p->se.sum_exec_runtime;

   <+66>:    mov    0xc8(%rbx),%r12

3009        task_rq_unlock(rq, p, &rf);
3010
3011        return ns;

   <+103>:   mov    %r12,%rax

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-02 22:04   ` Giovanni Gherdovich
@ 2016-08-03 10:02     ` Peter Zijlstra
  2016-08-03 10:34       ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-08-03 10:02 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Ingo Molnar, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
	Mel Gorman, mgorman

On Wed, Aug 03, 2016 at 12:04:52AM +0200, Giovanni Gherdovich wrote:

> > > +#if defined(CONFIG_FAIR_GROUP_SCHED)
> > 
> > This here wants a comment on why we're doing this. Because I'm sure
> > that if someone were to read this code in a few weeks they'd go
> > WTF!?
> 
> I had that config variable set in the machine I was testing on, and
> thought that for some reason it was related to my observations. I will
> repeat the experiment without it, and if I obtain the same results I
> will drop the conditional. Otherwise I will motivate its necessity.
> 

No, I meant we want a comment here explaining the reason for these
prefetches.

You'll need the #ifdef because se->cfs_rq doesn't exist otherwise.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2998,6 +2998,11 @@ unsigned long long task_sched_runtime(struct
> task_struct *p)
>          * thread, breaking clock_gettime().
>          */
>         if (task_current(rq, p) && task_on_rq_queued(p)) {
> +#if defined(CONFIG_FAIR_GROUP_SCHED)
> +               struct sched_entity *curr = (&p->se)->cfs_rq->curr;
> +               prefetch(curr);
> +               prefetch(&curr->exec_start);
> +#endif
>                 update_rq_clock(rq);
>                 p->sched_class->update_curr(rq);
>         }
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 
> 
> I post below the snippets of generated code with and without CSE that
> I got running 'disassemble /m task_sched_runtime' in gdb; you'll see
> they're identical. If you prefer the explicit hint I'll include it in
> v2, but it's probably safe to say it isn't needed.

I much prefer the manual CSE, its much more readable.

Also, maybe pull the whole thing into a helper function with a
descriptive name, like:

/*
 * XXX comment on why this is needed goes here...
 */
static inline void prefetch_curr_exec_start(struct task_struct *p)
{
#ifdef CONFIG_FAIR_GROUP_SCHED
	struct sched_entity *curr = (&p->se)->cfs_rq->curr;

	prefetch(curr);
	prefetch(&curr->exec_start);
#endif
}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-03 10:02     ` Peter Zijlstra
@ 2016-08-03 10:34       ` Peter Zijlstra
  2016-08-05  7:58         ` Giovanni Gherdovich
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-08-03 10:34 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Ingo Molnar, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
	Mel Gorman, mgorman

On Wed, Aug 03, 2016 at 12:02:54PM +0200, Peter Zijlstra wrote:
> /*
>  * XXX comment on why this is needed goes here...
>  */
> static inline void prefetch_curr_exec_start(struct task_struct *p)
> {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> 	struct sched_entity *curr = (&p->se)->cfs_rq->curr;
#else
	struct sched_entity *curr = task_rq(p)->cfs_rq->curr;
#endif

Would be the 'right' thing for !cgroup stuffs

> 
> 	prefetch(curr);
> 	prefetch(&curr->exec_start);
> }

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
  2016-08-03 10:34       ` Peter Zijlstra
@ 2016-08-05  7:58         ` Giovanni Gherdovich
  0 siblings, 0 replies; 9+ messages in thread
From: Giovanni Gherdovich @ 2016-08-05  7:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
	Mel Gorman, mgorman

On Wed, 2016-08-03 at 12:02 +0200, Peter Zijlstra wrote:
> On Wed, Aug 03, 2016 at 12:04:52AM +0200, Giovanni Gherdovich wrote:
> 
> > > > +#if defined(CONFIG_FAIR_GROUP_SCHED)
> > > 
> > > This here wants a comment on why we're doing this. Because I'm
> > > sure that if someone were to read this code in a few weeks
> > > they'd go WTF!?
> > 
> > I had that config variable set in the machine I was testing on, and
> > thought that for some reason it was related to my observations. I
> > will repeat the experiment without it, and if I obtain the same
> > results I will drop the conditional. Otherwise I will motivate its
> > necessity.
> > 
> 
> No, I meant we want a comment here explaining the reason for these
> prefetches.
> 
> You'll need the #ifdef because se->cfs_rq doesn't exist otherwise.

Oh! Thanks for pointing that out.

> > 
> > I post below the snippets of generated code with and without CSE
> > that I got running 'disassemble /m task_sched_runtime' in gdb;
> > you'll see they're identical. If you prefer the explicit hint I'll
> > include it in v2, but it's probably safe to say it isn't needed.
> 
> I much prefer the manual CSE, its much more readable.

Ok.

> 
> Also, maybe pull the whole thing into a helper function with a
> descriptive name, like:
> 
> /*
>  * XXX comment on why this is needed goes here...
>  */
> static inline void prefetch_curr_exec_start(struct task_struct *p)
> {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> 	struct sched_entity *curr = (&p->se)->cfs_rq->curr;
> 
> 	prefetch(curr);
> 	prefetch(&curr->exec_start);
> #endif
> }

Ok.

On Wed, 2016-08-03 at 12:34 +0200, Peter Zijlstra wrote:
> On Wed, Aug 03, 2016 at 12:02:54PM +0200, Peter Zijlstra wrote:
> > /*
> >  * XXX comment on why this is needed goes here...
> >  */
> > static inline void prefetch_curr_exec_start(struct task_struct *p)
> > {
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > 	struct sched_entity *curr = (&p->se)->cfs_rq->curr;
> #else
> 	struct sched_entity *curr = task_rq(p)->cfs_rq->curr;
> #endif
> 
> Would be the 'right' thing for !cgroup stuffs

Ok, thanks. With a bit of help I figured that you actually meant to
write

#else
	struct sched_entity *curr = (&task_rq(p)->cfs)->curr;
#endif

Will add that.

V2 is on its way; I previously said I would put "From: Mike" in the
patch header, but since then I discussed with Mike and we agreed that
the patch will be "From: Giovanni" and Suggested-by: Mike.

Regards,
Giovanni

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-08-05  7:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-26 14:07 [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich
2016-07-26 14:43 ` kbuild test robot
2016-07-27 11:41   ` Giovanni Gherdovich
2016-08-02 10:37 ` Peter Zijlstra
2016-08-02 13:26   ` Mike Galbraith
2016-08-02 22:04   ` Giovanni Gherdovich
2016-08-03 10:02     ` Peter Zijlstra
2016-08-03 10:34       ` Peter Zijlstra
2016-08-05  7:58         ` Giovanni Gherdovich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).