* [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
@ 2016-07-26 14:07 Giovanni Gherdovich
2016-07-26 14:43 ` kbuild test robot
2016-08-02 10:37 ` Peter Zijlstra
0 siblings, 2 replies; 9+ messages in thread
From: Giovanni Gherdovich @ 2016-07-26 14:07 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: Mike Galbraith, Stanislaw Gruszka, linux-kernel, Mel Gorman,
Giovanni Gherdovich
Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency") fixed a problem whereby clock_nanosleep() followed by
clock_gettime() could allow a task to wake early. It addressed the problem
by calling the scheduling classes update_curr when the cputimer starts.
Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.
This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.
Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.
pound_clock_gettime:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.48 3.06 ( 11.83%)
5 3.33 3.25 ( 2.40%)
8 3.37 3.26 ( 3.30%)
12 3.32 3.37 ( -1.60%)
21 4.01 3.90 ( 2.74%)
30 3.63 3.36 ( 7.41%)
48 3.71 3.11 ( 16.27%)
79 3.75 3.16 ( 15.74%)
110 3.81 3.25 ( 14.80%)
128 3.88 3.31 ( 14.76%)
pound_times:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.65 3.25 ( 11.03%)
5 3.45 3.17 ( 7.92%)
8 3.52 3.22 ( 8.69%)
12 3.29 3.36 ( -2.04%)
21 4.07 3.92 ( 3.78%)
30 3.87 3.40 ( 12.17%)
48 3.79 3.16 ( 16.61%)
79 3.88 3.28 ( 15.42%)
110 3.90 3.38 ( 13.35%)
128 4.00 3.38 ( 15.45%)
pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:
$ git clone https://github.com/gormanm/mmtests.git
$ cd mmtests
$ cp configs/config-global-dhp__workload_poundtime config
$ ./run-mmtests.sh --run-monitor $(uname -r)
The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again
$ cd mmtests
$ ./run-mmtests.sh --run-monitor $(uname -r)
will produce results to compare with. A comparison table will be output
with
$ cd mmtests/work/log
$ ../../compare-kernels.sh
the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.
The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be
struct sched_entity *curr = cfs_rq->curr;
and
delta_exec = now - curr->exec_start;
in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.
A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):
threads cycles before cycles after gain
2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88%
5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85%
8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74%
12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74%
21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89%
30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92%
48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10%
79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33%
110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21%
128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99%
A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):
threads L1 misses/total*100 L1 misses/total*100 gain
before after
2 7.43 +-4.90% 7.36 +-4.70% 0.94%
5 13.09 +-4.74% 13.52 +-3.73% -3.28%
8 13.79 +-5.61% 12.90 +-3.27% 6.45%
12 11.57 +-2.44% 8.71 +-1.40% 24.72%
21 12.39 +-3.92% 9.97 +-1.84% 19.53%
30 13.91 +-2.53% 11.73 +-2.28% 15.67%
48 13.71 +-1.59% 12.32 +-1.97% 10.14%
79 14.44 +-0.66% 13.40 +-1.06% 7.20%
110 15.86 +-0.50% 14.46 +-0.59% 8.83%
128 16.51 +-0.32% 15.06 +-0.78% 8.78%
As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).
pound_clock_gettime:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.23 3.68 ( -64.56%) 3.48 (-55.48%)
5 2.83 3.78 ( -33.42%) 3.33 (-17.43%)
8 2.84 4.31 ( -52.12%) 3.37 (-18.76%)
12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%)
21 3.14 4.63 ( -47.36%) 4.01 (-27.71%)
30 3.28 5.75 ( -75.37%) 3.63 (-10.80%)
48 3.02 6.05 (-100.56%) 3.71 (-22.99%)
79 2.88 6.30 (-118.90%) 3.75 (-30.26%)
110 2.95 6.46 (-119.00%) 3.81 (-29.24%)
128 3.05 6.42 (-110.08%) 3.88 (-27.04%)
pound_times:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.27 3.73 ( -64.71%) 3.65 (-61.14%)
5 2.78 3.77 ( -35.56%) 3.45 (-23.98%)
8 2.79 4.41 ( -57.71%) 3.52 (-26.05%)
12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%)
21 3.10 4.61 ( -48.74%) 4.07 (-31.34%)
30 3.33 5.75 ( -72.53%) 3.87 (-16.01%)
48 2.96 6.06 (-105.04%) 3.79 (-28.10%)
79 2.88 6.24 (-116.83%) 3.88 (-34.81%)
110 2.98 6.37 (-114.08%) 3.90 (-31.12%)
128 3.10 6.35 (-104.61%) 4.00 (-28.87%)
The source code of the two benchmarks follows. To compile the two:
NR_THREADS=42
for FILE in pound_times pound_clock_gettime; do
gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
done
==== BEGIN pound_times.c ====
struct tms start;
void *pound (void *threadid)
{
struct tms end;
int oldutime = 0;
int utime;
int i;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}
==== END pound_times.c ====
==== BEGIN pound_clock_gettime.c ====
void *pound (void *threadid)
{
struct timespec ts;
int rc, i;
unsigned long prev = 0, this = 0;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
if (rc < 0)
perror("clock_gettime");
this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
if (0 && this < prev)
printf("%lu ns timewarp at iteration %d\n", prev - this, i);
prev = this;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long rc, i;
pid_t pgid;
for (i = 0; i < NUM_THREADS; i++) {
rc = pthread_create(&th[i], NULL, pound, (void *)i);
if (rc < 0)
perror("pthread_create");
}
pthread_exit(NULL);
return 0;
}
==== END pound_clock_gettime.c ====
Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
---
kernel/sched/core.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51d7105..0ef1e69 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2998,6 +2998,10 @@ unsigned long long task_sched_runtime(struct task_struct *p)
* thread, breaking clock_gettime().
*/
if (task_current(rq, p) && task_on_rq_queued(p)) {
+#if defined(CONFIG_FAIR_GROUP_SCHED)
+ prefetch((&p->se)->cfs_rq->curr);
+ prefetch(&(&p->se)->cfs_rq->curr->exec_start);
+#endif
update_rq_clock(rq);
p->sched_class->update_curr(rq);
}
--
2.6.6
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-07-26 14:07 [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich
@ 2016-07-26 14:43 ` kbuild test robot
2016-07-27 11:41 ` Giovanni Gherdovich
2016-08-02 10:37 ` Peter Zijlstra
1 sibling, 1 reply; 9+ messages in thread
From: kbuild test robot @ 2016-07-26 14:43 UTC (permalink / raw)
To: Giovanni Gherdovich
Cc: kbuild-all, Ingo Molnar, Peter Zijlstra, Mike Galbraith,
Stanislaw Gruszka, linux-kernel, Mel Gorman, Giovanni Gherdovich
[-- Attachment #1: Type: text/plain, Size: 1797 bytes --]
Hi,
[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.7 next-20160726]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Giovanni-Gherdovich/sched-cputime-Mitigate-performance-regression-in-times-clock_gettime/20160726-221216
config: mips-allyesconfig (attached as .config)
compiler: mips-linux-gnu-gcc (Debian 5.4.0-6) 5.4.0 20160609
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=mips
All errors (new ones prefixed by >>):
kernel/sched/core.c: In function 'task_sched_runtime':
>> kernel/sched/core.c:3009:3: error: implicit declaration of function 'prefetch' [-Werror=implicit-function-declaration]
prefetch((&p->se)->cfs_rq->curr);
^
cc1: some warnings being treated as errors
make[2]: *** [kernel/sched/core.o] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [kernel/sched/] Error 2
vim +/prefetch +3009 kernel/sched/core.c
3003 * Must be ->curr _and_ ->on_rq. If dequeued, we would
3004 * project cycles that may never be accounted to this
3005 * thread, breaking clock_gettime().
3006 */
3007 if (task_current(rq, p) && task_on_rq_queued(p)) {
3008 #if defined(CONFIG_FAIR_GROUP_SCHED)
> 3009 prefetch((&p->se)->cfs_rq->curr);
3010 prefetch(&(&p->se)->cfs_rq->curr->exec_start);
3011 #endif
3012 update_rq_clock(rq);
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 42436 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-07-26 14:43 ` kbuild test robot
@ 2016-07-27 11:41 ` Giovanni Gherdovich
0 siblings, 0 replies; 9+ messages in thread
From: Giovanni Gherdovich @ 2016-07-27 11:41 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: kbuild-all, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
Mel Gorman, mgorman
On Tue, 2016-07-26 at 22:43 +0800, kbuild test robot wrote:
> Hi,
>
> [auto build test ERROR on tip/sched/core]
> [also build test ERROR on v4.7 next-20160726]
> [if your patch is applied to the wrong git tree, please drop us a
> note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Giovanni-Gherdovich/
> sched-cputime-Mitigate-performance-regression-in-times
> -clock_gettime/20160726-221216
> config: mips-allyesconfig (attached as .config)
> compiler: mips-linux-gnu-gcc (Debian 5.4.0-6) 5.4.0 20160609
> reproduce:
> wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tes
> ts.git/plain/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=mips
>
> All errors (new ones prefixed by >>):
>
> kernel/sched/core.c: In function 'task_sched_runtime':
> > > kernel/sched/core.c:3009:3: error: implicit declaration of
> > > function 'prefetch' [-Werror=implicit-function-declaration]
> prefetch((&p->se)->cfs_rq->curr);
> ^
> cc1: some warnings being treated as errors
> make[2]: *** [kernel/sched/core.o] Error 1
> make[2]: Target '__build' not remade because of errors.
> make[1]: *** [kernel/sched/] Error 2
>
> vim +/prefetch +3009 kernel/sched/core.c
>
> 3003 * Must be ->curr _and_ ->on_rq. If dequeued,
> we would
> 3004 * project cycles that may never be accounted
> to this
> 3005 * thread, breaking clock_gettime().
> 3006 */
> 3007 if (task_current(rq, p) &&
> task_on_rq_queued(p)) {
> 3008 #if defined(CONFIG_FAIR_GROUP_SCHED)
> > 3009 prefetch((&p->se)->cfs_rq->curr);
> 3010 prefetch(&(&p->se)->cfs_rq->curr
> ->exec_start);
> 3011 #endif
> 3012 update_rq_clock(rq);
>
> ---
> 0-DAY kernel test infrastructure Open Source
> Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Cor
> poration
As the robot pointed out, I forgot the appropriate header file. On x86
it compiled as the declaration must have come via an indirect header
file inclusion.
I will wait a few days for more comments, then I'll send a v2
addressing the missing #include with the following change:
-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fde4014..97c196d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
#include <linux/context_tracking.h>
#include <linux/compiler.h>
#include <linux/frame.h>
+#include <linux/prefetch.h>
#include <asm/switch_to.h>
#include <asm/tlb.h>
-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
Regards,
Giovanni Gherdovich
SUSE Labs
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-07-26 14:07 [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich
2016-07-26 14:43 ` kbuild test robot
@ 2016-08-02 10:37 ` Peter Zijlstra
2016-08-02 13:26 ` Mike Galbraith
2016-08-02 22:04 ` Giovanni Gherdovich
1 sibling, 2 replies; 9+ messages in thread
From: Peter Zijlstra @ 2016-08-02 10:37 UTC (permalink / raw)
To: Giovanni Gherdovich
Cc: Ingo Molnar, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
Mel Gorman
On Tue, Jul 26, 2016 at 04:07:14PM +0200, Giovanni Gherdovich wrote:
> Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
SoB chain is borken. Either Mike wrote the patch in which case you're
missing a From: Mike header someplace, or you wrote it and Mike needs to
be a Ack/Reviewed or somesuch.
> ---
> kernel/sched/core.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 51d7105..0ef1e69 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2998,6 +2998,10 @@ unsigned long long task_sched_runtime(struct task_struct *p)
> * thread, breaking clock_gettime().
> */
> if (task_current(rq, p) && task_on_rq_queued(p)) {
> +#if defined(CONFIG_FAIR_GROUP_SCHED)
This here wants a comment on why we're doing this. Because I'm sure that
if someone were to read this code in a few weeks they'd go WTF!?
Also, is there a possibility of manual CSE we should do?
> + prefetch((&p->se)->cfs_rq->curr);
> + prefetch(&(&p->se)->cfs_rq->curr->exec_start);
> +#endif
> update_rq_clock(rq);
> p->sched_class->update_curr(rq);
> }
> --
> 2.6.6
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-02 10:37 ` Peter Zijlstra
@ 2016-08-02 13:26 ` Mike Galbraith
2016-08-02 22:04 ` Giovanni Gherdovich
1 sibling, 0 replies; 9+ messages in thread
From: Mike Galbraith @ 2016-08-02 13:26 UTC (permalink / raw)
To: Peter Zijlstra, Giovanni Gherdovich
Cc: Ingo Molnar, Stanislaw Gruszka, linux-kernel, Mel Gorman
On Tue, 2016-08-02 at 12:37 +0200, Peter Zijlstra wrote:
> On Tue, Jul 26, 2016 at 04:07:14PM +0200, Giovanni Gherdovich wrote:
>
> > Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
> > Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
>
> SoB chain is borken. Either Mike wrote the patch in which case you're
> missing a From: Mike header someplace, or you wrote it and Mike needs to
> be a Ack/Reviewed or somesuch.
Hohum, I told him to just take the credit (blame) if he intended to
submit, but he didn't listen ;-) It was a demonstration that a much
bigger/invasive patch was doing the same thing in an indirect way.
-Mike
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-02 10:37 ` Peter Zijlstra
2016-08-02 13:26 ` Mike Galbraith
@ 2016-08-02 22:04 ` Giovanni Gherdovich
2016-08-03 10:02 ` Peter Zijlstra
1 sibling, 1 reply; 9+ messages in thread
From: Giovanni Gherdovich @ 2016-08-02 22:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
Mel Gorman, mgorman
Hello Peter,
thank you for your reply.
On Tue, 2016-08-02 at 12:37 +0200, Peter Zijlstra wrote:
> On Tue, Jul 26, 2016 at 04:07:14PM +0200, Giovanni Gherdovich wrote:
>
> > Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
> > Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
>
> SoB chain is borken. Either Mike wrote the patch in which case you're
> missing a From: Mike header someplace, or you wrote it and Mike needs
> to be a Ack/Reviewed or somesuch.
Right. As Mike already explained, this patch is the result of him
correcting a much more involved/complicated solution I prepared to
solve the problem. I will put the "From: Mike" in v2.
>
> > ---
> > kernel/sched/core.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 51d7105..0ef1e69 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2998,6 +2998,10 @@ unsigned long long task_sched_runtime(struct
> > task_struct *p)
> > * thread, breaking clock_gettime().
> > */
> > if (task_current(rq, p) && task_on_rq_queued(p)) {
> > +#if defined(CONFIG_FAIR_GROUP_SCHED)
>
> This here wants a comment on why we're doing this. Because I'm sure
> that if someone were to read this code in a few weeks they'd go
> WTF!?
I had that config variable set in the machine I was testing on, and
thought that for some reason it was related to my observations. I will
repeat the experiment without it, and if I obtain the same results I
will drop the conditional. Otherwise I will motivate its necessity.
I will submit a v2 early next week, rebasing the patch on the
forthcoming 4.8-rc1 tag and updating the experimental data.
>
> Also, is there a possibility of manual CSE we should do?
>
> > + prefetch((&p->se)->cfs_rq->curr);
> > + prefetch(&(&p->se)->cfs_rq->curr->exec_start);
> > +#endif
> > update_rq_clock(rq);
> > p->sched_class->update_curr(rq);
> > }
Good point. I verified and GCC 4.8.5 gets it already without hints
needed. This is the alternative code with the CSE that I compiled:
-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51d7105..5d676db 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2998,6 +2998,11 @@ unsigned long long task_sched_runtime(struct
task_struct *p)
* thread, breaking clock_gettime().
*/
if (task_current(rq, p) && task_on_rq_queued(p)) {
+#if defined(CONFIG_FAIR_GROUP_SCHED)
+ struct sched_entity *curr = (&p->se)->cfs_rq->curr;
+ prefetch(curr);
+ prefetch(&curr->exec_start);
+#endif
update_rq_clock(rq);
p->sched_class->update_curr(rq);
}
-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
I post below the snippets of generated code with and without CSE that
I got running 'disassemble /m task_sched_runtime' in gdb; you'll see
they're identical. If you prefer the explicit hint I'll include it in
v2, but it's probably safe to say it isn't needed.
Regards,
Giovanni
with CSE: -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
3001#if defined(CONFIG_FAIR_GROUP_SCHED)
3002 struct sched_entity *curr = (&p->se)->cfs_rq->curr;
<+117>: mov 0x1d0(%rbx),%rdx
<+124>: mov 0x38(%rdx),%rdx
3003 prefetch(curr);
3004 prefetch(&curr->exec_start);
3005#endif
3006 update_rq_clock(rq);
3007 p->sched_class->update_curr(rq);
<+144>: mov 0x58(%rbx),%rdx
<+148>: mov %rax,%rdi
<+151>: mov %rax,-0x20(%rbp)
<+155>: callq *0xb0(%rdx)
<+161>: mov -0x20(%rbp),%rax
<+165>: jmp <task_sched_runtime+66>
<+167>: mov %rax,%rdi
<+170>: mov %rax,-0x20(%rbp)
<+174>: callq <update_rq_clock>
<+179>: mov -0x20(%rbp),%rax
<+183>: jmp <task_sched_runtime+144>
: nopl 0x0(%rax)
3008 }
3009 ns = p->se.sum_exec_runtime;
<+66>: mov 0xc8(%rbx),%r12
3010 task_rq_unlock(rq, p, &rf);
3011
3012 return ns;
<+103>: mov %r12,%rax
w/o CSE: -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
3001#if defined(CONFIG_FAIR_GROUP_SCHED)
3002 prefetch((&p->se)->cfs_rq->curr);
<+117>: mov 0x1d0(%rbx),%rdx
<+124>: mov 0x38(%rdx),%rdx
3003 prefetch(&(&p->se)->cfs_rq->curr->exec_start);
3004#endif
3005 update_rq_clock(rq);
3006 p->sched_class->update_curr(rq);
<+144>: mov 0x58(%rbx),%rdx
<+148>: mov %rax,%rdi
<+151>: mov %rax,-0x20(%rbp)
<+155>: callq *0xb0(%rdx)
<+161>: mov -0x20(%rbp),%rax
<+165>: jmp <task_sched_runtime+66>
<+167>: mov %rax,%rdi
<+170>: mov %rax,-0x20(%rbp)
<+174>: callq <update_rq_clock>
<+179>: mov -0x20(%rbp),%rax
<+183>: jmp <task_sched_runtime+144>
: nopl 0x0(%rax)
3007 }
3008 ns = p->se.sum_exec_runtime;
<+66>: mov 0xc8(%rbx),%r12
3009 task_rq_unlock(rq, p, &rf);
3010
3011 return ns;
<+103>: mov %r12,%rax
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-02 22:04 ` Giovanni Gherdovich
@ 2016-08-03 10:02 ` Peter Zijlstra
2016-08-03 10:34 ` Peter Zijlstra
0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-08-03 10:02 UTC (permalink / raw)
To: Giovanni Gherdovich
Cc: Ingo Molnar, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
Mel Gorman, mgorman
On Wed, Aug 03, 2016 at 12:04:52AM +0200, Giovanni Gherdovich wrote:
> > > +#if defined(CONFIG_FAIR_GROUP_SCHED)
> >
> > This here wants a comment on why we're doing this. Because I'm sure
> > that if someone were to read this code in a few weeks they'd go
> > WTF!?
>
> I had that config variable set in the machine I was testing on, and
> thought that for some reason it was related to my observations. I will
> repeat the experiment without it, and if I obtain the same results I
> will drop the conditional. Otherwise I will motivate its necessity.
>
No, I meant we want a comment here explaining the reason for these
prefetches.
You'll need the #ifdef because se->cfs_rq doesn't exist otherwise.
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2998,6 +2998,11 @@ unsigned long long task_sched_runtime(struct
> task_struct *p)
> * thread, breaking clock_gettime().
> */
> if (task_current(rq, p) && task_on_rq_queued(p)) {
> +#if defined(CONFIG_FAIR_GROUP_SCHED)
> + struct sched_entity *curr = (&p->se)->cfs_rq->curr;
> + prefetch(curr);
> + prefetch(&curr->exec_start);
> +#endif
> update_rq_clock(rq);
> p->sched_class->update_curr(rq);
> }
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
>
> I post below the snippets of generated code with and without CSE that
> I got running 'disassemble /m task_sched_runtime' in gdb; you'll see
> they're identical. If you prefer the explicit hint I'll include it in
> v2, but it's probably safe to say it isn't needed.
I much prefer the manual CSE, its much more readable.
Also, maybe pull the whole thing into a helper function with a
descriptive name, like:
/*
* XXX comment on why this is needed goes here...
*/
static inline void prefetch_curr_exec_start(struct task_struct *p)
{
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity *curr = (&p->se)->cfs_rq->curr;
prefetch(curr);
prefetch(&curr->exec_start);
#endif
}
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-03 10:02 ` Peter Zijlstra
@ 2016-08-03 10:34 ` Peter Zijlstra
2016-08-05 7:58 ` Giovanni Gherdovich
0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-08-03 10:34 UTC (permalink / raw)
To: Giovanni Gherdovich
Cc: Ingo Molnar, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
Mel Gorman, mgorman
On Wed, Aug 03, 2016 at 12:02:54PM +0200, Peter Zijlstra wrote:
> /*
> * XXX comment on why this is needed goes here...
> */
> static inline void prefetch_curr_exec_start(struct task_struct *p)
> {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> struct sched_entity *curr = (&p->se)->cfs_rq->curr;
#else
struct sched_entity *curr = task_rq(p)->cfs_rq->curr;
#endif
Would be the 'right' thing for !cgroup stuffs
>
> prefetch(curr);
> prefetch(&curr->exec_start);
> }
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-03 10:34 ` Peter Zijlstra
@ 2016-08-05 7:58 ` Giovanni Gherdovich
0 siblings, 0 replies; 9+ messages in thread
From: Giovanni Gherdovich @ 2016-08-05 7:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Mike Galbraith, Stanislaw Gruszka, linux-kernel,
Mel Gorman, mgorman
On Wed, 2016-08-03 at 12:02 +0200, Peter Zijlstra wrote:
> On Wed, Aug 03, 2016 at 12:04:52AM +0200, Giovanni Gherdovich wrote:
>
> > > > +#if defined(CONFIG_FAIR_GROUP_SCHED)
> > >
> > > This here wants a comment on why we're doing this. Because I'm
> > > sure that if someone were to read this code in a few weeks
> > > they'd go WTF!?
> >
> > I had that config variable set in the machine I was testing on, and
> > thought that for some reason it was related to my observations. I
> > will repeat the experiment without it, and if I obtain the same
> > results I will drop the conditional. Otherwise I will motivate its
> > necessity.
> >
>
> No, I meant we want a comment here explaining the reason for these
> prefetches.
>
> You'll need the #ifdef because se->cfs_rq doesn't exist otherwise.
Oh! Thanks for pointing that out.
> >
> > I post below the snippets of generated code with and without CSE
> > that I got running 'disassemble /m task_sched_runtime' in gdb;
> > you'll see they're identical. If you prefer the explicit hint I'll
> > include it in v2, but it's probably safe to say it isn't needed.
>
> I much prefer the manual CSE, its much more readable.
Ok.
>
> Also, maybe pull the whole thing into a helper function with a
> descriptive name, like:
>
> /*
> * XXX comment on why this is needed goes here...
> */
> static inline void prefetch_curr_exec_start(struct task_struct *p)
> {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> struct sched_entity *curr = (&p->se)->cfs_rq->curr;
>
> prefetch(curr);
> prefetch(&curr->exec_start);
> #endif
> }
Ok.
On Wed, 2016-08-03 at 12:34 +0200, Peter Zijlstra wrote:
> On Wed, Aug 03, 2016 at 12:02:54PM +0200, Peter Zijlstra wrote:
> > /*
> > * XXX comment on why this is needed goes here...
> > */
> > static inline void prefetch_curr_exec_start(struct task_struct *p)
> > {
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > struct sched_entity *curr = (&p->se)->cfs_rq->curr;
> #else
> struct sched_entity *curr = task_rq(p)->cfs_rq->curr;
> #endif
>
> Would be the 'right' thing for !cgroup stuffs
Ok, thanks. With a bit of help I figured that you actually meant to
write
#else
struct sched_entity *curr = (&task_rq(p)->cfs)->curr;
#endif
Will add that.
V2 is on its way; I previously said I would put "From: Mike" in the
patch header, but since then I discussed with Mike and we agreed that
the patch will be "From: Giovanni" and Suggested-by: Mike.
Regards,
Giovanni
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-08-05 7:59 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-26 14:07 [PATCH] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich
2016-07-26 14:43 ` kbuild test robot
2016-07-27 11:41 ` Giovanni Gherdovich
2016-08-02 10:37 ` Peter Zijlstra
2016-08-02 13:26 ` Mike Galbraith
2016-08-02 22:04 ` Giovanni Gherdovich
2016-08-03 10:02 ` Peter Zijlstra
2016-08-03 10:34 ` Peter Zijlstra
2016-08-05 7:58 ` Giovanni Gherdovich
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).