* measuring system wide CPU usage ignoring idle process @ 2017-11-20 14:00 Milian Wolff 2017-11-20 14:29 ` Jiri Olsa 0 siblings, 1 reply; 13+ messages in thread From: Milian Wolff @ 2017-11-20 14:00 UTC (permalink / raw) To: linux-perf-users; +Cc: acme, namhyung, Jiri Olsa Hey all, colleagues of mine just brought this inconvenient perf stat behavior to my attention: $ perf stat -a -e cpu-clock,task-clock,cycles,instructions sleep 1 Performance counter stats for 'system wide': 4004.501439 cpu-clock (msec) # 4.000 CPUs utilized 4004.526474 task-clock (msec) # 4.000 CPUs utilized 945,906,029 cycles # 0.236 GHz 461,861,241 instructions # 0.49 insn per cycle 1.001247082 seconds time elapsed This shows that cpu-clock and task-clock are incremented also for the idle processes. Is there some trick to exclude that time, such that the CPU utilization drops below 100% when doing `perf stat -a`? Or should one ignore these clock measurements for system wide stats and only look at the cycles/instructions etc.? This does go somewhat in the direction of http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html anyways, so I'm not opposed to this. Thanks -- Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH&Co KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-20 14:00 measuring system wide CPU usage ignoring idle process Milian Wolff @ 2017-11-20 14:29 ` Jiri Olsa 2017-11-20 20:24 ` Milian Wolff 0 siblings, 1 reply; 13+ messages in thread From: Jiri Olsa @ 2017-11-20 14:29 UTC (permalink / raw) To: Milian Wolff; +Cc: linux-perf-users, acme, namhyung On Mon, Nov 20, 2017 at 03:00:46PM +0100, Milian Wolff wrote: > Hey all, > > colleagues of mine just brought this inconvenient perf stat behavior to my > attention: > > $ perf stat -a -e cpu-clock,task-clock,cycles,instructions sleep 1 > > Performance counter stats for 'system wide': > > 4004.501439 cpu-clock (msec) # 4.000 CPUs utilized > 4004.526474 task-clock (msec) # 4.000 CPUs utilized > 945,906,029 cycles # 0.236 GHz > 461,861,241 instructions # 0.49 insn per cycle > > 1.001247082 seconds time elapsed > > This shows that cpu-clock and task-clock are incremented also for the idle > processes. Is there some trick to exclude that time, such that the CPU > utilization drops below 100% when doing `perf stat -a`? I dont think it's the idle process you see, I think it's the managing overhead before the 'sleep 1' task goes actualy to sleep there's some user space code before it gets into the sleep syscall, and there's some possible kernel scheduling/syscall/irq code with events already enabled and counting in following 3 sessions you can see the counts are pretty much the same regardless the sleeping time: [jolsa@krava perf]$ sudo ./perf stat -e cycles:u,cycles:k sleep 1 Performance counter stats for 'sleep 1': 316,478 cycles:u 594,468 cycles:k 1.000813330 seconds time elapsed [jolsa@krava perf]$ sudo ./perf stat -e cycles:u,cycles:k sleep 5 Performance counter stats for 'sleep 5': 339,287 cycles:u 665,888 cycles:k 5.001004575 seconds time elapsed [jolsa@krava perf]$ sudo ./perf stat -e cycles:u,cycles:k sleep 10 Performance counter stats for 'sleep 10': 314,507 cycles:u 658,764 cycles:k 10.001117596 seconds time elapsed jirka ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-20 14:29 ` Jiri Olsa @ 2017-11-20 20:24 ` Milian Wolff 2017-11-20 23:44 ` Jiri Olsa 0 siblings, 1 reply; 13+ messages in thread From: Milian Wolff @ 2017-11-20 20:24 UTC (permalink / raw) To: Jiri Olsa; +Cc: linux-perf-users, acme, namhyung On Montag, 20. November 2017 15:29:08 CET Jiri Olsa wrote: > On Mon, Nov 20, 2017 at 03:00:46PM +0100, Milian Wolff wrote: > > Hey all, > > > > colleagues of mine just brought this inconvenient perf stat behavior to my > > attention: > > > > $ perf stat -a -e cpu-clock,task-clock,cycles,instructions sleep 1 > > > > Performance counter stats for 'system wide': > > 4004.501439 cpu-clock (msec) # 4.000 CPUs utilized > > 4004.526474 task-clock (msec) # 4.000 CPUs utilized > > 945,906,029 cycles # 0.236 GHz > > 461,861,241 instructions # 0.49 insn per > > cycle > > > > 1.001247082 seconds time elapsed > > > > This shows that cpu-clock and task-clock are incremented also for the idle > > processes. Is there some trick to exclude that time, such that the CPU > > utilization drops below 100% when doing `perf stat -a`? > > I dont think it's the idle process you see, I think it's the managing > overhead before the 'sleep 1' task goes actualy to sleep > > there's some user space code before it gets into the sleep syscall, > and there's some possible kernel scheduling/syscall/irq code with > events already enabled and counting Sorry for being unclear: I was talking about the task-clock and cpu-clock values which you omitted from your measurements below. My example also shows that the counts for cycles and instructions are fine. But the cpu-clock and task-clock are useless as they always sum up to essentially `$nproc*$runtime`. What I'm hoping for are fractional values for the "N CPUs utilized". > in following 3 sessions you can see the counts are pretty much > the same regardless the sleeping time: > > [jolsa@krava perf]$ sudo ./perf stat -e cycles:u,cycles:k sleep 1 > > Performance counter stats for 'sleep 1': > > 316,478 cycles:u > 594,468 cycles:k > > 1.000813330 seconds time elapsed > > [jolsa@krava perf]$ sudo ./perf stat -e cycles:u,cycles:k sleep 5 > > Performance counter stats for 'sleep 5': > > 339,287 cycles:u > 665,888 cycles:k > > 5.001004575 seconds time elapsed > > [jolsa@krava perf]$ sudo ./perf stat -e cycles:u,cycles:k sleep 10 > > Performance counter stats for 'sleep 10': > > 314,507 cycles:u > 658,764 cycles:k > > 10.001117596 seconds time elapsed -- Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH&Co KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-20 20:24 ` Milian Wolff @ 2017-11-20 23:44 ` Jiri Olsa 2017-11-23 13:40 ` Milian Wolff 0 siblings, 1 reply; 13+ messages in thread From: Jiri Olsa @ 2017-11-20 23:44 UTC (permalink / raw) To: Milian Wolff; +Cc: linux-perf-users, acme, namhyung On Mon, Nov 20, 2017 at 09:24:42PM +0100, Milian Wolff wrote: > On Montag, 20. November 2017 15:29:08 CET Jiri Olsa wrote: > > On Mon, Nov 20, 2017 at 03:00:46PM +0100, Milian Wolff wrote: > > > Hey all, > > > > > > colleagues of mine just brought this inconvenient perf stat behavior to my > > > attention: > > > > > > $ perf stat -a -e cpu-clock,task-clock,cycles,instructions sleep 1 > > > > > > Performance counter stats for 'system wide': > > > 4004.501439 cpu-clock (msec) # 4.000 CPUs utilized > > > 4004.526474 task-clock (msec) # 4.000 CPUs utilized > > > 945,906,029 cycles # 0.236 GHz > > > 461,861,241 instructions # 0.49 insn per > > > cycle > > > > > > 1.001247082 seconds time elapsed > > > > > > This shows that cpu-clock and task-clock are incremented also for the idle > > > processes. Is there some trick to exclude that time, such that the CPU > > > utilization drops below 100% when doing `perf stat -a`? > > > > I dont think it's the idle process you see, I think it's the managing > > overhead before the 'sleep 1' task goes actualy to sleep > > > > there's some user space code before it gets into the sleep syscall, > > and there's some possible kernel scheduling/syscall/irq code with > > events already enabled and counting > > Sorry for being unclear: I was talking about the task-clock and cpu-clock > values which you omitted from your measurements below. My example also shows > that the counts for cycles and instructions are fine. But the cpu-clock and > task-clock are useless as they always sum up to essentially `$nproc*$runtime`. > What I'm hoping for are fractional values for the "N CPUs utilized". ugh my bad.. anyway by using -a you create cpu counters which never unschedule, so those times will be same as the 'sleep 1' run length but not sure now how to get the real utilization.. will check jirka ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-20 23:44 ` Jiri Olsa @ 2017-11-23 13:40 ` Milian Wolff 2017-11-23 14:09 ` Jiri Olsa 0 siblings, 1 reply; 13+ messages in thread From: Milian Wolff @ 2017-11-23 13:40 UTC (permalink / raw) To: Jiri Olsa; +Cc: linux-perf-users, acme, namhyung [-- Attachment #1: Type: text/plain, Size: 2675 bytes --] On Tuesday, November 21, 2017 12:44:38 AM CET Jiri Olsa wrote: > On Mon, Nov 20, 2017 at 09:24:42PM +0100, Milian Wolff wrote: > > On Montag, 20. November 2017 15:29:08 CET Jiri Olsa wrote: > > > On Mon, Nov 20, 2017 at 03:00:46PM +0100, Milian Wolff wrote: > > > > Hey all, > > > > > > > > colleagues of mine just brought this inconvenient perf stat behavior > > > > to my > > > > attention: > > > > > > > > $ perf stat -a -e cpu-clock,task-clock,cycles,instructions sleep 1 > > > > > > > > Performance counter stats for 'system wide': > > > > 4004.501439 cpu-clock (msec) # 4.000 CPUs > > > > utilized > > > > 4004.526474 task-clock (msec) # 4.000 CPUs > > > > utilized > > > > 945,906,029 cycles # 0.236 GHz > > > > 461,861,241 instructions # 0.49 insn per > > > > cycle > > > > > > > > 1.001247082 seconds time elapsed > > > > > > > > This shows that cpu-clock and task-clock are incremented also for the > > > > idle > > > > processes. Is there some trick to exclude that time, such that the CPU > > > > utilization drops below 100% when doing `perf stat -a`? > > > > > > I dont think it's the idle process you see, I think it's the managing > > > overhead before the 'sleep 1' task goes actualy to sleep > > > > > > there's some user space code before it gets into the sleep syscall, > > > and there's some possible kernel scheduling/syscall/irq code with > > > events already enabled and counting > > > > Sorry for being unclear: I was talking about the task-clock and cpu-clock > > values which you omitted from your measurements below. My example also > > shows that the counts for cycles and instructions are fine. But the > > cpu-clock and task-clock are useless as they always sum up to essentially > > `$nproc*$runtime`. What I'm hoping for are fractional values for the "N > > CPUs utilized". > ugh my bad.. anyway by using -a you create cpu counters > which never unschedule, so those times will be same > as the 'sleep 1' run length > > but not sure now how to get the real utilization.. will check Hey jirka, did you have a chance to check the above? I'd be really interested in knowing whether there is an existing workaround. If not, would it be feasible to patch perf to get the desired behavior? I'd be willing to look into this. This would probably require changes on the kernel side though, or how could this be fixed? Thanks -- Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH&Co KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 3826 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-23 13:40 ` Milian Wolff @ 2017-11-23 14:09 ` Jiri Olsa 2017-11-23 14:21 ` Jiri Olsa 0 siblings, 1 reply; 13+ messages in thread From: Jiri Olsa @ 2017-11-23 14:09 UTC (permalink / raw) To: Milian Wolff Cc: linux-perf-users, acme, namhyung, Ingo Molnar, Alexander Shishkin, Peter Zijlstra On Thu, Nov 23, 2017 at 02:40:36PM +0100, Milian Wolff wrote: > On Tuesday, November 21, 2017 12:44:38 AM CET Jiri Olsa wrote: > > On Mon, Nov 20, 2017 at 09:24:42PM +0100, Milian Wolff wrote: > > > On Montag, 20. November 2017 15:29:08 CET Jiri Olsa wrote: > > > > On Mon, Nov 20, 2017 at 03:00:46PM +0100, Milian Wolff wrote: > > > > > Hey all, > > > > > > > > > > colleagues of mine just brought this inconvenient perf stat behavior > > > > > to my > > > > > attention: > > > > > > > > > > $ perf stat -a -e cpu-clock,task-clock,cycles,instructions sleep 1 > > > > > > > > > > Performance counter stats for 'system wide': > > > > > 4004.501439 cpu-clock (msec) # 4.000 CPUs > > > > > utilized > > > > > 4004.526474 task-clock (msec) # 4.000 CPUs > > > > > utilized > > > > > 945,906,029 cycles # 0.236 GHz > > > > > 461,861,241 instructions # 0.49 insn per > > > > > cycle > > > > > > > > > > 1.001247082 seconds time elapsed > > > > > > > > > > This shows that cpu-clock and task-clock are incremented also for the > > > > > idle > > > > > processes. Is there some trick to exclude that time, such that the CPU > > > > > utilization drops below 100% when doing `perf stat -a`? > > > > > > > > I dont think it's the idle process you see, I think it's the managing > > > > overhead before the 'sleep 1' task goes actualy to sleep > > > > > > > > there's some user space code before it gets into the sleep syscall, > > > > and there's some possible kernel scheduling/syscall/irq code with > > > > events already enabled and counting > > > > > > Sorry for being unclear: I was talking about the task-clock and cpu-clock > > > values which you omitted from your measurements below. My example also > > > shows that the counts for cycles and instructions are fine. But the > > > cpu-clock and task-clock are useless as they always sum up to essentially > > > `$nproc*$runtime`. What I'm hoping for are fractional values for the "N > > > CPUs utilized". > > ugh my bad.. anyway by using -a you create cpu counters > > which never unschedule, so those times will be same > > as the 'sleep 1' run length > > > > but not sure now how to get the real utilization.. will check > > Hey jirka, > > did you have a chance to check the above? I'd be really interested in knowing > whether there is an existing workaround. If not, would it be feasible to patch > perf to get the desired behavior? I'd be willing to look into this. This would > probably require changes on the kernel side though, or how could this be > fixed? hi, I haven't found any good way yet.. I ended up with following patch to allow attach counters to idle process, which got me the count/behaviour you need (with few tools changes in my perf/idle branch) but I'm not sure it's the best idea ;-) there might be better way.. CC-ing Ingo, Peter and Alexander thanks jirka --- include/uapi/linux/perf_event.h | 1 + kernel/events/core.c | 14 ++++++++++++-- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 362493a2f950..9e48598d1f1d 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -947,6 +947,7 @@ enum perf_callchain_context { #define PERF_FLAG_FD_OUTPUT (1UL << 1) #define PERF_FLAG_PID_CGROUP (1UL << 2) /* pid=cgroup id, per-cpu mode only */ #define PERF_FLAG_FD_CLOEXEC (1UL << 3) /* O_CLOEXEC */ +#define PERF_FLAG_PID_IDLE (1UL << 4) /* attach to idle process */ #if defined(__LITTLE_ENDIAN_BITFIELD) union perf_mem_data_src { diff --git a/kernel/events/core.c b/kernel/events/core.c index 799bb352d99f..529b07aecea7 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -346,7 +346,8 @@ static void event_function_local(struct perf_event *event, event_f func, void *d #define PERF_FLAG_ALL (PERF_FLAG_FD_NO_GROUP |\ PERF_FLAG_FD_OUTPUT |\ PERF_FLAG_PID_CGROUP |\ - PERF_FLAG_FD_CLOEXEC) + PERF_FLAG_FD_CLOEXEC |\ + PERF_FLAG_PID_IDLE) /* * branch priv levels that need permission checks @@ -9898,6 +9899,9 @@ SYSCALL_DEFINE5(perf_event_open, if ((flags & PERF_FLAG_PID_CGROUP) && (pid == -1 || cpu == -1)) return -EINVAL; + if ((flags & PERF_FLAG_PID_IDLE) && (pid == -1 || cpu == -1)) + return -EINVAL; + if (flags & PERF_FLAG_FD_CLOEXEC) f_flags |= O_CLOEXEC; @@ -9917,7 +9921,13 @@ SYSCALL_DEFINE5(perf_event_open, } if (pid != -1 && !(flags & PERF_FLAG_PID_CGROUP)) { - task = find_lively_task_by_vpid(pid); + if (flags & PERF_FLAG_PID_IDLE) { + task = idle_task(cpu); + get_task_struct(task); + } else { + task = find_lively_task_by_vpid(pid); + } + if (IS_ERR(task)) { err = PTR_ERR(task); goto err_group_fd; -- 2.13.6 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-23 14:09 ` Jiri Olsa @ 2017-11-23 14:21 ` Jiri Olsa 2017-11-23 14:42 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 13+ messages in thread From: Jiri Olsa @ 2017-11-23 14:21 UTC (permalink / raw) To: Milian Wolff Cc: linux-perf-users, acme, namhyung, Ingo Molnar, Alexander Shishkin, Peter Zijlstra On Thu, Nov 23, 2017 at 03:09:31PM +0100, Jiri Olsa wrote: > On Thu, Nov 23, 2017 at 02:40:36PM +0100, Milian Wolff wrote: > > On Tuesday, November 21, 2017 12:44:38 AM CET Jiri Olsa wrote: > > > On Mon, Nov 20, 2017 at 09:24:42PM +0100, Milian Wolff wrote: > > > > On Montag, 20. November 2017 15:29:08 CET Jiri Olsa wrote: > > > > > On Mon, Nov 20, 2017 at 03:00:46PM +0100, Milian Wolff wrote: > > > > > > Hey all, > > > > > > > > > > > > colleagues of mine just brought this inconvenient perf stat behavior > > > > > > to my > > > > > > attention: > > > > > > > > > > > > $ perf stat -a -e cpu-clock,task-clock,cycles,instructions sleep 1 > > > > > > > > > > > > Performance counter stats for 'system wide': > > > > > > 4004.501439 cpu-clock (msec) # 4.000 CPUs > > > > > > utilized > > > > > > 4004.526474 task-clock (msec) # 4.000 CPUs > > > > > > utilized > > > > > > 945,906,029 cycles # 0.236 GHz > > > > > > 461,861,241 instructions # 0.49 insn per > > > > > > cycle > > > > > > > > > > > > 1.001247082 seconds time elapsed > > > > > > > > > > > > This shows that cpu-clock and task-clock are incremented also for the > > > > > > idle > > > > > > processes. Is there some trick to exclude that time, such that the CPU > > > > > > utilization drops below 100% when doing `perf stat -a`? > > > > > > > > > > I dont think it's the idle process you see, I think it's the managing > > > > > overhead before the 'sleep 1' task goes actualy to sleep > > > > > > > > > > there's some user space code before it gets into the sleep syscall, > > > > > and there's some possible kernel scheduling/syscall/irq code with > > > > > events already enabled and counting > > > > > > > > Sorry for being unclear: I was talking about the task-clock and cpu-clock > > > > values which you omitted from your measurements below. My example also > > > > shows that the counts for cycles and instructions are fine. But the > > > > cpu-clock and task-clock are useless as they always sum up to essentially > > > > `$nproc*$runtime`. What I'm hoping for are fractional values for the "N > > > > CPUs utilized". > > > ugh my bad.. anyway by using -a you create cpu counters > > > which never unschedule, so those times will be same > > > as the 'sleep 1' run length > > > > > > but not sure now how to get the real utilization.. will check > > > > Hey jirka, > > > > did you have a chance to check the above? I'd be really interested in knowing > > whether there is an existing workaround. If not, would it be feasible to patch > > perf to get the desired behavior? I'd be willing to look into this. This would > > probably require changes on the kernel side though, or how could this be > > fixed? > > hi, > I haven't found any good way yet.. I ended up with following > patch to allow attach counters to idle process, which got > me the count/behaviour you need (with few tools changes in > my perf/idle branch) > > but I'm not sure it's the best idea ;-) there might > be better way.. CC-ing Ingo, Peter and Alexander also I was thinking we might add 'idle' line into perf top ;-) shouldn't be that hard once we have the counter jirka ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-23 14:21 ` Jiri Olsa @ 2017-11-23 14:42 ` Arnaldo Carvalho de Melo 2017-11-23 15:12 ` Jiri Olsa 2017-11-23 15:15 ` Peter Zijlstra 0 siblings, 2 replies; 13+ messages in thread From: Arnaldo Carvalho de Melo @ 2017-11-23 14:42 UTC (permalink / raw) To: Jiri Olsa Cc: Milian Wolff, linux-perf-users, namhyung, Ingo Molnar, Alexander Shishkin, Peter Zijlstra Em Thu, Nov 23, 2017 at 03:21:00PM +0100, Jiri Olsa escreveu: > On Thu, Nov 23, 2017 at 03:09:31PM +0100, Jiri Olsa wrote: > > On Thu, Nov 23, 2017 at 02:40:36PM +0100, Milian Wolff wrote: > > > On Tuesday, November 21, 2017 12:44:38 AM CET Jiri Olsa wrote: > > > > On Mon, Nov 20, 2017 at 09:24:42PM +0100, Milian Wolff wrote: > > > > > On Montag, 20. November 2017 15:29:08 CET Jiri Olsa wrote: > > > > > > On Mon, Nov 20, 2017 at 03:00:46PM +0100, Milian Wolff wrote: > > > > > > > Hey all, > > > > > > > > > > > > > > colleagues of mine just brought this inconvenient perf stat behavior > > > > > > > to my > > > > > > > attention: > > > > > > > > > > > > > > $ perf stat -a -e cpu-clock,task-clock,cycles,instructions sleep 1 > > > > > > > > > > > > > > Performance counter stats for 'system wide': > > > > > > > 4004.501439 cpu-clock (msec) # 4.000 CPUs > > > > > > > utilized > > > > > > > 4004.526474 task-clock (msec) # 4.000 CPUs > > > > > > > utilized > > > > > > > 945,906,029 cycles # 0.236 GHz > > > > > > > 461,861,241 instructions # 0.49 insn per > > > > > > > cycle > > > > > > > > > > > > > > 1.001247082 seconds time elapsed > > > > > > > > > > > > > > This shows that cpu-clock and task-clock are incremented also for the > > > > > > > idle > > > > > > > processes. Is there some trick to exclude that time, such that the CPU > > > > > > > utilization drops below 100% when doing `perf stat -a`? > > > > > > > > > > > > I dont think it's the idle process you see, I think it's the managing > > > > > > overhead before the 'sleep 1' task goes actualy to sleep > > > > > > > > > > > > there's some user space code before it gets into the sleep syscall, > > > > > > and there's some possible kernel scheduling/syscall/irq code with > > > > > > events already enabled and counting > > > > > > > > > > Sorry for being unclear: I was talking about the task-clock and cpu-clock > > > > > values which you omitted from your measurements below. My example also > > > > > shows that the counts for cycles and instructions are fine. But the > > > > > cpu-clock and task-clock are useless as they always sum up to essentially > > > > > `$nproc*$runtime`. What I'm hoping for are fractional values for the "N > > > > > CPUs utilized". > > > > ugh my bad.. anyway by using -a you create cpu counters > > > > which never unschedule, so those times will be same > > > > as the 'sleep 1' run length Humm, what role perf_event_attr.exclude_idle has here? > > > > > > > > but not sure now how to get the real utilization.. will check > > > > > > did you have a chance to check the above? I'd be really interested in knowing > > > whether there is an existing workaround. If not, would it be feasible to patch > > > perf to get the desired behavior? I'd be willing to look into this. This would > > > probably require changes on the kernel side though, or how could this be > > > fixed? > > > > hi, > > I haven't found any good way yet.. I ended up with following > > patch to allow attach counters to idle process, which got > > me the count/behaviour you need (with few tools changes in > > my perf/idle branch) > > > > but I'm not sure it's the best idea ;-) there might > > be better way.. CC-ing Ingo, Peter and Alexander > > also I was thinking we might add 'idle' line into perf top ;-) > shouldn't be that hard once we have the counter Humm... What is wrong with perf_event_attr.exclude_idle? :-) From include/uapi/linux/perf_event.h: exclude_idle : 1, /* don't count when idle */ But it is not being set: [root@jouet ~]# perf stat -vv -a -e cpu-clock,task-clock,cycles,instructions sleep 1 Using CPUID GenuineIntel-6-3D intel_pt default config: tsc,pt,branch ------------------------------------------------------------ perf_event_attr: type 1 size 112 sample_type IDENTIFIER read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING disabled 1 inherit 1 exclude_guest 1 ------------------------------------------------------------ sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 3 sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 = 4 sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 = 5 sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 = 7 ------------------------------------------------------------ perf_event_attr: type 1 size 112 config 0x1 sample_type IDENTIFIER read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING disabled 1 inherit 1 exclude_guest 1 ------------------------------------------------------------ sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 8 sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 = 9 sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 = 10 sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 = 11 ------------------------------------------------------------ perf_event_attr: size 112 sample_type IDENTIFIER read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING disabled 1 inherit 1 exclude_guest 1 ------------------------------------------------------------ sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 12 sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 = 13 sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 = 14 sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 = 15 ------------------------------------------------------------ perf_event_attr: size 112 config 0x1 sample_type IDENTIFIER read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING disabled 1 inherit 1 exclude_guest 1 ------------------------------------------------------------ sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 16 sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 = 17 sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 = 18 sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 = 19 cpu-clock: 0: 1001547771 1001547617 1001547617 cpu-clock: 1: 1001552938 1001552742 1001552742 cpu-clock: 2: 1001555120 1001554407 1001554407 cpu-clock: 3: 1001563889 1001563570 1001563570 cpu-clock: 4006219718 4006218336 4006218336 task-clock: 0: 1001603894 1001603894 1001603894 task-clock: 1: 1001616140 1001616140 1001616140 task-clock: 2: 1001617338 1001617338 1001617338 task-clock: 3: 1001621998 1001621998 1001621998 task-clock: 4006459370 4006459370 4006459370 cycles: 0: 71757776 1001642926 1001642926 cycles: 1: 23188411 1001651335 1001651335 cycles: 2: 24665622 1001654878 1001654878 cycles: 3: 79907293 1001659590 1001659590 cycles: 199519102 4006608729 4006608729 instructions: 0: 40314068 1001677791 1001677791 instructions: 1: 13525409 1001682314 1001682314 instructions: 2: 14247277 1001682655 1001682655 instructions: 3: 23286057 1001685112 1001685112 instructions: 91372811 4006727872 4006727872 Performance counter stats for 'system wide': 4006.219718 cpu-clock (msec) # 3.999 CPUs utilized 4006.459370 task-clock (msec) # 3.999 CPUs utilized 199,519,102 cycles # 0.050 GHz 91,372,811 instructions # 0.46 insn per cycle 1.001749823 seconds time elapsed [root@jouet ~]# So the I tried the patch at the end of this messagem, but it doesn't seem to affect software counters such as cpu-clock and task-clock: [root@jouet ~]# perf stat --no-idle -a -e cpu-clock,task-clock,cycles,instructions sleep 1m Performance counter stats for 'system wide': 240005.027025 cpu-clock (msec) # 4.000 CPUs utilized 240005.150119 task-clock (msec) # 4.000 CPUs utilized 2,658,680,286 cycles # 0.011 GHz 1,109,111,339 instructions # 0.42 insn per cycle 60.001361214 seconds time elapsed [root@jouet ~]# perf stat --idle -a -e cpu-clock,task-clock,cycles,instructions sleep 1m Performance counter stats for 'system wide': 240006.825047 cpu-clock (msec) # 4.000 CPUs utilized 240006.964995 task-clock (msec) # 4.000 CPUs utilized 2,784,702,480 cycles # 0.012 GHz 1,210,285,863 instructions # 0.43 insn per cycle 60.001806963 seconds time elapsed [root@jouet ~]# [root@jouet ~]# perf stat -vv --no-idle -a -e cpu-clock,task-clock,cycles,instructions sleep 1 |& grep exclude_idle exclude_idle 1 exclude_idle 1 exclude_idle 1 exclude_idle 1 [root@jouet ~]# perf stat -vv -a -e cpu-clock,task-clock,cycles,instructions sleep 1 |& grep exclude_idle [root@jouet ~]# perf stat --idle -vv -a -e cpu-clock,task-clock,cycles,instructions sleep 1 |& grep exclude_idle [root@jouet ~]# Time to look at the kernel... - Arnaldo diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 59af5a8419e2..32860537e114 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -144,6 +144,7 @@ typedef int (*aggr_get_id_t)(struct cpu_map *m, int cpu); static int run_count = 1; static bool no_inherit = false; +static bool idle = true; static volatile pid_t child_pid = -1; static bool null_run = false; static int detailed_run = 0; @@ -237,6 +238,7 @@ static int create_perf_stat_counter(struct perf_evsel *evsel) attr->read_format |= PERF_FORMAT_ID|PERF_FORMAT_GROUP; attr->inherit = !no_inherit; + attr->exclude_idle = !idle; /* * Some events get initialized with sample_(period/type) set, @@ -1890,6 +1892,7 @@ static const struct option stat_options[] = { OPT_CALLBACK('M', "metrics", &evsel_list, "metric/metric group list", "monitor specified metrics or metric groups (separated by ,)", parse_metric_groups), + OPT_BOOLEAN(0, "idle", &idle, "Measure when idle"), OPT_END() }; ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-23 14:42 ` Arnaldo Carvalho de Melo @ 2017-11-23 15:12 ` Jiri Olsa 2017-11-23 18:59 ` Arnaldo Carvalho de Melo 2017-11-23 15:15 ` Peter Zijlstra 1 sibling, 1 reply; 13+ messages in thread From: Jiri Olsa @ 2017-11-23 15:12 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Milian Wolff, linux-perf-users, namhyung, Ingo Molnar, Alexander Shishkin, Peter Zijlstra On Thu, Nov 23, 2017 at 11:42:20AM -0300, Arnaldo Carvalho de Melo wrote: > Em Thu, Nov 23, 2017 at 03:21:00PM +0100, Jiri Olsa escreveu: > > On Thu, Nov 23, 2017 at 03:09:31PM +0100, Jiri Olsa wrote: > > > On Thu, Nov 23, 2017 at 02:40:36PM +0100, Milian Wolff wrote: > > > > On Tuesday, November 21, 2017 12:44:38 AM CET Jiri Olsa wrote: > > > > > On Mon, Nov 20, 2017 at 09:24:42PM +0100, Milian Wolff wrote: > > > > > > On Montag, 20. November 2017 15:29:08 CET Jiri Olsa wrote: > > > > > > > On Mon, Nov 20, 2017 at 03:00:46PM +0100, Milian Wolff wrote: > > > > > > > > Hey all, > > > > > > > > > > > > > > > > colleagues of mine just brought this inconvenient perf stat behavior > > > > > > > > to my > > > > > > > > attention: > > > > > > > > > > > > > > > > $ perf stat -a -e cpu-clock,task-clock,cycles,instructions sleep 1 > > > > > > > > > > > > > > > > Performance counter stats for 'system wide': > > > > > > > > 4004.501439 cpu-clock (msec) # 4.000 CPUs > > > > > > > > utilized > > > > > > > > 4004.526474 task-clock (msec) # 4.000 CPUs > > > > > > > > utilized > > > > > > > > 945,906,029 cycles # 0.236 GHz > > > > > > > > 461,861,241 instructions # 0.49 insn per > > > > > > > > cycle > > > > > > > > > > > > > > > > 1.001247082 seconds time elapsed > > > > > > > > > > > > > > > > This shows that cpu-clock and task-clock are incremented also for the > > > > > > > > idle > > > > > > > > processes. Is there some trick to exclude that time, such that the CPU > > > > > > > > utilization drops below 100% when doing `perf stat -a`? > > > > > > > > > > > > > > I dont think it's the idle process you see, I think it's the managing > > > > > > > overhead before the 'sleep 1' task goes actualy to sleep > > > > > > > > > > > > > > there's some user space code before it gets into the sleep syscall, > > > > > > > and there's some possible kernel scheduling/syscall/irq code with > > > > > > > events already enabled and counting > > > > > > > > > > > > Sorry for being unclear: I was talking about the task-clock and cpu-clock > > > > > > values which you omitted from your measurements below. My example also > > > > > > shows that the counts for cycles and instructions are fine. But the > > > > > > cpu-clock and task-clock are useless as they always sum up to essentially > > > > > > `$nproc*$runtime`. What I'm hoping for are fractional values for the "N > > > > > > CPUs utilized". > > > > > ugh my bad.. anyway by using -a you create cpu counters > > > > > which never unschedule, so those times will be same > > > > > as the 'sleep 1' run length > > Humm, what role perf_event_attr.exclude_idle has here? it's used for omiting samples from idle process.. but looks like it's enforced for software clock events AFAICS it's not used in counting mode jirka ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-23 15:12 ` Jiri Olsa @ 2017-11-23 18:59 ` Arnaldo Carvalho de Melo 2017-11-24 8:14 ` Jiri Olsa 0 siblings, 1 reply; 13+ messages in thread From: Arnaldo Carvalho de Melo @ 2017-11-23 18:59 UTC (permalink / raw) To: Jiri Olsa Cc: Milian Wolff, linux-perf-users, namhyung, Ingo Molnar, Alexander Shishkin, Peter Zijlstra Em Thu, Nov 23, 2017 at 04:12:05PM +0100, Jiri Olsa escreveu: > On Thu, Nov 23, 2017 at 11:42:20AM -0300, Arnaldo Carvalho de Melo wrote: > > Em Thu, Nov 23, 2017 at 03:21:00PM +0100, Jiri Olsa escreveu: > > > On Thu, Nov 23, 2017 at 03:09:31PM +0100, Jiri Olsa wrote: > > > > On Thu, Nov 23, 2017 at 02:40:36PM +0100, Milian Wolff wrote: > > > > > > > Sorry for being unclear: I was talking about the task-clock and cpu-clock > > > > > > > values which you omitted from your measurements below. My example also > > > > > > > shows that the counts for cycles and instructions are fine. But the > > > > > > > cpu-clock and task-clock are useless as they always sum up to essentially > > > > > > > `$nproc*$runtime`. What I'm hoping for are fractional values for the "N > > > > > > > CPUs utilized". > > > > > > ugh my bad.. anyway by using -a you create cpu counters > > > > > > which never unschedule, so those times will be same > > > > > > as the 'sleep 1' run length > > Humm, what role perf_event_attr.exclude_idle has here? > it's used for omiting samples from idle process.. but looks like it's > enforced for software clock events looks like it is NOT enforced? > AFAICS it's not used in counting mode But it should? I think it should, as we see from Milian's use case. PeterZ sent a patch, I guess we should continue from there :-) - Arnaldo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-23 18:59 ` Arnaldo Carvalho de Melo @ 2017-11-24 8:14 ` Jiri Olsa 0 siblings, 0 replies; 13+ messages in thread From: Jiri Olsa @ 2017-11-24 8:14 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Milian Wolff, linux-perf-users, namhyung, Ingo Molnar, Alexander Shishkin, Peter Zijlstra On Thu, Nov 23, 2017 at 03:59:41PM -0300, Arnaldo Carvalho de Melo wrote: > Em Thu, Nov 23, 2017 at 04:12:05PM +0100, Jiri Olsa escreveu: > > On Thu, Nov 23, 2017 at 11:42:20AM -0300, Arnaldo Carvalho de Melo wrote: > > > Em Thu, Nov 23, 2017 at 03:21:00PM +0100, Jiri Olsa escreveu: > > > > On Thu, Nov 23, 2017 at 03:09:31PM +0100, Jiri Olsa wrote: > > > > > On Thu, Nov 23, 2017 at 02:40:36PM +0100, Milian Wolff wrote: > > > > > > > > Sorry for being unclear: I was talking about the task-clock and cpu-clock > > > > > > > > values which you omitted from your measurements below. My example also > > > > > > > > shows that the counts for cycles and instructions are fine. But the > > > > > > > > cpu-clock and task-clock are useless as they always sum up to essentially > > > > > > > > `$nproc*$runtime`. What I'm hoping for are fractional values for the "N > > > > > > > > CPUs utilized". > > > > > > > > ugh my bad.. anyway by using -a you create cpu counters > > > > > > > which never unschedule, so those times will be same > > > > > > > as the 'sleep 1' run length > > > > Humm, what role perf_event_attr.exclude_idle has here? > > > it's used for omiting samples from idle process.. but looks like it's > > enforced for software clock events > > looks like it is NOT enforced? yea.. NOT ;-) > > > AFAICS it's not used in counting mode > > But it should? I think it should, as we see from Milian's use case. > > PeterZ sent a patch, I guess we should continue from there :-) right jirka ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-23 14:42 ` Arnaldo Carvalho de Melo 2017-11-23 15:12 ` Jiri Olsa @ 2017-11-23 15:15 ` Peter Zijlstra 2018-04-17 13:41 ` Arnaldo Carvalho de Melo 1 sibling, 1 reply; 13+ messages in thread From: Peter Zijlstra @ 2017-11-23 15:15 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Jiri Olsa, Milian Wolff, linux-perf-users, namhyung, Ingo Molnar, Alexander Shishkin On Thu, Nov 23, 2017 at 11:42:20AM -0300, Arnaldo Carvalho de Melo wrote: > What is wrong with perf_event_attr.exclude_idle? :-) Neither task- nor cpu-clock actually implement that.. Something like the _completely_untested_ below might cure that for cpu-clock. I have the nagging feeling we actually already account the idle time _somewhere_, but I couldn't remember and was too lazy to go find -- but someone should if this were to become an actual patch. --- diff --git a/kernel/events/core.c b/kernel/events/core.c index a59fe11558a4..5386d551b373 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -8900,6 +8908,10 @@ static void cpu_clock_event_update(struct perf_event *event) u64 now; now = local_clock(); + + if (event->attr.exclude_idle) + now -= idle_task(event->oncpu)->se.sum_exec_runtime; + prev = local64_xchg(&event->hw.prev_count, now); local64_add(now - prev, &event->count); } diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index d518664cce4f..419c620510c6 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -27,9 +27,14 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl static struct task_struct * pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { + struct task_struct *idle = rq->idle; + put_prev_task(rq, prev); update_idle_core(rq); schedstat_inc(rq->sched_goidle); + + idle->se.exec_start = rq_clock_task(rq); + return rq->idle; } @@ -48,6 +53,17 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags) static void put_prev_task_idle(struct rq *rq, struct task_struct *prev) { + struct task_struct *idle = rq->idle; + u64 delta, now; + + now = rq_clock_task(rq); + delta = now - idle->se.exec_start; + if (unlikely((s64)delta < 0)) + delta = 0; + + idle->se.sum_exec_runtime += delta; + idle->se.exec_start = now; + rq_last_tick_reset(rq); } @@ -57,6 +73,9 @@ static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued) static void set_curr_task_idle(struct rq *rq) { + struct task_struct *idle = rq->idle; + + idle->se.exec_start = rq_clock_task(rq); } static void switched_to_idle(struct rq *rq, struct task_struct *p) ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: measuring system wide CPU usage ignoring idle process 2017-11-23 15:15 ` Peter Zijlstra @ 2018-04-17 13:41 ` Arnaldo Carvalho de Melo 0 siblings, 0 replies; 13+ messages in thread From: Arnaldo Carvalho de Melo @ 2018-04-17 13:41 UTC (permalink / raw) To: Stephane Eranian Cc: Peter Zijlstra, Jiri Olsa, Milian Wolff, linux-perf-users, Namhyung Kim, Ingo Molnar, Alexander Shishkin, Linux Kernel Mailing List Em Thu, Nov 23, 2017 at 04:15:36PM +0100, Peter Zijlstra escreveu: > On Thu, Nov 23, 2017 at 11:42:20AM -0300, Arnaldo Carvalho de Melo wrote: > > What is wrong with perf_event_attr.exclude_idle? :-) > > Neither task- nor cpu-clock actually implement that.. > > Something like the _completely_untested_ below might cure that for > cpu-clock. I have the nagging feeling we actually already account the > idle time _somewhere_, but I couldn't remember and was too lazy to go > find -- but someone should if this were to become an actual patch. > Stephane, this was the thread, - Arnaldo > --- > diff --git a/kernel/events/core.c b/kernel/events/core.c > index a59fe11558a4..5386d551b373 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -8900,6 +8908,10 @@ static void cpu_clock_event_update(struct perf_event *event) > u64 now; > > now = local_clock(); > + > + if (event->attr.exclude_idle) > + now -= idle_task(event->oncpu)->se.sum_exec_runtime; > + > prev = local64_xchg(&event->hw.prev_count, now); > local64_add(now - prev, &event->count); > } > diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c > index d518664cce4f..419c620510c6 100644 > --- a/kernel/sched/idle_task.c > +++ b/kernel/sched/idle_task.c > @@ -27,9 +27,14 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl > static struct task_struct * > pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > { > + struct task_struct *idle = rq->idle; > + > put_prev_task(rq, prev); > update_idle_core(rq); > schedstat_inc(rq->sched_goidle); > + > + idle->se.exec_start = rq_clock_task(rq); > + > return rq->idle; > } > > @@ -48,6 +53,17 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags) > > static void put_prev_task_idle(struct rq *rq, struct task_struct *prev) > { > + struct task_struct *idle = rq->idle; > + u64 delta, now; > + > + now = rq_clock_task(rq); > + delta = now - idle->se.exec_start; > + if (unlikely((s64)delta < 0)) > + delta = 0; > + > + idle->se.sum_exec_runtime += delta; > + idle->se.exec_start = now; > + > rq_last_tick_reset(rq); > } > > @@ -57,6 +73,9 @@ static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued) > > static void set_curr_task_idle(struct rq *rq) > { > + struct task_struct *idle = rq->idle; > + > + idle->se.exec_start = rq_clock_task(rq); > } > > static void switched_to_idle(struct rq *rq, struct task_struct *p) ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-04-17 13:41 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-11-20 14:00 measuring system wide CPU usage ignoring idle process Milian Wolff 2017-11-20 14:29 ` Jiri Olsa 2017-11-20 20:24 ` Milian Wolff 2017-11-20 23:44 ` Jiri Olsa 2017-11-23 13:40 ` Milian Wolff 2017-11-23 14:09 ` Jiri Olsa 2017-11-23 14:21 ` Jiri Olsa 2017-11-23 14:42 ` Arnaldo Carvalho de Melo 2017-11-23 15:12 ` Jiri Olsa 2017-11-23 18:59 ` Arnaldo Carvalho de Melo 2017-11-24 8:14 ` Jiri Olsa 2017-11-23 15:15 ` Peter Zijlstra 2018-04-17 13:41 ` Arnaldo Carvalho de Melo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).