perf record per-thread vs per-cpu behavior

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* perf record per-thread vs per-cpu behavior
@ 2021-12-08 23:33 Josh Hunt
  2021-12-09 17:52 ` Jiri Olsa
  0 siblings, 1 reply; 2+ messages in thread
From: Josh Hunt @ 2021-12-08 23:33 UTC (permalink / raw)
  To: linux-perf-users

Hi

I'm seeing what I believe to be incorrect behavior when running perf 
record -p and was hoping I could get some clarification from someone 
here. We have an application which has around 800 threads and when I run 
perf record -p <pid> on that the # of samples returned seems impossible 
given the # of cpus, sampling freq, and duration of profile.

I'm running on a 5.4 LTS kernel and the machine has 40 hw threads.

Running with -p <pid> I see:
# perf record -p 7297 -s -F 49 --call-graph lbr -- sleep 5
[ perf record: Woken up 32 times to write data ]
[ perf record: Captured and wrote 11.798 MB perf.data (58430 samples) ]

and I've confirmed there are actually 58k samples in the data file, but 
the # of samples should be closer to 5 * 49 * 40 = 9800.

Running with -a I see something closer to what I would expect:

# perf record -a -s -F 49 --call-graph lbr -- sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 3.405 MB perf.data (9745 samples) ]

Running with -p on applications with fewer threads seems to produce 
something closer to what I would expect since the pid being profiled 
isn't on cpu 100% of the time:

# perf record -p 11629 -s -F 49 --call-graph lbr -- sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.279 MB perf.data (1324 samples) ]

It's very possible I'm doing something wrong or don't understand things 
:) but if anyone has any insights into the above I'd appreciate your help!

I should also say that the resulting reports look different and it has 
me questioning the validity of the -p data.

I'm happy to provide more info if needed, and thanks for any guidance 
you can provide.

Josh

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: perf record per-thread vs per-cpu behavior
  2021-12-08 23:33 perf record per-thread vs per-cpu behavior Josh Hunt
@ 2021-12-09 17:52 ` Jiri Olsa
  0 siblings, 0 replies; 2+ messages in thread
From: Jiri Olsa @ 2021-12-09 17:52 UTC (permalink / raw)
  To: Josh Hunt; +Cc: linux-perf-users

On Wed, Dec 08, 2021 at 03:33:18PM -0800, Josh Hunt wrote:
> Hi
> 
> I'm seeing what I believe to be incorrect behavior when running perf record
> -p and was hoping I could get some clarification from someone here. We have
> an application which has around 800 threads and when I run perf record -p
> <pid> on that the # of samples returned seems impossible given the # of
> cpus, sampling freq, and duration of profile.
> 
> I'm running on a 5.4 LTS kernel and the machine has 40 hw threads.
> 
> Running with -p <pid> I see:
> # perf record -p 7297 -s -F 49 --call-graph lbr -- sleep 5
> [ perf record: Woken up 32 times to write data ]
> [ perf record: Captured and wrote 11.798 MB perf.data (58430 samples) ]
> 
> and I've confirmed there are actually 58k samples in the data file, but the
> # of samples should be closer to 5 * 49 * 40 = 9800.
> 
> Running with -a I see something closer to what I would expect:
> 
> # perf record -a -s -F 49 --call-graph lbr -- sleep 5
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 3.405 MB perf.data (9745 samples) ]
> 
> Running with -p on applications with fewer threads seems to produce
> something closer to what I would expect since the pid being profiled isn't
> on cpu 100% of the time:
> 
> # perf record -p 11629 -s -F 49 --call-graph lbr -- sleep 5
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.279 MB perf.data (1324 samples) ]
> 
> It's very possible I'm doing something wrong or don't understand things :)
> but if anyone has any insights into the above I'd appreciate your help!

hi,
so with -p x you monitor also all the 800 threads

and IIRC the way -F works is by gradually moving counter's period into
proper value, to reach the desired frequency

so I think you get initial burst of samples when thread gets scheduled
(from 800 threads), which screws the expected count, like:

# ./perf script
                                           value:
 sched-messaging 281355 413544.604702:          1 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281355 413544.604707:          1 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281355 413544.604709:        195 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281355 413544.604711:     108360 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281355 413544.604758:   60054506 cycles:  ffffffff81653ced __list_del_entry_valid+0x2d ([kernel.kallsyms])
 sched-messaging 281341 413544.604778:          1 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281341 413544.604780:          1 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281341 413544.604782:        566 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281341 413544.604784:     357999 cycles:  ffffffff81c53ae7 rcu_nmi_exit+0x17 ([kernel.kallsyms])
 sched-messaging 281467 413544.604823:          1 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281467 413544.604825:          1 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281467 413544.604827:        645 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281467 413544.604829:     484508 cycles:  ffffffff81e015c6 nmi_restore+0x25 ([kernel.kallsyms])
 sched-messaging 281467 413544.605025:  345122294 cycles:  ffffffff81312f3a memcg_slab_post_alloc_hook+0x17a ([kernel.kallsyms])
 sched-messaging 281346 413544.605262:          1 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281346 413544.605266:          1 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281346 413544.605268:        260 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281346 413544.605270:     124558 cycles:  ffffffff8106f098 native_write_msr+0x8 ([kernel.kallsyms])
 sched-messaging 281345 413544.605306:          1 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281345 413544.605308:          1 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281345 413544.605310:        518 cycles:  ffffffff8106f094 native_write_msr+0x4 ([kernel.kallsyms])
 sched-messaging 281345 413544.605312:     344137 cycles:  ffffffff81c53ce2 irqentry_nmi_exit+0x32 ([kernel.kallsyms])
 ...

check the 'value:' column

with system wide monitoring you don't see this issue and the counter
stabilize fast, because it's not scheduled in/out all the time

jirka

> 
> I should also say that the resulting reports look different and it has me
> questioning the validity of the -p data.
> 
> I'm happy to provide more info if needed, and thanks for any guidance you
> can provide.
> 
> Josh
> 


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-12-09 17:53 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-12-08 23:33 perf record per-thread vs per-cpu behavior Josh Hunt
2021-12-09 17:52 ` Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).