* About tracing clues @ 2022-08-19 10:06 Prasad Pandit 2022-08-19 17:50 ` Steven Rostedt 0 siblings, 1 reply; 5+ messages in thread From: Prasad Pandit @ 2022-08-19 10:06 UTC (permalink / raw) To: linux-trace-users; +Cc: rostedt Hello, * While running oslat(8) tests in a guest running on RHEL kernel-rt host, I'm seeing occasional peculiar latency spikes as below === # trace-cmd record -p nop --poll -M 1E -e all /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30 oslat V 2.30 INFO: debugfs mountpoint: /sys/kernel/debug/tracing/ Total runtime: 43200 seconds Thread priority: SCHED_FIFO:1 CPU list: 1-4 CPU for main thread: 0 Workload: no Workload mem: 0 (KiB) Preheat cores: 4 Pre-heat for 1 seconds... Test starts... /home/test/rt-tests/oslat: Trace threshold (30 us) triggered on cpu 2 with 67 us! Stopping the test. ... <idle>-0 [002] 265.041487: tlb_flush: pages=0 reason=flush on task switch (0) oslat-16074 [002] 265.041489: print: tracing_mark_write: /home/test/rt-tests/oslat: Starting thread on CPU 2. oslat-16074 [002] 265.041490: kfree: (__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil) oslat-16074 [002] 265.041490: sys_exit: NR 1 = 53 oslat-16074 [002] 265.041490: sys_exit_write: 0x35 oslat-16074 [002] 265.041490: user_enter: oslat-16074 [002] 300.576710: user_exit: oslat-16074 [002] 300.576734: irq_handler_entry: irq=21 name=virtio3 oslat-16074 [002] 300.576734: irq_handler_exit: irq=21 ret=handled oslat-16074 [002] 300.576734: sched_waking: comm=irq/21-virtio3 pid=219 prio=49 target_cpu=002 oslat-16074 [002] 300.576736: sched_migrate_task: comm=irq/21-virtio3 pid=219 prio=49 orig_cpu=2 dest_cpu=0 oslat-16074 [002] 300.576740: write_msr: 830, value fd oslat-16074 [002] 300.576740: sched_wakeup: irq/21-virtio3:219 [49] success=1 CPU:000 oslat-16074 [002] 300.576742: read_msr: 819, value 4 oslat-16074 [002] 300.576754: irq_matrix_alloc: bit=38 cpu=0 online=1 avl=195 alloc=6 managed=1 online_maps=6 global_avl=1178, global_rsvd=10, total_alloc=28 oslat-16074 [002] 300.576755: vector_update: irq=21 vector=38 cpu=0 prev_vector=34 prev_cpu=2 oslat-16074 [002] 300.576755: vector_alloc: irq=21 vector=38 reserved=0 ret=0 oslat-16074 [002] 300.576756: vector_config: irq=21 vector=38 cpu=0 apicdest=0x00000000 oslat-16074 [002] 300.576773: user_enter: oslat-16074 [002] 300.576778: user_exit: oslat-16074 [002] 300.576778: sys_enter: NR 1 (3, 7fcbb683c300, 65, 0, 0, 7fcbb683b816) oslat-16074 [002] 300.576778: sys_enter_write: fd: 0x00000003, buf: 0x7fcbb683c300, count: 0x00000065 oslat-16074 [002] 300.576780: print: tracing_mark_write: /home/test/rt-tests/oslat: Trace threshold (30 us) triggered on cpu 2 with 67 us! Stopping the test. oslat-16074 [002] 300.576780: kfree: (__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil) oslat-16074 [002] 300.576781: sys_exit: NR 1 = 101 oslat-16074 [002] 300.576781: sys_exit_write: 0x65 oslat-16074 [002] 300.576781: user_enter: === * Looking at these traces it is difficult to see/conclude the source of the said (67us) latency spike. * Other logs I have for the same host/guest show _call_function_entry/exit or timer softirq entries in traces. * Peculiar is that these spikes seem to occur when oslat(8) is invoked after the guest boot-up. Subsequent oslat(8) sessions in the same guest run over 12 hrs without any sign of latency spike. * Is there a way to find further clues or conclusively figure out the source of such occasional latency spikes? Thank you. --- - P J P ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: About tracing clues 2022-08-19 10:06 About tracing clues Prasad Pandit @ 2022-08-19 17:50 ` Steven Rostedt 2022-08-20 9:21 ` Prasad Pandit 0 siblings, 1 reply; 5+ messages in thread From: Steven Rostedt @ 2022-08-19 17:50 UTC (permalink / raw) To: Prasad Pandit, Daniel Bristot de Oliveira; +Cc: linux-trace-users On Fri, 19 Aug 2022 15:36:03 +0530 Prasad Pandit <ppandit@redhat.com> wrote: > Hello, > > * While running oslat(8) tests in a guest running on RHEL kernel-rt > host, I'm seeing occasional peculiar latency spikes as below > === > # trace-cmd record -p nop --poll -M 1E -e all > /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T > 30 Hmm you need to turn off line wraps when pasting in output like this. Because the line wraps make it very difficult to follow. /me fixes it: > === > # trace-cmd record -p nop --poll -M 1E -e all /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30 oslat V 2.30 > INFO: debugfs mountpoint: /sys/kernel/debug/tracing/ > Total runtime: 43200 seconds > Thread priority: SCHED_FIFO:1 > CPU list: 1-4 > CPU for main thread: 0 > Workload: no > Workload mem: 0 (KiB) > Preheat cores: 4 > > Pre-heat for 1 seconds... > Test starts... > /home/test/rt-tests/oslat: Trace threshold (30 us) triggered on cpu 2 > with 67 us! > Stopping the test. > ... > <idle>-0 [002] 265.041487: tlb_flush: pages=0 reason=flush on task switch (0) > oslat-16074 [002] 265.041489: print: tracing_mark_write: /home/test/rt-tests/oslat: Starting thread on CPU 2. > oslat-16074 [002] 265.041490: kfree: (__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil) > oslat-16074 [002] 265.041490: sys_exit: NR 1 = 53 > oslat-16074 [002] 265.041490: sys_exit_write: 0x35 > oslat-16074 [002] 265.041490: user_enter: > oslat-16074 [002] 300.576710: user_exit: Here's where it exits the user app. Hmm, what's the 24us delay here? > oslat-16074 [002] 300.576734: irq_handler_entry: irq=21 name=virtio3 > oslat-16074 [002] 300.576734: irq_handler_exit: irq=21 ret=handled > oslat-16074 [002] 300.576734: sched_waking: comm=irq/21-virtio3 pid=219 prio=49 target_cpu=002 > oslat-16074 [002] 300.576736: sched_migrate_task: comm=irq/21-virtio3 pid=219 prio=49 orig_cpu=2 dest_cpu=0 > oslat-16074 [002] 300.576740: write_msr: 830, value fd > oslat-16074 [002] 300.576740: sched_wakeup: irq/21-virtio3:219 [49] success=1 CPU:000 > oslat-16074 [002] 300.576742: read_msr: 819, value 4 > oslat-16074 [002] 300.576754: irq_matrix_alloc: bit=38 cpu=0 online=1 avl=195 alloc=6 managed=1 online_maps=6 global_avl=1178, global_rsvd=10, total_alloc=28 > oslat-16074 [002] 300.576755: vector_update: irq=21 vector=38 cpu=0 prev_vector=34 prev_cpu=2 > oslat-16074 [002] 300.576755: vector_alloc: irq=21 vector=38 reserved=0 ret=0 > oslat-16074 [002] 300.576756: vector_config: irq=21 vector=38 cpu=0 apicdest=0x00000000 And here's where it enters. > oslat-16074 [002] 300.576773: user_enter: > oslat-16074 [002] 300.576778: user_exit: > oslat-16074 [002] 300.576778: sys_enter: NR 1 (3, 7fcbb683c300, 65, 0, 0, 7fcbb683b816) > oslat-16074 [002] 300.576778: sys_enter_write: fd: 0x00000003, buf: 0x7fcbb683c300, count: 0x00000065 > oslat-16074 [002] 300.576780: print: tracing_mark_write: /home/test/rt-tests/oslat: Trace threshold (30 us) triggered on cpu 2 with 67 us! Stopping the test. > oslat-16074 [002] 300.576780: kfree: (__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil) > oslat-16074 [002] 300.576781: sys_exit: NR 1 = 101 > oslat-16074 [002] 300.576781: sys_exit_write: 0x65 > oslat-16074 [002] 300.576781: user_enter: > === > > * Looking at these traces it is difficult to see/conclude the source > of the said (67us) latency spike. > * Other logs I have for the same host/guest show > _call_function_entry/exit or timer softirq entries in traces. > * Peculiar is that these spikes seem to occur when oslat(8) is invoked > after the guest boot-up. > Subsequent oslat(8) sessions in the same guest run over 12 hrs > without any sign of latency spike. > > * Is there a way to find further clues or conclusively figure out the > source of such occasional latency spikes? > Since this is run on a guest, the host can obviously cause noise. If you have the latest trace-cmd (at least 3.1.2) and install that on both the host and guest, you can do a trace to get more info. You can even start trace-cmd on the guest and kick off the host. But you need to have vsockets enabled on the guest. To get the vsocket CID id, you can run the agent: # trace-cmd agent listening on @3:823 The @3 is the CID (823 is the port). Hit Ctrl-C on to kill the agent. Then on the host, run: # trace-cmd agent -P 3 Where 3 is the CID of the guest you want to connect the host (the same number from the 'trace-cmd agent'. That is, if it was @4:823, then use '-P 4'). The host agent above should also have a: listening on @2:823 or something. You'll need the host's CID too. Then on the guest, run: [ You do not need -p nop if you have -e used ] # trace-cmd record --poll -M 1E -e all --proxy 2 --name host -e kvm -e sched -e irq /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30 oslat V 2.30 And then when it ends, you'll have two files. One that is trace.dat and one that is trace-host.dat. If everything worked well, the two would also be in sync with each other. Doing: # trace-cmd report -i trace-host.dat -i trace.dat Will show the host events happening with the guest events. If the host caused the latency, you will definitely see it. If you have the latest kernelshark, you can run: kernelshark trace-host.dat -a trace.dat And then select "Plots" -> "KVM Combo Plots" and in the popup, select the box for "all". Then you'll see plots for each virtual CPU and the host thread that represents it. -- Steve ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: About tracing clues 2022-08-19 17:50 ` Steven Rostedt @ 2022-08-20 9:21 ` Prasad Pandit 2022-08-23 19:24 ` Prasad Pandit 0 siblings, 1 reply; 5+ messages in thread From: Prasad Pandit @ 2022-08-20 9:21 UTC (permalink / raw) To: Steven Rostedt; +Cc: Daniel Bristot de Oliveira, linux-trace-users Hello Steven, On Fri, 19 Aug 2022 at 23:29, Steven Rostedt <rostedt@goodmis.org> wrote: > Hmm you need to turn off line wraps when pasting in output like this. > Because the line wraps make it very difficult to follow. * True, I should've done that; I'll remember to do it. > To get the vsocket CID id, you can run the agent: > > # trace-cmd agent > listening on @3:823 > > The @3 is the CID (823 is the port). > > Hit Ctrl-C on to kill the agent. > > Then on the host, run: > > # trace-cmd agent -P 3 > > Where 3 is the CID of the guest you want to connect the host (the same > number from the 'trace-cmd agent'. That is, if it was @4:823, then use > '-P 4'). The host agent above should also have a: > > listening on @2:823 > > or something. You'll need the host's CID too. > > Then on the guest, run: > > [ You do not need -p nop if you have -e used ] > > # trace-cmd record --poll -M 1E -e all --proxy 2 --name host -e kvm -e sched -e irq /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30 oslat V 2.30 > > And then when it ends, you'll have two files. One that is trace.dat and one > that is trace-host.dat. If everything worked well, the two would also be in > sync with each other. Doing: > > # trace-cmd report -i trace-host.dat -i trace.dat > > Will show the host events happening with the guest events. If the host > caused the latency, you will definitely see it. > > If you have the latest kernelshark, you can run: > > kernelshark trace-host.dat -a trace.dat > > And then select "Plots" -> "KVM Combo Plots" and in the popup, select the > box for "all". > > Then you'll see plots for each virtual CPU and the host thread that represents it. * I see, thank you so much for these interesting details. I'll try these steps and get back to you soon. Thank you. --- - P J P ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: About tracing clues 2022-08-20 9:21 ` Prasad Pandit @ 2022-08-23 19:24 ` Prasad Pandit 2022-08-23 19:30 ` Steven Rostedt 0 siblings, 1 reply; 5+ messages in thread From: Prasad Pandit @ 2022-08-23 19:24 UTC (permalink / raw) To: Steven Rostedt; +Cc: Daniel Bristot de Oliveira, linux-trace-users Hello Steve, On Sat, 20 Aug 2022 at 14:51, Prasad Pandit <ppandit@redhat.com> wrote: >> # trace-cmd agent * For me `trace-cmd agent` in the guest shows => listening on @4294967295:823 >> # trace-cmd record --poll -M 1E -e all --proxy 2 --name host -e kvm -e sched -e irq /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30 * Running above command with --proxy 4294967295:823 throws an error => guest 4294967295 not found * Looking through 'trace-record.c' source IIUC, - The long type CID(=4294967295) causes atoi(3) calls in parse_guest_name() to return -1 so 'cid' is set to -1. - Later in parse_guest_name() guest->name is returned without setting '*res' addrinfo parameter. guest = trace_get_guest(*cid, gname); if (guest) { *cid = guest->cid; return guest->name; <== } * ie. cid==-1 and result==NULL leads to the above error via => die("guest %s not found", optarg); * Instead of 'return guest->name', if we set 'gname = guest->name', parse_guest_name() sets the '*res' addrinfo struct and returns gname. It helps to avoid above error, but then shows another error below ==> Failed to connect to host host:823 * If I run above command with --name 4294967295, then it breaks with ==> Failed to connect to host 4294967295:823 * Not sure if the long type CID value is erroneous OR parse_guest_name() function needs some fixing OR maybe both. Wdyt...? Thank you. --- - P J P ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: About tracing clues 2022-08-23 19:24 ` Prasad Pandit @ 2022-08-23 19:30 ` Steven Rostedt 0 siblings, 0 replies; 5+ messages in thread From: Steven Rostedt @ 2022-08-23 19:30 UTC (permalink / raw) To: Prasad Pandit; +Cc: Daniel Bristot de Oliveira, linux-trace-users On Wed, 24 Aug 2022 00:54:48 +0530 Prasad Pandit <ppandit@redhat.com> wrote: > Hello Steve, > > On Sat, 20 Aug 2022 at 14:51, Prasad Pandit <ppandit@redhat.com> wrote: > >> # trace-cmd agent > > * For me `trace-cmd agent` in the guest shows => listening on @4294967295:823 I need to fix that to be more meaningful. That 4294967295 is (unsigned)-1. In other words, the creation of the vsocket failed It means that you do not have vsockets defined for this guest, or you are running inside a container on the guest that prevents you from accessing the vsockets. > > >> # trace-cmd record --poll -M 1E -e all --proxy 2 --name host -e kvm -e sched -e irq /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30 > > * Running above command with --proxy 4294967295:823 throws an error => > guest 4294967295 not found The rest will not work because the vsocket failed. -- Steve ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-08-23 20:10 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-08-19 10:06 About tracing clues Prasad Pandit 2022-08-19 17:50 ` Steven Rostedt 2022-08-20 9:21 ` Prasad Pandit 2022-08-23 19:24 ` Prasad Pandit 2022-08-23 19:30 ` Steven Rostedt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).