About tracing clues

linux-trace-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* About tracing clues
@ 2022-08-19 10:06 Prasad Pandit
  2022-08-19 17:50 ` Steven Rostedt
  0 siblings, 1 reply; 5+ messages in thread
From: Prasad Pandit @ 2022-08-19 10:06 UTC (permalink / raw)
  To: linux-trace-users; +Cc: rostedt

Hello,

* While running oslat(8) tests in a guest running on RHEL kernel-rt
host, I'm seeing occasional peculiar latency spikes as below
===
# trace-cmd record -p nop --poll -M 1E -e all
/home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T
30
oslat V 2.30
INFO: debugfs mountpoint: /sys/kernel/debug/tracing/
Total runtime:          43200 seconds
Thread priority:        SCHED_FIFO:1
CPU list:               1-4
CPU for main thread:    0
Workload:               no
Workload mem:           0 (KiB)
Preheat cores:          4

Pre-heat for 1 seconds...
Test starts...
/home/test/rt-tests/oslat: Trace threshold (30 us) triggered on cpu 2
with 67 us!
Stopping the test.
...
             <idle>-0     [002]   265.041487: tlb_flush:
pages=0 reason=flush on task switch (0)
           oslat-16074 [002]   265.041489: print:
tracing_mark_write: /home/test/rt-tests/oslat: Starting thread on CPU
2.
           oslat-16074 [002]   265.041490: kfree:
(__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil)
           oslat-16074 [002]   265.041490: sys_exit:             NR 1 = 53
           oslat-16074 [002]   265.041490: sys_exit_write:       0x35
           oslat-16074 [002]   265.041490: user_enter:
           oslat-16074 [002]   300.576710: user_exit:
           oslat-16074 [002]   300.576734: irq_handler_entry:
irq=21 name=virtio3
           oslat-16074 [002]   300.576734: irq_handler_exit:
irq=21 ret=handled
           oslat-16074 [002]   300.576734: sched_waking:
comm=irq/21-virtio3 pid=219 prio=49 target_cpu=002
           oslat-16074 [002]   300.576736: sched_migrate_task:
comm=irq/21-virtio3 pid=219 prio=49 orig_cpu=2 dest_cpu=0
           oslat-16074 [002]   300.576740: write_msr:            830, value fd
           oslat-16074 [002]   300.576740: sched_wakeup:
irq/21-virtio3:219 [49] success=1 CPU:000
           oslat-16074 [002]   300.576742: read_msr:             819, value 4
           oslat-16074 [002]   300.576754: irq_matrix_alloc:
bit=38 cpu=0 online=1 avl=195 alloc=6 managed=1 online_maps=6
global_avl=1178, global_rsvd=10, total_alloc=28
           oslat-16074 [002]   300.576755: vector_update:
irq=21 vector=38 cpu=0 prev_vector=34 prev_cpu=2
           oslat-16074 [002]   300.576755: vector_alloc:
irq=21 vector=38 reserved=0 ret=0
           oslat-16074 [002]   300.576756: vector_config:
irq=21 vector=38 cpu=0 apicdest=0x00000000
           oslat-16074 [002]   300.576773: user_enter:
           oslat-16074 [002]   300.576778: user_exit:
           oslat-16074 [002]   300.576778: sys_enter:            NR 1
(3, 7fcbb683c300, 65, 0, 0, 7fcbb683b816)
           oslat-16074 [002]   300.576778: sys_enter_write:      fd:
0x00000003, buf: 0x7fcbb683c300, count: 0x00000065
           oslat-16074 [002]   300.576780: print:
tracing_mark_write: /home/test/rt-tests/oslat: Trace threshold (30 us)
triggered on cpu 2 with 67 us!
Stopping the test.
           oslat-16074 [002]   300.576780: kfree:
(__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil)
           oslat-16074 [002]   300.576781: sys_exit:             NR 1 = 101
           oslat-16074 [002]   300.576781: sys_exit_write:       0x65
           oslat-16074 [002]   300.576781: user_enter:
===

* Looking at these traces it is difficult to see/conclude the source
of the said (67us) latency spike.
* Other logs I have for the same host/guest show
_call_function_entry/exit or timer softirq entries in traces.
* Peculiar is that these spikes seem to occur when oslat(8) is invoked
after the guest boot-up.
   Subsequent oslat(8) sessions in the same guest run over 12 hrs
without any sign of latency spike.

* Is there a way to find further clues or conclusively figure out the
source of such occasional latency spikes?

Thank you.
---
  - P J P


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: About tracing clues
  2022-08-19 10:06 About tracing clues Prasad Pandit
@ 2022-08-19 17:50 ` Steven Rostedt
  2022-08-20  9:21   ` Prasad Pandit
  0 siblings, 1 reply; 5+ messages in thread
From: Steven Rostedt @ 2022-08-19 17:50 UTC (permalink / raw)
  To: Prasad Pandit, Daniel Bristot de Oliveira; +Cc: linux-trace-users

On Fri, 19 Aug 2022 15:36:03 +0530
Prasad Pandit <ppandit@redhat.com> wrote:

> Hello,
> 
> * While running oslat(8) tests in a guest running on RHEL kernel-rt
> host, I'm seeing occasional peculiar latency spikes as below
> ===
> # trace-cmd record -p nop --poll -M 1E -e all
> /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T
> 30

Hmm you need to turn off line wraps when pasting in output like this.
Because the line wraps make it very difficult to follow.

/me fixes it:

> ===
> # trace-cmd record -p nop --poll -M 1E -e all /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30 oslat V 2.30
> INFO: debugfs mountpoint: /sys/kernel/debug/tracing/
> Total runtime:          43200 seconds
> Thread priority:        SCHED_FIFO:1
> CPU list:               1-4
> CPU for main thread:    0
> Workload:               no
> Workload mem:           0 (KiB)
> Preheat cores:          4
> 
> Pre-heat for 1 seconds...
> Test starts...
> /home/test/rt-tests/oslat: Trace threshold (30 us) triggered on cpu 2
> with 67 us!
> Stopping the test.
> ...
>              <idle>-0     [002]   265.041487: tlb_flush: pages=0 reason=flush on task switch (0)
>            oslat-16074 [002]   265.041489: print: tracing_mark_write: /home/test/rt-tests/oslat: Starting thread on CPU 2.
>            oslat-16074 [002]   265.041490: kfree: (__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil)
>            oslat-16074 [002]   265.041490: sys_exit:             NR 1 = 53
>            oslat-16074 [002]   265.041490: sys_exit_write:       0x35
>            oslat-16074 [002]   265.041490: user_enter:
>            oslat-16074 [002]   300.576710: user_exit:

Here's where it exits the user app.

Hmm, what's the 24us delay here?

>            oslat-16074 [002]   300.576734: irq_handler_entry: irq=21 name=virtio3
>            oslat-16074 [002]   300.576734: irq_handler_exit: irq=21 ret=handled
>            oslat-16074 [002]   300.576734: sched_waking: comm=irq/21-virtio3 pid=219 prio=49 target_cpu=002
>            oslat-16074 [002]   300.576736: sched_migrate_task: comm=irq/21-virtio3 pid=219 prio=49 orig_cpu=2 dest_cpu=0
>            oslat-16074 [002]   300.576740: write_msr:            830, value fd
>            oslat-16074 [002]   300.576740: sched_wakeup: irq/21-virtio3:219 [49] success=1 CPU:000
>            oslat-16074 [002]   300.576742: read_msr:             819, value 4
>            oslat-16074 [002]   300.576754: irq_matrix_alloc: bit=38 cpu=0 online=1 avl=195 alloc=6 managed=1 online_maps=6 global_avl=1178, global_rsvd=10, total_alloc=28
>            oslat-16074 [002]   300.576755: vector_update: irq=21 vector=38 cpu=0 prev_vector=34 prev_cpu=2
>            oslat-16074 [002]   300.576755: vector_alloc: irq=21 vector=38 reserved=0 ret=0
>            oslat-16074 [002]   300.576756: vector_config: irq=21 vector=38 cpu=0 apicdest=0x00000000

And here's where it enters.


>            oslat-16074 [002]   300.576773: user_enter:
>            oslat-16074 [002]   300.576778: user_exit:
>            oslat-16074 [002]   300.576778: sys_enter:            NR 1 (3, 7fcbb683c300, 65, 0, 0, 7fcbb683b816)
>            oslat-16074 [002]   300.576778: sys_enter_write:      fd: 0x00000003, buf: 0x7fcbb683c300, count: 0x00000065
>            oslat-16074 [002]   300.576780: print: tracing_mark_write: /home/test/rt-tests/oslat: Trace threshold (30 us) triggered on cpu 2 with 67 us!  Stopping the test.
>            oslat-16074 [002]   300.576780: kfree: (__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil)
>            oslat-16074 [002]   300.576781: sys_exit:             NR 1 = 101
>            oslat-16074 [002]   300.576781: sys_exit_write:       0x65
>            oslat-16074 [002]   300.576781: user_enter:
> ===



> 
> * Looking at these traces it is difficult to see/conclude the source
> of the said (67us) latency spike.
> * Other logs I have for the same host/guest show
> _call_function_entry/exit or timer softirq entries in traces.
> * Peculiar is that these spikes seem to occur when oslat(8) is invoked
> after the guest boot-up.
>    Subsequent oslat(8) sessions in the same guest run over 12 hrs
> without any sign of latency spike.
> 
> * Is there a way to find further clues or conclusively figure out the
> source of such occasional latency spikes?
> 

Since this is run on a guest, the host can obviously cause noise. If you
have the latest trace-cmd (at least 3.1.2) and install that on both the
host and guest, you can do a trace to get more info. You can even start
trace-cmd on the guest and kick off the host. But you need to have vsockets
enabled on the guest.

To get the vsocket CID id, you can run the agent:

  # trace-cmd agent
listening on @3:823

The @3 is the CID (823 is the port).

Hit Ctrl-C on to kill the agent.

Then on the host, run:

 # trace-cmd agent -P 3

Where 3 is the CID of the guest you want to connect the host (the same
number from the 'trace-cmd agent'. That is, if it was @4:823, then use
'-P 4'). The host agent above should also have a:

 listening on @2:823

or something. You'll need the host's CID too.

Then on the guest, run:

[ You do not need -p nop if you have -e used ]

  # trace-cmd record --poll -M 1E -e all --proxy 2 --name host -e kvm -e sched -e irq /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30 oslat V 2.30

And then when it ends, you'll have two files. One that is trace.dat and one
that is trace-host.dat. If everything worked well, the two would also be in
sync with each other. Doing:

  # trace-cmd report -i trace-host.dat -i trace.dat

Will show the host events happening with the guest events. If the host
caused the latency, you will definitely see it.

If you have the latest kernelshark, you can run:

  kernelshark trace-host.dat -a trace.dat

And then select "Plots" -> "KVM Combo Plots" and in the popup, select the
box for "all".

Then you'll see plots for each virtual CPU and the host thread that
represents it.

-- Steve



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: About tracing clues
  2022-08-19 17:50 ` Steven Rostedt
@ 2022-08-20  9:21   ` Prasad Pandit
  2022-08-23 19:24     ` Prasad Pandit
  0 siblings, 1 reply; 5+ messages in thread
From: Prasad Pandit @ 2022-08-20  9:21 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Daniel Bristot de Oliveira, linux-trace-users

Hello Steven,

On Fri, 19 Aug 2022 at 23:29, Steven Rostedt <rostedt@goodmis.org> wrote:
> Hmm you need to turn off line wraps when pasting in output like this.
> Because the line wraps make it very difficult to follow.

* True, I should've done that; I'll remember to do it.

> To get the vsocket CID id, you can run the agent:
>
>   # trace-cmd agent
> listening on @3:823
>
> The @3 is the CID (823 is the port).
>
> Hit Ctrl-C on to kill the agent.
>
> Then on the host, run:
>
>  # trace-cmd agent -P 3
>
> Where 3 is the CID of the guest you want to connect the host (the same
> number from the 'trace-cmd agent'. That is, if it was @4:823, then use
> '-P 4'). The host agent above should also have a:
>
>  listening on @2:823
>
> or something. You'll need the host's CID too.
>
> Then on the guest, run:
>
> [ You do not need -p nop if you have -e used ]
>
>   # trace-cmd record --poll -M 1E -e all --proxy 2 --name host -e kvm -e sched -e irq /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30 oslat V 2.30
>
> And then when it ends, you'll have two files. One that is trace.dat and one
> that is trace-host.dat. If everything worked well, the two would also be in
> sync with each other. Doing:
>
>   # trace-cmd report -i trace-host.dat -i trace.dat
>
> Will show the host events happening with the guest events. If the host
> caused the latency, you will definitely see it.
>
> If you have the latest kernelshark, you can run:
>
>   kernelshark trace-host.dat -a trace.dat
>
> And then select "Plots" -> "KVM Combo Plots" and in the popup, select the
> box for "all".
>
> Then you'll see plots for each virtual CPU and the host thread that represents it.

* I see, thank you so much for these interesting details. I'll try
these steps and get back to you soon.


Thank you.
---
  - P J P


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: About tracing clues
  2022-08-20  9:21   ` Prasad Pandit
@ 2022-08-23 19:24     ` Prasad Pandit
  2022-08-23 19:30       ` Steven Rostedt
  0 siblings, 1 reply; 5+ messages in thread
From: Prasad Pandit @ 2022-08-23 19:24 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Daniel Bristot de Oliveira, linux-trace-users

Hello Steve,

On Sat, 20 Aug 2022 at 14:51, Prasad Pandit <ppandit@redhat.com> wrote:
>>   # trace-cmd agent

* For me `trace-cmd agent` in the guest shows =>  listening on @4294967295:823

>>   # trace-cmd record --poll -M 1E -e all --proxy 2 --name host -e kvm -e sched -e irq /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30

* Running above command with --proxy 4294967295:823 throws an error =>
guest 4294967295 not found

* Looking through 'trace-record.c' source IIUC,
   -  The long type CID(=4294967295) causes atoi(3) calls in
parse_guest_name() to return -1
       so 'cid' is set to -1.
   - Later in parse_guest_name() guest->name is returned without
setting '*res' addrinfo parameter.
          guest = trace_get_guest(*cid, gname);
          if (guest) {
              *cid = guest->cid;
              return guest->name;     <==
          }

* ie. cid==-1 and result==NULL leads to the above error via  =>
die("guest %s not found", optarg);

* Instead of 'return guest->name', if we set 'gname = guest->name',
parse_guest_name() sets the '*res' addrinfo struct and returns gname.
  It helps to avoid above error, but then shows another error below
       ==>  Failed to connect to host host:823

* If I run above command with --name 4294967295, then it breaks with
      ==> Failed to connect to host 4294967295:823

* Not sure if the long type CID value is erroneous OR
parse_guest_name() function needs some fixing OR maybe both. Wdyt...?

Thank you.
---
  - P J P


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: About tracing clues
  2022-08-23 19:24     ` Prasad Pandit
@ 2022-08-23 19:30       ` Steven Rostedt
  0 siblings, 0 replies; 5+ messages in thread
From: Steven Rostedt @ 2022-08-23 19:30 UTC (permalink / raw)
  To: Prasad Pandit; +Cc: Daniel Bristot de Oliveira, linux-trace-users

On Wed, 24 Aug 2022 00:54:48 +0530
Prasad Pandit <ppandit@redhat.com> wrote:

> Hello Steve,
> 
> On Sat, 20 Aug 2022 at 14:51, Prasad Pandit <ppandit@redhat.com> wrote:
> >>   # trace-cmd agent  
> 
> * For me `trace-cmd agent` in the guest shows =>  listening on @4294967295:823

I need to fix that to be more meaningful. That 4294967295 is (unsigned)-1.

In other words, the creation of the vsocket failed

It means that you do not have vsockets defined for this guest, or you are
running inside a container on the guest that prevents you from accessing
the vsockets.

> 
> >>   # trace-cmd record --poll -M 1E -e all --proxy 2 --name host -e kvm -e sched -e irq /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30  
> 
> * Running above command with --proxy 4294967295:823 throws an error =>
> guest 4294967295 not found

The rest will not work because the vsocket failed.

-- Steve

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-08-23 20:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-19 10:06 About tracing clues Prasad Pandit
2022-08-19 17:50 ` Steven Rostedt
2022-08-20  9:21   ` Prasad Pandit
2022-08-23 19:24     ` Prasad Pandit
2022-08-23 19:30       ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).