Scheduling latency outliers observed with perf sched record

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Scheduling latency outliers observed with perf sched record
@ 2025-03-14 11:59 Daniel Dao
  2025-03-19 22:39 ` Namhyung Kim
  0 siblings, 1 reply; 2+ messages in thread
From: Daniel Dao @ 2025-03-14 11:59 UTC (permalink / raw)
  To: Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim, Jiri Olsa,
	Alexei Starovoitov, Andrii Nakryiko, Song Liu, kernel-team,
	Jesper Dangaard Brouer, linux-perf-users, linux-kernel

Hi,

 At Cloudflare with both kernel 6.6.82 and 6.12.17, we frequently
observe scheduling
 latency outliers with `perf sched record`. For example:

-------------------------------------------------------------------------------------------------------------------------------------------
Task                  |   Runtime ms  |  Count   | Avg delay ms    |
Max delay ms    | Max delay start           | Max delay end          |
-------------------------------------------------------------------------------------------------------------------------------------------
Pingora HTTP Pr:3133982 |      0.080 ms |        1 | avg: 881.614 ms |
max: 881.614 ms | max start: 312881.567210 s | max end: 312882.448824
s

The scheduling timeline of the outlier is as follow

          time    cpu  task name                       wait time  sch
delay   run time  state
                        [tid/pid]                          (msec)
(msec)     (msec)
--------------- ------  ------------------------------  ---------
---------  ---------  -----
  312881.567210 [0058]  Pingora HTTP Pr[3045283/3045241
                   awakened: Pingora HTTP Pr[3133982/3045241
  312882.448807 [0015]  Pingora HTTP Pr[3045278/3045241
                   awakened: Pingora HTTP Pr[3133982/3045241
  312882.448861 [0118]  Pingora HTTP Pr[3133982/3045241
                   awakened: Pingora HTTP Pr[3045278/3045241
  312882.448911 [0118]  Pingora HTTP Pr[3133982/3045241      0.000
881.614      0.087      S

According to `perf sched record`, the task received several wakeups
before it finally executed on CPU 118. The timeline
on CPU 118 is as follow:

  312882.448824 [0118]  <idle>                               0.024
 0.024      0.416      I
  312882.448861 [0118]  Pingora HTTP Pr[3133982/3045241
                   awakened: Pingora HTTP Pr[3045278/3045241
  312882.448911 [0118]  Pingora HTTP Pr[3133982/3045241      0.000
881.614      0.087      S

This is very strange and seems impossible because the task was woken
up before but not able to run yet while the CPU has idle time.

In production, we usually do not use `perf sched record`, but using a
ebpf program that is similar to
https://github.com/bpftrace/bpftrace/blob/master/tools/runqlat.but

tracepoint:sched:sched_wakeup, tracepoint:sched:sched_wakeup_new {
  @qtime[args.pid] = nsecs;
}

tracepoint:sched:sched_switch {
  if (args.prev_state == TASK_RUNNING) {
    @qtime[args.prev_pid] = nsecs;
  }

  $ns = @qtime[args.next_pid];
  if ($ns) {
    @usecs = hist((nsecs - $ns) / 1000);
  }
  delete(@qtime, args.next_pid);
}

We do not observe any outliers with this program. We figured that the
difference with perf record is in the case of
supposedly repeated wakeups. The ebpf script took the last wakeup,
while `perf sched latency` took the first wakeup. Once we
adjusted our bpftrace script to take the first wakeup, we observed the
same outliers.

Looking into this further, we adjusted the same bpftrace script to
also print out `sum_exec_runtime` to make sure that
the process didn't actually run i.e. we do not miss any sched_switch
events. The adjusted script is as follow:

#define TASK_RUNNING 0
#define THRESHOLD 100

rt:sched_wakeup, rt:sched_wakeup_new
{
  $task = (struct task_struct *)arg0;
  $ns = @qtime[$task->pid];
  $sum_exec_runtime = @sum_exec_runtime[$task->pid];
  if ($ns > 0 ) {
    $ms = ((nsecs - $ns) / 1000000);
    if ($ms > 100) {
      printf("pid=%d t=%ld delay_ms=%d sum_exec_runtime=%ld
prev_sum_exec_runtime=%ld cached_sum_exec_runtime=%ld\n", $task->pid,
nsecs, $ms,
          $task->se.sum_exec_runtime, $task->se.prev_sum_exec_runtime,
$sum_exec_runtime);
    }
    if ($task->se.sum_exec_runtime == $sum_exec_runtime) {
      // repeated wakeup, and task hasn't been run before, and we have
not missed any sched_switch events, do not count
      return;
    }
  }
  @qtime[$task->pid] = nsecs;
  @sum_exec_runtime[$task->pid] = $task->se.sum_exec_runtime;
}

rt:sched_switch
{
  $prev = (struct task_struct *)arg1;
  $next = (struct task_struct *)arg2;
  $prev_state = arg3;
  if ($prev_state == TASK_RUNNING) {
    @qtime[$prev->pid] = nsecs;
    @sum_exec_runtime[$prev->pid] = $prev->se.sum_exec_runtime;
  }

  $ns = @qtime[$next->pid];
  if ($ns) {
    @usecs = hist((nsecs - $ns) / 1000);
  }
  delete(@qtime[$next->pid]);
  delete(@sum_exec_runtime[$next->pid]);
}

The script gives us the following data:

  pid=2197474 t=298353700427405 delay_ms=1157
sum_exec_runtime=1663205188 prev_sum_exec_runtime=1663167058
cached_sum_exec_runtime=1663167058
  pid=2196692 t=298354187452899 delay_ms=1644
sum_exec_runtime=1630867065 prev_sum_exec_runtime=1630824755
cached_sum_exec_runtime=1630824755
  pid=3290499 t=298354606215376 delay_ms=1482
sum_exec_runtime=26316767196 prev_sum_exec_runtime=26316654306
cached_sum_exec_runtime=26316654306

We can see that the script detected repeated wakeup (i.e. the program
did not observe a sched_switch), but the task exec_runtime has
changed.

Am I understanding correctly that this means that we must have missed
a sched_switch tracepoint event ? We didn't see any
miss counters for our bpf program reported by bpftool. The `perf sched
record` command also didn't report any missing events,
and I think it must have also missed some of these sched_switch events.

Is there anything else that we are missing here ?

Daniel.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Scheduling latency outliers observed with perf sched record
  2025-03-14 11:59 Scheduling latency outliers observed with perf sched record Daniel Dao
@ 2025-03-19 22:39 ` Namhyung Kim
  0 siblings, 0 replies; 2+ messages in thread
From: Namhyung Kim @ 2025-03-19 22:39 UTC (permalink / raw)
  To: Daniel Dao
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Alexei Starovoitov, Andrii Nakryiko, Song Liu, kernel-team,
	Jesper Dangaard Brouer, linux-perf-users, linux-kernel

Hello,

On Fri, Mar 14, 2025 at 11:59:23AM +0000, Daniel Dao wrote:
> Hi,
> 
>  At Cloudflare with both kernel 6.6.82 and 6.12.17, we frequently
> observe scheduling
>  latency outliers with `perf sched record`. For example:
> 
> -------------------------------------------------------------------------------------------------------------------------------------------
> Task                  |   Runtime ms  |  Count   | Avg delay ms    |
> Max delay ms    | Max delay start           | Max delay end          |
> -------------------------------------------------------------------------------------------------------------------------------------------
> Pingora HTTP Pr:3133982 |      0.080 ms |        1 | avg: 881.614 ms |
> max: 881.614 ms | max start: 312881.567210 s | max end: 312882.448824
> s
> 
> The scheduling timeline of the outlier is as follow
> 
>           time    cpu  task name                       wait time  sch
> delay   run time  state
>                         [tid/pid]                          (msec)
> (msec)     (msec)
> --------------- ------  ------------------------------  ---------
> ---------  ---------  -----
>   312881.567210 [0058]  Pingora HTTP Pr[3045283/3045241
>                    awakened: Pingora HTTP Pr[3133982/3045241
>   312882.448807 [0015]  Pingora HTTP Pr[3045278/3045241
>                    awakened: Pingora HTTP Pr[3133982/3045241
>   312882.448861 [0118]  Pingora HTTP Pr[3133982/3045241
>                    awakened: Pingora HTTP Pr[3045278/3045241
>   312882.448911 [0118]  Pingora HTTP Pr[3133982/3045241      0.000
> 881.614      0.087      S
> 
> According to `perf sched record`, the task received several wakeups
> before it finally executed on CPU 118. The timeline
> on CPU 118 is as follow:
> 
>   312882.448824 [0118]  <idle>                               0.024
>  0.024      0.416      I
>   312882.448861 [0118]  Pingora HTTP Pr[3133982/3045241
>                    awakened: Pingora HTTP Pr[3045278/3045241
>   312882.448911 [0118]  Pingora HTTP Pr[3133982/3045241      0.000
> 881.614      0.087      S
> 
> This is very strange and seems impossible because the task was woken
> up before but not able to run yet while the CPU has idle time.
> 
> In production, we usually do not use `perf sched record`, but using a
> ebpf program that is similar to
> https://github.com/bpftrace/bpftrace/blob/master/tools/runqlat.but
> 
> tracepoint:sched:sched_wakeup, tracepoint:sched:sched_wakeup_new {
>   @qtime[args.pid] = nsecs;
> }
> 
> tracepoint:sched:sched_switch {
>   if (args.prev_state == TASK_RUNNING) {
>     @qtime[args.prev_pid] = nsecs;
>   }
> 
>   $ns = @qtime[args.next_pid];
>   if ($ns) {
>     @usecs = hist((nsecs - $ns) / 1000);
>   }
>   delete(@qtime, args.next_pid);
> }
> 
> We do not observe any outliers with this program. We figured that the
> difference with perf record is in the case of
> supposedly repeated wakeups. The ebpf script took the last wakeup,
> while `perf sched latency` took the first wakeup. Once we
> adjusted our bpftrace script to take the first wakeup, we observed the
> same outliers.
> 
> Looking into this further, we adjusted the same bpftrace script to
> also print out `sum_exec_runtime` to make sure that
> the process didn't actually run i.e. we do not miss any sched_switch
> events. The adjusted script is as follow:
> 
> #define TASK_RUNNING 0
> #define THRESHOLD 100
> 
> rt:sched_wakeup, rt:sched_wakeup_new
> {
>   $task = (struct task_struct *)arg0;
>   $ns = @qtime[$task->pid];
>   $sum_exec_runtime = @sum_exec_runtime[$task->pid];
>   if ($ns > 0 ) {
>     $ms = ((nsecs - $ns) / 1000000);
>     if ($ms > 100) {
>       printf("pid=%d t=%ld delay_ms=%d sum_exec_runtime=%ld
> prev_sum_exec_runtime=%ld cached_sum_exec_runtime=%ld\n", $task->pid,
> nsecs, $ms,
>           $task->se.sum_exec_runtime, $task->se.prev_sum_exec_runtime,
> $sum_exec_runtime);
>     }
>     if ($task->se.sum_exec_runtime == $sum_exec_runtime) {
>       // repeated wakeup, and task hasn't been run before, and we have
> not missed any sched_switch events, do not count
>       return;
>     }
>   }
>   @qtime[$task->pid] = nsecs;
>   @sum_exec_runtime[$task->pid] = $task->se.sum_exec_runtime;
> }
> 
> rt:sched_switch
> {
>   $prev = (struct task_struct *)arg1;
>   $next = (struct task_struct *)arg2;
>   $prev_state = arg3;
>   if ($prev_state == TASK_RUNNING) {
>     @qtime[$prev->pid] = nsecs;
>     @sum_exec_runtime[$prev->pid] = $prev->se.sum_exec_runtime;
>   }
> 
>   $ns = @qtime[$next->pid];
>   if ($ns) {
>     @usecs = hist((nsecs - $ns) / 1000);
>   }
>   delete(@qtime[$next->pid]);
>   delete(@sum_exec_runtime[$next->pid]);
> }
> 
> The script gives us the following data:
> 
>   pid=2197474 t=298353700427405 delay_ms=1157
> sum_exec_runtime=1663205188 prev_sum_exec_runtime=1663167058
> cached_sum_exec_runtime=1663167058
>   pid=2196692 t=298354187452899 delay_ms=1644
> sum_exec_runtime=1630867065 prev_sum_exec_runtime=1630824755
> cached_sum_exec_runtime=1630824755
>   pid=3290499 t=298354606215376 delay_ms=1482
> sum_exec_runtime=26316767196 prev_sum_exec_runtime=26316654306
> cached_sum_exec_runtime=26316654306
> 
> We can see that the script detected repeated wakeup (i.e. the program
> did not observe a sched_switch), but the task exec_runtime has
> changed.
> 
> Am I understanding correctly that this means that we must have missed
> a sched_switch tracepoint event ? We didn't see any
> miss counters for our bpf program reported by bpftool. The `perf sched
> record` command also didn't report any missing events,
> and I think it must have also missed some of these sched_switch events.
> 
> Is there anything else that we are missing here ?

perf can miss some events when the ring buffer is full.  There should be
a warning about that.  Also you can check whether the data has LOST
records by running this command.

  $ perf report --stats | grep LOST

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-03-19 22:39 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-14 11:59 Scheduling latency outliers observed with perf sched record Daniel Dao
2025-03-19 22:39 ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).