* Scheduling latency outliers observed with perf sched record
@ 2025-03-14 11:59 Daniel Dao
2025-03-19 22:39 ` Namhyung Kim
0 siblings, 1 reply; 2+ messages in thread
From: Daniel Dao @ 2025-03-14 11:59 UTC (permalink / raw)
To: Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim, Jiri Olsa,
Alexei Starovoitov, Andrii Nakryiko, Song Liu, kernel-team,
Jesper Dangaard Brouer, linux-perf-users, linux-kernel
Hi,
At Cloudflare with both kernel 6.6.82 and 6.12.17, we frequently
observe scheduling
latency outliers with `perf sched record`. For example:
-------------------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Count | Avg delay ms |
Max delay ms | Max delay start | Max delay end |
-------------------------------------------------------------------------------------------------------------------------------------------
Pingora HTTP Pr:3133982 | 0.080 ms | 1 | avg: 881.614 ms |
max: 881.614 ms | max start: 312881.567210 s | max end: 312882.448824
s
The scheduling timeline of the outlier is as follow
time cpu task name wait time sch
delay run time state
[tid/pid] (msec)
(msec) (msec)
--------------- ------ ------------------------------ ---------
--------- --------- -----
312881.567210 [0058] Pingora HTTP Pr[3045283/3045241
awakened: Pingora HTTP Pr[3133982/3045241
312882.448807 [0015] Pingora HTTP Pr[3045278/3045241
awakened: Pingora HTTP Pr[3133982/3045241
312882.448861 [0118] Pingora HTTP Pr[3133982/3045241
awakened: Pingora HTTP Pr[3045278/3045241
312882.448911 [0118] Pingora HTTP Pr[3133982/3045241 0.000
881.614 0.087 S
According to `perf sched record`, the task received several wakeups
before it finally executed on CPU 118. The timeline
on CPU 118 is as follow:
312882.448824 [0118] <idle> 0.024
0.024 0.416 I
312882.448861 [0118] Pingora HTTP Pr[3133982/3045241
awakened: Pingora HTTP Pr[3045278/3045241
312882.448911 [0118] Pingora HTTP Pr[3133982/3045241 0.000
881.614 0.087 S
This is very strange and seems impossible because the task was woken
up before but not able to run yet while the CPU has idle time.
In production, we usually do not use `perf sched record`, but using a
ebpf program that is similar to
https://github.com/bpftrace/bpftrace/blob/master/tools/runqlat.but
tracepoint:sched:sched_wakeup, tracepoint:sched:sched_wakeup_new {
@qtime[args.pid] = nsecs;
}
tracepoint:sched:sched_switch {
if (args.prev_state == TASK_RUNNING) {
@qtime[args.prev_pid] = nsecs;
}
$ns = @qtime[args.next_pid];
if ($ns) {
@usecs = hist((nsecs - $ns) / 1000);
}
delete(@qtime, args.next_pid);
}
We do not observe any outliers with this program. We figured that the
difference with perf record is in the case of
supposedly repeated wakeups. The ebpf script took the last wakeup,
while `perf sched latency` took the first wakeup. Once we
adjusted our bpftrace script to take the first wakeup, we observed the
same outliers.
Looking into this further, we adjusted the same bpftrace script to
also print out `sum_exec_runtime` to make sure that
the process didn't actually run i.e. we do not miss any sched_switch
events. The adjusted script is as follow:
#define TASK_RUNNING 0
#define THRESHOLD 100
rt:sched_wakeup, rt:sched_wakeup_new
{
$task = (struct task_struct *)arg0;
$ns = @qtime[$task->pid];
$sum_exec_runtime = @sum_exec_runtime[$task->pid];
if ($ns > 0 ) {
$ms = ((nsecs - $ns) / 1000000);
if ($ms > 100) {
printf("pid=%d t=%ld delay_ms=%d sum_exec_runtime=%ld
prev_sum_exec_runtime=%ld cached_sum_exec_runtime=%ld\n", $task->pid,
nsecs, $ms,
$task->se.sum_exec_runtime, $task->se.prev_sum_exec_runtime,
$sum_exec_runtime);
}
if ($task->se.sum_exec_runtime == $sum_exec_runtime) {
// repeated wakeup, and task hasn't been run before, and we have
not missed any sched_switch events, do not count
return;
}
}
@qtime[$task->pid] = nsecs;
@sum_exec_runtime[$task->pid] = $task->se.sum_exec_runtime;
}
rt:sched_switch
{
$prev = (struct task_struct *)arg1;
$next = (struct task_struct *)arg2;
$prev_state = arg3;
if ($prev_state == TASK_RUNNING) {
@qtime[$prev->pid] = nsecs;
@sum_exec_runtime[$prev->pid] = $prev->se.sum_exec_runtime;
}
$ns = @qtime[$next->pid];
if ($ns) {
@usecs = hist((nsecs - $ns) / 1000);
}
delete(@qtime[$next->pid]);
delete(@sum_exec_runtime[$next->pid]);
}
The script gives us the following data:
pid=2197474 t=298353700427405 delay_ms=1157
sum_exec_runtime=1663205188 prev_sum_exec_runtime=1663167058
cached_sum_exec_runtime=1663167058
pid=2196692 t=298354187452899 delay_ms=1644
sum_exec_runtime=1630867065 prev_sum_exec_runtime=1630824755
cached_sum_exec_runtime=1630824755
pid=3290499 t=298354606215376 delay_ms=1482
sum_exec_runtime=26316767196 prev_sum_exec_runtime=26316654306
cached_sum_exec_runtime=26316654306
We can see that the script detected repeated wakeup (i.e. the program
did not observe a sched_switch), but the task exec_runtime has
changed.
Am I understanding correctly that this means that we must have missed
a sched_switch tracepoint event ? We didn't see any
miss counters for our bpf program reported by bpftool. The `perf sched
record` command also didn't report any missing events,
and I think it must have also missed some of these sched_switch events.
Is there anything else that we are missing here ?
Daniel.
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Scheduling latency outliers observed with perf sched record
2025-03-14 11:59 Scheduling latency outliers observed with perf sched record Daniel Dao
@ 2025-03-19 22:39 ` Namhyung Kim
0 siblings, 0 replies; 2+ messages in thread
From: Namhyung Kim @ 2025-03-19 22:39 UTC (permalink / raw)
To: Daniel Dao
Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
Alexei Starovoitov, Andrii Nakryiko, Song Liu, kernel-team,
Jesper Dangaard Brouer, linux-perf-users, linux-kernel
Hello,
On Fri, Mar 14, 2025 at 11:59:23AM +0000, Daniel Dao wrote:
> Hi,
>
> At Cloudflare with both kernel 6.6.82 and 6.12.17, we frequently
> observe scheduling
> latency outliers with `perf sched record`. For example:
>
> -------------------------------------------------------------------------------------------------------------------------------------------
> Task | Runtime ms | Count | Avg delay ms |
> Max delay ms | Max delay start | Max delay end |
> -------------------------------------------------------------------------------------------------------------------------------------------
> Pingora HTTP Pr:3133982 | 0.080 ms | 1 | avg: 881.614 ms |
> max: 881.614 ms | max start: 312881.567210 s | max end: 312882.448824
> s
>
> The scheduling timeline of the outlier is as follow
>
> time cpu task name wait time sch
> delay run time state
> [tid/pid] (msec)
> (msec) (msec)
> --------------- ------ ------------------------------ ---------
> --------- --------- -----
> 312881.567210 [0058] Pingora HTTP Pr[3045283/3045241
> awakened: Pingora HTTP Pr[3133982/3045241
> 312882.448807 [0015] Pingora HTTP Pr[3045278/3045241
> awakened: Pingora HTTP Pr[3133982/3045241
> 312882.448861 [0118] Pingora HTTP Pr[3133982/3045241
> awakened: Pingora HTTP Pr[3045278/3045241
> 312882.448911 [0118] Pingora HTTP Pr[3133982/3045241 0.000
> 881.614 0.087 S
>
> According to `perf sched record`, the task received several wakeups
> before it finally executed on CPU 118. The timeline
> on CPU 118 is as follow:
>
> 312882.448824 [0118] <idle> 0.024
> 0.024 0.416 I
> 312882.448861 [0118] Pingora HTTP Pr[3133982/3045241
> awakened: Pingora HTTP Pr[3045278/3045241
> 312882.448911 [0118] Pingora HTTP Pr[3133982/3045241 0.000
> 881.614 0.087 S
>
> This is very strange and seems impossible because the task was woken
> up before but not able to run yet while the CPU has idle time.
>
> In production, we usually do not use `perf sched record`, but using a
> ebpf program that is similar to
> https://github.com/bpftrace/bpftrace/blob/master/tools/runqlat.but
>
> tracepoint:sched:sched_wakeup, tracepoint:sched:sched_wakeup_new {
> @qtime[args.pid] = nsecs;
> }
>
> tracepoint:sched:sched_switch {
> if (args.prev_state == TASK_RUNNING) {
> @qtime[args.prev_pid] = nsecs;
> }
>
> $ns = @qtime[args.next_pid];
> if ($ns) {
> @usecs = hist((nsecs - $ns) / 1000);
> }
> delete(@qtime, args.next_pid);
> }
>
> We do not observe any outliers with this program. We figured that the
> difference with perf record is in the case of
> supposedly repeated wakeups. The ebpf script took the last wakeup,
> while `perf sched latency` took the first wakeup. Once we
> adjusted our bpftrace script to take the first wakeup, we observed the
> same outliers.
>
> Looking into this further, we adjusted the same bpftrace script to
> also print out `sum_exec_runtime` to make sure that
> the process didn't actually run i.e. we do not miss any sched_switch
> events. The adjusted script is as follow:
>
> #define TASK_RUNNING 0
> #define THRESHOLD 100
>
> rt:sched_wakeup, rt:sched_wakeup_new
> {
> $task = (struct task_struct *)arg0;
> $ns = @qtime[$task->pid];
> $sum_exec_runtime = @sum_exec_runtime[$task->pid];
> if ($ns > 0 ) {
> $ms = ((nsecs - $ns) / 1000000);
> if ($ms > 100) {
> printf("pid=%d t=%ld delay_ms=%d sum_exec_runtime=%ld
> prev_sum_exec_runtime=%ld cached_sum_exec_runtime=%ld\n", $task->pid,
> nsecs, $ms,
> $task->se.sum_exec_runtime, $task->se.prev_sum_exec_runtime,
> $sum_exec_runtime);
> }
> if ($task->se.sum_exec_runtime == $sum_exec_runtime) {
> // repeated wakeup, and task hasn't been run before, and we have
> not missed any sched_switch events, do not count
> return;
> }
> }
> @qtime[$task->pid] = nsecs;
> @sum_exec_runtime[$task->pid] = $task->se.sum_exec_runtime;
> }
>
> rt:sched_switch
> {
> $prev = (struct task_struct *)arg1;
> $next = (struct task_struct *)arg2;
> $prev_state = arg3;
> if ($prev_state == TASK_RUNNING) {
> @qtime[$prev->pid] = nsecs;
> @sum_exec_runtime[$prev->pid] = $prev->se.sum_exec_runtime;
> }
>
> $ns = @qtime[$next->pid];
> if ($ns) {
> @usecs = hist((nsecs - $ns) / 1000);
> }
> delete(@qtime[$next->pid]);
> delete(@sum_exec_runtime[$next->pid]);
> }
>
> The script gives us the following data:
>
> pid=2197474 t=298353700427405 delay_ms=1157
> sum_exec_runtime=1663205188 prev_sum_exec_runtime=1663167058
> cached_sum_exec_runtime=1663167058
> pid=2196692 t=298354187452899 delay_ms=1644
> sum_exec_runtime=1630867065 prev_sum_exec_runtime=1630824755
> cached_sum_exec_runtime=1630824755
> pid=3290499 t=298354606215376 delay_ms=1482
> sum_exec_runtime=26316767196 prev_sum_exec_runtime=26316654306
> cached_sum_exec_runtime=26316654306
>
> We can see that the script detected repeated wakeup (i.e. the program
> did not observe a sched_switch), but the task exec_runtime has
> changed.
>
> Am I understanding correctly that this means that we must have missed
> a sched_switch tracepoint event ? We didn't see any
> miss counters for our bpf program reported by bpftool. The `perf sched
> record` command also didn't report any missing events,
> and I think it must have also missed some of these sched_switch events.
>
> Is there anything else that we are missing here ?
perf can miss some events when the ring buffer is full. There should be
a warning about that. Also you can check whether the data has LOST
records by running this command.
$ perf report --stats | grep LOST
Thanks,
Namhyung
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-03-19 22:39 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-14 11:59 Scheduling latency outliers observed with perf sched record Daniel Dao
2025-03-19 22:39 ` Namhyung Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).