From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 19F621D514B; Wed, 19 Mar 2025 22:39:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742423952; cv=none; b=elmZc1aS+wTWYgcxiS3Ga3H1eaCzyB+fSdd40rfAGi1eUiawe3saeU9G+UkCNdMu9ZzoI2mX3+2yocGTPzIrExw8cw+ulmowaLdOcVu98kOP7ciq+yGRcaDPe8ibs/VIEErtg7HsxOJjkTu09IduJYU5hLoG4GR4bwH5xo3RwuA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742423952; c=relaxed/simple; bh=dImZNh0bG9VM/YhN/vSRAx8A+54tLqxUraUVX8X/htY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=jPuib3SFL59LGX3GeBpS54IVD2jKCf6e9x70/2nAdnAbvEB78QtvL39GUsDmwpe9d97uFQiJt9ZS0uxQHdNNwI75G1IpouMplCPudjXhy9MYUV8INqwoCERVwykDATnrmj9hO8pG+y1Y4CGsJNb5kgL0G4QxOgtWGQPb36qbPzg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=iL1eKEof; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="iL1eKEof" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 12E05C4CEE4; Wed, 19 Mar 2025 22:39:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1742423951; bh=dImZNh0bG9VM/YhN/vSRAx8A+54tLqxUraUVX8X/htY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=iL1eKEofaZl6Ab+o5jUkqyVfMTBiNqciwKnDQWfp/dHkBVSLbfG9O62XKjD/BV3YO Y6RbZLGhw9EkXKy9Uv8mmFo4RkH4TsFidnY+J/uNK+opzz+i/NsWz+tmVudq2C78ib 0AwMxUODNYmJh/NLYGlEYuoXm0KRt9fj7BWHAa7syDlPJe0FEMUBDzMSzdNXf7wXIl 5tJVpofLKszLbDWP4SaAdtR2gwMSFMJxOhkvdlAVcGj5RwxQExPxvQI+3GzD+I1P1g TUzI1u5VGmUCupFM6mTxYNBAdVVJtrOjAndWd3d+J9V+RRsWxZJ/LoFBoL7UhPfz/z BcrIO4WHH4e5g== Date: Wed, 19 Mar 2025 15:39:09 -0700 From: Namhyung Kim To: Daniel Dao Cc: Peter Zijlstra , Arnaldo Carvalho de Melo , Jiri Olsa , Alexei Starovoitov , Andrii Nakryiko , Song Liu , kernel-team , Jesper Dangaard Brouer , linux-perf-users@vger.kernel.org, linux-kernel Subject: Re: Scheduling latency outliers observed with perf sched record Message-ID: References: Precedence: bulk X-Mailing-List: linux-perf-users@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Hello, On Fri, Mar 14, 2025 at 11:59:23AM +0000, Daniel Dao wrote: > Hi, > > At Cloudflare with both kernel 6.6.82 and 6.12.17, we frequently > observe scheduling > latency outliers with `perf sched record`. For example: > > ------------------------------------------------------------------------------------------------------------------------------------------- > Task | Runtime ms | Count | Avg delay ms | > Max delay ms | Max delay start | Max delay end | > ------------------------------------------------------------------------------------------------------------------------------------------- > Pingora HTTP Pr:3133982 | 0.080 ms | 1 | avg: 881.614 ms | > max: 881.614 ms | max start: 312881.567210 s | max end: 312882.448824 > s > > The scheduling timeline of the outlier is as follow > > time cpu task name wait time sch > delay run time state > [tid/pid] (msec) > (msec) (msec) > --------------- ------ ------------------------------ --------- > --------- --------- ----- > 312881.567210 [0058] Pingora HTTP Pr[3045283/3045241 > awakened: Pingora HTTP Pr[3133982/3045241 > 312882.448807 [0015] Pingora HTTP Pr[3045278/3045241 > awakened: Pingora HTTP Pr[3133982/3045241 > 312882.448861 [0118] Pingora HTTP Pr[3133982/3045241 > awakened: Pingora HTTP Pr[3045278/3045241 > 312882.448911 [0118] Pingora HTTP Pr[3133982/3045241 0.000 > 881.614 0.087 S > > According to `perf sched record`, the task received several wakeups > before it finally executed on CPU 118. The timeline > on CPU 118 is as follow: > > 312882.448824 [0118] 0.024 > 0.024 0.416 I > 312882.448861 [0118] Pingora HTTP Pr[3133982/3045241 > awakened: Pingora HTTP Pr[3045278/3045241 > 312882.448911 [0118] Pingora HTTP Pr[3133982/3045241 0.000 > 881.614 0.087 S > > This is very strange and seems impossible because the task was woken > up before but not able to run yet while the CPU has idle time. > > In production, we usually do not use `perf sched record`, but using a > ebpf program that is similar to > https://github.com/bpftrace/bpftrace/blob/master/tools/runqlat.but > > tracepoint:sched:sched_wakeup, tracepoint:sched:sched_wakeup_new { > @qtime[args.pid] = nsecs; > } > > tracepoint:sched:sched_switch { > if (args.prev_state == TASK_RUNNING) { > @qtime[args.prev_pid] = nsecs; > } > > $ns = @qtime[args.next_pid]; > if ($ns) { > @usecs = hist((nsecs - $ns) / 1000); > } > delete(@qtime, args.next_pid); > } > > We do not observe any outliers with this program. We figured that the > difference with perf record is in the case of > supposedly repeated wakeups. The ebpf script took the last wakeup, > while `perf sched latency` took the first wakeup. Once we > adjusted our bpftrace script to take the first wakeup, we observed the > same outliers. > > Looking into this further, we adjusted the same bpftrace script to > also print out `sum_exec_runtime` to make sure that > the process didn't actually run i.e. we do not miss any sched_switch > events. The adjusted script is as follow: > > #define TASK_RUNNING 0 > #define THRESHOLD 100 > > rt:sched_wakeup, rt:sched_wakeup_new > { > $task = (struct task_struct *)arg0; > $ns = @qtime[$task->pid]; > $sum_exec_runtime = @sum_exec_runtime[$task->pid]; > if ($ns > 0 ) { > $ms = ((nsecs - $ns) / 1000000); > if ($ms > 100) { > printf("pid=%d t=%ld delay_ms=%d sum_exec_runtime=%ld > prev_sum_exec_runtime=%ld cached_sum_exec_runtime=%ld\n", $task->pid, > nsecs, $ms, > $task->se.sum_exec_runtime, $task->se.prev_sum_exec_runtime, > $sum_exec_runtime); > } > if ($task->se.sum_exec_runtime == $sum_exec_runtime) { > // repeated wakeup, and task hasn't been run before, and we have > not missed any sched_switch events, do not count > return; > } > } > @qtime[$task->pid] = nsecs; > @sum_exec_runtime[$task->pid] = $task->se.sum_exec_runtime; > } > > rt:sched_switch > { > $prev = (struct task_struct *)arg1; > $next = (struct task_struct *)arg2; > $prev_state = arg3; > if ($prev_state == TASK_RUNNING) { > @qtime[$prev->pid] = nsecs; > @sum_exec_runtime[$prev->pid] = $prev->se.sum_exec_runtime; > } > > $ns = @qtime[$next->pid]; > if ($ns) { > @usecs = hist((nsecs - $ns) / 1000); > } > delete(@qtime[$next->pid]); > delete(@sum_exec_runtime[$next->pid]); > } > > The script gives us the following data: > > pid=2197474 t=298353700427405 delay_ms=1157 > sum_exec_runtime=1663205188 prev_sum_exec_runtime=1663167058 > cached_sum_exec_runtime=1663167058 > pid=2196692 t=298354187452899 delay_ms=1644 > sum_exec_runtime=1630867065 prev_sum_exec_runtime=1630824755 > cached_sum_exec_runtime=1630824755 > pid=3290499 t=298354606215376 delay_ms=1482 > sum_exec_runtime=26316767196 prev_sum_exec_runtime=26316654306 > cached_sum_exec_runtime=26316654306 > > We can see that the script detected repeated wakeup (i.e. the program > did not observe a sched_switch), but the task exec_runtime has > changed. > > Am I understanding correctly that this means that we must have missed > a sched_switch tracepoint event ? We didn't see any > miss counters for our bpf program reported by bpftool. The `perf sched > record` command also didn't report any missing events, > and I think it must have also missed some of these sched_switch events. > > Is there anything else that we are missing here ? perf can miss some events when the ring buffer is full. There should be a warning about that. Also you can check whether the data has LOST records by running this command. $ perf report --stats | grep LOST Thanks, Namhyung