* [PATCH] perf: Timehist account sch delay for scheduled out running
@ 2024-06-13 18:59 Fernand Sieber
2024-06-15 20:50 ` Madadi Vineeth Reddy
0 siblings, 1 reply; 3+ messages in thread
From: Fernand Sieber @ 2024-06-13 18:59 UTC (permalink / raw)
Cc: linux-perf-users, linux-kernel, Fernand Sieber, Peter Zijlstra,
Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin
When using perf timehist, sch delay is only computed for a waking task,
not for a pre empted task. This patches addresses this problem.
Example of `perf timehist` report before the patch for `stress` task
competing with each other.
First column is wait time, second column sch delay, third column
runtime.
```
1.492060 [0000] s stress[81] 1.999 0.000 2.000 R next: stress[83]
1.494060 [0000] s stress[83] 2.000 0.000 2.000 R next: stress[81]
1.496060 [0000] s stress[81] 2.000 0.000 2.000 R next: stress[83]
1.498060 [0000] s stress[83] 2.000 0.000 1.999 R next: stress[81]
```
After the patch, it looks like this (note that all wait time is now sch
delay instead of zero):
```
1.492060 [0000] s stress[81] 1.999 1.999 2.000 R next: stress[83]
1.494060 [0000] s stress[83] 2.000 2.000 2.000 R next: stress[81]
1.496060 [0000] s stress[81] 2.000 2.000 2.000 R next: stress[83]
1.498060 [0000] s stress[83] 2.000 2.000 1.999 R next: stress[81]
```
In timehist:
* wait time represents the duration waiting for any system resource
* sch delay represents the duration waiting for cpu system resources
This is based on perf comments (dt_wait = wait time, dt_delay = sch
delay):
```
/*
* Explanation of delta-time stats:
*
* t = time of current schedule out event
* tprev = time of previous sched out event
* also time of schedule-in event for current task
* last_time = time of last sched change event for current task
* (i.e, time process was last scheduled out)
* ready_to_run = time of wakeup for current task
*
* -----|------------|------------|------------|------
* last ready tprev t
* time to run
*
* |-------- dt_wait --------|
* |- dt_delay -|-- dt_run --|
*
* dt_run = run time of current task
* dt_wait = time between last schedule out event for task and tprev
* represents time spent off the cpu
* dt_delay = time between wakeup and schedule-in of task
*/
```
The problem with the current logic is that last time is only set when
waking a task. Therefore it is not set for a pre empted task. To fix
this, we set last time to the current sample time if a scheduled out
task (on the switch tracepoint) is in state running.
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
tools/perf/builtin-sched.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 5977c49ae2c7..7422c930abaf 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -2659,7 +2659,10 @@ static int timehist_sched_change_event(struct perf_tool *tool,
tr->last_state = state;
/* sched out event for task so reset ready to run time */
- tr->ready_to_run = 0;
+ if (state == 'R')
+ tr->ready_to_run = t;
+ else
+ tr->ready_to_run = 0;
}
evsel__save_time(evsel, sample->time, sample->cpu);
--
2.40.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] perf: Timehist account sch delay for scheduled out running
2024-06-13 18:59 [PATCH] perf: Timehist account sch delay for scheduled out running Fernand Sieber
@ 2024-06-15 20:50 ` Madadi Vineeth Reddy
2024-06-18 8:44 ` Sieber, Fernand
0 siblings, 1 reply; 3+ messages in thread
From: Madadi Vineeth Reddy @ 2024-06-15 20:50 UTC (permalink / raw)
To: Fernand Sieber
Cc: linux-perf-users, linux-kernel, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Alexander Shishkin, Namhyung Kim,
Madadi Vineeth Reddy
Hi Fernand,
On 14/06/24 00:29, Fernand Sieber wrote:
> When using perf timehist, sch delay is only computed for a waking task,
> not for a pre empted task. This patches addresses this problem.
>
> Example of `perf timehist` report before the patch for `stress` task
> competing with each other.
>
> First column is wait time, second column sch delay, third column
> runtime.
> ```
> 1.492060 [0000] s stress[81] 1.999 0.000 2.000 R next: stress[83]
> 1.494060 [0000] s stress[83] 2.000 0.000 2.000 R next: stress[81]
> 1.496060 [0000] s stress[81] 2.000 0.000 2.000 R next: stress[83]
> 1.498060 [0000] s stress[83] 2.000 0.000 1.999 R next: stress[81]
> ```
>
> After the patch, it looks like this (note that all wait time is now sch
> delay instead of zero):
> ```
> 1.492060 [0000] s stress[81] 1.999 1.999 2.000 R next: stress[83]
> 1.494060 [0000] s stress[83] 2.000 2.000 2.000 R next: stress[81]
> 1.496060 [0000] s stress[81] 2.000 2.000 2.000 R next: stress[83]
> 1.498060 [0000] s stress[83] 2.000 2.000 1.999 R next: stress[81]
> ```
>
> In timehist:
> * wait time represents the duration waiting for any system resource
> * sch delay represents the duration waiting for cpu system resources
According to the documentation(perf-sched man page), scheduling delay is defined
as the "time between wakeup and actually running."
When a task switches out, if its state is 'R' , then the task would switch-in
again without waking and wakeup in between.Therefore, that duration is not
currently calculated as scheduling delay, which aligns with the man page
description.
I agree that the time a task is preempted involuntarily and is waiting to run
could be considered part of the scheduling delay. The definition is somewhat
vague in this regard.
If you decide to proceed with this change, please ensure the man page is updated
accordingly.
Regarding wait time, the man page defines it as the "time between sched-out and
the next sched-in events for the task." This measurement does not depend on the
task state and includes time spent on a voluntary switch.
Thanks and Regards
Madadi Vineeth Reddy
>
> This is based on perf comments (dt_wait = wait time, dt_delay = sch
> delay):
> ```
> /*
> * Explanation of delta-time stats:
> *
> * t = time of current schedule out event
> * tprev = time of previous sched out event
> * also time of schedule-in event for current task
> * last_time = time of last sched change event for current task
> * (i.e, time process was last scheduled out)
> * ready_to_run = time of wakeup for current task
> *
> * -----|------------|------------|------------|------
> * last ready tprev t
> * time to run
> *
> * |-------- dt_wait --------|
> * |- dt_delay -|-- dt_run --|
> *
> * dt_run = run time of current task
> * dt_wait = time between last schedule out event for task and tprev
> * represents time spent off the cpu
> * dt_delay = time between wakeup and schedule-in of task
> */
> ```
>
> The problem with the current logic is that last time is only set when
> waking a task. Therefore it is not set for a pre empted task. To fix
> this, we set last time to the current sample time if a scheduled out
> task (on the switch tracepoint) is in state running.
>
> Signed-off-by: Fernand Sieber <sieberf@amazon.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---
> tools/perf/builtin-sched.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index 5977c49ae2c7..7422c930abaf 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -2659,7 +2659,10 @@ static int timehist_sched_change_event(struct perf_tool *tool,
> tr->last_state = state;
>
> /* sched out event for task so reset ready to run time */
> - tr->ready_to_run = 0;
> + if (state == 'R')
> + tr->ready_to_run = t;
> + else
> + tr->ready_to_run = 0;
> }
>
> evsel__save_time(evsel, sample->time, sample->cpu);
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] perf: Timehist account sch delay for scheduled out running
2024-06-15 20:50 ` Madadi Vineeth Reddy
@ 2024-06-18 8:44 ` Sieber, Fernand
0 siblings, 0 replies; 3+ messages in thread
From: Sieber, Fernand @ 2024-06-18 8:44 UTC (permalink / raw)
To: 20240613185906.31082-1-sieberf@amazon.com
Cc: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Alexander Shishkin, Namhyung Kim
Hi Maladi,
Ok, thanks for the feedback. I will submit another revision with the updated man page then. In my opinion it makes more sense to measure the scheduling delay as the runnable delay (wakeup and pre empted), as we need to consider both cases when benchmarking scheduling policies. However if there is/was a specific reason to restrict scheduling delay in timehist as the measure of wakeup delay only please let me know.
--Fernand
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-06-18 8:44 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-13 18:59 [PATCH] perf: Timehist account sch delay for scheduled out running Fernand Sieber
2024-06-15 20:50 ` Madadi Vineeth Reddy
2024-06-18 8:44 ` Sieber, Fernand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).