* [PATCH] sched/rt: fix incorrect schedstats for rt thread @ 2026-01-08 3:13 Dengjun Su 2026-01-08 11:16 ` Peter Zijlstra 0 siblings, 1 reply; 5+ messages in thread From: Dengjun Su @ 2026-01-08 3:13 UTC (permalink / raw) To: linux-kernel Cc: dengjun.su, mike.zhang, haiqiang.gong, peijun.huang, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Matthias Brugger, AngeloGioacchino Del Regno, linux-arm-kernel, linux-mediatek For RT thread, only 'set_next_task_rt' will call 'update_stats_wait_end_rt' to update schedstats information. However, during the RT migration process, 'update_stats_wait_start_rt' will be called twice, which will cause the values of wait_max and wait_sum to be incorrect. The specific output as follows: $ cat /proc/6046/task/6046/sched | grep wait wait_start : 0.000000 wait_max : 496717.080029 wait_sum : 7921540.776553 Add 'update_stats_wait_end_rt' in 'update_stats_dequeue_rt' to update schedstats information when dequeue_task. Signed-off-by: Dengjun Su <dengjun.su@mediatek.com> --- kernel/sched/rt.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index f1867fe8e5c5..12f2efddca9f 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1302,13 +1302,18 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se, int flags) { struct task_struct *p = NULL; + struct rq *rq = rq_of_rt_rq(rt_rq); if (!schedstat_enabled()) return; - if (rt_entity_is_task(rt_se)) + if (rt_entity_is_task(rt_se)) { p = rt_task_of(rt_se); + if (p != rq->curr) + update_stats_wait_end_rt(rt_rq, rt_se); + } + if ((flags & DEQUEUE_SLEEP) && p) { unsigned int state; -- 2.43.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/rt: fix incorrect schedstats for rt thread 2026-01-08 3:13 [PATCH] sched/rt: fix incorrect schedstats for rt thread Dengjun Su @ 2026-01-08 11:16 ` Peter Zijlstra 2026-01-09 7:24 ` Dengjun Su 0 siblings, 1 reply; 5+ messages in thread From: Peter Zijlstra @ 2026-01-08 11:16 UTC (permalink / raw) To: Dengjun Su Cc: linux-kernel, mike.zhang, haiqiang.gong, peijun.huang, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Matthias Brugger, AngeloGioacchino Del Regno, linux-arm-kernel, linux-mediatek On Thu, Jan 08, 2026 at 11:13:07AM +0800, Dengjun Su wrote: > For RT thread, only 'set_next_task_rt' will call > 'update_stats_wait_end_rt' to update schedstats information. > However, during the RT migration process, > 'update_stats_wait_start_rt' will be called twice, which > will cause the values of wait_max and wait_sum to be incorrect. Right, that looses time. Also note that I think dl has the same issue. > The specific output as follows: > $ cat /proc/6046/task/6046/sched | grep wait > wait_start : 0.000000 > wait_max : 496717.080029 > wait_sum : 7921540.776553 > > Add 'update_stats_wait_end_rt' in 'update_stats_dequeue_rt' to > update schedstats information when dequeue_task. This needs a few more words on why this is correct -- notably it took me a little time to find the 'task_on_rq_migrating()' case in __update_stats_wait_end() which makes this not actually 'end'. But then the corresponding clause in __update_stats_wait_start() gives me a headache: 'wait_start > prev_wait_start' I mean, wtf. Should that not equally be using task_on_rq_migrating() ? Can you please take a hard look at all that and fix up things all-round? ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/rt: fix incorrect schedstats for rt thread 2026-01-08 11:16 ` Peter Zijlstra @ 2026-01-09 7:24 ` Dengjun Su 2026-01-12 16:38 ` Peter Zijlstra 0 siblings, 1 reply; 5+ messages in thread From: Dengjun Su @ 2026-01-09 7:24 UTC (permalink / raw) To: peterz Cc: angelogioacchino.delregno, bsegall, dengjun.su, dietmar.eggemann, haiqiang.gong, juri.lelli, linux-arm-kernel, linux-kernel, linux-mediatek, matthias.bgg, mgorman, mike.zhang, mingo, peijun.huang, rostedt, vincent.guittot, vschneid On Thu, 2026-01-08 at 12:16 +0100, Peter Zijlstra wrote: > On Thu, Jan 08, 2026 at 11:13:07AM +0800, Dengjun Su wrote: > > For RT thread, only 'set_next_task_rt' will call > > 'update_stats_wait_end_rt' to update schedstats information. > > However, during the RT migration process, > > 'update_stats_wait_start_rt' will be called twice, which > > will cause the values of wait_max and wait_sum to be incorrect. > > Right, that looses time. Also note that I think dl has the same > issue. Hi Peter, Thanks for the feedback. Yes, sorry for miss dl class, I will update it in V2. > > > The specific output as follows: > > $ cat /proc/6046/task/6046/sched | grep wait > > wait_start : 0.000000 > > wait_max : 496717.080029 > > wait_sum : 7921540.776553 > > > > Add 'update_stats_wait_end_rt' in 'update_stats_dequeue_rt' to > > update schedstats information when dequeue_task. > > This needs a few more words on why this is correct -- notably it took > me > a little time to find the 'task_on_rq_migrating()' case in > __update_stats_wait_end() which makes this not actually 'end'. > > But then the corresponding clause in __update_stats_wait_start() > gives > me a headache: > > 'wait_start > prev_wait_start' > > I mean, wtf. Should that not equally be using task_on_rq_migrating() > ? > > Can you please take a hard look at all that and fix up things > all-round? > A complete schedstats information update flow of migrate should be __update_stats_wait_start() [enter queue A, stage 1] -> __update_stats_wait_end() [leave queue A, stage 2] -> __update_stats_wait_start() [enter queue B, stage 3] -> __update_stats_wait_end() [start running on queue B, stage 4] Stage 1: prev_wait_start is 0, and in the end, wait_start records the time of entering the queue. Stage 2: task_on_rq_migrating(p) is true, and wait_start is updated to the waiting time on queue A. Stage 3: prev_wait_start is the waiting time on queue A, wait_start is the time of entering queue B, and wait_start is expected to be greater than prev_wait_start. Under this condition, wait_start is updated to (the moment of entering queue B) - (the waiting time on queue A). Stage 4: the final wait time = (time when starting to run on queue B) - (time of entering queue B) + (waiting time on queue A) = waiting time on queue B + waiting time on queue A. The current problem is that stage 2 does not call __update_stats_wait_end to update wait_start, which causes the final computed wait time = waiting time on queue B + the moment of entering queue A, leading to incorrect wait_max and wait_sum. For __update_stats_wait_end(), task_on_rq_migrating(p) is needed to distinguish between stage 2 and stage 4 because they involve different processing flows, but for __update_stats_wait_start(), it is not necessary to distinguish between stage 1 and stage 3. As for adding the condition wait_start > prev_wait_start, I think it is more like a mechanism to prevent statistical deviations caused by time inconsistencies. Thanks ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/rt: fix incorrect schedstats for rt thread 2026-01-09 7:24 ` Dengjun Su @ 2026-01-12 16:38 ` Peter Zijlstra 2026-01-14 11:55 ` Dengjun Su 0 siblings, 1 reply; 5+ messages in thread From: Peter Zijlstra @ 2026-01-12 16:38 UTC (permalink / raw) To: Dengjun Su Cc: angelogioacchino.delregno, bsegall, dietmar.eggemann, haiqiang.gong, juri.lelli, linux-arm-kernel, linux-kernel, linux-mediatek, matthias.bgg, mgorman, mike.zhang, mingo, peijun.huang, rostedt, vincent.guittot, vschneid On Fri, Jan 09, 2026 at 03:24:47PM +0800, Dengjun Su wrote: > For __update_stats_wait_end(), task_on_rq_migrating(p) is needed to > distinguish between stage 2 and stage 4 because they involve different > processing flows, but for __update_stats_wait_start(), it is not necessary > to distinguish between stage 1 and stage 3. > > As for adding the condition wait_start > prev_wait_start, I think it is > more like a mechanism to prevent statistical deviations caused by time > inconsistencies. It looks like nonsense to me.. since you have a test-case, could you see what this does for you? Specifically: - it ensures that when not in a migration, prev_wait_start must be 0 - it unconditionally subtracts; unsigned types are defined to wrap nicely (2s complement) and it all should work just fine. --- diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c index d1c9429a4ac5..144b23029327 100644 --- a/kernel/sched/stats.c +++ b/kernel/sched/stats.c @@ -12,8 +12,10 @@ void __update_stats_wait_start(struct rq *rq, struct task_struct *p, wait_start = rq_clock(rq); prev_wait_start = schedstat_val(stats->wait_start); - if (p && likely(wait_start > prev_wait_start)) + if (p) { + WARN_ON_ONCE(!task_on_rq_migrating(p) && prev_wait_start); wait_start -= prev_wait_start; + } __schedstat_set(stats->wait_start, wait_start); } ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/rt: fix incorrect schedstats for rt thread 2026-01-12 16:38 ` Peter Zijlstra @ 2026-01-14 11:55 ` Dengjun Su 0 siblings, 0 replies; 5+ messages in thread From: Dengjun Su @ 2026-01-14 11:55 UTC (permalink / raw) To: peterz Cc: angelogioacchino.delregno, bsegall, dengjun.su, dietmar.eggemann, haiqiang.gong, juri.lelli, linux-arm-kernel, linux-kernel, linux-mediatek, matthias.bgg, mgorman, mike.zhang, mingo, peijun.huang, rostedt, vincent.guittot, vschneid On Mon, 2026-01-12 at 17:38 +0100, Peter Zijlstra wrote: > On Fri, Jan 09, 2026 at 03:24:47PM +0800, Dengjun Su wrote: > > > For __update_stats_wait_end(), task_on_rq_migrating(p) is needed to > > distinguish between stage 2 and stage 4 because they involve > > different > > processing flows, but for __update_stats_wait_start(), it is not > > necessary > > to distinguish between stage 1 and stage 3. > > > > As for adding the condition wait_start > prev_wait_start, I think > > it is > > more like a mechanism to prevent statistical deviations caused by > > time > > inconsistencies. > > It looks like nonsense to me.. since you have a test-case, could you > see > what this does for you? > > Specifically: > > - it ensures that when not in a migration, prev_wait_start must be 0 > > - it unconditionally subtracts; unsigned types are defined to wrap > nicely (2s complement) and it all should work just fine. > > --- > > diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c > index d1c9429a4ac5..144b23029327 100644 > --- a/kernel/sched/stats.c > +++ b/kernel/sched/stats.c > @@ -12,8 +12,10 @@ void __update_stats_wait_start(struct rq *rq, > struct task_struct *p, > wait_start = rq_clock(rq); > prev_wait_start = schedstat_val(stats->wait_start); > > - if (p && likely(wait_start > prev_wait_start)) > + if (p) { > + WARN_ON_ONCE(!task_on_rq_migrating(p) && > prev_wait_start); > wait_start -= prev_wait_start; > + } > > __schedstat_set(stats->wait_start, wait_start); > } Hi Peter, I have confirm in my current test case, adding this change or not have no impact. I think this is reasonable, task_on_rq_migrating(p) should be true in my test case. Base on original patch that adds this, I think the logic of __update_stats_wait_start is: 1. If migrating and the time is updated normally, it records the waiting time on the previous queue. 2. If time is abnormal, it updates wait_start to the current time. For migrating, it will discards the previously counted waiting time on the previous queue (perhaps this statistical value itself is abnormal). According to the current modification, if there is a time abnormal, wait_start will not be updated to current time. Partial fragments of the original patch are as follows. --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9a5e60f..53ec4d4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -738,27 +738,41 @@ static void update_curr_fair(struct rq *rq) } static inline void -update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se) +update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se, + bool migrating) { - schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq))); + schedstat_set(se->statistics.wait_start, + migrating && + likely(rq_clock(rq_of(cfs_rq)) > se->statistics.wait_start) ? + rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start : + rq_clock(rq_of(cfs_rq))); } ^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-01-14 11:56 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-08 3:13 [PATCH] sched/rt: fix incorrect schedstats for rt thread Dengjun Su 2026-01-08 11:16 ` Peter Zijlstra 2026-01-09 7:24 ` Dengjun Su 2026-01-12 16:38 ` Peter Zijlstra 2026-01-14 11:55 ` Dengjun Su
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox