* [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator @ 2026-01-14 15:43 Johannes Weiner 2026-01-14 15:43 ` [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes Johannes Weiner 2026-01-28 10:35 ` [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Peter Zijlstra 0 siblings, 2 replies; 9+ messages in thread From: Johannes Weiner @ 2026-01-14 15:43 UTC (permalink / raw) To: Peter Zijlstra, Suren Baghdasaryan, Ingo Molnar Cc: Chengming Zhou, Dietmar Eggemann, John Stultz, linux-kernel In the aggregator, catch races between state snooping and task state conclusions explicitly by checking for sample underflows; then move the clock reads out of the reader's seqcount protection. This shrinks the critical section and allows switching the scheduler side to looser (cheaper) clock sourcing in the next patch. Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- kernel/sched/psi.c | 34 +++++++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 7 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 59fdb7ebbf22..4b7bf8eb46c2 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -285,7 +285,6 @@ static void get_recent_times(struct psi_group *group, int cpu, /* Snapshot a coherent view of the CPU state */ do { seq = psi_read_begin(cpu); - now = cpu_clock(cpu); memcpy(times, groupc->times, sizeof(groupc->times)); state_mask = groupc->state_mask; state_start = groupc->state_start; @@ -293,6 +292,9 @@ static void get_recent_times(struct psi_group *group, int cpu, memcpy(tasks, groupc->tasks, sizeof(groupc->tasks)); } while (psi_read_retry(cpu, seq)); + if (state_mask) + now = cpu_clock(cpu); + /* Calculate state time deltas against the previous snapshot */ for (s = 0; s < NR_PSI_STATES; s++) { u32 delta; @@ -308,7 +310,22 @@ static void get_recent_times(struct psi_group *group, int cpu, if (state_mask & (1 << s)) times[s] += now - state_start; + /* + * This snooping ahead can obviously race with the + * state concluding on the cpu. If we previously + * snooped to a time past where the state concludes, + * times[s] can now be behind times_prev[s]. + * + * time_after32() would be the obvious choice, but + * S32_MAX is right around two seconds, which is the + * aggregation interval; if the aggregator gets + * delayed, there would be a risk of dismissing + * genuinely large samples. Use a larger margin. + */ delta = times[s] - groupc->times_prev[aggregator][s]; + if (delta > psi_period + (psi_period >> 1)) + delta = 0; + groupc->times_prev[aggregator][s] = times[s]; times[s] = delta; @@ -908,16 +925,18 @@ static void psi_flags_change(struct task_struct *task, int clear, int set) void psi_task_change(struct task_struct *task, int clear, int set) { - int cpu = task_cpu(task); + int cpu; u64 now; if (!task->pid) return; + cpu = task_cpu(task); + now = cpu_clock(cpu); + psi_flags_change(task, clear, set); psi_write_begin(cpu); - now = cpu_clock(cpu); for_each_group(group, task_psi_group(task)) psi_group_change(group, cpu, clear, set, now, true); psi_write_end(cpu); @@ -928,10 +947,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, { struct psi_group *common = NULL; int cpu = task_cpu(prev); - u64 now; + u64 now = cpu_clock(cpu); psi_write_begin(cpu); - now = cpu_clock(cpu); if (next->pid) { psi_flags_change(next, 0, TSK_ONCPU); @@ -999,6 +1017,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, psi_group_change(group, cpu, clear, set, now, wake_clock); } } + psi_write_end(cpu); } @@ -1027,9 +1046,9 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st return; rq->psi_irq_time = irq; - psi_write_begin(cpu); now = cpu_clock(cpu); + psi_write_begin(cpu); for_each_group(group, task_psi_group(curr)) { if (!group->enabled) continue; @@ -1234,8 +1253,9 @@ void psi_cgroup_restart(struct psi_group *group) guard(rq_lock_irq)(cpu_rq(cpu)); - psi_write_begin(cpu); now = cpu_clock(cpu); + + psi_write_begin(cpu); psi_group_change(group, cpu, 0, 0, now, true); psi_write_end(cpu); } -- 2.52.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes 2026-01-14 15:43 [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Johannes Weiner @ 2026-01-14 15:43 ` Johannes Weiner 2026-02-02 20:41 ` Peter Zijlstra 2026-01-28 10:35 ` [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Peter Zijlstra 1 sibling, 1 reply; 9+ messages in thread From: Johannes Weiner @ 2026-01-14 15:43 UTC (permalink / raw) To: Peter Zijlstra, Suren Baghdasaryan, Ingo Molnar Cc: Chengming Zhou, Dietmar Eggemann, John Stultz, linux-kernel In the hottest psi paths, the scheduler already caches the cpu_clock() call for the event in rq->clock. Now that the clocks between state changes and pressure aggregation don't need to be synchronized inside the seqcount section anymore, use the cheaper rq_clock(). Add update_rq_clock() calls to the few places where psi is entered without the rq already locked. schbench -n 0 (no think ops): Before: average rps: 204408.50 After: average rps: 204755.90 2.67% -0.54% [kernel.kallsyms] [k] sched_clock_noinstr Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- kernel/sched/core.c | 3 ++- kernel/sched/psi.c | 12 +++++++----- kernel/sched/stats.h | 1 + 3 files changed, 10 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 60afadb6eede..27d880a3af28 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5517,11 +5517,12 @@ void sched_tick(void) sched_clock_tick(); rq_lock(rq, &rf); + update_rq_clock(rq); + donor = rq->donor; psi_account_irqtime(rq, donor, NULL); - update_rq_clock(rq); hw_pressure = arch_scale_hw_pressure(cpu_of(rq)); update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure); diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 4b7bf8eb46c2..4a0a83f1d1dc 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -932,7 +932,7 @@ void psi_task_change(struct task_struct *task, int clear, int set) return; cpu = task_cpu(task); - now = cpu_clock(cpu); + now = rq_clock(cpu_rq(cpu)); psi_flags_change(task, clear, set); @@ -947,7 +947,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, { struct psi_group *common = NULL; int cpu = task_cpu(prev); - u64 now = cpu_clock(cpu); + u64 now = rq_clock(cpu_rq(cpu)); psi_write_begin(cpu); @@ -1026,9 +1026,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st { int cpu = task_cpu(curr); struct psi_group_cpu *groupc; + u64 irq, now; s64 delta; - u64 irq; - u64 now; if (static_branch_likely(&psi_disabled) || !irqtime_enabled()) return; @@ -1046,7 +1045,7 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st return; rq->psi_irq_time = irq; - now = cpu_clock(cpu); + now = rq_clock(rq); psi_write_begin(cpu); for_each_group(group, task_psi_group(curr)) { @@ -1089,6 +1088,7 @@ void psi_memstall_enter(unsigned long *flags) * race with CPU migration. */ rq = this_rq_lock_irq(&rf); + update_rq_clock(rq); current->in_memstall = 1; psi_task_change(current, 0, TSK_MEMSTALL | TSK_MEMSTALL_RUNNING); @@ -1119,6 +1119,7 @@ void psi_memstall_leave(unsigned long *flags) * race with CPU migration. */ rq = this_rq_lock_irq(&rf); + update_rq_clock(rq); current->in_memstall = 0; psi_task_change(current, TSK_MEMSTALL | TSK_MEMSTALL_RUNNING, 0); @@ -1187,6 +1188,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) } rq = task_rq_lock(task, &rf); + update_rq_clock(rq); /* * We may race with schedule() dropping the rq lock between diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index c903f1a42891..cb67a81e92b6 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -210,6 +210,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) struct rq *rq; rq = __task_rq_lock(p, &rf); + update_rq_clock(rq); psi_task_change(p, p->psi_flags, 0); __task_rq_unlock(rq, p, &rf); } -- 2.52.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes 2026-01-14 15:43 ` [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes Johannes Weiner @ 2026-02-02 20:41 ` Peter Zijlstra 2026-02-03 16:34 ` Johannes Weiner 0 siblings, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2026-02-02 20:41 UTC (permalink / raw) To: Johannes Weiner Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann, John Stultz, linux-kernel On Wed, Jan 14, 2026 at 10:43:17AM -0500, Johannes Weiner wrote: > In the hottest psi paths, the scheduler already caches the cpu_clock() > call for the event in rq->clock. Now that the clocks between state > changes and pressure aggregation don't need to be synchronized inside > the seqcount section anymore, use the cheaper rq_clock(). > > Add update_rq_clock() calls to the few places where psi is entered > without the rq already locked. Just to be clear, rq->clock is not a cache of cpu_clock(). rq->clock discards all backwards motion (which obviously should never happen, but if it does, the clocks go out of sync). So if you use rq->clock, you must use it for all and not mix with cpu_clock(). I *think* the patch does that, but I've not double checked. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes 2026-02-02 20:41 ` Peter Zijlstra @ 2026-02-03 16:34 ` Johannes Weiner 0 siblings, 0 replies; 9+ messages in thread From: Johannes Weiner @ 2026-02-03 16:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann, John Stultz, linux-kernel On Mon, Feb 02, 2026 at 09:41:37PM +0100, Peter Zijlstra wrote: > On Wed, Jan 14, 2026 at 10:43:17AM -0500, Johannes Weiner wrote: > > In the hottest psi paths, the scheduler already caches the cpu_clock() > > call for the event in rq->clock. Now that the clocks between state > > changes and pressure aggregation don't need to be synchronized inside > > the seqcount section anymore, use the cheaper rq_clock(). > > > > Add update_rq_clock() calls to the few places where psi is entered > > without the rq already locked. > > Just to be clear, rq->clock is not a cache of cpu_clock(). rq->clock > discards all backwards motion (which obviously should never happen, but > if it does, the clocks go out of sync). > > So if you use rq->clock, you must use it for all and not mix with > cpu_clock(). > > I *think* the patch does that, but I've not double checked. Ah no, it does mix them :( Yeah I'm using rq_clock() consistently on the scheduler side to accumulate the times of concluded states. state_start = rq_clock() ... state_time = rq_clock() - state_start However, the aggregator side still uses state_time += cpu_clock() - state_start to incorporate currently active state. If they don't have the same base, this won't work. Doing the full lock and update_rq_clock() from the aggregator sounds quite heavy handed. How about using sched_clock_cpu() directly and doing the backwards motion check by hand? local_irq_save() now = sched_clock_cpu(cpu) local_irq_restore() ... if (state_mask & (1 << s) && now > state_start) times[s] += now - state_start ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator 2026-01-14 15:43 [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Johannes Weiner 2026-01-14 15:43 ` [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes Johannes Weiner @ 2026-01-28 10:35 ` Peter Zijlstra 2026-01-28 18:56 ` Johannes Weiner 1 sibling, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2026-01-28 10:35 UTC (permalink / raw) To: Johannes Weiner Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann, John Stultz, linux-kernel On Wed, Jan 14, 2026 at 10:43:16AM -0500, Johannes Weiner wrote: > In the aggregator, catch races between state snooping and task state > conclusions explicitly by checking for sample underflows; then move > the clock reads out of the reader's seqcount protection. I'm utterly failing to make sense of this -- what?! > This shrinks the critical section and allows switching the scheduler > side to looser (cheaper) clock sourcing in the next patch. > > Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> > Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > kernel/sched/psi.c | 34 +++++++++++++++++++++++++++------- > 1 file changed, 27 insertions(+), 7 deletions(-) > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index 59fdb7ebbf22..4b7bf8eb46c2 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -285,7 +285,6 @@ static void get_recent_times(struct psi_group *group, int cpu, > /* Snapshot a coherent view of the CPU state */ > do { > seq = psi_read_begin(cpu); > - now = cpu_clock(cpu); > memcpy(times, groupc->times, sizeof(groupc->times)); > state_mask = groupc->state_mask; > state_start = groupc->state_start; > @@ -293,6 +292,9 @@ static void get_recent_times(struct psi_group *group, int cpu, > memcpy(tasks, groupc->tasks, sizeof(groupc->tasks)); > } while (psi_read_retry(cpu, seq)); > > + if (state_mask) > + now = cpu_clock(cpu); > + So this can be later... which results in a larger value > /* Calculate state time deltas against the previous snapshot */ > for (s = 0; s < NR_PSI_STATES; s++) { > u32 delta; > @@ -308,7 +310,22 @@ static void get_recent_times(struct psi_group *group, int cpu, > if (state_mask & (1 << s)) > times[s] += now - state_start; > which makes times[s] larger than it should be > + /* > + * This snooping ahead can obviously race with the > + * state concluding on the cpu. If we previously > + * snooped to a time past where the state concludes, > + * times[s] can now be behind times_prev[s]. > + * > + * time_after32() would be the obvious choice, but > + * S32_MAX is right around two seconds, which is the > + * aggregation interval; if the aggregator gets > + * delayed, there would be a risk of dismissing > + * genuinely large samples. Use a larger margin. > + */ > delta = times[s] - groupc->times_prev[aggregator][s]; > + if (delta > psi_period + (psi_period >> 1)) > + delta = 0; > + This seems to check if times_prev is larger than times; confused again. > groupc->times_prev[aggregator][s] = times[s]; It updates times_prev irrespectively. Storing a potentially larger value. > > times[s] = delta; And stores the delta, which can be larger than it should be? > @@ -908,16 +925,18 @@ static void psi_flags_change(struct task_struct *task, int clear, int set) > > void psi_task_change(struct task_struct *task, int clear, int set) > { > - int cpu = task_cpu(task); > + int cpu; > u64 now; > > if (!task->pid) > return; > > + cpu = task_cpu(task); > + now = cpu_clock(cpu); > + So this clock is earlier, or smaller. > psi_flags_change(task, clear, set); > > psi_write_begin(cpu); > - now = cpu_clock(cpu); > for_each_group(group, task_psi_group(task)) > psi_group_change(group, cpu, clear, set, now, true); > psi_write_end(cpu); > @@ -928,10 +947,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > { > struct psi_group *common = NULL; > int cpu = task_cpu(prev); > - u64 now; > + u64 now = cpu_clock(cpu); Same. > psi_write_begin(cpu); > - now = cpu_clock(cpu); > > if (next->pid) { > psi_flags_change(next, 0, TSK_ONCPU); > @@ -999,6 +1017,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > psi_group_change(group, cpu, clear, set, now, wake_clock); > } > } > + > psi_write_end(cpu); > } > > @@ -1027,9 +1046,9 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st > return; > rq->psi_irq_time = irq; > > - psi_write_begin(cpu); > now = cpu_clock(cpu); Same. > + psi_write_begin(cpu); > for_each_group(group, task_psi_group(curr)) { > if (!group->enabled) > continue; > @@ -1234,8 +1253,9 @@ void psi_cgroup_restart(struct psi_group *group) > > guard(rq_lock_irq)(cpu_rq(cpu)); > > - psi_write_begin(cpu); > now = cpu_clock(cpu); And same again. > + > + psi_write_begin(cpu); > psi_group_change(group, cpu, 0, 0, now, true); > psi_write_end(cpu); > } For all these we call psi_group_change(), which calls record_times() which then sets ->state_start to a smaller value. Resulting in times above to be larger still. So this inflates delta and leaves me utterly confused. Not only does the Changelog here not explain anything, this also very much needs comments in the code, because the next time someone is going to be reading this, they'll break their WTF'o'meter and probably the next one they get too. /me stomps off searching for where the heck he left his pile of spare WTF'o'meters.. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator 2026-01-28 10:35 ` [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Peter Zijlstra @ 2026-01-28 18:56 ` Johannes Weiner 2026-01-28 20:17 ` Peter Zijlstra 0 siblings, 1 reply; 9+ messages in thread From: Johannes Weiner @ 2026-01-28 18:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann, John Stultz, linux-kernel On Wed, Jan 28, 2026 at 11:35:14AM +0100, Peter Zijlstra wrote: > On Wed, Jan 14, 2026 at 10:43:16AM -0500, Johannes Weiner wrote: > > In the aggregator, catch races between state snooping and task state > > conclusions explicitly by checking for sample underflows; then move > > the clock reads out of the reader's seqcount protection. > > I'm utterly failing to make sense of this -- what?! > > > This shrinks the critical section and allows switching the scheduler > > side to looser (cheaper) clock sourcing in the next patch. > > > > Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> > > Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > --- > > kernel/sched/psi.c | 34 +++++++++++++++++++++++++++------- > > 1 file changed, 27 insertions(+), 7 deletions(-) > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > > index 59fdb7ebbf22..4b7bf8eb46c2 100644 > > --- a/kernel/sched/psi.c > > +++ b/kernel/sched/psi.c > > @@ -285,7 +285,6 @@ static void get_recent_times(struct psi_group *group, int cpu, > > /* Snapshot a coherent view of the CPU state */ > > do { > > seq = psi_read_begin(cpu); > > - now = cpu_clock(cpu); > > memcpy(times, groupc->times, sizeof(groupc->times)); > > state_mask = groupc->state_mask; > > state_start = groupc->state_start; > > @@ -293,6 +292,9 @@ static void get_recent_times(struct psi_group *group, int cpu, > > memcpy(tasks, groupc->tasks, sizeof(groupc->tasks)); > > } while (psi_read_retry(cpu, seq)); > > > > + if (state_mask) > > + now = cpu_clock(cpu); > > + > > So this can be later... which results in a larger value Correct. > > /* Calculate state time deltas against the previous snapshot */ > > for (s = 0; s < NR_PSI_STATES; s++) { > > u32 delta; > > @@ -308,7 +310,22 @@ static void get_recent_times(struct psi_group *group, int cpu, > > if (state_mask & (1 << s)) > > times[s] += now - state_start; > > > > which makes times[s] larger than it should be Let me re-iterate what each side does, so we're on the same page. --- On the write side, psi_group_change() tracks task state changes from the scheduler. Whenever a state concludes, the nanoseconds spent in it are accumulated in groupc->times[s]. This is increasing monotonically. The read side (get_recent_times()) runs periodically to extract state time deltas for calculating the pressure avgs. For this it remembers how much state time it previously sampled in ->times_prev[s]: delta = groupc->times[s] - groupc->times_prev[s] groupc->times_prev[s] += delta If we had unlimited space, we would make these cumulators 64 bit and be done. But we can only use 32 bit, and states can last longer than the 4s we can hold in 32 bit. To deal with this, we do two things: 1) reader runs at least every 2 seconds 2) in addition to concluded states it also incorporates samples of the *active state*. When a >4s state concludes and the upper bits are lost in the ->times overflow, we don't care. We already sampled it all in <= 2s deltas: delta = groupc->times[s] - groupc->times_prev[s] if (state_mask & (1 << s)) delta += now - state_start groupc->times_prev[s] += delta You can see ->times_prev[s] pulling ahead of ->times[s] while the state is active. But once the state concludes, the scheduler side will add `now - state_start` to ->times[s] to catch it up. --- As long as the reader and writer have the same understanding of `now', this works: The `now - state_start` that the scheduler records when the state concludes will never be less than what the reader thought it would be. IOW, times_prev[s] is guaranteed to stay within 2 seconds behind what the scheduler already recorded (->times[s]), and will record (now - state_start). That mutual understanding of "now" is expensive, so I'm trying to remove it. This necessarily opens a race: the reader could sample a live state to a time beyond what the scheduler will use to conclude that state. The result is that one reader run can oversample slightly, and the next run will see a negative delta. Here is how I'm proposing to handle it: > > + /* > > + * This snooping ahead can obviously race with the > > + * state concluding on the cpu. If we previously > > + * snooped to a time past where the state concludes, > > + * times[s] can now be behind times_prev[s]. > > + * > > + * time_after32() would be the obvious choice, but > > + * S32_MAX is right around two seconds, which is the > > + * aggregation interval; if the aggregator gets > > + * delayed, there would be a risk of dismissing > > + * genuinely large samples. Use a larger margin. > > + */ > > delta = times[s] - groupc->times_prev[aggregator][s]; > > + if (delta > psi_period + (psi_period >> 1)) > > + delta = 0; > > + > > This seems to check if times_prev is larger than times; confused again. In the last round, the reader oversampled. That means times_prev[] slipped ahead of the scheduler reality (times[]). The actual accounting error caused by this should be acceptable, owed to the slightly different clock reads of the two racing threads. Certainly, there is no meaningful additional sample time in *this* round, so we fix the delta to 0. [ On the overflow itself: In a simpler world, this would be: if (((signed)times[s] - times_prev[s]) < 0) delta = 0 but I hope the explanation in the comment makes sense: Reader runs every ~2s so we expect legitimate state samples up to ~2s. The simple version would consider everything above ~2s nonsense. That's too tight. So I consider everything above 3s nonsense instead, giving us a 1s headroom for any delays in reader scheduling. ] > > groupc->times_prev[aggregator][s] = times[s]; > > It updates times_prev irrespectively. Storing a potentially larger > value. Right, this is on purpose. Once the delta is extracted and processed, we need to update the reader to where the scheduler is, as per usual. But there is now a second case: 1) Existing case. The scheduler accumulated state time the reader hasn't seen yet, so times_prev[] is behind times[]. After delta extraction, the assignment catches the reader up to the scheduler. 2) New case. The previous reader run used a `now' too far in the future and oversampled. times_prev[] is *ahead* of times[]. The assignment *rewinds* the reader back to the scheduler's reality. > > times[s] = delta; > > And stores the delta, which can be larger than it should be? We filtered out bogus deltas, so it should be valid or 0. > > @@ -908,16 +925,18 @@ static void psi_flags_change(struct task_struct *task, int clear, int set) > > > > void psi_task_change(struct task_struct *task, int clear, int set) > > { > > - int cpu = task_cpu(task); > > + int cpu; > > u64 now; > > > > if (!task->pid) > > return; > > > > + cpu = task_cpu(task); > > + now = cpu_clock(cpu); > > + > > So this clock is earlier, or smaller. > > > psi_flags_change(task, clear, set); > > > > psi_write_begin(cpu); > > - now = cpu_clock(cpu); > > for_each_group(group, task_psi_group(task)) > > psi_group_change(group, cpu, clear, set, now, true); > > psi_write_end(cpu); > > @@ -928,10 +947,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > > { > > struct psi_group *common = NULL; > > int cpu = task_cpu(prev); > > - u64 now; > > + u64 now = cpu_clock(cpu); > > Same. > > > psi_write_begin(cpu); > > - now = cpu_clock(cpu); > > > > if (next->pid) { > > psi_flags_change(next, 0, TSK_ONCPU); > > @@ -999,6 +1017,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > > psi_group_change(group, cpu, clear, set, now, wake_clock); > > } > > } > > + > > psi_write_end(cpu); > > } > > > > @@ -1027,9 +1046,9 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st > > return; > > rq->psi_irq_time = irq; > > > > - psi_write_begin(cpu); > > now = cpu_clock(cpu); > > Same. > > > + psi_write_begin(cpu); > > for_each_group(group, task_psi_group(curr)) { > > if (!group->enabled) > > continue; > > @@ -1234,8 +1253,9 @@ void psi_cgroup_restart(struct psi_group *group) > > > > guard(rq_lock_irq)(cpu_rq(cpu)); > > > > - psi_write_begin(cpu); > > now = cpu_clock(cpu); > > And same again. > > > + > > + psi_write_begin(cpu); > > psi_group_change(group, cpu, 0, 0, now, true); > > psi_write_end(cpu); > > } > > For all these we call psi_group_change(), which calls record_times() > which then sets ->state_start to a smaller value. Resulting in times > above to be larger still. I think there are two considerations. We use that clock value to both start and stop the clock on a state. So while we use a timestamp from slightly earlier in the scheduling event, we do it on both ends. This should not affect long-term accuracy. Then there is reader coherency. Before, state_start would match the exact time at which the reader could see the state go live inside the state_mask. After the patch, the reader could see a state whose state_start is slightly in the past. But that's the common case for the reader anyway? Since it rarely runs in the same nanosecond in which the state change occurs. Am I missing something? > So this inflates delta and leaves me utterly confused. > > Not only does the Changelog here not explain anything, this also very > much needs comments in the code, because the next time someone is going > to be reading this, they'll break their WTF'o'meter and probably the > next one they get too. Fair enough, it's tricky. I'll do my best to capture all the above into the changelog and code comments. But let's try to get on the same page first. That should also help identify which parts exactly need the most help. Thanks ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator 2026-01-28 18:56 ` Johannes Weiner @ 2026-01-28 20:17 ` Peter Zijlstra 2026-02-03 15:38 ` Johannes Weiner 0 siblings, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2026-01-28 20:17 UTC (permalink / raw) To: Johannes Weiner Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann, John Stultz, linux-kernel On Wed, Jan 28, 2026 at 01:56:20PM -0500, Johannes Weiner wrote: > That mutual understanding of "now" is expensive, so I'm trying to > remove it. This necessarily opens a race: the reader could sample a > live state to a time beyond what the scheduler will use to conclude > that state. The result is that one reader run can oversample slightly, > and the next run will see a negative delta. Right, exactly the problem from: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") > Here is how I'm proposing to handle it: > > > > + /* > > > + * This snooping ahead can obviously race with the > > > + * state concluding on the cpu. If we previously > > > + * snooped to a time past where the state concludes, > > > + * times[s] can now be behind times_prev[s]. > > > + * > > > + * time_after32() would be the obvious choice, but > > > + * S32_MAX is right around two seconds, which is the > > > + * aggregation interval; if the aggregator gets > > > + * delayed, there would be a risk of dismissing > > > + * genuinely large samples. Use a larger margin. > > > + */ > > > delta = times[s] - groupc->times_prev[aggregator][s]; > > > + if (delta > psi_period + (psi_period >> 1)) > > > + delta = 0; > > > + > > > > This seems to check if times_prev is larger than times; confused again. > > In the last round, the reader oversampled. That means times_prev[] > slipped ahead of the scheduler reality (times[]). > > The actual accounting error caused by this should be acceptable, owed > to the slightly different clock reads of the two racing threads. > > Certainly, there is no meaningful additional sample time in *this* > round, so we fix the delta to 0. OK, so you don't care about the fact that consistent races can inflate the number beyond actuality. You just want to get rid of backward motion? The argument being something like: Since its a decaying average, and the per-time accumulated error is relatively minor, it doesn't affect the over-all outcome much. Because if we just consider it as free-running counters, the accumulation error is unbound. But since it is time averaged, the error becomes insignificant. > > > groupc->times_prev[aggregator][s] = times[s]; > > > > It updates times_prev irrespectively. Storing a potentially larger > > value. > > Right, this is on purpose. Once the delta is extracted and processed, > we need to update the reader to where the scheduler is, as per > usual. But there is now a second case: > > 1) Existing case. The scheduler accumulated state time the reader > hasn't seen yet, so times_prev[] is behind times[]. After delta > extraction, the assignment catches the reader up to the scheduler. > > 2) New case. The previous reader run used a `now' too far in the > future and oversampled. times_prev[] is *ahead* of times[]. The > assignment *rewinds* the reader back to the scheduler's reality. Well, not quite, that too large delta has already 'escaped' and been accumulated. This way you ensure a second race will again result in a too large delta being accumulated, rather than the next state being accounted slightly short -- which would compensate for the previous one being accounted slightly larger. That is afaict 2) ensures you are consistently oversampling but never undersampling. > > > times[s] = delta; > > > > And stores the delta, which can be larger than it should be? > > We filtered out bogus deltas, so it should be valid or 0. Semantically a too large value is equally bogus to a negative value. Its just that negative numbers are not expected and wreck things down the chain. > > For all these we call psi_group_change(), which calls record_times() > > which then sets ->state_start to a smaller value. Resulting in times > > above to be larger still. > > I think there are two considerations. > > We use that clock value to both start and stop the clock on a > state. So while we use a timestamp from slightly earlier in the > scheduling event, we do it on both ends. This should not affect > long-term accuracy. If we only consider these timestamps, sure. But due to the whole accumulation mess in between you get unbounded drift (on the accumulator -- pre-averaging). > Then there is reader coherency. Before, state_start would match the > exact time at which the reader could see the state go live inside the > state_mask. After the patch, the reader could see a state whose > state_start is slightly in the past. But that's the common case for > the reader anyway? Since it rarely runs in the same nanosecond in > which the state change occurs. > > Am I missing something? So currently, with things being inside the locks, if we sample early we miss a bit. If we sample late, we see the exact time. With the update time being early, we go back to 3840cbe24cf0, and can see a too long period in both cases. But because you're then also using a late clock on read, the error is larger still. If you are measuring time from a start to 'now', and the actual period is (10 characters) like so: .x.|--------|.y. Then, if you move the start to x (earlier), your period becomes longer (12 characters). If you then also move now to y (later) you get an ever larger error (14 characters). I mean, it all probably doesn't matter, but its there. > > So this inflates delta and leaves me utterly confused. > > > > Not only does the Changelog here not explain anything, this also very > > much needs comments in the code, because the next time someone is going > > to be reading this, they'll break their WTF'o'meter and probably the > > next one they get too. > > Fair enough, it's tricky. > > I'll do my best to capture all the above into the changelog and code > comments. But let's try to get on the same page first. That should > also help identify which parts exactly need the most help. Mostly I wasn't sure on which errors you care about and which you don't. As I understand it now, you *only* care about not having negative values in the accumulation chain because that's otherwise unsigned and negatives show up as large numbers and things go 'funny'. You do not care about long term drift in the pure accumulator -- since its all time averaged? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator 2026-01-28 20:17 ` Peter Zijlstra @ 2026-02-03 15:38 ` Johannes Weiner 2026-02-03 21:02 ` Johannes Weiner 0 siblings, 1 reply; 9+ messages in thread From: Johannes Weiner @ 2026-02-03 15:38 UTC (permalink / raw) To: Peter Zijlstra Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann, John Stultz, linux-kernel On Wed, Jan 28, 2026 at 09:17:43PM +0100, Peter Zijlstra wrote: > On Wed, Jan 28, 2026 at 01:56:20PM -0500, Johannes Weiner wrote: > > That mutual understanding of "now" is expensive, so I'm trying to > > remove it. This necessarily opens a race: the reader could sample a > > live state to a time beyond what the scheduler will use to conclude > > that state. The result is that one reader run can oversample slightly, > > and the next run will see a negative delta. > > Right, exactly the problem from: > > 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") Yep. > > Here is how I'm proposing to handle it: > > > > > > + /* > > > > + * This snooping ahead can obviously race with the > > > > + * state concluding on the cpu. If we previously > > > > + * snooped to a time past where the state concludes, > > > > + * times[s] can now be behind times_prev[s]. > > > > + * > > > > + * time_after32() would be the obvious choice, but > > > > + * S32_MAX is right around two seconds, which is the > > > > + * aggregation interval; if the aggregator gets > > > > + * delayed, there would be a risk of dismissing > > > > + * genuinely large samples. Use a larger margin. > > > > + */ > > > > delta = times[s] - groupc->times_prev[aggregator][s]; > > > > + if (delta > psi_period + (psi_period >> 1)) > > > > + delta = 0; > > > > + > > > > > > This seems to check if times_prev is larger than times; confused again. > > > > In the last round, the reader oversampled. That means times_prev[] > > slipped ahead of the scheduler reality (times[]). > > > > The actual accounting error caused by this should be acceptable, owed > > to the slightly different clock reads of the two racing threads. > > > > Certainly, there is no meaningful additional sample time in *this* > > round, so we fix the delta to 0. > > OK, so you don't care about the fact that consistent races can inflate > the number beyond actuality. You just want to get rid of backward > motion? Yes. The backward motion is what caused the multi-second errors that immediately shift even the long-term averages. Those are blatantly noticable and cause practical issues. And notably, these were the ONLY issue anyone reported against 3840cbe24cf0 in three years. Given the decaying averages, and the recency windows anyone really cares about, a smaller fudge factor is much less of a concern. It's largely self-correcting too: if you have ongoing/recurring pressure periods, oversampling of live state is corrected by subsequent smaller samples against the "actual" time the scheduler ended up recording. The actual underflows were quite rare. The reproducer had to run the aggregation every microsecond for extended periods to trigger a race where the oversampled error is less than state time in the subsequent sampling period: v detect live state | v now() -> record N + error | | v underflow aggregator | | | scheduler |---------------| ^ now() -> record N We *could* look into remembering oversampling debt accurately and incorporate deficits into the samples that go into total=. But we couldn't do it for the avg= because that also encodes recency. And that discrepancy between the two would likely be worse interface-wise than a fudge factor affecting both the same. > The argument being something like: Since its a decaying average, and the > per-time accumulated error is relatively minor, it doesn't affect the > over-all outcome much. > > Because if we just consider it as free-running counters, the > accumulation error is unbound. But since it is time averaged, the error > becomes insignificant. Yes. > > > > groupc->times_prev[aggregator][s] = times[s]; > > > > > > It updates times_prev irrespectively. Storing a potentially larger > > > value. > > > > Right, this is on purpose. Once the delta is extracted and processed, > > we need to update the reader to where the scheduler is, as per > > usual. But there is now a second case: > > > > 1) Existing case. The scheduler accumulated state time the reader > > hasn't seen yet, so times_prev[] is behind times[]. After delta > > extraction, the assignment catches the reader up to the scheduler. > > > > 2) New case. The previous reader run used a `now' too far in the > > future and oversampled. times_prev[] is *ahead* of times[]. The > > assignment *rewinds* the reader back to the scheduler's reality. > > Well, not quite, that too large delta has already 'escaped' and been > accumulated. This way you ensure a second race will again result in a > too large delta being accumulated, rather than the next state being > accounted slightly short -- which would compensate for the previous one > being accounted slightly larger. > > That is afaict 2) ensures you are consistently oversampling but never > undersampling. True. But because of the importance of recency in that data, there is a limit to how much we could backcorrect later samples. > > > > times[s] = delta; > > > > > > And stores the delta, which can be larger than it should be? > > > > We filtered out bogus deltas, so it should be valid or 0. > > Semantically a too large value is equally bogus to a negative value. > > Its just that negative numbers are not expected and wreck things down > the chain. Yes. > > > For all these we call psi_group_change(), which calls record_times() > > > which then sets ->state_start to a smaller value. Resulting in times > > > above to be larger still. > > > > I think there are two considerations. > > > > We use that clock value to both start and stop the clock on a > > state. So while we use a timestamp from slightly earlier in the > > scheduling event, we do it on both ends. This should not affect > > long-term accuracy. > > If we only consider these timestamps, sure. But due to the whole > accumulation mess in between you get unbounded drift (on the accumulator > -- pre-averaging). > > > Then there is reader coherency. Before, state_start would match the > > exact time at which the reader could see the state go live inside the > > state_mask. After the patch, the reader could see a state whose > > state_start is slightly in the past. But that's the common case for > > the reader anyway? Since it rarely runs in the same nanosecond in > > which the state change occurs. > > > > Am I missing something? > > So currently, with things being inside the locks, if we sample early we > miss a bit. If we sample late, we see the exact time. > > With the update time being early, we go back to 3840cbe24cf0, and can > see a too long period in both cases. > > But because you're then also using a late clock on read, the error is > larger still. > > If you are measuring time from a start to 'now', and the actual period > is (10 characters) like so: > > .x.|--------|.y. > > Then, if you move the start to x (earlier), your period becomes longer > (12 characters). If you then also move now to y (later) you get an > ever larger error (14 characters). > > I mean, it all probably doesn't matter, but its there. Right. I think we can live with it, but need it documented in the code comments to maintain a proper mental picture of the implementation tradeoffs and their consequences. > > > So this inflates delta and leaves me utterly confused. > > > > > > Not only does the Changelog here not explain anything, this also very > > > much needs comments in the code, because the next time someone is going > > > to be reading this, they'll break their WTF'o'meter and probably the > > > next one they get too. > > > > Fair enough, it's tricky. > > > > I'll do my best to capture all the above into the changelog and code > > comments. But let's try to get on the same page first. That should > > also help identify which parts exactly need the most help. > > Mostly I wasn't sure on which errors you care about and which you don't. Ok, fair enough. > As I understand it now, you *only* care about not having negative values > in the accumulation chain because that's otherwise unsigned and > negatives show up as large numbers and things go 'funny'. > > You do not care about long term drift in the pure accumulator -- since > its all time averaged? Yep. Thanks for taking a look. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator 2026-02-03 15:38 ` Johannes Weiner @ 2026-02-03 21:02 ` Johannes Weiner 0 siblings, 0 replies; 9+ messages in thread From: Johannes Weiner @ 2026-02-03 21:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann, John Stultz, linux-kernel On Tue, Feb 03, 2026 at 10:38:09AM -0500, Johannes Weiner wrote: > Yes. The backward motion is what caused the multi-second errors that > immediately shift even the long-term averages. Those are blatantly > noticable and cause practical issues. And notably, these were the ONLY > issue anyone reported against 3840cbe24cf0 in three years. Grr. "[...] reported against df77430639c9 ("psi: Reduce calls to ched_clock() in psi") in three years", of course. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-02-03 21:02 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-14 15:43 [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Johannes Weiner 2026-01-14 15:43 ` [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes Johannes Weiner 2026-02-02 20:41 ` Peter Zijlstra 2026-02-03 16:34 ` Johannes Weiner 2026-01-28 10:35 ` [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Peter Zijlstra 2026-01-28 18:56 ` Johannes Weiner 2026-01-28 20:17 ` Peter Zijlstra 2026-02-03 15:38 ` Johannes Weiner 2026-02-03 21:02 ` Johannes Weiner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox