[PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator
@ 2026-01-14 15:43 Johannes Weiner
  2026-01-14 15:43 ` [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes Johannes Weiner
  2026-01-28 10:35 ` [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Johannes Weiner @ 2026-01-14 15:43 UTC (permalink / raw)
  To: Peter Zijlstra, Suren Baghdasaryan, Ingo Molnar
  Cc: Chengming Zhou, Dietmar Eggemann, John Stultz, linux-kernel

In the aggregator, catch races between state snooping and task state
conclusions explicitly by checking for sample underflows; then move
the clock reads out of the reader's seqcount protection.

This shrinks the critical section and allows switching the scheduler
side to looser (cheaper) clock sourcing in the next patch.

Suggested-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 59fdb7ebbf22..4b7bf8eb46c2 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -285,7 +285,6 @@ static void get_recent_times(struct psi_group *group, int cpu,
 	/* Snapshot a coherent view of the CPU state */
 	do {
 		seq = psi_read_begin(cpu);
-		now = cpu_clock(cpu);
 		memcpy(times, groupc->times, sizeof(groupc->times));
 		state_mask = groupc->state_mask;
 		state_start = groupc->state_start;
@@ -293,6 +292,9 @@ static void get_recent_times(struct psi_group *group, int cpu,
 			memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
 	} while (psi_read_retry(cpu, seq));
 
+	if (state_mask)
+		now = cpu_clock(cpu);
+
 	/* Calculate state time deltas against the previous snapshot */
 	for (s = 0; s < NR_PSI_STATES; s++) {
 		u32 delta;
@@ -308,7 +310,22 @@ static void get_recent_times(struct psi_group *group, int cpu,
 		if (state_mask & (1 << s))
 			times[s] += now - state_start;
 
+		/*
+		 * This snooping ahead can obviously race with the
+		 * state concluding on the cpu. If we previously
+		 * snooped to a time past where the state concludes,
+		 * times[s] can now be behind times_prev[s].
+		 *
+		 * time_after32() would be the obvious choice, but
+		 * S32_MAX is right around two seconds, which is the
+		 * aggregation interval; if the aggregator gets
+		 * delayed, there would be a risk of dismissing
+		 * genuinely large samples. Use a larger margin.
+		 */
 		delta = times[s] - groupc->times_prev[aggregator][s];
+		if (delta > psi_period + (psi_period >> 1))
+			delta = 0;
+
 		groupc->times_prev[aggregator][s] = times[s];
 
 		times[s] = delta;
@@ -908,16 +925,18 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
 
 void psi_task_change(struct task_struct *task, int clear, int set)
 {
-	int cpu = task_cpu(task);
+	int cpu;
 	u64 now;
 
 	if (!task->pid)
 		return;
 
+	cpu = task_cpu(task);
+	now = cpu_clock(cpu);
+
 	psi_flags_change(task, clear, set);
 
 	psi_write_begin(cpu);
-	now = cpu_clock(cpu);
 	for_each_group(group, task_psi_group(task))
 		psi_group_change(group, cpu, clear, set, now, true);
 	psi_write_end(cpu);
@@ -928,10 +947,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 {
 	struct psi_group *common = NULL;
 	int cpu = task_cpu(prev);
-	u64 now;
+	u64 now = cpu_clock(cpu);
 
 	psi_write_begin(cpu);
-	now = cpu_clock(cpu);
 
 	if (next->pid) {
 		psi_flags_change(next, 0, TSK_ONCPU);
@@ -999,6 +1017,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 				psi_group_change(group, cpu, clear, set, now, wake_clock);
 		}
 	}
+
 	psi_write_end(cpu);
 }
 
@@ -1027,9 +1046,9 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 		return;
 	rq->psi_irq_time = irq;
 
-	psi_write_begin(cpu);
 	now = cpu_clock(cpu);
 
+	psi_write_begin(cpu);
 	for_each_group(group, task_psi_group(curr)) {
 		if (!group->enabled)
 			continue;
@@ -1234,8 +1253,9 @@ void psi_cgroup_restart(struct psi_group *group)
 
 		guard(rq_lock_irq)(cpu_rq(cpu));
 
-		psi_write_begin(cpu);
 		now = cpu_clock(cpu);
+
+		psi_write_begin(cpu);
 		psi_group_change(group, cpu, 0, 0, now, true);
 		psi_write_end(cpu);
 	}
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes
  2026-01-14 15:43 [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Johannes Weiner
@ 2026-01-14 15:43 ` Johannes Weiner
  2026-02-02 20:41   ` Peter Zijlstra
  2026-01-28 10:35 ` [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2026-01-14 15:43 UTC (permalink / raw)
  To: Peter Zijlstra, Suren Baghdasaryan, Ingo Molnar
  Cc: Chengming Zhou, Dietmar Eggemann, John Stultz, linux-kernel

In the hottest psi paths, the scheduler already caches the cpu_clock()
call for the event in rq->clock. Now that the clocks between state
changes and pressure aggregation don't need to be synchronized inside
the seqcount section anymore, use the cheaper rq_clock().

Add update_rq_clock() calls to the few places where psi is entered
without the rq already locked.

schbench -n 0 (no think ops):
Before: average rps: 204408.50
 After: average rps: 204755.90

     2.67%     -0.54%  [kernel.kallsyms]         [k] sched_clock_noinstr

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c  |  3 ++-
 kernel/sched/psi.c   | 12 +++++++-----
 kernel/sched/stats.h |  1 +
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 60afadb6eede..27d880a3af28 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5517,11 +5517,12 @@ void sched_tick(void)
 	sched_clock_tick();
 
 	rq_lock(rq, &rf);
+	update_rq_clock(rq);
+
 	donor = rq->donor;
 
 	psi_account_irqtime(rq, donor, NULL);
 
-	update_rq_clock(rq);
 	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
 	update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 4b7bf8eb46c2..4a0a83f1d1dc 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -932,7 +932,7 @@ void psi_task_change(struct task_struct *task, int clear, int set)
 		return;
 
 	cpu = task_cpu(task);
-	now = cpu_clock(cpu);
+	now = rq_clock(cpu_rq(cpu));
 
 	psi_flags_change(task, clear, set);
 
@@ -947,7 +947,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 {
 	struct psi_group *common = NULL;
 	int cpu = task_cpu(prev);
-	u64 now = cpu_clock(cpu);
+	u64 now = rq_clock(cpu_rq(cpu));
 
 	psi_write_begin(cpu);
 
@@ -1026,9 +1026,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 {
 	int cpu = task_cpu(curr);
 	struct psi_group_cpu *groupc;
+	u64 irq, now;
 	s64 delta;
-	u64 irq;
-	u64 now;
 
 	if (static_branch_likely(&psi_disabled) || !irqtime_enabled())
 		return;
@@ -1046,7 +1045,7 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 		return;
 	rq->psi_irq_time = irq;
 
-	now = cpu_clock(cpu);
+	now = rq_clock(rq);
 
 	psi_write_begin(cpu);
 	for_each_group(group, task_psi_group(curr)) {
@@ -1089,6 +1088,7 @@ void psi_memstall_enter(unsigned long *flags)
 	 * race with CPU migration.
 	 */
 	rq = this_rq_lock_irq(&rf);
+	update_rq_clock(rq);
 
 	current->in_memstall = 1;
 	psi_task_change(current, 0, TSK_MEMSTALL | TSK_MEMSTALL_RUNNING);
@@ -1119,6 +1119,7 @@ void psi_memstall_leave(unsigned long *flags)
 	 * race with CPU migration.
 	 */
 	rq = this_rq_lock_irq(&rf);
+	update_rq_clock(rq);
 
 	current->in_memstall = 0;
 	psi_task_change(current, TSK_MEMSTALL | TSK_MEMSTALL_RUNNING, 0);
@@ -1187,6 +1188,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 	}
 
 	rq = task_rq_lock(task, &rf);
+	update_rq_clock(rq);
 
 	/*
 	 * We may race with schedule() dropping the rq lock between
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index c903f1a42891..cb67a81e92b6 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -210,6 +210,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p)
 		struct rq *rq;
 
 		rq = __task_rq_lock(p, &rf);
+		update_rq_clock(rq);
 		psi_task_change(p, p->psi_flags, 0);
 		__task_rq_unlock(rq, p, &rf);
 	}
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator
  2026-01-14 15:43 [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Johannes Weiner
  2026-01-14 15:43 ` [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes Johannes Weiner
@ 2026-01-28 10:35 ` Peter Zijlstra
  2026-01-28 18:56   ` Johannes Weiner
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2026-01-28 10:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann,
	John Stultz, linux-kernel

On Wed, Jan 14, 2026 at 10:43:16AM -0500, Johannes Weiner wrote:
> In the aggregator, catch races between state snooping and task state
> conclusions explicitly by checking for sample underflows; then move
> the clock reads out of the reader's seqcount protection.

I'm utterly failing to make sense of this -- what?!

> This shrinks the critical section and allows switching the scheduler
> side to looser (cheaper) clock sourcing in the next patch.
> 
> Suggested-by: Chengming Zhou <chengming.zhou@linux.dev>
> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  kernel/sched/psi.c | 34 +++++++++++++++++++++++++++-------
>  1 file changed, 27 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 59fdb7ebbf22..4b7bf8eb46c2 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -285,7 +285,6 @@ static void get_recent_times(struct psi_group *group, int cpu,
>  	/* Snapshot a coherent view of the CPU state */
>  	do {
>  		seq = psi_read_begin(cpu);
> -		now = cpu_clock(cpu);
>  		memcpy(times, groupc->times, sizeof(groupc->times));
>  		state_mask = groupc->state_mask;
>  		state_start = groupc->state_start;
> @@ -293,6 +292,9 @@ static void get_recent_times(struct psi_group *group, int cpu,
>  			memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
>  	} while (psi_read_retry(cpu, seq));
>  
> +	if (state_mask)
> +		now = cpu_clock(cpu);
> +

So this can be later... which results in a larger value

>  	/* Calculate state time deltas against the previous snapshot */
>  	for (s = 0; s < NR_PSI_STATES; s++) {
>  		u32 delta;
> @@ -308,7 +310,22 @@ static void get_recent_times(struct psi_group *group, int cpu,
>  		if (state_mask & (1 << s))
>  			times[s] += now - state_start;
>  

which makes times[s] larger than it should be

> +		/*
> +		 * This snooping ahead can obviously race with the
> +		 * state concluding on the cpu. If we previously
> +		 * snooped to a time past where the state concludes,
> +		 * times[s] can now be behind times_prev[s].
> +		 *
> +		 * time_after32() would be the obvious choice, but
> +		 * S32_MAX is right around two seconds, which is the
> +		 * aggregation interval; if the aggregator gets
> +		 * delayed, there would be a risk of dismissing
> +		 * genuinely large samples. Use a larger margin.
> +		 */
>  		delta = times[s] - groupc->times_prev[aggregator][s];
> +		if (delta > psi_period + (psi_period >> 1))
> +			delta = 0;
> +

This seems to check if times_prev is larger than times; confused again.

>  		groupc->times_prev[aggregator][s] = times[s];

It updates times_prev irrespectively. Storing a potentially larger
value.

>  
>  		times[s] = delta;

And stores the delta, which can be larger than it should be?

> @@ -908,16 +925,18 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
>  
>  void psi_task_change(struct task_struct *task, int clear, int set)
>  {
> -	int cpu = task_cpu(task);
> +	int cpu;
>  	u64 now;
>  
>  	if (!task->pid)
>  		return;
>  
> +	cpu = task_cpu(task);
> +	now = cpu_clock(cpu);
> +

So this clock is earlier, or smaller.

>  	psi_flags_change(task, clear, set);
>  
>  	psi_write_begin(cpu);
> -	now = cpu_clock(cpu);
>  	for_each_group(group, task_psi_group(task))
>  		psi_group_change(group, cpu, clear, set, now, true);
>  	psi_write_end(cpu);
> @@ -928,10 +947,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>  {
>  	struct psi_group *common = NULL;
>  	int cpu = task_cpu(prev);
> -	u64 now;
> +	u64 now = cpu_clock(cpu);

Same.

>  	psi_write_begin(cpu);
> -	now = cpu_clock(cpu);
>  
>  	if (next->pid) {
>  		psi_flags_change(next, 0, TSK_ONCPU);
> @@ -999,6 +1017,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>  				psi_group_change(group, cpu, clear, set, now, wake_clock);
>  		}
>  	}
> +
>  	psi_write_end(cpu);
>  }
>  
> @@ -1027,9 +1046,9 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
>  		return;
>  	rq->psi_irq_time = irq;
>  
> -	psi_write_begin(cpu);
>  	now = cpu_clock(cpu);

Same.

> +	psi_write_begin(cpu);
>  	for_each_group(group, task_psi_group(curr)) {
>  		if (!group->enabled)
>  			continue;
> @@ -1234,8 +1253,9 @@ void psi_cgroup_restart(struct psi_group *group)
>  
>  		guard(rq_lock_irq)(cpu_rq(cpu));
>  
> -		psi_write_begin(cpu);
>  		now = cpu_clock(cpu);

And same again.

> +
> +		psi_write_begin(cpu);
>  		psi_group_change(group, cpu, 0, 0, now, true);
>  		psi_write_end(cpu);
>  	}

For all these we call psi_group_change(), which calls record_times()
which then sets ->state_start to a smaller value. Resulting in times
above to be larger still.

So this inflates delta and leaves me utterly confused.

Not only does the Changelog here not explain anything, this also very
much needs comments in the code, because the next time someone is going
to be reading this, they'll break their WTF'o'meter and probably the
next one they get too.

/me stomps off searching for where the heck he left his pile of spare
WTF'o'meters..

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator
  2026-01-28 10:35 ` [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Peter Zijlstra
@ 2026-01-28 18:56   ` Johannes Weiner
  2026-01-28 20:17     ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2026-01-28 18:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann,
	John Stultz, linux-kernel

On Wed, Jan 28, 2026 at 11:35:14AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 14, 2026 at 10:43:16AM -0500, Johannes Weiner wrote:
> > In the aggregator, catch races between state snooping and task state
> > conclusions explicitly by checking for sample underflows; then move
> > the clock reads out of the reader's seqcount protection.
> 
> I'm utterly failing to make sense of this -- what?!
> 
> > This shrinks the critical section and allows switching the scheduler
> > side to looser (cheaper) clock sourcing in the next patch.
> > 
> > Suggested-by: Chengming Zhou <chengming.zhou@linux.dev>
> > Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  kernel/sched/psi.c | 34 +++++++++++++++++++++++++++-------
> >  1 file changed, 27 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index 59fdb7ebbf22..4b7bf8eb46c2 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -285,7 +285,6 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >  	/* Snapshot a coherent view of the CPU state */
> >  	do {
> >  		seq = psi_read_begin(cpu);
> > -		now = cpu_clock(cpu);
> >  		memcpy(times, groupc->times, sizeof(groupc->times));
> >  		state_mask = groupc->state_mask;
> >  		state_start = groupc->state_start;
> > @@ -293,6 +292,9 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >  			memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
> >  	} while (psi_read_retry(cpu, seq));
> >  
> > +	if (state_mask)
> > +		now = cpu_clock(cpu);
> > +
> 
> So this can be later... which results in a larger value

Correct.

> >  	/* Calculate state time deltas against the previous snapshot */
> >  	for (s = 0; s < NR_PSI_STATES; s++) {
> >  		u32 delta;
> > @@ -308,7 +310,22 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >  		if (state_mask & (1 << s))
> >  			times[s] += now - state_start;
> >  
> 
> which makes times[s] larger than it should be

Let me re-iterate what each side does, so we're on the same page.

---

On the write side, psi_group_change() tracks task state changes from
the scheduler. Whenever a state concludes, the nanoseconds spent in it
are accumulated in groupc->times[s]. This is increasing monotonically.

The read side (get_recent_times()) runs periodically to extract state
time deltas for calculating the pressure avgs. For this it remembers
how much state time it previously sampled in ->times_prev[s]:

	delta = groupc->times[s] - groupc->times_prev[s]
	groupc->times_prev[s] += delta

If we had unlimited space, we would make these cumulators 64 bit and
be done. But we can only use 32 bit, and states can last longer than
the 4s we can hold in 32 bit. To deal with this, we do two things: 1)
reader runs at least every 2 seconds 2) in addition to concluded
states it also incorporates samples of the *active state*.

When a >4s state concludes and the upper bits are lost in the ->times
overflow, we don't care. We already sampled it all in <= 2s deltas:

	delta = groupc->times[s] - groupc->times_prev[s]
	if (state_mask & (1 << s))
		delta += now - state_start
	groupc->times_prev[s] += delta

You can see ->times_prev[s] pulling ahead of ->times[s] while the
state is active. But once the state concludes, the scheduler side will
add `now - state_start` to ->times[s] to catch it up.

---

As long as the reader and writer have the same understanding of `now',
this works: The `now - state_start` that the scheduler records when
the state concludes will never be less than what the reader thought it
would be. IOW, times_prev[s] is guaranteed to stay within 2 seconds
behind what the scheduler already recorded (->times[s]), and will
record (now - state_start).

That mutual understanding of "now" is expensive, so I'm trying to
remove it. This necessarily opens a race: the reader could sample a
live state to a time beyond what the scheduler will use to conclude
that state. The result is that one reader run can oversample slightly,
and the next run will see a negative delta.

Here is how I'm proposing to handle it:

> > +		/*
> > +		 * This snooping ahead can obviously race with the
> > +		 * state concluding on the cpu. If we previously
> > +		 * snooped to a time past where the state concludes,
> > +		 * times[s] can now be behind times_prev[s].
> > +		 *
> > +		 * time_after32() would be the obvious choice, but
> > +		 * S32_MAX is right around two seconds, which is the
> > +		 * aggregation interval; if the aggregator gets
> > +		 * delayed, there would be a risk of dismissing
> > +		 * genuinely large samples. Use a larger margin.
> > +		 */
> >  		delta = times[s] - groupc->times_prev[aggregator][s];
> > +		if (delta > psi_period + (psi_period >> 1))
> > +			delta = 0;
> > +
> 
> This seems to check if times_prev is larger than times; confused again.

In the last round, the reader oversampled. That means times_prev[]
slipped ahead of the scheduler reality (times[]).

The actual accounting error caused by this should be acceptable, owed
to the slightly different clock reads of the two racing threads.

Certainly, there is no meaningful additional sample time in *this*
round, so we fix the delta to 0.

[ On the overflow itself: In a simpler world, this would be:

		if (((signed)times[s] - times_prev[s]) < 0)
			delta = 0

  but I hope the explanation in the comment makes sense: Reader runs
  every ~2s so we expect legitimate state samples up to ~2s. The
  simple version would consider everything above ~2s nonsense. That's
  too tight. So I consider everything above 3s nonsense instead,
  giving us a 1s headroom for any delays in reader scheduling. ]

> >  		groupc->times_prev[aggregator][s] = times[s];
> 
> It updates times_prev irrespectively. Storing a potentially larger
> value.

Right, this is on purpose. Once the delta is extracted and processed,
we need to update the reader to where the scheduler is, as per
usual. But there is now a second case:

1) Existing case. The scheduler accumulated state time the reader
   hasn't seen yet, so times_prev[] is behind times[]. After delta
   extraction, the assignment catches the reader up to the scheduler.

2) New case. The previous reader run used a `now' too far in the
   future and oversampled. times_prev[] is *ahead* of times[]. The
   assignment *rewinds* the reader back to the scheduler's reality.

> >  		times[s] = delta;
> 
> And stores the delta, which can be larger than it should be?

We filtered out bogus deltas, so it should be valid or 0.

> > @@ -908,16 +925,18 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
> >  
> >  void psi_task_change(struct task_struct *task, int clear, int set)
> >  {
> > -	int cpu = task_cpu(task);
> > +	int cpu;
> >  	u64 now;
> >  
> >  	if (!task->pid)
> >  		return;
> >  
> > +	cpu = task_cpu(task);
> > +	now = cpu_clock(cpu);
> > +
> 
> So this clock is earlier, or smaller.
> 
> >  	psi_flags_change(task, clear, set);
> >  
> >  	psi_write_begin(cpu);
> > -	now = cpu_clock(cpu);
> >  	for_each_group(group, task_psi_group(task))
> >  		psi_group_change(group, cpu, clear, set, now, true);
> >  	psi_write_end(cpu);
> > @@ -928,10 +947,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> >  {
> >  	struct psi_group *common = NULL;
> >  	int cpu = task_cpu(prev);
> > -	u64 now;
> > +	u64 now = cpu_clock(cpu);
> 
> Same.
> 
> >  	psi_write_begin(cpu);
> > -	now = cpu_clock(cpu);
> >  
> >  	if (next->pid) {
> >  		psi_flags_change(next, 0, TSK_ONCPU);
> > @@ -999,6 +1017,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> >  				psi_group_change(group, cpu, clear, set, now, wake_clock);
> >  		}
> >  	}
> > +
> >  	psi_write_end(cpu);
> >  }
> >  
> > @@ -1027,9 +1046,9 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
> >  		return;
> >  	rq->psi_irq_time = irq;
> >  
> > -	psi_write_begin(cpu);
> >  	now = cpu_clock(cpu);
> 
> Same.
> 
> > +	psi_write_begin(cpu);
> >  	for_each_group(group, task_psi_group(curr)) {
> >  		if (!group->enabled)
> >  			continue;
> > @@ -1234,8 +1253,9 @@ void psi_cgroup_restart(struct psi_group *group)
> >  
> >  		guard(rq_lock_irq)(cpu_rq(cpu));
> >  
> > -		psi_write_begin(cpu);
> >  		now = cpu_clock(cpu);
> 
> And same again.
> 
> > +
> > +		psi_write_begin(cpu);
> >  		psi_group_change(group, cpu, 0, 0, now, true);
> >  		psi_write_end(cpu);
> >  	}
> 
> For all these we call psi_group_change(), which calls record_times()
> which then sets ->state_start to a smaller value. Resulting in times
> above to be larger still.

I think there are two considerations.

We use that clock value to both start and stop the clock on a
state. So while we use a timestamp from slightly earlier in the
scheduling event, we do it on both ends. This should not affect
long-term accuracy.

Then there is reader coherency. Before, state_start would match the
exact time at which the reader could see the state go live inside the
state_mask. After the patch, the reader could see a state whose
state_start is slightly in the past. But that's the common case for
the reader anyway? Since it rarely runs in the same nanosecond in
which the state change occurs.

Am I missing something?

> So this inflates delta and leaves me utterly confused.
> 
> Not only does the Changelog here not explain anything, this also very
> much needs comments in the code, because the next time someone is going
> to be reading this, they'll break their WTF'o'meter and probably the
> next one they get too.

Fair enough, it's tricky.

I'll do my best to capture all the above into the changelog and code
comments. But let's try to get on the same page first. That should
also help identify which parts exactly need the most help.

Thanks

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator
  2026-01-28 18:56   ` Johannes Weiner
@ 2026-01-28 20:17     ` Peter Zijlstra
  2026-02-03 15:38       ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2026-01-28 20:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann,
	John Stultz, linux-kernel

On Wed, Jan 28, 2026 at 01:56:20PM -0500, Johannes Weiner wrote:

> That mutual understanding of "now" is expensive, so I'm trying to
> remove it. This necessarily opens a race: the reader could sample a
> live state to a time beyond what the scheduler will use to conclude
> that state. The result is that one reader run can oversample slightly,
> and the next run will see a negative delta.

Right, exactly the problem from:

 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")

> Here is how I'm proposing to handle it:
> 
> > > +		/*
> > > +		 * This snooping ahead can obviously race with the
> > > +		 * state concluding on the cpu. If we previously
> > > +		 * snooped to a time past where the state concludes,
> > > +		 * times[s] can now be behind times_prev[s].
> > > +		 *
> > > +		 * time_after32() would be the obvious choice, but
> > > +		 * S32_MAX is right around two seconds, which is the
> > > +		 * aggregation interval; if the aggregator gets
> > > +		 * delayed, there would be a risk of dismissing
> > > +		 * genuinely large samples. Use a larger margin.
> > > +		 */
> > >  		delta = times[s] - groupc->times_prev[aggregator][s];
> > > +		if (delta > psi_period + (psi_period >> 1))
> > > +			delta = 0;
> > > +
> > 
> > This seems to check if times_prev is larger than times; confused again.
> 
> In the last round, the reader oversampled. That means times_prev[]
> slipped ahead of the scheduler reality (times[]).
> 
> The actual accounting error caused by this should be acceptable, owed
> to the slightly different clock reads of the two racing threads.
> 
> Certainly, there is no meaningful additional sample time in *this*
> round, so we fix the delta to 0.

OK, so you don't care about the fact that consistent races can inflate
the number beyond actuality. You just want to get rid of backward
motion?

The argument being something like: Since its a decaying average, and the
per-time accumulated error is relatively minor, it doesn't affect the
over-all outcome much.

Because if we just consider it as free-running counters, the
accumulation error is unbound. But since it is time averaged, the error
becomes insignificant.

> > >  		groupc->times_prev[aggregator][s] = times[s];
> > 
> > It updates times_prev irrespectively. Storing a potentially larger
> > value.
> 
> Right, this is on purpose. Once the delta is extracted and processed,
> we need to update the reader to where the scheduler is, as per
> usual. But there is now a second case:
> 
> 1) Existing case. The scheduler accumulated state time the reader
>    hasn't seen yet, so times_prev[] is behind times[]. After delta
>    extraction, the assignment catches the reader up to the scheduler.
> 
> 2) New case. The previous reader run used a `now' too far in the
>    future and oversampled. times_prev[] is *ahead* of times[]. The
>    assignment *rewinds* the reader back to the scheduler's reality.

Well, not quite, that too large delta has already 'escaped' and been
accumulated. This way you ensure a second race will again result in a
too large delta being accumulated, rather than the next state being
accounted slightly short -- which would compensate for the previous one
being accounted slightly larger.

That is afaict 2) ensures you are consistently oversampling but never
undersampling.

> > >  		times[s] = delta;
> > 
> > And stores the delta, which can be larger than it should be?
> 
> We filtered out bogus deltas, so it should be valid or 0.

Semantically a too large value is equally bogus to a negative value.

Its just that negative numbers are not expected and wreck things down
the chain.

> > For all these we call psi_group_change(), which calls record_times()
> > which then sets ->state_start to a smaller value. Resulting in times
> > above to be larger still.
> 
> I think there are two considerations.
> 
> We use that clock value to both start and stop the clock on a
> state. So while we use a timestamp from slightly earlier in the
> scheduling event, we do it on both ends. This should not affect
> long-term accuracy.

If we only consider these timestamps, sure. But due to the whole
accumulation mess in between you get unbounded drift (on the accumulator
-- pre-averaging).

> Then there is reader coherency. Before, state_start would match the
> exact time at which the reader could see the state go live inside the
> state_mask. After the patch, the reader could see a state whose
> state_start is slightly in the past. But that's the common case for
> the reader anyway? Since it rarely runs in the same nanosecond in
> which the state change occurs.
> 
> Am I missing something?

So currently, with things being inside the locks, if we sample early we
miss a bit. If we sample late, we see the exact time.

With the update time being early, we go back to 3840cbe24cf0, and can
see a too long period in both cases.

But because you're then also using a late clock on read, the error is
larger still.

If you are measuring time from a start to 'now', and the actual period
is (10 characters) like so:

   .x.|--------|.y.

Then, if you move the start to x (earlier), your period becomes longer
(12 characters). If you then also move now to y (later) you get an
ever larger error (14 characters).

I mean, it all probably doesn't matter, but its there.

> > So this inflates delta and leaves me utterly confused.
> > 
> > Not only does the Changelog here not explain anything, this also very
> > much needs comments in the code, because the next time someone is going
> > to be reading this, they'll break their WTF'o'meter and probably the
> > next one they get too.
> 
> Fair enough, it's tricky.
> 
> I'll do my best to capture all the above into the changelog and code
> comments. But let's try to get on the same page first. That should
> also help identify which parts exactly need the most help.

Mostly I wasn't sure on which errors you care about and which you don't.

As I understand it now, you *only* care about not having negative values
in the accumulation chain because that's otherwise unsigned and
negatives show up as large numbers and things go 'funny'.

You do not care about long term drift in the pure accumulator -- since
its all time averaged?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes
  2026-01-14 15:43 ` [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes Johannes Weiner
@ 2026-02-02 20:41   ` Peter Zijlstra
  2026-02-03 16:34     ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2026-02-02 20:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann,
	John Stultz, linux-kernel

On Wed, Jan 14, 2026 at 10:43:17AM -0500, Johannes Weiner wrote:
> In the hottest psi paths, the scheduler already caches the cpu_clock()
> call for the event in rq->clock. Now that the clocks between state
> changes and pressure aggregation don't need to be synchronized inside
> the seqcount section anymore, use the cheaper rq_clock().
> 
> Add update_rq_clock() calls to the few places where psi is entered
> without the rq already locked.

Just to be clear, rq->clock is not a cache of cpu_clock(). rq->clock
discards all backwards motion (which obviously should never happen, but
if it does, the clocks go out of sync).

So if you use rq->clock, you must use it for all and not mix with
cpu_clock().

I *think* the patch does that, but I've not double checked.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator
  2026-01-28 20:17     ` Peter Zijlstra
@ 2026-02-03 15:38       ` Johannes Weiner
  2026-02-03 21:02         ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2026-02-03 15:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann,
	John Stultz, linux-kernel

On Wed, Jan 28, 2026 at 09:17:43PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 28, 2026 at 01:56:20PM -0500, Johannes Weiner wrote:
> > That mutual understanding of "now" is expensive, so I'm trying to
> > remove it. This necessarily opens a race: the reader could sample a
> > live state to a time beyond what the scheduler will use to conclude
> > that state. The result is that one reader run can oversample slightly,
> > and the next run will see a negative delta.
> 
> Right, exactly the problem from:
> 
>  3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")

Yep.

> > Here is how I'm proposing to handle it:
> > 
> > > > +		/*
> > > > +		 * This snooping ahead can obviously race with the
> > > > +		 * state concluding on the cpu. If we previously
> > > > +		 * snooped to a time past where the state concludes,
> > > > +		 * times[s] can now be behind times_prev[s].
> > > > +		 *
> > > > +		 * time_after32() would be the obvious choice, but
> > > > +		 * S32_MAX is right around two seconds, which is the
> > > > +		 * aggregation interval; if the aggregator gets
> > > > +		 * delayed, there would be a risk of dismissing
> > > > +		 * genuinely large samples. Use a larger margin.
> > > > +		 */
> > > >  		delta = times[s] - groupc->times_prev[aggregator][s];
> > > > +		if (delta > psi_period + (psi_period >> 1))
> > > > +			delta = 0;
> > > > +
> > > 
> > > This seems to check if times_prev is larger than times; confused again.
> > 
> > In the last round, the reader oversampled. That means times_prev[]
> > slipped ahead of the scheduler reality (times[]).
> > 
> > The actual accounting error caused by this should be acceptable, owed
> > to the slightly different clock reads of the two racing threads.
> > 
> > Certainly, there is no meaningful additional sample time in *this*
> > round, so we fix the delta to 0.
> 
> OK, so you don't care about the fact that consistent races can inflate
> the number beyond actuality. You just want to get rid of backward
> motion?

Yes. The backward motion is what caused the multi-second errors that
immediately shift even the long-term averages. Those are blatantly
noticable and cause practical issues. And notably, these were the ONLY
issue anyone reported against 3840cbe24cf0 in three years.

Given the decaying averages, and the recency windows anyone really
cares about, a smaller fudge factor is much less of a concern.

It's largely self-correcting too: if you have ongoing/recurring
pressure periods, oversampling of live state is corrected by
subsequent smaller samples against the "actual" time the scheduler
ended up recording. The actual underflows were quite rare. The
reproducer had to run the aggregation every microsecond for extended
periods to trigger a race where the oversampled error is less than
state time in the subsequent sampling period:

                             v detect live state
                             | v now() -> record N + error
                             | |     v underflow
  aggregator                 | |     |
  scheduler   |---------------|
                              ^ now() -> record N

We *could* look into remembering oversampling debt accurately and
incorporate deficits into the samples that go into total=. But we
couldn't do it for the avg= because that also encodes recency. And
that discrepancy between the two would likely be worse interface-wise
than a fudge factor affecting both the same.

> The argument being something like: Since its a decaying average, and the
> per-time accumulated error is relatively minor, it doesn't affect the
> over-all outcome much.
> 
> Because if we just consider it as free-running counters, the
> accumulation error is unbound. But since it is time averaged, the error
> becomes insignificant.

Yes.

> > > >  		groupc->times_prev[aggregator][s] = times[s];
> > > 
> > > It updates times_prev irrespectively. Storing a potentially larger
> > > value.
> > 
> > Right, this is on purpose. Once the delta is extracted and processed,
> > we need to update the reader to where the scheduler is, as per
> > usual. But there is now a second case:
> > 
> > 1) Existing case. The scheduler accumulated state time the reader
> >    hasn't seen yet, so times_prev[] is behind times[]. After delta
> >    extraction, the assignment catches the reader up to the scheduler.
> > 
> > 2) New case. The previous reader run used a `now' too far in the
> >    future and oversampled. times_prev[] is *ahead* of times[]. The
> >    assignment *rewinds* the reader back to the scheduler's reality.
> 
> Well, not quite, that too large delta has already 'escaped' and been
> accumulated. This way you ensure a second race will again result in a
> too large delta being accumulated, rather than the next state being
> accounted slightly short -- which would compensate for the previous one
> being accounted slightly larger.
> 
> That is afaict 2) ensures you are consistently oversampling but never
> undersampling.

True. But because of the importance of recency in that data, there is
a limit to how much we could backcorrect later samples.

> > > >  		times[s] = delta;
> > > 
> > > And stores the delta, which can be larger than it should be?
> > 
> > We filtered out bogus deltas, so it should be valid or 0.
> 
> Semantically a too large value is equally bogus to a negative value.
> 
> Its just that negative numbers are not expected and wreck things down
> the chain.

Yes.

> > > For all these we call psi_group_change(), which calls record_times()
> > > which then sets ->state_start to a smaller value. Resulting in times
> > > above to be larger still.
> > 
> > I think there are two considerations.
> > 
> > We use that clock value to both start and stop the clock on a
> > state. So while we use a timestamp from slightly earlier in the
> > scheduling event, we do it on both ends. This should not affect
> > long-term accuracy.
> 
> If we only consider these timestamps, sure. But due to the whole
> accumulation mess in between you get unbounded drift (on the accumulator
> -- pre-averaging).
> 
> > Then there is reader coherency. Before, state_start would match the
> > exact time at which the reader could see the state go live inside the
> > state_mask. After the patch, the reader could see a state whose
> > state_start is slightly in the past. But that's the common case for
> > the reader anyway? Since it rarely runs in the same nanosecond in
> > which the state change occurs.
> > 
> > Am I missing something?
> 
> So currently, with things being inside the locks, if we sample early we
> miss a bit. If we sample late, we see the exact time.
> 
> With the update time being early, we go back to 3840cbe24cf0, and can
> see a too long period in both cases.
> 
> But because you're then also using a late clock on read, the error is
> larger still.
> 
> If you are measuring time from a start to 'now', and the actual period
> is (10 characters) like so:
> 
>    .x.|--------|.y.
> 
> Then, if you move the start to x (earlier), your period becomes longer
> (12 characters). If you then also move now to y (later) you get an
> ever larger error (14 characters).
> 
> I mean, it all probably doesn't matter, but its there.

Right. I think we can live with it, but need it documented in the code
comments to maintain a proper mental picture of the implementation
tradeoffs and their consequences.

> > > So this inflates delta and leaves me utterly confused.
> > > 
> > > Not only does the Changelog here not explain anything, this also very
> > > much needs comments in the code, because the next time someone is going
> > > to be reading this, they'll break their WTF'o'meter and probably the
> > > next one they get too.
> > 
> > Fair enough, it's tricky.
> > 
> > I'll do my best to capture all the above into the changelog and code
> > comments. But let's try to get on the same page first. That should
> > also help identify which parts exactly need the most help.
> 
> Mostly I wasn't sure on which errors you care about and which you don't.

Ok, fair enough.

> As I understand it now, you *only* care about not having negative values
> in the accumulation chain because that's otherwise unsigned and
> negatives show up as large numbers and things go 'funny'.
> 
> You do not care about long term drift in the pure accumulator -- since
> its all time averaged?

Yep.

Thanks for taking a look.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes
  2026-02-02 20:41   ` Peter Zijlstra
@ 2026-02-03 16:34     ` Johannes Weiner
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2026-02-03 16:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann,
	John Stultz, linux-kernel

On Mon, Feb 02, 2026 at 09:41:37PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 14, 2026 at 10:43:17AM -0500, Johannes Weiner wrote:
> > In the hottest psi paths, the scheduler already caches the cpu_clock()
> > call for the event in rq->clock. Now that the clocks between state
> > changes and pressure aggregation don't need to be synchronized inside
> > the seqcount section anymore, use the cheaper rq_clock().
> > 
> > Add update_rq_clock() calls to the few places where psi is entered
> > without the rq already locked.
> 
> Just to be clear, rq->clock is not a cache of cpu_clock(). rq->clock
> discards all backwards motion (which obviously should never happen, but
> if it does, the clocks go out of sync).
> 
> So if you use rq->clock, you must use it for all and not mix with
> cpu_clock().
> 
> I *think* the patch does that, but I've not double checked.

Ah no, it does mix them :(

Yeah I'm using rq_clock() consistently on the scheduler side to
accumulate the times of concluded states.

	state_start = rq_clock()
	...
	state_time = rq_clock() - state_start

However, the aggregator side still uses

	state_time += cpu_clock() - state_start

to incorporate currently active state. If they don't have the same
base, this won't work.

Doing the full lock and update_rq_clock() from the aggregator sounds
quite heavy handed. How about using sched_clock_cpu() directly and
doing the backwards motion check by hand?

	local_irq_save()
	now = sched_clock_cpu(cpu)
	local_irq_restore()

	...

	if (state_mask & (1 << s) && now > state_start)
		times[s] += now - state_start

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator
  2026-02-03 15:38       ` Johannes Weiner
@ 2026-02-03 21:02         ` Johannes Weiner
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2026-02-03 21:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, Ingo Molnar, Chengming Zhou, Dietmar Eggemann,
	John Stultz, linux-kernel

On Tue, Feb 03, 2026 at 10:38:09AM -0500, Johannes Weiner wrote:
> Yes. The backward motion is what caused the multi-second errors that
> immediately shift even the long-term averages. Those are blatantly
> noticable and cause practical issues. And notably, these were the ONLY
> issue anyone reported against 3840cbe24cf0 in three years.

Grr.

"[...] reported against df77430639c9 ("psi: Reduce calls to
ched_clock() in psi") in three years", of course.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-02-03 21:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-14 15:43 [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Johannes Weiner
2026-01-14 15:43 ` [PATCH resend 2/2] sched: psi: use rq_clock() during task state changes Johannes Weiner
2026-02-02 20:41   ` Peter Zijlstra
2026-02-03 16:34     ` Johannes Weiner
2026-01-28 10:35 ` [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Peter Zijlstra
2026-01-28 18:56   ` Johannes Weiner
2026-01-28 20:17     ` Peter Zijlstra
2026-02-03 15:38       ` Johannes Weiner
2026-02-03 21:02         ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox