* Re: [Bug 219194] New: PSI full-stall duration reports nonsensically large value when reading pressure file too frequently [not found] <bug-219194-14699@https.bugzilla.kernel.org/> @ 2024-08-26 18:19 ` Johannes Weiner 2024-08-27 12:18 ` [PATCH] sched: psi: fix bogus pressure spikes from aggregation race Johannes Weiner 0 siblings, 1 reply; 3+ messages in thread From: Johannes Weiner @ 2024-08-26 18:19 UTC (permalink / raw) To: bugzilla-daemon; +Cc: Suren Baghdasaryan, Peter Zijlstra, linux-kernel Hi Brandon, On Mon, Aug 26, 2024 at 12:49:34AM +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219194 > > Bug ID: 219194 > Summary: PSI full-stall duration reports nonsensically large > value when reading pressure file too frequently > Product: Other > Version: 2.5 > Hardware: AMD > OS: Linux > Status: NEW > Severity: normal > Priority: P3 > Component: Other > Assignee: other_other@kernel-bugs.osdl.org > Reporter: brandon@buildbuddy.io > CC: hannes@cmpxchg.org > Regression: No > > Created attachment 306775 > --> https://bugzilla.kernel.org/attachment.cgi?id=306775&action=edit > Minimal reproducible example > > The cumulative full-stall duration ("total" field) reported in > /sys/fs/cgroup/.../cpu.pressure is occasionally nonsensically large - a value > much longer than the total execution duration of the only process that was run > in the cgroup. > > Interestingly, the nonsensical value that it reports is often exactly 4.294967 > seconds. This value looks suspiciously like UINT32_MAX (4294967295) which made > me think that this is a bug with PSI code rather than something I was doing > wrong. Thanks for reporting this. Your reproducer was immensely helpful in tracking this down, much appreciated. With the below patch, I can no longer trigger it. If at all possible, could you try it on your end? It's a race condition between aggregator (reader) and a task changing state. Specifically, it's an issue with the live state snooping: the aggregator will add cpu_clock() - state_start to the sample of any currently active states that have not been recorded yet. This makes the aggregator get ahead of recorded task state, but we assume this is okay because, as the state concludes, that delta we snooped ahead will be incorporated back into the recorded times. But there is a bug. When the state change concludes it reads cpu_clock() before acquiring the seq write lock. This means an aggregator can race and snoop ahead to a later clock value than the one at which the state will conclude. The next aggregation will see a negative delta and record a UINT_MAX - <race time> sample: psi_group_change() get_recent_times() now = cpu_clock() (X) <a few cycles pass> read_seqcount_begin() now = cpu_clock() (X+42) memcpy(times, groupc->times) !read_seqcount_retry() /* Snoops live state up to X+42 */ if (state_mask & (1 << s)) times[s] += now - state_start delta = times[s] - groupc->times_prev[s] groupc->times_prev[s] = times[s] /* times_prev up to X+42 */ write_seqcount_begin() /* concludes state ending at X */ groupc->times[s] += now - state_start groupc->state_mask &= 1 << s write_seqcount_end() read_seqcount_begin() memcpy(times, groupc->times) !read_seqcount_retry() if (state_mask & (1 << s)) ; /* nope, state concluded */ delta = times[s] - groupc->times_prev[s] /* times_prev ahead, sample is UINT_MAX-42 */ The solution is to move the clock read of the state change into the locked section. This way, live snooping cannot get ahead of the time at which the state will conclude: --- diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 020d58967d4e..84dad1511d1e 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -769,12 +769,13 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) } static void psi_group_change(struct psi_group *group, int cpu, - unsigned int clear, unsigned int set, u64 now, + unsigned int clear, unsigned int set, bool wake_clock) { struct psi_group_cpu *groupc; unsigned int t, m; u32 state_mask; + u64 now; lockdep_assert_rq_held(cpu_rq(cpu)); groupc = per_cpu_ptr(group->pcpu, cpu); @@ -789,6 +790,7 @@ static void psi_group_change(struct psi_group *group, int cpu, * SOME and FULL time these may have resulted in. */ write_seqcount_begin(&groupc->seq); + now = cpu_clock(cpu); /* * Start with TSK_ONCPU, which doesn't have a corresponding @@ -899,18 +901,15 @@ void psi_task_change(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); struct psi_group *group; - u64 now; if (!task->pid) return; psi_flags_change(task, clear, set); - now = cpu_clock(cpu); - group = task_psi_group(task); do { - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, true); } while ((group = group->parent)); } @@ -919,7 +918,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, { struct psi_group *group, *common = NULL; int cpu = task_cpu(prev); - u64 now = cpu_clock(cpu); if (next->pid) { psi_flags_change(next, 0, TSK_ONCPU); @@ -936,7 +934,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, break; } - psi_group_change(group, cpu, 0, TSK_ONCPU, now, true); + psi_group_change(group, cpu, 0, TSK_ONCPU, true); } while ((group = group->parent)); } @@ -974,7 +972,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, do { if (group == common) break; - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, wake_clock); } while ((group = group->parent)); /* @@ -986,7 +984,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { clear &= ~TSK_ONCPU; for (; group; group = group->parent) - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, wake_clock); } } } @@ -997,8 +995,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st int cpu = task_cpu(curr); struct psi_group *group; struct psi_group_cpu *groupc; - u64 now, irq; s64 delta; + u64 irq; if (static_branch_likely(&psi_disabled)) return; @@ -1011,7 +1009,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st if (prev && task_psi_group(prev) == group) return; - now = cpu_clock(cpu); irq = irq_time_read(cpu); delta = (s64)(irq - rq->psi_irq_time); if (delta < 0) @@ -1019,12 +1016,15 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st rq->psi_irq_time = irq; do { + u64 now; + if (!group->enabled) continue; groupc = per_cpu_ptr(group->pcpu, cpu); write_seqcount_begin(&groupc->seq); + now = cpu_clock(cpu); record_times(groupc, now); groupc->times[PSI_IRQ_FULL] += delta; @@ -1223,11 +1223,9 @@ void psi_cgroup_restart(struct psi_group *group) for_each_possible_cpu(cpu) { struct rq *rq = cpu_rq(cpu); struct rq_flags rf; - u64 now; rq_lock_irq(rq, &rf); - now = cpu_clock(cpu); - psi_group_change(group, cpu, 0, 0, now, true); + psi_group_change(group, cpu, 0, 0, true); rq_unlock_irq(rq, &rf); } } ^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH] sched: psi: fix bogus pressure spikes from aggregation race 2024-08-26 18:19 ` [Bug 219194] New: PSI full-stall duration reports nonsensically large value when reading pressure file too frequently Johannes Weiner @ 2024-08-27 12:18 ` Johannes Weiner 2024-08-29 8:27 ` Chengming Zhou 0 siblings, 1 reply; 3+ messages in thread From: Johannes Weiner @ 2024-08-27 12:18 UTC (permalink / raw) To: bugzilla-daemon Cc: Suren Baghdasaryan, Peter Zijlstra, Shakeel Butt, linux-kernel Sending a proper patch. Peter, can you please take this? --- Brandon reports sporadic, non-sensical spikes in cumulative pressure time (total=) when reading cpu.pressure at a high rate. This is due to a race condition between reader aggregation and tasks changing states. While it affects all states and all resources captured by PSI, in practice it most likely triggers with CPU pressure, since scheduling events are so frequent compared to other resource events. The race context is the live snooping of ongoing stalls during a pressure read. The read aggregates per-cpu records for stalls that have concluded, but will also incorporate ad-hoc the duration of any active state that hasn't been recorded yet. This is important to get timely measurements of ongoing stalls. Those ad-hoc samples are calculated on-the-fly up to the current time on that CPU; since the stall hasn't concluded, it's expected that this is the minimum amount of stall time that will enter the per-cpu records once it does. The problem is that the path that concludes the state uses a CPU clock read that is not synchronized against aggregators; the clock is read outside of the seqlock protection. This allows aggregators to race and snoop a stall with a longer duration than will actually be recorded. With the recorded stall time being less than the last snapshot remembered by the aggregator, a subsequent sample will underflow and observe a bogus delta value, resulting in an erratic jump in pressure. Fix this by moving the clock read of the state change into the seqlock protection. This ensures no aggregation can snoop live stalls past the time that's recorded when the state concludes. Reported-by: Brandon Duffany <brandon@buildbuddy.io> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194 Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi") Cc: stable@vger.kernel.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- kernel/sched/psi.c | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 020d58967d4e..84dad1511d1e 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -769,12 +769,13 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) } static void psi_group_change(struct psi_group *group, int cpu, - unsigned int clear, unsigned int set, u64 now, + unsigned int clear, unsigned int set, bool wake_clock) { struct psi_group_cpu *groupc; unsigned int t, m; u32 state_mask; + u64 now; lockdep_assert_rq_held(cpu_rq(cpu)); groupc = per_cpu_ptr(group->pcpu, cpu); @@ -789,6 +790,7 @@ static void psi_group_change(struct psi_group *group, int cpu, * SOME and FULL time these may have resulted in. */ write_seqcount_begin(&groupc->seq); + now = cpu_clock(cpu); /* * Start with TSK_ONCPU, which doesn't have a corresponding @@ -899,18 +901,15 @@ void psi_task_change(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); struct psi_group *group; - u64 now; if (!task->pid) return; psi_flags_change(task, clear, set); - now = cpu_clock(cpu); - group = task_psi_group(task); do { - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, true); } while ((group = group->parent)); } @@ -919,7 +918,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, { struct psi_group *group, *common = NULL; int cpu = task_cpu(prev); - u64 now = cpu_clock(cpu); if (next->pid) { psi_flags_change(next, 0, TSK_ONCPU); @@ -936,7 +934,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, break; } - psi_group_change(group, cpu, 0, TSK_ONCPU, now, true); + psi_group_change(group, cpu, 0, TSK_ONCPU, true); } while ((group = group->parent)); } @@ -974,7 +972,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, do { if (group == common) break; - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, wake_clock); } while ((group = group->parent)); /* @@ -986,7 +984,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { clear &= ~TSK_ONCPU; for (; group; group = group->parent) - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, wake_clock); } } } @@ -997,8 +995,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st int cpu = task_cpu(curr); struct psi_group *group; struct psi_group_cpu *groupc; - u64 now, irq; s64 delta; + u64 irq; if (static_branch_likely(&psi_disabled)) return; @@ -1011,7 +1009,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st if (prev && task_psi_group(prev) == group) return; - now = cpu_clock(cpu); irq = irq_time_read(cpu); delta = (s64)(irq - rq->psi_irq_time); if (delta < 0) @@ -1019,12 +1016,15 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st rq->psi_irq_time = irq; do { + u64 now; + if (!group->enabled) continue; groupc = per_cpu_ptr(group->pcpu, cpu); write_seqcount_begin(&groupc->seq); + now = cpu_clock(cpu); record_times(groupc, now); groupc->times[PSI_IRQ_FULL] += delta; @@ -1223,11 +1223,9 @@ void psi_cgroup_restart(struct psi_group *group) for_each_possible_cpu(cpu) { struct rq *rq = cpu_rq(cpu); struct rq_flags rf; - u64 now; rq_lock_irq(rq, &rf); - now = cpu_clock(cpu); - psi_group_change(group, cpu, 0, 0, now, true); + psi_group_change(group, cpu, 0, 0, true); rq_unlock_irq(rq, &rf); } } -- 2.46.0 ^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] sched: psi: fix bogus pressure spikes from aggregation race 2024-08-27 12:18 ` [PATCH] sched: psi: fix bogus pressure spikes from aggregation race Johannes Weiner @ 2024-08-29 8:27 ` Chengming Zhou 0 siblings, 0 replies; 3+ messages in thread From: Chengming Zhou @ 2024-08-29 8:27 UTC (permalink / raw) To: Johannes Weiner, bugzilla-daemon Cc: Suren Baghdasaryan, Peter Zijlstra, Shakeel Butt, linux-kernel On 2024/8/27 20:18, Johannes Weiner wrote: > Sending a proper patch. Peter, can you please take this? > > --- > > Brandon reports sporadic, non-sensical spikes in cumulative pressure > time (total=) when reading cpu.pressure at a high rate. This is due to > a race condition between reader aggregation and tasks changing states. > > While it affects all states and all resources captured by PSI, in > practice it most likely triggers with CPU pressure, since scheduling > events are so frequent compared to other resource events. > > The race context is the live snooping of ongoing stalls during a > pressure read. The read aggregates per-cpu records for stalls that > have concluded, but will also incorporate ad-hoc the duration of any > active state that hasn't been recorded yet. This is important to get > timely measurements of ongoing stalls. Those ad-hoc samples are > calculated on-the-fly up to the current time on that CPU; since the > stall hasn't concluded, it's expected that this is the minimum amount > of stall time that will enter the per-cpu records once it does. > > The problem is that the path that concludes the state uses a CPU clock > read that is not synchronized against aggregators; the clock is read > outside of the seqlock protection. This allows aggregators to race and > snoop a stall with a longer duration than will actually be recorded. > > With the recorded stall time being less than the last snapshot > remembered by the aggregator, a subsequent sample will underflow and > observe a bogus delta value, resulting in an erratic jump in pressure. > > Fix this by moving the clock read of the state change into the seqlock > protection. This ensures no aggregation can snoop live stalls past the > time that's recorded when the state concludes. > > Reported-by: Brandon Duffany <brandon@buildbuddy.io> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194 > Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi") > Cc: stable@vger.kernel.org > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Good catch! Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Maybe another solution is to check the race happened on the read side, then return delta as 0, right? Thanks. > --- > kernel/sched/psi.c | 26 ++++++++++++-------------- > 1 file changed, 12 insertions(+), 14 deletions(-) > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index 020d58967d4e..84dad1511d1e 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -769,12 +769,13 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > } > > static void psi_group_change(struct psi_group *group, int cpu, > - unsigned int clear, unsigned int set, u64 now, > + unsigned int clear, unsigned int set, > bool wake_clock) > { > struct psi_group_cpu *groupc; > unsigned int t, m; > u32 state_mask; > + u64 now; > > lockdep_assert_rq_held(cpu_rq(cpu)); > groupc = per_cpu_ptr(group->pcpu, cpu); > @@ -789,6 +790,7 @@ static void psi_group_change(struct psi_group *group, int cpu, > * SOME and FULL time these may have resulted in. > */ > write_seqcount_begin(&groupc->seq); > + now = cpu_clock(cpu); > > /* > * Start with TSK_ONCPU, which doesn't have a corresponding > @@ -899,18 +901,15 @@ void psi_task_change(struct task_struct *task, int clear, int set) > { > int cpu = task_cpu(task); > struct psi_group *group; > - u64 now; > > if (!task->pid) > return; > > psi_flags_change(task, clear, set); > > - now = cpu_clock(cpu); > - > group = task_psi_group(task); > do { > - psi_group_change(group, cpu, clear, set, now, true); > + psi_group_change(group, cpu, clear, set, true); > } while ((group = group->parent)); > } > > @@ -919,7 +918,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > { > struct psi_group *group, *common = NULL; > int cpu = task_cpu(prev); > - u64 now = cpu_clock(cpu); > > if (next->pid) { > psi_flags_change(next, 0, TSK_ONCPU); > @@ -936,7 +934,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > break; > } > > - psi_group_change(group, cpu, 0, TSK_ONCPU, now, true); > + psi_group_change(group, cpu, 0, TSK_ONCPU, true); > } while ((group = group->parent)); > } > > @@ -974,7 +972,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > do { > if (group == common) > break; > - psi_group_change(group, cpu, clear, set, now, wake_clock); > + psi_group_change(group, cpu, clear, set, wake_clock); > } while ((group = group->parent)); > > /* > @@ -986,7 +984,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { > clear &= ~TSK_ONCPU; > for (; group; group = group->parent) > - psi_group_change(group, cpu, clear, set, now, wake_clock); > + psi_group_change(group, cpu, clear, set, wake_clock); > } > } > } > @@ -997,8 +995,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st > int cpu = task_cpu(curr); > struct psi_group *group; > struct psi_group_cpu *groupc; > - u64 now, irq; > s64 delta; > + u64 irq; > > if (static_branch_likely(&psi_disabled)) > return; > @@ -1011,7 +1009,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st > if (prev && task_psi_group(prev) == group) > return; > > - now = cpu_clock(cpu); > irq = irq_time_read(cpu); > delta = (s64)(irq - rq->psi_irq_time); > if (delta < 0) > @@ -1019,12 +1016,15 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st > rq->psi_irq_time = irq; > > do { > + u64 now; > + > if (!group->enabled) > continue; > > groupc = per_cpu_ptr(group->pcpu, cpu); > > write_seqcount_begin(&groupc->seq); > + now = cpu_clock(cpu); > > record_times(groupc, now); > groupc->times[PSI_IRQ_FULL] += delta; > @@ -1223,11 +1223,9 @@ void psi_cgroup_restart(struct psi_group *group) > for_each_possible_cpu(cpu) { > struct rq *rq = cpu_rq(cpu); > struct rq_flags rf; > - u64 now; > > rq_lock_irq(rq, &rf); > - now = cpu_clock(cpu); > - psi_group_change(group, cpu, 0, 0, now, true); > + psi_group_change(group, cpu, 0, 0, true); > rq_unlock_irq(rq, &rf); > } > } ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-08-29 8:27 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <bug-219194-14699@https.bugzilla.kernel.org/>
2024-08-26 18:19 ` [Bug 219194] New: PSI full-stall duration reports nonsensically large value when reading pressure file too frequently Johannes Weiner
2024-08-27 12:18 ` [PATCH] sched: psi: fix bogus pressure spikes from aggregation race Johannes Weiner
2024-08-29 8:27 ` Chengming Zhou
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox