From: Chengming Zhou <chengming.zhou@linux.dev>
To: Johannes Weiner <hannes@cmpxchg.org>, bugzilla-daemon@kernel.org
Cc: Suren Baghdasaryan <surenb@google.com>,
Peter Zijlstra <peterz@infradead.org>,
Shakeel Butt <shakeelb@google.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched: psi: fix bogus pressure spikes from aggregation race
Date: Thu, 29 Aug 2024 16:27:06 +0800 [thread overview]
Message-ID: <b6bf22e9-0d01-4e6b-bd06-4eaef73e76ce@linux.dev> (raw)
In-Reply-To: <20240827121851.GB438928@cmpxchg.org>
On 2024/8/27 20:18, Johannes Weiner wrote:
> Sending a proper patch. Peter, can you please take this?
>
> ---
>
> Brandon reports sporadic, non-sensical spikes in cumulative pressure
> time (total=) when reading cpu.pressure at a high rate. This is due to
> a race condition between reader aggregation and tasks changing states.
>
> While it affects all states and all resources captured by PSI, in
> practice it most likely triggers with CPU pressure, since scheduling
> events are so frequent compared to other resource events.
>
> The race context is the live snooping of ongoing stalls during a
> pressure read. The read aggregates per-cpu records for stalls that
> have concluded, but will also incorporate ad-hoc the duration of any
> active state that hasn't been recorded yet. This is important to get
> timely measurements of ongoing stalls. Those ad-hoc samples are
> calculated on-the-fly up to the current time on that CPU; since the
> stall hasn't concluded, it's expected that this is the minimum amount
> of stall time that will enter the per-cpu records once it does.
>
> The problem is that the path that concludes the state uses a CPU clock
> read that is not synchronized against aggregators; the clock is read
> outside of the seqlock protection. This allows aggregators to race and
> snoop a stall with a longer duration than will actually be recorded.
>
> With the recorded stall time being less than the last snapshot
> remembered by the aggregator, a subsequent sample will underflow and
> observe a bogus delta value, resulting in an erratic jump in pressure.
>
> Fix this by moving the clock read of the state change into the seqlock
> protection. This ensures no aggregation can snoop live stalls past the
> time that's recorded when the state concludes.
>
> Reported-by: Brandon Duffany <brandon@buildbuddy.io>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
> Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi")
> Cc: stable@vger.kernel.org
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Good catch!
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Maybe another solution is to check the race happened on the read side,
then return delta as 0, right?
Thanks.
> ---
> kernel/sched/psi.c | 26 ++++++++++++--------------
> 1 file changed, 12 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 020d58967d4e..84dad1511d1e 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -769,12 +769,13 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> }
>
> static void psi_group_change(struct psi_group *group, int cpu,
> - unsigned int clear, unsigned int set, u64 now,
> + unsigned int clear, unsigned int set,
> bool wake_clock)
> {
> struct psi_group_cpu *groupc;
> unsigned int t, m;
> u32 state_mask;
> + u64 now;
>
> lockdep_assert_rq_held(cpu_rq(cpu));
> groupc = per_cpu_ptr(group->pcpu, cpu);
> @@ -789,6 +790,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
> * SOME and FULL time these may have resulted in.
> */
> write_seqcount_begin(&groupc->seq);
> + now = cpu_clock(cpu);
>
> /*
> * Start with TSK_ONCPU, which doesn't have a corresponding
> @@ -899,18 +901,15 @@ void psi_task_change(struct task_struct *task, int clear, int set)
> {
> int cpu = task_cpu(task);
> struct psi_group *group;
> - u64 now;
>
> if (!task->pid)
> return;
>
> psi_flags_change(task, clear, set);
>
> - now = cpu_clock(cpu);
> -
> group = task_psi_group(task);
> do {
> - psi_group_change(group, cpu, clear, set, now, true);
> + psi_group_change(group, cpu, clear, set, true);
> } while ((group = group->parent));
> }
>
> @@ -919,7 +918,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> {
> struct psi_group *group, *common = NULL;
> int cpu = task_cpu(prev);
> - u64 now = cpu_clock(cpu);
>
> if (next->pid) {
> psi_flags_change(next, 0, TSK_ONCPU);
> @@ -936,7 +934,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> break;
> }
>
> - psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
> + psi_group_change(group, cpu, 0, TSK_ONCPU, true);
> } while ((group = group->parent));
> }
>
> @@ -974,7 +972,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> do {
> if (group == common)
> break;
> - psi_group_change(group, cpu, clear, set, now, wake_clock);
> + psi_group_change(group, cpu, clear, set, wake_clock);
> } while ((group = group->parent));
>
> /*
> @@ -986,7 +984,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
> clear &= ~TSK_ONCPU;
> for (; group; group = group->parent)
> - psi_group_change(group, cpu, clear, set, now, wake_clock);
> + psi_group_change(group, cpu, clear, set, wake_clock);
> }
> }
> }
> @@ -997,8 +995,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
> int cpu = task_cpu(curr);
> struct psi_group *group;
> struct psi_group_cpu *groupc;
> - u64 now, irq;
> s64 delta;
> + u64 irq;
>
> if (static_branch_likely(&psi_disabled))
> return;
> @@ -1011,7 +1009,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
> if (prev && task_psi_group(prev) == group)
> return;
>
> - now = cpu_clock(cpu);
> irq = irq_time_read(cpu);
> delta = (s64)(irq - rq->psi_irq_time);
> if (delta < 0)
> @@ -1019,12 +1016,15 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
> rq->psi_irq_time = irq;
>
> do {
> + u64 now;
> +
> if (!group->enabled)
> continue;
>
> groupc = per_cpu_ptr(group->pcpu, cpu);
>
> write_seqcount_begin(&groupc->seq);
> + now = cpu_clock(cpu);
>
> record_times(groupc, now);
> groupc->times[PSI_IRQ_FULL] += delta;
> @@ -1223,11 +1223,9 @@ void psi_cgroup_restart(struct psi_group *group)
> for_each_possible_cpu(cpu) {
> struct rq *rq = cpu_rq(cpu);
> struct rq_flags rf;
> - u64 now;
>
> rq_lock_irq(rq, &rf);
> - now = cpu_clock(cpu);
> - psi_group_change(group, cpu, 0, 0, now, true);
> + psi_group_change(group, cpu, 0, 0, true);
> rq_unlock_irq(rq, &rf);
> }
> }
next prev parent reply other threads:[~2024-08-29 8:27 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <bug-219194-14699@https.bugzilla.kernel.org/>
2024-08-26 18:19 ` [Bug 219194] New: PSI full-stall duration reports nonsensically large value when reading pressure file too frequently Johannes Weiner
2024-08-27 12:18 ` [PATCH] sched: psi: fix bogus pressure spikes from aggregation race Johannes Weiner
2024-08-29 8:27 ` Chengming Zhou [this message]
2024-10-03 11:29 Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b6bf22e9-0d01-4e6b-bd06-4eaef73e76ce@linux.dev \
--to=chengming.zhou@linux.dev \
--cc=bugzilla-daemon@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=shakeelb@google.com \
--cc=surenb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox