public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Chengming Zhou <chengming.zhou@linux.dev>
To: Johannes Weiner <hannes@cmpxchg.org>, bugzilla-daemon@kernel.org
Cc: Suren Baghdasaryan <surenb@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Shakeel Butt <shakeelb@google.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched: psi: fix bogus pressure spikes from aggregation race
Date: Thu, 29 Aug 2024 16:27:06 +0800	[thread overview]
Message-ID: <b6bf22e9-0d01-4e6b-bd06-4eaef73e76ce@linux.dev> (raw)
In-Reply-To: <20240827121851.GB438928@cmpxchg.org>

On 2024/8/27 20:18, Johannes Weiner wrote:
> Sending a proper patch. Peter, can you please take this?
> 
> ---
> 
> Brandon reports sporadic, non-sensical spikes in cumulative pressure
> time (total=) when reading cpu.pressure at a high rate. This is due to
> a race condition between reader aggregation and tasks changing states.
> 
> While it affects all states and all resources captured by PSI, in
> practice it most likely triggers with CPU pressure, since scheduling
> events are so frequent compared to other resource events.
> 
> The race context is the live snooping of ongoing stalls during a
> pressure read. The read aggregates per-cpu records for stalls that
> have concluded, but will also incorporate ad-hoc the duration of any
> active state that hasn't been recorded yet. This is important to get
> timely measurements of ongoing stalls. Those ad-hoc samples are
> calculated on-the-fly up to the current time on that CPU; since the
> stall hasn't concluded, it's expected that this is the minimum amount
> of stall time that will enter the per-cpu records once it does.
> 
> The problem is that the path that concludes the state uses a CPU clock
> read that is not synchronized against aggregators; the clock is read
> outside of the seqlock protection. This allows aggregators to race and
> snoop a stall with a longer duration than will actually be recorded.
> 
> With the recorded stall time being less than the last snapshot
> remembered by the aggregator, a subsequent sample will underflow and
> observe a bogus delta value, resulting in an erratic jump in pressure.
> 
> Fix this by moving the clock read of the state change into the seqlock
> protection. This ensures no aggregation can snoop live stalls past the
> time that's recorded when the state concludes.
> 
> Reported-by: Brandon Duffany <brandon@buildbuddy.io>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
> Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi")
> Cc: stable@vger.kernel.org
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Good catch!

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

Maybe another solution is to check the race happened on the read side,
then return delta as 0, right?

Thanks.

> ---
>   kernel/sched/psi.c | 26 ++++++++++++--------------
>   1 file changed, 12 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 020d58967d4e..84dad1511d1e 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -769,12 +769,13 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
>   }
>   
>   static void psi_group_change(struct psi_group *group, int cpu,
> -			     unsigned int clear, unsigned int set, u64 now,
> +			     unsigned int clear, unsigned int set,
>   			     bool wake_clock)
>   {
>   	struct psi_group_cpu *groupc;
>   	unsigned int t, m;
>   	u32 state_mask;
> +	u64 now;
>   
>   	lockdep_assert_rq_held(cpu_rq(cpu));
>   	groupc = per_cpu_ptr(group->pcpu, cpu);
> @@ -789,6 +790,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
>   	 * SOME and FULL time these may have resulted in.
>   	 */
>   	write_seqcount_begin(&groupc->seq);
> +	now = cpu_clock(cpu);
>   
>   	/*
>   	 * Start with TSK_ONCPU, which doesn't have a corresponding
> @@ -899,18 +901,15 @@ void psi_task_change(struct task_struct *task, int clear, int set)
>   {
>   	int cpu = task_cpu(task);
>   	struct psi_group *group;
> -	u64 now;
>   
>   	if (!task->pid)
>   		return;
>   
>   	psi_flags_change(task, clear, set);
>   
> -	now = cpu_clock(cpu);
> -
>   	group = task_psi_group(task);
>   	do {
> -		psi_group_change(group, cpu, clear, set, now, true);
> +		psi_group_change(group, cpu, clear, set, true);
>   	} while ((group = group->parent));
>   }
>   
> @@ -919,7 +918,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>   {
>   	struct psi_group *group, *common = NULL;
>   	int cpu = task_cpu(prev);
> -	u64 now = cpu_clock(cpu);
>   
>   	if (next->pid) {
>   		psi_flags_change(next, 0, TSK_ONCPU);
> @@ -936,7 +934,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>   				break;
>   			}
>   
> -			psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
> +			psi_group_change(group, cpu, 0, TSK_ONCPU, true);
>   		} while ((group = group->parent));
>   	}
>   
> @@ -974,7 +972,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>   		do {
>   			if (group == common)
>   				break;
> -			psi_group_change(group, cpu, clear, set, now, wake_clock);
> +			psi_group_change(group, cpu, clear, set, wake_clock);
>   		} while ((group = group->parent));
>   
>   		/*
> @@ -986,7 +984,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>   		if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
>   			clear &= ~TSK_ONCPU;
>   			for (; group; group = group->parent)
> -				psi_group_change(group, cpu, clear, set, now, wake_clock);
> +				psi_group_change(group, cpu, clear, set, wake_clock);
>   		}
>   	}
>   }
> @@ -997,8 +995,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
>   	int cpu = task_cpu(curr);
>   	struct psi_group *group;
>   	struct psi_group_cpu *groupc;
> -	u64 now, irq;
>   	s64 delta;
> +	u64 irq;
>   
>   	if (static_branch_likely(&psi_disabled))
>   		return;
> @@ -1011,7 +1009,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
>   	if (prev && task_psi_group(prev) == group)
>   		return;
>   
> -	now = cpu_clock(cpu);
>   	irq = irq_time_read(cpu);
>   	delta = (s64)(irq - rq->psi_irq_time);
>   	if (delta < 0)
> @@ -1019,12 +1016,15 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
>   	rq->psi_irq_time = irq;
>   
>   	do {
> +		u64 now;
> +
>   		if (!group->enabled)
>   			continue;
>   
>   		groupc = per_cpu_ptr(group->pcpu, cpu);
>   
>   		write_seqcount_begin(&groupc->seq);
> +		now = cpu_clock(cpu);
>   
>   		record_times(groupc, now);
>   		groupc->times[PSI_IRQ_FULL] += delta;
> @@ -1223,11 +1223,9 @@ void psi_cgroup_restart(struct psi_group *group)
>   	for_each_possible_cpu(cpu) {
>   		struct rq *rq = cpu_rq(cpu);
>   		struct rq_flags rf;
> -		u64 now;
>   
>   		rq_lock_irq(rq, &rf);
> -		now = cpu_clock(cpu);
> -		psi_group_change(group, cpu, 0, 0, now, true);
> +		psi_group_change(group, cpu, 0, 0, true);
>   		rq_unlock_irq(rq, &rf);
>   	}
>   }

  reply	other threads:[~2024-08-29  8:27 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <bug-219194-14699@https.bugzilla.kernel.org/>
2024-08-26 18:19 ` [Bug 219194] New: PSI full-stall duration reports nonsensically large value when reading pressure file too frequently Johannes Weiner
2024-08-27 12:18   ` [PATCH] sched: psi: fix bogus pressure spikes from aggregation race Johannes Weiner
2024-08-29  8:27     ` Chengming Zhou [this message]
2024-10-03 11:29 Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b6bf22e9-0d01-4e6b-bd06-4eaef73e76ce@linux.dev \
    --to=chengming.zhou@linux.dev \
    --cc=bugzilla-daemon@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=shakeelb@google.com \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox