From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-172.mta0.migadu.com (out-172.mta0.migadu.com [91.218.175.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A9D1518C327 for ; Thu, 29 Aug 2024 08:27:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724920043; cv=none; b=mUIrqEh8w2fSDj9blVZdXTmP/kP7t+C4xw+20ei8zRYXMXV083ebn0VleCG++KvqOiYpixGGqVnL7NA8hmLTxeggPmEWxpGNwx/TP/lVl20Ii1UXYRoE4VRcZHtwEENbILNvTbhF4Pz1/SeyzoGmB0mHqI+tY1N1rMUyVxvBm7M= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724920043; c=relaxed/simple; bh=q8RkWOZsduoG5UpfLU23VXt2DfOskuBrY0U7O8AaMVY=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=apicFnVH9RZejl0+qhHD0Z3Axf5PpulmM3zuuEKahd7ekNcDX8QisdNX7BNAGuPaEEjP8EpFSnOGDpIxuBD1D1XyIx0b5P0cpadGbqNl/MZ75wuGaWXFyfV5CSs8mogqjEqUlGdT8D+XIf8V/udFJZ0/yOlQEHeLStgOHkhHfrg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=HLY00qBJ; arc=none smtp.client-ip=91.218.175.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="HLY00qBJ" Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1724920033; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z4IX81RoSKRpORtKKtqo9Xgk/NT1kzOnZwSCBhIj1Ng=; b=HLY00qBJ6iPhp2xVu0Y9eupsVbgG6O7AjHJ495bKmIS7vShSXXSwXIJDq2laxzFRx/LVhQ toplZU6gnnHWzNKQh45m4ClYJDYNVcYSVv1w3/1eAX+k8eJ01p++HAJx0qEiawKIq8b2jC lbMALXq5vLx/30c10htibuyIPgzhD9Y= Date: Thu, 29 Aug 2024 16:27:06 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH] sched: psi: fix bogus pressure spikes from aggregation race To: Johannes Weiner , bugzilla-daemon@kernel.org Cc: Suren Baghdasaryan , Peter Zijlstra , Shakeel Butt , linux-kernel@vger.kernel.org References: <20240826181900.GA3214@cmpxchg.org> <20240827121851.GB438928@cmpxchg.org> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Chengming Zhou In-Reply-To: <20240827121851.GB438928@cmpxchg.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On 2024/8/27 20:18, Johannes Weiner wrote: > Sending a proper patch. Peter, can you please take this? > > --- > > Brandon reports sporadic, non-sensical spikes in cumulative pressure > time (total=) when reading cpu.pressure at a high rate. This is due to > a race condition between reader aggregation and tasks changing states. > > While it affects all states and all resources captured by PSI, in > practice it most likely triggers with CPU pressure, since scheduling > events are so frequent compared to other resource events. > > The race context is the live snooping of ongoing stalls during a > pressure read. The read aggregates per-cpu records for stalls that > have concluded, but will also incorporate ad-hoc the duration of any > active state that hasn't been recorded yet. This is important to get > timely measurements of ongoing stalls. Those ad-hoc samples are > calculated on-the-fly up to the current time on that CPU; since the > stall hasn't concluded, it's expected that this is the minimum amount > of stall time that will enter the per-cpu records once it does. > > The problem is that the path that concludes the state uses a CPU clock > read that is not synchronized against aggregators; the clock is read > outside of the seqlock protection. This allows aggregators to race and > snoop a stall with a longer duration than will actually be recorded. > > With the recorded stall time being less than the last snapshot > remembered by the aggregator, a subsequent sample will underflow and > observe a bogus delta value, resulting in an erratic jump in pressure. > > Fix this by moving the clock read of the state change into the seqlock > protection. This ensures no aggregation can snoop live stalls past the > time that's recorded when the state concludes. > > Reported-by: Brandon Duffany > Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194 > Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi") > Cc: stable@vger.kernel.org > Signed-off-by: Johannes Weiner Good catch! Reviewed-by: Chengming Zhou Maybe another solution is to check the race happened on the read side, then return delta as 0, right? Thanks. > --- > kernel/sched/psi.c | 26 ++++++++++++-------------- > 1 file changed, 12 insertions(+), 14 deletions(-) > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index 020d58967d4e..84dad1511d1e 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -769,12 +769,13 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > } > > static void psi_group_change(struct psi_group *group, int cpu, > - unsigned int clear, unsigned int set, u64 now, > + unsigned int clear, unsigned int set, > bool wake_clock) > { > struct psi_group_cpu *groupc; > unsigned int t, m; > u32 state_mask; > + u64 now; > > lockdep_assert_rq_held(cpu_rq(cpu)); > groupc = per_cpu_ptr(group->pcpu, cpu); > @@ -789,6 +790,7 @@ static void psi_group_change(struct psi_group *group, int cpu, > * SOME and FULL time these may have resulted in. > */ > write_seqcount_begin(&groupc->seq); > + now = cpu_clock(cpu); > > /* > * Start with TSK_ONCPU, which doesn't have a corresponding > @@ -899,18 +901,15 @@ void psi_task_change(struct task_struct *task, int clear, int set) > { > int cpu = task_cpu(task); > struct psi_group *group; > - u64 now; > > if (!task->pid) > return; > > psi_flags_change(task, clear, set); > > - now = cpu_clock(cpu); > - > group = task_psi_group(task); > do { > - psi_group_change(group, cpu, clear, set, now, true); > + psi_group_change(group, cpu, clear, set, true); > } while ((group = group->parent)); > } > > @@ -919,7 +918,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > { > struct psi_group *group, *common = NULL; > int cpu = task_cpu(prev); > - u64 now = cpu_clock(cpu); > > if (next->pid) { > psi_flags_change(next, 0, TSK_ONCPU); > @@ -936,7 +934,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > break; > } > > - psi_group_change(group, cpu, 0, TSK_ONCPU, now, true); > + psi_group_change(group, cpu, 0, TSK_ONCPU, true); > } while ((group = group->parent)); > } > > @@ -974,7 +972,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > do { > if (group == common) > break; > - psi_group_change(group, cpu, clear, set, now, wake_clock); > + psi_group_change(group, cpu, clear, set, wake_clock); > } while ((group = group->parent)); > > /* > @@ -986,7 +984,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { > clear &= ~TSK_ONCPU; > for (; group; group = group->parent) > - psi_group_change(group, cpu, clear, set, now, wake_clock); > + psi_group_change(group, cpu, clear, set, wake_clock); > } > } > } > @@ -997,8 +995,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st > int cpu = task_cpu(curr); > struct psi_group *group; > struct psi_group_cpu *groupc; > - u64 now, irq; > s64 delta; > + u64 irq; > > if (static_branch_likely(&psi_disabled)) > return; > @@ -1011,7 +1009,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st > if (prev && task_psi_group(prev) == group) > return; > > - now = cpu_clock(cpu); > irq = irq_time_read(cpu); > delta = (s64)(irq - rq->psi_irq_time); > if (delta < 0) > @@ -1019,12 +1016,15 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st > rq->psi_irq_time = irq; > > do { > + u64 now; > + > if (!group->enabled) > continue; > > groupc = per_cpu_ptr(group->pcpu, cpu); > > write_seqcount_begin(&groupc->seq); > + now = cpu_clock(cpu); > > record_times(groupc, now); > groupc->times[PSI_IRQ_FULL] += delta; > @@ -1223,11 +1223,9 @@ void psi_cgroup_restart(struct psi_group *group) > for_each_possible_cpu(cpu) { > struct rq *rq = cpu_rq(cpu); > struct rq_flags rf; > - u64 now; > > rq_lock_irq(rq, &rf); > - now = cpu_clock(cpu); > - psi_group_change(group, cpu, 0, 0, now, true); > + psi_group_change(group, cpu, 0, 0, true); > rq_unlock_irq(rq, &rf); > } > }