All of lore.kernel.org
 help / color / mirror / Atom feed
From: Beata Michalska <beata.michalska@arm.com>
To: Chris Mason <clm@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	mingo@redhat.com, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, linux-kernel@vger.kernel.org,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage
Date: Wed, 16 Jul 2025 08:53:01 +0200	[thread overview]
Message-ID: <aHdMTdPeQ1F6f-x9@arm.com> (raw)
In-Reply-To: <0d86c527-27a7-44d5-9ddc-f9a153f67b4d@meta.com>

On Tue, Jul 15, 2025 at 03:11:14PM -0400, Chris Mason wrote:
> On 7/2/25 7:49 AM, Peter Zijlstra wrote:
> > Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus
> > pressure spikes from aggregation race") caused a regression for him on
> > a high context switch rate benchmark (schbench) due to the now
> > repeating cpu_clock() calls.
> > 
> > In particular the problem is that get_recent_times() will extrapolate
> > the current state to 'now'. But if an update uses a timestamp from
> > before the start of the update, it is possible to get two reads
> > with inconsistent results. It is effectively back-dating an update.
> > 
> > (note that this all hard-relies on the clock being synchronized across
> > CPUs -- if this is not the case, all bets are off).
> > 
> > Combine this problem with the fact that there are per-group-per-cpu
> > seqcounts, the commit in question pushed the clock read into the group
> > iteration, causing tree-depth cpu_clock() calls. On architectures
> > where cpu_clock() has appreciable overhead, this hurts.
> > 
> > Instead move to a per-cpu seqcount, which allows us to have a single
> > clock read for all group updates, increasing internal consistency and
> > lowering update overhead. This comes at the cost of a longer update
> > side (proportional to the tree depth) which can cause the read side to
> > retry more often.
> > 
> > Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")
> > Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>,
> > Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net 
> > ---
> >  include/linux/psi_types.h |    6 --
> >  kernel/sched/psi.c        |  121 +++++++++++++++++++++++++---------------------
> >  2 files changed, 68 insertions(+), 59 deletions(-)
> > 
> > --- a/include/linux/psi_types.h
> > +++ b/include/linux/psi_types.h
> > @@ -84,11 +84,9 @@ enum psi_aggregators {
> >  struct psi_group_cpu {
> >  	/* 1st cacheline updated by the scheduler */
> >  
> > -	/* Aggregator needs to know of concurrent changes */
> > -	seqcount_t seq ____cacheline_aligned_in_smp;
> > -
> >  	/* States of the tasks belonging to this group */
> > -	unsigned int tasks[NR_PSI_TASK_COUNTS];
> > +	unsigned int tasks[NR_PSI_TASK_COUNTS]
> > +			____cacheline_aligned_in_smp;
> >  
> >  	/* Aggregate pressure state derived from the tasks */
> >  	u32 state_mask;
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -176,6 +176,28 @@ struct psi_group psi_system = {
> >  	.pcpu = &system_group_pcpu,
> >  };
> >  
> > +static DEFINE_PER_CPU(seqcount_t, psi_seq);
> 
> [ ... ]
> 
> > @@ -186,7 +208,7 @@ static void group_init(struct psi_group
> >  
> >  	group->enabled = true;
> >  	for_each_possible_cpu(cpu)
> > -		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> > +		seqcount_init(per_cpu_ptr(&psi_seq, cpu));
> >  	group->avg_last_update = sched_clock();
> >  	group->avg_next_update = group->avg_last_update + psi_period;
> >  	mutex_init(&group->avgs_lock);
> 
> I'm not sure if someone mentioned this already, but testing the
> series I got a bunch of softlockups in get_recent_times()
> that randomly jumped from CPU to CPU.

... was beaten to it. I can observe the same.
> 
> This fixed it for me, but reading it now I'm wondering
> if we want to seqcount_init() unconditionally even when PSI
> is off.  
> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 2024c1d36402d..979a447bc281f 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -207,8 +207,6 @@ static void group_init(struct psi_group *group)
>         int cpu;
> 
>         group->enabled = true;
> -       for_each_possible_cpu(cpu)
> -               seqcount_init(per_cpu_ptr(&psi_seq, cpu));
>         group->avg_last_update = sched_clock();
>         group->avg_next_update = group->avg_last_update + psi_period;
>         mutex_init(&group->avgs_lock);
> @@ -231,6 +229,7 @@ static void group_init(struct psi_group *group)
> 
>  void __init psi_init(void)
>  {
> +       int cpu;
>         if (!psi_enable) {
>                 static_branch_enable(&psi_disabled);
>                 static_branch_disable(&psi_cgroups_enabled);
> @@ -241,6 +240,8 @@ void __init psi_init(void)
>                 static_branch_disable(&psi_cgroups_enabled);
> 
>         psi_period = jiffies_to_nsecs(PSI_FREQ);
> +       for_each_possible_cpu(cpu)
> +               seqcount_init(per_cpu_ptr(&psi_seq, cpu));
>         group_init(&psi_system);
>  }
> 
> 
Wouldn't it be enough to use SEQCNT_ZERO? Those are static per-cpu ones.

---
BR
Beata

  parent reply	other threads:[~2025-07-16  6:53 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
2025-07-02 11:49 ` [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage Peter Zijlstra
2025-07-15 19:11   ` Chris Mason
2025-07-16  6:06     ` K Prateek Nayak
2025-07-16  6:53     ` Beata Michalska [this message]
2025-07-16 10:40       ` Peter Zijlstra
2025-07-16 14:54         ` Johannes Weiner
2025-07-16 16:27         ` Chris Mason
2025-07-23  4:16         ` Aithal, Srikanth
2025-07-25  5:13         ` K Prateek Nayak
2025-07-02 11:49 ` [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling Peter Zijlstra
2025-07-02 16:12   ` Juri Lelli
2025-07-10 12:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-07-14 22:56   ` [PATCH v2 02/12] " Mel Gorman
2025-07-15 14:55     ` Chris Mason
2025-07-16 18:19       ` Mel Gorman
2025-07-30  9:34   ` Geert Uytterhoeven
2025-07-30  9:46     ` Juri Lelli
2025-07-30 10:05       ` Geert Uytterhoeven
2025-08-05 22:03   ` Chris Bainbridge
2025-08-05 23:04     ` Chris Bainbridge
2025-09-15 22:29   ` John Stultz
2025-09-16  4:18     ` John Stultz
2025-09-16  5:28       ` [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation John Stultz
2025-09-16  8:51         ` Juri Lelli
2025-09-16 11:01           ` Peter Zijlstra
2025-09-16 12:52             ` Juri Lelli
2025-09-16 14:30               ` Peter Zijlstra
2025-09-16 17:35             ` John Stultz
2025-09-16 21:30               ` Peter Zijlstra
2025-09-17  3:29                 ` John Stultz
2025-09-17  9:34                   ` Peter Zijlstra
2025-09-17 12:26                     ` Peter Zijlstra
2025-09-17 13:56                       ` Juri Lelli
2025-09-17 17:30                         ` Peter Zijlstra
2025-09-18  8:37                           ` Juri Lelli
2025-09-18  9:04                             ` Peter Zijlstra
2025-09-18  9:42                               ` Juri Lelli
2025-09-17 19:29                       ` John Stultz
2025-09-18  6:56                       ` [tip: sched/urgent] sched/deadline: Fix dl_server behaviour tip-bot2 for Peter Zijlstra
2025-09-25  7:55                       ` tip-bot2 for Peter Zijlstra
2025-09-18  6:56             ` [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck tip-bot2 for Peter Zijlstra
2025-09-18 14:46               ` Dietmar Eggemann
2025-09-22 21:57               ` Marek Szyprowski
2025-09-22 23:46                 ` John Stultz
2025-09-23  6:31                   ` Marek Szyprowski
2025-09-23  7:25                 ` Peter Zijlstra
2025-09-23  7:52                   ` Marek Szyprowski
2025-09-23 22:02                 ` Peter Zijlstra
2025-09-29 15:19                   ` Marek Szyprowski
     [not found]                   ` <eae77bd0-d874-4ddf-88d7-c1ab75358f91@samsung.com>
2025-10-09  8:35                     ` Krzysztof Kozlowski
2025-10-09  9:26                     ` Peter Zijlstra
2025-10-09 11:42                       ` Marek Szyprowski
2025-09-25  7:55             ` tip-bot2 for Peter Zijlstra
2025-07-02 11:49 ` [PATCH v2 03/12] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
2025-07-10 16:47   ` Vincent Guittot
2025-07-14 22:59   ` Mel Gorman
2025-07-02 11:49 ` [PATCH v2 04/12] sched: Use lock guard in ttwu_runnable() Peter Zijlstra
2025-07-10 16:48   ` Vincent Guittot
2025-07-14 23:00   ` Mel Gorman
2025-07-02 11:49 ` [PATCH v2 05/12] sched: Add ttwu_queue controls Peter Zijlstra
2025-07-10 16:51   ` Vincent Guittot
2025-07-14 23:14   ` Mel Gorman
2025-07-02 11:49 ` [PATCH v2 06/12] sched: Introduce ttwu_do_migrate() Peter Zijlstra
2025-07-10 16:51   ` Vincent Guittot
2025-07-02 11:49 ` [PATCH v2 07/12] psi: Split psi_ttwu_dequeue() Peter Zijlstra
2025-07-17 23:59   ` Chris Mason
2025-07-18 18:02     ` Steven Rostedt
2025-07-02 11:49 ` [PATCH v2 08/12] sched: Re-arrange __ttwu_queue_wakelist() Peter Zijlstra
2025-07-02 11:49 ` [PATCH v2 09/12] sched: Clean up ttwu comments Peter Zijlstra
2025-07-02 11:49 ` [PATCH v2 10/12] sched: Use lock guard in sched_ttwu_pending() Peter Zijlstra
2025-07-10 16:51   ` Vincent Guittot
2025-07-02 11:49 ` [PATCH v2 11/12] sched: Change ttwu_runnable() vs sched_delayed Peter Zijlstra
2025-07-02 11:49 ` [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
2025-07-03 16:00   ` Phil Auld
2025-07-03 16:47     ` Peter Zijlstra
2025-07-03 17:11       ` Phil Auld
2025-07-14 13:57         ` Phil Auld
2025-07-04  6:13       ` K Prateek Nayak
2025-07-04  7:59         ` Peter Zijlstra
2025-07-08 12:44   ` Dietmar Eggemann
2025-07-08 18:57     ` Peter Zijlstra
2025-07-08 21:02     ` Peter Zijlstra
2025-07-23  5:42   ` Shrikanth Hegde
2025-07-02 15:27 ` [PATCH v2 00/12] sched: Address schbench regression Chris Mason
2025-07-07  9:05 ` Shrikanth Hegde
2025-07-07  9:11   ` Peter Zijlstra
2025-07-07  9:38     ` Shrikanth Hegde
2025-07-16 13:46       ` Phil Auld
2025-07-17 17:25         ` Phil Auld
2025-07-07 18:19   ` Shrikanth Hegde
2025-07-08 19:02     ` Peter Zijlstra
2025-07-09 16:46       ` Shrikanth Hegde
2025-07-14 17:54       ` Shrikanth Hegde
2025-07-21 19:37       ` Shrikanth Hegde
2025-07-22 20:20         ` Chris Mason
2025-07-24 18:23           ` Chris Mason
2025-07-08 15:09   ` Chris Mason
2025-07-08 17:29     ` Shrikanth Hegde
2025-07-17 13:04 ` Beata Michalska
2025-07-17 16:57   ` Beata Michalska

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aHdMTdPeQ1F6f-x9@arm.com \
    --to=beata.michalska@arm.com \
    --cc=bsegall@google.com \
    --cc=clm@meta.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.