All of lore.kernel.org
 help / color / mirror / Atom feed
From: Phil Auld <pauld@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	Waiman Long <longman@redhat.com>, Ingo Molnar <mingo@kernel.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	stable@vger.kernel.org
Subject: Re: [PATCH] sched: Fix nr_uninterruptible race causing increasing load average
Date: Thu, 8 Jul 2021 09:25:45 -0400	[thread overview]
Message-ID: <YOb82exzMcrOxfHa@lorien.usersys.redhat.com> (raw)
In-Reply-To: <YOaoomJAS2FzXi7I@hirez.programming.kicks-ass.net>

Hi Peter,

On Thu, Jul 08, 2021 at 09:26:26AM +0200 Peter Zijlstra wrote:
> On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
> > On systems with weaker memory ordering (e.g. power) commit dbfb089d360b
> > ("sched: Fix loadavg accounting race") causes increasing values of load
> > average (via rq->calc_load_active and calc_load_tasks) due to the wakeup
> > CPU not always seeing the write to task->sched_contributes_to_load in
> > __schedule(). Missing that we fail to decrement nr_uninterruptible when
> > waking up a task which incremented nr_uninterruptible when it slept.
> > 
> > The rq->lock serialization is insufficient across different rq->locks.
> > 
> > Add smp_wmb() to schedule and smp_rmb() before the read in
> > ttwu_do_activate().
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4ca80df205ce..ced7074716eb 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> >  
> >  	lockdep_assert_held(&rq->lock);
> >  
> > +	/* Pairs with smp_wmb in __schedule() */
> > +	smp_rmb();
> >  	if (p->sched_contributes_to_load)
> >  		rq->nr_uninterruptible--;
> >  
> 
> Is this really needed ?! (this question is a big fat clue the comment is
> insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq
> and hence the p->sched_contributed_to_load must already happen after.
>

Yes, it is needed.  We've got idle power systems with load average of 530.21.
Calc_load_tasks is 530, and the sum of both nr_uninterruptible and
calc_load_active across all the runqueues is 530. Basically monotonically
non-decreasing load average. With the patch this no longer happens.

We need the sched_contributed_to_load to "happen before" so that it's seen
on the other cpu on wakeup.

> > @@ -5084,6 +5086,11 @@ static void __sched notrace __schedule(bool preempt)
> >  				!(prev_state & TASK_NOLOAD) &&
> >  				!(prev->flags & PF_FROZEN);
> >  
> > +			/*
> > +			 * Make sure the previous write is ordered before p->on_rq etc so
> > +			 * that it is visible to other cpus in the wakeup path (ttwu_do_activate()).
> > +			 */
> > +			smp_wmb();
> >  			if (prev->sched_contributes_to_load)
> >  				rq->nr_uninterruptible++;
> 
> That comment is terrible, look at all the other barrier comments around
> there for clues; in effect you're worrying about:
> 
> 	p->sched_contributes_to_load = X	R1 = p->on_rq
> 	WMB					RMB
> 	p->on_rq = Y				R2 = p->sched_contributes_to_load
> 
> Right?

The only way I can see that decrememnt being missed is if the write to
sched_contributes_to_load is not being seen on the wakeup cpu. 

Before the previous patch the _state condition was checked again on the
wakeup cpu and that is ordered.

> 
> 
> Bah bah bah.. I so detest having to add barriers here for silly
> accounting. Let me think about this a little.
> 
> 

Thanks,
Phil

-- 


  parent reply	other threads:[~2021-07-08 13:26 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-07 19:04 [PATCH] sched: Fix nr_uninterruptible race causing increasing load average Phil Auld
2021-07-08  7:26 ` Peter Zijlstra
2021-07-08  7:48   ` Peter Zijlstra
2021-07-08  7:54     ` Peter Zijlstra
2021-07-08 14:54       ` Phil Auld
2021-07-09 12:57         ` Peter Zijlstra
2021-07-11 13:19           ` Phil Auld
2021-07-08 13:25   ` Phil Auld [this message]
2021-07-09 11:38     ` Peter Zijlstra
2021-07-11 12:57       ` Phil Auld
2021-07-23 13:38       ` Phil Auld
2021-07-28 15:45         ` Phil Auld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YOb82exzMcrOxfHa@lorien.usersys.redhat.com \
    --to=pauld@redhat.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.