linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Paul Gortmaker <paul.gortmaker@windriver.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: <linux-kernel@vger.kernel.org>, <linux-rt-users@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH] sched/rt: don't try to balance rt_runtime when it is futile
Date: Wed, 14 May 2014 15:11:00 -0400	[thread overview]
Message-ID: <20140514191100.GA24155@windriver.com> (raw)
In-Reply-To: <20140514154459.GE4570@linux.vnet.ibm.com>

[Added Frederic to Cc: since we are now talking nohz stuff]

[Re: [PATCH] sched/rt: don't try to balance rt_runtime when it is futile] On 14/05/2014 (Wed 08:44) Paul E. McKenney wrote:

> On Wed, May 14, 2014 at 11:08:35AM -0400, Paul Gortmaker wrote:
> > As of the old commit ac086bc22997a2be24fc40fc8d46522fe7e03d11
> > ("sched: rt-group: smp balancing") the concept of borrowing per
> > cpu rt_runtime from one core to another was introduced.
> > 
> > However, this prevents the RT throttling message from ever being
> > emitted when someone does a common (but mistaken) attempt at
> > using too much CPU in RT context.  Consider the following test:
> > 
> >   echo "main() {for(;;);}" > full_load.c
> >   gcc full_load.c -o full_load
> >   taskset -c 1 ./full_load &
> >   chrt -r -p 80 `pidof full_load`
> > 
> > When run on x86_64 defconfig, what happens is as follows:
> > 
> > -task runs on core1 for 95% of an rt_period as documented in
> >  the file Documentation/scheduler/sched-rt-group.txt
> > 
> > -at 95%, the code in balance_runtime sees this threshold and
> >  calls do_balance_runtime()
> > 
> > -do_balance_runtime sees that core 1 is in need, and does this:
> > 	---------------
> >         if (rt_rq->rt_runtime + diff > rt_period)
> >                 diff = rt_period - rt_rq->rt_runtime;
> >         iter->rt_runtime -= diff;
> >         rt_rq->rt_runtime += diff;
> > 	---------------
> >  which extends core1's rt_runtime by 5%, making it 100% of rt_period
> >  by stealing 5% from core0 (or possibly some other core).
> > 
> > However, the next time core1's rt_rq enters sched_rt_runtime_exceeded(),
> > we hit this near the top of that function:
> > 	---------------
> >         if (runtime >= sched_rt_period(rt_rq))
> >                 return 0;
> > 	---------------
> > and hence we'll _never_ look at/set any of the throttling checks and
> > messages in sched_rt_runtime_exceeded().  Instead, we will happily
> > plod along for CONFIG_RCU_CPU_STALL_TIMEOUT seconds, at which point
> > the RCU subsystem will get angry and trigger an NMI in response to
> > what it rightly sees as a WTF situation.
> 
> In theory, one way of making RCU OK with an RT usermode CPU hog is to
> build with Frederic's CONFIG_NO_HZ_FULL=y.  This will cause RCU to see
> CPUs having a single runnable usermode task as idle, preventing the RCU
> CPU stall warning.  This does work well for mainline kernel in the lab.

Agreed; wanting to test that locally for myself meant moving to a more
modern machine, as the older PentiumD doesn't support NO_HZ_FULL.  But
on the newer box (dual socket six cores in each) I found the stall
harder to trigger w/o going back to using the threadirqs boot arg as
used in the earlier lkml post referenced below. (Why?  Not sure...)

Once I did that though (boot vanilla linux-next with threadirqs) I
confirmed what you said; i.e. that we would reliably get a stall with
the defconfig of NOHZ_IDLE=y but not with NOHZ_FULL=y (and hence also
RCU_USER_QS=y).

> 
> In practice, not sure how much testing CONFIG_NO_HZ_FULL=y has received
> for -rt kernels in production environments.
> 
> But leaving practice aside for the moment...
> 

[...]

> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index ea4d500..698aac9 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -774,6 +774,15 @@ static int balance_runtime(struct rt_rq *rt_rq)
> >  	if (!sched_feat(RT_RUNTIME_SHARE))
> >  		return more;
> > 
> > +	/*
> > +	 * Stealing from another core won't help us at all if
> > +	 * we have nothing to migrate over there, or only one
> > +	 * task that is running up all the rt_time.  In fact it
> > +	 * will just inhibit the throttling message in that case.
> > +	 */
> > +	if (!rt_rq->rt_nr_migratory || rt_rq->rt_nr_total == 1)
> 
> How about something like the following to take NO_HZ_FULL into account?
> 
> +	if ((!rt_rq->rt_nr_migratory || rt_rq->rt_nr_total == 1) &&
> +	    !tick_nohz_full_cpu(cpu))

Yes, I think special casing nohz_full can make sense, but maybe not
exactly here in balance_runtime?  Since the underlying reasoning doesn't
change on nohz_full ; if only one task is present, or nothing can
migrate, then the call to do_balance_runtime is largely useless - we'll
walk possibly all cpus in search of an rt_rq to steal from, and what we
steal, we can't use - so we've artificially crippled the other rt_rq for
nothing other than to artifically inflate our rt_runtime and thus allow
100% usage.

Given that, perhaps a separate change to sched_rt_runtime_exceeded()
that works out the CPU from the rt_rq, and returns zero if it is a
nohz_full cpu?  Does that make sense?  Then the nohz_full people won't
get the throttling message even if they go 100%.

Paul.
--

> 
> 							Thanx, Paul
> 
> > +		return more;
> > +
> >  	if (rt_rq->rt_time > rt_rq->rt_runtime) {
> >  		raw_spin_unlock(&rt_rq->rt_runtime_lock);
> >  		more = do_balance_runtime(rt_rq);
> > -- 
> > 1.8.2.3
> > 
> 

  reply	other threads:[~2014-05-14 19:11 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-14 15:08 [PATCH] sched/rt: don't try to balance rt_runtime when it is futile Paul Gortmaker
2014-05-14 15:44 ` Paul E. McKenney
2014-05-14 19:11   ` Paul Gortmaker [this message]
2014-05-14 19:27     ` Paul E. McKenney
2014-05-15  2:49     ` Mike Galbraith
2014-05-15 14:09       ` Paul Gortmaker
2014-11-27  9:17       ` Wanpeng Li
2014-11-27 15:31         ` Mike Galbraith
2014-11-27 11:36     ` Wanpeng Li
2014-05-15  3:18   ` Mike Galbraith
2014-05-15 14:45     ` Paul E. McKenney
2014-05-15 17:27       ` Mike Galbraith
2014-05-18  4:22     ` Mike Galbraith
2014-05-18  5:20       ` Paul E. McKenney
2014-05-18  8:36         ` Mike Galbraith
2014-05-18 15:58           ` Paul E. McKenney
2014-05-19  2:44             ` Mike Galbraith
2014-05-19  5:34               ` Paul E. McKenney
2014-05-20 14:53                 ` Frederic Weisbecker
2014-05-20 15:53                   ` Paul E. McKenney
2014-05-20 16:24                     ` Frederic Weisbecker
2014-05-20 16:36                       ` Peter Zijlstra
2014-05-20 17:20                       ` Paul E. McKenney
2014-05-21  4:29                         ` Mike Galbraith
2014-05-21  4:18                     ` Mike Galbraith
2014-05-21 12:03                       ` Paul E. McKenney
2014-05-21  3:52                   ` Mike Galbraith
2014-05-19 10:54     ` Peter Zijlstra
2014-05-19 12:40 ` Peter Zijlstra
2014-05-22 19:40   ` Paul Gortmaker
2014-11-27 11:21 ` Wanpeng Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140514191100.GA24155@windriver.com \
    --to=paul.gortmaker@windriver.com \
    --cc=fweisbec@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).