From: Wanpeng Li <kernellwp@gmail.com>
To: Paul Gortmaker <paul.gortmaker@windriver.com>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org,
Ingo Molnar <mingo@redhat.com>,
Frederic Weisbecker <fweisbec@gmail.com>,
Peter Zijlstra <peterz@infradead.org>,
Steven Rostedt <rostedt@goodmis.org>,
Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH] sched/rt: don't try to balance rt_runtime when it is futile
Date: Thu, 27 Nov 2014 19:36:31 +0800 [thread overview]
Message-ID: <54770CBF.3040309@gmail.com> (raw)
In-Reply-To: <20140514191100.GA24155@windriver.com>
Hi Paul,
On 5/15/14, 3:11 AM, Paul Gortmaker wrote:
> [Added Frederic to Cc: since we are now talking nohz stuff]
>
> [Re: [PATCH] sched/rt: don't try to balance rt_runtime when it is futile] On 14/05/2014 (Wed 08:44) Paul E. McKenney wrote:
>
>> On Wed, May 14, 2014 at 11:08:35AM -0400, Paul Gortmaker wrote:
>>> As of the old commit ac086bc22997a2be24fc40fc8d46522fe7e03d11
>>> ("sched: rt-group: smp balancing") the concept of borrowing per
>>> cpu rt_runtime from one core to another was introduced.
>>>
>>> However, this prevents the RT throttling message from ever being
>>> emitted when someone does a common (but mistaken) attempt at
>>> using too much CPU in RT context. Consider the following test:
>>>
>>> echo "main() {for(;;);}" > full_load.c
>>> gcc full_load.c -o full_load
>>> taskset -c 1 ./full_load &
>>> chrt -r -p 80 `pidof full_load`
>>>
>>> When run on x86_64 defconfig, what happens is as follows:
>>>
>>> -task runs on core1 for 95% of an rt_period as documented in
>>> the file Documentation/scheduler/sched-rt-group.txt
>>>
>>> -at 95%, the code in balance_runtime sees this threshold and
>>> calls do_balance_runtime()
>>>
>>> -do_balance_runtime sees that core 1 is in need, and does this:
>>> ---------------
>>> if (rt_rq->rt_runtime + diff > rt_period)
>>> diff = rt_period - rt_rq->rt_runtime;
>>> iter->rt_runtime -= diff;
>>> rt_rq->rt_runtime += diff;
>>> ---------------
>>> which extends core1's rt_runtime by 5%, making it 100% of rt_period
>>> by stealing 5% from core0 (or possibly some other core).
>>>
>>> However, the next time core1's rt_rq enters sched_rt_runtime_exceeded(),
>>> we hit this near the top of that function:
>>> ---------------
>>> if (runtime >= sched_rt_period(rt_rq))
>>> return 0;
>>> ---------------
>>> and hence we'll _never_ look at/set any of the throttling checks and
>>> messages in sched_rt_runtime_exceeded(). Instead, we will happily
>>> plod along for CONFIG_RCU_CPU_STALL_TIMEOUT seconds, at which point
>>> the RCU subsystem will get angry and trigger an NMI in response to
>>> what it rightly sees as a WTF situation.
>> In theory, one way of making RCU OK with an RT usermode CPU hog is to
>> build with Frederic's CONFIG_NO_HZ_FULL=y. This will cause RCU to see
>> CPUs having a single runnable usermode task as idle, preventing the RCU
>> CPU stall warning. This does work well for mainline kernel in the lab.
> Agreed; wanting to test that locally for myself meant moving to a more
> modern machine, as the older PentiumD doesn't support NO_HZ_FULL. But
Could you point out which hw feature support NO_HZ_FULL? How to check it
through cpuid?
Regards,
Wanpeng Li
> on the newer box (dual socket six cores in each) I found the stall
> harder to trigger w/o going back to using the threadirqs boot arg as
> used in the earlier lkml post referenced below. (Why? Not sure...)
>
> Once I did that though (boot vanilla linux-next with threadirqs) I
> confirmed what you said; i.e. that we would reliably get a stall with
> the defconfig of NOHZ_IDLE=y but not with NOHZ_FULL=y (and hence also
> RCU_USER_QS=y).
>
>> In practice, not sure how much testing CONFIG_NO_HZ_FULL=y has received
>> for -rt kernels in production environments.
>>
>> But leaving practice aside for the moment...
>>
> [...]
>
>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>>> index ea4d500..698aac9 100644
>>> --- a/kernel/sched/rt.c
>>> +++ b/kernel/sched/rt.c
>>> @@ -774,6 +774,15 @@ static int balance_runtime(struct rt_rq *rt_rq)
>>> if (!sched_feat(RT_RUNTIME_SHARE))
>>> return more;
>>>
>>> + /*
>>> + * Stealing from another core won't help us at all if
>>> + * we have nothing to migrate over there, or only one
>>> + * task that is running up all the rt_time. In fact it
>>> + * will just inhibit the throttling message in that case.
>>> + */
>>> + if (!rt_rq->rt_nr_migratory || rt_rq->rt_nr_total == 1)
>> How about something like the following to take NO_HZ_FULL into account?
>>
>> + if ((!rt_rq->rt_nr_migratory || rt_rq->rt_nr_total == 1) &&
>> + !tick_nohz_full_cpu(cpu))
> Yes, I think special casing nohz_full can make sense, but maybe not
> exactly here in balance_runtime? Since the underlying reasoning doesn't
> change on nohz_full ; if only one task is present, or nothing can
> migrate, then the call to do_balance_runtime is largely useless - we'll
> walk possibly all cpus in search of an rt_rq to steal from, and what we
> steal, we can't use - so we've artificially crippled the other rt_rq for
> nothing other than to artifically inflate our rt_runtime and thus allow
> 100% usage.
>
> Given that, perhaps a separate change to sched_rt_runtime_exceeded()
> that works out the CPU from the rt_rq, and returns zero if it is a
> nohz_full cpu? Does that make sense? Then the nohz_full people won't
> get the throttling message even if they go 100%.
>
> Paul.
> --
>
>> Thanx, Paul
>>
>>> + return more;
>>> +
>>> if (rt_rq->rt_time > rt_rq->rt_runtime) {
>>> raw_spin_unlock(&rt_rq->rt_runtime_lock);
>>> more = do_balance_runtime(rt_rq);
>>> --
>>> 1.8.2.3
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
next prev parent reply other threads:[~2014-11-27 11:36 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-14 15:08 [PATCH] sched/rt: don't try to balance rt_runtime when it is futile Paul Gortmaker
2014-05-14 15:44 ` Paul E. McKenney
2014-05-14 19:11 ` Paul Gortmaker
2014-05-14 19:27 ` Paul E. McKenney
2014-05-15 2:49 ` Mike Galbraith
2014-05-15 14:09 ` Paul Gortmaker
2014-11-27 9:17 ` Wanpeng Li
2014-11-27 15:31 ` Mike Galbraith
2014-11-27 11:36 ` Wanpeng Li [this message]
2014-05-15 3:18 ` Mike Galbraith
2014-05-15 14:45 ` Paul E. McKenney
2014-05-15 17:27 ` Mike Galbraith
2014-05-18 4:22 ` Mike Galbraith
2014-05-18 5:20 ` Paul E. McKenney
2014-05-18 8:36 ` Mike Galbraith
2014-05-18 15:58 ` Paul E. McKenney
2014-05-19 2:44 ` Mike Galbraith
2014-05-19 5:34 ` Paul E. McKenney
2014-05-20 14:53 ` Frederic Weisbecker
2014-05-20 15:53 ` Paul E. McKenney
2014-05-20 16:24 ` Frederic Weisbecker
2014-05-20 16:36 ` Peter Zijlstra
2014-05-20 17:20 ` Paul E. McKenney
2014-05-21 4:29 ` Mike Galbraith
2014-05-21 4:18 ` Mike Galbraith
2014-05-21 12:03 ` Paul E. McKenney
2014-05-21 3:52 ` Mike Galbraith
2014-05-19 10:54 ` Peter Zijlstra
2014-05-19 12:40 ` Peter Zijlstra
2014-05-22 19:40 ` Paul Gortmaker
2014-11-27 11:21 ` Wanpeng Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54770CBF.3040309@gmail.com \
--to=kernellwp@gmail.com \
--cc=fweisbec@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-users@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=paul.gortmaker@windriver.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).