From: Peter Zijlstra <peterz@infradead.org>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
"Ingo Molnar" <mingo@kernel.org>,
"Thomas Gleixner" <tglx@linutronix.de>,
"Clark Williams" <williams@redhat.com>,
linux-rt-users <linux-rt-users@vger.kernel.org>,
"Mike Galbraith" <umgwanakikbuti@gmail.com>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
"Jörn Engel" <joern@purestorage.com>
Subject: Re: [PATCH v5] sched/rt: Use IPI to trigger RT task push migration instead of pulling
Date: Fri, 20 Mar 2015 11:31:20 +0100 [thread overview]
Message-ID: <20150320103120.GZ23123@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20150318144946.2f3cc982@gandalf.local.home>
On Wed, Mar 18, 2015 at 02:49:46PM -0400, Steven Rostedt wrote:
>
> When debugging the latencies on a 40 core box, where we hit 300 to
> 500 microsecond latencies, I found there was a huge contention on the
> runqueue locks.
>
> Investigating it further, running ftrace, I found that it was due to
> the pulling of RT tasks.
>
> The test that was run was the following:
>
> cyclictest --numa -p95 -m -d0 -i100
>
> This created a thread on each CPU, that would set its wakeup in iterations
> of 100 microseconds. The -d0 means that all the threads had the same
> interval (100us). Each thread sleeps for 100us and wakes up and measures
> its latencies.
>
> cyclictest is maintained at:
> git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
>
> What happened was another RT task would be scheduled on one of the CPUs
> that was running our test, when the other CPU tests went to sleep and
> scheduled idle. This caused the "pull" operation to execute on all
> these CPUs. Each one of these saw the RT task that was overloaded on
> the CPU of the test that was still running, and each one tried
> to grab that task in a thundering herd way.
>
> To grab the task, each thread would do a double rq lock grab, grabbing
> its own lock as well as the rq of the overloaded CPU. As the sched
> domains on this box was rather flat for its size, I saw up to 12 CPUs
> block on this lock at once. This caused a ripple affect with the
> rq locks especially since the taking was done via a double rq lock, which
> means that several of the CPUs had their own rq locks held while trying
> to take this rq lock. As these locks were blocked, any wakeups or load
> balanceing on these CPUs would also block on these locks, and the wait
> time escalated.
>
> I've tried various methods to lessen the load, but things like an
> atomic counter to only let one CPU grab the task wont work, because
> the task may have a limited affinity, and we may pick the wrong
> CPU to take that lock and do the pull, to only find out that the
> CPU we picked isn't in the task's affinity.
>
> Instead of doing the PULL, I now have the CPUs that want the pull to
> send over an IPI to the overloaded CPU, and let that CPU pick what
> CPU to push the task to. No more need to grab the rq lock, and the
> push/pull algorithm still works fine.
>
> With this patch, the latency dropped to just 150us over a 20 hour run.
> Without the patch, the huge latencies would trigger in seconds.
>
> I've created a new sched feature called RT_PUSH_IPI, which is enabled
> by default.
>
> When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
> and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
> is enabled, the IPI is sent to the overloaded CPU to do a push.
>
> To enabled or disable this at run time:
>
> # mount -t debugfs nodev /sys/kernel/debug
> # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
> or
> # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
>
> Update: This original patch would send an IPI to all CPUs in the RT overload
> list. But that could theoretically cause the reverse issue. That is, there
> could be lots of overloaded RT queues and one CPU lowers its priority. It would
> then send an IPI to all the overloaded RT queues and they could then all try
> to grab the rq lock of the CPU lowering its priority, and then we have the
> same problem.
>
> The latest design sends out only one IPI to the first overloaded CPU. It tries to
> push any tasks that it can, and then looks for the next overloaded CPU that can
> push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
> tasks that have priorities greater than the source CPU are covered. In case the
> source CPU lowers its priority again, a flag is set to tell the IPI traversal to
> restart with the first RT overloaded CPU after the source CPU.
>
> Parts-suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
OK, queued it. Do we want to look into making the same change for
deadline once this has settled?
next prev parent reply other threads:[~2015-03-20 10:31 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-18 18:49 [PATCH v5] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
2015-03-20 10:25 ` Peter Zijlstra
2015-03-20 14:27 ` Steven Rostedt
2015-03-20 10:31 ` Peter Zijlstra [this message]
2015-03-20 14:30 ` Steven Rostedt
2015-03-26 1:24 ` Wanpeng Li
2015-03-27 16:14 ` Steven Rostedt
2015-03-23 12:25 ` [tip:sched/core] " tip-bot for Steven Rostedt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150320103120.GZ23123@twins.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=joern@purestorage.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-users@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=umgwanakikbuti@gmail.com \
--cc=williams@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox