From: David Ahern <david.ahern@oracle.com>
To: Peter Zijlstra <peterz@infradead.org>,
Mike Galbraith <efault@gmx.de>, Ingo Molnar <mingo@kernel.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: NMI watchdog triggering during load_balance
Date: Thu, 05 Mar 2015 21:05:28 -0700 [thread overview]
Message-ID: <54F92788.6010007@oracle.com> (raw)
Hi Peter/Mike/Ingo:
I've been banging my against this wall for a week now and hoping you or
someone could shed some light on the problem.
On larger systems (256 to 1024 cpus) there are several use cases (e.g.,
http://www.cs.virginia.edu/stream/) that regularly trigger the NMI
watchdog with the stack trace:
Call Trace:
@ [000000000045d3d0] double_rq_lock+0x4c/0x68
@ [00000000004699c4] load_balance+0x278/0x740
@ [00000000008a7b88] __schedule+0x378/0x8e4
@ [00000000008a852c] schedule+0x68/0x78
@ [000000000042c82c] cpu_idle+0x14c/0x18c
@ [00000000008a3a50] after_lock_tlb+0x1b4/0x1cc
Capturing data for all CPUs I tend to see load_balance related stack
traces on 700-800 cpus, with a few hundred blocked on _raw_spin_trylock_bh.
I originally thought it was a deadlock in the rq locking, but if I bump
the watchdog timeout the system eventually recovers (after 10-30+
seconds of unresponsiveness) so it does not seem likely to be a deadlock.
This particluar system has 1024 cpus:
# lscpu
Architecture: sparc64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Big Endian
CPU(s): 1024
On-line CPU(s) list: 0-1023
Thread(s) per core: 8
Core(s) per socket: 4
Socket(s): 32
NUMA node(s): 4
NUMA node0 CPU(s): 0-255
NUMA node1 CPU(s): 256-511
NUMA node2 CPU(s): 512-767
NUMA node3 CPU(s): 768-1023
and there are 4 scheduling domains. An example of the domain debug
output (condensed for the email):
CPU970 attaching sched-domain:
domain 0: span 968-975 level SIBLING
groups: 8 single CPU groups
domain 1: span 968-975 level MC
groups: 1 group with 8 cpus
domain 2: span 768-1023 level CPU
groups: 32 groups with 8 cpus per group
domain 3: span 0-1023 level NODE
groups: 4 groups with 256 cpus per group
On an idle system (20 or so non-kernel threads such as mingetty, udev,
...) perf top shows the task scheduler is consuming significant time:
PerfTop: 136580 irqs/sec kernel:99.9% exact: 0.0% [1000Hz
cycles], (all, 1024 CPUs)
-----------------------------------------------------------------------
20.22% [kernel] [k] find_busiest_group
16.00% [kernel] [k] find_next_bit
6.37% [kernel] [k] ktime_get_update_offsets
5.70% [kernel] [k] ktime_get
...
This is a 2.6.39 kernel (yes, a relatively old one); 3.8 shows similar
symptoms. 3.18 is much better.
From what I can tell load balancing is happening non-stop and there is
heavy contention in the run queue locks. I instrumented the rq locking
and under load (e.g, the stream test) regularly see single rq locks held
continuously for 2-3 seconds (e.g., at the end of the stream run which
has 1024 threads and the process is terminating).
I have been staring at and instrumenting the scheduling code for days.
It seems like the balancing of domains is regularly lining up on all or
almost all CPUs and it seems like the NODE domain causes the most damage
since it scans all cpus (ie., in rebalance_domains() each domain pass
triggers a call to load_balance on all cpus at the same time). Just in
random snapshots during a stream test I have seen 1 pass through
rebalance_domains take > 17 seconds (custom tracepoints to tag start and
end).
Since each domain is a superset of the lower one each pass through
load_balance regularly repeats the processing of the previous domain
(e.g., NODE domain repeats the cpus in the CPU domain). Then multiplying
that across 1024 cpus and it seems like a of duplication.
Does that make sense or am I off in the weeds?
Thanks,
David
next reply other threads:[~2015-03-06 4:06 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-06 4:05 David Ahern [this message]
2015-03-06 4:52 ` NMI watchdog triggering during load_balance Mike Galbraith
2015-03-06 15:01 ` David Ahern
2015-03-06 18:11 ` Mike Galbraith
2015-03-06 18:37 ` David Ahern
2015-03-06 19:29 ` Mike Galbraith
2015-03-10 3:06 ` David Ahern
2015-03-07 9:36 ` Peter Zijlstra
2015-03-06 8:51 ` Peter Zijlstra
2015-03-06 15:03 ` David Ahern
2015-03-06 9:07 ` Peter Zijlstra
2015-03-06 15:10 ` David Ahern
2015-03-06 9:12 ` Peter Zijlstra
2015-03-06 15:12 ` David Ahern
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54F92788.6010007@oracle.com \
--to=david.ahern@oracle.com \
--cc=efault@gmx.de \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).