All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Ahern <david.ahern@oracle.com>
To: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: NMI watchdog triggering during load_balance
Date: Fri, 06 Mar 2015 11:37:11 -0700	[thread overview]
Message-ID: <54F9F3D7.1030905@oracle.com> (raw)
In-Reply-To: <1425665511.7562.36.camel@gmx.de>

On 3/6/15 11:11 AM, Mike Galbraith wrote:
> That was the question, _do_ you have any control, because that topology
> is toxic.  I guess your reply means 'nope'.
>
>> The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8
>> threads per core and each cpu has 4 memory controllers.
>
> Thank god I've never met one of these, looks like the box from hell :)
>
>> If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a
>> noticeable improvement -- watchdog does not trigger and I do not get the
>> rq locks held for 2-3 seconds. But there is still fairly high cpu usage
>> for an idle system. Perhaps I should leave SCHED_MC on and disable
>> SCHED_SMT; I'll try that today.
>
> Well, if you disable SMT,your troubles _should_ shrink radically, as
> your box does. You should probably look at why you have CPU domains.
> You don't ever want to see that on a NUMA box.

In responding earlier today I realized that the topology is all wrong as 
you were pointing out. There should be 16 NUMA domains (4 memory 
controllers per socket and 4 sockets). There should be 8 sibling cores. 
I will look into why that is not getting setup properly and what we can 
do about fixing it.

--

But, I do not understand how the wrong topology is causing the NMI 
watchdog to trigger. In the end there are still N domains, M groups per 
domain and P cpus per group. Doesn't the balancing walk over all of them 
irrespective of physical topology?

Here's another data point that jelled this morning explaining the 
problem to someone: the NMI watchdog trips on a mass exit:

TPC: <_raw_spin_trylock_bh+0x38/0x100>
g0: 7fffffffffffffff g1: 00000000000000ff g2: 0000000000070f8c g3: 
fffe403b97891c98
g4: fffe803b963eda00 g5: 000000010036c000 g6: fffe803b84108000 g7: 
0000000000000093
o0: 0000000000000fe0 o1: 0000000000000fe0 o2: ffffff0000000000 o3: 
0000000000200200
o4: 0000000000a98080 o5: 0000000000000000 sp: fffe803b8410ada1 ret_pc: 
00000000006800dc
RPC: <cpumask_next_and+0x44/0x6c>
l0: 0000000000e9b114 l1: 0000000000000001 l2: 0000000000000001 l3: 
0000000000000005
l4: 0000000000002000 l5: fffe803b8410b990 l6: 0000000000000004 l7: 
0000000000f267b0
i0: 0000000100b10700 i1: 00000000ffffffff i2: 0000000101324d80 i3: 
fffe803b8410b6c0
i4: 0000000000000038 i5: 0000000000000498 i6: fffe803b8410ae51 i7: 
000000000045dc30
I7: <double_rq_lock+0x4c/0x68>
Call Trace:
  [000000000045dc30] double_rq_lock+0x4c/0x68
  [000000000046a23c] load_balance+0x278/0x740
  [00000000008aa178] __schedule+0x378/0x8e4
  [00000000008aab1c] schedule+0x68/0x78
  [00000000004718ac] do_exit+0x798/0x7c0
  [000000000047195c] do_group_exit+0x88/0xc0
  [0000000000481148] get_signal_to_deliver+0x3ec/0x4c8
  [000000000042cbc0] do_signal+0x70/0x5e4
  [000000000042d14c] do_notify_resume+0x18/0x50
  [00000000004049c4] __handle_signal+0xc/0x2c


For example the stream program has 1024 threads (1 for each CPU). If you 
ctrl-c the program or wait for it terminate that's when it trips. Other 
workloads that routinely trip it are make -j N, N some number (e.g., on 
a 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, 
ctrl-c ... boom with the above stack trace.

Code wise ... and this is still present in 3.18 and 3.20:

schedule()
- __schedule()
   + irqs disabled: raw_spin_lock_irq(&rq->lock);

      pick_next_task
      - idle_balance()

   + irqs enabled:
     different task: context_switch(rq, prev, next)
                     --> finish_lock_switch eventually
     same task: raw_spin_unlock_irq(&rq->lock) or


For 2.6.39 it's the invocation of idle_balance which is triggering load 
balancing with IRQs disabled. That's when the NMI watchdog trips.

I'll pound on 3.18 and see if I can reproduce something similar there.

David

  reply	other threads:[~2015-03-06 18:37 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-06  4:05 NMI watchdog triggering during load_balance David Ahern
2015-03-06  4:52 ` Mike Galbraith
2015-03-06 15:01   ` David Ahern
2015-03-06 18:11     ` Mike Galbraith
2015-03-06 18:37       ` David Ahern [this message]
2015-03-06 19:29         ` Mike Galbraith
2015-03-10  3:06           ` David Ahern
2015-03-07  9:36         ` Peter Zijlstra
2015-03-06  8:51 ` Peter Zijlstra
2015-03-06 15:03   ` David Ahern
2015-03-06  9:07 ` Peter Zijlstra
2015-03-06 15:10   ` David Ahern
2015-03-06  9:12 ` Peter Zijlstra
2015-03-06 15:12   ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54F9F3D7.1030905@oracle.com \
    --to=david.ahern@oracle.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.