All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: David Ahern <david.ahern@oracle.com>
Cc: Mike Galbraith <efault@gmx.de>, Ingo Molnar <mingo@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: NMI watchdog triggering during load_balance
Date: Sat, 7 Mar 2015 10:36:47 +0100	[thread overview]
Message-ID: <20150307093647.GP23367@worktop.ger.corp.intel.com> (raw)
In-Reply-To: <54F9F3D7.1030905@oracle.com>

On Fri, Mar 06, 2015 at 11:37:11AM -0700, David Ahern wrote:
> On 3/6/15 11:11 AM, Mike Galbraith wrote:
> In responding earlier today I realized that the topology is all wrong as you
> were pointing out. There should be 16 NUMA domains (4 memory controllers per
> socket and 4 sockets). There should be 8 sibling cores. I will look into why
> that is not getting setup properly and what we can do about fixing it.

So we changed the numa topology setup a while back; see commit
cb83b629bae0 ("sched/numa: Rewrite the CONFIG_NUMA sched domain
support").

> But, I do not understand how the wrong topology is causing the NMI watchdog
> to trigger. In the end there are still N domains, M groups per domain and P
> cpus per group. Doesn't the balancing walk over all of them irrespective of
> physical topology?

Not quite; so for regular load balancing only the first CPU in the
domain will iterate up.

So if you have 4 'nodes' only 4 CPUs will iterate the entire machine,
not all 1024.



> Call Trace:
>  [000000000045dc30] double_rq_lock+0x4c/0x68
>  [000000000046a23c] load_balance+0x278/0x740
>  [00000000008aa178] __schedule+0x378/0x8e4
>  [00000000008aab1c] schedule+0x68/0x78
>  [00000000004718ac] do_exit+0x798/0x7c0
>  [000000000047195c] do_group_exit+0x88/0xc0
>  [0000000000481148] get_signal_to_deliver+0x3ec/0x4c8
>  [000000000042cbc0] do_signal+0x70/0x5e4
>  [000000000042d14c] do_notify_resume+0x18/0x50
>  [00000000004049c4] __handle_signal+0xc/0x2c
> 
> 
> For example the stream program has 1024 threads (1 for each CPU). If you
> ctrl-c the program or wait for it terminate that's when it trips. Other
> workloads that routinely trip it are make -j N, N some number (e.g., on a
> 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, ctrl-c
> ... boom with the above stack trace.
> 
> Code wise ... and this is still present in 3.18 and 3.20:
> 
> schedule()
> - __schedule()
>   + irqs disabled: raw_spin_lock_irq(&rq->lock);
> 
>      pick_next_task
>      - idle_balance()

> For 2.6.39 it's the invocation of idle_balance which is triggering load
> balancing with IRQs disabled. That's when the NMI watchdog trips.

So for idle_balance() look at SD_BALANCE_NEWIDLE, only domains with that
set will get iterated.

I suppose you could try something like the below on 3.18

Which will disable SD_BALANCE_NEWDILE on all 'distant' nodes; but first
check how your fixed numa topology looks and if you trigger that case at
all.

---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17141da77c6e..7fce683928fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6268,6 +6268,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
 				       SD_BALANCE_FORK |
+				       SD_BALANCE_NEWIDLE |
 				       SD_WAKE_AFFINE);
 		}
 


  parent reply	other threads:[~2015-03-07  9:40 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-06  4:05 NMI watchdog triggering during load_balance David Ahern
2015-03-06  4:52 ` Mike Galbraith
2015-03-06 15:01   ` David Ahern
2015-03-06 18:11     ` Mike Galbraith
2015-03-06 18:37       ` David Ahern
2015-03-06 19:29         ` Mike Galbraith
2015-03-10  3:06           ` David Ahern
2015-03-07  9:36         ` Peter Zijlstra [this message]
2015-03-06  8:51 ` Peter Zijlstra
2015-03-06 15:03   ` David Ahern
2015-03-06  9:07 ` Peter Zijlstra
2015-03-06 15:10   ` David Ahern
2015-03-06  9:12 ` Peter Zijlstra
2015-03-06 15:12   ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150307093647.GP23367@worktop.ger.corp.intel.com \
    --to=peterz@infradead.org \
    --cc=david.ahern@oracle.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.