From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932393AbbCFShr (ORCPT <rfc822;w@1wt.eu>);
	Fri, 6 Mar 2015 13:37:47 -0500
Received: from userp1040.oracle.com ([156.151.31.81]:32725 "EHLO
	userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752067AbbCFShp (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 6 Mar 2015 13:37:45 -0500
Message-ID: <54F9F3D7.1030905@oracle.com>
Date: Fri, 06 Mar 2015 11:37:11 -0700
From: David Ahern <david.ahern@oracle.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: Mike Galbraith <efault@gmx.de>
CC: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>,
        LKML <linux-kernel@vger.kernel.org>
Subject: Re: NMI watchdog triggering during load_balance
References: <54F92788.6010007@oracle.com> <1425617559.16821.36.camel@gmx.de>	 <54F9C155.3050309@oracle.com> <1425665511.7562.36.camel@gmx.de>
In-Reply-To: <1425665511.7562.36.camel@gmx.de>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Source-IP: acsinet22.oracle.com [141.146.126.238]
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 3/6/15 11:11 AM, Mike Galbraith wrote:
> That was the question, _do_ you have any control, because that topology
> is toxic.  I guess your reply means 'nope'.
>
>> The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8
>> threads per core and each cpu has 4 memory controllers.
>
> Thank god I've never met one of these, looks like the box from hell :)
>
>> If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a
>> noticeable improvement -- watchdog does not trigger and I do not get the
>> rq locks held for 2-3 seconds. But there is still fairly high cpu usage
>> for an idle system. Perhaps I should leave SCHED_MC on and disable
>> SCHED_SMT; I'll try that today.
>
> Well, if you disable SMT,your troubles _should_ shrink radically, as
> your box does. You should probably look at why you have CPU domains.
> You don't ever want to see that on a NUMA box.

In responding earlier today I realized that the topology is all wrong as 
you were pointing out. There should be 16 NUMA domains (4 memory 
controllers per socket and 4 sockets). There should be 8 sibling cores. 
I will look into why that is not getting setup properly and what we can 
do about fixing it.

--

But, I do not understand how the wrong topology is causing the NMI 
watchdog to trigger. In the end there are still N domains, M groups per 
domain and P cpus per group. Doesn't the balancing walk over all of them 
irrespective of physical topology?

Here's another data point that jelled this morning explaining the 
problem to someone: the NMI watchdog trips on a mass exit:

TPC: <_raw_spin_trylock_bh+0x38/0x100>
g0: 7fffffffffffffff g1: 00000000000000ff g2: 0000000000070f8c g3: 
fffe403b97891c98
g4: fffe803b963eda00 g5: 000000010036c000 g6: fffe803b84108000 g7: 
0000000000000093
o0: 0000000000000fe0 o1: 0000000000000fe0 o2: ffffff0000000000 o3: 
0000000000200200
o4: 0000000000a98080 o5: 0000000000000000 sp: fffe803b8410ada1 ret_pc: 
00000000006800dc
RPC: <cpumask_next_and+0x44/0x6c>
l0: 0000000000e9b114 l1: 0000000000000001 l2: 0000000000000001 l3: 
0000000000000005
l4: 0000000000002000 l5: fffe803b8410b990 l6: 0000000000000004 l7: 
0000000000f267b0
i0: 0000000100b10700 i1: 00000000ffffffff i2: 0000000101324d80 i3: 
fffe803b8410b6c0
i4: 0000000000000038 i5: 0000000000000498 i6: fffe803b8410ae51 i7: 
000000000045dc30
I7: <double_rq_lock+0x4c/0x68>
Call Trace:
  [000000000045dc30] double_rq_lock+0x4c/0x68
  [000000000046a23c] load_balance+0x278/0x740
  [00000000008aa178] __schedule+0x378/0x8e4
  [00000000008aab1c] schedule+0x68/0x78
  [00000000004718ac] do_exit+0x798/0x7c0
  [000000000047195c] do_group_exit+0x88/0xc0
  [0000000000481148] get_signal_to_deliver+0x3ec/0x4c8
  [000000000042cbc0] do_signal+0x70/0x5e4
  [000000000042d14c] do_notify_resume+0x18/0x50
  [00000000004049c4] __handle_signal+0xc/0x2c


For example the stream program has 1024 threads (1 for each CPU). If you 
ctrl-c the program or wait for it terminate that's when it trips. Other 
workloads that routinely trip it are make -j N, N some number (e.g., on 
a 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, 
ctrl-c ... boom with the above stack trace.

Code wise ... and this is still present in 3.18 and 3.20:

schedule()
- __schedule()
   + irqs disabled: raw_spin_lock_irq(&rq->lock);

      pick_next_task
      - idle_balance()

   + irqs enabled:
     different task: context_switch(rq, prev, next)
                     --> finish_lock_switch eventually
     same task: raw_spin_unlock_irq(&rq->lock) or


For 2.6.39 it's the invocation of idle_balance which is triggering load 
balancing with IRQs disabled. That's when the NMI watchdog trips.

I'll pound on 3.18 and see if I can reproduce something similar there.

David