From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932393AbbCFShr (ORCPT ); Fri, 6 Mar 2015 13:37:47 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:32725 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752067AbbCFShp (ORCPT ); Fri, 6 Mar 2015 13:37:45 -0500 Message-ID: <54F9F3D7.1030905@oracle.com> Date: Fri, 06 Mar 2015 11:37:11 -0700 From: David Ahern User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Mike Galbraith CC: Peter Zijlstra , Ingo Molnar , LKML Subject: Re: NMI watchdog triggering during load_balance References: <54F92788.6010007@oracle.com> <1425617559.16821.36.camel@gmx.de> <54F9C155.3050309@oracle.com> <1425665511.7562.36.camel@gmx.de> In-Reply-To: <1425665511.7562.36.camel@gmx.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: acsinet22.oracle.com [141.146.126.238] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 3/6/15 11:11 AM, Mike Galbraith wrote: > That was the question, _do_ you have any control, because that topology > is toxic. I guess your reply means 'nope'. > >> The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8 >> threads per core and each cpu has 4 memory controllers. > > Thank god I've never met one of these, looks like the box from hell :) > >> If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a >> noticeable improvement -- watchdog does not trigger and I do not get the >> rq locks held for 2-3 seconds. But there is still fairly high cpu usage >> for an idle system. Perhaps I should leave SCHED_MC on and disable >> SCHED_SMT; I'll try that today. > > Well, if you disable SMT,your troubles _should_ shrink radically, as > your box does. You should probably look at why you have CPU domains. > You don't ever want to see that on a NUMA box. In responding earlier today I realized that the topology is all wrong as you were pointing out. There should be 16 NUMA domains (4 memory controllers per socket and 4 sockets). There should be 8 sibling cores. I will look into why that is not getting setup properly and what we can do about fixing it. -- But, I do not understand how the wrong topology is causing the NMI watchdog to trigger. In the end there are still N domains, M groups per domain and P cpus per group. Doesn't the balancing walk over all of them irrespective of physical topology? Here's another data point that jelled this morning explaining the problem to someone: the NMI watchdog trips on a mass exit: TPC: <_raw_spin_trylock_bh+0x38/0x100> g0: 7fffffffffffffff g1: 00000000000000ff g2: 0000000000070f8c g3: fffe403b97891c98 g4: fffe803b963eda00 g5: 000000010036c000 g6: fffe803b84108000 g7: 0000000000000093 o0: 0000000000000fe0 o1: 0000000000000fe0 o2: ffffff0000000000 o3: 0000000000200200 o4: 0000000000a98080 o5: 0000000000000000 sp: fffe803b8410ada1 ret_pc: 00000000006800dc RPC: l0: 0000000000e9b114 l1: 0000000000000001 l2: 0000000000000001 l3: 0000000000000005 l4: 0000000000002000 l5: fffe803b8410b990 l6: 0000000000000004 l7: 0000000000f267b0 i0: 0000000100b10700 i1: 00000000ffffffff i2: 0000000101324d80 i3: fffe803b8410b6c0 i4: 0000000000000038 i5: 0000000000000498 i6: fffe803b8410ae51 i7: 000000000045dc30 I7: Call Trace: [000000000045dc30] double_rq_lock+0x4c/0x68 [000000000046a23c] load_balance+0x278/0x740 [00000000008aa178] __schedule+0x378/0x8e4 [00000000008aab1c] schedule+0x68/0x78 [00000000004718ac] do_exit+0x798/0x7c0 [000000000047195c] do_group_exit+0x88/0xc0 [0000000000481148] get_signal_to_deliver+0x3ec/0x4c8 [000000000042cbc0] do_signal+0x70/0x5e4 [000000000042d14c] do_notify_resume+0x18/0x50 [00000000004049c4] __handle_signal+0xc/0x2c For example the stream program has 1024 threads (1 for each CPU). If you ctrl-c the program or wait for it terminate that's when it trips. Other workloads that routinely trip it are make -j N, N some number (e.g., on a 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, ctrl-c ... boom with the above stack trace. Code wise ... and this is still present in 3.18 and 3.20: schedule() - __schedule() + irqs disabled: raw_spin_lock_irq(&rq->lock); pick_next_task - idle_balance() + irqs enabled: different task: context_switch(rq, prev, next) --> finish_lock_switch eventually same task: raw_spin_unlock_irq(&rq->lock) or For 2.6.39 it's the invocation of idle_balance which is triggering load balancing with IRQs disabled. That's when the NMI watchdog trips. I'll pound on 3.18 and see if I can reproduce something similar there. David