From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755125AbbCFPLf (ORCPT ); Fri, 6 Mar 2015 10:11:35 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:46761 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754218AbbCFPLd (ORCPT ); Fri, 6 Mar 2015 10:11:33 -0500 Message-ID: <54F9C381.3000305@oracle.com> Date: Fri, 06 Mar 2015 08:10:57 -0700 From: David Ahern User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Peter Zijlstra CC: Mike Galbraith , Ingo Molnar , LKML Subject: Re: NMI watchdog triggering during load_balance References: <54F92788.6010007@oracle.com> <20150306090731.GY21418@twins.programming.kicks-ass.net> In-Reply-To: <20150306090731.GY21418@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: acsinet21.oracle.com [141.146.126.237] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 3/6/15 2:07 AM, Peter Zijlstra wrote: > On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote: >> Since each domain is a superset of the lower one each pass through >> load_balance regularly repeats the processing of the previous domain (e.g., >> NODE domain repeats the cpus in the CPU domain). Then multiplying that >> across 1024 cpus and it seems like a of duplication. > > It is, _but_ each domain has an interval, bigger domains _should_ load > balance at a bigger interval (iow lower frequency), and all this is > lockless data gathering, so reusing stuff from the previous round could > be quite stale indeed. > Yes and I have twiddled the intervals. The defaults for min_interval and max_interval (msec): SMT 1 2 MC 1 4 CPU 1 4 NODE 8 32 Increasing those values (e.g. moving NODE to 50 and 100) drops idle time cpu usage but does not solve the fundamental problem -- under load the balancing of domains seems to be lining up and the system comes to a halt in load balancing frenzy. David