From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752279AbbE1IaA (ORCPT ); Thu, 28 May 2015 04:30:00 -0400 Received: from cantor2.suse.de ([195.135.220.15]:54485 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753775AbbE1I3c (ORCPT ); Thu, 28 May 2015 04:29:32 -0400 Date: Thu, 28 May 2015 09:29:27 +0100 From: Mel Gorman To: riel@redhat.com Cc: linux-kernel@vger.kernel.org, jhladky@redhat.com, peterz@infradead.org, mingo@kernel.org, dedekind1@gmail.com Subject: Re: [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination Message-ID: <20150528082927.GD13750@suse.de> References: <1432753468-7785-1-git-send-email-riel@redhat.com> <1432753468-7785-3-git-send-email-riel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1432753468-7785-3-git-send-email-riel@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 27, 2015 at 03:04:28PM -0400, riel@redhat.com wrote: > From: Rik van Riel > > Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the > preferred node") fixes an issue where workloads would never converge > on a fully loaded (or overloaded) system. > > However, it introduces a regression on less than fully loaded systems, > where workloads converge on a few NUMA nodes, instead of properly staying > spread out across the whole system. This leads to a reduction in available > memory bandwidth, and usable CPU cache, with predictable performance problems. > > The root cause appears to be an interaction between the load balancer and > NUMA balancing, where the short term load represented by the load balancer > differs from the long term load the NUMA balancing code would like to base > its decisions on. > > Simply reverting a43455a1 would re-introduce the non-convergence of > workloads on fully loaded systems, so that is not a good option. As > an aside, the check done before a43455a1 only applied to a task's > preferred node, not to other candidate nodes in the system, so the > converge-on-too-few-nodes problem still happens, just to a lesser > degree. > > Instead, try to compensate for the impedance mismatch between the > load balancer and NUMA balancing by only ever considering a lesser > loaded node as a destination for NUMA balancing, regardless of > whether the task is trying to move to the preferred node, or to another > node. > > This patch also addresses the issue that a system with a single runnable > thread would never migrate that thread to near its memory, introduced by > 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced"). > > A test where the main thread creates a large memory area, and spawns > a worker thread to iterate over the memory (placed on another node > by select_task_rq_fair), after which the main thread goes to sleep > and waits for the worker thread to loop over all the memory now sees > the worker thread migrated to where the memory is, instead of having > all the memory migrated over like before. > > Jirka has run a number of performance tests on several systems: > single instance SpecJBB 2005 performance is 7-15% higher on a 4 node > system, with higher gains on systems with more cores per socket. > Multi-instance SpecJBB 2005 (one per node), linpack, and stream see > little or no changes with the revert of 095bebf61a46 and this patch. > > Signed-off-by: Rik van Riel > Reported-by: Artem Bityutski > Reported-by: Jirka Hladky > Tested-by: Jirka Hladky Acked-by: Mel Gorman -- Mel Gorman SUSE Labs