From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752279AbbE1IaA (ORCPT <rfc822;w@1wt.eu>);
	Thu, 28 May 2015 04:30:00 -0400
Received: from cantor2.suse.de ([195.135.220.15]:54485 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753775AbbE1I3c (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 28 May 2015 04:29:32 -0400
Date: Thu, 28 May 2015 09:29:27 +0100
From: Mel Gorman <mgorman@suse.de>
To: riel@redhat.com
Cc: linux-kernel@vger.kernel.org, jhladky@redhat.com, peterz@infradead.org,
        mingo@kernel.org, dedekind1@gmail.com
Subject: Re: [PATCH 2/2] numa,sched: only consider less busy nodes as numa
 balancing destination
Message-ID: <20150528082927.GD13750@suse.de>
References: <1432753468-7785-1-git-send-email-riel@redhat.com>
 <1432753468-7785-3-git-send-email-riel@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-15
Content-Disposition: inline
In-Reply-To: <1432753468-7785-3-git-send-email-riel@redhat.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, May 27, 2015 at 03:04:28PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the
> preferred node") fixes an issue where workloads would never converge
> on a fully loaded (or overloaded) system.
> 
> However, it introduces a regression on less than fully loaded systems,
> where workloads converge on a few NUMA nodes, instead of properly staying
> spread out across the whole system. This leads to a reduction in available
> memory bandwidth, and usable CPU cache, with predictable performance problems.
> 
> The root cause appears to be an interaction between the load balancer and
> NUMA balancing, where the short term load represented by the load balancer
> differs from the long term load the NUMA balancing code would like to base
> its decisions on.
> 
> Simply reverting a43455a1 would re-introduce the non-convergence of
> workloads on fully loaded systems, so that is not a good option. As
> an aside, the check done before a43455a1 only applied to a task's
> preferred node, not to other candidate nodes in the system, so the
> converge-on-too-few-nodes problem still happens, just to a lesser
> degree.
> 
> Instead, try to compensate for the impedance mismatch between the
> load balancer and NUMA balancing by only ever considering a lesser
> loaded node as a destination for NUMA balancing, regardless of
> whether the task is trying to move to the preferred node, or to another
> node.
> 
> This patch also addresses the issue that a system with a single runnable
> thread would never migrate that thread to near its memory, introduced by
> 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced").
> 
> A test where the main thread creates a large memory area, and spawns
> a worker thread to iterate over the memory (placed on another node
> by select_task_rq_fair), after which the main thread goes to sleep
> and waits for the worker thread to loop over all the memory now sees
> the worker thread migrated to where the memory is, instead of having
> all the memory migrated over like before.
> 
> Jirka has run a number of performance tests on several systems:
> single instance SpecJBB 2005 performance is 7-15% higher on a 4 node
> system, with higher gains on systems with more cores per socket.
> Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
> little or no changes with the revert of 095bebf61a46 and this patch.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Artem Bityutski <dedekind1@gmail.com>
> Reported-by: Jirka Hladky <jhladky@redhat.com>
> Tested-by: Jirka Hladky <jhladky@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs