public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <pzijlstr@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hillf Danton <dhillf@gmail.com>,
	Andrew Jones <drjones@redhat.com>, Dan Smith <danms@us.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paul Turner <pjt@google.com>, Christoph Lameter <cl@linux.com>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	Mike Galbraith <efault@gmx.de>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH 00/33] AutoNUMA27
Date: Fri, 12 Oct 2012 09:46:27 +0100	[thread overview]
Message-ID: <20121012084627.GS3317@csn.ul.ie> (raw)
In-Reply-To: <20121012014553.GD1818@redhat.com>

On Fri, Oct 12, 2012 at 03:45:53AM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote:
> > So after getting through the full review of it, there wasn't anything
> > I could not stand. I think it's *very* heavy on some of the paths like
> > the idle balancer which I was not keen on and the fault paths are also
> > quite heavy.  I think the weight on some of these paths can be reduced
> > but not to 0 if the objectives to autonuma are to be met.
> > 
> > I'm not fully convinced that the task exchange is actually necessary or
> > beneficial because it somewhat assumes that there is a symmetry between CPU
> > and memory balancing that may not be true. The fact that it only considers
> 
> The problem is that without an active task exchange and no explicit
> call to stop_one_cpu*, there's no way to migrate a currently running
> task and clearly we need that. We can indefinitely wait hoping the
> task goes to sleep and leaves the CPU idle, or that a couple of other
> tasks start and trigger load balance events.
> 

Stick that in a comment although I still don't fully see why the actual
exchange is necessary and why you cannot just move the current task to
the remote CPUs runqueue. Maybe it's something to do with them converging
faster if you do an exchange. I'll figure it out eventually.

> We must move tasks even if all cpus are in a steady rq->nr_running ==
> 1 state and there's no other scheduler balance event that could
> possibly attempt to move tasks around in such a steady state.
> 

I see, because just because there is a 1:1 mapping between tasks and
CPUs does not mean that it has converged from a NUMA perspective. The
idle balancer could be moving to an idle CPU that is poor from a NUMA
point of view. Better integration with the load balancer and caching on
a per-NUMA basis both the best and worst converged processes might help
but I'm hand-waving.

> Of course one could hack the active idle balancing so that it does the
> active NUMA balancing action, but that would be a purely artificial
> complication: it would add unnecessary delay and it would provide no
> benefit whatsoever.
> 
> Why don't we dump the active idle balancing too, and we hack the load
> balancing to do the active idle balancing as well? Of course then the
> two will be more integrated. But it'll be a mess and slower and
> there's a good reason why they exist as totally separated pieces of
> code working in parallel.
> 

I'm not 100% convinced they have to be separate but you have thought about
this a hell of a lot more than I have and I'm a scheduling dummy.

For example, to me it seems that if the load balancer was going to move a
task to an idle CPU on a remote node, it could also check it it would be
more or less converged before moving and reject the balancing if it would
be less converged after the move. This increases the search cost in the
load balancer but not necessarily any worse than what happens currently.

> We can integrate it more, but in my view the result would be worse and
> more complicated. Last but not the least messing the idle balancing
> code to do an active NUMA balancing action (somehow invoking
> stop_one_cpu* in the steady state described above) would force even
> cellphones and UP kernels to deal with NUMA code somehow.
> 

hmm...

> > tasks that are currently running feels a bit random but examining all tasks
> > that recently ran on the node would be far too expensive to there is no
> 
> So far this seems a good tradeoff. Nothing will prevent us to scan
> deeper into the runqueues later if find a way to do that efficiently.
> 

I don't think there is an effecient way to do that but I'm hoping
caching an exchange candiate on a per-NUMA basis could reduce the cost
while still converging reasonably quickly.

> > good answer. You are caught between a rock and a hard place and either
> > direction you go is wrong for different reasons. You need something more
> 
> I think you described the problem perfectly ;).
> 
> > frequent than scans (because it'll converge too slowly) but doing it from
> > the balancer misses some tasks and may run too frequently and it's unclear
> > how it effects the current load balancer decisions. I don't have a good
> > alternative solution for this but ideally it would be better integrated with
> > the existing scheduler when there is more data on what those scheduling
> > decisions should be. That will only come from a wide range of testing and
> > the inevitable bug reports.
> > 
> > That said, this is concentrating on the problems without considering the
> > situations where it would work very well.  I think it'll come down to HPC
> > and anything jitter-sensitive will hate this while workloads like JVM,
> > virtualisation or anything that uses a lot of memory without caring about
> > placement will love it. It's not perfect but it's better than incurring
> > the cost of remote access unconditionally.
> 
> Full agreement.
> 
> Your detailed full review was very appreciated, thanks!
> 

You're welcome.

-- 
Mel Gorman
SUSE Labs

  reply	other threads:[~2012-10-12  8:46 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1349308275-2174-1-git-send-email-aarcange@redhat.com>
     [not found] ` <20121004113943.be7f92a0.akpm@linux-foundation.org>
2012-10-05 23:14   ` [PATCH 00/33] AutoNUMA27 Andi Kleen
2012-10-05 23:57     ` Tim Chen
2012-10-06  0:11       ` Andi Kleen
2012-10-08 13:44         ` Don Morris
2012-10-08 20:34     ` Rik van Riel
     [not found] ` <20121011101930.GM3317@csn.ul.ie>
2012-10-11 14:56   ` Andrea Arcangeli
2012-10-11 15:35     ` Mel Gorman
2012-10-12  0:41       ` Andrea Arcangeli
2012-10-12 14:54       ` Mel Gorman
     [not found] ` <1349308275-2174-2-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011105036.GN3317@csn.ul.ie>
2012-10-11 16:07     ` [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt Andrea Arcangeli
2012-10-11 19:37       ` Mel Gorman
     [not found] ` <1349308275-2174-5-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011110137.GQ3317@csn.ul.ie>
2012-10-11 16:43     ` [PATCH 04/33] autonuma: define _PAGE_NUMA Andrea Arcangeli
2012-10-11 19:48       ` Mel Gorman
     [not found] ` <1349308275-2174-6-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011111545.GR3317@csn.ul.ie>
2012-10-11 16:58     ` [PATCH 05/33] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
2012-10-11 19:54       ` Mel Gorman
     [not found] ` <1349308275-2174-7-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011122255.GS3317@csn.ul.ie>
2012-10-11 17:05     ` [PATCH 06/33] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
2012-10-11 20:01       ` Mel Gorman
     [not found] ` <1349308275-2174-8-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011122827.GT3317@csn.ul.ie>
2012-10-11 17:15     ` [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
2012-10-11 20:06       ` Mel Gorman
     [not found]     ` <5076E4B2.2040301@redhat.com>
     [not found]       ` <0000013a525a8739-2b4049fa-1cb3-4b8f-b3a7-1fa77b181590-000000@email.amazonses.com>
2012-10-12  0:52         ` Andrea Arcangeli
     [not found] ` <1349308275-2174-9-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011134643.GU3317@csn.ul.ie>
2012-10-11 17:34     ` [PATCH 08/33] autonuma: define the autonuma flags Andrea Arcangeli
2012-10-11 20:17       ` Mel Gorman
     [not found] ` <1349308275-2174-11-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011145805.GW3317@csn.ul.ie>
2012-10-12  0:25     ` [PATCH 10/33] autonuma: CPU follows memory algorithm Andrea Arcangeli
2012-10-12  8:29       ` Mel Gorman
     [not found] ` <20121011213432.GQ3317@csn.ul.ie>
2012-10-12  1:45   ` [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
2012-10-12  8:46     ` Mel Gorman [this message]
     [not found] ` <1349308275-2174-16-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011155302.GA3317@csn.ul.ie>
     [not found]     ` <50770314.7060800@redhat.com>
     [not found]       ` <20121011175953.GT1818@redhat.com>
2012-10-12 14:03         ` [PATCH 15/33] autonuma: alloc/free/init task_autonuma Rik van Riel
2012-10-13 18:40 ` [PATCH 00/33] AutoNUMA27 Srikar Dronamraju
2012-10-14  4:57   ` Andrea Arcangeli
2012-10-15  8:16     ` Srikar Dronamraju
2012-10-23 16:32     ` Srikar Dronamraju
     [not found] ` <1349308275-2174-20-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121013180618.GC31442@linux.vnet.ibm.com>
2012-10-15  8:24     ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Srikar Dronamraju
2012-10-15  9:20       ` Mel Gorman
2012-10-15 10:00         ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121012084627.GS3317@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=danms@us.ibm.com \
    --cc=dhillf@gmail.com \
    --cc=drjones@redhat.com \
    --cc=efault@gmx.de \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=pjt@google.com \
    --cc=pzijlstr@redhat.com \
    --cc=riel@redhat.com \
    --cc=suresh.b.siddha@intel.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox