linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	mingo@kernel.org, oleg@redhat.com, pjt@google.com,
	akpm@linux-foundation.org, torvalds@linux-foundation.org,
	tglx@linutronix.de, Lee.Schermerhorn@hp.com,
	linux-kernel@vger.kernel.org, Petr Holasek <pholasek@redhat.com>
Subject: Re: [PATCH 00/19] sched-numa rewrite
Date: Fri, 17 Aug 2012 20:08:47 +0200	[thread overview]
Message-ID: <20120817180847.GE10129@redhat.com> (raw)
In-Reply-To: <5022B356.9060902@redhat.com>

Hi,

On Wed, Aug 08, 2012 at 02:43:34PM -0400, Rik van Riel wrote:
> While the sched-numa code is relatively small and clean, the
> current version does not seem to offer a significant
> performance improvement over not having it, and in one of
> the tests performance actually regresses vs. mainline.

sched-numa is small true, but I argue about it being clean. It does
lots of hacks, it has a worse numa hinting page fault implementation,
it has no runtime disable tweak, it has no config option, and it's
very intrusive in the scheduler and MM code and it'd be very hard to
backout if a better solution would emerge in the future.

> On the other hand, the autonuma code is pretty large and
> hard to understand, but it does provide a significant
> speedup on each of the tests.

AutoNUMA code is certainly pretty large, but it is totally self
contained. 90% of it is in isolated files that can be deleted and
won't even get built if CONFIG_AUTONUMA=n. The other common code
changes can be wiped out by following the build errors after dropping
the include files with CONFIG_AUTONUMA=n, shall a better solution
emerge in the future.

I think it's important that whatever is merged, is self contained and
easy to backout in the future. Especially if the not self contained
code is full of hacks like big/small mode or random number generator
generating part of the "input".

I applied the fix for sched-numa rewrite/v2 posted on lkml but I still
lockups when running the autonuma-benchmark on the 8 nodes system, I
never could complete the first numa01 test. I provided stack traces
off list to debug it.

So for now I updated the pdf with only the autonuma23 results for the
8 nodes system. I had to bump the autonuma version to 23 and repeat
all benchmarks because of a one liner s/kalloc/kzalloc/ change needed
to successfully boot autonuma on the 8 node system (that boots with
ram not zeroed out).

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma-vs-sched-numa-rewrite-20120817.pdf

I didn't include the convergence charts for 3.6-rc1 on the 8 nodes
because they're equal to the ones on the 2 nodes and they would only
waste pdf real estate.

>From the numa02_SMT charts I suspect something may not be perfect in
the active load idle balancing of CFS. The imperfection is likely lost
in the noise, and without the convergence charts showing the exact
memory distribution across the nodes it would be hard to notice it.

numa01 on the 8 nodes is quite a pathological case, and it shows the
heavy NUMA false sharing/relation there is when 2 processes crosses 4
nodes each and touches all memory in a loop. The smooth async memory
migration of that pathological case still doesn't hurt despite some
small migration keep going in the background forever (this is why
async migrate providing smooth behavior is quite important). numa01 is
a very different load on 2 nodes vs 8 nodes (on 2 nodes it can coverge
100% and it will stop the memory migrations altogether).

Sometime near the end of the tests (X axis is time) you'll notice some
divergence, that happens because some threads completes sooner (the
threads of the node that had all ram local at startup certainly will
always complete faster than the others). The reason for that
divergence is that it falls into the _SMT case to fill all idle cores.

I also noticed on the 8 node system some repetition of the task
migrations invoked by sched_autonuma_balance() that I intend to
optimize away in future versions (it is only visible after enabling
the debug mode). Fixing it, will save some small amount of CPU. What
happens is that the idle load balancing invoked by the CPU that become
idle after the task migration, sometime grabs the migrated task and
puts it back in its original position, so the migration has to be
repeated at the next invocation of sched_autonuma_balance().

      reply	other threads:[~2012-08-17 18:09 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
2012-07-31 19:12 ` [PATCH 01/19] task_work: Remove dependency on sched.h Peter Zijlstra
2012-07-31 20:52   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 02/19] mm/mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
2012-07-31 20:52   ` Rik van Riel
2012-08-09 21:41   ` Andrea Arcangeli
2012-08-10  0:50     ` Andi Kleen
2012-07-31 19:12 ` [PATCH 03/19] mm/mpol: Make MPOL_LOCAL a real policy Peter Zijlstra
2012-07-31 20:52   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 04/19] mm, thp: Preserve pgprot across huge page split Peter Zijlstra
2012-07-31 20:53   ` Rik van Riel
2012-08-09 21:42   ` Andrea Arcangeli
2012-07-31 19:12 ` [PATCH 05/19] mm, mpol: Create special PROT_NONE infrastructure Peter Zijlstra
2012-07-31 20:55   ` Rik van Riel
2012-08-09 21:43   ` Andrea Arcangeli
2012-07-31 19:12 ` [PATCH 06/19] mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra
2012-07-31 21:04   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
2012-07-31 21:06   ` Rik van Riel
2012-08-09 21:44   ` Andrea Arcangeli
2012-10-01  9:36   ` Michael Kerrisk
2012-10-01  9:45     ` Ingo Molnar
2012-07-31 19:12 ` [PATCH 08/19] mm/mpol: Check for misplaced page Peter Zijlstra
2012-07-31 21:13   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 09/19] mm, migrate: Introduce migrate_misplaced_page() Peter Zijlstra
2012-07-31 21:16   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 10/19] mm, mpol: Use special PROT_NONE to migrate pages Peter Zijlstra
2012-07-31 21:24   ` Rik van Riel
2012-08-09 21:44   ` Andrea Arcangeli
2012-07-31 19:12 ` [PATCH 11/19] sched, mm: Introduce tsk_home_node() Peter Zijlstra
2012-07-31 21:30   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 12/19] mm/mpol: Make mempolicy home-node aware Peter Zijlstra
2012-07-31 21:33   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 13/19] sched: Introduce sched_feat_numa() Peter Zijlstra
2012-07-31 21:34   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 14/19] sched: Make find_busiest_queue() a method Peter Zijlstra
2012-07-31 21:34   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 15/19] sched: Implement home-node awareness Peter Zijlstra
2012-07-31 21:52   ` Rik van Riel
2012-08-09 21:51   ` Andrea Arcangeli
2012-07-31 19:12 ` [PATCH 16/19] sched, numa: NUMA home-node selection code Peter Zijlstra
2012-07-31 21:52   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 17/19] sched, numa: Detect big processes Peter Zijlstra
2012-07-31 21:53   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 18/19] sched, numa: Per task memory placement for " Peter Zijlstra
2012-07-31 21:56   ` Rik van Riel
2012-08-08 21:35   ` Peter Zijlstra
2012-08-09 21:57   ` Andrea Arcangeli
2012-07-31 19:12 ` [PATCH 19/19] mm, numa: retry failed page migrations Peter Zijlstra
2012-08-02 20:40   ` Christoph Lameter
2012-08-08 17:17 ` [PATCH 00/19] sched-numa rewrite Andrea Arcangeli
2012-08-08 18:43   ` Rik van Riel
2012-08-17 18:08     ` Andrea Arcangeli [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120817180847.GE10129@redhat.com \
    --to=aarcange@redhat.com \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=pholasek@redhat.com \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).