linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
	Paul Turner <pjt@google.com>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	Mike Galbraith <efault@gmx.de>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Dan Smith <danms@us.ibm.com>,
	Bharata B Rao <bharata.rao@gmail.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [RFC] AutoNUMA alpha6
Date: Fri, 16 Mar 2012 19:25:11 +0100	[thread overview]
Message-ID: <20120316182511.GJ24602@redhat.com> (raw)
In-Reply-To: <20120316144028.036474157@chello.nl>

On Fri, Mar 16, 2012 at 03:40:28PM +0100, Peter Zijlstra wrote:
> And a few numbers...

Could you try my two trivial benchmarks I sent on lkml too? That
should take less time than the effort you did to add those performance
numbers to perf. I use those benchmarks as a regression test for my
code. They exercise a more complex scenario than "sleep 2" so
supposedly the results will be more interesting.

You find both programs in this link:

http://lists.openwall.net/linux-kernel/2012/01/27/9

These are my results.

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120126.pdf

I happened to have released the autonuma source yesterday on my git
tree:

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=refs/heads/autonuma

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=30ed50adf6cfe85f7feb12c4279359ec52f5f2cd;hp=c03cf0621ed5941f7a9c1e0a343d4df30dbfb7a1

It's a big monlithic patch, but I'll split it.

THP native migration isn't complete yet so it degrades more than it
should when comparing THP autonuma vs hard bind THP. But that can be
done later and it'll benefit move_pages or any other userland hard
binding too, not just the autonuma kernel side. I guess you need this
feature too.

The scanning rate must be tuned, it's possibly too fast as default
because all my benchmarks tends to be short lived. There's already
lots of tuning available in /sys/kernel/mm/autonuma .

There's lots of other tuning to facilitate testing the different
algorithms. By default the numa balancing decisions will keep the
process stuck in its own node, unless there's an idle cpu, but there's
a model to let it escape the node for load balancing/fairness reasons
(to be closer to the stock scheduler) by setting load_balance_strict
to zero (default is 1).

There's also a knuma_scand/working_set tweak to scan the working set
and not all memory (so only care about what's hot, if the app has a
ton of memory that isn't using in some node, that won't be accounted
anymore in the memory migration and CPU migration decisions).

There's no syscall or hint userland can give.

The migration doesn't happen during page fault. There's proper
knuma_migrated daemon per node. The daemon has per-node array of
page-lists. Then knuma_migrated0 is waken with some hysteresis, and
will pick pages that wants to go from node1 to node0, from node 2 to
node0 etc.. and it'll pick them in round robin fascion across all
nodes. That stops when the node0 is out of memory and the cache would
be shrunk or in most cases when there aren't more pages to
migrate. One of the missing features is to start balancing cache
around but I'll add that later and I've already reserved one slot in
the pgdat for that. All other numa_migratedN also runs, so we're
guaranteed to make progress when process A going from node0 to node1,
and process B going from node1 to node0.

All memory that isn't shared is migrated, that includes mapped
pagecache.

The basic logic is scheduler following the memory and memory following
CPU, until things converge.

I'm skeptical in general that any NUMA hinting syscall will be used by
anything except qemu and that's what motivated my design. Hopefully in
the future CPU vendors will provide us a better way to track memory
locality than what I'm doing right now in software. The cost is almost
unmeasurable (even if you disable the pmd mode). I'm afraid with virt
the cost could be higher because of the vmexists but virt is long
lived and a slower scanning rate for the memory layout info should be
ok.

Here also huge amount of improvements are possible. Hopefully it's not
too intrusive either.

I also wrote a threaded userland tool that can render visually at
>20frames per sec the status of the memory and shows the memory
migration (the ones I found were on python and with >8G of ram they
just can't deliver). I was going to try to make it per-process instead
of global before releasing it, that may give another speedup (or
slowdown I don't know for sure). It'll help explain what the code does
and see it in action. But for us echo 1 >/sys/kernel/mm/autonuma/debug
may be enough. Still the visual thing is cool and if done generically
it would be interesting. Ideally once it goes per process it should
show which CPU the process is running on too, not just where the
process memory is.

 arch/x86/include/asm/paravirt.h      |    2 -
 arch/x86/include/asm/pgtable.h       |   51 ++-
 arch/x86/include/asm/pgtable_types.h |   22 +-
 arch/x86/kernel/cpu/amd.c            |    4 +-
 arch/x86/kernel/cpu/common.c         |    4 +-
 arch/x86/kernel/setup_percpu.c       |    1 +
 arch/x86/mm/gup.c                    |    2 +-
 arch/x86/mm/numa.c                   |    9 +-
 fs/exec.c                            |    3 +
 include/asm-generic/pgtable.h        |   13 +
 include/linux/autonuma.h             |   41 +
 include/linux/autonuma_flags.h       |   62 ++
 include/linux/autonuma_sched.h       |   61 ++
 include/linux/autonuma_types.h       |   54 ++
 include/linux/huge_mm.h              |    7 +-
 include/linux/kthread.h              |    1 +
 include/linux/mm_types.h             |   29 +
 include/linux/mmzone.h               |    6 +
 include/linux/sched.h                |    4 +
 kernel/exit.c                        |    1 +
 kernel/fork.c                        |   36 +-
 kernel/kthread.c                     |   23 +
 kernel/sched/Makefile                |    3 +-
 kernel/sched/core.c                  |   13 +-
 kernel/sched/fair.c                  |   55 ++-
 kernel/sched/numa.c                  |  322 ++++++++
 kernel/sched/sched.h                 |   12 +
 mm/Kconfig                           |   13 +
 mm/Makefile                          |    1 +
 mm/autonuma.c                        | 1465 ++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                     |   32 +-
 mm/memcontrol.c                      |    2 +-
 mm/memory.c                          |   36 +-
 mm/mempolicy.c                       |   15 +-
 mm/mmu_context.c                     |    2 +
 mm/page_alloc.c                      |   19 +
 36 files changed, 2376 insertions(+), 50 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2012-03-16 18:25 UTC|newest]

Thread overview: 152+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 01/26] mm, mpol: Re-implement check_*_range() using walk_page_range() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
2012-07-06 10:32   ` Johannes Weiner
2012-07-06 14:48     ` Minchan Kim
2012-07-06 15:02       ` Peter Zijlstra
2012-07-06 14:54   ` Kyungmin Park
2012-07-06 15:00     ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY Peter Zijlstra
2012-03-23 11:50   ` Mel Gorman
2012-07-06 16:38     ` Rik van Riel
2012-07-06 20:04       ` Lee Schermerhorn
2012-07-06 20:27         ` Rik van Riel
2012-07-09 11:48       ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 04/26] mm, mpol: add MPOL_MF_NOOP Peter Zijlstra
2012-07-06 18:40   ` Rik van Riel
2012-03-16 14:40 ` [RFC][PATCH 05/26] mm, mpol: Check for misplaced page Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 06/26] mm: Migrate " Peter Zijlstra
2012-04-03 17:32   ` Dan Smith
2012-03-16 14:40 ` [RFC][PATCH 07/26] mm: Handle misplaced anon pages Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 08/26] mm, mpol: Simplify do_mbind() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 09/26] sched, mm: Introduce tsk_home_node() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware Peter Zijlstra
2012-03-16 18:34   ` Christoph Lameter
2012-03-16 21:12     ` Peter Zijlstra
2012-03-19 13:53       ` Christoph Lameter
2012-03-19 14:05         ` Peter Zijlstra
2012-03-19 15:16           ` Christoph Lameter
2012-03-19 15:23             ` Peter Zijlstra
2012-03-19 15:31               ` Christoph Lameter
2012-03-19 17:09                 ` Peter Zijlstra
2012-03-19 17:28                   ` Peter Zijlstra
2012-03-19 19:06                   ` Christoph Lameter
2012-03-19 20:28                   ` Lee Schermerhorn
2012-03-19 21:21                     ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 11/26] mm, mpol: Lazy migrate a process/vma Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 12/26] sched, mm: sched_{fork,exec} node assignment Peter Zijlstra
2012-06-15 18:16   ` Tony Luck
2012-06-20 19:12     ` [PATCH] sched: Fix build problems when CONFIG_NUMA=y and CONFIG_SMP=n Luck, Tony
2012-03-16 14:40 ` [RFC][PATCH 13/26] sched: Implement home-node awareness Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 14/26] sched, numa: Numa balancer Peter Zijlstra
2012-07-07 18:26   ` Rik van Riel
2012-07-09 12:05     ` Peter Zijlstra
2012-07-09 12:23     ` Peter Zijlstra
2012-07-09 12:40       ` Peter Zijlstra
2012-07-09 14:50         ` Rik van Riel
2012-07-08 18:35   ` Rik van Riel
2012-07-09 12:25     ` Peter Zijlstra
2012-07-09 14:54       ` Rik van Riel
2012-07-12 22:02   ` Rik van Riel
2012-07-13 14:45     ` Don Morris
2012-07-14 16:20       ` Rik van Riel
2012-03-16 14:40 ` [RFC][PATCH 15/26] sched, numa: Implement hotplug hooks Peter Zijlstra
2012-03-19 12:16   ` Srivatsa S. Bhat
2012-03-19 12:19     ` Peter Zijlstra
2012-03-19 12:27       ` Srivatsa S. Bhat
2012-03-16 14:40 ` [RFC][PATCH 16/26] sched, numa: Abstract the numa_entity Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 17/26] srcu: revert1 Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 18/26] srcu: revert2 Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 19/26] srcu: Implement call_srcu() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 20/26] mm, mpol: Introduce vma_dup_policy() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 21/26] mm, mpol: Introduce vma_put_policy() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 22/26] mm, mpol: Split and explose some mempolicy functions Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 23/26] sched, numa: Introduce sys_numa_{t,m}bind() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 24/26] mm, mpol: Implement numa_group RSS accounting Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities Peter Zijlstra
2012-07-08 18:34   ` Rik van Riel
2012-07-09 12:26     ` Peter Zijlstra
2012-07-09 14:53       ` Rik van Riel
2012-07-09 14:55         ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 26/26] sched, numa: A few debug bits Peter Zijlstra
2012-03-16 18:25 ` Andrea Arcangeli [this message]
2012-03-19 18:47   ` [RFC] AutoNUMA alpha6 Peter Zijlstra
2012-03-19 19:02     ` Andrea Arcangeli
2012-03-20 23:41   ` Dan Smith
2012-03-21  1:00     ` Andrea Arcangeli
2012-03-21  2:12     ` Andrea Arcangeli
2012-03-21  4:01       ` Dan Smith
2012-03-21 12:49         ` Andrea Arcangeli
2012-03-21 22:05           ` Dan Smith
2012-03-21 22:52             ` Andrea Arcangeli
2012-03-21 23:13               ` Dan Smith
2012-03-21 23:41                 ` Andrea Arcangeli
2012-03-22  0:17               ` Andrea Arcangeli
2012-03-22 13:58                 ` Dan Smith
2012-03-22 14:27                   ` Andrea Arcangeli
2012-03-22 18:49                     ` Andrea Arcangeli
2012-03-22 18:56                       ` Dan Smith
2012-03-22 19:11                         ` Andrea Arcangeli
2012-03-23 14:15                         ` Andrew Theurer
2012-03-23 16:01                           ` Andrea Arcangeli
2012-03-25 13:30                         ` Andrea Arcangeli
2012-03-21  7:12       ` Ingo Molnar
2012-03-21 12:08         ` Andrea Arcangeli
2012-03-21  7:53     ` Ingo Molnar
2012-03-21 12:17       ` Andrea Arcangeli
2012-03-19  9:57 ` [RFC][PATCH 00/26] sched/numa Avi Kivity
2012-03-19 11:12   ` Peter Zijlstra
2012-03-19 11:30     ` Peter Zijlstra
2012-03-19 11:39     ` Peter Zijlstra
2012-03-19 11:42     ` Avi Kivity
2012-03-19 11:59       ` Peter Zijlstra
2012-03-19 12:07         ` Avi Kivity
2012-03-19 12:09       ` Peter Zijlstra
2012-03-19 12:16         ` Avi Kivity
2012-03-19 20:03           ` Peter Zijlstra
2012-03-20 10:18             ` Avi Kivity
2012-03-20 10:48               ` Peter Zijlstra
2012-03-20 10:52                 ` Avi Kivity
2012-03-20 11:07                   ` Peter Zijlstra
2012-03-20 11:48                     ` Avi Kivity
2012-03-19 12:20       ` Peter Zijlstra
2012-03-19 12:24         ` Avi Kivity
2012-03-19 15:44           ` Avi Kivity
2012-03-19 13:40       ` Andrea Arcangeli
2012-03-19 20:06         ` Peter Zijlstra
2012-03-19 13:04     ` Andrea Arcangeli
2012-03-19 13:26       ` Peter Zijlstra
2012-03-19 13:57         ` Andrea Arcangeli
2012-03-19 14:06           ` Avi Kivity
2012-03-19 14:30             ` Andrea Arcangeli
2012-03-19 18:42               ` Peter Zijlstra
2012-03-20 22:18                 ` Rik van Riel
2012-03-21 16:50                   ` Andrea Arcangeli
2012-04-02 16:34                   ` Pekka Enberg
2012-04-02 16:55                     ` Rik van Riel
2012-04-02 16:54                       ` Pekka Enberg
2012-04-02 17:12                         ` Pekka Enberg
2012-04-02 17:23                           ` Pekka Enberg
2012-03-19 14:07           ` Peter Zijlstra
2012-03-19 14:34             ` Andrea Arcangeli
2012-03-19 18:41               ` Peter Zijlstra
2012-03-19 19:13           ` Peter Zijlstra
2012-03-19 14:07         ` Andrea Arcangeli
2012-03-19 19:05           ` Peter Zijlstra
2012-03-19 13:26       ` Peter Zijlstra
2012-03-19 14:16         ` Andrea Arcangeli
2012-03-19 13:29       ` Peter Zijlstra
2012-03-19 14:19         ` Andrea Arcangeli
2012-03-19 13:39       ` Peter Zijlstra
2012-03-19 14:20         ` Andrea Arcangeli
2012-03-19 20:17           ` Christoph Lameter
2012-03-19 20:28             ` Ingo Molnar
2012-03-19 20:43               ` Christoph Lameter
2012-03-19 21:34                 ` Ingo Molnar
2012-03-20  0:05               ` Linus Torvalds
2012-03-20  7:31                 ` Ingo Molnar
2012-03-21 22:53 ` Nish Aravamudan
2012-03-22  9:45   ` Peter Zijlstra
2012-03-22 10:34     ` Ingo Molnar
2012-03-24  1:41     ` Nish Aravamudan
2012-03-26 11:42       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120316182511.GJ24602@redhat.com \
    --to=aarcange@redhat.com \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=bharata.rao@gmail.com \
    --cc=danms@us.ibm.com \
    --cc=efault@gmx.de \
    --cc=hannes@cmpxchg.org \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=suresh.b.siddha@intel.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).