[PATCH 00/33] Latest numa/core release, v17

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/33] Latest numa/core release, v17
@ 2012-11-22 22:49 Ingo Molnar
  2012-11-22 22:49 ` [PATCH 01/33] mm/generic: Only flush the local TLB in ptep_set_access_flags() Ingo Molnar
                   ` (34 more replies)
  0 siblings, 35 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

This release mainly addresses one of the regressions Linus
(rightfully) complained about: the "4x JVM" SPECjbb run.

[ Note to testers: if possible please still run with
  CONFIG_TRANSPARENT_HUGEPAGES=y enabled, to avoid the
  !THP regression that is still not fully fixed.
  It will be fixed next. ]

The new 4x JVM results on a 4-node, 32-CPU, 64 GB RAM system,
(240 seconds run, 8 warehouses per 4 JVM instances):

     spec1.txt:           throughput =     177460.44 SPECjbb2005 bops
     spec2.txt:           throughput =     176175.08 SPECjbb2005 bops
     spec3.txt:           throughput =     175053.91 SPECjbb2005 bops
     spec4.txt:           throughput =     171383.52 SPECjbb2005 bops
    
Which is close to (but not yet completely matching) the hard binding
performance figures.
 
Mainline has the following 4x JVM performance:
    
     spec1.txt:           throughput =     157839.25 SPECjbb2005 bops
     spec2.txt:           throughput =     156969.15 SPECjbb2005 bops
     spec3.txt:           throughput =     157571.59 SPECjbb2005 bops
     spec4.txt:           throughput =     157873.86 SPECjbb2005 bops

This result is achieved through the following patches:

  sched: Introduce staged average NUMA faults
  sched: Track groups of shared tasks
  sched: Use the best-buddy 'ideal cpu' in balancing decisions
  sched, mm, mempolicy: Add per task mempolicy
  sched: Average the fault stats longer
  sched: Use the ideal CPU to drive active balancing
  sched: Add hysteresis to p->numa_shared
  sched: Track shared task's node groups and interleave their memory allocations

These patches make increasing use of the shared/private access
pattern distinction between tasks.

Automatic, task group accurate interleaving of memory is the
most important new placement optimization feature in -v17.

It works by first implementing a task CPU placement feature:

    Using our shared/private distinction to allow the separate
    handling of 'private' versus 'shared' workloads, we enable
    the active-balancing of them:
    
     - private tasks, via the sched_update_ideal_cpu_private() function,
       try to 'spread' the system as evenly as possible.
    
     - shared-access tasks that also share their mm (threads), via the
       sched_update_ideal_cpu_shared() function, try to 'compress'
       with other shared tasks on as few nodes as possible.
    
As tasks are tracked as distinct groups of 'shared access pattern'
tasks, they are compressed towards as few nodes as possible. While
the scheduler performs this compression, a mempolicy node mask can
be constructed almost for free - and in turn be used for the memory
allocations of the tasks.

There are two notable special cases of the interleaving:

     - if a group of shared tasks fits on a single node. In this case
       the interleaving happens on a single bit, a single node and thus
       turns into nice node-local allocations.
    
     - if a large group spans the whole system: in this case the node
       masks will cover the whole system, and all memory gets evenly
       interleaved and available RAM bandwidth gets utilized. This is
       preferable to allocating memory assymetrically and overloading
       certain CPU links and running into their bandwidth limitations.

"Private" and non-NUMA tasks on the other hand are not affected and
continue to do efficient node-local allocations.

With this approach we avoid most of the 'threading means shared access
patterns' heuristics that AutoNUMA uses, by automatically separating
out threads that have a private working set and not binding them to
the other threads forcibly.

The thread group heuristics are not completely eliminated though, as
can be seen in the "sched: Use the ideal CPU to drive active balancing"
patch. It's not hard-coded into the design in any case and could be
extended to other task group information: the automatic NUMA balancing
of cgroups for example.
 
Thanks,

    Ingo

-------------------->

Andrea Arcangeli (1):
  numa, mm: Support NUMA hinting page faults from gup/gup_fast

Ingo Molnar (14):
  mm: Optimize the TLB flush of sys_mprotect() and change_protection()
    users
  sched, mm, numa: Create generic NUMA fault infrastructure, with
    architectures overrides
  sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag
  sched, numa, mm: Interleave shared tasks
  sched: Implement NUMA scanning backoff
  sched: Improve convergence
  sched: Introduce staged average NUMA faults
  sched: Track groups of shared tasks
  sched: Use the best-buddy 'ideal cpu' in balancing decisions
  sched, mm, mempolicy: Add per task mempolicy
  sched: Average the fault stats longer
  sched: Use the ideal CPU to drive active balancing
  sched: Add hysteresis to p->numa_shared
  sched: Track shared task's node groups and interleave their memory
    allocations

Mel Gorman (1):
  mm/migration: Improve migrate_misplaced_page()

Peter Zijlstra (11):
  mm: Count the number of pages affected in change_protection()
  sched, numa, mm: Add last_cpu to page flags
  sched: Make find_busiest_queue() a method
  sched, numa, mm: Describe the NUMA scheduling problem formally
  mm/migrate: Introduce migrate_misplaced_page()
  sched, numa, mm, arch: Add variable locality exception
  sched, numa, mm: Add the scanning page fault machinery
  sched: Add adaptive NUMA affinity support
  sched: Implement constant, per task Working Set Sampling (WSS) rate
  sched, numa, mm: Count WS scanning against present PTEs, not virtual
    memory ranges
  sched: Implement slow start for working set sampling

Rik van Riel (6):
  mm/generic: Only flush the local TLB in ptep_set_access_flags()
  x86/mm: Only do a local tlb flush in ptep_set_access_flags()
  x86/mm: Introduce pte_accessible()
  mm: Only flush the TLB when clearing an accessible pte
  x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
  sched, numa, mm: Add credits for NUMA placement

 CREDITS                                  |    1 +
 Documentation/scheduler/numa-problem.txt |  236 +++++
 arch/sh/mm/Kconfig                       |    1 +
 arch/x86/Kconfig                         |    2 +
 arch/x86/include/asm/pgtable.h           |    6 +
 arch/x86/mm/pgtable.c                    |    8 +-
 include/asm-generic/pgtable.h            |   59 ++
 include/linux/huge_mm.h                  |   12 +
 include/linux/hugetlb.h                  |    8 +-
 include/linux/init_task.h                |    8 +
 include/linux/mempolicy.h                |   47 +-
 include/linux/migrate.h                  |    7 +
 include/linux/mm.h                       |   99 +-
 include/linux/mm_types.h                 |   50 +
 include/linux/mmzone.h                   |   14 +-
 include/linux/page-flags-layout.h        |   83 ++
 include/linux/sched.h                    |   54 +-
 init/Kconfig                             |   81 ++
 kernel/bounds.c                          |    4 +
 kernel/sched/core.c                      |  105 ++-
 kernel/sched/fair.c                      | 1464 ++++++++++++++++++++++++++----
 kernel/sched/features.h                  |   13 +
 kernel/sched/sched.h                     |   39 +-
 kernel/sysctl.c                          |   45 +-
 mm/Makefile                              |    1 +
 mm/huge_memory.c                         |  163 ++++
 mm/hugetlb.c                             |   10 +-
 mm/internal.h                            |    5 +-
 mm/memcontrol.c                          |    7 +-
 mm/memory.c                              |  105 ++-
 mm/mempolicy.c                           |  175 +++-
 mm/migrate.c                             |  106 ++-
 mm/mprotect.c                            |   69 +-
 mm/numa.c                                |   73 ++
 mm/pgtable-generic.c                     |    9 +-
 35 files changed, 2818 insertions(+), 351 deletions(-)
 create mode 100644 Documentation/scheduler/numa-problem.txt
 create mode 100644 include/linux/page-flags-layout.h
 create mode 100644 mm/numa.c

-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/33] mm/generic: Only flush the local TLB in ptep_set_access_flags()
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 02/33] x86/mm: Only do a local tlb flush " Ingo Molnar
                   ` (33 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins,
	Michel Lespinasse

From: Rik van Riel <riel@redhat.com>

The function ptep_set_access_flags() is only ever used to upgrade
access permissions to a page - i.e. they make it less restrictive.

That means the only negative side effect of not flushing remote
TLBs in this function is that other CPUs may incur spurious page
faults, if they happen to access the same address, and still have
a PTE with the old permissions cached in their TLB caches.

Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.

This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.

In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault() to actually flush the TLB entry.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
[ Changelog massage. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/pgtable-generic.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..d8397da 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -12,8 +12,8 @@
 
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 /*
- * Only sets the access flags (dirty, accessed, and
- * writable). Furthermore, we know it always gets set to a "more
+ * Only sets the access flags (dirty, accessed), as well as write 
+ * permission. Furthermore, we know it always gets set to a "more
  * permissive" setting, which allows most architectures to optimize
  * this. We return whether the PTE actually changed, which in turn
  * instructs the caller to do things like update__mmu_cache.  This
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	int changed = !pte_same(*ptep, entry);
 	if (changed) {
 		set_pte_at(vma->vm_mm, address, ptep, entry);
-		flush_tlb_page(vma, address);
+		flush_tlb_fix_spurious_fault(vma, address);
 	}
 	return changed;
 }
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/33] x86/mm: Only do a local tlb flush in ptep_set_access_flags()
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
  2012-11-22 22:49 ` [PATCH 01/33] mm/generic: Only flush the local TLB in ptep_set_access_flags() Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 03/33] x86/mm: Introduce pte_accessible() Ingo Molnar
                   ` (32 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins,
	Michel Lespinasse

From: Rik van Riel <riel@redhat.com>

Because we only ever upgrade a PTE when calling ptep_set_access_flags(),
it is safe to skip flushing entries on remote TLBs.

The worst that can happen is a spurious page fault on other CPUs, which
would flush that TLB entry.

Lazily letting another CPU incur a spurious page fault occasionally
is (much!) cheaper than aggressively flushing everybody else's TLB.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..be3bb46 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long)pgd);
 }
 
+/*
+ * Used to set accessed or dirty bits in the page table entries
+ * on other architectures. On x86, the accessed and dirty bits
+ * are tracked by hardware. However, do_wp_page calls this function
+ * to also make the pte writeable at the same time the dirty bit is
+ * set. In that case we do actually need to write the PTE.
+ */
 int ptep_set_access_flags(struct vm_area_struct *vma,
 			  unsigned long address, pte_t *ptep,
 			  pte_t entry, int dirty)
@@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		flush_tlb_page(vma, address);
+		__flush_tlb_one(address);
 	}
 
 	return changed;
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/33] x86/mm: Introduce pte_accessible()
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
  2012-11-22 22:49 ` [PATCH 01/33] mm/generic: Only flush the local TLB in ptep_set_access_flags() Ingo Molnar
  2012-11-22 22:49 ` [PATCH 02/33] x86/mm: Only do a local tlb flush " Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 04/33] mm: Only flush the TLB when clearing an accessible pte Ingo Molnar
                   ` (31 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Rik van Riel <riel@redhat.com>

We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
the pte is associated with a page.

However, for TLB flushing purposes, we would like to know whether the pte
points to an actually accessible page.  This allows us to skip remote TLB
flushes for pages that are not actually accessible.

Fill in this method for x86 and provide a safe (but slower) method
on other architectures.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Fixed-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org
[ Added Linus's review fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h | 6 ++++++
 include/asm-generic/pgtable.h  | 4 ++++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d..5fe03aa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -407,6 +407,12 @@ static inline int pte_present(pte_t a)
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
+#define pte_accessible pte_accessible
+static inline int pte_accessible(pte_t a)
+{
+	return pte_flags(a) & _PAGE_PRESENT;
+}
+
 static inline int pte_hidden(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_HIDDEN;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b36ce40..48fc1dc 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #define move_pte(pte, prot, old_addr, new_addr)	(pte)
 #endif
 
+#ifndef pte_accessible
+# define pte_accessible(pte)		((void)(pte),1)
+#endif
+
 #ifndef flush_tlb_fix_spurious_fault
 #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
 #endif
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/33] mm: Only flush the TLB when clearing an accessible pte
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (2 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 03/33] x86/mm: Introduce pte_accessible() Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 05/33] x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Ingo Molnar
                   ` (30 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Rik van Riel <riel@redhat.com>

If ptep_clear_flush() is called to clear a page table entry that is
accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
there is no need to flush the TLB on remote CPUs.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/n/tip-vm3rkzevahelwhejx5uwm8ex@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/pgtable-generic.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d8397da..0c8323f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -88,7 +88,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pte_t pte;
 	pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
-	flush_tlb_page(vma, address);
+	if (pte_accessible(pte))
+		flush_tlb_page(vma, address);
 	return pte;
 }
 #endif
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/33] x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (3 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 04/33] mm: Only flush the TLB when clearing an accessible pte Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 06/33] mm: Count the number of pages affected in change_protection() Ingo Molnar
                   ` (29 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins,
	Michel Lespinasse

From: Rik van Riel <riel@redhat.com>

Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.

Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this.  However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert - or a CPU model specific quirk could
be added to retain this optimization on most CPUs.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
[ Applied changelog massage and moved this last in the series,
  to create bisection distance. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index be3bb46..7353de3 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		__flush_tlb_one(address);
 	}
 
 	return changed;
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/33] mm: Count the number of pages affected in change_protection()
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (4 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 05/33] x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 07/33] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Ingo Molnar
                   ` (28 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

This will be used for three kinds of purposes:

 - to optimize mprotect()

 - to speed up working set scanning for working set areas that
   have not been touched

 - to more accurately scan per real working set

No change in functionality from this patch.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/hugetlb.h |  8 +++++--
 include/linux/mm.h      |  3 +++
 mm/hugetlb.c            | 10 +++++++--
 mm/mprotect.c           | 58 +++++++++++++++++++++++++++++++++++++------------
 4 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2251648..06e691b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -87,7 +87,7 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
 				pud_t *pud, int write);
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot);
 
 #else /* !CONFIG_HUGETLB_PAGE */
@@ -132,7 +132,11 @@ static inline void copy_huge_page(struct page *dst, struct page *src)
 {
 }
 
-#define hugetlb_change_protection(vma, address, end, newprot)
+static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+		unsigned long address, unsigned long end, pgprot_t newprot)
+{
+	return 0;
+}
 
 static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			struct vm_area_struct *vma, unsigned long start,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..8d86d5a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1078,6 +1078,9 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 extern unsigned long do_mremap(unsigned long addr,
 			       unsigned long old_len, unsigned long new_len,
 			       unsigned long flags, unsigned long new_addr);
+extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end, pgprot_t newprot,
+			      int dirty_accountable);
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 59a0059..712895e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3014,7 +3014,7 @@ same_page:
 	return i ? i : -EFAULT;
 }
 
-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -3022,6 +3022,7 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
+	unsigned long pages = 0;
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
@@ -3032,12 +3033,15 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
-		if (huge_pmd_unshare(mm, &address, ptep))
+		if (huge_pmd_unshare(mm, &address, ptep)) {
+			pages++;
 			continue;
+		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
 			pte = huge_ptep_get_and_clear(mm, address, ptep);
 			pte = pte_mkhuge(pte_modify(pte, newprot));
 			set_huge_pte_at(mm, address, ptep, pte);
+			pages++;
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
@@ -3049,6 +3053,8 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+
+	return pages << h->order;
 }
 
 int hugetlb_reserve_pages(struct inode *inode,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..1e265be 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -35,12 +35,13 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 }
 #endif
 
-static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
+	unsigned long pages = 0;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -60,6 +61,7 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				ptent = pte_mkwrite(ptent);
 
 			ptep_modify_prot_commit(mm, addr, pte, ptent);
+			pages++;
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
@@ -72,18 +74,22 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				set_pte_at(mm, addr, pte,
 					swp_entry_to_pte(entry));
 			}
+			pages++;
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
+
+	return pages;
 }
 
-static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
 	pmd_t *pmd;
 	unsigned long next;
+	unsigned long pages = 0;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -91,35 +97,42 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma->vm_mm, pmd);
-			else if (change_huge_pmd(vma, pmd, addr, newprot))
+			else if (change_huge_pmd(vma, pmd, addr, newprot)) {
+				pages += HPAGE_PMD_NR;
 				continue;
+			}
 			/* fall through */
 		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
+		pages += change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
 				 dirty_accountable);
 	} while (pmd++, addr = next, addr != end);
+
+	return pages;
 }
 
-static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
 	pud_t *pud;
 	unsigned long next;
+	unsigned long pages = 0;
 
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		change_pmd_range(vma, pud, addr, next, newprot,
+		pages += change_pmd_range(vma, pud, addr, next, newprot,
 				 dirty_accountable);
 	} while (pud++, addr = next, addr != end);
+
+	return pages;
 }
 
-static void change_protection(struct vm_area_struct *vma,
+static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
@@ -127,6 +140,7 @@ static void change_protection(struct vm_area_struct *vma,
 	pgd_t *pgd;
 	unsigned long next;
 	unsigned long start = addr;
+	unsigned long pages = 0;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
@@ -135,10 +149,30 @@ static void change_protection(struct vm_area_struct *vma,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		change_pud_range(vma, pgd, addr, next, newprot,
+		pages += change_pud_range(vma, pgd, addr, next, newprot,
 				 dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
+
 	flush_tlb_range(vma, start, end);
+
+	return pages;
+}
+
+unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+		       unsigned long end, pgprot_t newprot,
+		       int dirty_accountable)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long pages;
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	if (is_vm_hugetlb_page(vma))
+		pages = hugetlb_change_protection(vma, start, end, newprot);
+	else
+		pages = change_protection_range(vma, start, end, newprot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+
+	return pages;
 }
 
 int
@@ -213,12 +247,8 @@ success:
 		dirty_accountable = 1;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
-	if (is_vm_hugetlb_page(vma))
-		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
-	else
-		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	perf_event_mmap(vma);
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/33] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (5 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 06/33] mm: Count the number of pages affected in change_protection() Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 08/33] sched, numa, mm: Add last_cpu to page flags Ingo Molnar
                   ` (27 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Reuse the NUMA code's 'modified page protections' count that
change_protection() computes and skip the TLB flush if there's
no changes to a range that sys_mprotect() modifies.

Given that mprotect() already optimizes the same-flags case
I expected this optimization to dominantly trigger on
CONFIG_NUMA_BALANCING=y kernels - but even with that feature
disabled it triggers rather often.

There's two reasons for that:

1)

While sys_mprotect() already optimizes the same-flag case:

        if (newflags == oldflags) {
                *pprev = vma;
                return 0;
        }

and this test works in many cases, but it is too sharp in some
others, where it differentiates between protection values that the
underlying PTE format makes no distinction about, such as
PROT_EXEC == PROT_READ on x86.

2)

Even where the pte format over vma flag changes necessiates a
modification of the pagetables, there might be no pagetables
yet to modify: they might not be instantiated yet.

During a regular desktop bootup this optimization hits a couple
of hundred times. During a Java test I measured thousands of
hits.

So this optimization improves sys_mprotect() in general, not just
CONFIG_NUMA_BALANCING=y kernels.

[ We could further increase the efficiency of this optimization if
  change_pte_range() and change_huge_pmd() was a bit smarter about
  recognizing exact-same-value protection masks - when the hardware
  can do that safely. This would probably further speed up mprotect(). ]

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/mprotect.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1e265be..7c3628a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -153,7 +153,9 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 				 dirty_accountable);
 	} while (pgd++, addr = next, addr != end);

-	flush_tlb_range(vma, start, end);
+	/* Only flush the TLB if we actually modified any entries: */
+	if (pages)
+		flush_tlb_range(vma, start, end);

 	return pages;
 }
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/33] sched, numa, mm: Add last_cpu to page flags
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (6 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 07/33] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 09/33] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides Ingo Molnar
                   ` (26 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Introduce a per-page last_cpu field, fold this into the struct
page::flags field whenever possible.

The unlikely/rare 32bit NUMA configs will likely grow the page-frame.

[ Completely dropping 32bit support for CONFIG_NUMA_BALANCING would simplify
  things, but it would also remove the warning if we grow enough 64bit
  only page-flags to push the last-cpu out. ]

Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h                | 90 +++++++++++++++++++++------------------
 include/linux/mm_types.h          |  5 +++
 include/linux/mmzone.h            | 14 +-----
 include/linux/page-flags-layout.h | 83 ++++++++++++++++++++++++++++++++++++
 kernel/bounds.c                   |  4 ++
 mm/memory.c                       |  4 ++
 6 files changed, 146 insertions(+), 54 deletions(-)
 create mode 100644 include/linux/page-flags-layout.h

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8d86d5a..5fc1d46 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,50 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-
-/*
- * page->flags layout:
- *
- * There are three possibilities for how page->flags get
- * laid out.  The first is for the normal case, without
- * sparsemem.  The second is for sparsemem when there is
- * plenty of space for node and section.  The last is when
- * we have run out of space and have to fall back to an
- * alternate (slower) way of determining the node.
- *
- * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE | ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
- */
-#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
-#define SECTIONS_WIDTH		SECTIONS_SHIFT
-#else
-#define SECTIONS_WIDTH		0
-#endif
-
-#define ZONES_WIDTH		ZONES_SHIFT
-
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define NODES_WIDTH		NODES_SHIFT
-#else
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-#error "Vmemmap: No space for nodes field in page flags"
-#endif
-#define NODES_WIDTH		0
-#endif
-
-/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPU] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-
-/*
- * We are going to use the flags for the page to node mapping if its in
- * there.  This includes the case where there is no node, so it is implicit.
- */
-#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
-#define NODE_NOT_IN_PAGE_FLAGS
-#endif
+#define LAST_CPU_PGOFF		(ZONES_PGOFF - LAST_CPU_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -634,6 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
+#define LAST_CPU_PGSHIFT	(LAST_CPU_PGOFF * (LAST_CPU_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -655,6 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
+#define LAST_CPU_MASK		((1UL << LAST_CPU_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -693,6 +656,51 @@ static inline int page_to_nid(const struct page *page)
 }
 #endif
 
+#ifdef CONFIG_NUMA_BALANCING
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+	return xchg(&page->_last_cpu, cpu);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+	return page->_last_cpu;
+}
+#else
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+	unsigned long old_flags, flags;
+	int last_cpu;
+
+	do {
+		old_flags = flags = page->flags;
+		last_cpu = (flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+
+		flags &= ~(LAST_CPU_MASK << LAST_CPU_PGSHIFT);
+		flags |= (cpu & LAST_CPU_MASK) << LAST_CPU_PGSHIFT;
+	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
+
+	return last_cpu;
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+	return (page->flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+}
+#endif /* LAST_CPU_NOT_IN_PAGE_FLAGS */
+#else /* CONFIG_NUMA_BALANCING */
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+	return page_to_nid(page);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+	return page_to_nid(page);
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static inline struct zone *page_zone(const struct page *page)
 {
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..7e9f758 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
 #include <linux/uprobes.h>
+#include <linux/page-flags-layout.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -175,6 +176,10 @@ struct page {
 	 */
 	void *shadow;
 #endif
+
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+	int _last_cpu;
+#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 50aaca8..7e116ed 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -15,7 +15,7 @@
 #include <linux/seqlock.h>
 #include <linux/nodemask.h>
 #include <linux/pageblock-flags.h>
-#include <generated/bounds.h>
+#include <linux/page-flags-layout.h>
 #include <linux/atomic.h>
 #include <asm/page.h>
 
@@ -318,16 +318,6 @@ enum zone_type {
  * match the requested limits. See gfp_zone() in include/linux/gfp.h
  */
 
-#if MAX_NR_ZONES < 2
-#define ZONES_SHIFT 0
-#elif MAX_NR_ZONES <= 2
-#define ZONES_SHIFT 1
-#elif MAX_NR_ZONES <= 4
-#define ZONES_SHIFT 2
-#else
-#error ZONES_SHIFT -- too many zones configured adjust calculation
-#endif
-
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 
@@ -1030,8 +1020,6 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
  * PA_SECTION_SHIFT		physical address to/from section number
  * PFN_SECTION_SHIFT		pfn to/from section number
  */
-#define SECTIONS_SHIFT		(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
-
 #define PA_SECTION_SHIFT	(SECTION_SIZE_BITS)
 #define PFN_SECTION_SHIFT	(SECTION_SIZE_BITS - PAGE_SHIFT)
 
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
new file mode 100644
index 0000000..b258132
--- /dev/null
+++ b/include/linux/page-flags-layout.h
@@ -0,0 +1,83 @@
+#ifndef _LINUX_PAGE_FLAGS_LAYOUT
+#define _LINUX_PAGE_FLAGS_LAYOUT
+
+#include <linux/numa.h>
+#include <generated/bounds.h>
+
+#if MAX_NR_ZONES < 2
+#define ZONES_SHIFT 0
+#elif MAX_NR_ZONES <= 2
+#define ZONES_SHIFT 1
+#elif MAX_NR_ZONES <= 4
+#define ZONES_SHIFT 2
+#else
+#error ZONES_SHIFT -- too many zones configured adjust calculation
+#endif
+
+#ifdef CONFIG_SPARSEMEM
+#include <asm/sparsemem.h>
+
+/* 
+ * SECTION_SHIFT    		#bits space required to store a section #
+ */
+#define SECTIONS_SHIFT         (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
+#endif
+
+/*
+ * page->flags layout:
+ *
+ * There are five possibilities for how page->flags get laid out.  The first
+ * (and second) is for the normal case, without sparsemem. The third is for
+ * sparsemem when there is plenty of space for node and section. The last is
+ * when we have run out of space and have to fall back to an alternate (slower)
+ * way of determining the node.
+ *
+ * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |            ... | FLAGS |
+ *     "      plus space for last_cpu:|       NODE     | ZONE | LAST_CPU | ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE |            ... | FLAGS |
+ *     "      plus space for last_cpu:| SECTION | NODE | ZONE | LAST_CPU | ... | FLAGS |
+ * classic sparse no space for node:  | SECTION |     ZONE    |            ... | FLAGS |
+ */
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+
+#define SECTIONS_WIDTH		SECTIONS_SHIFT
+#else
+#define SECTIONS_WIDTH		0
+#endif
+
+#define ZONES_WIDTH		ZONES_SHIFT
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define NODES_WIDTH		NODES_SHIFT
+#else
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+#error "Vmemmap: No space for nodes field in page flags"
+#endif
+#define NODES_WIDTH		0
+#endif
+
+#ifdef CONFIG_NUMA_BALANCING
+#define LAST_CPU_SHIFT	NR_CPUS_BITS
+#else
+#define LAST_CPU_SHIFT	0
+#endif
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPU_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPU_WIDTH	LAST_CPU_SHIFT
+#else
+#define LAST_CPU_WIDTH	0
+#endif
+
+/*
+ * We are going to use the flags for the page to node mapping if its in
+ * there.  This includes the case where there is no node, so it is implicit.
+ */
+#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
+#define NODE_NOT_IN_PAGE_FLAGS
+#endif
+
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPU_WIDTH == 0
+#define LAST_CPU_NOT_IN_PAGE_FLAGS
+#endif
+
+#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 0c9b862..e8ca97b 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -10,6 +10,7 @@
 #include <linux/mmzone.h>
 #include <linux/kbuild.h>
 #include <linux/page_cgroup.h>
+#include <linux/log2.h>
 
 void foo(void)
 {
@@ -17,5 +18,8 @@ void foo(void)
 	DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
 	DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
 	DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+#ifdef CONFIG_SMP
+	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+#endif
 	/* End of constants */
 }
diff --git a/mm/memory.c b/mm/memory.c
index fb135ba..24d3a4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -67,6 +67,10 @@
 
 #include "internal.h"
 
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA config, growing page-frame for last_cpu.
+#endif
+
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 /* use the per-pgdat data instead for discontigmem - mbligh */
 unsigned long max_mapnr;
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/33] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (7 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 08/33] sched, numa, mm: Add last_cpu to page flags Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 10/33] sched: Make find_busiest_queue() a method Ingo Molnar
                   ` (25 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

This patch is based on patches written by multiple people:

   Hugh Dickins <hughd@google.com>
   Johannes Weiner <hannes@cmpxchg.org>
   Peter Zijlstra <a.p.zijlstra@chello.nl>

Of the "mm/mpol: Create special PROT_NONE infrastructure" patch
and its variants.

I have reworked the code so significantly that I had to
drop the acks and signoffs.

In order to facilitate a lazy -- fault driven -- migration of pages,
create a special transient PROT_NONE variant, we can then use the
'spurious' protection faults to drive our migrations from.

Pages that already had an effective PROT_NONE mapping will not
be detected to generate these 'spuriuos' faults for the simple reason
that we cannot distinguish them on their protection bits, see
pte_numa().

This isn't a problem since PROT_NONE (and possible PROT_WRITE with
dirty tracking) aren't used or are rare enough for us to not care
about their placement.

Architectures can set the CONFIG_ARCH_WANTS_NUMA_GENERIC_PGPROT Kconfig
variable, in which case they get the PROT_NONE variant. Alternatively
they can provide the basic primitives themselves:

  bool pte_numa(struct vm_area_struct *vma, pte_t pte);
  pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte);
  bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
  pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
  unsigned long change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end);

[ This non-generic angle is untested though. ]

Original-Idea-by: Rik van Riel <riel@redhat.com>
Also-From: Johannes Weiner <hannes@cmpxchg.org>
Also-From: Hugh Dickins <hughd@google.com>
Also-From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/asm-generic/pgtable.h |  55 ++++++++++++++
 include/linux/huge_mm.h       |  12 ++++
 include/linux/mempolicy.h     |   6 ++
 include/linux/migrate.h       |   5 ++
 include/linux/mm.h            |   5 ++
 include/linux/sched.h         |   2 +
 init/Kconfig                  |  22 ++++++
 mm/Makefile                   |   1 +
 mm/huge_memory.c              | 162 ++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h                 |   5 +-
 mm/memcontrol.c               |   7 +-
 mm/memory.c                   |  85 ++++++++++++++++++++--
 mm/migrate.c                  |   2 +-
 mm/mprotect.c                 |   7 --
 mm/numa.c                     |  73 +++++++++++++++++++
 15 files changed, 430 insertions(+), 19 deletions(-)
 create mode 100644 mm/numa.c

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 48fc1dc..d03d0a8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -537,6 +537,61 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 }
 
 /*
+ * Is this pte used for NUMA scanning?
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern bool pte_numa(struct vm_area_struct *vma, pte_t pte);
+#else
+# ifndef pte_numa
+static inline bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+	return false;
+}
+# endif
+#endif
+
+/*
+ * Turn a pte into a NUMA entry:
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte);
+#else
+# ifndef pte_mknuma
+static inline pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte)
+{
+	return pte;
+}
+# endif
+#endif
+
+/*
+ * Is this pmd used for NUMA scanning?
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
+#else
+# ifndef pmd_numa
+static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+	return false;
+}
+# endif
+#endif
+
+/*
+ * Some architectures (such as x86) may need to preserve certain pgprot
+ * bits, without complicating generic pgprot code.
+ *
+ * Most architectures don't care:
+ */
+#ifndef pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+	return newprot;
+}
+#endif
+
+/*
  * This is a noop if Transparent Hugepage Support is not built into
  * the kernel. Otherwise it is equivalent to
  * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..7f5a552 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -197,4 +197,16 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef CONFIG_NUMA_BALANCING_HUGEPAGE
+extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmd,
+				  unsigned int flags, pmd_t orig_pmd);
+#else
+static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+					 unsigned long address, pmd_t *pmd,
+					 unsigned int flags, pmd_t orig_pmd)
+{
+}
+#endif
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..f329306 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -324,4 +324,10 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
 }
 
 #endif /* CONFIG_NUMA */
+
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	return -1; /* no node preference */
+}
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..afd9af1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -64,4 +64,9 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #define fail_migrate_page NULL
 
 #endif /* CONFIG_MIGRATION */
+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* _LINUX_MIGRATE_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5fc1d46..246375c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1559,6 +1559,11 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern unsigned long
+change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+#endif
+
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e1581a0..a0a2808 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1575,6 +1575,8 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+static inline void task_numa_fault(int node, int cpu, int pages) { }
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..f36c83d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -696,6 +696,28 @@ config LOG_BUF_SHIFT
 config HAVE_UNSTABLE_SCHED_CLOCK
 	bool
 
+#
+# Helper Kconfig switches to express compound feature dependencies
+# and thus make the .h/.c code more readable:
+#
+config NUMA_BALANCING_HUGEPAGE
+	bool
+	default y
+	depends on NUMA_BALANCING
+	depends on TRANSPARENT_HUGEPAGE
+
+config ARCH_USES_NUMA_GENERIC_PGPROT
+	bool
+	default y
+	depends on ARCH_WANTS_NUMA_GENERIC_PGPROT
+	depends on NUMA_BALANCING
+
+config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+	bool
+	default y
+	depends on ARCH_USES_NUMA_GENERIC_PGPROT
+	depends on TRANSPARENT_HUGEPAGE
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
diff --git a/mm/Makefile b/mm/Makefile
index 6b025f8..26f7574 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
+obj-$(CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT) += numa.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..814e3ea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
 #include <linux/freezer.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
+#include <linux/migrate.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -725,6 +726,165 @@ out:
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * Handle a NUMA fault: check whether we should migrate and
+ * mark it accessible again.
+ */
+void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd,
+			   unsigned int flags, pmd_t entry)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct mem_cgroup *memcg = NULL;
+	struct page *new_page;
+	struct page *page = NULL;
+	int last_cpu;
+	int node = -1;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry)))
+		goto unlock;
+
+	if (unlikely(pmd_trans_splitting(entry))) {
+		spin_unlock(&mm->page_table_lock);
+		wait_split_huge_page(vma->anon_vma, pmd);
+		return;
+	}
+
+	page = pmd_page(entry);
+	if (page) {
+		int page_nid = page_to_nid(page);
+
+		VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+		last_cpu = page_last_cpu(page);
+
+		get_page(page);
+		/*
+		 * Note that migrating pages shared by others is safe, since
+		 * get_user_pages() or GUP fast would have to fault this page
+		 * present before they could proceed, and we are holding the
+		 * pagetable lock here and are mindful of pmd races below.
+		 */
+		node = mpol_misplaced(page, vma, haddr);
+		if (node != -1 && node != page_nid)
+			goto migrate;
+	}
+
+fixup:
+	/* change back to regular protection */
+	entry = pmd_modify(entry, vma->vm_page_prot);
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, entry);
+
+unlock:
+	spin_unlock(&mm->page_table_lock);
+	if (page) {
+		task_numa_fault(page_to_nid(page), last_cpu, HPAGE_PMD_NR);
+		put_page(page);
+	}
+	return;
+
+migrate:
+	spin_unlock(&mm->page_table_lock);
+
+	lock_page(page);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry))) {
+		spin_unlock(&mm->page_table_lock);
+		unlock_page(page);
+		put_page(page);
+		return;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	new_page = alloc_pages_node(node,
+	    (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
+	if (!new_page)
+		goto alloc_fail;
+
+	if (isolate_lru_page(page)) {	/* Does an implicit get_page() */
+		put_page(new_page);
+		goto alloc_fail;
+	}
+
+	__set_page_locked(new_page);
+	SetPageSwapBacked(new_page);
+
+	/* anon mapping, we can simply copy page->mapping to the new page: */
+	new_page->mapping = page->mapping;
+	new_page->index = page->index;
+
+	migrate_page_copy(new_page, page);
+
+	WARN_ON(PageLRU(new_page));
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry))) {
+		spin_unlock(&mm->page_table_lock);
+
+		/* Reverse changes made by migrate_page_copy() */
+		if (TestClearPageActive(new_page))
+			SetPageActive(page);
+		if (TestClearPageUnevictable(new_page))
+			SetPageUnevictable(page);
+		mlock_migrate_page(page, new_page);
+
+		unlock_page(new_page);
+		put_page(new_page);		/* Free it */
+
+		unlock_page(page);
+		putback_lru_page(page);
+		put_page(page);			/* Drop the local reference */
+		return;
+	}
+	/*
+	 * Traditional migration needs to prepare the memcg charge
+	 * transaction early to prevent the old page from being
+	 * uncharged when installing migration entries.  Here we can
+	 * save the potential rollback and start the charge transfer
+	 * only when migration is already known to end successfully.
+	 */
+	mem_cgroup_prepare_migration(page, new_page, &memcg);
+
+	entry = mk_pmd(new_page, vma->vm_page_prot);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	entry = pmd_mkhuge(entry);
+
+	page_add_new_anon_rmap(new_page, vma, haddr);
+
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, entry);
+	page_remove_rmap(page);
+	/*
+	 * Finish the charge transaction under the page table lock to
+	 * prevent split_huge_page() from dividing up the charge
+	 * before it's fully transferred to the new page.
+	 */
+	mem_cgroup_end_migration(memcg, page, new_page, true);
+	spin_unlock(&mm->page_table_lock);
+
+	task_numa_fault(node, last_cpu, HPAGE_PMD_NR);
+
+	unlock_page(new_page);
+	unlock_page(page);
+	put_page(page);			/* Drop the rmap reference */
+	put_page(page);			/* Drop the LRU isolation reference */
+	put_page(page);			/* Drop the local reference */
+	return;
+
+alloc_fail:
+	unlock_page(page);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry))) {
+		put_page(page);
+		page = NULL;
+		goto unlock;
+	}
+	goto fixup;
+}
+#endif
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
@@ -1363,6 +1523,8 @@ static int __split_huge_page_map(struct page *page,
 				BUG_ON(page_mapcount(page) != 1);
 			if (!pmd_young(*pmd))
 				entry = pte_mkold(entry);
+			if (pmd_numa(vma, *pmd))
+				entry = pte_mknuma(vma, entry);
 			pte = pte_offset_map(&_pmd, haddr);
 			BUG_ON(!pte_none(*pte));
 			set_pte_at(mm, haddr, pte, entry);
diff --git a/mm/internal.h b/mm/internal.h
index a4fa284..b84d571 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -212,11 +212,12 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 {
 	if (TestClearPageMlocked(page)) {
 		unsigned long flags;
+		int nr_pages = hpage_nr_pages(page);
 
 		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
+		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 		SetPageMlocked(newpage);
-		__inc_zone_page_state(newpage, NR_MLOCK);
+		__mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages);
 		local_irq_restore(flags);
 	}
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..011e510 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 				  struct mem_cgroup **memcgp)
 {
 	struct mem_cgroup *memcg = NULL;
+	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
 	enum charge_type ctype;
 
 	*memcgp = NULL;
 
-	VM_BUG_ON(PageTransHuge(page));
 	if (mem_cgroup_disabled())
 		return;
 
+	if (PageTransHuge(page))
+		nr_pages <<= compound_order(page);
+
 	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
@@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 	 * charged to the res_counter since we plan on replacing the
 	 * old one and only one page is going to be left afterwards.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
 }
 
 /* remove redundant charge if migration failed*/
diff --git a/mm/memory.c b/mm/memory.c
index 24d3a4a..b9bb15c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/migrate.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3437,6 +3438,69 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pte_t *ptep, pmd_t *pmd,
+			unsigned int flags, pte_t entry)
+{
+	struct page *page = NULL;
+	int node, page_nid = -1;
+	int last_cpu = -1;
+	spinlock_t *ptl;
+
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+	if (unlikely(!pte_same(*ptep, entry)))
+		goto out_unlock;
+
+	page = vm_normal_page(vma, address, entry);
+	if (page) {
+		get_page(page);
+		page_nid = page_to_nid(page);
+		last_cpu = page_last_cpu(page);
+		node = mpol_misplaced(page, vma, address);
+		if (node != -1 && node != page_nid)
+			goto migrate;
+	}
+
+out_pte_upgrade_unlock:
+	flush_cache_page(vma, address, pte_pfn(entry));
+
+	ptep_modify_prot_start(mm, address, ptep);
+	entry = pte_modify(entry, vma->vm_page_prot);
+	ptep_modify_prot_commit(mm, address, ptep, entry);
+
+	/* No TLB flush needed because we upgraded the PTE */
+
+	update_mmu_cache(vma, address, ptep);
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+out:
+	if (page) {
+		task_numa_fault(page_nid, last_cpu, 1);
+		put_page(page);
+	}
+
+	return 0;
+
+migrate:
+	pte_unmap_unlock(ptep, ptl);
+
+	if (!migrate_misplaced_page(page, node)) {
+		page_nid = node;
+		goto out;
+	}
+
+	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_same(*ptep, entry)) {
+		put_page(page);
+		page = NULL;
+		goto out_unlock;
+	}
+
+	goto out_pte_upgrade_unlock;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3475,6 +3539,9 @@ int handle_pte_fault(struct mm_struct *mm,
 					pte, pmd, flags, entry);
 	}
 
+	if (pte_numa(vma, entry))
+		return do_numa_page(mm, vma, address, pte, pmd, flags, entry);
+
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
@@ -3539,13 +3606,16 @@ retry:
 							  pmd, flags);
 	} else {
 		pmd_t orig_pmd = *pmd;
-		int ret;
+		int ret = 0;
 
 		barrier();
-		if (pmd_trans_huge(orig_pmd)) {
-			if (flags & FAULT_FLAG_WRITE &&
-			    !pmd_write(orig_pmd) &&
-			    !pmd_trans_splitting(orig_pmd)) {
+		if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
+			if (pmd_numa(vma, orig_pmd)) {
+				do_huge_pmd_numa_page(mm, vma, address, pmd,
+						      flags, orig_pmd);
+			}
+
+			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
 				/*
@@ -3555,12 +3625,13 @@ retry:
 				 */
 				if (unlikely(ret & VM_FAULT_OOM))
 					goto retry;
-				return ret;
 			}
-			return 0;
+
+			return ret;
 		}
 	}
 
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..4ba45f4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -407,7 +407,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
  */
 void migrate_page_copy(struct page *newpage, struct page *page)
 {
-	if (PageHuge(page))
+	if (PageHuge(page) || PageTransHuge(page))
 		copy_huge_page(newpage, page);
 	else
 		copy_highpage(newpage, page);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7c3628a..6ff2d5e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -28,13 +28,6 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
-#ifndef pgprot_modify
-static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
-{
-	return newprot;
-}
-#endif
-
 static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
diff --git a/mm/numa.c b/mm/numa.c
new file mode 100644
index 0000000..8d18800
--- /dev/null
+++ b/mm/numa.c
@@ -0,0 +1,73 @@
+/*
+ * Generic NUMA page table entry support. This code reuses
+ * PROT_NONE: an architecture can choose to use its own
+ * implementation, by setting CONFIG_ARCH_SUPPORTS_NUMA_BALANCING
+ * and not setting CONFIG_ARCH_WANTS_NUMA_GENERIC_PGPROT.
+ */
+#include <linux/mm.h>
+
+static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
+{
+	/*
+	 * obtain PROT_NONE by removing READ|WRITE|EXEC privs
+	 */
+	vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+
+	return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
+}
+
+bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+	/*
+	 * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+	 * "normal" vma->vm_page_prot protections.  Genuine PROT_NONE
+	 * VMAs should never get here, because the fault handling code
+	 * will notice that the VMA has no read or write permissions.
+	 *
+	 * This means we cannot get 'special' PROT_NONE faults from genuine
+	 * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
+	 * tracking.
+	 *
+	 * Neither case is really interesting for our current use though so we
+	 * don't care.
+	 */
+	if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
+		return false;
+
+	return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
+}
+
+pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte)
+{
+	return pte_modify(pte, vma_prot_none(vma));
+}
+
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+	/*
+	 * See pte_numa() above
+	 */
+	if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot)))
+		return false;
+
+	return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
+}
+#endif
+
+/*
+ * The scheduler uses this function to mark a range of virtual
+ * memory inaccessible to user-space, for the purposes of probing
+ * the composition of the working set.
+ *
+ * The resulting page faults will be demultiplexed into:
+ *
+ *    mm/memory.c::do_numa_page()
+ *    mm/huge_memory.c::do_huge_pmd_numa_page()
+ *
+ * This generic version simply uses PROT_NONE.
+ */
+unsigned long change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+	return change_protection(vma, start, end, vma_prot_none(vma), 0);
+}
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/33] sched: Make find_busiest_queue() a method
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (8 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 09/33] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 11/33] sched, numa, mm: Describe the NUMA scheduling problem formally Ingo Molnar
                   ` (24 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Its a bit awkward but it was the least painful means of modifying the
queue selection. Used in a later patch to conditionally use a random
queue.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/n/tip-lfpez319yryvdhwqfqrh99f2@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59e072b..511fbb8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3600,6 +3600,9 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+
+	struct rq *		(*find_busiest_queue)(struct lb_env *,
+						      struct sched_group *);
 };
 
 /*
@@ -4779,13 +4782,14 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
 
 	struct lb_env env = {
-		.sd		= sd,
-		.dst_cpu	= this_cpu,
-		.dst_rq		= this_rq,
-		.dst_grpmask    = sched_group_cpus(sd->groups),
-		.idle		= idle,
-		.loop_break	= sched_nr_migrate_break,
-		.cpus		= cpus,
+		.sd		    = sd,
+		.dst_cpu	    = this_cpu,
+		.dst_rq		    = this_rq,
+		.dst_grpmask        = sched_group_cpus(sd->groups),
+		.idle		    = idle,
+		.loop_break	    = sched_nr_migrate_break,
+		.cpus		    = cpus,
+		.find_busiest_queue = find_busiest_queue,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -4804,7 +4808,7 @@ redo:
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(&env, group);
+	busiest = env.find_busiest_queue(&env, group);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 11/33] sched, numa, mm: Describe the NUMA scheduling problem formally
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (9 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 10/33] sched: Make find_busiest_queue() a method Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 12/33] numa, mm: Support NUMA hinting page faults from gup/gup_fast Ingo Molnar
                   ` (23 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins,
	H. Peter Anvin, Mike Galbraith

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

This is probably a first: formal description of a complex high-level
computing problem, within the kernel source.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/n/tip-mmnlpupoetcatimvjEld16Pb@git.kernel.org
[ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++
 1 file changed, 230 insertions(+)
 create mode 100644 Documentation/scheduler/numa-problem.txt

diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt
new file mode 100644
index 0000000..a5d2fee
--- /dev/null
+++ b/Documentation/scheduler/numa-problem.txt
@@ -0,0 +1,230 @@
+
+
+Effective NUMA scheduling problem statement, described formally:
+
+ * minimize interconnect traffic
+
+For each task 't_i' we have memory, this memory can be spread over multiple
+physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
+node 'k' in [pages].  
+
+If a task shares memory with another task let us denote this as:
+'s_i,k', the memory shared between tasks including 't_i' residing on node
+'k'.
+
+Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement.
+
+Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
+frequency over those memory regions [1/s] such that the product gives an
+(average) bandwidth 'bp' and 'bs' in [pages/s].
+
+(note: multiple tasks sharing memory naturally avoid duplicat accounting
+       because each task will have its own access frequency 'fs')
+
+(pjt: I think this frequency is more numerically consistent if you explicitly 
+      restrict p/s above to be the working-set. (It also makes explicit the 
+      requirement for <C0,M0> to change about a change in the working set.)
+
+      Doing this does have the nice property that it lets you use your frequency
+      measurement as a weak-ordering for the benefit a task would receive when
+      we can't fit everything.
+
+      e.g. task1 has working set 10mb, f=90%
+           task2 has working set 90mb, f=10%
+
+      Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
+      from task1 being on the right node than task2. )
+
+Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
+
+  C: t_i -> {c_i, n_i}
+
+This gives us the total interconnect traffic between nodes 'k' and 'l',
+'T_k,l', as:
+
+  T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l
+
+And our goal is to obtain C0 and M0 such that:
+
+  T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
+
+(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
+       on node 'l' from node 'k', this would be useful for bigger NUMA systems
+
+ pjt: I agree nice to have, but intuition suggests diminishing returns on more
+      usual systems given factors like things like Haswell's enormous 35mb l3
+      cache and QPI being able to do a direct fetch.)
+
+(note: do we need a limit on the total memory per node?)
+
+
+ * fairness
+
+For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
+'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
+load 'L_n':
+
+  L_n = 1/P_n * \Sum_i w_i for all c_i = n
+
+using that we can formulate a load difference between CPUs
+
+  L_n,m = | L_n - L_m |
+
+Which allows us to state the fairness goal like:
+
+  L_n,m(C0) =< L_n,m(C) for all C, n != m
+
+(pjt: It can also be usefully stated that, having converged at C0:
+
+   | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
+
+      Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
+      the "worst" partition we should accept; but having it gives us a useful 
+      bound on how much we can reasonably adjust L_n/L_m at a Pareto point to 
+      favor T_n,m. )
+
+Together they give us the complete multi-objective optimization problem:
+
+  min_C,M [ L_n,m(C), T_k,l(C,M) ]
+
+
+
+Notes:
+
+ - the memory bandwidth problem is very much an inter-process problem, in
+   particular there is no such concept as a process in the above problem.
+
+ - the naive solution would completely prefer fairness over interconnect
+   traffic, the more complicated solution could pick another Pareto point using
+   an aggregate objective function such that we balance the loss of work
+   efficiency against the gain of running, we'd want to more or less suggest
+   there to be a fixed bound on the error from the Pareto line for any
+   such solution.
+
+References:
+
+  http://en.wikipedia.org/wiki/Mathematical_optimization
+  http://en.wikipedia.org/wiki/Multi-objective_optimization
+
+
+* warning, significant hand-waving ahead, improvements welcome *
+
+
+Partial solutions / approximations:
+
+ 1) have task node placement be a pure preference from the 'fairness' pov.
+
+This means we always prefer fairness over interconnect bandwidth. This reduces
+the problem to:
+
+  min_C,M [ T_k,l(C,M) ]
+
+ 2a) migrate memory towards 'n_i' (the task's node).
+
+This creates memory movement such that 'p_i,k for k != n_i' becomes 0 -- 
+provided 'n_i' stays stable enough and there's sufficient memory (looks like
+we might need memory limits for this).
+
+This does however not provide us with any 's_i' (shared) information. It does
+however remove 'M' since it defines memory placement in terms of task
+placement.
+
+XXX properties of this M vs a potential optimal
+
+ 2b) migrate memory towards 'n_i' using 2 samples.
+
+This separates pages into those that will migrate and those that will not due
+to the two samples not matching. We could consider the first to be of 'p_i'
+(private) and the second to be of 's_i' (shared).
+
+This interpretation can be motivated by the previously observed property that
+'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only
+'s_i' (shared). (here we loose the need for memory limits again, since it
+becomes indistinguishable from shared).
+
+XXX include the statistical babble on double sampling somewhere near
+
+This reduces the problem further; we loose 'M' as per 2a, it further reduces
+the 'T_k,l' (interconnect traffic) term to only include shared (since per the
+above all private will be local):
+
+  T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
+
+[ more or less matches the state of sched/numa and describes its remaining
+  problems and assumptions. It should work well for tasks without significant
+  shared memory usage between tasks. ]
+
+Possible future directions:
+
+Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
+can evaluate it;
+
+ 3a) add per-task per node counters
+
+At fault time, count the number of pages the task faults on for each node.
+This should give an approximation of 'p_i' for the local node and 's_i,k' for
+all remote nodes.
+
+While these numbers provide pages per scan, and so have the unit [pages/s] they
+don't count repeat access and thus aren't actually representable for our
+bandwidth numberes.
+
+ 3b) additional frequency term
+
+Additionally (or instead if it turns out we don't need the raw 'p' and 's' 
+numbers) we can approximate the repeat accesses by using the time since marking
+the pages as indication of the access frequency.
+
+Let 'I' be the interval of marking pages and 'e' the elapsed time since the
+last marking, then we could estimate the number of accesses 'a' as 'a = I / e'.
+If we then increment the node counters using 'a' instead of 1 we might get
+a better estimate of bandwidth terms.
+
+ 3c) additional averaging; can be applied on top of either a/b.
+
+[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since
+  the decaying avg includes the old accesses and therefore has a measure of repeat
+  accesses.
+
+  Rik also argued that the sample frequency is too low to get accurate access
+  frequency measurements, I'm not entirely convinced, event at low sample 
+  frequencies the avg elapsed time 'e' over multiple samples should still
+  give us a fair approximation of the avg access frequency 'a'.
+
+  So doing both b&c has a fair chance of working and allowing us to distinguish
+  between important and less important memory accesses.
+
+  Experimentation has shown no benefit from the added frequency term so far. ]
+
+This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute
+'T_k,l' Our optimization problem now reads:
+
+  min_C [ \Sum_i bs_i,l for every n_i = k, l != k ]
+
+And includes only shared terms, this makes sense since all task private memory
+will become local as per 2.
+
+This suggests that if there is significant shared memory, we should try and
+move towards it.
+
+ 4) move towards where 'most' memory is
+
+The simplest significance test is comparing the biggest shared 's_i,k' against
+the private 'p_i'. If we have more shared than private, move towards it.
+
+This effectively makes us move towards where most our memory is and forms a
+feed-back loop with 2. We migrate memory towards us and we migrate towards
+where 'most' memory is.
+
+(Note: even if there were two tasks fully trashing the same shared memory, it
+       is very rare for there to be an 50/50 split in memory, lacking a perfect
+       split, the small will move towards the larger. In case of the perfect
+       split, we'll tie-break towards the lower node number.)
+
+ 5) 'throttle' 4's node placement
+
+Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize'
+and show representative numbers, we should limit node-migration to not be
+faster than this.
+
+ n) poke holes in previous that require more stuff and describe it.
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 12/33] numa, mm: Support NUMA hinting page faults from gup/gup_fast
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (10 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 11/33] sched, numa, mm: Describe the NUMA scheduling problem formally Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 13/33] mm/migrate: Introduce migrate_misplaced_page() Ingo Molnar
                   ` (22 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Andrea Arcangeli <aarcange@redhat.com>

Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.

KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.

Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.

[ This patch was picked up from the AutoNUMA tree. ]

Originally-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ ported to this tree. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 246375c..f39a628 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1585,6 +1585,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_MLOCK	0x40	/* mark page as mlocked */
 #define FOLL_SPLIT	0x80	/* don't return transhuge pages, split them */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
+#define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
index b9bb15c..23ad2eb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1522,6 +1522,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if ((flags & FOLL_NUMA) && pmd_numa(vma, *pmd))
+		goto no_page_table;
 	if (pmd_trans_huge(*pmd)) {
 		if (flags & FOLL_SPLIT) {
 			split_huge_page_pmd(mm, pmd);
@@ -1551,6 +1553,8 @@ split_fallthrough:
 	pte = *ptep;
 	if (!pte_present(pte))
 		goto no_page;
+	if ((flags & FOLL_NUMA) && pte_numa(vma, pte))
+		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
 
@@ -1702,6 +1706,19 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
 	vm_flags &= (gup_flags & FOLL_FORCE) ?
 			(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+
+	/*
+	 * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
+	 * would be called on PROT_NONE ranges. We must never invoke
+	 * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
+	 * page faults would unprotect the PROT_NONE ranges if
+	 * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
+	 * bitflag. So to avoid that, don't set FOLL_NUMA if
+	 * FOLL_FORCE is set.
+	 */
+	if (!(gup_flags & FOLL_FORCE))
+		gup_flags |= FOLL_NUMA;
+
 	i = 0;
 
 	do {
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 13/33] mm/migrate: Introduce migrate_misplaced_page()
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (11 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 12/33] numa, mm: Support NUMA hinting page faults from gup/gup_fast Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 14/33] mm/migration: Improve migrate_misplaced_page() Ingo Molnar
                   ` (21 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Add migrate_misplaced_page() which deals with migrating pages from
faults.

This includes adding a new MIGRATE_FAULT migration mode to
deal with the extra page reference required due to having to look up
the page.

Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/n/tip-es03i8ne7xee0981brw40fl5@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/migrate.h      |  4 ++-
 include/linux/migrate_mode.h |  3 ++
 mm/migrate.c                 | 79 +++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 77 insertions(+), 9 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index afd9af1..72665c9 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -30,6 +30,7 @@ extern int migrate_vmas(struct mm_struct *mm,
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct page *page, int node);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
@@ -63,10 +64,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #define migrate_page NULL
 #define fail_migrate_page NULL
 
-#endif /* CONFIG_MIGRATION */
 static inline
 int migrate_misplaced_page(struct page *page, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+
+#endif /* CONFIG_MIGRATION */
 #endif /* _LINUX_MIGRATE_H */
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89..40b37dc 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,14 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
+ *	this path has an extra reference count
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
+	MIGRATE_FAULT,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/migrate.c b/mm/migrate.c
index 4ba45f4..b89062d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
 	struct buffer_head *bh = head;
 
 	/* Simple case, sync compaction */
-	if (mode != MIGRATE_ASYNC) {
+	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
 		do {
 			get_bh(bh);
 			lock_buffer(bh);
@@ -279,12 +279,22 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
 		struct buffer_head *head, enum migrate_mode mode)
 {
-	int expected_count;
+	int expected_count = 0;
 	void **pslot;
 
+	if (mode == MIGRATE_FAULT) {
+		/*
+		 * MIGRATE_FAULT has an extra reference on the page and
+		 * otherwise acts like ASYNC, no point in delaying the
+		 * fault, we'll try again next time.
+		 */
+		expected_count++;
+	}
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
-		if (page_count(page) != 1)
+		expected_count += 1;
+		if (page_count(page) != expected_count)
 			return -EAGAIN;
 		return 0;
 	}
@@ -294,7 +304,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
-	expected_count = 2 + page_has_private(page);
+	expected_count += 2 + page_has_private(page);
 	if (page_count(page) != expected_count ||
 		radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
 		spin_unlock_irq(&mapping->tree_lock);
@@ -313,7 +323,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	 * the mapping back due to an elevated page count, we would have to
 	 * block waiting on other references to be dropped.
 	 */
-	if (mode == MIGRATE_ASYNC && head &&
+	if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
 			!buffer_migrate_lock_buffers(head, mode)) {
 		page_unfreeze_refs(page, expected_count);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -521,7 +531,7 @@ int buffer_migrate_page(struct address_space *mapping,
 	 * with an IRQ-safe spinlock held. In the sync case, the buffers
 	 * need to be locked now
 	 */
-	if (mode != MIGRATE_ASYNC)
+	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
 		BUG_ON(!buffer_migrate_lock_buffers(head, mode));
 
 	ClearPagePrivate(page);
@@ -687,7 +697,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
-		if (!force || mode == MIGRATE_ASYNC)
+		if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
 			goto out;
 
 		/*
@@ -1403,4 +1413,57 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
  	}
  	return err;
 }
-#endif
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+	struct address_space *mapping = page_mapping(page);
+	int page_lru = page_is_file_cache(page);
+	struct page *newpage;
+	int ret = -EAGAIN;
+	gfp_t gfp = GFP_HIGHUSER_MOVABLE;
+
+	/*
+	 * Never wait for allocations just to migrate on fault, but don't dip
+	 * into reserves. And, only accept pages from the specified node. No
+	 * sense migrating to a different "misplaced" page!
+	 */
+	if (mapping)
+		gfp = mapping_gfp_mask(mapping);
+	gfp &= ~__GFP_WAIT;
+	gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
+
+	newpage = alloc_pages_node(node, gfp, 0);
+	if (!newpage) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (isolate_lru_page(page)) {
+		ret = -EBUSY;
+		goto put_new;
+	}
+
+	inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+	ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);
+	/*
+	 * A page that has been migrated has all references removed and will be
+	 * freed. A page that has not been migrated will have kepts its
+	 * references and be restored.
+	 */
+	dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+	putback_lru_page(page);
+put_new:
+	/*
+	 * Move the new page to the LRU. If migration was not successful
+	 * then this will free the page.
+	 */
+	putback_lru_page(newpage);
+out:
+	return ret;
+}
+
+#endif /* CONFIG_NUMA */
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 14/33] mm/migration: Improve migrate_misplaced_page()
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (12 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 13/33] mm/migrate: Introduce migrate_misplaced_page() Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 15/33] sched, numa, mm, arch: Add variable locality exception Ingo Molnar
                   ` (20 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Mel Gorman <mgorman@suse.de>

Fix, improve and clean up migrate_misplaced_page() to
reuse migrate_pages() and to check for zone watermarks
to make sure we don't overload the node.

This was originally based on Peter's patch "mm/migrate: Introduce
migrate_misplaced_page()" but borrows extremely heavily from Andrea's
"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
collection".

Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Based-on-work-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Link: http://lkml.kernel.org/r/1353064973-26082-14-git-send-email-mgorman@suse.de
[ Adapted to the numa/core tree. Kept Mel's patch separate to retain
  original authorship for the authors. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/migrate_mode.h |   3 -
 mm/memory.c                  |  13 ++--
 mm/migrate.c                 | 143 +++++++++++++++++++++++++++----------------
 3 files changed, 95 insertions(+), 64 deletions(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 40b37dc..ebf3d89 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,14 +6,11 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
- * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
- *	this path has an extra reference count
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
-	MIGRATE_FAULT,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/memory.c b/mm/memory.c
index 23ad2eb..52ad29d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3492,28 +3492,25 @@ out_pte_upgrade_unlock:
 
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
-out:
+
 	if (page) {
 		task_numa_fault(page_nid, last_cpu, 1);
 		put_page(page);
 	}
-
+out:
 	return 0;
 
 migrate:
 	pte_unmap_unlock(ptep, ptl);
 
-	if (!migrate_misplaced_page(page, node)) {
-		page_nid = node;
+	if (migrate_misplaced_page(page, node)) {
 		goto out;
 	}
+	page = NULL;
 
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_same(*ptep, entry)) {
-		put_page(page);
-		page = NULL;
+	if (!pte_same(*ptep, entry))
 		goto out_unlock;
-	}
 
 	goto out_pte_upgrade_unlock;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index b89062d..16a4709 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
 	struct buffer_head *bh = head;
 
 	/* Simple case, sync compaction */
-	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
+	if (mode != MIGRATE_ASYNC) {
 		do {
 			get_bh(bh);
 			lock_buffer(bh);
@@ -282,19 +282,9 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	int expected_count = 0;
 	void **pslot;
 
-	if (mode == MIGRATE_FAULT) {
-		/*
-		 * MIGRATE_FAULT has an extra reference on the page and
-		 * otherwise acts like ASYNC, no point in delaying the
-		 * fault, we'll try again next time.
-		 */
-		expected_count++;
-	}
-
 	if (!mapping) {
 		/* Anonymous page without mapping */
-		expected_count += 1;
-		if (page_count(page) != expected_count)
+		if (page_count(page) != 1)
 			return -EAGAIN;
 		return 0;
 	}
@@ -304,7 +294,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
-	expected_count += 2 + page_has_private(page);
+	expected_count = 2 + page_has_private(page);
 	if (page_count(page) != expected_count ||
 		radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
 		spin_unlock_irq(&mapping->tree_lock);
@@ -323,7 +313,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	 * the mapping back due to an elevated page count, we would have to
 	 * block waiting on other references to be dropped.
 	 */
-	if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
+	if (mode == MIGRATE_ASYNC && head &&
 			!buffer_migrate_lock_buffers(head, mode)) {
 		page_unfreeze_refs(page, expected_count);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -531,7 +521,7 @@ int buffer_migrate_page(struct address_space *mapping,
 	 * with an IRQ-safe spinlock held. In the sync case, the buffers
 	 * need to be locked now
 	 */
-	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
+	if (mode != MIGRATE_ASYNC)
 		BUG_ON(!buffer_migrate_lock_buffers(head, mode));
 
 	ClearPagePrivate(page);
@@ -697,7 +687,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
-		if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
+		if (!force || mode == MIGRATE_ASYNC)
 			goto out;
 
 		/*
@@ -1415,55 +1405,102 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
 }
 
 /*
+ * Returns true if this is a safe migration target node for misplaced NUMA
+ * pages. Currently it only checks the watermarks which is a bit crude.
+ */
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+				   int nr_migrate_pages)
+{
+	int z;
+
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static struct page *alloc_misplaced_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+
+	newpage = alloc_pages_exact_node(nid,
+					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
+					  __GFP_NOMEMALLOC | __GFP_NORETRY |
+					  __GFP_NOWARN) &
+					 ~GFP_IOFS, 0);
+	return newpage;
+}
+
+/*
  * Attempt to migrate a misplaced page to the specified destination
- * node.
+ * node. Caller is expected to have an elevated reference count on
+ * the page that will be dropped by this function before returning.
  */
 int migrate_misplaced_page(struct page *page, int node)
 {
-	struct address_space *mapping = page_mapping(page);
-	int page_lru = page_is_file_cache(page);
-	struct page *newpage;
-	int ret = -EAGAIN;
-	gfp_t gfp = GFP_HIGHUSER_MOVABLE;
+	int isolated = 0;
+	LIST_HEAD(migratepages);
 
 	/*
-	 * Never wait for allocations just to migrate on fault, but don't dip
-	 * into reserves. And, only accept pages from the specified node. No
-	 * sense migrating to a different "misplaced" page!
+	 * Don't migrate pages that are mapped in multiple processes.
+	 * TODO: Handle false sharing detection instead of this hammer
 	 */
-	if (mapping)
-		gfp = mapping_gfp_mask(mapping);
-	gfp &= ~__GFP_WAIT;
-	gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
-
-	newpage = alloc_pages_node(node, gfp, 0);
-	if (!newpage) {
-		ret = -ENOMEM;
+	if (page_mapcount(page) != 1)
 		goto out;
-	}
 
-	if (isolate_lru_page(page)) {
-		ret = -EBUSY;
-		goto put_new;
+	/* Avoid migrating to a node that is nearly full */
+	if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+		int page_lru;
+
+		if (isolate_lru_page(page)) {
+			put_page(page);
+			goto out;
+		}
+		isolated = 1;
+
+		/*
+		 * Page is isolated which takes a reference count so now the
+		 * callers reference can be safely dropped without the page
+		 * disappearing underneath us during migration
+		 */
+		put_page(page);
+
+		page_lru = page_is_file_cache(page);
+		inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+		list_add(&page->lru, &migratepages);
 	}
 
-	inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
-	ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);
-	/*
-	 * A page that has been migrated has all references removed and will be
-	 * freed. A page that has not been migrated will have kepts its
-	 * references and be restored.
-	 */
-	dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
-	putback_lru_page(page);
-put_new:
-	/*
-	 * Move the new page to the LRU. If migration was not successful
-	 * then this will free the page.
-	 */
-	putback_lru_page(newpage);
+	if (isolated) {
+		int nr_remaining;
+
+		nr_remaining = migrate_pages(&migratepages,
+				alloc_misplaced_dst_page,
+				node, false, MIGRATE_ASYNC);
+		if (nr_remaining) {
+			putback_lru_pages(&migratepages);
+			isolated = 0;
+		}
+	}
+	BUG_ON(!list_empty(&migratepages));
 out:
-	return ret;
+	return isolated;
 }
 
 #endif /* CONFIG_NUMA */
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 15/33] sched, numa, mm, arch: Add variable locality exception
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (13 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 14/33] mm/migration: Improve migrate_misplaced_page() Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 16/33] sched, numa, mm: Add credits for NUMA placement Ingo Molnar
                   ` (19 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Some architectures (ab)use NUMA to represent different memory
regions all cpu-local but of different latencies, such as SuperH.

The naming comes from Mel Gorman.

Named-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/sh/mm/Kconfig | 1 +
 init/Kconfig       | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
index cb8f992..5d2a4df 100644
--- a/arch/sh/mm/Kconfig
+++ b/arch/sh/mm/Kconfig
@@ -111,6 +111,7 @@ config VSYSCALL
 config NUMA
 	bool "Non Uniform Memory Access (NUMA) Support"
 	depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+	select ARCH_WANTS_NUMA_VARIABLE_LOCALITY
 	default n
 	help
 	  Some SH systems have many various memories scattered around
diff --git a/init/Kconfig b/init/Kconfig
index f36c83d..b8a4a58 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -718,6 +718,13 @@ config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
 	depends on ARCH_USES_NUMA_GENERIC_PGPROT
 	depends on TRANSPARENT_HUGEPAGE
 
+#
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
+	bool
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 16/33] sched, numa, mm: Add credits for NUMA placement
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (14 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 15/33] sched, numa, mm, arch: Add variable locality exception Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 17/33] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Ingo Molnar
                   ` (18 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Rik van Riel <riel@redhat.com>

The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)

[ The Documentation/scheduler/numa-problem.txt file should
  probably be rewritten once we figure out the final details of
  what the NUMA code needs to do, and why. ]

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20121018171928.24d06af4@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
----
This is against tip.git numa/core
---
 CREDITS             | 1 +
 kernel/sched/fair.c | 3 +++
 mm/memory.c         | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/CREDITS b/CREDITS
index d8fe12a..b4cdc8f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/parport bugs
 D: Author of lil (Linux Interrupt Latency benchmark)
 D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
 D: VM hacker
+D: NUMA task placement
 D: Various other kernel hacks
 S: Imola 40026
 S: Italy
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 511fbb8..8af0208 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
  *
  *  Adaptive scheduling granularity, math enhancements by Peter Zijlstra
  *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ *  NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ *  CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
  */
 
 #include <linux/latencytop.h>
diff --git a/mm/memory.c b/mm/memory.c
index 52ad29d..1f733dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -36,6 +36,8 @@
  *		(Gerhard.Wichert@pdb.siemens.de)
  *
  * Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
  */
 
 #include <linux/kernel_stat.h>
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 17/33] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (15 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 16/33] sched, numa, mm: Add credits for NUMA placement Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 18/33] sched, numa, mm: Add the scanning page fault machinery Ingo Molnar
                   ` (17 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Allow architectures to opt-in to the adaptive affinity NUMA balancing code.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 init/Kconfig | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index b8a4a58..cf3e79c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -725,6 +725,13 @@ config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
 config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
 	bool
 
+#
+# For architectures that want to enable the PROT_NONE driven,
+# NUMA-affine scheduler balancing logic:
+#
+config ARCH_SUPPORTS_NUMA_BALANCING
+	bool
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 18/33] sched, numa, mm: Add the scanning page fault machinery
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (16 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 17/33] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-12-04  0:56   ` [patch] mm, mempolicy: Introduce spinlock to read shared policy tree David Rientjes
  2012-11-22 22:49 ` [PATCH 19/33] sched: Add adaptive NUMA affinity support Ingo Molnar
                   ` (16 subsequent siblings)
  34 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Add the NUMA working set scanning/hinting page fault machinery,
with no policy yet.

The numa_migration_target() function is a derivative of
Andrea Arcangeli's code in the AutoNUMA tree - including the
comment above the function.

An additional enhancement is that instead of node granular faults
we are recording CPU granular faults, which allows us to make
a distinction between:

 - 'private' tasks (who access pages that have been accessed
   from the same CPU before - i.e. by the same task)

 - and 'shared' tasks: who access pages that tend to have been
   accessed by another CPU, i.e. another task.

We later on use this fault metric to do enhancement task and
memory placement.

[ The earliest versions had the mpol_misplaced() function from
  Lee Schermerhorn - this was heavily modified later on. ]

Many thanks to everyone involved for these great ideas and code.

Written-by: Andrea Arcangeli <aarcange@redhat.com>
Also-written-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
[ split it out of the main policy patch - as suggested by Mel Gorman ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/init_task.h |   8 +++
 include/linux/mempolicy.h |   6 +-
 include/linux/mm_types.h  |   4 ++
 include/linux/sched.h     |  41 ++++++++++++--
 init/Kconfig              |  73 +++++++++++++++++++-----
 kernel/sched/core.c       |  15 +++++
 kernel/sysctl.c           |  31 ++++++++++-
 mm/huge_memory.c          |   1 +
 mm/mempolicy.c            | 137 ++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 294 insertions(+), 22 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6d087c5..ed98982 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -143,6 +143,13 @@ extern struct task_group root_task_group;
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_NUMA_BALANCING
+# define INIT_TASK_NUMA(tsk)						\
+	.numa_shared = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -210,6 +217,7 @@ extern struct task_group root_task_group;
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
 	INIT_CPUSET_SEQ							\
+	INIT_TASK_NUMA(tsk)						\
 }
 
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index f329306..c511e25 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma)
 	return 1;
 }
 
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
 #else
 
 struct mempolicy {};
@@ -323,11 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
 	return 0;
 }
 
-#endif /* CONFIG_NUMA */
-
 static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
 				 unsigned long address)
 {
 	return -1; /* no node preference */
 }
+
+#endif /* CONFIG_NUMA */
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7e9f758..48760e9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -403,6 +403,10 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long numa_next_scan;
+	int numa_scan_seq;
+#endif
 	struct uprobes_state uprobes_state;
 };
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a0a2808..418d405 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1501,6 +1501,18 @@ struct task_struct {
 	short il_next;
 	short pref_node_fork;
 #endif
+#ifdef CONFIG_NUMA_BALANCING
+	int numa_shared;
+	int numa_max_node;
+	int numa_scan_seq;
+	int numa_migrate_seq;
+	unsigned int numa_scan_period;
+	u64 node_stamp;			/* migration stamp  */
+	unsigned long numa_weight;
+	unsigned long *numa_faults;
+	struct callback_head numa_work;
+#endif /* CONFIG_NUMA_BALANCING */
+
 	struct rcu_head rcu;
 
 	/*
@@ -1575,7 +1587,25 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+#ifdef CONFIG_NUMA_BALANCING
+extern void task_numa_fault(int node, int cpu, int pages);
+#else
 static inline void task_numa_fault(int node, int cpu, int pages) { }
+#endif /* CONFIG_NUMA_BALANCING */
+
+/*
+ * -1: non-NUMA task
+ *  0: NUMA task with a dominantly 'private' working set
+ *  1: NUMA task with a dominantly 'shared' working set
+ */
+static inline int task_numa_shared(struct task_struct *p)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	return p->numa_shared;
+#else
+	return -1;
+#endif
+}
 
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
@@ -2014,6 +2044,10 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_sched_numa_scan_period_min;
+extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_settle_count;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
@@ -2024,18 +2058,17 @@ extern unsigned int sysctl_sched_shares_window;
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos);
-#endif
-#ifdef CONFIG_SCHED_DEBUG
+
 static inline unsigned int get_sysctl_timer_migration(void)
 {
 	return sysctl_timer_migration;
 }
-#else
+#else /* CONFIG_SCHED_DEBUG */
 static inline unsigned int get_sysctl_timer_migration(void)
 {
 	return 1;
 }
-#endif
+#endif /* CONFIG_SCHED_DEBUG */
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
diff --git a/init/Kconfig b/init/Kconfig
index cf3e79c..9511f0d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -697,6 +697,65 @@ config HAVE_UNSTABLE_SCHED_CLOCK
 	bool
 
 #
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
+	bool
+
+#
+# For architectures that want to enable NUMA-affine scheduling
+# and memory placement:
+#
+config ARCH_SUPPORTS_NUMA_BALANCING
+	bool
+
+#
+# For architectures that want to reuse the PROT_NONE bits
+# to implement NUMA protection bits:
+#
+config ARCH_WANTS_NUMA_GENERIC_PGPROT
+	bool
+
+config NUMA_BALANCING
+	bool "NUMA-optimizing scheduler"
+	default n
+	depends on ARCH_SUPPORTS_NUMA_BALANCING
+	depends on !ARCH_WANTS_NUMA_VARIABLE_LOCALITY
+	depends on SMP && NUMA && MIGRATION
+	help
+	  This option enables NUMA-aware, transparent, automatic
+	  placement optimizations of memory, tasks and task groups.
+
+	  The optimizations work by (transparently) runtime sampling the
+	  workload sharing relationship between threads and processes
+	  of long-run workloads, and scheduling them based on these
+	  measured inter-task relationships (or the lack thereof).
+
+	  ("Long-run" means several seconds of CPU runtime at least.)
+
+	  Tasks that predominantly perform their own processing, without
+	  interacting with other tasks much will be independently balanced
+	  to a CPU and their working set memory will migrate to that CPU/node.
+
+	  Tasks that share a lot of data with each other will be attempted to
+	  be scheduled on as few nodes as possible, with their memory
+	  following them there and being distributed between those nodes.
+
+	  This optimization can improve the performance of long-run CPU-bound
+	  workloads by 10% or more. The sampling and migration has a small
+	  but nonzero cost, so if your NUMA workload is already perfectly
+	  placed (for example by use of explicit CPU and memory bindings,
+	  or because the stock scheduler does a good job already) then you
+	  probably don't need this feature.
+
+	  [ On non-NUMA systems this feature will not be active. You can query
+	    whether your system is a NUMA system via looking at the output of
+	    "numactl --hardware". ]
+
+	  Say N if unsure.
+
+#
 # Helper Kconfig switches to express compound feature dependencies
 # and thus make the .h/.c code more readable:
 #
@@ -718,20 +777,6 @@ config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
 	depends on ARCH_USES_NUMA_GENERIC_PGPROT
 	depends on TRANSPARENT_HUGEPAGE
 
-#
-# For architectures that (ab)use NUMA to represent different memory regions
-# all cpu-local but of different latencies, such as SuperH.
-#
-config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
-	bool
-
-#
-# For architectures that want to enable the PROT_NONE driven,
-# NUMA-affine scheduler balancing logic:
-#
-config ARCH_SUPPORTS_NUMA_BALANCING
-	bool
-
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..3611f5f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1544,6 +1544,21 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
+
+#ifdef CONFIG_NUMA_BALANCING
+	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+		p->mm->numa_next_scan = jiffies;
+		p->mm->numa_scan_seq = 0;
+	}
+
+	p->numa_shared = -1;
+	p->node_stamp = 0ULL;
+	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+	p->numa_migrate_seq = 2;
+	p->numa_faults = NULL;
+	p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+	p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_NUMA_BALANCING */
 }
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b0fa5ad..7736b9e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 100000;		/* 100 usecs */
 static int max_sched_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 static int min_wakeup_granularity_ns;			/* 0 usecs */
 static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
+#ifdef CONFIG_SMP
 static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
 static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-#endif
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */
 
 #ifdef CONFIG_COMPACTION
 static int min_extfrag_threshold;
@@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &min_wakeup_granularity_ns,
 		.extra2		= &max_wakeup_granularity_ns,
 	},
+#ifdef CONFIG_SMP
 	{
 		.procname	= "sched_tunable_scaling",
 		.data		= &sysctl_sched_tunable_scaling,
@@ -347,7 +350,31 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_NUMA_BALANCING
+	{
+		.procname	= "sched_numa_scan_period_min_ms",
+		.data		= &sysctl_sched_numa_scan_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "sched_numa_scan_period_max_ms",
+		.data		= &sysctl_sched_numa_scan_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "sched_numa_settle_count",
+		.data		= &sysctl_sched_numa_settle_count,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
 		.data		= &sysctl_sched_rt_period,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 814e3ea..92e101f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1456,6 +1456,7 @@ static void __split_huge_page_refcount(struct page *page)
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
+		page_xchg_last_cpu(page, page_last_cpu(page_tail));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..318043a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2175,6 +2175,143 @@ static void sp_free(struct sp_node *n)
 	kmem_cache_free(sn_cache, n);
 }
 
+/*
+ * Multi-stage node selection is used in conjunction with a periodic
+ * migration fault to build a temporal task<->page relation. By
+ * using a two-stage filter we remove short/unlikely relations.
+ *
+ * Using P(p) ~ n_p / n_t as per frequentist probability, we can
+ * equate a task's usage of a particular page (n_p) per total usage
+ * of this page (n_t) (in a given time-span) to a probability.
+ *
+ * Our periodic faults will then sample this probability and getting
+ * the same result twice in a row, given these samples are fully
+ * independent, is then given by P(n)^2, provided our sample period
+ * is sufficiently short compared to the usage pattern.
+ *
+ * This quadric squishes small probabilities, making it less likely
+ * we act on an unlikely task<->page relation.
+ *
+ * Return the best node ID this page should be on, or -1 if it should
+ * stay where it is.
+ */
+static int
+numa_migration_target(struct page *page, int page_nid,
+		      struct task_struct *p, int this_cpu,
+		      int cpu_last_access)
+{
+	int nid_last_access;
+	int this_nid;
+
+	if (task_numa_shared(p) < 0)
+		return -1;
+
+	/*
+	 * Possibly migrate towards the current node, depends on
+	 * task_numa_placement() and access details.
+	 */
+	nid_last_access = cpu_to_node(cpu_last_access);
+	this_nid = cpu_to_node(this_cpu);
+
+	if (nid_last_access != this_nid) {
+		/*
+		 * 'Access miss': the page got last accessed from a remote node.
+		 */
+		return -1;
+	}
+	/*
+	 * 'Access hit': the page got last accessed from our node.
+	 *
+	 * Migrate the page if needed.
+	 */
+
+	/* The page is already on this node: */
+	if (page_nid == this_nid)
+		return -1;
+
+	return this_nid;
+}
+
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ * @multi  - use multi-stage node binding
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ *	-1	- not misplaced, page is in the right node
+ *	node	- node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+	int best_nid = -1, page_nid;
+	int cpu_last_access, this_cpu;
+	struct mempolicy *pol;
+	unsigned long pgoff;
+	struct zone *zone;
+
+	BUG_ON(!vma);
+
+	this_cpu = raw_smp_processor_id();
+	page_nid = page_to_nid(page);
+
+	cpu_last_access = page_xchg_last_cpu(page, this_cpu);
+
+	pol = get_vma_policy(current, vma, addr);
+	if (!(task_numa_shared(current) >= 0))
+		goto out_keep_page;
+
+	switch (pol->mode) {
+	case MPOL_INTERLEAVE:
+		BUG_ON(addr >= vma->vm_end);
+		BUG_ON(addr < vma->vm_start);
+
+		pgoff = vma->vm_pgoff;
+		pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+		best_nid = offset_il_node(pol, vma, pgoff);
+		break;
+
+	case MPOL_PREFERRED:
+		if (pol->flags & MPOL_F_LOCAL)
+			best_nid = numa_migration_target(page, page_nid, current, this_cpu, cpu_last_access);
+		else
+			best_nid = pol->v.preferred_node;
+		break;
+
+	case MPOL_BIND:
+		/*
+		 * allows binding to multiple nodes.
+		 * use current page if in policy nodemask,
+		 * else select nearest allowed node, if any.
+		 * If no allowed nodes, use current [!misplaced].
+		 */
+		if (node_isset(page_nid, pol->v.nodes))
+			goto out_keep_page;
+		(void)first_zones_zonelist(
+				node_zonelist(numa_node_id(), GFP_HIGHUSER),
+				gfp_zone(GFP_HIGHUSER),
+				&pol->v.nodes, &zone);
+		best_nid = zone->node;
+		break;
+
+	default:
+		BUG();
+	}
+
+out_keep_page:
+	mpol_cond_put(pol);
+
+	return best_nid;
+}
+
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
 	pr_debug("deleting %lx-l%lx\n", n->start, n->end);
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-11-22 22:49 ` [PATCH 18/33] sched, numa, mm: Add the scanning page fault machinery Ingo Molnar
@ 2012-12-04  0:56   ` David Rientjes
  2012-12-20 18:34     ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: David Rientjes @ 2012-12-04  0:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Linus Torvalds, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Sasha Levin

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Sasha was fuzzing with trinity and reported the following problem:

BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
2 locks held by trinity-main/6361:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0
 #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
Pid: 6361, comm: trinity-main Tainted: G        W 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
Call Trace:
 [<ffffffff8114e393>] __might_sleep+0x1c3/0x1e0
 [<ffffffff83ae5209>] mutex_lock_nested+0x29/0x50
 [<ffffffff8124fc3e>] mpol_shared_policy_lookup+0x2e/0x90
 [<ffffffff81219ebe>] shmem_get_policy+0x2e/0x30
 [<ffffffff8124e99a>] get_vma_policy+0x5a/0xa0
 [<ffffffff8124fce1>] mpol_misplaced+0x41/0x1d0
 [<ffffffff8122f085>] handle_pte_fault+0x465/0x6a0

do_numa_page() calls the new mpol_misplaced() function introduced by 
"sched, numa, mm: Add the scanning page fault machinery" in the page fault 
patch while holding mm->page_table_lock and then 
mpol_shared_policy_lookup() ends up trying to take the shared policy 
mutex.

The fix is to protect the shared policy tree with both a spinlock and 
mutex; both must be held to modify the tree, but only one is required to 
read the tree.  This allows sp_lookup() to grab the spinlock for read.

[rientjes@google.com: wrote changelog]
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/mempolicy.h |    1 +
 mm/mempolicy.c            |   23 ++++++++++++++++++-----
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -133,6 +133,7 @@ struct sp_node {
 
 struct shared_policy {
 	struct rb_root root;
+	spinlock_t lock;
 	struct mutex mutex;
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2090,12 +2090,20 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
  *
  * Remember policies even when nobody has shared memory mapped.
  * The policies are kept in Red-Black tree linked from the inode.
- * They are protected by the sp->lock spinlock, which should be held
- * for any accesses to the tree.
+ *
+ * The rb-tree is locked using both a mutex and a spinlock. Every modification
+ * to the tree must hold both the mutex and the spinlock, lookups can hold
+ * either to observe a stable tree.
+ *
+ * In particular, sp_insert() and sp_delete() take the spinlock, whereas
+ * sp_lookup() doesn't, this so users have choice.
+ *
+ * shared_policy_replace() and mpol_free_shared_policy() take the mutex
+ * and call sp_insert(), sp_delete().
  */
 
 /* lookup first element intersecting start-end */
-/* Caller holds sp->mutex */
+/* Caller holds either sp->lock and/or sp->mutex */
 static struct sp_node *
 sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
 {
@@ -2134,6 +2142,7 @@ static void sp_insert(struct shared_policy *sp, struct sp_node *new)
 	struct rb_node *parent = NULL;
 	struct sp_node *nd;
 
+	spin_lock(&sp->lock);
 	while (*p) {
 		parent = *p;
 		nd = rb_entry(parent, struct sp_node, nd);
@@ -2146,6 +2155,7 @@ static void sp_insert(struct shared_policy *sp, struct sp_node *new)
 	}
 	rb_link_node(&new->nd, parent, p);
 	rb_insert_color(&new->nd, &sp->root);
+	spin_unlock(&sp->lock);
 	pr_debug("inserting %lx-%lx: %d\n", new->start, new->end,
 		 new->policy ? new->policy->mode : 0);
 }
@@ -2159,13 +2169,13 @@ mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
 
 	if (!sp->root.rb_node)
 		return NULL;
-	mutex_lock(&sp->mutex);
+	spin_lock(&sp->lock);
 	sn = sp_lookup(sp, idx, idx+1);
 	if (sn) {
 		mpol_get(sn->policy);
 		pol = sn->policy;
 	}
-	mutex_unlock(&sp->mutex);
+	spin_unlock(&sp->lock);
 	return pol;
 }
 
@@ -2178,8 +2188,10 @@ static void sp_free(struct sp_node *n)
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
 	pr_debug("deleting %lx-l%lx\n", n->start, n->end);
+	spin_lock(&sp->lock);
 	rb_erase(&n->nd, &sp->root);
 	sp_free(n);
+	spin_unlock(&sp->lock);
 }
 
 static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
@@ -2264,6 +2276,7 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 	int ret;
 
 	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
+	spin_lock_init(&sp->lock);
 	mutex_init(&sp->mutex);
 
 	if (mpol) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-12-04  0:56   ` [patch] mm, mempolicy: Introduce spinlock to read shared policy tree David Rientjes
@ 2012-12-20 18:34     ` Linus Torvalds
  2012-12-20 22:55       ` David Rientjes
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2012-12-20 18:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm, Peter Zijlstra,
	Paul Turner, Lee Schermerhorn, Christoph Lameter, Rik van Riel,
	Mel Gorman, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Sasha Levin

Going through some old emails before -rc1 rlease..

What is the status of this patch? The patch that is reported to cause
the problem hasn't been merged, but that mpol_misplaced() thing did
happen in commit 771fb4d806a9. And it looks like it's called from
numa_migrate_prep() under the pte map lock. Or am I missing something?
See commit 9532fec118d ("mm: numa: Migrate pages handled during a
pmd_numa hinting fault").

Am I missing something? Mel, please take another look.

I despise these kinds of dual-locking models, and am wondering if we
can't have *just* the spinlock?

            Linus

On Mon, Dec 3, 2012 at 4:56 PM, David Rientjes <rientjes@google.com> wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>
> Sasha was fuzzing with trinity and reported the following problem:
>
> BUG: sleeping function called from invalid context at kernel/mutex.c:269
> in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
> 2 locks held by trinity-main/6361:
>  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0
>  #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
> Pid: 6361, comm: trinity-main Tainted: G        W 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
> Call Trace:
>  [<ffffffff8114e393>] __might_sleep+0x1c3/0x1e0
>  [<ffffffff83ae5209>] mutex_lock_nested+0x29/0x50
>  [<ffffffff8124fc3e>] mpol_shared_policy_lookup+0x2e/0x90
>  [<ffffffff81219ebe>] shmem_get_policy+0x2e/0x30
>  [<ffffffff8124e99a>] get_vma_policy+0x5a/0xa0
>  [<ffffffff8124fce1>] mpol_misplaced+0x41/0x1d0
>  [<ffffffff8122f085>] handle_pte_fault+0x465/0x6a0
>
> do_numa_page() calls the new mpol_misplaced() function introduced by
> "sched, numa, mm: Add the scanning page fault machinery" in the page fault
> patch while holding mm->page_table_lock and then
> mpol_shared_policy_lookup() ends up trying to take the shared policy
> mutex.
>
> The fix is to protect the shared policy tree with both a spinlock and
> mutex; both must be held to modify the tree, but only one is required to
> read the tree.  This allows sp_lookup() to grab the spinlock for read.
>
> [rientjes@google.com: wrote changelog]
> Reported-by: Sasha Levin <levinsasha928@gmail.com>
> Tested-by: Sasha Levin <levinsasha928@gmail.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/mempolicy.h |    1 +
>  mm/mempolicy.c            |   23 ++++++++++++++++++-----
>  2 files changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -133,6 +133,7 @@ struct sp_node {
>
>  struct shared_policy {
>         struct rb_root root;
> +       spinlock_t lock;
>         struct mutex mutex;
>  };
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2090,12 +2090,20 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
>   *
>   * Remember policies even when nobody has shared memory mapped.
>   * The policies are kept in Red-Black tree linked from the inode.
> - * They are protected by the sp->lock spinlock, which should be held
> - * for any accesses to the tree.
> + *
> + * The rb-tree is locked using both a mutex and a spinlock. Every modification
> + * to the tree must hold both the mutex and the spinlock, lookups can hold
> + * either to observe a stable tree.
> + *
> + * In particular, sp_insert() and sp_delete() take the spinlock, whereas
> + * sp_lookup() doesn't, this so users have choice.
> + *
> + * shared_policy_replace() and mpol_free_shared_policy() take the mutex
> + * and call sp_insert(), sp_delete().
>   */
>
>  /* lookup first element intersecting start-end */
> -/* Caller holds sp->mutex */
> +/* Caller holds either sp->lock and/or sp->mutex */
>  static struct sp_node *
>  sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
>  {
> @@ -2134,6 +2142,7 @@ static void sp_insert(struct shared_policy *sp, struct sp_node *new)
>         struct rb_node *parent = NULL;
>         struct sp_node *nd;
>
> +       spin_lock(&sp->lock);
>         while (*p) {
>                 parent = *p;
>                 nd = rb_entry(parent, struct sp_node, nd);
> @@ -2146,6 +2155,7 @@ static void sp_insert(struct shared_policy *sp, struct sp_node *new)
>         }
>         rb_link_node(&new->nd, parent, p);
>         rb_insert_color(&new->nd, &sp->root);
> +       spin_unlock(&sp->lock);
>         pr_debug("inserting %lx-%lx: %d\n", new->start, new->end,
>                  new->policy ? new->policy->mode : 0);
>  }
> @@ -2159,13 +2169,13 @@ mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
>
>         if (!sp->root.rb_node)
>                 return NULL;
> -       mutex_lock(&sp->mutex);
> +       spin_lock(&sp->lock);
>         sn = sp_lookup(sp, idx, idx+1);
>         if (sn) {
>                 mpol_get(sn->policy);
>                 pol = sn->policy;
>         }
> -       mutex_unlock(&sp->mutex);
> +       spin_unlock(&sp->lock);
>         return pol;
>  }
>
> @@ -2178,8 +2188,10 @@ static void sp_free(struct sp_node *n)
>  static void sp_delete(struct shared_policy *sp, struct sp_node *n)
>  {
>         pr_debug("deleting %lx-l%lx\n", n->start, n->end);
> +       spin_lock(&sp->lock);
>         rb_erase(&n->nd, &sp->root);
>         sp_free(n);
> +       spin_unlock(&sp->lock);
>  }
>
>  static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
> @@ -2264,6 +2276,7 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
>         int ret;
>
>         sp->root = RB_ROOT;             /* empty tree == default mempolicy */
> +       spin_lock_init(&sp->lock);
>         mutex_init(&sp->mutex);
>
>         if (mpol) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-12-20 18:34     ` Linus Torvalds
@ 2012-12-20 22:55       ` David Rientjes
  2012-12-21 13:47         ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: David Rientjes @ 2012-12-20 22:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm, Peter Zijlstra,
	Paul Turner, Lee Schermerhorn, Christoph Lameter, Rik van Riel,
	Mel Gorman, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Sasha Levin, KOSAKI Motohiro

On Thu, 20 Dec 2012, Linus Torvalds wrote:

> Going through some old emails before -rc1 rlease..
> 
> What is the status of this patch? The patch that is reported to cause
> the problem hasn't been merged, but that mpol_misplaced() thing did
> happen in commit 771fb4d806a9. And it looks like it's called from
> numa_migrate_prep() under the pte map lock. Or am I missing something?

Andrew pinged both Ingo and I about it privately two weeks ago.  It 
probably doesn't trigger right now because there's no pte_mknuma() on 
shared pages (yet) but will eventually be needed for correctness.  So it's 
not required for -rc1 as it sits in the tree today but will be needed 
later (and hopefully not forgotten about until Sasha fuzzes again).

> See commit 9532fec118d ("mm: numa: Migrate pages handled during a
> pmd_numa hinting fault").
> 
> Am I missing something? Mel, please take another look.
> 
> I despise these kinds of dual-locking models, and am wondering if we
> can't have *just* the spinlock?
> 

Adding KOSAKI to the cc.

This is probably worth discussing now to see if we can't revert 
b22d127a39dd ("mempolicy: fix a race in shared_policy_replace()"), keep it 
only as a spinlock as you suggest, and do what KOSAKI suggested in 
http://marc.info/?l=linux-kernel&m=133940650731255 instead.  I don't think 
it's worth trying to optimize this path at the cost of having both a 
spinlock and mutex.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-12-20 22:55       ` David Rientjes
@ 2012-12-21 13:47         ` Mel Gorman
  2012-12-21 16:53           ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2012-12-21 13:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Sasha Levin, KOSAKI Motohiro

On Thu, Dec 20, 2012 at 02:55:22PM -0800, David Rientjes wrote:
> On Thu, 20 Dec 2012, Linus Torvalds wrote:
> 
> > Going through some old emails before -rc1 rlease..
> > 
> > What is the status of this patch? The patch that is reported to cause
> > the problem hasn't been merged, but that mpol_misplaced() thing did
> > happen in commit 771fb4d806a9. And it looks like it's called from
> > numa_migrate_prep() under the pte map lock. Or am I missing something?
> 
> Andrew pinged both Ingo and I about it privately two weeks ago.  It 
> probably doesn't trigger right now because there's no pte_mknuma() on 
> shared pages (yet) but will eventually be needed for correctness.

Specifically it is very unlikely to hit because of the page_mapcount()
checks that are made before setting pte_numa. I guess it is still possible
to trigger if just one process is mapping the shared area.

> So it's 
> not required for -rc1 as it sits in the tree today but will be needed 
> later (and hopefully not forgotten about until Sasha fuzzes again).
> 

Indeed.

> > See commit 9532fec118d ("mm: numa: Migrate pages handled during a
> > pmd_numa hinting fault").
> > 
> > Am I missing something? Mel, please take another look.
> > 
> > I despise these kinds of dual-locking models, and am wondering if we
> > can't have *just* the spinlock?
> > 
> 
> Adding KOSAKI to the cc.
> 
> This is probably worth discussing now to see if we can't revert 
> b22d127a39dd ("mempolicy: fix a race in shared_policy_replace()"), keep it 
> only as a spinlock as you suggest, and do what KOSAKI suggested in 
> http://marc.info/?l=linux-kernel&m=133940650731255 instead.  I don't think 
> it's worth trying to optimize this path at the cost of having both a 
> spinlock and mutex.

Jeez, I'm still not keen on that approach for the reasons that are explained
in the changelog for b22d127a39dd.

The reported problem is due to the PTL being held for get_vma_policy()
during hinting fault handling but it's not actually necessary once the page
count has been elevated. If it was just PTEs we were dealing with, we could
just drop the PTL before calling mpol_misplaced() but the handling of PMDs
complicates that. A patch that simply dropped the PTL unconditionally looks
tidy but it then forces do_pmd_numa_page() to reacquire the PTL even if
the page was properly placed and 512 release/acquires of the PTL could suck.

That leads to this third *ugly* option that conditionally drops the lock
and it's up to the caller to figure out what happened. Fooling around with
how it conditionally releases the lock results in different sorts of ugly.
We now have three ugly sister patches for this. Who wants to be Cinderalla?

---8<---
mm: numa: Release the PTL if calling vm_ops->get_policy during NUMA hinting faults

Sasha was fuzzing with trinity and reported the following problem:

BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
2 locks held by trinity-main/6361:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0
 #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
Pid: 6361, comm: trinity-main Tainted: G        W
3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
Call Trace:
 [<ffffffff8114e393>] __might_sleep+0x1c3/0x1e0
 [<ffffffff83ae5209>] mutex_lock_nested+0x29/0x50
 [<ffffffff8124fc3e>] mpol_shared_policy_lookup+0x2e/0x90
 [<ffffffff81219ebe>] shmem_get_policy+0x2e/0x30
 [<ffffffff8124e99a>] get_vma_policy+0x5a/0xa0
 [<ffffffff8124fce1>] mpol_misplaced+0x41/0x1d0
 [<ffffffff8122f085>] handle_pte_fault+0x465/0x6a0

This was triggered by a different version of automatic NUMA balancing but
in theory the current version is vunerable to the same problem.

do_numa_page
  -> numa_migrate_prep
    -> mpol_misplaced
      -> get_vma_policy
        -> shmem_get_policy

It's very unlikely this will happen as shared pages are not marked
pte_numa -- see the page_mapcount() check in change_pte_range() -- but
it is possible. There are a couple of ways this can be handled. Peter
Zijlstra and David Rientjes had a patch that introduced a dual-locking
model where lookups can use a spinlock but dual-locking like this is
tricky. A second approach is to partially revert b22d127a (mempolicy:
fix a race in shared_policy_replace) and go back to Kosaki's original
approach at http://marc.info/?l=linux-kernel&m=133940650731255 to only
use a spinlock for shared policies.

This patch is a third approach that is a different type of ugly. It drops
the PTL in numa_migrate_prep() if vm_ops->get_policy exists after the page
has been pinned and it's up to the caller to reacquire if necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/memory.c |   34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e0a9b0c..82d0b20 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3431,15 +3431,29 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
-int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int current_nid)
+static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+				unsigned long addr, int current_nid,
+				pte_t *ptep, spinlock_t *ptl, bool *released)
 {
+	*released = false;
+
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
+	/*
+	 * This is UGLY. If the vma has a get_policy ops then it is possible
+	 * it needs to allocate GFP_KERNEL which is not safe with the PTL
+	 * held. In this case we have to release the PTL and it's up to the
+	 * caller to reacquire it if necessary.
+	 */
+	if (vma->vm_ops && vma->vm_ops->get_policy) {
+		pte_unmap_unlock(ptep, ptl);
+		*released = true;
+	}
+		
 	return mpol_misplaced(page, vma, addr);
 }
 
@@ -3451,6 +3465,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int current_nid = -1;
 	int target_nid;
 	bool migrated = false;
+	bool released_ptl;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3479,8 +3494,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	current_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
-	pte_unmap_unlock(ptep, ptl);
+	target_nid = numa_migrate_prep(page, vma, addr, current_nid,
+					ptep, ptl, &released_ptl);
+	if (!released_ptl)
+		pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
 		/*
 		 * Account for the fault against the current node if it not
@@ -3513,6 +3530,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	bool released_ptl;
 	int local_nid = numa_node_id();
 
 	spin_lock(&mm->page_table_lock);
@@ -3567,14 +3585,18 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		curr_nid = local_nid;
 		target_nid = numa_migrate_prep(page, vma, addr,
-					       page_to_nid(page));
+					       page_to_nid(page),
+					       pte, ptl, &released_ptl);
 		if (target_nid == -1) {
+			if (released_ptl)
+				pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 			put_page(page);
 			continue;
 		}
 
 		/* Migrate to the requested node */
-		pte_unmap_unlock(pte, ptl);
+		if (!released_ptl)
+			pte_unmap_unlock(pte, ptl);
 		migrated = migrate_misplaced_page(page, target_nid);
 		if (migrated)
 			curr_nid = target_nid;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-12-21 13:47         ` Mel Gorman
@ 2012-12-21 16:53           ` Linus Torvalds
  2012-12-21 18:21             ` Hugh Dickins
  2012-12-21 19:58             ` Mel Gorman
  0 siblings, 2 replies; 55+ messages in thread
From: Linus Torvalds @ 2012-12-21 16:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Ingo Molnar, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Sasha Levin, KOSAKI Motohiro

On Fri, Dec 21, 2012 at 5:47 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Dec 20, 2012 at 02:55:22PM -0800, David Rientjes wrote:
>>
>> This is probably worth discussing now to see if we can't revert
>> b22d127a39dd ("mempolicy: fix a race in shared_policy_replace()"), keep it
>> only as a spinlock as you suggest, and do what KOSAKI suggested in
>> http://marc.info/?l=linux-kernel&m=133940650731255 instead.  I don't think
>> it's worth trying to optimize this path at the cost of having both a
>> spinlock and mutex.
>
> Jeez, I'm still not keen on that approach for the reasons that are explained
> in the changelog for b22d127a39dd.

Christ, Mel.

Your reasons in b22d127a39dd are weak as hell, and then you come up
with *THIS* shit instead:

> That leads to this third *ugly* option that conditionally drops the lock
> and it's up to the caller to figure out what happened. Fooling around with
> how it conditionally releases the lock results in different sorts of ugly.
> We now have three ugly sister patches for this. Who wants to be Cinderalla?
>
> ---8<---
> mm: numa: Release the PTL if calling vm_ops->get_policy during NUMA hinting faults

Heck no. In fact, not a f*cking way in hell. Look yourself in the
mirror, Mel. This patch is ugly, and *guaranteed* to result in subtle
locking issues, and then you have the *gall* to quote the "uhh, that's
a bit ugly due to some trivial duplication" thing in commit
b22d127a39dd.

Reverting commit b22d127a39dd and just having a "ok, if we need to
allocate, then drop the lock, allocate, re-get the lock, and see if we
still need the new allocation" is *beautiful* code compared to the
diseased abortion you just posted.

Seriously. Conditional locking is error-prone, and about a million
times worse than the trivial fix that Kosaki suggested.

                         Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-12-21 16:53           ` Linus Torvalds
@ 2012-12-21 18:21             ` Hugh Dickins
  2012-12-21 21:51               ` Linus Torvalds
  2012-12-21 19:58             ` Mel Gorman
  1 sibling, 1 reply; 55+ messages in thread
From: Hugh Dickins @ 2012-12-21 18:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, David Rientjes, Ingo Molnar,
	Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Andrew Morton,
	Andrea Arcangeli, Thomas Gleixner, Johannes Weiner, Sasha Levin,
	KOSAKI Motohiro

On Fri, 21 Dec 2012, Linus Torvalds wrote:
> On Fri, Dec 21, 2012 at 5:47 AM, Mel Gorman <mgorman@suse.de> wrote:
> > On Thu, Dec 20, 2012 at 02:55:22PM -0800, David Rientjes wrote:
> >>
> >> This is probably worth discussing now to see if we can't revert
> >> b22d127a39dd ("mempolicy: fix a race in shared_policy_replace()"), keep it
> >> only as a spinlock as you suggest, and do what KOSAKI suggested in
> >> http://marc.info/?l=linux-kernel&m=133940650731255 instead.  I don't think
> >> it's worth trying to optimize this path at the cost of having both a
> >> spinlock and mutex.
> >
> > Jeez, I'm still not keen on that approach for the reasons that are explained
> > in the changelog for b22d127a39dd.
> 
> Christ, Mel.
> 
> Your reasons in b22d127a39dd are weak as hell, and then you come up
> with *THIS* shit instead:
> 
> > That leads to this third *ugly* option that conditionally drops the lock
> > and it's up to the caller to figure out what happened. Fooling around with
> > how it conditionally releases the lock results in different sorts of ugly.
> > We now have three ugly sister patches for this. Who wants to be Cinderalla?
> >
> > ---8<---
> > mm: numa: Release the PTL if calling vm_ops->get_policy during NUMA hinting faults
> 
> Heck no. In fact, not a f*cking way in hell. Look yourself in the
> mirror, Mel. This patch is ugly, and *guaranteed* to result in subtle
> locking issues, and then you have the *gall* to quote the "uhh, that's
> a bit ugly due to some trivial duplication" thing in commit
> b22d127a39dd.
> 
> Reverting commit b22d127a39dd and just having a "ok, if we need to
> allocate, then drop the lock, allocate, re-get the lock, and see if we
> still need the new allocation" is *beautiful* code compared to the
> diseased abortion you just posted.
> 
> Seriously. Conditional locking is error-prone, and about a million
> times worse than the trivial fix that Kosaki suggested.

I'm picking up a vibe that you don't entirely like Mel's approach.

I've an unsubstantiated suspicion that it's also incomplete as is.
Although at first I thought huge_memory.c does not need a similar
mod, because THPages are anonymous and cannot come from tmpfs,
I now wonder about a MAP_PRIVATE mapping from tmpfs - for better
or for worse, anon pages there are subject to the same mempolicy
as the shared file pages, and I don't see what prevents khugepaged
from gathering those into THPages.  But it didn't happen when I
tried, so maybe I'm just missing what prevents it.

I don't understand David's and Mel's remarks about the "shared pages"
check making Sasha's warning unlikely: page_mapcount has nothing to do
with whether a page belongs to shm/shmem/tmpfs, and it's easy enough
to reproduce Sasha's warning on the current git tree.  "mount -o
remount,mpol=local /tmp" or something like that is useful in testing.

I wish wish wish I had time to spend on this today, but I don't.
And I've not looked to see (let alone tested) whether it's easy
to revert Mel's mutex then add in Kosaki's patch (which I didn't
look at so have no opinion on).

Shall we go for Peter/David's mutex+spinlock for rc1 - I assume
they both tested that - with a promise to do better in rc2?

What I wanted to try is separate the get_vma_policy() out from
mpol_misplaced(), and have the various callsites do that first
outside the page table lock, passing it in to mpol_misplaced.
But that doesn't work (efficiently) unless it also returns the
range that that policy is valid for, so we don't have to (drop
lock and) call it on every pte.  I cannot do that for rc1, and
perhaps it's irrelevant if Kosaki's patch is preferred.

(Perhaps I should confess I've another reason to come here for
rc2: that "+ info->vfs_inode.i_ino" we recently added for better
interleave distribution in shmem_alloc_page: I think NUMA placement
faults will be fighting shmem_alloc_page's choices because that
offset is not exposed.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-12-21 18:21             ` Hugh Dickins
@ 2012-12-21 21:51               ` Linus Torvalds
  0 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2012-12-21 21:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mel Gorman, David Rientjes, Ingo Molnar,
	Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Andrew Morton,
	Andrea Arcangeli, Thomas Gleixner, Johannes Weiner, Sasha Levin,
	KOSAKI Motohiro

On Fri, Dec 21, 2012 at 10:21 AM, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 21 Dec 2012, Linus Torvalds wrote:
>>
>> compared to the diseased abortion you just posted.
>
> I'm picking up a vibe that you don't entirely like Mel's approach.

Good job. I was a bit nervous that I was being too subtle.

> I don't understand David's and Mel's remarks about the "shared pages"
> check making Sasha's warning unlikely: page_mapcount has nothing to do
> with whether a page belongs to shm/shmem/tmpfs, and it's easy enough
> to reproduce Sasha's warning on the current git tree.  "mount -o
> remount,mpol=local /tmp" or something like that is useful in testing.

I think that Mel and David may talk about the mutex actually blocking
(not just the debug message possibly triggering).

> I wish wish wish I had time to spend on this today, but I don't.
> And I've not looked to see (let alone tested) whether it's easy
> to revert Mel's mutex then add in Kosaki's patch (which I didn't
> look at so have no opinion on).

I don't actually have Kosaki's patch either, just the description of
it. We've done that kind of "preallocate before taking the lock"
before, though.

> Shall we go for Peter/David's mutex+spinlock for rc1 - I assume
> they both tested that - with a promise to do better in rc2?

Well, if the plan is to fix it for rc2, then there is no point in
putting a workaround in now, since actually hitting the problem (as
opposed to seeing the warning) is presumably much harder.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-12-21 16:53           ` Linus Torvalds
  2012-12-21 18:21             ` Hugh Dickins
@ 2012-12-21 19:58             ` Mel Gorman
  2012-12-21 22:02               ` Linus Torvalds
  1 sibling, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2012-12-21 19:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Rientjes, Ingo Molnar, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Sasha Levin, KOSAKI Motohiro

On Fri, Dec 21, 2012 at 08:53:33AM -0800, Linus Torvalds wrote:
> > Jeez, I'm still not keen on that approach for the reasons that are explained
> > in the changelog for b22d127a39dd.
> 
> Christ, Mel.
> 
> Your reasons in b22d127a39dd are weak as hell, and then you come up
> with *THIS* shit instead:
> 

The complaint about duplicated code was based on the fact that the mempolicy
code was a complete mess and duplicating code did not help. I'll accept
that it's weak.

> > That leads to this third *ugly* option that conditionally drops the lock
> > and it's up to the caller to figure out what happened. Fooling around with
> > how it conditionally releases the lock results in different sorts of ugly.
> > We now have three ugly sister patches for this. Who wants to be Cinderalla?
> >
> > ---8<---
> > mm: numa: Release the PTL if calling vm_ops->get_policy during NUMA hinting faults
> 
> Heck no. In fact, not a f*cking way in hell. Look yourself in the
> mirror, Mel.

I could do with a shave, a glass of wine and a holiday in that order.

> This patch is ugly, and *guaranteed* to result in subtle
> locking issues, and then you have the *gall* to quote the "uhh, that's
> a bit ugly due to some trivial duplication" thing in commit
> b22d127a39dd.
> 

No argument.

> Reverting commit b22d127a39dd and just having a "ok, if we need to
> allocate, then drop the lock, allocate, re-get the lock, and see if we
> still need the new allocation" is *beautiful* code compared to the
> diseased abortion you just posted.
> 
> Seriously. Conditional locking is error-prone, and about a million
> times worse than the trivial fix that Kosaki suggested.
> 

Kosaki's patch does not fix the actual problem with NUMA hinting
faults. Converting to a spinlock is nice but we'd still hold the PTL at
the time sp_alloc is called and potentially allocating GFP_KERNEL with a
spinlock held.

At the risk of making your head explode, here is another patch.  It does
the conversion to spinlock as Kosaki originally did. It's unnecessary for
the actual problem at hand but I felt that avoiding it would piss you off
more. The actual fix is changing how the PTL is handled by NUMA hinting fault
handler. It's still conditionally locking but only at the location it matters
where it'll be obvious. As before, we could unconditionally unlock but then
a PMD fault potentially releases/acquires the PTL a large number of times.

This survived a trinity fuzz test for mbind running for 5 minutes and
autonumabench. CONFIG_SLUB_DEBUG, CONFIG_DEBUG_MUTEXES, CONFIG_DEBUG_SPINLOCK
and CONFIG_NUMA_BALANCING were enabled.  Unfortunately, as part of the
same test I also checked slabinfo and I see that shared_policy_nodes is
continually increasing indicating that it's leaking again. This also
happens with current git so it's another regression.

---8<---
mm: mempolicy: Convert shared_policy mutex to spinlock and do not hold PTL across a shmem vm_ops->get_vma_policy().

Sasha was fuzzing with trinity and reported the following problem:

BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
2 locks held by trinity-main/6361:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0
 #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
Pid: 6361, comm: trinity-main Tainted: G        W
3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
Call Trace:
 [<ffffffff8114e393>] __might_sleep+0x1c3/0x1e0
 [<ffffffff83ae5209>] mutex_lock_nested+0x29/0x50
 [<ffffffff8124fc3e>] mpol_shared_policy_lookup+0x2e/0x90
 [<ffffffff81219ebe>] shmem_get_policy+0x2e/0x30
 [<ffffffff8124e99a>] get_vma_policy+0x5a/0xa0
 [<ffffffff8124fce1>] mpol_misplaced+0x41/0x1d0
 [<ffffffff8122f085>] handle_pte_fault+0x465/0x6a0

This was triggered by a different version of automatic NUMA balancing but
in theory the current version is vunerable to the same problem.

do_numa_page
  -> numa_migrate_prep
    -> mpol_misplaced
      -> get_vma_policy
        -> shmem_get_policy

It's very unlikely this will happen as shared pages are not marked
pte_numa -- see the page_mapcount() check in change_pte_range() -- but
it is possible.

To address this, this patch is in two parts. First it restores sp->lock as
originally implemented by Kosaki Motohiro. This is not actually necessary at
this point but the related flames were such that I felt that hand-waving at
it would result in a second kick in the arse.

The second part alters how PTL is acquired and released during a NUMA
hinting fault.  numa_migrate_prep() only takes a reference to the page and
the caller calls mpol_misplaced(). It is up to the caller how to handle
the PTL. In the case of do_numa_page(), it just releases it. For PMDs,
it will hold the PTL if there is no vm_ops->get_vma_policy(). Otherwise
it will release the PTL and reacquire it if necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mempolicy.h |    2 +-
 mm/memory.c               |   27 +++++++++++++-----
 mm/mempolicy.c            |   68 ++++++++++++++++++++++++++++++++-------------
 3 files changed, 69 insertions(+), 28 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 9adc270..cc51d17 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -123,7 +123,7 @@ struct sp_node {
 
 struct shared_policy {
 	struct rb_root root;
-	struct mutex mutex;
+	spinlock_t lock;
 };
 
 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
diff --git a/mm/memory.c b/mm/memory.c
index e0a9b0c..d8c2a5c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3431,7 +3431,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
-int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+static void numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 				unsigned long addr, int current_nid)
 {
 	get_page(page);
@@ -3439,8 +3439,6 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
-
-	return mpol_misplaced(page, vma, addr);
 }
 
 int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -3479,8 +3477,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	current_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+	numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
+
+	target_nid = mpol_misplaced(page, vma, addr);
 	if (target_nid == -1) {
 		/*
 		 * Account for the fault against the current node if it not
@@ -3513,6 +3513,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	bool policy_vma = (vma->vm_ops && vma->vm_ops->get_policy);
 	int local_nid = numa_node_id();
 
 	spin_lock(&mm->page_table_lock);
@@ -3566,15 +3567,27 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * migrated to.
 		 */
 		curr_nid = local_nid;
-		target_nid = numa_migrate_prep(page, vma, addr,
-					       page_to_nid(page));
+		numa_migrate_prep(page, vma, addr, page_to_nid(page));
+
+		/*
+		 * If there is a possibility that mpol_misplaced will need
+		 * to allocate for a shared memory policy then we have to
+		 * release the PTL now and reacquire later if necessary.
+		 */
+		if (policy_vma)
+			pte_unmap_unlock(pte, ptl);
+
+		target_nid = mpol_misplaced(page, vma, addr);
 		if (target_nid == -1) {
+			if (policy_vma)
+				pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 			put_page(page);
 			continue;
 		}
 
 		/* Migrate to the requested node */
-		pte_unmap_unlock(pte, ptl);
+		if (!policy_vma)
+			pte_unmap_unlock(pte, ptl);
 		migrated = migrate_misplaced_page(page, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d1b315e..ed8ebbf 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2132,7 +2132,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
  */
 
 /* lookup first element intersecting start-end */
-/* Caller holds sp->mutex */
+/* Caller holds sp->lock */
 static struct sp_node *
 sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
 {
@@ -2196,13 +2196,13 @@ mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
 
 	if (!sp->root.rb_node)
 		return NULL;
-	mutex_lock(&sp->mutex);
+	spin_lock(&sp->lock);
 	sn = sp_lookup(sp, idx, idx+1);
 	if (sn) {
 		mpol_get(sn->policy);
 		pol = sn->policy;
 	}
-	mutex_unlock(&sp->mutex);
+	spin_unlock(&sp->lock);
 	return pol;
 }
 
@@ -2328,6 +2328,14 @@ static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 	sp_free(n);
 }
 
+static void sp_node_init(struct sp_node *node, unsigned long start,
+			unsigned long end, struct mempolicy *pol)
+{
+	node->start = start;
+	node->end = end;
+	node->policy = pol;
+}
+
 static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
 				struct mempolicy *pol)
 {
@@ -2344,10 +2352,7 @@ static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
 		return NULL;
 	}
 	newpol->flags |= MPOL_F_SHARED;
-
-	n->start = start;
-	n->end = end;
-	n->policy = newpol;
+	sp_node_init(n, start, end, newpol);
 
 	return n;
 }
@@ -2357,9 +2362,12 @@ static int shared_policy_replace(struct shared_policy *sp, unsigned long start,
 				 unsigned long end, struct sp_node *new)
 {
 	struct sp_node *n;
+	struct sp_node *n_new = NULL;
+	struct mempolicy *mpol_new = NULL;
 	int ret = 0;
 
-	mutex_lock(&sp->mutex);
+restart:
+	spin_lock(&sp->lock);
 	n = sp_lookup(sp, start, end);
 	/* Take care of old policies in the same range. */
 	while (n && n->start < end) {
@@ -2372,14 +2380,16 @@ static int shared_policy_replace(struct shared_policy *sp, unsigned long start,
 		} else {
 			/* Old policy spanning whole new range. */
 			if (n->end > end) {
-				struct sp_node *new2;
-				new2 = sp_alloc(end, n->end, n->policy);
-				if (!new2) {
-					ret = -ENOMEM;
-					goto out;
-				}
+				if (!n_new)
+					goto alloc_new;
+
+				*mpol_new = *n->policy;
+				atomic_set(&mpol_new->refcnt, 1);
+				sp_node_init(n_new, n->end, end, mpol_new);
+				sp_insert(sp, n_new);
 				n->end = start;
-				sp_insert(sp, new2);
+				n_new = NULL;
+				mpol_new = NULL;
 				break;
 			} else
 				n->end = start;
@@ -2390,9 +2400,27 @@ static int shared_policy_replace(struct shared_policy *sp, unsigned long start,
 	}
 	if (new)
 		sp_insert(sp, new);
-out:
-	mutex_unlock(&sp->mutex);
+	spin_unlock(&sp->lock);
+	ret = 0;
+
+err_out:
+	if (mpol_new)
+		mpol_put(mpol_new);
+	if (n_new)
+		kmem_cache_free(sn_cache, n_new);
+		
 	return ret;
+
+alloc_new:
+	spin_unlock(&sp->lock);
+	ret = -ENOMEM;
+	n_new = kmem_cache_alloc(sn_cache, GFP_KERNEL);
+	if (!n_new)
+		goto err_out;
+	mpol_new = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+	if (!mpol_new)
+		goto err_out;
+	goto restart;
 }
 
 /**
@@ -2410,7 +2438,7 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 	int ret;
 
 	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
-	mutex_init(&sp->mutex);
+	spin_lock_init(&sp->lock);
 
 	if (mpol) {
 		struct vm_area_struct pvma;
@@ -2476,14 +2504,14 @@ void mpol_free_shared_policy(struct shared_policy *p)
 
 	if (!p->root.rb_node)
 		return;
-	mutex_lock(&p->mutex);
+	spin_lock(&p->lock);
 	next = rb_first(&p->root);
 	while (next) {
 		n = rb_entry(next, struct sp_node, nd);
 		next = rb_next(&n->nd);
 		sp_delete(p, n);
 	}
-	mutex_unlock(&p->mutex);
+	spin_unlock(&p->lock);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-12-21 19:58             ` Mel Gorman
@ 2012-12-21 22:02               ` Linus Torvalds
  2012-12-21 23:10                 ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2012-12-21 22:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Ingo Molnar, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Sasha Levin, KOSAKI Motohiro

On Fri, Dec 21, 2012 at 11:58 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> Kosaki's patch does not fix the actual problem with NUMA hinting
> faults. Converting to a spinlock is nice but we'd still hold the PTL at
> the time sp_alloc is called and potentially allocating GFP_KERNEL with a
> spinlock held.

The problem I saw reported - and the problem that the "mutex+spinlock"
patch was fixing - wasn't actually sp_alloc(), but just sp_lookup()
through mpol_shared_policy_lookup().

And converting that to a spinlock would definitely fix it - taking
that spinlock quickly for the lookup while holding the pt lock is
fine.

Now, if we have to call sp_alloc() too at some point, that's
different, but that wouldn't be helped by the "mutex+spinlock" patch
(that started this thread) anyway.

> At the risk of making your head explode, here is another patch.

So I don't hate this patch, but I don't see the point of your games in
do_pmd_numa_page(). I'm not seeing the allocation in mpol_misplaced(),
and that wasn't what the original report was.

The backtrace you quote is literally *only* about the fact that you
cannot take a mutex inside a spinlock. No allocation, just a lookup.

So where's the sp_alloc()?

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-12-21 22:02               ` Linus Torvalds
@ 2012-12-21 23:10                 ` Mel Gorman
  2012-12-22  0:36                   ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2012-12-21 23:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Rientjes, Ingo Molnar, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Sasha Levin, KOSAKI Motohiro

On Fri, Dec 21, 2012 at 02:02:04PM -0800, Linus Torvalds wrote:
> On Fri, Dec 21, 2012 at 11:58 AM, Mel Gorman <mgorman@suse.de> wrote:
> >
> > Kosaki's patch does not fix the actual problem with NUMA hinting
> > faults. Converting to a spinlock is nice but we'd still hold the PTL at
> > the time sp_alloc is called and potentially allocating GFP_KERNEL with a
> > spinlock held.
> 
> The problem I saw reported - and the problem that the "mutex+spinlock"
> patch was fixing - wasn't actually sp_alloc(), but just sp_lookup()
> through mpol_shared_policy_lookup().
> 
> And converting that to a spinlock would definitely fix it - taking
> that spinlock quickly for the lookup while holding the pt lock is
> fine.
> 

Yes, I realised when walking to the shop afterwards that sp_alloc()
should never be called from this path as we're only reading the policy,
no modifications. Kosaki's patch on its own is enough.

> So I don't hate this patch, but I don't see the point of your games in
> do_pmd_numa_page(). I'm not seeing the allocation in mpol_misplaced(),
> and that wasn't what the original report was.
> 

They are unnecessary. This passed the same set of tests. We're still leaking
shared_policy_node which regressed at some point but I'm not going to get
the chance to debug that before the new years unfortunately.

---8<---
mm: mempolicy: Convert shared_policy mutex to spinlock

Sasha was fuzzing with trinity and reported the following problem:

BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
2 locks held by trinity-main/6361:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0
 #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
Pid: 6361, comm: trinity-main Tainted: G        W
3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
Call Trace:
 [<ffffffff8114e393>] __might_sleep+0x1c3/0x1e0
 [<ffffffff83ae5209>] mutex_lock_nested+0x29/0x50
 [<ffffffff8124fc3e>] mpol_shared_policy_lookup+0x2e/0x90
 [<ffffffff81219ebe>] shmem_get_policy+0x2e/0x30
 [<ffffffff8124e99a>] get_vma_policy+0x5a/0xa0
 [<ffffffff8124fce1>] mpol_misplaced+0x41/0x1d0
 [<ffffffff8122f085>] handle_pte_fault+0x465/0x6a0

This was triggered by a different version of automatic NUMA balancing but
in theory the current version is vunerable to the same problem.

do_numa_page
  -> numa_migrate_prep
    -> mpol_misplaced
      -> get_vma_policy
        -> shmem_get_policy

It's very unlikely this will happen as shared pages are not marked
pte_numa -- see the page_mapcount() check in change_pte_range() -- but
it is possible.

To address this, this patch restores sp->lock as originally implemented
by Kosaki Motohiro. In the path where get_vma_policy() is called, it
should not be calling sp_alloc() so it is not necessary to treat the PTL
specially.

From: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mempolicy.h |    2 +-
 mm/mempolicy.c            |   68 ++++++++++++++++++++++++++++++++-------------
 2 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 9adc270..cc51d17 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -123,7 +123,7 @@ struct sp_node {
 
 struct shared_policy {
 	struct rb_root root;
-	struct mutex mutex;
+	spinlock_t lock;
 };
 
 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d1b315e..ed8ebbf 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2132,7 +2132,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
  */
 
 /* lookup first element intersecting start-end */
-/* Caller holds sp->mutex */
+/* Caller holds sp->lock */
 static struct sp_node *
 sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
 {
@@ -2196,13 +2196,13 @@ mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
 
 	if (!sp->root.rb_node)
 		return NULL;
-	mutex_lock(&sp->mutex);
+	spin_lock(&sp->lock);
 	sn = sp_lookup(sp, idx, idx+1);
 	if (sn) {
 		mpol_get(sn->policy);
 		pol = sn->policy;
 	}
-	mutex_unlock(&sp->mutex);
+	spin_unlock(&sp->lock);
 	return pol;
 }
 
@@ -2328,6 +2328,14 @@ static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 	sp_free(n);
 }
 
+static void sp_node_init(struct sp_node *node, unsigned long start,
+			unsigned long end, struct mempolicy *pol)
+{
+	node->start = start;
+	node->end = end;
+	node->policy = pol;
+}
+
 static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
 				struct mempolicy *pol)
 {
@@ -2344,10 +2352,7 @@ static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
 		return NULL;
 	}
 	newpol->flags |= MPOL_F_SHARED;
-
-	n->start = start;
-	n->end = end;
-	n->policy = newpol;
+	sp_node_init(n, start, end, newpol);
 
 	return n;
 }
@@ -2357,9 +2362,12 @@ static int shared_policy_replace(struct shared_policy *sp, unsigned long start,
 				 unsigned long end, struct sp_node *new)
 {
 	struct sp_node *n;
+	struct sp_node *n_new = NULL;
+	struct mempolicy *mpol_new = NULL;
 	int ret = 0;
 
-	mutex_lock(&sp->mutex);
+restart:
+	spin_lock(&sp->lock);
 	n = sp_lookup(sp, start, end);
 	/* Take care of old policies in the same range. */
 	while (n && n->start < end) {
@@ -2372,14 +2380,16 @@ static int shared_policy_replace(struct shared_policy *sp, unsigned long start,
 		} else {
 			/* Old policy spanning whole new range. */
 			if (n->end > end) {
-				struct sp_node *new2;
-				new2 = sp_alloc(end, n->end, n->policy);
-				if (!new2) {
-					ret = -ENOMEM;
-					goto out;
-				}
+				if (!n_new)
+					goto alloc_new;
+
+				*mpol_new = *n->policy;
+				atomic_set(&mpol_new->refcnt, 1);
+				sp_node_init(n_new, n->end, end, mpol_new);
+				sp_insert(sp, n_new);
 				n->end = start;
-				sp_insert(sp, new2);
+				n_new = NULL;
+				mpol_new = NULL;
 				break;
 			} else
 				n->end = start;
@@ -2390,9 +2400,27 @@ static int shared_policy_replace(struct shared_policy *sp, unsigned long start,
 	}
 	if (new)
 		sp_insert(sp, new);
-out:
-	mutex_unlock(&sp->mutex);
+	spin_unlock(&sp->lock);
+	ret = 0;
+
+err_out:
+	if (mpol_new)
+		mpol_put(mpol_new);
+	if (n_new)
+		kmem_cache_free(sn_cache, n_new);
+		
 	return ret;
+
+alloc_new:
+	spin_unlock(&sp->lock);
+	ret = -ENOMEM;
+	n_new = kmem_cache_alloc(sn_cache, GFP_KERNEL);
+	if (!n_new)
+		goto err_out;
+	mpol_new = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+	if (!mpol_new)
+		goto err_out;
+	goto restart;
 }
 
 /**
@@ -2410,7 +2438,7 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 	int ret;
 
 	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
-	mutex_init(&sp->mutex);
+	spin_lock_init(&sp->lock);
 
 	if (mpol) {
 		struct vm_area_struct pvma;
@@ -2476,14 +2504,14 @@ void mpol_free_shared_policy(struct shared_policy *p)
 
 	if (!p->root.rb_node)
 		return;
-	mutex_lock(&p->mutex);
+	spin_lock(&p->lock);
 	next = rb_first(&p->root);
 	while (next) {
 		n = rb_entry(next, struct sp_node, nd);
 		next = rb_next(&n->nd);
 		sp_delete(p, n);
 	}
-	mutex_unlock(&p->mutex);
+	spin_unlock(&p->lock);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-12-21 23:10                 ` Mel Gorman
@ 2012-12-22  0:36                   ` Linus Torvalds
  2013-01-02 19:43                     ` KOSAKI Motohiro
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2012-12-22  0:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Ingo Molnar, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Sasha Levin, KOSAKI Motohiro

Ok, this looks fine to me, but I'd like to get a sign-off from Kosaki
too, and I guess it's really not all that urgent, so I can do the -rc1
release tonight without worrying about it, knowing that a fix is at
least pending, and that nobody is likely to actually ever hit the
problem in practice anyway.

             Linus

On Fri, Dec 21, 2012 at 3:10 PM, Mel Gorman <mgorman@suse.de> wrote:
>
> mm: mempolicy: Convert shared_policy mutex to spinlock

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] mm, mempolicy: Introduce spinlock to read shared policy tree
  2012-12-22  0:36                   ` Linus Torvalds
@ 2013-01-02 19:43                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2013-01-02 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, David Rientjes, Ingo Molnar,
	Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Andrew Morton,
	Andrea Arcangeli, Thomas Gleixner, Johannes Weiner, Hugh Dickins,
	Sasha Levin

> Ok, this looks fine to me, but I'd like to get a sign-off from Kosaki
> too, and I guess it's really not all that urgent, so I can do the -rc1
> release tonight without worrying about it, knowing that a fix is at
> least pending, and that nobody is likely to actually ever hit the
> problem in practice anyway.

Sorry for the looooong time silince. I broke my stomach and I didn't actively
developed last year. I apologize this.

Anyway, I ran basic tests of mempolicy again and I have no seen any
failure. Thus

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


Thank you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 19/33] sched: Add adaptive NUMA affinity support
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (17 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 18/33] sched, numa, mm: Add the scanning page fault machinery Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-26 20:32   ` Sasha Levin
  2012-11-22 22:49 ` [PATCH 20/33] sched: Implement constant, per task Working Set Sampling (WSS) rate Ingo Molnar
                   ` (15 subsequent siblings)
  34 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

The principal ideas behind this patch are the fundamental
difference between shared and privately used memory and the very
strong desire to only rely on per-task behavioral state for
scheduling decisions.

We define 'shared memory' as all user memory that is frequently
accessed by multiple tasks and conversely 'private memory' is
the user memory used predominantly by a single task.

To approximate the above strict definition we recognise that
task placement is dominantly per cpu and thus using cpu granular
page access state is a natural fit. Thus we introduce
page::last_cpu as the cpu that last accessed a page.

Using this, we can construct two per-task node-vectors, 'S_i'
and 'P_i' reflecting the amount of shared and privately used
pages of this task respectively. Pages for which two consecutive
'hits' are of the same cpu are assumed private and the others
are shared.

[ This means that we will start evaluating this state when the
  task has not migrated for at least 2 scans, see NUMA_SETTLE ]

Using these vectors we can compute the total number of
shared/private pages of this task and determine which dominates.

[ Note that for shared tasks we only see '1/n' the total number
  of shared pages for the other tasks will take the other
  faults; where 'n' is the number of tasks sharing the memory.
  So for an equal comparison we should divide total private by
  'n' as well, but we don't have 'n' so we pick 2. ]

We can also compute which node holds most of our memory, running
on this node will be called 'ideal placement' (As per previous
patches we will prefer to pull memory towards wherever we run.)

We change the load-balancer to prefer moving tasks in order of:

  1) !numa tasks and numa tasks in the direction of more faults
  2) allow !ideal tasks getting worse in the direction of faults
  3) allow private tasks to get worse
  4) allow shared tasks to get worse

This order ensures we prefer increasing memory locality but when
we do have to make hard decisions we prefer spreading private
over shared, because spreading shared tasks significantly
increases the interconnect bandwidth since not all memory can
follow.

We also add an extra 'lateral' force to the load balancer that
perturbs the state when otherwise 'fairly' balanced. This
ensures we don't get 'stuck' in a state which is fair but
undesired from a memory location POV (see can_do_numa_run()).

Lastly, we allow shared tasks to defeat the default spreading of
tasks such that, when possible, they can aggregate on a single
node.

Shared tasks aggregate for the very simple reason that there has
to be a single node that holds most of their memory and a second
most, etc.. and tasks want to move up the faults ladder.

Enable it on x86. A number of other architectures are
most likely fine too - but they should enable and test this
feature explicitly.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/scheduler/numa-problem.txt |  20 +-
 arch/x86/Kconfig                         |   2 +
 include/linux/sched.h                    |   1 +
 kernel/sched/core.c                      |  53 +-
 kernel/sched/fair.c                      | 975 +++++++++++++++++++++++++------
 kernel/sched/features.h                  |   8 +
 kernel/sched/sched.h                     |  38 +-
 7 files changed, 900 insertions(+), 197 deletions(-)

diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt
index a5d2fee..7f133e3 100644
--- a/Documentation/scheduler/numa-problem.txt
+++ b/Documentation/scheduler/numa-problem.txt
@@ -133,6 +133,8 @@ XXX properties of this M vs a potential optimal
 
  2b) migrate memory towards 'n_i' using 2 samples.
 
+XXX include the statistical babble on double sampling somewhere near
+
 This separates pages into those that will migrate and those that will not due
 to the two samples not matching. We could consider the first to be of 'p_i'
 (private) and the second to be of 's_i' (shared).
@@ -142,7 +144,17 @@ This interpretation can be motivated by the previously observed property that
 's_i' (shared). (here we loose the need for memory limits again, since it
 becomes indistinguishable from shared).
 
-XXX include the statistical babble on double sampling somewhere near
+ 2c) use cpu samples instead of node samples
+
+The problem with sampling on node granularity is that one looses 's_i' for
+the local node, since one cannot distinguish between two accesses from the
+same node.
+
+By increasing the granularity to per-cpu we gain the ability to have both an
+'s_i' and 'p_i' per node. Since we do all task placement per-cpu as well this
+seems like a natural match. The line where we overcommit cpus is where we loose
+granularity again, but when we loose overcommit we naturally spread tasks.
+Therefore it should work out nicely.
 
 This reduces the problem further; we loose 'M' as per 2a, it further reduces
 the 'T_k,l' (interconnect traffic) term to only include shared (since per the
@@ -150,12 +162,6 @@ above all private will be local):
 
   T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
 
-[ more or less matches the state of sched/numa and describes its remaining
-  problems and assumptions. It should work well for tasks without significant
-  shared memory usage between tasks. ]
-
-Possible future directions:
-
 Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
 can evaluate it;
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 46c3bff..95646fe 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -22,6 +22,8 @@ config X86
 	def_bool y
 	select HAVE_AOUT if X86_32
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select ARCH_SUPPORTS_NUMA_BALANCING
+	select ARCH_WANTS_NUMA_GENERIC_PGPROT
 	select HAVE_IDE
 	select HAVE_OPROFILE
 	select HAVE_PCSPKR_PLATFORM
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 418d405..bb12cc3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -823,6 +823,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3611f5f..7b58366 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1800,6 +1800,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
@@ -5510,7 +5511,9 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_id);
 
-static void update_top_cache_domain(int cpu)
+DEFINE_PER_CPU(struct sched_domain *, sd_node);
+
+static void update_domain_cache(int cpu)
 {
 	struct sched_domain *sd;
 	int id = cpu;
@@ -5521,6 +5524,15 @@ static void update_top_cache_domain(int cpu)
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_id, cpu) = id;
+
+	for_each_domain(cpu, sd) {
+		if (cpumask_equal(sched_domain_span(sd),
+				  cpumask_of_node(cpu_to_node(cpu))))
+			goto got_node;
+	}
+	sd = NULL;
+got_node:
+	rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
 }
 
 /*
@@ -5563,7 +5575,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	rcu_assign_pointer(rq->sd, sd);
 	destroy_sched_domains(tmp, cpu);
 
-	update_top_cache_domain(cpu);
+	update_domain_cache(cpu);
 }
 
 /* cpus with isolated domains */
@@ -5985,6 +5997,37 @@ static struct sched_domain_topology_level default_topology[] = {
 
 static struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * Change a task's NUMA state - called from the placement tick.
+ */
+void sched_setnuma(struct task_struct *p, int node, int shared)
+{
+	unsigned long flags;
+	int on_rq, running;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->numa_shared = shared;
+	p->numa_max_node = node;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
+
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_NUMA
 
 static int sched_domains_numa_levels;
@@ -6030,6 +6073,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
@@ -6884,7 +6928,6 @@ void __init sched_init(void)
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
-		rq->push_cpu = 0;
 		rq->cpu = i;
 		rq->online = 0;
 		rq->idle_stamp = 0;
@@ -6892,6 +6935,10 @@ void __init sched_init(void)
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
+#ifdef CONFIG_NUMA_BALANCING
+		rq->nr_shared_running = 0;
+#endif
+
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
 		rq->nohz_flags = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8af0208..f3aeaac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -29,6 +29,9 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/random.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>
 
 #include <trace/events/sched.h>
 
@@ -774,6 +777,235 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 /**************************************************
+ * Scheduling class numa methods.
+ *
+ * The purpose of the NUMA bits are to maintain compute (task) and data
+ * (memory) locality.
+ *
+ * We keep a faults vector per task and use periodic fault scans to try and
+ * estalish a task<->page relation. This assumes the task<->page relation is a
+ * compute<->data relation, this is false for things like virt. and n:m
+ * threading solutions but its the best we can do given the information we
+ * have.
+ *
+ * We try and migrate such that we increase along the order provided by this
+ * vector while maintaining fairness.
+ *
+ * Tasks start out with their numa status unset (-1) this effectively means
+ * they act !NUMA until we've established the task is busy enough to bother
+ * with placement.
+ */
+
+#ifdef CONFIG_SMP
+static unsigned long task_h_load(struct task_struct *p);
+#endif
+
+#ifdef CONFIG_NUMA_BALANCING
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	if (task_numa_shared(p) != -1) {
+		p->numa_weight = task_h_load(p);
+		rq->nr_numa_running++;
+		rq->nr_shared_running += task_numa_shared(p);
+		rq->nr_ideal_running += (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+		rq->numa_weight += p->numa_weight;
+	}
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	if (task_numa_shared(p) != -1) {
+		rq->nr_numa_running--;
+		rq->nr_shared_running -= task_numa_shared(p);
+		rq->nr_ideal_running -= (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+		rq->numa_weight -= p->numa_weight;
+	}
+}
+
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_sched_numa_scan_period_min = 5000;
+unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+
+/*
+ * Wait for the 2-sample stuff to settle before migrating again
+ */
+unsigned int sysctl_sched_numa_settle_count = 2;
+
+static void task_numa_migrate(struct task_struct *p, int next_cpu)
+{
+	p->numa_migrate_seq = 0;
+}
+
+static void task_numa_placement(struct task_struct *p)
+{
+	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+	unsigned long total[2] = { 0, 0 };
+	unsigned long faults, max_faults = 0;
+	int node, priv, shared, max_node = -1;
+
+	if (p->numa_scan_seq == seq)
+		return;
+
+	p->numa_scan_seq = seq;
+
+	for (node = 0; node < nr_node_ids; node++) {
+		faults = 0;
+		for (priv = 0; priv < 2; priv++) {
+			faults += p->numa_faults[2*node + priv];
+			total[priv] += p->numa_faults[2*node + priv];
+			p->numa_faults[2*node + priv] /= 2;
+		}
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_node = node;
+		}
+	}
+
+	if (max_node != p->numa_max_node)
+		sched_setnuma(p, max_node, task_numa_shared(p));
+
+	p->numa_migrate_seq++;
+	if (sched_feat(NUMA_SETTLE) &&
+	    p->numa_migrate_seq < sysctl_sched_numa_settle_count)
+		return;
+
+	/*
+	 * Note: shared is spread across multiple tasks and in the future
+	 * we might want to consider a different equation below to reduce
+	 * the impact of a little private memory accesses.
+	 */
+	shared = (total[0] >= total[1] / 2);
+	if (shared != task_numa_shared(p)) {
+		sched_setnuma(p, p->numa_max_node, shared);
+		p->numa_migrate_seq = 0;
+	}
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int last_cpu, int pages)
+{
+	struct task_struct *p = current;
+	int priv = (task_cpu(p) == last_cpu);
+
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		if (!p->numa_faults)
+			return;
+	}
+
+	task_numa_placement(p);
+	p->numa_faults[2*node + priv] += pages;
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+	unsigned long migrate, next_scan, now = jiffies;
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+
+	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+	work->next = work; /* protect against double add */
+	/*
+	 * Who cares about NUMA placement when they're dying.
+	 *
+	 * NOTE: make sure not to dereference p->mm before this check,
+	 * exit_task_work() happens _after_ exit_mm() so we could be called
+	 * without p->mm even though we still had it when we enqueued this
+	 * work.
+	 */
+	if (p->flags & PF_EXITING)
+		return;
+
+	/*
+	 * Enforce maximal scan/migration frequency..
+	 */
+	migrate = mm->numa_next_scan;
+	if (time_before(now, migrate))
+		return;
+
+	next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
+	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+		return;
+
+	ACCESS_ONCE(mm->numa_scan_seq)++;
+	{
+		struct vm_area_struct *vma;
+
+		down_write(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (!vma_migratable(vma))
+				continue;
+			change_prot_numa(vma, vma->vm_start, vma->vm_end);
+		}
+		up_write(&mm->mmap_sem);
+	}
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+	struct callback_head *work = &curr->numa_work;
+	u64 period, now;
+
+	/*
+	 * We don't care about NUMA placement if we don't have memory.
+	 */
+	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+		return;
+
+	/*
+	 * Using runtime rather than walltime has the dual advantage that
+	 * we (mostly) drive the selection from busy threads and that the
+	 * task needs to have done some actual work before we bother with
+	 * NUMA placement.
+	 */
+	now = curr->se.sum_exec_runtime;
+	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+
+	if (now - curr->node_stamp > period) {
+		curr->node_stamp = now;
+
+		/*
+		 * We are comparing runtime to wall clock time here, which
+		 * puts a maximum scan frequency limit on the task work.
+		 *
+		 * This, together with the limits in task_numa_work() filters
+		 * us from over-sampling if there are many threads: if all
+		 * threads happen to come in at the same time we don't create a
+		 * spike in overhead.
+		 *
+		 * We also avoid multiple threads scanning at once in parallel to
+		 * each other.
+		 */
+		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+			task_work_add(curr, work, true);
+		}
+	}
+}
+#else /* !CONFIG_NUMA_BALANCING: */
+#ifdef CONFIG_SMP
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)	{ }
+#endif
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)	{ }
+static inline void task_tick_numa(struct rq *rq, struct task_struct *curr)	{ }
+static inline void task_numa_migrate(struct task_struct *p, int next_cpu)	{ }
+#endif /* CONFIG_NUMA_BALANCING */
+
+/**************************************************
  * Scheduling class queueing methods:
  */
 
@@ -784,9 +1016,13 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
-#endif
+	if (entity_is_task(se)) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		account_numa_enqueue(rq, task_of(se));
+		list_add(&se->group_node, &rq->cfs_tasks);
+	}
+#endif /* CONFIG_SMP */
 	cfs_rq->nr_running++;
 }
 
@@ -796,8 +1032,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
 		list_del_init(&se->group_node);
+		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+	}
 	cfs_rq->nr_running--;
 }
 
@@ -3177,20 +3415,8 @@ unlock:
 	return new_cpu;
 }
 
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
 #ifdef CONFIG_FAIR_GROUP_SCHED
-/*
- * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
- * cfs_rq_of(p) references at time of call are still valid and identify the
- * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
- * other assumptions, including the state of rq->lock, should be made.
- */
-static void
-migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+static void migrate_task_rq_entity(struct task_struct *p, int next_cpu)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -3206,7 +3432,27 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 		atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
 	}
 }
+#else
+static void migrate_task_rq_entity(struct task_struct *p, int next_cpu) { }
 #endif
+
+/*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+/*
+ * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
+ * cfs_rq_of(p) references at time of call are still valid and identify the
+ * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
+ * other assumptions, including the state of rq->lock, should be made.
+ */
+static void
+migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+{
+	migrate_task_rq_entity(p, next_cpu);
+	task_numa_migrate(p, next_cpu);
+}
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -3580,7 +3826,10 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
-#define LBF_SOME_PINNED 0x04
+#define LBF_SOME_PINNED	0x04
+#define LBF_NUMA_RUN	0x08
+#define LBF_NUMA_SHARED	0x10
+#define LBF_KEEP_SHARED	0x20
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3599,6 +3848,8 @@ struct lb_env {
 	struct cpumask		*cpus;
 
 	unsigned int		flags;
+	unsigned int		failed;
+	unsigned int		iteration;
 
 	unsigned int		loop;
 	unsigned int		loop_break;
@@ -3620,11 +3871,87 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	check_preempt_curr(env->dst_rq, p, 0);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+
+static inline unsigned long task_node_faults(struct task_struct *p, int node)
+{
+	return p->numa_faults[2*node] + p->numa_faults[2*node + 1];
+}
+
+static int task_faults_down(struct task_struct *p, struct lb_env *env)
+{
+	int src_node, dst_node, node, down_node = -1;
+	unsigned long faults, src_faults, max_faults = 0;
+
+	if (!sched_feat_numa(NUMA_FAULTS_DOWN) || !p->numa_faults)
+		return 1;
+
+	src_node = cpu_to_node(env->src_cpu);
+	dst_node = cpu_to_node(env->dst_cpu);
+
+	if (src_node == dst_node)
+		return 1;
+
+	src_faults = task_node_faults(p, src_node);
+
+	for (node = 0; node < nr_node_ids; node++) {
+		if (node == src_node)
+			continue;
+
+		faults = task_node_faults(p, node);
+
+		if (faults > max_faults && faults <= src_faults) {
+			max_faults = faults;
+			down_node = node;
+		}
+	}
+
+	if (down_node == dst_node)
+		return 1; /* move towards the next node down */
+
+	return 0; /* stay here */
+}
+
+static int task_faults_up(struct task_struct *p, struct lb_env *env)
+{
+	unsigned long src_faults, dst_faults;
+	int src_node, dst_node;
+
+	if (!sched_feat_numa(NUMA_FAULTS_UP) || !p->numa_faults)
+		return 0; /* can't say it improved */
+
+	src_node = cpu_to_node(env->src_cpu);
+	dst_node = cpu_to_node(env->dst_cpu);
+
+	if (src_node == dst_node)
+		return 0; /* pointless, don't do that */
+
+	src_faults = task_node_faults(p, src_node);
+	dst_faults = task_node_faults(p, dst_node);
+
+	if (dst_faults > src_faults)
+		return 1; /* move to dst */
+
+	return 0; /* stay where we are */
+}
+
+#else /* !CONFIG_NUMA_BALANCING: */
+static inline int task_faults_up(struct task_struct *p, struct lb_env *env)
+{
+	return 0;
+}
+
+static inline int task_faults_down(struct task_struct *p, struct lb_env *env)
+{
+	return 0;
+}
+#endif
+
 /*
  * Is this task likely cache-hot:
  */
 static int
-task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
+task_hot(struct task_struct *p, struct lb_env *env)
 {
 	s64 delta;
 
@@ -3647,80 +3974,153 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	if (sysctl_sched_migration_cost == 0)
 		return 0;
 
-	delta = now - p->se.exec_start;
+	delta = env->src_rq->clock_task - p->se.exec_start;
 
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
 /*
- * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
+ * We do not migrate tasks that cannot be migrated to this CPU
+ * due to cpus_allowed.
+ *
+ * NOTE: this function has env-> side effects, to help the balancing
+ *       of pinned tasks.
  */
-static
-int can_migrate_task(struct task_struct *p, struct lb_env *env)
+static bool can_migrate_pinned_task(struct task_struct *p, struct lb_env *env)
 {
-	int tsk_cache_hot = 0;
+	int new_dst_cpu;
+
+	if (cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p)))
+		return true;
+
+	schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+
 	/*
-	 * We do not migrate tasks that are:
-	 * 1) running (obviously), or
-	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
-	 * 3) are cache-hot on their current CPU.
+	 * Remember if this task can be migrated to any other cpu in
+	 * our sched_group. We may want to revisit it if we couldn't
+	 * meet load balance goals by pulling other tasks on src_cpu.
+	 *
+	 * Also avoid computing new_dst_cpu if we have already computed
+	 * one in current iteration.
 	 */
-	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
-		int new_dst_cpu;
-
-		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+	if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+		return false;
 
-		/*
-		 * Remember if this task can be migrated to any other cpu in
-		 * our sched_group. We may want to revisit it if we couldn't
-		 * meet load balance goals by pulling other tasks on src_cpu.
-		 *
-		 * Also avoid computing new_dst_cpu if we have already computed
-		 * one in current iteration.
-		 */
-		if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
-			return 0;
-
-		new_dst_cpu = cpumask_first_and(env->dst_grpmask,
-						tsk_cpus_allowed(p));
-		if (new_dst_cpu < nr_cpu_ids) {
-			env->flags |= LBF_SOME_PINNED;
-			env->new_dst_cpu = new_dst_cpu;
-		}
-		return 0;
+	new_dst_cpu = cpumask_first_and(env->dst_grpmask, tsk_cpus_allowed(p));
+	if (new_dst_cpu < nr_cpu_ids) {
+		env->flags |= LBF_SOME_PINNED;
+		env->new_dst_cpu = new_dst_cpu;
 	}
+	return false;
+}
 
-	/* Record that we found atleast one task that could run on dst_cpu */
-	env->flags &= ~LBF_ALL_PINNED;
+/*
+ * We cannot (easily) migrate tasks that are currently running:
+ */
+static bool can_migrate_running_task(struct task_struct *p, struct lb_env *env)
+{
+	if (!task_running(env->src_rq, p))
+		return true;
 
-	if (task_running(env->src_rq, p)) {
-		schedstat_inc(p, se.statistics.nr_failed_migrations_running);
-		return 0;
-	}
+	schedstat_inc(p, se.statistics.nr_failed_migrations_running);
+	return false;
+}
 
+/*
+ * Can we migrate a NUMA task? The rules are rather involved:
+ */
+static bool can_migrate_numa_task(struct task_struct *p, struct lb_env *env)
+{
 	/*
-	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * iteration:
+	 *   0		   -- only allow improvement, or !numa
+	 *   1		   -- + worsen !ideal
+	 *   2                         priv
+	 *   3                         shared (everything)
+	 *
+	 * NUMA_HOT_DOWN:
+	 *   1 .. nodes    -- allow getting worse by step
+	 *   nodes+1	   -- punt, everything goes!
+	 *
+	 * LBF_NUMA_RUN    -- numa only, only allow improvement
+	 * LBF_NUMA_SHARED -- shared only
+	 *
+	 * LBF_KEEP_SHARED -- do not touch shared tasks
 	 */
 
-	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
-	if (!tsk_cache_hot ||
-		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-#ifdef CONFIG_SCHEDSTATS
-		if (tsk_cache_hot) {
-			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
-			schedstat_inc(p, se.statistics.nr_forced_migrations);
-		}
-#endif
-		return 1;
+	/* a numa run can only move numa tasks about to improve things */
+	if (env->flags & LBF_NUMA_RUN) {
+		if (task_numa_shared(p) < 0)
+			return false;
+		/* can only pull shared tasks */
+		if ((env->flags & LBF_NUMA_SHARED) && !task_numa_shared(p))
+			return false;
+	} else {
+		if (task_numa_shared(p) < 0)
+			goto try_migrate;
 	}
 
-	if (tsk_cache_hot) {
-		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
-		return 0;
-	}
-	return 1;
+	/* can not move shared tasks */
+	if ((env->flags & LBF_KEEP_SHARED) && task_numa_shared(p) == 1)
+		return false;
+
+	if (task_faults_up(p, env))
+		return true; /* memory locality beats cache hotness */
+
+	if (env->iteration < 1)
+		return false;
+
+#ifdef CONFIG_NUMA_BALANCING
+	if (p->numa_max_node != cpu_to_node(task_cpu(p))) /* !ideal */
+		goto demote;
+#endif
+
+	if (env->iteration < 2)
+		return false;
+
+	if (task_numa_shared(p) == 0) /* private */
+		goto demote;
+
+	if (env->iteration < 3)
+		return false;
+
+demote:
+	if (env->iteration < 5)
+		return task_faults_down(p, env);
+
+try_migrate:
+	if (env->failed > env->sd->cache_nice_tries)
+		return true;
+
+	return !task_hot(p, env);
+}
+
+/*
+ * can_migrate_task() - may task p from runqueue rq be migrated to this_cpu?
+ */
+static int can_migrate_task(struct task_struct *p, struct lb_env *env)
+{
+	if (!can_migrate_pinned_task(p, env))
+		return false;
+
+	/* Record that we found atleast one task that could run on dst_cpu */
+	env->flags &= ~LBF_ALL_PINNED;
+
+	if (!can_migrate_running_task(p, env))
+		return false;
+
+	if (env->sd->flags & SD_NUMA)
+		return can_migrate_numa_task(p, env);
+
+	if (env->failed > env->sd->cache_nice_tries)
+		return true;
+
+	if (!task_hot(p, env))
+		return true;
+
+	schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
+
+	return false;
 }
 
 /*
@@ -3735,6 +4135,7 @@ static int move_one_task(struct lb_env *env)
 	struct task_struct *p, *n;
 
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
 
@@ -3742,6 +4143,7 @@ static int move_one_task(struct lb_env *env)
 			continue;
 
 		move_task(p, env);
+
 		/*
 		 * Right now, this is only the second place move_task()
 		 * is called, so we can safely collect move_task()
@@ -3753,8 +4155,6 @@ static int move_one_task(struct lb_env *env)
 	return 0;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
-
 static const unsigned int sched_nr_migrate_break = 32;
 
 /*
@@ -3766,7 +4166,6 @@ static const unsigned int sched_nr_migrate_break = 32;
  */
 static int move_tasks(struct lb_env *env)
 {
-	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	struct task_struct *p;
 	unsigned long load;
 	int pulled = 0;
@@ -3774,8 +4173,8 @@ static int move_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
-	while (!list_empty(tasks)) {
-		p = list_first_entry(tasks, struct task_struct, se.group_node);
+	while (!list_empty(&env->src_rq->cfs_tasks)) {
+		p = list_first_entry(&env->src_rq->cfs_tasks, struct task_struct, se.group_node);
 
 		env->loop++;
 		/* We've more or less seen every task there is, call it quits */
@@ -3786,7 +4185,7 @@ static int move_tasks(struct lb_env *env)
 		if (env->loop > env->loop_break) {
 			env->loop_break += sched_nr_migrate_break;
 			env->flags |= LBF_NEED_BREAK;
-			break;
+			goto out;
 		}
 
 		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
@@ -3794,7 +4193,7 @@ static int move_tasks(struct lb_env *env)
 
 		load = task_h_load(p);
 
-		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
+		if (sched_feat(LB_MIN) && load < 16 && !env->failed)
 			goto next;
 
 		if ((load / 2) > env->imbalance)
@@ -3814,7 +4213,7 @@ static int move_tasks(struct lb_env *env)
 		 * the critical section.
 		 */
 		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+			goto out;
 #endif
 
 		/*
@@ -3822,13 +4221,13 @@ static int move_tasks(struct lb_env *env)
 		 * weighted load.
 		 */
 		if (env->imbalance <= 0)
-			break;
+			goto out;
 
 		continue;
 next:
-		list_move_tail(&p->se.group_node, tasks);
+		list_move_tail(&p->se.group_node, &env->src_rq->cfs_tasks);
 	}
-
+out:
 	/*
 	 * Right now, this is one of only two places move_task() is called,
 	 * so we can safely collect move_task() stats here rather than
@@ -3953,17 +4352,18 @@ static inline void update_blocked_averages(int cpu)
 static inline void update_h_load(long cpu)
 {
 }
-
+#ifdef CONFIG_SMP
 static unsigned long task_h_load(struct task_struct *p)
 {
 	return p->se.load.weight;
 }
 #endif
+#endif
 
 /********** Helpers for find_busiest_group ************************/
 /*
  * sd_lb_stats - Structure to store the statistics of a sched_domain
- * 		during load balancing.
+ *		during load balancing.
  */
 struct sd_lb_stats {
 	struct sched_group *busiest; /* Busiest group in this sd */
@@ -3976,7 +4376,7 @@ struct sd_lb_stats {
 	unsigned long this_load;
 	unsigned long this_load_per_task;
 	unsigned long this_nr_running;
-	unsigned long this_has_capacity;
+	unsigned int  this_has_capacity;
 	unsigned int  this_idle_cpus;
 
 	/* Statistics of the busiest group */
@@ -3985,10 +4385,28 @@ struct sd_lb_stats {
 	unsigned long busiest_load_per_task;
 	unsigned long busiest_nr_running;
 	unsigned long busiest_group_capacity;
-	unsigned long busiest_has_capacity;
+	unsigned int  busiest_has_capacity;
 	unsigned int  busiest_group_weight;
 
 	int group_imb; /* Is there imbalance in this sd */
+
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long this_numa_running;
+	unsigned long this_numa_weight;
+	unsigned long this_shared_running;
+	unsigned long this_ideal_running;
+	unsigned long this_group_capacity;
+
+	struct sched_group *numa;
+	unsigned long numa_load;
+	unsigned long numa_nr_running;
+	unsigned long numa_numa_running;
+	unsigned long numa_shared_running;
+	unsigned long numa_ideal_running;
+	unsigned long numa_numa_weight;
+	unsigned long numa_group_capacity;
+	unsigned int  numa_has_capacity;
+#endif
 };
 
 /*
@@ -4004,6 +4422,13 @@ struct sg_lb_stats {
 	unsigned long group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long sum_ideal_running;
+	unsigned long sum_numa_running;
+	unsigned long sum_numa_weight;
+#endif
+	unsigned long sum_shared_running;	/* 0 on non-NUMA */
 };
 
 /**
@@ -4032,6 +4457,151 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 	return load_idx;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+
+static inline bool pick_numa_rand(int n)
+{
+	return !(get_random_int() % n);
+}
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+	sgs->sum_ideal_running += rq->nr_ideal_running;
+	sgs->sum_shared_running += rq->nr_shared_running;
+	sgs->sum_numa_running += rq->nr_numa_running;
+	sgs->sum_numa_weight += rq->numa_weight;
+}
+
+static inline
+void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
+			  struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
+			  int local_group)
+{
+	if (!(sd->flags & SD_NUMA))
+		return;
+
+	if (local_group) {
+		sds->this_numa_running   = sgs->sum_numa_running;
+		sds->this_numa_weight    = sgs->sum_numa_weight;
+		sds->this_shared_running = sgs->sum_shared_running;
+		sds->this_ideal_running  = sgs->sum_ideal_running;
+		sds->this_group_capacity = sgs->group_capacity;
+
+	} else if (sgs->sum_numa_running - sgs->sum_ideal_running) {
+		if (!sds->numa || pick_numa_rand(sd->span_weight / sg->group_weight)) {
+			sds->numa = sg;
+			sds->numa_load		 = sgs->avg_load;
+			sds->numa_nr_running     = sgs->sum_nr_running;
+			sds->numa_numa_running   = sgs->sum_numa_running;
+			sds->numa_shared_running = sgs->sum_shared_running;
+			sds->numa_ideal_running  = sgs->sum_ideal_running;
+			sds->numa_numa_weight    = sgs->sum_numa_weight;
+			sds->numa_has_capacity	 = sgs->group_has_capacity;
+			sds->numa_group_capacity = sgs->group_capacity;
+		}
+	}
+}
+
+static struct rq *
+find_busiest_numa_queue(struct lb_env *env, struct sched_group *sg)
+{
+	struct rq *rq, *busiest = NULL;
+	int cpu;
+
+	for_each_cpu_and(cpu, sched_group_cpus(sg), env->cpus) {
+		rq = cpu_rq(cpu);
+
+		if (!rq->nr_numa_running)
+			continue;
+
+		if (!(rq->nr_numa_running - rq->nr_ideal_running))
+			continue;
+
+		if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
+			continue;
+
+		if (!busiest || pick_numa_rand(sg->group_weight))
+			busiest = rq;
+	}
+
+	return busiest;
+}
+
+static bool can_do_numa_run(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	/*
+	 * if we're overloaded; don't pull when:
+	 *   - the other guy isn't
+	 *   - imbalance would become too great
+	 */
+	if (!sds->this_has_capacity) {
+		if (sds->numa_has_capacity)
+			return false;
+	}
+
+	/*
+	 * pull if we got easy trade
+	 */
+	if (sds->this_nr_running - sds->this_numa_running)
+		return true;
+
+	/*
+	 * If we got capacity allow stacking up on shared tasks.
+	 */
+	if ((sds->this_shared_running < sds->this_group_capacity) && sds->numa_shared_running) {
+		env->flags |= LBF_NUMA_SHARED;
+		return true;
+	}
+
+	/*
+	 * pull if we could possibly trade
+	 */
+	if (sds->this_numa_running - sds->this_ideal_running)
+		return true;
+
+	return false;
+}
+
+/*
+ * introduce some controlled imbalance to perturb the state so we allow the
+ * state to improve should be tightly controlled/co-ordinated with
+ * can_migrate_task()
+ */
+static int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	if (!sds->numa || !sds->numa_numa_running)
+		return 0;
+
+	if (!can_do_numa_run(env, sds))
+		return 0;
+
+	env->flags |= LBF_NUMA_RUN;
+	env->flags &= ~LBF_KEEP_SHARED;
+	env->imbalance = sds->numa_numa_weight / sds->numa_numa_running;
+	sds->busiest = sds->numa;
+	env->find_busiest_queue = find_busiest_numa_queue;
+
+	return 1;
+}
+
+#else /* !CONFIG_NUMA_BALANCING: */
+static inline
+void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
+			  struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
+			  int local_group)
+{
+}
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+}
+
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	return 0;
+}
+#endif
+
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
 	return SCHED_POWER_SCALE;
@@ -4245,6 +4815,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
 		sgs->sum_weighted_load += weighted_cpuload(i);
+
+		update_sg_numa_stats(sgs, rq);
+
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
 	}
@@ -4336,6 +4909,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	return false;
 }
 
+static void update_src_keep_shared(struct lb_env *env, bool keep_shared)
+{
+	env->flags &= ~LBF_KEEP_SHARED;
+	if (keep_shared)
+		env->flags |= LBF_KEEP_SHARED;
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -4368,6 +4948,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 		sds->total_load += sgs.group_load;
 		sds->total_pwr += sg->sgp->power;
 
+#ifdef CONFIG_NUMA_BALANCING
 		/*
 		 * In case the child domain prefers tasks go to siblings
 		 * first, lower the sg capacity to one so that we'll try
@@ -4378,8 +4959,11 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
-		if (prefer_sibling && !local_group && sds->this_has_capacity)
-			sgs.group_capacity = min(sgs.group_capacity, 1UL);
+		if (0 && prefer_sibling && !local_group && sds->this_has_capacity) {
+			sgs.group_capacity = clamp_val(sgs.sum_shared_running,
+					1UL, sgs.group_capacity);
+		}
+#endif
 
 		if (local_group) {
 			sds->this_load = sgs.avg_load;
@@ -4398,8 +4982,13 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 			sds->busiest_has_capacity = sgs.group_has_capacity;
 			sds->busiest_group_weight = sgs.group_weight;
 			sds->group_imb = sgs.group_imb;
+
+			update_src_keep_shared(env,
+				sgs.sum_shared_running <= sgs.group_capacity);
 		}
 
+		update_sd_numa_stats(env->sd, sg, sds, &sgs, local_group);
+
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -4652,14 +5241,14 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 * don't try and pull any tasks.
 	 */
 	if (sds.this_load >= sds.max_load)
-		goto out_balanced;
+		goto out_imbalanced;
 
 	/*
 	 * Don't pull any tasks if this group is already above the domain
 	 * average load.
 	 */
 	if (sds.this_load >= sds.avg_load)
-		goto out_balanced;
+		goto out_imbalanced;
 
 	if (env->idle == CPU_IDLE) {
 		/*
@@ -4685,9 +5274,18 @@ force_balance:
 	calculate_imbalance(env, &sds);
 	return sds.busiest;
 
+out_imbalanced:
+	/* if we've got capacity allow for secondary placement preference */
+	if (!sds.this_has_capacity)
+		goto ret;
+
 out_balanced:
+	if (check_numa_busiest_group(env, &sds))
+		return sds.busiest;
+
 ret:
 	env->imbalance = 0;
+
 	return NULL;
 }
 
@@ -4723,6 +5321,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (capacity && rq->nr_running == 1 && wl > env->imbalance)
 			continue;
 
+		if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
+			continue;
+
 		/*
 		 * For the load comparisons with the other cpu's, consider
 		 * the weighted_cpuload() scaled with the cpu power, so that
@@ -4749,25 +5350,40 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 /* Working cpumask for load_balance and load_balance_newidle. */
 DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
 
-static int need_active_balance(struct lb_env *env)
-{
-	struct sched_domain *sd = env->sd;
-
-	if (env->idle == CPU_NEWLY_IDLE) {
+static int active_load_balance_cpu_stop(void *data);
 
+static void update_sd_failed(struct lb_env *env, int ld_moved)
+{
+	if (!ld_moved) {
+		schedstat_inc(env->sd, lb_failed[env->idle]);
 		/*
-		 * ASYM_PACKING needs to force migrate tasks from busy but
-		 * higher numbered CPUs in order to pack all tasks in the
-		 * lowest numbered CPUs.
+		 * Increment the failure counter only on periodic balance.
+		 * We do not want newidle balance, which can be very
+		 * frequent, pollute the failure counter causing
+		 * excessive cache_hot migrations and active balances.
 		 */
-		if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
-			return 1;
-	}
-
-	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
+		if (env->idle != CPU_NEWLY_IDLE && !(env->flags & LBF_NUMA_RUN))
+			env->sd->nr_balance_failed++;
+	} else
+		env->sd->nr_balance_failed = 0;
 }
 
-static int active_load_balance_cpu_stop(void *data);
+/*
+ * See can_migrate_numa_task()
+ */
+static int lb_max_iteration(struct lb_env *env)
+{
+	if (!(env->sd->flags & SD_NUMA))
+		return 0;
+
+	if (env->flags & LBF_NUMA_RUN)
+		return 0; /* NUMA_RUN may only improve */
+
+	if (sched_feat_numa(NUMA_FAULTS_DOWN))
+		return 5; /* nodes^2 would suck */
+
+	return 3;
+}
 
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
@@ -4793,6 +5409,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.loop_break	    = sched_nr_migrate_break,
 		.cpus		    = cpus,
 		.find_busiest_queue = find_busiest_queue,
+		.failed             = sd->nr_balance_failed,
+		.iteration          = 0,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -4816,6 +5434,8 @@ redo:
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
 	}
+	env.src_rq  = busiest;
+	env.src_cpu = busiest->cpu;
 
 	BUG_ON(busiest == env.dst_rq);
 
@@ -4895,92 +5515,72 @@ more_balance:
 		}
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
-		if (unlikely(env.flags & LBF_ALL_PINNED)) {
-			cpumask_clear_cpu(cpu_of(busiest), cpus);
-			if (!cpumask_empty(cpus)) {
-				env.loop = 0;
-				env.loop_break = sched_nr_migrate_break;
-				goto redo;
-			}
-			goto out_balanced;
+		if (unlikely(env.flags & LBF_ALL_PINNED))
+			goto out_pinned;
+
+		if (!ld_moved && env.iteration < lb_max_iteration(&env)) {
+			env.iteration++;
+			env.loop = 0;
+			goto more_balance;
 		}
 	}
 
-	if (!ld_moved) {
-		schedstat_inc(sd, lb_failed[idle]);
+	if (!ld_moved && idle != CPU_NEWLY_IDLE) {
+		raw_spin_lock_irqsave(&busiest->lock, flags);
+
 		/*
-		 * Increment the failure counter only on periodic balance.
-		 * We do not want newidle balance, which can be very
-		 * frequent, pollute the failure counter causing
-		 * excessive cache_hot migrations and active balances.
+		 * Don't kick the active_load_balance_cpu_stop,
+		 * if the curr task on busiest cpu can't be
+		 * moved to this_cpu
 		 */
-		if (idle != CPU_NEWLY_IDLE)
-			sd->nr_balance_failed++;
-
-		if (need_active_balance(&env)) {
-			raw_spin_lock_irqsave(&busiest->lock, flags);
-
-			/* don't kick the active_load_balance_cpu_stop,
-			 * if the curr task on busiest cpu can't be
-			 * moved to this_cpu
-			 */
-			if (!cpumask_test_cpu(this_cpu,
-					tsk_cpus_allowed(busiest->curr))) {
-				raw_spin_unlock_irqrestore(&busiest->lock,
-							    flags);
-				env.flags |= LBF_ALL_PINNED;
-				goto out_one_pinned;
-			}
-
-			/*
-			 * ->active_balance synchronizes accesses to
-			 * ->active_balance_work.  Once set, it's cleared
-			 * only after active load balance is finished.
-			 */
-			if (!busiest->active_balance) {
-				busiest->active_balance = 1;
-				busiest->push_cpu = this_cpu;
-				active_balance = 1;
-			}
+		if (!cpumask_test_cpu(this_cpu, tsk_cpus_allowed(busiest->curr))) {
 			raw_spin_unlock_irqrestore(&busiest->lock, flags);
-
-			if (active_balance) {
-				stop_one_cpu_nowait(cpu_of(busiest),
-					active_load_balance_cpu_stop, busiest,
-					&busiest->active_balance_work);
-			}
-
-			/*
-			 * We've kicked active balancing, reset the failure
-			 * counter.
-			 */
-			sd->nr_balance_failed = sd->cache_nice_tries+1;
+			env.flags |= LBF_ALL_PINNED;
+			goto out_pinned;
 		}
-	} else
-		sd->nr_balance_failed = 0;
 
-	if (likely(!active_balance)) {
-		/* We were unbalanced, so reset the balancing interval */
-		sd->balance_interval = sd->min_interval;
-	} else {
 		/*
-		 * If we've begun active balancing, start to back off. This
-		 * case may not be covered by the all_pinned logic if there
-		 * is only 1 task on the busy runqueue (because we don't call
-		 * move_tasks).
+		 * ->active_balance synchronizes accesses to
+		 * ->active_balance_work.  Once set, it's cleared
+		 * only after active load balance is finished.
 		 */
-		if (sd->balance_interval < sd->max_interval)
-			sd->balance_interval *= 2;
+		if (!busiest->active_balance) {
+			busiest->active_balance	= 1;
+			busiest->ab_dst_cpu	= this_cpu;
+			busiest->ab_flags	= env.flags;
+			busiest->ab_failed	= env.failed;
+			busiest->ab_idle	= env.idle;
+			active_balance		= 1;
+		}
+		raw_spin_unlock_irqrestore(&busiest->lock, flags);
+
+		if (active_balance) {
+			stop_one_cpu_nowait(cpu_of(busiest),
+					active_load_balance_cpu_stop, busiest,
+					&busiest->ab_work);
+		}
 	}
 
-	goto out;
+	if (!active_balance)
+		update_sd_failed(&env, ld_moved);
+
+	sd->balance_interval = sd->min_interval;
+out:
+	return ld_moved;
+
+out_pinned:
+	cpumask_clear_cpu(cpu_of(busiest), cpus);
+	if (!cpumask_empty(cpus)) {
+		env.loop = 0;
+		env.loop_break = sched_nr_migrate_break;
+		goto redo;
+	}
 
 out_balanced:
 	schedstat_inc(sd, lb_balanced[idle]);
 
 	sd->nr_balance_failed = 0;
 
-out_one_pinned:
 	/* tune up the balancing interval */
 	if (((env.flags & LBF_ALL_PINNED) &&
 			sd->balance_interval < MAX_PINNED_INTERVAL) ||
@@ -4988,8 +5588,8 @@ out_one_pinned:
 		sd->balance_interval *= 2;
 
 	ld_moved = 0;
-out:
-	return ld_moved;
+
+	goto out;
 }
 
 /*
@@ -5060,7 +5660,7 @@ static int active_load_balance_cpu_stop(void *data)
 {
 	struct rq *busiest_rq = data;
 	int busiest_cpu = cpu_of(busiest_rq);
-	int target_cpu = busiest_rq->push_cpu;
+	int target_cpu = busiest_rq->ab_dst_cpu;
 	struct rq *target_rq = cpu_rq(target_cpu);
 	struct sched_domain *sd;
 
@@ -5098,17 +5698,23 @@ static int active_load_balance_cpu_stop(void *data)
 			.sd		= sd,
 			.dst_cpu	= target_cpu,
 			.dst_rq		= target_rq,
-			.src_cpu	= busiest_rq->cpu,
+			.src_cpu	= busiest_cpu,
 			.src_rq		= busiest_rq,
-			.idle		= CPU_IDLE,
+			.flags		= busiest_rq->ab_flags,
+			.failed		= busiest_rq->ab_failed,
+			.idle		= busiest_rq->ab_idle,
 		};
+		env.iteration = lb_max_iteration(&env);
 
 		schedstat_inc(sd, alb_count);
 
-		if (move_one_task(&env))
+		if (move_one_task(&env)) {
 			schedstat_inc(sd, alb_pushed);
-		else
+			update_sd_failed(&env, 1);
+		} else {
 			schedstat_inc(sd, alb_failed);
+			update_sd_failed(&env, 0);
+		}
 	}
 	rcu_read_unlock();
 	double_unlock_balance(busiest_rq, target_rq);
@@ -5508,6 +6114,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	}
 
 	update_rq_runnable_avg(rq, 1);
+
+	if (sched_feat_numa(NUMA) && nr_node_ids > 1)
+		task_tick_numa(rq, curr);
 }
 
 /*
@@ -5902,9 +6511,7 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	.migrate_task_rq	= migrate_task_rq_fair,
-#endif
 	.rq_online		= rq_online_fair,
 	.rq_offline		= rq_offline_fair,
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index e68e69a..a432eb8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -66,3 +66,11 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
+
+#ifdef CONFIG_NUMA_BALANCING
+/* Do the working set probing faults: */
+SCHED_FEAT(NUMA,             true)
+SCHED_FEAT(NUMA_FAULTS_UP,   true)
+SCHED_FEAT(NUMA_FAULTS_DOWN, false)
+SCHED_FEAT(NUMA_SETTLE,      true)
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5eca173..bb9475c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3,6 +3,7 @@
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
+#include <linux/slab.h>
 
 #include "cpupri.h"
 
@@ -420,17 +421,29 @@ struct rq {
 	unsigned long cpu_power;
 
 	unsigned char idle_balance;
-	/* For active balancing */
 	int post_schedule;
+
+	/* For active balancing */
 	int active_balance;
-	int push_cpu;
-	struct cpu_stop_work active_balance_work;
+	int ab_dst_cpu;
+	int ab_flags;
+	int ab_failed;
+	int ab_idle;
+	struct cpu_stop_work ab_work;
+
 	/* cpu of this runqueue: */
 	int cpu;
 	int online;
 
 	struct list_head cfs_tasks;
 
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long numa_weight;
+	unsigned long nr_numa_running;
+	unsigned long nr_ideal_running;
+#endif
+	unsigned long nr_shared_running;	/* 0 on non-NUMA */
+
 	u64 rt_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
@@ -501,6 +514,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+#ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node, int shared);
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
@@ -544,6 +569,7 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_node);
 
 extern int group_balance_cpu(struct sched_group *sg);
 
@@ -663,6 +689,12 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
 
+#ifdef CONFIG_NUMA_BALANCING
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 19/33] sched: Add adaptive NUMA affinity support
  2012-11-22 22:49 ` [PATCH 19/33] sched: Add adaptive NUMA affinity support Ingo Molnar
@ 2012-11-26 20:32   ` Sasha Levin
  0 siblings, 0 replies; 55+ messages in thread
From: Sasha Levin @ 2012-11-26 20:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Linus Torvalds, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Dave Jones

Hi all,

On 11/22/2012 05:49 PM, Ingo Molnar wrote:
> +static void task_numa_placement(struct task_struct *p)
> +{
> +	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);

I was fuzzing with trinity on my fake numa setup, and discovered that this can
be called for task_structs with p->mm == NULL, which would cause things like:

[ 1140.001957] BUG: unable to handle kernel NULL pointer dereference at 00000000000006d0
[ 1140.010037] IP: [<ffffffff81157627>] task_numa_placement+0x27/0x1a0
[ 1140.015020] PGD 9b002067 PUD 9fb3c067 PMD 14a89067 PTE 5a4098bf040
[ 1140.015020] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1140.015020] Dumping ftrace buffer:
[ 1140.015020]    (ftrace buffer empty)
[ 1140.015020] CPU 1
[ 1140.015020] Pid: 3179, comm: ksmd Tainted: G        W    3.7.0-rc6-next-20121126-sasha-00015-gb04382b-dirty #200
[ 1140.015020] RIP: 0010:[<ffffffff81157627>]  [<ffffffff81157627>] task_numa_placement+0x27/0x1a0
[ 1140.015020] RSP: 0018:ffff8800bfae5b08  EFLAGS: 00010292
[ 1140.015020] RAX: 0000000000000000 RBX: ffff8800bfaeb000 RCX: 0000000000000001
[ 1140.015020] RDX: ffff880007c00000 RSI: 000000000000000e RDI: ffff8800bfaeb000
[ 1140.015020] RBP: ffff8800bfae5b38 R08: ffff8800bf805e00 R09: ffff880000369000
[ 1140.015020] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000e
[ 1140.015020] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000064
[ 1140.015020] FS:  0000000000000000(0000) GS:ffff880007c00000(0000) knlGS:0000000000000000
[ 1140.015020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1140.015020] CR2: 00000000000006d0 CR3: 0000000097b18000 CR4: 00000000000406e0
[ 1140.015020] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1140.015020] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1140.015020] Process ksmd (pid: 3179, threadinfo ffff8800bfae4000, task ffff8800bfaeb000)
[ 1140.015020] Stack:
[ 1140.015020]  0000000000000000 0000000000000000 000000000000000e ffff8800bfaeb000
[ 1140.015020]  000000000000000e 0000000000000004 ffff8800bfae5b88 ffffffff8115a577
[ 1140.015020]  ffff8800bfae5b68 ffffffff00000001 ffff88000c1d0068 ffffea0000ec1000
[ 1140.015020] Call Trace:
[ 1140.015020]  [<ffffffff8115a577>] task_numa_fault+0xb7/0xd0
[ 1140.015020]  [<ffffffff81230d96>] do_numa_page.isra.42+0x1b6/0x270
[ 1140.015020]  [<ffffffff8126fe08>] ? mem_cgroup_count_vm_event+0x178/0x1a0
[ 1140.015020]  [<ffffffff812333f4>] handle_pte_fault+0x174/0x220
[ 1140.015020]  [<ffffffff819e7ad9>] ? __const_udelay+0x29/0x30
[ 1140.015020]  [<ffffffff81234780>] handle_mm_fault+0x320/0x350
[ 1140.015020]  [<ffffffff81256845>] break_ksm+0x65/0xc0
[ 1140.015020]  [<ffffffff81256b4d>] break_cow+0x5d/0x80
[ 1140.015020]  [<ffffffff81258442>] cmp_and_merge_page+0x122/0x1e0
[ 1140.015020]  [<ffffffff81258565>] ksm_do_scan+0x65/0xa0
[ 1140.015020]  [<ffffffff8125860f>] ksm_scan_thread+0x6f/0x2d0
[ 1140.015020]  [<ffffffff8113b990>] ? abort_exclusive_wait+0xb0/0xb0
[ 1140.015020]  [<ffffffff812585a0>] ? ksm_do_scan+0xa0/0xa0
[ 1140.015020]  [<ffffffff8113a723>] kthread+0xe3/0xf0
[ 1140.015020]  [<ffffffff8113a640>] ? __kthread_bind+0x40/0x40
[ 1140.015020]  [<ffffffff83c8813c>] ret_from_fork+0x7c/0xb0
[ 1140.015020]  [<ffffffff8113a640>] ? __kthread_bind+0x40/0x40
[ 1140.015020] Code: 00 00 00 00 55 48 89 e5 41 55 41 54 53 48 89 fb 48 83 ec 18 48 c7 45 d0 00 00 00 00 48 8b 87 a0 04 00 00 48
c7 45 d8 00 00 00 00 <8b> 80 d0 06 00 00 39 87 d4 15 00 00 0f 84 57 01 00 00 89 87 d4
[ 1140.015020] RIP  [<ffffffff81157627>] task_numa_placement+0x27/0x1a0
[ 1140.015020]  RSP <ffff8800bfae5b08>
[ 1140.015020] CR2: 00000000000006d0
[ 1140.660568] ---[ end trace 9f1fd31243556513 ]---

In exchange to this bug report, I have couple of questions about this NUMA code which I wasn't
able to answer myself :)

 - In this case, would it mean that KSM may run on one node, but scan the memory of a different node?
 - If yes, we should migrate KSM to each node we scan, right? Or possibly start a dedicated KSM
thread for each NUMA node?
 - Is there a class of per-numa threads in the works?


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 20/33] sched: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (18 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 19/33] sched: Add adaptive NUMA affinity support Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 21/33] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Ingo Molnar
                   ` (14 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

 - it samples the working set at dissimilar rates,
   giving some tasks a sampling quality advantage
   over others.

 - creates performance problems for tasks with very
   large working sets

 - over-samples processes with large address spaces but
   which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality
in this tree allows such constant rate, per task,
execution-weight proportional sampling of the working set,
with an adaptive sampling interval/frequency that goes from
once per 100 msecs up to just once per 1.6 seconds.
The current sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 1.6
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
  rate of the 'hinting page fault'. AutoNUMA's scanning is
  currently rate-limited, but it is also fundamentally
  single-threaded, executing in the knuma_scand kernel thread,
  so the limit in AutoNUMA is global and does not scale up with
  the number of CPUs, nor does it scan tasks in an execution
  proportional manner.

  So the idea of rate-limiting the scanning was first implemented
  in the AutoNUMA tree via a global rate limit. This patch goes
  beyond that by implementing an execution rate proportional
  working set sampling rate that is not implemented via a single
  global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
  first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/n/tip-wt5b48o2226ec63784i58s3j@git.kernel.org
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm_types.h |  1 +
 include/linux/sched.h    |  1 +
 kernel/sched/fair.c      | 41 +++++++++++++++++++++++++++++------------
 kernel/sysctl.c          |  7 +++++++
 4 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 48760e9..5995652 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -405,6 +405,7 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned long numa_next_scan;
+	unsigned long numa_scan_offset;
 	int numa_scan_seq;
 #endif
 	struct uprobes_state uprobes_state;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb12cc3..3372aac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2047,6 +2047,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
 extern unsigned int sysctl_sched_numa_scan_period_min;
 extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_scan_size;
 extern unsigned int sysctl_sched_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f3aeaac..151a3cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -825,8 +825,9 @@ static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
 /*
  * numa task sample period in ms: 5s
  */
-unsigned int sysctl_sched_numa_scan_period_min = 5000;
-unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+unsigned int sysctl_sched_numa_scan_period_min = 100;
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;
+unsigned int sysctl_sched_numa_scan_size = 256;   /* MB */
 
 /*
  * Wait for the 2-sample stuff to settle before migrating again
@@ -912,6 +913,9 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	struct vm_area_struct *vma;
+	unsigned long offset, end;
+	long length;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -938,18 +942,31 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	ACCESS_ONCE(mm->numa_scan_seq)++;
-	{
-		struct vm_area_struct *vma;
+	offset = mm->numa_scan_offset;
+	length = sysctl_sched_numa_scan_size;
+	length <<= 20;
 
-		down_write(&mm->mmap_sem);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			if (!vma_migratable(vma))
-				continue;
-			change_prot_numa(vma, vma->vm_start, vma->vm_end);
-		}
-		up_write(&mm->mmap_sem);
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, offset);
+	if (!vma) {
+		ACCESS_ONCE(mm->numa_scan_seq)++;
+		offset = 0;
+		vma = mm->mmap;
+	}
+	for (; vma && length > 0; vma = vma->vm_next) {
+		if (!vma_migratable(vma))
+			continue;
+
+		offset = max(offset, vma->vm_start);
+		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+		length -= end - offset;
+
+		change_prot_numa(vma, offset, end);
+
+		offset = end;
 	}
+	mm->numa_scan_offset = offset;
+	up_write(&mm->mmap_sem);
 }
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7736b9e..a14b8a4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,6 +367,13 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "sched_numa_scan_size_mb",
+		.data		= &sysctl_sched_numa_scan_size,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "sched_numa_settle_count",
 		.data		= &sysctl_sched_numa_settle_count,
 		.maxlen		= sizeof(unsigned int),
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 21/33] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (19 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 20/33] sched: Implement constant, per task Working Set Sampling (WSS) rate Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 22/33] sched: Implement slow start for working set sampling Ingo Molnar
                   ` (13 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

By accounting against the present PTEs, scanning speed reflects the
actual present (mapped) memory.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 37 +++++++++++++++++++++----------------
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 151a3cd..da28315 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -914,8 +914,8 @@ void task_numa_work(struct callback_head *work)
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
-	unsigned long offset, end;
-	long length;
+	unsigned long start, end;
+	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -942,30 +942,35 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	offset = mm->numa_scan_offset;
-	length = sysctl_sched_numa_scan_size;
-	length <<= 20;
+	start = mm->numa_scan_offset;
+	pages = sysctl_sched_numa_scan_size;
+	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+	if (!pages)
+		return;
 
 	down_write(&mm->mmap_sem);
-	vma = find_vma(mm, offset);
+	vma = find_vma(mm, start);
 	if (!vma) {
 		ACCESS_ONCE(mm->numa_scan_seq)++;
-		offset = 0;
+		start = 0;
 		vma = mm->mmap;
 	}
-	for (; vma && length > 0; vma = vma->vm_next) {
+	for (; vma; vma = vma->vm_next) {
 		if (!vma_migratable(vma))
 			continue;
 
-		offset = max(offset, vma->vm_start);
-		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
-		length -= end - offset;
-
-		change_prot_numa(vma, offset, end);
-
-		offset = end;
+		do {
+			start = max(start, vma->vm_start);
+			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
+			end = min(end, vma->vm_end);
+			pages -= change_prot_numa(vma, start, end);
+			start = end;
+			if (pages <= 0)
+				goto out;
+		} while (end != vma->vm_end);
 	}
-	mm->numa_scan_offset = offset;
+out:
+	mm->numa_scan_offset = start;
 	up_write(&mm->mmap_sem);
 }
 
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 22/33] sched: Implement slow start for working set sampling
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (20 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 21/33] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 23/33] sched, numa, mm: Interleave shared tasks Ingo Molnar
                   ` (12 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
  the initial scan would happen much later still, in effect that
  patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

   # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

   !NUMA:
   45.291088843 seconds time elapsed                                          ( +-  0.40% )
   45.154231752 seconds time elapsed                                          ( +-  0.36% )

   +NUMA, no slow start:
   46.172308123 seconds time elapsed                                          ( +-  0.30% )
   46.343168745 seconds time elapsed                                          ( +-  0.25% )

   +NUMA, 1 sec slow start:
   45.224189155 seconds time elapsed                                          ( +-  0.25% )
   45.160866532 seconds time elapsed                                          ( +-  0.17% )

and it also fixes an observable perf bench (hackbench) regression:

   # perf stat --null --repeat 10 perf bench sched messaging

   -NUMA:

   -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
   +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )

   +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/sched_numa_scan_delay_ms tunable
knob.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/n/tip-vn7p3ynbwqt3qqewhdlvjltc@git.kernel.org
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  2 +-
 kernel/sched/fair.c   | 16 ++++++++++------
 kernel/sysctl.c       |  7 +++++++
 4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3372aac..8f65323 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2045,6 +2045,7 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_sched_numa_scan_delay;
 extern unsigned int sysctl_sched_numa_scan_period_min;
 extern unsigned int sysctl_sched_numa_scan_period_max;
 extern unsigned int sysctl_sched_numa_scan_size;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7b58366..af0602f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1556,7 +1556,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = 2;
 	p->numa_faults = NULL;
-	p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+	p->numa_scan_period = sysctl_sched_numa_scan_delay;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_NUMA_BALANCING */
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da28315..8f0e6ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -823,11 +823,12 @@ static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
 }
 
 /*
- * numa task sample period in ms: 5s
+ * Scan @scan_size MB every @scan_period after an initial @scan_delay.
  */
-unsigned int sysctl_sched_numa_scan_period_min = 100;
-unsigned int sysctl_sched_numa_scan_period_max = 100*16;
-unsigned int sysctl_sched_numa_scan_size = 256;   /* MB */
+unsigned int sysctl_sched_numa_scan_delay = 1000;	/* ms */
+unsigned int sysctl_sched_numa_scan_period_min = 100;	/* ms */
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;/* ms */
+unsigned int sysctl_sched_numa_scan_size = 256;		/* MB */
 
 /*
  * Wait for the 2-sample stuff to settle before migrating again
@@ -938,10 +939,12 @@ void task_numa_work(struct callback_head *work)
 	if (time_before(now, migrate))
 		return;
 
-	next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
+	next_scan = now + msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
+	current->numa_scan_period += jiffies_to_msecs(2);
+
 	start = mm->numa_scan_offset;
 	pages = sysctl_sched_numa_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
@@ -998,7 +1001,8 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
 
 	if (now - curr->node_stamp > period) {
-		curr->node_stamp = now;
+		curr->node_stamp += period;
+		curr->numa_scan_period = sysctl_sched_numa_scan_period_min;
 
 		/*
 		 * We are comparing runtime to wall clock time here, which
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a14b8a4..6d2fe5b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -353,6 +353,13 @@ static struct ctl_table kern_table[] = {
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_NUMA_BALANCING
 	{
+		.procname	= "sched_numa_scan_delay_ms",
+		.data		= &sysctl_sched_numa_scan_delay,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "sched_numa_scan_period_min_ms",
 		.data		= &sysctl_sched_numa_scan_period_min,
 		.maxlen		= sizeof(unsigned int),
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 23/33] sched, numa, mm: Interleave shared tasks
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (21 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 22/33] sched: Implement slow start for working set sampling Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 24/33] sched: Implement NUMA scanning backoff Ingo Molnar
                   ` (11 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Interleave tasks that are 'shared' - i.e. whose memory access patterns
indicate that they are intensively sharing memory with other tasks.

If such a task ends up converging then it switches back into the lazy
node-local policy.

Build-Bug-Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/mempolicy.c | 56 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 42 insertions(+), 14 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 318043a..02890f2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -111,12 +111,30 @@ enum zone_type policy_zone = 0;
 /*
  * run-time system-wide default policy => local allocation
  */
-static struct mempolicy default_policy = {
-	.refcnt = ATOMIC_INIT(1), /* never free it */
-	.mode = MPOL_PREFERRED,
-	.flags = MPOL_F_LOCAL,
+
+static struct mempolicy default_policy_local = {
+	.refcnt		= ATOMIC_INIT(1), /* never free it */
+	.mode		= MPOL_PREFERRED,
+	.flags		= MPOL_F_LOCAL,
+};
+
+/*
+ * .v.nodes is set by numa_policy_init():
+ */
+static struct mempolicy default_policy_shared = {
+	.refcnt			= ATOMIC_INIT(1), /* never free it */
+	.mode			= MPOL_INTERLEAVE,
+	.flags			= 0,
 };
 
+static struct mempolicy *default_policy(void)
+{
+	if (task_numa_shared(current) == 1)
+		return &default_policy_shared;
+
+	return &default_policy_local;
+}
+
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
 	/*
@@ -789,7 +807,7 @@ out:
 static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 {
 	nodes_clear(*nodes);
-	if (p == &default_policy)
+	if (p == default_policy())
 		return;
 
 	switch (p->mode) {
@@ -864,7 +882,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 		return -EINVAL;
 
 	if (!pol)
-		pol = &default_policy;	/* indicates default behavior */
+		pol = default_policy();	/* indicates default behavior */
 
 	if (flags & MPOL_F_NODE) {
 		if (flags & MPOL_F_ADDR) {
@@ -880,7 +898,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 			goto out;
 		}
 	} else {
-		*policy = pol == &default_policy ? MPOL_DEFAULT :
+		*policy = pol == default_policy() ? MPOL_DEFAULT :
 						pol->mode;
 		/*
 		 * Internal mempolicy flags must be masked off before exposing
@@ -1568,7 +1586,7 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
 		}
 	}
 	if (!pol)
-		pol = &default_policy;
+		pol = default_policy();
 	return pol;
 }
 
@@ -1974,7 +1992,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 	unsigned int cpuset_mems_cookie;
 
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
-		pol = &default_policy;
+		pol = default_policy();
 
 retry_cpuset:
 	cpuset_mems_cookie = get_mems_allowed();
@@ -2255,7 +2273,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	int best_nid = -1, page_nid;
 	int cpu_last_access, this_cpu;
 	struct mempolicy *pol;
-	unsigned long pgoff;
 	struct zone *zone;
 
 	BUG_ON(!vma);
@@ -2271,13 +2288,22 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	switch (pol->mode) {
 	case MPOL_INTERLEAVE:
+	{
+		int shift;
+
 		BUG_ON(addr >= vma->vm_end);
 		BUG_ON(addr < vma->vm_start);
 
-		pgoff = vma->vm_pgoff;
-		pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
-		best_nid = offset_il_node(pol, vma, pgoff);
+#ifdef CONFIG_HUGETLB_PAGE
+		if (transparent_hugepage_enabled(vma) || vma->vm_flags & VM_HUGETLB)
+			shift = HPAGE_SHIFT;
+		else
+#endif
+			shift = PAGE_SHIFT;
+
+		best_nid = interleave_nid(pol, vma, addr, shift);
 		break;
+	}
 
 	case MPOL_PREFERRED:
 		if (pol->flags & MPOL_F_LOCAL)
@@ -2492,6 +2518,8 @@ void __init numa_policy_init(void)
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL);
 
+	default_policy_shared.v.nodes = node_online_map;
+
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
 	 * enabled across suitably sized nodes (default is >= 16MB), or
@@ -2712,7 +2740,7 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context)
 	 */
 	VM_BUG_ON(maxlen < strlen("interleave") + strlen("relative") + 16);
 
-	if (!pol || pol == &default_policy)
+	if (!pol || pol == default_policy())
 		mode = MPOL_DEFAULT;
 	else
 		mode = pol->mode;
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 24/33] sched: Implement NUMA scanning backoff
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (22 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 23/33] sched, numa, mm: Interleave shared tasks Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 25/33] sched: Improve convergence Ingo Molnar
                   ` (10 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Back off slowly from scanning, up to sysctl_sched_numa_scan_period_max
(1.6 seconds). Scan faster again if we were forced to switch to
another node.

This makes sure that workload in equilibrium don't get scanned as often
as workloads that are still converging.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 6 ++++++
 kernel/sched/fair.c | 8 +++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index af0602f..ec3cc74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6024,6 +6024,12 @@ void sched_setnuma(struct task_struct *p, int node, int shared)
 	if (on_rq)
 		enqueue_task(rq, p, 0);
 	task_rq_unlock(rq, p, &flags);
+
+	/*
+	 * Reset the scanning period. If the task converges
+	 * on this node then we'll back off again:
+	 */
+	p->numa_scan_period = sysctl_sched_numa_scan_period_min;
 }
 
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f0e6ba..59fea2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -865,8 +865,10 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	if (max_node != p->numa_max_node)
+	if (max_node != p->numa_max_node) {
 		sched_setnuma(p, max_node, task_numa_shared(p));
+		goto out_backoff;
+	}
 
 	p->numa_migrate_seq++;
 	if (sched_feat(NUMA_SETTLE) &&
@@ -882,7 +884,11 @@ static void task_numa_placement(struct task_struct *p)
 	if (shared != task_numa_shared(p)) {
 		sched_setnuma(p, p->numa_max_node, shared);
 		p->numa_migrate_seq = 0;
+		goto out_backoff;
 	}
+	return;
+out_backoff:
+	p->numa_scan_period = min(p->numa_scan_period * 2, sysctl_sched_numa_scan_period_max);
 }
 
 /*
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 25/33] sched: Improve convergence
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (23 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 24/33] sched: Implement NUMA scanning backoff Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 26/33] sched: Introduce staged average NUMA faults Ingo Molnar
                   ` (9 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

 - break out of can_do_numa_run() earlier if we can make no progress
 - don't flip between siblings that often
 - turn on bidirectional fault balancing
 - improve the flow in task_numa_work()

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c     | 46 ++++++++++++++++++++++++++++++++--------------
 kernel/sched/features.h |  2 +-
 2 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59fea2e..9c46b45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -917,12 +917,12 @@ void task_numa_fault(int node, int last_cpu, int pages)
  */
 void task_numa_work(struct callback_head *work)
 {
+	long pages_total, pages_left, pages_changed;
 	unsigned long migrate, next_scan, now = jiffies;
+	unsigned long start0, start, end;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
-	unsigned long start, end;
-	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -951,35 +951,42 @@ void task_numa_work(struct callback_head *work)
 
 	current->numa_scan_period += jiffies_to_msecs(2);
 
-	start = mm->numa_scan_offset;
-	pages = sysctl_sched_numa_scan_size;
-	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
-	if (!pages)
+	start0 = start = end = mm->numa_scan_offset;
+	pages_total = sysctl_sched_numa_scan_size;
+	pages_total <<= 20 - PAGE_SHIFT; /* MB in pages */
+	if (!pages_total)
 		return;
 
+	pages_left	= pages_total;
+
 	down_write(&mm->mmap_sem);
 	vma = find_vma(mm, start);
 	if (!vma) {
 		ACCESS_ONCE(mm->numa_scan_seq)++;
-		start = 0;
-		vma = mm->mmap;
+		end = 0;
+		vma = find_vma(mm, end);
 	}
 	for (; vma; vma = vma->vm_next) {
 		if (!vma_migratable(vma))
 			continue;
 
 		do {
-			start = max(start, vma->vm_start);
-			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
+			start = max(end, vma->vm_start);
+			end = ALIGN(start + (pages_left << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			pages -= change_prot_numa(vma, start, end);
-			start = end;
-			if (pages <= 0)
+			pages_changed = change_prot_numa(vma, start, end);
+
+			WARN_ON_ONCE(pages_changed > pages_total);
+			BUG_ON(pages_changed < 0);
+
+			pages_left -= pages_changed;
+			if (pages_left <= 0)
 				goto out;
 		} while (end != vma->vm_end);
 	}
 out:
-	mm->numa_scan_offset = start;
+	mm->numa_scan_offset = end;
+
 	up_write(&mm->mmap_sem);
 }
 
@@ -3306,6 +3313,13 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	int i;
 
 	/*
+	 * For NUMA tasks constant, reliable placement is more important
+	 * than flipping tasks between siblings:
+	 */
+	if (task_numa_shared(p) >= 0)
+		return target;
+
+	/*
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
@@ -4581,6 +4595,10 @@ static bool can_do_numa_run(struct lb_env *env, struct sd_lb_stats *sds)
 	 * If we got capacity allow stacking up on shared tasks.
 	 */
 	if ((sds->this_shared_running < sds->this_group_capacity) && sds->numa_shared_running) {
+		/* There's no point in trying to move if all are here already: */
+		if (sds->numa_shared_running == sds->this_shared_running)
+			return false;
+
 		env->flags |= LBF_NUMA_SHARED;
 		return true;
 	}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a432eb8..b75a10d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -71,6 +71,6 @@ SCHED_FEAT(LB_MIN, false)
 /* Do the working set probing faults: */
 SCHED_FEAT(NUMA,             true)
 SCHED_FEAT(NUMA_FAULTS_UP,   true)
-SCHED_FEAT(NUMA_FAULTS_DOWN, false)
+SCHED_FEAT(NUMA_FAULTS_DOWN, true)
 SCHED_FEAT(NUMA_SETTLE,      true)
 #endif
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 26/33] sched: Introduce staged average NUMA faults
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (24 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 25/33] sched: Improve convergence Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 27/33] sched: Track groups of shared tasks Ingo Molnar
                   ` (8 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

The current way of building the p->numa_faults[2][node] faults
statistics has a sampling artifact:

The continuous and immediate nature of propagating new fault
stats to the numa_faults array creates a 'pulsating' dynamic,
that starts at the average value at the beginning of the scan,
increases monotonically until we finish the scan to about twice
the average, and then drops back to half of its value due to
the running average.

Since we rely on these values to balance tasks, the pulsating
nature resulted in false migrations and general noise in the
stats.

To solve this, introduce buffering of the current scan via
p->task_numa_faults_curr[]. The array is co-allocated with the
p->task_numa[] for efficiency reasons, but it is otherwise an
ordinary separate array.

At the end of the scan we propagate the latest stats into the
average stats value. Most of the balancing code stays unmodified.

The cost of this change is that we delay the effects of the latest
round of faults by 1 scan - but using the partial faults info was
creating artifacts.

This instantly stabilized the page fault stats and improved
numa02-alike workloads by making them faster to converge.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 20 +++++++++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f65323..92b41b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1511,6 +1511,7 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	unsigned long numa_weight;
 	unsigned long *numa_faults;
+	unsigned long *numa_faults_curr;
 	struct callback_head numa_work;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c46b45..1ab11be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -852,12 +852,26 @@ static void task_numa_placement(struct task_struct *p)
 
 	p->numa_scan_seq = seq;
 
+	/*
+	 * Update the fault average with the result of the latest
+	 * scan:
+	 */
 	for (node = 0; node < nr_node_ids; node++) {
 		faults = 0;
 		for (priv = 0; priv < 2; priv++) {
-			faults += p->numa_faults[2*node + priv];
-			total[priv] += p->numa_faults[2*node + priv];
-			p->numa_faults[2*node + priv] /= 2;
+			unsigned int new_faults;
+			unsigned int idx;
+
+			idx = 2*node + priv;
+			new_faults = p->numa_faults_curr[idx];
+			p->numa_faults_curr[idx] = 0;
+
+			/* Keep a simple running average: */
+			p->numa_faults[idx] += new_faults;
+			p->numa_faults[idx] /= 2;
+
+			faults += p->numa_faults[idx];
+			total[priv] += p->numa_faults[idx];
 		}
 		if (faults > max_faults) {
 			max_faults = faults;
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 27/33] sched: Track groups of shared tasks
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (25 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 26/33] sched: Introduce staged average NUMA faults Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 28/33] sched: Use the best-buddy 'ideal cpu' in balancing decisions Ingo Molnar
                   ` (7 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

To be able to cluster memory-related tasks more efficiently, introduce
a new metric that tracks the 'best' buddy task

Track our "memory buddies": the tasks we actively share memory with.

Firstly we establish the identity of some other task that we are
sharing memory with by looking at rq[page::last_cpu].curr - i.e.
we check the task that is running on that CPU right now.

This is not entirely correct as this task might have scheduled or
migrate ther - but statistically there will be correlation to the
tasks that we share memory with, and correlation is all we need.

We map out the relation itself by filtering out the highest address
ask that is below our own task address, per working set scan
iteration.

This creates a natural ordering relation between groups of tasks:

    t1 < t2 < t3 < t4

    t1->memory_buddy == NULL
    t2->memory_buddy == t1
    t3->memory_buddy == t2
    t4->memory_buddy == t3

The load-balancer can then use this information to speed up NUMA
convergence, by moving such tasks together if capacity and load
constraints allow it.

(This is all statistical so there are no preemption or locking worries.)

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |   5 ++
 kernel/sched/core.c   |   5 ++
 kernel/sched/fair.c   | 144 ++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 92b41b4..be73297 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1513,6 +1513,11 @@ struct task_struct {
 	unsigned long *numa_faults;
 	unsigned long *numa_faults_curr;
 	struct callback_head numa_work;
+
+	struct task_struct *shared_buddy, *shared_buddy_curr;
+	unsigned long shared_buddy_faults, shared_buddy_faults_curr;
+	int ideal_cpu, ideal_cpu_curr;
+
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec3cc74..39cf991 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1558,6 +1558,11 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_faults = NULL;
 	p->numa_scan_period = sysctl_sched_numa_scan_delay;
 	p->numa_work.next = &p->numa_work;
+
+	p->shared_buddy = NULL;
+	p->shared_buddy_faults = 0;
+	p->ideal_cpu = -1;
+	p->ideal_cpu_curr = -1;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ab11be..67f7fd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -840,6 +840,43 @@ static void task_numa_migrate(struct task_struct *p, int next_cpu)
 	p->numa_migrate_seq = 0;
 }
 
+/*
+ * Called for every full scan - here we consider switching to a new
+ * shared buddy, if the one we found during this scan is good enough:
+ */
+static void shared_fault_full_scan_done(struct task_struct *p)
+{
+	/*
+	 * If we have a new maximum rate buddy task then pick it
+	 * as our new best friend:
+	 */
+	if (p->shared_buddy_faults_curr > p->shared_buddy_faults) {
+		WARN_ON_ONCE(!p->shared_buddy_curr);
+		p->shared_buddy			= p->shared_buddy_curr;
+		p->shared_buddy_faults		= p->shared_buddy_faults_curr;
+		p->ideal_cpu			= p->ideal_cpu_curr;
+
+		goto clear_buddy;
+	}
+	/*
+	 * If the new buddy is lower rate than the previous average
+	 * fault rate then don't switch buddies yet but lower the average by
+	 * averaging in the new rate, with a 1/3 weight.
+	 *
+	 * Eventually, if the current buddy is not a buddy anymore
+	 * then we'll switch away from it: a higher rate buddy will
+	 * replace it.
+	 */
+	p->shared_buddy_faults *= 3;
+	p->shared_buddy_faults += p->shared_buddy_faults_curr;
+	p->shared_buddy_faults /= 4;
+
+clear_buddy:
+	p->shared_buddy_curr		= NULL;
+	p->shared_buddy_faults_curr	= 0;
+	p->ideal_cpu_curr		= -1;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -852,6 +889,8 @@ static void task_numa_placement(struct task_struct *p)
 
 	p->numa_scan_seq = seq;
 
+	shared_fault_full_scan_done(p);
+
 	/*
 	 * Update the fault average with the result of the latest
 	 * scan:
@@ -906,23 +945,122 @@ out_backoff:
 }
 
 /*
+ * Track our "memory buddies" the tasks we actively share memory with.
+ *
+ * Firstly we establish the identity of some other task that we are
+ * sharing memory with by looking at rq[page::last_cpu].curr - i.e.
+ * we check the task that is running on that CPU right now.
+ *
+ * This is not entirely correct as this task might have scheduled or
+ * migrate ther - but statistically there will be correlation to the
+ * tasks that we share memory with, and correlation is all we need.
+ *
+ * We map out the relation itself by filtering out the highest address
+ * ask that is below our own task address, per working set scan
+ * iteration.
+ *
+ * This creates a natural ordering relation between groups of tasks:
+ *
+ *     t1 < t2 < t3 < t4
+ *
+ *     t1->memory_buddy == NULL
+ *     t2->memory_buddy == t1
+ *     t3->memory_buddy == t2
+ *     t4->memory_buddy == t3
+ *
+ * The load-balancer can then use this information to speed up NUMA
+ * convergence, by moving such tasks together if capacity and load
+ * constraints allow it.
+ *
+ * (This is all statistical so there are no preemption or locking worries.)
+ */
+static void shared_fault_tick(struct task_struct *this_task, int node, int last_cpu, int pages)
+{
+	struct task_struct *last_task;
+	struct rq *last_rq;
+	int last_node;
+	int this_node;
+	int this_cpu;
+
+	last_node = cpu_to_node(last_cpu);
+	this_cpu  = raw_smp_processor_id();
+	this_node = cpu_to_node(this_cpu);
+
+	/* Ignore private memory access faults: */
+	if (last_cpu == this_cpu)
+		return;
+
+	/*
+	 * Ignore accesses from foreign nodes to our memory.
+	 *
+	 * Yet still recognize tasks accessing a third node - i.e. one that is
+	 * remote to both of them.
+	 */
+	if (node != this_node)
+		return;
+
+	/* We are in a shared fault - see which task we relate to: */
+	last_rq = cpu_rq(last_cpu);
+	last_task = last_rq->curr;
+
+	/* Task might be gone from that runqueue already: */
+	if (!last_task || last_task == last_rq->idle)
+		return;
+
+	if (last_task == this_task->shared_buddy_curr)
+		goto out_hit;
+
+	/* Order our memory buddies by address: */
+	if (last_task >= this_task)
+		return;
+
+	if (this_task->shared_buddy_curr > last_task)
+		return;
+
+	/* New shared buddy! */
+	this_task->shared_buddy_curr = last_task;
+	this_task->shared_buddy_faults_curr = 0;
+	this_task->ideal_cpu_curr = last_rq->cpu;
+
+out_hit:
+	/*
+	 * Give threads that we share a process with an advantage,
+	 * but don't stop the discovery of process level sharing
+	 * either:
+	 */
+	if (this_task->mm == last_task->mm)
+		pages *= 2;
+
+	this_task->shared_buddy_faults_curr += pages;
+}
+
+/*
  * Got a PROT_NONE fault for a page on @node.
  */
 void task_numa_fault(int node, int last_cpu, int pages)
 {
 	struct task_struct *p = current;
 	int priv = (task_cpu(p) == last_cpu);
+	int idx = 2*node + priv;
 
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+		int entries = 2*nr_node_ids;
+		int size = sizeof(*p->numa_faults) * entries;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		p->numa_faults = kzalloc(2*size, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
+		/*
+		 * For efficiency reasons we allocate ->numa_faults[]
+		 * and ->numa_faults_curr[] at once and split the
+		 * buffer we get. They are separate otherwise.
+		 */
+		p->numa_faults_curr = p->numa_faults + entries;
 	}
 
+	p->numa_faults_curr[idx] += pages;
+	shared_fault_tick(p, node, last_cpu, pages);
 	task_numa_placement(p);
-	p->numa_faults[2*node + priv] += pages;
 }
 
 /*
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 28/33] sched: Use the best-buddy 'ideal cpu' in balancing decisions
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (26 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 27/33] sched: Track groups of shared tasks Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 29/33] sched, mm, mempolicy: Add per task mempolicy Ingo Molnar
                   ` (6 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Now that we have a notion of (one of the) best CPUs we interrelate
with in terms of memory usage, use that information to improve
can_migrate_task() balancing decisions: allow the migration to
occur even if we locally cache-hot, if we are on another node
and want to migrate towards our best buddy's node.

( Note that this is not hard affinity - if imbalance persists long
  enough then the scheduler will eventually balance tasks anyway,
  to maximize CPU utilization. )

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c     | 35 ++++++++++++++++++++++++++++++++---
 kernel/sched/features.h |  2 ++
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 67f7fd2..24a5588 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -840,6 +840,14 @@ static void task_numa_migrate(struct task_struct *p, int next_cpu)
 	p->numa_migrate_seq = 0;
 }
 
+static int task_ideal_cpu(struct task_struct *p)
+{
+	if (!sched_feat(IDEAL_CPU))
+		return -1;
+
+	return p->ideal_cpu;
+}
+
 /*
  * Called for every full scan - here we consider switching to a new
  * shared buddy, if the one we found during this scan is good enough:
@@ -1028,7 +1036,7 @@ out_hit:
 	 * but don't stop the discovery of process level sharing
 	 * either:
 	 */
-	if (this_task->mm == last_task->mm)
+	if (sched_feat(IDEAL_CPU_THREAD_BIAS) && this_task->mm == last_task->mm)
 		pages *= 2;
 
 	this_task->shared_buddy_faults_curr += pages;
@@ -1189,6 +1197,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 }
 #else /* !CONFIG_NUMA_BALANCING: */
 #ifdef CONFIG_SMP
+static inline int task_ideal_cpu(struct task_struct *p)				{ return -1; }
 static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)	{ }
 #endif
 static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)	{ }
@@ -4064,6 +4073,7 @@ struct lb_env {
 static void move_task(struct task_struct *p, struct lb_env *env)
 {
 	deactivate_task(env->src_rq, p, 0);
+
 	set_task_cpu(p, env->dst_cpu);
 	activate_task(env->dst_rq, p, 0);
 	check_preempt_curr(env->dst_rq, p, 0);
@@ -4242,15 +4252,17 @@ static bool can_migrate_numa_task(struct task_struct *p, struct lb_env *env)
 	 *
 	 * LBF_NUMA_RUN    -- numa only, only allow improvement
 	 * LBF_NUMA_SHARED -- shared only
+	 * LBF_NUMA_IDEAL  -- ideal only
 	 *
 	 * LBF_KEEP_SHARED -- do not touch shared tasks
 	 */
 
 	/* a numa run can only move numa tasks about to improve things */
 	if (env->flags & LBF_NUMA_RUN) {
-		if (task_numa_shared(p) < 0)
+		if (task_numa_shared(p) < 0 && task_ideal_cpu(p) < 0)
 			return false;
-		/* can only pull shared tasks */
+
+		/* If we are only allowed to pull shared tasks: */
 		if ((env->flags & LBF_NUMA_SHARED) && !task_numa_shared(p))
 			return false;
 	} else {
@@ -4307,6 +4319,23 @@ static int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (!can_migrate_running_task(p, env))
 		return false;
 
+#ifdef CONFIG_NUMA_BALANCING
+	/* If we are only allowed to pull ideal tasks: */
+	if ((task_ideal_cpu(p) >= 0) && (p->shared_buddy_faults > 1000)) {
+		int ideal_node;
+		int dst_node;
+
+		BUG_ON(env->dst_cpu < 0);
+
+		ideal_node = cpu_to_node(p->ideal_cpu);
+		dst_node = cpu_to_node(env->dst_cpu);
+
+		if (ideal_node == dst_node)
+			return true;
+		return false;
+	}
+#endif
+
 	if (env->sd->flags & SD_NUMA)
 		return can_migrate_numa_task(p, env);
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index b75a10d..737d2c8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -66,6 +66,8 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
+SCHED_FEAT(IDEAL_CPU,			true)
+SCHED_FEAT(IDEAL_CPU_THREAD_BIAS,	false)
 
 #ifdef CONFIG_NUMA_BALANCING
 /* Do the working set probing faults: */
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 29/33] sched, mm, mempolicy: Add per task mempolicy
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (27 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 28/33] sched: Use the best-buddy 'ideal cpu' in balancing decisions Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 30/33] sched: Average the fault stats longer Ingo Molnar
                   ` (5 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

We are going to make use of it in the NUMA code: each thread will
converge not just to a group of related tasks, but to a specific
group of memory nodes as well.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mempolicy.h | 39 +--------------------------------------
 include/linux/mm_types.h  | 40 ++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h     |  3 ++-
 kernel/sched/core.c       |  7 +++++++
 mm/mempolicy.c            | 16 +++-------------
 5 files changed, 53 insertions(+), 52 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index c511e25..f44b7f3 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -6,11 +6,11 @@
 #define _LINUX_MEMPOLICY_H 1
 
 
+#include <linux/mm_types.h>
 #include <linux/mmzone.h>
 #include <linux/slab.h>
 #include <linux/rbtree.h>
 #include <linux/spinlock.h>
-#include <linux/nodemask.h>
 #include <linux/pagemap.h>
 #include <uapi/linux/mempolicy.h>
 
@@ -19,43 +19,6 @@ struct mm_struct;
 #ifdef CONFIG_NUMA
 
 /*
- * Describe a memory policy.
- *
- * A mempolicy can be either associated with a process or with a VMA.
- * For VMA related allocations the VMA policy is preferred, otherwise
- * the process policy is used. Interrupts ignore the memory policy
- * of the current process.
- *
- * Locking policy for interlave:
- * In process context there is no locking because only the process accesses
- * its own state. All vma manipulation is somewhat protected by a down_read on
- * mmap_sem.
- *
- * Freeing policy:
- * Mempolicy objects are reference counted.  A mempolicy will be freed when
- * mpol_put() decrements the reference count to zero.
- *
- * Duplicating policy objects:
- * mpol_dup() allocates a new mempolicy and copies the specified mempolicy
- * to the new storage.  The reference count of the new object is initialized
- * to 1, representing the caller of mpol_dup().
- */
-struct mempolicy {
-	atomic_t refcnt;
-	unsigned short mode; 	/* See MPOL_* above */
-	unsigned short flags;	/* See set_mempolicy() MPOL_F_* above */
-	union {
-		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave/bind */
-		/* undefined for default */
-	} v;
-	union {
-		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
-		nodemask_t user_nodemask;	/* nodemask passed by user */
-	} w;
-};
-
-/*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
  */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5995652..cd2be76 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
 #include <linux/page-debug-flags.h>
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
+#include <linux/nodemask.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -203,6 +204,45 @@ struct page_frag {
 
 typedef unsigned long __nocast vm_flags_t;
 
+#ifdef CONFIG_NUMA
+/*
+ * Describe a memory policy.
+ *
+ * A mempolicy can be either associated with a process or with a VMA.
+ * For VMA related allocations the VMA policy is preferred, otherwise
+ * the process policy is used. Interrupts ignore the memory policy
+ * of the current process.
+ *
+ * Locking policy for interlave:
+ * In process context there is no locking because only the process accesses
+ * its own state. All vma manipulation is somewhat protected by a down_read on
+ * mmap_sem.
+ *
+ * Freeing policy:
+ * Mempolicy objects are reference counted.  A mempolicy will be freed when
+ * mpol_put() decrements the reference count to zero.
+ *
+ * Duplicating policy objects:
+ * mpol_dup() allocates a new mempolicy and copies the specified mempolicy
+ * to the new storage.  The reference count of the new object is initialized
+ * to 1, representing the caller of mpol_dup().
+ */
+struct mempolicy {
+	atomic_t refcnt;
+	unsigned short mode; 	/* See MPOL_* above */
+	unsigned short flags;	/* See set_mempolicy() MPOL_F_* above */
+	union {
+		short 		 preferred_node; /* preferred */
+		nodemask_t	 nodes;		/* interleave/bind */
+		/* undefined for default */
+	} v;
+	union {
+		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
+		nodemask_t user_nodemask;	/* nodemask passed by user */
+	} w;
+};
+#endif
+
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
  * conditions.  These are held in a global tree and are pinned by the VMAs that
diff --git a/include/linux/sched.h b/include/linux/sched.h
index be73297..696492e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1516,7 +1516,8 @@ struct task_struct {
 
 	struct task_struct *shared_buddy, *shared_buddy_curr;
 	unsigned long shared_buddy_faults, shared_buddy_faults_curr;
-	int ideal_cpu, ideal_cpu_curr;
+	int ideal_cpu, ideal_cpu_curr, ideal_cpu_candidate;
+	struct mempolicy numa_policy;
 
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 39cf991..794efa0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
 #include <linux/slab.h>
 #include <linux/init_task.h>
 #include <linux/binfmts.h>
+#include <uapi/linux/mempolicy.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -1563,6 +1564,12 @@ static void __sched_fork(struct task_struct *p)
 	p->shared_buddy_faults = 0;
 	p->ideal_cpu = -1;
 	p->ideal_cpu_curr = -1;
+	atomic_set(&p->numa_policy.refcnt, 1);
+	p->numa_policy.mode = MPOL_INTERLEAVE;
+	p->numa_policy.flags = 0;
+	p->numa_policy.v.preferred_node = 0;
+	p->numa_policy.v.nodes = node_online_map;
+
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02890f2..d71a93d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -118,20 +118,12 @@ static struct mempolicy default_policy_local = {
 	.flags		= MPOL_F_LOCAL,
 };
 
-/*
- * .v.nodes is set by numa_policy_init():
- */
-static struct mempolicy default_policy_shared = {
-	.refcnt			= ATOMIC_INIT(1), /* never free it */
-	.mode			= MPOL_INTERLEAVE,
-	.flags			= 0,
-};
-
 static struct mempolicy *default_policy(void)
 {
+#ifdef CONFIG_NUMA_BALANCING
 	if (task_numa_shared(current) == 1)
-		return &default_policy_shared;
-
+		return &current->numa_policy;
+#endif
 	return &default_policy_local;
 }
 
@@ -2518,8 +2510,6 @@ void __init numa_policy_init(void)
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL);
 
-	default_policy_shared.v.nodes = node_online_map;
-
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
 	 * enabled across suitably sized nodes (default is >= 16MB), or
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 30/33] sched: Average the fault stats longer
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (28 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 29/33] sched, mm, mempolicy: Add per task mempolicy Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 31/33] sched: Use the ideal CPU to drive active balancing Ingo Molnar
                   ` (4 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

We will rely on the per CPU fault statistics and its
shared/private derivative even more in the future, so
stabilize this metric even better.

The staged updates introduced in commit:

   sched: Introduce staged average NUMA faults

Already stabilized this key metric significantly, but in
real workloads it was still reacting to temporary load
balancing transients too quickly.

Slow down by weighting the average. The weighting value was
found via experimentation.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 24a5588..a5f3ad7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -914,8 +914,8 @@ static void task_numa_placement(struct task_struct *p)
 			p->numa_faults_curr[idx] = 0;
 
 			/* Keep a simple running average: */
-			p->numa_faults[idx] += new_faults;
-			p->numa_faults[idx] /= 2;
+			p->numa_faults[idx] = p->numa_faults[idx]*7 + new_faults;
+			p->numa_faults[idx] /= 8;
 
 			faults += p->numa_faults[idx];
 			total[priv] += p->numa_faults[idx];
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 31/33] sched: Use the ideal CPU to drive active balancing
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (29 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 30/33] sched: Average the fault stats longer Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 32/33] sched: Add hysteresis to p->numa_shared Ingo Molnar
                   ` (3 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Use our shared/private distinction to allow the separate
handling of 'private' versus 'shared' workloads, which enables
the active-balancing of them:

 - private tasks, via the sched_update_ideal_cpu_private() function,
   try to 'spread' the system as evenly as possible.

 - shared-access tasks that also share their mm (threads), via the
   sched_update_ideal_cpu_shared() function, try to 'compress'
   with other shared tasks on as few nodes as possible.

   [ We'll be able to extend this grouping beyond threads in the
     future, by using the existing p->shared_buddy directed graph. ]

Introduce the sched_rebalance_to() primitive to trigger active rebalancing.

The result of this patch is 2-3 times faster convergence and
much more stable convergence points.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h   |   1 +
 kernel/sched/core.c     |  19 ++++
 kernel/sched/fair.c     | 244 +++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h |   7 +-
 kernel/sched/sched.h    |   1 +
 5 files changed, 257 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 696492e..8bc3a03 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2020,6 +2020,7 @@ task_sched_runtime(struct task_struct *task);
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
 extern void sched_exec(void);
+extern void sched_rebalance_to(int dest_cpu);
 #else
 #define sched_exec()   {}
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 794efa0..93f2561 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2596,6 +2596,22 @@ unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 }
 
+/*
+ * sched_rebalance_to()
+ *
+ * Active load-balance to a target CPU.
+ */
+void sched_rebalance_to(int dest_cpu)
+{
+	struct task_struct *p = current;
+	struct migration_arg arg = { p, dest_cpu };
+
+	if (!cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
+		return;
+
+	stop_one_cpu(raw_smp_processor_id(), migration_cpu_stop, &arg);
+}
+
 #endif
 
 DEFINE_PER_CPU(struct kernel_stat, kstat);
@@ -4753,6 +4769,9 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 done:
 	ret = 1;
 fail:
+#ifdef CONFIG_NUMA_BALANCING
+	rq_dest->curr_buddy = NULL;
+#endif
 	double_rq_unlock(rq_src, rq_dest);
 	raw_spin_unlock(&p->pi_lock);
 	return ret;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a5f3ad7..8aa4b36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -848,6 +848,181 @@ static int task_ideal_cpu(struct task_struct *p)
 	return p->ideal_cpu;
 }
 
+
+static int sched_update_ideal_cpu_shared(struct task_struct *p)
+{
+	int full_buddies;
+	int max_buddies;
+	int target_cpu;
+	int ideal_cpu;
+	int this_cpu;
+	int this_node;
+	int best_node;
+	int buddies;
+	int node;
+	int cpu;
+
+	if (!sched_feat(PUSH_SHARED_BUDDIES))
+		return -1;
+
+	ideal_cpu = -1;
+	best_node = -1;
+	max_buddies = 0;
+	this_cpu = task_cpu(p);
+	this_node = cpu_to_node(this_cpu);
+
+	for_each_online_node(node) {
+		full_buddies = cpumask_weight(cpumask_of_node(node));
+
+		buddies = 0;
+		target_cpu = -1;
+
+		for_each_cpu(cpu, cpumask_of_node(node)) {
+			struct task_struct *curr;
+			struct rq *rq;
+
+			WARN_ON_ONCE(cpu_to_node(cpu) != node);
+
+			rq = cpu_rq(cpu);
+
+			/*
+			 * Don't take the rq lock for scalability,
+			 * we only rely on rq->curr statistically:
+			 */
+			curr = ACCESS_ONCE(rq->curr);
+
+			if (curr == p) {
+				buddies += 1;
+				continue;
+			}
+
+			/* Pick up idle tasks immediately: */
+			if (curr == rq->idle && !rq->curr_buddy)
+				target_cpu = cpu;
+
+			/* Leave alone non-NUMA tasks: */
+			if (task_numa_shared(curr) < 0)
+				continue;
+
+			/* Private task is an easy target: */
+			if (task_numa_shared(curr) == 0) {
+				if (!rq->curr_buddy)
+					target_cpu = cpu;
+				continue;
+			}
+			if (curr->mm != p->mm) {
+				/* Leave alone different groups on their ideal node: */
+				if (cpu_to_node(curr->ideal_cpu) == node)
+					continue;
+				if (!rq->curr_buddy)
+					target_cpu = cpu;
+				continue;
+			}
+
+			buddies++;
+		}
+		WARN_ON_ONCE(buddies > full_buddies);
+
+		/* Don't go to a node that is already at full capacity: */
+		if (buddies == full_buddies)
+			continue;
+
+		if (!buddies)
+			continue;
+
+		if (buddies > max_buddies && target_cpu != -1) {
+			max_buddies = buddies;
+			best_node = node;
+			WARN_ON_ONCE(target_cpu == -1);
+			ideal_cpu = target_cpu;
+		}
+	}
+
+	WARN_ON_ONCE(best_node == -1 && ideal_cpu != -1);
+	WARN_ON_ONCE(best_node != -1 && ideal_cpu == -1);
+
+	this_node = cpu_to_node(this_cpu);
+
+	/* If we'd stay within this node then stay put: */
+	if (ideal_cpu == -1 || cpu_to_node(ideal_cpu) == this_node)
+		ideal_cpu = this_cpu;
+
+	return ideal_cpu;
+}
+
+static int sched_update_ideal_cpu_private(struct task_struct *p)
+{
+	int full_idles;
+	int this_idles;
+	int max_idles;
+	int target_cpu;
+	int ideal_cpu;
+	int best_node;
+	int this_node;
+	int this_cpu;
+	int idles;
+	int node;
+	int cpu;
+
+	if (!sched_feat(PUSH_PRIVATE_BUDDIES))
+		return -1;
+
+	ideal_cpu = -1;
+	best_node = -1;
+	max_idles = 0;
+	this_idles = 0;
+	this_cpu = task_cpu(p);
+	this_node = cpu_to_node(this_cpu);
+
+	for_each_online_node(node) {
+		full_idles = cpumask_weight(cpumask_of_node(node));
+
+		idles = 0;
+		target_cpu = -1;
+
+		for_each_cpu(cpu, cpumask_of_node(node)) {
+			struct rq *rq;
+
+			WARN_ON_ONCE(cpu_to_node(cpu) != node);
+
+			rq = cpu_rq(cpu);
+			if (rq->curr == rq->idle) {
+				if (!rq->curr_buddy)
+					target_cpu = cpu;
+				idles++;
+			}
+		}
+		WARN_ON_ONCE(idles > full_idles);
+
+		if (node == this_node)
+			this_idles = idles;
+
+		if (!idles)
+			continue;
+
+		if (idles > max_idles && target_cpu != -1) {
+			max_idles = idles;
+			best_node = node;
+			WARN_ON_ONCE(target_cpu == -1);
+			ideal_cpu = target_cpu;
+		}
+	}
+
+	WARN_ON_ONCE(best_node == -1 && ideal_cpu != -1);
+	WARN_ON_ONCE(best_node != -1 && ideal_cpu == -1);
+
+	/* If the target is not idle enough, skip: */
+	if (max_idles <= this_idles+1)
+		ideal_cpu = this_cpu;
+		
+	/* If we'd stay within this node then stay put: */
+	if (ideal_cpu == -1 || cpu_to_node(ideal_cpu) == this_node)
+		ideal_cpu = this_cpu;
+
+	return ideal_cpu;
+}
+
+
 /*
  * Called for every full scan - here we consider switching to a new
  * shared buddy, if the one we found during this scan is good enough:
@@ -862,7 +1037,6 @@ static void shared_fault_full_scan_done(struct task_struct *p)
 		WARN_ON_ONCE(!p->shared_buddy_curr);
 		p->shared_buddy			= p->shared_buddy_curr;
 		p->shared_buddy_faults		= p->shared_buddy_faults_curr;
-		p->ideal_cpu			= p->ideal_cpu_curr;
 
 		goto clear_buddy;
 	}
@@ -891,14 +1065,13 @@ static void task_numa_placement(struct task_struct *p)
 	unsigned long total[2] = { 0, 0 };
 	unsigned long faults, max_faults = 0;
 	int node, priv, shared, max_node = -1;
+	int this_node;
 
 	if (p->numa_scan_seq == seq)
 		return;
 
 	p->numa_scan_seq = seq;
 
-	shared_fault_full_scan_done(p);
-
 	/*
 	 * Update the fault average with the result of the latest
 	 * scan:
@@ -926,10 +1099,7 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	if (max_node != p->numa_max_node) {
-		sched_setnuma(p, max_node, task_numa_shared(p));
-		goto out_backoff;
-	}
+	shared_fault_full_scan_done(p);
 
 	p->numa_migrate_seq++;
 	if (sched_feat(NUMA_SETTLE) &&
@@ -942,14 +1112,55 @@ static void task_numa_placement(struct task_struct *p)
 	 * the impact of a little private memory accesses.
 	 */
 	shared = (total[0] >= total[1] / 2);
-	if (shared != task_numa_shared(p)) {
-		sched_setnuma(p, p->numa_max_node, shared);
+	if (shared)
+		p->ideal_cpu = sched_update_ideal_cpu_shared(p);
+	else
+		p->ideal_cpu = sched_update_ideal_cpu_private(p);
+
+	if (p->ideal_cpu >= 0) {
+		/* Filter migrations a bit - the same target twice in a row is picked: */
+		if (p->ideal_cpu == p->ideal_cpu_candidate) {
+			max_node = cpu_to_node(p->ideal_cpu);
+		} else {
+			p->ideal_cpu_candidate = p->ideal_cpu;
+			max_node = -1;
+		}
+	} else {
+		if (max_node < 0)
+			max_node = p->numa_max_node;
+	}
+
+	if (shared != task_numa_shared(p) || (max_node != -1 && max_node != p->numa_max_node)) {
+
 		p->numa_migrate_seq = 0;
-		goto out_backoff;
+		/*
+		 * Fix up node migration fault statistics artifact, as we
+		 * migrate to another node we'll soon bring over our private
+		 * working set - generating 'shared' faults as that happens.
+		 * To counter-balance this effect, move this node's private
+		 * stats to the new node.
+		 */
+		if (max_node != -1 && p->numa_max_node != -1 && max_node != p->numa_max_node) {
+			int idx_oldnode = p->numa_max_node*2 + 1;
+			int idx_newnode = max_node*2 + 1;
+
+			p->numa_faults[idx_newnode] += p->numa_faults[idx_oldnode];
+			p->numa_faults[idx_oldnode] = 0;
+		}
+		sched_setnuma(p, max_node, shared);
+	} else {
+		/* node unchanged, back off: */
+		p->numa_scan_period = min(p->numa_scan_period * 2, sysctl_sched_numa_scan_period_max);
+	}
+
+	this_node = cpu_to_node(task_cpu(p));
+
+	if (max_node >= 0 && p->ideal_cpu >= 0 && max_node != this_node) {
+		struct rq *rq = cpu_rq(p->ideal_cpu);
+
+		rq->curr_buddy = p;
+		sched_rebalance_to(p->ideal_cpu);
 	}
-	return;
-out_backoff:
-	p->numa_scan_period = min(p->numa_scan_period * 2, sysctl_sched_numa_scan_period_max);
 }
 
 /*
@@ -1051,6 +1262,8 @@ void task_numa_fault(int node, int last_cpu, int pages)
 	int priv = (task_cpu(p) == last_cpu);
 	int idx = 2*node + priv;
 
+	WARN_ON_ONCE(last_cpu == -1 || node == -1);
+
 	if (unlikely(!p->numa_faults)) {
 		int entries = 2*nr_node_ids;
 		int size = sizeof(*p->numa_faults) * entries;
@@ -3545,6 +3758,11 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
 
+#ifdef CONFIG_NUMA_BALANCING
+	if (sched_feat(WAKE_ON_IDEAL_CPU) && p->ideal_cpu >= 0)
+		return p->ideal_cpu;
+#endif
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
 			want_affine = 1;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 737d2c8..c868a66 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -68,11 +68,14 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(IDEAL_CPU,			true)
 SCHED_FEAT(IDEAL_CPU_THREAD_BIAS,	false)
+SCHED_FEAT(PUSH_PRIVATE_BUDDIES,	true)
+SCHED_FEAT(PUSH_SHARED_BUDDIES,		true)
+SCHED_FEAT(WAKE_ON_IDEAL_CPU,		false)
 
 #ifdef CONFIG_NUMA_BALANCING
 /* Do the working set probing faults: */
 SCHED_FEAT(NUMA,             true)
-SCHED_FEAT(NUMA_FAULTS_UP,   true)
-SCHED_FEAT(NUMA_FAULTS_DOWN, true)
+SCHED_FEAT(NUMA_FAULTS_UP,   false)
+SCHED_FEAT(NUMA_FAULTS_DOWN, false)
 SCHED_FEAT(NUMA_SETTLE,      true)
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bb9475c..810a1a0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -441,6 +441,7 @@ struct rq {
 	unsigned long numa_weight;
 	unsigned long nr_numa_running;
 	unsigned long nr_ideal_running;
+	struct task_struct *curr_buddy;
 #endif
 	unsigned long nr_shared_running;	/* 0 on non-NUMA */
 
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 32/33] sched: Add hysteresis to p->numa_shared
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (30 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 31/33] sched: Use the ideal CPU to drive active balancing Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:49 ` [PATCH 33/33] sched: Track shared task's node groups and interleave their memory allocations Ingo Molnar
                   ` (2 subsequent siblings)
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Make p->numa_shared flip/flop less around unstable equilibriums,
instead require a significant move in either direction to trigger
'dominantly shared accesses' versus 'dominantly private accesses'
NUMA status.

Suggested-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8aa4b36..ab4a7130 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1111,7 +1111,20 @@ static void task_numa_placement(struct task_struct *p)
 	 * we might want to consider a different equation below to reduce
 	 * the impact of a little private memory accesses.
 	 */
-	shared = (total[0] >= total[1] / 2);
+	shared = p->numa_shared;
+
+	if (shared < 0) {
+		shared = (total[0] >= total[1]);
+	} else if (shared == 0) {
+		/* If it was private before, make it harder to become shared: */
+		if (total[0] >= total[1]*2)
+			shared = 1;
+	} else if (shared == 1 ) {
+		 /* If it was shared before, make it harder to become private: */
+		if (total[0]*2 <= total[1])
+			shared = 0;
+	}
+
 	if (shared)
 		p->ideal_cpu = sched_update_ideal_cpu_shared(p);
 	else
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 33/33] sched: Track shared task's node groups and interleave their memory allocations
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (31 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 32/33] sched: Add hysteresis to p->numa_shared Ingo Molnar
@ 2012-11-22 22:49 ` Ingo Molnar
  2012-11-22 22:53 ` [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
  2012-11-23 17:32 ` Comparison between three trees (was: Latest numa/core release, v17) Mel Gorman
  34 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

This patch shows the power of the shared/private distinction: in
the shared tasks active balancing function (sched_update_ideal_cpu_shared())
we are able to to build a per (shared) task node mask of the nodes that
it and its buddies occupy at the moment.

Private tasks on the other hand are not affected and continue to do
efficient node-local allocations.

There's two important special cases:

 - if a group of shared tasks fits on a single node. In this case
   the interleaving happens on a single bit, a single node and thus
   turns into nice node-local allocations.

 - if a large group spans the whole system: in this case the node
   masks will cover the whole system, and all memory gets evenly
   interleaved and available RAM bandwidth gets utilized. This is
   preferable to allocating memory assymetrically and overloading
   certain CPU links and running into their bandwidth limitations.

This patch, in combination with the private/shared buddies patch,
optimizes the "4x JVM", "single JVM" and "2x JVM" SPECjbb workloads
on a 4-node system produce almost completely perfect memory placement.

For example a 4-JVM workload on a 4-node, 32-CPU system has
this performance (8 SPECjbb warehouses per JVM):

 spec1.txt:           throughput =     177460.44 SPECjbb2005 bops
 spec2.txt:           throughput =     176175.08 SPECjbb2005 bops
 spec3.txt:           throughput =     175053.91 SPECjbb2005 bops
 spec4.txt:           throughput =     171383.52 SPECjbb2005 bops

Which is close to the hard binding performance figures.

while previously it would regress compared to mainline.

Mainline has the following 4x JVM performance:

 spec1.txt:           throughput =     157839.25 SPECjbb2005 bops
 spec2.txt:           throughput =     156969.15 SPECjbb2005 bops
 spec3.txt:           throughput =     157571.59 SPECjbb2005 bops
 spec4.txt:           throughput =     157873.86 SPECjbb2005 bops

So the patch brings a 12% speedup.

This placement idea came while discussing interleaving strategies
with Christoph Lameter.

Suggested-by: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab4a7130..5cc3620 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -922,6 +922,10 @@ static int sched_update_ideal_cpu_shared(struct task_struct *p)
 			buddies++;
 		}
 		WARN_ON_ONCE(buddies > full_buddies);
+		if (buddies)
+			node_set(node, p->numa_policy.v.nodes);
+		else
+			node_clear(node, p->numa_policy.v.nodes);
 
 		/* Don't go to a node that is already at full capacity: */
 		if (buddies == full_buddies)
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/33] Latest numa/core release, v17
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (32 preceding siblings ...)
  2012-11-22 22:49 ` [PATCH 33/33] sched: Track shared task's node groups and interleave their memory allocations Ingo Molnar
@ 2012-11-22 22:53 ` Ingo Molnar
  2012-11-23  6:47   ` Zhouping Liu
  2012-11-23 17:32 ` Comparison between three trees (was: Latest numa/core release, v17) Mel Gorman
  34 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins



* Ingo Molnar <mingo@kernel.org> wrote:

> This release mainly addresses one of the regressions Linus
> (rightfully) complained about: the "4x JVM" SPECjbb run.
> 
> [ Note to testers: if possible please still run with
>   CONFIG_TRANSPARENT_HUGEPAGES=y enabled, to avoid the
>   !THP regression that is still not fully fixed.
>   It will be fixed next. ]

I forgot to include the Git link:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/33] Latest numa/core release, v17
  2012-11-22 22:53 ` [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
@ 2012-11-23  6:47   ` Zhouping Liu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhouping Liu @ 2012-11-23  6:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Linus Torvalds, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On 11/23/2012 06:53 AM, Ingo Molnar wrote:
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
>> This release mainly addresses one of the regressions Linus
>> (rightfully) complained about: the "4x JVM" SPECjbb run.
>>
>> [ Note to testers: if possible please still run with
>>    CONFIG_TRANSPARENT_HUGEPAGES=y enabled, to avoid the
>>    !THP regression that is still not fully fixed.
>>    It will be fixed next. ]
> I forgot to include the Git link:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

Hi Ingo, I tested the latest tip/master tree on 2 nodes with 8 
processors, closed CONFIG_TRANSPARENT_HUGEPAGE,
and found some issues:

one is that command `stress -i 20 -m 30 -v` caused some hung tasks:

------------- snip ---------------------------
[ 1726.278382] Node 0 DMA free:15880kB min:20kB low:24kB high:28kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15332kB 
mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB 
slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB 
pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes
[ 1726.366825] lowmem_reserve[]: 0 1957 3973 3973
[ 1726.388610] Node 0 DMA32 free:10856kB min:2796kB low:3492kB 
high:4192kB active_anon:1479384kB inactive_anon:498788kB active_file:0kB 
inactive_file:8kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:2004184kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB 
shmem:20kB slab_reclaimable:140kB slab_unreclaimable:96kB 
kernel_stack:8kB pagetables:80kB unstable:0kB bounce:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:3502066 all_unreclaimable? yes
[ 1726.490163] lowmem_reserve[]: 0 0 2016 2016
[ 1726.515235] Node 0 Normal free:2880kB min:2880kB low:3600kB 
high:4320kB active_anon:1453776kB inactive_anon:490196kB 
active_file:748kB inactive_file:1140kB unevictable:3740kB 
isolated(anon):0kB isolated(file):0kB present:2064384kB mlocked:3492kB 
dirty:0kB writeback:0kB mapped:2748kB shmem:2116kB 
slab_reclaimable:9260kB slab_unreclaimable:35880kB kernel_stack:1184kB 
pagetables:3308kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:3437106 all_unreclaimable? yes
[ 1726.629591] lowmem_reserve[]: 0 0 0 0
[ 1726.657650] Node 1 Normal free:5748kB min:5760kB low:7200kB 
high:8640kB active_anon:3383776kB inactive_anon:682376kB active_file:8kB 
inactive_file:340kB unevictable:8kB isolated(anon):384kB 
isolated(file):0kB present:4128768kB mlocked:8kB dirty:0kB writeback:0kB 
mapped:20kB shmem:12kB slab_reclaimable:9364kB 
slab_unreclaimable:13728kB kernel_stack:880kB pagetables:12492kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:8696274 all_unreclaimable? yes
[ 1726.782732] lowmem_reserve[]: 0 0 0 0
[ 1726.814748] Node 0 DMA: 2*4kB 2*8kB 1*16kB 1*32kB 3*64kB 2*128kB 
0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15880kB
[ 1726.854951] Node 0 DMA32: 12*4kB 25*8kB 21*16kB 19*32kB 15*64kB 
8*128kB 4*256kB 1*512kB 0*1024kB 1*2048kB 1*4096kB = 10856kB
[ 1726.896378] Node 0 Normal: 556*4kB 11*8kB 6*16kB 7*32kB 2*64kB 
1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2888kB
[ 1726.937928] Node 1 Normal: 392*4kB 22*8kB 1*16kB 0*32kB 0*64kB 
0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 5856kB
[ 1726.979519] 481162 total pagecache pages
[ 1727.013687] 479540 pages in swap cache
[ 1727.047898] Swap cache stats: add 2176770, delete 1697230, find 
701489/781867
[ 1727.085709] Free swap  = 4371040kB
[ 1727.119839] Total swap = 8175612kB
[ 1727.187872] 2097136 pages RAM
[ 1727.221789] 56226 pages reserved
[ 1727.256273] 1721904 pages shared
[ 1727.290826] 1549555 pages non-shared
[ 1727.325708] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents 
oom_score_adj name
[ 1727.366026] [  303]     0   303     9750      184      23 500         
-1000 systemd-udevd
[ 1727.407347] [  448]     0   448     6558      178      13 56         
-1000 auditd
[ 1727.447970] [  516]     0   516    61311      172      22 
105             0 rsyslogd
[ 1727.488827] [  519]     0   519    24696       35      15 
35             0 iscsiuio
[ 1727.529839] [  529]    81   529     8201      246      27 93          
-900 dbus-daemon
[ 1727.571553] [  530]     0   530     1264       93       9 
16             0 iscsid
[ 1727.612795] [  531]     0   531     1389      843       9 0           
-17 iscsid
[ 1727.654069] [  538]     0   538     1608      157      10 
29             0 mcelog
[ 1727.695659] [  543]     0   543     5906      138      16 
49             0 atd
[ 1727.736954] [  551]     0   551    30146      211      22 
123             0 crond
[ 1727.778589] [  569]     0   569    26612      223      83 
185             0 login
[ 1727.820932] [  587]     0   587    86085      334      84 
139             0 NetworkManager
[ 1727.863388] [  643]     0   643    21931      200      65 
171          -900 modem-manager
[ 1727.906015] [  644]     0   644     5812      183      16 
74             0 bluetoothd
[ 1727.948629] [  672]     0   672   118364      228     179 
1079             0 libvirtd
[ 1727.991230] [  691]     0   691    20059      189      40 198         
-1000 sshd
[ 1728.033700] [  996]     0   996    28950      196      17 
516             0 bash
[ 1728.076541] [ 1056]     0  1056     1803      112      10 
22             0 stress
[ 1728.119495] [ 1057]     0  1057     1803       27      10 
21             0 stress
[ 1728.162616] [ 1058]     0  1058    67340    54961     118 
12             0 stress
[ 1728.205521] [ 1059]     0  1059     1803       27      10 
21             0 stress
[ 1728.248414] [ 1060]     0  1060    67340    12209      35 
12             0 stress
[ 1728.291529] [ 1061]     0  1061     1803       27      10 
21             0 stress
[ 1728.335147] [ 1062]     0  1062    67340    23519      57 
12             0 stress
[ 1728.378537] [ 1063]     0  1063     1803       27      10 
21             0 stress
[ 1728.421877] [ 1064]     0  1064    67340     7673     138 
57944             0 stress
[ 1728.465325] [ 1065]     0  1065     1803       27      10 
21             0 stress
[ 1728.508955] [ 1066]     0  1066    67340       90      11 
12             0 stress
[ 1728.553187] [ 1067]     0  1067     1803       27      10 
21             0 stress
[ 1728.597554] [ 1068]     0  1068    67340    58628     126 
12             0 stress
[ 1728.640668] [ 1069]     0  1069     1803       27      10 
21             0 stress
[ 1728.683676] [ 1070]     0  1070    67340    59802     128 
12             0 stress
[ 1728.726534] [ 1071]     0  1071     1803       27      10 
21             0 stress
[ 1728.769082] [ 1072]     0  1072    67340     5924     138 
59693             0 stress
[ 1728.811455] [ 1073]     0  1073     1803       27      10 
21             0 stress
[ 1728.852798] [ 1074]     0  1074    67340    65103     138 
14             0 stress
[ 1728.892605] [ 1075]     0  1075     1803       27      10 
21             0 stress
[ 1728.931191] [ 1076]     0  1076    67340    60077     128 
13             0 stress
[ 1728.969491] [ 1077]     0  1077     1803       27      10 
21             0 stress
[ 1729.006394] [ 1078]     0  1078    67340    13262     138 
52355             0 stress
[ 1729.042189] [ 1079]     0  1079     1803       27      10 
21             0 stress
[ 1729.076890] [ 1080]     0  1080    67340    38640      87 
12             0 stress
[ 1729.111443] [ 1081]     0  1081     1803       27      10 
21             0 stress
[ 1729.144638] [ 1082]     0  1082    67340     8238     138 
57379             0 stress
[ 1729.176403] [ 1083]     0  1083     1803       27      10 
21             0 stress
[ 1729.206905] [ 1084]     0  1084    67340    55392     119 
12             0 stress
[ 1729.237086] [ 1085]     0  1085     1803       27      10 
21             0 stress
[ 1729.265883] [ 1086]     0  1086    67340     4169     138 
61447             0 stress
[ 1729.293362] [ 1087]     0  1087     1803       27      10 
21             0 stress
[ 1729.319405] [ 1088]     0  1088    67340    16042      42 
12             0 stress
[ 1729.345380] [ 1089]     0  1089     1803       27      10 
21             0 stress
[ 1729.370934] [ 1090]     0  1090    67340     1223      13 
12             0 stress
[ 1729.395553] [ 1091]     0  1091     1803       27      10 
21             0 stress
[ 1729.419544] [ 1092]     0  1092    67340     8318     138 
57298             0 stress
[ 1729.443863] [ 1093]     0  1093     1803       27      10 
21             0 stress
[ 1729.467471] [ 1094]     0  1094    67340     2342      16 
12             0 stress
[ 1729.491074] [ 1095]     0  1095     1803       27      10 
21             0 stress
[ 1729.514194] [ 1096]     0  1096    67340    59017     126 
12             0 stress
[ 1729.536998] [ 1097]     0  1097    67340    36245      82 
12             0 stress
[ 1729.559710] [ 1098]     0  1098    67340    57050     122 
12             0 stress
[ 1729.582264] [ 1099]     0  1099    67340    29239      68 
12             0 stress
[ 1729.604895] [ 1100]     0  1100    67340    30815      71 
12             0 stress
[ 1729.627532] [ 1101]     0  1101    67340     6881     138 
58735             0 stress
[ 1729.650016] [ 1102]     0  1102    67340    37447      84 
12             0 stress
[ 1729.672130] [ 1103]     0  1103    67340     6891      24 
12             0 stress
[ 1729.693897] [ 1104]     0  1104    67340    35463      80 
12             0 stress
[ 1729.715565] [ 1105]     0  1105    67340    11843      34 
12             0 stress
[ 1729.736992] [ 1106]     0  1106    67340    10279     138 
55338             0 stress
[ 1729.758383] [ 1198]     0  1198    88549     5957     185 
6739          -900 setroubleshootd
[ 1729.780776] [ 2309]     0  2309     3243      192      12 
0             0 systemd-cgroups
[ 1729.803176] [ 2312]     0  2312     3243      179      12 
0             0 systemd-cgroups
[ 1729.825560] [ 2314]     0  2314     3243      209      11 
0             0 systemd-cgroups
[ 1729.847848] [ 2315]     0  2315     3243      165      11 
0             0 systemd-cgroups
[ 1729.870223] [ 2317]     0  2317     1736       41       6 
0             0 systemd-cgroups
[ 1729.892316] [ 2319]     0  2319     2688       46       7 
0             0 systemd-cgroups
[ 1729.914310] [ 2320]     0  2320      681       34       4 
0             0 systemd-cgroups
[ 1729.936223] [ 2321]     0  2321       42        1       2 
0             0 systemd-cgroups
[ 1729.957811] Out of memory: Kill process 516 (rsyslogd) score 0 or 
sacrifice child
[ 1729.978407] Killed process 516 (rsyslogd) total-vm:245244kB, 
anon-rss:0kB, file-rss:688kB
[ 1923.469572] INFO: task kworker/4:2:232 blocked for more than 120 seconds.
[ 1923.490002] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1923.511856] kworker/4:2     D ffff88027fc14080     0   232      2 
0x00000000
[ 1923.533216]  ffff88027977db88 0000000000000046 ffff8802797b5040 
ffff88027977dfd8
[ 1923.556100]  ffff88027977dfd8 ffff88027977dfd8 ffff880179063580 
ffff8802797b5040
[ 1923.578569]  ffff88027977db68 ffff88027977dd10 7fffffffffffffff 
ffff8802797b5040
[ 1923.601034] Call Trace:
[ 1923.618253]  [<ffffffff815e7b69>] schedule+0x29/0x70
[ 1923.638372]  [<ffffffff815e6114>] schedule_timeout+0x1f4/0x2b0
[ 1923.659539]  [<ffffffff815e79f0>] wait_for_common+0x120/0x170
[ 1923.680664]  [<ffffffff81096920>] ? try_to_wake_up+0x2d0/0x2d0
[ 1923.702224]  [<ffffffff815e7b3d>] wait_for_completion+0x1d/0x20
[ 1923.723820]  [<ffffffff8107b0b9>] call_usermodehelper_fns+0x1d9/0x200
[ 1923.746167]  [<ffffffff810d0b32>] cgroup_release_agent+0xe2/0x180
[ 1923.768233]  [<ffffffff8107e638>] process_one_work+0x148/0x490
[ 1923.790179]  [<ffffffff810d0a50>] ? init_root_id+0xb0/0xb0
[ 1923.811797]  [<ffffffff8107f16e>] worker_thread+0x15e/0x450
[ 1923.833733]  [<ffffffff8107f010>] ? busy_worker_rebind_fn+0x110/0x110
[ 1923.856703]  [<ffffffff81084350>] kthread+0xc0/0xd0
[ 1923.878067]  [<ffffffff81084290>] ? kthread_create_on_node+0x120/0x120
[ 1923.901388]  [<ffffffff815f12ac>] ret_from_fork+0x7c/0xb0
[ 1923.923544]  [<ffffffff81084290>] ? kthread_create_on_node+0x120/0x120
[ 1923.947290] INFO: task rs:main Q:Reg:534 blocked for more than 120 
seconds.
[ 1923.971646] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1923.997338] rs:main Q:Reg   D ffff88027fc54080     0   534      1 
0x00000084
[ 1924.022565]  ffff880279913930 0000000000000082 ffff880279b29ac0 
ffff880279913fd8
[ 1924.049053]  ffff880279913fd8 ffff880279913fd8 ffff880279b2b580 
ffff880279b29ac0
[ 1924.075230]  000000000003b55f ffff880279b29ac0 ffff880279bedde0 
ffffffffffffffff
[ 1924.101476] Call Trace:
[ 1924.122766]  [<ffffffff815e7b69>] schedule+0x29/0x70
[ 1924.146894]  [<ffffffff815e8915>] rwsem_down_failed_common+0xb5/0x140
[ 1924.173018]  [<ffffffff815e89d5>] rwsem_down_read_failed+0x15/0x17
[ 1924.198572]  [<ffffffff812eb0c4>] call_rwsem_down_read_failed+0x14/0x30
[ 1924.224975]  [<ffffffff810fecf3>] ? taskstats_exit+0x383/0x420
[ 1924.250579]  [<ffffffff812e9e5f>] ? __get_user_8+0x1f/0x29
[ 1924.275797]  [<ffffffff815e6df4>] ? down_read+0x24/0x2b
[ 1924.300955]  [<ffffffff815eca16>] __do_page_fault+0x1c6/0x4e0
[ 1924.326726]  [<ffffffff811797ef>] ? alloc_pages_current+0xcf/0x140
[ 1924.353121]  [<ffffffff8118345e>] ? new_slab+0x20e/0x310
[ 1924.378729]  [<ffffffff815ecd3e>] do_page_fault+0xe/0x10
[ 1924.404424]  [<ffffffff815e9358>] page_fault+0x28/0x30
[ 1924.429991]  [<ffffffff810fecf3>] ? taskstats_exit+0x383/0x420
[ 1924.456539]  [<ffffffff812e9e5f>] ? __get_user_8+0x1f/0x29
[ 1924.482746]  [<ffffffff810c047d>] ? exit_robust_list+0x5d/0x160
[ 1924.509604]  [<ffffffff810feca9>] ? taskstats_exit+0x339/0x420
[ 1924.536514]  [<ffffffff8105e5d7>] mm_release+0x147/0x160
[ 1924.563242]  [<ffffffff81065186>] exit_mm+0x26/0x120
[ 1924.589384]  [<ffffffff81066787>] do_exit+0x167/0x8d0
[ 1924.615888]  [<ffffffff810be75b>] ? futex_wait+0x13b/0x2c0
[ 1924.642809]  [<ffffffff81183060>] ? kmem_cache_free+0x20/0x160
[ 1924.670213]  [<ffffffff8106733f>] do_group_exit+0x3f/0xa0
[ 1924.697387]  [<ffffffff81075eca>] get_signal_to_deliver+0x1ca/0x5e0
[ 1924.726012]  [<ffffffff8101437f>] do_signal+0x3f/0x610
[ 1924.753260]  [<ffffffff810c06ad>] ? do_futex+0x12d/0x580
[ 1924.780688]  [<ffffffff810149f0>] do_notify_resume+0x80/0xb0
[ 1924.808394]  [<ffffffff815f1612>] int_signal+0x12/0x17
[ 1924.835765] INFO: task rsyslogd:536 blocked for more than 120 seconds.
[ 1924.864967] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1924.896012] rsyslogd        D ffff88017bc14080     0   536      1 
0x00000086
[ 1924.926767]  ffff8802791b7c38 0000000000000082 ffff88027a0a0000 
ffff8802791b7fd8
[ 1924.958361]  ffff8802791b7fd8 ffff8802791b7fd8 ffff880279b29ac0 
ffff88027a0a0000
[ 1924.990164]  ffff8802791b7c28 ffff88027a0a0000 ffff880279bedde0 
ffffffffffffffff
[ 1925.021929] Call Trace:
[ 1925.048178]  [<ffffffff815e7b69>] schedule+0x29/0x70
[ 1925.077075]  [<ffffffff815e8915>] rwsem_down_failed_common+0xb5/0x140
[ 1925.107125]  [<ffffffff815e89b3>] rwsem_down_write_failed+0x13/0x20
[ 1925.136665]  [<ffffffff812eb0f3>] call_rwsem_down_write_failed+0x13/0x20
[ 1925.166430]  [<ffffffff815e6dc2>] ? down_write+0x32/0x40
[ 1925.194797]  [<ffffffff8109ae0b>] task_numa_work+0xeb/0x270
[ 1925.223254]  [<ffffffff81081047>] task_work_run+0xa7/0xe0
[ 1925.251609]  [<ffffffff810762bb>] get_signal_to_deliver+0x5bb/0x5e0
[ 1925.280624]  [<ffffffff8115e619>] ? handle_mm_fault+0x149/0x210
[ 1925.309222]  [<ffffffff8101437f>] do_signal+0x3f/0x610
[ 1925.337048]  [<ffffffff8117cea1>] ? change_prot_numa+0x51/0x60
[ 1925.365604]  [<ffffffff8109aef6>] ? task_numa_work+0x1d6/0x270
[ 1925.394295]  [<ffffffff810149f0>] do_notify_resume+0x80/0xb0
[ 1925.422677]  [<ffffffff815e917c>] retint_signal+0x48/0x8c
[ 1925.450753] INFO: task NetworkManager:587 blocked for more than 120 
seconds.
[ 1925.480882] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1925.512086] NetworkManager  D ffff88027fc54080     0   587      1 
0x00000080
[ 1925.542751]  ffff880279fb7dc8 0000000000000086 ffff880279f29ac0 
ffff880279fb7fd8
[ 1925.574086]  ffff880279fb7fd8 ffff880279fb7fd8 ffff8801752c1ac0 
ffff880279f29ac0
[ 1925.605328]  ffffea0005e82c40 ffff880279f29ac0 ffff8801793cd860 
ffffffffffffffff
[ 1925.636585] Call Trace:
[ 1925.662557]  [<ffffffff815e7b69>] schedule+0x29/0x70
[ 1925.691518]  [<ffffffff815e8915>] rwsem_down_failed_common+0xb5/0x140
[ 1925.721815]  [<ffffffff8115e619>] ? handle_mm_fault+0x149/0x210
[ 1925.751814]  [<ffffffff815e89b3>] rwsem_down_write_failed+0x13/0x20
[ 1925.781475]  [<ffffffff812eb0f3>] call_rwsem_down_write_failed+0x13/0x20
[ 1925.811421]  [<ffffffff815e6dc2>] ? down_write+0x32/0x40
[ 1925.839679]  [<ffffffff8109ae0b>] task_numa_work+0xeb/0x270
[ 1925.868180]  [<ffffffff810e307c>] ? __audit_syscall_exit+0x3ec/0x450
[ 1925.897544]  [<ffffffff81081047>] task_work_run+0xa7/0xe0
[ 1925.925674]  [<ffffffff810149e1>] do_notify_resume+0x71/0xb0
[ 1925.954159]  [<ffffffff815e917c>] retint_signal+0x48/0x8c
[ 2045.984126] INFO: task kworker/4:2:232 blocked for more than 120 seconds.
[ 2046.013951] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 2046.045339] kworker/4:2     D ffff88027fc14080     0   232      2 
0x00000000
[ 2046.075873]  ffff88027977db88 0000000000000046 ffff8802797b5040 
ffff88027977dfd8
[ 2046.107357]  ffff88027977dfd8 ffff88027977dfd8 ffff880179063580 
ffff8802797b5040
[ 2046.138576]  ffff88027977db68 ffff88027977dd10 7fffffffffffffff 
ffff8802797b5040
[ 2046.169696] Call Trace:
[ 2046.195647]  [<ffffffff815e7b69>] schedule+0x29/0x70
[ 2046.224276]  [<ffffffff815e6114>] schedule_timeout+0x1f4/0x2b0
[ 2046.253957]  [<ffffffff815e79f0>] wait_for_common+0x120/0x170
[ 2046.283497]  [<ffffffff81096920>] ? try_to_wake_up+0x2d0/0x2d0
[ 2046.313349]  [<ffffffff815e7b3d>] wait_for_completion+0x1d/0x20
[ 2046.343061]  [<ffffffff8107b0b9>] call_usermodehelper_fns+0x1d9/0x200
[ 2046.373636]  [<ffffffff810d0b32>] cgroup_release_agent+0xe2/0x180
[ 2046.403705]  [<ffffffff8107e638>] process_one_work+0x148/0x490
[ 2046.433485]  [<ffffffff810d0a50>] ? init_root_id+0xb0/0xb0
[ 2046.462879]  [<ffffffff8107f16e>] worker_thread+0x15e/0x450
[ 2046.492398]  [<ffffffff8107f010>] ? busy_worker_rebind_fn+0x110/0x110
[ 2046.522898]  [<ffffffff81084350>] kthread+0xc0/0xd0
[ 2046.551810]  [<ffffffff81084290>] ? kthread_create_on_node+0x120/0x120
[ 2046.582425]  [<ffffffff815f12ac>] ret_from_fork+0x7c/0xb0
[ 2046.612099]  [<ffffffff81084290>] ? kthread_create_on_node+0x120/0x120
----------------------- snip ----------------------------

the other is that oom02(LTP: testcases/kernel/mem/oom/oom02) made 
oom-killer kill unexpected processes:
(oom02 is designed to try to hog memory on one node), and oom02 always 
hung until you kill it manually.

------------ snip -----------------------
[12449.554508] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents 
oom_score_adj name
[12449.554524] [  303]     0   303     9750      296      23 414         
-1000 systemd-udevd
[12449.554527] [  448]     0   448     6558      224      13 35         
-1000 auditd
[12449.554532] [  519]     0   519    24696       82      15 
20             0 iscsiuio
[12449.554534] [  529]    81   529     8201      389      27 35          
-900 dbus-daemon
[12449.554536] [  530]     0   530     1264       94       9 
15             0 iscsid
[12449.554538] [  531]     0   531     1389      876       9 0           
-17 iscsid
[12449.554540] [  538]     0   538     1608      158      10 
28             0 mcelog
[12449.554542] [  543]     0   543     5906      138      16 
48             0 atd
[12449.554543] [  551]     0   551    30146      301      22 
56             0 crond
[12449.554545] [  569]     0   569    26612      224      83 
184             0 login
[12449.554547] [  587]     0   587   108127      917      91 
45             0 NetworkManager
[12449.554549] [  643]     0   643    21931      290      65 
120          -900 modem-manager
[12449.554551] [  644]     0   644     5812      222      16 
61             0 bluetoothd
[12449.554553] [  672]     0   672   118388      469     179 
935             0 libvirtd
[12449.554555] [  691]     0   691    20059      229      40 177         
-1000 sshd
[12449.554556] [  996]     0   996    28975      654      18 
186             0 bash
[12449.554558] [ 1198]     0  1198    88549     5958     185 
6738          -900 setroubleshootd
[12449.554563] [ 2332]     0  2332    74107     3729     152 
0             0 systemd-journal
[12449.554564] [ 2335]     0  2335     8222      388      20 
0             0 systemd-logind
[12449.554567] [20909]     0 20909    61311      456      22 
0             0 rsyslogd
[12449.554572] [12818]     0 12818   788031     6273      24 
0             0 oom02
[12449.554574] [13193]     0 13193    23143     3749      46 
0             0 dhclient
[12449.554576] [13217]     0 13217    33221     1188      67 
0             0 sshd
[12449.554577] [13221]     0 13221    28949      879      19 
0             0 bash
[12449.554579] [13308]     0 13308    23143     3749      45 
0             0 dhclient
[12449.554581] [13388]     0 13388    36864     1063      45 
0             0 vim
[12449.554583] Out of memory: Kill process 13193 (dhclient) score 1 or 
sacrifice child
[12449.554584] Killed process 13193 (dhclient) total-vm:92572kB, 
anon-rss:11976kB, file-rss:3020kB
[12451.878812] oom02 invoked oom-killer: gfp_mask=0x280da, order=0, 
oom_score_adj=0
[12451.878815] oom02 cpuset=/ mems_allowed=0-1
[12451.878818] Pid: 12818, comm: oom02 Tainted: G        W 
3.7.0-rc6numacorev17+ #5
[12451.878819] Call Trace:
[12451.878829]  [<ffffffff810d92d1>] ? 
cpuset_print_task_mems_allowed+0x91/0xa0
[12451.878834]  [<ffffffff815dd3b6>] dump_header.isra.12+0x70/0x19b
[12451.878838]  [<ffffffff812e37d3>] ? ___ratelimit+0xa3/0x120
[12451.878843]  [<ffffffff81139c0d>] oom_kill_process+0x1cd/0x320
[12451.878848]  [<ffffffff8106d3c5>] ? has_ns_capability_noaudit+0x15/0x20
[12451.878850]  [<ffffffff8113a377>] out_of_memory+0x447/0x480
[12451.878853]  [<ffffffff8113ff5c>] __alloc_pages_nodemask+0x94c/0x960
[12451.878858]  [<ffffffff8117b1a6>] alloc_pages_vma+0xb6/0x190
[12451.878861]  [<ffffffff8115e094>] handle_pte_fault+0x8f4/0xb90
[12451.878865]  [<ffffffff810fa237>] ? call_rcu_sched+0x17/0x20
[12451.878868]  [<ffffffff8118ed82>] ? put_filp+0x52/0x60
[12451.878870]  [<ffffffff8115e619>] handle_mm_fault+0x149/0x210
[12451.878873]  [<ffffffff815ec9c2>] __do_page_fault+0x172/0x4e0
[12451.878875]  [<ffffffff81183060>] ? kmem_cache_free+0x20/0x160
[12451.878878]  [<ffffffff81198396>] ? final_putname+0x26/0x50
[12451.878880]  [<ffffffff815ecd3e>] do_page_fault+0xe/0x10
[12451.878883]  [<ffffffff815e9358>] page_fault+0x28/0x30
--------------- snip ------------------

-------------- snip ------------------
[12451.956997] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents 
oom_score_adj name
[12451.957010] [  303]     0   303     9750      296      23 414         
-1000 systemd-udevd
[12451.957013] [  448]     0   448     6558      224      13 35         
-1000 auditd
[12451.957017] [  519]     0   519    24696       82      15 
20             0 iscsiuio
[12451.957019] [  529]    81   529     8201      389      27 35          
-900 dbus-daemon
[12451.957021] [  530]     0   530     1264       94       9 
15             0 iscsid
[12451.957022] [  531]     0   531     1389      876       9 0           
-17 iscsid
[12451.957024] [  538]     0   538     1608      158      10 
28             0 mcelog
[12451.957026] [  543]     0   543     5906      138      16 
48             0 atd
[12451.957028] [  551]     0   551    30146      301      22 
56             0 crond
[12451.957029] [  569]     0   569    26612      224      83 
184             0 login
[12451.957031] [  587]     0   587   110177      922      92 
45             0 NetworkManager
[12451.957033] [  643]     0   643    21931      290      65 
120          -900 modem-manager
[12451.957035] [  644]     0   644     5812      222      16 
61             0 bluetoothd
[12451.957036] [  672]     0   672   118388      469     179 
935             0 libvirtd
[12451.957038] [  691]     0   691    20059      229      40 177         
-1000 sshd
[12451.957040] [  996]     0   996    28975      654      18 
186             0 bash
[12451.957042] [ 1198]     0  1198    88549     5958     185 
6738          -900 setroubleshootd
[12451.957045] [ 2332]     0  2332    74107     3897     152 
0             0 systemd-journal
[12451.957047] [ 2335]     0  2335     8222      388      20 
0             0 systemd-logind
[12451.957049] [20909]     0 20909    61311      456      22 
0             0 rsyslogd
[12451.957052] [12818]     0 12818   788031     6273      24 
0             0 oom02
[12451.957054] [13217]     0 13217    33221     1188      67 
0             0 sshd
[12451.957055] [13221]     0 13221    28949      879      19 
0             0 bash
[12451.957057] [13388]     0 13388    36864     1063      45 
0             0 vim
[12451.957059] [13410]     0 13410    33476      510      58 0          
-900 nm-dispatcher.a
[12451.957061] Out of memory: Kill process 2335 (systemd-logind) score 0 
or sacrifice child
-------------------- snip ----------------------------

also I found oom-killer performed bad on numa/core tree, you can use 
LTP: testcases/kernel/mem/oom/oom* to verify it.

please let me know if you need more details or any other testing works.

Thanks,
Zhouping

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Comparison between three trees (was: Latest numa/core release, v17)
  2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
                   ` (33 preceding siblings ...)
  2012-11-22 22:53 ` [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
@ 2012-11-23 17:32 ` Mel Gorman
  2012-11-25  8:47   ` Hillf Danton
                     ` (3 more replies)
  34 siblings, 4 replies; 55+ messages in thread
From: Mel Gorman @ 2012-11-23 17:32 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Paul Turner, Lee Schermerhorn,
	Christoph Lameter, Rik van Riel, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Johannes Weiner, Hugh Dickins

Warning: This is an insanely long mail and there a lot of data here. Get
	coffee or something.

This is another round of comparisons between the latest released versions
of each of three automatic numa balancing trees that are out there.

>From the series "Automatic NUMA Balancing V5", the kernels tested were

stats-v5r1	Patches 1-10. TLB optimisations, migration stats
thpmigrate-v5r1	Patches 1-37. Basic placement policy, PMD handling, THP migration etc.
adaptscan-v5r1	Patches 1-38. Heavy handed PTE scan reduction
delaystart-v5r1 Patches 1-40. Delay the PTE scan until running on a new node

If I just say balancenuma, I mean the "delaystart-v5r1" kernel. The other
kernels are included so you can see the impact the scan rate adaption
patch has and what that might mean for a placement policy using a proper
feedback mechanism.

The other two kernels were

numacore-20121123 It was no longer clear what the deltas between releases and
	the dependencies might be so I just pulled tip/master on November
	23rd, 2012. An earlier pull had serious difficulties and the patch
	responsible has been dropped since. This is not a like-with-like
	comparison as the tree contains numerous other patches but it's
	the best available given the timeframe

autonuma-v28fast This is a rebased version of Andrea's autonuma-v28fast
	branch with Hugh's THP migration patch on top. Hopefully Andrea
	and Hugh will not mind but I took the liberty of publishing the
	result as the mm-autonuma-v28fastr4-mels-rebase branch in
	git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git

I'm treating stats-v5r1 as the baseline as it has the same TLB optimisations
shared between balancenuma and numacore. As I write this I realise this may
not be fair to autonuma depending on how it avoids flushing the TLB. I'm
not digging into that right now, Andrea might comment.

All of these tests were run unattended via MMTests. Any errors in the
methodology would be applied evenly to all kernels tested. There were
monitors running but *not* profiling for the reported figures. All tests
were actually run in pairs, with and without profiling but none of the
profiles are included, nor have I looked at any of them yet.  The heaviest
active monitor reads numa_maps every 10 seconds and is only read one per
address space and reused by all threads. This will affect peak values
because it means the monitors contend on some of the same locks the PTE
scanner does for example. If time permits, I'll run a no-monitor set.

Lets start with the usual autonumabench.

AUTONUMA BENCH
                                          3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                                 rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
User    NUMA01               75064.91 (  0.00%)    24837.09 ( 66.91%)    31651.70 ( 57.83%)    54454.75 ( 27.46%)    58561.99 ( 21.98%)    56747.85 ( 24.40%)
User    NUMA01_THEADLOCAL    62045.39 (  0.00%)    17582.23 ( 71.66%)    17173.01 ( 72.32%)    16906.80 ( 72.75%)    17813.47 ( 71.29%)    18021.32 ( 70.95%)
User    NUMA02                6921.18 (  0.00%)     2088.16 ( 69.83%)     2226.35 ( 67.83%)     2065.29 ( 70.16%)     2049.90 ( 70.38%)     2098.25 ( 69.68%)
User    NUMA02_SMT            2924.84 (  0.00%)     1006.42 ( 65.59%)     1069.26 ( 63.44%)      987.17 ( 66.25%)      995.65 ( 65.96%)     1000.24 ( 65.80%)
System  NUMA01                  48.75 (  0.00%)     1138.62 (-2235.63%)      249.25 (-411.28%)      696.82 (-1329.37%)      273.76 (-461.56%)      271.95 (-457.85%)
System  NUMA01_THEADLOCAL       46.05 (  0.00%)      480.03 (-942.41%)       92.40 (-100.65%)      156.85 (-240.61%)      135.24 (-193.68%)      122.13 (-165.21%)
System  NUMA02                   1.73 (  0.00%)       24.84 (-1335.84%)        7.73 (-346.82%)        8.74 (-405.20%)        6.35 (-267.05%)        9.02 (-421.39%)
System  NUMA02_SMT              18.34 (  0.00%)       11.02 ( 39.91%)        3.74 ( 79.61%)        3.31 ( 81.95%)        3.53 ( 80.75%)        3.55 ( 80.64%)
Elapsed NUMA01                1666.60 (  0.00%)      585.34 ( 64.88%)      749.72 ( 55.02%)     1234.33 ( 25.94%)     1321.51 ( 20.71%)     1269.96 ( 23.80%)
Elapsed NUMA01_THEADLOCAL     1391.37 (  0.00%)      392.39 ( 71.80%)      381.56 ( 72.58%)      370.06 ( 73.40%)      396.18 ( 71.53%)      397.63 ( 71.42%)
Elapsed NUMA02                 176.41 (  0.00%)       50.78 ( 71.21%)       53.35 ( 69.76%)       48.89 ( 72.29%)       50.66 ( 71.28%)       50.34 ( 71.46%)
Elapsed NUMA02_SMT             163.88 (  0.00%)       48.09 ( 70.66%)       49.54 ( 69.77%)       46.83 ( 71.42%)       48.29 ( 70.53%)       47.63 ( 70.94%)
CPU     NUMA01                4506.00 (  0.00%)     4437.00 (  1.53%)     4255.00 (  5.57%)     4468.00 (  0.84%)     4452.00 (  1.20%)     4489.00 (  0.38%)
CPU     NUMA01_THEADLOCAL     4462.00 (  0.00%)     4603.00 ( -3.16%)     4524.00 ( -1.39%)     4610.00 ( -3.32%)     4530.00 ( -1.52%)     4562.00 ( -2.24%)
CPU     NUMA02                3924.00 (  0.00%)     4160.00 ( -6.01%)     4187.00 ( -6.70%)     4241.00 ( -8.08%)     4058.00 ( -3.41%)     4185.00 ( -6.65%)
CPU     NUMA02_SMT            1795.00 (  0.00%)     2115.00 (-17.83%)     2165.00 (-20.61%)     2114.00 (-17.77%)     2068.00 (-15.21%)     2107.00 (-17.38%)

numacore is the best at running the adverse numa01 workload. autonuma does
respectably and balancenuma does not cope with this case. It improves on the
baseline but it does not know how to interleave for this type of workload.

For the other workloads that are friendlier to NUMA, the three trees
are roughly comparable in terms of elapsed time. There is not multiple runs
because it takes too long but there is a strong chance we are within the noise
of each other for the other workloads.

Where we differ is in system CPU usage. In all cases, numacore uses more
system CPU. It is likely it is compensating better for this overhead
with better placement. With this higher overhead it ends up with a tie
on everything except the adverse workload. Take NUMA01_THREADLOCAL as
an example -- numacore uses roughly 4 times more system CPU than either
autonuma or balancenuma. autonumas cost could be hidden in kernel threads
but that's not true for balancenuma.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User       274653.21    92676.27   107399.17   130223.93   142154.84   146804.10
System       1329.11     5364.97     1093.69     2773.99     1453.79     1814.66
Elapsed      6827.56     2781.35     3046.92     3508.55     3757.51     3843.07

The overall elapsed time is differences in how well numa01 is handled. There
are large differences in the system CPU time. It's using almost twice
the amount of CPU as either autonuma or balancenuma.

MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                        195440      172116      168284      169788      167656      168860
Page Outs                       355400      238756      247740      246860      264276      269304
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                  42264       29117       37284       47486       32077       34343
THP collapse alloc                  23           1         809          23          26          22
THP splits                           5           1          47           6           5           4
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0           0      523123      180790      209771
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0           0         543         187         217
NUMA PTE updates                     0           0           0   842347410   295302723   301160396
NUMA hint faults                     0           0           0     6924258     3277126     3189624
NUMA hint local faults               0           0           0     3757418     1824546     1872917
NUMA pages migrated                  0           0           0      523123      180790      209771
AutoNUMA cost                        0           0           0       40527       18456       18060

Not much to usefully interpret here other than noting we generally avoid
splitting THP. For balancenuma, note what the scan adaption does to the
number of PTE updates and the number of faults incurred. A policy may
not necessarily like this. It depends on its requirements but if it wants
higher PTE scan rates it will have to compensate for it.

Next is the specjbb. There are 4 separate configurations

multi JVM, THP
multi JVM, no THP
single JVM, THP
single JVM, no THP

SPECJBB: Mult JVMs (one per node, 4 nodes), THP is enabled
                          3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                 rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
Mean   1      30969.75 (  0.00%)     28318.75 ( -8.56%)     31542.00 (  1.85%)     30427.75 ( -1.75%)     31192.25 (  0.72%)     31216.75 (  0.80%)
Mean   2      62036.50 (  0.00%)     57323.50 ( -7.60%)     66167.25 (  6.66%)     62900.25 (  1.39%)     61826.75 ( -0.34%)     62239.00 (  0.33%)
Mean   3      90075.50 (  0.00%)     86045.25 ( -4.47%)     96151.25 (  6.75%)     91035.75 (  1.07%)     89128.25 ( -1.05%)     90692.25 (  0.68%)
Mean   4     116062.50 (  0.00%)     91439.25 (-21.22%)    125072.75 (  7.76%)    116103.75 (  0.04%)    115819.25 ( -0.21%)    117047.75 (  0.85%)
Mean   5     136056.00 (  0.00%)     97558.25 (-28.30%)    150854.50 ( 10.88%)    138629.75 (  1.89%)    138712.25 (  1.95%)    139477.00 (  2.51%)
Mean   6     153827.50 (  0.00%)    128628.25 (-16.38%)    175849.50 ( 14.32%)    157472.75 (  2.37%)    158780.00 (  3.22%)    158780.25 (  3.22%)
Mean   7     151946.00 (  0.00%)    136447.25 (-10.20%)    181675.50 ( 19.57%)    160388.25 (  5.56%)    160378.75 (  5.55%)    162787.50 (  7.14%)
Mean   8     155941.50 (  0.00%)    136351.25 (-12.56%)    185131.75 ( 18.72%)    158613.00 (  1.71%)    159683.25 (  2.40%)    164054.25 (  5.20%)
Mean   9     146191.50 (  0.00%)    125132.00 (-14.41%)    184833.50 ( 26.43%)    155988.50 (  6.70%)    157664.75 (  7.85%)    161319.00 ( 10.35%)
Mean   10    139189.50 (  0.00%)     98594.50 (-29.17%)    179948.50 ( 29.28%)    150341.75 (  8.01%)    152771.00 (  9.76%)    155530.25 ( 11.74%)
Mean   11    133561.75 (  0.00%)    105967.75 (-20.66%)    175904.50 ( 31.70%)    144335.75 (  8.07%)    146147.00 (  9.42%)    146832.50 (  9.94%)
Mean   12    123752.25 (  0.00%)    138392.25 ( 11.83%)    169482.50 ( 36.95%)    140328.50 ( 13.39%)    138498.50 ( 11.92%)    142362.25 ( 15.04%)
Mean   13    123578.50 (  0.00%)    103236.50 (-16.46%)    166714.75 ( 34.91%)    136745.25 ( 10.65%)    138469.50 ( 12.05%)    140699.00 ( 13.85%)
Mean   14    123812.00 (  0.00%)    113250.00 ( -8.53%)    164406.00 ( 32.79%)    138061.25 ( 11.51%)    134047.25 (  8.27%)    139790.50 ( 12.91%)
Mean   15    123499.25 (  0.00%)    130577.50 (  5.73%)    162517.00 ( 31.59%)    133598.50 (  8.18%)    132651.50 (  7.41%)    134423.00 (  8.85%)
Mean   16    118595.75 (  0.00%)    127494.50 (  7.50%)    160836.25 ( 35.62%)    129305.25 (  9.03%)    131355.75 ( 10.76%)    132424.25 ( 11.66%)
Mean   17    115374.75 (  0.00%)    121443.50 (  5.26%)    157091.00 ( 36.16%)    127538.50 ( 10.54%)    128536.00 ( 11.41%)    128923.75 ( 11.74%)
Mean   18    120981.00 (  0.00%)    119649.00 ( -1.10%)    155978.75 ( 28.93%)    126031.00 (  4.17%)    127277.00 (  5.20%)    131032.25 (  8.31%)
Stddev 1       1256.20 (  0.00%)      1649.69 (-31.32%)      1042.80 ( 16.99%)      1004.74 ( 20.02%)      1125.79 ( 10.38%)       965.75 ( 23.12%)
Stddev 2        894.02 (  0.00%)      1299.83 (-45.39%)       153.62 ( 82.82%)      1757.03 (-96.53%)      1089.32 (-21.84%)       370.16 ( 58.60%)
Stddev 3       1354.13 (  0.00%)      3221.35 (-137.89%)       452.26 ( 66.60%)      1169.99 ( 13.60%)      1387.57 ( -2.47%)       629.10 ( 53.54%)
Stddev 4       1505.56 (  0.00%)      9559.15 (-534.92%)       597.48 ( 60.32%)      1046.60 ( 30.48%)      1285.40 ( 14.62%)      1320.74 ( 12.28%)
Stddev 5        513.85 (  0.00%)     20854.29 (-3958.43%)       416.34 ( 18.98%)       760.85 (-48.07%)      1118.27 (-117.62%)      1382.28 (-169.00%)
Stddev 6       1393.16 (  0.00%)     11554.27 (-729.36%)      1225.46 ( 12.04%)      1190.92 ( 14.52%)      1662.55 (-19.34%)      1814.39 (-30.24%)
Stddev 7       1645.51 (  0.00%)      7300.33 (-343.65%)      1690.25 ( -2.72%)      2517.46 (-52.99%)      1882.02 (-14.37%)      2393.67 (-45.47%)
Stddev 8       4853.40 (  0.00%)     10303.35 (-112.29%)      1724.63 ( 64.47%)      4280.27 ( 11.81%)      6680.41 (-37.64%)      1453.35 ( 70.05%)
Stddev 9       4366.96 (  0.00%)      9683.51 (-121.74%)      3443.47 ( 21.15%)      7360.20 (-68.54%)      4560.06 ( -4.42%)      3269.18 ( 25.14%)
Stddev 10      4840.11 (  0.00%)      7402.77 (-52.95%)      5808.63 (-20.01%)      4639.55 (  4.14%)      1221.58 ( 74.76%)      3911.11 ( 19.19%)
Stddev 11      5208.04 (  0.00%)     12657.33 (-143.03%)     10003.74 (-92.08%)      8961.02 (-72.06%)      3754.61 ( 27.91%)      4138.30 ( 20.54%)
Stddev 12      5015.66 (  0.00%)     14749.87 (-194.08%)     14862.62 (-196.32%)      4554.52 (  9.19%)      7436.76 (-48.27%)      3902.07 ( 22.20%)
Stddev 13      3348.23 (  0.00%)     13349.42 (-298.70%)     15333.50 (-357.96%)      5121.75 (-52.97%)      6893.45 (-105.88%)      3633.54 ( -8.52%)
Stddev 14      2816.30 (  0.00%)      3878.71 (-37.72%)     15707.34 (-457.73%)      1296.47 ( 53.97%)      4760.04 (-69.02%)      1540.51 ( 45.30%)
Stddev 15      2592.17 (  0.00%)       777.61 ( 70.00%)     17317.35 (-568.06%)      3572.43 (-37.82%)      5510.05 (-112.57%)      2227.21 ( 14.08%)
Stddev 16      4163.07 (  0.00%)      1239.57 ( 70.22%)     16770.00 (-302.83%)      3858.12 (  7.33%)      2947.70 ( 29.19%)      3332.69 ( 19.95%)
Stddev 17      5959.34 (  0.00%)      1602.88 ( 73.10%)     16890.90 (-183.44%)      4770.68 ( 19.95%)      4398.91 ( 26.18%)      3340.67 ( 43.94%)
Stddev 18      3040.65 (  0.00%)       857.66 ( 71.79%)     19296.90 (-534.63%)      6344.77 (-108.67%)      4183.68 (-37.59%)      1278.14 ( 57.96%)
TPut   1     123879.00 (  0.00%)    113275.00 ( -8.56%)    126168.00 (  1.85%)    121711.00 ( -1.75%)    124769.00 (  0.72%)    124867.00 (  0.80%)
TPut   2     248146.00 (  0.00%)    229294.00 ( -7.60%)    264669.00 (  6.66%)    251601.00 (  1.39%)    247307.00 ( -0.34%)    248956.00 (  0.33%)
TPut   3     360302.00 (  0.00%)    344181.00 ( -4.47%)    384605.00 (  6.75%)    364143.00 (  1.07%)    356513.00 ( -1.05%)    362769.00 (  0.68%)
TPut   4     464250.00 (  0.00%)    365757.00 (-21.22%)    500291.00 (  7.76%)    464415.00 (  0.04%)    463277.00 ( -0.21%)    468191.00 (  0.85%)
TPut   5     544224.00 (  0.00%)    390233.00 (-28.30%)    603418.00 ( 10.88%)    554519.00 (  1.89%)    554849.00 (  1.95%)    557908.00 (  2.51%)
TPut   6     615310.00 (  0.00%)    514513.00 (-16.38%)    703398.00 ( 14.32%)    629891.00 (  2.37%)    635120.00 (  3.22%)    635121.00 (  3.22%)
TPut   7     607784.00 (  0.00%)    545789.00 (-10.20%)    726702.00 ( 19.57%)    641553.00 (  5.56%)    641515.00 (  5.55%)    651150.00 (  7.14%)
TPut   8     623766.00 (  0.00%)    545405.00 (-12.56%)    740527.00 ( 18.72%)    634452.00 (  1.71%)    638733.00 (  2.40%)    656217.00 (  5.20%)
TPut   9     584766.00 (  0.00%)    500528.00 (-14.41%)    739334.00 ( 26.43%)    623954.00 (  6.70%)    630659.00 (  7.85%)    645276.00 ( 10.35%)
TPut   10    556758.00 (  0.00%)    394378.00 (-29.17%)    719794.00 ( 29.28%)    601367.00 (  8.01%)    611084.00 (  9.76%)    622121.00 ( 11.74%)
TPut   11    534247.00 (  0.00%)    423871.00 (-20.66%)    703618.00 ( 31.70%)    577343.00 (  8.07%)    584588.00 (  9.42%)    587330.00 (  9.94%)
TPut   12    495009.00 (  0.00%)    553569.00 ( 11.83%)    677930.00 ( 36.95%)    561314.00 ( 13.39%)    553994.00 ( 11.92%)    569449.00 ( 15.04%)
TPut   13    494314.00 (  0.00%)    412946.00 (-16.46%)    666859.00 ( 34.91%)    546981.00 ( 10.65%)    553878.00 ( 12.05%)    562796.00 ( 13.85%)
TPut   14    495248.00 (  0.00%)    453000.00 ( -8.53%)    657624.00 ( 32.79%)    552245.00 ( 11.51%)    536189.00 (  8.27%)    559162.00 ( 12.91%)
TPut   15    493997.00 (  0.00%)    522310.00 (  5.73%)    650068.00 ( 31.59%)    534394.00 (  8.18%)    530606.00 (  7.41%)    537692.00 (  8.85%)
TPut   16    474383.00 (  0.00%)    509978.00 (  7.50%)    643345.00 ( 35.62%)    517221.00 (  9.03%)    525423.00 ( 10.76%)    529697.00 ( 11.66%)
TPut   17    461499.00 (  0.00%)    485774.00 (  5.26%)    628364.00 ( 36.16%)    510154.00 ( 10.54%)    514144.00 ( 11.41%)    515695.00 ( 11.74%)
TPut   18    483924.00 (  0.00%)    478596.00 ( -1.10%)    623915.00 ( 28.93%)    504124.00 (  4.17%)    509108.00 (  5.20%)    524129.00 (  8.31%)

numacore is not handling the multi JVM case well with numerous regressions
for lower number of threads. It starts improving as it gets closer to the
expected peak of 12 warehouses for this configuration. There are also large
variances between the different JVMs throughput but note again that this
improves as the number of warehouses increase.

autonuma generally does very well in terms of throughput but the variance
between JVMs is massive.

balancenuma does reasonably well and improves upon the baseline kernel. It's
no longer regressing for small numbers of warehouses and is basically the
same as mainline. As the number of warehouses increases, it shows some
performance improvement and the variances are not too bad.

SPECJBB PEAKS
                                       3.7.0                      3.7.0                      3.7.0                      3.7.0                      3.7.0                      3.7.0
                              rc6-stats-v5r1      rc6-numacore-20121123     rc6-autonuma-v28fastr4        rc6-thpmigrate-v5r1         rc6-adaptscan-v5r1        rc6-delaystart-v5r4
 Expctd Warehouse            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)
 Expctd Peak Bops        495009.00 (  0.00%)        553569.00 ( 11.83%)        677930.00 ( 36.95%)        561314.00 ( 13.39%)        553994.00 ( 11.92%)        569449.00 ( 15.04%)
 Actual Warehouse             8.00 (  0.00%)            12.00 ( 50.00%)             8.00 (  0.00%)             7.00 (-12.50%)             7.00 (-12.50%)             8.00 (  0.00%)
 Actual Peak Bops        623766.00 (  0.00%)        553569.00 (-11.25%)        740527.00 ( 18.72%)        641553.00 (  2.85%)        641515.00 (  2.85%)        656217.00 (  5.20%)
 SpecJBB Bops            261413.00 (  0.00%)        262783.00 (  0.52%)        349854.00 ( 33.83%)        286648.00 (  9.65%)        286412.00 (  9.56%)        292202.00 ( 11.78%)
 SpecJBB Bops/JVM         65353.00 (  0.00%)         65696.00 (  0.52%)         87464.00 ( 33.83%)         71662.00 (  9.65%)         71603.00 (  9.56%)         73051.00 ( 11.78%)

Note the peak numbers for numacore. The peak performance regresses 11.25%
from the baseline kernel. However as it improves with the number of
warehouses, specjbb reports that it sees a 0.52%  because it's using a
range of peak values.

autonuma sees an 18.72% performance gain at its peak and a 33.83% gain in
its specjbb score.

balancenuma does reasonably well with a 5.2% gain at its peak and 11.78% on its
overall specjbb score.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User       204146.61   197898.85   203957.74   203331.16   203747.52   203740.33
System        314.90     6106.94      444.09     1278.71      703.78      688.21
Elapsed      5029.18     5041.34     5009.46     5022.41     5024.73     5021.80

Note the system CPU usage. numacore is using 9 times more system CPU
than balancenuma is and 4 times more than autonuma (usual disclaimer
about threads).

MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                        164712      164556      160492      164020      160552      164364
Page Outs                       509132      236136      430444      511088      471208      252540
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                 105761       91276       94593      111724      106169       99366
THP collapse alloc                 114         111        1059         119         114         115
THP splits                         605         379         575         517         570         592
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0           0     1031293      476756      398109
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0           0        1070         494         413
NUMA PTE updates                     0           0           0  1089136813   514718304   515300823
NUMA hint faults                     0           0           0     9147497     4661092     4580385
NUMA hint local faults               0           0           0     3005415     1332898     1599021
NUMA pages migrated                  0           0           0     1031293      476756      398109
AutoNUMA cost                        0           0           0       53381       26917       26516

The main takeaways here is that there were THP allocations and all the
trees split THPs at roughly the same rate overall. Migration stats are
not available for numacore or autonuma and the migration stats available
for balancenuma here are not reliable because it's not accounting for THP
properly. This is fixed, but not in the V5 tree released.


SPECJBB: Multi JVMs (one per node, 4 nodes), THP is disabled
                          3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                 rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
Mean   1      25269.25 (  0.00%)     21623.50 (-14.43%)     25937.75 (  2.65%)     25138.00 ( -0.52%)     25539.25 (  1.07%)     25193.00 ( -0.30%)
Mean   2      53467.00 (  0.00%)     38412.00 (-28.16%)     56598.75 (  5.86%)     50813.00 ( -4.96%)     52803.50 ( -1.24%)     52637.50 ( -1.55%)
Mean   3      77112.50 (  0.00%)     57653.25 (-25.23%)     83762.25 (  8.62%)     75274.25 ( -2.38%)     76097.00 ( -1.32%)     76324.25 ( -1.02%)
Mean   4      99928.75 (  0.00%)     68468.50 (-31.48%)    108700.75 (  8.78%)     97444.75 ( -2.49%)     99426.75 ( -0.50%)     99767.25 ( -0.16%)
Mean   5     119616.75 (  0.00%)     77222.25 (-35.44%)    132572.75 ( 10.83%)    117350.00 ( -1.90%)    118417.25 ( -1.00%)    118298.50 ( -1.10%)
Mean   6     133944.75 (  0.00%)     89222.75 (-33.39%)    154110.25 ( 15.06%)    133565.75 ( -0.28%)    135268.75 (  0.99%)    137512.50 (  2.66%)
Mean   7     137063.00 (  0.00%)     94944.25 (-30.73%)    159535.25 ( 16.40%)    136744.75 ( -0.23%)    139218.25 (  1.57%)    138919.25 (  1.35%)
Mean   8     130814.25 (  0.00%)     98367.25 (-24.80%)    162045.75 ( 23.87%)    137088.25 (  4.80%)    139649.50 (  6.75%)    138273.00 (  5.70%)
Mean   9     124815.00 (  0.00%)     99183.50 (-20.54%)    162337.75 ( 30.06%)    135275.50 (  8.38%)    137494.50 ( 10.16%)    137386.25 ( 10.07%)
Mean   10    123741.00 (  0.00%)     91926.25 (-25.71%)    158733.00 ( 28.28%)    131418.00 (  6.20%)    132662.00 (  7.21%)    132379.25 (  6.98%)
Mean   11    116966.25 (  0.00%)     95283.00 (-18.54%)    155065.50 ( 32.57%)    125246.00 (  7.08%)    124420.25 (  6.37%)    128132.00 (  9.55%)
Mean   12    106682.00 (  0.00%)     92286.25 (-13.49%)    149946.25 ( 40.55%)    118489.50 ( 11.07%)    119624.25 ( 12.13%)    121050.75 ( 13.47%)
Mean   13    106395.00 (  0.00%)    103168.75 ( -3.03%)    146355.50 ( 37.56%)    118143.75 ( 11.04%)    116799.25 (  9.78%)    121032.25 ( 13.76%)
Mean   14    104384.25 (  0.00%)    105417.75 (  0.99%)    145206.50 ( 39.11%)    119562.75 ( 14.54%)    117898.75 ( 12.95%)    114255.25 (  9.46%)
Mean   15    103699.00 (  0.00%)    103878.75 (  0.17%)    142139.75 ( 37.07%)    115845.50 ( 11.71%)    117527.25 ( 13.33%)    109329.50 (  5.43%)
Mean   16    100955.00 (  0.00%)    103582.50 (  2.60%)    139864.00 ( 38.54%)    113216.75 ( 12.15%)    114046.50 ( 12.97%)    108669.75 (  7.64%)
Mean   17     99528.25 (  0.00%)    101783.25 (  2.27%)    138544.50 ( 39.20%)    112736.50 ( 13.27%)    115917.00 ( 16.47%)    113464.50 ( 14.00%)
Mean   18     97694.00 (  0.00%)     99978.75 (  2.34%)    138034.00 ( 41.29%)    108930.00 ( 11.50%)    114137.50 ( 16.83%)    114161.25 ( 16.86%)
Stddev 1        898.91 (  0.00%)       754.70 ( 16.04%)       815.97 (  9.23%)       786.81 ( 12.47%)       756.10 ( 15.89%)      1061.69 (-18.11%)
Stddev 2        676.51 (  0.00%)      2726.62 (-303.04%)       946.10 (-39.85%)      1591.35 (-135.23%)       968.21 (-43.12%)       919.08 (-35.86%)
Stddev 3        629.58 (  0.00%)      1975.98 (-213.86%)      1403.79 (-122.97%)       291.72 ( 53.66%)      1181.68 (-87.69%)       701.90 (-11.49%)
Stddev 4        363.04 (  0.00%)      2867.55 (-689.87%)      1810.59 (-398.73%)      1288.56 (-254.94%)      1757.87 (-384.21%)      2050.94 (-464.94%)
Stddev 5        437.02 (  0.00%)      1159.08 (-165.22%)      2352.89 (-438.39%)      1148.94 (-162.90%)      1294.70 (-196.26%)       861.14 (-97.05%)
Stddev 6       1484.12 (  0.00%)      1777.97 (-19.80%)      1045.24 ( 29.57%)       860.24 ( 42.04%)      1703.57 (-14.79%)      1367.56 (  7.85%)
Stddev 7       3856.79 (  0.00%)       857.26 ( 77.77%)      1369.61 ( 64.49%)      1517.99 ( 60.64%)      2676.34 ( 30.61%)      1818.15 ( 52.86%)
Stddev 8       4910.41 (  0.00%)      2751.82 ( 43.96%)      1765.69 ( 64.04%)      5022.25 ( -2.28%)      3113.14 ( 36.60%)      3958.06 ( 19.39%)
Stddev 9       2107.95 (  0.00%)      2348.33 (-11.40%)      1764.06 ( 16.31%)      2932.34 (-39.11%)      6568.79 (-211.62%)      7450.20 (-253.43%)
Stddev 10      2012.98 (  0.00%)      1332.65 ( 33.80%)      3297.73 (-63.82%)      4649.56 (-130.98%)      2703.19 (-34.29%)      4193.34 (-108.31%)
Stddev 11      5263.81 (  0.00%)      3810.66 ( 27.61%)      5676.52 ( -7.84%)      1647.81 ( 68.70%)      4683.05 ( 11.03%)      3702.45 ( 29.66%)
Stddev 12      4316.09 (  0.00%)       731.69 ( 83.05%)      9685.19 (-124.40%)      2202.13 ( 48.98%)      2520.73 ( 41.60%)      3572.75 ( 17.22%)
Stddev 13      4116.97 (  0.00%)      4217.04 ( -2.43%)      9249.57 (-124.67%)      3042.07 ( 26.11%)      1705.18 ( 58.58%)       464.36 ( 88.72%)
Stddev 14      4711.12 (  0.00%)       925.12 ( 80.36%)     10672.49 (-126.54%)      1597.01 ( 66.10%)      1983.88 ( 57.89%)      1513.32 ( 67.88%)
Stddev 15      4582.30 (  0.00%)       909.35 ( 80.16%)     11033.47 (-140.78%)      1966.56 ( 57.08%)       420.63 ( 90.82%)      1049.66 ( 77.09%)
Stddev 16      3805.96 (  0.00%)       743.92 ( 80.45%)     10353.28 (-172.03%)      1493.18 ( 60.77%)      2524.84 ( 33.66%)      2030.46 ( 46.65%)
Stddev 17      4560.83 (  0.00%)      1130.10 ( 75.22%)      9902.66 (-117.12%)      1709.65 ( 62.51%)      2449.37 ( 46.30%)      1259.00 ( 72.40%)
Stddev 18      4503.57 (  0.00%)      1418.91 ( 68.49%)     12143.74 (-169.65%)      1334.37 ( 70.37%)      1693.93 ( 62.39%)       975.71 ( 78.33%)
TPut   1     101077.00 (  0.00%)     86494.00 (-14.43%)    103751.00 (  2.65%)    100552.00 ( -0.52%)    102157.00 (  1.07%)    100772.00 ( -0.30%)
TPut   2     213868.00 (  0.00%)    153648.00 (-28.16%)    226395.00 (  5.86%)    203252.00 ( -4.96%)    211214.00 ( -1.24%)    210550.00 ( -1.55%)
TPut   3     308450.00 (  0.00%)    230613.00 (-25.23%)    335049.00 (  8.62%)    301097.00 ( -2.38%)    304388.00 ( -1.32%)    305297.00 ( -1.02%)
TPut   4     399715.00 (  0.00%)    273874.00 (-31.48%)    434803.00 (  8.78%)    389779.00 ( -2.49%)    397707.00 ( -0.50%)    399069.00 ( -0.16%)
TPut   5     478467.00 (  0.00%)    308889.00 (-35.44%)    530291.00 ( 10.83%)    469400.00 ( -1.90%)    473669.00 ( -1.00%)    473194.00 ( -1.10%)
TPut   6     535779.00 (  0.00%)    356891.00 (-33.39%)    616441.00 ( 15.06%)    534263.00 ( -0.28%)    541075.00 (  0.99%)    550050.00 (  2.66%)
TPut   7     548252.00 (  0.00%)    379777.00 (-30.73%)    638141.00 ( 16.40%)    546979.00 ( -0.23%)    556873.00 (  1.57%)    555677.00 (  1.35%)
TPut   8     523257.00 (  0.00%)    393469.00 (-24.80%)    648183.00 ( 23.87%)    548353.00 (  4.80%)    558598.00 (  6.75%)    553092.00 (  5.70%)
TPut   9     499260.00 (  0.00%)    396734.00 (-20.54%)    649351.00 ( 30.06%)    541102.00 (  8.38%)    549978.00 ( 10.16%)    549545.00 ( 10.07%)
TPut   10    494964.00 (  0.00%)    367705.00 (-25.71%)    634932.00 ( 28.28%)    525672.00 (  6.20%)    530648.00 (  7.21%)    529517.00 (  6.98%)
TPut   11    467865.00 (  0.00%)    381132.00 (-18.54%)    620262.00 ( 32.57%)    500984.00 (  7.08%)    497681.00 (  6.37%)    512528.00 (  9.55%)
TPut   12    426728.00 (  0.00%)    369145.00 (-13.49%)    599785.00 ( 40.55%)    473958.00 ( 11.07%)    478497.00 ( 12.13%)    484203.00 ( 13.47%)
TPut   13    425580.00 (  0.00%)    412675.00 ( -3.03%)    585422.00 ( 37.56%)    472575.00 ( 11.04%)    467197.00 (  9.78%)    484129.00 ( 13.76%)
TPut   14    417537.00 (  0.00%)    421671.00 (  0.99%)    580826.00 ( 39.11%)    478251.00 ( 14.54%)    471595.00 ( 12.95%)    457021.00 (  9.46%)
TPut   15    414796.00 (  0.00%)    415515.00 (  0.17%)    568559.00 ( 37.07%)    463382.00 ( 11.71%)    470109.00 ( 13.33%)    437318.00 (  5.43%)
TPut   16    403820.00 (  0.00%)    414330.00 (  2.60%)    559456.00 ( 38.54%)    452867.00 ( 12.15%)    456186.00 ( 12.97%)    434679.00 (  7.64%)
TPut   17    398113.00 (  0.00%)    407133.00 (  2.27%)    554178.00 ( 39.20%)    450946.00 ( 13.27%)    463668.00 ( 16.47%)    453858.00 ( 14.00%)
TPut   18    390776.00 (  0.00%)    399915.00 (  2.34%)    552136.00 ( 41.29%)    435720.00 ( 11.50%)    456550.00 ( 16.83%)    456645.00 ( 16.86%)

numacore regresses badly without THP on multi JVM configurations. Note
that once again it improves as the number of warehouses increase. SpecJBB
reports based on peaks so this will be missed if only the peak figures
are quoted in other benchmark reports.

autonuma again does pretty well although it's variances between JVMs is nuts.

Without THP, balancenuma shows small regressions for small numbers of
warehouses but recovers to show decent performance gains. Note that the gains
vary a lot between warehouses because it's completely at the mercy of the
default scheduler decisions which are getting no hints about NUMA placement.

SPECJBB PEAKS
                                       3.7.0                      3.7.0                      3.7.0                      3.7.0                      3.7.0                      3.7.0
                              rc6-stats-v5r1      rc6-numacore-20121123     rc6-autonuma-v28fastr4        rc6-thpmigrate-v5r1         rc6-adaptscan-v5r1        rc6-delaystart-v5r4
 Expctd Warehouse            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)
 Expctd Peak Bops        426728.00 (  0.00%)        369145.00 (-13.49%)        599785.00 ( 40.55%)        473958.00 ( 11.07%)        478497.00 ( 12.13%)        484203.00 ( 13.47%)
 Actual Warehouse             7.00 (  0.00%)            14.00 (100.00%)             9.00 ( 28.57%)             8.00 ( 14.29%)             8.00 ( 14.29%)             7.00 (  0.00%)
 Actual Peak Bops        548252.00 (  0.00%)        421671.00 (-23.09%)        649351.00 ( 18.44%)        548353.00 (  0.02%)        558598.00 (  1.89%)        555677.00 (  1.35%)
 SpecJBB Bops            221334.00 (  0.00%)        218491.00 ( -1.28%)        307720.00 ( 39.03%)        248285.00 ( 12.18%)        251062.00 ( 13.43%)        246759.00 ( 11.49%)
 SpecJBB Bops/JVM         55334.00 (  0.00%)         54623.00 ( -1.28%)         76930.00 ( 39.03%)         62071.00 ( 12.18%)         62766.00 ( 13.43%)         61690.00 ( 11.49%)

numacore regresses from the peak by 23.09% and the specjbb overall score is down 1.28%.

autonuma does well with a 18.44% gain on the peak and 39.03% overall.

balancenuma does reasonably well - 1.35% gain at the peak and 11.49%
gain overall.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User       203906.38   167709.64   203858.75   200055.62   202076.09   201985.74
System        577.16    31263.34      692.24     4114.76     2129.71     2177.70
Elapsed      5030.84     5067.85     5009.06     5019.25     5026.83     5017.79

numacores system CPU usage is nuts.

autonumas is ok (kernel threads blah blah)

balancenumas is higher than I'd like. I want to describe is as "not crazy"
but it probably is to everybody else.

MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                        157624      164396      165024      163492      164776      163348
Page Outs                       322264      391416      271880      491668      401644      523684
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                      2           2           3           2           1           3
THP collapse alloc                   0           0           9           0           0           5
THP splits                           0           0           0           0           0           0
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0           0   100618401    47601498    49370903
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0           0      104441       49410       51246
NUMA PTE updates                     0           0           0   783430956   381926529   389134805
NUMA hint faults                     0           0           0   730273702   352415076   360742428
NUMA hint local faults               0           0           0   191790656    92208827    93522412
NUMA pages migrated                  0           0           0   100618401    47601498    49370903
AutoNUMA cost                        0           0           0     3658764     1765653     1807374

First take-away is the lack of THP activity.

Here the stats balancenuma reports are useful because we're only dealing
with base pages. balancenuma migrates 38MB/second which is really high. Note
what the scan rate adaption did to that figure. Without scan rate adaption
it's at 78MB/second on average which is nuts. Average migration rate is
something we should keep an eye on.

>From here, we're onto the single JVM configuration. I suspect
this is tested much more commonly but note that it behaves very
differently to the multi JVM configuration as explained by Andrea
(http://choon.net/forum/read.php?21,1599976,page=4).

A concern with the single JVM results as reported here is the maximum
number of warehouses. In the Multi JVM configuration, the expected peak
was 12 warehouses so I ran up to 18 so that the tests could complete in a
reasonable amount of time. The expected peak for a single JVM is 48 (the
number of CPUs) but the configuration file was derived from the multi JVM
configuration so it was restricted to running up to 18 warehouses. Again,
the reason was so it would complete in a reasonable amount of time but
specjbb does not give a score for this type of configuration and I am
only reporting on the 1-18 warehouses it ran for. I've reconfigured the
4 specjbb configs to run a full config and it'll run over the weekend.

SPECJBB: Single JVMs (one per node, 4 nodes), THP is enabled

SPECJBB BOPS
                        3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
               rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
TPut 1      26802.00 (  0.00%)     22808.00 (-14.90%)     24482.00 ( -8.66%)     25723.00 ( -4.03%)     24387.00 ( -9.01%)     25940.00 ( -3.22%)
TPut 2      57720.00 (  0.00%)     51245.00 (-11.22%)     55018.00 ( -4.68%)     55498.00 ( -3.85%)     55259.00 ( -4.26%)     55581.00 ( -3.71%)
TPut 3      86940.00 (  0.00%)     79172.00 ( -8.93%)     87705.00 (  0.88%)     86101.00 ( -0.97%)     86894.00 ( -0.05%)     86875.00 ( -0.07%)
TPut 4     117203.00 (  0.00%)    107315.00 ( -8.44%)    117382.00 (  0.15%)    116282.00 ( -0.79%)    116322.00 ( -0.75%)    115263.00 ( -1.66%)
TPut 5     145375.00 (  0.00%)    121178.00 (-16.64%)    145802.00 (  0.29%)    142378.00 ( -2.06%)    144947.00 ( -0.29%)    144211.00 ( -0.80%)
TPut 6     169232.00 (  0.00%)    157796.00 ( -6.76%)    173409.00 (  2.47%)    171066.00 (  1.08%)    173341.00 (  2.43%)    169861.00 (  0.37%)
TPut 7     195468.00 (  0.00%)    169834.00 (-13.11%)    197201.00 (  0.89%)    197536.00 (  1.06%)    198347.00 (  1.47%)    198047.00 (  1.32%)
TPut 8     217863.00 (  0.00%)    169975.00 (-21.98%)    222559.00 (  2.16%)    224901.00 (  3.23%)    226268.00 (  3.86%)    218354.00 (  0.23%)
TPut 9     240679.00 (  0.00%)    197498.00 (-17.94%)    245997.00 (  2.21%)    250022.00 (  3.88%)    253838.00 (  5.47%)    250264.00 (  3.98%)
TPut 10    261454.00 (  0.00%)    204909.00 (-21.63%)    269551.00 (  3.10%)    275125.00 (  5.23%)    274658.00 (  5.05%)    274155.00 (  4.86%)
TPut 11    281079.00 (  0.00%)    230118.00 (-18.13%)    281588.00 (  0.18%)    304383.00 (  8.29%)    297198.00 (  5.73%)    299131.00 (  6.42%)
TPut 12    302007.00 (  0.00%)    275511.00 ( -8.77%)    313281.00 (  3.73%)    327826.00 (  8.55%)    325324.00 (  7.72%)    325372.00 (  7.74%)
TPut 13    319139.00 (  0.00%)    293501.00 ( -8.03%)    332581.00 (  4.21%)    352389.00 ( 10.42%)    340169.00 (  6.59%)    351215.00 ( 10.05%)
TPut 14    321069.00 (  0.00%)    312088.00 ( -2.80%)    337911.00 (  5.25%)    376198.00 ( 17.17%)    370669.00 ( 15.45%)    366491.00 ( 14.15%)
TPut 15    345851.00 (  0.00%)    283856.00 (-17.93%)    369104.00 (  6.72%)    389772.00 ( 12.70%)    392963.00 ( 13.62%)    389254.00 ( 12.55%)
TPut 16    346868.00 (  0.00%)    317127.00 ( -8.57%)    380930.00 (  9.82%)    420331.00 ( 21.18%)    412974.00 ( 19.06%)    408575.00 ( 17.79%)
TPut 17    357755.00 (  0.00%)    349624.00 ( -2.27%)    387635.00 (  8.35%)    441223.00 ( 23.33%)    426558.00 ( 19.23%)    435985.00 ( 21.87%)
TPut 18    357467.00 (  0.00%)    360056.00 (  0.72%)    399487.00 ( 11.75%)    464603.00 ( 29.97%)    442907.00 ( 23.90%)    453011.00 ( 26.73%)

numacore is not doing well here for low numbers of warehouses. However,
note that by 18 warehouses it had drawn level and the expected peak is 48
warehouses. The specjbb reported figure would be using the higher numbers
of warehouses. I'll a full range over the weekend and report back. If
time permits, I'll also run a "monitors disabled" run case the read of
numa_maps every 10 seconds is crippling it.

autonuma did reasonably well and was showing larger gains towards teh 18
warehouses mark.

balancenuma regressed a little initially but was doing quite well by 18
warehouses. 

SPECJBB PEAKS
                                       3.7.0                      3.7.0                      3.7.0                      3.7.0                      3.7.0                      3.7.0
                              rc6-stats-v5r1      rc6-numacore-20121123     rc6-autonuma-v28fastr4        rc6-thpmigrate-v5r1         rc6-adaptscan-v5r1        rc6-delaystart-v5r4
 Expctd Warehouse                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)
 Expctd Peak Bops                    0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)
 Actual Warehouse                   17.00 (  0.00%)                   18.00 (  5.88%)                   18.00 (  5.88%)                   18.00 (  5.88%)                   18.00 (  5.88%)                   18.00 (  5.88%)
 Actual Peak Bops               357755.00 (  0.00%)               360056.00 (  0.64%)               399487.00 ( 11.66%)               464603.00 ( 29.87%)               442907.00 ( 23.80%)               453011.00 ( 26.63%)
 SpecJBB Bops                        0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)
 SpecJBB Bops/JVM                    0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)                    0.00 (  0.00%)

Note that numacores peak was 0.64% higher than the baseline and for a
higher number of warehouses so it was scaling better.

autonuma was 11.66% higher at the peak which was also at 18 warehouses.

balancenuma was at 26.63% and was still scaling at 18 warehouses.

The fact that the peak and maximum number of warehouses is the same
reinforces that this test needs to be rerun all the way up to 48 warehouses.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User        10450.16    10006.88    10441.26    10421.00    10441.47    10447.30
System        115.84      549.28      107.70      167.83      129.14      142.34
Elapsed      1196.56     1228.13     1187.23     1196.37     1198.64     1198.75

numacores system CPU usage is very high.

autonumas is lower than baseline -- usual thread disclaimers.

balancenuma system CPU usage is also a bit high but it's not crazy.


MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                        164228      164452      164436      163868      164440      164052
Page Outs                       173972      132016      247080      257988      123724      255716
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                  55438       46676       52240       48118       57618       53194
THP collapse alloc                  56           8         323          54          28          19
THP splits                          96          30         106          80          91          86
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0           0      253855      111066       58659
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0           0         263         115          60
NUMA PTE updates                     0           0           0   142021619    62920560    64394112
NUMA hint faults                     0           0           0     2314850     1258884     1019745
NUMA hint local faults               0           0           0     1249300      756763      569808
NUMA pages migrated                  0           0           0      253855      111066       58659
AutoNUMA cost                        0           0           0       12573        6736        5550

THP was in use - collapses and splits in evidence.

For balancenuma, note how adaptscan affected the PTE scan rates. The
impact on the system CPU usage is obvious too -- fewer PTE scans means
fewer faults, fewer migrations etc. Obviously there needs to be enough
of these faults to actually do the NUMA balancing but there comes a point
where there are diminishing returns.

SPECJBB: Single JVMs (one per node, 4 nodes), THP is disabled

                        3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
               rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
TPut 1      20890.00 (  0.00%)     18720.00 (-10.39%)     21127.00 (  1.13%)     20376.00 ( -2.46%)     20806.00 ( -0.40%)     20698.00 ( -0.92%)
TPut 2      48259.00 (  0.00%)     38121.00 (-21.01%)     47920.00 ( -0.70%)     47085.00 ( -2.43%)     48594.00 (  0.69%)     48094.00 ( -0.34%)
TPut 3      73203.00 (  0.00%)     60057.00 (-17.96%)     73630.00 (  0.58%)     70241.00 ( -4.05%)     73418.00 (  0.29%)     74016.00 (  1.11%)
TPut 4      98694.00 (  0.00%)     73669.00 (-25.36%)     98929.00 (  0.24%)     96721.00 ( -2.00%)     96797.00 ( -1.92%)     97930.00 ( -0.77%)
TPut 5     122563.00 (  0.00%)     98786.00 (-19.40%)    118969.00 ( -2.93%)    118045.00 ( -3.69%)    121553.00 ( -0.82%)    122781.00 (  0.18%)
TPut 6     144095.00 (  0.00%)    114485.00 (-20.55%)    145328.00 (  0.86%)    141713.00 ( -1.65%)    142589.00 ( -1.05%)    143771.00 ( -0.22%)
TPut 7     166457.00 (  0.00%)    112416.00 (-32.47%)    163503.00 ( -1.77%)    166971.00 (  0.31%)    166788.00 (  0.20%)    165188.00 ( -0.76%)
TPut 8     191067.00 (  0.00%)    122996.00 (-35.63%)    189477.00 ( -0.83%)    183090.00 ( -4.17%)    187710.00 ( -1.76%)    192157.00 (  0.57%)
TPut 9     210634.00 (  0.00%)    141200.00 (-32.96%)    209639.00 ( -0.47%)    207968.00 ( -1.27%)    215216.00 (  2.18%)    214222.00 (  1.70%)
TPut 10    234121.00 (  0.00%)    129508.00 (-44.68%)    231221.00 ( -1.24%)    221553.00 ( -5.37%)    219998.00 ( -6.03%)    227193.00 ( -2.96%)
TPut 11    257885.00 (  0.00%)    131232.00 (-49.11%)    256568.00 ( -0.51%)    252734.00 ( -2.00%)    258433.00 (  0.21%)    260534.00 (  1.03%)
TPut 12    271751.00 (  0.00%)    154763.00 (-43.05%)    277319.00 (  2.05%)    277154.00 (  1.99%)    265747.00 ( -2.21%)    262285.00 ( -3.48%)
TPut 13    297457.00 (  0.00%)    119716.00 (-59.75%)    296068.00 ( -0.47%)    289716.00 ( -2.60%)    276527.00 ( -7.04%)    293199.00 ( -1.43%)
TPut 14    319074.00 (  0.00%)    129730.00 (-59.34%)    311604.00 ( -2.34%)    308798.00 ( -3.22%)    316807.00 ( -0.71%)    275748.00 (-13.58%)
TPut 15    337859.00 (  0.00%)    177494.00 (-47.47%)    329288.00 ( -2.54%)    300463.00 (-11.07%)    305116.00 ( -9.69%)    287814.00 (-14.81%)
TPut 16    356396.00 (  0.00%)    145173.00 (-59.27%)    355616.00 ( -0.22%)    342598.00 ( -3.87%)    364077.00 (  2.16%)    339649.00 ( -4.70%)
TPut 17    373925.00 (  0.00%)    176956.00 (-52.68%)    368589.00 ( -1.43%)    360917.00 ( -3.48%)    366043.00 ( -2.11%)    345586.00 ( -7.58%)
TPut 18    388373.00 (  0.00%)    150100.00 (-61.35%)    372873.00 ( -3.99%)    389062.00 (  0.18%)    386779.00 ( -0.41%)    370871.00 ( -4.51%)

balancenuma suffered here. It is very likely that it was not able to handle
faults at a PMD level due to the lack of THP and I would expect that the
pages within a PMD boundary are not on the same node so pmd_numa is not
set. This results in its worst case of always having to deal with PTE
faults. Further, it must be migrating many or almost all of these because
the adaptscan patch made no difference. This is a worst-case scenario for
balancenuma. The scan rates later will indicate if that was the case.

autonuma did ok in that it was roughly comparable with mainline. Small
regressions.

I do not know how to describe numacores figures. Lets go with "not great".
Maybe it would have gotten better if it ran all the way up to 48 warehouses
or maybe the numa_maps reading is really kicking it harder than it kicks
autonuma or balancenuma. There is also the possibility that there is some
other patch in tip/master that is causing the problems.

SPECJBB PEAKS
                                       3.7.0                      3.7.0                      3.7.0                      3.7.0                      3.7.0                      3.7.0
                              rc6-stats-v5r1      rc6-numacore-20121123     rc6-autonuma-v28fastr4        rc6-thpmigrate-v5r1         rc6-adaptscan-v5r1        rc6-delaystart-v5r4
 Expctd Warehouse            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)
 Expctd Peak Bops             0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)
 Actual Warehouse            18.00 (  0.00%)            15.00 (-16.67%)            18.00 (  0.00%)            18.00 (  0.00%)            18.00 (  0.00%)            18.00 (  0.00%)
 Actual Peak Bops        388373.00 (  0.00%)        177494.00 (-54.30%)        372873.00 ( -3.99%)        389062.00 (  0.18%)        386779.00 ( -0.41%)        370871.00 ( -4.51%)
 SpecJBB Bops                 0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)
 SpecJBB Bops/JVM             0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)             0.00 (  0.00%)

numacore regressed 54.30% at the actual peak of 15 warehouses which was
also fewer warehouses than the baseline kernel did.

autonuma and balancenuma both peaked at 18 warehouses (the maximum number
it ran) so it was still scaling ok but autonuma regressed 3.99% while
balancenuma regressed 4.51%.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User        10405.85     7284.62    10826.33    10084.82    10134.62    10026.65
System        331.48     2505.16      432.62      506.52      538.50      529.03
Elapsed      1202.48     1242.71     1197.09     1204.03     1202.98     1201.74

numacores system CPU usage was very high.

autonumas and balancenumas were both higher than I'd like.


MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                        163780      164588      193572      163984      164068      164416
Page Outs                       137692      130984      265672      230884      188836      117192
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                      1           1           4           2           2           2
THP collapse alloc                   0           0          12           0           0           0
THP splits                           0           0           0           0           0           0
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0           0     7816428     5725511     6869488
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0           0        8113        5943        7130
NUMA PTE updates                     0           0           0    66123797    53516623    60445811
NUMA hint faults                     0           0           0    63047742    51160357    58406746
NUMA hint local faults               0           0           0    18265709    14490652    16584428
NUMA pages migrated                  0           0           0     7816428     5725511     6869488
AutoNUMA cost                        0           0           0      315850      256285      292587

For balancenuma the scan rates are interesting. Note that adaptscan made
very little difference to the number of PTEs updated. This very strongly
implies that the scan rate is not being reduced as many of the NUMA faults
are resulting in a migration.  This could be hit with a hammer by always
decreasing the scan rate on every fall but it would be a really really
blunt hammer.

As before, note that there was no THP activity because it was disabled.

Finally, the following are just rudimentary tests to check some basics.

KERNBENCH
                               3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                      rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
User    min        1296.38 (  0.00%)     1310.16 ( -1.06%)     1296.52 ( -0.01%)     1297.53 ( -0.09%)     1298.35 ( -0.15%)     1299.53 ( -0.24%)
User    mean       1298.86 (  0.00%)     1311.49 ( -0.97%)     1299.73 ( -0.07%)     1300.50 ( -0.13%)     1301.56 ( -0.21%)     1301.42 ( -0.20%)
User    stddev        1.65 (  0.00%)        0.90 ( 45.15%)        2.68 (-62.37%)        3.47 (-110.63%)        2.19 (-33.06%)        1.59 (  3.45%)
User    max        1301.52 (  0.00%)     1312.87 ( -0.87%)     1303.09 ( -0.12%)     1306.88 ( -0.41%)     1304.60 ( -0.24%)     1304.05 ( -0.19%)
System  min         118.74 (  0.00%)      129.74 ( -9.26%)      122.34 ( -3.03%)      121.82 ( -2.59%)      121.21 ( -2.08%)      119.43 ( -0.58%)
System  mean        119.34 (  0.00%)      130.24 ( -9.14%)      123.20 ( -3.24%)      122.15 ( -2.35%)      121.52 ( -1.83%)      120.17 ( -0.70%)
System  stddev        0.42 (  0.00%)        0.49 (-14.52%)        0.56 (-30.96%)        0.25 ( 41.66%)        0.43 ( -0.96%)        0.56 (-31.84%)
System  max         120.00 (  0.00%)      131.07 ( -9.22%)      123.88 ( -3.23%)      122.53 ( -2.11%)      122.36 ( -1.97%)      120.83 ( -0.69%)
Elapsed min          40.42 (  0.00%)       41.42 ( -2.47%)       40.55 ( -0.32%)       41.43 ( -2.50%)       40.66 ( -0.59%)       40.09 (  0.82%)
Elapsed mean         41.60 (  0.00%)       42.63 ( -2.48%)       41.65 ( -0.13%)       42.27 ( -1.62%)       41.57 (  0.06%)       41.12 (  1.13%)
Elapsed stddev        0.72 (  0.00%)        0.82 (-13.62%)        0.80 (-10.77%)        0.65 (  9.93%)        0.86 (-19.29%)        0.64 ( 11.92%)
Elapsed max          42.41 (  0.00%)       43.90 ( -3.51%)       42.79 ( -0.90%)       43.03 ( -1.46%)       42.76 ( -0.83%)       41.87 (  1.27%)
CPU     min        3341.00 (  0.00%)     3279.00 (  1.86%)     3319.00 (  0.66%)     3298.00 (  1.29%)     3319.00 (  0.66%)     3392.00 ( -1.53%)
CPU     mean       3409.80 (  0.00%)     3382.40 (  0.80%)     3417.00 ( -0.21%)     3365.60 (  1.30%)     3424.00 ( -0.42%)     3457.00 ( -1.38%)
CPU     stddev       63.50 (  0.00%)       66.38 ( -4.53%)       70.01 (-10.25%)       50.19 ( 20.97%)       74.58 (-17.45%)       56.25 ( 11.42%)
CPU     max        3514.00 (  0.00%)     3479.00 (  1.00%)     3516.00 ( -0.06%)     3426.00 (  2.50%)     3506.00 (  0.23%)     3546.00 ( -0.91%)

numacore has improved a lot here here. It only regressed 2.48% which is an improvement
over earlier releases.

autonuma and balancenuma both show some system CPU overhead but averaged
over the multiple runs, it's not very obvious.


MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User         7821.05     7900.01     7829.89     7837.23     7840.19     7835.43
System        735.84      802.86      758.93      753.98      749.44      740.47
Elapsed       298.72      305.17      298.52      300.67      296.84      296.20

System CPU overhead  is a bit more obvious here. balancenuma adds 5ish
seconds (0.62%). autonuma adds around 23 seconds (3.04%). numacore adds
67 seconds (8.34%)

MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                           156           0          28         148           8          16
Page Outs                      1519504     1740760     1460708     1548820     1510256     1548792
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                    323         351         365         374         378         316
THP collapse alloc                  22           1       10071          30           7          28
THP splits                           4           2         151           5           1           7
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0           0      558483       50325      100470
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0           0         579          52         104
NUMA PTE updates                     0           0           0   109735841    86018422    65125719
NUMA hint faults                     0           0           0    68484623    53110294    40259527
NUMA hint local faults               0           0           0    65051361    50701491    37787066
NUMA pages migrated                  0           0           0      558483       50325      100470
AutoNUMA cost                        0           0           0      343201      266154      201755

And you can see where balacenumas system CPU overhead is coming from. Despite
the fact that most of the processes are short-lived, they are still living
longer than 1 second and being scheduled on another node which triggers
the PTE scanner.

Note how adaptscan affects the number of PTE updates as it reduces the scan rate.

Note too how delaystart reduces it further because PTE scanning is postponed
until the task is scheduled on a new node.

AIM9
                                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                        rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
Min    page_test   337620.00 (  0.00%)   382584.94 ( 13.32%)   274380.00 (-18.73%)   386013.33 ( 14.33%)   367068.62 (  8.72%)   389186.67 ( 15.27%)
Min    brk_test   3189200.00 (  0.00%)  3130446.37 ( -1.84%)  3036200.00 ( -4.80%)  3261733.33 (  2.27%)  2729513.66 (-14.41%)  3232266.67 (  1.35%)
Min    exec_test      263.16 (  0.00%)      270.49 (  2.79%)      275.97 (  4.87%)      263.49 (  0.13%)      262.32 ( -0.32%)      263.33 (  0.06%)
Min    fork_test     1489.36 (  0.00%)     1533.86 (  2.99%)     1754.15 ( 17.78%)     1503.66 (  0.96%)     1500.66 (  0.76%)     1484.69 ( -0.31%)
Mean   page_test   376537.21 (  0.00%)   407175.97 (  8.14%)   369202.58 ( -1.95%)   408484.43 (  8.48%)   401734.17 (  6.69%)   419007.65 ( 11.28%)
Mean   brk_test   3217657.48 (  0.00%)  3223631.95 (  0.19%)  3142007.48 ( -2.35%)  3301305.55 (  2.60%)  2815992.93 (-12.48%)  3270913.07 (  1.66%)
Mean   exec_test      266.09 (  0.00%)      275.19 (  3.42%)      280.30 (  5.34%)      268.35 (  0.85%)      265.03 ( -0.40%)      268.45 (  0.89%)
Mean   fork_test     1521.05 (  0.00%)     1569.47 (  3.18%)     1844.55 ( 21.27%)     1526.62 (  0.37%)     1531.56 (  0.69%)     1529.75 (  0.57%)
Stddev page_test    26593.06 (  0.00%)    11327.52 (-57.40%)    35313.32 ( 32.79%)    11484.61 (-56.81%)    15098.72 (-43.22%)    12553.59 (-52.79%)
Stddev brk_test     14591.07 (  0.00%)    51911.60 (255.78%)    42645.66 (192.27%)    22593.16 ( 54.84%)    41088.23 (181.60%)    26548.94 ( 81.95%)
Stddev exec_test        2.18 (  0.00%)        2.83 ( 29.93%)        3.47 ( 59.06%)        2.90 ( 33.05%)        2.01 ( -7.84%)        3.42 ( 56.74%)
Stddev fork_test       22.76 (  0.00%)       18.41 (-19.10%)       68.22 (199.75%)       20.41 (-10.34%)       20.20 (-11.23%)       28.56 ( 25.48%)
Max    page_test   407320.00 (  0.00%)   421940.00 (  3.59%)   398026.67 ( -2.28%)   421940.00 (  3.59%)   426755.50 (  4.77%)   438146.67 (  7.57%)
Max    brk_test   3240200.00 (  0.00%)  3321800.00 (  2.52%)  3227733.33 ( -0.38%)  3337666.67 (  3.01%)  2863933.33 (-11.61%)  3321852.10 (  2.52%)
Max    exec_test      269.97 (  0.00%)      281.96 (  4.44%)      287.81 (  6.61%)      272.67 (  1.00%)      268.82 ( -0.43%)      273.67 (  1.37%)
Max    fork_test     1554.82 (  0.00%)     1601.33 (  2.99%)     1926.91 ( 23.93%)     1565.62 (  0.69%)     1559.39 (  0.29%)     1583.50 (  1.84%)

This has much improved in general.

page_test is looking generally good on average although the large variances
make it a bit unreliable. brk_test is looking ok too. autonuma regressed
but with the large variances it is within the noise. exec_test fork_test
both look fine.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User            0.14        2.83        2.87        2.73        2.79        2.80
System          0.24        0.72        0.75        0.72        0.71        0.71
Elapsed       721.97      724.55      724.52      724.36      725.08      724.54


System CPU overhead is noticeable again but it's not really a factor for this load.

MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                          7252        7180        7176        7416        7672        7168
Page Outs                        72684       74080       74844       73980       74472       74844
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                      0          15           0          36          18          19
THP collapse alloc                   0           0           0           0           0           2
THP splits                           0           0           0           0           0           1
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0           0          75         842         581
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0           0           0           0           0
NUMA PTE updates                     0           0           0    40740052    41937943     1669018
NUMA hint faults                     0           0           0       20273       17880        9628
NUMA hint local faults               0           0           0       15901       15562        7259
NUMA pages migrated                  0           0           0          75         842         581
AutoNUMA cost                        0           0           0         386         382          59

The evidence is there that the load is active enough to trigger automatic
numa migration activity even though the processes are all small. For
balancenuma, being scheduled on a new node is enough.

HACKBENCH PIPES
                         3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
Procs 1       0.0537 (  0.00%)      0.0282 ( 47.58%)      0.0233 ( 56.73%)      0.0400 ( 25.56%)      0.0220 ( 59.06%)      0.0269 ( 50.02%)
Procs 4       0.0755 (  0.00%)      0.0710 (  5.96%)      0.0540 ( 28.48%)      0.0721 (  4.54%)      0.0679 ( 10.07%)      0.0684 (  9.36%)
Procs 8       0.0795 (  0.00%)      0.0933 (-17.39%)      0.1032 (-29.87%)      0.0859 ( -8.08%)      0.0736 (  7.35%)      0.0954 (-20.11%)
Procs 12      0.1002 (  0.00%)      0.1069 ( -6.62%)      0.1760 (-75.56%)      0.1051 ( -4.88%)      0.0809 ( 19.26%)      0.0926 (  7.68%)
Procs 16      0.1086 (  0.00%)      0.1282 (-18.07%)      0.1695 (-56.08%)      0.1380 (-27.07%)      0.1055 (  2.85%)      0.1239 (-14.13%)
Procs 20      0.1455 (  0.00%)      0.1450 (  0.37%)      0.3690 (-153.54%)      0.1276 ( 12.36%)      0.1588 ( -9.12%)      0.1464 ( -0.56%)
Procs 24      0.1548 (  0.00%)      0.1638 ( -5.82%)      0.4010 (-158.99%)      0.1648 ( -6.41%)      0.1575 ( -1.69%)      0.1621 ( -4.69%)
Procs 28      0.1995 (  0.00%)      0.2089 ( -4.72%)      0.3936 (-97.31%)      0.1829 (  8.33%)      0.2057 ( -3.09%)      0.1942 (  2.66%)
Procs 32      0.2030 (  0.00%)      0.2352 (-15.86%)      0.3780 (-86.21%)      0.2189 ( -7.85%)      0.2011 (  0.92%)      0.2207 ( -8.71%)
Procs 36      0.2323 (  0.00%)      0.2502 ( -7.70%)      0.4813 (-107.14%)      0.2449 ( -5.41%)      0.2492 ( -7.27%)      0.2250 (  3.16%)
Procs 40      0.2708 (  0.00%)      0.2734 ( -0.97%)      0.6089 (-124.84%)      0.2832 ( -4.57%)      0.2822 ( -4.20%)      0.2658 (  1.85%)

Everyone is a bit all over the place here and autonuma is consistent with the
last results in that it's hurting hackbench pipes results. With such large
differences on each thread number it's difficult to draw any conclusion
here. I'd have to dig into the data more and see what's happening but
system CPU can be a proxy measure so onwards...


MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User           57.28       61.04       61.94       61.00       59.64       58.88
System       1849.51     2011.94     1873.74     1918.32     1864.12     1916.33
Elapsed        96.56      100.27      145.82       97.88       96.59       98.28

Yep, system CPU usage is up. Highest in numacore, balancenuma is adding a
chunk as well. autonuma appears to add less but the usual thread comment
applies.


MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                            24          24          24          24          24          24
Page Outs                         1668        1772        2284        1752        2072        1756
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                      0           5           0           6           6           0
THP collapse alloc                   0           0           0           2           0           5
THP splits                           0           0           0           0           0           0
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0           0           2           0          28
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0           0           0           0           0
NUMA PTE updates                     0           0           0       54736        1061       42752
NUMA hint faults                     0           0           0        2247         518          71
NUMA hint local faults               0           0           0          29           1           0
NUMA pages migrated                  0           0           0           2           0          28
AutoNUMA cost                        0           0           0          11           2           0

And here is the evidence again. balancenuma at least is triggering the
migration logic while running hackbench. It may be that as the thread
counts grow it simply becomes more likely it gets scheduled on another
node and starts up even though it is not memory intensive.

I could avoid firing the PTE scanner if the processes RSS is low I guess
but that feels hacky.

HACKBENCH SOCKETS
                         3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
Procs 1       0.0220 (  0.00%)      0.0240 ( -9.09%)      0.0276 (-25.34%)      0.0228 ( -3.83%)      0.0282 (-28.18%)      0.0207 (  6.11%)
Procs 4       0.0535 (  0.00%)      0.0490 (  8.35%)      0.0888 (-66.12%)      0.0467 ( 12.70%)      0.0442 ( 17.27%)      0.0494 (  7.52%)
Procs 8       0.0716 (  0.00%)      0.0726 ( -1.33%)      0.1665 (-132.54%)      0.0718 ( -0.25%)      0.0700 (  2.19%)      0.0701 (  2.09%)
Procs 12      0.1026 (  0.00%)      0.0975 (  4.99%)      0.1290 (-25.73%)      0.0981 (  4.34%)      0.0946 (  7.76%)      0.0967 (  5.71%)
Procs 16      0.1272 (  0.00%)      0.1268 (  0.25%)      0.3193 (-151.05%)      0.1229 (  3.35%)      0.1224 (  3.78%)      0.1270 (  0.11%)
Procs 20      0.1487 (  0.00%)      0.1537 ( -3.40%)      0.1793 (-20.57%)      0.1550 ( -4.25%)      0.1519 ( -2.17%)      0.1579 ( -6.18%)
Procs 24      0.1794 (  0.00%)      0.1797 ( -0.16%)      0.4423 (-146.55%)      0.1851 ( -3.19%)      0.1807 ( -0.71%)      0.1904 ( -6.15%)
Procs 28      0.2165 (  0.00%)      0.2156 (  0.44%)      0.5012 (-131.50%)      0.2147 (  0.85%)      0.2126 (  1.82%)      0.2194 ( -1.34%)
Procs 32      0.2344 (  0.00%)      0.2458 ( -4.89%)      0.7008 (-199.00%)      0.2498 ( -6.60%)      0.2449 ( -4.50%)      0.2528 ( -7.86%)
Procs 36      0.2623 (  0.00%)      0.2752 ( -4.92%)      0.7469 (-184.73%)      0.2852 ( -8.72%)      0.2762 ( -5.30%)      0.2826 ( -7.72%)
Procs 40      0.2921 (  0.00%)      0.3030 ( -3.72%)      0.7753 (-165.46%)      0.3085 ( -5.61%)      0.3046 ( -4.28%)      0.3182 ( -8.94%)

Mix of gains and losses except for autonuma which takes a hammering.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User           39.43       38.44       48.79       41.48       39.54       42.47
System       2249.41     2273.39     2678.90     2285.03     2218.08     2302.44
Elapsed       104.91      105.83      173.39      105.50      104.38      106.55

Less system CPU overhead from numacore here. autonuma adds a lot. balancenuma
is adding more than it should.


MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                             4           4           4           4           4           4
Page Outs                         1952        2104        2812        1796        1952        2264
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                      0           0           0           6           0           0
THP collapse alloc                   0           0           1           0           0           0
THP splits                           0           0           0           0           0           0
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0           0         328         513          19
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0           0           0           0           0
NUMA PTE updates                     0           0           0       21522       22448       21376
NUMA hint faults                     0           0           0        1082         546          52
NUMA hint local faults               0           0           0         217           0          31
NUMA pages migrated                  0           0           0         328         513          19
AutoNUMA cost                        0           0           0           5           2           0

Again the PTE scanners are in there. They will not help hackbench figures.

PAGE FAULT TEST
                              3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                     rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
System     1        8.0195 (  0.00%)       8.2535 ( -2.92%)       8.0495 ( -0.37%)      37.7675 (-370.95%)      38.0265 (-374.18%)       7.9775 (  0.52%)
System     2        8.0095 (  0.00%)       8.0905 ( -1.01%)       8.1415 ( -1.65%)      12.0595 (-50.56%)      11.4145 (-42.51%)       7.9900 (  0.24%)
System     3        8.1025 (  0.00%)       8.1725 ( -0.86%)       8.3525 ( -3.09%)       9.7380 (-20.19%)       9.4905 (-17.13%)       8.1110 ( -0.10%)
System     4        8.1635 (  0.00%)       8.2875 ( -1.52%)       8.5415 ( -4.63%)       8.7440 ( -7.11%)       8.6145 ( -5.52%)       8.1800 ( -0.20%)
System     5        8.4600 (  0.00%)       8.5900 ( -1.54%)       8.8910 ( -5.09%)       8.8365 ( -4.45%)       8.6755 ( -2.55%)       8.5105 ( -0.60%)
System     6        8.7565 (  0.00%)       8.8120 ( -0.63%)       9.3630 ( -6.93%)       8.9460 ( -2.16%)       8.8490 ( -1.06%)       8.7390 (  0.20%)
System     7        8.7390 (  0.00%)       8.8430 ( -1.19%)       9.9310 (-13.64%)       9.0680 ( -3.76%)       8.9600 ( -2.53%)       8.8300 ( -1.04%)
System     8        8.7700 (  0.00%)       8.9110 ( -1.61%)      10.1445 (-15.67%)       9.0435 ( -3.12%)       8.8060 ( -0.41%)       8.7615 (  0.10%)
System     9        9.3455 (  0.00%)       9.3505 ( -0.05%)      10.5340 (-12.72%)       9.4765 ( -1.40%)       9.3955 ( -0.54%)       9.2860 (  0.64%)
System     10       9.4195 (  0.00%)       9.4780 ( -0.62%)      11.6035 (-23.19%)       9.6500 ( -2.45%)       9.5350 ( -1.23%)       9.4735 ( -0.57%)
System     11       9.5405 (  0.00%)       9.6495 ( -1.14%)      12.8475 (-34.66%)       9.7370 ( -2.06%)       9.5995 ( -0.62%)       9.5835 ( -0.45%)
System     12       9.7035 (  0.00%)       9.7470 ( -0.45%)      13.2560 (-36.61%)       9.8445 ( -1.45%)       9.7260 ( -0.23%)       9.5890 (  1.18%)
System     13      10.2745 (  0.00%)      10.2270 (  0.46%)      13.5490 (-31.87%)      10.3840 ( -1.07%)      10.1880 (  0.84%)      10.1480 (  1.23%)
System     14      10.5405 (  0.00%)      10.6135 ( -0.69%)      13.9225 (-32.09%)      10.6915 ( -1.43%)      10.5255 (  0.14%)      10.5620 ( -0.20%)
System     15      10.7190 (  0.00%)      10.8635 ( -1.35%)      15.0760 (-40.65%)      10.9380 ( -2.04%)      10.8190 ( -0.93%)      10.7040 (  0.14%)
System     16      11.2575 (  0.00%)      11.2750 ( -0.16%)      15.0995 (-34.13%)      11.3315 ( -0.66%)      11.2615 ( -0.04%)      11.2345 (  0.20%)
System     17      11.8090 (  0.00%)      12.0865 ( -2.35%)      16.1715 (-36.94%)      11.8925 ( -0.71%)      11.7655 (  0.37%)      11.7585 (  0.43%)
System     18      12.3910 (  0.00%)      12.4270 ( -0.29%)      16.7410 (-35.11%)      12.4425 ( -0.42%)      12.4235 ( -0.26%)      12.3295 (  0.50%)
System     19      12.7915 (  0.00%)      12.8340 ( -0.33%)      16.7175 (-30.69%)      12.7980 ( -0.05%)      12.9825 ( -1.49%)      12.7980 ( -0.05%)
System     20      13.5870 (  0.00%)      13.3100 (  2.04%)      16.5590 (-21.87%)      13.2725 (  2.31%)      13.1720 (  3.05%)      13.1855 (  2.96%)
System     21      13.9325 (  0.00%)      13.9705 ( -0.27%)      16.9110 (-21.38%)      13.8975 (  0.25%)      14.0360 ( -0.74%)      13.8760 (  0.41%)
System     22      14.5810 (  0.00%)      14.7345 ( -1.05%)      18.1160 (-24.24%)      14.7635 ( -1.25%)      14.4805 (  0.69%)      14.4130 (  1.15%)
System     23      15.0710 (  0.00%)      15.1400 ( -0.46%)      18.3805 (-21.96%)      15.2020 ( -0.87%)      15.1100 ( -0.26%)      15.0385 (  0.22%)
System     24      15.8815 (  0.00%)      15.7120 (  1.07%)      19.7195 (-24.17%)      15.6205 (  1.64%)      15.5965 (  1.79%)      15.5950 (  1.80%)
System     25      16.1480 (  0.00%)      16.6115 ( -2.87%)      19.5480 (-21.06%)      16.2305 ( -0.51%)      16.1775 ( -0.18%)      16.1510 ( -0.02%)
System     26      17.1075 (  0.00%)      17.1015 (  0.04%)      19.7100 (-15.21%)      17.0800 (  0.16%)      16.8955 (  1.24%)      16.7845 (  1.89%)
System     27      17.3015 (  0.00%)      17.4120 ( -0.64%)      20.2640 (-17.12%)      17.2615 (  0.23%)      17.2430 (  0.34%)      17.2895 (  0.07%)
System     28      17.8750 (  0.00%)      17.9675 ( -0.52%)      21.2030 (-18.62%)      17.7305 (  0.81%)      17.7480 (  0.71%)      17.7615 (  0.63%)
System     29      18.5260 (  0.00%)      18.8165 ( -1.57%)      20.4045 (-10.14%)      18.3895 (  0.74%)      18.2980 (  1.23%)      18.4480 (  0.42%)
System     30      19.0865 (  0.00%)      19.1865 ( -0.52%)      21.0970 (-10.53%)      18.9800 (  0.56%)      18.8510 (  1.23%)      19.0500 (  0.19%)
System     31      19.8095 (  0.00%)      19.7210 (  0.45%)      22.8030 (-15.11%)      19.7365 (  0.37%)      19.6370 (  0.87%)      19.9115 ( -0.51%)
System     32      20.3360 (  0.00%)      20.3510 ( -0.07%)      23.3780 (-14.96%)      20.2040 (  0.65%)      20.0695 (  1.31%)      20.2110 (  0.61%)
System     33      21.0240 (  0.00%)      21.0225 (  0.01%)      23.3495 (-11.06%)      20.8200 (  0.97%)      20.6455 (  1.80%)      21.0125 (  0.05%)
System     34      21.6065 (  0.00%)      21.9710 ( -1.69%)      23.2650 ( -7.68%)      21.4115 (  0.90%)      21.4230 (  0.85%)      21.8570 ( -1.16%)
System     35      22.3005 (  0.00%)      22.3190 ( -0.08%)      23.2305 ( -4.17%)      22.1695 (  0.59%)      22.0695 (  1.04%)      22.2485 (  0.23%)
System     36      23.0245 (  0.00%)      22.9430 (  0.35%)      24.8930 ( -8.12%)      22.7685 (  1.11%)      22.7385 (  1.24%)      23.0900 ( -0.28%)
System     37      23.8225 (  0.00%)      23.7100 (  0.47%)      24.9290 ( -4.64%)      23.5425 (  1.18%)      23.3270 (  2.08%)      23.6795 (  0.60%)
System     38      24.5015 (  0.00%)      24.4780 (  0.10%)      25.3145 ( -3.32%)      24.3460 (  0.63%)      24.1105 (  1.60%)      24.5430 ( -0.17%)
System     39      25.1855 (  0.00%)      25.1445 (  0.16%)      25.1985 ( -0.05%)      25.1355 (  0.20%)      24.9305 (  1.01%)      25.0000 (  0.74%)
System     40      25.8990 (  0.00%)      25.8310 (  0.26%)      26.5205 ( -2.40%)      25.7115 (  0.72%)      25.5310 (  1.42%)      25.9605 ( -0.24%)
System     41      26.5585 (  0.00%)      26.7045 ( -0.55%)      27.5060 ( -3.57%)      26.5825 ( -0.09%)      26.3515 (  0.78%)      26.5835 ( -0.09%)
System     42      27.3840 (  0.00%)      27.5735 ( -0.69%)      27.3995 ( -0.06%)      27.2475 (  0.50%)      27.1680 (  0.79%)      27.3810 (  0.01%)
System     43      28.1595 (  0.00%)      28.2515 ( -0.33%)      27.5285 (  2.24%)      27.9805 (  0.64%)      27.8795 (  0.99%)      28.1255 (  0.12%)
System     44      28.8460 (  0.00%)      29.0390 ( -0.67%)      28.4580 (  1.35%)      28.9385 ( -0.32%)      28.7750 (  0.25%)      28.8655 ( -0.07%)
System     45      29.5430 (  0.00%)      29.8280 ( -0.96%)      28.5270 (  3.44%)      29.8165 ( -0.93%)      29.6105 ( -0.23%)      29.5655 ( -0.08%)
System     46      30.3290 (  0.00%)      30.6420 ( -1.03%)      29.1955 (  3.74%)      30.6235 ( -0.97%)      30.4205 ( -0.30%)      30.2640 (  0.21%)
System     47      30.9365 (  0.00%)      31.3360 ( -1.29%)      29.2915 (  5.32%)      31.3365 ( -1.29%)      31.3660 ( -1.39%)      30.9300 (  0.02%)
System     48      31.5680 (  0.00%)      32.1220 ( -1.75%)      29.3805 (  6.93%)      32.1925 ( -1.98%)      31.9820 ( -1.31%)      31.6180 ( -0.16%)

autonuma is showing a lot of system CPU overhead here. numacore and
balancenuma are ok. Some blips there but small enough that's nothing to
get excited over.

Elapsed    1        8.7170 (  0.00%)       8.9585 ( -2.77%)       8.7485 ( -0.36%)      38.5375 (-342.10%)      38.8065 (-345.18%)       8.6755 (  0.48%)
Elapsed    2        4.4075 (  0.00%)       4.4345 ( -0.61%)       4.5320 ( -2.82%)       6.5940 (-49.61%)       6.1920 (-40.49%)       4.4090 ( -0.03%)
Elapsed    3        2.9785 (  0.00%)       2.9990 ( -0.69%)       3.0945 ( -3.89%)       3.5820 (-20.26%)       3.4765 (-16.72%)       2.9840 ( -0.18%)
Elapsed    4        2.2530 (  0.00%)       2.3010 ( -2.13%)       2.3845 ( -5.84%)       2.4400 ( -8.30%)       2.4045 ( -6.72%)       2.2675 ( -0.64%)
Elapsed    5        1.9070 (  0.00%)       1.9315 ( -1.28%)       1.9885 ( -4.27%)       2.0180 ( -5.82%)       1.9725 ( -3.43%)       1.9195 ( -0.66%)
Elapsed    6        1.6490 (  0.00%)       1.6705 ( -1.30%)       1.7470 ( -5.94%)       1.6695 ( -1.24%)       1.6575 ( -0.52%)       1.6385 (  0.64%)
Elapsed    7        1.4235 (  0.00%)       1.4385 ( -1.05%)       1.6090 (-13.03%)       1.4590 ( -2.49%)       1.4495 ( -1.83%)       1.4200 (  0.25%)
Elapsed    8        1.2500 (  0.00%)       1.2600 ( -0.80%)       1.4345 (-14.76%)       1.2650 ( -1.20%)       1.2340 (  1.28%)       1.2345 (  1.24%)
Elapsed    9        1.2090 (  0.00%)       1.2125 ( -0.29%)       1.3355 (-10.46%)       1.2275 ( -1.53%)       1.2185 ( -0.79%)       1.1975 (  0.95%)
Elapsed    10       1.0885 (  0.00%)       1.0900 ( -0.14%)       1.3390 (-23.01%)       1.1195 ( -2.85%)       1.1110 ( -2.07%)       1.0985 ( -0.92%)
Elapsed    11       0.9970 (  0.00%)       1.0220 ( -2.51%)       1.3575 (-36.16%)       1.0210 ( -2.41%)       1.0145 ( -1.76%)       1.0005 ( -0.35%)
Elapsed    12       0.9355 (  0.00%)       0.9375 ( -0.21%)       1.3060 (-39.60%)       0.9505 ( -1.60%)       0.9390 ( -0.37%)       0.9205 (  1.60%)
Elapsed    13       0.9345 (  0.00%)       0.9320 (  0.27%)       1.2940 (-38.47%)       0.9435 ( -0.96%)       0.9200 (  1.55%)       0.9195 (  1.61%)
Elapsed    14       0.8815 (  0.00%)       0.8960 ( -1.64%)       1.2755 (-44.70%)       0.8955 ( -1.59%)       0.8780 (  0.40%)       0.8860 ( -0.51%)
Elapsed    15       0.8175 (  0.00%)       0.8375 ( -2.45%)       1.3655 (-67.03%)       0.8470 ( -3.61%)       0.8260 ( -1.04%)       0.8170 (  0.06%)
Elapsed    16       0.8135 (  0.00%)       0.8045 (  1.11%)       1.3165 (-61.83%)       0.8130 (  0.06%)       0.8040 (  1.17%)       0.7970 (  2.03%)
Elapsed    17       0.8375 (  0.00%)       0.8530 ( -1.85%)       1.4175 (-69.25%)       0.8380 ( -0.06%)       0.8405 ( -0.36%)       0.8305 (  0.84%)
Elapsed    18       0.8045 (  0.00%)       0.8100 ( -0.68%)       1.4135 (-75.70%)       0.8120 ( -0.93%)       0.8050 ( -0.06%)       0.8010 (  0.44%)
Elapsed    19       0.7600 (  0.00%)       0.7625 ( -0.33%)       1.3640 (-79.47%)       0.7700 ( -1.32%)       0.7870 ( -3.55%)       0.7720 ( -1.58%)
Elapsed    20       0.7860 (  0.00%)       0.7410 (  5.73%)       1.3125 (-66.98%)       0.7580 (  3.56%)       0.7375 (  6.17%)       0.7370 (  6.23%)
Elapsed    21       0.8080 (  0.00%)       0.7970 (  1.36%)       1.2775 (-58.11%)       0.7960 (  1.49%)       0.8175 ( -1.18%)       0.7970 (  1.36%)
Elapsed    22       0.7930 (  0.00%)       0.7840 (  1.13%)       1.3940 (-75.79%)       0.8035 ( -1.32%)       0.7780 (  1.89%)       0.7640 (  3.66%)
Elapsed    23       0.7570 (  0.00%)       0.7525 (  0.59%)       1.3490 (-78.20%)       0.7915 ( -4.56%)       0.7710 ( -1.85%)       0.7800 ( -3.04%)
Elapsed    24       0.7705 (  0.00%)       0.7280 (  5.52%)       1.4550 (-88.84%)       0.7400 (  3.96%)       0.7630 (  0.97%)       0.7575 (  1.69%)
Elapsed    25       0.8165 (  0.00%)       0.8630 ( -5.70%)       1.3755 (-68.46%)       0.8790 ( -7.65%)       0.9015 (-10.41%)       0.8505 ( -4.16%)
Elapsed    26       0.8465 (  0.00%)       0.8425 (  0.47%)       1.3405 (-58.36%)       0.8790 ( -3.84%)       0.8660 ( -2.30%)       0.8360 (  1.24%)
Elapsed    27       0.8025 (  0.00%)       0.8045 ( -0.25%)       1.3655 (-70.16%)       0.8325 ( -3.74%)       0.8420 ( -4.92%)       0.8175 ( -1.87%)
Elapsed    28       0.7990 (  0.00%)       0.7850 (  1.75%)       1.3475 (-68.65%)       0.8075 ( -1.06%)       0.8185 ( -2.44%)       0.7885 (  1.31%)
Elapsed    29       0.8010 (  0.00%)       0.8005 (  0.06%)       1.2595 (-57.24%)       0.8075 ( -0.81%)       0.8130 ( -1.50%)       0.7970 (  0.50%)
Elapsed    30       0.7965 (  0.00%)       0.7825 (  1.76%)       1.2365 (-55.24%)       0.8105 ( -1.76%)       0.8050 ( -1.07%)       0.8095 ( -1.63%)
Elapsed    31       0.7820 (  0.00%)       0.7740 (  1.02%)       1.2670 (-62.02%)       0.7980 ( -2.05%)       0.8035 ( -2.75%)       0.7970 ( -1.92%)
Elapsed    32       0.7905 (  0.00%)       0.7675 (  2.91%)       1.3765 (-74.13%)       0.8000 ( -1.20%)       0.7935 ( -0.38%)       0.7725 (  2.28%)
Elapsed    33       0.7980 (  0.00%)       0.7640 (  4.26%)       1.2225 (-53.20%)       0.7985 ( -0.06%)       0.7945 (  0.44%)       0.7900 (  1.00%)
Elapsed    34       0.7875 (  0.00%)       0.7820 (  0.70%)       1.1880 (-50.86%)       0.8030 ( -1.97%)       0.8175 ( -3.81%)       0.8090 ( -2.73%)
Elapsed    35       0.7910 (  0.00%)       0.7735 (  2.21%)       1.2100 (-52.97%)       0.8050 ( -1.77%)       0.8025 ( -1.45%)       0.7830 (  1.01%)
Elapsed    36       0.7745 (  0.00%)       0.7565 (  2.32%)       1.3075 (-68.82%)       0.8010 ( -3.42%)       0.8095 ( -4.52%)       0.8000 ( -3.29%)
Elapsed    37       0.7960 (  0.00%)       0.7660 (  3.77%)       1.1970 (-50.38%)       0.8045 ( -1.07%)       0.7950 (  0.13%)       0.8010 ( -0.63%)
Elapsed    38       0.7800 (  0.00%)       0.7825 ( -0.32%)       1.1305 (-44.94%)       0.8095 ( -3.78%)       0.8015 ( -2.76%)       0.8065 ( -3.40%)
Elapsed    39       0.7915 (  0.00%)       0.7635 (  3.54%)       1.0915 (-37.90%)       0.8085 ( -2.15%)       0.8060 ( -1.83%)       0.7790 (  1.58%)
Elapsed    40       0.7810 (  0.00%)       0.7635 (  2.24%)       1.1175 (-43.09%)       0.7870 ( -0.77%)       0.8025 ( -2.75%)       0.7895 ( -1.09%)
Elapsed    41       0.7675 (  0.00%)       0.7730 ( -0.72%)       1.1610 (-51.27%)       0.8025 ( -4.56%)       0.7780 ( -1.37%)       0.7870 ( -2.54%)
Elapsed    42       0.7705 (  0.00%)       0.7925 ( -2.86%)       1.1095 (-44.00%)       0.7850 ( -1.88%)       0.7890 ( -2.40%)       0.7950 ( -3.18%)
Elapsed    43       0.7830 (  0.00%)       0.7680 (  1.92%)       1.1470 (-46.49%)       0.7960 ( -1.66%)       0.7830 (  0.00%)       0.7855 ( -0.32%)
Elapsed    44       0.7745 (  0.00%)       0.7560 (  2.39%)       1.1575 (-49.45%)       0.7870 ( -1.61%)       0.7950 ( -2.65%)       0.7835 ( -1.16%)
Elapsed    45       0.7665 (  0.00%)       0.7635 (  0.39%)       1.0200 (-33.07%)       0.7935 ( -3.52%)       0.7745 ( -1.04%)       0.7695 ( -0.39%)
Elapsed    46       0.7660 (  0.00%)       0.7695 ( -0.46%)       1.0610 (-38.51%)       0.7835 ( -2.28%)       0.7830 ( -2.22%)       0.7725 ( -0.85%)
Elapsed    47       0.7575 (  0.00%)       0.7710 ( -1.78%)       1.0340 (-36.50%)       0.7895 ( -4.22%)       0.7800 ( -2.97%)       0.7755 ( -2.38%)
Elapsed    48       0.7740 (  0.00%)       0.7665 (  0.97%)       1.0505 (-35.72%)       0.7735 (  0.06%)       0.7795 ( -0.71%)       0.7630 (  1.42%)

autonuma hurts here. numacore and balancenuma are ok.

Faults/cpu 1   379968.7014 (  0.00%)  369716.7221 ( -2.70%)  378284.9642 ( -0.44%)   86427.8993 (-77.25%)   87036.4027 (-77.09%)  381109.9811 (  0.30%)
Faults/cpu 2   379324.0493 (  0.00%)  376624.9420 ( -0.71%)  372938.2576 ( -1.68%)  258617.9410 (-31.82%)  272229.5372 (-28.23%)  379332.1426 (  0.00%)
Faults/cpu 3   374110.9252 (  0.00%)  371809.0394 ( -0.62%)  362384.3379 ( -3.13%)  315364.3194 (-15.70%)  322932.0319 (-13.68%)  373740.6327 ( -0.10%)
Faults/cpu 4   371054.3320 (  0.00%)  366010.1683 ( -1.36%)  354374.7659 ( -4.50%)  347925.4511 ( -6.23%)  351926.8213 ( -5.15%)  369718.8116 ( -0.36%)
Faults/cpu 5   357644.9509 (  0.00%)  353116.2568 ( -1.27%)  340954.4156 ( -4.67%)  342873.2808 ( -4.13%)  348837.4032 ( -2.46%)  355357.9808 ( -0.64%)
Faults/cpu 6   345166.0268 (  0.00%)  343605.5937 ( -0.45%)  324566.0244 ( -5.97%)  339177.9361 ( -1.73%)  341785.4988 ( -0.98%)  345830.4062 (  0.19%)
Faults/cpu 7   346686.9164 (  0.00%)  343254.5354 ( -0.99%)  307569.0063 (-11.28%)  334501.4563 ( -3.51%)  337715.4825 ( -2.59%)  342176.3071 ( -1.30%)
Faults/cpu 8   345617.2248 (  0.00%)  341409.8570 ( -1.22%)  301005.0046 (-12.91%)  335797.8156 ( -2.84%)  344630.9102 ( -0.29%)  346313.4237 (  0.20%)
Faults/cpu 9   324187.6755 (  0.00%)  324493.4570 (  0.09%)  292467.7328 ( -9.78%)  320295.6357 ( -1.20%)  321737.9910 ( -0.76%)  325867.9016 (  0.52%)
Faults/cpu 10  323260.5270 (  0.00%)  321706.2762 ( -0.48%)  267253.0641 (-17.33%)  314825.0722 ( -2.61%)  317861.8672 ( -1.67%)  320046.7340 ( -0.99%)
Faults/cpu 11  319485.7975 (  0.00%)  315952.8672 ( -1.11%)  242837.3072 (-23.99%)  312472.4466 ( -2.20%)  316449.1894 ( -0.95%)  317039.2752 ( -0.77%)
Faults/cpu 12  314193.4166 (  0.00%)  313068.6101 ( -0.36%)  235605.3115 (-25.01%)  309340.3850 ( -1.54%)  313383.0113 ( -0.26%)  317336.9315 (  1.00%)
Faults/cpu 13  297642.2341 (  0.00%)  299213.5432 (  0.53%)  234437.1802 (-21.24%)  293494.9766 ( -1.39%)  299705.3429 (  0.69%)  300624.5210 (  1.00%)
Faults/cpu 14  290534.1543 (  0.00%)  288426.1514 ( -0.73%)  224483.1714 (-22.73%)  285707.6328 ( -1.66%)  290879.5737 (  0.12%)  289279.0242 ( -0.43%)
Faults/cpu 15  288135.4034 (  0.00%)  283193.5948 ( -1.72%)  212413.0189 (-26.28%)  280349.0344 ( -2.70%)  284072.2862 ( -1.41%)  287647.8834 ( -0.17%)
Faults/cpu 16  272332.8272 (  0.00%)  272814.3475 (  0.18%)  207466.3481 (-23.82%)  270402.6579 ( -0.71%)  271763.7503 ( -0.21%)  274964.5255 (  0.97%)
Faults/cpu 17  259801.4891 (  0.00%)  254678.1893 ( -1.97%)  195438.3763 (-24.77%)  258832.2108 ( -0.37%)  260388.8630 (  0.23%)  260959.0635 (  0.45%)
Faults/cpu 18  247485.0166 (  0.00%)  247528.4736 (  0.02%)  188851.6906 (-23.69%)  246617.6952 ( -0.35%)  246672.7250 ( -0.33%)  248623.7380 (  0.46%)
Faults/cpu 19  240874.3964 (  0.00%)  240040.1762 ( -0.35%)  188854.7002 (-21.60%)  241091.5604 (  0.09%)  235779.1526 ( -2.12%)  240054.8191 ( -0.34%)
Faults/cpu 20  230055.4776 (  0.00%)  233739.6952 (  1.60%)  189561.1074 (-17.60%)  232361.9801 (  1.00%)  235648.3672 (  2.43%)  235093.1838 (  2.19%)
Faults/cpu 21  221089.0306 (  0.00%)  222658.7857 (  0.71%)  185501.7940 (-16.10%)  221778.3227 (  0.31%)  220242.8822 ( -0.38%)  222037.5554 (  0.43%)
Faults/cpu 22  212928.6223 (  0.00%)  211709.9070 ( -0.57%)  173833.3256 (-18.36%)  210452.7972 ( -1.16%)  214426.3103 (  0.70%)  214947.4742 (  0.95%)
Faults/cpu 23  207494.8662 (  0.00%)  206521.8192 ( -0.47%)  171758.7557 (-17.22%)  205407.2927 ( -1.01%)  206721.0393 ( -0.37%)  207409.9085 ( -0.04%)
Faults/cpu 24  198271.6218 (  0.00%)  200140.9741 (  0.94%)  162334.1621 (-18.13%)  201006.4327 (  1.38%)  201252.9323 (  1.50%)  200952.4305 (  1.35%)
Faults/cpu 25  194049.1874 (  0.00%)  188802.4110 ( -2.70%)  161943.4996 (-16.55%)  191462.4322 ( -1.33%)  191439.2795 ( -1.34%)  192108.4659 ( -1.00%)
Faults/cpu 26  183620.4998 (  0.00%)  183343.6939 ( -0.15%)  160425.1497 (-12.63%)  182870.8145 ( -0.41%)  184395.3448 (  0.42%)  186077.3626 (  1.34%)
Faults/cpu 27  181390.7603 (  0.00%)  180468.1260 ( -0.51%)  156356.5144 (-13.80%)  181196.8598 ( -0.11%)  181266.5928 ( -0.07%)  180640.5088 ( -0.41%)
Faults/cpu 28  176180.0531 (  0.00%)  175634.1202 ( -0.31%)  150357.6004 (-14.66%)  177080.1177 (  0.51%)  177119.5918 (  0.53%)  176368.0055 (  0.11%)
Faults/cpu 29  169650.2633 (  0.00%)  168217.8595 ( -0.84%)  155420.2194 ( -8.39%)  170747.8837 (  0.65%)  171278.7622 (  0.96%)  170279.8400 (  0.37%)
Faults/cpu 30  165035.8356 (  0.00%)  164500.4660 ( -0.32%)  149498.3808 ( -9.41%)  165260.2440 (  0.14%)  166184.8081 (  0.70%)  164413.5702 ( -0.38%)
Faults/cpu 31  159436.3440 (  0.00%)  160203.2927 (  0.48%)  139138.4143 (-12.73%)  159857.9330 (  0.26%)  160602.8294 (  0.73%)  158802.3951 ( -0.40%)
Faults/cpu 32  155345.7802 (  0.00%)  155688.0137 (  0.22%)  136290.5101 (-12.27%)  156028.5649 (  0.44%)  156660.6132 (  0.85%)  156110.2021 (  0.49%)
Faults/cpu 33  150219.6220 (  0.00%)  150761.8116 (  0.36%)  135744.4512 ( -9.64%)  151295.3001 (  0.72%)  152374.5286 (  1.43%)  149876.4226 ( -0.23%)
Faults/cpu 34  145772.3820 (  0.00%)  144612.2751 ( -0.80%)  136039.8268 ( -6.68%)  147191.8811 (  0.97%)  146490.6089 (  0.49%)  144259.7221 ( -1.04%)
Faults/cpu 35  141844.4600 (  0.00%)  141708.8606 ( -0.10%)  136089.5490 ( -4.06%)  141913.1720 (  0.05%)  142196.7473 (  0.25%)  141281.3582 ( -0.40%)
Faults/cpu 36  137593.5661 (  0.00%)  138161.2436 (  0.41%)  128386.3001 ( -6.69%)  138513.0778 (  0.67%)  138313.7914 (  0.52%)  136719.5046 ( -0.64%)
Faults/cpu 37  132889.3691 (  0.00%)  133510.5699 (  0.47%)  127211.5973 ( -4.27%)  133844.4348 (  0.72%)  134542.6731 (  1.24%)  133044.9847 (  0.12%)
Faults/cpu 38  129464.8808 (  0.00%)  129309.9659 ( -0.12%)  124991.9760 ( -3.45%)  129698.4299 (  0.18%)  130383.7440 (  0.71%)  128545.0900 ( -0.71%)
Faults/cpu 39  125847.2523 (  0.00%)  126247.6919 (  0.32%)  125720.8199 ( -0.10%)  125748.5172 ( -0.08%)  126184.8812 (  0.27%)  126166.4376 (  0.25%)
Faults/cpu 40  122497.3658 (  0.00%)  122904.6230 (  0.33%)  119592.8625 ( -2.37%)  122917.6924 (  0.34%)  123206.4626 (  0.58%)  121880.4385 ( -0.50%)
Faults/cpu 41  119450.0397 (  0.00%)  119031.7169 ( -0.35%)  115547.9382 ( -3.27%)  118794.7652 ( -0.55%)  119418.5855 ( -0.03%)  118715.8560 ( -0.61%)
Faults/cpu 42  116004.5444 (  0.00%)  115247.2406 ( -0.65%)  115673.3669 ( -0.29%)  115894.3102 ( -0.10%)  115924.0103 ( -0.07%)  115546.2484 ( -0.40%)
Faults/cpu 43  112825.6897 (  0.00%)  112555.8521 ( -0.24%)  115351.1821 (  2.24%)  113205.7203 (  0.34%)  112896.3224 (  0.06%)  112501.5505 ( -0.29%)
Faults/cpu 44  110221.9798 (  0.00%)  109799.1269 ( -0.38%)  111690.2165 (  1.33%)  109460.3398 ( -0.69%)  109736.3227 ( -0.44%)  109822.0646 ( -0.36%)
Faults/cpu 45  107808.1019 (  0.00%)  106853.8230 ( -0.89%)  111211.9257 (  3.16%)  106613.8474 ( -1.11%)  106835.5728 ( -0.90%)  107420.9722 ( -0.36%)
Faults/cpu 46  105338.7289 (  0.00%)  104322.1338 ( -0.97%)  108688.1743 (  3.18%)  103868.0598 ( -1.40%)  104019.1548 ( -1.25%)  105022.6610 ( -0.30%)
Faults/cpu 47  103330.7670 (  0.00%)  102023.9900 ( -1.26%)  108331.5085 (  4.84%)  101681.8182 ( -1.60%)  101245.4175 ( -2.02%)  102871.1021 ( -0.44%)
Faults/cpu 48  101441.4170 (  0.00%)   99674.9779 ( -1.74%)  108007.0665 (  6.47%)   99354.5932 ( -2.06%)   99252.9156 ( -2.16%)  100868.6868 ( -0.56%)

Same story on number of faults processed per CPU.

Faults/sec 1   379226.4553 (  0.00%)  368933.2163 ( -2.71%)  377567.1922 ( -0.44%)   86267.2515 (-77.25%)   86875.1744 (-77.09%)  380376.2873 (  0.30%)
Faults/sec 2   749973.6389 (  0.00%)  745368.4598 ( -0.61%)  729046.6001 ( -2.79%)  501399.0067 (-33.14%)  533091.7531 (-28.92%)  748098.5102 ( -0.25%)
Faults/sec 3  1109387.2150 (  0.00%) 1101815.4855 ( -0.68%) 1067844.4241 ( -3.74%)  922150.6228 (-16.88%)  948926.6753 (-14.46%) 1105559.1712 ( -0.35%)
Faults/sec 4  1466774.3100 (  0.00%) 1436277.7333 ( -2.08%) 1386595.2563 ( -5.47%) 1352804.9587 ( -7.77%) 1373754.4330 ( -6.34%) 1455926.9804 ( -0.74%)
Faults/sec 5  1734004.1931 (  0.00%) 1712341.4333 ( -1.25%) 1663159.2063 ( -4.09%) 1636827.0073 ( -5.60%) 1674262.7667 ( -3.45%) 1719713.1856 ( -0.82%)
Faults/sec 6  2005083.6885 (  0.00%) 1980047.8898 ( -1.25%) 1892759.0575 ( -5.60%) 1978591.3286 ( -1.32%) 1990385.5922 ( -0.73%) 2012957.1946 (  0.39%)
Faults/sec 7  2323523.7344 (  0.00%) 2297209.3144 ( -1.13%) 2064475.4665 (-11.15%) 2260510.6371 ( -2.71%) 2278640.0597 ( -1.93%) 2324813.2040 (  0.06%)
Faults/sec 8  2648167.0893 (  0.00%) 2624742.9343 ( -0.88%) 2314968.6209 (-12.58%) 2606988.4580 ( -1.55%) 2671599.7800 (  0.88%) 2673032.1950 (  0.94%)
Faults/sec 9  2736925.7247 (  0.00%) 2728207.1722 ( -0.32%) 2491913.1048 ( -8.95%) 2689604.9745 ( -1.73%) 2708047.0077 ( -1.06%) 2760248.2053 (  0.85%)
Faults/sec 10 3039414.3444 (  0.00%) 3038105.4345 ( -0.04%) 2492174.2233 (-18.00%) 2947139.9612 ( -3.04%) 2973073.5636 ( -2.18%) 3002803.7061 ( -1.20%)
Faults/sec 11 3321706.1658 (  0.00%) 3239414.0527 ( -2.48%) 2456634.8702 (-26.04%) 3237117.6282 ( -2.55%) 3260521.6371 ( -1.84%) 3298132.1843 ( -0.71%)
Faults/sec 12 3532409.7672 (  0.00%) 3534748.1800 (  0.07%) 2556542.9426 (-27.63%) 3478409.1401 ( -1.53%) 3513285.3467 ( -0.54%) 3587238.4424 (  1.55%)
Faults/sec 13 3537583.2973 (  0.00%) 3555979.7240 (  0.52%) 2643676.1015 (-25.27%) 3498887.6802 ( -1.09%) 3584695.8753 (  1.33%) 3590044.7697 (  1.48%)
Faults/sec 14 3746624.1500 (  0.00%) 3689003.6175 ( -1.54%) 2630758.3449 (-29.78%) 3690864.4632 ( -1.49%) 3751840.8797 (  0.14%) 3724950.8729 ( -0.58%)
Faults/sec 15 4051109.8741 (  0.00%) 3953680.3643 ( -2.41%) 2541857.4723 (-37.26%) 3905515.7917 ( -3.59%) 3998526.1306 ( -1.30%) 4049199.2538 ( -0.05%)
Faults/sec 16 4078126.4712 (  0.00%) 4123441.7643 (  1.11%) 2549782.7076 (-37.48%) 4067671.7626 ( -0.26%) 4106454.4320 (  0.69%) 4167569.6242 (  2.19%)
Faults/sec 17 3946209.5066 (  0.00%) 3886274.3946 ( -1.52%) 2405328.1767 (-39.05%) 3937304.5223 ( -0.23%) 3920485.2382 ( -0.65%) 3967957.4690 (  0.55%)
Faults/sec 18 4115112.1063 (  0.00%) 4079027.7233 ( -0.88%) 2385981.0332 (-42.02%) 4062940.8129 ( -1.27%) 4103770.0811 ( -0.28%) 4121303.7070 (  0.15%)
Faults/sec 19 4354086.4908 (  0.00%) 4333268.5610 ( -0.48%) 2501627.6834 (-42.55%) 4284800.1294 ( -1.59%) 4206148.7446 ( -3.40%) 4287512.8517 ( -1.53%)
Faults/sec 20 4263596.5894 (  0.00%) 4472167.3677 (  4.89%) 2564140.4929 (-39.86%) 4370659.6359 (  2.51%) 4479581.9679 (  5.07%) 4484166.9738 (  5.17%)
Faults/sec 21 4098972.5089 (  0.00%) 4151322.9576 (  1.28%) 2626683.1075 (-35.92%) 4149013.2160 (  1.22%) 4058372.3890 ( -0.99%) 4143527.1704 (  1.09%)
Faults/sec 22 4175738.8898 (  0.00%) 4237648.8102 (  1.48%) 2388945.8252 (-42.79%) 4137584.2163 ( -0.91%) 4247730.7669 (  1.72%) 4322814.4495 (  3.52%)
Faults/sec 23 4373975.8159 (  0.00%) 4395014.8420 (  0.48%) 2491320.6893 (-43.04%) 4195839.4189 ( -4.07%) 4289031.3045 ( -1.94%) 4249735.3807 ( -2.84%)
Faults/sec 24 4343903.6909 (  0.00%) 4539539.0281 (  4.50%) 2367142.7680 (-45.51%) 4463459.6633 (  2.75%) 4347883.8816 (  0.09%) 4361808.4405 (  0.41%)
Faults/sec 25 4049139.5490 (  0.00%) 3836819.6187 ( -5.24%) 2452593.4879 (-39.43%) 3756917.3563 ( -7.22%) 3667462.3028 ( -9.43%) 3882470.4622 ( -4.12%)
Faults/sec 26 3923558.8580 (  0.00%) 3926335.3913 (  0.07%) 2497179.3566 (-36.35%) 3758947.5820 ( -4.20%) 3810590.6641 ( -2.88%) 3949958.5833 (  0.67%)
Faults/sec 27 4120929.2726 (  0.00%) 4111259.5839 ( -0.23%) 2444020.3202 (-40.69%) 3958866.4333 ( -3.93%) 3934181.7350 ( -4.53%) 4038502.1999 ( -2.00%)
Faults/sec 28 4148296.9993 (  0.00%) 4208740.3644 (  1.46%) 2508485.6715 (-39.53%) 4084949.7113 ( -1.53%) 4037661.6209 ( -2.67%) 4185738.4607 (  0.90%)
Faults/sec 29 4124742.2486 (  0.00%) 4142048.5869 (  0.42%) 2672716.5715 (-35.20%) 4085761.2234 ( -0.95%) 4068650.8559 ( -1.36%) 4144694.1129 (  0.48%)
Faults/sec 30 4160740.4979 (  0.00%) 4236457.4748 (  1.82%) 2695629.9415 (-35.21%) 4076825.3513 ( -2.02%) 4106802.5562 ( -1.30%) 4084027.7691 ( -1.84%)
Faults/sec 31 4237767.8919 (  0.00%) 4262954.1215 (  0.59%) 2622045.7226 (-38.13%) 4147492.6973 ( -2.13%) 4129507.3254 ( -2.55%) 4154591.8086 ( -1.96%)
Faults/sec 32 4193896.3492 (  0.00%) 4313804.9370 (  2.86%) 2486013.3793 (-40.72%) 4144234.0287 ( -1.18%) 4167653.2985 ( -0.63%) 4280308.2714 (  2.06%)
Faults/sec 33 4162942.9767 (  0.00%) 4324720.6943 (  3.89%) 2705706.6138 (-35.00%) 4148215.3556 ( -0.35%) 4160800.6591 ( -0.05%) 4188855.2428 (  0.62%)
Faults/sec 34 4204133.3523 (  0.00%) 4246486.4313 (  1.01%) 2801163.4164 (-33.37%) 4115498.6406 ( -2.11%) 4050464.9098 ( -3.66%) 4092430.9384 ( -2.66%)
Faults/sec 35 4189096.5835 (  0.00%) 4271877.3268 (  1.98%) 2763406.1657 (-34.03%) 4112864.6044 ( -1.82%) 4116065.7955 ( -1.74%) 4219699.5756 (  0.73%)
Faults/sec 36 4277421.2521 (  0.00%) 4373426.4356 (  2.24%) 2692221.4270 (-37.06%) 4129438.5970 ( -3.46%) 4108075.3296 ( -3.96%) 4149259.8944 ( -3.00%)
Faults/sec 37 4168551.9047 (  0.00%) 4319223.3874 (  3.61%) 2836764.2086 (-31.95%) 4109725.0377 ( -1.41%) 4156874.2769 ( -0.28%) 4149515.4613 ( -0.46%)
Faults/sec 38 4247525.5670 (  0.00%) 4229905.6978 ( -0.41%) 2938912.4587 (-30.81%) 4085058.1995 ( -3.82%) 4127366.4416 ( -2.83%) 4096271.9211 ( -3.56%)
Faults/sec 39 4190989.8515 (  0.00%) 4329385.1325 (  3.30%) 3061436.0988 (-26.95%) 4099026.7324 ( -2.19%) 4094648.2005 ( -2.30%) 4240087.0764 (  1.17%)
Faults/sec 40 4238307.5210 (  0.00%) 4337475.3368 (  2.34%) 2988097.1336 (-29.50%) 4203501.6812 ( -0.82%) 4120604.7912 ( -2.78%) 4193144.8164 ( -1.07%)
Faults/sec 41 4317393.3854 (  0.00%) 4282458.5094 ( -0.81%) 2949899.0149 (-31.67%) 4120836.6477 ( -4.55%) 4248620.8455 ( -1.59%) 4206700.7050 ( -2.56%)
Faults/sec 42 4299075.7581 (  0.00%) 4181602.0005 ( -2.73%) 3037710.0530 (-29.34%) 4205958.7415 ( -2.17%) 4181449.1786 ( -2.74%) 4155578.2275 ( -3.34%)
Faults/sec 43 4234922.1492 (  0.00%) 4301130.5970 (  1.56%) 2996342.1505 (-29.25%) 4170975.0653 ( -1.51%) 4210039.9002 ( -0.59%) 4203158.8656 ( -0.75%)
Faults/sec 44 4270913.7498 (  0.00%) 4376035.4745 (  2.46%) 3054249.1521 (-28.49%) 4193693.1721 ( -1.81%) 4154034.6390 ( -2.74%) 4207031.5562 ( -1.50%)
Faults/sec 45 4313055.5348 (  0.00%) 4342993.1271 (  0.69%) 3263986.2960 (-24.32%) 4172891.7566 ( -3.25%) 4262028.6193 ( -1.18%) 4293905.9657 ( -0.44%)
Faults/sec 46 4323716.1160 (  0.00%) 4306994.5183 ( -0.39%) 3198502.0716 (-26.02%) 4212553.2514 ( -2.57%) 4216000.7652 ( -2.49%) 4277511.4815 ( -1.07%)
Faults/sec 47 4364354.4986 (  0.00%) 4290609.7996 ( -1.69%) 3274654.5504 (-24.97%) 4185908.2435 ( -4.09%) 4235166.8662 ( -2.96%) 4267607.2786 ( -2.22%)
Faults/sec 48 4280234.1143 (  0.00%) 4312820.1724 (  0.76%) 3168212.5669 (-25.98%) 4272168.2365 ( -0.19%) 4235504.6092 ( -1.05%) 4322535.9118 (  0.99%)

More or less the same story.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User         1076.65      935.93     1276.09     1089.84     1134.60     1097.18
System      18726.05    18738.26    22038.05    19395.18    19281.62    18688.61
Elapsed      1353.67     1346.72     1798.95     2022.47     2010.67     1355.63

autonumas system CPU usage overhead is obvious here. balancenuma and
numacore are ok although it's interesting to note that balancenuma required
the delaystart logic to keep the usage down here.

MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                           680         536         536         540         540         540
Page Outs                        16004       15496       19048       19052       19888       15892
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                      0           0           0           0           0           0
THP collapse alloc                   0           0           0           0           0           0
THP splits                           0           0           0           1           0           0
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0           0        1093         986         613
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0           0           1           1           0
NUMA PTE updates                     0           0           0   505196235   493301672      515709
NUMA hint faults                     0           0           0     2549799     2482875      105795
NUMA hint local faults               0           0           0     2545441     2480546      102428
NUMA pages migrated                  0           0           0        1093         986         613
AutoNUMA cost                        0           0           0       16285       15867         532

There you have it. Some good results, some great, some bad results, some
disastrous. Of course this is for only one machine and other machines
might report differently. I've outlined what other factors could impact the
results and will re-run tests if there is a complaint about one of them.

I'll keep my overall comments to balancenuma. I think it did pretty well
overall. It generally was an improvement on the baseline kernel and in only
one case did it heavily regress (specjbb, single JVM, no THP). Here it hit
its worst-case scenario of always dealing with PTE faults, almost always
migrating and not reducing the scan rate. I could try be clever about this,
I could ignore it or I could hit it with a hammer. I have a hammer.

Other comments?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Comparison between three trees (was: Latest numa/core release, v17)
  2012-11-23 17:32 ` Comparison between three trees (was: Latest numa/core release, v17) Mel Gorman
@ 2012-11-25  8:47   ` Hillf Danton
  2012-11-26  9:38     ` Mel Gorman
  2012-11-25 23:37   ` Mel Gorman
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 55+ messages in thread
From: Hillf Danton @ 2012-11-25  8:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Andrea Arcangeli, linux-kernel,
	linux-mm, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton

On 11/24/12, Mel Gorman <mgorman@suse.de> wrote:
> Warning: This is an insanely long mail and there a lot of data here. Get
> 	coffee or something.
>
> This is another round of comparisons between the latest released versions
> of each of three automatic numa balancing trees that are out there.
>
> From the series "Automatic NUMA Balancing V5", the kernels tested were
>
> stats-v5r1	Patches 1-10. TLB optimisations, migration stats
> thpmigrate-v5r1	Patches 1-37. Basic placement policy, PMD handling, THP
> migration etc.
> adaptscan-v5r1	Patches 1-38. Heavy handed PTE scan reduction
> delaystart-v5r1 Patches 1-40. Delay the PTE scan until running on a new
> node
>
> If I just say balancenuma, I mean the "delaystart-v5r1" kernel. The other
> kernels are included so you can see the impact the scan rate adaption
> patch has and what that might mean for a placement policy using a proper
> feedback mechanism.
>
> The other two kernels were
>
> numacore-20121123 It was no longer clear what the deltas between releases
> and
> 	the dependencies might be so I just pulled tip/master on November
> 	23rd, 2012. An earlier pull had serious difficulties and the patch
> 	responsible has been dropped since. This is not a like-with-like
> 	comparison as the tree contains numerous other patches but it's
> 	the best available given the timeframe
>
> autonuma-v28fast This is a rebased version of Andrea's autonuma-v28fast
> 	branch with Hugh's THP migration patch on top.

FYI, based on how target huge page is selected,

+
+	new_page = alloc_pages_node(numa_node_id(),
+		(GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);

the thp replacement policy is changed to be MORON,

+	/* Migrate the page towards the node whose CPU is referencing it */
+	if (pol->flags & MPOL_F_MORON)
+		polnid = numa_node_id();


described in
	[PATCH 29/46] mm: numa: Migrate on reference policy
	https://lkml.org/lkml/2012/11/21/228

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Comparison between three trees (was: Latest numa/core release, v17)
  2012-11-25  8:47   ` Hillf Danton
@ 2012-11-26  9:38     ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2012-11-26  9:38 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Ingo Molnar, Peter Zijlstra, Andrea Arcangeli, linux-kernel,
	linux-mm, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Sun, Nov 25, 2012 at 04:47:15PM +0800, Hillf Danton wrote:
> On 11/24/12, Mel Gorman <mgorman@suse.de> wrote:
> > Warning: This is an insanely long mail and there a lot of data here. Get
> > 	coffee or something.
> >
> > This is another round of comparisons between the latest released versions
> > of each of three automatic numa balancing trees that are out there.
> >
> > From the series "Automatic NUMA Balancing V5", the kernels tested were
> >
> > stats-v5r1	Patches 1-10. TLB optimisations, migration stats
> > thpmigrate-v5r1	Patches 1-37. Basic placement policy, PMD handling, THP
> > migration etc.
> > adaptscan-v5r1	Patches 1-38. Heavy handed PTE scan reduction
> > delaystart-v5r1 Patches 1-40. Delay the PTE scan until running on a new
> > node
> >
> > If I just say balancenuma, I mean the "delaystart-v5r1" kernel. The other
> > kernels are included so you can see the impact the scan rate adaption
> > patch has and what that might mean for a placement policy using a proper
> > feedback mechanism.
> >
> > The other two kernels were
> >
> > numacore-20121123 It was no longer clear what the deltas between releases
> > and
> > 	the dependencies might be so I just pulled tip/master on November
> > 	23rd, 2012. An earlier pull had serious difficulties and the patch
> > 	responsible has been dropped since. This is not a like-with-like
> > 	comparison as the tree contains numerous other patches but it's
> > 	the best available given the timeframe
> >
> > autonuma-v28fast This is a rebased version of Andrea's autonuma-v28fast
> > 	branch with Hugh's THP migration patch on top.
> 
> FYI, based on how target huge page is selected,
> 
> +
> +	new_page = alloc_pages_node(numa_node_id(),
> +		(GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
> 
> the thp replacement policy is changed to be MORON,
> 

That is likely true. When rebasing a policy on top of balancenuma it is
important to keep an eye on what node is used for target migration and
what node is passed to task_numa_fault() and confirm this is the node
the policy expects.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Comparison between three trees (was: Latest numa/core release, v17)
  2012-11-23 17:32 ` Comparison between three trees (was: Latest numa/core release, v17) Mel Gorman
  2012-11-25  8:47   ` Hillf Danton
@ 2012-11-25 23:37   ` Mel Gorman
  2012-11-25 23:40   ` Mel Gorman
  2012-11-26 13:33   ` Mel Gorman
  3 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2012-11-25 23:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Paul Turner, Lee Schermerhorn,
	Christoph Lameter, Rik van Riel, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Johannes Weiner, Hugh Dickins

On Fri, Nov 23, 2012 at 05:32:05PM +0000, Mel Gorman wrote:
> From here, we're onto the single JVM configuration. I suspect
> this is tested much more commonly but note that it behaves very
> differently to the multi JVM configuration as explained by Andrea
> (http://choon.net/forum/read.php?21,1599976,page=4).
> 
> A concern with the single JVM results as reported here is the maximum
> number of warehouses. In the Multi JVM configuration, the expected peak
> was 12 warehouses so I ran up to 18 so that the tests could complete in a
> reasonable amount of time. The expected peak for a single JVM is 48 (the
> number of CPUs) but the configuration file was derived from the multi JVM
> configuration so it was restricted to running up to 18 warehouses. Again,
> the reason was so it would complete in a reasonable amount of time but
> specjbb does not give a score for this type of configuration and I am
> only reporting on the 1-18 warehouses it ran for. I've reconfigured the
> 4 specjbb configs to run a full config and it'll run over the weekend.
> 

Ths use of just peak figures really is a factor.  The THP configuration,
single JVM is the best configuration for numacore but this is only visible
for peak numbers of warehouses. For lower number of warehouses it regresses
but this is not reported by the specjbb benchmark and could have been
easily missed. It also mostly explains why I was seeing very different
figures to other testers.

More below.

> SPECJBB: Single JVMs (one per node, 4 nodes), THP is enabled
> 
> SPECJBB BOPS
>                         3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
>                rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
> TPut 1      26802.00 (  0.00%)     22808.00 (-14.90%)     24482.00 ( -8.66%)     25723.00 ( -4.03%)     24387.00 ( -9.01%)     25940.00 ( -3.22%)
> TPut 2      57720.00 (  0.00%)     51245.00 (-11.22%)     55018.00 ( -4.68%)     55498.00 ( -3.85%)     55259.00 ( -4.26%)     55581.00 ( -3.71%)
> TPut 3      86940.00 (  0.00%)     79172.00 ( -8.93%)     87705.00 (  0.88%)     86101.00 ( -0.97%)     86894.00 ( -0.05%)     86875.00 ( -0.07%)
> TPut 4     117203.00 (  0.00%)    107315.00 ( -8.44%)    117382.00 (  0.15%)    116282.00 ( -0.79%)    116322.00 ( -0.75%)    115263.00 ( -1.66%)
> TPut 5     145375.00 (  0.00%)    121178.00 (-16.64%)    145802.00 (  0.29%)    142378.00 ( -2.06%)    144947.00 ( -0.29%)    144211.00 ( -0.80%)
> TPut 6     169232.00 (  0.00%)    157796.00 ( -6.76%)    173409.00 (  2.47%)    171066.00 (  1.08%)    173341.00 (  2.43%)    169861.00 (  0.37%)
> TPut 7     195468.00 (  0.00%)    169834.00 (-13.11%)    197201.00 (  0.89%)    197536.00 (  1.06%)    198347.00 (  1.47%)    198047.00 (  1.32%)
> TPut 8     217863.00 (  0.00%)    169975.00 (-21.98%)    222559.00 (  2.16%)    224901.00 (  3.23%)    226268.00 (  3.86%)    218354.00 (  0.23%)
> TPut 9     240679.00 (  0.00%)    197498.00 (-17.94%)    245997.00 (  2.21%)    250022.00 (  3.88%)    253838.00 (  5.47%)    250264.00 (  3.98%)
> TPut 10    261454.00 (  0.00%)    204909.00 (-21.63%)    269551.00 (  3.10%)    275125.00 (  5.23%)    274658.00 (  5.05%)    274155.00 (  4.86%)
> TPut 11    281079.00 (  0.00%)    230118.00 (-18.13%)    281588.00 (  0.18%)    304383.00 (  8.29%)    297198.00 (  5.73%)    299131.00 (  6.42%)
> TPut 12    302007.00 (  0.00%)    275511.00 ( -8.77%)    313281.00 (  3.73%)    327826.00 (  8.55%)    325324.00 (  7.72%)    325372.00 (  7.74%)
> TPut 13    319139.00 (  0.00%)    293501.00 ( -8.03%)    332581.00 (  4.21%)    352389.00 ( 10.42%)    340169.00 (  6.59%)    351215.00 ( 10.05%)
> TPut 14    321069.00 (  0.00%)    312088.00 ( -2.80%)    337911.00 (  5.25%)    376198.00 ( 17.17%)    370669.00 ( 15.45%)    366491.00 ( 14.15%)
> TPut 15    345851.00 (  0.00%)    283856.00 (-17.93%)    369104.00 (  6.72%)    389772.00 ( 12.70%)    392963.00 ( 13.62%)    389254.00 ( 12.55%)
> TPut 16    346868.00 (  0.00%)    317127.00 ( -8.57%)    380930.00 (  9.82%)    420331.00 ( 21.18%)    412974.00 ( 19.06%)    408575.00 ( 17.79%)
> TPut 17    357755.00 (  0.00%)    349624.00 ( -2.27%)    387635.00 (  8.35%)    441223.00 ( 23.33%)    426558.00 ( 19.23%)    435985.00 ( 21.87%)
> TPut 18    357467.00 (  0.00%)    360056.00 (  0.72%)    399487.00 ( 11.75%)    464603.00 ( 29.97%)    442907.00 ( 23.90%)    453011.00 ( 26.73%)
> 
> numacore is not doing well here for low numbers of warehouses. However,
> note that by 18 warehouses it had drawn level and the expected peak is 48
> warehouses. The specjbb reported figure would be using the higher numbers
> of warehouses. I'll a full range over the weekend and report back. If
> time permits, I'll also run a "monitors disabled" run case the read of
> numa_maps every 10 seconds is crippling it.
> 

Over the weekend I ran a few configurations that used a large number of
warehouses. The numacore and autonuma kernels are as before.  The balancenuma
kernel is a reshuffled tree that moves the THP patches towards the end of the
series. It's functionally very similar to delaystart-v5r4 from the earlier
report. The differences are bug fixes from Hillf and accounting fixes.

In terms of testing, the big difference is the number of warehouses
tested. Here are the results.

SPECJBB: Single JVM, THP is enabled
                        3.7.0                 3.7.0                 3.7.0                 3.7.0
               rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4  rc6-thpmigrate-v6r10
TPut 1      25598.00 (  0.00%)     24938.00 ( -2.58%)     24663.00 ( -3.65%)     25641.00 (  0.17%)
TPut 2      56182.00 (  0.00%)     50701.00 ( -9.76%)     55059.00 ( -2.00%)     56300.00 (  0.21%)
TPut 3      84856.00 (  0.00%)     80000.00 ( -5.72%)     86692.00 (  2.16%)     87656.00 (  3.30%)
TPut 4     115406.00 (  0.00%)    102629.00 (-11.07%)    118576.00 (  2.75%)    117089.00 (  1.46%)
TPut 5     143810.00 (  0.00%)    131824.00 ( -8.33%)    142516.00 ( -0.90%)    143652.00 ( -0.11%)
TPut 6     168681.00 (  0.00%)    138700.00 (-17.77%)    171938.00 (  1.93%)    171625.00 (  1.75%)
TPut 7     196629.00 (  0.00%)    158003.00 (-19.64%)    184263.00 ( -6.29%)    196422.00 ( -0.11%)
TPut 8     219888.00 (  0.00%)    173094.00 (-21.28%)    222689.00 (  1.27%)    226163.00 (  2.85%)
TPut 9     244790.00 (  0.00%)    201543.00 (-17.67%)    247785.00 (  1.22%)    252223.00 (  3.04%)
TPut 10    265824.00 (  0.00%)    224522.00 (-15.54%)    268362.00 (  0.95%)    273253.00 (  2.79%)
TPut 11    286745.00 (  0.00%)    240431.00 (-16.15%)    297968.00 (  3.91%)    303903.00 (  5.98%)
TPut 12    312593.00 (  0.00%)    278749.00 (-10.83%)    322880.00 (  3.29%)    324283.00 (  3.74%)
TPut 13    319508.00 (  0.00%)    297467.00 ( -6.90%)    337332.00 (  5.58%)    350443.00 (  9.68%)
TPut 14    348575.00 (  0.00%)    301683.00 (-13.45%)    374828.00 (  7.53%)    371199.00 (  6.49%)
TPut 15    350516.00 (  0.00%)    357707.00 (  2.05%)    370428.00 (  5.68%)    400114.00 ( 14.15%)
TPut 16    370886.00 (  0.00%)    326597.00 (-11.94%)    412694.00 ( 11.27%)    420616.00 ( 13.41%)
TPut 17    386422.00 (  0.00%)    363441.00 ( -5.95%)    427190.00 ( 10.55%)    444268.00 ( 14.97%)
TPut 18    387031.00 (  0.00%)    387802.00 (  0.20%)    449808.00 ( 16.22%)    459404.00 ( 18.70%)
TPut 19    397352.00 (  0.00%)    387513.00 ( -2.48%)    444231.00 ( 11.80%)    480527.00 ( 20.93%)
TPut 20    386512.00 (  0.00%)    409861.00 (  6.04%)    469152.00 ( 21.38%)    503000.00 ( 30.14%)
TPut 21    406441.00 (  0.00%)    453321.00 ( 11.53%)    475290.00 ( 16.94%)    517443.00 ( 27.31%)
TPut 22    399667.00 (  0.00%)    473069.00 ( 18.37%)    494780.00 ( 23.80%)    530384.00 ( 32.71%)
TPut 23    406795.00 (  0.00%)    459549.00 ( 12.97%)    498187.00 ( 22.47%)    545605.00 ( 34.12%)
TPut 24    410499.00 (  0.00%)    442373.00 (  7.76%)    506758.00 ( 23.45%)    555870.00 ( 35.41%)
TPut 25    400845.00 (  0.00%)    463657.00 ( 15.67%)    497653.00 ( 24.15%)    554370.00 ( 38.30%)
TPut 26    390073.00 (  0.00%)    488957.00 ( 25.35%)    500685.00 ( 28.36%)    553714.00 ( 41.95%)
TPut 27    391689.00 (  0.00%)    452545.00 ( 15.54%)    498155.00 ( 27.18%)    561167.00 ( 43.27%)
TPut 28    380903.00 (  0.00%)    483782.00 ( 27.01%)    494085.00 ( 29.71%)    546296.00 ( 43.42%)
TPut 29    381805.00 (  0.00%)    527448.00 ( 38.15%)    502872.00 ( 31.71%)    552729.00 ( 44.77%)
TPut 30    375810.00 (  0.00%)    483409.00 ( 28.63%)    494412.00 ( 31.56%)    548433.00 ( 45.93%)
TPut 31    378324.00 (  0.00%)    477776.00 ( 26.29%)    497701.00 ( 31.55%)    548419.00 ( 44.96%)
TPut 32    372322.00 (  0.00%)    444958.00 ( 19.51%)    488683.00 ( 31.25%)    536867.00 ( 44.19%)
TPut 33    359918.00 (  0.00%)    431751.00 ( 19.96%)    484478.00 ( 34.61%)    538970.00 ( 49.75%)
TPut 34    357685.00 (  0.00%)    452866.00 ( 26.61%)    476558.00 ( 33.23%)    521906.00 ( 45.91%)
TPut 35    354902.00 (  0.00%)    456795.00 ( 28.71%)    484244.00 ( 36.44%)    533609.00 ( 50.35%)
TPut 36    337517.00 (  0.00%)    469182.00 ( 39.01%)    454640.00 ( 34.70%)    526363.00 ( 55.95%)
TPut 37    332136.00 (  0.00%)    456822.00 ( 37.54%)    458413.00 ( 38.02%)    519400.00 ( 56.38%)
TPut 38    330084.00 (  0.00%)    453377.00 ( 37.35%)    434666.00 ( 31.68%)    512187.00 ( 55.17%)
TPut 39    319024.00 (  0.00%)    412778.00 ( 29.39%)    428688.00 ( 34.37%)    509798.00 ( 59.80%)
TPut 40    315002.00 (  0.00%)    391376.00 ( 24.25%)    398529.00 ( 26.52%)    480411.00 ( 52.51%)
TPut 41    299693.00 (  0.00%)    353819.00 ( 18.06%)    403541.00 ( 34.65%)    492599.00 ( 64.37%)
TPut 42    298226.00 (  0.00%)    347563.00 ( 16.54%)    362189.00 ( 21.45%)    476979.00 ( 59.94%)
TPut 43    295595.00 (  0.00%)    401208.00 ( 35.73%)    393026.00 ( 32.96%)    459142.00 ( 55.33%)
TPut 44    296490.00 (  0.00%)    419443.00 ( 41.47%)    341222.00 ( 15.09%)    452357.00 ( 52.57%)
TPut 45    292584.00 (  0.00%)    420579.00 ( 43.75%)    393112.00 ( 34.36%)    468680.00 ( 60.19%)
TPut 46    287256.00 (  0.00%)    384628.00 ( 33.90%)    375230.00 ( 30.63%)    433550.00 ( 50.93%)
TPut 47    277411.00 (  0.00%)    349226.00 ( 25.89%)    392540.00 ( 41.50%)    449038.00 ( 61.87%)
TPut 48    277058.00 (  0.00%)    396594.00 ( 43.14%)    398184.00 ( 43.72%)    457085.00 ( 64.98%)
TPut 49    279962.00 (  0.00%)    402671.00 ( 43.83%)    394294.00 ( 40.84%)    425650.00 ( 52.04%)
TPut 50    279948.00 (  0.00%)    372190.00 ( 32.95%)    420082.00 ( 50.06%)    447108.00 ( 59.71%)
TPut 51    282160.00 (  0.00%)    362593.00 ( 28.51%)    404464.00 ( 43.35%)    460767.00 ( 63.30%)
TPut 52    275574.00 (  0.00%)    343943.00 ( 24.81%)    397754.00 ( 44.34%)    425609.00 ( 54.44%)
TPut 53    283902.00 (  0.00%)    355129.00 ( 25.09%)    410938.00 ( 44.75%)    427099.00 ( 50.44%)
TPut 54    277341.00 (  0.00%)    371739.00 ( 34.04%)    398662.00 ( 43.74%)    427941.00 ( 54.30%)
TPut 55    272116.00 (  0.00%)    417531.00 ( 53.44%)    390286.00 ( 43.43%)    436491.00 ( 60.41%)
TPut 56    280207.00 (  0.00%)    347432.00 ( 23.99%)    404331.00 ( 44.30%)    439342.00 ( 56.79%)
TPut 57    282146.00 (  0.00%)    329932.00 ( 16.94%)    379562.00 ( 34.53%)    407568.00 ( 44.45%)
TPut 58    275901.00 (  0.00%)    373810.00 ( 35.49%)    394333.00 ( 42.93%)    428118.00 ( 55.17%)
TPut 59    276583.00 (  0.00%)    359812.00 ( 30.09%)    376969.00 ( 36.30%)    429891.00 ( 55.43%)
TPut 60    272523.00 (  0.00%)    368938.00 ( 35.38%)    385033.00 ( 41.28%)    427636.00 ( 56.92%)
TPut 61    272427.00 (  0.00%)    387343.00 ( 42.18%)    376525.00 ( 38.21%)    417755.00 ( 53.35%)
TPut 62    258730.00 (  0.00%)    390303.00 ( 50.85%)    373770.00 ( 44.46%)    438145.00 ( 69.34%)
TPut 63    269246.00 (  0.00%)    389464.00 ( 44.65%)    381536.00 ( 41.71%)    433943.00 ( 61.17%)
TPut 64    266261.00 (  0.00%)    387660.00 ( 45.59%)    387200.00 ( 45.42%)    399805.00 ( 50.16%)
TPut 65    259147.00 (  0.00%)    373458.00 ( 44.11%)    389666.00 ( 50.36%)    400191.00 ( 54.43%)
TPut 66    273445.00 (  0.00%)    374637.00 ( 37.01%)    359764.00 ( 31.57%)    419330.00 ( 53.35%)
TPut 67    269350.00 (  0.00%)    380035.00 ( 41.09%)    391560.00 ( 45.37%)    391418.00 ( 45.32%)
TPut 68    275532.00 (  0.00%)    379096.00 ( 37.59%)    396028.00 ( 43.73%)    390213.00 ( 41.62%)
TPut 69    274195.00 (  0.00%)    368116.00 ( 34.25%)    393802.00 ( 43.62%)    391539.00 ( 42.80%)
TPut 70    269523.00 (  0.00%)    372521.00 ( 38.21%)    381988.00 ( 41.73%)    360330.00 ( 33.69%)
TPut 71    264778.00 (  0.00%)    372533.00 ( 40.70%)    377377.00 ( 42.53%)    395088.00 ( 49.21%)
TPut 72    265705.00 (  0.00%)    359686.00 ( 35.37%)    390037.00 ( 46.79%)    399126.00 ( 50.21%)

Note for lower number of warehouses that numacore regresses and then
improves as the warehouses increase. The expected peak is 48 cores and
note how numacore gets a 43.14% improvement here, autonuma sees a 43.72%
gain and balancenuma sees a 64.98% gain.

This explains why there was a big difference in reported figures. I was
using Multiple JVMs as ordinarily one would expect one JVM per node and
to have each JVM bound to a node. Multiple JVMs and Single JVMs generate
very different results.  Second, there are massive differences depending on
whether THP is enabled or disabled. Lastly, as we can see here, numacore
regresses for small number of warehouses which is what I initially saw
but does very well as the number of warehouses increases. specjbb reports
based on peak number of warehouses so if people were using just the specjbb
score or were only testing peak number of warehouses, they would see the
performance gains but miss the regressions.

SPECJBB PEAKS
                                       3.7.0                      3.7.0                      3.7.0                      3.7.0
                              rc6-stats-v5r1      rc6-numacore-20121123     rc6-autonuma-v28fastr4       rc6-thpmigrate-v6r10
 Expctd Warehouse                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)
 Expctd Peak Bops               277058.00 (  0.00%)               396594.00 ( 43.14%)               398184.00 ( 43.72%)               457085.00 ( 64.98%)
 Actual Warehouse                   24.00 (  0.00%)                   29.00 ( 20.83%)                   24.00 (  0.00%)                   27.00 ( 12.50%)
 Actual Peak Bops               410499.00 (  0.00%)               527448.00 ( 28.49%)               506758.00 ( 23.45%)               561167.00 ( 36.70%)
 SpecJBB Bops                   139464.00 (  0.00%)               190554.00 ( 36.63%)               199064.00 ( 42.74%)               213820.00 ( 53.32%)
 SpecJBB Bops/JVM               139464.00 (  0.00%)               190554.00 ( 36.63%)               199064.00 ( 42.74%)               213820.00 ( 53.32%)

Here you can see that numacore scales to a higher number of warehouses
and sees a 43.14% performance gain at the peak and a 36.63% gain on the
specjbb score. The peaks are great, just not the smaller number of
warehouses.

autonuma sees a 23.45% performance gain at the peak and a 42.74%
performance gain on the specjbb score.

balancenuma gets a 36.7% performance gain at the peak and a 53.32%
gain on the specjbb score.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v6r10
User       317241.10   311543.98   314980.59   315357.34
System        105.47     2989.96      341.54      431.13
Elapsed      7432.59     7439.32     7433.84     7433.72

Same comments about the sytem CPU usage. numacores is really high.
balancenuma's is higher than I'd like.

MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v6r10
Page Ins                         38252       38036       38212       37976
Page Outs                        55364       59772       55704       54824
Swap Ins                             0           0           0           0
Swap Outs                            0           0           0           0
Direct pages scanned                 0           0           0           0
Kswapd pages scanned                 0           0           0           0
Kswapd pages reclaimed               0           0           0           0
Direct pages reclaimed               0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%
Page writes by reclaim               0           0           0           0
Page writes file                     0           0           0           0
Page writes anon                     0           0           0           0
Page reclaim immediate               0           0           0           0
Page rescued immediate               0           0           0           0
Slabs scanned                        0           0           0           0
Direct inode steals                  0           0           0           0
Kswapd inode steals                  0           0           0           0
Kswapd skipped wait                  0           0           0           0
THP fault alloc                  51908       43137       46165       49523
THP collapse alloc                  62           3         179          59
THP splits                          72          45          86          75
THP fault fallback                   0           0           0           0
THP collapse fail                    0           0           0           0
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success                 0           0           0    46917509
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                      0           0           0       48700
NUMA PTE updates                     0           0           0   356453719
NUMA hint faults                     0           0           0     2056190
NUMA hint local faults               0           0           0      752408
NUMA pages migrated                  0           0           0    46917509
AutoNUMA cost                        0           0           0       13667

Note that THP was certainly in use here. balancenuma migrated a lot more
than I'd like but it cannot be compared with numacore or autonuma at
this point.


SPECJBB: Single JVMs (one per node, 4 nodes), THP is disabled
                        3.7.0                 3.7.0                 3.7.0                 3.7.0
               rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4  rc6-thpmigrate-v6r10
TPut 1      20507.00 (  0.00%)     16702.00 (-18.55%)     19496.00 ( -4.93%)     19831.00 ( -3.30%)
TPut 2      48723.00 (  0.00%)     36714.00 (-24.65%)     49452.00 (  1.50%)     45973.00 ( -5.64%)
TPut 3      72618.00 (  0.00%)     59086.00 (-18.63%)     69728.00 ( -3.98%)     71996.00 ( -0.86%)
TPut 4      98383.00 (  0.00%)     76940.00 (-21.80%)     98216.00 ( -0.17%)     95339.00 ( -3.09%)
TPut 5     122240.00 (  0.00%)     95981.00 (-21.48%)    119822.00 ( -1.98%)    117487.00 ( -3.89%)
TPut 6     144010.00 (  0.00%)    100095.00 (-30.49%)    141127.00 ( -2.00%)    143931.00 ( -0.05%)
TPut 7     164690.00 (  0.00%)    119577.00 (-27.39%)    159922.00 ( -2.90%)    164073.00 ( -0.37%)
TPut 8     190702.00 (  0.00%)    125183.00 (-34.36%)    189187.00 ( -0.79%)    180400.00 ( -5.40%)
TPut 9     209898.00 (  0.00%)    137179.00 (-34.64%)    160205.00 (-23.67%)    206052.00 ( -1.83%)
TPut 10    234064.00 (  0.00%)    140225.00 (-40.09%)    220768.00 ( -5.68%)    218224.00 ( -6.77%)
TPut 11    252408.00 (  0.00%)    134453.00 (-46.73%)    250953.00 ( -0.58%)    248507.00 ( -1.55%)
TPut 12    278689.00 (  0.00%)    140355.00 (-49.64%)    271815.00 ( -2.47%)    255907.00 ( -8.17%)
TPut 13    298940.00 (  0.00%)    153780.00 (-48.56%)    190433.00 (-36.30%)    289418.00 ( -3.19%)
TPut 14    315971.00 (  0.00%)    126929.00 (-59.83%)    309899.00 ( -1.92%)    283315.00 (-10.34%)
TPut 15    340446.00 (  0.00%)    132710.00 (-61.02%)    290484.00 (-14.68%)    327168.00 ( -3.90%)
TPut 16    362010.00 (  0.00%)    156255.00 (-56.84%)    347844.00 ( -3.91%)    311160.00 (-14.05%)
TPut 17    376476.00 (  0.00%)     95441.00 (-74.65%)    333508.00 (-11.41%)    366629.00 ( -2.62%)
TPut 18    399230.00 (  0.00%)    132993.00 (-66.69%)    374946.00 ( -6.08%)    358280.00 (-10.26%)
TPut 19    414300.00 (  0.00%)    129194.00 (-68.82%)    392675.00 ( -5.22%)    363700.00 (-12.21%)
TPut 20    429780.00 (  0.00%)     90068.00 (-79.04%)    241891.00 (-43.72%)    413210.00 ( -3.86%)
TPut 21    439977.00 (  0.00%)    136793.00 (-68.91%)    412629.00 ( -6.22%)    398914.00 ( -9.33%)
TPut 22    459593.00 (  0.00%)    134292.00 (-70.78%)    426511.00 ( -7.20%)    414652.00 ( -9.78%)
TPut 23    473600.00 (  0.00%)    137794.00 (-70.90%)    436081.00 ( -7.92%)    421456.00 (-11.01%)
TPut 24    483442.00 (  0.00%)    139342.00 (-71.18%)    390536.00 (-19.22%)    453552.00 ( -6.18%)
TPut 25    484584.00 (  0.00%)    144745.00 (-70.13%)    430863.00 (-11.09%)    397971.00 (-17.87%)
TPut 26    483041.00 (  0.00%)    145326.00 (-69.91%)    333960.00 (-30.86%)    454575.00 ( -5.89%)
TPut 27    480788.00 (  0.00%)    145395.00 (-69.76%)    402433.00 (-16.30%)    415528.00 (-13.57%)
TPut 28    470141.00 (  0.00%)    146261.00 (-68.89%)    385008.00 (-18.11%)    445938.00 ( -5.15%)
TPut 29    476984.00 (  0.00%)    147988.00 (-68.97%)    379719.00 (-20.39%)    395984.00 (-16.98%)
TPut 30    471709.00 (  0.00%)    148658.00 (-68.49%)    417249.00 (-11.55%)    424000.00 (-10.11%)
TPut 31    470451.00 (  0.00%)    147949.00 (-68.55%)    408792.00 (-13.11%)    384502.00 (-18.27%)
TPut 32    468377.00 (  0.00%)    158685.00 (-66.12%)    414694.00 (-11.46%)    405441.00 (-13.44%)
TPut 33    463536.00 (  0.00%)    159097.00 (-65.68%)    412259.00 (-11.06%)    399323.00 (-13.85%)
TPut 34    457678.00 (  0.00%)    153025.00 (-66.56%)    408133.00 (-10.83%)    402190.00 (-12.12%)
TPut 35    448181.00 (  0.00%)    154037.00 (-65.63%)    405535.00 ( -9.52%)    422016.00 ( -5.84%)
TPut 36    450490.00 (  0.00%)    149057.00 (-66.91%)    407218.00 ( -9.61%)    381320.00 (-15.35%)
TPut 37    435425.00 (  0.00%)    153996.00 (-64.63%)    400370.00 ( -8.05%)    403088.00 ( -7.43%)
TPut 38    434985.00 (  0.00%)    158683.00 (-63.52%)    408266.00 ( -6.14%)    406860.00 ( -6.47%)
TPut 39    425064.00 (  0.00%)    160263.00 (-62.30%)    397737.00 ( -6.43%)    385657.00 ( -9.27%)
TPut 40    428366.00 (  0.00%)    161150.00 (-62.38%)    383404.00 (-10.50%)    405984.00 ( -5.22%)
TPut 41    417072.00 (  0.00%)    155817.00 (-62.64%)    394627.00 ( -5.38%)    398389.00 ( -4.48%)
TPut 42    398350.00 (  0.00%)    156774.00 (-60.64%)    388583.00 ( -2.45%)    329310.00 (-17.33%)
TPut 43    405526.00 (  0.00%)    162938.00 (-59.82%)    371761.00 ( -8.33%)    396379.00 ( -2.26%)
TPut 44    400696.00 (  0.00%)    167164.00 (-58.28%)    372067.00 ( -7.14%)    373746.00 ( -6.73%)
TPut 45    391357.00 (  0.00%)    163075.00 (-58.33%)    365494.00 ( -6.61%)    348089.00 (-11.06%)
TPut 46    394109.00 (  0.00%)    173557.00 (-55.96%)    357955.00 ( -9.17%)    372188.00 ( -5.56%)
TPut 47    383292.00 (  0.00%)    168575.00 (-56.02%)    357946.00 ( -6.61%)    352658.00 ( -7.99%)
TPut 48    373607.00 (  0.00%)    158491.00 (-57.58%)    358227.00 ( -4.12%)    373779.00 (  0.05%)
TPut 49    372131.00 (  0.00%)    145881.00 (-60.80%)    360147.00 ( -3.22%)    358224.00 ( -3.74%)
TPut 50    369060.00 (  0.00%)    139450.00 (-62.21%)    355721.00 ( -3.61%)    367608.00 ( -0.39%)
TPut 51    375906.00 (  0.00%)    139823.00 (-62.80%)    367783.00 ( -2.16%)    364796.00 ( -2.96%)
TPut 52    379731.00 (  0.00%)    158706.00 (-58.21%)    381289.00 (  0.41%)    370100.00 ( -2.54%)
TPut 53    366656.00 (  0.00%)    178068.00 (-51.43%)    382147.00 (  4.22%)    369301.00 (  0.72%)
TPut 54    373531.00 (  0.00%)    177087.00 (-52.59%)    374892.00 (  0.36%)    367863.00 ( -1.52%)
TPut 55    374440.00 (  0.00%)    174830.00 (-53.31%)    372036.00 ( -0.64%)    377606.00 (  0.85%)
TPut 56    351285.00 (  0.00%)    175761.00 (-49.97%)    370602.00 (  5.50%)    371896.00 (  5.87%)
TPut 57    366069.00 (  0.00%)    172227.00 (-52.95%)    377253.00 (  3.06%)    364024.00 ( -0.56%)
TPut 58    367753.00 (  0.00%)    174523.00 (-52.54%)    376854.00 (  2.47%)    372580.00 (  1.31%)
TPut 59    364282.00 (  0.00%)    176119.00 (-51.65%)    365806.00 (  0.42%)    370299.00 (  1.65%)
TPut 60    372531.00 (  0.00%)    175673.00 (-52.84%)    354662.00 ( -4.80%)    365126.00 ( -1.99%)
TPut 61    359648.00 (  0.00%)    174686.00 (-51.43%)    365387.00 (  1.60%)    370039.00 (  2.89%)
TPut 62    361856.00 (  0.00%)    171420.00 (-52.63%)    366173.00 (  1.19%)    345029.00 ( -4.65%)
TPut 63    363032.00 (  0.00%)    171603.00 (-52.73%)    360794.00 ( -0.62%)    349379.00 ( -3.76%)
TPut 64    351549.00 (  0.00%)    170967.00 (-51.37%)    354632.00 (  0.88%)    352406.00 (  0.24%)
TPut 65    360425.00 (  0.00%)    170349.00 (-52.74%)    346205.00 ( -3.95%)    351510.00 ( -2.47%)
TPut 66    359197.00 (  0.00%)    170037.00 (-52.66%)    355970.00 ( -0.90%)    330963.00 ( -7.86%)
TPut 67    356962.00 (  0.00%)    168949.00 (-52.67%)    355577.00 ( -0.39%)    358511.00 (  0.43%)
TPut 68    360411.00 (  0.00%)    167892.00 (-53.42%)    337932.00 ( -6.24%)    358516.00 ( -0.53%)
TPut 69    354346.00 (  0.00%)    166288.00 (-53.07%)    334951.00 ( -5.47%)    360614.00 (  1.77%)
TPut 70    354596.00 (  0.00%)    166214.00 (-53.13%)    333059.00 ( -6.07%)    337859.00 ( -4.72%)
TPut 71    351838.00 (  0.00%)    167198.00 (-52.48%)    316732.00 ( -9.98%)    350369.00 ( -0.42%)
TPut 72    357716.00 (  0.00%)    164325.00 (-54.06%)    309282.00 (-13.54%)    353090.00 ( -1.29%)

Without THP, numacore suffers really badly. Neither autonuma or
balancenuma do great. The reasons why balancenuma suffers have already
been explained -- the scan rate is not reducing but this can be
addressed with a big hammer. A patch already exists that does that but
is not included here.

SPECJBB PEAKS
                                       3.7.0                      3.7.0                      3.7.0                      3.7.0
                              rc6-stats-v5r1      rc6-numacore-20121123     rc6-autonuma-v28fastr4       rc6-thpmigrate-v6r10
 Expctd Warehouse                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)
 Expctd Peak Bops               373607.00 (  0.00%)               158491.00 (-57.58%)               358227.00 ( -4.12%)               373779.00 (  0.05%)
 Actual Warehouse                   25.00 (  0.00%)                   53.00 (112.00%)                   23.00 ( -8.00%)                   26.00 (  4.00%)
 Actual Peak Bops               484584.00 (  0.00%)               178068.00 (-63.25%)               436081.00 (-10.01%)               454575.00 ( -6.19%)
 SpecJBB Bops                   185685.00 (  0.00%)                85236.00 (-54.10%)               182329.00 ( -1.81%)               183908.00 ( -0.96%)
 SpecJBB Bops/JVM               185685.00 (  0.00%)                85236.00 (-54.10%)               182329.00 ( -1.81%)               183908.00 ( -0.96%)

numacore regresses 63.25% at it's peak and has a 54.10% loss on its
specjbb score.

autonuma regresses 10.01% at its peak, 1.81% on the specjbb score.

balancenuma does "best" in that it regresses the least.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v6r10
User       316094.47   169409.35        0.00   308074.71
System         62.67   123927.05        0.00     1897.43
Elapsed      7434.12     7452.00        0.00     7438.16

The autonuma file that stored the system CPu usage was truncated for some
reason. I've set it to rerun.

numacores system CPU usage is massive.

balancenumas is also far too high due to it failing to reduce the scan
rate.

So, now I'm seeing compatible figures that have been reported elsewhere.
To get those figures you must use a single JVM, THP must be enabled and it
must run with a large enough number of warehouses. For other configurations
or lower number of warehouses, it can suffer.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Comparison between three trees (was: Latest numa/core release, v17)
  2012-11-23 17:32 ` Comparison between three trees (was: Latest numa/core release, v17) Mel Gorman
  2012-11-25  8:47   ` Hillf Danton
  2012-11-25 23:37   ` Mel Gorman
@ 2012-11-25 23:40   ` Mel Gorman
  2012-11-26 13:33   ` Mel Gorman
  3 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2012-11-25 23:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Paul Turner, Lee Schermerhorn,
	Christoph Lameter, Rik van Riel, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Johannes Weiner, Hugh Dickins

On Fri, Nov 23, 2012 at 05:32:05PM +0000, Mel Gorman wrote:

> <SNIP>
> SPECJBB: Single JVMs (one per node, 4 nodes), THP is enabled
> 
> <SNIP>
> SPECJBB: Single JVMs (one per node, 4 nodes), THP is disabled

Just to clarify, the "JVMs (one per node, 4 nodes)" was a cut&paste
error. Single JVM meant that there was just one JVM running and it was
configured to use 80% of available RAM.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Comparison between three trees (was: Latest numa/core release, v17)
  2012-11-23 17:32 ` Comparison between three trees (was: Latest numa/core release, v17) Mel Gorman
                     ` (2 preceding siblings ...)
  2012-11-25 23:40   ` Mel Gorman
@ 2012-11-26 13:33   ` Mel Gorman
  3 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2012-11-26 13:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Paul Turner, Lee Schermerhorn,
	Christoph Lameter, Rik van Riel, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Johannes Weiner, Hugh Dickins

On Fri, Nov 23, 2012 at 05:32:05PM +0000, Mel Gorman wrote:
> SPECJBB: Single JVMs (one per node, 4 nodes), THP is disabled
> 
>                         3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
>                rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
> TPut 1      20890.00 (  0.00%)     18720.00 (-10.39%)     21127.00 (  1.13%)     20376.00 ( -2.46%)     20806.00 ( -0.40%)     20698.00 ( -0.92%)
> TPut 2      48259.00 (  0.00%)     38121.00 (-21.01%)     47920.00 ( -0.70%)     47085.00 ( -2.43%)     48594.00 (  0.69%)     48094.00 ( -0.34%)
> TPut 3      73203.00 (  0.00%)     60057.00 (-17.96%)     73630.00 (  0.58%)     70241.00 ( -4.05%)     73418.00 (  0.29%)     74016.00 (  1.11%)
> TPut 4      98694.00 (  0.00%)     73669.00 (-25.36%)     98929.00 (  0.24%)     96721.00 ( -2.00%)     96797.00 ( -1.92%)     97930.00 ( -0.77%)
> TPut 5     122563.00 (  0.00%)     98786.00 (-19.40%)    118969.00 ( -2.93%)    118045.00 ( -3.69%)    121553.00 ( -0.82%)    122781.00 (  0.18%)
> TPut 6     144095.00 (  0.00%)    114485.00 (-20.55%)    145328.00 (  0.86%)    141713.00 ( -1.65%)    142589.00 ( -1.05%)    143771.00 ( -0.22%)
> TPut 7     166457.00 (  0.00%)    112416.00 (-32.47%)    163503.00 ( -1.77%)    166971.00 (  0.31%)    166788.00 (  0.20%)    165188.00 ( -0.76%)
> TPut 8     191067.00 (  0.00%)    122996.00 (-35.63%)    189477.00 ( -0.83%)    183090.00 ( -4.17%)    187710.00 ( -1.76%)    192157.00 (  0.57%)
> TPut 9     210634.00 (  0.00%)    141200.00 (-32.96%)    209639.00 ( -0.47%)    207968.00 ( -1.27%)    215216.00 (  2.18%)    214222.00 (  1.70%)
> TPut 10    234121.00 (  0.00%)    129508.00 (-44.68%)    231221.00 ( -1.24%)    221553.00 ( -5.37%)    219998.00 ( -6.03%)    227193.00 ( -2.96%)
> TPut 11    257885.00 (  0.00%)    131232.00 (-49.11%)    256568.00 ( -0.51%)    252734.00 ( -2.00%)    258433.00 (  0.21%)    260534.00 (  1.03%)
> TPut 12    271751.00 (  0.00%)    154763.00 (-43.05%)    277319.00 (  2.05%)    277154.00 (  1.99%)    265747.00 ( -2.21%)    262285.00 ( -3.48%)
> TPut 13    297457.00 (  0.00%)    119716.00 (-59.75%)    296068.00 ( -0.47%)    289716.00 ( -2.60%)    276527.00 ( -7.04%)    293199.00 ( -1.43%)
> TPut 14    319074.00 (  0.00%)    129730.00 (-59.34%)    311604.00 ( -2.34%)    308798.00 ( -3.22%)    316807.00 ( -0.71%)    275748.00 (-13.58%)
> TPut 15    337859.00 (  0.00%)    177494.00 (-47.47%)    329288.00 ( -2.54%)    300463.00 (-11.07%)    305116.00 ( -9.69%)    287814.00 (-14.81%)
> TPut 16    356396.00 (  0.00%)    145173.00 (-59.27%)    355616.00 ( -0.22%)    342598.00 ( -3.87%)    364077.00 (  2.16%)    339649.00 ( -4.70%)
> TPut 17    373925.00 (  0.00%)    176956.00 (-52.68%)    368589.00 ( -1.43%)    360917.00 ( -3.48%)    366043.00 ( -2.11%)    345586.00 ( -7.58%)
> TPut 18    388373.00 (  0.00%)    150100.00 (-61.35%)    372873.00 ( -3.99%)    389062.00 (  0.18%)    386779.00 ( -0.41%)    370871.00 ( -4.51%)
> 
> balancenuma suffered here. It is very likely that it was not able to handle
> faults at a PMD level due to the lack of THP and I would expect that the
> pages within a PMD boundary are not on the same node so pmd_numa is not
> set. This results in its worst case of always having to deal with PTE
> faults. Further, it must be migrating many or almost all of these because
> the adaptscan patch made no difference. This is a worst-case scenario for
> balancenuma. The scan rates later will indicate if that was the case.
> 

This worst-case for balancenuma can be hit with a hammer to some extent
(patch below) but the results are too variable to be considered useful. The
headline figures say that balancenuma comes back in line with mainline so
it's not regressing but the devil is in the details. It regresses less
but balancenumas worst-case scenario still hurts. I'm not including the
patch in the tree because the right answer is to rebase a scheduling and
placement policy on top that results in fewer migrations.

However, for reference here is how the hammer affects the results for a
single JVM with THP disabled. adaptalways-v6r12 is the hammer.

SPECJBB BOPS
                        3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
               rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4  rc6-thpmigrate-v6r10 rc6-adaptalways-v6r12
TPut 1      20507.00 (  0.00%)     16702.00 (-18.55%)     19496.00 ( -4.93%)     19831.00 ( -3.30%)     20539.00 (  0.16%)
TPut 2      48723.00 (  0.00%)     36714.00 (-24.65%)     49452.00 (  1.50%)     45973.00 ( -5.64%)     47664.00 ( -2.17%)
TPut 3      72618.00 (  0.00%)     59086.00 (-18.63%)     69728.00 ( -3.98%)     71996.00 ( -0.86%)     71917.00 ( -0.97%)
TPut 4      98383.00 (  0.00%)     76940.00 (-21.80%)     98216.00 ( -0.17%)     95339.00 ( -3.09%)     96118.00 ( -2.30%)
TPut 5     122240.00 (  0.00%)     95981.00 (-21.48%)    119822.00 ( -1.98%)    117487.00 ( -3.89%)    121080.00 ( -0.95%)
TPut 6     144010.00 (  0.00%)    100095.00 (-30.49%)    141127.00 ( -2.00%)    143931.00 ( -0.05%)    141666.00 ( -1.63%)
TPut 7     164690.00 (  0.00%)    119577.00 (-27.39%)    159922.00 ( -2.90%)    164073.00 ( -0.37%)    163861.00 ( -0.50%)
TPut 8     190702.00 (  0.00%)    125183.00 (-34.36%)    189187.00 ( -0.79%)    180400.00 ( -5.40%)    187520.00 ( -1.67%)
TPut 9     209898.00 (  0.00%)    137179.00 (-34.64%)    160205.00 (-23.67%)    206052.00 ( -1.83%)    214639.00 (  2.26%)
TPut 10    234064.00 (  0.00%)    140225.00 (-40.09%)    220768.00 ( -5.68%)    218224.00 ( -6.77%)    224924.00 ( -3.90%)
TPut 11    252408.00 (  0.00%)    134453.00 (-46.73%)    250953.00 ( -0.58%)    248507.00 ( -1.55%)    247219.00 ( -2.06%)
TPut 12    278689.00 (  0.00%)    140355.00 (-49.64%)    271815.00 ( -2.47%)    255907.00 ( -8.17%)    266701.00 ( -4.30%)
TPut 13    298940.00 (  0.00%)    153780.00 (-48.56%)    190433.00 (-36.30%)    289418.00 ( -3.19%)    269335.00 ( -9.90%)
TPut 14    315971.00 (  0.00%)    126929.00 (-59.83%)    309899.00 ( -1.92%)    283315.00 (-10.34%)    308350.00 ( -2.41%)
TPut 15    340446.00 (  0.00%)    132710.00 (-61.02%)    290484.00 (-14.68%)    327168.00 ( -3.90%)    342031.00 (  0.47%)
TPut 16    362010.00 (  0.00%)    156255.00 (-56.84%)    347844.00 ( -3.91%)    311160.00 (-14.05%)    360196.00 ( -0.50%)
TPut 17    376476.00 (  0.00%)     95441.00 (-74.65%)    333508.00 (-11.41%)    366629.00 ( -2.62%)    341397.00 ( -9.32%)
TPut 18    399230.00 (  0.00%)    132993.00 (-66.69%)    374946.00 ( -6.08%)    358280.00 (-10.26%)    324370.00 (-18.75%)
TPut 19    414300.00 (  0.00%)    129194.00 (-68.82%)    392675.00 ( -5.22%)    363700.00 (-12.21%)    368777.00 (-10.99%)
TPut 20    429780.00 (  0.00%)     90068.00 (-79.04%)    241891.00 (-43.72%)    413210.00 ( -3.86%)    351444.00 (-18.23%)
TPut 21    439977.00 (  0.00%)    136793.00 (-68.91%)    412629.00 ( -6.22%)    398914.00 ( -9.33%)    442260.00 (  0.52%)
TPut 22    459593.00 (  0.00%)    134292.00 (-70.78%)    426511.00 ( -7.20%)    414652.00 ( -9.78%)    422916.00 ( -7.98%)
TPut 23    473600.00 (  0.00%)    137794.00 (-70.90%)    436081.00 ( -7.92%)    421456.00 (-11.01%)    359619.00 (-24.07%)
TPut 24    483442.00 (  0.00%)    139342.00 (-71.18%)    390536.00 (-19.22%)    453552.00 ( -6.18%)    486759.00 (  0.69%)
TPut 25    484584.00 (  0.00%)    144745.00 (-70.13%)    430863.00 (-11.09%)    397971.00 (-17.87%)    396648.00 (-18.15%)
TPut 26    483041.00 (  0.00%)    145326.00 (-69.91%)    333960.00 (-30.86%)    454575.00 ( -5.89%)    472979.00 ( -2.08%)
TPut 27    480788.00 (  0.00%)    145395.00 (-69.76%)    402433.00 (-16.30%)    415528.00 (-13.57%)    418540.00 (-12.95%)
TPut 28    470141.00 (  0.00%)    146261.00 (-68.89%)    385008.00 (-18.11%)    445938.00 ( -5.15%)    455615.00 ( -3.09%)
TPut 29    476984.00 (  0.00%)    147988.00 (-68.97%)    379719.00 (-20.39%)    395984.00 (-16.98%)    479828.00 (  0.60%)
TPut 30    471709.00 (  0.00%)    148658.00 (-68.49%)    417249.00 (-11.55%)    424000.00 (-10.11%)    435163.00 ( -7.75%)
TPut 31    470451.00 (  0.00%)    147949.00 (-68.55%)    408792.00 (-13.11%)    384502.00 (-18.27%)    415069.00 (-11.77%)
TPut 32    468377.00 (  0.00%)    158685.00 (-66.12%)    414694.00 (-11.46%)    405441.00 (-13.44%)    468585.00 (  0.04%)
TPut 33    463536.00 (  0.00%)    159097.00 (-65.68%)    412259.00 (-11.06%)    399323.00 (-13.85%)    455622.00 ( -1.71%)
TPut 34    457678.00 (  0.00%)    153025.00 (-66.56%)    408133.00 (-10.83%)    402190.00 (-12.12%)    432962.00 ( -5.40%)
TPut 35    448181.00 (  0.00%)    154037.00 (-65.63%)    405535.00 ( -9.52%)    422016.00 ( -5.84%)    452914.00 (  1.06%)
TPut 36    450490.00 (  0.00%)    149057.00 (-66.91%)    407218.00 ( -9.61%)    381320.00 (-15.35%)    427438.00 ( -5.12%)
TPut 37    435425.00 (  0.00%)    153996.00 (-64.63%)    400370.00 ( -8.05%)    403088.00 ( -7.43%)    381348.00 (-12.42%)
TPut 38    434985.00 (  0.00%)    158683.00 (-63.52%)    408266.00 ( -6.14%)    406860.00 ( -6.47%)    404181.00 ( -7.08%)
TPut 39    425064.00 (  0.00%)    160263.00 (-62.30%)    397737.00 ( -6.43%)    385657.00 ( -9.27%)    425414.00 (  0.08%)
TPut 40    428366.00 (  0.00%)    161150.00 (-62.38%)    383404.00 (-10.50%)    405984.00 ( -5.22%)    444815.00 (  3.84%)
TPut 41    417072.00 (  0.00%)    155817.00 (-62.64%)    394627.00 ( -5.38%)    398389.00 ( -4.48%)    391735.00 ( -6.07%)
TPut 42    398350.00 (  0.00%)    156774.00 (-60.64%)    388583.00 ( -2.45%)    329310.00 (-17.33%)    430361.00 (  8.04%)
TPut 43    405526.00 (  0.00%)    162938.00 (-59.82%)    371761.00 ( -8.33%)    396379.00 ( -2.26%)    397849.00 ( -1.89%)
TPut 44    400696.00 (  0.00%)    167164.00 (-58.28%)    372067.00 ( -7.14%)    373746.00 ( -6.73%)    388050.00 ( -3.16%)
TPut 45    391357.00 (  0.00%)    163075.00 (-58.33%)    365494.00 ( -6.61%)    348089.00 (-11.06%)    414737.00 (  5.97%)
TPut 46    394109.00 (  0.00%)    173557.00 (-55.96%)    357955.00 ( -9.17%)    372188.00 ( -5.56%)    400373.00 (  1.59%)
TPut 47    383292.00 (  0.00%)    168575.00 (-56.02%)    357946.00 ( -6.61%)    352658.00 ( -7.99%)    395851.00 (  3.28%)
TPut 48    373607.00 (  0.00%)    158491.00 (-57.58%)    358227.00 ( -4.12%)    373779.00 (  0.05%)    388631.00 (  4.02%)
TPut 49    372131.00 (  0.00%)    145881.00 (-60.80%)    360147.00 ( -3.22%)    358224.00 ( -3.74%)    377922.00 (  1.56%)
TPut 50    369060.00 (  0.00%)    139450.00 (-62.21%)    355721.00 ( -3.61%)    367608.00 ( -0.39%)    369852.00 (  0.21%)
TPut 51    375906.00 (  0.00%)    139823.00 (-62.80%)    367783.00 ( -2.16%)    364796.00 ( -2.96%)    353863.00 ( -5.86%)
TPut 52    379731.00 (  0.00%)    158706.00 (-58.21%)    381289.00 (  0.41%)    370100.00 ( -2.54%)    379472.00 ( -0.07%)
TPut 53    366656.00 (  0.00%)    178068.00 (-51.43%)    382147.00 (  4.22%)    369301.00 (  0.72%)    376606.00 (  2.71%)
TPut 54    373531.00 (  0.00%)    177087.00 (-52.59%)    374892.00 (  0.36%)    367863.00 ( -1.52%)    372560.00 ( -0.26%)
TPut 55    374440.00 (  0.00%)    174830.00 (-53.31%)    372036.00 ( -0.64%)    377606.00 (  0.85%)    375134.00 (  0.19%)
TPut 56    351285.00 (  0.00%)    175761.00 (-49.97%)    370602.00 (  5.50%)    371896.00 (  5.87%)    366349.00 (  4.29%)
TPut 57    366069.00 (  0.00%)    172227.00 (-52.95%)    377253.00 (  3.06%)    364024.00 ( -0.56%)    367468.00 (  0.38%)
TPut 58    367753.00 (  0.00%)    174523.00 (-52.54%)    376854.00 (  2.47%)    372580.00 (  1.31%)    363218.00 ( -1.23%)
TPut 59    364282.00 (  0.00%)    176119.00 (-51.65%)    365806.00 (  0.42%)    370299.00 (  1.65%)    367422.00 (  0.86%)
TPut 60    372531.00 (  0.00%)    175673.00 (-52.84%)    354662.00 ( -4.80%)    365126.00 ( -1.99%)    372139.00 ( -0.11%)
TPut 61    359648.00 (  0.00%)    174686.00 (-51.43%)    365387.00 (  1.60%)    370039.00 (  2.89%)    368296.00 (  2.40%)
TPut 62    361856.00 (  0.00%)    171420.00 (-52.63%)    366173.00 (  1.19%)    345029.00 ( -4.65%)    368224.00 (  1.76%)
TPut 63    363032.00 (  0.00%)    171603.00 (-52.73%)    360794.00 ( -0.62%)    349379.00 ( -3.76%)    364463.00 (  0.39%)
TPut 64    351549.00 (  0.00%)    170967.00 (-51.37%)    354632.00 (  0.88%)    352406.00 (  0.24%)    365522.00 (  3.97%)
TPut 65    360425.00 (  0.00%)    170349.00 (-52.74%)    346205.00 ( -3.95%)    351510.00 ( -2.47%)    360351.00 ( -0.02%)
TPut 66    359197.00 (  0.00%)    170037.00 (-52.66%)    355970.00 ( -0.90%)    330963.00 ( -7.86%)    347958.00 ( -3.13%)
TPut 67    356962.00 (  0.00%)    168949.00 (-52.67%)    355577.00 ( -0.39%)    358511.00 (  0.43%)    371059.00 (  3.95%)
TPut 68    360411.00 (  0.00%)    167892.00 (-53.42%)    337932.00 ( -6.24%)    358516.00 ( -0.53%)    361518.00 (  0.31%)
TPut 69    354346.00 (  0.00%)    166288.00 (-53.07%)    334951.00 ( -5.47%)    360614.00 (  1.77%)    367286.00 (  3.65%)
TPut 70    354596.00 (  0.00%)    166214.00 (-53.13%)    333059.00 ( -6.07%)    337859.00 ( -4.72%)    350505.00 ( -1.15%)
TPut 71    351838.00 (  0.00%)    167198.00 (-52.48%)    316732.00 ( -9.98%)    350369.00 ( -0.42%)    353104.00 (  0.36%)
TPut 72    357716.00 (  0.00%)    164325.00 (-54.06%)    309282.00 (-13.54%)    353090.00 ( -1.29%)    339898.00 ( -4.98%)

adaptalways reduces the scanning rate on every fault. It mitigates many
of the worse of the regressions but does not eliminate them because there
are still remote faults and migrations.

SPECJBB PEAKS
                                       3.7.0                      3.7.0                      3.7.0                      3.7.0                      3.7.0
                              rc6-stats-v5r1      rc6-numacore-20121123     rc6-autonuma-v28fastr4       rc6-thpmigrate-v6r10      rc6-adaptalways-v6r12
 Expctd Warehouse            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)
 Expctd Peak Bops        373607.00 (  0.00%)        158491.00 (-57.58%)        358227.00 ( -4.12%)        373779.00 (  0.05%)        388631.00 (  4.02%)
 Actual Warehouse            25.00 (  0.00%)            53.00 (112.00%)            23.00 ( -8.00%)            26.00 (  4.00%)            24.00 ( -4.00%)
 Actual Peak Bops        484584.00 (  0.00%)        178068.00 (-63.25%)        436081.00 (-10.01%)        454575.00 ( -6.19%)        486759.00 (  0.45%)
 SpecJBB Bops            185685.00 (  0.00%)         85236.00 (-54.10%)        182329.00 ( -1.81%)        183908.00 ( -0.96%)        186711.00 (  0.55%)
 SpecJBB Bops/JVM        185685.00 (  0.00%)         85236.00 (-54.10%)        182329.00 ( -1.81%)        183908.00 ( -0.96%)        186711.00 (  0.55%)

The actual peak performance figures look ok though and if you were just
looking at the headline figures you might be tempted to conclude that the
patch works but the per-warehouse figures show that it's not really the
case at all.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v6r10rc6-adaptalways-v6r12
User       316094.47   169409.35   308316.22   308074.71   309256.18
System         62.67   123927.05     4304.26     1897.43     1650.29
Elapsed      7434.12     7452.00     7439.70     7438.16     7437.24

It does reduce system CPU usage a bit but the fact is that it's still
migrating uselessly.

MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v6r10rc6-adaptalways-v6r12
Page Ins                         34248       37888       38048       38148       38076
Page Outs                        50932       60036       54448       55196       55368
Swap Ins                             0           0           0           0           0
Swap Outs                            0           0           0           0           0
Direct pages scanned                 0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0
Page writes file                     0           0           0           0           0
Page writes anon                     0           0           0           0           0
Page reclaim immediate               0           0           0           0           0
Page rescued immediate               0           0           0           0           0
Slabs scanned                        0           0           0           0           0
Direct inode steals                  0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                      3           3           3           3           2
THP collapse alloc                   0           0          12           0           0
THP splits                           0           0           0           0           0
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success                 0           0           0    27257642    22698940
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                      0           0           0       28293       23561
NUMA PTE updates                     0           0           0   220482204   187969232
NUMA hint faults                     0           0           0   214660099   183397210
NUMA hint local faults               0           0           0    55657689    47359679
NUMA pages migrated                  0           0           0    27257642    22698940
AutoNUMA cost                        0           0           0     1075361      918733

Note that it alters the number of PTEs that are updated and the number
of faults but not enough to make a difference. Far too many of those
NUMA faults were remote and resulted in migration.

Here is the "hammer" for reference but I'll not be including it.

---8<---
mm: sched: Adapt the scanning rate even if a NUMA hinting fault migrates

specjbb on single JVM for balancenuma indicated that the scan rate was
not reducing and the performance was impaired. The problem was that
the threads are getting scheduled between nodes and balancenuma is
migrating the pages around in circles uselessly. It needs a scheduling
policy that makes tasks sticker to a node if much of their memory is
there.

In the meantime, I have a hammer and this problems looks mighty like a
nail.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd9c78c..ed54789 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -818,11 +818,18 @@ void task_numa_fault(int node, int pages, bool migrated)
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
-	 * This is reset periodically in case of phase changes
+	 * This is reset periodically in case of phase changes. If the page
+	 * was migrated, we still slow the scan rate but less. If the
+	 * workload is not converging at all, at least it will update
+	 * fewer PTEs and stop trashing around but in ideal circumstances it
+	 * also means we converge slower.
 	 */
         if (!migrated)
 		p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
+	else
+		p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
+			p->numa_scan_period + jiffies_to_msecs(5));
 
 	task_numa_placement(p);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2013-01-02 19:44 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
2012-11-22 22:49 ` [PATCH 01/33] mm/generic: Only flush the local TLB in ptep_set_access_flags() Ingo Molnar
2012-11-22 22:49 ` [PATCH 02/33] x86/mm: Only do a local tlb flush " Ingo Molnar
2012-11-22 22:49 ` [PATCH 03/33] x86/mm: Introduce pte_accessible() Ingo Molnar
2012-11-22 22:49 ` [PATCH 04/33] mm: Only flush the TLB when clearing an accessible pte Ingo Molnar
2012-11-22 22:49 ` [PATCH 05/33] x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Ingo Molnar
2012-11-22 22:49 ` [PATCH 06/33] mm: Count the number of pages affected in change_protection() Ingo Molnar
2012-11-22 22:49 ` [PATCH 07/33] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Ingo Molnar
2012-11-22 22:49 ` [PATCH 08/33] sched, numa, mm: Add last_cpu to page flags Ingo Molnar
2012-11-22 22:49 ` [PATCH 09/33] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides Ingo Molnar
2012-11-22 22:49 ` [PATCH 10/33] sched: Make find_busiest_queue() a method Ingo Molnar
2012-11-22 22:49 ` [PATCH 11/33] sched, numa, mm: Describe the NUMA scheduling problem formally Ingo Molnar
2012-11-22 22:49 ` [PATCH 12/33] numa, mm: Support NUMA hinting page faults from gup/gup_fast Ingo Molnar
2012-11-22 22:49 ` [PATCH 13/33] mm/migrate: Introduce migrate_misplaced_page() Ingo Molnar
2012-11-22 22:49 ` [PATCH 14/33] mm/migration: Improve migrate_misplaced_page() Ingo Molnar
2012-11-22 22:49 ` [PATCH 15/33] sched, numa, mm, arch: Add variable locality exception Ingo Molnar
2012-11-22 22:49 ` [PATCH 16/33] sched, numa, mm: Add credits for NUMA placement Ingo Molnar
2012-11-22 22:49 ` [PATCH 17/33] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Ingo Molnar
2012-11-22 22:49 ` [PATCH 18/33] sched, numa, mm: Add the scanning page fault machinery Ingo Molnar
2012-12-04  0:56   ` [patch] mm, mempolicy: Introduce spinlock to read shared policy tree David Rientjes
2012-12-20 18:34     ` Linus Torvalds
2012-12-20 22:55       ` David Rientjes
2012-12-21 13:47         ` Mel Gorman
2012-12-21 16:53           ` Linus Torvalds
2012-12-21 18:21             ` Hugh Dickins
2012-12-21 21:51               ` Linus Torvalds
2012-12-21 19:58             ` Mel Gorman
2012-12-21 22:02               ` Linus Torvalds
2012-12-21 23:10                 ` Mel Gorman
2012-12-22  0:36                   ` Linus Torvalds
2013-01-02 19:43                     ` KOSAKI Motohiro
2012-11-22 22:49 ` [PATCH 19/33] sched: Add adaptive NUMA affinity support Ingo Molnar
2012-11-26 20:32   ` Sasha Levin
2012-11-22 22:49 ` [PATCH 20/33] sched: Implement constant, per task Working Set Sampling (WSS) rate Ingo Molnar
2012-11-22 22:49 ` [PATCH 21/33] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Ingo Molnar
2012-11-22 22:49 ` [PATCH 22/33] sched: Implement slow start for working set sampling Ingo Molnar
2012-11-22 22:49 ` [PATCH 23/33] sched, numa, mm: Interleave shared tasks Ingo Molnar
2012-11-22 22:49 ` [PATCH 24/33] sched: Implement NUMA scanning backoff Ingo Molnar
2012-11-22 22:49 ` [PATCH 25/33] sched: Improve convergence Ingo Molnar
2012-11-22 22:49 ` [PATCH 26/33] sched: Introduce staged average NUMA faults Ingo Molnar
2012-11-22 22:49 ` [PATCH 27/33] sched: Track groups of shared tasks Ingo Molnar
2012-11-22 22:49 ` [PATCH 28/33] sched: Use the best-buddy 'ideal cpu' in balancing decisions Ingo Molnar
2012-11-22 22:49 ` [PATCH 29/33] sched, mm, mempolicy: Add per task mempolicy Ingo Molnar
2012-11-22 22:49 ` [PATCH 30/33] sched: Average the fault stats longer Ingo Molnar
2012-11-22 22:49 ` [PATCH 31/33] sched: Use the ideal CPU to drive active balancing Ingo Molnar
2012-11-22 22:49 ` [PATCH 32/33] sched: Add hysteresis to p->numa_shared Ingo Molnar
2012-11-22 22:49 ` [PATCH 33/33] sched: Track shared task's node groups and interleave their memory allocations Ingo Molnar
2012-11-22 22:53 ` [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
2012-11-23  6:47   ` Zhouping Liu
2012-11-23 17:32 ` Comparison between three trees (was: Latest numa/core release, v17) Mel Gorman
2012-11-25  8:47   ` Hillf Danton
2012-11-26  9:38     ` Mel Gorman
2012-11-25 23:37   ` Mel Gorman
2012-11-25 23:40   ` Mel Gorman
2012-11-26 13:33   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).