linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/33] Latest numa/core release, v17
@ 2012-11-22 22:49 Ingo Molnar
  2012-11-22 22:49 ` [PATCH 01/33] mm/generic: Only flush the local TLB in ptep_set_access_flags() Ingo Molnar
                   ` (34 more replies)
  0 siblings, 35 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-11-22 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

This release mainly addresses one of the regressions Linus
(rightfully) complained about: the "4x JVM" SPECjbb run.

[ Note to testers: if possible please still run with
  CONFIG_TRANSPARENT_HUGEPAGES=y enabled, to avoid the
  !THP regression that is still not fully fixed.
  It will be fixed next. ]

The new 4x JVM results on a 4-node, 32-CPU, 64 GB RAM system,
(240 seconds run, 8 warehouses per 4 JVM instances):

     spec1.txt:           throughput =     177460.44 SPECjbb2005 bops
     spec2.txt:           throughput =     176175.08 SPECjbb2005 bops
     spec3.txt:           throughput =     175053.91 SPECjbb2005 bops
     spec4.txt:           throughput =     171383.52 SPECjbb2005 bops
    
Which is close to (but not yet completely matching) the hard binding
performance figures.
 
Mainline has the following 4x JVM performance:
    
     spec1.txt:           throughput =     157839.25 SPECjbb2005 bops
     spec2.txt:           throughput =     156969.15 SPECjbb2005 bops
     spec3.txt:           throughput =     157571.59 SPECjbb2005 bops
     spec4.txt:           throughput =     157873.86 SPECjbb2005 bops

This result is achieved through the following patches:

  sched: Introduce staged average NUMA faults
  sched: Track groups of shared tasks
  sched: Use the best-buddy 'ideal cpu' in balancing decisions
  sched, mm, mempolicy: Add per task mempolicy
  sched: Average the fault stats longer
  sched: Use the ideal CPU to drive active balancing
  sched: Add hysteresis to p->numa_shared
  sched: Track shared task's node groups and interleave their memory allocations

These patches make increasing use of the shared/private access
pattern distinction between tasks.

Automatic, task group accurate interleaving of memory is the
most important new placement optimization feature in -v17.

It works by first implementing a task CPU placement feature:

    Using our shared/private distinction to allow the separate
    handling of 'private' versus 'shared' workloads, we enable
    the active-balancing of them:
    
     - private tasks, via the sched_update_ideal_cpu_private() function,
       try to 'spread' the system as evenly as possible.
    
     - shared-access tasks that also share their mm (threads), via the
       sched_update_ideal_cpu_shared() function, try to 'compress'
       with other shared tasks on as few nodes as possible.
    
As tasks are tracked as distinct groups of 'shared access pattern'
tasks, they are compressed towards as few nodes as possible. While
the scheduler performs this compression, a mempolicy node mask can
be constructed almost for free - and in turn be used for the memory
allocations of the tasks.

There are two notable special cases of the interleaving:

     - if a group of shared tasks fits on a single node. In this case
       the interleaving happens on a single bit, a single node and thus
       turns into nice node-local allocations.
    
     - if a large group spans the whole system: in this case the node
       masks will cover the whole system, and all memory gets evenly
       interleaved and available RAM bandwidth gets utilized. This is
       preferable to allocating memory assymetrically and overloading
       certain CPU links and running into their bandwidth limitations.

"Private" and non-NUMA tasks on the other hand are not affected and
continue to do efficient node-local allocations.

With this approach we avoid most of the 'threading means shared access
patterns' heuristics that AutoNUMA uses, by automatically separating
out threads that have a private working set and not binding them to
the other threads forcibly.

The thread group heuristics are not completely eliminated though, as
can be seen in the "sched: Use the ideal CPU to drive active balancing"
patch. It's not hard-coded into the design in any case and could be
extended to other task group information: the automatic NUMA balancing
of cgroups for example.
 
Thanks,

    Ingo

-------------------->

Andrea Arcangeli (1):
  numa, mm: Support NUMA hinting page faults from gup/gup_fast

Ingo Molnar (14):
  mm: Optimize the TLB flush of sys_mprotect() and change_protection()
    users
  sched, mm, numa: Create generic NUMA fault infrastructure, with
    architectures overrides
  sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag
  sched, numa, mm: Interleave shared tasks
  sched: Implement NUMA scanning backoff
  sched: Improve convergence
  sched: Introduce staged average NUMA faults
  sched: Track groups of shared tasks
  sched: Use the best-buddy 'ideal cpu' in balancing decisions
  sched, mm, mempolicy: Add per task mempolicy
  sched: Average the fault stats longer
  sched: Use the ideal CPU to drive active balancing
  sched: Add hysteresis to p->numa_shared
  sched: Track shared task's node groups and interleave their memory
    allocations

Mel Gorman (1):
  mm/migration: Improve migrate_misplaced_page()

Peter Zijlstra (11):
  mm: Count the number of pages affected in change_protection()
  sched, numa, mm: Add last_cpu to page flags
  sched: Make find_busiest_queue() a method
  sched, numa, mm: Describe the NUMA scheduling problem formally
  mm/migrate: Introduce migrate_misplaced_page()
  sched, numa, mm, arch: Add variable locality exception
  sched, numa, mm: Add the scanning page fault machinery
  sched: Add adaptive NUMA affinity support
  sched: Implement constant, per task Working Set Sampling (WSS) rate
  sched, numa, mm: Count WS scanning against present PTEs, not virtual
    memory ranges
  sched: Implement slow start for working set sampling

Rik van Riel (6):
  mm/generic: Only flush the local TLB in ptep_set_access_flags()
  x86/mm: Only do a local tlb flush in ptep_set_access_flags()
  x86/mm: Introduce pte_accessible()
  mm: Only flush the TLB when clearing an accessible pte
  x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
  sched, numa, mm: Add credits for NUMA placement

 CREDITS                                  |    1 +
 Documentation/scheduler/numa-problem.txt |  236 +++++
 arch/sh/mm/Kconfig                       |    1 +
 arch/x86/Kconfig                         |    2 +
 arch/x86/include/asm/pgtable.h           |    6 +
 arch/x86/mm/pgtable.c                    |    8 +-
 include/asm-generic/pgtable.h            |   59 ++
 include/linux/huge_mm.h                  |   12 +
 include/linux/hugetlb.h                  |    8 +-
 include/linux/init_task.h                |    8 +
 include/linux/mempolicy.h                |   47 +-
 include/linux/migrate.h                  |    7 +
 include/linux/mm.h                       |   99 +-
 include/linux/mm_types.h                 |   50 +
 include/linux/mmzone.h                   |   14 +-
 include/linux/page-flags-layout.h        |   83 ++
 include/linux/sched.h                    |   54 +-
 init/Kconfig                             |   81 ++
 kernel/bounds.c                          |    4 +
 kernel/sched/core.c                      |  105 ++-
 kernel/sched/fair.c                      | 1464 ++++++++++++++++++++++++++----
 kernel/sched/features.h                  |   13 +
 kernel/sched/sched.h                     |   39 +-
 kernel/sysctl.c                          |   45 +-
 mm/Makefile                              |    1 +
 mm/huge_memory.c                         |  163 ++++
 mm/hugetlb.c                             |   10 +-
 mm/internal.h                            |    5 +-
 mm/memcontrol.c                          |    7 +-
 mm/memory.c                              |  105 ++-
 mm/mempolicy.c                           |  175 +++-
 mm/migrate.c                             |  106 ++-
 mm/mprotect.c                            |   69 +-
 mm/numa.c                                |   73 ++
 mm/pgtable-generic.c                     |    9 +-
 35 files changed, 2818 insertions(+), 351 deletions(-)
 create mode 100644 Documentation/scheduler/numa-problem.txt
 create mode 100644 include/linux/page-flags-layout.h
 create mode 100644 mm/numa.c

-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2013-01-02 19:44 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
2012-11-22 22:49 ` [PATCH 01/33] mm/generic: Only flush the local TLB in ptep_set_access_flags() Ingo Molnar
2012-11-22 22:49 ` [PATCH 02/33] x86/mm: Only do a local tlb flush " Ingo Molnar
2012-11-22 22:49 ` [PATCH 03/33] x86/mm: Introduce pte_accessible() Ingo Molnar
2012-11-22 22:49 ` [PATCH 04/33] mm: Only flush the TLB when clearing an accessible pte Ingo Molnar
2012-11-22 22:49 ` [PATCH 05/33] x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Ingo Molnar
2012-11-22 22:49 ` [PATCH 06/33] mm: Count the number of pages affected in change_protection() Ingo Molnar
2012-11-22 22:49 ` [PATCH 07/33] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Ingo Molnar
2012-11-22 22:49 ` [PATCH 08/33] sched, numa, mm: Add last_cpu to page flags Ingo Molnar
2012-11-22 22:49 ` [PATCH 09/33] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides Ingo Molnar
2012-11-22 22:49 ` [PATCH 10/33] sched: Make find_busiest_queue() a method Ingo Molnar
2012-11-22 22:49 ` [PATCH 11/33] sched, numa, mm: Describe the NUMA scheduling problem formally Ingo Molnar
2012-11-22 22:49 ` [PATCH 12/33] numa, mm: Support NUMA hinting page faults from gup/gup_fast Ingo Molnar
2012-11-22 22:49 ` [PATCH 13/33] mm/migrate: Introduce migrate_misplaced_page() Ingo Molnar
2012-11-22 22:49 ` [PATCH 14/33] mm/migration: Improve migrate_misplaced_page() Ingo Molnar
2012-11-22 22:49 ` [PATCH 15/33] sched, numa, mm, arch: Add variable locality exception Ingo Molnar
2012-11-22 22:49 ` [PATCH 16/33] sched, numa, mm: Add credits for NUMA placement Ingo Molnar
2012-11-22 22:49 ` [PATCH 17/33] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Ingo Molnar
2012-11-22 22:49 ` [PATCH 18/33] sched, numa, mm: Add the scanning page fault machinery Ingo Molnar
2012-12-04  0:56   ` [patch] mm, mempolicy: Introduce spinlock to read shared policy tree David Rientjes
2012-12-20 18:34     ` Linus Torvalds
2012-12-20 22:55       ` David Rientjes
2012-12-21 13:47         ` Mel Gorman
2012-12-21 16:53           ` Linus Torvalds
2012-12-21 18:21             ` Hugh Dickins
2012-12-21 21:51               ` Linus Torvalds
2012-12-21 19:58             ` Mel Gorman
2012-12-21 22:02               ` Linus Torvalds
2012-12-21 23:10                 ` Mel Gorman
2012-12-22  0:36                   ` Linus Torvalds
2013-01-02 19:43                     ` KOSAKI Motohiro
2012-11-22 22:49 ` [PATCH 19/33] sched: Add adaptive NUMA affinity support Ingo Molnar
2012-11-26 20:32   ` Sasha Levin
2012-11-22 22:49 ` [PATCH 20/33] sched: Implement constant, per task Working Set Sampling (WSS) rate Ingo Molnar
2012-11-22 22:49 ` [PATCH 21/33] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Ingo Molnar
2012-11-22 22:49 ` [PATCH 22/33] sched: Implement slow start for working set sampling Ingo Molnar
2012-11-22 22:49 ` [PATCH 23/33] sched, numa, mm: Interleave shared tasks Ingo Molnar
2012-11-22 22:49 ` [PATCH 24/33] sched: Implement NUMA scanning backoff Ingo Molnar
2012-11-22 22:49 ` [PATCH 25/33] sched: Improve convergence Ingo Molnar
2012-11-22 22:49 ` [PATCH 26/33] sched: Introduce staged average NUMA faults Ingo Molnar
2012-11-22 22:49 ` [PATCH 27/33] sched: Track groups of shared tasks Ingo Molnar
2012-11-22 22:49 ` [PATCH 28/33] sched: Use the best-buddy 'ideal cpu' in balancing decisions Ingo Molnar
2012-11-22 22:49 ` [PATCH 29/33] sched, mm, mempolicy: Add per task mempolicy Ingo Molnar
2012-11-22 22:49 ` [PATCH 30/33] sched: Average the fault stats longer Ingo Molnar
2012-11-22 22:49 ` [PATCH 31/33] sched: Use the ideal CPU to drive active balancing Ingo Molnar
2012-11-22 22:49 ` [PATCH 32/33] sched: Add hysteresis to p->numa_shared Ingo Molnar
2012-11-22 22:49 ` [PATCH 33/33] sched: Track shared task's node groups and interleave their memory allocations Ingo Molnar
2012-11-22 22:53 ` [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
2012-11-23  6:47   ` Zhouping Liu
2012-11-23 17:32 ` Comparison between three trees (was: Latest numa/core release, v17) Mel Gorman
2012-11-25  8:47   ` Hillf Danton
2012-11-26  9:38     ` Mel Gorman
2012-11-25 23:37   ` Mel Gorman
2012-11-25 23:40   ` Mel Gorman
2012-11-26 13:33   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).