From: Ingo Molnar <mingo@kernel.org>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
Paul Turner <pjt@google.com>,
Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
Mel Gorman <mgorman@suse.de>,
Andrew Morton <akpm@linux-foundation.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>,
Johannes Weiner <hannes@cmpxchg.org>,
Hugh Dickins <hughd@google.com>
Subject: [PATCH 33/33] sched: Track shared task's node groups and interleave their memory allocations
Date: Thu, 22 Nov 2012 23:49:54 +0100 [thread overview]
Message-ID: <1353624594-1118-34-git-send-email-mingo@kernel.org> (raw)
In-Reply-To: <1353624594-1118-1-git-send-email-mingo@kernel.org>
This patch shows the power of the shared/private distinction: in
the shared tasks active balancing function (sched_update_ideal_cpu_shared())
we are able to to build a per (shared) task node mask of the nodes that
it and its buddies occupy at the moment.
Private tasks on the other hand are not affected and continue to do
efficient node-local allocations.
There's two important special cases:
- if a group of shared tasks fits on a single node. In this case
the interleaving happens on a single bit, a single node and thus
turns into nice node-local allocations.
- if a large group spans the whole system: in this case the node
masks will cover the whole system, and all memory gets evenly
interleaved and available RAM bandwidth gets utilized. This is
preferable to allocating memory assymetrically and overloading
certain CPU links and running into their bandwidth limitations.
This patch, in combination with the private/shared buddies patch,
optimizes the "4x JVM", "single JVM" and "2x JVM" SPECjbb workloads
on a 4-node system produce almost completely perfect memory placement.
For example a 4-JVM workload on a 4-node, 32-CPU system has
this performance (8 SPECjbb warehouses per JVM):
spec1.txt: throughput = 177460.44 SPECjbb2005 bops
spec2.txt: throughput = 176175.08 SPECjbb2005 bops
spec3.txt: throughput = 175053.91 SPECjbb2005 bops
spec4.txt: throughput = 171383.52 SPECjbb2005 bops
Which is close to the hard binding performance figures.
while previously it would regress compared to mainline.
Mainline has the following 4x JVM performance:
spec1.txt: throughput = 157839.25 SPECjbb2005 bops
spec2.txt: throughput = 156969.15 SPECjbb2005 bops
spec3.txt: throughput = 157571.59 SPECjbb2005 bops
spec4.txt: throughput = 157873.86 SPECjbb2005 bops
So the patch brings a 12% speedup.
This placement idea came while discussing interleaving strategies
with Christoph Lameter.
Suggested-by: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab4a7130..5cc3620 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -922,6 +922,10 @@ static int sched_update_ideal_cpu_shared(struct task_struct *p)
buddies++;
}
WARN_ON_ONCE(buddies > full_buddies);
+ if (buddies)
+ node_set(node, p->numa_policy.v.nodes);
+ else
+ node_clear(node, p->numa_policy.v.nodes);
/* Don't go to a node that is already at full capacity: */
if (buddies == full_buddies)
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-11-22 22:51 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-22 22:49 [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
2012-11-22 22:49 ` [PATCH 01/33] mm/generic: Only flush the local TLB in ptep_set_access_flags() Ingo Molnar
2012-11-22 22:49 ` [PATCH 02/33] x86/mm: Only do a local tlb flush " Ingo Molnar
2012-11-22 22:49 ` [PATCH 03/33] x86/mm: Introduce pte_accessible() Ingo Molnar
2012-11-22 22:49 ` [PATCH 04/33] mm: Only flush the TLB when clearing an accessible pte Ingo Molnar
2012-11-22 22:49 ` [PATCH 05/33] x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Ingo Molnar
2012-11-22 22:49 ` [PATCH 06/33] mm: Count the number of pages affected in change_protection() Ingo Molnar
2012-11-22 22:49 ` [PATCH 07/33] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Ingo Molnar
2012-11-22 22:49 ` [PATCH 08/33] sched, numa, mm: Add last_cpu to page flags Ingo Molnar
2012-11-22 22:49 ` [PATCH 09/33] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides Ingo Molnar
2012-11-22 22:49 ` [PATCH 10/33] sched: Make find_busiest_queue() a method Ingo Molnar
2012-11-22 22:49 ` [PATCH 11/33] sched, numa, mm: Describe the NUMA scheduling problem formally Ingo Molnar
2012-11-22 22:49 ` [PATCH 12/33] numa, mm: Support NUMA hinting page faults from gup/gup_fast Ingo Molnar
2012-11-22 22:49 ` [PATCH 13/33] mm/migrate: Introduce migrate_misplaced_page() Ingo Molnar
2012-11-22 22:49 ` [PATCH 14/33] mm/migration: Improve migrate_misplaced_page() Ingo Molnar
2012-11-22 22:49 ` [PATCH 15/33] sched, numa, mm, arch: Add variable locality exception Ingo Molnar
2012-11-22 22:49 ` [PATCH 16/33] sched, numa, mm: Add credits for NUMA placement Ingo Molnar
2012-11-22 22:49 ` [PATCH 17/33] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Ingo Molnar
2012-11-22 22:49 ` [PATCH 18/33] sched, numa, mm: Add the scanning page fault machinery Ingo Molnar
2012-12-04 0:56 ` [patch] mm, mempolicy: Introduce spinlock to read shared policy tree David Rientjes
2012-12-20 18:34 ` Linus Torvalds
2012-12-20 22:55 ` David Rientjes
2012-12-21 13:47 ` Mel Gorman
2012-12-21 16:53 ` Linus Torvalds
2012-12-21 18:21 ` Hugh Dickins
2012-12-21 21:51 ` Linus Torvalds
2012-12-21 19:58 ` Mel Gorman
2012-12-21 22:02 ` Linus Torvalds
2012-12-21 23:10 ` Mel Gorman
2012-12-22 0:36 ` Linus Torvalds
2013-01-02 19:43 ` KOSAKI Motohiro
2012-11-22 22:49 ` [PATCH 19/33] sched: Add adaptive NUMA affinity support Ingo Molnar
2012-11-26 20:32 ` Sasha Levin
2012-11-22 22:49 ` [PATCH 20/33] sched: Implement constant, per task Working Set Sampling (WSS) rate Ingo Molnar
2012-11-22 22:49 ` [PATCH 21/33] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Ingo Molnar
2012-11-22 22:49 ` [PATCH 22/33] sched: Implement slow start for working set sampling Ingo Molnar
2012-11-22 22:49 ` [PATCH 23/33] sched, numa, mm: Interleave shared tasks Ingo Molnar
2012-11-22 22:49 ` [PATCH 24/33] sched: Implement NUMA scanning backoff Ingo Molnar
2012-11-22 22:49 ` [PATCH 25/33] sched: Improve convergence Ingo Molnar
2012-11-22 22:49 ` [PATCH 26/33] sched: Introduce staged average NUMA faults Ingo Molnar
2012-11-22 22:49 ` [PATCH 27/33] sched: Track groups of shared tasks Ingo Molnar
2012-11-22 22:49 ` [PATCH 28/33] sched: Use the best-buddy 'ideal cpu' in balancing decisions Ingo Molnar
2012-11-22 22:49 ` [PATCH 29/33] sched, mm, mempolicy: Add per task mempolicy Ingo Molnar
2012-11-22 22:49 ` [PATCH 30/33] sched: Average the fault stats longer Ingo Molnar
2012-11-22 22:49 ` [PATCH 31/33] sched: Use the ideal CPU to drive active balancing Ingo Molnar
2012-11-22 22:49 ` [PATCH 32/33] sched: Add hysteresis to p->numa_shared Ingo Molnar
2012-11-22 22:49 ` Ingo Molnar [this message]
2012-11-22 22:53 ` [PATCH 00/33] Latest numa/core release, v17 Ingo Molnar
2012-11-23 6:47 ` Zhouping Liu
2012-11-23 17:32 ` Comparison between three trees (was: Latest numa/core release, v17) Mel Gorman
2012-11-25 8:47 ` Hillf Danton
2012-11-26 9:38 ` Mel Gorman
2012-11-25 23:37 ` Mel Gorman
2012-11-25 23:40 ` Mel Gorman
2012-11-26 13:33 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1353624594-1118-34-git-send-email-mingo@kernel.org \
--to=mingo@kernel.org \
--cc=Lee.Schermerhorn@hp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=pjt@google.com \
--cc=riel@redhat.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).