linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/19] sched-numa rewrite
@ 2012-07-31 19:12 Peter Zijlstra
  2012-07-31 19:12 ` [PATCH 01/19] task_work: Remove dependency on sched.h Peter Zijlstra
                   ` (19 more replies)
  0 siblings, 20 replies; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

Hi all,

After having had a talk with Rik about all this NUMA nonsense where he proposed
the scheme implemented in the next to last patch, I came up with a related
means of doing the home-node selection.

I've also switched to (ab)using PROT_NONE for driving the migration faults.

These patches go on top of tip/master with origin/master (Linus' tree) merged in.

Since the posting of last week (which was private due to operator error and not
intention) Ingo dropped all the previous patches from tip and I did a complete
rebase of the series, making all the back and forth of old and new stuff go away.

---
 arch/x86/include/asm/pgtable.h |    1 +
 drivers/base/node.c            |    2 +-
 include/linux/huge_mm.h        |    3 +
 include/linux/init_task.h      |    9 +
 include/linux/mempolicy.h      |   30 ++-
 include/linux/migrate.h        |    7 +
 include/linux/migrate_mode.h   |    3 +
 include/linux/mm.h             |   12 +
 include/linux/mm_types.h       |   12 +
 include/linux/mmzone.h         |    1 -
 include/linux/sched.h          |   30 ++-
 include/linux/task_work.h      |    7 -
 kernel/exit.c                  |    5 +-
 kernel/sched/core.c            |   71 ++++++-
 kernel/sched/debug.c           |    3 +
 kernel/sched/fair.c            |  501 ++++++++++++++++++++++++++++++++++++++--
 kernel/sched/features.h        |   10 +
 kernel/sched/sched.h           |   37 +++
 kernel/sysctl.c                |   13 +-
 mm/huge_memory.c               |  165 +++++++++-----
 mm/memory.c                    |  125 ++++++++++-
 mm/mempolicy.c                 |  296 ++++++++++++++++++------
 mm/migrate.c                   |   85 ++++++-
 mm/mprotect.c                  |   24 ++-
 mm/vmstat.c                    |    1 -
 25 files changed, 1257 insertions(+), 196 deletions(-)



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/19] task_work: Remove dependency on sched.h
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 20:52   ` Rik van Riel
  2012-07-31 19:12 ` [PATCH 02/19] mm/mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: decouple-sched-task_work.patch --]
[-- Type: text/plain, Size: 1160 bytes --]

Remove the need for sched.h from task_work.h so that we can use struct
task_work in struct task_struct in a later patch.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/task_work.h |    7 -------
 kernel/exit.c             |    5 ++++-
 2 files changed, 4 insertions(+), 8 deletions(-)
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -2,7 +2,6 @@
 #define _LINUX_TASK_WORK_H
 
 #include <linux/list.h>
-#include <linux/sched.h>
 
 typedef void (*task_work_func_t)(struct callback_head *);
 
@@ -16,10 +15,4 @@ int task_work_add(struct task_struct *ta
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
 void task_work_run(void);
 
-static inline void exit_task_work(struct task_struct *task)
-{
-	if (unlikely(task->task_works))
-		task_work_run();
-}
-
 #endif	/* _LINUX_TASK_WORK_H */
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -992,7 +992,10 @@ void do_exit(long code)
 	exit_shm(tsk);
 	exit_files(tsk);
 	exit_fs(tsk);
-	exit_task_work(tsk);
+
+	if (unlikely(tsk->task_works))
+		task_work_run();
+
 	check_stack_usage();
 	exit_thread();
 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 02/19] mm/mpol: Remove NUMA_INTERLEAVE_HIT
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
  2012-07-31 19:12 ` [PATCH 01/19] task_work: Remove dependency on sched.h Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 20:52   ` Rik van Riel
  2012-08-09 21:41   ` Andrea Arcangeli
  2012-07-31 19:12 ` [PATCH 03/19] mm/mpol: Make MPOL_LOCAL a real policy Peter Zijlstra
                   ` (17 subsequent siblings)
  19 siblings, 2 replies; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: mm_mpol-Remove_NUMA_INTERLEAVE_HIT.patch --]
[-- Type: text/plain, Size: 5134 bytes --]

Since the NUMA_INTERLEAVE_HIT statistic is useless on its own; it wants
to be compared to either a total of interleave allocations or to a miss
count, remove it.

Fixing it would be possible, but since we've gone years without these
statistics I figure we can continue that way.

Also NUMA_HIT fully includes NUMA_INTERLEAVE_HIT so users might
switch to using that.

This cleans up some of the weird MPOL_INTERLEAVE allocation exceptions.

Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/base/node.c    |    2 -
 include/linux/mmzone.h |    1 
 mm/mempolicy.c         |   68 +++++++++++++++----------------------------------
 mm/vmstat.c            |    1 
 4 files changed, 22 insertions(+), 50 deletions(-)
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,7 +169,7 @@ static ssize_t node_read_numastat(struct
 		       node_page_state(dev->id, NUMA_HIT),
 		       node_page_state(dev->id, NUMA_MISS),
 		       node_page_state(dev->id, NUMA_FOREIGN),
-		       node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
+		       0UL,
 		       node_page_state(dev->id, NUMA_LOCAL),
 		       node_page_state(dev->id, NUMA_OTHER));
 }
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -137,7 +137,6 @@ enum zone_stat_item {
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
 	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
-	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1553,11 +1553,29 @@ static nodemask_t *policy_nodemask(gfp_t
 	return NULL;
 }
 
+/* Do dynamic interleaving for a process */
+static unsigned interleave_nodes(struct mempolicy *policy)
+{
+	unsigned nid, next;
+	struct task_struct *me = current;
+
+	nid = me->il_next;
+	next = next_node(nid, policy->v.nodes);
+	if (next >= MAX_NUMNODES)
+		next = first_node(policy->v.nodes);
+	if (next < MAX_NUMNODES)
+		me->il_next = next;
+	return nid;
+}
+
 /* Return a zonelist indicated by gfp for node representing a mempolicy */
 static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 	int nd)
 {
 	switch (policy->mode) {
+	case MPOL_INTERLEAVE:
+		nd = interleave_nodes(policy);
+		break;
 	case MPOL_PREFERRED:
 		if (!(policy->flags & MPOL_F_LOCAL))
 			nd = policy->v.preferred_node;
@@ -1579,21 +1597,6 @@ static struct zonelist *policy_zonelist(
 	return node_zonelist(nd, gfp);
 }
 
-/* Do dynamic interleaving for a process */
-static unsigned interleave_nodes(struct mempolicy *policy)
-{
-	unsigned nid, next;
-	struct task_struct *me = current;
-
-	nid = me->il_next;
-	next = next_node(nid, policy->v.nodes);
-	if (next >= MAX_NUMNODES)
-		next = first_node(policy->v.nodes);
-	if (next < MAX_NUMNODES)
-		me->il_next = next;
-	return nid;
-}
-
 /*
  * Depending on the memory policy provide a node from which to allocate the
  * next slab entry.
@@ -1824,21 +1827,6 @@ bool mempolicy_nodemask_intersects(struc
 	return ret;
 }
 
-/* Allocate a page in interleaved policy.
-   Own path because it needs to do special accounting. */
-static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
-					unsigned nid)
-{
-	struct zonelist *zl;
-	struct page *page;
-
-	zl = node_zonelist(nid, gfp);
-	page = __alloc_pages(gfp, order, zl);
-	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
-		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
-	return page;
-}
-
 /**
  * 	alloc_pages_vma	- Allocate a page for a VMA.
  *
@@ -1875,17 +1863,6 @@ alloc_pages_vma(gfp_t gfp, int order, st
 	pol = get_vma_policy(current, vma, addr);
 	cpuset_mems_cookie = get_mems_allowed();
 
-	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
-		unsigned nid;
-
-		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
-		mpol_cond_put(pol);
-		page = alloc_page_interleave(gfp, order, nid);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
-			goto retry_cpuset;
-
-		return page;
-	}
 	zl = policy_zonelist(gfp, pol, node);
 	if (unlikely(mpol_needs_cond_ref(pol))) {
 		/*
@@ -1943,12 +1920,9 @@ struct page *alloc_pages_current(gfp_t g
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
 	 */
-	if (pol->mode == MPOL_INTERLEAVE)
-		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
-		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
-				policy_nodemask(gfp, pol));
+	page = __alloc_pages_nodemask(gfp, order,
+			policy_zonelist(gfp, pol, numa_node_id()),
+			policy_nodemask(gfp, pol));
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -717,7 +717,6 @@ const char * const vmstat_text[] = {
 	"numa_hit",
 	"numa_miss",
 	"numa_foreign",
-	"numa_interleave",
 	"numa_local",
 	"numa_other",
 #endif



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 03/19] mm/mpol: Make MPOL_LOCAL a real policy
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
  2012-07-31 19:12 ` [PATCH 01/19] task_work: Remove dependency on sched.h Peter Zijlstra
  2012-07-31 19:12 ` [PATCH 02/19] mm/mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 20:52   ` Rik van Riel
  2012-07-31 19:12 ` [PATCH 04/19] mm, thp: Preserve pgprot across huge page split Peter Zijlstra
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: mm_mpol-Make_MPOL_LOCAL_a_real_policy.patch --]
[-- Type: text/plain, Size: 2083 bytes --]

Make MPOL_LOCAL a real and exposed policy such that applications that
relied on the previous default behaviour can explicitly request it.

Requested-by: Christoph Lameter <cl@linux.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |    1 +
 mm/mempolicy.c            |    9 ++++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 4aa4273..fb515b4 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -20,6 +20,7 @@ enum {
 	MPOL_PREFERRED,
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
+	MPOL_LOCAL,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 30bea02..e7e3b58 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -269,6 +269,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 			     (flags & MPOL_F_RELATIVE_NODES)))
 				return ERR_PTR(-EINVAL);
 		}
+	} else if (mode == MPOL_LOCAL) {
+		if (!nodes_empty(*nodes))
+			return ERR_PTR(-EINVAL);
+		mode = MPOL_PREFERRED;
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -2273,7 +2277,6 @@ void numa_default_policy(void)
  * "local" is pseudo-policy:  MPOL_PREFERRED with MPOL_F_LOCAL flag
  * Used only for mpol_parse_str() and mpol_to_str()
  */
-#define MPOL_LOCAL MPOL_MAX
 static const char * const policy_modes[] =
 {
 	[MPOL_DEFAULT]    = "default",
@@ -2326,12 +2329,12 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 	if (flags)
 		*flags++ = '\0';	/* terminate mode string */
 
-	for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+	for (mode = 0; mode < MPOL_MAX; mode++) {
 		if (!strcmp(str, policy_modes[mode])) {
 			break;
 		}
 	}
-	if (mode > MPOL_LOCAL)
+	if (mode >= MPOL_MAX)
 		goto out;
 
 	switch (mode) {
-- 
1.7.2.3




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/19] mm, thp: Preserve pgprot across huge page split
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (2 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 03/19] mm/mpol: Make MPOL_LOCAL a real policy Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 20:53   ` Rik van Riel
  2012-08-09 21:42   ` Andrea Arcangeli
  2012-07-31 19:12 ` [PATCH 05/19] mm, mpol: Create special PROT_NONE infrastructure Peter Zijlstra
                   ` (15 subsequent siblings)
  19 siblings, 2 replies; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: thp-preserve-prot.patch --]
[-- Type: text/plain, Size: 5521 bytes --]

If we marked a THP with our special PROT_NONE protections, ensure we
don't loose them over a split.

Collapse seems to always allocate a new (huge) page which should
already end up on the new target node so loosing protections there
isn't a problem.

Cc: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/include/asm/pgtable.h |    1 
 mm/huge_memory.c               |  104 +++++++++++++++++++----------------------
 2 files changed, 50 insertions(+), 55 deletions(-)
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -350,6 +350,7 @@ static inline pgprot_t pgprot_modify(pgp
 }
 
 #define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK)
+#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_HPAGE_CHG_MASK)
 
 #define canon_pgprot(p) __pgprot(massage_pgprot(p))
 
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1353,64 +1353,60 @@ static int __split_huge_page_map(struct 
 	int ret = 0, i;
 	pgtable_t pgtable;
 	unsigned long haddr;
+	pgprot_t prot;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = page_check_address_pmd(page, mm, address,
 				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
-	if (pmd) {
-		pgtable = get_pmd_huge_pte(mm);
-		pmd_populate(mm, &_pmd, pgtable);
-
-		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
-		     i++, haddr += PAGE_SIZE) {
-			pte_t *pte, entry;
-			BUG_ON(PageCompound(page+i));
-			entry = mk_pte(page + i, vma->vm_page_prot);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			if (!pmd_write(*pmd))
-				entry = pte_wrprotect(entry);
-			else
-				BUG_ON(page_mapcount(page) != 1);
-			if (!pmd_young(*pmd))
-				entry = pte_mkold(entry);
-			pte = pte_offset_map(&_pmd, haddr);
-			BUG_ON(!pte_none(*pte));
-			set_pte_at(mm, haddr, pte, entry);
-			pte_unmap(pte);
-		}
+	if (!pmd)
+		goto unlock;
 
-		smp_wmb(); /* make pte visible before pmd */
-		/*
-		 * Up to this point the pmd is present and huge and
-		 * userland has the whole access to the hugepage
-		 * during the split (which happens in place). If we
-		 * overwrite the pmd with the not-huge version
-		 * pointing to the pte here (which of course we could
-		 * if all CPUs were bug free), userland could trigger
-		 * a small page size TLB miss on the small sized TLB
-		 * while the hugepage TLB entry is still established
-		 * in the huge TLB. Some CPU doesn't like that. See
-		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
-		 * Erratum 383 on page 93. Intel should be safe but is
-		 * also warns that it's only safe if the permission
-		 * and cache attributes of the two entries loaded in
-		 * the two TLB is identical (which should be the case
-		 * here). But it is generally safer to never allow
-		 * small and huge TLB entries for the same virtual
-		 * address to be loaded simultaneously. So instead of
-		 * doing "pmd_populate(); flush_tlb_range();" we first
-		 * mark the current pmd notpresent (atomically because
-		 * here the pmd_trans_huge and pmd_trans_splitting
-		 * must remain set at all times on the pmd until the
-		 * split is complete for this pmd), then we flush the
-		 * SMP TLB and finally we write the non-huge version
-		 * of the pmd entry with pmd_populate.
-		 */
-		set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
-		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-		pmd_populate(mm, pmd, pgtable);
-		ret = 1;
+	prot = pmd_pgprot(*pmd);
+	pgtable = get_pmd_huge_pte(mm);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+
+		BUG_ON(PageCompound(page+i));
+		entry = mk_pte(page + i, prot);
+		entry = pte_mkdirty(entry);
+		if (!pmd_young(*pmd))
+			entry = pte_mkold(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
 	}
+
+	smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */
+	/*
+	 * Up to this point the pmd is present and huge.
+	 *
+	 * If we overwrite the pmd with the not-huge version, we could trigger
+	 * a small page size TLB miss on the small sized TLB while the hugepage
+	 * TLB entry is still established in the huge TLB.
+	 *
+	 * Some CPUs don't like that. See
+	 * http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383
+	 * on page 93.
+	 *
+	 * Thus it is generally safer to never allow small and huge TLB entries
+	 * for overlapping virtual addresses to be loaded. So we first mark the
+	 * current pmd not present, then we flush the TLB and finally we write
+	 * the non-huge version of the pmd entry with pmd_populate.
+	 *
+	 * The above needs to be done under the ptl because pmd_trans_huge and
+	 * pmd_trans_splitting must remain set on the pmd until the split is
+	 * complete. The ptl also protects against concurrent faults due to
+	 * making the pmd not-present.
+	 */
+	set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
+	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	pmd_populate(mm, pmd, pgtable);
+	ret = 1;
+
+unlock:
 	spin_unlock(&mm->page_table_lock);
 
 	return ret;
@@ -2241,9 +2237,7 @@ static int khugepaged_wait_event(void)
 static void khugepaged_do_scan(struct page **hpage)
 {
 	unsigned int progress = 0, pass_through_head = 0;
-	unsigned int pages = khugepaged_pages_to_scan;
-
-	barrier(); /* write khugepaged_pages_to_scan to local stack */
+	unsigned int pages = ACCESS_ONCE(khugepaged_pages_to_scan);
 
 	while (progress < pages) {
 		cond_resched();



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 05/19] mm, mpol: Create special PROT_NONE infrastructure
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (3 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 04/19] mm, thp: Preserve pgprot across huge page split Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 20:55   ` Rik van Riel
  2012-08-09 21:43   ` Andrea Arcangeli
  2012-07-31 19:12 ` [PATCH 06/19] mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra
                   ` (14 subsequent siblings)
  19 siblings, 2 replies; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: numa-prot-none.patch --]
[-- Type: text/plain, Size: 9878 bytes --]

In order to facilitate a lazy -- fault driven -- migration of pages,
create a special transient PROT_NONE variant, we can then use the
'spurious' protection faults to drive our migrations from.

Pages that already had an effective PROT_NONE mapping will not
be detected to generate these 'spuriuos' faults for the simple reason
that we cannot distinguish them on their protection bits, see
pte_prot_none.

This isn't a problem since PROT_NONE (and possible PROT_WRITE with
dirty tracking) aren't used or are rare enough for us to not care
about their placement.

Suggested-by: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/huge_mm.h   |    3 +
 include/linux/mempolicy.h |    4 +-
 include/linux/mm.h        |   12 ++++++
 mm/huge_memory.c          |   21 +++++++++++
 mm/memory.c               |   86 ++++++++++++++++++++++++++++++++++++++++++----
 mm/mempolicy.c            |   24 ++++++++++++
 mm/mprotect.c             |   24 +++++++++---
 7 files changed, 159 insertions(+), 15 deletions(-)
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,6 +11,9 @@ extern int copy_huge_pmd(struct mm_struc
 extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       pmd_t orig_pmd);
+extern void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmd,
+				  unsigned int flags, pmd_t orig_pmd);
 extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
 extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
 					  unsigned long addr,
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -254,7 +254,9 @@ static inline int vma_migratable(struct 
 	return 1;
 }
 
-#else
+extern void lazy_migrate_process(struct mm_struct *mm);
+
+#else /* CONFIG_NUMA */
 
 struct mempolicy {};
 
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1046,6 +1046,9 @@ extern unsigned long move_page_tables(st
 extern unsigned long do_mremap(unsigned long addr,
 			       unsigned long old_len, unsigned long new_len,
 			       unsigned long flags, unsigned long new_addr);
+extern void change_protection(struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end, pgprot_t newprot,
+			      int dirty_accountable);
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
@@ -1495,6 +1498,15 @@ static inline pgprot_t vm_get_page_prot(
 }
 #endif
 
+static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
+{
+	/*
+	 * obtain PROT_NONE by removing READ|WRITE|EXEC privs
+	 */
+	vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+	return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
+}
+
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -750,6 +750,27 @@ int do_huge_pmd_anonymous_page(struct mm
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
+void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd,
+			   unsigned int flags, pmd_t entry)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry)))
+		goto out_unlock;
+
+	/* do fancy stuff */
+
+	/* change back to regular protection */
+	entry = pmd_modify(entry, vma->vm_page_prot);
+	if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
+		update_mmu_cache(vma, address, entry);
+
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3415,6 +3415,71 @@ static int do_nonlinear_fault(struct mm_
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static bool pte_prot_none(struct vm_area_struct *vma, pte_t pte)
+{
+	/*
+	 * If we have the normal vma->vm_page_prot protections we're not a
+	 * 'special' PROT_NONE page.
+	 *
+	 * This means we cannot get 'special' PROT_NONE faults from genuine
+	 * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
+	 * tracking.
+	 *
+	 * Neither case is really interesting for our current use though so we
+	 * don't care.
+	 */
+	if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
+		return false;
+
+	return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
+}
+
+static bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
+{
+	/*
+	 * See pte_prot_none().
+	 */
+	if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot)))
+		return false;
+
+	return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
+}
+
+static int do_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pte_t *ptep, pmd_t *pmd,
+			unsigned int flags, pte_t entry)
+{
+	spinlock_t *ptl;
+	int ret = 0;
+
+	if (!pte_unmap_same(mm, pmd, ptep, entry))
+		goto out;
+
+	/*
+	 * Do fancy stuff...
+	 */
+
+	/*
+	 * OK, nothing to do,.. change the protection back to what it
+	 * ought to be.
+	 */
+	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (unlikely(!pte_same(*ptep, entry)))
+		goto unlock;
+
+	flush_cache_page(vma, address, pte_pfn(entry));
+
+	ptep_modify_prot_start(mm, address, ptep);
+	entry = pte_modify(entry, vma->vm_page_prot);
+	ptep_modify_prot_commit(mm, address, ptep, entry);
+
+	update_mmu_cache(vma, address, ptep);
+unlock:
+	pte_unmap_unlock(ptep, ptl);
+out:
+	return ret;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3453,6 +3518,9 @@ int handle_pte_fault(struct mm_struct *m
 					pte, pmd, flags, entry);
 	}
 
+	if (pte_prot_none(vma, entry))
+		return do_prot_none(mm, vma, address, pte, pmd, flags, entry);
+
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
@@ -3517,13 +3585,16 @@ int handle_mm_fault(struct mm_struct *mm
 							  pmd, flags);
 	} else {
 		pmd_t orig_pmd = *pmd;
-		int ret;
+		int ret = 0;
 
 		barrier();
-		if (pmd_trans_huge(orig_pmd)) {
-			if (flags & FAULT_FLAG_WRITE &&
-			    !pmd_write(orig_pmd) &&
-			    !pmd_trans_splitting(orig_pmd)) {
+		if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
+			if (pmd_prot_none(vma, orig_pmd)) {
+				do_huge_pmd_prot_none(mm, vma, address, pmd,
+						      flags, orig_pmd);
+			}
+
+			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
 				/*
@@ -3533,12 +3604,13 @@ int handle_mm_fault(struct mm_struct *mm
 				 */
 				if (unlikely(ret & VM_FAULT_OOM))
 					goto retry;
-				return ret;
 			}
-			return 0;
+
+			return ret;
 		}
 	}
 
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -565,6 +565,12 @@ static inline int check_pgd_range(struct
 	return 0;
 }
 
+static void
+change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+	change_protection(vma, start, end, vma_prot_none(vma), 0);
+}
+
 /*
  * Check if all pages in a range are on a set of nodes.
  * If pagelist != NULL then isolate pages from the LRU and
@@ -1197,6 +1203,24 @@ static long do_mbind(unsigned long start
 	return err;
 }
 
+static void lazy_migrate_vma(struct vm_area_struct *vma)
+{
+	if (!vma_migratable(vma))
+		return;
+
+	change_prot_none(vma, vma->vm_start, vma->vm_end);
+}
+
+void lazy_migrate_process(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	down_read(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next)
+		lazy_migrate_vma(vma);
+	up_read(&mm->mmap_sem);
+}
+
 /*
  * User space interface with variable sized bitmaps for nodelists.
  */
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -119,7 +119,7 @@ static inline void change_pud_range(stru
 	} while (pud++, addr = next, addr != end);
 }
 
-static void change_protection(struct vm_area_struct *vma,
+static void change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
@@ -141,6 +141,20 @@ static void change_protection(struct vm_
 	flush_tlb_range(vma, start, end);
 }
 
+void change_protection(struct vm_area_struct *vma, unsigned long start,
+		       unsigned long end, pgprot_t newprot,
+		       int dirty_accountable)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_change_protection(vma, start, end, newprot);
+	else
+		change_protection_range(vma, start, end, newprot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
 int
 mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	unsigned long start, unsigned long end, unsigned long newflags)
@@ -213,12 +227,8 @@ mprotect_fixup(struct vm_area_struct *vm
 		dirty_accountable = 1;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
-	if (is_vm_hugetlb_page(vma))
-		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
-	else
-		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	perf_event_mmap(vma);



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 06/19] mm/mpol: Add MPOL_MF_LAZY ...
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (4 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 05/19] mm, mpol: Create special PROT_NONE infrastructure Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:04   ` Rik van Riel
  2012-07-31 19:12 ` [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Lee Schermerhorn, Peter Zijlstra

[-- Attachment #1: mm_mpol-Add_MPOL_MF_LAZY____.patch --]
[-- Type: text/plain, Size: 4951 bytes --]

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch adds another mbind() flag to request "lazy migration".
The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are simply unmapped from the calling task's page table ['_MOVE]
or from all referencing page tables [_MOVE_ALL].  Anon pages will first
be added to the swap [or migration?] cache, if necessary.  The pages
will be migrated in the fault path on "first touch", if the policy
dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After unmap, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ nearly complete rewrite.. ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |   13 ++++++++++---
 include/linux/migrate.h   |    1 +
 mm/mempolicy.c            |   46 +++++++++++++++++++++++++++++-----------------
 3 files changed, 40 insertions(+), 20 deletions(-)
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -48,9 +48,16 @@ enum mpol_rebind_step {
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
+#define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
+				   to policy */
+#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
+#define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+
+#define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
+			 MPOL_MF_MOVE     | 	\
+			 MPOL_MF_MOVE_ALL |	\
+			 MPOL_MF_LAZY)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -589,22 +589,32 @@ check_range(struct mm_struct *mm, unsign
 		return ERR_PTR(-EFAULT);
 	prev = NULL;
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+		unsigned long endvma = vma->vm_end;
+
+		if (endvma > end)
+			endvma = end;
+		if (vma->vm_start > start)
+			start = vma->vm_start;
+
 		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
 			if (!vma->vm_next && vma->vm_end < end)
 				return ERR_PTR(-EFAULT);
 			if (prev && prev->vm_end < vma->vm_start)
 				return ERR_PTR(-EFAULT);
 		}
-		if (!is_vm_hugetlb_page(vma) &&
-		    ((flags & MPOL_MF_STRICT) ||
+
+		if (is_vm_hugetlb_page(vma))
+			goto next;
+
+		if (flags & MPOL_MF_LAZY) {
+			change_prot_none(vma, start, endvma);
+			goto next;
+		}
+
+		if ((flags & MPOL_MF_STRICT) ||
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-				vma_migratable(vma)))) {
-			unsigned long endvma = vma->vm_end;
+		      vma_migratable(vma))) {
 
-			if (endvma > end)
-				endvma = end;
-			if (vma->vm_start > start)
-				start = vma->vm_start;
 			err = check_pgd_range(vma, start, endvma, nodes,
 						flags, private);
 			if (err) {
@@ -612,6 +622,7 @@ check_range(struct mm_struct *mm, unsign
 				break;
 			}
 		}
+next:
 		prev = vma;
 	}
 	return first;
@@ -1118,8 +1129,7 @@ static long do_mbind(unsigned long start
 	int err;
 	LIST_HEAD(pagelist);
 
-	if (flags & ~(unsigned long)(MPOL_MF_STRICT |
-				     MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+  	if (flags & ~(unsigned long)MPOL_MF_VALID)
 		return -EINVAL;
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
 		return -EPERM;
@@ -1178,21 +1188,23 @@ static long do_mbind(unsigned long start
 	vma = check_range(mm, start, end, nmask,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
-	err = PTR_ERR(vma);
-	if (!IS_ERR(vma)) {
-		int nr_failed = 0;
-
+	err = PTR_ERR(vma);	/* maybe ... */
+	if (!IS_ERR(vma))
 		err = mbind_range(mm, start, end, new);
 
+	if (!err) {
+		int nr_failed = 0;
+
 		if (!list_empty(&pagelist)) {
+			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
-						(unsigned long)vma,
-						false, MIGRATE_SYNC);
+						  (unsigned long)vma,
+						  false, MIGRATE_SYNC);
 			if (nr_failed)
 				putback_lru_pages(&pagelist);
 		}
 
-		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+		if (nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
 	} else
 		putback_lru_pages(&pagelist);



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (5 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 06/19] mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:06   ` Rik van Riel
                     ` (2 more replies)
  2012-07-31 19:12 ` [PATCH 08/19] mm/mpol: Check for misplaced page Peter Zijlstra
                   ` (12 subsequent siblings)
  19 siblings, 3 replies; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Lee Schermerhorn, Peter Zijlstra

[-- Attachment #1: mm_mpol-Add_MPOL_MF_NOOP.patch --]
[-- Type: text/plain, Size: 2695 bytes --]

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch augments the MPOL_MF_LAZY feature by adding a "NOOP"
policy to mbind().  When the NOOP policy is used with the 'MOVE
and 'LAZY flags, mbind() [check_range()] will walk the specified
range and unmap eligible pages so that they will be migrated on
next touch.

This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy.  Note that we could just use
"default" policy in this case.  However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |    1 +
 mm/mempolicy.c            |    8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 87fabfa..668311a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -21,6 +21,7 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
+	MPOL_NOOP,		/* retain existing policy for range */
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4fba5f2..251ef31 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-	if (mode == MPOL_DEFAULT) {
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
 		if (nodes && !nodes_empty(*nodes))
 			return ERR_PTR(-EINVAL);
-		return NULL;	/* simply delete any existing policy */
+		return NULL;
 	}
 	VM_BUG_ON(!nodes);
 
@@ -1069,7 +1069,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -1121,7 +1121,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
 	err = PTR_ERR(vma);	/* maybe ... */
-	if (!IS_ERR(vma))
+	if (!IS_ERR(vma) && mode != MPOL_NOOP)
 		err = mbind_range(mm, start, end, new);
 
 	if (!err) {
-- 
1.7.2.3




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/19] mm/mpol: Check for misplaced page
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (6 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:13   ` Rik van Riel
  2012-07-31 19:12 ` [PATCH 09/19] mm, migrate: Introduce migrate_misplaced_page() Peter Zijlstra
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Lee Schermerhorn, Peter Zijlstra

[-- Attachment #1: mm_mpol-Check_for_misplaced_page.patch --]
[-- Type: text/plain, Size: 5045 bytes --]

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped.  This involves
looking up the node where the page belongs.  So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.  Because interleaved and
non-interleaved allocations are accounted differently, the function
also returns whether or not the new node came from an interleaved
policy, if the page is misplaced.

A subsequent patch will call this function from the fault path for
stable pages with zero page_mapcount().  Because of this, I don't
want to go ahead and allocate the page, e.g., via alloc_page_vma()
only to have to free it if it has the correct policy.  So, I just
mimic the alloc_page_vma() node computation logic--sort of.

Note:  we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ Added MPOL_F_LAZY to trigger migrate-on-fault;
  simplified code now that we don't have to bother
  with special crap for interleaved ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |    9 +++++
 mm/mempolicy.c            |   79 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 88 insertions(+)
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -68,6 +68,7 @@ enum mpol_rebind_step {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
+#define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 
 #ifdef __KERNEL__
 
@@ -262,6 +263,8 @@ static inline int vma_migratable(struct 
 	return 1;
 }
 
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
 extern void lazy_migrate_process(struct mm_struct *mm);
 
 #else /* CONFIG_NUMA */
@@ -389,6 +392,12 @@ static inline int mpol_to_str(char *buff
 	return 0;
 }
 
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	return -1; /* no node preference */
+}
+
 #endif /* CONFIG_NUMA */
 #endif /* __KERNEL__ */
 
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1152,6 +1152,9 @@ static long do_mbind(unsigned long start
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	if (flags & MPOL_MF_LAZY)
+		new->flags |= MPOL_F_MOF;
+
 	/*
 	 * If we are using the default policy then operation
 	 * on discontinuous address spaces is okay after all
@@ -2143,6 +2146,82 @@ mpol_shared_policy_lookup(struct shared_
 	return pol;
 }
 
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ *	-1	- not misplaced, page is in the right node
+ *	node	- node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol;
+	struct zone *zone;
+	int curnid = page_to_nid(page);
+	unsigned long pgoff;
+	int polnid = -1;
+	int ret = -1;
+
+	BUG_ON(!vma);
+
+	pol = get_vma_policy(current, vma, addr);
+	if (!(pol->flags & MPOL_F_MOF))
+		goto out;
+
+	switch (pol->mode) {
+	case MPOL_INTERLEAVE:
+		BUG_ON(addr >= vma->vm_end);
+		BUG_ON(addr < vma->vm_start);
+
+		pgoff = vma->vm_pgoff;
+		pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+		polnid = offset_il_node(pol, vma, pgoff);
+		break;
+
+	case MPOL_PREFERRED:
+		if (pol->flags & MPOL_F_LOCAL)
+			polnid = numa_node_id();
+		else
+			polnid = pol->v.preferred_node;
+		break;
+
+	case MPOL_BIND:
+		/*
+		 * allows binding to multiple nodes.
+		 * use current page if in policy nodemask,
+		 * else select nearest allowed node, if any.
+		 * If no allowed nodes, use current [!misplaced].
+		 */
+		if (node_isset(curnid, pol->v.nodes))
+			goto out;
+		(void)first_zones_zonelist(
+				node_zonelist(numa_node_id(), GFP_HIGHUSER),
+				gfp_zone(GFP_HIGHUSER),
+				&pol->v.nodes, &zone);
+		polnid = zone->node;
+		break;
+
+	default:
+		BUG();
+	}
+	if (curnid != polnid)
+		ret = polnid;
+out:
+	mpol_cond_put(pol);
+
+	return ret;
+}
+
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
 	pr_debug("deleting %lx-l%lx\n", n->start, n->end);



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 09/19] mm, migrate: Introduce migrate_misplaced_page()
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (7 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 08/19] mm/mpol: Check for misplaced page Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:16   ` Rik van Riel
  2012-07-31 19:12 ` [PATCH 10/19] mm, mpol: Use special PROT_NONE to migrate pages Peter Zijlstra
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: numa-migrate_misplaced_page.patch --]
[-- Type: text/plain, Size: 5881 bytes --]

Add migrate_misplaced_page() which deals with migrating pages from
faults. This includes adding a new MIGRATE_FAULT migration mode to
deal with the extra page reference required due to having to look up
the page.

Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/migrate.h      |    7 +++
 include/linux/migrate_mode.h |    3 +
 mm/migrate.c                 |   85 ++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 87 insertions(+), 8 deletions(-)
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -30,6 +30,7 @@ extern int migrate_vmas(struct mm_struct
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct mm_struct *, struct page *, int);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
@@ -63,5 +64,11 @@ static inline int migrate_huge_page_move
 #define migrate_page NULL
 #define fail_migrate_page NULL
 
+static inline
+int migrate_misplaced_page(struct mm_struct *mm, struct page *page, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_MIGRATION */
+
 #endif /* _LINUX_MIGRATE_H */
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,14 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
+ *	this path has an extra reference count
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
+	MIGRATE_FAULT,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -224,7 +224,7 @@ static bool buffer_migrate_lock_buffers(
 	struct buffer_head *bh = head;
 
 	/* Simple case, sync compaction */
-	if (mode != MIGRATE_ASYNC) {
+	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
 		do {
 			get_bh(bh);
 			lock_buffer(bh);
@@ -278,12 +278,22 @@ static int migrate_page_move_mapping(str
 		struct page *newpage, struct page *page,
 		struct buffer_head *head, enum migrate_mode mode)
 {
-	int expected_count;
+	int expected_count = 0;
 	void **pslot;
 
+	if (mode == MIGRATE_FAULT) {
+		/*
+		 * MIGRATE_FAULT has an extra reference on the page and
+		 * otherwise acts like ASYNC, no point in delaying the
+		 * fault, we'll try again next time.
+		 */
+		expected_count++;
+	}
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
-		if (page_count(page) != 1)
+		expected_count += 1;
+		if (page_count(page) != expected_count)
 			return -EAGAIN;
 		return 0;
 	}
@@ -293,7 +303,7 @@ static int migrate_page_move_mapping(str
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
-	expected_count = 2 + page_has_private(page);
+	expected_count += 2 + page_has_private(page);
 	if (page_count(page) != expected_count ||
 		radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
 		spin_unlock_irq(&mapping->tree_lock);
@@ -312,7 +322,7 @@ static int migrate_page_move_mapping(str
 	 * the mapping back due to an elevated page count, we would have to
 	 * block waiting on other references to be dropped.
 	 */
-	if (mode == MIGRATE_ASYNC && head &&
+	if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
 			!buffer_migrate_lock_buffers(head, mode)) {
 		page_unfreeze_refs(page, expected_count);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -520,7 +530,7 @@ int buffer_migrate_page(struct address_s
 	 * with an IRQ-safe spinlock held. In the sync case, the buffers
 	 * need to be locked now
 	 */
-	if (mode != MIGRATE_ASYNC)
+	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
 		BUG_ON(!buffer_migrate_lock_buffers(head, mode));
 
 	ClearPagePrivate(page);
@@ -687,7 +697,7 @@ static int __unmap_and_move(struct page 
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
-		if (!force || mode == MIGRATE_ASYNC)
+		if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
 			goto out;
 
 		/*
@@ -1428,4 +1438,63 @@ int migrate_vmas(struct mm_struct *mm, c
  	}
  	return err;
 }
-#endif
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node.
+ */
+int migrate_misplaced_page(struct mm_struct *mm, struct page *page, int node)
+{
+	struct address_space *mapping = page_mapping(page);
+	int page_lru = page_is_file_cache(page);
+	struct page *newpage;
+	int ret = -EAGAIN;
+	gfp_t gfp = GFP_HIGHUSER_MOVABLE;
+
+	/*
+	 * Don't migrate pages that are mapped in multiple processes.
+	 */
+	if (page_mapcount(page) != 1)
+		goto out;
+
+	/*
+	 * Never wait for allocations just to migrate on fault, but don't dip
+	 * into reserves. And, only accept pages from the specified node. No
+	 * sense migrating to a different "misplaced" page!
+	 */
+	if (mapping)
+		gfp = mapping_gfp_mask(mapping);
+	gfp &= ~__GFP_WAIT;
+	gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
+
+	newpage = alloc_pages_node(node, gfp, 0);
+	if (!newpage) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (isolate_lru_page(page)) {
+		ret = -EBUSY;
+		goto put_new;
+	}
+
+	inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+	ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);
+	/*
+	 * A page that has been migrated has all references removed and will be
+	 * freed. A page that has not been migrated will have kepts its
+	 * references and be restored.
+	 */
+	dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+	putback_lru_page(page);
+put_new:
+	/*
+	 * Move the new page to the LRU. If migration was not successful
+	 * then this will free the page.
+	 */
+	putback_lru_page(newpage);
+out:
+	return ret;
+}
+
+#endif /* CONFIG_NUMA */



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 10/19] mm, mpol: Use special PROT_NONE to migrate pages
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (8 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 09/19] mm, migrate: Introduce migrate_misplaced_page() Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:24   ` Rik van Riel
  2012-08-09 21:44   ` Andrea Arcangeli
  2012-07-31 19:12 ` [PATCH 11/19] sched, mm: Introduce tsk_home_node() Peter Zijlstra
                   ` (9 subsequent siblings)
  19 siblings, 2 replies; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: numa-prot-none-migrate.patch --]
[-- Type: text/plain, Size: 3927 bytes --]

Combine our previous PROT_NONE, mpol_misplaced and
migrate_misplaced_page() pieces into an effective migrate on fault
scheme.

Suggested-by: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/huge_memory.c |   41 ++++++++++++++++++++++++++++++++++++++++-
 mm/memory.c      |   42 ++++++++++++++++++++++++++++++++++++------
 2 files changed, 76 insertions(+), 7 deletions(-)
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
 #include <linux/mman.h>
+#include <linux/migrate.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -755,12 +756,48 @@ void do_huge_pmd_prot_none(struct mm_str
 			   unsigned int flags, pmd_t entry)
 {
 	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct page *page = NULL;
+	int node;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, entry)))
 		goto out_unlock;
 
-	/* do fancy stuff */
+	if (unlikely(pmd_trans_splitting(entry))) {
+		spin_unlock(&mm->page_table_lock);
+		wait_split_huge_page(vma->anon_vma, pmd);
+		return;
+	}
+
+#ifdef CONFIG_NUMA
+	page = pmd_page(entry);
+	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	/*
+	 * XXX should we serialize against split_huge_page ?
+	 */
+
+	node = mpol_misplaced(page, vma, haddr);
+	if (node == -1)
+		goto do_fixup;
+
+	/*
+	 * Due to lacking code to migrate thp pages, we'll split
+	 * (which preserves the special PROT_NONE) and re-take the
+	 * fault on the normal pages.
+	 */
+	split_huge_page(page);
+	put_page(page);
+	return;
+
+do_fixup:
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry)))
+		goto out_unlock;
+#endif
 
 	/* change back to regular protection */
 	entry = pmd_modify(entry, vma->vm_page_prot);
@@ -769,6 +806,8 @@ void do_huge_pmd_prot_none(struct mm_str
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+	if (page)
+		put_page(page);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/migrate.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3449,17 +3450,42 @@ static int do_prot_none(struct mm_struct
 			unsigned long address, pte_t *ptep, pmd_t *pmd,
 			unsigned int flags, pte_t entry)
 {
+	struct page *page = NULL;
 	spinlock_t *ptl;
-	int ret = 0;
+	int node;
 
-	if (!pte_unmap_same(mm, pmd, ptep, entry))
-		goto out;
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+	if (unlikely(!pte_same(*ptep, entry)))
+		goto unlock;
 
+#ifdef CONFIG_NUMA
 	/*
-	 * Do fancy stuff...
+	 * For NUMA systems we use the special PROT_NONE maps to drive
+	 * lazy page migration, see MPOL_MF_LAZY and related.
 	 */
+	page = vm_normal_page(vma, address, entry);
+	if (!page)
+		goto do_fixup_locked;
+
+	get_page(page);
+	pte_unmap_unlock(ptep, ptl);
+
+	node = mpol_misplaced(page, vma, address);
+	if (node == -1)
+		goto do_fixup;
 
 	/*
+	 * Page migration will install a new pte with vma->vm_page_prot,
+	 * otherwise fall-through to the fixup. Next time,.. perhaps.
+	 */
+	if (!migrate_misplaced_page(mm, page, node)) {
+		put_page(page);
+		return 0;
+	}
+
+do_fixup:
+	/*
 	 * OK, nothing to do,.. change the protection back to what it
 	 * ought to be.
 	 */
@@ -3467,6 +3493,9 @@ static int do_prot_none(struct mm_struct
 	if (unlikely(!pte_same(*ptep, entry)))
 		goto unlock;
 
+do_fixup_locked:
+#endif /* CONFIG_NUMA */
+
 	flush_cache_page(vma, address, pte_pfn(entry));
 
 	ptep_modify_prot_start(mm, address, ptep);
@@ -3476,8 +3505,9 @@ static int do_prot_none(struct mm_struct
 	update_mmu_cache(vma, address, ptep);
 unlock:
 	pte_unmap_unlock(ptep, ptl);
-out:
-	return ret;
+	if (page)
+		put_page(page);
+	return 0;
 }
 
 /*



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 11/19] sched, mm: Introduce tsk_home_node()
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (9 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 10/19] mm, mpol: Use special PROT_NONE to migrate pages Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:30   ` Rik van Riel
  2012-07-31 19:12 ` [PATCH 12/19] mm/mpol: Make mempolicy home-node aware Peter Zijlstra
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched_mm-Introduce_tsk_home_node.patch --]
[-- Type: text/plain, Size: 3790 bytes --]

Introduce the home-node concept for tasks. In order to keep memory
locality we need to have a something to stay local to, we define the
home-node of a task as the node we prefer to allocate memory from and
prefer to execute on.

These are no hard guarantees, merely preferences. This allows for
optimal resource usage, we can run a task away from the home-node, the
remote memory hit -- while expensive -- is less expensive than not
running at all, or very little, due to severe cpu overload.

Similarly, we can allocate memory from another node if our home-node
is depleted, again, some memory is better than no memory.

This patch merely introduces the basic infrastructure, all policy
comes later.

Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |   10 ++++++++++
 kernel/sched/core.c       |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index b806b82..53be033 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -143,6 +143,13 @@ extern struct task_group root_task_group;
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_NUMA
+# define INIT_TASK_NUMA(tsk)						\
+	.node = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -210,6 +217,7 @@ extern struct task_group root_task_group;
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
 	INIT_CPUSET_SEQ							\
+	INIT_TASK_NUMA(tsk)						\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fd9436a..3384ae8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1518,6 +1518,7 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
+	int node;
 #endif
 	struct rcu_head rcu;
 
@@ -1592,6 +1593,15 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+static inline int tsk_home_node(struct task_struct *p)
+{
+#ifdef CONFIG_NUMA
+	return p->node;
+#else
+	return -1;
+#endif
+}
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5d011ef..fddb68f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6545,6 +6545,38 @@ static struct sched_domain_topology_level *sched_domain_topology = default_topol
 
 #ifdef CONFIG_NUMA
 
+/*
+ * Requeues a task ensuring its on the right load-balance list so
+ * that it might get migrated to its new home.
+ *
+ * Note that we cannot actively migrate ourselves since our callers
+ * can be from atomic context. We rely on the regular load-balance
+ * mechanisms to move us around -- its all preference anyway.
+ */
+void sched_setnode(struct task_struct *p, int node)
+{
+	unsigned long flags;
+	int on_rq, running;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->node = node;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
+
 static int sched_domains_numa_levels;
 static int *sched_domains_numa_distance;
 static struct cpumask ***sched_domains_numa_masks;
-- 
1.7.2.3




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 12/19] mm/mpol: Make mempolicy home-node aware
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (10 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 11/19] sched, mm: Introduce tsk_home_node() Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:33   ` Rik van Riel
  2012-07-31 19:12 ` [PATCH 13/19] sched: Introduce sched_feat_numa() Peter Zijlstra
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Christoph Lameter, Peter Zijlstra

[-- Attachment #1: mm_mpol-Make_mempolicy_home-node_aware.patch --]
[-- Type: text/plain, Size: 2574 bytes --]

Add another layer of fallback policy to make the home node concept
useful from a memory allocation PoV.

This changes the mpol order to:

 - vma->vm_ops->get_policy	[if applicable]
 - vma->vm_policy		[if applicable]
 - task->mempolicy
 - tsk_home_node() preferred	[NEW]
 - default_policy

Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
facilitate efficient on-demand memory migration.

Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mempolicy.c |   29 +++++++++++++++++++++++++++--
 1 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e0dde20..3e109ae 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -117,6 +117,22 @@ static struct mempolicy default_policy = {
 	.flags = MPOL_F_LOCAL,
 };
 
+static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+
+static struct mempolicy *get_task_policy(struct task_struct *p)
+{
+	struct mempolicy *pol = p->mempolicy;
+	int node;
+
+	if (!pol) {
+		node = tsk_home_node(p);
+		if (node != -1)
+			pol = &preferred_node_policy[node];
+	}
+
+	return pol;
+}
+
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
 	/*
@@ -1519,7 +1535,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
 struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = task->mempolicy;
+	struct mempolicy *pol = get_task_policy(task);
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -1903,7 +1919,7 @@ retry_cpuset:
  */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol = get_task_policy(current);
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
 
@@ -2371,6 +2387,15 @@ void __init numa_policy_init(void)
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL);
 
+	for_each_node(nid) {
+		preferred_node_policy[nid] = (struct mempolicy) {
+			.refcnt = ATOMIC_INIT(1),
+			.mode = MPOL_PREFERRED,
+			.flags = MPOL_F_MOF,
+			.v = { .preferred_node = nid, },
+		};
+	}
+
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
 	 * enabled across suitably sized nodes (default is >= 16MB), or
-- 
1.7.2.3




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 13/19] sched: Introduce sched_feat_numa()
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (11 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 12/19] mm/mpol: Make mempolicy home-node aware Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:34   ` Rik van Riel
  2012-07-31 19:12 ` [PATCH 14/19] sched: Make find_busiest_queue() a method Peter Zijlstra
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Christoph Lameter, Peter Zijlstra

[-- Attachment #1: sched-Introduce_sched_feat_numa.patch --]
[-- Type: text/plain, Size: 995 bytes --]

Avoid a few #ifdef's later on.

Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched/sched.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0fce5a8..a817261 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -644,6 +644,12 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
 
+#ifdef CONFIG_NUMA
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
-- 
1.7.2.3




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 14/19] sched: Make find_busiest_queue() a method
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (12 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 13/19] sched: Introduce sched_feat_numa() Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:34   ` Rik van Riel
  2012-07-31 19:12 ` [PATCH 15/19] sched: Implement home-node awareness Peter Zijlstra
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Christoph Lameter, Peter Zijlstra

[-- Attachment #1: sched-Make_find_busiest_queue_a_method.patch --]
[-- Type: text/plain, Size: 1827 bytes --]

Its a bit awkward but it was the least painful means of modifying the
queue selection. Used in a later patch to conditionally use a random
queue.

Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched/fair.c |   19 ++++++++++++-------
 1 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22321db..9c4164e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3074,6 +3074,10 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+
+	struct rq *		(*find_busiest_queue)(struct lb_env *,
+						      struct sched_group *,
+						      const struct cpumask *);
 };
 
 /*
@@ -4246,12 +4250,13 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
 
 	struct lb_env env = {
-		.sd		= sd,
-		.dst_cpu	= this_cpu,
-		.dst_rq		= this_rq,
-		.dst_grpmask    = sched_group_cpus(sd->groups),
-		.idle		= idle,
-		.loop_break	= sched_nr_migrate_break,
+		.sd		    = sd,
+		.dst_cpu	    = this_cpu,
+		.dst_rq		    = this_rq,
+		.dst_grpmask	    = sched_group_cpus(sd->groups),
+		.idle		    = idle,
+		.loop_break	    = sched_nr_migrate_break,
+		.find_busiest_queue = find_busiest_queue,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -4270,7 +4275,7 @@ redo:
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(&env, group, cpus);
+	busiest = env.find_busiest_queue(&env, group, cpus);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
-- 
1.7.2.3




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 15/19] sched: Implement home-node awareness
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (13 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 14/19] sched: Make find_busiest_queue() a method Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:52   ` Rik van Riel
  2012-08-09 21:51   ` Andrea Arcangeli
  2012-07-31 19:12 ` [PATCH 16/19] sched, numa: NUMA home-node selection code Peter Zijlstra
                   ` (4 subsequent siblings)
  19 siblings, 2 replies; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Christoph Lameter, Peter Zijlstra

[-- Attachment #1: sched-Implement_home-node_awareness.patch --]
[-- Type: text/plain, Size: 15337 bytes --]

Implement home node preference in the load-balancer.

This is done in four pieces:

 - task_numa_hot(); make it harder to migrate tasks away from their
   home-node, controlled using the NUMA_HOT feature flag.

 - select_task_rq_fair(); prefer placing the task in their home-node,
   controlled using the NUMA_BIAS feature flag.

 - load_balance(); during the regular pull load-balance pass, try
   pulling tasks that are on the wrong node first with a preference
   of moving them nearer to their home-node through task_numa_hot(),
   controlled through the NUMA_PULL feature flag.

 - load_balance(); when the balancer finds no imbalance, introduce
   some such that it still prefers to move tasks towards their
   home-node, using active load-balance if needed, controlled through
   the NUMA_PULL_BIAS feature flag.

In order to easily find off-node tasks, split the per-cpu task list
into two parts.

Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h   |    1 
 kernel/sched/core.c     |   21 +++-
 kernel/sched/debug.c    |    3 
 kernel/sched/fair.c     |  236 ++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/features.h |    8 +
 kernel/sched/sched.h    |   19 +++
 6 files changed, 271 insertions(+), 17 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -862,6 +862,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6034,7 +6034,9 @@ static void destroy_sched_domains(struct
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_id);
 
-static void update_top_cache_domain(int cpu)
+DEFINE_PER_CPU(struct sched_domain *, sd_node);
+
+static void update_domain_cache(int cpu)
 {
 	struct sched_domain *sd;
 	int id = cpu;
@@ -6077,6 +6079,15 @@ static void update_top_cache_domain(int 
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_id, cpu) = id;
+
+	for_each_domain(cpu, sd) {
+		if (cpumask_equal(sched_domain_span(sd),
+				  cpumask_of_node(cpu_to_node(cpu))))
+			goto got_node;
+	}
+	sd = NULL;
+got_node:
+	rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
 }
 
 /*
@@ -6119,7 +6130,7 @@ cpu_attach_domain(struct sched_domain *s
 	rcu_assign_pointer(rq->sd, sd);
 	destroy_sched_domains(tmp, cpu);
 
-	update_top_cache_domain(cpu);
+	update_domain_cache(cpu);
 }
 
 /* cpus with isolated domains */
@@ -6619,6 +6630,7 @@ sd_numa_init(struct sched_domain_topolog
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
@@ -7410,6 +7422,11 @@ void __init sched_init(void)
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
+#ifdef CONFIG_NUMA
+		INIT_LIST_HEAD(&rq->offnode_tasks);
+		rq->offnode_running = 0;
+		rq->offnode_weight = 0;
+#endif
 
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -132,6 +132,9 @@ print_task(struct seq_file *m, struct rq
 	SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
 		0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
 #endif
+#ifdef CONFIG_NUMA
+	SEQ_printf(m, " %d/%d", p->node, cpu_to_node(task_cpu(p)));
+#endif
 #ifdef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, " %s", task_group_path(task_group(p)));
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/random.h>
 
 #include <trace/events/sched.h>
 
@@ -2688,6 +2693,7 @@ select_task_rq_fair(struct task_struct *
 	int want_affine = 0;
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
+	int node = tsk_home_node(p);
 
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
@@ -2699,6 +2705,29 @@ select_task_rq_fair(struct task_struct *
 	}
 
 	rcu_read_lock();
+	if (sched_feat_numa(NUMA_BIAS) && node != -1) {
+		int node_cpu;
+
+		node_cpu = cpumask_any_and(tsk_cpus_allowed(p), cpumask_of_node(node));
+		if (node_cpu >= nr_cpu_ids)
+			goto find_sd;
+
+		/*
+		 * For fork,exec find the idlest cpu in the home-node.
+		 */
+		if (sd_flag & (SD_BALANCE_FORK|SD_BALANCE_EXEC)) {
+			new_cpu = cpu = node_cpu;
+			sd = per_cpu(sd_node, cpu);
+			goto pick_idlest;
+		}
+
+		/*
+		 * For wake, pretend we were running in the home-node.
+		 */
+		prev_cpu = node_cpu;
+	}
+
+find_sd:
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			continue;
@@ -2752,6 +2781,7 @@ select_task_rq_fair(struct task_struct *
 		goto unlock;
 	}
 
+pick_idlest:
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
@@ -3071,6 +3101,8 @@ struct lb_env {
 	long			imbalance;
 	unsigned int		flags;
 
+	struct list_head	*tasks;
+
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
@@ -3092,6 +3124,23 @@ static void move_task(struct task_struct
 	check_preempt_curr(env->dst_rq, p, 0);
 }
 
+static int task_numa_hot(struct task_struct *p, int from_cpu, int to_cpu)
+{
+	int from_dist, to_dist;
+	int node = tsk_home_node(p);
+
+	if (!sched_feat_numa(NUMA_HOT) || node == -1)
+		return 0; /* no node preference */
+
+	from_dist = node_distance(cpu_to_node(from_cpu), node);
+	to_dist = node_distance(cpu_to_node(to_cpu), node);
+
+	if (to_dist < from_dist)
+		return 0; /* getting closer is ok */
+
+	return 1; /* stick to where we are */
+}
+
 /*
  * Is this task likely cache-hot:
  */
@@ -3177,6 +3226,7 @@ int can_migrate_task(struct task_struct 
 	 */
 
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+	tsk_cache_hot |= task_numa_hot(p, env->src_cpu, env->dst_cpu);
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS
@@ -3202,11 +3252,11 @@ int can_migrate_task(struct task_struct 
  *
  * Called with both runqueues locked.
  */
-static int move_one_task(struct lb_env *env)
+static int __move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
-	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+	list_for_each_entry_safe(p, n, env->tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
 
@@ -3225,6 +3275,21 @@ static int move_one_task(struct lb_env *
 	return 0;
 }
 
+static int move_one_task(struct lb_env *env)
+{
+	if (sched_feat_numa(NUMA_PULL)) {
+		env->tasks = offnode_tasks(env->src_rq);
+		if (__move_one_task(env))
+			return 1;
+	}
+
+	env->tasks = &env->src_rq->cfs_tasks;
+	if (__move_one_task(env))
+		return 1;
+
+	return 0;
+}
+
 static unsigned long task_h_load(struct task_struct *p);
 
 static const unsigned int sched_nr_migrate_break = 32;
@@ -3238,7 +3303,6 @@ static const unsigned int sched_nr_migra
  */
 static int move_tasks(struct lb_env *env)
 {
-	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	struct task_struct *p;
 	unsigned long load;
 	int pulled = 0;
@@ -3246,8 +3310,9 @@ static int move_tasks(struct lb_env *env
 	if (env->imbalance <= 0)
 		return 0;
 
-	while (!list_empty(tasks)) {
-		p = list_first_entry(tasks, struct task_struct, se.group_node);
+again:
+	while (!list_empty(env->tasks)) {
+		p = list_first_entry(env->tasks, struct task_struct, se.group_node);
 
 		env->loop++;
 		/* We've more or less seen every task there is, call it quits */
@@ -3258,7 +3323,7 @@ static int move_tasks(struct lb_env *env
 		if (env->loop > env->loop_break) {
 			env->loop_break += sched_nr_migrate_break;
 			env->flags |= LBF_NEED_BREAK;
-			break;
+			goto out;
 		}
 
 		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
@@ -3286,7 +3351,7 @@ static int move_tasks(struct lb_env *env
 		 * the critical section.
 		 */
 		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+			goto out;
 #endif
 
 		/*
@@ -3294,13 +3359,20 @@ static int move_tasks(struct lb_env *env
 		 * weighted load.
 		 */
 		if (env->imbalance <= 0)
-			break;
+			goto out;
 
 		continue;
 next:
-		list_move_tail(&p->se.group_node, tasks);
+		list_move_tail(&p->se.group_node, env->tasks);
 	}
 
+	if (env->tasks == offnode_tasks(env->src_rq)) {
+		env->tasks = &env->src_rq->cfs_tasks;
+		env->loop = 0;
+		goto again;
+	}
+
+out:
 	/*
 	 * Right now, this is one of only two places move_task() is called,
 	 * so we can safely collect move_task() stats here rather than
@@ -3447,6 +3519,11 @@ struct sd_lb_stats {
 	unsigned int  busiest_group_weight;
 
 	int group_imb; /* Is there imbalance in this sd */
+#ifdef CONFIG_NUMA
+	struct sched_group *numa_group; /* group which has offnode_tasks */
+	unsigned long numa_group_weight;
+	unsigned long numa_group_running;
+#endif
 };
 
 /*
@@ -3462,6 +3539,10 @@ struct sg_lb_stats {
 	unsigned long group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_NUMA
+	unsigned long numa_weight;
+	unsigned long numa_running;
+#endif
 };
 
 /**
@@ -3490,6 +3571,117 @@ static inline int get_sd_load_idx(struct
 	return load_idx;
 }
 
+#ifdef CONFIG_NUMA
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+	sgs->numa_weight += rq->offnode_weight;
+	sgs->numa_running += rq->offnode_running;
+}
+
+/*
+ * Since the offnode lists are indiscriminate (they contain tasks for all other
+ * nodes) it is impossible to say if there's any task on there that wants to
+ * move towards the pulling cpu. Therefore select a random offnode list to pull
+ * from such that eventually we'll try them all.
+ */
+static inline bool pick_numa_rand(void)
+{
+	return get_random_int() & 1;
+}
+
+/*
+ * Select a random group that has offnode tasks as sds->numa_group
+ */
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+		struct sched_group *group, struct sd_lb_stats *sds,
+		int local_group, struct sg_lb_stats *sgs)
+{
+	if (!(sd->flags & SD_NUMA))
+		return;
+
+	if (local_group)
+		return;
+
+	if (!sgs->numa_running)
+		return;
+
+	if (!sds->numa_group || pick_numa_rand()) {
+		sds->numa_group = group;
+		sds->numa_group_weight = sgs->numa_weight;
+		sds->numa_group_running = sgs->numa_running;
+	}
+}
+
+/*
+ * Pick a random queue from the group that has offnode tasks.
+ */
+static struct rq *find_busiest_numa_queue(struct lb_env *env,
+					  struct sched_group *group,
+					  const struct cpumask *cpus)
+{
+	struct rq *busiest = NULL, *rq;
+	int cpu;
+
+	for_each_cpu_and(cpu, sched_group_cpus(group), cpus) {
+		rq = cpu_rq(cpu);
+		if (!rq->offnode_running)
+			continue;
+		if (!busiest || pick_numa_rand())
+			busiest = rq;
+	}
+
+	return busiest;
+}
+
+/*
+ * Called in case of no other imbalance, if there is a queue running offnode
+ * tasksk we'll say we're imbalanced anyway to nudge these tasks towards their
+ * proper node.
+ */
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	if (!sched_feat(NUMA_PULL_BIAS))
+		return 0;
+
+	if (!sds->numa_group)
+		return 0;
+
+	env->imbalance = sds->numa_group_weight / sds->numa_group_running;
+	sds->busiest = sds->numa_group;
+	env->find_busiest_queue = find_busiest_numa_queue;
+	return 1;
+}
+
+static inline bool need_active_numa_balance(struct lb_env *env)
+{
+	return env->find_busiest_queue == find_busiest_numa_queue &&
+			env->src_rq->offnode_running == 1 &&
+			env->src_rq->nr_running == 1;
+}
+
+#else /* CONFIG_NUMA */
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+}
+
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+		struct sched_group *group, struct sd_lb_stats *sds,
+		int local_group, struct sg_lb_stats *sgs)
+{
+}
+
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	return 0;
+}
+
+static inline bool need_active_numa_balance(struct lb_env *env)
+{
+	return false;
+}
+#endif /* CONFIG_NUMA */
+
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
 	return SCHED_POWER_SCALE;
@@ -3707,6 +3899,8 @@ static inline void update_sg_lb_stats(st
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
+
+		update_sg_numa_stats(sgs, rq);
 	}
 
 	/*
@@ -3863,6 +4057,8 @@ static inline void update_sd_lb_stats(st
 			sds->group_imb = sgs.group_imb;
 		}
 
+		update_sd_numa_stats(env->sd, sg, sds, local_group, &sgs);
+
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -4150,6 +4346,9 @@ find_busiest_group(struct lb_env *env, c
 	return sds.busiest;
 
 out_balanced:
+	if (check_numa_busiest_group(env, &sds))
+		return sds.busiest;
+
 ret:
 	env->imbalance = 0;
 	return NULL;
@@ -4229,6 +4428,9 @@ static int need_active_balance(struct lb
 			return 1;
 	}
 
+	if (need_active_numa_balance(env))
+		return 1;
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
@@ -4280,6 +4482,8 @@ static int load_balance(int this_cpu, st
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
 	}
+	env.src_rq  = busiest;
+	env.src_cpu = busiest->cpu;
 
 	BUG_ON(busiest == this_rq);
 
@@ -4295,9 +4499,11 @@ static int load_balance(int this_cpu, st
 		 * correctly treated as an imbalance.
 		 */
 		env.flags |= LBF_ALL_PINNED;
-		env.src_cpu   = busiest->cpu;
-		env.src_rq    = busiest;
-		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
+		env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
+		if (sched_feat_numa(NUMA_PULL))
+			env.tasks = offnode_tasks(busiest);
+		else
+			env.tasks = &busiest->cfs_tasks;
 
 more_balance:
 		local_irq_save(flags);
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -69,3 +69,11 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
+
+#ifdef CONFIG_NUMA
+SCHED_FEAT(NUMA_HOT,       true)
+SCHED_FEAT(NUMA_BIAS,      true)
+SCHED_FEAT(NUMA_PULL,      true)
+SCHED_FEAT(NUMA_PULL_BIAS, true)
+#endif
+
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -414,6 +414,12 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+#ifdef CONFIG_NUMA
+	unsigned long    offnode_running;
+	unsigned long	 offnode_weight;
+	struct list_head offnode_tasks;
+#endif
+
 	u64 rt_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
@@ -465,6 +471,15 @@ struct rq {
 #endif
 };
 
+static inline struct list_head *offnode_tasks(struct rq *rq)
+{
+#ifdef CONFIG_NUMA
+	return &rq->offnode_tasks;
+#else
+	return NULL;
+#endif
+}
+
 static inline int cpu_of(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -525,6 +540,7 @@ static inline struct sched_domain *highe
 
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_node);
 
 extern int group_balance_cpu(struct sched_group *sg);
 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 16/19] sched, numa: NUMA home-node selection code
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (14 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 15/19] sched: Implement home-node awareness Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:52   ` Rik van Riel
  2012-07-31 19:12 ` [PATCH 17/19] sched, numa: Detect big processes Peter Zijlstra
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: numa-1.patch --]
[-- Type: text/plain, Size: 12429 bytes --]

Now that we have infrastructure in place to migrate pages back to
their home-node, and migrate memory towards the home-node, we need to
set the home-node.

Instead of creating a seconday control loop, fully rely on the
existing load-balancer to do the right thing. The home-node selection
logic will simply pick the node the task has been found to run on
for two consequtive samples (see task_tick_numa).

This means NUMA placement is directly related to regular placement.
The home-node logic in the load-balancer tries to keep a task on the
home-node wheras the fairness and work-conserving constraints will try
and move it away.

The balance between these two 'forces' is what will result in the NUMA
placement.

Cc: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/init_task.h |    3 
 include/linux/mm_types.h  |    3 
 include/linux/sched.h     |   19 +++--
 kernel/sched/core.c       |   18 ++++-
 kernel/sched/fair.c       |  163 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/features.h   |    1 
 kernel/sched/sched.h      |   33 ++++++---
 kernel/sysctl.c           |   13 +++
 8 files changed, 227 insertions(+), 26 deletions(-)
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -145,7 +145,8 @@ extern struct task_group root_task_group
 
 #ifdef CONFIG_NUMA
 # define INIT_TASK_NUMA(tsk)						\
-	.node = -1,
+	.node = -1,							\
+	.node_last = -1,
 #else
 # define INIT_TASK_NUMA(tsk)
 #endif
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -388,6 +388,9 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_NUMA
+	unsigned long numa_next_scan;
+#endif
 	struct uprobes_state uprobes_state;
 };
 
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -62,6 +62,7 @@ struct sched_param {
 #include <linux/errno.h>
 #include <linux/nodemask.h>
 #include <linux/mm_types.h>
+#include <linux/task_work.h>
 
 #include <asm/page.h>
 #include <asm/ptrace.h>
@@ -1519,8 +1520,14 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
-	int node;
-#endif
+
+	int node;			/* task home node   */
+	int node_last;			/* home node filter */
+#ifdef CONFIG_SMP
+	u64 node_stamp;			/* migration stamp  */
+	unsigned long numa_contrib;
+#endif /* CONFIG_SMP  */
+#endif /* CONFIG_NUMA */
 	struct rcu_head rcu;
 
 	/*
@@ -2029,22 +2036,22 @@ extern unsigned int sysctl_sched_nr_migr
 extern unsigned int sysctl_sched_time_avg;
 extern unsigned int sysctl_timer_migration;
 extern unsigned int sysctl_sched_shares_window;
+extern unsigned int sysctl_sched_numa_task_period;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos);
-#endif
-#ifdef CONFIG_SCHED_DEBUG
+
 static inline unsigned int get_sysctl_timer_migration(void)
 {
 	return sysctl_timer_migration;
 }
-#else
+#else /* CONFIG_SCHED_DEBUG */
 static inline unsigned int get_sysctl_timer_migration(void)
 {
 	return 1;
 }
-#endif
+#endif /* CONFIG_SCHED_DEBUG */
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1722,6 +1722,17 @@ static void __sched_fork(struct task_str
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
+
+#ifdef CONFIG_NUMA
+	if (p->mm && atomic_read(&p->mm->mm_users) == 1)
+		p->mm->numa_next_scan = jiffies;
+
+	p->node = -1;
+	p->node_last = -1;
+#ifdef CONFIG_SMP
+	p->node_stamp = 0ULL;
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_NUMA */
 }
 
 /*
@@ -6558,9 +6569,9 @@ static struct sched_domain_topology_leve
  * Requeues a task ensuring its on the right load-balance list so
  * that it might get migrated to its new home.
  *
- * Note that we cannot actively migrate ourselves since our callers
- * can be from atomic context. We rely on the regular load-balance
- * mechanisms to move us around -- its all preference anyway.
+ * Since home-node is pure preference there's no hard migrate to force
+ * us anywhere, this also allows us to call this from atomic context if
+ * required.
  */
 void sched_setnode(struct task_struct *p, int node)
 {
@@ -6578,6 +6589,7 @@ void sched_setnode(struct task_struct *p
 		p->sched_class->put_prev_task(rq, p);
 
 	p->node = node;
+	p->node_last = node;
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -27,6 +27,7 @@
 #include <linux/profile.h>
 #include <linux/interrupt.h>
 #include <linux/random.h>
+#include <linux/mempolicy.h>
 
 #include <trace/events/sched.h>
 
@@ -774,6 +775,139 @@ update_stats_curr_start(struct cfs_rq *c
 }
 
 /**************************************************
+ * Scheduling class numa methods.
+ *
+ * The purpose of the NUMA bits are to maintain compute (task) and data
+ * (memory) locality. We try and achieve this by making tasks stick to
+ * a particular node (their home node) but if fairness mandates they run
+ * elsewhere for long enough, we let the memory follow them.
+ *
+ * Tasks start out with their home-node unset (-1) this effectively means
+ * they act !NUMA until we've established the task is busy enough to bother
+ * with placement.
+ */
+
+static unsigned long task_h_load(struct task_struct *p);
+
+#if defined(CONFIG_SMP) && defined(CONFIG_NUMA)
+static void account_offnode_enqueue(struct rq *rq, struct task_struct *p)
+{
+	p->numa_contrib = task_h_load(p);
+	rq->offnode_weight += p->numa_contrib;
+	rq->offnode_running++;
+}
+static void account_offnode_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->offnode_weight -= p->numa_contrib;
+	rq->offnode_running--;
+}
+
+/*
+ * numa task sample period in ms
+ */
+unsigned int sysctl_sched_numa_task_period = 2500;
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ */
+void task_numa_work(struct callback_head *work)
+{
+	unsigned long migrate, next_scan, now = jiffies;
+	struct task_struct *t, *p = current;
+	int node = p->node_last;
+
+	WARN_ON_ONCE(p != container_of(work, struct task_struct, rcu));
+
+	/*
+	 * Who cares about NUMA placement when they're dying.
+	 */
+	if (p->flags & PF_EXITING)
+		return;
+
+	/*
+	 * Enforce maximal migration frequency..
+	 */
+	migrate = p->mm->numa_next_scan;
+	if (time_before(now, migrate))
+		return;
+
+	next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_task_period);
+	if (cmpxchg(&p->mm->numa_next_scan, migrate, next_scan) != migrate)
+		return;
+
+	rcu_read_lock();
+	t = p;
+	do {
+		sched_setnode(t, node);
+	} while ((t = next_thread(p)) != p);
+	rcu_read_unlock();
+
+	lazy_migrate_process(p->mm);
+}
+
+/*
+ * Sample task location from hardirq context (tick), this has minimal bias with
+ * obvious exceptions of frequency interference and tick avoidance techniques.
+ * If this were to become a problem we could move this sampling into the
+ * sleep/wakeup path -- but we'd prefer to avoid that for obvious reasons.
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+	u64 period, now;
+	int node;
+
+	/*
+	 * We don't care about NUMA placement if we don't have memory.
+	 */
+	if (!curr->mm)
+		return;
+
+	/*
+	 * Sample our node location every @sysctl_sched_numa_task_period
+	 * runtime ms. We use a two stage selection in order to filter
+	 * unlikely locations.
+	 *
+	 * If P(n) is the probability we're on node 'n', then the probability
+	 * we sample the same node twice is P(n)^2. This quadric squishes small
+	 * values and makes it more likely we end up on nodes where we have
+	 * significant presence.
+	 *
+	 * Using runtime rather than walltime has the dual advantage that
+	 * we (mostly) drive the selection from busy threads and that the
+	 * task needs to have done some actual work before we bother with
+	 * NUMA placement.
+	 */
+	now = curr->se.sum_exec_runtime;
+	period = (u64)sysctl_sched_numa_task_period * NSEC_PER_MSEC;
+
+	if (now - curr->node_stamp > period) {
+		curr->node_stamp = now;
+		node = numa_node_id();
+
+		if (curr->node_last == node && curr->node != node) {
+			/*
+			 * We can re-use curr->rcu because we checked curr->mm
+			 * != NULL so release_task()->call_rcu() was not called
+			 * yet and exit_task_work() is called before
+			 * exit_notify().
+			 */
+			init_task_work(&curr->rcu, task_numa_work);
+			task_work_add(curr, &curr->rcu, true);
+		}
+		curr->node_last = node;
+	}
+}
+#else
+static void account_offnode_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static void account_offnode_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
+#endif /* SMP && NUMA */
+
+/**************************************************
  * Scheduling class queueing methods:
  */
 
@@ -784,9 +918,19 @@ account_entity_enqueue(struct cfs_rq *cf
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
-#endif
+	if (entity_is_task(se)) {
+		struct rq *rq = rq_of(cfs_rq);
+		struct task_struct *p = task_of(se);
+		struct list_head *tasks = &rq->cfs_tasks;
+
+		if (offnode_task(p)) {
+			account_offnode_enqueue(rq, p);
+			tasks = offnode_tasks(rq);
+		}
+
+		list_add(&se->group_node, tasks);
+	}
+#endif /* CONFIG_SMP */
 	cfs_rq->nr_running++;
 }
 
@@ -796,8 +940,14 @@ account_entity_dequeue(struct cfs_rq *cf
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		struct task_struct *p = task_of(se);
+
 		list_del_init(&se->group_node);
+
+		if (offnode_task(p))
+			account_offnode_dequeue(rq_of(cfs_rq), p);
+	}
 	cfs_rq->nr_running--;
 }
 
@@ -3286,8 +3436,6 @@ static int move_one_task(struct lb_env *
 	return 0;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
-
 static const unsigned int sched_nr_migrate_break = 32;
 
 /*
@@ -5173,6 +5321,9 @@ static void task_tick_fair(struct rq *rq
 		cfs_rq = cfs_rq_of(se);
 		entity_tick(cfs_rq, se, queued);
 	}
+
+	if (sched_feat_numa(NUMA))
+		task_tick_numa(rq, curr);
 }
 
 /*
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -71,6 +71,7 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 
 #ifdef CONFIG_NUMA
+SCHED_FEAT(NUMA,           true)
 SCHED_FEAT(NUMA_HOT,       true)
 SCHED_FEAT(NUMA_BIAS,      true)
 SCHED_FEAT(NUMA_PULL,      true)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -471,15 +471,6 @@ struct rq {
 #endif
 };
 
-static inline struct list_head *offnode_tasks(struct rq *rq)
-{
-#ifdef CONFIG_NUMA
-	return &rq->offnode_tasks;
-#else
-	return NULL;
-#endif
-}
-
 static inline int cpu_of(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -497,6 +488,30 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NUMA)
+static inline bool offnode_task(struct task_struct *t)
+{
+	return t->node != -1 && t->node != cpu_to_node(task_cpu(t));
+}
+
+static inline struct list_head *offnode_tasks(struct rq *rq)
+{
+	return &rq->offnode_tasks;
+}
+
+void sched_setnode(struct task_struct *p, int node);
+#else /* SMP && NUMA */
+static inline bool offnode_task(struct task_struct *t)
+{
+	return false;
+}
+
+static inline struct list_head *offnode_tasks(struct rq *rq)
+{
+	return NULL;
+}
+#endif /* SMP && NUMA */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -291,6 +291,7 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &min_wakeup_granularity_ns,
 		.extra2		= &max_wakeup_granularity_ns,
 	},
+#ifdef CONFIG_SMP
 	{
 		.procname	= "sched_tunable_scaling",
 		.data		= &sysctl_sched_tunable_scaling,
@@ -337,7 +338,17 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
-#endif
+#ifdef CONFIG_NUMA
+	{
+		.procname	= "sched_numa_task_period_ms",
+		.data		= &sysctl_sched_numa_task_period,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif /* CONFIG_NUMA */
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
 		.data		= &sysctl_sched_rt_period,



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 17/19] sched, numa: Detect big processes
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (15 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 16/19] sched, numa: NUMA home-node selection code Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:53   ` Rik van Riel
  2012-07-31 19:12 ` [PATCH 18/19] sched, numa: Per task memory placement for " Peter Zijlstra
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: numa-2.patch --]
[-- Type: text/plain, Size: 4038 bytes --]

Detect 'big' processes for which the one home-node per process isn't
going to work as desired.

The current policy for such tasks is to ignore them entirely and put
the home-node back to -1 (no preference) so they'll behave as if none
of this NUMA nonsense is there.

The current heuristic for determining if a task is 'big' is if its
consuming more than 1/2 a node's worth of cputime. We might want to
add a term here looking at the RSS of the process and compare this
against the available memory per node.

Cc: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm_types.h |    1 
 include/linux/sched.h    |    2 +
 kernel/sched/core.c      |    6 ++++-
 kernel/sched/fair.c      |   49 +++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 55 insertions(+), 3 deletions(-)
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -389,6 +389,7 @@ struct mm_struct {
 	struct cpumask cpumask_allocation;
 #endif
 #ifdef CONFIG_NUMA
+	unsigned int  numa_big;
 	unsigned long numa_next_scan;
 #endif
 	struct uprobes_state uprobes_state;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1525,6 +1525,8 @@ struct task_struct {
 	int node_last;			/* home node filter */
 #ifdef CONFIG_SMP
 	u64 node_stamp;			/* migration stamp  */
+	u64 numa_runtime_stamp;
+	u64 numa_walltime_stamp;
 	unsigned long numa_contrib;
 #endif /* CONFIG_SMP  */
 #endif /* CONFIG_NUMA */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1724,13 +1724,17 @@ static void __sched_fork(struct task_str
 #endif
 
 #ifdef CONFIG_NUMA
-	if (p->mm && atomic_read(&p->mm->mm_users) == 1)
+	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+		p->mm->numa_big = 0;
 		p->mm->numa_next_scan = jiffies;
+	}
 
 	p->node = -1;
 	p->node_last = -1;
 #ifdef CONFIG_SMP
 	p->node_stamp = 0ULL;
+	p->numa_runtime_stamp = 0;
+	p->numa_walltime_stamp = local_clock();
 #endif /* CONFIG_SMP */
 #endif /* CONFIG_NUMA */
 }
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -803,11 +803,47 @@ static void account_offnode_dequeue(stru
 }
 
 /*
- * numa task sample period in ms
+ * numa task sample period in ms: 2.5s
  */
 unsigned int sysctl_sched_numa_task_period = 2500;
 
 /*
+ * Determine if a process is 'big'.
+ */
+static bool task_numa_big(struct task_struct *p)
+{
+	struct sched_domain *sd;
+	struct task_struct *t;
+	u64 walltime = local_clock();
+	u64 runtime = 0;
+	int weight = 0;
+
+	rcu_read_lock();
+	t = p;
+	do {
+		if (t->sched_class == &fair_sched_class)
+			runtime += t->se.sum_exec_runtime;
+	} while ((t = next_thread(t)) != p);
+
+	sd = rcu_dereference(__get_cpu_var(sd_node));
+	if (sd)
+		weight = sd->span_weight;
+	rcu_read_unlock();
+
+	runtime -= p->numa_runtime_stamp;
+	walltime -= p->numa_walltime_stamp;
+
+	p->numa_runtime_stamp += runtime;
+	p->numa_walltime_stamp += walltime;
+
+	/*
+	 * We're 'big' when we burn more than half a node's worth
+	 * of cputime.
+	 */
+	return runtime > walltime * max(1, weight / 2);
+}
+
+/*
  * The expensive part of numa migration is done from task_work context.
  */
 void task_numa_work(struct callback_head *work)
@@ -815,6 +851,7 @@ void task_numa_work(struct callback_head
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *t, *p = current;
 	int node = p->node_last;
+	int big;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, rcu));
 
@@ -835,6 +872,13 @@ void task_numa_work(struct callback_head
 	if (cmpxchg(&p->mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
+	/*
+	 * If this task is too big, we bail on NUMA placement of the process.
+	 */
+	big = p->mm->numa_big = task_numa_big(p);
+	if (big)
+		node = -1;
+
 	rcu_read_lock();
 	t = p;
 	do {
@@ -858,8 +902,9 @@ void task_tick_numa(struct rq *rq, struc
 
 	/*
 	 * We don't care about NUMA placement if we don't have memory.
+	 * We also bail on placement if we're too big.
 	 */
-	if (!curr->mm)
+	if (!curr->mm || curr->mm->numa_big)
 		return;
 
 	/*



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 18/19] sched, numa: Per task memory placement for big processes
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (16 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 17/19] sched, numa: Detect big processes Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-07-31 21:56   ` Rik van Riel
                     ` (2 more replies)
  2012-07-31 19:12 ` [PATCH 19/19] mm, numa: retry failed page migrations Peter Zijlstra
  2012-08-08 17:17 ` [PATCH 00/19] sched-numa rewrite Andrea Arcangeli
  19 siblings, 3 replies; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: numa-3.patch --]
[-- Type: text/plain, Size: 10590 bytes --]

Implement a per-task memory placement scheme for 'big' tasks (as per
the last patch). It relies on a regular PROT_NONE 'migration' fault to
scan the memory space of the procress and uses a two stage migration
scheme to reduce the invluence of unlikely usage relations.

It relies on the assumption that the compute part is tied to a
paticular task and builds a task<->page relation set to model the
compute<->data relation.

Probability says that the task faulting on a page after we protect it,
is most likely to be the task that uses that page most.

To decrease the likelyhood of acting on a false relation, we only
migrate a page when two consecutive samples are from the same task.

I'm still not entirely convinced this scheme is sound, esp. for things
like virtualization and n:m threading solutions in general the
compute<->task relation is fundamentally untrue.

NOTES:
 - we don't actually sample the task, but the task's home-node as a
   migration target, so we're effectively building home-node<->page
   relations not task<->page relations.

 - we migrate to the task's home-node, not the node the task is
   currently running on, since the home-node is the long term target
   for the task to run on, irrespective of whatever node it might
   temporarily run on.

Suggested-by: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mempolicy.h |    6 ++--
 include/linux/mm_types.h  |    6 ++++
 kernel/sched/fair.c       |   67 ++++++++++++++++++++++++++++++++++------------
 kernel/sched/features.h   |    1 
 mm/huge_memory.c          |    3 +-
 mm/memory.c               |    2 -
 mm/mempolicy.c            |   39 +++++++++++++++++++++++++-
 7 files changed, 101 insertions(+), 23 deletions(-)
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -69,6 +69,7 @@ enum mpol_rebind_step {
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
+#define MPOL_F_HOME	(1 << 4) /* this is the home-node policy */
 
 #ifdef __KERNEL__
 
@@ -263,7 +264,8 @@ static inline int vma_migratable(struct 
 	return 1;
 }
 
-extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+extern int mpol_misplaced(struct page *, struct vm_area_struct *,
+			  unsigned long, int);
 
 extern void lazy_migrate_process(struct mm_struct *mm);
 
@@ -393,7 +395,7 @@ static inline int mpol_to_str(char *buff
 }
 
 static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
-				 unsigned long address)
+				 unsigned long address, int multi)
 {
 	return -1; /* no node preference */
 }
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -160,6 +160,12 @@ struct page {
 	 */
 	void *shadow;
 #endif
+#ifdef CONFIG_NUMA
+	/*
+	 * XXX fold this into flags for 64bit or so...
+	 */
+	int nid_last;
+#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -785,6 +785,16 @@ update_stats_curr_start(struct cfs_rq *c
  * Tasks start out with their home-node unset (-1) this effectively means
  * they act !NUMA until we've established the task is busy enough to bother
  * with placement.
+ *
+ * Once we start doing NUMA placement there's two modes, 'small' process-wide
+ * and 'big' per-task. For the small mode we have a process-wide home node
+ * and lazily mirgrate all memory only when this home-node changes.
+ *
+ * For big mode we keep a home-node per task and use periodic fault scans
+ * to try and estalish a task<->page relation. This assumes the task<->page
+ * relation is a compute<->data relation, this is false for things like virt.
+ * and n:m threading solutions but its the best we can do given the
+ * information we have.
  */
 
 static unsigned long task_h_load(struct task_struct *p);
@@ -796,6 +806,7 @@ static void account_offnode_enqueue(stru
 	rq->offnode_weight += p->numa_contrib;
 	rq->offnode_running++;
 }
+
 static void account_offnode_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->offnode_weight -= p->numa_contrib;
@@ -818,6 +829,9 @@ static bool task_numa_big(struct task_st
 	u64 runtime = 0;
 	int weight = 0;
 
+	if (sched_feat(NUMA_FORCE_BIG))
+		return true;
+
 	rcu_read_lock();
 	t = p;
 	do {
@@ -851,7 +865,7 @@ void task_numa_work(struct callback_head
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *t, *p = current;
 	int node = p->node_last;
-	int big;
+	int big = p->mm->numa_big;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, rcu));
 
@@ -862,7 +876,7 @@ void task_numa_work(struct callback_head
 		return;
 
 	/*
-	 * Enforce maximal migration frequency..
+	 * Enforce maximal scan/migration frequency..
 	 */
 	migrate = p->mm->numa_next_scan;
 	if (time_before(now, migrate))
@@ -872,20 +886,34 @@ void task_numa_work(struct callback_head
 	if (cmpxchg(&p->mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	/*
-	 * If this task is too big, we bail on NUMA placement of the process.
-	 */
-	big = p->mm->numa_big = task_numa_big(p);
-	if (big)
-		node = -1;
+	if (!big)
+		big = p->mm->numa_big = task_numa_big(p);
 
-	rcu_read_lock();
-	t = p;
-	do {
-		sched_setnode(t, node);
-	} while ((t = next_thread(p)) != p);
-	rcu_read_unlock();
+	if (big) {
+		/*
+		 * For 'big' processes we do per-thread home-node, combined
+		 * with periodic fault scans.
+		 */
+		if (p->node != node)
+			sched_setnode(p, node);
+	} else {
+		/*
+		 * For 'small' processes we keep the entire process on a
+		 * node and migrate all memory once.
+		 */
+		rcu_read_lock();
+		t = p;
+		do {
+			sched_setnode(t, node);
+		} while ((t = next_thread(p)) != p);
+		rcu_read_unlock();
+	}
 
+	/*
+	 * Trigger fault driven migration, small processes do direct
+	 * lazy migration, big processes do gradual task<->page relations.
+	 * See mpol_misplaced().
+	 */
 	lazy_migrate_process(p->mm);
 }
 
@@ -902,9 +930,8 @@ void task_tick_numa(struct rq *rq, struc
 
 	/*
 	 * We don't care about NUMA placement if we don't have memory.
-	 * We also bail on placement if we're too big.
 	 */
-	if (!curr->mm || curr->mm->numa_big)
+	if (!curr->mm)
 		return;
 
 	/*
@@ -929,7 +956,13 @@ void task_tick_numa(struct rq *rq, struc
 		curr->node_stamp = now;
 		node = numa_node_id();
 
-		if (curr->node_last == node && curr->node != node) {
+		/*
+		 * 'small' tasks only migrate once when their process home-node
+		 * changes, 'big' tasks need continuous 'migration' faults to
+		 * keep the task<->page map accurate.
+		 */
+		if (curr->node_last == node &&
+		    (curr->node != node || curr->mm->numa_big)) {
 			/*
 			 * We can re-use curr->rcu because we checked curr->mm
 			 * != NULL so release_task()->call_rcu() was not called
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -72,6 +72,7 @@ SCHED_FEAT(LB_MIN, false)
 
 #ifdef CONFIG_NUMA
 SCHED_FEAT(NUMA,           true)
+SCHED_FEAT(NUMA_FORCE_BIG, false)
 SCHED_FEAT(NUMA_HOT,       true)
 SCHED_FEAT(NUMA_BIAS,      true)
 SCHED_FEAT(NUMA_PULL,      true)
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -780,7 +780,7 @@ void do_huge_pmd_prot_none(struct mm_str
 	 * XXX should we serialize against split_huge_page ?
 	 */
 
-	node = mpol_misplaced(page, vma, haddr);
+	node = mpol_misplaced(page, vma, haddr, mm->numa_big);
 	if (node == -1)
 		goto do_fixup;
 
@@ -1366,6 +1366,7 @@ static void __split_huge_page_refcount(s
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
+		page_tail->nid_last = page->nid_last;
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3471,7 +3471,7 @@ static int do_prot_none(struct mm_struct
 	get_page(page);
 	pte_unmap_unlock(ptep, ptl);
 
-	node = mpol_misplaced(page, vma, address);
+	node = mpol_misplaced(page, vma, address, mm->numa_big);
 	if (node == -1)
 		goto do_fixup;
 
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2168,6 +2168,7 @@ mpol_shared_policy_lookup(struct shared_
  * @page   - page to be checked
  * @vma    - vm area where page mapped
  * @addr   - virtual address where page mapped
+ * @multi  - use multi-stage node binding
  *
  * Lookup current policy node id for vma,addr and "compare to" page's
  * node id.
@@ -2179,7 +2180,8 @@ mpol_shared_policy_lookup(struct shared_
  * Policy determination "mimics" alloc_page_vma().
  * Called from fault path where we know the vma and faulting address.
  */
-int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+		   unsigned long addr, int multi)
 {
 	struct mempolicy *pol;
 	struct zone *zone;
@@ -2230,6 +2232,39 @@ int mpol_misplaced(struct page *page, st
 	default:
 		BUG();
 	}
+
+	/*
+	 * Multi-stage node selection is used in conjunction with a periodic
+	 * migration fault to build a temporal task<->page relation. By
+	 * using a two-stage filter we remove short/unlikely relations.
+	 *
+	 * Using P(p) ~ n_p / n_t as per frequentist probability, we can
+	 * equate a task's usage of a particular page (n_p) per total usage
+	 * of this page (n_t) (in a given time-span) to a probability.
+	 *
+	 * Our periodic faults will then sample this probability and getting
+	 * the same result twice in a row, given these samples are fully
+	 * independent, is then given by P(n)^2, provided our sample period
+	 * is sufficiently short compared to the usage pattern.
+	 *
+	 * This quadric squishes small probabilities, making it less likely
+	 * we act on an unlikely task<->page relation.
+	 *
+	 * NOTE: effectively we're using task-home-node<->page-node relations
+	 * since those are the only thing we can affect.
+	 *
+	 * NOTE: we're using task-home-node as opposed to the current node
+	 * the task might be running on, since the task-home-node is the
+	 * long-term node of this task, further reducing noise. Also see
+	 * task_tick_numa().
+	 */
+	if (multi && (pol->flags & MPOL_F_HOME)) {
+		if (page->nid_last != polnid) {
+			page->nid_last = polnid;
+			goto out;
+		}
+	}
+
 	if (curnid != polnid)
 		ret = polnid;
 out:
@@ -2421,7 +2456,7 @@ void __init numa_policy_init(void)
 		preferred_node_policy[nid] = (struct mempolicy) {
 			.refcnt = ATOMIC_INIT(1),
 			.mode = MPOL_PREFERRED,
-			.flags = MPOL_F_MOF,
+			.flags = MPOL_F_MOF | MPOL_F_HOME,
 			.v = { .preferred_node = nid, },
 		};
 	}



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 19/19] mm, numa: retry failed page migrations
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (17 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 18/19] sched, numa: Per task memory placement for " Peter Zijlstra
@ 2012-07-31 19:12 ` Peter Zijlstra
  2012-08-02 20:40   ` Christoph Lameter
  2012-08-08 17:17 ` [PATCH 00/19] sched-numa rewrite Andrea Arcangeli
  19 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2012-07-31 19:12 UTC (permalink / raw)
  To: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: rik_van_riel-mm_numa-retry_failed_page_migrations.patch --]
[-- Type: text/plain, Size: 3538 bytes --]

From: Rik van Riel <riel@redhat.com>

Keep track of how many NUMA page migrations succeeded and
failed (in a way that wants retrying later) per process.

If a lot of the page migrations of a process fail, unmap the
process pages some point later, so the migration can be tried
again at the next fault.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm_types.h |    2 ++
 kernel/sched/core.c      |    2 ++
 kernel/sched/fair.c      |   19 ++++++++++++++++++-
 mm/memory.c              |   15 ++++++++++++---
 4 files changed, 34 insertions(+), 4 deletions(-)
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -397,6 +397,8 @@ struct mm_struct {
 #ifdef CONFIG_NUMA
 	unsigned int  numa_big;
 	unsigned long numa_next_scan;
+	unsigned int  numa_migrate_success;
+	unsigned int  numa_migrate_failed;
 #endif
 	struct uprobes_state uprobes_state;
 };
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1727,6 +1727,8 @@ static void __sched_fork(struct task_str
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_big = 0;
 		p->mm->numa_next_scan = jiffies;
+		p->mm->numa_migrate_success = 0;
+		p->mm->numa_migrate_failed = 0;
 	}
 
 	p->node = -1;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -857,6 +857,18 @@ static bool task_numa_big(struct task_st
 	return runtime > walltime * max(1, weight / 2);
 }
 
+static bool many_migrate_failures(struct task_struct *p)
+{
+	if (!p->mm)
+		return false;
+
+	/* More than 1/4 of the attempted NUMA page migrations failed. */
+	if (p->mm->numa_migrate_failed * 3 > p->mm->numa_migrate_success)
+		return true;
+
+	return false;
+}
+
 /*
  * The expensive part of numa migration is done from task_work context.
  */
@@ -909,6 +921,10 @@ void task_numa_work(struct task_work *wo
 		rcu_read_unlock();
 	}
 
+	/* Age the numa migrate statistics. */
+	p->mm->numa_migrate_failed /= 2;
+	p->mm->numa_migrate_success /= 2;
+
 	/*
 	 * Trigger fault driven migration, small processes do direct
 	 * lazy migration, big processes do gradual task<->page relations.
@@ -962,7 +978,8 @@ void task_tick_numa(struct rq *rq, struc
 		 * keep the task<->page map accurate.
 		 */
 		if (curr->node_last == node &&
-		    (curr->node != node || curr->mm->numa_big)) {
+		    (curr->node != node || curr->mm->numa_big ||
+				many_migrate_failures(curr))) {
 			/*
 			 * We can re-use curr->rcu because we checked curr->mm
 			 * != NULL so release_task()->call_rcu() was not called
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3452,7 +3452,7 @@ static int do_prot_none(struct mm_struct
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int node;
+	int node, ret;
 
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
@@ -3472,18 +3472,27 @@ static int do_prot_none(struct mm_struct
 	pte_unmap_unlock(ptep, ptl);
 
 	node = mpol_misplaced(page, vma, address, mm->numa_big);
-	if (node == -1)
+	if (node == -1) {
+		mm->numa_migrate_success++;
 		goto do_fixup;
+	}
 
 	/*
 	 * Page migration will install a new pte with vma->vm_page_prot,
 	 * otherwise fall-through to the fixup. Next time,.. perhaps.
 	 */
-	if (!migrate_misplaced_page(mm, page, node)) {
+	ret = migrate_misplaced_page(mm, page, node);
+	if (!ret) {
+		mm->numa_migrate_success++;
 		put_page(page);
 		return 0;
 	}
 
+	if (ret == -ENOMEM || ret == -EBUSY) {
+		/* This fault should be tried again later. */
+		mm->numa_migrate_failed++;
+	}
+
 do_fixup:
 	/*
 	 * OK, nothing to do,.. change the protection back to what it



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/19] task_work: Remove dependency on sched.h
  2012-07-31 19:12 ` [PATCH 01/19] task_work: Remove dependency on sched.h Peter Zijlstra
@ 2012-07-31 20:52   ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 20:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel<riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/19] mm/mpol: Remove NUMA_INTERLEAVE_HIT
  2012-07-31 19:12 ` [PATCH 02/19] mm/mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
@ 2012-07-31 20:52   ` Rik van Riel
  2012-08-09 21:41   ` Andrea Arcangeli
  1 sibling, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 20:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/19] mm/mpol: Make MPOL_LOCAL a real policy
  2012-07-31 19:12 ` [PATCH 03/19] mm/mpol: Make MPOL_LOCAL a real policy Peter Zijlstra
@ 2012-07-31 20:52   ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 20:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/19] mm, thp: Preserve pgprot across huge page split
  2012-07-31 19:12 ` [PATCH 04/19] mm, thp: Preserve pgprot across huge page split Peter Zijlstra
@ 2012-07-31 20:53   ` Rik van Riel
  2012-08-09 21:42   ` Andrea Arcangeli
  1 sibling, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/19] mm, mpol: Create special PROT_NONE infrastructure
  2012-07-31 19:12 ` [PATCH 05/19] mm, mpol: Create special PROT_NONE infrastructure Peter Zijlstra
@ 2012-07-31 20:55   ` Rik van Riel
  2012-08-09 21:43   ` Andrea Arcangeli
  1 sibling, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 20:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/19] mm/mpol: Add MPOL_MF_LAZY ...
  2012-07-31 19:12 ` [PATCH 06/19] mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra
@ 2012-07-31 21:04   ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP
  2012-07-31 19:12 ` [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
@ 2012-07-31 21:06   ` Rik van Riel
  2012-08-09 21:44   ` Andrea Arcangeli
  2012-10-01  9:36   ` Michael Kerrisk
  2 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 08/19] mm/mpol: Check for misplaced page
  2012-07-31 19:12 ` [PATCH 08/19] mm/mpol: Check for misplaced page Peter Zijlstra
@ 2012-07-31 21:13   ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/19] mm, migrate: Introduce migrate_misplaced_page()
  2012-07-31 19:12 ` [PATCH 09/19] mm, migrate: Introduce migrate_misplaced_page() Peter Zijlstra
@ 2012-07-31 21:16   ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/19] mm, mpol: Use special PROT_NONE to migrate pages
  2012-07-31 19:12 ` [PATCH 10/19] mm, mpol: Use special PROT_NONE to migrate pages Peter Zijlstra
@ 2012-07-31 21:24   ` Rik van Riel
  2012-08-09 21:44   ` Andrea Arcangeli
  1 sibling, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

> +	if (unlikely(pmd_trans_splitting(entry))) {
> +		spin_unlock(&mm->page_table_lock);
> +		wait_split_huge_page(vma->anon_vma, pmd);
> +		return;
> +	}
> +
> +#ifdef CONFIG_NUMA
> +	page = pmd_page(entry);
> +	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> +
> +	get_page(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	/*
> +	 * XXX should we serialize against split_huge_page ?
> +	 */

I believe we are already serialized here, because we check
for pmd_trans_splitting while holding the page table lock.

The THP code grabs the page table lock when modifying this
status, so we should be good.

> +	/*
> +	 * Due to lacking code to migrate thp pages, we'll split
> +	 * (which preserves the special PROT_NONE) and re-take the
> +	 * fault on the normal pages.
> +	 */
> +	split_huge_page(page);
> +	put_page(page);
> +	return;

Likewise, the THP code serializes split_huge_page, and has
protection against multiple simultaneous invocations of
split_huge_page.

A second invocation of split_huge_page will see that the
page was already split, and it will bail out.

> +do_fixup:
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, entry)))
> +		goto out_unlock;

If the THP was split for another reason than a NUMA
fault, the !pmd_same check here should result in us
doing the right thing automatically.

I believe this code is correct.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/19] sched, mm: Introduce tsk_home_node()
  2012-07-31 19:12 ` [PATCH 11/19] sched, mm: Introduce tsk_home_node() Peter Zijlstra
@ 2012-07-31 21:30   ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 12/19] mm/mpol: Make mempolicy home-node aware
  2012-07-31 19:12 ` [PATCH 12/19] mm/mpol: Make mempolicy home-node aware Peter Zijlstra
@ 2012-07-31 21:33   ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel, Christoph Lameter

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 13/19] sched: Introduce sched_feat_numa()
  2012-07-31 19:12 ` [PATCH 13/19] sched: Introduce sched_feat_numa() Peter Zijlstra
@ 2012-07-31 21:34   ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel, Christoph Lameter

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 14/19] sched: Make find_busiest_queue() a method
  2012-07-31 19:12 ` [PATCH 14/19] sched: Make find_busiest_queue() a method Peter Zijlstra
@ 2012-07-31 21:34   ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel, Christoph Lameter

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 15/19] sched: Implement home-node awareness
  2012-07-31 19:12 ` [PATCH 15/19] sched: Implement home-node awareness Peter Zijlstra
@ 2012-07-31 21:52   ` Rik van Riel
  2012-08-09 21:51   ` Andrea Arcangeli
  1 sibling, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel, Christoph Lameter

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 16/19] sched, numa: NUMA home-node selection code
  2012-07-31 19:12 ` [PATCH 16/19] sched, numa: NUMA home-node selection code Peter Zijlstra
@ 2012-07-31 21:52   ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 17/19] sched, numa: Detect big processes
  2012-07-31 19:12 ` [PATCH 17/19] sched, numa: Detect big processes Peter Zijlstra
@ 2012-07-31 21:53   ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:

> The current heuristic for determining if a task is 'big' is if its
> consuming more than 1/2 a node's worth of cputime. We might want to
> add a term here looking at the RSS of the process and compare this
> against the available memory per node.

This could probably use some refinement in the future, but
it looks like a reasonable start.

> Cc: Rik van Riel <riel@redhat.com>
> Cc: Paul Turner <pjt@google.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 18/19] sched, numa: Per task memory placement for big processes
  2012-07-31 19:12 ` [PATCH 18/19] sched, numa: Per task memory placement for " Peter Zijlstra
@ 2012-07-31 21:56   ` Rik van Riel
  2012-08-08 21:35   ` Peter Zijlstra
  2012-08-09 21:57   ` Andrea Arcangeli
  2 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-07-31 21:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On 07/31/2012 03:12 PM, Peter Zijlstra wrote:
> Probability says that the task faulting on a page after we protect it,
> is most likely to be the task that uses that page most.
>
> To decrease the likelyhood of acting on a false relation, we only
> migrate a page when two consecutive samples are from the same task.
>
> I'm still not entirely convinced this scheme is sound, esp. for things
> like virtualization and n:m threading solutions in general the
> compute<->task relation is fundamentally untrue.

Again, we may need some additional code on top in the future,
eg. something like Andrea's policy that tries grouping related
tasks/threads together, but this looks like a very good way
to start.

We can introduce complexity if it is needed. Simplicity is good.

Acked-by: Rik van Riel

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 19/19] mm, numa: retry failed page migrations
  2012-07-31 19:12 ` [PATCH 19/19] mm, numa: retry failed page migrations Peter Zijlstra
@ 2012-08-02 20:40   ` Christoph Lameter
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Lameter @ 2012-08-02 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On Tue, 31 Jul 2012, Peter Zijlstra wrote:

> Keep track of how many NUMA page migrations succeeded and
> failed (in a way that wants retrying later) per process.

It would be good if we could also somehow determine if that
migration actually made sense?

Were there enough accesses to the page so that the effort to migrate
the page was amortized?

Still skeptical on this endeavor.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/19] sched-numa rewrite
  2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
                   ` (18 preceding siblings ...)
  2012-07-31 19:12 ` [PATCH 19/19] mm, numa: retry failed page migrations Peter Zijlstra
@ 2012-08-08 17:17 ` Andrea Arcangeli
  2012-08-08 18:43   ` Rik van Riel
  19 siblings, 1 reply; 53+ messages in thread
From: Andrea Arcangeli @ 2012-08-08 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

Hi everyone,

On Tue, Jul 31, 2012 at 09:12:04PM +0200, Peter Zijlstra wrote:
> Hi all,
> 
> After having had a talk with Rik about all this NUMA nonsense where he proposed
> the scheme implemented in the next to last patch, I came up with a related
> means of doing the home-node selection.
> 
> I've also switched to (ab)using PROT_NONE for driving the migration faults.

I'm glad we agree on the introduction of the numa hinting page faults.

I run a benchmark to compare your sched-numa rewrite with autonuma22:

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma-vs-sched-numa-rewrite-20120808.pdf

> These patches go on top of tip/master with origin/master (Linus' tree) merged in.

It applied clean (with git am) on top of 3.6-rc1
(0d7614f09c1ebdbaa1599a5aba7593f147bf96ee) which already had a pull of
sched-core from tip and other tip bits. If that's not ok let me know
which commit I should use, and I'll repeat.

I released autonuma22 yesterday to provide an exact commit
(f958aa119a8ec417571ea8bdb527182d8ebe8b68) in case somebody wants to
reproduce the numbers on 2 node systems.

The autonuma-benchmark used to run the benchmark was at commit
65d93e485f09e3c1005e8c55cb5b1f97bd3a9ed8 which matches tag 0.1:

git clone git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git

I'll update the pdf shortly by adding 8 node results too.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/19] sched-numa rewrite
  2012-08-08 17:17 ` [PATCH 00/19] sched-numa rewrite Andrea Arcangeli
@ 2012-08-08 18:43   ` Rik van Riel
  2012-08-17 18:08     ` Andrea Arcangeli
  0 siblings, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2012-08-08 18:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, mingo, oleg, pjt, akpm, torvalds, tglx,
	Lee.Schermerhorn, linux-kernel

On 08/08/2012 01:17 PM, Andrea Arcangeli wrote:
> Hi everyone,
>
> On Tue, Jul 31, 2012 at 09:12:04PM +0200, Peter Zijlstra wrote:
>> Hi all,
>>
>> After having had a talk with Rik about all this NUMA nonsense where he proposed
>> the scheme implemented in the next to last patch, I came up with a related
>> means of doing the home-node selection.
>>
>> I've also switched to (ab)using PROT_NONE for driving the migration faults.
>
> I'm glad we agree on the introduction of the numa hinting page faults.
>
> I run a benchmark to compare your sched-numa rewrite with autonuma22:
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma-vs-sched-numa-rewrite-20120808.pdf

For the people who have not yet read that PDF:

While the sched-numa code is relatively small and clean, the
current version does not seem to offer a significant
performance improvement over not having it, and in one of
the tests performance actually regresses vs. mainline.

On the other hand, the autonuma code is pretty large and
hard to understand, but it does provide a significant
speedup on each of the tests.

I have not looked at why sched-numa is not giving a significant
performance improvement.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 18/19] sched, numa: Per task memory placement for big processes
  2012-07-31 19:12 ` [PATCH 18/19] sched, numa: Per task memory placement for " Peter Zijlstra
  2012-07-31 21:56   ` Rik van Riel
@ 2012-08-08 21:35   ` Peter Zijlstra
  2012-08-09 21:57   ` Andrea Arcangeli
  2 siblings, 0 replies; 53+ messages in thread
From: Peter Zijlstra @ 2012-08-08 21:35 UTC (permalink / raw)
  To: mingo
  Cc: riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel, Paul Mundt

On Tue, 2012-07-31 at 21:12 +0200, Peter Zijlstra wrote:
> +#ifdef CONFIG_NUMA
> +       /*
> +        * XXX fold this into flags for 64bit or so...
> +        */
> +       int nid_last;
> +#endif 

Something like the below? I still ought to update all the various
comments about page flag layout etc..

Also, that #warning gives a very noisy build indeed, I guess we should
either make it silent or increase the page frame size for those
configs.. 32bit NUMA is quite rare for normal people (sorry Paul) :)

---
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -611,10 +611,19 @@ static inline pte_t maybe_mkwrite(pte_t 
 #define NODES_WIDTH		0
 #endif
 
+#if NODES_WIDTH && (SECTIONS_WIDTH+ZONES_WIDTH+2*NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS)
+#define LAST_NID_WIDTH	NODES_SHIFT
+#else
+#warning "faking page_xchg_last_nid"
+#define LAST_NID_NOT_IN_PAGE_FLAGS
+#define LAST_NID_WIDTH	0
+#endif
+
 /* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
+#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
 
 /*
  * We are going to use the flags for the page to node mapping if its in
@@ -632,6 +641,7 @@ static inline pte_t maybe_mkwrite(pte_t 
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
+#define LAST_NID_PGSHIFT	(LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -653,6 +663,7 @@ static inline pte_t maybe_mkwrite(pte_t 
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
+#define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -691,6 +702,39 @@ static inline int page_to_nid(const stru
 }
 #endif
 
+#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	return nid; /* fakin' it */
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return page_to_nid(page);
+}	
+#else
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	unsigned long old_flags, flags;
+	int last_nid;
+
+       	old_flags = flags = page->flags;
+	last_nid = (flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+
+	flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
+	flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+
+ 	(void)cmpxchg(&page->flags, old_flags, flags);
+
+	return last_nid;
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+}
+#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+
 static inline struct zone *page_zone(const struct page *page)
 {
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -176,12 +176,6 @@ struct page {
 	 */
 	void *shadow;
 #endif
-#ifdef CONFIG_NUMA
-	/*
-	 * XXX fold this into flags for 64bit or so...
-	 */
-	int nid_last;
-#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1366,6 +1366,7 @@ static void __split_huge_page_refcount(s
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
+		page_xchg_last_nid(page, page_last_nid(tail_page));
 		page_tail->nid_last = page->nid_last;
 
 		BUG_ON(!PageAnon(page_tail));
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2265,10 +2265,9 @@ int mpol_misplaced(struct page *page, st
 	 * task_tick_numa().
 	 */
 	if (multi && (pol->flags & MPOL_F_HOME)) {
-		if (page->nid_last != polnid) {
-			page->nid_last = polnid;
+		int last_nid = page_xchg_last_nid(page, polnid);
+		if (last_nid != polnid)
 			goto out;
-		}
 	}
 
 	if (curnid != polnid)



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/19] mm/mpol: Remove NUMA_INTERLEAVE_HIT
  2012-07-31 19:12 ` [PATCH 02/19] mm/mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
  2012-07-31 20:52   ` Rik van Riel
@ 2012-08-09 21:41   ` Andrea Arcangeli
  2012-08-10  0:50     ` Andi Kleen
  1 sibling, 1 reply; 53+ messages in thread
From: Andrea Arcangeli @ 2012-08-09 21:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

Hi,

On Tue, Jul 31, 2012 at 09:12:06PM +0200, Peter Zijlstra wrote:
> Since the NUMA_INTERLEAVE_HIT statistic is useless on its own; it wants
> to be compared to either a total of interleave allocations or to a miss
> count, remove it.
> 
> Fixing it would be possible, but since we've gone years without these
> statistics I figure we can continue that way.
> 
> Also NUMA_HIT fully includes NUMA_INTERLEAVE_HIT so users might
> switch to using that.
> 
> This cleans up some of the weird MPOL_INTERLEAVE allocation exceptions.

It's not apparent why you need to remove it for sched-numa. I think I
see it but it'd be nicer if it would explained so one doesn't need to
read an internal bit of several patches later to understand why this
is needed.

> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -169,7 +169,7 @@ static ssize_t node_read_numastat(struct
>  		       node_page_state(dev->id, NUMA_HIT),
>  		       node_page_state(dev->id, NUMA_MISS),
>  		       node_page_state(dev->id, NUMA_FOREIGN),
> -		       node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
> +		       0UL,
>  		       node_page_state(dev->id, NUMA_LOCAL),
>  		       node_page_state(dev->id, NUMA_OTHER));
>  }

Not so nice to leave forever a 0 here. It doesn't matter if nobody can
act on it because it wants to be compared, it's still useful as an
informative value for vmstat below:

> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -717,7 +717,6 @@ const char * const vmstat_text[] = {
>  	"numa_hit",
>  	"numa_miss",
>  	"numa_foreign",
> -	"numa_interleave",
>  	"numa_local",
>  	"numa_other",
>  #endif
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/19] mm, thp: Preserve pgprot across huge page split
  2012-07-31 19:12 ` [PATCH 04/19] mm, thp: Preserve pgprot across huge page split Peter Zijlstra
  2012-07-31 20:53   ` Rik van Riel
@ 2012-08-09 21:42   ` Andrea Arcangeli
  1 sibling, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2012-08-09 21:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On Tue, Jul 31, 2012 at 09:12:08PM +0200, Peter Zijlstra wrote:
> If we marked a THP with our special PROT_NONE protections, ensure we
> don't loose them over a split.
> 
> Collapse seems to always allocate a new (huge) page which should
> already end up on the new target node so loosing protections there
> isn't a problem.

This looks an optimization too, as it reduces a few branches.

If you didn't introduce an unnecessary goto it would have made the
actual change more readable and the patch much smaller. (you could
have cleaned it up with a later patch if you disliked the codying
style that tried to avoid using unnecessary gotos)

The s/barrier/ACCESS_ONCE/ I'll merge it in my tree as a separate
commit, as it's not related to sched-numa.

> 
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Paul Turner <pjt@google.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  arch/x86/include/asm/pgtable.h |    1 
>  mm/huge_memory.c               |  104 +++++++++++++++++++----------------------
>  2 files changed, 50 insertions(+), 55 deletions(-)
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -350,6 +350,7 @@ static inline pgprot_t pgprot_modify(pgp
>  }
>  
>  #define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK)
> +#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_HPAGE_CHG_MASK)
>  
>  #define canon_pgprot(p) __pgprot(massage_pgprot(p))
>  
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1353,64 +1353,60 @@ static int __split_huge_page_map(struct 
>  	int ret = 0, i;
>  	pgtable_t pgtable;
>  	unsigned long haddr;
> +	pgprot_t prot;
>  
>  	spin_lock(&mm->page_table_lock);
>  	pmd = page_check_address_pmd(page, mm, address,
>  				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> -	if (pmd) {
> -		pgtable = get_pmd_huge_pte(mm);
> -		pmd_populate(mm, &_pmd, pgtable);
> -
> -		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
> -		     i++, haddr += PAGE_SIZE) {
> -			pte_t *pte, entry;
> -			BUG_ON(PageCompound(page+i));
> -			entry = mk_pte(page + i, vma->vm_page_prot);
> -			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -			if (!pmd_write(*pmd))
> -				entry = pte_wrprotect(entry);
> -			else
> -				BUG_ON(page_mapcount(page) != 1);
> -			if (!pmd_young(*pmd))
> -				entry = pte_mkold(entry);
> -			pte = pte_offset_map(&_pmd, haddr);
> -			BUG_ON(!pte_none(*pte));
> -			set_pte_at(mm, haddr, pte, entry);
> -			pte_unmap(pte);
> -		}
> +	if (!pmd)
> +		goto unlock;
>  
> -		smp_wmb(); /* make pte visible before pmd */
> -		/*
> -		 * Up to this point the pmd is present and huge and
> -		 * userland has the whole access to the hugepage
> -		 * during the split (which happens in place). If we
> -		 * overwrite the pmd with the not-huge version
> -		 * pointing to the pte here (which of course we could
> -		 * if all CPUs were bug free), userland could trigger
> -		 * a small page size TLB miss on the small sized TLB
> -		 * while the hugepage TLB entry is still established
> -		 * in the huge TLB. Some CPU doesn't like that. See
> -		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
> -		 * Erratum 383 on page 93. Intel should be safe but is
> -		 * also warns that it's only safe if the permission
> -		 * and cache attributes of the two entries loaded in
> -		 * the two TLB is identical (which should be the case
> -		 * here). But it is generally safer to never allow
> -		 * small and huge TLB entries for the same virtual
> -		 * address to be loaded simultaneously. So instead of
> -		 * doing "pmd_populate(); flush_tlb_range();" we first
> -		 * mark the current pmd notpresent (atomically because
> -		 * here the pmd_trans_huge and pmd_trans_splitting
> -		 * must remain set at all times on the pmd until the
> -		 * split is complete for this pmd), then we flush the
> -		 * SMP TLB and finally we write the non-huge version
> -		 * of the pmd entry with pmd_populate.
> -		 */
> -		set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
> -		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> -		pmd_populate(mm, pmd, pgtable);
> -		ret = 1;
> +	prot = pmd_pgprot(*pmd);
> +	pgtable = get_pmd_huge_pte(mm);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t *pte, entry;
> +
> +		BUG_ON(PageCompound(page+i));
> +		entry = mk_pte(page + i, prot);
> +		entry = pte_mkdirty(entry);
> +		if (!pmd_young(*pmd))
> +			entry = pte_mkold(entry);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		pte_unmap(pte);
>  	}
> +
> +	smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */
> +	/*
> +	 * Up to this point the pmd is present and huge.
> +	 *
> +	 * If we overwrite the pmd with the not-huge version, we could trigger
> +	 * a small page size TLB miss on the small sized TLB while the hugepage
> +	 * TLB entry is still established in the huge TLB.
> +	 *
> +	 * Some CPUs don't like that. See
> +	 * http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383
> +	 * on page 93.
> +	 *
> +	 * Thus it is generally safer to never allow small and huge TLB entries
> +	 * for overlapping virtual addresses to be loaded. So we first mark the
> +	 * current pmd not present, then we flush the TLB and finally we write
> +	 * the non-huge version of the pmd entry with pmd_populate.
> +	 *
> +	 * The above needs to be done under the ptl because pmd_trans_huge and
> +	 * pmd_trans_splitting must remain set on the pmd until the split is
> +	 * complete. The ptl also protects against concurrent faults due to
> +	 * making the pmd not-present.
> +	 */
> +	set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
> +	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	pmd_populate(mm, pmd, pgtable);
> +	ret = 1;
> +
> +unlock:
>  	spin_unlock(&mm->page_table_lock);
>  
>  	return ret;
> @@ -2241,9 +2237,7 @@ static int khugepaged_wait_event(void)
>  static void khugepaged_do_scan(struct page **hpage)
>  {
>  	unsigned int progress = 0, pass_through_head = 0;
> -	unsigned int pages = khugepaged_pages_to_scan;
> -
> -	barrier(); /* write khugepaged_pages_to_scan to local stack */
> +	unsigned int pages = ACCESS_ONCE(khugepaged_pages_to_scan);
>  
>  	while (progress < pages) {
>  		cond_resched();
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/19] mm, mpol: Create special PROT_NONE infrastructure
  2012-07-31 19:12 ` [PATCH 05/19] mm, mpol: Create special PROT_NONE infrastructure Peter Zijlstra
  2012-07-31 20:55   ` Rik van Riel
@ 2012-08-09 21:43   ` Andrea Arcangeli
  1 sibling, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2012-08-09 21:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On Tue, Jul 31, 2012 at 09:12:09PM +0200, Peter Zijlstra wrote:
> +static bool pte_prot_none(struct vm_area_struct *vma, pte_t pte)
> +{
> +	/*
> +	 * If we have the normal vma->vm_page_prot protections we're not a
> +	 * 'special' PROT_NONE page.
> +	 *
> +	 * This means we cannot get 'special' PROT_NONE faults from genuine
> +	 * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
> +	 * tracking.
> +	 *
> +	 * Neither case is really interesting for our current use though so we
> +	 * don't care.
> +	 */
> +	if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
> +		return false;
> +
> +	return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
> +}
> @@ -3453,6 +3518,9 @@ int handle_pte_fault(struct mm_struct *m
>  					pte, pmd, flags, entry);
>  	}
>  
> +	if (pte_prot_none(vma, entry))
> +		return do_prot_none(mm, vma, address, pte, pmd, flags, entry);
> +
>  	ptl = pte_lockptr(mm, pmd);
>  	spin_lock(ptl);
>  	if (unlikely(!pte_same(*pte, entry)))

I recommend calling it pte_numa, not pte_prot_none(), given you return
true only when it's not the real prot none.

Also I'd leave the details fully hidden in arch code, there's no need
to expose those to common code and force all archs to use the
PROT_NONE bitflag to implement numa hinting page faults. If an arch
wants to avoid touching the vma cacheline to avoid wasting time
computing the vma->vma_page_prot vs pteval, it can use a bitflag
different than protnone like AutoNUMA is currently doing. No reason to
force the use of PROT_NONE at the common code level.

My current implementation of the numa hinting page fault follows:

	spin_lock(ptl);
	if (unlikely(!pte_same(*pte, entry)))
		goto unlock;
	entry = pte_numa_fixup(mm, vma, address, entry, pte);

static inline pte_t pte_numa_fixup(struct mm_struct *mm,
				   struct vm_area_struct *vma,
				   unsigned long addr, pte_t pte, pte_t *ptep)
{
	if (pte_numa(pte))
		pte = __pte_numa_fixup(mm, vma, addr, pte, ptep);
	return pte;
}

I can easily change my entry point a bit to call it before taking the
spinlocks like you're doing to accomodate for your sync migration needs.

But I think it's better implemented like above, by just passing the
vma along until it reaches pte_numa(pte, vma) should be enough.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP
  2012-07-31 19:12 ` [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
  2012-07-31 21:06   ` Rik van Riel
@ 2012-08-09 21:44   ` Andrea Arcangeli
  2012-10-01  9:36   ` Michael Kerrisk
  2 siblings, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2012-08-09 21:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On Tue, Jul 31, 2012 at 09:12:11PM +0200, Peter Zijlstra wrote:
> From: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> This patch augments the MPOL_MF_LAZY feature by adding a "NOOP"
> policy to mbind().  When the NOOP policy is used with the 'MOVE
> and 'LAZY flags, mbind() [check_range()] will walk the specified
> range and unmap eligible pages so that they will be migrated on
> next touch.
> 
> This allows an application to prepare for a new phase of operation
> where different regions of shared storage will be assigned to
> worker threads, w/o changing policy.  Note that we could just use
> "default" policy in this case.  However, this also allows an
> application to request that pages be migrated, only if necessary,
> to follow any arbitrary policy that might currently apply to a
> range of pages, without knowing the policy, or without specifying
> multiple mbind()s for ranges with different policies.

This is a new kapi change. I could hardly understand the above so I
wonder how long it will take before userland programmers will be
familiar with MPOL_NOOP to actually use it in most apps? Could you
just enable/disable your logics using a sysfs knob instead?

enabling/disabling sched-numa is something an admin can easily do with
a sysfs control, patching and rebuilding a proprietary app using mbind
calls, no way, especially if the app is proprietary.

> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/mempolicy.h |    1 +
>  mm/mempolicy.c            |    8 ++++----
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 87fabfa..668311a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -21,6 +21,7 @@ enum {
>  	MPOL_BIND,
>  	MPOL_INTERLEAVE,
>  	MPOL_LOCAL,
> +	MPOL_NOOP,		/* retain existing policy for range */
>  	MPOL_MAX,	/* always last member of enum */
>  };
>  
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4fba5f2..251ef31 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
>  	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
>  		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
>  
> -	if (mode == MPOL_DEFAULT) {
> +	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
>  		if (nodes && !nodes_empty(*nodes))
>  			return ERR_PTR(-EINVAL);
> -		return NULL;	/* simply delete any existing policy */
> +		return NULL;
>  	}
>  	VM_BUG_ON(!nodes);
>  
> @@ -1069,7 +1069,7 @@ static long do_mbind(unsigned long start, unsigned long len,
>  	if (start & ~PAGE_MASK)
>  		return -EINVAL;
>  
> -	if (mode == MPOL_DEFAULT)
> +	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
>  		flags &= ~MPOL_MF_STRICT;
>  
>  	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
> @@ -1121,7 +1121,7 @@ static long do_mbind(unsigned long start, unsigned long len,
>  			  flags | MPOL_MF_INVERT, &pagelist);
>  
>  	err = PTR_ERR(vma);	/* maybe ... */
> -	if (!IS_ERR(vma))
> +	if (!IS_ERR(vma) && mode != MPOL_NOOP)
>  		err = mbind_range(mm, start, end, new);
>  
>  	if (!err) {
> -- 
> 1.7.2.3
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/19] mm, mpol: Use special PROT_NONE to migrate pages
  2012-07-31 19:12 ` [PATCH 10/19] mm, mpol: Use special PROT_NONE to migrate pages Peter Zijlstra
  2012-07-31 21:24   ` Rik van Riel
@ 2012-08-09 21:44   ` Andrea Arcangeli
  1 sibling, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2012-08-09 21:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On Tue, Jul 31, 2012 at 09:12:14PM +0200, Peter Zijlstra wrote:
> +#ifdef CONFIG_NUMA
>  	/*
> -	 * Do fancy stuff...
> +	 * For NUMA systems we use the special PROT_NONE maps to drive
> +	 * lazy page migration, see MPOL_MF_LAZY and related.
>  	 */
> +	page = vm_normal_page(vma, address, entry);
> +	if (!page)
> +		goto do_fixup_locked;
> +
> +	get_page(page);
> +	pte_unmap_unlock(ptep, ptl);
> +
> +	node = mpol_misplaced(page, vma, address);
> +	if (node == -1)
> +		goto do_fixup;
>  
>  	/*
> +	 * Page migration will install a new pte with vma->vm_page_prot,
> +	 * otherwise fall-through to the fixup. Next time,.. perhaps.
> +	 */
> +	if (!migrate_misplaced_page(mm, page, node)) {
> +		put_page(page);
> +		return 0;
> +	}
> +
> +do_fixup:
> +	/*
>  	 * OK, nothing to do,.. change the protection back to what it
>  	 * ought to be.
>  	 */
> @@ -3467,6 +3493,9 @@ static int do_prot_none(struct mm_struct
>  	if (unlikely(!pte_same(*ptep, entry)))
>  		goto unlock;
>  
> +do_fixup_locked:
> +#endif /* CONFIG_NUMA */
> +

Do fancy stuff would better of be in a separate file instead of mixing
it with the numa hinting page fault entry points in memory.c. My
"fancy stuff" happens in mm/autonuma.c. memory.c calls it.

>  	flush_cache_page(vma, address, pte_pfn(entry));
>  
>  	ptep_modify_prot_start(mm, address, ptep);
> @@ -3476,8 +3505,9 @@ static int do_prot_none(struct mm_struct
>  	update_mmu_cache(vma, address, ptep);
>  unlock:
>  	pte_unmap_unlock(ptep, ptl);
> -out:
> -	return ret;
> +	if (page)
> +		put_page(page);
> +	return 0;
>  }
>  
>  /*
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 15/19] sched: Implement home-node awareness
  2012-07-31 19:12 ` [PATCH 15/19] sched: Implement home-node awareness Peter Zijlstra
  2012-07-31 21:52   ` Rik van Riel
@ 2012-08-09 21:51   ` Andrea Arcangeli
  1 sibling, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2012-08-09 21:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel, Christoph Lameter

On Tue, Jul 31, 2012 at 09:12:19PM +0200, Peter Zijlstra wrote:
> @@ -2699,6 +2705,29 @@ select_task_rq_fair(struct task_struct *
>  	}
>  
>  	rcu_read_lock();
> +	if (sched_feat_numa(NUMA_BIAS) && node != -1) {
> +		int node_cpu;
> +
> +		node_cpu = cpumask_any_and(tsk_cpus_allowed(p), cpumask_of_node(node));
> +		if (node_cpu >= nr_cpu_ids)
> +			goto find_sd;
> +
> +		/*
> +		 * For fork,exec find the idlest cpu in the home-node.
> +		 */
> +		if (sd_flag & (SD_BALANCE_FORK|SD_BALANCE_EXEC)) {
> +			new_cpu = cpu = node_cpu;
> +			sd = per_cpu(sd_node, cpu);
> +			goto pick_idlest;
> +		}
> +
> +		/*
> +		 * For wake, pretend we were running in the home-node.
> +		 */
> +		prev_cpu = node_cpu;
> +	}
> +
> +find_sd:
>  	for_each_domain(cpu, tmp) {
>  		if (!(tmp->flags & SD_LOAD_BALANCE))
>  			continue;

This won't work right. I allow fork/clone to go anywhere if there are
more idle cpus. I won't limit to the idlest in the home node.

tsk->task_selected_nid with AutoNUMA is an hint, never an
"enforcement" no matter if it's a wakeup or fork or execve. Idle cpus
always gets priority over NUMA affinity. Not doing it, will regress
performance instead of improving it.

The only way to get good performance is what AutoNUMA does:

1) once in a while verify if tsk has a better node to go

2) if yes, move it to the better node and set task_selected_nid

3) in CFS during regular load balancing and idle balancing use
   task_selected_nid as a _not_strict_ hint until 1) runs again

> @@ -3092,6 +3124,23 @@ static void move_task(struct task_struct
>  	check_preempt_curr(env->dst_rq, p, 0);
>  }
>  
> +static int task_numa_hot(struct task_struct *p, int from_cpu, int to_cpu)
> +{
> +	int from_dist, to_dist;
> +	int node = tsk_home_node(p);
> +
> +	if (!sched_feat_numa(NUMA_HOT) || node == -1)
> +		return 0; /* no node preference */
> +
> +	from_dist = node_distance(cpu_to_node(from_cpu), node);
> +	to_dist = node_distance(cpu_to_node(to_cpu), node);
> +
> +	if (to_dist < from_dist)
> +		return 0; /* getting closer is ok */

Getting closer is not ok if 30% of the ram is in the current node, 70%
is in the "home node" and "to_dist" has 0% of the ram. In short you're
taking things into account that are almost irrelevant (distance) and
ignoring the real important stuff to decide if you're improving the
overall NUMA convergence or not.

The objective of any CPU migration to different nodes has to be only a
global overall improvement of numa convergence, not to move the local
30% closer to the remote 70%, that's a regression in convergence if
you end up with 15% here, 15% there and 70% there.

You can take distance into account but only after you know the
mm->mm_autonuma statistical data.

> @@ -3177,6 +3226,7 @@ int can_migrate_task(struct task_struct 
>  	 */
>  
>  	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> +	tsk_cache_hot |= task_numa_hot(p, env->src_cpu, env->dst_cpu);
>  	if (!tsk_cache_hot ||
>  		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
>  #ifdef CONFIG_SCHEDSTATS
> @@ -3202,11 +3252,11 @@ int can_migrate_task(struct task_struct 
>   *
>   * Called with both runqueues locked.
>   */
> -static int move_one_task(struct lb_env *env)
> +static int __move_one_task(struct lb_env *env)
>  {
>  	struct task_struct *p, *n;
>  
> -	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
> +	list_for_each_entry_safe(p, n, env->tasks, se.group_node) {
>  		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
>  			continue;
>  
> @@ -3225,6 +3275,21 @@ static int move_one_task(struct lb_env *
>  	return 0;
>  }
>  
> +static int move_one_task(struct lb_env *env)
> +{
> +	if (sched_feat_numa(NUMA_PULL)) {
> +		env->tasks = offnode_tasks(env->src_rq);
> +		if (__move_one_task(env))
> +			return 1;
> +	}
> +
> +	env->tasks = &env->src_rq->cfs_tasks;
> +	if (__move_one_task(env))
> +		return 1;
> +
> +	return 0;
> +}
> +
>  static unsigned long task_h_load(struct task_struct *p);
>  
>  static const unsigned int sched_nr_migrate_break = 32;
> @@ -3238,7 +3303,6 @@ static const unsigned int sched_nr_migra
>   */
>  static int move_tasks(struct lb_env *env)
>  {
> -	struct list_head *tasks = &env->src_rq->cfs_tasks;
>  	struct task_struct *p;
>  	unsigned long load;
>  	int pulled = 0;
> @@ -3246,8 +3310,9 @@ static int move_tasks(struct lb_env *env
>  	if (env->imbalance <= 0)
>  		return 0;
>  
> -	while (!list_empty(tasks)) {
> -		p = list_first_entry(tasks, struct task_struct, se.group_node);
> +again:
> +	while (!list_empty(env->tasks)) {
> +		p = list_first_entry(env->tasks, struct task_struct, se.group_node);
>  
>  		env->loop++;
>  		/* We've more or less seen every task there is, call it quits */
> @@ -3258,7 +3323,7 @@ static int move_tasks(struct lb_env *env
>  		if (env->loop > env->loop_break) {
>  			env->loop_break += sched_nr_migrate_break;
>  			env->flags |= LBF_NEED_BREAK;
> -			break;
> +			goto out;
>  		}
>  
>  		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> @@ -3286,7 +3351,7 @@ static int move_tasks(struct lb_env *env
>  		 * the critical section.
>  		 */
>  		if (env->idle == CPU_NEWLY_IDLE)
> -			break;
> +			goto out;
>  #endif
>  
>  		/*
> @@ -3294,13 +3359,20 @@ static int move_tasks(struct lb_env *env
>  		 * weighted load.
>  		 */
>  		if (env->imbalance <= 0)
> -			break;
> +			goto out;
>  
>  		continue;
>  next:
> -		list_move_tail(&p->se.group_node, tasks);
> +		list_move_tail(&p->se.group_node, env->tasks);
>  	}
>  
> +	if (env->tasks == offnode_tasks(env->src_rq)) {
> +		env->tasks = &env->src_rq->cfs_tasks;
> +		env->loop = 0;
> +		goto again;
> +	}
> +
> +out:
>  	/*
>  	 * Right now, this is one of only two places move_task() is called,
>  	 * so we can safely collect move_task() stats here rather than
> @@ -3447,6 +3519,11 @@ struct sd_lb_stats {
>  	unsigned int  busiest_group_weight;
>  
>  	int group_imb; /* Is there imbalance in this sd */
> +#ifdef CONFIG_NUMA
> +	struct sched_group *numa_group; /* group which has offnode_tasks */
> +	unsigned long numa_group_weight;
> +	unsigned long numa_group_running;
> +#endif
>  };
>  
>  /*
> @@ -3462,6 +3539,10 @@ struct sg_lb_stats {
>  	unsigned long group_weight;
>  	int group_imb; /* Is there an imbalance in the group ? */
>  	int group_has_capacity; /* Is there extra capacity in the group? */
> +#ifdef CONFIG_NUMA
> +	unsigned long numa_weight;
> +	unsigned long numa_running;
> +#endif
>  };
>  
>  /**
> @@ -3490,6 +3571,117 @@ static inline int get_sd_load_idx(struct
>  	return load_idx;
>  }
>  
> +#ifdef CONFIG_NUMA
> +static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
> +{
> +	sgs->numa_weight += rq->offnode_weight;
> +	sgs->numa_running += rq->offnode_running;
> +}
> +

offnode weight is never increased in this patch, maybe the increase to
numa_weight could be moved in later patches that increases
offnode_weight.

> +/*
> + * Since the offnode lists are indiscriminate (they contain tasks for all other
> + * nodes) it is impossible to say if there's any task on there that wants to
> + * move towards the pulling cpu. Therefore select a random offnode list to pull
> + * from such that eventually we'll try them all.
> + */
> +static inline bool pick_numa_rand(void)
> +{
> +	return get_random_int() & 1;
> +}

This is what you have to do to pick tasks. I think having to resort to
the random generator to generate your algorithm input, shows had bad
this really is.

In comparison AutoNUMA never a single time _anywhere_ does anything in
function of random. Try to find a single place in AutoNUMA where I
take a random decision.

AutoNUMA always takes decisions it is sure they will improve overall
NUMA convergence.

AutoNUMA because it does things incrementally, and it's not fully
synchronous (leaves CFS alone with the hint for some period of time)
will behave slightly different too, but when it kicks in, it is fully
deterministic.


The only way to create a converging algorithm without
mm_autonuma/task_autonuma is by enforcing a overcommit_memory=2 model
where you account perfectly everything and you decide where the memory
is yourself and you can enforce it (no matter mlocks, tmpfs memory,
slab, or whatever). That kind of real strict model will never happen,
so you could try approximate it maybe making wild (sometime false)
assumptions like pagecache is always freeable.

But that's not what you're doing here. You're trying to solve the
problem in an incremental way like AutoNUMA, so without recalculating
the whole layout of the whole system at every page fault that
allocates 4k and could require to rebalance and move everything by
spilling something over a new node. Problem you lack enough
information to solve it with the incremental algorithm, so it'll never
work.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 18/19] sched, numa: Per task memory placement for big processes
  2012-07-31 19:12 ` [PATCH 18/19] sched, numa: Per task memory placement for " Peter Zijlstra
  2012-07-31 21:56   ` Rik van Riel
  2012-08-08 21:35   ` Peter Zijlstra
@ 2012-08-09 21:57   ` Andrea Arcangeli
  2 siblings, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2012-08-09 21:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On Tue, Jul 31, 2012 at 09:12:22PM +0200, Peter Zijlstra wrote:
> Implement a per-task memory placement scheme for 'big' tasks (as per
> the last patch). It relies on a regular PROT_NONE 'migration' fault to
> scan the memory space of the procress and uses a two stage migration
> scheme to reduce the invluence of unlikely usage relations.
> 
> It relies on the assumption that the compute part is tied to a
> paticular task and builds a task<->page relation set to model the
> compute<->data relation.
> 
> Probability says that the task faulting on a page after we protect it,
> is most likely to be the task that uses that page most.
> 
> To decrease the likelyhood of acting on a false relation, we only

Do you prefer me to use the term false relation instead of false
sharing too? I never was fond of the term false sharing even though at
the NUMA level it resembles the cache effects.

> migrate a page when two consecutive samples are from the same task.
> 
> I'm still not entirely convinced this scheme is sound, esp. for things
> like virtualization and n:m threading solutions in general the
> compute<->task relation is fundamentally untrue.

To make it true, virt requires AutoNUMA or hard bindings the guest too
and just a fake vtopology matching the hardware topology. Anyway we
can discuss that later, it's not the primary topic here.

> + *
> + * Once we start doing NUMA placement there's two modes, 'small' process-wide
> + * and 'big' per-task. For the small mode we have a process-wide home node
> + * and lazily mirgrate all memory only when this home-node changes.
> + *
> + * For big mode we keep a home-node per task and use periodic fault scans
> + * to try and estalish a task<->page relation. This assumes the task<->page
> + * relation is a compute<->data relation, this is false for things like virt.
> + * and n:m threading solutions but its the best we can do given the
> + * information we have.

Differentiating the way you migrate the memory between small and big
tasks looks really bad and a major band aid. It is only needed because
the small mode is too broken to be used on big tasks, and the big mode
provides too bad performance.

> +	if (big) {
> +		/*
> +		 * For 'big' processes we do per-thread home-node, combined
> +		 * with periodic fault scans.
> +		 */
> +		if (p->node != node)
> +			sched_setnode(p, node);
> +	} else {
> +		/*
> +		 * For 'small' processes we keep the entire process on a
> +		 * node and migrate all memory once.
> +		 */
> +		rcu_read_lock();
> +		t = p;
> +		do {
> +			sched_setnode(t, node);
> +		} while ((t = next_thread(p)) != p);
> +		rcu_read_unlock();
> +	}

The small mode tries to do the right thing, except expressed like
above is so bad and fixed in stone and can't work ok in all possible
scenarios (hence you can't use it for big tasks).

AutoNUMA only uses the "small" mode, but it does in a way that works
perfectly for big tasks too. This is the relevant documentation on how
the CPU follow memory algorithm of AutoNUMA sorts out the very above
problem in an optimal way that doesn't require small/big
classifications or other hacks like that:

 * One important thing is how we calculate the weights using
 * task_autonuma or mm_autonuma, depending if the other CPU is running
 * a thread of the current process, or a thread of a different
 * process.
 *
 * We use the mm_autonuma statistics to calculate the NUMA weights of
 * the two task candidate for exchange, if the task in the other CPU
 * belongs to a different processes. This way all threads of the same
 * process will try to converge in the same "mm" best nodes if their
 * "thread local" best CPU is already busy by a thread of a different
 * process. This is important because with threads, there is always
 * the possibility of NUMA false sharing, and so it's better to
 * converge all threads in as fewer nodes as possible.

Your approach will never work ok on large systems, unless all tasks
are of the small kind of course.

At least now you introduced the autonuma_last_nid numa hinting page
fault confirmation before starting the migration to mitigate the
bad behavior of the big mode for big tasks.

> +	/*
> +	 * Trigger fault driven migration, small processes do direct
> +	 * lazy migration, big processes do gradual task<->page relations.
> +	 * See mpol_misplaced().
> +	 */
>  	lazy_migrate_process(p->mm);

The comment should go in prev patch that introduced the call to
lazy_migrate_process not here, bad splitup.

> +	/*
> +	 * Multi-stage node selection is used in conjunction with a periodic
> +	 * migration fault to build a temporal task<->page relation. By
> +	 * using a two-stage filter we remove short/unlikely relations.
> +	 *
> +	 * Using P(p) ~ n_p / n_t as per frequentist probability, we can
> +	 * equate a task's usage of a particular page (n_p) per total usage
> +	 * of this page (n_t) (in a given time-span) to a probability.
> +	 *
> +	 * Our periodic faults will then sample this probability and getting
> +	 * the same result twice in a row, given these samples are fully
> +	 * independent, is then given by P(n)^2, provided our sample period
> +	 * is sufficiently short compared to the usage pattern.
> +	 *
> +	 * This quadric squishes small probabilities, making it less likely
> +	 * we act on an unlikely task<->page relation.
> +	 *
> +	 * NOTE: effectively we're using task-home-node<->page-node relations
> +	 * since those are the only thing we can affect.
> +	 *
> +	 * NOTE: we're using task-home-node as opposed to the current node
> +	 * the task might be running on, since the task-home-node is the
> +	 * long-term node of this task, further reducing noise. Also see
> +	 * task_tick_numa().
> +	 */
> +	if (multi && (pol->flags & MPOL_F_HOME)) {
> +		if (page->nid_last != polnid) {
> +			page->nid_last = polnid;
> +			goto out;
> +		}
> +	}
> +

Great to see we agree on the autonuma_last_nid logic now that you
included it into sched-numa, so finally sched-numa also has a
task<->page relation. You also wrote the math to explain it! I hope if
you don't mind if I copy the above math in the comment above.

I just seen you posted a patch to remove the 8 bytes per page and
return to zero cost by dropping 32bit support, but it was still
interesting to see sched-numa for the first time using 8 bytes per
page unconditionally! (autonuma currently uses 12 per page but only on
NUMA hardware and zero on non-NUMA, and it works on 32bit archs too
with the same 12byte cost)

> @@ -2421,7 +2456,7 @@ void __init numa_policy_init(void)
>  		preferred_node_policy[nid] = (struct mempolicy) {
>  			.refcnt = ATOMIC_INIT(1),
>  			.mode = MPOL_PREFERRED,
> -			.flags = MPOL_F_MOF,
> +			.flags = MPOL_F_MOF | MPOL_F_HOME,
>  			.v = { .preferred_node = nid, },
>  		};
>  	}
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/19] mm/mpol: Remove NUMA_INTERLEAVE_HIT
  2012-08-09 21:41   ` Andrea Arcangeli
@ 2012-08-10  0:50     ` Andi Kleen
  0 siblings, 0 replies; 53+ messages in thread
From: Andi Kleen @ 2012-08-10  0:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, mingo, riel, oleg, pjt, akpm, torvalds, tglx,
	Lee.Schermerhorn, linux-kernel

Andrea Arcangeli <aarcange@redhat.com> writes:

> Hi,
>
> On Tue, Jul 31, 2012 at 09:12:06PM +0200, Peter Zijlstra wrote:
>> Since the NUMA_INTERLEAVE_HIT statistic is useless on its own; it wants
>> to be compared to either a total of interleave allocations or to a miss
>> count, remove it.
>> 
>> Fixing it would be possible, but since we've gone years without these
>> statistics I figure we can continue that way.
>> 
>> Also NUMA_HIT fully includes NUMA_INTERLEAVE_HIT so users might
>> switch to using that.
>> 
>> This cleans up some of the weird MPOL_INTERLEAVE allocation exceptions.
>
> It's not apparent why you need to remove it for sched-numa. I think I
> see it but it'd be nicer if it would explained so one doesn't need to
> read an internal bit of several patches later to understand why this
> is needed.

Also it still breaks the numactl test suite, as already explained 
multiple times. Without the HIT counter there is no way to check
interleave actually happened.

I'm a bit concerned about patch kits like this ignoring review feedback?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/19] sched-numa rewrite
  2012-08-08 18:43   ` Rik van Riel
@ 2012-08-17 18:08     ` Andrea Arcangeli
  0 siblings, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2012-08-17 18:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, mingo, oleg, pjt, akpm, torvalds, tglx,
	Lee.Schermerhorn, linux-kernel, Petr Holasek

Hi,

On Wed, Aug 08, 2012 at 02:43:34PM -0400, Rik van Riel wrote:
> While the sched-numa code is relatively small and clean, the
> current version does not seem to offer a significant
> performance improvement over not having it, and in one of
> the tests performance actually regresses vs. mainline.

sched-numa is small true, but I argue about it being clean. It does
lots of hacks, it has a worse numa hinting page fault implementation,
it has no runtime disable tweak, it has no config option, and it's
very intrusive in the scheduler and MM code and it'd be very hard to
backout if a better solution would emerge in the future.

> On the other hand, the autonuma code is pretty large and
> hard to understand, but it does provide a significant
> speedup on each of the tests.

AutoNUMA code is certainly pretty large, but it is totally self
contained. 90% of it is in isolated files that can be deleted and
won't even get built if CONFIG_AUTONUMA=n. The other common code
changes can be wiped out by following the build errors after dropping
the include files with CONFIG_AUTONUMA=n, shall a better solution
emerge in the future.

I think it's important that whatever is merged, is self contained and
easy to backout in the future. Especially if the not self contained
code is full of hacks like big/small mode or random number generator
generating part of the "input".

I applied the fix for sched-numa rewrite/v2 posted on lkml but I still
lockups when running the autonuma-benchmark on the 8 nodes system, I
never could complete the first numa01 test. I provided stack traces
off list to debug it.

So for now I updated the pdf with only the autonuma23 results for the
8 nodes system. I had to bump the autonuma version to 23 and repeat
all benchmarks because of a one liner s/kalloc/kzalloc/ change needed
to successfully boot autonuma on the 8 node system (that boots with
ram not zeroed out).

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma-vs-sched-numa-rewrite-20120817.pdf

I didn't include the convergence charts for 3.6-rc1 on the 8 nodes
because they're equal to the ones on the 2 nodes and they would only
waste pdf real estate.

>From the numa02_SMT charts I suspect something may not be perfect in
the active load idle balancing of CFS. The imperfection is likely lost
in the noise, and without the convergence charts showing the exact
memory distribution across the nodes it would be hard to notice it.

numa01 on the 8 nodes is quite a pathological case, and it shows the
heavy NUMA false sharing/relation there is when 2 processes crosses 4
nodes each and touches all memory in a loop. The smooth async memory
migration of that pathological case still doesn't hurt despite some
small migration keep going in the background forever (this is why
async migrate providing smooth behavior is quite important). numa01 is
a very different load on 2 nodes vs 8 nodes (on 2 nodes it can coverge
100% and it will stop the memory migrations altogether).

Sometime near the end of the tests (X axis is time) you'll notice some
divergence, that happens because some threads completes sooner (the
threads of the node that had all ram local at startup certainly will
always complete faster than the others). The reason for that
divergence is that it falls into the _SMT case to fill all idle cores.

I also noticed on the 8 node system some repetition of the task
migrations invoked by sched_autonuma_balance() that I intend to
optimize away in future versions (it is only visible after enabling
the debug mode). Fixing it, will save some small amount of CPU. What
happens is that the idle load balancing invoked by the CPU that become
idle after the task migration, sometime grabs the migrated task and
puts it back in its original position, so the migration has to be
repeated at the next invocation of sched_autonuma_balance().

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP
  2012-07-31 19:12 ` [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
  2012-07-31 21:06   ` Rik van Riel
  2012-08-09 21:44   ` Andrea Arcangeli
@ 2012-10-01  9:36   ` Michael Kerrisk
  2012-10-01  9:45     ` Ingo Molnar
  2 siblings, 1 reply; 53+ messages in thread
From: Michael Kerrisk @ 2012-10-01  9:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, riel, oleg, pjt, akpm, torvalds, tglx, Lee.Schermerhorn,
	linux-kernel

On Tue, Jul 31, 2012 at 9:12 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> From: Lee Schermerhorn <lee.schermerhorn@hp.com>
>
> This patch augments the MPOL_MF_LAZY feature by adding a "NOOP"
> policy to mbind().  When the NOOP policy is used with the 'MOVE
> and 'LAZY flags, mbind() [check_range()] will walk the specified
> range and unmap eligible pages so that they will be migrated on
> next touch.

This patch is mistitled. The new flag is MPOL_NOOP. That made it
difficult to find the commit in linux-next. Does it make sense to fix
the patch title before this hits mainline?

Thanks,

Michael



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP
  2012-10-01  9:36   ` Michael Kerrisk
@ 2012-10-01  9:45     ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2012-10-01  9:45 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Peter Zijlstra, riel, oleg, pjt, akpm, torvalds, tglx,
	Lee.Schermerhorn, linux-kernel


* Michael Kerrisk <mtk.manpages@gmail.com> wrote:

> On Tue, Jul 31, 2012 at 9:12 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > From: Lee Schermerhorn <lee.schermerhorn@hp.com>
> >
> > This patch augments the MPOL_MF_LAZY feature by adding a 
> > "NOOP" policy to mbind().  When the NOOP policy is used with 
> > the 'MOVE and 'LAZY flags, mbind() [check_range()] will walk 
> > the specified range and unmap eligible pages so that they 
> > will be migrated on next touch.
> 
> This patch is mistitled. The new flag is MPOL_NOOP. That made 
> it difficult to find the commit in linux-next. Does it make 
> sense to fix the patch title before this hits mainline?

In isolation such a minor title problem alone does not justify 
rebasing.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2012-10-01  9:46 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
2012-07-31 19:12 ` [PATCH 01/19] task_work: Remove dependency on sched.h Peter Zijlstra
2012-07-31 20:52   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 02/19] mm/mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
2012-07-31 20:52   ` Rik van Riel
2012-08-09 21:41   ` Andrea Arcangeli
2012-08-10  0:50     ` Andi Kleen
2012-07-31 19:12 ` [PATCH 03/19] mm/mpol: Make MPOL_LOCAL a real policy Peter Zijlstra
2012-07-31 20:52   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 04/19] mm, thp: Preserve pgprot across huge page split Peter Zijlstra
2012-07-31 20:53   ` Rik van Riel
2012-08-09 21:42   ` Andrea Arcangeli
2012-07-31 19:12 ` [PATCH 05/19] mm, mpol: Create special PROT_NONE infrastructure Peter Zijlstra
2012-07-31 20:55   ` Rik van Riel
2012-08-09 21:43   ` Andrea Arcangeli
2012-07-31 19:12 ` [PATCH 06/19] mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra
2012-07-31 21:04   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 07/19] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
2012-07-31 21:06   ` Rik van Riel
2012-08-09 21:44   ` Andrea Arcangeli
2012-10-01  9:36   ` Michael Kerrisk
2012-10-01  9:45     ` Ingo Molnar
2012-07-31 19:12 ` [PATCH 08/19] mm/mpol: Check for misplaced page Peter Zijlstra
2012-07-31 21:13   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 09/19] mm, migrate: Introduce migrate_misplaced_page() Peter Zijlstra
2012-07-31 21:16   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 10/19] mm, mpol: Use special PROT_NONE to migrate pages Peter Zijlstra
2012-07-31 21:24   ` Rik van Riel
2012-08-09 21:44   ` Andrea Arcangeli
2012-07-31 19:12 ` [PATCH 11/19] sched, mm: Introduce tsk_home_node() Peter Zijlstra
2012-07-31 21:30   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 12/19] mm/mpol: Make mempolicy home-node aware Peter Zijlstra
2012-07-31 21:33   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 13/19] sched: Introduce sched_feat_numa() Peter Zijlstra
2012-07-31 21:34   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 14/19] sched: Make find_busiest_queue() a method Peter Zijlstra
2012-07-31 21:34   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 15/19] sched: Implement home-node awareness Peter Zijlstra
2012-07-31 21:52   ` Rik van Riel
2012-08-09 21:51   ` Andrea Arcangeli
2012-07-31 19:12 ` [PATCH 16/19] sched, numa: NUMA home-node selection code Peter Zijlstra
2012-07-31 21:52   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 17/19] sched, numa: Detect big processes Peter Zijlstra
2012-07-31 21:53   ` Rik van Riel
2012-07-31 19:12 ` [PATCH 18/19] sched, numa: Per task memory placement for " Peter Zijlstra
2012-07-31 21:56   ` Rik van Riel
2012-08-08 21:35   ` Peter Zijlstra
2012-08-09 21:57   ` Andrea Arcangeli
2012-07-31 19:12 ` [PATCH 19/19] mm, numa: retry failed page migrations Peter Zijlstra
2012-08-02 20:40   ` Christoph Lameter
2012-08-08 17:17 ` [PATCH 00/19] sched-numa rewrite Andrea Arcangeli
2012-08-08 18:43   ` Rik van Riel
2012-08-17 18:08     ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).