[PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>, Mel Gorman <mgorman@suse.de>
Subject: [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter
Date: Wed,  3 Jul 2013 15:21:34 +0100	[thread overview]
Message-ID: <1372861300-9973-8-git-send-email-mgorman@suse.de> (raw)
In-Reply-To: <1372861300-9973-1-git-send-email-mgorman@suse.de>

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared. This would require
that the last task that accessed a page for a hinting fault would be
recorded which would increase the size of struct page. Instead this patch
approximates private pages by assuming that faults that pass the two-stage
filter are private pages and all others are shared. The preferred NUMA
node is then selected based on where the maximum number of approximately
private faults were measured. Shared faults are not taken into
consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults may result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  4 ++--
 kernel/sched/fair.c   | 30 +++++++++++++++++++++---------
 mm/huge_memory.c      |  7 ++++---
 mm/memory.c           |  9 ++++++---
 4 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 82a6136..a41edea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1600,10 +1600,10 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages, bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9139be..a66f2bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -831,6 +831,11 @@ find_idlest_cpu_node(int this_cpu, int nid)
 	return idlest_cpu;
 }
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -847,13 +852,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
+
+			/* Decay existing window and copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
 
-		faults = p->numa_faults[nid];
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -886,16 +897,17 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv = (cpu_to_node(task_cpu(p)) == last_nid);
 
 	if (!sched_feat_numa(NUMA))
 		return;
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
@@ -903,7 +915,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -917,7 +929,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults_buffer[node] += pages << migrated;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..7cd7114 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	int target_nid;
+	int target_nid, last_nid;
 	int current_nid = -1;
 	bool migrated;
 
@@ -1307,6 +1307,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
+	last_nid = page_nid_last(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		put_page(page);
@@ -1332,7 +1333,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!migrated)
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
 	return 0;
 
 check_same:
@@ -1347,7 +1348,7 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+		task_numa_fault(last_nid, current_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index ba94dec..c28bf52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int current_nid = -1, last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3566,6 +3566,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
+	last_nid = page_nid_last(page);
 	current_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3586,7 +3587,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+		task_numa_fault(last_nid, current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3602,6 +3603,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3654,6 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * migrated to.
 		 */
 		curr_nid = local_nid;
+		last_nid = page_nid_last(page);
 		target_nid = numa_migrate_prep(page, vma, addr,
 					       page_to_nid(page));
 		if (target_nid == -1) {
@@ -3666,7 +3669,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		migrated = migrate_misplaced_page(page, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		task_numa_fault(last_nid, curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2013-07-03 14:22 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
2013-07-03 14:21 ` [PATCH 01/13] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-07-03 14:21 ` [PATCH 02/13] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-07-03 14:21 ` [PATCH 03/13] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-07-03 14:21 ` [PATCH 04/13] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-07-03 14:21 ` [PATCH 05/13] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-07-03 14:21 ` [PATCH 06/13] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-07-04 12:26   ` Srikar Dronamraju
2013-07-04 13:29     ` Mel Gorman
2013-07-03 14:21 ` Mel Gorman [this message]
2013-07-03 21:56   ` [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter Johannes Weiner
2013-07-04  9:23     ` Mel Gorman
2013-07-04 14:24       ` Rik van Riel
2013-07-04 19:36       ` Johannes Weiner
2013-07-05  9:41         ` Mel Gorman
2013-07-05 10:48         ` Peter Zijlstra
2013-07-03 14:21 ` [PATCH 08/13] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
2013-07-03 14:21 ` [PATCH 09/13] sched: Favour moving tasks towards nodes that incurred more faults Mel Gorman
2013-07-03 18:27   ` Peter Zijlstra
2013-07-04  9:25     ` Mel Gorman
2013-07-03 14:21 ` [PATCH 10/13] sched: Set the scan rate proportional to the size of the task being scanned Mel Gorman
2013-07-03 14:21 ` [PATCH 11/13] sched: Check current->mm before allocating NUMA faults Mel Gorman
2013-07-03 15:33   ` Mel Gorman
2013-07-04 12:48   ` Srikar Dronamraju
2013-07-05 10:07     ` Mel Gorman
2013-07-03 14:21 ` [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
2013-07-03 18:35   ` Peter Zijlstra
2013-07-04  9:27     ` Mel Gorman
2013-07-03 18:41   ` Peter Zijlstra
2013-07-04  9:32     ` Mel Gorman
2013-07-03 18:42   ` Peter Zijlstra
2013-07-03 14:21 ` [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node Mel Gorman
2013-07-03 18:32   ` Peter Zijlstra
2013-07-04  9:37     ` Mel Gorman
2013-07-04 13:07       ` Srikar Dronamraju
2013-07-04 13:54         ` Mel Gorman
2013-07-04 14:06           ` Peter Zijlstra
2013-07-04 14:40             ` Mel Gorman
2013-07-03 16:19 ` [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
2013-07-03 16:26   ` Mel Gorman
2013-07-04 18:02 ` [PATCH RFC WIP] Process weights based scheduling for better consolidation Srikar Dronamraju
2013-07-05 10:16   ` Peter Zijlstra
2013-07-05 12:49     ` Srikar Dronamraju

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:82a6136 dfblob:a41edea dfblob:b9139be dfblob:a66f2bb
dfblob:e2f7f5a dfblob:7cd7114 dfblob:ba94dec dfblob:c28bf52 )
 OR (
bs:"[PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1372861300-9973-8-git-send-email-mgorman@suse.de \
    --to=mgorman@suse.de \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@kernel.org \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).