linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>, Mel Gorman <mgorman@suse.de>
Subject: [PATCH 12/18] sched: Set the scan rate proportional to the size of the task being scanned
Date: Mon, 15 Jul 2013 16:20:14 +0100	[thread overview]
Message-ID: <1373901620-2021-13-git-send-email-mgorman@suse.de> (raw)
In-Reply-To: <1373901620-2021-1-git-send-email-mgorman@suse.de>

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks virtual
address space. Conceptually this is a lot easier to understand. There is a
"sanity" check to ensure the scan rate is never extremely fast based on the
amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 ++++---
 include/linux/sched.h           |  1 +
 kernel/sched/fair.c             | 72 +++++++++++++++++++++++++++++++++++------
 3 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a275042..f38d4f4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -401,15 +401,16 @@ workload pattern changes and minimises performance impact due to remote
 memory accesses. These sysctls control the thresholds for scan delays and
 the number of pages scanned.
 
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
 
 numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
 when it initially forks.
 
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
 
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b81195e..d44fbc6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1504,6 +1504,7 @@ struct task_struct {
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
+	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 022a04c..8a392c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -779,10 +779,12 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 600000;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -790,6 +792,46 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+	unsigned long nr_vm_pages = 0;
+	unsigned long nr_scan_pages;
+
+	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+	nr_vm_pages = p->mm->total_vm;
+	if (!nr_vm_pages)
+		nr_vm_pages = nr_scan_pages;
+
+	nr_vm_pages = round_up(nr_vm_pages, nr_scan_pages);
+	return nr_vm_pages / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+	unsigned int scan, floor;
+	unsigned int windows = 1;
+
+	if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+		windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+	floor = 1000 / windows;
+
+	scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+	return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+	unsigned int smin = task_scan_min(p);
+	unsigned int smax;
+
+	/* Watch for min being lower than max due to floor calculations */
+	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+	return max(smin, smax);
+}
+
 /*
  * Once a preferred node is selected the scheduler balancer will prefer moving
  * a task to that node for sysctl_numa_balancing_settle_count number of PTE
@@ -839,6 +881,7 @@ static void task_numa_placement(struct task_struct *p)
 		return;
 	p->numa_scan_seq = seq;
 	p->numa_migrate_seq++;
+	p->numa_scan_period_max = task_scan_max(p);
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
@@ -894,7 +937,7 @@ static void task_numa_placement(struct task_struct *p)
 		 */
 		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
 			p->numa_scan_period = max(p->numa_scan_period >> 1,
-					sysctl_numa_balancing_scan_period_min);
+					task_scan_min(p));
 		}
 	}
 }
@@ -935,7 +978,7 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	 * This is reset periodically in case of phase changes
 	 */
         if (!migrated)
-		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
+		p->numa_scan_period = min(p->numa_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
@@ -961,6 +1004,7 @@ void task_numa_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
+	unsigned long nr_pte_updates = 0;
 	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -1002,8 +1046,10 @@ void task_numa_work(struct callback_head *work)
 	if (time_before(now, migrate))
 		return;
 
-	if (p->numa_scan_period == 0)
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+	if (p->numa_scan_period == 0) {
+		p->numa_scan_period_max = task_scan_max(p);
+		p->numa_scan_period = task_scan_min(p);
+	}
 
 	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -1042,7 +1088,15 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			pages -= change_prot_numa(vma, start, end);
+			nr_pte_updates += change_prot_numa(vma, start, end);
+
+			/*
+			 * Scan sysctl_numa_balancing_scan_size but ensure that
+			 * at least one PTE is updated so that unused virtual
+			 * address space is quickly skipped.
+			 */
+			if (nr_pte_updates)
+				pages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
 			if (pages <= 0)
@@ -1089,7 +1143,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
-			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp = now;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2013-07-15 15:20 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-15 15:20 [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5 Mel Gorman
2013-07-15 15:20 ` [PATCH 01/18] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-07-15 15:20 ` [PATCH 02/18] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-07-17 10:50   ` Peter Zijlstra
2013-07-31  7:54     ` Mel Gorman
2013-07-29 10:10   ` Peter Zijlstra
2013-07-31  7:54     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 03/18] mm: numa: Account for THP numa hinting faults on the correct node Mel Gorman
2013-07-17  0:33   ` Hillf Danton
2013-07-17  1:26     ` Wanpeng Li
2013-07-17  1:26     ` Wanpeng Li
2013-07-15 15:20 ` [PATCH 04/18] mm: numa: Do not migrate or account for hinting faults on the zero page Mel Gorman
2013-07-17 11:00   ` Peter Zijlstra
2013-07-31  8:11     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 05/18] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-07-15 15:20 ` [PATCH 06/18] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-07-15 15:20 ` [PATCH 07/18] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-07-25 10:40   ` [PATCH] sched, numa: migrates_degrades_locality() Peter Zijlstra
2013-07-31  8:44     ` Mel Gorman
2013-07-31  8:50       ` Peter Zijlstra
2013-07-15 15:20 ` [PATCH 08/18] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-07-17  1:31   ` Hillf Danton
2013-07-31  9:07     ` Mel Gorman
2013-07-31  9:38       ` Srikar Dronamraju
2013-08-01  4:47   ` Srikar Dronamraju
2013-08-01 15:38     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 09/18] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults Mel Gorman
2013-07-17  2:17   ` Hillf Danton
2013-07-31  9:08     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 10/18] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
2013-07-15 15:20 ` [PATCH 11/18] sched: Check current->mm before allocating NUMA faults Mel Gorman
2013-07-15 15:20 ` Mel Gorman [this message]
2013-07-15 15:20 ` [PATCH 13/18] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
2013-07-17  5:22   ` Sam Ben
2013-07-31  9:13     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 14/18] sched: Remove check that skips small VMAs Mel Gorman
2013-07-15 15:20 ` [PATCH 15/18] sched: Set preferred NUMA node based on number of private faults Mel Gorman
2013-07-18  1:53   ` [PATCH 15/18] fix compilation with !CONFIG_NUMA_BALANCING Rik van Riel
2013-07-31  9:19     ` Mel Gorman
2013-07-26 11:20   ` [PATCH 15/18] sched: Set preferred NUMA node based on number of private faults Peter Zijlstra
2013-07-31  9:29     ` Mel Gorman
2013-07-31  9:34       ` Peter Zijlstra
2013-07-31 10:10         ` Mel Gorman
2013-07-15 15:20 ` [PATCH 16/18] sched: Avoid overloading CPUs on a preferred NUMA node Mel Gorman
2013-07-15 20:03   ` Peter Zijlstra
2013-07-16  8:23     ` Mel Gorman
2013-07-16 10:35       ` Peter Zijlstra
2013-07-16 15:55   ` Hillf Danton
2013-07-16 16:01     ` Mel Gorman
2013-07-17 10:54   ` Peter Zijlstra
2013-07-31  9:49     ` Mel Gorman
2013-08-01  7:10   ` Srikar Dronamraju
2013-08-01 15:42     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 17/18] sched: Retry migration of tasks to CPU on a preferred node Mel Gorman
2013-07-25 10:33   ` Peter Zijlstra
2013-07-31 10:03     ` Mel Gorman
2013-07-31 10:05       ` Peter Zijlstra
2013-07-31 10:07         ` Mel Gorman
2013-07-25 10:35   ` Peter Zijlstra
2013-08-01  5:13   ` Srikar Dronamraju
2013-08-01 15:46     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 18/18] sched: Swap tasks when reschuling if a CPU on a target node is imbalanced Mel Gorman
2013-07-15 20:11   ` Peter Zijlstra
2013-07-16  9:41     ` Mel Gorman
2013-08-01  4:59   ` Srikar Dronamraju
2013-08-01 15:48     ` Mel Gorman
2013-07-15 20:14 ` [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5 Peter Zijlstra
2013-07-16 15:10 ` Srikar Dronamraju
2013-07-25 10:36 ` Peter Zijlstra
2013-07-31 10:30   ` Mel Gorman
2013-07-31 10:48     ` Peter Zijlstra
2013-07-31 11:57       ` Mel Gorman
2013-07-31 15:30         ` Peter Zijlstra
2013-07-31 16:11           ` Mel Gorman
2013-07-31 16:39             ` Peter Zijlstra
2013-08-01 15:51               ` Mel Gorman
2013-07-25 10:38 ` [PATCH] mm, numa: Sanitize task_numa_fault() callsites Peter Zijlstra
2013-07-31 11:25   ` Mel Gorman
2013-07-25 10:41 ` [PATCH] sched, numa: Improve scanner Peter Zijlstra
2013-07-25 10:46 ` [PATCH] mm, sched, numa: Create a per-task MPOL_INTERLEAVE policy Peter Zijlstra
2013-07-26  9:55   ` Peter Zijlstra
2013-08-26 16:10     ` Peter Zijlstra
2013-08-26 16:14       ` Peter Zijlstra
2013-07-30 11:24 ` [PATCH] mm, numa: Change page last {nid,pid} into {cpu,pid} Peter Zijlstra
2013-08-01 22:33   ` Rik van Riel
2013-07-30 11:38 ` [PATCH] sched, numa: Use {cpu, pid} to create task groups for shared faults Peter Zijlstra
2013-07-31 15:07   ` Peter Zijlstra
2013-07-31 15:38     ` Peter Zijlstra
2013-07-31 15:45     ` Don Morris
2013-07-31 16:05       ` Peter Zijlstra
2013-08-02 16:47       ` [PATCH -v3] " Peter Zijlstra
2013-08-02 16:50         ` [PATCH] mm, numa: Do not group on RO pages Peter Zijlstra
2013-08-02 19:56           ` Peter Zijlstra
2013-08-05 19:36           ` [PATCH] numa,sched: use group fault statistics in numa placement Rik van Riel
2013-08-09 13:55             ` Don Morris
2013-08-28 16:41         ` [PATCH -v3] sched, numa: Use {cpu, pid} to create task groups for shared faults Peter Zijlstra
2013-08-28 17:10           ` Rik van Riel
2013-08-01  6:23   ` [PATCH,RFC] numa,sched: use group fault statistics in numa placement Rik van Riel
2013-08-01 10:37     ` Peter Zijlstra
2013-08-01 16:35       ` Rik van Riel
2013-08-01 22:36   ` [RFC PATCH -v2] " Rik van Riel
2013-07-30 13:58 ` [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5 Andrew Theurer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1373901620-2021-13-git-send-email-mgorman@suse.de \
    --to=mgorman@suse.de \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@kernel.org \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).