linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Add a sysctl for numa_balancing.
@ 2013-04-24 23:56 Andi Kleen
  2013-04-25  0:14 ` Will Huck
  2013-04-29  8:41 ` Mel Gorman
  0 siblings, 2 replies; 4+ messages in thread
From: Andi Kleen @ 2013-04-24 23:56 UTC (permalink / raw)
  To: mgorman; +Cc: linux-kernel, linux-mm, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

As discussed earlier, this adds a working sysctl to enable/disable
automatic numa memory balancing at runtime.

This was possible earlier through debugfs, but only with special
debugging options set. Also fix the boot message.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/sysctl/kernel.txt |   10 ++++++++++
 include/linux/sched/sysctl.h    |    4 ++++
 kernel/sched/core.c             |   24 +++++++++++++++++++++++-
 kernel/sysctl.c                 |   11 +++++++++++
 mm/mempolicy.c                  |    2 +-
 5 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccd4258..17a7004 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,16 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+TBD someone document the other numa_balancing tunables
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..e228a1b 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -101,4 +101,8 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+extern int sched_numa_balancing(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos);
+
 #endif /* _SCHED_SYSCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..679be74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1614,7 +1614,29 @@ void set_numabalancing_state(bool enabled)
 	numabalancing_enabled = enabled;
 }
 #endif /* CONFIG_SCHED_DEBUG */
-#endif /* CONFIG_NUMA_BALANCING */
+
+#ifdef CONFIG_PROC_SYSCTL
+int sched_numa_balancing(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = numabalancing_enabled;
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0)
+		return err;
+	if (write)
+		set_numabalancing_state(state);
+	return err;
+}
+#endif
+#endif
 
 /*
  * fork()/clone()-time setup:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..94164ac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -393,6 +393,17 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "numa_balancing",
+		.data		= NULL, /* filled in by handler */
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_numa_balancing,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+
+
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..7eee646 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2531,7 +2531,7 @@ static void __init check_numabalancing_enable(void)
 
 	if (nr_node_ids > 1 && !numabalancing_override) {
 		printk(KERN_INFO "Enabling automatic NUMA balancing. "
-			"Configure with numa_balancing= or sysctl");
+			"Configure with numa_balancing= or the kernel.numa_balancing sysctl");
 		set_numabalancing_state(numabalancing_default);
 	}
 }
-- 
1.7.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] Add a sysctl for numa_balancing.
  2013-04-24 23:56 [PATCH] Add a sysctl for numa_balancing Andi Kleen
@ 2013-04-25  0:14 ` Will Huck
  2013-04-29  8:41 ` Mel Gorman
  1 sibling, 0 replies; 4+ messages in thread
From: Will Huck @ 2013-04-25  0:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mgorman, linux-kernel, linux-mm, Andi Kleen

On 04/25/2013 07:56 AM, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> As discussed earlier, this adds a working sysctl to enable/disable
> automatic numa memory balancing at runtime.
>
> This was possible earlier through debugfs, but only with special
> debugging options set. Also fix the boot message.

One offline question.

If I configure uma to fake numa, is there benefit or downside?

>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>   Documentation/sysctl/kernel.txt |   10 ++++++++++
>   include/linux/sched/sysctl.h    |    4 ++++
>   kernel/sched/core.c             |   24 +++++++++++++++++++++++-
>   kernel/sysctl.c                 |   11 +++++++++++
>   mm/mempolicy.c                  |    2 +-
>   5 files changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index ccd4258..17a7004 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -354,6 +354,16 @@ utilize.
>   
>   ==============================================================
>   
> +numa_balancing
> +
> +Enables/disables automatic page fault based NUMA memory
> +balancing. Memory is moved automatically to nodes
> +that access it often.
> +
> +TBD someone document the other numa_balancing tunables
> +
> +==============================================================
> +
>   osrelease, ostype & version:
>   
>   # cat osrelease
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index bf8086b..e228a1b 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -101,4 +101,8 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
>   		void __user *buffer, size_t *lenp,
>   		loff_t *ppos);
>   
> +extern int sched_numa_balancing(struct ctl_table *table, int write,
> +				 void __user *buffer, size_t *lenp,
> +				 loff_t *ppos);
> +
>   #endif /* _SCHED_SYSCTL_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 67d0465..679be74 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1614,7 +1614,29 @@ void set_numabalancing_state(bool enabled)
>   	numabalancing_enabled = enabled;
>   }
>   #endif /* CONFIG_SCHED_DEBUG */
> -#endif /* CONFIG_NUMA_BALANCING */
> +
> +#ifdef CONFIG_PROC_SYSCTL
> +int sched_numa_balancing(struct ctl_table *table, int write,
> +			 void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	struct ctl_table t;
> +	int err;
> +	int state = numabalancing_enabled;
> +
> +	if (write && !capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	t = *table;
> +	t.data = &state;
> +	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
> +	if (err < 0)
> +		return err;
> +	if (write)
> +		set_numabalancing_state(state);
> +	return err;
> +}
> +#endif
> +#endif
>   
>   /*
>    * fork()/clone()-time setup:
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index afc1dc6..94164ac 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -393,6 +393,17 @@ static struct ctl_table kern_table[] = {
>   		.mode		= 0644,
>   		.proc_handler	= proc_dointvec,
>   	},
> +	{
> +		.procname	= "numa_balancing",
> +		.data		= NULL, /* filled in by handler */
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= sched_numa_balancing,
> +		.extra1		= &zero,
> +		.extra2		= &one,
> +	},
> +
> +
>   #endif /* CONFIG_NUMA_BALANCING */
>   #endif /* CONFIG_SCHED_DEBUG */
>   	{
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 7431001..7eee646 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2531,7 +2531,7 @@ static void __init check_numabalancing_enable(void)
>   
>   	if (nr_node_ids > 1 && !numabalancing_override) {
>   		printk(KERN_INFO "Enabling automatic NUMA balancing. "
> -			"Configure with numa_balancing= or sysctl");
> +			"Configure with numa_balancing= or the kernel.numa_balancing sysctl");
>   		set_numabalancing_state(numabalancing_default);
>   	}
>   }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Add a sysctl for numa_balancing.
  2013-04-24 23:56 [PATCH] Add a sysctl for numa_balancing Andi Kleen
  2013-04-25  0:14 ` Will Huck
@ 2013-04-29  8:41 ` Mel Gorman
  2013-04-29 20:32   ` David Rientjes
  1 sibling, 1 reply; 4+ messages in thread
From: Mel Gorman @ 2013-04-29  8:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, Andi Kleen

On Wed, Apr 24, 2013 at 04:56:24PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> As discussed earlier, this adds a working sysctl to enable/disable
> automatic numa memory balancing at runtime.
> 
> This was possible earlier through debugfs, but only with special
> debugging options set. Also fix the boot message.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

Acked-by: Mel Gorman <mgorman@suse.de>

Would you like to merge the following patch with it to remove the TBD?

---8<---
mm: numa: Document remaining automatic NUMA balancing sysctls

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 62 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 17a7004..4d56060 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -356,11 +356,63 @@ utilize.
 
 numa_balancing
 
-Enables/disables automatic page fault based NUMA memory
-balancing. Memory is moved automatically to nodes
-that access it often.
-
-TBD someone document the other numa_balancing tunables
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
 ==============================================================
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] Add a sysctl for numa_balancing.
  2013-04-29  8:41 ` Mel Gorman
@ 2013-04-29 20:32   ` David Rientjes
  0 siblings, 0 replies; 4+ messages in thread
From: David Rientjes @ 2013-04-29 20:32 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andi Kleen, linux-kernel, linux-mm, Andi Kleen

On Mon, 29 Apr 2013, Mel Gorman wrote:

> On Wed, Apr 24, 2013 at 04:56:24PM -0700, Andi Kleen wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> > 
> > As discussed earlier, this adds a working sysctl to enable/disable
> > automatic numa memory balancing at runtime.
> > 
> > This was possible earlier through debugfs, but only with special
> > debugging options set. Also fix the boot message.
> > 
> > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> 

Acked-by: David Rientjes <rientjes@google.com>

> Would you like to merge the following patch with it to remove the TBD?
> 
> ---8<---
> mm: numa: Document remaining automatic NUMA balancing sysctls
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-04-29 20:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-24 23:56 [PATCH] Add a sysctl for numa_balancing Andi Kleen
2013-04-25  0:14 ` Will Huck
2013-04-29  8:41 ` Mel Gorman
2013-04-29 20:32   ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).