linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dave Hansen <dave@sr71.net>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
	Cody P Schafer <cody@linux.vnet.ibm.com>,
	Andi Kleen <ak@linux.intel.com>,
	cl@gentwo.org, Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mel@csn.ul.ie>, Dave Hansen <dave@sr71.net>
Subject: [RFC][PATCH 5/8] mm: pcp: make percpu_pagelist_fraction sysctl undoable
Date: Tue, 15 Oct 2013 13:35:45 -0700	[thread overview]
Message-ID: <20131015203545.9DAADC18@viggo.jf.intel.com> (raw)
In-Reply-To: <20131015203536.1475C2BE@viggo.jf.intel.com>


From: Dave Hansen <dave.hansen@linux.intel.com>

The kernel has two methods of setting the sizes of the percpu
pagesets:

 1. The default, according to a page_alloc.c comment is "set to
    around 1000th of the size of the zone.  But no more than 1/2
    of a meg."
 2. After boot, vm.percpu_pagelist_fraction can be set to
    override the default.

However, the trip from 1->2 is a one-way street.  There's no way
to get back.  You can get either the 'high' or 'batch' values to
match the boot-time value, but since the relationship between the
two is different in the two different modes, you can never get
back _exactly_ to where you were.  This kinda sucks if you are
trying to do performance testing to find optimal values.

Note that we remove the .extra1 argument to the sysctl structure.
The bounding behavior is now open-coded in the handler.

Since we are now able to go back to the boot-time values, we
need the boot-time function zone_batchsize() to be available
at runtime, so remove its __meminit.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/Documentation/sysctl/vm.txt |    6 +++---
 linux.git-davehans/kernel/sysctl.c             |   25 +++++++++++++++++++++----
 linux.git-davehans/mm/page_alloc.c             |    2 +-
 3 files changed, 25 insertions(+), 8 deletions(-)

diff -puN Documentation/sysctl/vm.txt~make-percpu_pagelist_fraction-sysctl-undoable Documentation/sysctl/vm.txt
--- linux.git/Documentation/sysctl/vm.txt~make-percpu_pagelist_fraction-sysctl-undoable	2013-10-15 09:57:07.004662395 -0700
+++ linux.git-davehans/Documentation/sysctl/vm.txt	2013-10-15 09:57:07.011662705 -0700
@@ -653,6 +653,9 @@ why oom happens. You can get snapshot.
 
 percpu_pagelist_fraction
 
+Set (at boot) to 0.  The kernel will size each percpu pagelist to around
+1/1000th of the size of the zone but limited to be around 0.75MB.
+
 This is the fraction of pages at most (high mark pcp->high) in each zone that
 are allocated for each per cpu page list.  The min value for this is 8.  It
 means that we don't allow more than 1/8th of pages in each zone to be
@@ -663,9 +666,6 @@ of hot per cpu pagelists.  User can spec
 The batch value of each per cpu pagelist is also updated as a result.  It is
 set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)
 
-The initial value is zero.  Kernel does not use this value at boot time to set
-the high water marks for each per cpu page list.
-
 ==============================================================
 
 stat_interval
diff -puN kernel/sysctl.c~make-percpu_pagelist_fraction-sysctl-undoable kernel/sysctl.c
--- linux.git/kernel/sysctl.c~make-percpu_pagelist_fraction-sysctl-undoable	2013-10-15 09:57:07.005662439 -0700
+++ linux.git-davehans/kernel/sysctl.c	2013-10-15 09:57:07.012662750 -0700
@@ -138,7 +138,6 @@ static unsigned long dirty_bytes_min = 2
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
-static int min_percpu_pagelist_fract = 8;
 
 static int ngroups_max = NGROUPS_MAX;
 static const int cap_last_cap = CAP_LAST_CAP;
@@ -1289,7 +1288,6 @@ static struct ctl_table vm_table[] = {
 		.maxlen		= sizeof(percpu_pagelist_fraction),
 		.mode		= 0644,
 		.proc_handler	= percpu_pagelist_fraction_sysctl_handler,
-		.extra1		= &min_percpu_pagelist_fract,
 	},
 #ifdef CONFIG_MMU
 	{
@@ -1910,7 +1908,7 @@ static int do_proc_dointvec_conv(bool *n
 
 static const char proc_wspace_sep[] = { ' ', '\t', '\n' };
 
-static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
+int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
 		  int write, void __user *buffer,
 		  size_t *lenp, loff_t *ppos,
 		  int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
@@ -2466,7 +2464,26 @@ static int proc_do_cad_pid(struct ctl_ta
 static int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
-	int ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ 	int ret;
+	int tmp = percpu_pagelist_fraction;
+	int min_percpu_pagelist_fract = 8;
+
+	ret = __do_proc_dointvec(&tmp, table, write, buffer, length, ppos,
+		       NULL, NULL);
+	/*
+	 * We want values >= min_percpu_pagelist_fract, but we
+	 * also accept 0 to mean "stop using the fractions and
+	 * go back to the default behavior".
+	 */
+	if (write) {
+		if (tmp < 0)
+			return -EINVAL;
+		if ((tmp < min_percpu_pagelist_fract) &&
+		    (tmp != 0))
+			return -EINVAL;
+		percpu_pagelist_fraction = tmp;
+	}
+
 	if (!write || (ret < 0))
 		return ret;
 
diff -puN mm/page_alloc.c~make-percpu_pagelist_fraction-sysctl-undoable mm/page_alloc.c
--- linux.git/mm/page_alloc.c~make-percpu_pagelist_fraction-sysctl-undoable	2013-10-15 09:57:07.008662572 -0700
+++ linux.git-davehans/mm/page_alloc.c	2013-10-15 09:57:07.015662883 -0700
@@ -4059,7 +4059,7 @@ static void __meminit zone_init_free_lis
 	memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
 #endif
 
-static int __meminit zone_batchsize(struct zone *zone)
+static int zone_batchsize(struct zone *zone)
 {
 #ifdef CONFIG_MMU
 	int batch;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2013-10-15 20:35 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-15 20:35 [RFC][PATCH 0/8] mm: freshen percpu pageset code Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 1/8] mm: pcp: rename percpu pageset functions Dave Hansen
2013-10-17  1:32   ` David Rientjes
2013-10-17 16:11     ` Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 2/8] mm: pcp: consolidate percpu_pagelist_fraction code Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 3/8] mm: pcp: separate pageset update code from sysctl code Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 4/8] mm: pcp: move pageset sysctl code to sysctl.c Dave Hansen
2013-10-15 20:35 ` Dave Hansen [this message]
2013-10-15 20:35 ` [RFC][PATCH 6/8] mm: pcp: consolidate high-to-batch ratio code Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 7/8] mm: pcp: move page coloring optimization away from pcp sizing Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 8/8] mm: pcp: create setup_boot_pageset() Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131015203545.9DAADC18@viggo.jf.intel.com \
    --to=dave@sr71.net \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=cody@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).