From: Dave Hansen <dave@sr71.net>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
Cody P Schafer <cody@linux.vnet.ibm.com>,
Andi Kleen <ak@linux.intel.com>,
cl@gentwo.org, Andrew Morton <akpm@linux-foundation.org>,
Mel Gorman <mel@csn.ul.ie>, Dave Hansen <dave@sr71.net>
Subject: [RFC][PATCH 5/8] mm: pcp: make percpu_pagelist_fraction sysctl undoable
Date: Tue, 15 Oct 2013 13:35:45 -0700 [thread overview]
Message-ID: <20131015203545.9DAADC18@viggo.jf.intel.com> (raw)
In-Reply-To: <20131015203536.1475C2BE@viggo.jf.intel.com>
From: Dave Hansen <dave.hansen@linux.intel.com>
The kernel has two methods of setting the sizes of the percpu
pagesets:
1. The default, according to a page_alloc.c comment is "set to
around 1000th of the size of the zone. But no more than 1/2
of a meg."
2. After boot, vm.percpu_pagelist_fraction can be set to
override the default.
However, the trip from 1->2 is a one-way street. There's no way
to get back. You can get either the 'high' or 'batch' values to
match the boot-time value, but since the relationship between the
two is different in the two different modes, you can never get
back _exactly_ to where you were. This kinda sucks if you are
trying to do performance testing to find optimal values.
Note that we remove the .extra1 argument to the sysctl structure.
The bounding behavior is now open-coded in the handler.
Since we are now able to go back to the boot-time values, we
need the boot-time function zone_batchsize() to be available
at runtime, so remove its __meminit.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
linux.git-davehans/Documentation/sysctl/vm.txt | 6 +++---
linux.git-davehans/kernel/sysctl.c | 25 +++++++++++++++++++++----
linux.git-davehans/mm/page_alloc.c | 2 +-
3 files changed, 25 insertions(+), 8 deletions(-)
diff -puN Documentation/sysctl/vm.txt~make-percpu_pagelist_fraction-sysctl-undoable Documentation/sysctl/vm.txt
--- linux.git/Documentation/sysctl/vm.txt~make-percpu_pagelist_fraction-sysctl-undoable 2013-10-15 09:57:07.004662395 -0700
+++ linux.git-davehans/Documentation/sysctl/vm.txt 2013-10-15 09:57:07.011662705 -0700
@@ -653,6 +653,9 @@ why oom happens. You can get snapshot.
percpu_pagelist_fraction
+Set (at boot) to 0. The kernel will size each percpu pagelist to around
+1/1000th of the size of the zone but limited to be around 0.75MB.
+
This is the fraction of pages at most (high mark pcp->high) in each zone that
are allocated for each per cpu page list. The min value for this is 8. It
means that we don't allow more than 1/8th of pages in each zone to be
@@ -663,9 +666,6 @@ of hot per cpu pagelists. User can spec
The batch value of each per cpu pagelist is also updated as a result. It is
set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
-The initial value is zero. Kernel does not use this value at boot time to set
-the high water marks for each per cpu page list.
-
==============================================================
stat_interval
diff -puN kernel/sysctl.c~make-percpu_pagelist_fraction-sysctl-undoable kernel/sysctl.c
--- linux.git/kernel/sysctl.c~make-percpu_pagelist_fraction-sysctl-undoable 2013-10-15 09:57:07.005662439 -0700
+++ linux.git-davehans/kernel/sysctl.c 2013-10-15 09:57:07.012662750 -0700
@@ -138,7 +138,6 @@ static unsigned long dirty_bytes_min = 2
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535;
static int minolduid;
-static int min_percpu_pagelist_fract = 8;
static int ngroups_max = NGROUPS_MAX;
static const int cap_last_cap = CAP_LAST_CAP;
@@ -1289,7 +1288,6 @@ static struct ctl_table vm_table[] = {
.maxlen = sizeof(percpu_pagelist_fraction),
.mode = 0644,
.proc_handler = percpu_pagelist_fraction_sysctl_handler,
- .extra1 = &min_percpu_pagelist_fract,
},
#ifdef CONFIG_MMU
{
@@ -1910,7 +1908,7 @@ static int do_proc_dointvec_conv(bool *n
static const char proc_wspace_sep[] = { ' ', '\t', '\n' };
-static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
+int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
int write, void __user *buffer,
size_t *lenp, loff_t *ppos,
int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
@@ -2466,7 +2464,26 @@ static int proc_do_cad_pid(struct ctl_ta
static int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
- int ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ int ret;
+ int tmp = percpu_pagelist_fraction;
+ int min_percpu_pagelist_fract = 8;
+
+ ret = __do_proc_dointvec(&tmp, table, write, buffer, length, ppos,
+ NULL, NULL);
+ /*
+ * We want values >= min_percpu_pagelist_fract, but we
+ * also accept 0 to mean "stop using the fractions and
+ * go back to the default behavior".
+ */
+ if (write) {
+ if (tmp < 0)
+ return -EINVAL;
+ if ((tmp < min_percpu_pagelist_fract) &&
+ (tmp != 0))
+ return -EINVAL;
+ percpu_pagelist_fraction = tmp;
+ }
+
if (!write || (ret < 0))
return ret;
diff -puN mm/page_alloc.c~make-percpu_pagelist_fraction-sysctl-undoable mm/page_alloc.c
--- linux.git/mm/page_alloc.c~make-percpu_pagelist_fraction-sysctl-undoable 2013-10-15 09:57:07.008662572 -0700
+++ linux.git-davehans/mm/page_alloc.c 2013-10-15 09:57:07.015662883 -0700
@@ -4059,7 +4059,7 @@ static void __meminit zone_init_free_lis
memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
#endif
-static int __meminit zone_batchsize(struct zone *zone)
+static int zone_batchsize(struct zone *zone)
{
#ifdef CONFIG_MMU
int batch;
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-10-15 20:35 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-15 20:35 [RFC][PATCH 0/8] mm: freshen percpu pageset code Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 1/8] mm: pcp: rename percpu pageset functions Dave Hansen
2013-10-17 1:32 ` David Rientjes
2013-10-17 16:11 ` Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 2/8] mm: pcp: consolidate percpu_pagelist_fraction code Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 3/8] mm: pcp: separate pageset update code from sysctl code Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 4/8] mm: pcp: move pageset sysctl code to sysctl.c Dave Hansen
2013-10-15 20:35 ` Dave Hansen [this message]
2013-10-15 20:35 ` [RFC][PATCH 6/8] mm: pcp: consolidate high-to-batch ratio code Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 7/8] mm: pcp: move page coloring optimization away from pcp sizing Dave Hansen
2013-10-15 20:35 ` [RFC][PATCH 8/8] mm: pcp: create setup_boot_pageset() Dave Hansen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131015203545.9DAADC18@viggo.jf.intel.com \
--to=dave@sr71.net \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=cl@gentwo.org \
--cc=cody@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).