* [PATCH v6] KSM: numa awareness sysfs knob
@ 2012-12-24 3:22 Petr Holasek
2012-12-24 5:08 ` Greg KH
0 siblings, 1 reply; 13+ messages in thread
From: Petr Holasek @ 2012-12-24 3:22 UTC (permalink / raw)
To: Hugh Dickins
Cc: Andrea Arcangeli, Andrew Morton, Chris Wright, Izik Eidus,
Rik van Riel, David Rientjes, Sasha Levin, linux-kernel, linux-mm,
Anton Arapov, Petr Holasek
Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes.
When it is set to zero only pages from the same node are merged,
otherwise pages from all nodes can be merged together (default behavior).
Typical use-case could be a lot of KVM guests on NUMA machine
and cpus from more distant nodes would have significant increase
of access latency to the merged ksm page. Sysfs knob was choosen
for higher variability when some users still prefers higher amount
of saved physical memory regardless of access latency.
Every numa node has its own stable & unstable trees because of faster
searching and inserting. Changing of merge_across_nodes value is possible
only when there are not any ksm shared pages in system.
I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:
http://pholasek.fedorapeople.org/alloc_pg.c
Population standard deviations of access times in percentage of average
were following:
merge_across_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes 1.7%
merge_across_nodes=0
2 nodes 1%
4 nodes 0.32%
8 nodes 0.018%
RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46
v2: https://lkml.org/lkml/2012/6/29/105
v3: https://lkml.org/lkml/2012/9/14/550
v4: https://lkml.org/lkml/2012/9/23/137
v5: https://lkml.org/lkml/2012/12/10/540
Changelog:
v2: Andrew's objections were reflected:
- value of merge_nodes can't be changed while there are some ksm
pages in system
- merge_nodes sysfs entry appearance depends on CONFIG_NUMA
- more verbose documentation
- added some performance testing results
v3: - more verbose documentation
- fixed race in merge_nodes store function
- introduced share_all debugging knob proposed by Andrew
- minor cleanups
v4: - merge_nodes was renamed to merge_across_nodes
- share_all debug knob was dropped
- get_kpfn_nid helper
- fixed page migration behaviour
v5: - unstable node's nid presence depends on CONFIG_NUMA
- fixed oops appearing when stable nodes were removed from tree
- roots of stable trees are initialized properly
- fixed unstable page migration issue
v6: - fixed oops caused by stable_nodes appended to wrong tree
- KSM_RUN_MERGE test removed
Signed-off-by: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
Documentation/vm/ksm.txt | 7 +++
mm/ksm.c | 151 +++++++++++++++++++++++++++++++++++++++++------
2 files changed, 139 insertions(+), 19 deletions(-)
diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
index b392e49..25cc89b 100644
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds ksmd should sleep before next scan
e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
Default: 20 (chosen for demonstration purposes)
+merge_across_nodes - specifies if pages from different numa nodes can be merged.
+ When set to 0, ksm merges only pages which physically
+ reside in the memory area of same NUMA node. It brings
+ lower latency to access to shared page. Value can be
+ changed only when there is no ksm shared pages in system.
+ Default: 1
+
run - set 0 to stop ksmd from running but keep merged pages,
set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
set 2 to stop ksmd and unmerge all pages currently merged,
diff --git a/mm/ksm.c b/mm/ksm.c
index 5157385..d1e1041 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -36,6 +36,7 @@
#include <linux/hash.h>
#include <linux/freezer.h>
#include <linux/oom.h>
+#include <linux/numa.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -139,6 +140,9 @@ struct rmap_item {
struct mm_struct *mm;
unsigned long address; /* + low bits used for flags below */
unsigned int oldchecksum; /* when unstable */
+#ifdef CONFIG_NUMA
+ unsigned int nid;
+#endif
union {
struct rb_node node; /* when node of unstable tree */
struct { /* when listed from stable tree */
@@ -153,8 +157,8 @@ struct rmap_item {
#define STABLE_FLAG 0x200 /* is listed from the stable tree */
/* The stable and unstable tree heads */
-static struct rb_root root_stable_tree = RB_ROOT;
-static struct rb_root root_unstable_tree = RB_ROOT;
+static struct rb_root root_unstable_tree[MAX_NUMNODES];
+static struct rb_root root_stable_tree[MAX_NUMNODES];
#define MM_SLOTS_HASH_SHIFT 10
#define MM_SLOTS_HASH_HEADS (1 << MM_SLOTS_HASH_SHIFT)
@@ -189,6 +193,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
/* Milliseconds ksmd should sleep between batches */
static unsigned int ksm_thread_sleep_millisecs = 20;
+/* Zeroed when merging across nodes is not allowed */
+static unsigned int ksm_merge_across_nodes = 1;
+
#define KSM_RUN_STOP 0
#define KSM_RUN_MERGE 1
#define KSM_RUN_UNMERGE 2
@@ -447,10 +454,25 @@ out: page = NULL;
return page;
}
+/*
+ * This helper is used for getting right index into array of tree roots.
+ * When merge_across_nodes knob is set to 1, there are only two rb-trees for
+ * stable and unstable pages from all nodes with roots in index 0. Otherwise,
+ * every node has its own stable and unstable tree.
+ */
+static inline int get_kpfn_nid(unsigned long kpfn)
+{
+ if (ksm_merge_across_nodes)
+ return 0;
+ else
+ return pfn_to_nid(kpfn);
+}
+
static void remove_node_from_stable_tree(struct stable_node *stable_node)
{
struct rmap_item *rmap_item;
struct hlist_node *hlist;
+ int nid;
hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
if (rmap_item->hlist.next)
@@ -462,7 +484,9 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
cond_resched();
}
- rb_erase(&stable_node->node, &root_stable_tree);
+ nid = get_kpfn_nid(stable_node->kpfn);
+
+ rb_erase(&stable_node->node, &root_stable_tree[nid]);
free_stable_node(stable_node);
}
@@ -560,7 +584,12 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
BUG_ON(age > 1);
if (!age)
- rb_erase(&rmap_item->node, &root_unstable_tree);
+#ifdef CONFIG_NUMA
+ rb_erase(&rmap_item->node,
+ &root_unstable_tree[rmap_item->nid]);
+#else
+ rb_erase(&rmap_item->node, &root_unstable_tree[0]);
+#endif
ksm_pages_unshared--;
rmap_item->address &= PAGE_MASK;
@@ -996,8 +1025,9 @@ static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
*/
static struct page *stable_tree_search(struct page *page)
{
- struct rb_node *node = root_stable_tree.rb_node;
+ struct rb_node *node;
struct stable_node *stable_node;
+ int nid;
stable_node = page_stable_node(page);
if (stable_node) { /* ksm page forked */
@@ -1005,6 +1035,9 @@ static struct page *stable_tree_search(struct page *page)
return page;
}
+ nid = get_kpfn_nid(page_to_pfn(page));
+ node = root_stable_tree[nid].rb_node;
+
while (node) {
struct page *tree_page;
int ret;
@@ -1039,10 +1072,16 @@ static struct page *stable_tree_search(struct page *page)
*/
static struct stable_node *stable_tree_insert(struct page *kpage)
{
- struct rb_node **new = &root_stable_tree.rb_node;
+ int nid;
+ unsigned long kpfn;
+ struct rb_node **new;
struct rb_node *parent = NULL;
struct stable_node *stable_node;
+ kpfn = page_to_pfn(kpage);
+ nid = get_kpfn_nid(kpfn);
+ new = &root_stable_tree[nid].rb_node;
+
while (*new) {
struct page *tree_page;
int ret;
@@ -1076,11 +1115,11 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
return NULL;
rb_link_node(&stable_node->node, parent, new);
- rb_insert_color(&stable_node->node, &root_stable_tree);
+ rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
INIT_HLIST_HEAD(&stable_node->hlist);
- stable_node->kpfn = page_to_pfn(kpage);
+ stable_node->kpfn = kpfn;
set_page_stable_node(kpage, stable_node);
return stable_node;
@@ -1104,10 +1143,15 @@ static
struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
struct page *page,
struct page **tree_pagep)
-
{
- struct rb_node **new = &root_unstable_tree.rb_node;
+ struct rb_node **new;
+ struct rb_root *root;
struct rb_node *parent = NULL;
+ int nid;
+
+ nid = get_kpfn_nid(page_to_pfn(page));
+ root = &root_unstable_tree[nid];
+ new = &root->rb_node;
while (*new) {
struct rmap_item *tree_rmap_item;
@@ -1128,6 +1172,18 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
return NULL;
}
+ /*
+ * If tree_page has been migrated to another NUMA node, it
+ * will be flushed out and put into the right unstable tree
+ * next time: only merge with it if merge_across_nodes.
+ * Just notice, we don't have similar problem for PageKsm
+ * because their migration is disabled now. (62b61f611e)
+ */
+ if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
+ put_page(tree_page);
+ return NULL;
+ }
+
ret = memcmp_pages(page, tree_page);
parent = *new;
@@ -1145,8 +1201,11 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
rmap_item->address |= UNSTABLE_FLAG;
rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
+#ifdef CONFIG_NUMA
+ rmap_item->nid = nid;
+#endif
rb_link_node(&rmap_item->node, parent, new);
- rb_insert_color(&rmap_item->node, &root_unstable_tree);
+ rb_insert_color(&rmap_item->node, root);
ksm_pages_unshared++;
return NULL;
@@ -1160,6 +1219,13 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
static void stable_tree_append(struct rmap_item *rmap_item,
struct stable_node *stable_node)
{
+#ifdef CONFIG_NUMA
+ /*
+ * Usually rmap_item->nid is already set correctly,
+ * but it may be wrong after switching merge_across_nodes.
+ */
+ rmap_item->nid = get_kpfn_nid(stable_node->kpfn);
+#endif
rmap_item->head = stable_node;
rmap_item->address |= STABLE_FLAG;
hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
@@ -1289,6 +1355,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
struct mm_slot *slot;
struct vm_area_struct *vma;
struct rmap_item *rmap_item;
+ int nid;
if (list_empty(&ksm_mm_head.mm_list))
return NULL;
@@ -1307,7 +1374,8 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
*/
lru_add_drain_all();
- root_unstable_tree = RB_ROOT;
+ for (nid = 0; nid < nr_node_ids; nid++)
+ root_unstable_tree[nid] = RB_ROOT;
spin_lock(&ksm_mmlist_lock);
slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
@@ -1782,15 +1850,19 @@ static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
unsigned long end_pfn)
{
struct rb_node *node;
+ int nid;
- for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
- struct stable_node *stable_node;
+ for (nid = 0; nid < nr_node_ids; nid++)
+ for (node = rb_first(&root_stable_tree[nid]); node;
+ node = rb_next(node)) {
+ struct stable_node *stable_node;
+
+ stable_node = rb_entry(node, struct stable_node, node);
+ if (stable_node->kpfn >= start_pfn &&
+ stable_node->kpfn < end_pfn)
+ return stable_node;
+ }
- stable_node = rb_entry(node, struct stable_node, node);
- if (stable_node->kpfn >= start_pfn &&
- stable_node->kpfn < end_pfn)
- return stable_node;
- }
return NULL;
}
@@ -1937,6 +2009,40 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
}
KSM_ATTR(run);
+#ifdef CONFIG_NUMA
+static ssize_t merge_across_nodes_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%u\n", ksm_merge_across_nodes);
+}
+
+static ssize_t merge_across_nodes_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long knob;
+
+ err = kstrtoul(buf, 10, &knob);
+ if (err)
+ return err;
+ if (knob > 1)
+ return -EINVAL;
+
+ mutex_lock(&ksm_thread_mutex);
+ if (ksm_merge_across_nodes != knob) {
+ if (ksm_pages_shared)
+ err = -EBUSY;
+ else
+ ksm_merge_across_nodes = knob;
+ }
+ mutex_unlock(&ksm_thread_mutex);
+
+ return err ? err : count;
+}
+KSM_ATTR(merge_across_nodes);
+#endif
+
static ssize_t pages_shared_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
@@ -1991,6 +2097,9 @@ static struct attribute *ksm_attrs[] = {
&pages_unshared_attr.attr,
&pages_volatile_attr.attr,
&full_scans_attr.attr,
+#ifdef CONFIG_NUMA
+ &merge_across_nodes_attr.attr,
+#endif
NULL,
};
@@ -2004,11 +2113,15 @@ static int __init ksm_init(void)
{
struct task_struct *ksm_thread;
int err;
+ int nid;
err = ksm_slab_init();
if (err)
goto out;
+ for (nid = 0; nid < nr_node_ids; nid++)
+ root_stable_tree[nid] = RB_ROOT;
+
ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
if (IS_ERR(ksm_thread)) {
printk(KERN_ERR "ksm: creating kthread failed\n");
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v6] KSM: numa awareness sysfs knob
2012-12-24 3:22 [PATCH v6] KSM: numa awareness sysfs knob Petr Holasek
@ 2012-12-24 5:08 ` Greg KH
2012-12-28 1:32 ` [PATCH v7 1/2] " Petr Holasek
0 siblings, 1 reply; 13+ messages in thread
From: Greg KH @ 2012-12-24 5:08 UTC (permalink / raw)
To: Petr Holasek
Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Chris Wright,
Izik Eidus, Rik van Riel, David Rientjes, Sasha Levin,
linux-kernel, linux-mm, Anton Arapov
On Mon, Dec 24, 2012 at 04:22:54AM +0100, Petr Holasek wrote:
> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> which control merging pages across different numa nodes.
All sysfs files must be documented in Documentation/ABI, please update
the files there as well (subsystem documentation, like you did, is also
nice, but the ABI files are the required ones.)
thanks,
greg k-h
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v7 1/2] KSM: numa awareness sysfs knob
2012-12-24 5:08 ` Greg KH
@ 2012-12-28 1:32 ` Petr Holasek
2012-12-28 1:32 ` [PATCH v7 2/2] Documentation: add sysfs ABI documentation for ksm Petr Holasek
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Petr Holasek @ 2012-12-28 1:32 UTC (permalink / raw)
To: Hugh Dickins
Cc: Andrea Arcangeli, Andrew Morton, Izik Eidus, Rik van Riel,
David Rientjes, Sasha Levin, linux-kernel, linux-mm, Anton Arapov,
Petr Holasek
Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes.
When it is set to zero only pages from the same node are merged,
otherwise pages from all nodes can be merged together (default behavior).
Typical use-case could be a lot of KVM guests on NUMA machine
and cpus from more distant nodes would have significant increase
of access latency to the merged ksm page. Sysfs knob was choosen
for higher variability when some users still prefers higher amount
of saved physical memory regardless of access latency.
Every numa node has its own stable & unstable trees because of faster
searching and inserting. Changing of merge_across_nodes value is possible
only when there are not any ksm shared pages in system.
I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:
http://pholasek.fedorapeople.org/alloc_pg.c
Population standard deviations of access times in percentage of average
were following:
merge_across_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes 1.7%
merge_across_nodes=0
2 nodes 1%
4 nodes 0.32%
8 nodes 0.018%
RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46
v2: https://lkml.org/lkml/2012/6/29/105
v3: https://lkml.org/lkml/2012/9/14/550
v4: https://lkml.org/lkml/2012/9/23/137
v5: https://lkml.org/lkml/2012/12/10/540
v6: https://lkml.org/lkml/2012/12/23/154
Changelog:
v2: Andrew's objections were reflected:
- value of merge_nodes can't be changed while there are some ksm
pages in system
- merge_nodes sysfs entry appearance depends on CONFIG_NUMA
- more verbose documentation
- added some performance testing results
v3: - more verbose documentation
- fixed race in merge_nodes store function
- introduced share_all debugging knob proposed by Andrew
- minor cleanups
v4: - merge_nodes was renamed to merge_across_nodes
- share_all debug knob was dropped
- get_kpfn_nid helper
- fixed page migration behaviour
v5: - unstable node's nid presence depends on CONFIG_NUMA
- fixed oops appearing when stable nodes were removed from tree
- roots of stable trees are initialized properly
- fixed unstable page migration issue
v6: - fixed oops caused by stable_nodes appended to wrong tree
- KSM_RUN_MERGE test removed
v7: - added sysfs ABI documentation for KSM
Signed-off-by: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
Documentation/vm/ksm.txt | 7 +++
mm/ksm.c | 151 +++++++++++++++++++++++++++++++++++++++++------
2 files changed, 139 insertions(+), 19 deletions(-)
diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
index b392e49..25cc89b 100644
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds ksmd should sleep before next scan
e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
Default: 20 (chosen for demonstration purposes)
+merge_across_nodes - specifies if pages from different numa nodes can be merged.
+ When set to 0, ksm merges only pages which physically
+ reside in the memory area of same NUMA node. It brings
+ lower latency to access to shared page. Value can be
+ changed only when there is no ksm shared pages in system.
+ Default: 1
+
run - set 0 to stop ksmd from running but keep merged pages,
set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
set 2 to stop ksmd and unmerge all pages currently merged,
diff --git a/mm/ksm.c b/mm/ksm.c
index 5157385..d1e1041 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -36,6 +36,7 @@
#include <linux/hash.h>
#include <linux/freezer.h>
#include <linux/oom.h>
+#include <linux/numa.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -139,6 +140,9 @@ struct rmap_item {
struct mm_struct *mm;
unsigned long address; /* + low bits used for flags below */
unsigned int oldchecksum; /* when unstable */
+#ifdef CONFIG_NUMA
+ unsigned int nid;
+#endif
union {
struct rb_node node; /* when node of unstable tree */
struct { /* when listed from stable tree */
@@ -153,8 +157,8 @@ struct rmap_item {
#define STABLE_FLAG 0x200 /* is listed from the stable tree */
/* The stable and unstable tree heads */
-static struct rb_root root_stable_tree = RB_ROOT;
-static struct rb_root root_unstable_tree = RB_ROOT;
+static struct rb_root root_unstable_tree[MAX_NUMNODES];
+static struct rb_root root_stable_tree[MAX_NUMNODES];
#define MM_SLOTS_HASH_SHIFT 10
#define MM_SLOTS_HASH_HEADS (1 << MM_SLOTS_HASH_SHIFT)
@@ -189,6 +193,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
/* Milliseconds ksmd should sleep between batches */
static unsigned int ksm_thread_sleep_millisecs = 20;
+/* Zeroed when merging across nodes is not allowed */
+static unsigned int ksm_merge_across_nodes = 1;
+
#define KSM_RUN_STOP 0
#define KSM_RUN_MERGE 1
#define KSM_RUN_UNMERGE 2
@@ -447,10 +454,25 @@ out: page = NULL;
return page;
}
+/*
+ * This helper is used for getting right index into array of tree roots.
+ * When merge_across_nodes knob is set to 1, there are only two rb-trees for
+ * stable and unstable pages from all nodes with roots in index 0. Otherwise,
+ * every node has its own stable and unstable tree.
+ */
+static inline int get_kpfn_nid(unsigned long kpfn)
+{
+ if (ksm_merge_across_nodes)
+ return 0;
+ else
+ return pfn_to_nid(kpfn);
+}
+
static void remove_node_from_stable_tree(struct stable_node *stable_node)
{
struct rmap_item *rmap_item;
struct hlist_node *hlist;
+ int nid;
hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
if (rmap_item->hlist.next)
@@ -462,7 +484,9 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
cond_resched();
}
- rb_erase(&stable_node->node, &root_stable_tree);
+ nid = get_kpfn_nid(stable_node->kpfn);
+
+ rb_erase(&stable_node->node, &root_stable_tree[nid]);
free_stable_node(stable_node);
}
@@ -560,7 +584,12 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
BUG_ON(age > 1);
if (!age)
- rb_erase(&rmap_item->node, &root_unstable_tree);
+#ifdef CONFIG_NUMA
+ rb_erase(&rmap_item->node,
+ &root_unstable_tree[rmap_item->nid]);
+#else
+ rb_erase(&rmap_item->node, &root_unstable_tree[0]);
+#endif
ksm_pages_unshared--;
rmap_item->address &= PAGE_MASK;
@@ -996,8 +1025,9 @@ static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
*/
static struct page *stable_tree_search(struct page *page)
{
- struct rb_node *node = root_stable_tree.rb_node;
+ struct rb_node *node;
struct stable_node *stable_node;
+ int nid;
stable_node = page_stable_node(page);
if (stable_node) { /* ksm page forked */
@@ -1005,6 +1035,9 @@ static struct page *stable_tree_search(struct page *page)
return page;
}
+ nid = get_kpfn_nid(page_to_pfn(page));
+ node = root_stable_tree[nid].rb_node;
+
while (node) {
struct page *tree_page;
int ret;
@@ -1039,10 +1072,16 @@ static struct page *stable_tree_search(struct page *page)
*/
static struct stable_node *stable_tree_insert(struct page *kpage)
{
- struct rb_node **new = &root_stable_tree.rb_node;
+ int nid;
+ unsigned long kpfn;
+ struct rb_node **new;
struct rb_node *parent = NULL;
struct stable_node *stable_node;
+ kpfn = page_to_pfn(kpage);
+ nid = get_kpfn_nid(kpfn);
+ new = &root_stable_tree[nid].rb_node;
+
while (*new) {
struct page *tree_page;
int ret;
@@ -1076,11 +1115,11 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
return NULL;
rb_link_node(&stable_node->node, parent, new);
- rb_insert_color(&stable_node->node, &root_stable_tree);
+ rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
INIT_HLIST_HEAD(&stable_node->hlist);
- stable_node->kpfn = page_to_pfn(kpage);
+ stable_node->kpfn = kpfn;
set_page_stable_node(kpage, stable_node);
return stable_node;
@@ -1104,10 +1143,15 @@ static
struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
struct page *page,
struct page **tree_pagep)
-
{
- struct rb_node **new = &root_unstable_tree.rb_node;
+ struct rb_node **new;
+ struct rb_root *root;
struct rb_node *parent = NULL;
+ int nid;
+
+ nid = get_kpfn_nid(page_to_pfn(page));
+ root = &root_unstable_tree[nid];
+ new = &root->rb_node;
while (*new) {
struct rmap_item *tree_rmap_item;
@@ -1128,6 +1172,18 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
return NULL;
}
+ /*
+ * If tree_page has been migrated to another NUMA node, it
+ * will be flushed out and put into the right unstable tree
+ * next time: only merge with it if merge_across_nodes.
+ * Just notice, we don't have similar problem for PageKsm
+ * because their migration is disabled now. (62b61f611e)
+ */
+ if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
+ put_page(tree_page);
+ return NULL;
+ }
+
ret = memcmp_pages(page, tree_page);
parent = *new;
@@ -1145,8 +1201,11 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
rmap_item->address |= UNSTABLE_FLAG;
rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
+#ifdef CONFIG_NUMA
+ rmap_item->nid = nid;
+#endif
rb_link_node(&rmap_item->node, parent, new);
- rb_insert_color(&rmap_item->node, &root_unstable_tree);
+ rb_insert_color(&rmap_item->node, root);
ksm_pages_unshared++;
return NULL;
@@ -1160,6 +1219,13 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
static void stable_tree_append(struct rmap_item *rmap_item,
struct stable_node *stable_node)
{
+#ifdef CONFIG_NUMA
+ /*
+ * Usually rmap_item->nid is already set correctly,
+ * but it may be wrong after switching merge_across_nodes.
+ */
+ rmap_item->nid = get_kpfn_nid(stable_node->kpfn);
+#endif
rmap_item->head = stable_node;
rmap_item->address |= STABLE_FLAG;
hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
@@ -1289,6 +1355,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
struct mm_slot *slot;
struct vm_area_struct *vma;
struct rmap_item *rmap_item;
+ int nid;
if (list_empty(&ksm_mm_head.mm_list))
return NULL;
@@ -1307,7 +1374,8 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
*/
lru_add_drain_all();
- root_unstable_tree = RB_ROOT;
+ for (nid = 0; nid < nr_node_ids; nid++)
+ root_unstable_tree[nid] = RB_ROOT;
spin_lock(&ksm_mmlist_lock);
slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
@@ -1782,15 +1850,19 @@ static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
unsigned long end_pfn)
{
struct rb_node *node;
+ int nid;
- for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
- struct stable_node *stable_node;
+ for (nid = 0; nid < nr_node_ids; nid++)
+ for (node = rb_first(&root_stable_tree[nid]); node;
+ node = rb_next(node)) {
+ struct stable_node *stable_node;
+
+ stable_node = rb_entry(node, struct stable_node, node);
+ if (stable_node->kpfn >= start_pfn &&
+ stable_node->kpfn < end_pfn)
+ return stable_node;
+ }
- stable_node = rb_entry(node, struct stable_node, node);
- if (stable_node->kpfn >= start_pfn &&
- stable_node->kpfn < end_pfn)
- return stable_node;
- }
return NULL;
}
@@ -1937,6 +2009,40 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
}
KSM_ATTR(run);
+#ifdef CONFIG_NUMA
+static ssize_t merge_across_nodes_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%u\n", ksm_merge_across_nodes);
+}
+
+static ssize_t merge_across_nodes_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long knob;
+
+ err = kstrtoul(buf, 10, &knob);
+ if (err)
+ return err;
+ if (knob > 1)
+ return -EINVAL;
+
+ mutex_lock(&ksm_thread_mutex);
+ if (ksm_merge_across_nodes != knob) {
+ if (ksm_pages_shared)
+ err = -EBUSY;
+ else
+ ksm_merge_across_nodes = knob;
+ }
+ mutex_unlock(&ksm_thread_mutex);
+
+ return err ? err : count;
+}
+KSM_ATTR(merge_across_nodes);
+#endif
+
static ssize_t pages_shared_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
@@ -1991,6 +2097,9 @@ static struct attribute *ksm_attrs[] = {
&pages_unshared_attr.attr,
&pages_volatile_attr.attr,
&full_scans_attr.attr,
+#ifdef CONFIG_NUMA
+ &merge_across_nodes_attr.attr,
+#endif
NULL,
};
@@ -2004,11 +2113,15 @@ static int __init ksm_init(void)
{
struct task_struct *ksm_thread;
int err;
+ int nid;
err = ksm_slab_init();
if (err)
goto out;
+ for (nid = 0; nid < nr_node_ids; nid++)
+ root_stable_tree[nid] = RB_ROOT;
+
ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
if (IS_ERR(ksm_thread)) {
printk(KERN_ERR "ksm: creating kthread failed\n");
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v7 2/2] Documentation: add sysfs ABI documentation for ksm
2012-12-28 1:32 ` [PATCH v7 1/2] " Petr Holasek
@ 2012-12-28 1:32 ` Petr Holasek
2013-01-01 4:41 ` [PATCH v7 1/2] KSM: numa awareness sysfs knob Simon Jeons
2013-01-01 8:46 ` Simon Jeons
2 siblings, 0 replies; 13+ messages in thread
From: Petr Holasek @ 2012-12-28 1:32 UTC (permalink / raw)
To: Hugh Dickins
Cc: Andrea Arcangeli, Andrew Morton, Izik Eidus, Rik van Riel,
David Rientjes, Sasha Levin, linux-kernel, linux-mm, Anton Arapov,
Petr Holasek
This patch adds sysfs documentation for Kernel Samepage Merging (KSM)
including new merge_across_nodes knob.
Signed-off-by: Petr Holasek <pholasek@redhat.com>
---
Documentation/ABI/testing/sysfs-kernel-mm-ksm | 51 +++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-ksm
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-ksm b/Documentation/ABI/testing/sysfs-kernel-mm-ksm
new file mode 100644
index 0000000..44384ae
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-ksm
@@ -0,0 +1,51 @@
+What: /sys/kernel/mm/ksm
+Date: September 2009
+Contact: Linux memory management mailing list <linux-mm@kvack.org>
+Description: /sys/kernel/mm/ksm contains interface of Kernel Samepage
+ Merging (KSM)
+
+What: /sys/kernel/mm/ksm/full_scans
+What: /sys/kernel/mm/ksm/pages_shared
+What: /sys/kernel/mm/ksm/pages_sharing
+What: /sys/kernel/mm/ksm/pages_to_scan
+What: /sys/kernel/mm/ksm/pages_unshared
+What: /sys/kernel/mm/ksm/pages_volatile
+What: /sys/kernel/mm/ksm/run
+What: /sys/kernel/mm/ksm/sleep_millisecs
+Date: September 2009
+Contact: Linux memory management mailing list <linux-mm@kvack.org>
+Description: Kernel Samepage Merging daemon sysfs interface
+
+ full_scans: how many times all mergeable areas have been
+ scanned.
+
+ pages_shared: how many shared pages are being used.
+
+ pages_sharing: how many more sites are sharing them i.e. how
+ much saved.
+
+ pages_to_scan: how many present pages to scan before ksmd goes
+ to sleep.
+
+ pages_unshared: how many pages unique but repeatedly checked
+ for merging.
+
+ pages_volatile: how many pages changing too fast to be placed
+ in a tree.
+
+ run: write 0 to disable ksm, read 0 while ksm is disabled.
+ write 1 to run ksm, read 1 while ksm is running.
+ write 2 to disable ksm and unmerge all its pages.
+
+ sleep_millisecs: how many milliseconds ksm should sleep between
+ scans.
+
+ See Documentation/vm/ksm.txt for more information.
+
+What: /sys/kernel/mm/ksm/merge_across_nodes
+Date: December 2012
+Contact: Linux memory management mailing list <linux-mm@kvack.org>
+Description: Control merging pages across different NUMA nodes.
+
+ When it is set to 0 only pages from the same node are merged,
+ otherwise pages from all nodes can be merged together (default).
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v7 1/2] KSM: numa awareness sysfs knob
2012-12-28 1:32 ` [PATCH v7 1/2] " Petr Holasek
2012-12-28 1:32 ` [PATCH v7 2/2] Documentation: add sysfs ABI documentation for ksm Petr Holasek
@ 2013-01-01 4:41 ` Simon Jeons
2013-01-03 12:24 ` Petr Holasek
2013-01-01 8:46 ` Simon Jeons
2 siblings, 1 reply; 13+ messages in thread
From: Simon Jeons @ 2013-01-01 4:41 UTC (permalink / raw)
To: Petr Holasek
Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Izik Eidus,
Rik van Riel, David Rientjes, Sasha Levin, linux-kernel, linux-mm,
Anton Arapov
On Fri, 2012-12-28 at 02:32 +0100, Petr Holasek wrote:
> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
>
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher variability when some users still prefers higher amount
> of saved physical memory regardless of access latency.
>
> Every numa node has its own stable & unstable trees because of faster
> searching and inserting. Changing of merge_across_nodes value is possible
> only when there are not any ksm shared pages in system.
>
> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:
>
> http://pholasek.fedorapeople.org/alloc_pg.c
>
> Population standard deviations of access times in percentage of average
> were following:
>
> merge_across_nodes=1
> 2 nodes 1.4%
> 4 nodes 1.6%
> 8 nodes 1.7%
>
> merge_across_nodes=0
> 2 nodes 1%
> 4 nodes 0.32%
> 8 nodes 0.018%
>
> RFC: https://lkml.org/lkml/2011/11/30/91
> v1: https://lkml.org/lkml/2012/1/23/46
> v2: https://lkml.org/lkml/2012/6/29/105
> v3: https://lkml.org/lkml/2012/9/14/550
> v4: https://lkml.org/lkml/2012/9/23/137
> v5: https://lkml.org/lkml/2012/12/10/540
> v6: https://lkml.org/lkml/2012/12/23/154
>
> Changelog:
>
> v2: Andrew's objections were reflected:
> - value of merge_nodes can't be changed while there are some ksm
> pages in system
> - merge_nodes sysfs entry appearance depends on CONFIG_NUMA
> - more verbose documentation
> - added some performance testing results
>
> v3: - more verbose documentation
> - fixed race in merge_nodes store function
> - introduced share_all debugging knob proposed by Andrew
> - minor cleanups
>
> v4: - merge_nodes was renamed to merge_across_nodes
> - share_all debug knob was dropped
> - get_kpfn_nid helper
> - fixed page migration behaviour
>
> v5: - unstable node's nid presence depends on CONFIG_NUMA
> - fixed oops appearing when stable nodes were removed from tree
> - roots of stable trees are initialized properly
> - fixed unstable page migration issue
>
> v6: - fixed oops caused by stable_nodes appended to wrong tree
> - KSM_RUN_MERGE test removed
>
> v7: - added sysfs ABI documentation for KSM
Hi Petr,
How you handle "memory corruption because the ksm page still points to
the stable_node that has been freed" mentioned by Andrea this time?
>
> Signed-off-by: Petr Holasek <pholasek@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> Documentation/vm/ksm.txt | 7 +++
> mm/ksm.c | 151 +++++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 139 insertions(+), 19 deletions(-)
>
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index b392e49..25cc89b 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds ksmd should sleep before next scan
> e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
> Default: 20 (chosen for demonstration purposes)
>
> +merge_across_nodes - specifies if pages from different numa nodes can be merged.
> + When set to 0, ksm merges only pages which physically
> + reside in the memory area of same NUMA node. It brings
> + lower latency to access to shared page. Value can be
> + changed only when there is no ksm shared pages in system.
> + Default: 1
> +
> run - set 0 to stop ksmd from running but keep merged pages,
> set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
> set 2 to stop ksmd and unmerge all pages currently merged,
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 5157385..d1e1041 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -36,6 +36,7 @@
> #include <linux/hash.h>
> #include <linux/freezer.h>
> #include <linux/oom.h>
> +#include <linux/numa.h>
>
> #include <asm/tlbflush.h>
> #include "internal.h"
> @@ -139,6 +140,9 @@ struct rmap_item {
> struct mm_struct *mm;
> unsigned long address; /* + low bits used for flags below */
> unsigned int oldchecksum; /* when unstable */
> +#ifdef CONFIG_NUMA
> + unsigned int nid;
> +#endif
> union {
> struct rb_node node; /* when node of unstable tree */
> struct { /* when listed from stable tree */
> @@ -153,8 +157,8 @@ struct rmap_item {
> #define STABLE_FLAG 0x200 /* is listed from the stable tree */
>
> /* The stable and unstable tree heads */
> -static struct rb_root root_stable_tree = RB_ROOT;
> -static struct rb_root root_unstable_tree = RB_ROOT;
> +static struct rb_root root_unstable_tree[MAX_NUMNODES];
> +static struct rb_root root_stable_tree[MAX_NUMNODES];
>
> #define MM_SLOTS_HASH_SHIFT 10
> #define MM_SLOTS_HASH_HEADS (1 << MM_SLOTS_HASH_SHIFT)
> @@ -189,6 +193,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
> /* Milliseconds ksmd should sleep between batches */
> static unsigned int ksm_thread_sleep_millisecs = 20;
>
> +/* Zeroed when merging across nodes is not allowed */
> +static unsigned int ksm_merge_across_nodes = 1;
> +
> #define KSM_RUN_STOP 0
> #define KSM_RUN_MERGE 1
> #define KSM_RUN_UNMERGE 2
> @@ -447,10 +454,25 @@ out: page = NULL;
> return page;
> }
>
> +/*
> + * This helper is used for getting right index into array of tree roots.
> + * When merge_across_nodes knob is set to 1, there are only two rb-trees for
> + * stable and unstable pages from all nodes with roots in index 0. Otherwise,
> + * every node has its own stable and unstable tree.
> + */
> +static inline int get_kpfn_nid(unsigned long kpfn)
> +{
> + if (ksm_merge_across_nodes)
> + return 0;
> + else
> + return pfn_to_nid(kpfn);
> +}
> +
> static void remove_node_from_stable_tree(struct stable_node *stable_node)
> {
> struct rmap_item *rmap_item;
> struct hlist_node *hlist;
> + int nid;
>
> hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
> if (rmap_item->hlist.next)
> @@ -462,7 +484,9 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
> cond_resched();
> }
>
> - rb_erase(&stable_node->node, &root_stable_tree);
> + nid = get_kpfn_nid(stable_node->kpfn);
> +
> + rb_erase(&stable_node->node, &root_stable_tree[nid]);
> free_stable_node(stable_node);
> }
>
> @@ -560,7 +584,12 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
> age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
> BUG_ON(age > 1);
> if (!age)
> - rb_erase(&rmap_item->node, &root_unstable_tree);
> +#ifdef CONFIG_NUMA
> + rb_erase(&rmap_item->node,
> + &root_unstable_tree[rmap_item->nid]);
> +#else
> + rb_erase(&rmap_item->node, &root_unstable_tree[0]);
> +#endif
>
> ksm_pages_unshared--;
> rmap_item->address &= PAGE_MASK;
> @@ -996,8 +1025,9 @@ static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
> */
> static struct page *stable_tree_search(struct page *page)
> {
> - struct rb_node *node = root_stable_tree.rb_node;
> + struct rb_node *node;
> struct stable_node *stable_node;
> + int nid;
>
> stable_node = page_stable_node(page);
> if (stable_node) { /* ksm page forked */
> @@ -1005,6 +1035,9 @@ static struct page *stable_tree_search(struct page *page)
> return page;
> }
>
> + nid = get_kpfn_nid(page_to_pfn(page));
> + node = root_stable_tree[nid].rb_node;
> +
> while (node) {
> struct page *tree_page;
> int ret;
> @@ -1039,10 +1072,16 @@ static struct page *stable_tree_search(struct page *page)
> */
> static struct stable_node *stable_tree_insert(struct page *kpage)
> {
> - struct rb_node **new = &root_stable_tree.rb_node;
> + int nid;
> + unsigned long kpfn;
> + struct rb_node **new;
> struct rb_node *parent = NULL;
> struct stable_node *stable_node;
>
> + kpfn = page_to_pfn(kpage);
> + nid = get_kpfn_nid(kpfn);
> + new = &root_stable_tree[nid].rb_node;
> +
> while (*new) {
> struct page *tree_page;
> int ret;
> @@ -1076,11 +1115,11 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
> return NULL;
>
> rb_link_node(&stable_node->node, parent, new);
> - rb_insert_color(&stable_node->node, &root_stable_tree);
> + rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
>
> INIT_HLIST_HEAD(&stable_node->hlist);
>
> - stable_node->kpfn = page_to_pfn(kpage);
> + stable_node->kpfn = kpfn;
> set_page_stable_node(kpage, stable_node);
>
> return stable_node;
> @@ -1104,10 +1143,15 @@ static
> struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
> struct page *page,
> struct page **tree_pagep)
> -
> {
> - struct rb_node **new = &root_unstable_tree.rb_node;
> + struct rb_node **new;
> + struct rb_root *root;
> struct rb_node *parent = NULL;
> + int nid;
> +
> + nid = get_kpfn_nid(page_to_pfn(page));
> + root = &root_unstable_tree[nid];
> + new = &root->rb_node;
>
> while (*new) {
> struct rmap_item *tree_rmap_item;
> @@ -1128,6 +1172,18 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
> return NULL;
> }
>
> + /*
> + * If tree_page has been migrated to another NUMA node, it
> + * will be flushed out and put into the right unstable tree
> + * next time: only merge with it if merge_across_nodes.
Why? Do you mean swap based migration? Or where I miss ....?
> + * Just notice, we don't have similar problem for PageKsm
> + * because their migration is disabled now. (62b61f611e)
> + */
> + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
> + put_page(tree_page);
> + return NULL;
> + }
> +
> ret = memcmp_pages(page, tree_page);
>
> parent = *new;
> @@ -1145,8 +1201,11 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
>
> rmap_item->address |= UNSTABLE_FLAG;
> rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
> +#ifdef CONFIG_NUMA
> + rmap_item->nid = nid;
> +#endif
> rb_link_node(&rmap_item->node, parent, new);
> - rb_insert_color(&rmap_item->node, &root_unstable_tree);
> + rb_insert_color(&rmap_item->node, root);
>
> ksm_pages_unshared++;
> return NULL;
> @@ -1160,6 +1219,13 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
> static void stable_tree_append(struct rmap_item *rmap_item,
> struct stable_node *stable_node)
> {
> +#ifdef CONFIG_NUMA
> + /*
> + * Usually rmap_item->nid is already set correctly,
> + * but it may be wrong after switching merge_across_nodes.
> + */
> + rmap_item->nid = get_kpfn_nid(stable_node->kpfn);
> +#endif
> rmap_item->head = stable_node;
> rmap_item->address |= STABLE_FLAG;
> hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
> @@ -1289,6 +1355,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> struct mm_slot *slot;
> struct vm_area_struct *vma;
> struct rmap_item *rmap_item;
> + int nid;
>
> if (list_empty(&ksm_mm_head.mm_list))
> return NULL;
> @@ -1307,7 +1374,8 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> */
> lru_add_drain_all();
>
> - root_unstable_tree = RB_ROOT;
> + for (nid = 0; nid < nr_node_ids; nid++)
> + root_unstable_tree[nid] = RB_ROOT;
>
> spin_lock(&ksm_mmlist_lock);
> slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
> @@ -1782,15 +1850,19 @@ static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
> unsigned long end_pfn)
> {
> struct rb_node *node;
> + int nid;
>
> - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
> - struct stable_node *stable_node;
> + for (nid = 0; nid < nr_node_ids; nid++)
> + for (node = rb_first(&root_stable_tree[nid]); node;
> + node = rb_next(node)) {
> + struct stable_node *stable_node;
> +
> + stable_node = rb_entry(node, struct stable_node, node);
> + if (stable_node->kpfn >= start_pfn &&
> + stable_node->kpfn < end_pfn)
> + return stable_node;
> + }
>
> - stable_node = rb_entry(node, struct stable_node, node);
> - if (stable_node->kpfn >= start_pfn &&
> - stable_node->kpfn < end_pfn)
> - return stable_node;
> - }
> return NULL;
> }
>
> @@ -1937,6 +2009,40 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
> }
> KSM_ATTR(run);
>
> +#ifdef CONFIG_NUMA
> +static ssize_t merge_across_nodes_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%u\n", ksm_merge_across_nodes);
> +}
> +
> +static ssize_t merge_across_nodes_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int err;
> + unsigned long knob;
> +
> + err = kstrtoul(buf, 10, &knob);
> + if (err)
> + return err;
> + if (knob > 1)
> + return -EINVAL;
> +
> + mutex_lock(&ksm_thread_mutex);
> + if (ksm_merge_across_nodes != knob) {
> + if (ksm_pages_shared)
> + err = -EBUSY;
> + else
> + ksm_merge_across_nodes = knob;
> + }
> + mutex_unlock(&ksm_thread_mutex);
> +
> + return err ? err : count;
> +}
> +KSM_ATTR(merge_across_nodes);
> +#endif
> +
> static ssize_t pages_shared_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> @@ -1991,6 +2097,9 @@ static struct attribute *ksm_attrs[] = {
> &pages_unshared_attr.attr,
> &pages_volatile_attr.attr,
> &full_scans_attr.attr,
> +#ifdef CONFIG_NUMA
> + &merge_across_nodes_attr.attr,
> +#endif
> NULL,
> };
>
> @@ -2004,11 +2113,15 @@ static int __init ksm_init(void)
> {
> struct task_struct *ksm_thread;
> int err;
> + int nid;
>
> err = ksm_slab_init();
> if (err)
> goto out;
>
> + for (nid = 0; nid < nr_node_ids; nid++)
> + root_stable_tree[nid] = RB_ROOT;
> +
> ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
> if (IS_ERR(ksm_thread)) {
> printk(KERN_ERR "ksm: creating kthread failed\n");
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v7 1/2] KSM: numa awareness sysfs knob
2013-01-01 4:41 ` [PATCH v7 1/2] KSM: numa awareness sysfs knob Simon Jeons
@ 2013-01-03 12:24 ` Petr Holasek
2013-01-08 1:40 ` Simon Jeons
0 siblings, 1 reply; 13+ messages in thread
From: Petr Holasek @ 2013-01-03 12:24 UTC (permalink / raw)
To: Simon Jeons
Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Izik Eidus,
Rik van Riel, David Rientjes, Sasha Levin, linux-kernel, linux-mm,
Anton Arapov
Hi Simon,
On Mon, 31 Dec 2012, Simon Jeons wrote:
> On Fri, 2012-12-28 at 02:32 +0100, Petr Holasek wrote:
> >
> > v7: - added sysfs ABI documentation for KSM
>
> Hi Petr,
>
> How you handle "memory corruption because the ksm page still points to
> the stable_node that has been freed" mentioned by Andrea this time?
>
<snip>
> >
> > + /*
> > + * If tree_page has been migrated to another NUMA node, it
> > + * will be flushed out and put into the right unstable tree
> > + * next time: only merge with it if merge_across_nodes.
>
> Why? Do you mean swap based migration? Or where I miss ....?
>
It can be physical page migration triggered by page compaction, memory hotplug
or some NUMA sched/memory balancing algorithm developed recently.
> > + * Just notice, we don't have similar problem for PageKsm
> > + * because their migration is disabled now. (62b61f611e)
> > + */
Migration of KSM pages is disabled now, you can look into ^^^ commit and
changes introduced to migrate.c.
> > + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
> > + put_page(tree_page);
> > + return NULL;
> > + }
> > +
> > ret = memcmp_pages(page, tree_page);
</snip>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v7 1/2] KSM: numa awareness sysfs knob
2013-01-03 12:24 ` Petr Holasek
@ 2013-01-08 1:40 ` Simon Jeons
2013-01-08 2:46 ` Hugh Dickins
0 siblings, 1 reply; 13+ messages in thread
From: Simon Jeons @ 2013-01-08 1:40 UTC (permalink / raw)
To: Petr Holasek
Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Izik Eidus,
Rik van Riel, David Rientjes, Sasha Levin, linux-kernel, linux-mm,
Anton Arapov
On Thu, 2013-01-03 at 13:24 +0100, Petr Holasek wrote:
> Hi Simon,
>
> On Mon, 31 Dec 2012, Simon Jeons wrote:
> > On Fri, 2012-12-28 at 02:32 +0100, Petr Holasek wrote:
> > >
> > > v7: - added sysfs ABI documentation for KSM
> >
> > Hi Petr,
> >
> > How you handle "memory corruption because the ksm page still points to
> > the stable_node that has been freed" mentioned by Andrea this time?
> >
>
Hi Petr,
You still didn't answer my question mentioned above. :)
> <snip>
>
> > >
> > > + /*
> > > + * If tree_page has been migrated to another NUMA node, it
> > > + * will be flushed out and put into the right unstable tree
> > > + * next time: only merge with it if merge_across_nodes.
> >
> > Why? Do you mean swap based migration? Or where I miss ....?
> >
>
> It can be physical page migration triggered by page compaction, memory hotplug
> or some NUMA sched/memory balancing algorithm developed recently.
>
> > > + * Just notice, we don't have similar problem for PageKsm
> > > + * because their migration is disabled now. (62b61f611e)
> > > + */
>
> Migration of KSM pages is disabled now, you can look into ^^^ commit and
> changes introduced to migrate.c.
>
> > > + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
> > > + put_page(tree_page);
> > > + return NULL;
> > > + }
> > > +
> > > ret = memcmp_pages(page, tree_page);
>
> </snip>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v7 1/2] KSM: numa awareness sysfs knob
2013-01-08 1:40 ` Simon Jeons
@ 2013-01-08 2:46 ` Hugh Dickins
0 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2013-01-08 2:46 UTC (permalink / raw)
To: Simon Jeons
Cc: Petr Holasek, Andrea Arcangeli, Andrew Morton, Izik Eidus,
Rik van Riel, David Rientjes, Sasha Levin, linux-kernel, linux-mm,
Anton Arapov
On Mon, 7 Jan 2013, Simon Jeons wrote:
> On Thu, 2013-01-03 at 13:24 +0100, Petr Holasek wrote:
> > Hi Simon,
> >
> > On Mon, 31 Dec 2012, Simon Jeons wrote:
> > > On Fri, 2012-12-28 at 02:32 +0100, Petr Holasek wrote:
> > > >
> > > > v7: - added sysfs ABI documentation for KSM
> > >
> > > Hi Petr,
> > >
> > > How you handle "memory corruption because the ksm page still points to
> > > the stable_node that has been freed" mentioned by Andrea this time?
> > >
> >
>
> Hi Petr,
>
> You still didn't answer my question mentioned above. :)
Yes, I noticed that too :) I think Petr probably hopes that I'll
answer; and yes, I do hold myself responsible for solving this.
The honest answer is that I forgot all about it for a while. I
had to go back to read the various threads to remind myself of what
Andrea said back then, and the ideas I had in replying. Thank you
for reminding us.
I do intend to fix it along the lines I suggested then, if that works
out; but that is a danger in memory hotremove only, so at present I'm
still wrestling with the more immediate problem of stale stable_nodes
when switching merge_across_nodes between 1 and 0 and 1.
Many of the problems there come from reclaim under memory pressure:
stable pages being written out to swap, and faulted back in at "the
wrong time". Essentially, existing bugs in KSM_RUN_UNMERGE, that
were not visible until merge_across_nodes brought us to rely upon it.
I have "advanced" from kernel oopses to userspace corruption: that's
no advance at all, no doubt I'm doing something stupid, but I haven't
spotted it yet; and once I've fixed that up, shall probably want to
look back at the little heap of fixups (a remove_all_stable_nodes()
function) and go about it quite differently - but for now I'm still
learning from the bugs I give myself.
>
> > <snip>
> >
> > > >
> > > > + /*
> > > > + * If tree_page has been migrated to another NUMA node, it
> > > > + * will be flushed out and put into the right unstable tree
> > > > + * next time: only merge with it if merge_across_nodes.
> > >
> > > Why? Do you mean swap based migration? Or where I miss ....?
> > >
> >
> > It can be physical page migration triggered by page compaction, memory hotplug
> > or some NUMA sched/memory balancing algorithm developed recently.
> >
> > > > + * Just notice, we don't have similar problem for PageKsm
> > > > + * because their migration is disabled now. (62b61f611e)
> > > > + */
> >
> > Migration of KSM pages is disabled now, you can look into ^^^ commit and
> > changes introduced to migrate.c.
Migration of KSM pages is still enabled in the memory hotremove case.
I don't remember how I tested that back then, so I want to enable KSM
page migration generally, just to be able to test it more thoroughly.
That would then benefit compaction, no longer frustrated by a KSM
page in the way.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v7 1/2] KSM: numa awareness sysfs knob
2012-12-28 1:32 ` [PATCH v7 1/2] " Petr Holasek
2012-12-28 1:32 ` [PATCH v7 2/2] Documentation: add sysfs ABI documentation for ksm Petr Holasek
2013-01-01 4:41 ` [PATCH v7 1/2] KSM: numa awareness sysfs knob Simon Jeons
@ 2013-01-01 8:46 ` Simon Jeons
2013-01-03 5:10 ` Hugh Dickins
2 siblings, 1 reply; 13+ messages in thread
From: Simon Jeons @ 2013-01-01 8:46 UTC (permalink / raw)
To: Petr Holasek
Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Izik Eidus,
Rik van Riel, David Rientjes, Sasha Levin, linux-kernel, linux-mm,
Anton Arapov
On Fri, 2012-12-28 at 02:32 +0100, Petr Holasek wrote:
> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
>
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher variability when some users still prefers higher amount
> of saved physical memory regardless of access latency.
>
> Every numa node has its own stable & unstable trees because of faster
> searching and inserting. Changing of merge_across_nodes value is possible
> only when there are not any ksm shared pages in system.
>
> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:
>
> http://pholasek.fedorapeople.org/alloc_pg.c
>
> Population standard deviations of access times in percentage of average
> were following:
>
> merge_across_nodes=1
> 2 nodes 1.4%
> 4 nodes 1.6%
> 8 nodes 1.7%
>
> merge_across_nodes=0
> 2 nodes 1%
> 4 nodes 0.32%
> 8 nodes 0.018%
>
> RFC: https://lkml.org/lkml/2011/11/30/91
> v1: https://lkml.org/lkml/2012/1/23/46
> v2: https://lkml.org/lkml/2012/6/29/105
> v3: https://lkml.org/lkml/2012/9/14/550
> v4: https://lkml.org/lkml/2012/9/23/137
> v5: https://lkml.org/lkml/2012/12/10/540
> v6: https://lkml.org/lkml/2012/12/23/154
>
> Changelog:
>
> v2: Andrew's objections were reflected:
> - value of merge_nodes can't be changed while there are some ksm
> pages in system
> - merge_nodes sysfs entry appearance depends on CONFIG_NUMA
> - more verbose documentation
> - added some performance testing results
>
> v3: - more verbose documentation
> - fixed race in merge_nodes store function
> - introduced share_all debugging knob proposed by Andrew
> - minor cleanups
>
> v4: - merge_nodes was renamed to merge_across_nodes
> - share_all debug knob was dropped
> - get_kpfn_nid helper
> - fixed page migration behaviour
>
> v5: - unstable node's nid presence depends on CONFIG_NUMA
> - fixed oops appearing when stable nodes were removed from tree
> - roots of stable trees are initialized properly
> - fixed unstable page migration issue
>
> v6: - fixed oops caused by stable_nodes appended to wrong tree
> - KSM_RUN_MERGE test removed
>
> v7: - added sysfs ABI documentation for KSM
>
> Signed-off-by: Petr Holasek <pholasek@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> Documentation/vm/ksm.txt | 7 +++
> mm/ksm.c | 151 +++++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 139 insertions(+), 19 deletions(-)
>
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index b392e49..25cc89b 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds ksmd should sleep before next scan
> e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
> Default: 20 (chosen for demonstration purposes)
>
> +merge_across_nodes - specifies if pages from different numa nodes can be merged.
> + When set to 0, ksm merges only pages which physically
> + reside in the memory area of same NUMA node. It brings
> + lower latency to access to shared page. Value can be
> + changed only when there is no ksm shared pages in system.
> + Default: 1
> +
> run - set 0 to stop ksmd from running but keep merged pages,
> set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
> set 2 to stop ksmd and unmerge all pages currently merged,
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 5157385..d1e1041 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -36,6 +36,7 @@
> #include <linux/hash.h>
> #include <linux/freezer.h>
> #include <linux/oom.h>
> +#include <linux/numa.h>
>
> #include <asm/tlbflush.h>
> #include "internal.h"
> @@ -139,6 +140,9 @@ struct rmap_item {
> struct mm_struct *mm;
> unsigned long address; /* + low bits used for flags below */
> unsigned int oldchecksum; /* when unstable */
> +#ifdef CONFIG_NUMA
> + unsigned int nid;
> +#endif
> union {
> struct rb_node node; /* when node of unstable tree */
> struct { /* when listed from stable tree */
> @@ -153,8 +157,8 @@ struct rmap_item {
> #define STABLE_FLAG 0x200 /* is listed from the stable tree */
>
> /* The stable and unstable tree heads */
> -static struct rb_root root_stable_tree = RB_ROOT;
> -static struct rb_root root_unstable_tree = RB_ROOT;
> +static struct rb_root root_unstable_tree[MAX_NUMNODES];
> +static struct rb_root root_stable_tree[MAX_NUMNODES];
>
> #define MM_SLOTS_HASH_SHIFT 10
> #define MM_SLOTS_HASH_HEADS (1 << MM_SLOTS_HASH_SHIFT)
> @@ -189,6 +193,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
> /* Milliseconds ksmd should sleep between batches */
> static unsigned int ksm_thread_sleep_millisecs = 20;
>
> +/* Zeroed when merging across nodes is not allowed */
> +static unsigned int ksm_merge_across_nodes = 1;
> +
> #define KSM_RUN_STOP 0
> #define KSM_RUN_MERGE 1
> #define KSM_RUN_UNMERGE 2
> @@ -447,10 +454,25 @@ out: page = NULL;
> return page;
> }
>
> +/*
> + * This helper is used for getting right index into array of tree roots.
> + * When merge_across_nodes knob is set to 1, there are only two rb-trees for
> + * stable and unstable pages from all nodes with roots in index 0. Otherwise,
> + * every node has its own stable and unstable tree.
> + */
> +static inline int get_kpfn_nid(unsigned long kpfn)
> +{
> + if (ksm_merge_across_nodes)
> + return 0;
> + else
> + return pfn_to_nid(kpfn);
> +}
> +
> static void remove_node_from_stable_tree(struct stable_node *stable_node)
> {
> struct rmap_item *rmap_item;
> struct hlist_node *hlist;
> + int nid;
>
> hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
> if (rmap_item->hlist.next)
> @@ -462,7 +484,9 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
> cond_resched();
> }
>
> - rb_erase(&stable_node->node, &root_stable_tree);
> + nid = get_kpfn_nid(stable_node->kpfn);
> +
> + rb_erase(&stable_node->node, &root_stable_tree[nid]);
> free_stable_node(stable_node);
> }
>
> @@ -560,7 +584,12 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
> age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
> BUG_ON(age > 1);
> if (!age)
> - rb_erase(&rmap_item->node, &root_unstable_tree);
> +#ifdef CONFIG_NUMA
> + rb_erase(&rmap_item->node,
> + &root_unstable_tree[rmap_item->nid]);
> +#else
> + rb_erase(&rmap_item->node, &root_unstable_tree[0]);
> +#endif
Hi Petr and Hugh,
One offline question, thanks for your clarify.
How to understand age = (unsigned char)(ksm_scan.seqnr -
rmap_item->address);? It used for what?
>
> ksm_pages_unshared--;
> rmap_item->address &= PAGE_MASK;
> @@ -996,8 +1025,9 @@ static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
> */
> static struct page *stable_tree_search(struct page *page)
> {
> - struct rb_node *node = root_stable_tree.rb_node;
> + struct rb_node *node;
> struct stable_node *stable_node;
> + int nid;
>
> stable_node = page_stable_node(page);
> if (stable_node) { /* ksm page forked */
> @@ -1005,6 +1035,9 @@ static struct page *stable_tree_search(struct page *page)
> return page;
> }
>
> + nid = get_kpfn_nid(page_to_pfn(page));
> + node = root_stable_tree[nid].rb_node;
> +
> while (node) {
> struct page *tree_page;
> int ret;
> @@ -1039,10 +1072,16 @@ static struct page *stable_tree_search(struct page *page)
> */
> static struct stable_node *stable_tree_insert(struct page *kpage)
> {
> - struct rb_node **new = &root_stable_tree.rb_node;
> + int nid;
> + unsigned long kpfn;
> + struct rb_node **new;
> struct rb_node *parent = NULL;
> struct stable_node *stable_node;
>
> + kpfn = page_to_pfn(kpage);
> + nid = get_kpfn_nid(kpfn);
> + new = &root_stable_tree[nid].rb_node;
> +
> while (*new) {
> struct page *tree_page;
> int ret;
> @@ -1076,11 +1115,11 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
> return NULL;
>
> rb_link_node(&stable_node->node, parent, new);
> - rb_insert_color(&stable_node->node, &root_stable_tree);
> + rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
>
> INIT_HLIST_HEAD(&stable_node->hlist);
>
> - stable_node->kpfn = page_to_pfn(kpage);
> + stable_node->kpfn = kpfn;
> set_page_stable_node(kpage, stable_node);
>
> return stable_node;
> @@ -1104,10 +1143,15 @@ static
> struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
> struct page *page,
> struct page **tree_pagep)
> -
> {
> - struct rb_node **new = &root_unstable_tree.rb_node;
> + struct rb_node **new;
> + struct rb_root *root;
> struct rb_node *parent = NULL;
> + int nid;
> +
> + nid = get_kpfn_nid(page_to_pfn(page));
> + root = &root_unstable_tree[nid];
> + new = &root->rb_node;
>
> while (*new) {
> struct rmap_item *tree_rmap_item;
> @@ -1128,6 +1172,18 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
> return NULL;
> }
>
> + /*
> + * If tree_page has been migrated to another NUMA node, it
> + * will be flushed out and put into the right unstable tree
> + * next time: only merge with it if merge_across_nodes.
> + * Just notice, we don't have similar problem for PageKsm
> + * because their migration is disabled now. (62b61f611e)
> + */
> + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
> + put_page(tree_page);
> + return NULL;
> + }
> +
> ret = memcmp_pages(page, tree_page);
>
> parent = *new;
> @@ -1145,8 +1201,11 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
>
> rmap_item->address |= UNSTABLE_FLAG;
> rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
> +#ifdef CONFIG_NUMA
> + rmap_item->nid = nid;
> +#endif
> rb_link_node(&rmap_item->node, parent, new);
> - rb_insert_color(&rmap_item->node, &root_unstable_tree);
> + rb_insert_color(&rmap_item->node, root);
>
> ksm_pages_unshared++;
> return NULL;
> @@ -1160,6 +1219,13 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
> static void stable_tree_append(struct rmap_item *rmap_item,
> struct stable_node *stable_node)
> {
> +#ifdef CONFIG_NUMA
> + /*
> + * Usually rmap_item->nid is already set correctly,
> + * but it may be wrong after switching merge_across_nodes.
> + */
> + rmap_item->nid = get_kpfn_nid(stable_node->kpfn);
> +#endif
> rmap_item->head = stable_node;
> rmap_item->address |= STABLE_FLAG;
> hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
> @@ -1289,6 +1355,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> struct mm_slot *slot;
> struct vm_area_struct *vma;
> struct rmap_item *rmap_item;
> + int nid;
>
> if (list_empty(&ksm_mm_head.mm_list))
> return NULL;
> @@ -1307,7 +1374,8 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> */
> lru_add_drain_all();
>
> - root_unstable_tree = RB_ROOT;
> + for (nid = 0; nid < nr_node_ids; nid++)
> + root_unstable_tree[nid] = RB_ROOT;
>
> spin_lock(&ksm_mmlist_lock);
> slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
> @@ -1782,15 +1850,19 @@ static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
> unsigned long end_pfn)
> {
> struct rb_node *node;
> + int nid;
>
> - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
> - struct stable_node *stable_node;
> + for (nid = 0; nid < nr_node_ids; nid++)
> + for (node = rb_first(&root_stable_tree[nid]); node;
> + node = rb_next(node)) {
> + struct stable_node *stable_node;
> +
> + stable_node = rb_entry(node, struct stable_node, node);
> + if (stable_node->kpfn >= start_pfn &&
> + stable_node->kpfn < end_pfn)
> + return stable_node;
> + }
>
> - stable_node = rb_entry(node, struct stable_node, node);
> - if (stable_node->kpfn >= start_pfn &&
> - stable_node->kpfn < end_pfn)
> - return stable_node;
> - }
> return NULL;
> }
>
> @@ -1937,6 +2009,40 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
> }
> KSM_ATTR(run);
>
> +#ifdef CONFIG_NUMA
> +static ssize_t merge_across_nodes_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%u\n", ksm_merge_across_nodes);
> +}
> +
> +static ssize_t merge_across_nodes_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int err;
> + unsigned long knob;
> +
> + err = kstrtoul(buf, 10, &knob);
> + if (err)
> + return err;
> + if (knob > 1)
> + return -EINVAL;
> +
> + mutex_lock(&ksm_thread_mutex);
> + if (ksm_merge_across_nodes != knob) {
> + if (ksm_pages_shared)
> + err = -EBUSY;
> + else
> + ksm_merge_across_nodes = knob;
> + }
> + mutex_unlock(&ksm_thread_mutex);
> +
> + return err ? err : count;
> +}
> +KSM_ATTR(merge_across_nodes);
> +#endif
> +
> static ssize_t pages_shared_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> @@ -1991,6 +2097,9 @@ static struct attribute *ksm_attrs[] = {
> &pages_unshared_attr.attr,
> &pages_volatile_attr.attr,
> &full_scans_attr.attr,
> +#ifdef CONFIG_NUMA
> + &merge_across_nodes_attr.attr,
> +#endif
> NULL,
> };
>
> @@ -2004,11 +2113,15 @@ static int __init ksm_init(void)
> {
> struct task_struct *ksm_thread;
> int err;
> + int nid;
>
> err = ksm_slab_init();
> if (err)
> goto out;
>
> + for (nid = 0; nid < nr_node_ids; nid++)
> + root_stable_tree[nid] = RB_ROOT;
> +
> ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
> if (IS_ERR(ksm_thread)) {
> printk(KERN_ERR "ksm: creating kthread failed\n");
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v7 1/2] KSM: numa awareness sysfs knob
2013-01-01 8:46 ` Simon Jeons
@ 2013-01-03 5:10 ` Hugh Dickins
2013-01-04 0:24 ` Simon Jeons
0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2013-01-03 5:10 UTC (permalink / raw)
To: Simon Jeons
Cc: Petr Holasek, Andrea Arcangeli, Andrew Morton, Izik Eidus,
Rik van Riel, David Rientjes, Sasha Levin, linux-kernel, linux-mm,
Anton Arapov
On Tue, 1 Jan 2013, Simon Jeons wrote:
>
> Hi Petr and Hugh,
>
> One offline question, thanks for your clarify.
Perhaps not as offline as you intended :)
>
> How to understand age = (unsigned char)(ksm_scan.seqnr -
> rmap_item->address);? It used for what?
As you can see, remove_rmap_item_from_tree uses it to decide whether
or not it should rb_erase the rmap_item from the unstable_tree.
Every full scan of all the rmap_items, we increment ksm_scan.seqnr,
forget the old unstable_tree (it would just be a waste of processing
to remove every node one by one), and build up the unstable_tree afresh.
That works fine until we need to remove an rmap_item: then we have to be
very sure to remove it from the unstable_tree if it's already been linked
there during this scan, but ignore its rblinkage if that's just left over
from the previous scan.
A single bit would be enough to decide this; but we got it troublesomely
wrong in the early days of KSM (didn't always visit every rmap_item each
scan), so it's convenient to use 8 bits (the low unsigned char, stored
below the FLAGs and below the page-aligned address in the rmap_item -
there's lots of them, best keep them as small as we can) and do a
BUG_ON(age > 1) if we made a mistake.
We haven't hit that BUG_ON in over three years: if we need some more
bits for something, we can cut the age down to one or two bits.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v7 1/2] KSM: numa awareness sysfs knob
2013-01-03 5:10 ` Hugh Dickins
@ 2013-01-04 0:24 ` Simon Jeons
2013-01-04 23:03 ` Hugh Dickins
0 siblings, 1 reply; 13+ messages in thread
From: Simon Jeons @ 2013-01-04 0:24 UTC (permalink / raw)
To: Hugh Dickins
Cc: Petr Holasek, Andrea Arcangeli, Andrew Morton, Izik Eidus,
Rik van Riel, David Rientjes, Sasha Levin, linux-kernel, linux-mm,
Anton Arapov
On Wed, 2013-01-02 at 21:10 -0800, Hugh Dickins wrote:
> On Tue, 1 Jan 2013, Simon Jeons wrote:
> >
> > Hi Petr and Hugh,
> >
> > One offline question, thanks for your clarify.
>
> Perhaps not as offline as you intended :)
Hi Hugh,
Thanks for your detail explanation. :)
>
> >
> > How to understand age = (unsigned char)(ksm_scan.seqnr -
> > rmap_item->address);? It used for what?
>
> As you can see, remove_rmap_item_from_tree uses it to decide whether
> or not it should rb_erase the rmap_item from the unstable_tree.
>
> Every full scan of all the rmap_items, we increment ksm_scan.seqnr,
> forget the old unstable_tree (it would just be a waste of processing
> to remove every node one by one), and build up the unstable_tree afresh.
>
When the rmap_items left over from the previous scan will be removed?
> That works fine until we need to remove an rmap_item: then we have to be
> very sure to remove it from the unstable_tree if it's already been linked
> there during this scan, but ignore its rblinkage if that's just left over
> from the previous scan.
>
> A single bit would be enough to decide this; but we got it troublesomely
> wrong in the early days of KSM (didn't always visit every rmap_item each
> scan), so it's convenient to use 8 bits (the low unsigned char, stored
When the scenario didn't always visit every rmap_item each scan can
occur?
> below the FLAGs and below the page-aligned address in the rmap_item -
> there's lots of them, best keep them as small as we can) and do a
> BUG_ON(age > 1) if we made a mistake.
>
> We haven't hit that BUG_ON in over three years: if we need some more
> bits for something, we can cut the age down to one or two bits.
>
> Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v7 1/2] KSM: numa awareness sysfs knob
2013-01-04 0:24 ` Simon Jeons
@ 2013-01-04 23:03 ` Hugh Dickins
2013-01-05 0:30 ` Simon Jeons
0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2013-01-04 23:03 UTC (permalink / raw)
To: Simon Jeons
Cc: Petr Holasek, Andrea Arcangeli, Andrew Morton, Izik Eidus,
Rik van Riel, David Rientjes, Sasha Levin, linux-kernel, linux-mm,
Anton Arapov
On Thu, 3 Jan 2013, Simon Jeons wrote:
> On Wed, 2013-01-02 at 21:10 -0800, Hugh Dickins wrote:
> >
> > As you can see, remove_rmap_item_from_tree uses it to decide whether
> > or not it should rb_erase the rmap_item from the unstable_tree.
> >
> > Every full scan of all the rmap_items, we increment ksm_scan.seqnr,
> > forget the old unstable_tree (it would just be a waste of processing
> > to remove every node one by one), and build up the unstable_tree afresh.
> >
>
> When the rmap_items left over from the previous scan will be removed?
Removed from the unstable rbtree? Not at all, it's simply restarted
afresh, and the old rblinkages ignored. Freed back to slab? When the
scan passes that mm+address and realizes that rmap_item is not wanted
any more. (Or when ksm is shut down with KSM_RUN_UNMERGE.)
>
> > That works fine until we need to remove an rmap_item: then we have to be
> > very sure to remove it from the unstable_tree if it's already been linked
> > there during this scan, but ignore its rblinkage if that's just left over
> > from the previous scan.
> >
> > A single bit would be enough to decide this; but we got it troublesomely
> > wrong in the early days of KSM (didn't always visit every rmap_item each
> > scan), so it's convenient to use 8 bits (the low unsigned char, stored
>
> When the scenario didn't always visit every rmap_item each scan can
> occur?
You're asking me about a stage of KSM development 3.5 years ago:
I don't remember the details.
>
> > below the FLAGs and below the page-aligned address in the rmap_item -
> > there's lots of them, best keep them as small as we can) and do a
> > BUG_ON(age > 1) if we made a mistake.
> >
> > We haven't hit that BUG_ON in over three years: if we need some more
> > bits for something, we can cut the age down to one or two bits.
> >
> > Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v7 1/2] KSM: numa awareness sysfs knob
2013-01-04 23:03 ` Hugh Dickins
@ 2013-01-05 0:30 ` Simon Jeons
0 siblings, 0 replies; 13+ messages in thread
From: Simon Jeons @ 2013-01-05 0:30 UTC (permalink / raw)
To: Hugh Dickins
Cc: Petr Holasek, Andrea Arcangeli, Andrew Morton, Izik Eidus,
Rik van Riel, David Rientjes, Sasha Levin, linux-kernel, linux-mm,
Anton Arapov
On Fri, 2013-01-04 at 15:03 -0800, Hugh Dickins wrote:
> On Thu, 3 Jan 2013, Simon Jeons wrote:
> > On Wed, 2013-01-02 at 21:10 -0800, Hugh Dickins wrote:
> > >
> > > As you can see, remove_rmap_item_from_tree uses it to decide whether
> > > or not it should rb_erase the rmap_item from the unstable_tree.
> > >
> > > Every full scan of all the rmap_items, we increment ksm_scan.seqnr,
> > > forget the old unstable_tree (it would just be a waste of processing
> > > to remove every node one by one), and build up the unstable_tree afresh.
> > >
> >
> > When the rmap_items left over from the previous scan will be removed?
>
> Removed from the unstable rbtree? Not at all, it's simply restarted
> afresh, and the old rblinkages ignored. Freed back to slab? When the
> scan passes that mm+address and realizes that rmap_item is not wanted
> any more. (Or when ksm is shut down with KSM_RUN_UNMERGE.)
>
Make sense. Thanks Hugh. :)
> >
> > > That works fine until we need to remove an rmap_item: then we have to be
> > > very sure to remove it from the unstable_tree if it's already been linked
> > > there during this scan, but ignore its rblinkage if that's just left over
> > > from the previous scan.
> > >
> > > A single bit would be enough to decide this; but we got it troublesomely
> > > wrong in the early days of KSM (didn't always visit every rmap_item each
> > > scan), so it's convenient to use 8 bits (the low unsigned char, stored
> >
> > When the scenario didn't always visit every rmap_item each scan can
> > occur?
>
> You're asking me about a stage of KSM development 3.5 years ago:
> I don't remember the details.
>
> >
> > > below the FLAGs and below the page-aligned address in the rmap_item -
> > > there's lots of them, best keep them as small as we can) and do a
> > > BUG_ON(age > 1) if we made a mistake.
> > >
> > > We haven't hit that BUG_ON in over three years: if we need some more
> > > bits for something, we can cut the age down to one or two bits.
> > >
> > > Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2013-01-08 2:46 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-24 3:22 [PATCH v6] KSM: numa awareness sysfs knob Petr Holasek
2012-12-24 5:08 ` Greg KH
2012-12-28 1:32 ` [PATCH v7 1/2] " Petr Holasek
2012-12-28 1:32 ` [PATCH v7 2/2] Documentation: add sysfs ABI documentation for ksm Petr Holasek
2013-01-01 4:41 ` [PATCH v7 1/2] KSM: numa awareness sysfs knob Simon Jeons
2013-01-03 12:24 ` Petr Holasek
2013-01-08 1:40 ` Simon Jeons
2013-01-08 2:46 ` Hugh Dickins
2013-01-01 8:46 ` Simon Jeons
2013-01-03 5:10 ` Hugh Dickins
2013-01-04 0:24 ` Simon Jeons
2013-01-04 23:03 ` Hugh Dickins
2013-01-05 0:30 ` Simon Jeons
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).