From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration Date: Mon, 28 Jan 2013 17:07:15 -0800 (PST) Message-ID: References: <20130128155452.16882a6e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Marcelo Tosatti , Gleb Natapov , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org To: Andrew Morton Return-path: In-Reply-To: <20130128155452.16882a6e.akpm@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-Id: kvm.vger.kernel.org On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:53:10 -0800 (PST) > Hugh Dickins wrote: > > > Here's a KSM series > > Sanity check: do you have a feeling for how useful KSM is? > Performance/space improvements for typical (or atypical) workloads? > Are people using it? Successfully? > > IOW, is it justifying itself? I have no idea! To me it's simply a technical challenge - and I agree with your implication that that's not a good enough justification. I've added Marcelo and Gleb and the KVM list to the Cc: my understanding is that it's the KVM guys who really appreciate KSM. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gleb Natapov Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration Date: Tue, 29 Jan 2013 12:45:14 +0200 Message-ID: <20130129104513.GA15004@redhat.com> References: <20130128155452.16882a6e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Andrew Morton , Marcelo Tosatti , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org To: Hugh Dickins Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: kvm.vger.kernel.org On Mon, Jan 28, 2013 at 05:07:15PM -0800, Hugh Dickins wrote: > On Mon, 28 Jan 2013, Andrew Morton wrote: > > On Fri, 25 Jan 2013 17:53:10 -0800 (PST) > > Hugh Dickins wrote: > > > > > Here's a KSM series > > > > Sanity check: do you have a feeling for how useful KSM is? > > Performance/space improvements for typical (or atypical) workloads? > > Are people using it? Successfully? > > > > IOW, is it justifying itself? > > I have no idea! To me it's simply a technical challenge - and I agree > with your implication that that's not a good enough justification. > > I've added Marcelo and Gleb and the KVM list to the Cc: > my understanding is that it's the KVM guys who really appreciate KSM. > KSM is used on all RH kvm deployments for memory overcommit. I asked around for numbers and got the answer that it allows to squeeze anywhere between 10% and 100% more VMs on the same machine depends on a type of a guest OS and how similar workloads of VMs are. And management tries to keep VMs with similar OSes/workloads on the same host to gain more from KSM. -- Gleb. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx149.postini.com [74.125.245.149]) by kanga.kvack.org (Postfix) with SMTP id CFDE96B0005 for ; Fri, 25 Jan 2013 20:53:14 -0500 (EST) Received: by mail-da0-f48.google.com with SMTP id k18so430809dae.21 for ; Fri, 25 Jan 2013 17:53:14 -0800 (PST) Date: Fri, 25 Jan 2013 17:53:10 -0800 (PST) From: Hugh Dickins Subject: [PATCH 0/11] ksm: NUMA trees and page migration Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues we had with that, fully enabling KSM page migration on the way. (A different kind of KSM/NUMA issue which I've certainly not begun to address here: when KSM pages are unmerged, there's usually no sense in preferring to allocate the new pages local to the caller's node.) Petr, I have intentionally changed the titles of yours: partly because your "sysfs knob" understated it, but mainly because I think gmail is liable to assign 1/11 and 2/11 to your earlier December thread, making them vanish from this series. I hope a change of title prevents that. 1 ksm: allow trees per NUMA node 2 ksm: add sysfs ABI Documentation 3 ksm: trivial tidyups 4 ksm: reorganize ksm_check_stable_tree 5 ksm: get_ksm_page locked 6 ksm: remove old stable nodes more thoroughly 7 ksm: make KSM page migration possible 8 ksm: make !merge_across_nodes migration safe 9 mm: enable KSM page migration 10 mm: remove offlining arg to migrate_pages 11 ksm: stop hotremove lockdep warning Documentation/ABI/testing/sysfs-kernel-mm-ksm | 52 + Documentation/vm/ksm.txt | 7 include/linux/ksm.h | 18 include/linux/migrate.h | 14 mm/compaction.c | 2 mm/ksm.c | 566 +++++++++++++--- mm/memory-failure.c | 7 mm/memory.c | 19 mm/memory_hotplug.c | 3 mm/mempolicy.c | 11 mm/migrate.c | 61 - mm/page_alloc.c | 6 12 files changed, 580 insertions(+), 186 deletions(-) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx167.postini.com [74.125.245.167]) by kanga.kvack.org (Postfix) with SMTP id CDA6D6B0005 for ; Fri, 25 Jan 2013 20:54:51 -0500 (EST) Received: by mail-pa0-f51.google.com with SMTP id fb11so576664pad.38 for ; Fri, 25 Jan 2013 17:54:51 -0800 (PST) Date: Fri, 25 Jan 2013 17:54:53 -0800 (PST) From: Hugh Dickins Subject: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org From: Petr Holasek Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes which control merging pages across different numa nodes. When it is set to zero only pages from the same node are merged, otherwise pages from all nodes can be merged together (default behavior). Typical use-case could be a lot of KVM guests on NUMA machine and cpus from more distant nodes would have significant increase of access latency to the merged ksm page. Sysfs knob was choosen for higher variability when some users still prefers higher amount of saved physical memory regardless of access latency. Every numa node has its own stable & unstable trees because of faster searching and inserting. Changing of merge_across_nodes value is possible only when there are not any ksm shared pages in system. I've tested this patch on numa machines with 2, 4 and 8 nodes and measured speed of memory access inside of KVM guests with memory pinned to one of nodes with this benchmark: http://pholasek.fedorapeople.org/alloc_pg.c Population standard deviations of access times in percentage of average were following: merge_across_nodes=1 2 nodes 1.4% 4 nodes 1.6% 8 nodes 1.7% merge_across_nodes=0 2 nodes 1% 4 nodes 0.32% 8 nodes 0.018% RFC: https://lkml.org/lkml/2011/11/30/91 v1: https://lkml.org/lkml/2012/1/23/46 v2: https://lkml.org/lkml/2012/6/29/105 v3: https://lkml.org/lkml/2012/9/14/550 v4: https://lkml.org/lkml/2012/9/23/137 v5: https://lkml.org/lkml/2012/12/10/540 v6: https://lkml.org/lkml/2012/12/23/154 v7: https://lkml.org/lkml/2012/12/27/225 Hugh notes that this patch brings two problems, whose solution needs further support in mm/ksm.c, which follows in subsequent patches: 1) switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree; 2) memory hotremove may migrate KSM pages, but there is no provision here for !merge_across_nodes to migrate nodes to the proper tree. Signed-off-by: Petr Holasek Signed-off-by: Hugh Dickins Acked-by: Rik van Riel --- Documentation/vm/ksm.txt | 7 + mm/ksm.c | 151 ++++++++++++++++++++++++++++++++----- 2 files changed, 139 insertions(+), 19 deletions(-) --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" Default: 20 (chosen for demonstration purposes) +merge_across_nodes - specifies if pages from different numa nodes can be merged. + When set to 0, ksm merges only pages which physically + reside in the memory area of same NUMA node. It brings + lower latency to access to shared page. Value can be + changed only when there is no ksm shared pages in system. + Default: 1 + run - set 0 to stop ksmd from running but keep merged pages, set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", set 2 to stop ksmd and unmerge all pages currently merged, --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 @@ -36,6 +36,7 @@ #include #include #include +#include #include #include "internal.h" @@ -139,6 +140,9 @@ struct rmap_item { struct mm_struct *mm; unsigned long address; /* + low bits used for flags below */ unsigned int oldchecksum; /* when unstable */ +#ifdef CONFIG_NUMA + unsigned int nid; +#endif union { struct rb_node node; /* when node of unstable tree */ struct { /* when listed from stable tree */ @@ -153,8 +157,8 @@ struct rmap_item { #define STABLE_FLAG 0x200 /* is listed from the stable tree */ /* The stable and unstable tree heads */ -static struct rb_root root_stable_tree = RB_ROOT; -static struct rb_root root_unstable_tree = RB_ROOT; +static struct rb_root root_unstable_tree[MAX_NUMNODES]; +static struct rb_root root_stable_tree[MAX_NUMNODES]; #define MM_SLOTS_HASH_BITS 10 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ /* Milliseconds ksmd should sleep between batches */ static unsigned int ksm_thread_sleep_millisecs = 20; +/* Zeroed when merging across nodes is not allowed */ +static unsigned int ksm_merge_across_nodes = 1; + #define KSM_RUN_STOP 0 #define KSM_RUN_MERGE 1 #define KSM_RUN_UNMERGE 2 @@ -441,10 +448,25 @@ out: page = NULL; return page; } +/* + * This helper is used for getting right index into array of tree roots. + * When merge_across_nodes knob is set to 1, there are only two rb-trees for + * stable and unstable pages from all nodes with roots in index 0. Otherwise, + * every node has its own stable and unstable tree. + */ +static inline int get_kpfn_nid(unsigned long kpfn) +{ + if (ksm_merge_across_nodes) + return 0; + else + return pfn_to_nid(kpfn); +} + static void remove_node_from_stable_tree(struct stable_node *stable_node) { struct rmap_item *rmap_item; struct hlist_node *hlist; + int nid; hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { if (rmap_item->hlist.next) @@ -456,7 +478,9 @@ static void remove_node_from_stable_tree cond_resched(); } - rb_erase(&stable_node->node, &root_stable_tree); + nid = get_kpfn_nid(stable_node->kpfn); + + rb_erase(&stable_node->node, &root_stable_tree[nid]); free_stable_node(stable_node); } @@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); BUG_ON(age > 1); if (!age) - rb_erase(&rmap_item->node, &root_unstable_tree); +#ifdef CONFIG_NUMA + rb_erase(&rmap_item->node, + &root_unstable_tree[rmap_item->nid]); +#else + rb_erase(&rmap_item->node, &root_unstable_tree[0]); +#endif ksm_pages_unshared--; rmap_item->address &= PAGE_MASK; @@ -990,8 +1019,9 @@ static struct page *try_to_merge_two_pag */ static struct page *stable_tree_search(struct page *page) { - struct rb_node *node = root_stable_tree.rb_node; + struct rb_node *node; struct stable_node *stable_node; + int nid; stable_node = page_stable_node(page); if (stable_node) { /* ksm page forked */ @@ -999,6 +1029,9 @@ static struct page *stable_tree_search(s return page; } + nid = get_kpfn_nid(page_to_pfn(page)); + node = root_stable_tree[nid].rb_node; + while (node) { struct page *tree_page; int ret; @@ -1033,10 +1066,16 @@ static struct page *stable_tree_search(s */ static struct stable_node *stable_tree_insert(struct page *kpage) { - struct rb_node **new = &root_stable_tree.rb_node; + int nid; + unsigned long kpfn; + struct rb_node **new; struct rb_node *parent = NULL; struct stable_node *stable_node; + kpfn = page_to_pfn(kpage); + nid = get_kpfn_nid(kpfn); + new = &root_stable_tree[nid].rb_node; + while (*new) { struct page *tree_page; int ret; @@ -1070,11 +1109,11 @@ static struct stable_node *stable_tree_i return NULL; rb_link_node(&stable_node->node, parent, new); - rb_insert_color(&stable_node->node, &root_stable_tree); + rb_insert_color(&stable_node->node, &root_stable_tree[nid]); INIT_HLIST_HEAD(&stable_node->hlist); - stable_node->kpfn = page_to_pfn(kpage); + stable_node->kpfn = kpfn; set_page_stable_node(kpage, stable_node); return stable_node; @@ -1098,10 +1137,15 @@ static struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item, struct page *page, struct page **tree_pagep) - { - struct rb_node **new = &root_unstable_tree.rb_node; + struct rb_node **new; + struct rb_root *root; struct rb_node *parent = NULL; + int nid; + + nid = get_kpfn_nid(page_to_pfn(page)); + root = &root_unstable_tree[nid]; + new = &root->rb_node; while (*new) { struct rmap_item *tree_rmap_item; @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i return NULL; } + /* + * If tree_page has been migrated to another NUMA node, it + * will be flushed out and put into the right unstable tree + * next time: only merge with it if merge_across_nodes. + * Just notice, we don't have similar problem for PageKsm + * because their migration is disabled now. (62b61f611e) + */ + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { + put_page(tree_page); + return NULL; + } + ret = memcmp_pages(page, tree_page); parent = *new; @@ -1139,8 +1195,11 @@ struct rmap_item *unstable_tree_search_i rmap_item->address |= UNSTABLE_FLAG; rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK); +#ifdef CONFIG_NUMA + rmap_item->nid = nid; +#endif rb_link_node(&rmap_item->node, parent, new); - rb_insert_color(&rmap_item->node, &root_unstable_tree); + rb_insert_color(&rmap_item->node, root); ksm_pages_unshared++; return NULL; @@ -1154,6 +1213,13 @@ struct rmap_item *unstable_tree_search_i static void stable_tree_append(struct rmap_item *rmap_item, struct stable_node *stable_node) { +#ifdef CONFIG_NUMA + /* + * Usually rmap_item->nid is already set correctly, + * but it may be wrong after switching merge_across_nodes. + */ + rmap_item->nid = get_kpfn_nid(stable_node->kpfn); +#endif rmap_item->head = stable_node; rmap_item->address |= STABLE_FLAG; hlist_add_head(&rmap_item->hlist, &stable_node->hlist); @@ -1283,6 +1349,7 @@ static struct rmap_item *scan_get_next_r struct mm_slot *slot; struct vm_area_struct *vma; struct rmap_item *rmap_item; + int nid; if (list_empty(&ksm_mm_head.mm_list)) return NULL; @@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r */ lru_add_drain_all(); - root_unstable_tree = RB_ROOT; + for (nid = 0; nid < nr_node_ids; nid++) + root_unstable_tree[nid] = RB_ROOT; spin_lock(&ksm_mmlist_lock); slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list); @@ -1770,15 +1838,19 @@ static struct stable_node *ksm_check_sta unsigned long end_pfn) { struct rb_node *node; + int nid; - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) { - struct stable_node *stable_node; + for (nid = 0; nid < nr_node_ids; nid++) + for (node = rb_first(&root_stable_tree[nid]); node; + node = rb_next(node)) { + struct stable_node *stable_node; + + stable_node = rb_entry(node, struct stable_node, node); + if (stable_node->kpfn >= start_pfn && + stable_node->kpfn < end_pfn) + return stable_node; + } - stable_node = rb_entry(node, struct stable_node, node); - if (stable_node->kpfn >= start_pfn && - stable_node->kpfn < end_pfn) - return stable_node; - } return NULL; } @@ -1925,6 +1997,40 @@ static ssize_t run_store(struct kobject } KSM_ATTR(run); +#ifdef CONFIG_NUMA +static ssize_t merge_across_nodes_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%u\n", ksm_merge_across_nodes); +} + +static ssize_t merge_across_nodes_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err; + unsigned long knob; + + err = kstrtoul(buf, 10, &knob); + if (err) + return err; + if (knob > 1) + return -EINVAL; + + mutex_lock(&ksm_thread_mutex); + if (ksm_merge_across_nodes != knob) { + if (ksm_pages_shared) + err = -EBUSY; + else + ksm_merge_across_nodes = knob; + } + mutex_unlock(&ksm_thread_mutex); + + return err ? err : count; +} +KSM_ATTR(merge_across_nodes); +#endif + static ssize_t pages_shared_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -1979,6 +2085,9 @@ static struct attribute *ksm_attrs[] = { &pages_unshared_attr.attr, &pages_volatile_attr.attr, &full_scans_attr.attr, +#ifdef CONFIG_NUMA + &merge_across_nodes_attr.attr, +#endif NULL, }; @@ -1992,11 +2101,15 @@ static int __init ksm_init(void) { struct task_struct *ksm_thread; int err; + int nid; err = ksm_slab_init(); if (err) goto out; + for (nid = 0; nid < nr_node_ids; nid++) + root_stable_tree[nid] = RB_ROOT; + ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd"); if (IS_ERR(ksm_thread)) { printk(KERN_ERR "ksm: creating kthread failed\n"); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx129.postini.com [74.125.245.129]) by kanga.kvack.org (Postfix) with SMTP id B12296B0005 for ; Fri, 25 Jan 2013 20:56:55 -0500 (EST) Received: by mail-da0-f51.google.com with SMTP id i30so432737dad.10 for ; Fri, 25 Jan 2013 17:56:55 -0800 (PST) Date: Fri, 25 Jan 2013 17:56:57 -0800 (PST) From: Hugh Dickins Subject: [PATCH 2/11] ksm: add sysfs ABI Documentation In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Greg KH , linux-kernel@vger.kernel.org, linux-mm@kvack.org From: Petr Holasek This patch adds sysfs documentation for Kernel Samepage Merging (KSM) including new merge_across_nodes knob. Signed-off-by: Petr Holasek Signed-off-by: Hugh Dickins --- Documentation/ABI/testing/sysfs-kernel-mm-ksm | 52 ++++++++++++++++ 1 file changed, 52 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-ksm --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ mmotm/Documentation/ABI/testing/sysfs-kernel-mm-ksm 2013-01-25 14:36:50.660205905 -0800 @@ -0,0 +1,52 @@ +What: /sys/kernel/mm/ksm +Date: September 2009 +KernelVersion: 2.6.32 +Contact: Linux memory management mailing list +Description: Interface for Kernel Samepage Merging (KSM) + +What: /sys/kernel/mm/ksm/full_scans +What: /sys/kernel/mm/ksm/pages_shared +What: /sys/kernel/mm/ksm/pages_sharing +What: /sys/kernel/mm/ksm/pages_to_scan +What: /sys/kernel/mm/ksm/pages_unshared +What: /sys/kernel/mm/ksm/pages_volatile +What: /sys/kernel/mm/ksm/run +What: /sys/kernel/mm/ksm/sleep_millisecs +Date: September 2009 +Contact: Linux memory management mailing list +Description: Kernel Samepage Merging daemon sysfs interface + + full_scans: how many times all mergeable areas have been + scanned. + + pages_shared: how many shared pages are being used. + + pages_sharing: how many more sites are sharing them i.e. how + much saved. + + pages_to_scan: how many present pages to scan before ksmd goes + to sleep. + + pages_unshared: how many pages unique but repeatedly checked + for merging. + + pages_volatile: how many pages changing too fast to be placed + in a tree. + + run: write 0 to disable ksm, read 0 while ksm is disabled. + write 1 to run ksm, read 1 while ksm is running. + write 2 to disable ksm and unmerge all its pages. + + sleep_millisecs: how many milliseconds ksm should sleep between + scans. + + See Documentation/vm/ksm.txt for more information. + +What: /sys/kernel/mm/ksm/merge_across_nodes +Date: January 2013 +KernelVersion: 3.9 +Contact: Linux memory management mailing list +Description: Control merging pages across different NUMA nodes. + + When it is set to 0 only pages from the same node are merged, + otherwise pages from all nodes can be merged together (default). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx103.postini.com [74.125.245.103]) by kanga.kvack.org (Postfix) with SMTP id AE2E36B0008 for ; Fri, 25 Jan 2013 20:58:10 -0500 (EST) Received: by mail-pa0-f52.google.com with SMTP id fb1so578274pad.25 for ; Fri, 25 Jan 2013 17:58:09 -0800 (PST) Date: Fri, 25 Jan 2013 17:58:11 -0800 (PST) From: Hugh Dickins Subject: [PATCH 3/11] ksm: trivial tidyups In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef CONFIG_NUMAs (but indeed we don't want to expand struct rmap_item by nid when not NUMA). Add comment, remove "unsigned" from rmap_item->nid, as "int nid" elsewhere. Define ksm_merge_across_nodes 1U when #ifndef NUMA to help optimizing out. Use ?: in get_kpfn_nid(). Adjust a few comments noticed in ongoing work. Leave stable_tree_insert()'s rb_linkage until after the node has been set up, as unstable_tree_search_insert() does: ksm_thread_mutex and page lock make either way safe, but we're going to copy and I prefer this precedent. Signed-off-by: Hugh Dickins --- mm/ksm.c | 48 ++++++++++++++++++++++-------------------------- 1 file changed, 22 insertions(+), 26 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:36:52.152205940 -0800 @@ -41,6 +41,14 @@ #include #include "internal.h" +#ifdef CONFIG_NUMA +#define NUMA(x) (x) +#define DO_NUMA(x) (x) +#else +#define NUMA(x) (0) +#define DO_NUMA(x) do { } while (0) +#endif + /* * A few notes about the KSM scanning process, * to make it easier to understand the data structures below: @@ -130,6 +138,7 @@ struct stable_node { * @mm: the memory structure this rmap_item is pointing into * @address: the virtual address this rmap_item tracks (+ flags in low bits) * @oldchecksum: previous checksum of the page at that virtual address + * @nid: NUMA node id of unstable tree in which linked (may not match page) * @node: rb node of this rmap_item in the unstable tree * @head: pointer to stable_node heading this list in the stable tree * @hlist: link into hlist of rmap_items hanging off that stable_node @@ -141,7 +150,7 @@ struct rmap_item { unsigned long address; /* + low bits used for flags below */ unsigned int oldchecksum; /* when unstable */ #ifdef CONFIG_NUMA - unsigned int nid; + int nid; #endif union { struct rb_node node; /* when node of unstable tree */ @@ -192,8 +201,12 @@ static unsigned int ksm_thread_pages_to_ /* Milliseconds ksmd should sleep between batches */ static unsigned int ksm_thread_sleep_millisecs = 20; +#ifdef CONFIG_NUMA /* Zeroed when merging across nodes is not allowed */ static unsigned int ksm_merge_across_nodes = 1; +#else +#define ksm_merge_across_nodes 1U +#endif #define KSM_RUN_STOP 0 #define KSM_RUN_MERGE 1 @@ -456,10 +469,7 @@ out: page = NULL; */ static inline int get_kpfn_nid(unsigned long kpfn) { - if (ksm_merge_across_nodes) - return 0; - else - return pfn_to_nid(kpfn); + return ksm_merge_across_nodes ? 0 : pfn_to_nid(kpfn); } static void remove_node_from_stable_tree(struct stable_node *stable_node) @@ -479,7 +489,6 @@ static void remove_node_from_stable_tree } nid = get_kpfn_nid(stable_node->kpfn); - rb_erase(&stable_node->node, &root_stable_tree[nid]); free_stable_node(stable_node); } @@ -578,13 +587,8 @@ static void remove_rmap_item_from_tree(s age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); BUG_ON(age > 1); if (!age) -#ifdef CONFIG_NUMA rb_erase(&rmap_item->node, - &root_unstable_tree[rmap_item->nid]); -#else - rb_erase(&rmap_item->node, &root_unstable_tree[0]); -#endif - + &root_unstable_tree[NUMA(rmap_item->nid)]); ksm_pages_unshared--; rmap_item->address &= PAGE_MASK; } @@ -604,7 +608,7 @@ static void remove_trailing_rmap_items(s } /* - * Though it's very tempting to unmerge in_stable_tree(rmap_item)s rather + * Though it's very tempting to unmerge rmap_items from stable tree rather * than check every pte of a given vma, the locking doesn't quite work for * that - an rmap_item is assigned to the stable tree after inserting ksm * page and upping mmap_sem. Nor does it fit with the way we skip dup'ing @@ -1058,7 +1062,7 @@ static struct page *stable_tree_search(s } /* - * stable_tree_insert - insert rmap_item pointing to new ksm page + * stable_tree_insert - insert stable tree node pointing to new ksm page * into the stable tree. * * This function returns the stable tree node just allocated on success, @@ -1108,13 +1112,11 @@ static struct stable_node *stable_tree_i if (!stable_node) return NULL; - rb_link_node(&stable_node->node, parent, new); - rb_insert_color(&stable_node->node, &root_stable_tree[nid]); - INIT_HLIST_HEAD(&stable_node->hlist); - stable_node->kpfn = kpfn; set_page_stable_node(kpage, stable_node); + rb_link_node(&stable_node->node, parent, new); + rb_insert_color(&stable_node->node, &root_stable_tree[nid]); return stable_node; } @@ -1170,8 +1172,6 @@ struct rmap_item *unstable_tree_search_i * If tree_page has been migrated to another NUMA node, it * will be flushed out and put into the right unstable tree * next time: only merge with it if merge_across_nodes. - * Just notice, we don't have similar problem for PageKsm - * because their migration is disabled now. (62b61f611e) */ if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { put_page(tree_page); @@ -1195,9 +1195,7 @@ struct rmap_item *unstable_tree_search_i rmap_item->address |= UNSTABLE_FLAG; rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK); -#ifdef CONFIG_NUMA - rmap_item->nid = nid; -#endif + DO_NUMA(rmap_item->nid = nid); rb_link_node(&rmap_item->node, parent, new); rb_insert_color(&rmap_item->node, root); @@ -1213,13 +1211,11 @@ struct rmap_item *unstable_tree_search_i static void stable_tree_append(struct rmap_item *rmap_item, struct stable_node *stable_node) { -#ifdef CONFIG_NUMA /* * Usually rmap_item->nid is already set correctly, * but it may be wrong after switching merge_across_nodes. */ - rmap_item->nid = get_kpfn_nid(stable_node->kpfn); -#endif + DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn)); rmap_item->head = stable_node; rmap_item->address |= STABLE_FLAG; hlist_add_head(&rmap_item->hlist, &stable_node->hlist); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx129.postini.com [74.125.245.129]) by kanga.kvack.org (Postfix) with SMTP id 8BD1A6B0005 for ; Fri, 25 Jan 2013 20:59:34 -0500 (EST) Received: by mail-pa0-f48.google.com with SMTP id fa1so587914pad.7 for ; Fri, 25 Jan 2013 17:59:33 -0800 (PST) Date: Fri, 25 Jan 2013 17:59:35 -0800 (PST) From: Hugh Dickins Subject: [PATCH 4/11] ksm: reorganize ksm_check_stable_tree In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Memory hotremove's ksm_check_stable_tree() is pitifully inefficient (restarting whenever it finds a stale node to remove), but rearrange so that at least it does not needlessly restart from nid 0 each time. And add a couple of comments: here is why we keep pfn instead of page. Signed-off-by: Hugh Dickins --- mm/ksm.c | 38 ++++++++++++++++++++++---------------- 1 file changed, 22 insertions(+), 16 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:52.152205940 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 @@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_MEMORY_HOTREMOVE -static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn, - unsigned long end_pfn) +static void ksm_check_stable_tree(unsigned long start_pfn, + unsigned long end_pfn) { + struct stable_node *stable_node; struct rb_node *node; int nid; - for (nid = 0; nid < nr_node_ids; nid++) - for (node = rb_first(&root_stable_tree[nid]); node; - node = rb_next(node)) { - struct stable_node *stable_node; - + for (nid = 0; nid < nr_node_ids; nid++) { + node = rb_first(&root_stable_tree[nid]); + while (node) { stable_node = rb_entry(node, struct stable_node, node); if (stable_node->kpfn >= start_pfn && - stable_node->kpfn < end_pfn) - return stable_node; + stable_node->kpfn < end_pfn) { + /* + * Don't get_ksm_page, page has already gone: + * which is why we keep kpfn instead of page* + */ + remove_node_from_stable_tree(stable_node); + node = rb_first(&root_stable_tree[nid]); + } else + node = rb_next(node); + cond_resched(); } - - return NULL; + } } static int ksm_memory_callback(struct notifier_block *self, unsigned long action, void *arg) { struct memory_notify *mn = arg; - struct stable_node *stable_node; switch (action) { case MEM_GOING_OFFLINE: @@ -1874,11 +1879,12 @@ static int ksm_memory_callback(struct no /* * Most of the work is done by page migration; but there might * be a few stable_nodes left over, still pointing to struct - * pages which have been offlined: prune those from the tree. + * pages which have been offlined: prune those from the tree, + * otherwise get_ksm_page() might later try to access a + * non-existent struct page. */ - while ((stable_node = ksm_check_stable_tree(mn->start_pfn, - mn->start_pfn + mn->nr_pages)) != NULL) - remove_node_from_stable_tree(stable_node); + ksm_check_stable_tree(mn->start_pfn, + mn->start_pfn + mn->nr_pages); /* fallthrough */ case MEM_CANCEL_OFFLINE: -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx134.postini.com [74.125.245.134]) by kanga.kvack.org (Postfix) with SMTP id 0B8506B0008 for ; Fri, 25 Jan 2013 21:00:48 -0500 (EST) Received: by mail-da0-f51.google.com with SMTP id i30so429107dad.38 for ; Fri, 25 Jan 2013 18:00:48 -0800 (PST) Date: Fri, 25 Jan 2013 18:00:50 -0800 (PST) From: Hugh Dickins Subject: [PATCH 5/11] ksm: get_ksm_page locked In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org In some places where get_ksm_page() is used, we need the page to be locked. When KSM migration is fully enabled, we shall want that to make sure that the page just acquired cannot be migrated beneath us (raised page count is only effective when there is serialization to make sure migration notices). Whereas when navigating through the stable tree, we certainly do not want to lock each node (raised page count is enough to guarantee the memcmps, even if page is migrated to another node). Since we're about to add another use case, add the locked argument to get_ksm_page() now. Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I really got the wrong end of the stick on that! There's a configuration in which page_cache_get_speculative() can do something cheaper than get_page_unless_zero(), relying on its caller's rcu_read_lock() to have disabled preemption for it. There's no need for rcu_read_lock() around get_page_unless_zero() (and mapping checks) here. Cut out that silliness before making this any harder to understand. Signed-off-by: Hugh Dickins --- mm/ksm.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree * but this is different - made simpler by ksm_thread_mutex being held, but * interesting for assuming that no other use of the struct page could ever * put our expected_mapping into page->mapping (or a field of the union which - * coincides with page->mapping). The RCU calls are not for KSM at all, but - * to keep the page_count protocol described with page_cache_get_speculative. + * coincides with page->mapping). * * Note: it is possible that get_ksm_page() will return NULL one moment, * then page the next, if the page is in between page_freeze_refs() and * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page * is on its way to being freed; but it is an anomaly to bear in mind. */ -static struct page *get_ksm_page(struct stable_node *stable_node) +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) { struct page *page; void *expected_mapping; @@ -530,7 +529,6 @@ static struct page *get_ksm_page(struct page = pfn_to_page(stable_node->kpfn); expected_mapping = (void *)stable_node + (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); - rcu_read_lock(); if (page->mapping != expected_mapping) goto stale; if (!get_page_unless_zero(page)) @@ -539,10 +537,16 @@ static struct page *get_ksm_page(struct put_page(page); goto stale; } - rcu_read_unlock(); + if (locked) { + lock_page(page); + if (page->mapping != expected_mapping) { + unlock_page(page); + put_page(page); + goto stale; + } + } return page; stale: - rcu_read_unlock(); remove_node_from_stable_tree(stable_node); return NULL; } @@ -558,11 +562,10 @@ static void remove_rmap_item_from_tree(s struct page *page; stable_node = rmap_item->head; - page = get_ksm_page(stable_node); + page = get_ksm_page(stable_node, true); if (!page) goto out; - lock_page(page); hlist_del(&rmap_item->hlist); unlock_page(page); put_page(page); @@ -1042,7 +1045,7 @@ static struct page *stable_tree_search(s cond_resched(); stable_node = rb_entry(node, struct stable_node, node); - tree_page = get_ksm_page(stable_node); + tree_page = get_ksm_page(stable_node, false); if (!tree_page) return NULL; @@ -1086,7 +1089,7 @@ static struct stable_node *stable_tree_i cond_resched(); stable_node = rb_entry(*new, struct stable_node, node); - tree_page = get_ksm_page(stable_node); + tree_page = get_ksm_page(stable_node, false); if (!tree_page) return NULL; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx204.postini.com [74.125.245.204]) by kanga.kvack.org (Postfix) with SMTP id 60A256B0005 for ; Fri, 25 Jan 2013 21:01:58 -0500 (EST) Received: by mail-pb0-f42.google.com with SMTP id rp2so534176pbb.15 for ; Fri, 25 Jan 2013 18:01:57 -0800 (PST) Date: Fri, 25 Jan 2013 18:01:59 -0800 (PST) From: Hugh Dickins Subject: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree. It's not something that people will often want to do, but it would be lame to demand a reboot when they're trying to determine which merge_across_nodes setting is best. How can this happen? We only permit switching merge_across_nodes when pages_shared is 0, and usually set run 2 to force that beforehand, which ought to unmerge everything: yet oopses still occur when you then run 1. Three causes: 1. The old stable tree (built according to the inverse merge_across_nodes) has not been fully torn down. A stable node lingers until get_ksm_page() notices that the page it references no longer references it: but the page is not necessarily freed as soon as expected, particularly when swapcache. Fix this with a pass through the old stable tree, applying get_ksm_page() to each of the remaining nodes (most found stale and removed immediately), with forced removal of any left over. Unless the page is still mapped: I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE and EBUSY than BUG. 2. __ksm_enter() has a nice little optimization, to insert the new mm just behind ksmd's cursor, so there's a full pass for it to stabilize (or be removed) before ksmd addresses it. Nice when ksmd is running, but not so nice when we're trying to unmerge all mms: we were missing those mms forked and inserted behind the unmerge cursor. Easily fixed by inserting at the end when KSM_RUN_UNMERGE. 3. It is possible for a KSM page to be faulted back from swapcache into an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. A long outstanding, unrelated bugfix sneaks in with that third fix: ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O error when read in from swap) to a page which it then marks Uptodate. Fix this case by not copying, letting do_swap_page() discover the error. Signed-off-by: Hugh Dickins --- include/linux/ksm.h | 18 ++------- mm/ksm.c | 83 +++++++++++++++++++++++++++++++++++++++--- mm/memory.c | 19 ++++----- 3 files changed, 92 insertions(+), 28 deletions(-) --- mmotm.orig/include/linux/ksm.h 2013-01-25 14:27:58.220193250 -0800 +++ mmotm/include/linux/ksm.h 2013-01-25 14:37:00.764206145 -0800 @@ -16,9 +16,6 @@ struct stable_node; struct mem_cgroup; -struct page *ksm_does_need_to_copy(struct page *page, - struct vm_area_struct *vma, unsigned long address); - #ifdef CONFIG_KSM int ksm_madvise(struct vm_area_struct *vma, unsigned long start, unsigned long end, int advice, unsigned long *vm_flags); @@ -73,15 +70,8 @@ static inline void set_page_stable_node( * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE, * but what if the vma was unmerged while the page was swapped out? */ -static inline int ksm_might_need_to_copy(struct page *page, - struct vm_area_struct *vma, unsigned long address) -{ - struct anon_vma *anon_vma = page_anon_vma(page); - - return anon_vma && - (anon_vma->root != vma->anon_vma->root || - page->index != linear_page_index(vma, address)); -} +struct page *ksm_might_need_to_copy(struct page *page, + struct vm_area_struct *vma, unsigned long address); int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg, unsigned long *vm_flags); @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_ return 0; } -static inline int ksm_might_need_to_copy(struct page *page, +static inline struct page *ksm_might_need_to_copy(struct page *page, struct vm_area_struct *vma, unsigned long address) { - return 0; + return page; } static inline int page_referenced_ksm(struct page *page, --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a /* * Only called through the sysfs control interface: */ +static int remove_stable_node(struct stable_node *stable_node) +{ + struct page *page; + int err; + + page = get_ksm_page(stable_node, true); + if (!page) { + /* + * get_ksm_page did remove_node_from_stable_tree itself. + */ + return 0; + } + + if (WARN_ON_ONCE(page_mapped(page))) + err = -EBUSY; + else { + /* + * This page might be in a pagevec waiting to be freed, + * or it might be PageSwapCache (perhaps under writeback), + * or it might have been removed from swapcache a moment ago. + */ + set_page_stable_node(page, NULL); + remove_node_from_stable_tree(stable_node); + err = 0; + } + + unlock_page(page); + put_page(page); + return err; +} + +static int remove_all_stable_nodes(void) +{ + struct stable_node *stable_node; + int nid; + int err = 0; + + for (nid = 0; nid < nr_node_ids; nid++) { + while (root_stable_tree[nid].rb_node) { + stable_node = rb_entry(root_stable_tree[nid].rb_node, + struct stable_node, node); + if (remove_stable_node(stable_node)) { + err = -EBUSY; + break; /* proceed to next nid */ + } + cond_resched(); + } + } + return err; +} + static int unmerge_and_remove_all_rmap_items(void) { struct mm_slot *mm_slot; @@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i } } + /* Clean up stable nodes, but don't worry if some are still busy */ + remove_all_stable_nodes(); ksm_scan.seqnr = 0; return 0; @@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm) spin_lock(&ksm_mmlist_lock); insert_to_mm_slots_hash(mm, mm_slot); /* - * Insert just behind the scanning cursor, to let the area settle + * When KSM_RUN_MERGE (or KSM_RUN_STOP), + * insert just behind the scanning cursor, to let the area settle * down a little; when fork is followed by immediate exec, we don't * want ksmd to waste time setting up and tearing down an rmap_list. + * + * But when KSM_RUN_UNMERGE, it's important to insert ahead of its + * scanning cursor, otherwise KSM pages in newly forked mms will be + * missed: then we might as well insert at the end of the list. */ - list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); + if (ksm_run & KSM_RUN_UNMERGE) + list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list); + else + list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); spin_unlock(&ksm_mmlist_lock); set_bit(MMF_VM_MERGEABLE, &mm->flags); @@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm) } } -struct page *ksm_does_need_to_copy(struct page *page, +struct page *ksm_might_need_to_copy(struct page *page, struct vm_area_struct *vma, unsigned long address) { + struct anon_vma *anon_vma = page_anon_vma(page); struct page *new_page; + if (PageKsm(page)) { + if (page_stable_node(page) && + !(ksm_run & KSM_RUN_UNMERGE)) + return page; /* no need to copy it */ + } else if (!anon_vma) { + return page; /* no need to copy it */ + } else if (anon_vma->root == vma->anon_vma->root && + page->index == linear_page_index(vma, address)) { + return page; /* still no need to copy it */ + } + if (!PageUptodate(page)) + return page; /* let do_swap_page report the error */ + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (new_page) { copy_user_highpage(new_page, page, address, vma); @@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store( mutex_lock(&ksm_thread_mutex); if (ksm_merge_across_nodes != knob) { - if (ksm_pages_shared) + if (ksm_pages_shared || remove_all_stable_nodes()) err = -EBUSY; else ksm_merge_across_nodes = knob; --- mmotm.orig/mm/memory.c 2013-01-25 14:27:58.220193250 -0800 +++ mmotm/mm/memory.c 2013-01-25 14:37:00.768206145 -0800 @@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) goto out_page; - if (ksm_might_need_to_copy(page, vma, address)) { - swapcache = page; - page = ksm_does_need_to_copy(page, vma, address); - - if (unlikely(!page)) { - ret = VM_FAULT_OOM; - page = swapcache; - swapcache = NULL; - goto out_page; - } + swapcache = page; + page = ksm_might_need_to_copy(page, vma, address); + if (unlikely(!page)) { + ret = VM_FAULT_OOM; + page = swapcache; + swapcache = NULL; + goto out_page; } + if (page == swapcache) + swapcache = NULL; if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { ret = VM_FAULT_OOM; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx120.postini.com [74.125.245.120]) by kanga.kvack.org (Postfix) with SMTP id EC8EE6B0005 for ; Fri, 25 Jan 2013 21:03:30 -0500 (EST) Received: by mail-da0-f53.google.com with SMTP id x6so430070dac.40 for ; Fri, 25 Jan 2013 18:03:30 -0800 (PST) Date: Fri, 25 Jan 2013 18:03:31 -0800 (PST) From: Hugh Dickins Subject: [PATCH 7/11] ksm: make KSM page migration possible In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org KSM page migration is already supported in the case of memory hotremove, which takes the ksm_thread_mutex across all its migrations to keep life simple. But the new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? So far there's no provision for that, and this patch does not attempt to deal with it either. But how will I test a solution, when I don't know how to hotremove memory? The best answer is to enable KSM page migration in all cases now, and test more common cases. With THP and compaction added since KSM came in, page migration is now mainstream, and it's a shame that a KSM page can frustrate freeing a page block. Without worrying about merge_across_nodes 0 for now, this patch gets KSM page migration working reliably for default merge_across_nodes 1 (but leave the patch enabling it until near the end of the series). It's much simpler than I'd originally imagined, and does not require an additional tier of locking: page migration relies on the page lock, KSM page reclaim relies on the page lock, the page lock is enough for KSM page migration too. Almost all the care has to be in get_ksm_page(): that's the function which worries about when a stable node is stale and should be freed, now it also has to worry about the KSM page being migrated. The only new overhead is an additional put/get/lock/unlock_page when stable_tree_search() arrives at a matching node: to make sure migration respects the raised page count, and so does not migrate the page while we're busy with it here. That's probably avoidable, either by changing internal interfaces from using kpage to stable_node, or by moving the ksm_migrate_page() callsite into a page_freeze_refs() section (even if not swapcache); but this works well, I've no urge to pull it apart now. (Descents of the stable tree may pass through nodes whose KSM pages are under migration: being unlocked, the raised page count does not prevent that, nor need it: it's safe to memcmp against either old or new page.) You might worry about mremap, and whether page migration's rmap_walk to remove migration entries will find all the KSM locations where it inserted earlier: that should already be handled, by the satisfyingly heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). Signed-off-by: Hugh Dickins --- mm/ksm.c | 94 ++++++++++++++++++++++++++++++++++++++----------- mm/migrate.c | 5 ++ 2 files changed, 77 insertions(+), 22 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 @@ -499,6 +499,7 @@ static void remove_node_from_stable_tree * In which case we can trust the content of the page, and it * returns the gotten page; but if the page has now been zapped, * remove the stale node from the stable tree and return NULL. + * But beware, the stable node's page might be being migrated. * * You would expect the stable_node to hold a reference to the ksm page. * But if it increments the page's count, swapping out has to wait for @@ -509,44 +510,77 @@ static void remove_node_from_stable_tree * pointing back to this stable node. This relies on freeing a PageAnon * page to reset its page->mapping to NULL, and relies on no other use of * a page to put something that might look like our key in page->mapping. - * - * include/linux/pagemap.h page_cache_get_speculative() is a good reference, - * but this is different - made simpler by ksm_thread_mutex being held, but - * interesting for assuming that no other use of the struct page could ever - * put our expected_mapping into page->mapping (or a field of the union which - * coincides with page->mapping). - * - * Note: it is possible that get_ksm_page() will return NULL one moment, - * then page the next, if the page is in between page_freeze_refs() and - * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page * is on its way to being freed; but it is an anomaly to bear in mind. */ static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) { struct page *page; void *expected_mapping; + unsigned long kpfn; - page = pfn_to_page(stable_node->kpfn); expected_mapping = (void *)stable_node + (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); - if (page->mapping != expected_mapping) - goto stale; - if (!get_page_unless_zero(page)) +again: + kpfn = ACCESS_ONCE(stable_node->kpfn); + page = pfn_to_page(kpfn); + + /* + * page is computed from kpfn, so on most architectures reading + * page->mapping is naturally ordered after reading node->kpfn, + * but on Alpha we need to be more careful. + */ + smp_read_barrier_depends(); + if (ACCESS_ONCE(page->mapping) != expected_mapping) goto stale; - if (page->mapping != expected_mapping) { + + /* + * We cannot do anything with the page while its refcount is 0. + * Usually 0 means free, or tail of a higher-order page: in which + * case this node is no longer referenced, and should be freed; + * however, it might mean that the page is under page_freeze_refs(). + * The __remove_mapping() case is easy, again the node is now stale; + * but if page is swapcache in migrate_page_move_mapping(), it might + * still be our page, in which case it's essential to keep the node. + */ + while (!get_page_unless_zero(page)) { + /* + * Another check for page->mapping != expected_mapping would + * work here too. We have chosen the !PageSwapCache test to + * optimize the common case, when the page is or is about to + * be freed: PageSwapCache is cleared (under spin_lock_irq) + * in the freeze_refs section of __remove_mapping(); but Anon + * page->mapping reset to NULL later, in free_pages_prepare(). + */ + if (!PageSwapCache(page)) + goto stale; + cpu_relax(); + } + + if (ACCESS_ONCE(page->mapping) != expected_mapping) { put_page(page); goto stale; } + if (locked) { lock_page(page); - if (page->mapping != expected_mapping) { + if (ACCESS_ONCE(page->mapping) != expected_mapping) { unlock_page(page); put_page(page); goto stale; } } return page; + stale: + /* + * We come here from above when page->mapping or !PageSwapCache + * suggests that the node is stale; but it might be under migration. + * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(), + * before checking whether node->kpfn has been changed. + */ + smp_rmb(); + if (ACCESS_ONCE(stable_node->kpfn) != kpfn) + goto again; remove_node_from_stable_tree(stable_node); return NULL; } @@ -1103,15 +1137,25 @@ static struct page *stable_tree_search(s return NULL; ret = memcmp_pages(page, tree_page); + put_page(tree_page); - if (ret < 0) { - put_page(tree_page); + if (ret < 0) node = node->rb_left; - } else if (ret > 0) { - put_page(tree_page); + else if (ret > 0) node = node->rb_right; - } else + else { + /* + * Lock and unlock the stable_node's page (which + * might already have been migrated) so that page + * migration is sure to notice its raised count. + * It would be more elegant to return stable_node + * than kpage, but that involves more changes. + */ + tree_page = get_ksm_page(stable_node, true); + if (tree_page) + unlock_page(tree_page); return tree_page; + } } return NULL; @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa if (stable_node) { VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage)); stable_node->kpfn = page_to_pfn(newpage); + /* + * newpage->mapping was set in advance; now we need smp_wmb() + * to make sure that the new stable_node->kpfn is visible + * to get_ksm_page() before it can see that oldpage->mapping + * has gone stale (or that PageSwapCache has been cleared). + */ + smp_wmb(); + set_page_stable_node(oldpage, NULL); } } #endif /* CONFIG_MIGRATION */ --- mmotm.orig/mm/migrate.c 2013-01-25 14:27:58.140193249 -0800 +++ mmotm/mm/migrate.c 2013-01-25 14:37:03.832206218 -0800 @@ -464,7 +464,10 @@ void migrate_page_copy(struct page *newp mlock_migrate_page(newpage, page); ksm_migrate_page(newpage, page); - + /* + * Please do not reorder this without considering how mm/ksm.c's + * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache(). + */ ClearPageSwapCache(page); ClearPagePrivate(page); set_page_private(page, 0); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx134.postini.com [74.125.245.134]) by kanga.kvack.org (Postfix) with SMTP id E637C6B0005 for ; Fri, 25 Jan 2013 21:05:00 -0500 (EST) Received: by mail-pb0-f47.google.com with SMTP id wz17so534259pbc.6 for ; Fri, 25 Jan 2013 18:05:00 -0800 (PST) Date: Fri, 25 Jan 2013 18:05:02 -0800 (PST) From: Hugh Dickins Subject: [PATCH 8/11] ksm: make !merge_across_nodes migration safe In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org The new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? ksm_migrate_page() can do no more than it's already doing, updating stable_node->kpfn: the stable tree itself cannot be manipulated without holding ksm_thread_mutex. So accept that a stable tree may temporarily indicate a page belonging to the wrong NUMA node, leave updating until the next pass of ksmd, just be careful not to merge other pages on to a misplaced page. Note nid of holding tree in stable_node, and recognize that it will not always match nid of kpfn. A misplaced KSM page is discovered, either when ksm_do_scan() next comes around to one of its rmap_items (we now have to go to cmp_and_merge_page even on pages in a stable tree), or when stable_tree_search() arrives at a matching node for another page, and this node page is found misplaced. In each case, move the misplaced stable_node to a list of migrate_nodes (and use the address of migrate_nodes as magic by which to identify them): we don't need them in a tree. If stable_tree_search() finds no match for a page, but it's currently exiled to this list, then slot its stable_node right there into the tree, bringing all of its mappings with it; otherwise they get migrated one by one to the original page of the colliding node. stable_tree_search() is now modelled more like stable_tree_insert(), in order to handle these insertions of migrated nodes. remove_node_from_stable_tree(), remove_all_stable_nodes() and ksm_check_stable_tree() have to handle the migrate_nodes list as well as the stable tree itself. Less obviously, we do need to prune the list of stale entries from time to time (scan_get_next_rmap_item() does it once each full scan): whereas stale nodes in the stable tree get naturally pruned as searches try to brush past them, these migrate_nodes may get forgotten and accumulate. Signed-off-by: Hugh Dickins --- mm/ksm.c | 164 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 134 insertions(+), 30 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 @@ -122,13 +122,25 @@ struct ksm_scan { /** * struct stable_node - node of the stable rbtree * @node: rb node of this ksm page in the stable tree + * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list + * @list: linked into migrate_nodes, pending placement in the proper node tree * @hlist: hlist head of rmap_items using this ksm page - * @kpfn: page frame number of this ksm page + * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid) + * @nid: NUMA node id of stable tree in which linked (may not match kpfn) */ struct stable_node { - struct rb_node node; + union { + struct rb_node node; /* when node of stable tree */ + struct { /* when listed for migration */ + struct list_head *head; + struct list_head list; + }; + }; struct hlist_head hlist; unsigned long kpfn; +#ifdef CONFIG_NUMA + int nid; +#endif }; /** @@ -169,6 +181,9 @@ struct rmap_item { static struct rb_root root_unstable_tree[MAX_NUMNODES]; static struct rb_root root_stable_tree[MAX_NUMNODES]; +/* Recently migrated nodes of stable tree, pending proper placement */ +static LIST_HEAD(migrate_nodes); + #define MM_SLOTS_HASH_BITS 10 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); @@ -311,11 +326,6 @@ static void insert_to_mm_slots_hash(stru hash_add(mm_slots_hash, &mm_slot->link, (unsigned long)mm); } -static inline int in_stable_tree(struct rmap_item *rmap_item) -{ - return rmap_item->address & STABLE_FLAG; -} - /* * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's * page tables after it has passed through ksm_exit() - which, if necessary, @@ -476,7 +486,6 @@ static void remove_node_from_stable_tree { struct rmap_item *rmap_item; struct hlist_node *hlist; - int nid; hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { if (rmap_item->hlist.next) @@ -488,8 +497,11 @@ static void remove_node_from_stable_tree cond_resched(); } - nid = get_kpfn_nid(stable_node->kpfn); - rb_erase(&stable_node->node, &root_stable_tree[nid]); + if (stable_node->head == &migrate_nodes) + list_del(&stable_node->list); + else + rb_erase(&stable_node->node, + &root_stable_tree[NUMA(stable_node->nid)]); free_stable_node(stable_node); } @@ -712,6 +724,7 @@ static int remove_stable_node(struct sta static int remove_all_stable_nodes(void) { struct stable_node *stable_node; + struct list_head *this, *next; int nid; int err = 0; @@ -726,6 +739,12 @@ static int remove_all_stable_nodes(void) cond_resched(); } } + list_for_each_safe(this, next, &migrate_nodes) { + stable_node = list_entry(this, struct stable_node, list); + if (remove_stable_node(stable_node)) + err = -EBUSY; + cond_resched(); + } return err; } @@ -1113,25 +1132,30 @@ static struct page *try_to_merge_two_pag */ static struct page *stable_tree_search(struct page *page) { - struct rb_node *node; - struct stable_node *stable_node; int nid; + struct rb_node **new; + struct rb_node *parent; + struct stable_node *stable_node; + struct stable_node *page_node; - stable_node = page_stable_node(page); - if (stable_node) { /* ksm page forked */ + page_node = page_stable_node(page); + if (page_node && page_node->head != &migrate_nodes) { + /* ksm page forked */ get_page(page); return page; } nid = get_kpfn_nid(page_to_pfn(page)); - node = root_stable_tree[nid].rb_node; +again: + new = &root_stable_tree[nid].rb_node; + parent = NULL; - while (node) { + while (*new) { struct page *tree_page; int ret; cond_resched(); - stable_node = rb_entry(node, struct stable_node, node); + stable_node = rb_entry(*new, struct stable_node, node); tree_page = get_ksm_page(stable_node, false); if (!tree_page) return NULL; @@ -1139,10 +1163,11 @@ static struct page *stable_tree_search(s ret = memcmp_pages(page, tree_page); put_page(tree_page); + parent = *new; if (ret < 0) - node = node->rb_left; + new = &parent->rb_left; else if (ret > 0) - node = node->rb_right; + new = &parent->rb_right; else { /* * Lock and unlock the stable_node's page (which @@ -1152,13 +1177,49 @@ static struct page *stable_tree_search(s * than kpage, but that involves more changes. */ tree_page = get_ksm_page(stable_node, true); - if (tree_page) + if (tree_page) { unlock_page(tree_page); - return tree_page; + if (get_kpfn_nid(stable_node->kpfn) != + NUMA(stable_node->nid)) { + put_page(tree_page); + goto replace; + } + return tree_page; + } + /* + * There is now a place for page_node, but the tree may + * have been rebalanced, so re-evaluate parent and new. + */ + if (page_node) + goto again; + return NULL; } } - return NULL; + if (!page_node) + return NULL; + + list_del(&page_node->list); + DO_NUMA(page_node->nid = nid); + rb_link_node(&page_node->node, parent, new); + rb_insert_color(&page_node->node, &root_stable_tree[nid]); + get_page(page); + return page; + +replace: + if (page_node) { + list_del(&page_node->list); + DO_NUMA(page_node->nid = nid); + rb_replace_node(&stable_node->node, + &page_node->node, &root_stable_tree[nid]); + get_page(page); + } else { + rb_erase(&stable_node->node, &root_stable_tree[nid]); + page = NULL; + } + stable_node->head = &migrate_nodes; + list_add(&stable_node->list, stable_node->head); + return page; } /* @@ -1215,6 +1276,7 @@ static struct stable_node *stable_tree_i INIT_HLIST_HEAD(&stable_node->hlist); stable_node->kpfn = kpfn; set_page_stable_node(kpage, stable_node); + DO_NUMA(stable_node->nid = nid); rb_link_node(&stable_node->node, parent, new); rb_insert_color(&stable_node->node, &root_stable_tree[nid]); @@ -1311,11 +1373,6 @@ struct rmap_item *unstable_tree_search_i static void stable_tree_append(struct rmap_item *rmap_item, struct stable_node *stable_node) { - /* - * Usually rmap_item->nid is already set correctly, - * but it may be wrong after switching merge_across_nodes. - */ - DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn)); rmap_item->head = stable_node; rmap_item->address |= STABLE_FLAG; hlist_add_head(&rmap_item->hlist, &stable_node->hlist); @@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa unsigned int checksum; int err; - remove_rmap_item_from_tree(rmap_item); + stable_node = page_stable_node(page); + if (stable_node) { + if (stable_node->head != &migrate_nodes && + get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { + rb_erase(&stable_node->node, + &root_stable_tree[NUMA(stable_node->nid)]); + stable_node->head = &migrate_nodes; + list_add(&stable_node->list, stable_node->head); + } + if (stable_node->head != &migrate_nodes && + rmap_item->head == stable_node) + return; + } /* We first start with searching the page inside the stable tree */ kpage = stable_tree_search(page); + if (kpage == page && rmap_item->head == stable_node) { + put_page(kpage); + return; + } + + remove_rmap_item_from_tree(rmap_item); + if (kpage) { err = try_to_merge_with_ksm_page(rmap_item, page, kpage); if (!err) { @@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r */ lru_add_drain_all(); + /* + * Whereas stale stable_nodes on the stable_tree itself + * get pruned in the regular course of stable_tree_search(), + * those moved out to the migrate_nodes list can accumulate: + * so prune them once before each full scan. + */ + if (!ksm_merge_across_nodes) { + struct stable_node *stable_node; + struct list_head *this, *next; + struct page *page; + + list_for_each_safe(this, next, &migrate_nodes) { + stable_node = list_entry(this, + struct stable_node, list); + page = get_ksm_page(stable_node, false); + if (page) + put_page(page); + cond_resched(); + } + } + for (nid = 0; nid < nr_node_ids; nid++) root_unstable_tree[nid] = RB_ROOT; @@ -1586,8 +1683,7 @@ static void ksm_do_scan(unsigned int sca rmap_item = scan_get_next_rmap_item(&page); if (!rmap_item) return; - if (!PageKsm(page) || !in_stable_tree(rmap_item)) - cmp_and_merge_page(page, rmap_item); + cmp_and_merge_page(page, rmap_item); put_page(page); } } @@ -1964,6 +2060,7 @@ static void ksm_check_stable_tree(unsign unsigned long end_pfn) { struct stable_node *stable_node; + struct list_head *this, *next; struct rb_node *node; int nid; @@ -1984,6 +2081,13 @@ static void ksm_check_stable_tree(unsign cond_resched(); } } + list_for_each_safe(this, next, &migrate_nodes) { + stable_node = list_entry(this, struct stable_node, list); + if (stable_node->kpfn >= start_pfn && + stable_node->kpfn < end_pfn) + remove_node_from_stable_tree(stable_node); + cond_resched(); + } } static int ksm_memory_callback(struct notifier_block *self, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx125.postini.com [74.125.245.125]) by kanga.kvack.org (Postfix) with SMTP id B96C06B0005 for ; Fri, 25 Jan 2013 21:06:23 -0500 (EST) Received: by mail-pa0-f47.google.com with SMTP id fa10so582938pad.20 for ; Fri, 25 Jan 2013 18:06:22 -0800 (PST) Date: Fri, 25 Jan 2013 18:06:24 -0800 (PST) From: Hugh Dickins Subject: [PATCH 9/11] ksm: enable KSM page migration In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org Migration of KSM pages is now safe: remove the PageKsm restrictions from mempolicy.c and migrate.c. But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which are irrelevant to KSM: it looks as if that code was preventing hotremove migration of KSM pages, unless they happened to be in swapcache. There is some question as to whether enforcing a NUMA mempolicy migration ought to migrate KSM pages, mapped into entirely unrelated processes; but moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway, and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on any area where this is a worry. Signed-off-by: Hugh Dickins --- mm/mempolicy.c | 3 +-- mm/migrate.c | 21 +++------------------ 2 files changed, 4 insertions(+), 20 deletions(-) --- mmotm.orig/mm/mempolicy.c 2013-01-24 12:28:38.848127553 -0800 +++ mmotm/mm/mempolicy.c 2013-01-25 14:38:49.596208731 -0800 @@ -496,9 +496,8 @@ static int check_pte_range(struct vm_are /* * vm_normal_page() filters out zero pages, but there might * still be PageReserved pages to skip, perhaps in a VDSO. - * And we cannot move PageKsm pages sensibly or safely yet. */ - if (PageReserved(page) || PageKsm(page)) + if (PageReserved(page)) continue; nid = page_to_nid(page); if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT)) --- mmotm.orig/mm/migrate.c 2013-01-25 14:37:03.832206218 -0800 +++ mmotm/mm/migrate.c 2013-01-25 14:38:49.596208731 -0800 @@ -731,20 +731,6 @@ static int __unmap_and_move(struct page lock_page(page); } - /* - * Only memory hotplug's offline_pages() caller has locked out KSM, - * and can safely migrate a KSM page. The other cases have skipped - * PageKsm along with PageReserved - but it is only now when we have - * the page lock that we can be certain it will not go KSM beneath us - * (KSM will not upgrade a page from PageAnon to PageKsm when it sees - * its pagecount raised, but only here do we take the page lock which - * serializes that). - */ - if (PageKsm(page) && !offlining) { - rc = -EBUSY; - goto unlock; - } - /* charge against new page */ mem_cgroup_prepare_migration(page, newpage, &mem); @@ -771,7 +757,7 @@ static int __unmap_and_move(struct page * File Caches may use write_page() or lock_page() in migration, then, * just care Anon page here. */ - if (PageAnon(page)) { + if (PageAnon(page) && !PageKsm(page)) { /* * Only page_lock_anon_vma_read() understands the subtleties of * getting a hold on an anon_vma from outside one of its mms. @@ -851,7 +837,6 @@ uncharge: mem_cgroup_end_migration(mem, page, newpage, (rc == MIGRATEPAGE_SUCCESS || rc == MIGRATEPAGE_BALLOON_SUCCESS)); -unlock: unlock_page(page); out: return rc; @@ -1156,7 +1141,7 @@ static int do_move_page_to_node_array(st goto set_status; /* Use PageReserved to check for zero page */ - if (PageReserved(page) || PageKsm(page)) + if (PageReserved(page)) goto put_and_set; pp->page = page; @@ -1318,7 +1303,7 @@ static void do_pages_stat_array(struct m err = -ENOENT; /* Use PageReserved to check for zero page */ - if (!page || PageReserved(page) || PageKsm(page)) + if (!page || PageReserved(page)) goto set_status; err = page_to_nid(page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx190.postini.com [74.125.245.190]) by kanga.kvack.org (Postfix) with SMTP id AE7F16B0005 for ; Fri, 25 Jan 2013 21:07:49 -0500 (EST) Received: by mail-da0-f42.google.com with SMTP id z17so434409dal.15 for ; Fri, 25 Jan 2013 18:07:48 -0800 (PST) Date: Fri, 25 Jan 2013 18:07:51 -0800 (PST) From: Hugh Dickins Subject: [PATCH 10/11] mm: remove offlining arg to migrate_pages In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org No functional change, but the only purpose of the offlining argument to migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a KSM page for memory hotremove (which took ksm_thread_mutex) but not for other callers. Now all cases are safe, remove the arg. Signed-off-by: Hugh Dickins --- include/linux/migrate.h | 14 ++++++-------- mm/compaction.c | 2 +- mm/memory-failure.c | 7 +++---- mm/memory_hotplug.c | 3 +-- mm/mempolicy.c | 8 +++----- mm/migrate.c | 35 +++++++++++++---------------------- mm/page_alloc.c | 6 ++---- 7 files changed, 29 insertions(+), 46 deletions(-) --- mmotm.orig/include/linux/migrate.h 2013-01-24 12:28:38.740127550 -0800 +++ mmotm/include/linux/migrate.h 2013-01-25 14:38:51.468208776 -0800 @@ -40,11 +40,9 @@ extern void putback_movable_pages(struct extern int migrate_page(struct address_space *, struct page *, struct page *, enum migrate_mode); extern int migrate_pages(struct list_head *l, new_page_t x, - unsigned long private, bool offlining, - enum migrate_mode mode, int reason); + unsigned long private, enum migrate_mode mode, int reason); extern int migrate_huge_page(struct page *, new_page_t x, - unsigned long private, bool offlining, - enum migrate_mode mode); + unsigned long private, enum migrate_mode mode); extern int fail_migrate_page(struct address_space *, struct page *, struct page *); @@ -62,11 +60,11 @@ extern int migrate_huge_page_move_mappin static inline void putback_lru_pages(struct list_head *l) {} static inline void putback_movable_pages(struct list_head *l) {} static inline int migrate_pages(struct list_head *l, new_page_t x, - unsigned long private, bool offlining, - enum migrate_mode mode, int reason) { return -ENOSYS; } + unsigned long private, enum migrate_mode mode, int reason) + { return -ENOSYS; } static inline int migrate_huge_page(struct page *page, new_page_t x, - unsigned long private, bool offlining, - enum migrate_mode mode) { return -ENOSYS; } + unsigned long private, enum migrate_mode mode) + { return -ENOSYS; } static inline int migrate_prep(void) { return -ENOSYS; } static inline int migrate_prep_local(void) { return -ENOSYS; } --- mmotm.orig/mm/compaction.c 2013-01-24 12:28:38.740127550 -0800 +++ mmotm/mm/compaction.c 2013-01-25 14:38:51.472208776 -0800 @@ -980,7 +980,7 @@ static int compact_zone(struct zone *zon nr_migrate = cc->nr_migratepages; err = migrate_pages(&cc->migratepages, compaction_alloc, - (unsigned long)cc, false, + (unsigned long)cc, cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC, MR_COMPACTION); update_nr_listpages(cc); --- mmotm.orig/mm/memory-failure.c 2013-01-24 12:28:38.740127550 -0800 +++ mmotm/mm/memory-failure.c 2013-01-25 14:38:51.472208776 -0800 @@ -1432,7 +1432,7 @@ static int soft_offline_huge_page(struct goto done; /* Keep page count to indicate a given hugepage is isolated. */ - ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, false, + ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, MIGRATE_SYNC); put_page(hpage); if (ret) { @@ -1564,11 +1564,10 @@ int soft_offline_page(struct page *page, if (!ret) { LIST_HEAD(pagelist); inc_zone_page_state(page, NR_ISOLATED_ANON + - page_is_file_cache(page)); + page_is_file_cache(page)); list_add(&page->lru, &pagelist); ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, - false, MIGRATE_SYNC, - MR_MEMORY_FAILURE); + MIGRATE_SYNC, MR_MEMORY_FAILURE); if (ret) { putback_lru_pages(&pagelist); pr_info("soft offline: %#lx: migration failed %d, type %lx\n", --- mmotm.orig/mm/memory_hotplug.c 2013-01-24 12:28:38.740127550 -0800 +++ mmotm/mm/memory_hotplug.c 2013-01-25 14:38:51.472208776 -0800 @@ -1283,8 +1283,7 @@ do_migrate_range(unsigned long start_pfn * migrate_pages returns # of failed pages. */ ret = migrate_pages(&source, alloc_migrate_target, 0, - true, MIGRATE_SYNC, - MR_MEMORY_HOTPLUG); + MIGRATE_SYNC, MR_MEMORY_HOTPLUG); if (ret) putback_lru_pages(&source); } --- mmotm.orig/mm/mempolicy.c 2013-01-25 14:38:49.596208731 -0800 +++ mmotm/mm/mempolicy.c 2013-01-25 14:38:51.472208776 -0800 @@ -1014,8 +1014,7 @@ static int migrate_to_node(struct mm_str if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, new_node_page, dest, - false, MIGRATE_SYNC, - MR_SYSCALL); + MIGRATE_SYNC, MR_SYSCALL); if (err) putback_lru_pages(&pagelist); } @@ -1259,9 +1258,8 @@ static long do_mbind(unsigned long start if (!list_empty(&pagelist)) { WARN_ON_ONCE(flags & MPOL_MF_LAZY); nr_failed = migrate_pages(&pagelist, new_vma_page, - (unsigned long)vma, - false, MIGRATE_SYNC, - MR_MEMPOLICY_MBIND); + (unsigned long)vma, + MIGRATE_SYNC, MR_MEMPOLICY_MBIND); if (nr_failed) putback_lru_pages(&pagelist); } --- mmotm.orig/mm/migrate.c 2013-01-25 14:38:49.596208731 -0800 +++ mmotm/mm/migrate.c 2013-01-25 14:38:51.476208776 -0800 @@ -701,7 +701,7 @@ static int move_to_new_page(struct page } static int __unmap_and_move(struct page *page, struct page *newpage, - int force, bool offlining, enum migrate_mode mode) + int force, enum migrate_mode mode) { int rc = -EAGAIN; int remap_swapcache = 1; @@ -847,8 +847,7 @@ out: * to the newly allocated page in newpage. */ static int unmap_and_move(new_page_t get_new_page, unsigned long private, - struct page *page, int force, bool offlining, - enum migrate_mode mode) + struct page *page, int force, enum migrate_mode mode) { int rc = 0; int *result = NULL; @@ -866,7 +865,7 @@ static int unmap_and_move(new_page_t get if (unlikely(split_huge_page(page))) goto out; - rc = __unmap_and_move(page, newpage, force, offlining, mode); + rc = __unmap_and_move(page, newpage, force, mode); if (unlikely(rc == MIGRATEPAGE_BALLOON_SUCCESS)) { /* @@ -927,8 +926,7 @@ out: */ static int unmap_and_move_huge_page(new_page_t get_new_page, unsigned long private, struct page *hpage, - int force, bool offlining, - enum migrate_mode mode) + int force, enum migrate_mode mode) { int rc = 0; int *result = NULL; @@ -990,9 +988,8 @@ out: * * Return: Number of pages not migrated or error code. */ -int migrate_pages(struct list_head *from, - new_page_t get_new_page, unsigned long private, bool offlining, - enum migrate_mode mode, int reason) +int migrate_pages(struct list_head *from, new_page_t get_new_page, + unsigned long private, enum migrate_mode mode, int reason) { int retry = 1; int nr_failed = 0; @@ -1013,8 +1010,7 @@ int migrate_pages(struct list_head *from cond_resched(); rc = unmap_and_move(get_new_page, private, - page, pass > 2, offlining, - mode); + page, pass > 2, mode); switch(rc) { case -ENOMEM: @@ -1047,15 +1043,13 @@ out: } int migrate_huge_page(struct page *hpage, new_page_t get_new_page, - unsigned long private, bool offlining, - enum migrate_mode mode) + unsigned long private, enum migrate_mode mode) { int pass, rc; for (pass = 0; pass < 10; pass++) { - rc = unmap_and_move_huge_page(get_new_page, - private, hpage, pass > 2, offlining, - mode); + rc = unmap_and_move_huge_page(get_new_page, private, + hpage, pass > 2, mode); switch (rc) { case -ENOMEM: goto out; @@ -1178,8 +1172,7 @@ set_status: err = 0; if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, new_page_node, - (unsigned long)pm, 0, MIGRATE_SYNC, - MR_SYSCALL); + (unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL); if (err) putback_lru_pages(&pagelist); } @@ -1614,10 +1607,8 @@ int migrate_misplaced_page(struct page * goto out; list_add(&page->lru, &migratepages); - nr_remaining = migrate_pages(&migratepages, - alloc_misplaced_dst_page, - node, false, MIGRATE_ASYNC, - MR_NUMA_MISPLACED); + nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page, + node, MIGRATE_ASYNC, MR_NUMA_MISPLACED); if (nr_remaining) { putback_lru_pages(&migratepages); isolated = 0; --- mmotm.orig/mm/page_alloc.c 2013-01-24 12:28:38.740127550 -0800 +++ mmotm/mm/page_alloc.c 2013-01-25 14:38:51.476208776 -0800 @@ -6064,10 +6064,8 @@ static int __alloc_contig_migrate_range( &cc->migratepages); cc->nr_migratepages -= nr_reclaimed; - ret = migrate_pages(&cc->migratepages, - alloc_migrate_target, - 0, false, MIGRATE_SYNC, - MR_CMA); + ret = migrate_pages(&cc->migratepages, alloc_migrate_target, + 0, MIGRATE_SYNC, MR_CMA); } if (ret < 0) { putback_movable_pages(&cc->migratepages); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx193.postini.com [74.125.245.193]) by kanga.kvack.org (Postfix) with SMTP id BFDAB6B0005 for ; Fri, 25 Jan 2013 21:10:17 -0500 (EST) Received: by mail-da0-f49.google.com with SMTP id v40so434347dad.36 for ; Fri, 25 Jan 2013 18:10:17 -0800 (PST) Date: Fri, 25 Jan 2013 18:10:18 -0800 (PST) From: Hugh Dickins Subject: [PATCH 11/11] ksm: stop hotremove lockdep warning In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Gerald Schaefer , KOSAKI Motohiro , linux-kernel@vger.kernel.org, linux-mm@kvack.org Complaints are rare, but lockdep still does not understand the way ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a problem because notifier callbacks are made under down_read of blocking_notifier_head->rwsem (so first the mutex is taken while holding the rwsem, then later the rwsem is taken while still holding the mutex); but is not in fact a problem because mem_hotplug_mutex is held throughout the dance. There was an attempt to fix this with mutex_lock_nested(); but if that happened to fool lockdep two years ago, apparently it does so no longer. I had hoped to eradicate this issue in extending KSM page migration not to need the ksm_thread_mutex. But then realized that although the page migration itself is safe, we do still need to lock out ksmd and other users of get_ksm_page() while offlining memory - at some point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may vanish, and get_ksm_page()'s accesses to them become a violation. So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining() checks, to achieve the same lockout without being caught by lockdep. This is less elegant for KSM, but it's more important to keep lockdep useful to other users - and I apologize for how long it took to fix. Reported-by: Gerald Schaefer Signed-off-by: Hugh Dickins --- mm/ksm.c | 55 +++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 41 insertions(+), 14 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:38:53.984208836 -0800 @@ -226,7 +226,9 @@ static unsigned int ksm_merge_across_nod #define KSM_RUN_STOP 0 #define KSM_RUN_MERGE 1 #define KSM_RUN_UNMERGE 2 -static unsigned int ksm_run = KSM_RUN_STOP; +#define KSM_RUN_OFFLINE 4 +static unsigned long ksm_run = KSM_RUN_STOP; +static void wait_while_offlining(void); static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait); static DEFINE_MUTEX(ksm_thread_mutex); @@ -1700,6 +1702,7 @@ static int ksm_scan_thread(void *nothing while (!kthread_should_stop()) { mutex_lock(&ksm_thread_mutex); + wait_while_offlining(); if (ksmd_should_run()) ksm_do_scan(ksm_thread_pages_to_scan); mutex_unlock(&ksm_thread_mutex); @@ -2056,6 +2059,22 @@ void ksm_migrate_page(struct page *newpa #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_MEMORY_HOTREMOVE +static int just_wait(void *word) +{ + schedule(); + return 0; +} + +static void wait_while_offlining(void) +{ + while (ksm_run & KSM_RUN_OFFLINE) { + mutex_unlock(&ksm_thread_mutex); + wait_on_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE), + just_wait, TASK_UNINTERRUPTIBLE); + mutex_lock(&ksm_thread_mutex); + } +} + static void ksm_check_stable_tree(unsigned long start_pfn, unsigned long end_pfn) { @@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no switch (action) { case MEM_GOING_OFFLINE: /* - * Keep it very simple for now: just lock out ksmd and - * MADV_UNMERGEABLE while any memory is going offline. - * mutex_lock_nested() is necessary because lockdep was alarmed - * that here we take ksm_thread_mutex inside notifier chain - * mutex, and later take notifier chain mutex inside - * ksm_thread_mutex to unlock it. But that's safe because both - * are inside mem_hotplug_mutex. + * Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items() + * and remove_all_stable_nodes() while memory is going offline: + * it is unsafe for them to touch the stable tree at this time. + * But unmerge_ksm_pages(), rmap lookups and other entry points + * which do not need the ksm_thread_mutex are all safe. */ - mutex_lock_nested(&ksm_thread_mutex, SINGLE_DEPTH_NESTING); + mutex_lock(&ksm_thread_mutex); + ksm_run |= KSM_RUN_OFFLINE; + mutex_unlock(&ksm_thread_mutex); break; case MEM_OFFLINE: @@ -2122,11 +2141,20 @@ static int ksm_memory_callback(struct no /* fallthrough */ case MEM_CANCEL_OFFLINE: + mutex_lock(&ksm_thread_mutex); + ksm_run &= ~KSM_RUN_OFFLINE; mutex_unlock(&ksm_thread_mutex); + + smp_mb(); /* wake_up_bit advises this */ + wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE)); break; } return NOTIFY_OK; } +#else +static void wait_while_offlining(void) +{ +} #endif /* CONFIG_MEMORY_HOTREMOVE */ #ifdef CONFIG_SYSFS @@ -2189,7 +2217,7 @@ KSM_ATTR(pages_to_scan); static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", ksm_run); + return sprintf(buf, "%lu\n", ksm_run); } static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr, @@ -2212,6 +2240,7 @@ static ssize_t run_store(struct kobject */ mutex_lock(&ksm_thread_mutex); + wait_while_offlining(); if (ksm_run != flags) { ksm_run = flags; if (flags & KSM_RUN_UNMERGE) { @@ -2254,6 +2283,7 @@ static ssize_t merge_across_nodes_store( return -EINVAL; mutex_lock(&ksm_thread_mutex); + wait_while_offlining(); if (ksm_merge_across_nodes != knob) { if (ksm_pages_shared || remove_all_stable_nodes()) err = -EBUSY; @@ -2366,10 +2396,7 @@ static int __init ksm_init(void) #endif /* CONFIG_SYSFS */ #ifdef CONFIG_MEMORY_HOTREMOVE - /* - * Choose a high priority since the callback takes ksm_thread_mutex: - * later callbacks could only be taking locks which nest within that. - */ + /* There is no significance to this priority 100 */ hotplug_memory_notifier(ksm_memory_callback, 100); #endif return 0; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx196.postini.com [74.125.245.196]) by kanga.kvack.org (Postfix) with SMTP id 289C36B0005 for ; Sat, 26 Jan 2013 20:14:45 -0500 (EST) Received: by mail-ia0-f172.google.com with SMTP id u8so2625417iag.31 for ; Sat, 26 Jan 2013 17:14:44 -0800 (PST) Message-ID: <1359249282.4159.4.camel@kernel> Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node From: Simon Jeons Date: Sat, 26 Jan 2013 19:14:42 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Hi Hugh, On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: > From: Petr Holasek > > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes > which control merging pages across different numa nodes. > When it is set to zero only pages from the same node are merged, > otherwise pages from all nodes can be merged together (default behavior). > > Typical use-case could be a lot of KVM guests on NUMA machine > and cpus from more distant nodes would have significant increase > of access latency to the merged ksm page. Sysfs knob was choosen > for higher variability when some users still prefers higher amount > of saved physical memory regardless of access latency. > > Every numa node has its own stable & unstable trees because of faster > searching and inserting. Changing of merge_across_nodes value is possible > only when there are not any ksm shared pages in system. > > I've tested this patch on numa machines with 2, 4 and 8 nodes and > measured speed of memory access inside of KVM guests with memory pinned > to one of nodes with this benchmark: > > http://pholasek.fedorapeople.org/alloc_pg.c > > Population standard deviations of access times in percentage of average > were following: > > merge_across_nodes=1 > 2 nodes 1.4% > 4 nodes 1.6% > 8 nodes 1.7% > > merge_across_nodes=0 > 2 nodes 1% > 4 nodes 0.32% > 8 nodes 0.018% > > RFC: https://lkml.org/lkml/2011/11/30/91 > v1: https://lkml.org/lkml/2012/1/23/46 > v2: https://lkml.org/lkml/2012/6/29/105 > v3: https://lkml.org/lkml/2012/9/14/550 > v4: https://lkml.org/lkml/2012/9/23/137 > v5: https://lkml.org/lkml/2012/12/10/540 > v6: https://lkml.org/lkml/2012/12/23/154 > v7: https://lkml.org/lkml/2012/12/27/225 > > Hugh notes that this patch brings two problems, whose solution needs > further support in mm/ksm.c, which follows in subsequent patches: > 1) switching merge_across_nodes after running KSM is liable to oops > on stale nodes still left over from the previous stable tree; > 2) memory hotremove may migrate KSM pages, but there is no provision > here for !merge_across_nodes to migrate nodes to the proper tree. > > Signed-off-by: Petr Holasek > Signed-off-by: Hugh Dickins > Acked-by: Rik van Riel > --- > Documentation/vm/ksm.txt | 7 + > mm/ksm.c | 151 ++++++++++++++++++++++++++++++++----- > 2 files changed, 139 insertions(+), 19 deletions(-) > > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > Default: 20 (chosen for demonstration purposes) > > +merge_across_nodes - specifies if pages from different numa nodes can be merged. > + When set to 0, ksm merges only pages which physically > + reside in the memory area of same NUMA node. It brings > + lower latency to access to shared page. Value can be > + changed only when there is no ksm shared pages in system. > + Default: 1 > + > run - set 0 to stop ksmd from running but keep merged pages, > set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", > set 2 to stop ksmd and unmerge all pages currently merged, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 > @@ -36,6 +36,7 @@ > #include > #include > #include > +#include > > #include > #include "internal.h" > @@ -139,6 +140,9 @@ struct rmap_item { > struct mm_struct *mm; > unsigned long address; /* + low bits used for flags below */ > unsigned int oldchecksum; /* when unstable */ > +#ifdef CONFIG_NUMA > + unsigned int nid; > +#endif > union { > struct rb_node node; /* when node of unstable tree */ > struct { /* when listed from stable tree */ > @@ -153,8 +157,8 @@ struct rmap_item { > #define STABLE_FLAG 0x200 /* is listed from the stable tree */ > > /* The stable and unstable tree heads */ > -static struct rb_root root_stable_tree = RB_ROOT; > -static struct rb_root root_unstable_tree = RB_ROOT; > +static struct rb_root root_unstable_tree[MAX_NUMNODES]; > +static struct rb_root root_stable_tree[MAX_NUMNODES]; > > #define MM_SLOTS_HASH_BITS 10 > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ > /* Milliseconds ksmd should sleep between batches */ > static unsigned int ksm_thread_sleep_millisecs = 20; > > +/* Zeroed when merging across nodes is not allowed */ > +static unsigned int ksm_merge_across_nodes = 1; > + > #define KSM_RUN_STOP 0 > #define KSM_RUN_MERGE 1 > #define KSM_RUN_UNMERGE 2 > @@ -441,10 +448,25 @@ out: page = NULL; > return page; > } > > +/* > + * This helper is used for getting right index into array of tree roots. > + * When merge_across_nodes knob is set to 1, there are only two rb-trees for > + * stable and unstable pages from all nodes with roots in index 0. Otherwise, > + * every node has its own stable and unstable tree. > + */ > +static inline int get_kpfn_nid(unsigned long kpfn) > +{ > + if (ksm_merge_across_nodes) > + return 0; > + else > + return pfn_to_nid(kpfn); > +} > + > static void remove_node_from_stable_tree(struct stable_node *stable_node) > { > struct rmap_item *rmap_item; > struct hlist_node *hlist; > + int nid; > > hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { > if (rmap_item->hlist.next) > @@ -456,7 +478,9 @@ static void remove_node_from_stable_tree > cond_resched(); > } > > - rb_erase(&stable_node->node, &root_stable_tree); > + nid = get_kpfn_nid(stable_node->kpfn); > + > + rb_erase(&stable_node->node, &root_stable_tree[nid]); > free_stable_node(stable_node); > } > > @@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s > age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); > BUG_ON(age > 1); > if (!age) > - rb_erase(&rmap_item->node, &root_unstable_tree); > +#ifdef CONFIG_NUMA > + rb_erase(&rmap_item->node, > + &root_unstable_tree[rmap_item->nid]); > +#else > + rb_erase(&rmap_item->node, &root_unstable_tree[0]); > +#endif > > ksm_pages_unshared--; > rmap_item->address &= PAGE_MASK; > @@ -990,8 +1019,9 @@ static struct page *try_to_merge_two_pag > */ > static struct page *stable_tree_search(struct page *page) > { > - struct rb_node *node = root_stable_tree.rb_node; > + struct rb_node *node; > struct stable_node *stable_node; > + int nid; > > stable_node = page_stable_node(page); > if (stable_node) { /* ksm page forked */ > @@ -999,6 +1029,9 @@ static struct page *stable_tree_search(s > return page; > } > > + nid = get_kpfn_nid(page_to_pfn(page)); > + node = root_stable_tree[nid].rb_node; > + > while (node) { > struct page *tree_page; > int ret; > @@ -1033,10 +1066,16 @@ static struct page *stable_tree_search(s > */ > static struct stable_node *stable_tree_insert(struct page *kpage) > { > - struct rb_node **new = &root_stable_tree.rb_node; > + int nid; > + unsigned long kpfn; > + struct rb_node **new; > struct rb_node *parent = NULL; > struct stable_node *stable_node; > > + kpfn = page_to_pfn(kpage); > + nid = get_kpfn_nid(kpfn); > + new = &root_stable_tree[nid].rb_node; > + > while (*new) { > struct page *tree_page; > int ret; > @@ -1070,11 +1109,11 @@ static struct stable_node *stable_tree_i > return NULL; > > rb_link_node(&stable_node->node, parent, new); > - rb_insert_color(&stable_node->node, &root_stable_tree); > + rb_insert_color(&stable_node->node, &root_stable_tree[nid]); > > INIT_HLIST_HEAD(&stable_node->hlist); > > - stable_node->kpfn = page_to_pfn(kpage); > + stable_node->kpfn = kpfn; > set_page_stable_node(kpage, stable_node); > > return stable_node; > @@ -1098,10 +1137,15 @@ static > struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item, > struct page *page, > struct page **tree_pagep) > - > { > - struct rb_node **new = &root_unstable_tree.rb_node; > + struct rb_node **new; > + struct rb_root *root; > struct rb_node *parent = NULL; > + int nid; > + > + nid = get_kpfn_nid(page_to_pfn(page)); > + root = &root_unstable_tree[nid]; > + new = &root->rb_node; > > while (*new) { > struct rmap_item *tree_rmap_item; > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > return NULL; > } > > + /* > + * If tree_page has been migrated to another NUMA node, it > + * will be flushed out and put into the right unstable tree Then why not insert the new page to unstable tree during page migration against current upstream? Because default behavior is merge across nodes. > + * next time: only merge with it if merge_across_nodes. > + * Just notice, we don't have similar problem for PageKsm > + * because their migration is disabled now. (62b61f611e) > + */ > + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { > + put_page(tree_page); > + return NULL; > + } > + > ret = memcmp_pages(page, tree_page); > > parent = *new; > @@ -1139,8 +1195,11 @@ struct rmap_item *unstable_tree_search_i > > rmap_item->address |= UNSTABLE_FLAG; > rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK); > +#ifdef CONFIG_NUMA > + rmap_item->nid = nid; > +#endif > rb_link_node(&rmap_item->node, parent, new); > - rb_insert_color(&rmap_item->node, &root_unstable_tree); > + rb_insert_color(&rmap_item->node, root); > > ksm_pages_unshared++; > return NULL; > @@ -1154,6 +1213,13 @@ struct rmap_item *unstable_tree_search_i > static void stable_tree_append(struct rmap_item *rmap_item, > struct stable_node *stable_node) > { > +#ifdef CONFIG_NUMA > + /* > + * Usually rmap_item->nid is already set correctly, > + * but it may be wrong after switching merge_across_nodes. > + */ > + rmap_item->nid = get_kpfn_nid(stable_node->kpfn); > +#endif > rmap_item->head = stable_node; > rmap_item->address |= STABLE_FLAG; > hlist_add_head(&rmap_item->hlist, &stable_node->hlist); > @@ -1283,6 +1349,7 @@ static struct rmap_item *scan_get_next_r > struct mm_slot *slot; > struct vm_area_struct *vma; > struct rmap_item *rmap_item; > + int nid; > > if (list_empty(&ksm_mm_head.mm_list)) > return NULL; > @@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r > */ > lru_add_drain_all(); > > - root_unstable_tree = RB_ROOT; > + for (nid = 0; nid < nr_node_ids; nid++) > + root_unstable_tree[nid] = RB_ROOT; > > spin_lock(&ksm_mmlist_lock); > slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list); > @@ -1770,15 +1838,19 @@ static struct stable_node *ksm_check_sta > unsigned long end_pfn) > { > struct rb_node *node; > + int nid; > > - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) { > - struct stable_node *stable_node; > + for (nid = 0; nid < nr_node_ids; nid++) > + for (node = rb_first(&root_stable_tree[nid]); node; > + node = rb_next(node)) { > + struct stable_node *stable_node; > + > + stable_node = rb_entry(node, struct stable_node, node); > + if (stable_node->kpfn >= start_pfn && > + stable_node->kpfn < end_pfn) > + return stable_node; > + } > > - stable_node = rb_entry(node, struct stable_node, node); > - if (stable_node->kpfn >= start_pfn && > - stable_node->kpfn < end_pfn) > - return stable_node; > - } > return NULL; > } > > @@ -1925,6 +1997,40 @@ static ssize_t run_store(struct kobject > } > KSM_ATTR(run); > > +#ifdef CONFIG_NUMA > +static ssize_t merge_across_nodes_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return sprintf(buf, "%u\n", ksm_merge_across_nodes); > +} > + > +static ssize_t merge_across_nodes_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count) > +{ > + int err; > + unsigned long knob; > + > + err = kstrtoul(buf, 10, &knob); > + if (err) > + return err; > + if (knob > 1) > + return -EINVAL; > + > + mutex_lock(&ksm_thread_mutex); > + if (ksm_merge_across_nodes != knob) { > + if (ksm_pages_shared) > + err = -EBUSY; > + else > + ksm_merge_across_nodes = knob; > + } > + mutex_unlock(&ksm_thread_mutex); > + > + return err ? err : count; > +} > +KSM_ATTR(merge_across_nodes); > +#endif > + > static ssize_t pages_shared_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > @@ -1979,6 +2085,9 @@ static struct attribute *ksm_attrs[] = { > &pages_unshared_attr.attr, > &pages_volatile_attr.attr, > &full_scans_attr.attr, > +#ifdef CONFIG_NUMA > + &merge_across_nodes_attr.attr, > +#endif > NULL, > }; > > @@ -1992,11 +2101,15 @@ static int __init ksm_init(void) > { > struct task_struct *ksm_thread; > int err; > + int nid; > > err = ksm_slab_init(); > if (err) > goto out; > > + for (nid = 0; nid < nr_node_ids; nid++) > + root_stable_tree[nid] = RB_ROOT; > + > ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd"); > if (IS_ERR(ksm_thread)) { > printk(KERN_ERR "ksm: creating kthread failed\n"); > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id C8A636B0005 for ; Sat, 26 Jan 2013 21:36:29 -0500 (EST) Received: by mail-da0-f44.google.com with SMTP id z20so728274dae.3 for ; Sat, 26 Jan 2013 18:36:29 -0800 (PST) Message-ID: <1359254187.4159.10.camel@kernel> Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked From: Simon Jeons Date: Sat, 26 Jan 2013 20:36:27 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Hi Hugh, On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote: > In some places where get_ksm_page() is used, we need the page to be locked. > In function get_ksm_page, why check page->mapping => get_page_unless_zero => check page->mapping instead of get_page_unless_zero => check page->mapping, because get_page_unless_zero is expensive? > When KSM migration is fully enabled, we shall want that to make sure that > the page just acquired cannot be migrated beneath us (raised page count is > only effective when there is serialization to make sure migration notices). > Whereas when navigating through the stable tree, we certainly do not want What's the meaning of "navigating through the stable tree"? > to lock each node (raised page count is enough to guarantee the memcmps, > even if page is migrated to another node). > > Since we're about to add another use case, add the locked argument to > get_ksm_page() now. Why the parameter lock passed from stable_tree_search/insert is true, but remove_rmap_item_from_tree is false? > > Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I > really got the wrong end of the stick on that! There's a configuration > in which page_cache_get_speculative() can do something cheaper than > get_page_unless_zero(), relying on its caller's rcu_read_lock() to have > disabled preemption for it. There's no need for rcu_read_lock() around > get_page_unless_zero() (and mapping checks) here. Cut out that > silliness before making this any harder to understand. > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 23 +++++++++++++---------- > 1 file changed, 13 insertions(+), 10 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree > * but this is different - made simpler by ksm_thread_mutex being held, but > * interesting for assuming that no other use of the struct page could ever > * put our expected_mapping into page->mapping (or a field of the union which > - * coincides with page->mapping). The RCU calls are not for KSM at all, but > - * to keep the page_count protocol described with page_cache_get_speculative. > + * coincides with page->mapping). > * > * Note: it is possible that get_ksm_page() will return NULL one moment, > * then page the next, if the page is in between page_freeze_refs() and > * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > * is on its way to being freed; but it is an anomaly to bear in mind. > */ > -static struct page *get_ksm_page(struct stable_node *stable_node) > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > { > struct page *page; > void *expected_mapping; > @@ -530,7 +529,6 @@ static struct page *get_ksm_page(struct > page = pfn_to_page(stable_node->kpfn); > expected_mapping = (void *)stable_node + > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > - rcu_read_lock(); > if (page->mapping != expected_mapping) > goto stale; > if (!get_page_unless_zero(page)) > @@ -539,10 +537,16 @@ static struct page *get_ksm_page(struct > put_page(page); > goto stale; > } > - rcu_read_unlock(); > + if (locked) { > + lock_page(page); > + if (page->mapping != expected_mapping) { > + unlock_page(page); > + put_page(page); > + goto stale; > + } > + } > return page; > stale: > - rcu_read_unlock(); > remove_node_from_stable_tree(stable_node); > return NULL; > } > @@ -558,11 +562,10 @@ static void remove_rmap_item_from_tree(s > struct page *page; > > stable_node = rmap_item->head; > - page = get_ksm_page(stable_node); > + page = get_ksm_page(stable_node, true); > if (!page) > goto out; > > - lock_page(page); > hlist_del(&rmap_item->hlist); > unlock_page(page); > put_page(page); > @@ -1042,7 +1045,7 @@ static struct page *stable_tree_search(s > > cond_resched(); > stable_node = rb_entry(node, struct stable_node, node); > - tree_page = get_ksm_page(stable_node); > + tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > > @@ -1086,7 +1089,7 @@ static struct stable_node *stable_tree_i > > cond_resched(); > stable_node = rb_entry(*new, struct stable_node, node); > - tree_page = get_ksm_page(stable_node); > + tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id 7E1446B0005 for ; Sat, 26 Jan 2013 21:48:49 -0500 (EST) Received: by mail-da0-f52.google.com with SMTP id f10so735586dak.39 for ; Sat, 26 Jan 2013 18:48:48 -0800 (PST) Message-ID: <1359254927.4159.11.camel@kernel> Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked From: Simon Jeons Date: Sat, 26 Jan 2013 20:48:47 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote: > In some places where get_ksm_page() is used, we need the page to be locked. > > When KSM migration is fully enabled, we shall want that to make sure that > the page just acquired cannot be migrated beneath us (raised page count is > only effective when there is serialization to make sure migration notices). > Whereas when navigating through the stable tree, we certainly do not want > to lock each node (raised page count is enough to guarantee the memcmps, > even if page is migrated to another node). > > Since we're about to add another use case, add the locked argument to > get_ksm_page() now. > > Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I > really got the wrong end of the stick on that! There's a configuration > in which page_cache_get_speculative() can do something cheaper than > get_page_unless_zero(), relying on its caller's rcu_read_lock() to have > disabled preemption for it. There's no need for rcu_read_lock() around > get_page_unless_zero() (and mapping checks) here. Cut out that > silliness before making this any harder to understand. BTW, what's the meaning of ksm page forked? > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 23 +++++++++++++---------- > 1 file changed, 13 insertions(+), 10 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree > * but this is different - made simpler by ksm_thread_mutex being held, but > * interesting for assuming that no other use of the struct page could ever > * put our expected_mapping into page->mapping (or a field of the union which > - * coincides with page->mapping). The RCU calls are not for KSM at all, but > - * to keep the page_count protocol described with page_cache_get_speculative. > + * coincides with page->mapping). > * > * Note: it is possible that get_ksm_page() will return NULL one moment, > * then page the next, if the page is in between page_freeze_refs() and > * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > * is on its way to being freed; but it is an anomaly to bear in mind. > */ > -static struct page *get_ksm_page(struct stable_node *stable_node) > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > { > struct page *page; > void *expected_mapping; > @@ -530,7 +529,6 @@ static struct page *get_ksm_page(struct > page = pfn_to_page(stable_node->kpfn); > expected_mapping = (void *)stable_node + > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > - rcu_read_lock(); > if (page->mapping != expected_mapping) > goto stale; > if (!get_page_unless_zero(page)) > @@ -539,10 +537,16 @@ static struct page *get_ksm_page(struct > put_page(page); > goto stale; > } > - rcu_read_unlock(); > + if (locked) { > + lock_page(page); > + if (page->mapping != expected_mapping) { > + unlock_page(page); > + put_page(page); > + goto stale; > + } > + } > return page; > stale: > - rcu_read_unlock(); > remove_node_from_stable_tree(stable_node); > return NULL; > } > @@ -558,11 +562,10 @@ static void remove_rmap_item_from_tree(s > struct page *page; > > stable_node = rmap_item->head; > - page = get_ksm_page(stable_node); > + page = get_ksm_page(stable_node, true); > if (!page) > goto out; > > - lock_page(page); > hlist_del(&rmap_item->hlist); > unlock_page(page); > put_page(page); > @@ -1042,7 +1045,7 @@ static struct page *stable_tree_search(s > > cond_resched(); > stable_node = rb_entry(node, struct stable_node, node); > - tree_page = get_ksm_page(stable_node); > + tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > > @@ -1086,7 +1089,7 @@ static struct stable_node *stable_tree_i > > cond_resched(); > stable_node = rb_entry(*new, struct stable_node, node); > - tree_page = get_ksm_page(stable_node); > + tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx118.postini.com [74.125.245.118]) by kanga.kvack.org (Postfix) with SMTP id BED716B0005 for ; Sat, 26 Jan 2013 21:54:41 -0500 (EST) Received: by mail-da0-f42.google.com with SMTP id z17so733960dal.15 for ; Sat, 26 Jan 2013 18:54:40 -0800 (PST) Date: Sat, 26 Jan 2013 18:54:36 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: <1359249282.4159.4.camel@kernel> Message-ID: References: <1359249282.4159.4.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sat, 26 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: > > From: Petr Holasek > > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > > return NULL; > > } > > > > + /* > > + * If tree_page has been migrated to another NUMA node, it > > + * will be flushed out and put into the right unstable tree > > Then why not insert the new page to unstable tree during page migration > against current upstream? Because default behavior is merge across > nodes. I don't understand the words "against current upstream" in your question. We cannot move a page (strictly, a node) from one tree to another during page migration itself, because the necessary ksm_thread_mutex is not held. Not would we even want to while "merge across nodes". Ah, perhaps you are pointing out that in current upstream, the only user of ksm page migration is memory hotremove, which in current upstream does hold ksm_thread_mutex. So you'd like us to add code for moving a node from one tree to another in ksm_migrate_page() (and what would it do when it collides with an existing node?), code which will then be removed a few patches later when ksm page migration is fully enabled? No, I'm not going to put any more thought into that. When Andrea pointed out the problem with Petr's original change to ksm_migrate_page(), I did indeed think that we could do something cleverer at that point; but once I got down to trying it, found that a dead end. I wasn't going to be able to test the hotremove case properly anyway, so no good pursuing solutions that couldn't be generalized. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx133.postini.com [74.125.245.133]) by kanga.kvack.org (Postfix) with SMTP id 1BD2B6B0005 for ; Sat, 26 Jan 2013 22:16:25 -0500 (EST) Received: by mail-da0-f53.google.com with SMTP id x6so735065dac.26 for ; Sat, 26 Jan 2013 19:16:24 -0800 (PST) Message-ID: <1359256581.4159.16.camel@kernel> Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node From: Simon Jeons Date: Sat, 26 Jan 2013 21:16:21 -0600 In-Reply-To: References: <1359249282.4159.4.camel@kernel> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote: > On Sat, 26 Jan 2013, Simon Jeons wrote: > > On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: > > > From: Petr Holasek > > > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > > > return NULL; > > > } > > > > > > + /* > > > + * If tree_page has been migrated to another NUMA node, it > > > + * will be flushed out and put into the right unstable tree > > > > Then why not insert the new page to unstable tree during page migration > > against current upstream? Because default behavior is merge across > > nodes. > > I don't understand the words "against current upstream" in your question. I mean current upstream codes without numa awareness. :) > > We cannot move a page (strictly, a node) from one tree to another during > page migration itself, because the necessary ksm_thread_mutex is not held. > Not would we even want to while "merge across nodes". > > Ah, perhaps you are pointing out that in current upstream, the only user > of ksm page migration is memory hotremove, which in current upstream does > hold ksm_thread_mutex. > > So you'd like us to add code for moving a node from one tree to another > in ksm_migrate_page() (and what would it do when it collides with an Without numa awareness, I still can't understand your explanation why can't insert the node to the tree just after page migration instead of inserting it at the next scan. > existing node?), code which will then be removed a few patches later > when ksm page migration is fully enabled? > > No, I'm not going to put any more thought into that. When Andrea pointed > out the problem with Petr's original change to ksm_migrate_page(), I did > indeed think that we could do something cleverer at that point; but once > I got down to trying it, found that a dead end. I wasn't going to be > able to test the hotremove case properly anyway, so no good pursuing > solutions that couldn't be generalized. > > Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx112.postini.com [74.125.245.112]) by kanga.kvack.org (Postfix) with SMTP id B841B6B0005 for ; Sat, 26 Jan 2013 23:55:58 -0500 (EST) Received: by mail-da0-f42.google.com with SMTP id z17so758115dal.29 for ; Sat, 26 Jan 2013 20:55:57 -0800 (PST) Message-ID: <1359262556.4159.23.camel@kernel> Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly From: Simon Jeons Date: Sat, 26 Jan 2013 23:55:56 -0500 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Hi Hugh, On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > Switching merge_across_nodes after running KSM is liable to oops on stale > nodes still left over from the previous stable tree. It's not something > that people will often want to do, but it would be lame to demand a reboot > when they're trying to determine which merge_across_nodes setting is best. > > How can this happen? We only permit switching merge_across_nodes when > pages_shared is 0, and usually set run 2 to force that beforehand, which > ought to unmerge everything: yet oopses still occur when you then run 1. > > Three causes: > > 1. The old stable tree (built according to the inverse merge_across_nodes) > has not been fully torn down. A stable node lingers until get_ksm_page() > notices that the page it references no longer references it: but the page > is not necessarily freed as soon as expected, particularly when swapcache. > When can this happen? > Fix this with a pass through the old stable tree, applying get_ksm_page() > to each of the remaining nodes (most found stale and removed immediately), > with forced removal of any left over. Unless the page is still mapped: > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > and EBUSY than BUG. > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > just behind ksmd's cursor, so there's a full pass for it to stabilize > (or be removed) before ksmd addresses it. Nice when ksmd is running, > but not so nice when we're trying to unmerge all mms: we were missing > those mms forked and inserted behind the unmerge cursor. Easily fixed > by inserting at the end when KSM_RUN_UNMERGE. mms forked will be unmerged just after ksmd's cursor since they're inserted behind it, why will be missing? > > 3. It is possible for a KSM page to be faulted back from swapcache into > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. > Make sense. :) > A long outstanding, unrelated bugfix sneaks in with that third fix: > ksm_does_need_to_copy() would copy from a !PageUptodate page (implying > I/O error when read in from swap) to a page which it then marks Uptodate. > Fix this case by not copying, letting do_swap_page() discover the error. > > Signed-off-by: Hugh Dickins > --- > include/linux/ksm.h | 18 ++------- > mm/ksm.c | 83 +++++++++++++++++++++++++++++++++++++++--- > mm/memory.c | 19 ++++----- > 3 files changed, 92 insertions(+), 28 deletions(-) > > --- mmotm.orig/include/linux/ksm.h 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/include/linux/ksm.h 2013-01-25 14:37:00.764206145 -0800 > @@ -16,9 +16,6 @@ > struct stable_node; > struct mem_cgroup; > > -struct page *ksm_does_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address); > - > #ifdef CONFIG_KSM > int ksm_madvise(struct vm_area_struct *vma, unsigned long start, > unsigned long end, int advice, unsigned long *vm_flags); > @@ -73,15 +70,8 @@ static inline void set_page_stable_node( > * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE, > * but what if the vma was unmerged while the page was swapped out? > */ > -static inline int ksm_might_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address) > -{ > - struct anon_vma *anon_vma = page_anon_vma(page); > - > - return anon_vma && > - (anon_vma->root != vma->anon_vma->root || > - page->index != linear_page_index(vma, address)); > -} > +struct page *ksm_might_need_to_copy(struct page *page, > + struct vm_area_struct *vma, unsigned long address); > > int page_referenced_ksm(struct page *page, > struct mem_cgroup *memcg, unsigned long *vm_flags); > @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_ > return 0; > } > > -static inline int ksm_might_need_to_copy(struct page *page, > +static inline struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > - return 0; > + return page; > } > > static inline int page_referenced_ksm(struct page *page, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a > /* > * Only called through the sysfs control interface: > */ > +static int remove_stable_node(struct stable_node *stable_node) > +{ > + struct page *page; > + int err; > + > + page = get_ksm_page(stable_node, true); > + if (!page) { > + /* > + * get_ksm_page did remove_node_from_stable_tree itself. > + */ > + return 0; > + } > + > + if (WARN_ON_ONCE(page_mapped(page))) > + err = -EBUSY; > + else { > + /* > + * This page might be in a pagevec waiting to be freed, > + * or it might be PageSwapCache (perhaps under writeback), > + * or it might have been removed from swapcache a moment ago. > + */ > + set_page_stable_node(page, NULL); > + remove_node_from_stable_tree(stable_node); > + err = 0; > + } > + > + unlock_page(page); > + put_page(page); > + return err; > +} > + > +static int remove_all_stable_nodes(void) > +{ > + struct stable_node *stable_node; > + int nid; > + int err = 0; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + while (root_stable_tree[nid].rb_node) { > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > + struct stable_node, node); > + if (remove_stable_node(stable_node)) { > + err = -EBUSY; > + break; /* proceed to next nid */ > + } > + cond_resched(); > + } > + } > + return err; > +} > + > static int unmerge_and_remove_all_rmap_items(void) > { > struct mm_slot *mm_slot; > @@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i > } > } > > + /* Clean up stable nodes, but don't worry if some are still busy */ > + remove_all_stable_nodes(); > ksm_scan.seqnr = 0; > return 0; > > @@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm) > spin_lock(&ksm_mmlist_lock); > insert_to_mm_slots_hash(mm, mm_slot); > /* > - * Insert just behind the scanning cursor, to let the area settle > + * When KSM_RUN_MERGE (or KSM_RUN_STOP), > + * insert just behind the scanning cursor, to let the area settle > * down a little; when fork is followed by immediate exec, we don't > * want ksmd to waste time setting up and tearing down an rmap_list. > + * > + * But when KSM_RUN_UNMERGE, it's important to insert ahead of its > + * scanning cursor, otherwise KSM pages in newly forked mms will be > + * missed: then we might as well insert at the end of the list. > */ > - list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > + if (ksm_run & KSM_RUN_UNMERGE) > + list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list); > + else > + list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > spin_unlock(&ksm_mmlist_lock); > > set_bit(MMF_VM_MERGEABLE, &mm->flags); > @@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm) > } > } > > -struct page *ksm_does_need_to_copy(struct page *page, > +struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > + struct anon_vma *anon_vma = page_anon_vma(page); > struct page *new_page; > > + if (PageKsm(page)) { > + if (page_stable_node(page) && > + !(ksm_run & KSM_RUN_UNMERGE)) > + return page; /* no need to copy it */ > + } else if (!anon_vma) { > + return page; /* no need to copy it */ > + } else if (anon_vma->root == vma->anon_vma->root && > + page->index == linear_page_index(vma, address)) { > + return page; /* still no need to copy it */ > + } > + if (!PageUptodate(page)) > + return page; /* let do_swap_page report the error */ > + > new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); > if (new_page) { > copy_user_highpage(new_page, page, address, vma); > @@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store( > > mutex_lock(&ksm_thread_mutex); > if (ksm_merge_across_nodes != knob) { > - if (ksm_pages_shared) > + if (ksm_pages_shared || remove_all_stable_nodes()) > err = -EBUSY; > else > ksm_merge_across_nodes = knob; > --- mmotm.orig/mm/memory.c 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/mm/memory.c 2013-01-25 14:37:00.768206145 -0800 > @@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct > if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) > goto out_page; > > - if (ksm_might_need_to_copy(page, vma, address)) { > - swapcache = page; > - page = ksm_does_need_to_copy(page, vma, address); > - > - if (unlikely(!page)) { > - ret = VM_FAULT_OOM; > - page = swapcache; > - swapcache = NULL; > - goto out_page; > - } > + swapcache = page; > + page = ksm_might_need_to_copy(page, vma, address); > + if (unlikely(!page)) { > + ret = VM_FAULT_OOM; > + page = swapcache; > + swapcache = NULL; > + goto out_page; > } > + if (page == swapcache) > + swapcache = NULL; > > if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > ret = VM_FAULT_OOM; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx197.postini.com [74.125.245.197]) by kanga.kvack.org (Postfix) with SMTP id 47CDB6B0005 for ; Sun, 27 Jan 2013 00:47:18 -0500 (EST) Received: by mail-da0-f48.google.com with SMTP id k18so771061dae.7 for ; Sat, 26 Jan 2013 21:47:17 -0800 (PST) Message-ID: <1359265635.6763.0.camel@kernel> Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible From: Simon Jeons Date: Sat, 26 Jan 2013 23:47:15 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 2013-01-25 at 18:03 -0800, Hugh Dickins wrote: > KSM page migration is already supported in the case of memory hotremove, > which takes the ksm_thread_mutex across all its migrations to keep life > simple. > > But the new KSM NUMA merge_across_nodes knob introduces a problem, when > it's set to non-default 0: if a KSM page is migrated to a different NUMA > node, how do we migrate its stable node to the right tree? And what if > that collides with an existing stable node? > > So far there's no provision for that, and this patch does not attempt > to deal with it either. But how will I test a solution, when I don't > know how to hotremove memory? The best answer is to enable KSM page > migration in all cases now, and test more common cases. With THP and > compaction added since KSM came in, page migration is now mainstream, > and it's a shame that a KSM page can frustrate freeing a page block. > > Without worrying about merge_across_nodes 0 for now, this patch gets > KSM page migration working reliably for default merge_across_nodes 1 > (but leave the patch enabling it until near the end of the series). > > It's much simpler than I'd originally imagined, and does not require > an additional tier of locking: page migration relies on the page lock, > KSM page reclaim relies on the page lock, the page lock is enough for > KSM page migration too. > > Almost all the care has to be in get_ksm_page(): that's the function > which worries about when a stable node is stale and should be freed, > now it also has to worry about the KSM page being migrated. > > The only new overhead is an additional put/get/lock/unlock_page when > stable_tree_search() arrives at a matching node: to make sure migration > respects the raised page count, and so does not migrate the page while > we're busy with it here. That's probably avoidable, either by changing > internal interfaces from using kpage to stable_node, or by moving the > ksm_migrate_page() callsite into a page_freeze_refs() section (even if > not swapcache); but this works well, I've no urge to pull it apart now. > > (Descents of the stable tree may pass through nodes whose KSM pages are > under migration: being unlocked, the raised page count does not prevent > that, nor need it: it's safe to memcmp against either old or new page.) > > You might worry about mremap, and whether page migration's rmap_walk > to remove migration entries will find all the KSM locations where it > inserted earlier: that should already be handled, by the satisfyingly > heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 94 ++++++++++++++++++++++++++++++++++++++----------- > mm/migrate.c | 5 ++ > 2 files changed, 77 insertions(+), 22 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 > @@ -499,6 +499,7 @@ static void remove_node_from_stable_tree > * In which case we can trust the content of the page, and it > * returns the gotten page; but if the page has now been zapped, > * remove the stale node from the stable tree and return NULL. > + * But beware, the stable node's page might be being migrated. > * > * You would expect the stable_node to hold a reference to the ksm page. > * But if it increments the page's count, swapping out has to wait for > @@ -509,44 +510,77 @@ static void remove_node_from_stable_tree > * pointing back to this stable node. This relies on freeing a PageAnon > * page to reset its page->mapping to NULL, and relies on no other use of > * a page to put something that might look like our key in page->mapping. > - * > - * include/linux/pagemap.h page_cache_get_speculative() is a good reference, > - * but this is different - made simpler by ksm_thread_mutex being held, but > - * interesting for assuming that no other use of the struct page could ever > - * put our expected_mapping into page->mapping (or a field of the union which > - * coincides with page->mapping). > - * > - * Note: it is possible that get_ksm_page() will return NULL one moment, > - * then page the next, if the page is in between page_freeze_refs() and > - * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > * is on its way to being freed; but it is an anomaly to bear in mind. > */ > static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > { > struct page *page; > void *expected_mapping; > + unsigned long kpfn; > > - page = pfn_to_page(stable_node->kpfn); > expected_mapping = (void *)stable_node + > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > - if (page->mapping != expected_mapping) > - goto stale; > - if (!get_page_unless_zero(page)) > +again: > + kpfn = ACCESS_ONCE(stable_node->kpfn); > + page = pfn_to_page(kpfn); > + > + /* > + * page is computed from kpfn, so on most architectures reading > + * page->mapping is naturally ordered after reading node->kpfn, > + * but on Alpha we need to be more careful. > + */ > + smp_read_barrier_depends(); > + if (ACCESS_ONCE(page->mapping) != expected_mapping) > goto stale; > - if (page->mapping != expected_mapping) { > + > + /* > + * We cannot do anything with the page while its refcount is 0. > + * Usually 0 means free, or tail of a higher-order page: in which > + * case this node is no longer referenced, and should be freed; > + * however, it might mean that the page is under page_freeze_refs(). > + * The __remove_mapping() case is easy, again the node is now stale; > + * but if page is swapcache in migrate_page_move_mapping(), it might > + * still be our page, in which case it's essential to keep the node. > + */ > + while (!get_page_unless_zero(page)) { > + /* > + * Another check for page->mapping != expected_mapping would > + * work here too. We have chosen the !PageSwapCache test to > + * optimize the common case, when the page is or is about to > + * be freed: PageSwapCache is cleared (under spin_lock_irq) > + * in the freeze_refs section of __remove_mapping(); but Anon > + * page->mapping reset to NULL later, in free_pages_prepare(). > + */ > + if (!PageSwapCache(page)) > + goto stale; > + cpu_relax(); > + } > + > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > put_page(page); > goto stale; > } > + > if (locked) { > lock_page(page); > - if (page->mapping != expected_mapping) { > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > unlock_page(page); > put_page(page); > goto stale; > } > } Could you explain why need check page->mapping twice after get page? > return page; > + > stale: > + /* > + * We come here from above when page->mapping or !PageSwapCache > + * suggests that the node is stale; but it might be under migration. > + * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(), > + * before checking whether node->kpfn has been changed. > + */ > + smp_rmb(); > + if (ACCESS_ONCE(stable_node->kpfn) != kpfn) > + goto again; > remove_node_from_stable_tree(stable_node); > return NULL; > } > @@ -1103,15 +1137,25 @@ static struct page *stable_tree_search(s > return NULL; > > ret = memcmp_pages(page, tree_page); > + put_page(tree_page); > > - if (ret < 0) { > - put_page(tree_page); > + if (ret < 0) > node = node->rb_left; > - } else if (ret > 0) { > - put_page(tree_page); > + else if (ret > 0) > node = node->rb_right; > - } else > + else { > + /* > + * Lock and unlock the stable_node's page (which > + * might already have been migrated) so that page > + * migration is sure to notice its raised count. > + * It would be more elegant to return stable_node > + * than kpage, but that involves more changes. > + */ > + tree_page = get_ksm_page(stable_node, true); > + if (tree_page) > + unlock_page(tree_page); > return tree_page; > + } > } > > return NULL; > @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa > if (stable_node) { > VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage)); > stable_node->kpfn = page_to_pfn(newpage); > + /* > + * newpage->mapping was set in advance; now we need smp_wmb() > + * to make sure that the new stable_node->kpfn is visible > + * to get_ksm_page() before it can see that oldpage->mapping > + * has gone stale (or that PageSwapCache has been cleared). > + */ > + smp_wmb(); > + set_page_stable_node(oldpage, NULL); > } > } > #endif /* CONFIG_MIGRATION */ > --- mmotm.orig/mm/migrate.c 2013-01-25 14:27:58.140193249 -0800 > +++ mmotm/mm/migrate.c 2013-01-25 14:37:03.832206218 -0800 > @@ -464,7 +464,10 @@ void migrate_page_copy(struct page *newp > > mlock_migrate_page(newpage, page); > ksm_migrate_page(newpage, page); > - > + /* > + * Please do not reorder this without considering how mm/ksm.c's > + * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache(). > + */ > ClearPageSwapCache(page); > ClearPagePrivate(page); > set_page_private(page, 0); > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx205.postini.com [74.125.245.205]) by kanga.kvack.org (Postfix) with SMTP id C78496B0005 for ; Sun, 27 Jan 2013 01:23:32 -0500 (EST) Received: by mail-ia0-f179.google.com with SMTP id x24so2720230iak.38 for ; Sat, 26 Jan 2013 22:23:32 -0800 (PST) Message-ID: <1359267810.6763.1.camel@kernel> Subject: Re: [PATCH 11/11] ksm: stop hotremove lockdep warning From: Simon Jeons Date: Sun, 27 Jan 2013 00:23:30 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Gerald Schaefer , KOSAKI Motohiro , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 2013-01-25 at 18:10 -0800, Hugh Dickins wrote: > Complaints are rare, but lockdep still does not understand the way > ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and > holds it until the ksm_memory_callback(MEM_OFFLINE): that appears > to be a problem because notifier callbacks are made under down_read > of blocking_notifier_head->rwsem (so first the mutex is taken while > holding the rwsem, then later the rwsem is taken while still holding > the mutex); but is not in fact a problem because mem_hotplug_mutex > is held throughout the dance. > > There was an attempt to fix this with mutex_lock_nested(); but if that > happened to fool lockdep two years ago, apparently it does so no longer. > > I had hoped to eradicate this issue in extending KSM page migration not > to need the ksm_thread_mutex. But then realized that although the page > migration itself is safe, we do still need to lock out ksmd and other > users of get_ksm_page() while offlining memory - at some point between > MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may > vanish, and get_ksm_page()'s accesses to them become a violation. > > So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to > MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining() > checks, to achieve the same lockout without being caught by lockdep. > This is less elegant for KSM, but it's more important to keep lockdep > useful to other users - and I apologize for how long it took to fix. > > Reported-by: Gerald Schaefer > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 55 +++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 41 insertions(+), 14 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:38:53.984208836 -0800 > @@ -226,7 +226,9 @@ static unsigned int ksm_merge_across_nod > #define KSM_RUN_STOP 0 > #define KSM_RUN_MERGE 1 > #define KSM_RUN_UNMERGE 2 > -static unsigned int ksm_run = KSM_RUN_STOP; > +#define KSM_RUN_OFFLINE 4 > +static unsigned long ksm_run = KSM_RUN_STOP; > +static void wait_while_offlining(void); > > static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait); > static DEFINE_MUTEX(ksm_thread_mutex); > @@ -1700,6 +1702,7 @@ static int ksm_scan_thread(void *nothing > > while (!kthread_should_stop()) { > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksmd_should_run()) > ksm_do_scan(ksm_thread_pages_to_scan); > mutex_unlock(&ksm_thread_mutex); > @@ -2056,6 +2059,22 @@ void ksm_migrate_page(struct page *newpa > #endif /* CONFIG_MIGRATION */ > > #ifdef CONFIG_MEMORY_HOTREMOVE > +static int just_wait(void *word) > +{ > + schedule(); > + return 0; > +} > + > +static void wait_while_offlining(void) > +{ > + while (ksm_run & KSM_RUN_OFFLINE) { > + mutex_unlock(&ksm_thread_mutex); > + wait_on_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE), > + just_wait, TASK_UNINTERRUPTIBLE); > + mutex_lock(&ksm_thread_mutex); > + } > +} > + > static void ksm_check_stable_tree(unsigned long start_pfn, > unsigned long end_pfn) > { > @@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no > switch (action) { > case MEM_GOING_OFFLINE: > /* > - * Keep it very simple for now: just lock out ksmd and > - * MADV_UNMERGEABLE while any memory is going offline. > - * mutex_lock_nested() is necessary because lockdep was alarmed > - * that here we take ksm_thread_mutex inside notifier chain > - * mutex, and later take notifier chain mutex inside > - * ksm_thread_mutex to unlock it. But that's safe because both > - * are inside mem_hotplug_mutex. > + * Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items() > + * and remove_all_stable_nodes() while memory is going offline: > + * it is unsafe for them to touch the stable tree at this time. > + * But unmerge_ksm_pages(), rmap lookups and other entry points Why unmerge_ksm_pages beneath us is safe for ksm memory hotremove? > + * which do not need the ksm_thread_mutex are all safe. > */ > - mutex_lock_nested(&ksm_thread_mutex, SINGLE_DEPTH_NESTING); > + mutex_lock(&ksm_thread_mutex); > + ksm_run |= KSM_RUN_OFFLINE; > + mutex_unlock(&ksm_thread_mutex); > break; > > case MEM_OFFLINE: > @@ -2122,11 +2141,20 @@ static int ksm_memory_callback(struct no > /* fallthrough */ > > case MEM_CANCEL_OFFLINE: > + mutex_lock(&ksm_thread_mutex); > + ksm_run &= ~KSM_RUN_OFFLINE; > mutex_unlock(&ksm_thread_mutex); > + > + smp_mb(); /* wake_up_bit advises this */ > + wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE)); > break; > } > return NOTIFY_OK; > } > +#else > +static void wait_while_offlining(void) > +{ > +} > #endif /* CONFIG_MEMORY_HOTREMOVE */ > > #ifdef CONFIG_SYSFS > @@ -2189,7 +2217,7 @@ KSM_ATTR(pages_to_scan); > static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr, > char *buf) > { > - return sprintf(buf, "%u\n", ksm_run); > + return sprintf(buf, "%lu\n", ksm_run); > } > > static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr, > @@ -2212,6 +2240,7 @@ static ssize_t run_store(struct kobject > */ > > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksm_run != flags) { > ksm_run = flags; > if (flags & KSM_RUN_UNMERGE) { > @@ -2254,6 +2283,7 @@ static ssize_t merge_across_nodes_store( > return -EINVAL; > > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksm_merge_across_nodes != knob) { > if (ksm_pages_shared || remove_all_stable_nodes()) > err = -EBUSY; > @@ -2366,10 +2396,7 @@ static int __init ksm_init(void) > #endif /* CONFIG_SYSFS */ > > #ifdef CONFIG_MEMORY_HOTREMOVE > - /* > - * Choose a high priority since the callback takes ksm_thread_mutex: > - * later callbacks could only be taking locks which nest within that. > - */ > + /* There is no significance to this priority 100 */ > hotplug_memory_notifier(ksm_memory_callback, 100); > #endif > return 0; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx141.postini.com [74.125.245.141]) by kanga.kvack.org (Postfix) with SMTP id 966306B0005 for ; Sun, 27 Jan 2013 03:49:17 -0500 (EST) Received: by mail-ia0-f172.google.com with SMTP id u8so2853185iag.31 for ; Sun, 27 Jan 2013 00:49:16 -0800 (PST) Message-ID: <1359276555.6763.6.camel@kernel> Subject: Re: [PATCH 8/11] ksm: make !merge_across_nodes migration safe From: Simon Jeons Date: Sun, 27 Jan 2013 02:49:15 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 2013-01-25 at 18:05 -0800, Hugh Dickins wrote: > The new KSM NUMA merge_across_nodes knob introduces a problem, when it's > set to non-default 0: if a KSM page is migrated to a different NUMA node, > how do we migrate its stable node to the right tree? And what if that > collides with an existing stable node? > > ksm_migrate_page() can do no more than it's already doing, updating > stable_node->kpfn: the stable tree itself cannot be manipulated without > holding ksm_thread_mutex. So accept that a stable tree may temporarily > indicate a page belonging to the wrong NUMA node, leave updating until > the next pass of ksmd, just be careful not to merge other pages on to a > misplaced page. Note nid of holding tree in stable_node, and recognize > that it will not always match nid of kpfn. > > A misplaced KSM page is discovered, either when ksm_do_scan() next comes > around to one of its rmap_items (we now have to go to cmp_and_merge_page > even on pages in a stable tree), or when stable_tree_search() arrives at > a matching node for another page, and this node page is found misplaced. > > In each case, move the misplaced stable_node to a list of migrate_nodes > (and use the address of migrate_nodes as magic by which to identify them): > we don't need them in a tree. If stable_tree_search() finds no match for > a page, but it's currently exiled to this list, then slot its stable_node > right there into the tree, bringing all of its mappings with it; otherwise > they get migrated one by one to the original page of the colliding node. > stable_tree_search() is now modelled more like stable_tree_insert(), > in order to handle these insertions of migrated nodes. > > remove_node_from_stable_tree(), remove_all_stable_nodes() and > ksm_check_stable_tree() have to handle the migrate_nodes list as well as > the stable tree itself. Less obviously, we do need to prune the list of > stale entries from time to time (scan_get_next_rmap_item() does it once > each full scan): whereas stale nodes in the stable tree get naturally > pruned as searches try to brush past them, these migrate_nodes may get > forgotten and accumulate. > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 164 +++++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 134 insertions(+), 30 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 > @@ -122,13 +122,25 @@ struct ksm_scan { > /** > * struct stable_node - node of the stable rbtree > * @node: rb node of this ksm page in the stable tree > + * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list > + * @list: linked into migrate_nodes, pending placement in the proper node tree > * @hlist: hlist head of rmap_items using this ksm page > - * @kpfn: page frame number of this ksm page > + * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid) > + * @nid: NUMA node id of stable tree in which linked (may not match kpfn) > */ > struct stable_node { > - struct rb_node node; > + union { > + struct rb_node node; /* when node of stable tree */ > + struct { /* when listed for migration */ > + struct list_head *head; > + struct list_head list; > + }; > + }; > struct hlist_head hlist; > unsigned long kpfn; > +#ifdef CONFIG_NUMA > + int nid; > +#endif > }; > > /** > @@ -169,6 +181,9 @@ struct rmap_item { > static struct rb_root root_unstable_tree[MAX_NUMNODES]; > static struct rb_root root_stable_tree[MAX_NUMNODES]; > > +/* Recently migrated nodes of stable tree, pending proper placement */ > +static LIST_HEAD(migrate_nodes); > + > #define MM_SLOTS_HASH_BITS 10 > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > > @@ -311,11 +326,6 @@ static void insert_to_mm_slots_hash(stru > hash_add(mm_slots_hash, &mm_slot->link, (unsigned long)mm); > } > > -static inline int in_stable_tree(struct rmap_item *rmap_item) > -{ > - return rmap_item->address & STABLE_FLAG; > -} > - > /* > * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's > * page tables after it has passed through ksm_exit() - which, if necessary, > @@ -476,7 +486,6 @@ static void remove_node_from_stable_tree > { > struct rmap_item *rmap_item; > struct hlist_node *hlist; > - int nid; > > hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { > if (rmap_item->hlist.next) > @@ -488,8 +497,11 @@ static void remove_node_from_stable_tree > cond_resched(); > } > > - nid = get_kpfn_nid(stable_node->kpfn); > - rb_erase(&stable_node->node, &root_stable_tree[nid]); > + if (stable_node->head == &migrate_nodes) > + list_del(&stable_node->list); > + else > + rb_erase(&stable_node->node, > + &root_stable_tree[NUMA(stable_node->nid)]); > free_stable_node(stable_node); > } > > @@ -712,6 +724,7 @@ static int remove_stable_node(struct sta > static int remove_all_stable_nodes(void) > { > struct stable_node *stable_node; > + struct list_head *this, *next; > int nid; > int err = 0; > > @@ -726,6 +739,12 @@ static int remove_all_stable_nodes(void) > cond_resched(); > } > } > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, struct stable_node, list); > + if (remove_stable_node(stable_node)) > + err = -EBUSY; > + cond_resched(); > + } > return err; > } > > @@ -1113,25 +1132,30 @@ static struct page *try_to_merge_two_pag > */ > static struct page *stable_tree_search(struct page *page) > { > - struct rb_node *node; > - struct stable_node *stable_node; > int nid; > + struct rb_node **new; > + struct rb_node *parent; > + struct stable_node *stable_node; > + struct stable_node *page_node; > > - stable_node = page_stable_node(page); > - if (stable_node) { /* ksm page forked */ > + page_node = page_stable_node(page); > + if (page_node && page_node->head != &migrate_nodes) { > + /* ksm page forked */ > get_page(page); > return page; > } > > nid = get_kpfn_nid(page_to_pfn(page)); > - node = root_stable_tree[nid].rb_node; > +again: > + new = &root_stable_tree[nid].rb_node; > + parent = NULL; > > - while (node) { > + while (*new) { > struct page *tree_page; > int ret; > > cond_resched(); > - stable_node = rb_entry(node, struct stable_node, node); > + stable_node = rb_entry(*new, struct stable_node, node); > tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > @@ -1139,10 +1163,11 @@ static struct page *stable_tree_search(s > ret = memcmp_pages(page, tree_page); > put_page(tree_page); > > + parent = *new; > if (ret < 0) > - node = node->rb_left; > + new = &parent->rb_left; > else if (ret > 0) > - node = node->rb_right; > + new = &parent->rb_right; > else { > /* > * Lock and unlock the stable_node's page (which > @@ -1152,13 +1177,49 @@ static struct page *stable_tree_search(s > * than kpage, but that involves more changes. > */ > tree_page = get_ksm_page(stable_node, true); > - if (tree_page) > + if (tree_page) { > unlock_page(tree_page); > - return tree_page; > + if (get_kpfn_nid(stable_node->kpfn) != > + NUMA(stable_node->nid)) { > + put_page(tree_page); > + goto replace; > + } > + return tree_page; > + } > + /* > + * There is now a place for page_node, but the tree may > + * have been rebalanced, so re-evaluate parent and new. > + */ > + if (page_node) > + goto again; > + return NULL; > } > } > > - return NULL; > + if (!page_node) > + return NULL; > + > + list_del(&page_node->list); > + DO_NUMA(page_node->nid = nid); > + rb_link_node(&page_node->node, parent, new); > + rb_insert_color(&page_node->node, &root_stable_tree[nid]); > + get_page(page); > + return page; > + > +replace: > + if (page_node) { > + list_del(&page_node->list); > + DO_NUMA(page_node->nid = nid); > + rb_replace_node(&stable_node->node, > + &page_node->node, &root_stable_tree[nid]); > + get_page(page); > + } else { > + rb_erase(&stable_node->node, &root_stable_tree[nid]); > + page = NULL; > + } > + stable_node->head = &migrate_nodes; > + list_add(&stable_node->list, stable_node->head); > + return page; > } > > /* > @@ -1215,6 +1276,7 @@ static struct stable_node *stable_tree_i > INIT_HLIST_HEAD(&stable_node->hlist); > stable_node->kpfn = kpfn; > set_page_stable_node(kpage, stable_node); > + DO_NUMA(stable_node->nid = nid); > rb_link_node(&stable_node->node, parent, new); > rb_insert_color(&stable_node->node, &root_stable_tree[nid]); > > @@ -1311,11 +1373,6 @@ struct rmap_item *unstable_tree_search_i > static void stable_tree_append(struct rmap_item *rmap_item, > struct stable_node *stable_node) > { > - /* > - * Usually rmap_item->nid is already set correctly, > - * but it may be wrong after switching merge_across_nodes. > - */ > - DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn)); > rmap_item->head = stable_node; > rmap_item->address |= STABLE_FLAG; > hlist_add_head(&rmap_item->hlist, &stable_node->hlist); > @@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa > unsigned int checksum; > int err; > > - remove_rmap_item_from_tree(rmap_item); > + stable_node = page_stable_node(page); > + if (stable_node) { > + if (stable_node->head != &migrate_nodes && > + get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { > + rb_erase(&stable_node->node, > + &root_stable_tree[NUMA(stable_node->nid)]); > + stable_node->head = &migrate_nodes; > + list_add(&stable_node->list, stable_node->head); Why list add &stable_node->list to stable_node->head? stable_node->head is used for queue what? > + } > + if (stable_node->head != &migrate_nodes && > + rmap_item->head == stable_node) > + return; > + } > > /* We first start with searching the page inside the stable tree */ > kpage = stable_tree_search(page); > + if (kpage == page && rmap_item->head == stable_node) { > + put_page(kpage); > + return; > + } > + > + remove_rmap_item_from_tree(rmap_item); > + > if (kpage) { > err = try_to_merge_with_ksm_page(rmap_item, page, kpage); > if (!err) { > @@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r > */ > lru_add_drain_all(); > > + /* > + * Whereas stale stable_nodes on the stable_tree itself > + * get pruned in the regular course of stable_tree_search(), Which kinds of stable_nodes can be treated as stale? I just see remove rmap_item in stable_tree_search() and scan_get_next_rmap_item(). > + * those moved out to the migrate_nodes list can accumulate: > + * so prune them once before each full scan. > + */ > + if (!ksm_merge_across_nodes) { > + struct stable_node *stable_node; > + struct list_head *this, *next; > + struct page *page; > + > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, > + struct stable_node, list); > + page = get_ksm_page(stable_node, false); > + if (page) > + put_page(page); > + cond_resched(); > + } > + } > + Why get page of misplaced pages here? > for (nid = 0; nid < nr_node_ids; nid++) > root_unstable_tree[nid] = RB_ROOT; > > @@ -1586,8 +1683,7 @@ static void ksm_do_scan(unsigned int sca > rmap_item = scan_get_next_rmap_item(&page); > if (!rmap_item) > return; > - if (!PageKsm(page) || !in_stable_tree(rmap_item)) > - cmp_and_merge_page(page, rmap_item); > + cmp_and_merge_page(page, rmap_item); > put_page(page); > } > } > @@ -1964,6 +2060,7 @@ static void ksm_check_stable_tree(unsign > unsigned long end_pfn) > { > struct stable_node *stable_node; > + struct list_head *this, *next; > struct rb_node *node; > int nid; > > @@ -1984,6 +2081,13 @@ static void ksm_check_stable_tree(unsign > cond_resched(); > } > } > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, struct stable_node, list); > + if (stable_node->kpfn >= start_pfn && > + stable_node->kpfn < end_pfn) > + remove_node_from_stable_tree(stable_node); > + cond_resched(); > + } > } > > static int ksm_memory_callback(struct notifier_block *self, > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx106.postini.com [74.125.245.106]) by kanga.kvack.org (Postfix) with SMTP id D218F6B0007 for ; Sun, 27 Jan 2013 16:55:24 -0500 (EST) Received: by mail-da0-f48.google.com with SMTP id k18so944432dae.21 for ; Sun, 27 Jan 2013 13:55:24 -0800 (PST) Date: Sun, 27 Jan 2013 13:55:19 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: <1359256581.4159.16.camel@kernel> Message-ID: References: <1359249282.4159.4.camel@kernel> <1359256581.4159.16.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sat, 26 Jan 2013, Simon Jeons wrote: > On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote: > > > > So you'd like us to add code for moving a node from one tree to another > > in ksm_migrate_page() (and what would it do when it collides with an > > Without numa awareness, I still can't understand your explanation why > can't insert the node to the tree just after page migration instead of > inserting it at the next scan. The node is already there in the right (only) tree in that case. > > > existing node?), code which will then be removed a few patches later > > when ksm page migration is fully enabled? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx160.postini.com [74.125.245.160]) by kanga.kvack.org (Postfix) with SMTP id B32576B0007 for ; Sun, 27 Jan 2013 17:07:59 -0500 (EST) Received: by mail-da0-f46.google.com with SMTP id p5so939652dak.5 for ; Sun, 27 Jan 2013 14:07:58 -0800 (PST) Date: Sun, 27 Jan 2013 14:08:00 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked In-Reply-To: <1359254187.4159.10.camel@kernel> Message-ID: References: <1359254187.4159.10.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sat, 26 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote: > > In some places where get_ksm_page() is used, we need the page to be locked. > > > > In function get_ksm_page, why check page->mapping => > get_page_unless_zero => check page->mapping instead of > get_page_unless_zero => check page->mapping, because > get_page_unless_zero is expensive? Yes, it's more expensive. > > > When KSM migration is fully enabled, we shall want that to make sure that > > the page just acquired cannot be migrated beneath us (raised page count is > > only effective when there is serialization to make sure migration notices). > > Whereas when navigating through the stable tree, we certainly do not want > > What's the meaning of "navigating through the stable tree"? Finding the right place in the stable tree, as stable_tree_search() and stable_tree_insert() do. > > > to lock each node (raised page count is enough to guarantee the memcmps, > > even if page is migrated to another node). > > > > Since we're about to add another use case, add the locked argument to > > get_ksm_page() now. > > Why the parameter lock passed from stable_tree_search/insert is true, > but remove_rmap_item_from_tree is false? The other way round? remove_rmap_item_from_tree needs the page locked, because it's about to modify the list: that's secured (e.g. against concurrent KSM page reclaim) by the page lock. stable_tree_search and stable_tree_insert do not need intermediate nodes to be locked: get_page is enough to secure the page contents for memcmp, and we don't want a pointless wait for exclusive page lock on every intermediate node. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx173.postini.com [74.125.245.173]) by kanga.kvack.org (Postfix) with SMTP id 97D216B0007 for ; Sun, 27 Jan 2013 17:10:15 -0500 (EST) Received: by mail-pa0-f49.google.com with SMTP id bi1so1152999pad.36 for ; Sun, 27 Jan 2013 14:10:14 -0800 (PST) Date: Sun, 27 Jan 2013 14:10:16 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked In-Reply-To: <1359254927.4159.11.camel@kernel> Message-ID: References: <1359254927.4159.11.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sat, 26 Jan 2013, Simon Jeons wrote: > > BTW, what's the meaning of ksm page forked? A ksm page is mapped into a process's mm, then that process calls fork(): the ksm page then appears in the child's mm, before ksmd has tracked it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx172.postini.com [74.125.245.172]) by kanga.kvack.org (Postfix) with SMTP id F1C1F6B0007 for ; Sun, 27 Jan 2013 18:05:50 -0500 (EST) Received: by mail-pb0-f46.google.com with SMTP id mc17so462231pbc.19 for ; Sun, 27 Jan 2013 15:05:50 -0800 (PST) Date: Sun, 27 Jan 2013 15:05:46 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <1359262556.4159.23.camel@kernel> Message-ID: References: <1359262556.4159.23.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sat, 26 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > > Switching merge_across_nodes after running KSM is liable to oops on stale > > nodes still left over from the previous stable tree. It's not something > > that people will often want to do, but it would be lame to demand a reboot > > when they're trying to determine which merge_across_nodes setting is best. > > > > How can this happen? We only permit switching merge_across_nodes when > > pages_shared is 0, and usually set run 2 to force that beforehand, which > > ought to unmerge everything: yet oopses still occur when you then run 1. > > > > Three causes: > > > > 1. The old stable tree (built according to the inverse merge_across_nodes) > > has not been fully torn down. A stable node lingers until get_ksm_page() > > notices that the page it references no longer references it: but the page > > is not necessarily freed as soon as expected, particularly when swapcache. > > > > When can this happen? Whenever there's an additional reference to the page, beyond those for its ptes in userspace - swapcache for example, or pinned by get_user_pages. That delays its being freed (arriving at the "page->mapping = NULL;" in free_pages_prepare()). Or it might simply be sitting in a pagevec, waiting for that to be filled up, to be freed as part of a batch. > > > Fix this with a pass through the old stable tree, applying get_ksm_page() > > to each of the remaining nodes (most found stale and removed immediately), > > with forced removal of any left over. Unless the page is still mapped: > > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > > and EBUSY than BUG. > > > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > > just behind ksmd's cursor, so there's a full pass for it to stabilize > > (or be removed) before ksmd addresses it. Nice when ksmd is running, > > but not so nice when we're trying to unmerge all mms: we were missing > > those mms forked and inserted behind the unmerge cursor. Easily fixed > > by inserting at the end when KSM_RUN_UNMERGE. > > mms forked will be unmerged just after ksmd's cursor since they're > inserted behind it, why will be missing? unmerge_and_remove_all_rmap_items() makes one pass through the list from start to finish: insert behind the cursor and it will be missed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx130.postini.com [74.125.245.130]) by kanga.kvack.org (Postfix) with SMTP id C13146B0007 for ; Sun, 27 Jan 2013 18:12:28 -0500 (EST) Received: by mail-pb0-f44.google.com with SMTP id wz12so43919pbc.3 for ; Sun, 27 Jan 2013 15:12:28 -0800 (PST) Date: Sun, 27 Jan 2013 15:12:29 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible In-Reply-To: <1359265635.6763.0.camel@kernel> Message-ID: References: <1359265635.6763.0.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sat, 26 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:03 -0800, Hugh Dickins wrote: > > + while (!get_page_unless_zero(page)) { > > + /* > > + * Another check for page->mapping != expected_mapping would > > + * work here too. We have chosen the !PageSwapCache test to > > + * optimize the common case, when the page is or is about to > > + * be freed: PageSwapCache is cleared (under spin_lock_irq) > > + * in the freeze_refs section of __remove_mapping(); but Anon > > + * page->mapping reset to NULL later, in free_pages_prepare(). > > + */ > > + if (!PageSwapCache(page)) > > + goto stale; > > + cpu_relax(); > > + } > > + > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > put_page(page); > > goto stale; > > } > > + > > if (locked) { > > lock_page(page); > > - if (page->mapping != expected_mapping) { > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > unlock_page(page); > > put_page(page); > > goto stale; > > } > > } > > Could you explain why need check page->mapping twice after get page? Once for the !locked case, which should not return page if mapping changed. Once for the locked case, which should not return page if mapping changed. We could use "else", but that wouldn't be an improvement. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id 549F06B0007 for ; Sun, 27 Jan 2013 18:25:54 -0500 (EST) Received: by mail-pa0-f45.google.com with SMTP id bg2so1161909pad.4 for ; Sun, 27 Jan 2013 15:25:53 -0800 (PST) Date: Sun, 27 Jan 2013 15:25:54 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 8/11] ksm: make !merge_across_nodes migration safe In-Reply-To: <1359276555.6763.6.camel@kernel> Message-ID: References: <1359276555.6763.6.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:05 -0800, Hugh Dickins wrote: > > @@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa > > unsigned int checksum; > > int err; > > > > - remove_rmap_item_from_tree(rmap_item); > > + stable_node = page_stable_node(page); > > + if (stable_node) { > > + if (stable_node->head != &migrate_nodes && > > + get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { > > + rb_erase(&stable_node->node, > > + &root_stable_tree[NUMA(stable_node->nid)]); > > + stable_node->head = &migrate_nodes; > > + list_add(&stable_node->list, stable_node->head); > > Why list add &stable_node->list to stable_node->head? stable_node->head > is used for queue what? Read that as list_add(&stable_node->list, &migrate_nodes) if you prefer. stable_node->head (overlaying stable_node->node.__rb_parent_color, which would never point to migrate_nodes as an rb_node) &migrate_nodes is used as "magic" to show that that rb_node is currently saved on this list, rather than linked into the stable tree itself. We could do some #define MIGRATE_NODES_MAGIC 0xwhatever and put that in head instead. > > @@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r > > */ > > lru_add_drain_all(); > > > > + /* > > + * Whereas stale stable_nodes on the stable_tree itself > > + * get pruned in the regular course of stable_tree_search(), > > Which kinds of stable_nodes can be treated as stale? I just see remove > rmap_item in stable_tree_search() and scan_get_next_rmap_item(). See get_ksm_page(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx164.postini.com [74.125.245.164]) by kanga.kvack.org (Postfix) with SMTP id B7C356B0007 for ; Sun, 27 Jan 2013 18:35:20 -0500 (EST) Received: by mail-da0-f49.google.com with SMTP id v40so955845dad.22 for ; Sun, 27 Jan 2013 15:35:20 -0800 (PST) Date: Sun, 27 Jan 2013 15:35:21 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 11/11] ksm: stop hotremove lockdep warning In-Reply-To: <1359267810.6763.1.camel@kernel> Message-ID: References: <1359267810.6763.1.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Gerald Schaefer , KOSAKI Motohiro , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:10 -0800, Hugh Dickins wrote: > > @@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no > > switch (action) { > > case MEM_GOING_OFFLINE: > > /* > > - * Keep it very simple for now: just lock out ksmd and > > - * MADV_UNMERGEABLE while any memory is going offline. > > - * mutex_lock_nested() is necessary because lockdep was alarmed > > - * that here we take ksm_thread_mutex inside notifier chain > > - * mutex, and later take notifier chain mutex inside > > - * ksm_thread_mutex to unlock it. But that's safe because both > > - * are inside mem_hotplug_mutex. > > + * Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items() > > + * and remove_all_stable_nodes() while memory is going offline: > > + * it is unsafe for them to touch the stable tree at this time. > > + * But unmerge_ksm_pages(), rmap lookups and other entry points > > Why unmerge_ksm_pages beneath us is safe for ksm memory hotremove? > > > + * which do not need the ksm_thread_mutex are all safe. It's just like userspace doing a write-fault on every KSM page in the vma. If that were unsafe for memory hotremove, then it would not be KSM's problem, memory hotremove would already be unsafe. (But memory hotremove is safe because it migrates away from all the pages to be removed before it can reach MEM_OFFLINE.) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx153.postini.com [74.125.245.153]) by kanga.kvack.org (Postfix) with SMTP id 060136B0007 for ; Sun, 27 Jan 2013 19:36:11 -0500 (EST) Received: by mail-ia0-f173.google.com with SMTP id l29so3449835iag.32 for ; Sun, 27 Jan 2013 16:36:11 -0800 (PST) Message-ID: <1359333371.6763.12.camel@kernel> Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked From: Simon Jeons Date: Sun, 27 Jan 2013 18:36:11 -0600 In-Reply-To: References: <1359254187.4159.10.camel@kernel> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sun, 2013-01-27 at 14:08 -0800, Hugh Dickins wrote: > On Sat, 26 Jan 2013, Simon Jeons wrote: > > On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote: > > > In some places where get_ksm_page() is used, we need the page to be locked. > > > > > > > In function get_ksm_page, why check page->mapping => > > get_page_unless_zero => check page->mapping instead of > > get_page_unless_zero => check page->mapping, because > > get_page_unless_zero is expensive? > > Yes, it's more expensive. > > > > > > When KSM migration is fully enabled, we shall want that to make sure that > > > the page just acquired cannot be migrated beneath us (raised page count is > > > only effective when there is serialization to make sure migration notices). > > > Whereas when navigating through the stable tree, we certainly do not want > > > > What's the meaning of "navigating through the stable tree"? > > Finding the right place in the stable tree, > as stable_tree_search() and stable_tree_insert() do. > > > > > > to lock each node (raised page count is enough to guarantee the memcmps, > > > even if page is migrated to another node). > > > > > > Since we're about to add another use case, add the locked argument to > > > get_ksm_page() now. > > > > Why the parameter lock passed from stable_tree_search/insert is true, > > but remove_rmap_item_from_tree is false? > > The other way round? remove_rmap_item_from_tree needs the page locked, > because it's about to modify the list: that's secured (e.g. against > concurrent KSM page reclaim) by the page lock. How can KSM page reclaim path call remove_rmap_item_from_tree? I have already track every callsites but can't find it. BTW, I'm curious about KSM page reclaim, it seems that there're no special handle in vmscan.c for KSM page reclaim, is it will be reclaimed similiar with normal page? > > stable_tree_search and stable_tree_insert do not need intermediate nodes > to be locked: get_page is enough to secure the page contents for memcmp, > and we don't want a pointless wait for exclusive page lock on every > intermediate node. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx162.postini.com [74.125.245.162]) by kanga.kvack.org (Postfix) with SMTP id 70B5F6B0007 for ; Sun, 27 Jan 2013 19:41:24 -0500 (EST) Received: by mail-da0-f52.google.com with SMTP id f10so969604dak.11 for ; Sun, 27 Jan 2013 16:41:23 -0800 (PST) Message-ID: <1359333683.6763.13.camel@kernel> Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible From: Simon Jeons Date: Sun, 27 Jan 2013 18:41:23 -0600 In-Reply-To: References: <1359265635.6763.0.camel@kernel> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sun, 2013-01-27 at 15:12 -0800, Hugh Dickins wrote: > On Sat, 26 Jan 2013, Simon Jeons wrote: > > On Fri, 2013-01-25 at 18:03 -0800, Hugh Dickins wrote: > > > + while (!get_page_unless_zero(page)) { > > > + /* > > > + * Another check for page->mapping != expected_mapping would > > > + * work here too. We have chosen the !PageSwapCache test to > > > + * optimize the common case, when the page is or is about to > > > + * be freed: PageSwapCache is cleared (under spin_lock_irq) > > > + * in the freeze_refs section of __remove_mapping(); but Anon > > > + * page->mapping reset to NULL later, in free_pages_prepare(). > > > + */ > > > + if (!PageSwapCache(page)) > > > + goto stale; > > > + cpu_relax(); > > > + } > > > + > > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > > put_page(page); > > > goto stale; > > > } > > > + > > > if (locked) { > > > lock_page(page); > > > - if (page->mapping != expected_mapping) { > > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > > unlock_page(page); > > > put_page(page); > > > goto stale; > > > } > > > } > > > > Could you explain why need check page->mapping twice after get page? > > Once for the !locked case, which should not return page if mapping changed. > Once for the locked case, which should not return page if mapping changed. > We could use "else", but that wouldn't be an improvement. But for locked case, page->mapping will be check twice. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id A5FEE6B0007 for ; Sun, 27 Jan 2013 20:42:01 -0500 (EST) Received: by mail-ie0-f173.google.com with SMTP id e13so721523iej.4 for ; Sun, 27 Jan 2013 17:42:01 -0800 (PST) Message-ID: <1359337321.6763.18.camel@kernel> Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly From: Simon Jeons Date: Sun, 27 Jan 2013 19:42:01 -0600 In-Reply-To: References: <1359262556.4159.23.camel@kernel> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sun, 2013-01-27 at 15:05 -0800, Hugh Dickins wrote: > On Sat, 26 Jan 2013, Simon Jeons wrote: > > On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > > > Switching merge_across_nodes after running KSM is liable to oops on stale > > > nodes still left over from the previous stable tree. It's not something > > > that people will often want to do, but it would be lame to demand a reboot > > > when they're trying to determine which merge_across_nodes setting is best. > > > > > > How can this happen? We only permit switching merge_across_nodes when > > > pages_shared is 0, and usually set run 2 to force that beforehand, which > > > ought to unmerge everything: yet oopses still occur when you then run 1. > > > > > > Three causes: > > > > > > 1. The old stable tree (built according to the inverse merge_across_nodes) ^^^^^^^^^^^^^^^^^^^^^ How to understand inverse merge_across_nodes here? > > > has not been fully torn down. A stable node lingers until get_ksm_page() > > > notices that the page it references no longer references it: but the page Do you mean page->mapping is NULL when call get_ksm_page()? Who clear it NULL? > > > is not necessarily freed as soon as expected, particularly when swapcache. Why is not necessarily freed as soon as expected? > > > > > > > When can this happen? > > Whenever there's an additional reference to the page, beyond those for > its ptes in userspace - swapcache for example, or pinned by get_user_pages. > That delays its being freed (arriving at the "page->mapping = NULL;" > in free_pages_prepare()). Or it might simply be sitting in a pagevec, > waiting for that to be filled up, to be freed as part of a batch. > > > > > > Fix this with a pass through the old stable tree, applying get_ksm_page() > > > to each of the remaining nodes (most found stale and removed immediately), > > > with forced removal of any left over. Unless the page is still mapped: > > > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > > > and EBUSY than BUG. > > > > > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > > > just behind ksmd's cursor, so there's a full pass for it to stabilize > > > (or be removed) before ksmd addresses it. Nice when ksmd is running, > > > but not so nice when we're trying to unmerge all mms: we were missing > > > those mms forked and inserted behind the unmerge cursor. Easily fixed > > > by inserting at the end when KSM_RUN_UNMERGE. > > > > mms forked will be unmerged just after ksmd's cursor since they're > > inserted behind it, why will be missing? > > unmerge_and_remove_all_rmap_items() makes one pass through the list > from start to finish: insert behind the cursor and it will be missed. Since mms forked will be insert just after ksmd's cursor, so it is the next which will be scan and unmerge, where I miss? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx187.postini.com [74.125.245.187]) by kanga.kvack.org (Postfix) with SMTP id 9A4E66B0007 for ; Sun, 27 Jan 2013 21:12:28 -0500 (EST) Received: by mail-pa0-f51.google.com with SMTP id fb11so1217308pad.24 for ; Sun, 27 Jan 2013 18:12:27 -0800 (PST) Message-ID: <1359339147.6763.25.camel@kernel> Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly From: Simon Jeons Date: Sun, 27 Jan 2013 20:12:27 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > Switching merge_across_nodes after running KSM is liable to oops on stale > nodes still left over from the previous stable tree. It's not something Since this patch solve the problem, so the description of merge_across_nodes(Value can be changed only when there is no ksm shared pages in system) should be changed in this patch. > that people will often want to do, but it would be lame to demand a reboot > when they're trying to determine which merge_across_nodes setting is best. > > How can this happen? We only permit switching merge_across_nodes when > pages_shared is 0, and usually set run 2 to force that beforehand, which > ought to unmerge everything: yet oopses still occur when you then run 1. > > Three causes: > > 1. The old stable tree (built according to the inverse merge_across_nodes) > has not been fully torn down. A stable node lingers until get_ksm_page() > notices that the page it references no longer references it: but the page > is not necessarily freed as soon as expected, particularly when swapcache. > > Fix this with a pass through the old stable tree, applying get_ksm_page() > to each of the remaining nodes (most found stale and removed immediately), > with forced removal of any left over. Unless the page is still mapped: > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > and EBUSY than BUG. > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > just behind ksmd's cursor, so there's a full pass for it to stabilize > (or be removed) before ksmd addresses it. Nice when ksmd is running, > but not so nice when we're trying to unmerge all mms: we were missing > those mms forked and inserted behind the unmerge cursor. Easily fixed > by inserting at the end when KSM_RUN_UNMERGE. > > 3. It is possible for a KSM page to be faulted back from swapcache into > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. > > A long outstanding, unrelated bugfix sneaks in with that third fix: > ksm_does_need_to_copy() would copy from a !PageUptodate page (implying > I/O error when read in from swap) to a page which it then marks Uptodate. > Fix this case by not copying, letting do_swap_page() discover the error. > > Signed-off-by: Hugh Dickins > --- > include/linux/ksm.h | 18 ++------- > mm/ksm.c | 83 +++++++++++++++++++++++++++++++++++++++--- > mm/memory.c | 19 ++++----- > 3 files changed, 92 insertions(+), 28 deletions(-) > > --- mmotm.orig/include/linux/ksm.h 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/include/linux/ksm.h 2013-01-25 14:37:00.764206145 -0800 > @@ -16,9 +16,6 @@ > struct stable_node; > struct mem_cgroup; > > -struct page *ksm_does_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address); > - > #ifdef CONFIG_KSM > int ksm_madvise(struct vm_area_struct *vma, unsigned long start, > unsigned long end, int advice, unsigned long *vm_flags); > @@ -73,15 +70,8 @@ static inline void set_page_stable_node( > * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE, > * but what if the vma was unmerged while the page was swapped out? > */ > -static inline int ksm_might_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address) > -{ > - struct anon_vma *anon_vma = page_anon_vma(page); > - > - return anon_vma && > - (anon_vma->root != vma->anon_vma->root || > - page->index != linear_page_index(vma, address)); > -} > +struct page *ksm_might_need_to_copy(struct page *page, > + struct vm_area_struct *vma, unsigned long address); > > int page_referenced_ksm(struct page *page, > struct mem_cgroup *memcg, unsigned long *vm_flags); > @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_ > return 0; > } > > -static inline int ksm_might_need_to_copy(struct page *page, > +static inline struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > - return 0; > + return page; > } > > static inline int page_referenced_ksm(struct page *page, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a > /* > * Only called through the sysfs control interface: > */ > +static int remove_stable_node(struct stable_node *stable_node) > +{ > + struct page *page; > + int err; > + > + page = get_ksm_page(stable_node, true); > + if (!page) { > + /* > + * get_ksm_page did remove_node_from_stable_tree itself. > + */ > + return 0; > + } > + > + if (WARN_ON_ONCE(page_mapped(page))) > + err = -EBUSY; > + else { > + /* > + * This page might be in a pagevec waiting to be freed, > + * or it might be PageSwapCache (perhaps under writeback), > + * or it might have been removed from swapcache a moment ago. > + */ > + set_page_stable_node(page, NULL); > + remove_node_from_stable_tree(stable_node); > + err = 0; > + } > + > + unlock_page(page); > + put_page(page); > + return err; > +} > + > +static int remove_all_stable_nodes(void) > +{ > + struct stable_node *stable_node; > + int nid; > + int err = 0; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + while (root_stable_tree[nid].rb_node) { > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > + struct stable_node, node); > + if (remove_stable_node(stable_node)) { > + err = -EBUSY; > + break; /* proceed to next nid */ > + } > + cond_resched(); > + } > + } > + return err; > +} > + > static int unmerge_and_remove_all_rmap_items(void) > { > struct mm_slot *mm_slot; > @@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i > } > } > > + /* Clean up stable nodes, but don't worry if some are still busy */ > + remove_all_stable_nodes(); > ksm_scan.seqnr = 0; > return 0; > > @@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm) > spin_lock(&ksm_mmlist_lock); > insert_to_mm_slots_hash(mm, mm_slot); > /* > - * Insert just behind the scanning cursor, to let the area settle > + * When KSM_RUN_MERGE (or KSM_RUN_STOP), > + * insert just behind the scanning cursor, to let the area settle > * down a little; when fork is followed by immediate exec, we don't > * want ksmd to waste time setting up and tearing down an rmap_list. > + * > + * But when KSM_RUN_UNMERGE, it's important to insert ahead of its > + * scanning cursor, otherwise KSM pages in newly forked mms will be > + * missed: then we might as well insert at the end of the list. > */ > - list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > + if (ksm_run & KSM_RUN_UNMERGE) > + list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list); > + else > + list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > spin_unlock(&ksm_mmlist_lock); > > set_bit(MMF_VM_MERGEABLE, &mm->flags); > @@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm) > } > } > > -struct page *ksm_does_need_to_copy(struct page *page, > +struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > + struct anon_vma *anon_vma = page_anon_vma(page); > struct page *new_page; > > + if (PageKsm(page)) { > + if (page_stable_node(page) && > + !(ksm_run & KSM_RUN_UNMERGE)) > + return page; /* no need to copy it */ > + } else if (!anon_vma) { > + return page; /* no need to copy it */ > + } else if (anon_vma->root == vma->anon_vma->root && > + page->index == linear_page_index(vma, address)) { > + return page; /* still no need to copy it */ > + } > + if (!PageUptodate(page)) > + return page; /* let do_swap_page report the error */ > + > new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); > if (new_page) { > copy_user_highpage(new_page, page, address, vma); > @@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store( > > mutex_lock(&ksm_thread_mutex); > if (ksm_merge_across_nodes != knob) { > - if (ksm_pages_shared) > + if (ksm_pages_shared || remove_all_stable_nodes()) > err = -EBUSY; > else > ksm_merge_across_nodes = knob; > --- mmotm.orig/mm/memory.c 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/mm/memory.c 2013-01-25 14:37:00.768206145 -0800 > @@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct > if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) > goto out_page; > > - if (ksm_might_need_to_copy(page, vma, address)) { > - swapcache = page; > - page = ksm_does_need_to_copy(page, vma, address); > - > - if (unlikely(!page)) { > - ret = VM_FAULT_OOM; > - page = swapcache; > - swapcache = NULL; > - goto out_page; > - } > + swapcache = page; > + page = ksm_might_need_to_copy(page, vma, address); > + if (unlikely(!page)) { > + ret = VM_FAULT_OOM; > + page = swapcache; > + swapcache = NULL; > + goto out_page; > } > + if (page == swapcache) > + swapcache = NULL; > > if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > ret = VM_FAULT_OOM; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx197.postini.com [74.125.245.197]) by kanga.kvack.org (Postfix) with SMTP id C24136B0007 for ; Sun, 27 Jan 2013 22:35:35 -0500 (EST) Received: by mail-pb0-f42.google.com with SMTP id rp2so1202110pbb.15 for ; Sun, 27 Jan 2013 19:35:35 -0800 (PST) Date: Sun, 27 Jan 2013 19:35:31 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked In-Reply-To: <1359333371.6763.12.camel@kernel> Message-ID: References: <1359254187.4159.10.camel@kernel> <1359333371.6763.12.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Sun, 2013-01-27 at 14:08 -0800, Hugh Dickins wrote: > > On Sat, 26 Jan 2013, Simon Jeons wrote: > > > > > > Why the parameter lock passed from stable_tree_search/insert is true, > > > but remove_rmap_item_from_tree is false? > > > > The other way round? remove_rmap_item_from_tree needs the page locked, > > because it's about to modify the list: that's secured (e.g. against > > concurrent KSM page reclaim) by the page lock. > > How can KSM page reclaim path call remove_rmap_item_from_tree? I have > already track every callsites but can't find it. It doesn't. Please read what I said above again. > BTW, I'm curious about > KSM page reclaim, it seems that there're no special handle in vmscan.c > for KSM page reclaim, is it will be reclaimed similiar with normal > page? Look for PageKsm in mm/rmap.c. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx108.postini.com [74.125.245.108]) by kanga.kvack.org (Postfix) with SMTP id BFC476B0008 for ; Sun, 27 Jan 2013 22:44:25 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id e20so1037040dak.0 for ; Sun, 27 Jan 2013 19:44:24 -0800 (PST) Message-ID: <1359344663.6763.32.camel@kernel> Subject: Re: [PATCH 8/11] ksm: make !merge_across_nodes migration safe From: Simon Jeons Date: Sun, 27 Jan 2013 21:44:23 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 2013-01-25 at 18:05 -0800, Hugh Dickins wrote: > The new KSM NUMA merge_across_nodes knob introduces a problem, when it's > set to non-default 0: if a KSM page is migrated to a different NUMA node, > how do we migrate its stable node to the right tree? And what if that > collides with an existing stable node? > > ksm_migrate_page() can do no more than it's already doing, updating > stable_node->kpfn: the stable tree itself cannot be manipulated without > holding ksm_thread_mutex. So accept that a stable tree may temporarily > indicate a page belonging to the wrong NUMA node, leave updating until > the next pass of ksmd, just be careful not to merge other pages on to a How you not to merge other pages on to a misplaced page? I don't see it. > misplaced page. Note nid of holding tree in stable_node, and recognize > that it will not always match nid of kpfn. > > A misplaced KSM page is discovered, either when ksm_do_scan() next comes > around to one of its rmap_items (we now have to go to cmp_and_merge_page > even on pages in a stable tree), or when stable_tree_search() arrives at > a matching node for another page, and this node page is found misplaced. > > In each case, move the misplaced stable_node to a list of migrate_nodes > (and use the address of migrate_nodes as magic by which to identify them): > we don't need them in a tree. If stable_tree_search() finds no match for > a page, but it's currently exiled to this list, then slot its stable_node > right there into the tree, bringing all of its mappings with it; otherwise > they get migrated one by one to the original page of the colliding node. > stable_tree_search() is now modelled more like stable_tree_insert(), > in order to handle these insertions of migrated nodes. When node will be removed from migrate_nodes list and insert to stable tree? > > remove_node_from_stable_tree(), remove_all_stable_nodes() and > ksm_check_stable_tree() have to handle the migrate_nodes list as well as > the stable tree itself. Less obviously, we do need to prune the list of > stale entries from time to time (scan_get_next_rmap_item() does it once > each full scan): > whereas stale nodes in the stable tree get naturally > pruned as searches try to brush past them, these migrate_nodes may get > forgotten and accumulate. Hard to understand this description. Could you explain it? :) > Signed-off-by: Hugh Dickins What will happen if page node of an unstable tree migrate to a new numa node? Also need to handle colliding? > --- > mm/ksm.c | 164 +++++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 134 insertions(+), 30 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 > @@ -122,13 +122,25 @@ struct ksm_scan { > /** > * struct stable_node - node of the stable rbtree > * @node: rb node of this ksm page in the stable tree > + * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list > + * @list: linked into migrate_nodes, pending placement in the proper node tree > * @hlist: hlist head of rmap_items using this ksm page > - * @kpfn: page frame number of this ksm page > + * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid) > + * @nid: NUMA node id of stable tree in which linked (may not match kpfn) > */ > struct stable_node { > - struct rb_node node; > + union { > + struct rb_node node; /* when node of stable tree */ > + struct { /* when listed for migration */ > + struct list_head *head; > + struct list_head list; > + }; > + }; > struct hlist_head hlist; > unsigned long kpfn; > +#ifdef CONFIG_NUMA > + int nid; > +#endif > }; > > /** > @@ -169,6 +181,9 @@ struct rmap_item { > static struct rb_root root_unstable_tree[MAX_NUMNODES]; > static struct rb_root root_stable_tree[MAX_NUMNODES]; > > +/* Recently migrated nodes of stable tree, pending proper placement */ > +static LIST_HEAD(migrate_nodes); > + > #define MM_SLOTS_HASH_BITS 10 > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > > @@ -311,11 +326,6 @@ static void insert_to_mm_slots_hash(stru > hash_add(mm_slots_hash, &mm_slot->link, (unsigned long)mm); > } > > -static inline int in_stable_tree(struct rmap_item *rmap_item) > -{ > - return rmap_item->address & STABLE_FLAG; > -} > - > /* > * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's > * page tables after it has passed through ksm_exit() - which, if necessary, > @@ -476,7 +486,6 @@ static void remove_node_from_stable_tree > { > struct rmap_item *rmap_item; > struct hlist_node *hlist; > - int nid; > > hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { > if (rmap_item->hlist.next) > @@ -488,8 +497,11 @@ static void remove_node_from_stable_tree > cond_resched(); > } > > - nid = get_kpfn_nid(stable_node->kpfn); > - rb_erase(&stable_node->node, &root_stable_tree[nid]); > + if (stable_node->head == &migrate_nodes) > + list_del(&stable_node->list); > + else > + rb_erase(&stable_node->node, > + &root_stable_tree[NUMA(stable_node->nid)]); > free_stable_node(stable_node); > } > > @@ -712,6 +724,7 @@ static int remove_stable_node(struct sta > static int remove_all_stable_nodes(void) > { > struct stable_node *stable_node; > + struct list_head *this, *next; > int nid; > int err = 0; > > @@ -726,6 +739,12 @@ static int remove_all_stable_nodes(void) > cond_resched(); > } > } > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, struct stable_node, list); > + if (remove_stable_node(stable_node)) > + err = -EBUSY; > + cond_resched(); > + } > return err; > } > > @@ -1113,25 +1132,30 @@ static struct page *try_to_merge_two_pag > */ > static struct page *stable_tree_search(struct page *page) > { > - struct rb_node *node; > - struct stable_node *stable_node; > int nid; > + struct rb_node **new; > + struct rb_node *parent; > + struct stable_node *stable_node; > + struct stable_node *page_node; > > - stable_node = page_stable_node(page); > - if (stable_node) { /* ksm page forked */ > + page_node = page_stable_node(page); > + if (page_node && page_node->head != &migrate_nodes) { > + /* ksm page forked */ > get_page(page); > return page; > } > > nid = get_kpfn_nid(page_to_pfn(page)); > - node = root_stable_tree[nid].rb_node; > +again: > + new = &root_stable_tree[nid].rb_node; > + parent = NULL; > > - while (node) { > + while (*new) { > struct page *tree_page; > int ret; > > cond_resched(); > - stable_node = rb_entry(node, struct stable_node, node); > + stable_node = rb_entry(*new, struct stable_node, node); > tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > @@ -1139,10 +1163,11 @@ static struct page *stable_tree_search(s > ret = memcmp_pages(page, tree_page); > put_page(tree_page); > > + parent = *new; > if (ret < 0) > - node = node->rb_left; > + new = &parent->rb_left; > else if (ret > 0) > - node = node->rb_right; > + new = &parent->rb_right; > else { > /* > * Lock and unlock the stable_node's page (which > @@ -1152,13 +1177,49 @@ static struct page *stable_tree_search(s > * than kpage, but that involves more changes. > */ > tree_page = get_ksm_page(stable_node, true); > - if (tree_page) > + if (tree_page) { > unlock_page(tree_page); > - return tree_page; > + if (get_kpfn_nid(stable_node->kpfn) != > + NUMA(stable_node->nid)) { > + put_page(tree_page); > + goto replace; > + } > + return tree_page; > + } > + /* > + * There is now a place for page_node, but the tree may > + * have been rebalanced, so re-evaluate parent and new. > + */ > + if (page_node) > + goto again; > + return NULL; > } > } > > - return NULL; > + if (!page_node) > + return NULL; > + > + list_del(&page_node->list); > + DO_NUMA(page_node->nid = nid); > + rb_link_node(&page_node->node, parent, new); > + rb_insert_color(&page_node->node, &root_stable_tree[nid]); > + get_page(page); > + return page; > + > +replace: > + if (page_node) { > + list_del(&page_node->list); > + DO_NUMA(page_node->nid = nid); > + rb_replace_node(&stable_node->node, > + &page_node->node, &root_stable_tree[nid]); > + get_page(page); > + } else { > + rb_erase(&stable_node->node, &root_stable_tree[nid]); > + page = NULL; > + } > + stable_node->head = &migrate_nodes; Why still set this magic since node has already insert to the tree? > + list_add(&stable_node->list, stable_node->head); > + return page; > } > > /* > @@ -1215,6 +1276,7 @@ static struct stable_node *stable_tree_i > INIT_HLIST_HEAD(&stable_node->hlist); > stable_node->kpfn = kpfn; > set_page_stable_node(kpage, stable_node); > + DO_NUMA(stable_node->nid = nid); > rb_link_node(&stable_node->node, parent, new); > rb_insert_color(&stable_node->node, &root_stable_tree[nid]); > > @@ -1311,11 +1373,6 @@ struct rmap_item *unstable_tree_search_i > static void stable_tree_append(struct rmap_item *rmap_item, > struct stable_node *stable_node) > { > - /* > - * Usually rmap_item->nid is already set correctly, > - * but it may be wrong after switching merge_across_nodes. > - */ > - DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn)); > rmap_item->head = stable_node; > rmap_item->address |= STABLE_FLAG; > hlist_add_head(&rmap_item->hlist, &stable_node->hlist); > @@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa > unsigned int checksum; > int err; > > - remove_rmap_item_from_tree(rmap_item); > + stable_node = page_stable_node(page); > + if (stable_node) { > + if (stable_node->head != &migrate_nodes && > + get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { > + rb_erase(&stable_node->node, > + &root_stable_tree[NUMA(stable_node->nid)]); > + stable_node->head = &migrate_nodes; > + list_add(&stable_node->list, stable_node->head); > + } > + if (stable_node->head != &migrate_nodes && > + rmap_item->head == stable_node) > + return; > + } > > /* We first start with searching the page inside the stable tree */ > kpage = stable_tree_search(page); > + if (kpage == page && rmap_item->head == stable_node) { > + put_page(kpage); > + return; > + } > + > + remove_rmap_item_from_tree(rmap_item); > + > if (kpage) { > err = try_to_merge_with_ksm_page(rmap_item, page, kpage); > if (!err) { > @@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r > */ > lru_add_drain_all(); > > + /* > + * Whereas stale stable_nodes on the stable_tree itself > + * get pruned in the regular course of stable_tree_search(), > + * those moved out to the migrate_nodes list can accumulate: > + * so prune them once before each full scan. > + */ > + if (!ksm_merge_across_nodes) { > + struct stable_node *stable_node; > + struct list_head *this, *next; > + struct page *page; > + > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, > + struct stable_node, list); > + page = get_ksm_page(stable_node, false); > + if (page) > + put_page(page); > + cond_resched(); > + } > + } > + > for (nid = 0; nid < nr_node_ids; nid++) > root_unstable_tree[nid] = RB_ROOT; > > @@ -1586,8 +1683,7 @@ static void ksm_do_scan(unsigned int sca > rmap_item = scan_get_next_rmap_item(&page); > if (!rmap_item) > return; > - if (!PageKsm(page) || !in_stable_tree(rmap_item)) > - cmp_and_merge_page(page, rmap_item); > + cmp_and_merge_page(page, rmap_item); > put_page(page); > } > } > @@ -1964,6 +2060,7 @@ static void ksm_check_stable_tree(unsign > unsigned long end_pfn) > { > struct stable_node *stable_node; > + struct list_head *this, *next; > struct rb_node *node; > int nid; > > @@ -1984,6 +2081,13 @@ static void ksm_check_stable_tree(unsign > cond_resched(); > } > } > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, struct stable_node, list); > + if (stable_node->kpfn >= start_pfn && > + stable_node->kpfn < end_pfn) > + remove_node_from_stable_tree(stable_node); > + cond_resched(); > + } > } > > static int ksm_memory_callback(struct notifier_block *self, > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id 53B126B0009 for ; Sun, 27 Jan 2013 22:44:55 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id e20so1040416dak.28 for ; Sun, 27 Jan 2013 19:44:54 -0800 (PST) Date: Sun, 27 Jan 2013 19:44:56 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible In-Reply-To: <1359333683.6763.13.camel@kernel> Message-ID: References: <1359265635.6763.0.camel@kernel> <1359333683.6763.13.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Sun, 2013-01-27 at 15:12 -0800, Hugh Dickins wrote: > > On Sat, 26 Jan 2013, Simon Jeons wrote: > > > > > > Could you explain why need check page->mapping twice after get page? > > > > Once for the !locked case, which should not return page if mapping changed. > > Once for the locked case, which should not return page if mapping changed. > > We could use "else", but that wouldn't be an improvement. > > But for locked case, page->mapping will be check twice. Thrice. I'm beginning to wonder: you do realize that page->mapping is volatile, from the point of view of get_ksm_page()? That is the whole point of why get_ksm_page() exists. I can see that the word "volatile" is not obviously used here - it's tucked away inside the ACCESS_ONCE() - but I thought the descriptions of races and barriers made that obvious. If the comments here haven't helped enough, please take a look at git commit 4035c07a8959 "ksm: take keyhole reference to page". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx174.postini.com [74.125.245.174]) by kanga.kvack.org (Postfix) with SMTP id F29F76B0009 for ; Sun, 27 Jan 2013 23:14:21 -0500 (EST) Received: by mail-pb0-f48.google.com with SMTP id wy12so1226758pbc.7 for ; Sun, 27 Jan 2013 20:14:21 -0800 (PST) Date: Sun, 27 Jan 2013 20:14:22 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <1359337321.6763.18.camel@kernel> Message-ID: References: <1359262556.4159.23.camel@kernel> <1359337321.6763.18.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Sun, 2013-01-27 at 15:05 -0800, Hugh Dickins wrote: > > On Sat, 26 Jan 2013, Simon Jeons wrote: > > > > How can this happen? We only permit switching merge_across_nodes when > > > > pages_shared is 0, and usually set run 2 to force that beforehand, which > > > > ought to unmerge everything: yet oopses still occur when you then run 1. > > > > > > > > Three causes: > > > > > > > > 1. The old stable tree (built according to the inverse merge_across_nodes) > ^^^^^^^^^^^^^^^^^^^^^ > How to understand inverse merge_across_nodes here? How not to understand it? Either it was 0 before (in which case there were as many stable trees as NUMA nodes) and is being changed to 1 (in which case there is to be only one stable tree), or it was 1 before (for one) and is being changed to 0 (for many). > > > > > has not been fully torn down. A stable node lingers until get_ksm_page() > > > > notices that the page it references no longer references it: but the page > > Do you mean page->mapping is NULL when call get_ksm_page()? Who clear it > NULL? I think I already pointed you to free_pages_prepare(). > > > > > is not necessarily freed as soon as expected, particularly when swapcache. > > Why is not necessarily freed as soon as expected? As I answered below. > > > > > > > > > > When can this happen? > > > > Whenever there's an additional reference to the page, beyond those for > > its ptes in userspace - swapcache for example, or pinned by get_user_pages. > > That delays its being freed (arriving at the "page->mapping = NULL;" > > in free_pages_prepare()). Or it might simply be sitting in a pagevec, > > waiting for that to be filled up, to be freed as part of a batch. > > > mms forked will be unmerged just after ksmd's cursor since they're > > > inserted behind it, why will be missing? > > > > unmerge_and_remove_all_rmap_items() makes one pass through the list > > from start to finish: insert behind the cursor and it will be missed. > > Since mms forked will be insert just after ksmd's cursor, so it is the > next which will be scan and unmerge, where I miss? mms forked are normally inserted just behind (== before) ksmd's cursor, as I've said in comments and explanations several times. Simon, I've had enough: you clearly have much more time to spare for asking questions than I have for answering them repeatedly: I would rather spend my time attending to 100 higher priorities. Please try much harder to work these things out for yourself from the source (perhaps with help from kernelnewbies.org), before interrogating linux-kernel and linux-mm developers. Sometimes your questions may help everybody to understand better, but often they just waste our time. I'll happily admit that mm, and mm/ksm.c in particular, is not the easiest place to start in understanding the kernel, nor I the best expositor. Best wishes, Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx105.postini.com [74.125.245.105]) by kanga.kvack.org (Postfix) with SMTP id 0B0DC6B0009 for ; Sun, 27 Jan 2013 23:19:27 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id e20so1042840dak.14 for ; Sun, 27 Jan 2013 20:19:27 -0800 (PST) Date: Sun, 27 Jan 2013 20:19:28 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <1359339147.6763.25.camel@kernel> Message-ID: References: <1359339147.6763.25.camel@kernel> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > > Switching merge_across_nodes after running KSM is liable to oops on stale > > nodes still left over from the previous stable tree. It's not something > > Since this patch solve the problem, so the description of > merge_across_nodes(Value can be changed only when there is no ksm shared > pages in system) should be changed in this patch. No. The code could be changed to unmerge_and_remove_all_rmap_items() automatically whenever merge_across_nodes is changed; but that's not what Petr chose to do, and I didn't feel strongly to change it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id 7A5646B0007 for ; Mon, 28 Jan 2013 01:36:42 -0500 (EST) Received: by mail-ia0-f182.google.com with SMTP id w33so3733895iag.27 for ; Sun, 27 Jan 2013 22:36:41 -0800 (PST) Message-ID: <1359355000.17885.1.camel@kernel> Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly From: Simon Jeons Date: Mon, 28 Jan 2013 00:36:40 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > Switching merge_across_nodes after running KSM is liable to oops on stale > nodes still left over from the previous stable tree. It's not something > that people will often want to do, but it would be lame to demand a reboot > when they're trying to determine which merge_across_nodes setting is best. > > How can this happen? We only permit switching merge_across_nodes when > pages_shared is 0, and usually set run 2 to force that beforehand, which > ought to unmerge everything: yet oopses still occur when you then run 1. > > Three causes: > > 1. The old stable tree (built according to the inverse merge_across_nodes) > has not been fully torn down. A stable node lingers until get_ksm_page() > notices that the page it references no longer references it: but the page > is not necessarily freed as soon as expected, particularly when swapcache. > > Fix this with a pass through the old stable tree, applying get_ksm_page() > to each of the remaining nodes (most found stale and removed immediately), > with forced removal of any left over. Unless the page is still mapped: > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > and EBUSY than BUG. > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > just behind ksmd's cursor, so there's a full pass for it to stabilize > (or be removed) before ksmd addresses it. Nice when ksmd is running, > but not so nice when we're trying to unmerge all mms: we were missing > those mms forked and inserted behind the unmerge cursor. Easily fixed > by inserting at the end when KSM_RUN_UNMERGE. > > 3. It is possible for a KSM page to be faulted back from swapcache into > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. > > A long outstanding, unrelated bugfix sneaks in with that third fix: > ksm_does_need_to_copy() would copy from a !PageUptodate page (implying > I/O error when read in from swap) to a page which it then marks Uptodate. > Fix this case by not copying, letting do_swap_page() discover the error. > > Signed-off-by: Hugh Dickins > --- > include/linux/ksm.h | 18 ++------- > mm/ksm.c | 83 +++++++++++++++++++++++++++++++++++++++--- > mm/memory.c | 19 ++++----- > 3 files changed, 92 insertions(+), 28 deletions(-) > > --- mmotm.orig/include/linux/ksm.h 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/include/linux/ksm.h 2013-01-25 14:37:00.764206145 -0800 > @@ -16,9 +16,6 @@ > struct stable_node; > struct mem_cgroup; > > -struct page *ksm_does_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address); > - > #ifdef CONFIG_KSM > int ksm_madvise(struct vm_area_struct *vma, unsigned long start, > unsigned long end, int advice, unsigned long *vm_flags); > @@ -73,15 +70,8 @@ static inline void set_page_stable_node( > * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE, > * but what if the vma was unmerged while the page was swapped out? > */ > -static inline int ksm_might_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address) > -{ > - struct anon_vma *anon_vma = page_anon_vma(page); > - > - return anon_vma && > - (anon_vma->root != vma->anon_vma->root || > - page->index != linear_page_index(vma, address)); > -} > +struct page *ksm_might_need_to_copy(struct page *page, > + struct vm_area_struct *vma, unsigned long address); > > int page_referenced_ksm(struct page *page, > struct mem_cgroup *memcg, unsigned long *vm_flags); > @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_ > return 0; > } > > -static inline int ksm_might_need_to_copy(struct page *page, > +static inline struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > - return 0; > + return page; > } > > static inline int page_referenced_ksm(struct page *page, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a > /* > * Only called through the sysfs control interface: > */ > +static int remove_stable_node(struct stable_node *stable_node) > +{ > + struct page *page; > + int err; > + > + page = get_ksm_page(stable_node, true); > + if (!page) { > + /* > + * get_ksm_page did remove_node_from_stable_tree itself. > + */ > + return 0; > + } > + > + if (WARN_ON_ONCE(page_mapped(page))) > + err = -EBUSY; > + else { > + /* > + * This page might be in a pagevec waiting to be freed, > + * or it might be PageSwapCache (perhaps under writeback), > + * or it might have been removed from swapcache a moment ago. > + */ > + set_page_stable_node(page, NULL); > + remove_node_from_stable_tree(stable_node); > + err = 0; > + } > + > + unlock_page(page); > + put_page(page); > + return err; > +} > + > +static int remove_all_stable_nodes(void) > +{ > + struct stable_node *stable_node; > + int nid; > + int err = 0; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + while (root_stable_tree[nid].rb_node) { > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > + struct stable_node, node); > + if (remove_stable_node(stable_node)) { > + err = -EBUSY; > + break; /* proceed to next nid */ Why proceed to next nid if meet unstale stable node in stable tree? Then still can't fully cleanup stale stable nodes. > + } > + cond_resched(); > + } > + } > + return err; > +} > + > static int unmerge_and_remove_all_rmap_items(void) > { > struct mm_slot *mm_slot; > @@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i > } > } > > + /* Clean up stable nodes, but don't worry if some are still busy */ > + remove_all_stable_nodes(); > ksm_scan.seqnr = 0; > return 0; > > @@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm) > spin_lock(&ksm_mmlist_lock); > insert_to_mm_slots_hash(mm, mm_slot); > /* > - * Insert just behind the scanning cursor, to let the area settle > + * When KSM_RUN_MERGE (or KSM_RUN_STOP), > + * insert just behind the scanning cursor, to let the area settle > * down a little; when fork is followed by immediate exec, we don't > * want ksmd to waste time setting up and tearing down an rmap_list. > + * > + * But when KSM_RUN_UNMERGE, it's important to insert ahead of its > + * scanning cursor, otherwise KSM pages in newly forked mms will be > + * missed: then we might as well insert at the end of the list. > */ > - list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > + if (ksm_run & KSM_RUN_UNMERGE) > + list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list); > + else > + list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > spin_unlock(&ksm_mmlist_lock); > > set_bit(MMF_VM_MERGEABLE, &mm->flags); > @@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm) > } > } > > -struct page *ksm_does_need_to_copy(struct page *page, > +struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > + struct anon_vma *anon_vma = page_anon_vma(page); > struct page *new_page; > > + if (PageKsm(page)) { > + if (page_stable_node(page) && > + !(ksm_run & KSM_RUN_UNMERGE)) > + return page; /* no need to copy it */ > + } else if (!anon_vma) { > + return page; /* no need to copy it */ > + } else if (anon_vma->root == vma->anon_vma->root && > + page->index == linear_page_index(vma, address)) { > + return page; /* still no need to copy it */ > + } > + if (!PageUptodate(page)) > + return page; /* let do_swap_page report the error */ > + > new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); > if (new_page) { > copy_user_highpage(new_page, page, address, vma); > @@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store( > > mutex_lock(&ksm_thread_mutex); > if (ksm_merge_across_nodes != knob) { > - if (ksm_pages_shared) > + if (ksm_pages_shared || remove_all_stable_nodes()) > err = -EBUSY; > else > ksm_merge_across_nodes = knob; > --- mmotm.orig/mm/memory.c 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/mm/memory.c 2013-01-25 14:37:00.768206145 -0800 > @@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct > if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) > goto out_page; > > - if (ksm_might_need_to_copy(page, vma, address)) { > - swapcache = page; > - page = ksm_does_need_to_copy(page, vma, address); > - > - if (unlikely(!page)) { > - ret = VM_FAULT_OOM; > - page = swapcache; > - swapcache = NULL; > - goto out_page; > - } > + swapcache = page; > + page = ksm_might_need_to_copy(page, vma, address); > + if (unlikely(!page)) { > + ret = VM_FAULT_OOM; > + page = swapcache; > + swapcache = NULL; > + goto out_page; > } > + if (page == swapcache) > + swapcache = NULL; > > if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > ret = VM_FAULT_OOM; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id 6F6296B0007 for ; Mon, 28 Jan 2013 18:03:06 -0500 (EST) Date: Mon, 28 Jan 2013 15:03:04 -0800 From: Andrew Morton Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node Message-Id: <20130128150304.2e7a2fb4.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 25 Jan 2013 17:54:53 -0800 (PST) Hugh Dickins wrote: > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > Default: 20 (chosen for demonstration purposes) > > +merge_across_nodes - specifies if pages from different numa nodes can be merged. > + When set to 0, ksm merges only pages which physically > + reside in the memory area of same NUMA node. It brings > + lower latency to access to shared page. Value can be > + changed only when there is no ksm shared pages in system. > + Default: 1 > + The explanation doesn't really tell the operator whether or not to set merge_across_nodes for a particular machine/workload. I guess most people will just shrug, turn the thing on and see if it improved things, but that's rather random. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx188.postini.com [74.125.245.188]) by kanga.kvack.org (Postfix) with SMTP id D48C96B0007 for ; Mon, 28 Jan 2013 18:08:56 -0500 (EST) Date: Mon, 28 Jan 2013 15:08:54 -0800 From: Andrew Morton Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node Message-Id: <20130128150854.6813b1ca.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 25 Jan 2013 17:54:53 -0800 (PST) Hugh Dickins wrote: > +/* Zeroed when merging across nodes is not allowed */ > +static unsigned int ksm_merge_across_nodes = 1; I spose this should be __read_mostly. If __read_mostly is not really a synonym for __make_write_often_storage_slower. I continue to harbor fear, uncertainty and doubt about this... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx110.postini.com [74.125.245.110]) by kanga.kvack.org (Postfix) with SMTP id 0722A6B0007 for ; Mon, 28 Jan 2013 18:11:20 -0500 (EST) Date: Mon, 28 Jan 2013 15:11:19 -0800 From: Andrew Morton Subject: Re: [PATCH 3/11] ksm: trivial tidyups Message-Id: <20130128151119.b74d0150.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 25 Jan 2013 17:58:11 -0800 (PST) Hugh Dickins wrote: > +#ifdef CONFIG_NUMA > +#define NUMA(x) (x) > +#define DO_NUMA(x) (x) Did we consider #define DO_NUMA do { (x) } while (0) ? That could avoid some nasty config-dependent compilation issues. > +#else > +#define NUMA(x) (0) > +#define DO_NUMA(x) do { } while (0) > +#endif -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx118.postini.com [74.125.245.118]) by kanga.kvack.org (Postfix) with SMTP id B12836B0008 for ; Mon, 28 Jan 2013 18:44:09 -0500 (EST) Date: Mon, 28 Jan 2013 15:44:07 -0800 From: Andrew Morton Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly Message-Id: <20130128154407.16a623a4.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 25 Jan 2013 18:01:59 -0800 (PST) Hugh Dickins wrote: > +static int remove_all_stable_nodes(void) > +{ > + struct stable_node *stable_node; > + int nid; > + int err = 0; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + while (root_stable_tree[nid].rb_node) { > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > + struct stable_node, node); > + if (remove_stable_node(stable_node)) { > + err = -EBUSY; It's a bit rude to overwrite remove_stable_node()'s return value. > + break; /* proceed to next nid */ > + } > + cond_resched(); Why is this here? > + } > + } > + return err; > +} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx135.postini.com [74.125.245.135]) by kanga.kvack.org (Postfix) with SMTP id 575A26B0010 for ; Mon, 28 Jan 2013 18:54:54 -0500 (EST) Date: Mon, 28 Jan 2013 15:54:52 -0800 From: Andrew Morton Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration Message-Id: <20130128155452.16882a6e.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 25 Jan 2013 17:53:10 -0800 (PST) Hugh Dickins wrote: > Here's a KSM series Sanity check: do you have a feeling for how useful KSM is? Performance/space improvements for typical (or atypical) workloads? Are people using it? Successfully? IOW, is it justifying itself? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx116.postini.com [74.125.245.116]) by kanga.kvack.org (Postfix) with SMTP id 6FA756B0007 for ; Mon, 28 Jan 2013 19:49:42 -0500 (EST) Received: by mail-wi0-f200.google.com with SMTP id hn14so2980681wib.7 for ; Mon, 28 Jan 2013 16:49:40 -0800 (PST) Message-ID: <51071CA0.801@ravellosystems.com> Date: Tue, 29 Jan 2013 02:49:36 +0200 From: Izik Eidus MIME-Version: 1.0 Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration References: <20130128155452.16882a6e.akpm@linux-foundation.org> In-Reply-To: <20130128155452.16882a6e.akpm@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Hugh Dickins , Petr Holasek , Andrea Arcangeli , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org On 01/29/2013 01:54 AM, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:53:10 -0800 (PST) > Hugh Dickins wrote: > >> Here's a KSM series > Sanity check: do you have a feeling for how useful KSM is? > Performance/space improvements for typical (or atypical) workloads? > Are people using it? Successfully? Hi, I think it mostly used for virtualization, I know at least two products that it use - RHEV - RedHat enterprise virtualization, and my current place (Ravello Systems) that use it to do vm consolidation on top of cloud enviorments (Run multiple unmodified VMs on top of one vm you get from ec2 / rackspace / what so ever), for Ravello it is highly critical in achieving high rate of consolidation ratio... > > IOW, is it justifying itself? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id B8AF46B0007 for ; Mon, 28 Jan 2013 20:17:21 -0500 (EST) Received: by mail-pb0-f42.google.com with SMTP id wz17so577460pbc.29 for ; Mon, 28 Jan 2013 17:17:21 -0800 (PST) Date: Mon, 28 Jan 2013 17:17:24 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: <20130128150304.2e7a2fb4.akpm@linux-foundation.org> Message-ID: References: <20130128150304.2e7a2fb4.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:54:53 -0800 (PST) > Hugh Dickins wrote: > > > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 > > +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 > > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > > Default: 20 (chosen for demonstration purposes) > > > > +merge_across_nodes - specifies if pages from different numa nodes can be merged. > > + When set to 0, ksm merges only pages which physically > > + reside in the memory area of same NUMA node. It brings > > + lower latency to access to shared page. Value can be > > + changed only when there is no ksm shared pages in system. > > + Default: 1 > > + > > The explanation doesn't really tell the operator whether or not to set > merge_across_nodes for a particular machine/workload. > > I guess most people will just shrug, turn the thing on and see if it > improved things, but that's rather random. Right. I don't think we can tell them which is going to be better, but surely we could do a better job of hinting at the tradeoffs. I think we expect large NUMA machines with lots of memory to want the better NUMA behavior of !merge_across_nodes, but machines with more limited memory across short-distance NUMA nodes, to prefer the greater deduplication of merge_across nodes. Petr, do you have a more informative text for this? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx164.postini.com [74.125.245.164]) by kanga.kvack.org (Postfix) with SMTP id 154406B0007 for ; Mon, 28 Jan 2013 20:38:41 -0500 (EST) Received: by mail-pa0-f54.google.com with SMTP id bi5so48335pad.27 for ; Mon, 28 Jan 2013 17:38:40 -0800 (PST) Date: Mon, 28 Jan 2013 17:38:43 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: <20130128150854.6813b1ca.akpm@linux-foundation.org> Message-ID: References: <20130128150854.6813b1ca.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:54:53 -0800 (PST) > Hugh Dickins wrote: > > > +/* Zeroed when merging across nodes is not allowed */ > > +static unsigned int ksm_merge_across_nodes = 1; > > I spose this should be __read_mostly. If __read_mostly is not really a > synonym for __make_write_often_storage_slower. I continue to harbor > fear, uncertainty and doubt about this... Could do. No strong feeling, but I think I'd rather it share its cacheline with other KSM-related stuff, than be off mixed up with unrelateds. I think there's a much stronger case for __read_mostly when it's a library thing accessed by different subsystems. You're right that this variable is accessed significantly more often that the other KSM tunables, so deserves a __read_mostly more than they do. But where to stop? Similar reluctance led me to avoid using "unlikely" throughout ksm.c, unlikely as some conditions are (I'm aghast to see that Andrea sneaked in a "likely" :). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id 648A96B0007 for ; Mon, 28 Jan 2013 20:44:20 -0500 (EST) Received: by mail-pb0-f42.google.com with SMTP id wz17so590278pbc.29 for ; Mon, 28 Jan 2013 17:44:19 -0800 (PST) Date: Mon, 28 Jan 2013 17:44:23 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 3/11] ksm: trivial tidyups In-Reply-To: <20130128151119.b74d0150.akpm@linux-foundation.org> Message-ID: References: <20130128151119.b74d0150.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:58:11 -0800 (PST) > Hugh Dickins wrote: > > > +#ifdef CONFIG_NUMA > > +#define NUMA(x) (x) > > +#define DO_NUMA(x) (x) > > Did we consider > > #define DO_NUMA do { (x) } while (0) > > ? It didn't occur to me at all. I like that it makes more sense of the DO_NUMA variant. Is it okay that, to work with the way I was using it, we need "(x);" in there rather than just "(x)"? > > That could avoid some nasty config-dependent compilation issues. > > > +#else > > +#define NUMA(x) (0) [PATCH] ksm: trivial tidyups fix Suggested by akpm: make DO_NUMA(x) do { (x); } while (0) more like the #else. Signed-off-by: Hugh Dickins --- mm/ksm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- mmotm.org/mm/ksm.c 2013-01-27 09:55:45.000000000 -0800 +++ mmotm/mm/ksm.c 2013-01-28 16:50:25.772026446 -0800 @@ -43,7 +43,7 @@ #ifdef CONFIG_NUMA #define NUMA(x) (x) -#define DO_NUMA(x) (x) +#define DO_NUMA(x) do { (x); } while (0) #else #define NUMA(x) (0) #define DO_NUMA(x) do { } while (0) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id 9E8C16B0007 for ; Mon, 28 Jan 2013 21:03:13 -0500 (EST) Received: by mail-pa0-f50.google.com with SMTP id hz10so60390pad.23 for ; Mon, 28 Jan 2013 18:03:12 -0800 (PST) Date: Mon, 28 Jan 2013 18:03:16 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <20130128154407.16a623a4.akpm@linux-foundation.org> Message-ID: References: <20130128154407.16a623a4.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 18:01:59 -0800 (PST) > Hugh Dickins wrote: > > > +static int remove_all_stable_nodes(void) > > +{ > > + struct stable_node *stable_node; > > + int nid; > > + int err = 0; > > + > > + for (nid = 0; nid < nr_node_ids; nid++) { > > + while (root_stable_tree[nid].rb_node) { > > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > > + struct stable_node, node); > > + if (remove_stable_node(stable_node)) { > > + err = -EBUSY; > > It's a bit rude to overwrite remove_stable_node()'s return value. Well.... yes, but only the tiniest bit rude :) > > > + break; /* proceed to next nid */ > > + } > > + cond_resched(); > > Why is this here? Because we don't have a limit on the length of this loop, and if every node which remove_stable_node() finds is already stale, and has no rmap_item still attached, then there would be no rescheduling point in the unbounded loop without this one. I was taught to worry about bad latencies even in unpreemptible kernels. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx107.postini.com [74.125.245.107]) by kanga.kvack.org (Postfix) with SMTP id 483166B0007 for ; Mon, 28 Jan 2013 21:26:23 -0500 (EST) Received: by mail-lb0-f198.google.com with SMTP id gf14so102746lbb.9 for ; Mon, 28 Jan 2013 18:26:18 -0800 (PST) Message-ID: <51073345.4070605@ravellosystems.com> Date: Tue, 29 Jan 2013 04:26:13 +0200 From: Izik Eidus MIME-Version: 1.0 Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration References: <20130128155452.16882a6e.akpm@linux-foundation.org> <51071CA0.801@ravellosystems.com> In-Reply-To: <51071CA0.801@ravellosystems.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Hugh Dickins , Petr Holasek , Andrea Arcangeli , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org On 01/29/2013 02:49 AM, Izik Eidus wrote: > On 01/29/2013 01:54 AM, Andrew Morton wrote: >> On Fri, 25 Jan 2013 17:53:10 -0800 (PST) >> Hugh Dickins wrote: >> >>> Here's a KSM series >> Sanity check: do you have a feeling for how useful KSM is? >> Performance/space improvements for typical (or atypical) workloads? >> Are people using it? Successfully? BTW, After thinking a bit about the word people, I wanted to see if normal users of linux that just download and install Linux (without using special virtualization product) are able to use it. So I google little bit for it, and found some nice results from users: http://serverascode.com/2012/11/11/ksm-kvm.html But I do agree that it provide justifying value only for virtualization users... > > Hi, > I think it mostly used for virtualization, I know at least two > products that it use - > RHEV - RedHat enterprise virtualization, and my current place (Ravello > Systems) that use it to do vm consolidation on top of cloud enviorments > (Run multiple unmodified VMs on top of one vm you get from ec2 / > rackspace / what so ever), for Ravello it is highly critical in > achieving high rate > of consolidation ratio... > >> >> IOW, is it justifying itself? > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx150.postini.com [74.125.245.150]) by kanga.kvack.org (Postfix) with SMTP id 3CFFA8D0002 for ; Tue, 29 Jan 2013 11:51:29 -0500 (EST) Date: Tue, 29 Jan 2013 17:51:25 +0100 From: Andrea Arcangeli Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration Message-ID: <20130129165125.GA17671@redhat.com> References: <20130128155452.16882a6e.akpm@linux-foundation.org> <51071CA0.801@ravellosystems.com> <51073345.4070605@ravellosystems.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51073345.4070605@ravellosystems.com> Sender: owner-linux-mm@kvack.org List-ID: To: Izik Eidus Cc: Andrew Morton , Hugh Dickins , Petr Holasek , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org Hi everyone, On Tue, Jan 29, 2013 at 04:26:13AM +0200, Izik Eidus wrote: > On 01/29/2013 02:49 AM, Izik Eidus wrote: > > On 01/29/2013 01:54 AM, Andrew Morton wrote: > >> On Fri, 25 Jan 2013 17:53:10 -0800 (PST) > >> Hugh Dickins wrote: > >> > >>> Here's a KSM series > >> Sanity check: do you have a feeling for how useful KSM is? > >> Performance/space improvements for typical (or atypical) workloads? > >> Are people using it? Successfully? > > > BTW, After thinking a bit about the word people, I wanted to see if > normal users of linux > that just download and install Linux (without using special > virtualization product) are able to use it. > So I google little bit for it, and found some nice results from users: > http://serverascode.com/2012/11/11/ksm-kvm.html > > But I do agree that it provide justifying value only for virtualization > users... Mostly for virtualization users indeed, but I'm aware of a few non virtualization users too: 1) CERN has been one of the early adopters of KSM and initially they were using KSM standalone (probably because not all hypervisors they had to deal with were KVM/linux based, while all guests were linux and in turn KSM capable). More info in the KSM paper page 2: http://www.kernel.org/doc/ols/2009/ols2009-pages-19-28.pdf However lately they're running KSM in combination with KVM too, and I'm not sure if they're still using it standalone. See the "KSM shared" blue area in slide 12 and the comparison with KSM on and off in slide 14. https://indico.fnal.gov/getFile.py/access?contribId=18&sessionId=4&resId=0&materialId=slides&confId=4986 2) all recent cyanogenmod in the performance menu in settings supports KSM out of the box. You can run it for a while and then shut it off. Not sure how good idea it is to leave it always on, but the only efficient cellphone/tablet powersaving design (i.e. the wakelocks + suspend to ram) still won't waste energy while the screen is off and the phone has suspended to ram, regardless of KSM on or off. KSM NUMA awareness however is not needed on the cellphone :). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id C76976B0007 for ; Wed, 30 Jan 2013 19:05:47 -0500 (EST) Received: by mail-pa0-f49.google.com with SMTP id bi1so1372509pad.22 for ; Wed, 30 Jan 2013 16:05:47 -0800 (PST) Message-ID: <1359590736.1557.0.camel@kernel> Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration From: Ric Mason Date: Wed, 30 Jan 2013 18:05:36 -0600 In-Reply-To: <20130129165125.GA17671@redhat.com> References: <20130128155452.16882a6e.akpm@linux-foundation.org> <51071CA0.801@ravellosystems.com> <51073345.4070605@ravellosystems.com> <20130129165125.GA17671@redhat.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli Cc: Izik Eidus , Andrew Morton , Hugh Dickins , Petr Holasek , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, 2013-01-29 at 17:51 +0100, Andrea Arcangeli wrote: > Hi everyone, > > On Tue, Jan 29, 2013 at 04:26:13AM +0200, Izik Eidus wrote: > > On 01/29/2013 02:49 AM, Izik Eidus wrote: > > > On 01/29/2013 01:54 AM, Andrew Morton wrote: > > >> On Fri, 25 Jan 2013 17:53:10 -0800 (PST) > > >> Hugh Dickins wrote: > > >> > > >>> Here's a KSM series > > >> Sanity check: do you have a feeling for how useful KSM is? > > >> Performance/space improvements for typical (or atypical) workloads? > > >> Are people using it? Successfully? > > > > > > BTW, After thinking a bit about the word people, I wanted to see if > > normal users of linux > > that just download and install Linux (without using special > > virtualization product) are able to use it. > > So I google little bit for it, and found some nice results from users: > > http://serverascode.com/2012/11/11/ksm-kvm.html > > > > But I do agree that it provide justifying value only for virtualization > > users... > > Mostly for virtualization users indeed, but I'm aware of a few non > virtualization users too: > > 1) CERN has been one of the early adopters of KSM and initially they > were using KSM standalone (probably because not all hypervisors they > had to deal with were KVM/linux based, while all guests were linux and > in turn KSM capable). More info in the KSM paper page 2: > > http://www.kernel.org/doc/ols/2009/ols2009-pages-19-28.pdf > > However lately they're running KSM in combination with KVM too, and I'm > not sure if they're still using it standalone. See the "KSM shared" > blue area in slide 12 and the comparison with KSM on and off in slide > 14. > > https://indico.fnal.gov/getFile.py/access?contribId=18&sessionId=4&resId=0&materialId=slides&confId=4986 > > 2) all recent cyanogenmod in the performance menu in settings supports > KSM out of the box. You can run it for a while and then shut it > off. > > Not sure how good idea it is to leave it always on, but the only > efficient cellphone/tablet powersaving design (i.e. the wakelocks + > suspend to ram) still won't waste energy while the screen is off and > the phone has suspended to ram, regardless of KSM on or off. > > KSM NUMA awareness however is not needed on the cellphone :). Thanks for your sharing. Is there ksm benchmark? How to get it? > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx155.postini.com [74.125.245.155]) by kanga.kvack.org (Postfix) with SMTP id 410276B000D for ; Tue, 5 Feb 2013 11:41:24 -0500 (EST) Date: Tue, 5 Feb 2013 16:41:18 +0000 From: Mel Gorman Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node Message-ID: <20130205164118.GI21389@suse.de> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote: > From: Petr Holasek > > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes > which control merging pages across different numa nodes. > When it is set to zero only pages from the same node are merged, > otherwise pages from all nodes can be merged together (default behavior). > > Typical use-case could be a lot of KVM guests on NUMA machine > and cpus from more distant nodes would have significant increase > of access latency to the merged ksm page. Sysfs knob was choosen > for higher variability when some users still prefers higher amount > of saved physical memory regardless of access latency. > This is understandable but it's going to be a fairly obscure option. I do not think it can be known in advance if the option should be set. The user must either run benchmarks before and after or use perf to record the "node-load-misses" event and see if setting the parameter reduces the number of remote misses. I don't know the internals of ksm.c at all and this is my first time reading this series. Everything in this review is subject to being completely wrong or due to a major misunderstanding on my part. Delete all feedback if desired. > Every numa node has its own stable & unstable trees because of faster > searching and inserting. Changing of merge_across_nodes value is possible > only when there are not any ksm shared pages in system. > > I've tested this patch on numa machines with 2, 4 and 8 nodes and > measured speed of memory access inside of KVM guests with memory pinned > to one of nodes with this benchmark: > > http://pholasek.fedorapeople.org/alloc_pg.c > > Population standard deviations of access times in percentage of average > were following: > > merge_across_nodes=1 > 2 nodes 1.4% > 4 nodes 1.6% > 8 nodes 1.7% > > merge_across_nodes=0 > 2 nodes 1% > 4 nodes 0.32% > 8 nodes 0.018% > > RFC: https://lkml.org/lkml/2011/11/30/91 > v1: https://lkml.org/lkml/2012/1/23/46 > v2: https://lkml.org/lkml/2012/6/29/105 > v3: https://lkml.org/lkml/2012/9/14/550 > v4: https://lkml.org/lkml/2012/9/23/137 > v5: https://lkml.org/lkml/2012/12/10/540 > v6: https://lkml.org/lkml/2012/12/23/154 > v7: https://lkml.org/lkml/2012/12/27/225 > > Hugh notes that this patch brings two problems, whose solution needs > further support in mm/ksm.c, which follows in subsequent patches: > 1) switching merge_across_nodes after running KSM is liable to oops > on stale nodes still left over from the previous stable tree; > 2) memory hotremove may migrate KSM pages, but there is no provision > here for !merge_across_nodes to migrate nodes to the proper tree. > > Signed-off-by: Petr Holasek > Signed-off-by: Hugh Dickins > Acked-by: Rik van Riel > --- > Documentation/vm/ksm.txt | 7 + > mm/ksm.c | 151 ++++++++++++++++++++++++++++++++----- > 2 files changed, 139 insertions(+), 19 deletions(-) > > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > Default: 20 (chosen for demonstration purposes) > > +merge_across_nodes - specifies if pages from different numa nodes can be merged. > + When set to 0, ksm merges only pages which physically > + reside in the memory area of same NUMA node. It brings > + lower latency to access to shared page. Value can be > + changed only when there is no ksm shared pages in system. > + Default: 1 > + > run - set 0 to stop ksmd from running but keep merged pages, > set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", > set 2 to stop ksmd and unmerge all pages currently merged, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 > @@ -36,6 +36,7 @@ > #include > #include > #include > +#include > > #include > #include "internal.h" > @@ -139,6 +140,9 @@ struct rmap_item { > struct mm_struct *mm; > unsigned long address; /* + low bits used for flags below */ > unsigned int oldchecksum; /* when unstable */ > +#ifdef CONFIG_NUMA > + unsigned int nid; > +#endif > union { > struct rb_node node; /* when node of unstable tree */ > struct { /* when listed from stable tree */ > @@ -153,8 +157,8 @@ struct rmap_item { > #define STABLE_FLAG 0x200 /* is listed from the stable tree */ > > /* The stable and unstable tree heads */ > -static struct rb_root root_stable_tree = RB_ROOT; > -static struct rb_root root_unstable_tree = RB_ROOT; > +static struct rb_root root_unstable_tree[MAX_NUMNODES]; > +static struct rb_root root_stable_tree[MAX_NUMNODES]; > With multiple stable node trees does the comment that begins with * A few notes about the KSM scanning process, * to make it easier to understand the data structures below: need an update? It's uninitialised so kernel data size in vmlinux should be unaffected but it's an additional runtime cost of around 4K for a standardish enterprise distro kernel config. Small beans on a NUMA machine and maybe not worth the hassle of kmalloc for nr_online_nodes and dealing with node memory hotplug but it's a pity. > #define MM_SLOTS_HASH_BITS 10 > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ > /* Milliseconds ksmd should sleep between batches */ > static unsigned int ksm_thread_sleep_millisecs = 20; > > +/* Zeroed when merging across nodes is not allowed */ > +static unsigned int ksm_merge_across_nodes = 1; > + Nit but initialised data does increase the size of vmlinux so maybe this should be the "opposite". i.e. rename it to ksm_merge_within_nodes and default it to 0? __read_mostly? > #define KSM_RUN_STOP 0 > #define KSM_RUN_MERGE 1 > #define KSM_RUN_UNMERGE 2 > @@ -441,10 +448,25 @@ out: page = NULL; > return page; > } > > +/* > + * This helper is used for getting right index into array of tree roots. > + * When merge_across_nodes knob is set to 1, there are only two rb-trees for > + * stable and unstable pages from all nodes with roots in index 0. Otherwise, > + * every node has its own stable and unstable tree. > + */ > +static inline int get_kpfn_nid(unsigned long kpfn) > +{ > + if (ksm_merge_across_nodes) > + return 0; > + else > + return pfn_to_nid(kpfn); > +} > + If we start with ksm_merge_across_nodes, KSM runs for a while and populates the stable node tree for node 0 and then ksm_merge_across_nodes gets set then badness happens because this can go anywhere nid = get_kpfn_nid(stable_node->kpfn); rb_erase(&stable_node->node, &root_stable_tree[nid]); Very late in the review I noticed that you comment on this already in the changelog and that it is addressed later in the series. I haven't seen this patch yet so the following suggestion is very stale but might still be relevant. We could increase size of root_stable_node[] by 1, have get_kpfn_nid return MAX_NR_NODES if ksm_merge_across_nodes and if ksm_merge_across_nodes gets set to 0 then we walk the stable tree at root_stable_tree[MAX_NR_NODES] and delete the entire tree? It's be disruptive as hell unfortunately and might break entirely if there is not enough memory to unshare the pages. Ideally we could take our time walking root_stable_tree[MAX_NR_NODES] without worrying about collisions and fix it up somehow. Dunno > static void remove_node_from_stable_tree(struct stable_node *stable_node) > { > struct rmap_item *rmap_item; > struct hlist_node *hlist; > + int nid; > > hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { > if (rmap_item->hlist.next) > @@ -456,7 +478,9 @@ static void remove_node_from_stable_tree > cond_resched(); > } > > - rb_erase(&stable_node->node, &root_stable_tree); > + nid = get_kpfn_nid(stable_node->kpfn); > + > + rb_erase(&stable_node->node, &root_stable_tree[nid]); > free_stable_node(stable_node); > } > > @@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s > age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); > BUG_ON(age > 1); > if (!age) > - rb_erase(&rmap_item->node, &root_unstable_tree); > +#ifdef CONFIG_NUMA > + rb_erase(&rmap_item->node, > + &root_unstable_tree[rmap_item->nid]); > +#else > + rb_erase(&rmap_item->node, &root_unstable_tree[0]); > +#endif > nit, does rmap_item->nid deserve a getter and setter helper instead? #ifdef CONFIG_NUMA static inline int rmap_item_nid(struct rmap_item *item) { return rmap_item->nid; } static inline void set_rmap_item_nid(struct rmap_item *item, int nid) { rmap_item->nid = nid; } #else static inline int rmap_item_nid(struct rmap_item *item) { return 0; } static inline void set_rmap_item_nid(struct rmap_item *item, int nid) { } #endif > ksm_pages_unshared--; > rmap_item->address &= PAGE_MASK; > @@ -990,8 +1019,9 @@ static struct page *try_to_merge_two_pag > */ > static struct page *stable_tree_search(struct page *page) > { > - struct rb_node *node = root_stable_tree.rb_node; > + struct rb_node *node; > struct stable_node *stable_node; > + int nid; > > stable_node = page_stable_node(page); > if (stable_node) { /* ksm page forked */ > @@ -999,6 +1029,9 @@ static struct page *stable_tree_search(s > return page; > } > > + nid = get_kpfn_nid(page_to_pfn(page)); > + node = root_stable_tree[nid].rb_node; > + > while (node) { > struct page *tree_page; > int ret; > @@ -1033,10 +1066,16 @@ static struct page *stable_tree_search(s > */ > static struct stable_node *stable_tree_insert(struct page *kpage) > { > - struct rb_node **new = &root_stable_tree.rb_node; > + int nid; > + unsigned long kpfn; > + struct rb_node **new; > struct rb_node *parent = NULL; > struct stable_node *stable_node; > > + kpfn = page_to_pfn(kpage); > + nid = get_kpfn_nid(kpfn); > + new = &root_stable_tree[nid].rb_node; > + > while (*new) { > struct page *tree_page; > int ret; > @@ -1070,11 +1109,11 @@ static struct stable_node *stable_tree_i > return NULL; > > rb_link_node(&stable_node->node, parent, new); > - rb_insert_color(&stable_node->node, &root_stable_tree); > + rb_insert_color(&stable_node->node, &root_stable_tree[nid]); > > INIT_HLIST_HEAD(&stable_node->hlist); > > - stable_node->kpfn = page_to_pfn(kpage); > + stable_node->kpfn = kpfn; > set_page_stable_node(kpage, stable_node); > > return stable_node; > @@ -1098,10 +1137,15 @@ static > struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item, > struct page *page, > struct page **tree_pagep) > - > { > - struct rb_node **new = &root_unstable_tree.rb_node; > + struct rb_node **new; > + struct rb_root *root; > struct rb_node *parent = NULL; > + int nid; > + > + nid = get_kpfn_nid(page_to_pfn(page)); > + root = &root_unstable_tree[nid]; > + new = &root->rb_node; > > while (*new) { > struct rmap_item *tree_rmap_item; > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > return NULL; > } > > + /* > + * If tree_page has been migrated to another NUMA node, it > + * will be flushed out and put into the right unstable tree > + * next time: only merge with it if merge_across_nodes. > + * Just notice, we don't have similar problem for PageKsm > + * because their migration is disabled now. (62b61f611e) > + */ > + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { > + put_page(tree_page); > + return NULL; > + } > + What about this case? 1. ksm_merge_across_nodes==0 2. pages gets placed on different unstable trees 3. ksm_merge_across_nodes==1 At that point we should be removing pages from the different unstable tree and moving them to root_unstable_tree[0] but this put_page() doesn't happen. Does it matter? > ret = memcmp_pages(page, tree_page); > > parent = *new; > @@ -1139,8 +1195,11 @@ struct rmap_item *unstable_tree_search_i > > rmap_item->address |= UNSTABLE_FLAG; > rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK); > +#ifdef CONFIG_NUMA > + rmap_item->nid = nid; > +#endif > rb_link_node(&rmap_item->node, parent, new); > - rb_insert_color(&rmap_item->node, &root_unstable_tree); > + rb_insert_color(&rmap_item->node, root); > > ksm_pages_unshared++; > return NULL; > @@ -1154,6 +1213,13 @@ struct rmap_item *unstable_tree_search_i > static void stable_tree_append(struct rmap_item *rmap_item, > struct stable_node *stable_node) > { > +#ifdef CONFIG_NUMA > + /* > + * Usually rmap_item->nid is already set correctly, > + * but it may be wrong after switching merge_across_nodes. > + */ > + rmap_item->nid = get_kpfn_nid(stable_node->kpfn); > +#endif > rmap_item->head = stable_node; > rmap_item->address |= STABLE_FLAG; > hlist_add_head(&rmap_item->hlist, &stable_node->hlist); > @@ -1283,6 +1349,7 @@ static struct rmap_item *scan_get_next_r > struct mm_slot *slot; > struct vm_area_struct *vma; > struct rmap_item *rmap_item; > + int nid; > > if (list_empty(&ksm_mm_head.mm_list)) > return NULL; > @@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r > */ > lru_add_drain_all(); > > - root_unstable_tree = RB_ROOT; > + for (nid = 0; nid < nr_node_ids; nid++) > + root_unstable_tree[nid] = RB_ROOT; > Minor but you shouldn't need to reset tham all if ksm_merge_across_nodes==1 Initially this triggered an alarm because it's not immediately obvious why you can just discard an rbtree like this. It looks like because the unstable tree is also part of a linked list so the rb representation can be reset quickly without leaking memory. > spin_lock(&ksm_mmlist_lock); > slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list); > @@ -1770,15 +1838,19 @@ static struct stable_node *ksm_check_sta > unsigned long end_pfn) > { > struct rb_node *node; > + int nid; > > - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) { > - struct stable_node *stable_node; > + for (nid = 0; nid < nr_node_ids; nid++) > + for (node = rb_first(&root_stable_tree[nid]); node; > + node = rb_next(node)) { > + struct stable_node *stable_node; > + > + stable_node = rb_entry(node, struct stable_node, node); > + if (stable_node->kpfn >= start_pfn && > + stable_node->kpfn < end_pfn) > + return stable_node; > + } > > - stable_node = rb_entry(node, struct stable_node, node); > - if (stable_node->kpfn >= start_pfn && > - stable_node->kpfn < end_pfn) > - return stable_node; > - } > return NULL; > } > > @@ -1925,6 +1997,40 @@ static ssize_t run_store(struct kobject > } > KSM_ATTR(run); > > +#ifdef CONFIG_NUMA > +static ssize_t merge_across_nodes_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return sprintf(buf, "%u\n", ksm_merge_across_nodes); > +} > + > +static ssize_t merge_across_nodes_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count) > +{ > + int err; > + unsigned long knob; > + > + err = kstrtoul(buf, 10, &knob); > + if (err) > + return err; > + if (knob > 1) > + return -EINVAL; > + > + mutex_lock(&ksm_thread_mutex); > + if (ksm_merge_across_nodes != knob) { > + if (ksm_pages_shared) > + err = -EBUSY; > + else > + ksm_merge_across_nodes = knob; > + } > + mutex_unlock(&ksm_thread_mutex); > + > + return err ? err : count; > +} > +KSM_ATTR(merge_across_nodes); > +#endif > + > static ssize_t pages_shared_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > @@ -1979,6 +2085,9 @@ static struct attribute *ksm_attrs[] = { > &pages_unshared_attr.attr, > &pages_volatile_attr.attr, > &full_scans_attr.attr, > +#ifdef CONFIG_NUMA > + &merge_across_nodes_attr.attr, > +#endif > NULL, > }; > > @@ -1992,11 +2101,15 @@ static int __init ksm_init(void) > { > struct task_struct *ksm_thread; > int err; > + int nid; > > err = ksm_slab_init(); > if (err) > goto out; > > + for (nid = 0; nid < nr_node_ids; nid++) > + root_stable_tree[nid] = RB_ROOT; > + > ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd"); > if (IS_ERR(ksm_thread)) { > printk(KERN_ERR "ksm: creating kthread failed\n"); > -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx140.postini.com [74.125.245.140]) by kanga.kvack.org (Postfix) with SMTP id A48616B0009 for ; Tue, 5 Feb 2013 11:48:27 -0500 (EST) Date: Tue, 5 Feb 2013 16:48:23 +0000 From: Mel Gorman Subject: Re: [PATCH 4/11] ksm: reorganize ksm_check_stable_tree Message-ID: <20130205164823.GJ21389@suse.de> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, Jan 25, 2013 at 05:59:35PM -0800, Hugh Dickins wrote: > Memory hotremove's ksm_check_stable_tree() is pitifully inefficient > (restarting whenever it finds a stale node to remove), but rearrange > so that at least it does not needlessly restart from nid 0 each time. > And add a couple of comments: here is why we keep pfn instead of page. > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 38 ++++++++++++++++++++++---------------- > 1 file changed, 22 insertions(+), 16 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:52.152205940 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > @@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa > #endif /* CONFIG_MIGRATION */ > > #ifdef CONFIG_MEMORY_HOTREMOVE > -static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn, > - unsigned long end_pfn) > +static void ksm_check_stable_tree(unsigned long start_pfn, > + unsigned long end_pfn) > { > + struct stable_node *stable_node; > struct rb_node *node; > int nid; > > - for (nid = 0; nid < nr_node_ids; nid++) > - for (node = rb_first(&root_stable_tree[nid]); node; > - node = rb_next(node)) { > - struct stable_node *stable_node; > - > + for (nid = 0; nid < nr_node_ids; nid++) { > + node = rb_first(&root_stable_tree[nid]); > + while (node) { This is not your fault, the old code is wrong too. It is assuming that all nodes are populated in numeric orders with no holes. It won't work if just two nodes 0 and 4 are online. It should be using for_each_online_node(). -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx149.postini.com [74.125.245.149]) by kanga.kvack.org (Postfix) with SMTP id 5908D6B0008 for ; Tue, 5 Feb 2013 12:18:10 -0500 (EST) Date: Tue, 5 Feb 2013 17:18:05 +0000 From: Mel Gorman Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked Message-ID: <20130205171805.GK21389@suse.de> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, Jan 25, 2013 at 06:00:50PM -0800, Hugh Dickins wrote: > In some places where get_ksm_page() is used, we need the page to be locked. > > When KSM migration is fully enabled, we shall want that to make sure that > the page just acquired cannot be migrated beneath us (raised page count is > only effective when there is serialization to make sure migration notices). > Whereas when navigating through the stable tree, we certainly do not want > to lock each node (raised page count is enough to guarantee the memcmps, > even if page is migrated to another node). > > Since we're about to add another use case, add the locked argument to > get_ksm_page() now. > > Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I > really got the wrong end of the stick on that! There's a configuration > in which page_cache_get_speculative() can do something cheaper than > get_page_unless_zero(), relying on its caller's rcu_read_lock() to have > disabled preemption for it. There's no need for rcu_read_lock() around > get_page_unless_zero() (and mapping checks) here. Cut out that > silliness before making this any harder to understand. > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 23 +++++++++++++---------- > 1 file changed, 13 insertions(+), 10 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree > * but this is different - made simpler by ksm_thread_mutex being held, but > * interesting for assuming that no other use of the struct page could ever > * put our expected_mapping into page->mapping (or a field of the union which > - * coincides with page->mapping). The RCU calls are not for KSM at all, but > - * to keep the page_count protocol described with page_cache_get_speculative. > + * coincides with page->mapping). > * > * Note: it is possible that get_ksm_page() will return NULL one moment, > * then page the next, if the page is in between page_freeze_refs() and > * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > * is on its way to being freed; but it is an anomaly to bear in mind. > */ > -static struct page *get_ksm_page(struct stable_node *stable_node) > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > { The naming is unhelpful :( Because the second parameter is called "locked", it implies that the caller of this function holds the page lock (which is obviously very silly). ret_locked maybe? As the function is akin to find_lock_page I would prefer if there was a new get_lock_ksm_page() instead of locking depending on the value of a parameter. We can do this because expected_mapping is recorded by the stable_node and we only need to recalculate it if the page has been successfully pinned. We calculate the expected value twice but that's not earth shattering. It'd look something like; /* * get_lock_ksm_page: Similar to get_ksm_page except returns with page * locked and pinned */ static struct page *get_lock_ksm_page(struct stable_node *stable_node) { struct page *page = get_ksm_page(stable_node); if (page) { expected_mapping = (void *)stable_node + (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); lock_page(page); if (page->mapping != expected_mapping) { unlock_page(page); /* release pin taken by get_ksm_page() */ put_page(page); page = NULL; } } return page; } Up to you, I'm not going to make a big deal of it. FWIW, I agree that removing rcu_read_lock() is fine. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id 4FD4D6B0002 for ; Tue, 5 Feb 2013 12:55:57 -0500 (EST) Date: Tue, 5 Feb 2013 17:55:51 +0000 From: Mel Gorman Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly Message-ID: <20130205175551.GL21389@suse.de> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, Jan 25, 2013 at 06:01:59PM -0800, Hugh Dickins wrote: > Switching merge_across_nodes after running KSM is liable to oops on stale > nodes still left over from the previous stable tree. It's not something > that people will often want to do, but it would be lame to demand a reboot > when they're trying to determine which merge_across_nodes setting is best. > > How can this happen? We only permit switching merge_across_nodes when > pages_shared is 0, and usually set run 2 to force that beforehand, which > ought to unmerge everything: yet oopses still occur when you then run 1. > When reviewing patch 1, I missed that the pages_shared check would prevent most of the problems I was envisioning with leftover entries in the stable tree. Sorry about that. > Three causes: > > 1. The old stable tree (built according to the inverse merge_across_nodes) > has not been fully torn down. A stable node lingers until get_ksm_page() > notices that the page it references no longer references it: but the page > is not necessarily freed as soon as expected, particularly when swapcache. > > Fix this with a pass through the old stable tree, applying get_ksm_page() > to each of the remaining nodes (most found stale and removed immediately), > with forced removal of any left over. Unless the page is still mapped: > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > and EBUSY than BUG. > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > just behind ksmd's cursor, so there's a full pass for it to stabilize > (or be removed) before ksmd addresses it. Nice when ksmd is running, > but not so nice when we're trying to unmerge all mms: we were missing > those mms forked and inserted behind the unmerge cursor. Easily fixed > by inserting at the end when KSM_RUN_UNMERGE. > > 3. It is possible for a KSM page to be faulted back from swapcache into > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. > > A long outstanding, unrelated bugfix sneaks in with that third fix: > ksm_does_need_to_copy() would copy from a !PageUptodate page (implying > I/O error when read in from swap) to a page which it then marks Uptodate. > Fix this case by not copying, letting do_swap_page() discover the error. > > Signed-off-by: Hugh Dickins > --- > include/linux/ksm.h | 18 ++------- > mm/ksm.c | 83 +++++++++++++++++++++++++++++++++++++++--- > mm/memory.c | 19 ++++----- > 3 files changed, 92 insertions(+), 28 deletions(-) > > --- mmotm.orig/include/linux/ksm.h 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/include/linux/ksm.h 2013-01-25 14:37:00.764206145 -0800 > @@ -16,9 +16,6 @@ > struct stable_node; > struct mem_cgroup; > > -struct page *ksm_does_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address); > - > #ifdef CONFIG_KSM > int ksm_madvise(struct vm_area_struct *vma, unsigned long start, > unsigned long end, int advice, unsigned long *vm_flags); > @@ -73,15 +70,8 @@ static inline void set_page_stable_node( > * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE, > * but what if the vma was unmerged while the page was swapped out? > */ > -static inline int ksm_might_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address) > -{ > - struct anon_vma *anon_vma = page_anon_vma(page); > - > - return anon_vma && > - (anon_vma->root != vma->anon_vma->root || > - page->index != linear_page_index(vma, address)); > -} > +struct page *ksm_might_need_to_copy(struct page *page, > + struct vm_area_struct *vma, unsigned long address); > > int page_referenced_ksm(struct page *page, > struct mem_cgroup *memcg, unsigned long *vm_flags); > @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_ > return 0; > } > > -static inline int ksm_might_need_to_copy(struct page *page, > +static inline struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > - return 0; > + return page; > } > > static inline int page_referenced_ksm(struct page *page, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a > /* > * Only called through the sysfs control interface: > */ > +static int remove_stable_node(struct stable_node *stable_node) > +{ > + struct page *page; > + int err; > + > + page = get_ksm_page(stable_node, true); > + if (!page) { > + /* > + * get_ksm_page did remove_node_from_stable_tree itself. > + */ > + return 0; > + } > + > + if (WARN_ON_ONCE(page_mapped(page))) > + err = -EBUSY; > + else { > + /* It will probably be very obvious to people familiar with ksm.c but even so maybe remind the reader that the pages must already have been unmerged * This page must already have been unmerged and should be stale. * It might be in a pagevec waiting to be freed or it might be ...... > + * This page might be in a pagevec waiting to be freed, > + * or it might be PageSwapCache (perhaps under writeback), > + * or it might have been removed from swapcache a moment ago. > + */ > + set_page_stable_node(page, NULL); > + remove_node_from_stable_tree(stable_node); > + err = 0; > + } > + > + unlock_page(page); > + put_page(page); > + return err; > +} > + > +static int remove_all_stable_nodes(void) > +{ > + struct stable_node *stable_node; > + int nid; > + int err = 0; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + while (root_stable_tree[nid].rb_node) { > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > + struct stable_node, node); > + if (remove_stable_node(stable_node)) { > + err = -EBUSY; > + break; /* proceed to next nid */ > + } If remove_stable_node() returns an error then it's quite possible that it'll go boom when that page is encountered later but it's not guaranteed. It'd be best effort to continue removing as many of the stable nodes anyway. We're in trouble either way of course. Otherwise I didn't spot a problem so as weak as it is due my familiarity with KSM; Acked-by: Mel Gorman -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx145.postini.com [74.125.245.145]) by kanga.kvack.org (Postfix) with SMTP id 0E5C36B0007 for ; Tue, 5 Feb 2013 14:11:07 -0500 (EST) Date: Tue, 5 Feb 2013 19:11:02 +0000 From: Mel Gorman Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible Message-ID: <20130205191102.GM21389@suse.de> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, Jan 25, 2013 at 06:03:31PM -0800, Hugh Dickins wrote: > KSM page migration is already supported in the case of memory hotremove, > which takes the ksm_thread_mutex across all its migrations to keep life > simple. > > But the new KSM NUMA merge_across_nodes knob introduces a problem, when > it's set to non-default 0: if a KSM page is migrated to a different NUMA > node, how do we migrate its stable node to the right tree? And what if > that collides with an existing stable node? > > So far there's no provision for that, and this patch does not attempt > to deal with it either. But how will I test a solution, when I don't > know how to hotremove memory? Just reach in and yank it straight out with a chisel. > The best answer is to enable KSM page > migration in all cases now, and test more common cases. With THP and > compaction added since KSM came in, page migration is now mainstream, > and it's a shame that a KSM page can frustrate freeing a page block. > THP will at least check if migration within a node works. It won't necessarily check we can migrate across nodes properly but it's a lot better than nothing. > Without worrying about merge_across_nodes 0 for now, this patch gets > KSM page migration working reliably for default merge_across_nodes 1 > (but leave the patch enabling it until near the end of the series). > > It's much simpler than I'd originally imagined, and does not require > an additional tier of locking: page migration relies on the page lock, > KSM page reclaim relies on the page lock, the page lock is enough for > KSM page migration too. > > Almost all the care has to be in get_ksm_page(): that's the function > which worries about when a stable node is stale and should be freed, > now it also has to worry about the KSM page being migrated. > > The only new overhead is an additional put/get/lock/unlock_page when > stable_tree_search() arrives at a matching node: to make sure migration > respects the raised page count, and so does not migrate the page while > we're busy with it here. That's probably avoidable, either by changing > internal interfaces from using kpage to stable_node, or by moving the > ksm_migrate_page() callsite into a page_freeze_refs() section (even if > not swapcache); but this works well, I've no urge to pull it apart now. > > (Descents of the stable tree may pass through nodes whose KSM pages are > under migration: being unlocked, the raised page count does not prevent > that, nor need it: it's safe to memcmp against either old or new page.) > > You might worry about mremap, and whether page migration's rmap_walk > to remove migration entries will find all the KSM locations where it > inserted earlier: that should already be handled, by the satisfyingly > heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 94 ++++++++++++++++++++++++++++++++++++++----------- > mm/migrate.c | 5 ++ > 2 files changed, 77 insertions(+), 22 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 > @@ -499,6 +499,7 @@ static void remove_node_from_stable_tree > * In which case we can trust the content of the page, and it > * returns the gotten page; but if the page has now been zapped, > * remove the stale node from the stable tree and return NULL. > + * But beware, the stable node's page might be being migrated. > * > * You would expect the stable_node to hold a reference to the ksm page. > * But if it increments the page's count, swapping out has to wait for > @@ -509,44 +510,77 @@ static void remove_node_from_stable_tree > * pointing back to this stable node. This relies on freeing a PageAnon > * page to reset its page->mapping to NULL, and relies on no other use of > * a page to put something that might look like our key in page->mapping. > - * > - * include/linux/pagemap.h page_cache_get_speculative() is a good reference, > - * but this is different - made simpler by ksm_thread_mutex being held, but > - * interesting for assuming that no other use of the struct page could ever > - * put our expected_mapping into page->mapping (or a field of the union which > - * coincides with page->mapping). > - * > - * Note: it is possible that get_ksm_page() will return NULL one moment, > - * then page the next, if the page is in between page_freeze_refs() and > - * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > * is on its way to being freed; but it is an anomaly to bear in mind. > */ > static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > { > struct page *page; > void *expected_mapping; > + unsigned long kpfn; > > - page = pfn_to_page(stable_node->kpfn); > expected_mapping = (void *)stable_node + > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > - if (page->mapping != expected_mapping) > - goto stale; > - if (!get_page_unless_zero(page)) > +again: > + kpfn = ACCESS_ONCE(stable_node->kpfn); > + page = pfn_to_page(kpfn); > + Ok. There should be no concern that hot-remove made the kpfn invalid because those stable tree entries should have been discarded. > + /* > + * page is computed from kpfn, so on most architectures reading > + * page->mapping is naturally ordered after reading node->kpfn, > + * but on Alpha we need to be more careful. > + */ > + smp_read_barrier_depends(); The value of page is data dependant on pfn_to_page(). Is it really possible for that to be re-ordered even on Alpha? > + if (ACCESS_ONCE(page->mapping) != expected_mapping) > goto stale; > - if (page->mapping != expected_mapping) { > + > + /* > + * We cannot do anything with the page while its refcount is 0. > + * Usually 0 means free, or tail of a higher-order page: in which > + * case this node is no longer referenced, and should be freed; > + * however, it might mean that the page is under page_freeze_refs(). > + * The __remove_mapping() case is easy, again the node is now stale; > + * but if page is swapcache in migrate_page_move_mapping(), it might > + * still be our page, in which case it's essential to keep the node. > + */ > + while (!get_page_unless_zero(page)) { > + /* > + * Another check for page->mapping != expected_mapping would > + * work here too. We have chosen the !PageSwapCache test to > + * optimize the common case, when the page is or is about to > + * be freed: PageSwapCache is cleared (under spin_lock_irq) > + * in the freeze_refs section of __remove_mapping(); but Anon > + * page->mapping reset to NULL later, in free_pages_prepare(). > + */ > + if (!PageSwapCache(page)) > + goto stale; > + cpu_relax(); > + } The recheck of stable_node->kpfn check after a barrier distinguishes between a free and a completed migration, that's fine. I'm hesitate to ask because it must be obvious but where is the guarantee that a KSM page is in the swap cache? > + > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > put_page(page); > goto stale; > } > + > if (locked) { > lock_page(page); > - if (page->mapping != expected_mapping) { > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > unlock_page(page); > put_page(page); > goto stale; > } > } > return page; > + > stale: > + /* > + * We come here from above when page->mapping or !PageSwapCache > + * suggests that the node is stale; but it might be under migration. > + * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(), > + * before checking whether node->kpfn has been changed. > + */ > + smp_rmb(); > + if (ACCESS_ONCE(stable_node->kpfn) != kpfn) > + goto again; > remove_node_from_stable_tree(stable_node); > return NULL; > } > @@ -1103,15 +1137,25 @@ static struct page *stable_tree_search(s > return NULL; > > ret = memcmp_pages(page, tree_page); > + put_page(tree_page); > > - if (ret < 0) { > - put_page(tree_page); > + if (ret < 0) > node = node->rb_left; > - } else if (ret > 0) { > - put_page(tree_page); > + else if (ret > 0) > node = node->rb_right; > - } else > + else { > + /* > + * Lock and unlock the stable_node's page (which > + * might already have been migrated) so that page > + * migration is sure to notice its raised count. > + * It would be more elegant to return stable_node > + * than kpage, but that involves more changes. > + */ > + tree_page = get_ksm_page(stable_node, true); > + if (tree_page) > + unlock_page(tree_page); > return tree_page; > + } > } > > return NULL; > @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa > if (stable_node) { > VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage)); > stable_node->kpfn = page_to_pfn(newpage); > + /* > + * newpage->mapping was set in advance; now we need smp_wmb() > + * to make sure that the new stable_node->kpfn is visible > + * to get_ksm_page() before it can see that oldpage->mapping > + * has gone stale (or that PageSwapCache has been cleared). > + */ > + smp_wmb(); > + set_page_stable_node(oldpage, NULL); > } > } > #endif /* CONFIG_MIGRATION */ > --- mmotm.orig/mm/migrate.c 2013-01-25 14:27:58.140193249 -0800 > +++ mmotm/mm/migrate.c 2013-01-25 14:37:03.832206218 -0800 > @@ -464,7 +464,10 @@ void migrate_page_copy(struct page *newp > > mlock_migrate_page(newpage, page); > ksm_migrate_page(newpage, page); > - > + /* > + * Please do not reorder this without considering how mm/ksm.c's > + * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache(). > + */ > ClearPageSwapCache(page); > ClearPagePrivate(page); > set_page_private(page, 0); > -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx156.postini.com [74.125.245.156]) by kanga.kvack.org (Postfix) with SMTP id ADF236B0005 for ; Thu, 7 Feb 2013 18:57:51 -0500 (EST) Received: by mail-da0-f49.google.com with SMTP id t11so1492225daj.36 for ; Thu, 07 Feb 2013 15:57:50 -0800 (PST) Date: Thu, 7 Feb 2013 15:57:50 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: <20130205164118.GI21389@suse.de> Message-ID: References: <20130205164118.GI21389@suse.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, 5 Feb 2013, Mel Gorman wrote: > On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote: > > From: Petr Holasek > > > > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes > > which control merging pages across different numa nodes. > > When it is set to zero only pages from the same node are merged, > > otherwise pages from all nodes can be merged together (default behavior). > > > > Typical use-case could be a lot of KVM guests on NUMA machine > > and cpus from more distant nodes would have significant increase > > of access latency to the merged ksm page. Sysfs knob was choosen > > for higher variability when some users still prefers higher amount > > of saved physical memory regardless of access latency. > > > > This is understandable but it's going to be a fairly obscure option. > I do not think it can be known in advance if the option should be set. > The user must either run benchmarks before and after or use perf to > record the "node-load-misses" event and see if setting the parameter > reduces the number of remote misses. Andrew made a similar point on the description of merge_across_nodes in ksm.txt. Petr's quiet at the moment, so I'll add a few more lines to that description (in an incremental patch): but be assured what I say will remain inadequate and unspecific - I don't have much idea of how to decide the setting, but assume that the people who are interested in using the knob will have a firmer idea of how to test for it. > > I don't know the internals of ksm.c at all and this is my first time reading > this series. Everything in this review is subject to being completely > wrong or due to a major misunderstanding on my part. Delete all feedback > if desired. Thank you for spending your time on it. [...snippings, but let's leave this paragraph in] > > Hugh notes that this patch brings two problems, whose solution needs > > further support in mm/ksm.c, which follows in subsequent patches: > > 1) switching merge_across_nodes after running KSM is liable to oops > > on stale nodes still left over from the previous stable tree; > > 2) memory hotremove may migrate KSM pages, but there is no provision > > here for !merge_across_nodes to migrate nodes to the proper tree. ... > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 > > +++ mmotm/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 ... > > With multiple stable node trees does the comment that begins with > > * A few notes about the KSM scanning process, > * to make it easier to understand the data structures below: > > need an update? Okay: I won't go through it pluralizing everything, but a couple of lines on the !merge_across_nodes multiplicity of trees would be helpful. > > It's uninitialised so kernel data size in vmlinux should be unaffected but > it's an additional runtime cost of around 4K for a standardish enterprise > distro kernel config. Small beans on a NUMA machine and maybe not worth > the hassle of kmalloc for nr_online_nodes and dealing with node memory > hotplug but it's a pity. It's a pity, I agree; as is the addition of int nid into rmap_item on 32-bit (on 64-bit it just occupies a hole) - there can be a lot of those. We were kind of hoping that the #ifdef CONFIG_NUMA would cover it, but some distros now enable NUMA by default even on 32-bit. And it's a pity because 99% of users will leave merge_across_nodes at its default of 1 and only ever need a single tree of each kind. I'll look into starting off with just root_stable_tree[1] and root_unstable_tree[1], then kmalloc'ing nr_node_ids of them when and if merge_across_nodes is switched off. Then I don't think we need bother about hotplug. If it ends up looking clean enough, I'll add that patch. > > > #define MM_SLOTS_HASH_BITS 10 > > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > > @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ > > /* Milliseconds ksmd should sleep between batches */ > > static unsigned int ksm_thread_sleep_millisecs = 20; > > > > +/* Zeroed when merging across nodes is not allowed */ > > +static unsigned int ksm_merge_across_nodes = 1; > > + > > Nit but initialised data does increase the size of vmlinux so maybe this > should be the "opposite". i.e. rename it to ksm_merge_within_nodes and > default it to 0? I don't find that particular increase in size very compelling! Though I would have preferred the tunable to be the opposite way around: it annoys me that the new code comes into play when !ksm_merge_across_nodes. However, I do find "merge across nodes" (thanks to Andrew for "across") a much more vivid description than the opposite "merge within nodes", and can't think of a better alternative for that; and wouldn't want to change it anyway at this late (v7) stage, not without Petr's consent. > > __read_mostly? I feel the same way as I did when Andrew suggested it: > > I spose this should be __read_mostly. If __read_mostly is not really a > synonym for __make_write_often_storage_slower. I continue to harbor > fear, uncertainty and doubt about this... Could do. No strong feeling, but I think I'd rather it share its cacheline with other KSM-related stuff, than be off mixed up with unrelateds. I think there's a much stronger case for __read_mostly when it's a library thing accessed by different subsystems. You're right that this variable is accessed significantly more often that the other KSM tunables, so deserves a __read_mostly more than they do. But where to stop? Similar reluctance led me to avoid using "unlikely" throughout ksm.c, unlikely as some conditions are (I'm aghast to see that Andrea sneaked in a "likely" :). > > > #define KSM_RUN_STOP 0 > > #define KSM_RUN_MERGE 1 > > #define KSM_RUN_UNMERGE 2 > > @@ -441,10 +448,25 @@ out: page = NULL; > > return page; > > } > > > > +/* > > + * This helper is used for getting right index into array of tree roots. > > + * When merge_across_nodes knob is set to 1, there are only two rb-trees for > > + * stable and unstable pages from all nodes with roots in index 0. Otherwise, > > + * every node has its own stable and unstable tree. > > + */ > > +static inline int get_kpfn_nid(unsigned long kpfn) > > +{ > > + if (ksm_merge_across_nodes) > > + return 0; > > + else > > + return pfn_to_nid(kpfn); > > +} > > + > > If we start with ksm_merge_across_nodes, KSM runs for a while and populates > the stable node tree for node 0 and then ksm_merge_across_nodes gets set > then badness happens because this can go anywhere > > nid = get_kpfn_nid(stable_node->kpfn); > rb_erase(&stable_node->node, &root_stable_tree[nid]); > > Very late in the review I noticed that you comment on this already in the > changelog and that it is addressed later in the series. I haven't seen Yes. Nobody's git bisection will be thwarted by this defect, so I'm happy for Petr's patch to go in as is first, then fix applied after. And even in this patch, there's already a pages_shared 0 test: which is inadequate, but covers the common case. > this patch yet so the following suggestion is very stale but might still > be relevant. > > We could increase size of root_stable_node[] by 1, have > get_kpfn_nid return MAX_NR_NODES if ksm_merge_across_nodes and > if ksm_merge_across_nodes gets set to 0 then we walk the stable > tree at root_stable_tree[MAX_NR_NODES] and delete the entire > tree? It's be disruptive as hell unfortunately and might break > entirely if there is not enough memory to unshare the pages. > > Ideally we could take our time walking root_stable_tree[MAX_NR_NODES] > without worrying about collisions and fix it up somehow. Dunno Petr's intention was that we just be disruptive, and insist on the old tree being torn down first: it was merely a defect that this patch does not quite ensure that. You're right that we could be cleverer: in the light of the changes I ended up making for collisions in migration, maybe that approach could be extended to switching merge_across_nodes. But I think you'll agree that switching merge_across_nodes is a path that needs to be handled correctly, but no way does it need optimization: people will do it when they're trying to work out the right tuning for their loads, and thereafter probably never again. > > @@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s > > age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); > > BUG_ON(age > 1); > > if (!age) > > - rb_erase(&rmap_item->node, &root_unstable_tree); > > +#ifdef CONFIG_NUMA > > + rb_erase(&rmap_item->node, > > + &root_unstable_tree[rmap_item->nid]); > > +#else > > + rb_erase(&rmap_item->node, &root_unstable_tree[0]); > > +#endif > > > > nit, does rmap_item->nid deserve a getter and setter helper instead? I found that part ugly too: it gets macro helpers in trivial tidyups 3/11, though not quite the getter/setter helpers you had in mind. > > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > > return NULL; > > } > > > > + /* > > + * If tree_page has been migrated to another NUMA node, it > > + * will be flushed out and put into the right unstable tree > > + * next time: only merge with it if merge_across_nodes. > > + * Just notice, we don't have similar problem for PageKsm > > + * because their migration is disabled now. (62b61f611e) > > + */ > > + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { > > + put_page(tree_page); > > + return NULL; > > + } > > + > > What about this case? > > 1. ksm_merge_across_nodes==0 > 2. pages gets placed on different unstable trees > 3. ksm_merge_across_nodes==1 > > At that point we should be removing pages from the different unstable > tree and moving them to root_unstable_tree[0] but this put_page() doesn't > happen. Does it matter? It doesn't matter. The general philosophy in ksm.c is to be very lazy about the unstable tree: all kinds of things can go "wrong" with it temporarily, that's okay so long as we don't fall for errors that would persist round after round. The check above is required (somewhere) to make sure that we don't merge pages from different nodes into the same stable tree when the switch says not to do that. But the case that you're thinking of, it'll just sort itself out in a later round (I think you later realized how the unstable tree is rebuilt from scratch each round). Or have I misunderstood: are you worrying that a put_page() is missing? I don't see that. But now you point me to this block, I do wonder if we could place it better. When I came to worry about such an issue in the stable tree, I decided that it's perfectly okay to use a page from the wrong node for an intermediate test, and suboptimal to give up at that point, just wrong to return it as a final match. But here we give up even when it's an intermediate: seems inconsistent, I'll give it some more thought later, and probably want to move it: it's not wrong as is, but I think it could be more efficient and more consistent. > > @@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r > > */ > > lru_add_drain_all(); > > > > - root_unstable_tree = RB_ROOT; > > + for (nid = 0; nid < nr_node_ids; nid++) > > + root_unstable_tree[nid] = RB_ROOT; > > > > Minor but you shouldn't need to reset tham all if > ksm_merge_across_nodes==1 True; and I'll need to attend to this if we do move away from the static allocation of root_unstable_tree[MAX_NUMNODES]. > > Initially this triggered an alarm because it's not immediately obvious > why you can just discard an rbtree like this. It looks like because the > unstable tree is also part of a linked list so the rb representation can > be reset quickly without leaking memory. Right, it takes a while to get your head around the way we just forget the old tree and start again each time. There's a funny place in remove_rmap_item_from_tree() (visible in an earlier extract) where it has to consider the "age" of the rmap_item, to decide whether it's linked into the current tree or not. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx196.postini.com [74.125.245.196]) by kanga.kvack.org (Postfix) with SMTP id AD1F66B0005 for ; Thu, 7 Feb 2013 19:07:09 -0500 (EST) Received: by mail-da0-f45.google.com with SMTP id w4so1465023dam.18 for ; Thu, 07 Feb 2013 16:07:08 -0800 (PST) Date: Thu, 7 Feb 2013 16:07:17 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 4/11] ksm: reorganize ksm_check_stable_tree In-Reply-To: <20130205164823.GJ21389@suse.de> Message-ID: References: <20130205164823.GJ21389@suse.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, 5 Feb 2013, Mel Gorman wrote: > On Fri, Jan 25, 2013 at 05:59:35PM -0800, Hugh Dickins wrote: > > Memory hotremove's ksm_check_stable_tree() is pitifully inefficient > > (restarting whenever it finds a stale node to remove), but rearrange > > so that at least it does not needlessly restart from nid 0 each time. > > And add a couple of comments: here is why we keep pfn instead of page. > > > > Signed-off-by: Hugh Dickins > > --- > > mm/ksm.c | 38 ++++++++++++++++++++++---------------- > > 1 file changed, 22 insertions(+), 16 deletions(-) > > > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:52.152205940 -0800 > > +++ mmotm/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > > @@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa > > #endif /* CONFIG_MIGRATION */ > > > > #ifdef CONFIG_MEMORY_HOTREMOVE > > -static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn, > > - unsigned long end_pfn) > > +static void ksm_check_stable_tree(unsigned long start_pfn, > > + unsigned long end_pfn) > > { > > + struct stable_node *stable_node; > > struct rb_node *node; > > int nid; > > > > - for (nid = 0; nid < nr_node_ids; nid++) > > - for (node = rb_first(&root_stable_tree[nid]); node; > > - node = rb_next(node)) { > > - struct stable_node *stable_node; > > - > > + for (nid = 0; nid < nr_node_ids; nid++) { > > + node = rb_first(&root_stable_tree[nid]); > > + while (node) { > > This is not your fault, the old code is wrong too. It is assuming that all > nodes are populated in numeric orders with no holes. It won't work if just > two nodes 0 and 4 are online. It should be using for_each_online_node(). If the old code is wrong, it probably would be my fault! But I believe this is okay: these rb_roots we're looking at, they are in memory which is not being offlined, and the trees for offline nodes will simply be empty, won't they? Something's badly wrong if otherwise. I certainly prefer to avoid for_each_online_node() etc: maybe I'm confusing with for_each_online_something_else(), but experience tells that you can get into nasty hotplug mutex ordering issues with those things - not worth the pain if you can easily and safely avoid them. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx176.postini.com [74.125.245.176]) by kanga.kvack.org (Postfix) with SMTP id A78396B0005 for ; Thu, 7 Feb 2013 19:33:50 -0500 (EST) Received: by mail-da0-f47.google.com with SMTP id s35so1508159dak.34 for ; Thu, 07 Feb 2013 16:33:49 -0800 (PST) Date: Thu, 7 Feb 2013 16:33:58 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked In-Reply-To: <20130205171805.GK21389@suse.de> Message-ID: References: <20130205171805.GK21389@suse.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, 5 Feb 2013, Mel Gorman wrote: > On Fri, Jan 25, 2013 at 06:00:50PM -0800, Hugh Dickins wrote: > > In some places where get_ksm_page() is used, we need the page to be locked. > > > > When KSM migration is fully enabled, we shall want that to make sure that > > the page just acquired cannot be migrated beneath us (raised page count is > > only effective when there is serialization to make sure migration notices). > > Whereas when navigating through the stable tree, we certainly do not want > > to lock each node (raised page count is enough to guarantee the memcmps, > > even if page is migrated to another node). > > > > Since we're about to add another use case, add the locked argument to > > get_ksm_page() now. > > > > Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I > > really got the wrong end of the stick on that! There's a configuration > > in which page_cache_get_speculative() can do something cheaper than > > get_page_unless_zero(), relying on its caller's rcu_read_lock() to have > > disabled preemption for it. There's no need for rcu_read_lock() around > > get_page_unless_zero() (and mapping checks) here. Cut out that > > silliness before making this any harder to understand. > > > > Signed-off-by: Hugh Dickins > > --- > > mm/ksm.c | 23 +++++++++++++---------- > > 1 file changed, 13 insertions(+), 10 deletions(-) > > > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > > +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree > > * but this is different - made simpler by ksm_thread_mutex being held, but > > * interesting for assuming that no other use of the struct page could ever > > * put our expected_mapping into page->mapping (or a field of the union which > > - * coincides with page->mapping). The RCU calls are not for KSM at all, but > > - * to keep the page_count protocol described with page_cache_get_speculative. > > + * coincides with page->mapping). > > * > > * Note: it is possible that get_ksm_page() will return NULL one moment, > > * then page the next, if the page is in between page_freeze_refs() and > > * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > > * is on its way to being freed; but it is an anomaly to bear in mind. > > */ > > -static struct page *get_ksm_page(struct stable_node *stable_node) > > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > > { > > The naming is unhelpful :( > > Because the second parameter is called "locked", it implies that the > caller of this function holds the page lock (which is obviously very > silly). ret_locked maybe? I'd prefer "lock_it": I'll make that change unless you've a better. > > As the function is akin to find_lock_page I would prefer if there was > a new get_lock_ksm_page() instead of locking depending on the value of a > parameter. I demur. If it were a global interface rather than a function static to ksm.c, yes, I'm sure Linus would side very strongly with you, and I'd be providing a pair of wrappers to get_ksm_page() to hide the bool arg. But this is a private function (you're invited :) which doesn't need that level of hand-holding. And I'm a firm believer in having one, difficult, function where all the heavy thought is focussed, which does the nasty work and spares everywhere else from having to worry about the difficulties. You'll shiver with horror as I recite shmem_getpage(_gfp), page_lock_anon_vma(_read), page_relock_lruvec (well, that one did not yet get beyond its posting): get_ksm_page is one of those. > We can do this because expected_mapping is recorded by the > stable_node and we only need to recalculate it if the page has been > successfully pinned. We calculate the expected value twice but that's > not earth shattering. It'd look something like; > > /* > * get_lock_ksm_page: Similar to get_ksm_page except returns with page > * locked and pinned > */ > static struct page *get_lock_ksm_page(struct stable_node *stable_node) > { > struct page *page = get_ksm_page(stable_node); > > if (page) { > expected_mapping = (void *)stable_node + > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > lock_page(page); > if (page->mapping != expected_mapping) { > unlock_page(page); > > /* release pin taken by get_ksm_page() */ > put_page(page); > page = NULL; > } > } > > return page; > } Something like; but would also need the remove_node_from_stable_tree. > > Up to you, I'm not going to make a big deal of it. Phew! Probably my insistence springs from knowing what this function develops into a few patches later, rather than the simpler version that appears at this stage of the series. > > FWIW, I agree that removing rcu_read_lock() is fine. Good, thanks, I was rather embarrassed by my misunderstanding there. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx135.postini.com [74.125.245.135]) by kanga.kvack.org (Postfix) with SMTP id B9ABE6B0005 for ; Fri, 8 Feb 2013 13:45:20 -0500 (EST) Received: from /spool/local by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 8 Feb 2013 18:43:55 -0000 Received: from d06av04.portsmouth.uk.ibm.com (d06av04.portsmouth.uk.ibm.com [9.149.37.216]) by b06cxnps4074.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r18Ij7xr36307080 for ; Fri, 8 Feb 2013 18:45:07 GMT Received: from d06av04.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av04.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r18IjFd5001331 for ; Fri, 8 Feb 2013 11:45:15 -0700 Date: Fri, 8 Feb 2013 19:45:10 +0100 From: Gerald Schaefer Subject: Re: [PATCH 11/11] ksm: stop hotremove lockdep warning Message-ID: <20130208194510.65fadd37@thinkpad.boeblingen.de.com> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , KOSAKI Motohiro , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 25 Jan 2013 18:10:18 -0800 (PST) Hugh Dickins wrote: > Complaints are rare, but lockdep still does not understand the way > ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and > holds it until the ksm_memory_callback(MEM_OFFLINE): that appears > to be a problem because notifier callbacks are made under down_read > of blocking_notifier_head->rwsem (so first the mutex is taken while > holding the rwsem, then later the rwsem is taken while still holding > the mutex); but is not in fact a problem because mem_hotplug_mutex > is held throughout the dance. > > There was an attempt to fix this with mutex_lock_nested(); but if that > happened to fool lockdep two years ago, apparently it does so no > longer. > > I had hoped to eradicate this issue in extending KSM page migration > not to need the ksm_thread_mutex. But then realized that although > the page migration itself is safe, we do still need to lock out ksmd > and other users of get_ksm_page() while offlining memory - at some > point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages > themselves may vanish, and get_ksm_page()'s accesses to them become a > violation. > > So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE > to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and > wait_while_offlining() checks, to achieve the same lockout without > being caught by lockdep. This is less elegant for KSM, but it's more > important to keep lockdep useful to other users - and I apologize for > how long it took to fix. Thanks a lot for the patch! I verified that it fixes the lockdep warning that we got on memory hotremove. > > Reported-by: Gerald Schaefer > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 55 +++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 41 insertions(+), 14 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:38:53.984208836 -0800 > @@ -226,7 +226,9 @@ static unsigned int ksm_merge_across_nod > #define KSM_RUN_STOP 0 > #define KSM_RUN_MERGE 1 > #define KSM_RUN_UNMERGE 2 > -static unsigned int ksm_run = KSM_RUN_STOP; > +#define KSM_RUN_OFFLINE 4 > +static unsigned long ksm_run = KSM_RUN_STOP; > +static void wait_while_offlining(void); > > static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait); > static DEFINE_MUTEX(ksm_thread_mutex); > @@ -1700,6 +1702,7 @@ static int ksm_scan_thread(void *nothing > > while (!kthread_should_stop()) { > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksmd_should_run()) > ksm_do_scan(ksm_thread_pages_to_scan); > mutex_unlock(&ksm_thread_mutex); > @@ -2056,6 +2059,22 @@ void ksm_migrate_page(struct page *newpa > #endif /* CONFIG_MIGRATION */ > > #ifdef CONFIG_MEMORY_HOTREMOVE > +static int just_wait(void *word) > +{ > + schedule(); > + return 0; > +} > + > +static void wait_while_offlining(void) > +{ > + while (ksm_run & KSM_RUN_OFFLINE) { > + mutex_unlock(&ksm_thread_mutex); > + wait_on_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE), > + just_wait, TASK_UNINTERRUPTIBLE); > + mutex_lock(&ksm_thread_mutex); > + } > +} > + > static void ksm_check_stable_tree(unsigned long start_pfn, > unsigned long end_pfn) > { > @@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no > switch (action) { > case MEM_GOING_OFFLINE: > /* > - * Keep it very simple for now: just lock out ksmd > and > - * MADV_UNMERGEABLE while any memory is going > offline. > - * mutex_lock_nested() is necessary because lockdep > was alarmed > - * that here we take ksm_thread_mutex inside > notifier chain > - * mutex, and later take notifier chain mutex inside > - * ksm_thread_mutex to unlock it. But that's safe > because both > - * are inside mem_hotplug_mutex. > + * Prevent ksm_do_scan(), > unmerge_and_remove_all_rmap_items() > + * and remove_all_stable_nodes() while memory is > going offline: > + * it is unsafe for them to touch the stable tree at > this time. > + * But unmerge_ksm_pages(), rmap lookups and other > entry points > + * which do not need the ksm_thread_mutex are all > safe. */ > - mutex_lock_nested(&ksm_thread_mutex, > SINGLE_DEPTH_NESTING); > + mutex_lock(&ksm_thread_mutex); > + ksm_run |= KSM_RUN_OFFLINE; > + mutex_unlock(&ksm_thread_mutex); > break; > > case MEM_OFFLINE: > @@ -2122,11 +2141,20 @@ static int ksm_memory_callback(struct no > /* fallthrough */ > > case MEM_CANCEL_OFFLINE: > + mutex_lock(&ksm_thread_mutex); > + ksm_run &= ~KSM_RUN_OFFLINE; > mutex_unlock(&ksm_thread_mutex); > + > + smp_mb(); /* wake_up_bit advises this */ > + wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE)); > break; > } > return NOTIFY_OK; > } > +#else > +static void wait_while_offlining(void) > +{ > +} > #endif /* CONFIG_MEMORY_HOTREMOVE */ > > #ifdef CONFIG_SYSFS > @@ -2189,7 +2217,7 @@ KSM_ATTR(pages_to_scan); > static ssize_t run_show(struct kobject *kobj, struct kobj_attribute > *attr, char *buf) > { > - return sprintf(buf, "%u\n", ksm_run); > + return sprintf(buf, "%lu\n", ksm_run); > } > > static ssize_t run_store(struct kobject *kobj, struct kobj_attribute > *attr, @@ -2212,6 +2240,7 @@ static ssize_t run_store(struct kobject > */ > > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksm_run != flags) { > ksm_run = flags; > if (flags & KSM_RUN_UNMERGE) { > @@ -2254,6 +2283,7 @@ static ssize_t merge_across_nodes_store( > return -EINVAL; > > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksm_merge_across_nodes != knob) { > if (ksm_pages_shared || remove_all_stable_nodes()) > err = -EBUSY; > @@ -2366,10 +2396,7 @@ static int __init ksm_init(void) > #endif /* CONFIG_SYSFS */ > > #ifdef CONFIG_MEMORY_HOTREMOVE > - /* > - * Choose a high priority since the callback takes > ksm_thread_mutex: > - * later callbacks could only be taking locks which nest > within that. > - */ > + /* There is no significance to this priority 100 */ > hotplug_memory_notifier(ksm_memory_callback, 100); > #endif > return 0; > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id 0896E6B0007 for ; Fri, 8 Feb 2013 14:33:36 -0500 (EST) Received: by mail-pa0-f47.google.com with SMTP id bj3so2256647pad.34 for ; Fri, 08 Feb 2013 11:33:36 -0800 (PST) Date: Fri, 8 Feb 2013 11:33:40 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <20130205175551.GL21389@suse.de> Message-ID: References: <20130205175551.GL21389@suse.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, 5 Feb 2013, Mel Gorman wrote: > On Fri, Jan 25, 2013 at 06:01:59PM -0800, Hugh Dickins wrote: > > Switching merge_across_nodes after running KSM is liable to oops on stale > > nodes still left over from the previous stable tree. It's not something > > that people will often want to do, but it would be lame to demand a reboot > > when they're trying to determine which merge_across_nodes setting is best. > > > > How can this happen? We only permit switching merge_across_nodes when > > pages_shared is 0, and usually set run 2 to force that beforehand, which > > ought to unmerge everything: yet oopses still occur when you then run 1. > > > > When reviewing patch 1, I missed that the pages_shared check would prevent > most of the problems I was envisioning with leftover entries in the > stable tree. Sorry about that. No apology necessary! > > > Three causes: > > > > 1. The old stable tree (built according to the inverse merge_across_nodes) > > has not been fully torn down. A stable node lingers until get_ksm_page() > > notices that the page it references no longer references it: but the page > > is not necessarily freed as soon as expected, particularly when swapcache. > > > > Fix this with a pass through the old stable tree, applying get_ksm_page() > > to each of the remaining nodes (most found stale and removed immediately), > > with forced removal of any left over. Unless the page is still mapped: > > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > > and EBUSY than BUG. But once I applied the testing for this to the completed patch series, I did start seeing that WARN_ON_ONCE: it's made safe by the EBUSY, but not working as intended. Cause outlined below. > > > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > > just behind ksmd's cursor, so there's a full pass for it to stabilize > > (or be removed) before ksmd addresses it. Nice when ksmd is running, > > but not so nice when we're trying to unmerge all mms: we were missing > > those mms forked and inserted behind the unmerge cursor. Easily fixed > > by inserting at the end when KSM_RUN_UNMERGE. > > > > 3. It is possible for a KSM page to be faulted back from swapcache into > > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. What I found is that a 4th cause emerges once KSM migration is properly working: that interval during page migration when the old page has been fully unmapped but the new not yet mapped in its place. The KSM COW breaking cannot see a page there then, so it ends up with a (newly migrated) KSM page left behind. Almost certainly has to be fixed in follow_page(), but I've not yet settled on its final form - the fix I have works well, but a different approach might be better. I'm also puzzled that I've never in practice been hit by a 5th cause: swapoff's try_to_unuse() is much like faulting, and ought to have the same ksm_might_need_to_copy() safeguards as faulting (or at least, I cannot see why not). > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > > +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > > @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a > > /* > > * Only called through the sysfs control interface: > > */ > > +static int remove_stable_node(struct stable_node *stable_node) > > +{ > > + struct page *page; > > + int err; > > + > > + page = get_ksm_page(stable_node, true); > > + if (!page) { > > + /* > > + * get_ksm_page did remove_node_from_stable_tree itself. > > + */ > > + return 0; > > + } > > + > > + if (WARN_ON_ONCE(page_mapped(page))) > > + err = -EBUSY; > > + else { > > + /* > > It will probably be very obvious to people familiar with ksm.c but even > so maybe remind the reader that the pages must already have been unmerged > > * This page must already have been unmerged and should be stale. > * It might be in a pagevec waiting to be freed or it might be Okay, I'll add a little more comment there; but I need to think longer for exactly how to express it. > ...... > > > > > + * This page might be in a pagevec waiting to be freed, > > + * or it might be PageSwapCache (perhaps under writeback), > > + * or it might have been removed from swapcache a moment ago. > > + */ > > + set_page_stable_node(page, NULL); > > + remove_node_from_stable_tree(stable_node); > > + err = 0; > > + } > > + > > + unlock_page(page); > > + put_page(page); > > + return err; > > +} > > + > > +static int remove_all_stable_nodes(void) > > +{ > > + struct stable_node *stable_node; > > + int nid; > > + int err = 0; > > + > > + for (nid = 0; nid < nr_node_ids; nid++) { > > + while (root_stable_tree[nid].rb_node) { > > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > > + struct stable_node, node); > > + if (remove_stable_node(stable_node)) { > > + err = -EBUSY; > > + break; /* proceed to next nid */ > > + } > > If remove_stable_node() returns an error then it's quite possible that it'll > go boom when that page is encountered later but it's not guaranteed. It'd > be best effort to continue removing as many of the stable nodes anyway. > We're in trouble either way of course. If it returns an error, then indeed something we don't yet understand has occurred, and we shall want to debug it. But unless it's due to corruption somewhere, we shouldn't be in much trouble, shouldn't go boom: remove_all_stable_nodes() error is ignored at the end of unmerging, it will be tried again when changing merge_across_nodes, and an error then will just prevent changing merge_across_nodes at that time. So the mysteriously unremovable stable nodes remain the same kind of tree. > > Otherwise I didn't spot a problem so as weak as it is due my familiarity > with KSM; > > Acked-by: Mel Gorman Thanks, Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx147.postini.com [74.125.245.147]) by kanga.kvack.org (Postfix) with SMTP id A517E6B0002 for ; Fri, 8 Feb 2013 15:52:03 -0500 (EST) Received: by mail-pa0-f52.google.com with SMTP id fb1so2222385pad.25 for ; Fri, 08 Feb 2013 12:52:02 -0800 (PST) Date: Fri, 8 Feb 2013 12:52:12 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible In-Reply-To: <20130205191102.GM21389@suse.de> Message-ID: References: <20130205191102.GM21389@suse.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: "Paul E. McKenney" , Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Paul, I've added you to the Cc in the hope that you can shed your light on an smp_read_barrier_depends() question with which Mel taxes me below. You may ask for more context: linux-next currently has an mm/ksm.c after this patch is applied, but you may have questions beyond that - thanks! On Tue, 5 Feb 2013, Mel Gorman wrote: > On Fri, Jan 25, 2013 at 06:03:31PM -0800, Hugh Dickins wrote: > > KSM page migration is already supported in the case of memory hotremove, > > which takes the ksm_thread_mutex across all its migrations to keep life > > simple. > > > > But the new KSM NUMA merge_across_nodes knob introduces a problem, when > > it's set to non-default 0: if a KSM page is migrated to a different NUMA > > node, how do we migrate its stable node to the right tree? And what if > > that collides with an existing stable node? > > > > So far there's no provision for that, and this patch does not attempt > > to deal with it either. But how will I test a solution, when I don't > > know how to hotremove memory? > > Just reach in and yank it straight out with a chisel. :) > > > The best answer is to enable KSM page > > migration in all cases now, and test more common cases. With THP and > > compaction added since KSM came in, page migration is now mainstream, > > and it's a shame that a KSM page can frustrate freeing a page block. > > > > THP will at least check if migration within a node works. It won't > necessarily check we can migrate across nodes properly but it's a lot > better than nothing. No, I went back and dug out a hack-patch I was using three or four years ago: occasionally on fault, just migrate every possible page in that mm for no reason other than to test page migration. > > static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > > { > > struct page *page; > > void *expected_mapping; > > + unsigned long kpfn; > > > > - page = pfn_to_page(stable_node->kpfn); > > expected_mapping = (void *)stable_node + > > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > > - if (page->mapping != expected_mapping) > > - goto stale; > > - if (!get_page_unless_zero(page)) > > +again: > > + kpfn = ACCESS_ONCE(stable_node->kpfn); > > + page = pfn_to_page(kpfn); > > + > > Ok. > > There should be no concern that hot-remove made the kpfn invalid because > those stable tree entries should have been discarded. Yes. > > > + /* > > + * page is computed from kpfn, so on most architectures reading > > + * page->mapping is naturally ordered after reading node->kpfn, > > + * but on Alpha we need to be more careful. > > + */ > > + smp_read_barrier_depends(); > > The value of page is data dependant on pfn_to_page(). Is it really possible > for that to be re-ordered even on Alpha? My intuition (to say "understanding" would be an exaggeration) is that on Alpha a very old value of page->mapping (in the line below) might be lying around and read from one cache, which has not necessarily been invalidated by ksm_migrate_page() pointing stable_node->kpfn to this new page. And if that happens, we could easily and mistakenly conclude that this stable node is stale: although there's an smp_rmb() after goto stale, stable_node->kpfn would still match kpfn, and we wrongly remove the node. My confidence that I've expressed that clearly in words, is lower than my confidence that I've coded it right; and if I'm wrong, yes, surely it's better to remove any cargo-cult smp_read_barrier_depends(). > > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) > > goto stale; > > - if (page->mapping != expected_mapping) { > > + > > + /* > > + * We cannot do anything with the page while its refcount is 0. > > + * Usually 0 means free, or tail of a higher-order page: in which > > + * case this node is no longer referenced, and should be freed; > > + * however, it might mean that the page is under page_freeze_refs(). > > + * The __remove_mapping() case is easy, again the node is now stale; > > + * but if page is swapcache in migrate_page_move_mapping(), it might > > + * still be our page, in which case it's essential to keep the node. > > + */ > > + while (!get_page_unless_zero(page)) { > > + /* > > + * Another check for page->mapping != expected_mapping would > > + * work here too. We have chosen the !PageSwapCache test to > > + * optimize the common case, when the page is or is about to > > + * be freed: PageSwapCache is cleared (under spin_lock_irq) > > + * in the freeze_refs section of __remove_mapping(); but Anon > > + * page->mapping reset to NULL later, in free_pages_prepare(). > > + */ > > + if (!PageSwapCache(page)) > > + goto stale; > > + cpu_relax(); > > + } > > The recheck of stable_node->kpfn check after a barrier distinguishes between > a free and a completed migration, that's fine. I'm hesitate to ask because > it must be obvious but where is the guarantee that a KSM page is in the > swap cache? Certainly none at all: it's the less common case that a KSM page is in swap cache. But if it is not in swap cache, how could its page count be 0 (causing get_page_unless_zero to fail)? By being free, or well on its way to being freed (hence stale); or reused as part of a compound page (hence stale also); or reused for another purpose which arrives at a page_freeze_refs() (hence stale also); other cases? It's hard to see from the diff, but in the original version of get_ksm_page(), !get_page_unless_zero goes straight to stale. Don't for a moment imagine that this function sprang fully formed from my mind: it was hard to get it working right (the swap cache get_page_unless_zero failure during migration really caught me out), and then to pare it down to its fairly simple final form. Hugh > > > + > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > put_page(page); > > goto stale; > > } > > + > > if (locked) { > > lock_page(page); > > - if (page->mapping != expected_mapping) { > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > unlock_page(page); > > put_page(page); > > goto stale; > > } > > } > > return page; > > + > > stale: > > + /* > > + * We come here from above when page->mapping or !PageSwapCache > > + * suggests that the node is stale; but it might be under migration. > > + * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(), > > + * before checking whether node->kpfn has been changed. > > + */ > > + smp_rmb(); > > + if (ACCESS_ONCE(stable_node->kpfn) != kpfn) > > + goto again; > > remove_node_from_stable_tree(stable_node); > > return NULL; > > } > > @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa > > if (stable_node) { > > VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage)); > > stable_node->kpfn = page_to_pfn(newpage); > > + /* > > + * newpage->mapping was set in advance; now we need smp_wmb() > > + * to make sure that the new stable_node->kpfn is visible > > + * to get_ksm_page() before it can see that oldpage->mapping > > + * has gone stale (or that PageSwapCache has been cleared). > > + */ > > + smp_wmb(); > > + set_page_stable_node(oldpage, NULL); > > } > > } > > #endif /* CONFIG_MIGRATION */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx172.postini.com [74.125.245.172]) by kanga.kvack.org (Postfix) with SMTP id 442706B0005 for ; Mon, 11 Feb 2013 17:13:38 -0500 (EST) Received: by mail-pa0-f51.google.com with SMTP id hz1so3237218pad.38 for ; Mon, 11 Feb 2013 14:13:37 -0800 (PST) Date: Mon, 11 Feb 2013 14:13:48 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 11/11] ksm: stop hotremove lockdep warning In-Reply-To: <20130208194510.65fadd37@thinkpad.boeblingen.de.com> Message-ID: References: <20130208194510.65fadd37@thinkpad.boeblingen.de.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Gerald Schaefer Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , KOSAKI Motohiro , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 8 Feb 2013, Gerald Schaefer wrote: > On Fri, 25 Jan 2013 18:10:18 -0800 (PST) > Hugh Dickins wrote: > > > Complaints are rare, but lockdep still does not understand the way > > ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and > > holds it until the ksm_memory_callback(MEM_OFFLINE): that appears > > to be a problem because notifier callbacks are made under down_read > > of blocking_notifier_head->rwsem (so first the mutex is taken while > > holding the rwsem, then later the rwsem is taken while still holding > > the mutex); but is not in fact a problem because mem_hotplug_mutex > > is held throughout the dance. > > > > There was an attempt to fix this with mutex_lock_nested(); but if that > > happened to fool lockdep two years ago, apparently it does so no > > longer. > > > > I had hoped to eradicate this issue in extending KSM page migration > > not to need the ksm_thread_mutex. But then realized that although > > the page migration itself is safe, we do still need to lock out ksmd > > and other users of get_ksm_page() while offlining memory - at some > > point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages > > themselves may vanish, and get_ksm_page()'s accesses to them become a > > violation. > > > > So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE > > to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and > > wait_while_offlining() checks, to achieve the same lockout without > > being caught by lockdep. This is less elegant for KSM, but it's more > > important to keep lockdep useful to other users - and I apologize for > > how long it took to fix. > > Thanks a lot for the patch! I verified that it fixes the lockdep warning > that we got on memory hotremove. > > > > > Reported-by: Gerald Schaefer > > Signed-off-by: Hugh Dickins Thank you for reporting and testing and reporting back: sorry again for taking so long to fix it. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx139.postini.com [74.125.245.139]) by kanga.kvack.org (Postfix) with SMTP id 7D6926B0002 for ; Thu, 14 Feb 2013 06:30:10 -0500 (EST) Date: Thu, 14 Feb 2013 11:30:05 +0000 From: Mel Gorman Subject: Re: [PATCH 4/11] ksm: reorganize ksm_check_stable_tree Message-ID: <20130214113005.GA7367@suse.de> References: <20130205164823.GJ21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Thu, Feb 07, 2013 at 04:07:17PM -0800, Hugh Dickins wrote: > On Tue, 5 Feb 2013, Mel Gorman wrote: > > On Fri, Jan 25, 2013 at 05:59:35PM -0800, Hugh Dickins wrote: > > > Memory hotremove's ksm_check_stable_tree() is pitifully inefficient > > > (restarting whenever it finds a stale node to remove), but rearrange > > > so that at least it does not needlessly restart from nid 0 each time. > > > And add a couple of comments: here is why we keep pfn instead of page. > > > > > > Signed-off-by: Hugh Dickins > > > --- > > > mm/ksm.c | 38 ++++++++++++++++++++++---------------- > > > 1 file changed, 22 insertions(+), 16 deletions(-) > > > > > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:52.152205940 -0800 > > > +++ mmotm/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > > > @@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa > > > #endif /* CONFIG_MIGRATION */ > > > > > > #ifdef CONFIG_MEMORY_HOTREMOVE > > > -static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn, > > > - unsigned long end_pfn) > > > +static void ksm_check_stable_tree(unsigned long start_pfn, > > > + unsigned long end_pfn) > > > { > > > + struct stable_node *stable_node; > > > struct rb_node *node; > > > int nid; > > > > > > - for (nid = 0; nid < nr_node_ids; nid++) > > > - for (node = rb_first(&root_stable_tree[nid]); node; > > > - node = rb_next(node)) { > > > - struct stable_node *stable_node; > > > - > > > + for (nid = 0; nid < nr_node_ids; nid++) { > > > + node = rb_first(&root_stable_tree[nid]); > > > + while (node) { > > > > This is not your fault, the old code is wrong too. It is assuming that all > > nodes are populated in numeric orders with no holes. It won't work if just > > two nodes 0 and 4 are online. It should be using for_each_online_node(). > > If the old code is wrong, it probably would be my fault! But I believe > this is okay: these rb_roots we're looking at, they are in memory which > is not being offlined, and the trees for offline nodes will simply be > empty, won't they? Something's badly wrong if otherwise. > I would expect them to be empty but that was not the problem I had in mind. Unfortunately I mixed up nr_online_ids and nr_node_ids and read the loop incorrectly. What you have is fine. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx179.postini.com [74.125.245.179]) by kanga.kvack.org (Postfix) with SMTP id EAD026B0007 for ; Thu, 14 Feb 2013 06:34:22 -0500 (EST) Date: Thu, 14 Feb 2013 11:34:18 +0000 From: Mel Gorman Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked Message-ID: <20130214113418.GB7367@suse.de> References: <20130205171805.GK21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Thu, Feb 07, 2013 at 04:33:58PM -0800, Hugh Dickins wrote: > > > > > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > > > +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > > > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree > > > * but this is different - made simpler by ksm_thread_mutex being held, but > > > * interesting for assuming that no other use of the struct page could ever > > > * put our expected_mapping into page->mapping (or a field of the union which > > > - * coincides with page->mapping). The RCU calls are not for KSM at all, but > > > - * to keep the page_count protocol described with page_cache_get_speculative. > > > + * coincides with page->mapping). > > > * > > > * Note: it is possible that get_ksm_page() will return NULL one moment, > > > * then page the next, if the page is in between page_freeze_refs() and > > > * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > > > * is on its way to being freed; but it is an anomaly to bear in mind. > > > */ > > > -static struct page *get_ksm_page(struct stable_node *stable_node) > > > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > > > { > > > > The naming is unhelpful :( > > > > Because the second parameter is called "locked", it implies that the > > caller of this function holds the page lock (which is obviously very > > silly). ret_locked maybe? > > I'd prefer "lock_it": I'll make that change unless you've a better. > I don't. > > > > As the function is akin to find_lock_page I would prefer if there was > > a new get_lock_ksm_page() instead of locking depending on the value of a > > parameter. > > I demur. If it were a global interface rather than a function static > to ksm.c, yes, I'm sure Linus would side very strongly with you, and I'd > be providing a pair of wrappers to get_ksm_page() to hide the bool arg. > > But this is a private function (you're invited :) which doesn't need > that level of hand-holding. > > And I'm a firm believer in having one, difficult, function where all > the heavy thought is focussed, which does the nasty work and spares > everywhere else from having to worry about the difficulties. > Ok, I'm convinced. As you say, the case for having one function is a lot strong later in the series when this function becomes quite complex. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx133.postini.com [74.125.245.133]) by kanga.kvack.org (Postfix) with SMTP id ABC596B0002 for ; Thu, 14 Feb 2013 06:58:09 -0500 (EST) Date: Thu, 14 Feb 2013 11:58:05 +0000 From: Mel Gorman Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly Message-ID: <20130214115805.GC7367@suse.de> References: <20130205175551.GL21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, Feb 08, 2013 at 11:33:40AM -0800, Hugh Dickins wrote: > > > > > > > > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > > > just behind ksmd's cursor, so there's a full pass for it to stabilize > > > (or be removed) before ksmd addresses it. Nice when ksmd is running, > > > but not so nice when we're trying to unmerge all mms: we were missing > > > those mms forked and inserted behind the unmerge cursor. Easily fixed > > > by inserting at the end when KSM_RUN_UNMERGE. > > > > > > 3. It is possible for a KSM page to be faulted back from swapcache into > > > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > > > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > > > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > > > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. > > What I found is that a 4th cause emerges once KSM migration > is properly working: that interval during page migration when the old > page has been fully unmapped but the new not yet mapped in its place. > For anyone else watching -- normal page migration expects to be protected during that particular window with migration ptes. Any references to the PTE mapping a page being migrated faults on a swap-like PTE and waits in migration_entry_wait(). > The KSM COW breaking cannot see a page there then, so it ends up with > a (newly migrated) KSM page left behind. Almost certainly has to be > fixed in follow_page(), but I've not yet settled on its final form - > the fix I have works well, but a different approach might be better. > follow_page() is one option. My guess is that you're thinking of adding a FOLL_ flag that will cause follow_page() to check is_migration_entry() and migration_entry_wait() if the flag is present. Otherwise you would need to check for migration ptes in a number of places under page lock and then hold the lock for long periods of time to prevent migration starting. I did not check this option in depth because it quickly looked like it would be a mess, with long page lock hold times and might not even be workable. > > > +static int remove_all_stable_nodes(void) > > > +{ > > > + struct stable_node *stable_node; > > > + int nid; > > > + int err = 0; > > > + > > > + for (nid = 0; nid < nr_node_ids; nid++) { > > > + while (root_stable_tree[nid].rb_node) { > > > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > > > + struct stable_node, node); > > > + if (remove_stable_node(stable_node)) { > > > + err = -EBUSY; > > > + break; /* proceed to next nid */ > > > + } > > > > If remove_stable_node() returns an error then it's quite possible that it'll > > go boom when that page is encountered later but it's not guaranteed. It'd > > be best effort to continue removing as many of the stable nodes anyway. > > We're in trouble either way of course. > > If it returns an error, then indeed something we don't yet understand > has occurred, and we shall want to debug it. But unless it's due to > corruption somewhere, we shouldn't be in much trouble, shouldn't go boom: > remove_all_stable_nodes() error is ignored at the end of unmerging, it > will be tried again when changing merge_across_nodes, and an error > then will just prevent changing merge_across_nodes at that time. So > the mysteriously unremovable stable nodes remain the same kind of tree. > Ok. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx181.postini.com [74.125.245.181]) by kanga.kvack.org (Postfix) with SMTP id B8D3E6B0002 for ; Thu, 14 Feb 2013 17:19:19 -0500 (EST) Received: by mail-pa0-f48.google.com with SMTP id hz10so1448021pad.35 for ; Thu, 14 Feb 2013 14:19:19 -0800 (PST) Date: Thu, 14 Feb 2013 14:19:26 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <20130214115805.GC7367@suse.de> Message-ID: References: <20130205175551.GL21389@suse.de> <20130214115805.GC7367@suse.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Thu, 14 Feb 2013, Mel Gorman wrote: > On Fri, Feb 08, 2013 at 11:33:40AM -0800, Hugh Dickins wrote: > > > > What I found is that a 4th cause emerges once KSM migration > > is properly working: that interval during page migration when the old > > page has been fully unmapped but the new not yet mapped in its place. > > > > For anyone else watching -- normal page migration expects to be protected > during that particular window with migration ptes. Any references to the > PTE mapping a page being migrated faults on a swap-like PTE and waits > in migration_entry_wait(). > > > The KSM COW breaking cannot see a page there then, so it ends up with > > a (newly migrated) KSM page left behind. Almost certainly has to be > > fixed in follow_page(), but I've not yet settled on its final form - > > the fix I have works well, but a different approach might be better. > > The fix I had (following migration entry to old page) was a bit too PageKsm specfic, and probably wrong for when get_user_pages() needs to get a hold on the _new_ page. > > follow_page() is one option. My guess is that you're thinking of adding > a FOLL_ flag that will cause follow_page() to check is_migration_entry() > and migration_entry_wait() if the flag is present. Maybe a FOLL_flag, but I was thinking of doing it always. The usual get_user_pages() case will already wait in handle_mm_fault() and works okay, and I didn't identify a problem case for follow_page() apart from this ksm.c usage; but I did wonder if someone might have or add code which gets similarly caught out by the migration case. It's not a change I'd dare to make (without a FOLL_flag) if Andrea hadn't already added a wait_split_huge_page() into follow_page(); and I need to convince myself that adding another cause for waiting is necessarily safe (perhaps adding a might_sleep would be good). Sorry, I expected to have posted follow-up patches days and days ago, but in fact my time has vanished elsewhere and I've not even started. > > Otherwise you would need to check for migration ptes in a number of places > under page lock and then hold the lock for long periods of time to prevent > migration starting. I did not check this option in depth because it quickly > looked like it would be a mess, with long page lock hold times and might > not even be workable. Yes, I think that's more or less why I quickly decided on doing it in follow_page(). Another option would be to move the ksm_migrate_page() callsite, and allow it to reject the migration attempt when "inconvenient" (I haven't stopped to think of the definition of inconvenient). Though it wouldn't fail often enough for anyone out there to care, that option just feels like a shameful cop-out to me: I'm trying to improve migration, not add strange cases when it fails. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755366Ab3AZBxQ (ORCPT ); Fri, 25 Jan 2013 20:53:16 -0500 Received: from mail-pa0-f44.google.com ([209.85.220.44]:45503 "EHLO mail-pa0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754517Ab3AZBxO (ORCPT ); Fri, 25 Jan 2013 20:53:14 -0500 Date: Fri, 25 Jan 2013 17:53:10 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 0/11] ksm: NUMA trees and page migration Message-ID: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues we had with that, fully enabling KSM page migration on the way. (A different kind of KSM/NUMA issue which I've certainly not begun to address here: when KSM pages are unmerged, there's usually no sense in preferring to allocate the new pages local to the caller's node.) Petr, I have intentionally changed the titles of yours: partly because your "sysfs knob" understated it, but mainly because I think gmail is liable to assign 1/11 and 2/11 to your earlier December thread, making them vanish from this series. I hope a change of title prevents that. 1 ksm: allow trees per NUMA node 2 ksm: add sysfs ABI Documentation 3 ksm: trivial tidyups 4 ksm: reorganize ksm_check_stable_tree 5 ksm: get_ksm_page locked 6 ksm: remove old stable nodes more thoroughly 7 ksm: make KSM page migration possible 8 ksm: make !merge_across_nodes migration safe 9 mm: enable KSM page migration 10 mm: remove offlining arg to migrate_pages 11 ksm: stop hotremove lockdep warning Documentation/ABI/testing/sysfs-kernel-mm-ksm | 52 + Documentation/vm/ksm.txt | 7 include/linux/ksm.h | 18 include/linux/migrate.h | 14 mm/compaction.c | 2 mm/ksm.c | 566 +++++++++++++--- mm/memory-failure.c | 7 mm/memory.c | 19 mm/memory_hotplug.c | 3 mm/mempolicy.c | 11 mm/migrate.c | 61 - mm/page_alloc.c | 6 12 files changed, 580 insertions(+), 186 deletions(-) Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755143Ab3AZByy (ORCPT ); Fri, 25 Jan 2013 20:54:54 -0500 Received: from mail-pa0-f52.google.com ([209.85.220.52]:53504 "EHLO mail-pa0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753468Ab3AZByv (ORCPT ); Fri, 25 Jan 2013 20:54:51 -0500 Date: Fri, 25 Jan 2013 17:54:53 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Petr Holasek Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes which control merging pages across different numa nodes. When it is set to zero only pages from the same node are merged, otherwise pages from all nodes can be merged together (default behavior). Typical use-case could be a lot of KVM guests on NUMA machine and cpus from more distant nodes would have significant increase of access latency to the merged ksm page. Sysfs knob was choosen for higher variability when some users still prefers higher amount of saved physical memory regardless of access latency. Every numa node has its own stable & unstable trees because of faster searching and inserting. Changing of merge_across_nodes value is possible only when there are not any ksm shared pages in system. I've tested this patch on numa machines with 2, 4 and 8 nodes and measured speed of memory access inside of KVM guests with memory pinned to one of nodes with this benchmark: http://pholasek.fedorapeople.org/alloc_pg.c Population standard deviations of access times in percentage of average were following: merge_across_nodes=1 2 nodes 1.4% 4 nodes 1.6% 8 nodes 1.7% merge_across_nodes=0 2 nodes 1% 4 nodes 0.32% 8 nodes 0.018% RFC: https://lkml.org/lkml/2011/11/30/91 v1: https://lkml.org/lkml/2012/1/23/46 v2: https://lkml.org/lkml/2012/6/29/105 v3: https://lkml.org/lkml/2012/9/14/550 v4: https://lkml.org/lkml/2012/9/23/137 v5: https://lkml.org/lkml/2012/12/10/540 v6: https://lkml.org/lkml/2012/12/23/154 v7: https://lkml.org/lkml/2012/12/27/225 Hugh notes that this patch brings two problems, whose solution needs further support in mm/ksm.c, which follows in subsequent patches: 1) switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree; 2) memory hotremove may migrate KSM pages, but there is no provision here for !merge_across_nodes to migrate nodes to the proper tree. Signed-off-by: Petr Holasek Signed-off-by: Hugh Dickins Acked-by: Rik van Riel --- Documentation/vm/ksm.txt | 7 + mm/ksm.c | 151 ++++++++++++++++++++++++++++++++----- 2 files changed, 139 insertions(+), 19 deletions(-) --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" Default: 20 (chosen for demonstration purposes) +merge_across_nodes - specifies if pages from different numa nodes can be merged. + When set to 0, ksm merges only pages which physically + reside in the memory area of same NUMA node. It brings + lower latency to access to shared page. Value can be + changed only when there is no ksm shared pages in system. + Default: 1 + run - set 0 to stop ksmd from running but keep merged pages, set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", set 2 to stop ksmd and unmerge all pages currently merged, --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 @@ -36,6 +36,7 @@ #include #include #include +#include #include #include "internal.h" @@ -139,6 +140,9 @@ struct rmap_item { struct mm_struct *mm; unsigned long address; /* + low bits used for flags below */ unsigned int oldchecksum; /* when unstable */ +#ifdef CONFIG_NUMA + unsigned int nid; +#endif union { struct rb_node node; /* when node of unstable tree */ struct { /* when listed from stable tree */ @@ -153,8 +157,8 @@ struct rmap_item { #define STABLE_FLAG 0x200 /* is listed from the stable tree */ /* The stable and unstable tree heads */ -static struct rb_root root_stable_tree = RB_ROOT; -static struct rb_root root_unstable_tree = RB_ROOT; +static struct rb_root root_unstable_tree[MAX_NUMNODES]; +static struct rb_root root_stable_tree[MAX_NUMNODES]; #define MM_SLOTS_HASH_BITS 10 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ /* Milliseconds ksmd should sleep between batches */ static unsigned int ksm_thread_sleep_millisecs = 20; +/* Zeroed when merging across nodes is not allowed */ +static unsigned int ksm_merge_across_nodes = 1; + #define KSM_RUN_STOP 0 #define KSM_RUN_MERGE 1 #define KSM_RUN_UNMERGE 2 @@ -441,10 +448,25 @@ out: page = NULL; return page; } +/* + * This helper is used for getting right index into array of tree roots. + * When merge_across_nodes knob is set to 1, there are only two rb-trees for + * stable and unstable pages from all nodes with roots in index 0. Otherwise, + * every node has its own stable and unstable tree. + */ +static inline int get_kpfn_nid(unsigned long kpfn) +{ + if (ksm_merge_across_nodes) + return 0; + else + return pfn_to_nid(kpfn); +} + static void remove_node_from_stable_tree(struct stable_node *stable_node) { struct rmap_item *rmap_item; struct hlist_node *hlist; + int nid; hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { if (rmap_item->hlist.next) @@ -456,7 +478,9 @@ static void remove_node_from_stable_tree cond_resched(); } - rb_erase(&stable_node->node, &root_stable_tree); + nid = get_kpfn_nid(stable_node->kpfn); + + rb_erase(&stable_node->node, &root_stable_tree[nid]); free_stable_node(stable_node); } @@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); BUG_ON(age > 1); if (!age) - rb_erase(&rmap_item->node, &root_unstable_tree); +#ifdef CONFIG_NUMA + rb_erase(&rmap_item->node, + &root_unstable_tree[rmap_item->nid]); +#else + rb_erase(&rmap_item->node, &root_unstable_tree[0]); +#endif ksm_pages_unshared--; rmap_item->address &= PAGE_MASK; @@ -990,8 +1019,9 @@ static struct page *try_to_merge_two_pag */ static struct page *stable_tree_search(struct page *page) { - struct rb_node *node = root_stable_tree.rb_node; + struct rb_node *node; struct stable_node *stable_node; + int nid; stable_node = page_stable_node(page); if (stable_node) { /* ksm page forked */ @@ -999,6 +1029,9 @@ static struct page *stable_tree_search(s return page; } + nid = get_kpfn_nid(page_to_pfn(page)); + node = root_stable_tree[nid].rb_node; + while (node) { struct page *tree_page; int ret; @@ -1033,10 +1066,16 @@ static struct page *stable_tree_search(s */ static struct stable_node *stable_tree_insert(struct page *kpage) { - struct rb_node **new = &root_stable_tree.rb_node; + int nid; + unsigned long kpfn; + struct rb_node **new; struct rb_node *parent = NULL; struct stable_node *stable_node; + kpfn = page_to_pfn(kpage); + nid = get_kpfn_nid(kpfn); + new = &root_stable_tree[nid].rb_node; + while (*new) { struct page *tree_page; int ret; @@ -1070,11 +1109,11 @@ static struct stable_node *stable_tree_i return NULL; rb_link_node(&stable_node->node, parent, new); - rb_insert_color(&stable_node->node, &root_stable_tree); + rb_insert_color(&stable_node->node, &root_stable_tree[nid]); INIT_HLIST_HEAD(&stable_node->hlist); - stable_node->kpfn = page_to_pfn(kpage); + stable_node->kpfn = kpfn; set_page_stable_node(kpage, stable_node); return stable_node; @@ -1098,10 +1137,15 @@ static struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item, struct page *page, struct page **tree_pagep) - { - struct rb_node **new = &root_unstable_tree.rb_node; + struct rb_node **new; + struct rb_root *root; struct rb_node *parent = NULL; + int nid; + + nid = get_kpfn_nid(page_to_pfn(page)); + root = &root_unstable_tree[nid]; + new = &root->rb_node; while (*new) { struct rmap_item *tree_rmap_item; @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i return NULL; } + /* + * If tree_page has been migrated to another NUMA node, it + * will be flushed out and put into the right unstable tree + * next time: only merge with it if merge_across_nodes. + * Just notice, we don't have similar problem for PageKsm + * because their migration is disabled now. (62b61f611e) + */ + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { + put_page(tree_page); + return NULL; + } + ret = memcmp_pages(page, tree_page); parent = *new; @@ -1139,8 +1195,11 @@ struct rmap_item *unstable_tree_search_i rmap_item->address |= UNSTABLE_FLAG; rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK); +#ifdef CONFIG_NUMA + rmap_item->nid = nid; +#endif rb_link_node(&rmap_item->node, parent, new); - rb_insert_color(&rmap_item->node, &root_unstable_tree); + rb_insert_color(&rmap_item->node, root); ksm_pages_unshared++; return NULL; @@ -1154,6 +1213,13 @@ struct rmap_item *unstable_tree_search_i static void stable_tree_append(struct rmap_item *rmap_item, struct stable_node *stable_node) { +#ifdef CONFIG_NUMA + /* + * Usually rmap_item->nid is already set correctly, + * but it may be wrong after switching merge_across_nodes. + */ + rmap_item->nid = get_kpfn_nid(stable_node->kpfn); +#endif rmap_item->head = stable_node; rmap_item->address |= STABLE_FLAG; hlist_add_head(&rmap_item->hlist, &stable_node->hlist); @@ -1283,6 +1349,7 @@ static struct rmap_item *scan_get_next_r struct mm_slot *slot; struct vm_area_struct *vma; struct rmap_item *rmap_item; + int nid; if (list_empty(&ksm_mm_head.mm_list)) return NULL; @@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r */ lru_add_drain_all(); - root_unstable_tree = RB_ROOT; + for (nid = 0; nid < nr_node_ids; nid++) + root_unstable_tree[nid] = RB_ROOT; spin_lock(&ksm_mmlist_lock); slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list); @@ -1770,15 +1838,19 @@ static struct stable_node *ksm_check_sta unsigned long end_pfn) { struct rb_node *node; + int nid; - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) { - struct stable_node *stable_node; + for (nid = 0; nid < nr_node_ids; nid++) + for (node = rb_first(&root_stable_tree[nid]); node; + node = rb_next(node)) { + struct stable_node *stable_node; + + stable_node = rb_entry(node, struct stable_node, node); + if (stable_node->kpfn >= start_pfn && + stable_node->kpfn < end_pfn) + return stable_node; + } - stable_node = rb_entry(node, struct stable_node, node); - if (stable_node->kpfn >= start_pfn && - stable_node->kpfn < end_pfn) - return stable_node; - } return NULL; } @@ -1925,6 +1997,40 @@ static ssize_t run_store(struct kobject } KSM_ATTR(run); +#ifdef CONFIG_NUMA +static ssize_t merge_across_nodes_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%u\n", ksm_merge_across_nodes); +} + +static ssize_t merge_across_nodes_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err; + unsigned long knob; + + err = kstrtoul(buf, 10, &knob); + if (err) + return err; + if (knob > 1) + return -EINVAL; + + mutex_lock(&ksm_thread_mutex); + if (ksm_merge_across_nodes != knob) { + if (ksm_pages_shared) + err = -EBUSY; + else + ksm_merge_across_nodes = knob; + } + mutex_unlock(&ksm_thread_mutex); + + return err ? err : count; +} +KSM_ATTR(merge_across_nodes); +#endif + static ssize_t pages_shared_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -1979,6 +2085,9 @@ static struct attribute *ksm_attrs[] = { &pages_unshared_attr.attr, &pages_volatile_attr.attr, &full_scans_attr.attr, +#ifdef CONFIG_NUMA + &merge_across_nodes_attr.attr, +#endif NULL, }; @@ -1992,11 +2101,15 @@ static int __init ksm_init(void) { struct task_struct *ksm_thread; int err; + int nid; err = ksm_slab_init(); if (err) goto out; + for (nid = 0; nid < nr_node_ids; nid++) + root_stable_tree[nid] = RB_ROOT; + ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd"); if (IS_ERR(ksm_thread)) { printk(KERN_ERR "ksm: creating kthread failed\n"); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755329Ab3AZB45 (ORCPT ); Fri, 25 Jan 2013 20:56:57 -0500 Received: from mail-pa0-f51.google.com ([209.85.220.51]:50617 "EHLO mail-pa0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754865Ab3AZB4z (ORCPT ); Fri, 25 Jan 2013 20:56:55 -0500 Date: Fri, 25 Jan 2013 17:56:57 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Greg KH , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 2/11] ksm: add sysfs ABI Documentation In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Petr Holasek This patch adds sysfs documentation for Kernel Samepage Merging (KSM) including new merge_across_nodes knob. Signed-off-by: Petr Holasek Signed-off-by: Hugh Dickins --- Documentation/ABI/testing/sysfs-kernel-mm-ksm | 52 ++++++++++++++++ 1 file changed, 52 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-ksm --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ mmotm/Documentation/ABI/testing/sysfs-kernel-mm-ksm 2013-01-25 14:36:50.660205905 -0800 @@ -0,0 +1,52 @@ +What: /sys/kernel/mm/ksm +Date: September 2009 +KernelVersion: 2.6.32 +Contact: Linux memory management mailing list +Description: Interface for Kernel Samepage Merging (KSM) + +What: /sys/kernel/mm/ksm/full_scans +What: /sys/kernel/mm/ksm/pages_shared +What: /sys/kernel/mm/ksm/pages_sharing +What: /sys/kernel/mm/ksm/pages_to_scan +What: /sys/kernel/mm/ksm/pages_unshared +What: /sys/kernel/mm/ksm/pages_volatile +What: /sys/kernel/mm/ksm/run +What: /sys/kernel/mm/ksm/sleep_millisecs +Date: September 2009 +Contact: Linux memory management mailing list +Description: Kernel Samepage Merging daemon sysfs interface + + full_scans: how many times all mergeable areas have been + scanned. + + pages_shared: how many shared pages are being used. + + pages_sharing: how many more sites are sharing them i.e. how + much saved. + + pages_to_scan: how many present pages to scan before ksmd goes + to sleep. + + pages_unshared: how many pages unique but repeatedly checked + for merging. + + pages_volatile: how many pages changing too fast to be placed + in a tree. + + run: write 0 to disable ksm, read 0 while ksm is disabled. + write 1 to run ksm, read 1 while ksm is running. + write 2 to disable ksm and unmerge all its pages. + + sleep_millisecs: how many milliseconds ksm should sleep between + scans. + + See Documentation/vm/ksm.txt for more information. + +What: /sys/kernel/mm/ksm/merge_across_nodes +Date: January 2013 +KernelVersion: 3.9 +Contact: Linux memory management mailing list +Description: Control merging pages across different NUMA nodes. + + When it is set to 0 only pages from the same node are merged, + otherwise pages from all nodes can be merged together (default). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755285Ab3AZB6N (ORCPT ); Fri, 25 Jan 2013 20:58:13 -0500 Received: from mail-da0-f51.google.com ([209.85.210.51]:37910 "EHLO mail-da0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754308Ab3AZB6K (ORCPT ); Fri, 25 Jan 2013 20:58:10 -0500 Date: Fri, 25 Jan 2013 17:58:11 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 3/11] ksm: trivial tidyups In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef CONFIG_NUMAs (but indeed we don't want to expand struct rmap_item by nid when not NUMA). Add comment, remove "unsigned" from rmap_item->nid, as "int nid" elsewhere. Define ksm_merge_across_nodes 1U when #ifndef NUMA to help optimizing out. Use ?: in get_kpfn_nid(). Adjust a few comments noticed in ongoing work. Leave stable_tree_insert()'s rb_linkage until after the node has been set up, as unstable_tree_search_insert() does: ksm_thread_mutex and page lock make either way safe, but we're going to copy and I prefer this precedent. Signed-off-by: Hugh Dickins --- mm/ksm.c | 48 ++++++++++++++++++++++-------------------------- 1 file changed, 22 insertions(+), 26 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:36:52.152205940 -0800 @@ -41,6 +41,14 @@ #include #include "internal.h" +#ifdef CONFIG_NUMA +#define NUMA(x) (x) +#define DO_NUMA(x) (x) +#else +#define NUMA(x) (0) +#define DO_NUMA(x) do { } while (0) +#endif + /* * A few notes about the KSM scanning process, * to make it easier to understand the data structures below: @@ -130,6 +138,7 @@ struct stable_node { * @mm: the memory structure this rmap_item is pointing into * @address: the virtual address this rmap_item tracks (+ flags in low bits) * @oldchecksum: previous checksum of the page at that virtual address + * @nid: NUMA node id of unstable tree in which linked (may not match page) * @node: rb node of this rmap_item in the unstable tree * @head: pointer to stable_node heading this list in the stable tree * @hlist: link into hlist of rmap_items hanging off that stable_node @@ -141,7 +150,7 @@ struct rmap_item { unsigned long address; /* + low bits used for flags below */ unsigned int oldchecksum; /* when unstable */ #ifdef CONFIG_NUMA - unsigned int nid; + int nid; #endif union { struct rb_node node; /* when node of unstable tree */ @@ -192,8 +201,12 @@ static unsigned int ksm_thread_pages_to_ /* Milliseconds ksmd should sleep between batches */ static unsigned int ksm_thread_sleep_millisecs = 20; +#ifdef CONFIG_NUMA /* Zeroed when merging across nodes is not allowed */ static unsigned int ksm_merge_across_nodes = 1; +#else +#define ksm_merge_across_nodes 1U +#endif #define KSM_RUN_STOP 0 #define KSM_RUN_MERGE 1 @@ -456,10 +469,7 @@ out: page = NULL; */ static inline int get_kpfn_nid(unsigned long kpfn) { - if (ksm_merge_across_nodes) - return 0; - else - return pfn_to_nid(kpfn); + return ksm_merge_across_nodes ? 0 : pfn_to_nid(kpfn); } static void remove_node_from_stable_tree(struct stable_node *stable_node) @@ -479,7 +489,6 @@ static void remove_node_from_stable_tree } nid = get_kpfn_nid(stable_node->kpfn); - rb_erase(&stable_node->node, &root_stable_tree[nid]); free_stable_node(stable_node); } @@ -578,13 +587,8 @@ static void remove_rmap_item_from_tree(s age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); BUG_ON(age > 1); if (!age) -#ifdef CONFIG_NUMA rb_erase(&rmap_item->node, - &root_unstable_tree[rmap_item->nid]); -#else - rb_erase(&rmap_item->node, &root_unstable_tree[0]); -#endif - + &root_unstable_tree[NUMA(rmap_item->nid)]); ksm_pages_unshared--; rmap_item->address &= PAGE_MASK; } @@ -604,7 +608,7 @@ static void remove_trailing_rmap_items(s } /* - * Though it's very tempting to unmerge in_stable_tree(rmap_item)s rather + * Though it's very tempting to unmerge rmap_items from stable tree rather * than check every pte of a given vma, the locking doesn't quite work for * that - an rmap_item is assigned to the stable tree after inserting ksm * page and upping mmap_sem. Nor does it fit with the way we skip dup'ing @@ -1058,7 +1062,7 @@ static struct page *stable_tree_search(s } /* - * stable_tree_insert - insert rmap_item pointing to new ksm page + * stable_tree_insert - insert stable tree node pointing to new ksm page * into the stable tree. * * This function returns the stable tree node just allocated on success, @@ -1108,13 +1112,11 @@ static struct stable_node *stable_tree_i if (!stable_node) return NULL; - rb_link_node(&stable_node->node, parent, new); - rb_insert_color(&stable_node->node, &root_stable_tree[nid]); - INIT_HLIST_HEAD(&stable_node->hlist); - stable_node->kpfn = kpfn; set_page_stable_node(kpage, stable_node); + rb_link_node(&stable_node->node, parent, new); + rb_insert_color(&stable_node->node, &root_stable_tree[nid]); return stable_node; } @@ -1170,8 +1172,6 @@ struct rmap_item *unstable_tree_search_i * If tree_page has been migrated to another NUMA node, it * will be flushed out and put into the right unstable tree * next time: only merge with it if merge_across_nodes. - * Just notice, we don't have similar problem for PageKsm - * because their migration is disabled now. (62b61f611e) */ if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { put_page(tree_page); @@ -1195,9 +1195,7 @@ struct rmap_item *unstable_tree_search_i rmap_item->address |= UNSTABLE_FLAG; rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK); -#ifdef CONFIG_NUMA - rmap_item->nid = nid; -#endif + DO_NUMA(rmap_item->nid = nid); rb_link_node(&rmap_item->node, parent, new); rb_insert_color(&rmap_item->node, root); @@ -1213,13 +1211,11 @@ struct rmap_item *unstable_tree_search_i static void stable_tree_append(struct rmap_item *rmap_item, struct stable_node *stable_node) { -#ifdef CONFIG_NUMA /* * Usually rmap_item->nid is already set correctly, * but it may be wrong after switching merge_across_nodes. */ - rmap_item->nid = get_kpfn_nid(stable_node->kpfn); -#endif + DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn)); rmap_item->head = stable_node; rmap_item->address |= STABLE_FLAG; hlist_add_head(&rmap_item->hlist, &stable_node->hlist); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755255Ab3AZB7g (ORCPT ); Fri, 25 Jan 2013 20:59:36 -0500 Received: from mail-pa0-f53.google.com ([209.85.220.53]:60529 "EHLO mail-pa0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753468Ab3AZB7e (ORCPT ); Fri, 25 Jan 2013 20:59:34 -0500 Date: Fri, 25 Jan 2013 17:59:35 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 4/11] ksm: reorganize ksm_check_stable_tree In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Memory hotremove's ksm_check_stable_tree() is pitifully inefficient (restarting whenever it finds a stale node to remove), but rearrange so that at least it does not needlessly restart from nid 0 each time. And add a couple of comments: here is why we keep pfn instead of page. Signed-off-by: Hugh Dickins --- mm/ksm.c | 38 ++++++++++++++++++++++---------------- 1 file changed, 22 insertions(+), 16 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:52.152205940 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 @@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_MEMORY_HOTREMOVE -static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn, - unsigned long end_pfn) +static void ksm_check_stable_tree(unsigned long start_pfn, + unsigned long end_pfn) { + struct stable_node *stable_node; struct rb_node *node; int nid; - for (nid = 0; nid < nr_node_ids; nid++) - for (node = rb_first(&root_stable_tree[nid]); node; - node = rb_next(node)) { - struct stable_node *stable_node; - + for (nid = 0; nid < nr_node_ids; nid++) { + node = rb_first(&root_stable_tree[nid]); + while (node) { stable_node = rb_entry(node, struct stable_node, node); if (stable_node->kpfn >= start_pfn && - stable_node->kpfn < end_pfn) - return stable_node; + stable_node->kpfn < end_pfn) { + /* + * Don't get_ksm_page, page has already gone: + * which is why we keep kpfn instead of page* + */ + remove_node_from_stable_tree(stable_node); + node = rb_first(&root_stable_tree[nid]); + } else + node = rb_next(node); + cond_resched(); } - - return NULL; + } } static int ksm_memory_callback(struct notifier_block *self, unsigned long action, void *arg) { struct memory_notify *mn = arg; - struct stable_node *stable_node; switch (action) { case MEM_GOING_OFFLINE: @@ -1874,11 +1879,12 @@ static int ksm_memory_callback(struct no /* * Most of the work is done by page migration; but there might * be a few stable_nodes left over, still pointing to struct - * pages which have been offlined: prune those from the tree. + * pages which have been offlined: prune those from the tree, + * otherwise get_ksm_page() might later try to access a + * non-existent struct page. */ - while ((stable_node = ksm_check_stable_tree(mn->start_pfn, - mn->start_pfn + mn->nr_pages)) != NULL) - remove_node_from_stable_tree(stable_node); + ksm_check_stable_tree(mn->start_pfn, + mn->start_pfn + mn->nr_pages); /* fallthrough */ case MEM_CANCEL_OFFLINE: From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755360Ab3AZCAu (ORCPT ); Fri, 25 Jan 2013 21:00:50 -0500 Received: from mail-da0-f48.google.com ([209.85.210.48]:53629 "EHLO mail-da0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753468Ab3AZCAs (ORCPT ); Fri, 25 Jan 2013 21:00:48 -0500 Date: Fri, 25 Jan 2013 18:00:50 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 5/11] ksm: get_ksm_page locked In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In some places where get_ksm_page() is used, we need the page to be locked. When KSM migration is fully enabled, we shall want that to make sure that the page just acquired cannot be migrated beneath us (raised page count is only effective when there is serialization to make sure migration notices). Whereas when navigating through the stable tree, we certainly do not want to lock each node (raised page count is enough to guarantee the memcmps, even if page is migrated to another node). Since we're about to add another use case, add the locked argument to get_ksm_page() now. Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I really got the wrong end of the stick on that! There's a configuration in which page_cache_get_speculative() can do something cheaper than get_page_unless_zero(), relying on its caller's rcu_read_lock() to have disabled preemption for it. There's no need for rcu_read_lock() around get_page_unless_zero() (and mapping checks) here. Cut out that silliness before making this any harder to understand. Signed-off-by: Hugh Dickins --- mm/ksm.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree * but this is different - made simpler by ksm_thread_mutex being held, but * interesting for assuming that no other use of the struct page could ever * put our expected_mapping into page->mapping (or a field of the union which - * coincides with page->mapping). The RCU calls are not for KSM at all, but - * to keep the page_count protocol described with page_cache_get_speculative. + * coincides with page->mapping). * * Note: it is possible that get_ksm_page() will return NULL one moment, * then page the next, if the page is in between page_freeze_refs() and * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page * is on its way to being freed; but it is an anomaly to bear in mind. */ -static struct page *get_ksm_page(struct stable_node *stable_node) +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) { struct page *page; void *expected_mapping; @@ -530,7 +529,6 @@ static struct page *get_ksm_page(struct page = pfn_to_page(stable_node->kpfn); expected_mapping = (void *)stable_node + (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); - rcu_read_lock(); if (page->mapping != expected_mapping) goto stale; if (!get_page_unless_zero(page)) @@ -539,10 +537,16 @@ static struct page *get_ksm_page(struct put_page(page); goto stale; } - rcu_read_unlock(); + if (locked) { + lock_page(page); + if (page->mapping != expected_mapping) { + unlock_page(page); + put_page(page); + goto stale; + } + } return page; stale: - rcu_read_unlock(); remove_node_from_stable_tree(stable_node); return NULL; } @@ -558,11 +562,10 @@ static void remove_rmap_item_from_tree(s struct page *page; stable_node = rmap_item->head; - page = get_ksm_page(stable_node); + page = get_ksm_page(stable_node, true); if (!page) goto out; - lock_page(page); hlist_del(&rmap_item->hlist); unlock_page(page); put_page(page); @@ -1042,7 +1045,7 @@ static struct page *stable_tree_search(s cond_resched(); stable_node = rb_entry(node, struct stable_node, node); - tree_page = get_ksm_page(stable_node); + tree_page = get_ksm_page(stable_node, false); if (!tree_page) return NULL; @@ -1086,7 +1089,7 @@ static struct stable_node *stable_tree_i cond_resched(); stable_node = rb_entry(*new, struct stable_node, node); - tree_page = get_ksm_page(stable_node); + tree_page = get_ksm_page(stable_node, false); if (!tree_page) return NULL; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755463Ab3AZCCA (ORCPT ); Fri, 25 Jan 2013 21:02:00 -0500 Received: from mail-pb0-f49.google.com ([209.85.160.49]:47575 "EHLO mail-pb0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753468Ab3AZCB6 (ORCPT ); Fri, 25 Jan 2013 21:01:58 -0500 Date: Fri, 25 Jan 2013 18:01:59 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree. It's not something that people will often want to do, but it would be lame to demand a reboot when they're trying to determine which merge_across_nodes setting is best. How can this happen? We only permit switching merge_across_nodes when pages_shared is 0, and usually set run 2 to force that beforehand, which ought to unmerge everything: yet oopses still occur when you then run 1. Three causes: 1. The old stable tree (built according to the inverse merge_across_nodes) has not been fully torn down. A stable node lingers until get_ksm_page() notices that the page it references no longer references it: but the page is not necessarily freed as soon as expected, particularly when swapcache. Fix this with a pass through the old stable tree, applying get_ksm_page() to each of the remaining nodes (most found stale and removed immediately), with forced removal of any left over. Unless the page is still mapped: I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE and EBUSY than BUG. 2. __ksm_enter() has a nice little optimization, to insert the new mm just behind ksmd's cursor, so there's a full pass for it to stabilize (or be removed) before ksmd addresses it. Nice when ksmd is running, but not so nice when we're trying to unmerge all mms: we were missing those mms forked and inserted behind the unmerge cursor. Easily fixed by inserting at the end when KSM_RUN_UNMERGE. 3. It is possible for a KSM page to be faulted back from swapcache into an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. A long outstanding, unrelated bugfix sneaks in with that third fix: ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O error when read in from swap) to a page which it then marks Uptodate. Fix this case by not copying, letting do_swap_page() discover the error. Signed-off-by: Hugh Dickins --- include/linux/ksm.h | 18 ++------- mm/ksm.c | 83 +++++++++++++++++++++++++++++++++++++++--- mm/memory.c | 19 ++++----- 3 files changed, 92 insertions(+), 28 deletions(-) --- mmotm.orig/include/linux/ksm.h 2013-01-25 14:27:58.220193250 -0800 +++ mmotm/include/linux/ksm.h 2013-01-25 14:37:00.764206145 -0800 @@ -16,9 +16,6 @@ struct stable_node; struct mem_cgroup; -struct page *ksm_does_need_to_copy(struct page *page, - struct vm_area_struct *vma, unsigned long address); - #ifdef CONFIG_KSM int ksm_madvise(struct vm_area_struct *vma, unsigned long start, unsigned long end, int advice, unsigned long *vm_flags); @@ -73,15 +70,8 @@ static inline void set_page_stable_node( * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE, * but what if the vma was unmerged while the page was swapped out? */ -static inline int ksm_might_need_to_copy(struct page *page, - struct vm_area_struct *vma, unsigned long address) -{ - struct anon_vma *anon_vma = page_anon_vma(page); - - return anon_vma && - (anon_vma->root != vma->anon_vma->root || - page->index != linear_page_index(vma, address)); -} +struct page *ksm_might_need_to_copy(struct page *page, + struct vm_area_struct *vma, unsigned long address); int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg, unsigned long *vm_flags); @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_ return 0; } -static inline int ksm_might_need_to_copy(struct page *page, +static inline struct page *ksm_might_need_to_copy(struct page *page, struct vm_area_struct *vma, unsigned long address) { - return 0; + return page; } static inline int page_referenced_ksm(struct page *page, --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a /* * Only called through the sysfs control interface: */ +static int remove_stable_node(struct stable_node *stable_node) +{ + struct page *page; + int err; + + page = get_ksm_page(stable_node, true); + if (!page) { + /* + * get_ksm_page did remove_node_from_stable_tree itself. + */ + return 0; + } + + if (WARN_ON_ONCE(page_mapped(page))) + err = -EBUSY; + else { + /* + * This page might be in a pagevec waiting to be freed, + * or it might be PageSwapCache (perhaps under writeback), + * or it might have been removed from swapcache a moment ago. + */ + set_page_stable_node(page, NULL); + remove_node_from_stable_tree(stable_node); + err = 0; + } + + unlock_page(page); + put_page(page); + return err; +} + +static int remove_all_stable_nodes(void) +{ + struct stable_node *stable_node; + int nid; + int err = 0; + + for (nid = 0; nid < nr_node_ids; nid++) { + while (root_stable_tree[nid].rb_node) { + stable_node = rb_entry(root_stable_tree[nid].rb_node, + struct stable_node, node); + if (remove_stable_node(stable_node)) { + err = -EBUSY; + break; /* proceed to next nid */ + } + cond_resched(); + } + } + return err; +} + static int unmerge_and_remove_all_rmap_items(void) { struct mm_slot *mm_slot; @@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i } } + /* Clean up stable nodes, but don't worry if some are still busy */ + remove_all_stable_nodes(); ksm_scan.seqnr = 0; return 0; @@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm) spin_lock(&ksm_mmlist_lock); insert_to_mm_slots_hash(mm, mm_slot); /* - * Insert just behind the scanning cursor, to let the area settle + * When KSM_RUN_MERGE (or KSM_RUN_STOP), + * insert just behind the scanning cursor, to let the area settle * down a little; when fork is followed by immediate exec, we don't * want ksmd to waste time setting up and tearing down an rmap_list. + * + * But when KSM_RUN_UNMERGE, it's important to insert ahead of its + * scanning cursor, otherwise KSM pages in newly forked mms will be + * missed: then we might as well insert at the end of the list. */ - list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); + if (ksm_run & KSM_RUN_UNMERGE) + list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list); + else + list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); spin_unlock(&ksm_mmlist_lock); set_bit(MMF_VM_MERGEABLE, &mm->flags); @@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm) } } -struct page *ksm_does_need_to_copy(struct page *page, +struct page *ksm_might_need_to_copy(struct page *page, struct vm_area_struct *vma, unsigned long address) { + struct anon_vma *anon_vma = page_anon_vma(page); struct page *new_page; + if (PageKsm(page)) { + if (page_stable_node(page) && + !(ksm_run & KSM_RUN_UNMERGE)) + return page; /* no need to copy it */ + } else if (!anon_vma) { + return page; /* no need to copy it */ + } else if (anon_vma->root == vma->anon_vma->root && + page->index == linear_page_index(vma, address)) { + return page; /* still no need to copy it */ + } + if (!PageUptodate(page)) + return page; /* let do_swap_page report the error */ + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (new_page) { copy_user_highpage(new_page, page, address, vma); @@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store( mutex_lock(&ksm_thread_mutex); if (ksm_merge_across_nodes != knob) { - if (ksm_pages_shared) + if (ksm_pages_shared || remove_all_stable_nodes()) err = -EBUSY; else ksm_merge_across_nodes = knob; --- mmotm.orig/mm/memory.c 2013-01-25 14:27:58.220193250 -0800 +++ mmotm/mm/memory.c 2013-01-25 14:37:00.768206145 -0800 @@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) goto out_page; - if (ksm_might_need_to_copy(page, vma, address)) { - swapcache = page; - page = ksm_does_need_to_copy(page, vma, address); - - if (unlikely(!page)) { - ret = VM_FAULT_OOM; - page = swapcache; - swapcache = NULL; - goto out_page; - } + swapcache = page; + page = ksm_might_need_to_copy(page, vma, address); + if (unlikely(!page)) { + ret = VM_FAULT_OOM; + page = swapcache; + swapcache = NULL; + goto out_page; } + if (page == swapcache) + swapcache = NULL; if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { ret = VM_FAULT_OOM; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755311Ab3AZCDd (ORCPT ); Fri, 25 Jan 2013 21:03:33 -0500 Received: from mail-pa0-f41.google.com ([209.85.220.41]:37506 "EHLO mail-pa0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754865Ab3AZCDa (ORCPT ); Fri, 25 Jan 2013 21:03:30 -0500 Date: Fri, 25 Jan 2013 18:03:31 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 7/11] ksm: make KSM page migration possible In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org KSM page migration is already supported in the case of memory hotremove, which takes the ksm_thread_mutex across all its migrations to keep life simple. But the new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? So far there's no provision for that, and this patch does not attempt to deal with it either. But how will I test a solution, when I don't know how to hotremove memory? The best answer is to enable KSM page migration in all cases now, and test more common cases. With THP and compaction added since KSM came in, page migration is now mainstream, and it's a shame that a KSM page can frustrate freeing a page block. Without worrying about merge_across_nodes 0 for now, this patch gets KSM page migration working reliably for default merge_across_nodes 1 (but leave the patch enabling it until near the end of the series). It's much simpler than I'd originally imagined, and does not require an additional tier of locking: page migration relies on the page lock, KSM page reclaim relies on the page lock, the page lock is enough for KSM page migration too. Almost all the care has to be in get_ksm_page(): that's the function which worries about when a stable node is stale and should be freed, now it also has to worry about the KSM page being migrated. The only new overhead is an additional put/get/lock/unlock_page when stable_tree_search() arrives at a matching node: to make sure migration respects the raised page count, and so does not migrate the page while we're busy with it here. That's probably avoidable, either by changing internal interfaces from using kpage to stable_node, or by moving the ksm_migrate_page() callsite into a page_freeze_refs() section (even if not swapcache); but this works well, I've no urge to pull it apart now. (Descents of the stable tree may pass through nodes whose KSM pages are under migration: being unlocked, the raised page count does not prevent that, nor need it: it's safe to memcmp against either old or new page.) You might worry about mremap, and whether page migration's rmap_walk to remove migration entries will find all the KSM locations where it inserted earlier: that should already be handled, by the satisfyingly heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). Signed-off-by: Hugh Dickins --- mm/ksm.c | 94 ++++++++++++++++++++++++++++++++++++++----------- mm/migrate.c | 5 ++ 2 files changed, 77 insertions(+), 22 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 @@ -499,6 +499,7 @@ static void remove_node_from_stable_tree * In which case we can trust the content of the page, and it * returns the gotten page; but if the page has now been zapped, * remove the stale node from the stable tree and return NULL. + * But beware, the stable node's page might be being migrated. * * You would expect the stable_node to hold a reference to the ksm page. * But if it increments the page's count, swapping out has to wait for @@ -509,44 +510,77 @@ static void remove_node_from_stable_tree * pointing back to this stable node. This relies on freeing a PageAnon * page to reset its page->mapping to NULL, and relies on no other use of * a page to put something that might look like our key in page->mapping. - * - * include/linux/pagemap.h page_cache_get_speculative() is a good reference, - * but this is different - made simpler by ksm_thread_mutex being held, but - * interesting for assuming that no other use of the struct page could ever - * put our expected_mapping into page->mapping (or a field of the union which - * coincides with page->mapping). - * - * Note: it is possible that get_ksm_page() will return NULL one moment, - * then page the next, if the page is in between page_freeze_refs() and - * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page * is on its way to being freed; but it is an anomaly to bear in mind. */ static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) { struct page *page; void *expected_mapping; + unsigned long kpfn; - page = pfn_to_page(stable_node->kpfn); expected_mapping = (void *)stable_node + (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); - if (page->mapping != expected_mapping) - goto stale; - if (!get_page_unless_zero(page)) +again: + kpfn = ACCESS_ONCE(stable_node->kpfn); + page = pfn_to_page(kpfn); + + /* + * page is computed from kpfn, so on most architectures reading + * page->mapping is naturally ordered after reading node->kpfn, + * but on Alpha we need to be more careful. + */ + smp_read_barrier_depends(); + if (ACCESS_ONCE(page->mapping) != expected_mapping) goto stale; - if (page->mapping != expected_mapping) { + + /* + * We cannot do anything with the page while its refcount is 0. + * Usually 0 means free, or tail of a higher-order page: in which + * case this node is no longer referenced, and should be freed; + * however, it might mean that the page is under page_freeze_refs(). + * The __remove_mapping() case is easy, again the node is now stale; + * but if page is swapcache in migrate_page_move_mapping(), it might + * still be our page, in which case it's essential to keep the node. + */ + while (!get_page_unless_zero(page)) { + /* + * Another check for page->mapping != expected_mapping would + * work here too. We have chosen the !PageSwapCache test to + * optimize the common case, when the page is or is about to + * be freed: PageSwapCache is cleared (under spin_lock_irq) + * in the freeze_refs section of __remove_mapping(); but Anon + * page->mapping reset to NULL later, in free_pages_prepare(). + */ + if (!PageSwapCache(page)) + goto stale; + cpu_relax(); + } + + if (ACCESS_ONCE(page->mapping) != expected_mapping) { put_page(page); goto stale; } + if (locked) { lock_page(page); - if (page->mapping != expected_mapping) { + if (ACCESS_ONCE(page->mapping) != expected_mapping) { unlock_page(page); put_page(page); goto stale; } } return page; + stale: + /* + * We come here from above when page->mapping or !PageSwapCache + * suggests that the node is stale; but it might be under migration. + * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(), + * before checking whether node->kpfn has been changed. + */ + smp_rmb(); + if (ACCESS_ONCE(stable_node->kpfn) != kpfn) + goto again; remove_node_from_stable_tree(stable_node); return NULL; } @@ -1103,15 +1137,25 @@ static struct page *stable_tree_search(s return NULL; ret = memcmp_pages(page, tree_page); + put_page(tree_page); - if (ret < 0) { - put_page(tree_page); + if (ret < 0) node = node->rb_left; - } else if (ret > 0) { - put_page(tree_page); + else if (ret > 0) node = node->rb_right; - } else + else { + /* + * Lock and unlock the stable_node's page (which + * might already have been migrated) so that page + * migration is sure to notice its raised count. + * It would be more elegant to return stable_node + * than kpage, but that involves more changes. + */ + tree_page = get_ksm_page(stable_node, true); + if (tree_page) + unlock_page(tree_page); return tree_page; + } } return NULL; @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa if (stable_node) { VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage)); stable_node->kpfn = page_to_pfn(newpage); + /* + * newpage->mapping was set in advance; now we need smp_wmb() + * to make sure that the new stable_node->kpfn is visible + * to get_ksm_page() before it can see that oldpage->mapping + * has gone stale (or that PageSwapCache has been cleared). + */ + smp_wmb(); + set_page_stable_node(oldpage, NULL); } } #endif /* CONFIG_MIGRATION */ --- mmotm.orig/mm/migrate.c 2013-01-25 14:27:58.140193249 -0800 +++ mmotm/mm/migrate.c 2013-01-25 14:37:03.832206218 -0800 @@ -464,7 +464,10 @@ void migrate_page_copy(struct page *newp mlock_migrate_page(newpage, page); ksm_migrate_page(newpage, page); - + /* + * Please do not reorder this without considering how mm/ksm.c's + * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache(). + */ ClearPageSwapCache(page); ClearPagePrivate(page); set_page_private(page, 0); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755544Ab3AZCFC (ORCPT ); Fri, 25 Jan 2013 21:05:02 -0500 Received: from mail-da0-f51.google.com ([209.85.210.51]:51858 "EHLO mail-da0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755015Ab3AZCFA (ORCPT ); Fri, 25 Jan 2013 21:05:00 -0500 Date: Fri, 25 Jan 2013 18:05:02 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 8/11] ksm: make !merge_across_nodes migration safe In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? ksm_migrate_page() can do no more than it's already doing, updating stable_node->kpfn: the stable tree itself cannot be manipulated without holding ksm_thread_mutex. So accept that a stable tree may temporarily indicate a page belonging to the wrong NUMA node, leave updating until the next pass of ksmd, just be careful not to merge other pages on to a misplaced page. Note nid of holding tree in stable_node, and recognize that it will not always match nid of kpfn. A misplaced KSM page is discovered, either when ksm_do_scan() next comes around to one of its rmap_items (we now have to go to cmp_and_merge_page even on pages in a stable tree), or when stable_tree_search() arrives at a matching node for another page, and this node page is found misplaced. In each case, move the misplaced stable_node to a list of migrate_nodes (and use the address of migrate_nodes as magic by which to identify them): we don't need them in a tree. If stable_tree_search() finds no match for a page, but it's currently exiled to this list, then slot its stable_node right there into the tree, bringing all of its mappings with it; otherwise they get migrated one by one to the original page of the colliding node. stable_tree_search() is now modelled more like stable_tree_insert(), in order to handle these insertions of migrated nodes. remove_node_from_stable_tree(), remove_all_stable_nodes() and ksm_check_stable_tree() have to handle the migrate_nodes list as well as the stable tree itself. Less obviously, we do need to prune the list of stale entries from time to time (scan_get_next_rmap_item() does it once each full scan): whereas stale nodes in the stable tree get naturally pruned as searches try to brush past them, these migrate_nodes may get forgotten and accumulate. Signed-off-by: Hugh Dickins --- mm/ksm.c | 164 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 134 insertions(+), 30 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 @@ -122,13 +122,25 @@ struct ksm_scan { /** * struct stable_node - node of the stable rbtree * @node: rb node of this ksm page in the stable tree + * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list + * @list: linked into migrate_nodes, pending placement in the proper node tree * @hlist: hlist head of rmap_items using this ksm page - * @kpfn: page frame number of this ksm page + * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid) + * @nid: NUMA node id of stable tree in which linked (may not match kpfn) */ struct stable_node { - struct rb_node node; + union { + struct rb_node node; /* when node of stable tree */ + struct { /* when listed for migration */ + struct list_head *head; + struct list_head list; + }; + }; struct hlist_head hlist; unsigned long kpfn; +#ifdef CONFIG_NUMA + int nid; +#endif }; /** @@ -169,6 +181,9 @@ struct rmap_item { static struct rb_root root_unstable_tree[MAX_NUMNODES]; static struct rb_root root_stable_tree[MAX_NUMNODES]; +/* Recently migrated nodes of stable tree, pending proper placement */ +static LIST_HEAD(migrate_nodes); + #define MM_SLOTS_HASH_BITS 10 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); @@ -311,11 +326,6 @@ static void insert_to_mm_slots_hash(stru hash_add(mm_slots_hash, &mm_slot->link, (unsigned long)mm); } -static inline int in_stable_tree(struct rmap_item *rmap_item) -{ - return rmap_item->address & STABLE_FLAG; -} - /* * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's * page tables after it has passed through ksm_exit() - which, if necessary, @@ -476,7 +486,6 @@ static void remove_node_from_stable_tree { struct rmap_item *rmap_item; struct hlist_node *hlist; - int nid; hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { if (rmap_item->hlist.next) @@ -488,8 +497,11 @@ static void remove_node_from_stable_tree cond_resched(); } - nid = get_kpfn_nid(stable_node->kpfn); - rb_erase(&stable_node->node, &root_stable_tree[nid]); + if (stable_node->head == &migrate_nodes) + list_del(&stable_node->list); + else + rb_erase(&stable_node->node, + &root_stable_tree[NUMA(stable_node->nid)]); free_stable_node(stable_node); } @@ -712,6 +724,7 @@ static int remove_stable_node(struct sta static int remove_all_stable_nodes(void) { struct stable_node *stable_node; + struct list_head *this, *next; int nid; int err = 0; @@ -726,6 +739,12 @@ static int remove_all_stable_nodes(void) cond_resched(); } } + list_for_each_safe(this, next, &migrate_nodes) { + stable_node = list_entry(this, struct stable_node, list); + if (remove_stable_node(stable_node)) + err = -EBUSY; + cond_resched(); + } return err; } @@ -1113,25 +1132,30 @@ static struct page *try_to_merge_two_pag */ static struct page *stable_tree_search(struct page *page) { - struct rb_node *node; - struct stable_node *stable_node; int nid; + struct rb_node **new; + struct rb_node *parent; + struct stable_node *stable_node; + struct stable_node *page_node; - stable_node = page_stable_node(page); - if (stable_node) { /* ksm page forked */ + page_node = page_stable_node(page); + if (page_node && page_node->head != &migrate_nodes) { + /* ksm page forked */ get_page(page); return page; } nid = get_kpfn_nid(page_to_pfn(page)); - node = root_stable_tree[nid].rb_node; +again: + new = &root_stable_tree[nid].rb_node; + parent = NULL; - while (node) { + while (*new) { struct page *tree_page; int ret; cond_resched(); - stable_node = rb_entry(node, struct stable_node, node); + stable_node = rb_entry(*new, struct stable_node, node); tree_page = get_ksm_page(stable_node, false); if (!tree_page) return NULL; @@ -1139,10 +1163,11 @@ static struct page *stable_tree_search(s ret = memcmp_pages(page, tree_page); put_page(tree_page); + parent = *new; if (ret < 0) - node = node->rb_left; + new = &parent->rb_left; else if (ret > 0) - node = node->rb_right; + new = &parent->rb_right; else { /* * Lock and unlock the stable_node's page (which @@ -1152,13 +1177,49 @@ static struct page *stable_tree_search(s * than kpage, but that involves more changes. */ tree_page = get_ksm_page(stable_node, true); - if (tree_page) + if (tree_page) { unlock_page(tree_page); - return tree_page; + if (get_kpfn_nid(stable_node->kpfn) != + NUMA(stable_node->nid)) { + put_page(tree_page); + goto replace; + } + return tree_page; + } + /* + * There is now a place for page_node, but the tree may + * have been rebalanced, so re-evaluate parent and new. + */ + if (page_node) + goto again; + return NULL; } } - return NULL; + if (!page_node) + return NULL; + + list_del(&page_node->list); + DO_NUMA(page_node->nid = nid); + rb_link_node(&page_node->node, parent, new); + rb_insert_color(&page_node->node, &root_stable_tree[nid]); + get_page(page); + return page; + +replace: + if (page_node) { + list_del(&page_node->list); + DO_NUMA(page_node->nid = nid); + rb_replace_node(&stable_node->node, + &page_node->node, &root_stable_tree[nid]); + get_page(page); + } else { + rb_erase(&stable_node->node, &root_stable_tree[nid]); + page = NULL; + } + stable_node->head = &migrate_nodes; + list_add(&stable_node->list, stable_node->head); + return page; } /* @@ -1215,6 +1276,7 @@ static struct stable_node *stable_tree_i INIT_HLIST_HEAD(&stable_node->hlist); stable_node->kpfn = kpfn; set_page_stable_node(kpage, stable_node); + DO_NUMA(stable_node->nid = nid); rb_link_node(&stable_node->node, parent, new); rb_insert_color(&stable_node->node, &root_stable_tree[nid]); @@ -1311,11 +1373,6 @@ struct rmap_item *unstable_tree_search_i static void stable_tree_append(struct rmap_item *rmap_item, struct stable_node *stable_node) { - /* - * Usually rmap_item->nid is already set correctly, - * but it may be wrong after switching merge_across_nodes. - */ - DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn)); rmap_item->head = stable_node; rmap_item->address |= STABLE_FLAG; hlist_add_head(&rmap_item->hlist, &stable_node->hlist); @@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa unsigned int checksum; int err; - remove_rmap_item_from_tree(rmap_item); + stable_node = page_stable_node(page); + if (stable_node) { + if (stable_node->head != &migrate_nodes && + get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { + rb_erase(&stable_node->node, + &root_stable_tree[NUMA(stable_node->nid)]); + stable_node->head = &migrate_nodes; + list_add(&stable_node->list, stable_node->head); + } + if (stable_node->head != &migrate_nodes && + rmap_item->head == stable_node) + return; + } /* We first start with searching the page inside the stable tree */ kpage = stable_tree_search(page); + if (kpage == page && rmap_item->head == stable_node) { + put_page(kpage); + return; + } + + remove_rmap_item_from_tree(rmap_item); + if (kpage) { err = try_to_merge_with_ksm_page(rmap_item, page, kpage); if (!err) { @@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r */ lru_add_drain_all(); + /* + * Whereas stale stable_nodes on the stable_tree itself + * get pruned in the regular course of stable_tree_search(), + * those moved out to the migrate_nodes list can accumulate: + * so prune them once before each full scan. + */ + if (!ksm_merge_across_nodes) { + struct stable_node *stable_node; + struct list_head *this, *next; + struct page *page; + + list_for_each_safe(this, next, &migrate_nodes) { + stable_node = list_entry(this, + struct stable_node, list); + page = get_ksm_page(stable_node, false); + if (page) + put_page(page); + cond_resched(); + } + } + for (nid = 0; nid < nr_node_ids; nid++) root_unstable_tree[nid] = RB_ROOT; @@ -1586,8 +1683,7 @@ static void ksm_do_scan(unsigned int sca rmap_item = scan_get_next_rmap_item(&page); if (!rmap_item) return; - if (!PageKsm(page) || !in_stable_tree(rmap_item)) - cmp_and_merge_page(page, rmap_item); + cmp_and_merge_page(page, rmap_item); put_page(page); } } @@ -1964,6 +2060,7 @@ static void ksm_check_stable_tree(unsign unsigned long end_pfn) { struct stable_node *stable_node; + struct list_head *this, *next; struct rb_node *node; int nid; @@ -1984,6 +2081,13 @@ static void ksm_check_stable_tree(unsign cond_resched(); } } + list_for_each_safe(this, next, &migrate_nodes) { + stable_node = list_entry(this, struct stable_node, list); + if (stable_node->kpfn >= start_pfn && + stable_node->kpfn < end_pfn) + remove_node_from_stable_tree(stable_node); + cond_resched(); + } } static int ksm_memory_callback(struct notifier_block *self, From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755520Ab3AZCG0 (ORCPT ); Fri, 25 Jan 2013 21:06:26 -0500 Received: from mail-da0-f41.google.com ([209.85.210.41]:64219 "EHLO mail-da0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754963Ab3AZCGX (ORCPT ); Fri, 25 Jan 2013 21:06:23 -0500 Date: Fri, 25 Jan 2013 18:06:24 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 9/11] ksm: enable KSM page migration In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Migration of KSM pages is now safe: remove the PageKsm restrictions from mempolicy.c and migrate.c. But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which are irrelevant to KSM: it looks as if that code was preventing hotremove migration of KSM pages, unless they happened to be in swapcache. There is some question as to whether enforcing a NUMA mempolicy migration ought to migrate KSM pages, mapped into entirely unrelated processes; but moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway, and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on any area where this is a worry. Signed-off-by: Hugh Dickins --- mm/mempolicy.c | 3 +-- mm/migrate.c | 21 +++------------------ 2 files changed, 4 insertions(+), 20 deletions(-) --- mmotm.orig/mm/mempolicy.c 2013-01-24 12:28:38.848127553 -0800 +++ mmotm/mm/mempolicy.c 2013-01-25 14:38:49.596208731 -0800 @@ -496,9 +496,8 @@ static int check_pte_range(struct vm_are /* * vm_normal_page() filters out zero pages, but there might * still be PageReserved pages to skip, perhaps in a VDSO. - * And we cannot move PageKsm pages sensibly or safely yet. */ - if (PageReserved(page) || PageKsm(page)) + if (PageReserved(page)) continue; nid = page_to_nid(page); if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT)) --- mmotm.orig/mm/migrate.c 2013-01-25 14:37:03.832206218 -0800 +++ mmotm/mm/migrate.c 2013-01-25 14:38:49.596208731 -0800 @@ -731,20 +731,6 @@ static int __unmap_and_move(struct page lock_page(page); } - /* - * Only memory hotplug's offline_pages() caller has locked out KSM, - * and can safely migrate a KSM page. The other cases have skipped - * PageKsm along with PageReserved - but it is only now when we have - * the page lock that we can be certain it will not go KSM beneath us - * (KSM will not upgrade a page from PageAnon to PageKsm when it sees - * its pagecount raised, but only here do we take the page lock which - * serializes that). - */ - if (PageKsm(page) && !offlining) { - rc = -EBUSY; - goto unlock; - } - /* charge against new page */ mem_cgroup_prepare_migration(page, newpage, &mem); @@ -771,7 +757,7 @@ static int __unmap_and_move(struct page * File Caches may use write_page() or lock_page() in migration, then, * just care Anon page here. */ - if (PageAnon(page)) { + if (PageAnon(page) && !PageKsm(page)) { /* * Only page_lock_anon_vma_read() understands the subtleties of * getting a hold on an anon_vma from outside one of its mms. @@ -851,7 +837,6 @@ uncharge: mem_cgroup_end_migration(mem, page, newpage, (rc == MIGRATEPAGE_SUCCESS || rc == MIGRATEPAGE_BALLOON_SUCCESS)); -unlock: unlock_page(page); out: return rc; @@ -1156,7 +1141,7 @@ static int do_move_page_to_node_array(st goto set_status; /* Use PageReserved to check for zero page */ - if (PageReserved(page) || PageKsm(page)) + if (PageReserved(page)) goto put_and_set; pp->page = page; @@ -1318,7 +1303,7 @@ static void do_pages_stat_array(struct m err = -ENOENT; /* Use PageReserved to check for zero page */ - if (!page || PageReserved(page) || PageKsm(page)) + if (!page || PageReserved(page)) goto set_status; err = page_to_nid(page); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755549Ab3AZCHv (ORCPT ); Fri, 25 Jan 2013 21:07:51 -0500 Received: from mail-da0-f50.google.com ([209.85.210.50]:63813 "EHLO mail-da0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754963Ab3AZCHt (ORCPT ); Fri, 25 Jan 2013 21:07:49 -0500 Date: Fri, 25 Jan 2013 18:07:51 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 10/11] mm: remove offlining arg to migrate_pages In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org No functional change, but the only purpose of the offlining argument to migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a KSM page for memory hotremove (which took ksm_thread_mutex) but not for other callers. Now all cases are safe, remove the arg. Signed-off-by: Hugh Dickins --- include/linux/migrate.h | 14 ++++++-------- mm/compaction.c | 2 +- mm/memory-failure.c | 7 +++---- mm/memory_hotplug.c | 3 +-- mm/mempolicy.c | 8 +++----- mm/migrate.c | 35 +++++++++++++---------------------- mm/page_alloc.c | 6 ++---- 7 files changed, 29 insertions(+), 46 deletions(-) --- mmotm.orig/include/linux/migrate.h 2013-01-24 12:28:38.740127550 -0800 +++ mmotm/include/linux/migrate.h 2013-01-25 14:38:51.468208776 -0800 @@ -40,11 +40,9 @@ extern void putback_movable_pages(struct extern int migrate_page(struct address_space *, struct page *, struct page *, enum migrate_mode); extern int migrate_pages(struct list_head *l, new_page_t x, - unsigned long private, bool offlining, - enum migrate_mode mode, int reason); + unsigned long private, enum migrate_mode mode, int reason); extern int migrate_huge_page(struct page *, new_page_t x, - unsigned long private, bool offlining, - enum migrate_mode mode); + unsigned long private, enum migrate_mode mode); extern int fail_migrate_page(struct address_space *, struct page *, struct page *); @@ -62,11 +60,11 @@ extern int migrate_huge_page_move_mappin static inline void putback_lru_pages(struct list_head *l) {} static inline void putback_movable_pages(struct list_head *l) {} static inline int migrate_pages(struct list_head *l, new_page_t x, - unsigned long private, bool offlining, - enum migrate_mode mode, int reason) { return -ENOSYS; } + unsigned long private, enum migrate_mode mode, int reason) + { return -ENOSYS; } static inline int migrate_huge_page(struct page *page, new_page_t x, - unsigned long private, bool offlining, - enum migrate_mode mode) { return -ENOSYS; } + unsigned long private, enum migrate_mode mode) + { return -ENOSYS; } static inline int migrate_prep(void) { return -ENOSYS; } static inline int migrate_prep_local(void) { return -ENOSYS; } --- mmotm.orig/mm/compaction.c 2013-01-24 12:28:38.740127550 -0800 +++ mmotm/mm/compaction.c 2013-01-25 14:38:51.472208776 -0800 @@ -980,7 +980,7 @@ static int compact_zone(struct zone *zon nr_migrate = cc->nr_migratepages; err = migrate_pages(&cc->migratepages, compaction_alloc, - (unsigned long)cc, false, + (unsigned long)cc, cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC, MR_COMPACTION); update_nr_listpages(cc); --- mmotm.orig/mm/memory-failure.c 2013-01-24 12:28:38.740127550 -0800 +++ mmotm/mm/memory-failure.c 2013-01-25 14:38:51.472208776 -0800 @@ -1432,7 +1432,7 @@ static int soft_offline_huge_page(struct goto done; /* Keep page count to indicate a given hugepage is isolated. */ - ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, false, + ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, MIGRATE_SYNC); put_page(hpage); if (ret) { @@ -1564,11 +1564,10 @@ int soft_offline_page(struct page *page, if (!ret) { LIST_HEAD(pagelist); inc_zone_page_state(page, NR_ISOLATED_ANON + - page_is_file_cache(page)); + page_is_file_cache(page)); list_add(&page->lru, &pagelist); ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, - false, MIGRATE_SYNC, - MR_MEMORY_FAILURE); + MIGRATE_SYNC, MR_MEMORY_FAILURE); if (ret) { putback_lru_pages(&pagelist); pr_info("soft offline: %#lx: migration failed %d, type %lx\n", --- mmotm.orig/mm/memory_hotplug.c 2013-01-24 12:28:38.740127550 -0800 +++ mmotm/mm/memory_hotplug.c 2013-01-25 14:38:51.472208776 -0800 @@ -1283,8 +1283,7 @@ do_migrate_range(unsigned long start_pfn * migrate_pages returns # of failed pages. */ ret = migrate_pages(&source, alloc_migrate_target, 0, - true, MIGRATE_SYNC, - MR_MEMORY_HOTPLUG); + MIGRATE_SYNC, MR_MEMORY_HOTPLUG); if (ret) putback_lru_pages(&source); } --- mmotm.orig/mm/mempolicy.c 2013-01-25 14:38:49.596208731 -0800 +++ mmotm/mm/mempolicy.c 2013-01-25 14:38:51.472208776 -0800 @@ -1014,8 +1014,7 @@ static int migrate_to_node(struct mm_str if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, new_node_page, dest, - false, MIGRATE_SYNC, - MR_SYSCALL); + MIGRATE_SYNC, MR_SYSCALL); if (err) putback_lru_pages(&pagelist); } @@ -1259,9 +1258,8 @@ static long do_mbind(unsigned long start if (!list_empty(&pagelist)) { WARN_ON_ONCE(flags & MPOL_MF_LAZY); nr_failed = migrate_pages(&pagelist, new_vma_page, - (unsigned long)vma, - false, MIGRATE_SYNC, - MR_MEMPOLICY_MBIND); + (unsigned long)vma, + MIGRATE_SYNC, MR_MEMPOLICY_MBIND); if (nr_failed) putback_lru_pages(&pagelist); } --- mmotm.orig/mm/migrate.c 2013-01-25 14:38:49.596208731 -0800 +++ mmotm/mm/migrate.c 2013-01-25 14:38:51.476208776 -0800 @@ -701,7 +701,7 @@ static int move_to_new_page(struct page } static int __unmap_and_move(struct page *page, struct page *newpage, - int force, bool offlining, enum migrate_mode mode) + int force, enum migrate_mode mode) { int rc = -EAGAIN; int remap_swapcache = 1; @@ -847,8 +847,7 @@ out: * to the newly allocated page in newpage. */ static int unmap_and_move(new_page_t get_new_page, unsigned long private, - struct page *page, int force, bool offlining, - enum migrate_mode mode) + struct page *page, int force, enum migrate_mode mode) { int rc = 0; int *result = NULL; @@ -866,7 +865,7 @@ static int unmap_and_move(new_page_t get if (unlikely(split_huge_page(page))) goto out; - rc = __unmap_and_move(page, newpage, force, offlining, mode); + rc = __unmap_and_move(page, newpage, force, mode); if (unlikely(rc == MIGRATEPAGE_BALLOON_SUCCESS)) { /* @@ -927,8 +926,7 @@ out: */ static int unmap_and_move_huge_page(new_page_t get_new_page, unsigned long private, struct page *hpage, - int force, bool offlining, - enum migrate_mode mode) + int force, enum migrate_mode mode) { int rc = 0; int *result = NULL; @@ -990,9 +988,8 @@ out: * * Return: Number of pages not migrated or error code. */ -int migrate_pages(struct list_head *from, - new_page_t get_new_page, unsigned long private, bool offlining, - enum migrate_mode mode, int reason) +int migrate_pages(struct list_head *from, new_page_t get_new_page, + unsigned long private, enum migrate_mode mode, int reason) { int retry = 1; int nr_failed = 0; @@ -1013,8 +1010,7 @@ int migrate_pages(struct list_head *from cond_resched(); rc = unmap_and_move(get_new_page, private, - page, pass > 2, offlining, - mode); + page, pass > 2, mode); switch(rc) { case -ENOMEM: @@ -1047,15 +1043,13 @@ out: } int migrate_huge_page(struct page *hpage, new_page_t get_new_page, - unsigned long private, bool offlining, - enum migrate_mode mode) + unsigned long private, enum migrate_mode mode) { int pass, rc; for (pass = 0; pass < 10; pass++) { - rc = unmap_and_move_huge_page(get_new_page, - private, hpage, pass > 2, offlining, - mode); + rc = unmap_and_move_huge_page(get_new_page, private, + hpage, pass > 2, mode); switch (rc) { case -ENOMEM: goto out; @@ -1178,8 +1172,7 @@ set_status: err = 0; if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, new_page_node, - (unsigned long)pm, 0, MIGRATE_SYNC, - MR_SYSCALL); + (unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL); if (err) putback_lru_pages(&pagelist); } @@ -1614,10 +1607,8 @@ int migrate_misplaced_page(struct page * goto out; list_add(&page->lru, &migratepages); - nr_remaining = migrate_pages(&migratepages, - alloc_misplaced_dst_page, - node, false, MIGRATE_ASYNC, - MR_NUMA_MISPLACED); + nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page, + node, MIGRATE_ASYNC, MR_NUMA_MISPLACED); if (nr_remaining) { putback_lru_pages(&migratepages); isolated = 0; --- mmotm.orig/mm/page_alloc.c 2013-01-24 12:28:38.740127550 -0800 +++ mmotm/mm/page_alloc.c 2013-01-25 14:38:51.476208776 -0800 @@ -6064,10 +6064,8 @@ static int __alloc_contig_migrate_range( &cc->migratepages); cc->nr_migratepages -= nr_reclaimed; - ret = migrate_pages(&cc->migratepages, - alloc_migrate_target, - 0, false, MIGRATE_SYNC, - MR_CMA); + ret = migrate_pages(&cc->migratepages, alloc_migrate_target, + 0, MIGRATE_SYNC, MR_CMA); } if (ret < 0) { putback_movable_pages(&cc->migratepages); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755518Ab3AZCKV (ORCPT ); Fri, 25 Jan 2013 21:10:21 -0500 Received: from mail-pa0-f52.google.com ([209.85.220.52]:55474 "EHLO mail-pa0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754963Ab3AZCKR (ORCPT ); Fri, 25 Jan 2013 21:10:17 -0500 Date: Fri, 25 Jan 2013 18:10:18 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Gerald Schaefer , KOSAKI Motohiro , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 11/11] ksm: stop hotremove lockdep warning In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Complaints are rare, but lockdep still does not understand the way ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a problem because notifier callbacks are made under down_read of blocking_notifier_head->rwsem (so first the mutex is taken while holding the rwsem, then later the rwsem is taken while still holding the mutex); but is not in fact a problem because mem_hotplug_mutex is held throughout the dance. There was an attempt to fix this with mutex_lock_nested(); but if that happened to fool lockdep two years ago, apparently it does so no longer. I had hoped to eradicate this issue in extending KSM page migration not to need the ksm_thread_mutex. But then realized that although the page migration itself is safe, we do still need to lock out ksmd and other users of get_ksm_page() while offlining memory - at some point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may vanish, and get_ksm_page()'s accesses to them become a violation. So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining() checks, to achieve the same lockout without being caught by lockdep. This is less elegant for KSM, but it's more important to keep lockdep useful to other users - and I apologize for how long it took to fix. Reported-by: Gerald Schaefer Signed-off-by: Hugh Dickins --- mm/ksm.c | 55 +++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 41 insertions(+), 14 deletions(-) --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:38:53.984208836 -0800 @@ -226,7 +226,9 @@ static unsigned int ksm_merge_across_nod #define KSM_RUN_STOP 0 #define KSM_RUN_MERGE 1 #define KSM_RUN_UNMERGE 2 -static unsigned int ksm_run = KSM_RUN_STOP; +#define KSM_RUN_OFFLINE 4 +static unsigned long ksm_run = KSM_RUN_STOP; +static void wait_while_offlining(void); static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait); static DEFINE_MUTEX(ksm_thread_mutex); @@ -1700,6 +1702,7 @@ static int ksm_scan_thread(void *nothing while (!kthread_should_stop()) { mutex_lock(&ksm_thread_mutex); + wait_while_offlining(); if (ksmd_should_run()) ksm_do_scan(ksm_thread_pages_to_scan); mutex_unlock(&ksm_thread_mutex); @@ -2056,6 +2059,22 @@ void ksm_migrate_page(struct page *newpa #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_MEMORY_HOTREMOVE +static int just_wait(void *word) +{ + schedule(); + return 0; +} + +static void wait_while_offlining(void) +{ + while (ksm_run & KSM_RUN_OFFLINE) { + mutex_unlock(&ksm_thread_mutex); + wait_on_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE), + just_wait, TASK_UNINTERRUPTIBLE); + mutex_lock(&ksm_thread_mutex); + } +} + static void ksm_check_stable_tree(unsigned long start_pfn, unsigned long end_pfn) { @@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no switch (action) { case MEM_GOING_OFFLINE: /* - * Keep it very simple for now: just lock out ksmd and - * MADV_UNMERGEABLE while any memory is going offline. - * mutex_lock_nested() is necessary because lockdep was alarmed - * that here we take ksm_thread_mutex inside notifier chain - * mutex, and later take notifier chain mutex inside - * ksm_thread_mutex to unlock it. But that's safe because both - * are inside mem_hotplug_mutex. + * Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items() + * and remove_all_stable_nodes() while memory is going offline: + * it is unsafe for them to touch the stable tree at this time. + * But unmerge_ksm_pages(), rmap lookups and other entry points + * which do not need the ksm_thread_mutex are all safe. */ - mutex_lock_nested(&ksm_thread_mutex, SINGLE_DEPTH_NESTING); + mutex_lock(&ksm_thread_mutex); + ksm_run |= KSM_RUN_OFFLINE; + mutex_unlock(&ksm_thread_mutex); break; case MEM_OFFLINE: @@ -2122,11 +2141,20 @@ static int ksm_memory_callback(struct no /* fallthrough */ case MEM_CANCEL_OFFLINE: + mutex_lock(&ksm_thread_mutex); + ksm_run &= ~KSM_RUN_OFFLINE; mutex_unlock(&ksm_thread_mutex); + + smp_mb(); /* wake_up_bit advises this */ + wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE)); break; } return NOTIFY_OK; } +#else +static void wait_while_offlining(void) +{ +} #endif /* CONFIG_MEMORY_HOTREMOVE */ #ifdef CONFIG_SYSFS @@ -2189,7 +2217,7 @@ KSM_ATTR(pages_to_scan); static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", ksm_run); + return sprintf(buf, "%lu\n", ksm_run); } static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr, @@ -2212,6 +2240,7 @@ static ssize_t run_store(struct kobject */ mutex_lock(&ksm_thread_mutex); + wait_while_offlining(); if (ksm_run != flags) { ksm_run = flags; if (flags & KSM_RUN_UNMERGE) { @@ -2254,6 +2283,7 @@ static ssize_t merge_across_nodes_store( return -EINVAL; mutex_lock(&ksm_thread_mutex); + wait_while_offlining(); if (ksm_merge_across_nodes != knob) { if (ksm_pages_shared || remove_all_stable_nodes()) err = -EBUSY; @@ -2366,10 +2396,7 @@ static int __init ksm_init(void) #endif /* CONFIG_SYSFS */ #ifdef CONFIG_MEMORY_HOTREMOVE - /* - * Choose a high priority since the callback takes ksm_thread_mutex: - * later callbacks could only be taking locks which nest within that. - */ + /* There is no significance to this priority 100 */ hotplug_memory_notifier(ksm_memory_callback, 100); #endif return 0; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755204Ab3A0BVe (ORCPT ); Sat, 26 Jan 2013 20:21:34 -0500 Received: from mail-ia0-f175.google.com ([209.85.210.175]:38382 "EHLO mail-ia0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754942Ab3A0BVb (ORCPT ); Sat, 26 Jan 2013 20:21:31 -0500 Message-ID: <1359249282.4159.4.camel@kernel> Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sat, 26 Jan 2013 19:14:42 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Hugh, On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: > From: Petr Holasek > > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes > which control merging pages across different numa nodes. > When it is set to zero only pages from the same node are merged, > otherwise pages from all nodes can be merged together (default behavior). > > Typical use-case could be a lot of KVM guests on NUMA machine > and cpus from more distant nodes would have significant increase > of access latency to the merged ksm page. Sysfs knob was choosen > for higher variability when some users still prefers higher amount > of saved physical memory regardless of access latency. > > Every numa node has its own stable & unstable trees because of faster > searching and inserting. Changing of merge_across_nodes value is possible > only when there are not any ksm shared pages in system. > > I've tested this patch on numa machines with 2, 4 and 8 nodes and > measured speed of memory access inside of KVM guests with memory pinned > to one of nodes with this benchmark: > > http://pholasek.fedorapeople.org/alloc_pg.c > > Population standard deviations of access times in percentage of average > were following: > > merge_across_nodes=1 > 2 nodes 1.4% > 4 nodes 1.6% > 8 nodes 1.7% > > merge_across_nodes=0 > 2 nodes 1% > 4 nodes 0.32% > 8 nodes 0.018% > > RFC: https://lkml.org/lkml/2011/11/30/91 > v1: https://lkml.org/lkml/2012/1/23/46 > v2: https://lkml.org/lkml/2012/6/29/105 > v3: https://lkml.org/lkml/2012/9/14/550 > v4: https://lkml.org/lkml/2012/9/23/137 > v5: https://lkml.org/lkml/2012/12/10/540 > v6: https://lkml.org/lkml/2012/12/23/154 > v7: https://lkml.org/lkml/2012/12/27/225 > > Hugh notes that this patch brings two problems, whose solution needs > further support in mm/ksm.c, which follows in subsequent patches: > 1) switching merge_across_nodes after running KSM is liable to oops > on stale nodes still left over from the previous stable tree; > 2) memory hotremove may migrate KSM pages, but there is no provision > here for !merge_across_nodes to migrate nodes to the proper tree. > > Signed-off-by: Petr Holasek > Signed-off-by: Hugh Dickins > Acked-by: Rik van Riel > --- > Documentation/vm/ksm.txt | 7 + > mm/ksm.c | 151 ++++++++++++++++++++++++++++++++----- > 2 files changed, 139 insertions(+), 19 deletions(-) > > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > Default: 20 (chosen for demonstration purposes) > > +merge_across_nodes - specifies if pages from different numa nodes can be merged. > + When set to 0, ksm merges only pages which physically > + reside in the memory area of same NUMA node. It brings > + lower latency to access to shared page. Value can be > + changed only when there is no ksm shared pages in system. > + Default: 1 > + > run - set 0 to stop ksmd from running but keep merged pages, > set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", > set 2 to stop ksmd and unmerge all pages currently merged, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 > @@ -36,6 +36,7 @@ > #include > #include > #include > +#include > > #include > #include "internal.h" > @@ -139,6 +140,9 @@ struct rmap_item { > struct mm_struct *mm; > unsigned long address; /* + low bits used for flags below */ > unsigned int oldchecksum; /* when unstable */ > +#ifdef CONFIG_NUMA > + unsigned int nid; > +#endif > union { > struct rb_node node; /* when node of unstable tree */ > struct { /* when listed from stable tree */ > @@ -153,8 +157,8 @@ struct rmap_item { > #define STABLE_FLAG 0x200 /* is listed from the stable tree */ > > /* The stable and unstable tree heads */ > -static struct rb_root root_stable_tree = RB_ROOT; > -static struct rb_root root_unstable_tree = RB_ROOT; > +static struct rb_root root_unstable_tree[MAX_NUMNODES]; > +static struct rb_root root_stable_tree[MAX_NUMNODES]; > > #define MM_SLOTS_HASH_BITS 10 > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ > /* Milliseconds ksmd should sleep between batches */ > static unsigned int ksm_thread_sleep_millisecs = 20; > > +/* Zeroed when merging across nodes is not allowed */ > +static unsigned int ksm_merge_across_nodes = 1; > + > #define KSM_RUN_STOP 0 > #define KSM_RUN_MERGE 1 > #define KSM_RUN_UNMERGE 2 > @@ -441,10 +448,25 @@ out: page = NULL; > return page; > } > > +/* > + * This helper is used for getting right index into array of tree roots. > + * When merge_across_nodes knob is set to 1, there are only two rb-trees for > + * stable and unstable pages from all nodes with roots in index 0. Otherwise, > + * every node has its own stable and unstable tree. > + */ > +static inline int get_kpfn_nid(unsigned long kpfn) > +{ > + if (ksm_merge_across_nodes) > + return 0; > + else > + return pfn_to_nid(kpfn); > +} > + > static void remove_node_from_stable_tree(struct stable_node *stable_node) > { > struct rmap_item *rmap_item; > struct hlist_node *hlist; > + int nid; > > hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { > if (rmap_item->hlist.next) > @@ -456,7 +478,9 @@ static void remove_node_from_stable_tree > cond_resched(); > } > > - rb_erase(&stable_node->node, &root_stable_tree); > + nid = get_kpfn_nid(stable_node->kpfn); > + > + rb_erase(&stable_node->node, &root_stable_tree[nid]); > free_stable_node(stable_node); > } > > @@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s > age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); > BUG_ON(age > 1); > if (!age) > - rb_erase(&rmap_item->node, &root_unstable_tree); > +#ifdef CONFIG_NUMA > + rb_erase(&rmap_item->node, > + &root_unstable_tree[rmap_item->nid]); > +#else > + rb_erase(&rmap_item->node, &root_unstable_tree[0]); > +#endif > > ksm_pages_unshared--; > rmap_item->address &= PAGE_MASK; > @@ -990,8 +1019,9 @@ static struct page *try_to_merge_two_pag > */ > static struct page *stable_tree_search(struct page *page) > { > - struct rb_node *node = root_stable_tree.rb_node; > + struct rb_node *node; > struct stable_node *stable_node; > + int nid; > > stable_node = page_stable_node(page); > if (stable_node) { /* ksm page forked */ > @@ -999,6 +1029,9 @@ static struct page *stable_tree_search(s > return page; > } > > + nid = get_kpfn_nid(page_to_pfn(page)); > + node = root_stable_tree[nid].rb_node; > + > while (node) { > struct page *tree_page; > int ret; > @@ -1033,10 +1066,16 @@ static struct page *stable_tree_search(s > */ > static struct stable_node *stable_tree_insert(struct page *kpage) > { > - struct rb_node **new = &root_stable_tree.rb_node; > + int nid; > + unsigned long kpfn; > + struct rb_node **new; > struct rb_node *parent = NULL; > struct stable_node *stable_node; > > + kpfn = page_to_pfn(kpage); > + nid = get_kpfn_nid(kpfn); > + new = &root_stable_tree[nid].rb_node; > + > while (*new) { > struct page *tree_page; > int ret; > @@ -1070,11 +1109,11 @@ static struct stable_node *stable_tree_i > return NULL; > > rb_link_node(&stable_node->node, parent, new); > - rb_insert_color(&stable_node->node, &root_stable_tree); > + rb_insert_color(&stable_node->node, &root_stable_tree[nid]); > > INIT_HLIST_HEAD(&stable_node->hlist); > > - stable_node->kpfn = page_to_pfn(kpage); > + stable_node->kpfn = kpfn; > set_page_stable_node(kpage, stable_node); > > return stable_node; > @@ -1098,10 +1137,15 @@ static > struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item, > struct page *page, > struct page **tree_pagep) > - > { > - struct rb_node **new = &root_unstable_tree.rb_node; > + struct rb_node **new; > + struct rb_root *root; > struct rb_node *parent = NULL; > + int nid; > + > + nid = get_kpfn_nid(page_to_pfn(page)); > + root = &root_unstable_tree[nid]; > + new = &root->rb_node; > > while (*new) { > struct rmap_item *tree_rmap_item; > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > return NULL; > } > > + /* > + * If tree_page has been migrated to another NUMA node, it > + * will be flushed out and put into the right unstable tree Then why not insert the new page to unstable tree during page migration against current upstream? Because default behavior is merge across nodes. > + * next time: only merge with it if merge_across_nodes. > + * Just notice, we don't have similar problem for PageKsm > + * because their migration is disabled now. (62b61f611e) > + */ > + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { > + put_page(tree_page); > + return NULL; > + } > + > ret = memcmp_pages(page, tree_page); > > parent = *new; > @@ -1139,8 +1195,11 @@ struct rmap_item *unstable_tree_search_i > > rmap_item->address |= UNSTABLE_FLAG; > rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK); > +#ifdef CONFIG_NUMA > + rmap_item->nid = nid; > +#endif > rb_link_node(&rmap_item->node, parent, new); > - rb_insert_color(&rmap_item->node, &root_unstable_tree); > + rb_insert_color(&rmap_item->node, root); > > ksm_pages_unshared++; > return NULL; > @@ -1154,6 +1213,13 @@ struct rmap_item *unstable_tree_search_i > static void stable_tree_append(struct rmap_item *rmap_item, > struct stable_node *stable_node) > { > +#ifdef CONFIG_NUMA > + /* > + * Usually rmap_item->nid is already set correctly, > + * but it may be wrong after switching merge_across_nodes. > + */ > + rmap_item->nid = get_kpfn_nid(stable_node->kpfn); > +#endif > rmap_item->head = stable_node; > rmap_item->address |= STABLE_FLAG; > hlist_add_head(&rmap_item->hlist, &stable_node->hlist); > @@ -1283,6 +1349,7 @@ static struct rmap_item *scan_get_next_r > struct mm_slot *slot; > struct vm_area_struct *vma; > struct rmap_item *rmap_item; > + int nid; > > if (list_empty(&ksm_mm_head.mm_list)) > return NULL; > @@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r > */ > lru_add_drain_all(); > > - root_unstable_tree = RB_ROOT; > + for (nid = 0; nid < nr_node_ids; nid++) > + root_unstable_tree[nid] = RB_ROOT; > > spin_lock(&ksm_mmlist_lock); > slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list); > @@ -1770,15 +1838,19 @@ static struct stable_node *ksm_check_sta > unsigned long end_pfn) > { > struct rb_node *node; > + int nid; > > - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) { > - struct stable_node *stable_node; > + for (nid = 0; nid < nr_node_ids; nid++) > + for (node = rb_first(&root_stable_tree[nid]); node; > + node = rb_next(node)) { > + struct stable_node *stable_node; > + > + stable_node = rb_entry(node, struct stable_node, node); > + if (stable_node->kpfn >= start_pfn && > + stable_node->kpfn < end_pfn) > + return stable_node; > + } > > - stable_node = rb_entry(node, struct stable_node, node); > - if (stable_node->kpfn >= start_pfn && > - stable_node->kpfn < end_pfn) > - return stable_node; > - } > return NULL; > } > > @@ -1925,6 +1997,40 @@ static ssize_t run_store(struct kobject > } > KSM_ATTR(run); > > +#ifdef CONFIG_NUMA > +static ssize_t merge_across_nodes_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return sprintf(buf, "%u\n", ksm_merge_across_nodes); > +} > + > +static ssize_t merge_across_nodes_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count) > +{ > + int err; > + unsigned long knob; > + > + err = kstrtoul(buf, 10, &knob); > + if (err) > + return err; > + if (knob > 1) > + return -EINVAL; > + > + mutex_lock(&ksm_thread_mutex); > + if (ksm_merge_across_nodes != knob) { > + if (ksm_pages_shared) > + err = -EBUSY; > + else > + ksm_merge_across_nodes = knob; > + } > + mutex_unlock(&ksm_thread_mutex); > + > + return err ? err : count; > +} > +KSM_ATTR(merge_across_nodes); > +#endif > + > static ssize_t pages_shared_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > @@ -1979,6 +2085,9 @@ static struct attribute *ksm_attrs[] = { > &pages_unshared_attr.attr, > &pages_volatile_attr.attr, > &full_scans_attr.attr, > +#ifdef CONFIG_NUMA > + &merge_across_nodes_attr.attr, > +#endif > NULL, > }; > > @@ -1992,11 +2101,15 @@ static int __init ksm_init(void) > { > struct task_struct *ksm_thread; > int err; > + int nid; > > err = ksm_slab_init(); > if (err) > goto out; > > + for (nid = 0; nid < nr_node_ids; nid++) > + root_stable_tree[nid] = RB_ROOT; > + > ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd"); > if (IS_ERR(ksm_thread)) { > printk(KERN_ERR "ksm: creating kthread failed\n"); > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755370Ab3A0Cgb (ORCPT ); Sat, 26 Jan 2013 21:36:31 -0500 Received: from mail-pb0-f46.google.com ([209.85.160.46]:40158 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755229Ab3A0Cg3 (ORCPT ); Sat, 26 Jan 2013 21:36:29 -0500 Message-ID: <1359254187.4159.10.camel@kernel> Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sat, 26 Jan 2013 20:36:27 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Hugh, On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote: > In some places where get_ksm_page() is used, we need the page to be locked. > In function get_ksm_page, why check page->mapping => get_page_unless_zero => check page->mapping instead of get_page_unless_zero => check page->mapping, because get_page_unless_zero is expensive? > When KSM migration is fully enabled, we shall want that to make sure that > the page just acquired cannot be migrated beneath us (raised page count is > only effective when there is serialization to make sure migration notices). > Whereas when navigating through the stable tree, we certainly do not want What's the meaning of "navigating through the stable tree"? > to lock each node (raised page count is enough to guarantee the memcmps, > even if page is migrated to another node). > > Since we're about to add another use case, add the locked argument to > get_ksm_page() now. Why the parameter lock passed from stable_tree_search/insert is true, but remove_rmap_item_from_tree is false? > > Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I > really got the wrong end of the stick on that! There's a configuration > in which page_cache_get_speculative() can do something cheaper than > get_page_unless_zero(), relying on its caller's rcu_read_lock() to have > disabled preemption for it. There's no need for rcu_read_lock() around > get_page_unless_zero() (and mapping checks) here. Cut out that > silliness before making this any harder to understand. > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 23 +++++++++++++---------- > 1 file changed, 13 insertions(+), 10 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree > * but this is different - made simpler by ksm_thread_mutex being held, but > * interesting for assuming that no other use of the struct page could ever > * put our expected_mapping into page->mapping (or a field of the union which > - * coincides with page->mapping). The RCU calls are not for KSM at all, but > - * to keep the page_count protocol described with page_cache_get_speculative. > + * coincides with page->mapping). > * > * Note: it is possible that get_ksm_page() will return NULL one moment, > * then page the next, if the page is in between page_freeze_refs() and > * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > * is on its way to being freed; but it is an anomaly to bear in mind. > */ > -static struct page *get_ksm_page(struct stable_node *stable_node) > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > { > struct page *page; > void *expected_mapping; > @@ -530,7 +529,6 @@ static struct page *get_ksm_page(struct > page = pfn_to_page(stable_node->kpfn); > expected_mapping = (void *)stable_node + > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > - rcu_read_lock(); > if (page->mapping != expected_mapping) > goto stale; > if (!get_page_unless_zero(page)) > @@ -539,10 +537,16 @@ static struct page *get_ksm_page(struct > put_page(page); > goto stale; > } > - rcu_read_unlock(); > + if (locked) { > + lock_page(page); > + if (page->mapping != expected_mapping) { > + unlock_page(page); > + put_page(page); > + goto stale; > + } > + } > return page; > stale: > - rcu_read_unlock(); > remove_node_from_stable_tree(stable_node); > return NULL; > } > @@ -558,11 +562,10 @@ static void remove_rmap_item_from_tree(s > struct page *page; > > stable_node = rmap_item->head; > - page = get_ksm_page(stable_node); > + page = get_ksm_page(stable_node, true); > if (!page) > goto out; > > - lock_page(page); > hlist_del(&rmap_item->hlist); > unlock_page(page); > put_page(page); > @@ -1042,7 +1045,7 @@ static struct page *stable_tree_search(s > > cond_resched(); > stable_node = rb_entry(node, struct stable_node, node); > - tree_page = get_ksm_page(stable_node); > + tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > > @@ -1086,7 +1089,7 @@ static struct stable_node *stable_tree_i > > cond_resched(); > stable_node = rb_entry(*new, struct stable_node, node); > - tree_page = get_ksm_page(stable_node); > + tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755435Ab3A0Csv (ORCPT ); Sat, 26 Jan 2013 21:48:51 -0500 Received: from mail-pb0-f41.google.com ([209.85.160.41]:39840 "EHLO mail-pb0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755245Ab3A0Cst (ORCPT ); Sat, 26 Jan 2013 21:48:49 -0500 Message-ID: <1359254927.4159.11.camel@kernel> Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sat, 26 Jan 2013 20:48:47 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote: > In some places where get_ksm_page() is used, we need the page to be locked. > > When KSM migration is fully enabled, we shall want that to make sure that > the page just acquired cannot be migrated beneath us (raised page count is > only effective when there is serialization to make sure migration notices). > Whereas when navigating through the stable tree, we certainly do not want > to lock each node (raised page count is enough to guarantee the memcmps, > even if page is migrated to another node). > > Since we're about to add another use case, add the locked argument to > get_ksm_page() now. > > Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I > really got the wrong end of the stick on that! There's a configuration > in which page_cache_get_speculative() can do something cheaper than > get_page_unless_zero(), relying on its caller's rcu_read_lock() to have > disabled preemption for it. There's no need for rcu_read_lock() around > get_page_unless_zero() (and mapping checks) here. Cut out that > silliness before making this any harder to understand. BTW, what's the meaning of ksm page forked? > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 23 +++++++++++++---------- > 1 file changed, 13 insertions(+), 10 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree > * but this is different - made simpler by ksm_thread_mutex being held, but > * interesting for assuming that no other use of the struct page could ever > * put our expected_mapping into page->mapping (or a field of the union which > - * coincides with page->mapping). The RCU calls are not for KSM at all, but > - * to keep the page_count protocol described with page_cache_get_speculative. > + * coincides with page->mapping). > * > * Note: it is possible that get_ksm_page() will return NULL one moment, > * then page the next, if the page is in between page_freeze_refs() and > * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > * is on its way to being freed; but it is an anomaly to bear in mind. > */ > -static struct page *get_ksm_page(struct stable_node *stable_node) > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > { > struct page *page; > void *expected_mapping; > @@ -530,7 +529,6 @@ static struct page *get_ksm_page(struct > page = pfn_to_page(stable_node->kpfn); > expected_mapping = (void *)stable_node + > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > - rcu_read_lock(); > if (page->mapping != expected_mapping) > goto stale; > if (!get_page_unless_zero(page)) > @@ -539,10 +537,16 @@ static struct page *get_ksm_page(struct > put_page(page); > goto stale; > } > - rcu_read_unlock(); > + if (locked) { > + lock_page(page); > + if (page->mapping != expected_mapping) { > + unlock_page(page); > + put_page(page); > + goto stale; > + } > + } > return page; > stale: > - rcu_read_unlock(); > remove_node_from_stable_tree(stable_node); > return NULL; > } > @@ -558,11 +562,10 @@ static void remove_rmap_item_from_tree(s > struct page *page; > > stable_node = rmap_item->head; > - page = get_ksm_page(stable_node); > + page = get_ksm_page(stable_node, true); > if (!page) > goto out; > > - lock_page(page); > hlist_del(&rmap_item->hlist); > unlock_page(page); > put_page(page); > @@ -1042,7 +1045,7 @@ static struct page *stable_tree_search(s > > cond_resched(); > stable_node = rb_entry(node, struct stable_node, node); > - tree_page = get_ksm_page(stable_node); > + tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > > @@ -1086,7 +1089,7 @@ static struct stable_node *stable_tree_i > > cond_resched(); > stable_node = rb_entry(*new, struct stable_node, node); > - tree_page = get_ksm_page(stable_node); > + tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755467Ab3A0Cyn (ORCPT ); Sat, 26 Jan 2013 21:54:43 -0500 Received: from mail-da0-f54.google.com ([209.85.210.54]:64183 "EHLO mail-da0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755288Ab3A0Cyl (ORCPT ); Sat, 26 Jan 2013 21:54:41 -0500 Date: Sat, 26 Jan 2013 18:54:36 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: <1359249282.4159.4.camel@kernel> Message-ID: References: <1359249282.4159.4.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 26 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: > > From: Petr Holasek > > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > > return NULL; > > } > > > > + /* > > + * If tree_page has been migrated to another NUMA node, it > > + * will be flushed out and put into the right unstable tree > > Then why not insert the new page to unstable tree during page migration > against current upstream? Because default behavior is merge across > nodes. I don't understand the words "against current upstream" in your question. We cannot move a page (strictly, a node) from one tree to another during page migration itself, because the necessary ksm_thread_mutex is not held. Not would we even want to while "merge across nodes". Ah, perhaps you are pointing out that in current upstream, the only user of ksm page migration is memory hotremove, which in current upstream does hold ksm_thread_mutex. So you'd like us to add code for moving a node from one tree to another in ksm_migrate_page() (and what would it do when it collides with an existing node?), code which will then be removed a few patches later when ksm page migration is fully enabled? No, I'm not going to put any more thought into that. When Andrea pointed out the problem with Petr's original change to ksm_migrate_page(), I did indeed think that we could do something cleverer at that point; but once I got down to trying it, found that a dead end. I wasn't going to be able to test the hotremove case properly anyway, so no good pursuing solutions that couldn't be generalized. Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755568Ab3A0DQ0 (ORCPT ); Sat, 26 Jan 2013 22:16:26 -0500 Received: from mail-pb0-f53.google.com ([209.85.160.53]:45391 "EHLO mail-pb0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755361Ab3A0DQY (ORCPT ); Sat, 26 Jan 2013 22:16:24 -0500 Message-ID: <1359256581.4159.16.camel@kernel> Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sat, 26 Jan 2013 21:16:21 -0600 In-Reply-To: References: <1359249282.4159.4.camel@kernel> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote: > On Sat, 26 Jan 2013, Simon Jeons wrote: > > On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: > > > From: Petr Holasek > > > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > > > return NULL; > > > } > > > > > > + /* > > > + * If tree_page has been migrated to another NUMA node, it > > > + * will be flushed out and put into the right unstable tree > > > > Then why not insert the new page to unstable tree during page migration > > against current upstream? Because default behavior is merge across > > nodes. > > I don't understand the words "against current upstream" in your question. I mean current upstream codes without numa awareness. :) > > We cannot move a page (strictly, a node) from one tree to another during > page migration itself, because the necessary ksm_thread_mutex is not held. > Not would we even want to while "merge across nodes". > > Ah, perhaps you are pointing out that in current upstream, the only user > of ksm page migration is memory hotremove, which in current upstream does > hold ksm_thread_mutex. > > So you'd like us to add code for moving a node from one tree to another > in ksm_migrate_page() (and what would it do when it collides with an Without numa awareness, I still can't understand your explanation why can't insert the node to the tree just after page migration instead of inserting it at the next scan. > existing node?), code which will then be removed a few patches later > when ksm page migration is fully enabled? > > No, I'm not going to put any more thought into that. When Andrea pointed > out the problem with Petr's original change to ksm_migrate_page(), I did > indeed think that we could do something cleverer at that point; but once > I got down to trying it, found that a dead end. I wasn't going to be > able to test the hotremove case properly anyway, so no good pursuing > solutions that couldn't be generalized. > > Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755737Ab3A0E4B (ORCPT ); Sat, 26 Jan 2013 23:56:01 -0500 Received: from mail-pa0-f48.google.com ([209.85.220.48]:65296 "EHLO mail-pa0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755571Ab3A0Ez6 (ORCPT ); Sat, 26 Jan 2013 23:55:58 -0500 Message-ID: <1359262556.4159.23.camel@kernel> Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sat, 26 Jan 2013 23:55:56 -0500 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Hugh, On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > Switching merge_across_nodes after running KSM is liable to oops on stale > nodes still left over from the previous stable tree. It's not something > that people will often want to do, but it would be lame to demand a reboot > when they're trying to determine which merge_across_nodes setting is best. > > How can this happen? We only permit switching merge_across_nodes when > pages_shared is 0, and usually set run 2 to force that beforehand, which > ought to unmerge everything: yet oopses still occur when you then run 1. > > Three causes: > > 1. The old stable tree (built according to the inverse merge_across_nodes) > has not been fully torn down. A stable node lingers until get_ksm_page() > notices that the page it references no longer references it: but the page > is not necessarily freed as soon as expected, particularly when swapcache. > When can this happen? > Fix this with a pass through the old stable tree, applying get_ksm_page() > to each of the remaining nodes (most found stale and removed immediately), > with forced removal of any left over. Unless the page is still mapped: > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > and EBUSY than BUG. > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > just behind ksmd's cursor, so there's a full pass for it to stabilize > (or be removed) before ksmd addresses it. Nice when ksmd is running, > but not so nice when we're trying to unmerge all mms: we were missing > those mms forked and inserted behind the unmerge cursor. Easily fixed > by inserting at the end when KSM_RUN_UNMERGE. mms forked will be unmerged just after ksmd's cursor since they're inserted behind it, why will be missing? > > 3. It is possible for a KSM page to be faulted back from swapcache into > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. > Make sense. :) > A long outstanding, unrelated bugfix sneaks in with that third fix: > ksm_does_need_to_copy() would copy from a !PageUptodate page (implying > I/O error when read in from swap) to a page which it then marks Uptodate. > Fix this case by not copying, letting do_swap_page() discover the error. > > Signed-off-by: Hugh Dickins > --- > include/linux/ksm.h | 18 ++------- > mm/ksm.c | 83 +++++++++++++++++++++++++++++++++++++++--- > mm/memory.c | 19 ++++----- > 3 files changed, 92 insertions(+), 28 deletions(-) > > --- mmotm.orig/include/linux/ksm.h 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/include/linux/ksm.h 2013-01-25 14:37:00.764206145 -0800 > @@ -16,9 +16,6 @@ > struct stable_node; > struct mem_cgroup; > > -struct page *ksm_does_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address); > - > #ifdef CONFIG_KSM > int ksm_madvise(struct vm_area_struct *vma, unsigned long start, > unsigned long end, int advice, unsigned long *vm_flags); > @@ -73,15 +70,8 @@ static inline void set_page_stable_node( > * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE, > * but what if the vma was unmerged while the page was swapped out? > */ > -static inline int ksm_might_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address) > -{ > - struct anon_vma *anon_vma = page_anon_vma(page); > - > - return anon_vma && > - (anon_vma->root != vma->anon_vma->root || > - page->index != linear_page_index(vma, address)); > -} > +struct page *ksm_might_need_to_copy(struct page *page, > + struct vm_area_struct *vma, unsigned long address); > > int page_referenced_ksm(struct page *page, > struct mem_cgroup *memcg, unsigned long *vm_flags); > @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_ > return 0; > } > > -static inline int ksm_might_need_to_copy(struct page *page, > +static inline struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > - return 0; > + return page; > } > > static inline int page_referenced_ksm(struct page *page, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a > /* > * Only called through the sysfs control interface: > */ > +static int remove_stable_node(struct stable_node *stable_node) > +{ > + struct page *page; > + int err; > + > + page = get_ksm_page(stable_node, true); > + if (!page) { > + /* > + * get_ksm_page did remove_node_from_stable_tree itself. > + */ > + return 0; > + } > + > + if (WARN_ON_ONCE(page_mapped(page))) > + err = -EBUSY; > + else { > + /* > + * This page might be in a pagevec waiting to be freed, > + * or it might be PageSwapCache (perhaps under writeback), > + * or it might have been removed from swapcache a moment ago. > + */ > + set_page_stable_node(page, NULL); > + remove_node_from_stable_tree(stable_node); > + err = 0; > + } > + > + unlock_page(page); > + put_page(page); > + return err; > +} > + > +static int remove_all_stable_nodes(void) > +{ > + struct stable_node *stable_node; > + int nid; > + int err = 0; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + while (root_stable_tree[nid].rb_node) { > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > + struct stable_node, node); > + if (remove_stable_node(stable_node)) { > + err = -EBUSY; > + break; /* proceed to next nid */ > + } > + cond_resched(); > + } > + } > + return err; > +} > + > static int unmerge_and_remove_all_rmap_items(void) > { > struct mm_slot *mm_slot; > @@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i > } > } > > + /* Clean up stable nodes, but don't worry if some are still busy */ > + remove_all_stable_nodes(); > ksm_scan.seqnr = 0; > return 0; > > @@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm) > spin_lock(&ksm_mmlist_lock); > insert_to_mm_slots_hash(mm, mm_slot); > /* > - * Insert just behind the scanning cursor, to let the area settle > + * When KSM_RUN_MERGE (or KSM_RUN_STOP), > + * insert just behind the scanning cursor, to let the area settle > * down a little; when fork is followed by immediate exec, we don't > * want ksmd to waste time setting up and tearing down an rmap_list. > + * > + * But when KSM_RUN_UNMERGE, it's important to insert ahead of its > + * scanning cursor, otherwise KSM pages in newly forked mms will be > + * missed: then we might as well insert at the end of the list. > */ > - list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > + if (ksm_run & KSM_RUN_UNMERGE) > + list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list); > + else > + list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > spin_unlock(&ksm_mmlist_lock); > > set_bit(MMF_VM_MERGEABLE, &mm->flags); > @@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm) > } > } > > -struct page *ksm_does_need_to_copy(struct page *page, > +struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > + struct anon_vma *anon_vma = page_anon_vma(page); > struct page *new_page; > > + if (PageKsm(page)) { > + if (page_stable_node(page) && > + !(ksm_run & KSM_RUN_UNMERGE)) > + return page; /* no need to copy it */ > + } else if (!anon_vma) { > + return page; /* no need to copy it */ > + } else if (anon_vma->root == vma->anon_vma->root && > + page->index == linear_page_index(vma, address)) { > + return page; /* still no need to copy it */ > + } > + if (!PageUptodate(page)) > + return page; /* let do_swap_page report the error */ > + > new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); > if (new_page) { > copy_user_highpage(new_page, page, address, vma); > @@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store( > > mutex_lock(&ksm_thread_mutex); > if (ksm_merge_across_nodes != knob) { > - if (ksm_pages_shared) > + if (ksm_pages_shared || remove_all_stable_nodes()) > err = -EBUSY; > else > ksm_merge_across_nodes = knob; > --- mmotm.orig/mm/memory.c 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/mm/memory.c 2013-01-25 14:37:00.768206145 -0800 > @@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct > if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) > goto out_page; > > - if (ksm_might_need_to_copy(page, vma, address)) { > - swapcache = page; > - page = ksm_does_need_to_copy(page, vma, address); > - > - if (unlikely(!page)) { > - ret = VM_FAULT_OOM; > - page = swapcache; > - swapcache = NULL; > - goto out_page; > - } > + swapcache = page; > + page = ksm_might_need_to_copy(page, vma, address); > + if (unlikely(!page)) { > + ret = VM_FAULT_OOM; > + page = swapcache; > + swapcache = NULL; > + goto out_page; > } > + if (page == swapcache) > + swapcache = NULL; > > if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > ret = VM_FAULT_OOM; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755990Ab3A0FrU (ORCPT ); Sun, 27 Jan 2013 00:47:20 -0500 Received: from mail-pb0-f44.google.com ([209.85.160.44]:37999 "EHLO mail-pb0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751983Ab3A0FrR (ORCPT ); Sun, 27 Jan 2013 00:47:17 -0500 Message-ID: <1359265635.6763.0.camel@kernel> Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sat, 26 Jan 2013 23:47:15 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2013-01-25 at 18:03 -0800, Hugh Dickins wrote: > KSM page migration is already supported in the case of memory hotremove, > which takes the ksm_thread_mutex across all its migrations to keep life > simple. > > But the new KSM NUMA merge_across_nodes knob introduces a problem, when > it's set to non-default 0: if a KSM page is migrated to a different NUMA > node, how do we migrate its stable node to the right tree? And what if > that collides with an existing stable node? > > So far there's no provision for that, and this patch does not attempt > to deal with it either. But how will I test a solution, when I don't > know how to hotremove memory? The best answer is to enable KSM page > migration in all cases now, and test more common cases. With THP and > compaction added since KSM came in, page migration is now mainstream, > and it's a shame that a KSM page can frustrate freeing a page block. > > Without worrying about merge_across_nodes 0 for now, this patch gets > KSM page migration working reliably for default merge_across_nodes 1 > (but leave the patch enabling it until near the end of the series). > > It's much simpler than I'd originally imagined, and does not require > an additional tier of locking: page migration relies on the page lock, > KSM page reclaim relies on the page lock, the page lock is enough for > KSM page migration too. > > Almost all the care has to be in get_ksm_page(): that's the function > which worries about when a stable node is stale and should be freed, > now it also has to worry about the KSM page being migrated. > > The only new overhead is an additional put/get/lock/unlock_page when > stable_tree_search() arrives at a matching node: to make sure migration > respects the raised page count, and so does not migrate the page while > we're busy with it here. That's probably avoidable, either by changing > internal interfaces from using kpage to stable_node, or by moving the > ksm_migrate_page() callsite into a page_freeze_refs() section (even if > not swapcache); but this works well, I've no urge to pull it apart now. > > (Descents of the stable tree may pass through nodes whose KSM pages are > under migration: being unlocked, the raised page count does not prevent > that, nor need it: it's safe to memcmp against either old or new page.) > > You might worry about mremap, and whether page migration's rmap_walk > to remove migration entries will find all the KSM locations where it > inserted earlier: that should already be handled, by the satisfyingly > heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 94 ++++++++++++++++++++++++++++++++++++++----------- > mm/migrate.c | 5 ++ > 2 files changed, 77 insertions(+), 22 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 > @@ -499,6 +499,7 @@ static void remove_node_from_stable_tree > * In which case we can trust the content of the page, and it > * returns the gotten page; but if the page has now been zapped, > * remove the stale node from the stable tree and return NULL. > + * But beware, the stable node's page might be being migrated. > * > * You would expect the stable_node to hold a reference to the ksm page. > * But if it increments the page's count, swapping out has to wait for > @@ -509,44 +510,77 @@ static void remove_node_from_stable_tree > * pointing back to this stable node. This relies on freeing a PageAnon > * page to reset its page->mapping to NULL, and relies on no other use of > * a page to put something that might look like our key in page->mapping. > - * > - * include/linux/pagemap.h page_cache_get_speculative() is a good reference, > - * but this is different - made simpler by ksm_thread_mutex being held, but > - * interesting for assuming that no other use of the struct page could ever > - * put our expected_mapping into page->mapping (or a field of the union which > - * coincides with page->mapping). > - * > - * Note: it is possible that get_ksm_page() will return NULL one moment, > - * then page the next, if the page is in between page_freeze_refs() and > - * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > * is on its way to being freed; but it is an anomaly to bear in mind. > */ > static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > { > struct page *page; > void *expected_mapping; > + unsigned long kpfn; > > - page = pfn_to_page(stable_node->kpfn); > expected_mapping = (void *)stable_node + > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > - if (page->mapping != expected_mapping) > - goto stale; > - if (!get_page_unless_zero(page)) > +again: > + kpfn = ACCESS_ONCE(stable_node->kpfn); > + page = pfn_to_page(kpfn); > + > + /* > + * page is computed from kpfn, so on most architectures reading > + * page->mapping is naturally ordered after reading node->kpfn, > + * but on Alpha we need to be more careful. > + */ > + smp_read_barrier_depends(); > + if (ACCESS_ONCE(page->mapping) != expected_mapping) > goto stale; > - if (page->mapping != expected_mapping) { > + > + /* > + * We cannot do anything with the page while its refcount is 0. > + * Usually 0 means free, or tail of a higher-order page: in which > + * case this node is no longer referenced, and should be freed; > + * however, it might mean that the page is under page_freeze_refs(). > + * The __remove_mapping() case is easy, again the node is now stale; > + * but if page is swapcache in migrate_page_move_mapping(), it might > + * still be our page, in which case it's essential to keep the node. > + */ > + while (!get_page_unless_zero(page)) { > + /* > + * Another check for page->mapping != expected_mapping would > + * work here too. We have chosen the !PageSwapCache test to > + * optimize the common case, when the page is or is about to > + * be freed: PageSwapCache is cleared (under spin_lock_irq) > + * in the freeze_refs section of __remove_mapping(); but Anon > + * page->mapping reset to NULL later, in free_pages_prepare(). > + */ > + if (!PageSwapCache(page)) > + goto stale; > + cpu_relax(); > + } > + > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > put_page(page); > goto stale; > } > + > if (locked) { > lock_page(page); > - if (page->mapping != expected_mapping) { > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > unlock_page(page); > put_page(page); > goto stale; > } > } Could you explain why need check page->mapping twice after get page? > return page; > + > stale: > + /* > + * We come here from above when page->mapping or !PageSwapCache > + * suggests that the node is stale; but it might be under migration. > + * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(), > + * before checking whether node->kpfn has been changed. > + */ > + smp_rmb(); > + if (ACCESS_ONCE(stable_node->kpfn) != kpfn) > + goto again; > remove_node_from_stable_tree(stable_node); > return NULL; > } > @@ -1103,15 +1137,25 @@ static struct page *stable_tree_search(s > return NULL; > > ret = memcmp_pages(page, tree_page); > + put_page(tree_page); > > - if (ret < 0) { > - put_page(tree_page); > + if (ret < 0) > node = node->rb_left; > - } else if (ret > 0) { > - put_page(tree_page); > + else if (ret > 0) > node = node->rb_right; > - } else > + else { > + /* > + * Lock and unlock the stable_node's page (which > + * might already have been migrated) so that page > + * migration is sure to notice its raised count. > + * It would be more elegant to return stable_node > + * than kpage, but that involves more changes. > + */ > + tree_page = get_ksm_page(stable_node, true); > + if (tree_page) > + unlock_page(tree_page); > return tree_page; > + } > } > > return NULL; > @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa > if (stable_node) { > VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage)); > stable_node->kpfn = page_to_pfn(newpage); > + /* > + * newpage->mapping was set in advance; now we need smp_wmb() > + * to make sure that the new stable_node->kpfn is visible > + * to get_ksm_page() before it can see that oldpage->mapping > + * has gone stale (or that PageSwapCache has been cleared). > + */ > + smp_wmb(); > + set_page_stable_node(oldpage, NULL); > } > } > #endif /* CONFIG_MIGRATION */ > --- mmotm.orig/mm/migrate.c 2013-01-25 14:27:58.140193249 -0800 > +++ mmotm/mm/migrate.c 2013-01-25 14:37:03.832206218 -0800 > @@ -464,7 +464,10 @@ void migrate_page_copy(struct page *newp > > mlock_migrate_page(newpage, page); > ksm_migrate_page(newpage, page); > - > + /* > + * Please do not reorder this without considering how mm/ksm.c's > + * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache(). > + */ > ClearPageSwapCache(page); > ClearPagePrivate(page); > set_page_private(page, 0); > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756095Ab3A0GXe (ORCPT ); Sun, 27 Jan 2013 01:23:34 -0500 Received: from mail-ia0-f172.google.com ([209.85.210.172]:47108 "EHLO mail-ia0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755928Ab3A0GXc (ORCPT ); Sun, 27 Jan 2013 01:23:32 -0500 Message-ID: <1359267810.6763.1.camel@kernel> Subject: Re: [PATCH 11/11] ksm: stop hotremove lockdep warning From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Gerald Schaefer , KOSAKI Motohiro , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sun, 27 Jan 2013 00:23:30 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2013-01-25 at 18:10 -0800, Hugh Dickins wrote: > Complaints are rare, but lockdep still does not understand the way > ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and > holds it until the ksm_memory_callback(MEM_OFFLINE): that appears > to be a problem because notifier callbacks are made under down_read > of blocking_notifier_head->rwsem (so first the mutex is taken while > holding the rwsem, then later the rwsem is taken while still holding > the mutex); but is not in fact a problem because mem_hotplug_mutex > is held throughout the dance. > > There was an attempt to fix this with mutex_lock_nested(); but if that > happened to fool lockdep two years ago, apparently it does so no longer. > > I had hoped to eradicate this issue in extending KSM page migration not > to need the ksm_thread_mutex. But then realized that although the page > migration itself is safe, we do still need to lock out ksmd and other > users of get_ksm_page() while offlining memory - at some point between > MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may > vanish, and get_ksm_page()'s accesses to them become a violation. > > So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to > MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining() > checks, to achieve the same lockout without being caught by lockdep. > This is less elegant for KSM, but it's more important to keep lockdep > useful to other users - and I apologize for how long it took to fix. > > Reported-by: Gerald Schaefer > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 55 +++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 41 insertions(+), 14 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:38:53.984208836 -0800 > @@ -226,7 +226,9 @@ static unsigned int ksm_merge_across_nod > #define KSM_RUN_STOP 0 > #define KSM_RUN_MERGE 1 > #define KSM_RUN_UNMERGE 2 > -static unsigned int ksm_run = KSM_RUN_STOP; > +#define KSM_RUN_OFFLINE 4 > +static unsigned long ksm_run = KSM_RUN_STOP; > +static void wait_while_offlining(void); > > static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait); > static DEFINE_MUTEX(ksm_thread_mutex); > @@ -1700,6 +1702,7 @@ static int ksm_scan_thread(void *nothing > > while (!kthread_should_stop()) { > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksmd_should_run()) > ksm_do_scan(ksm_thread_pages_to_scan); > mutex_unlock(&ksm_thread_mutex); > @@ -2056,6 +2059,22 @@ void ksm_migrate_page(struct page *newpa > #endif /* CONFIG_MIGRATION */ > > #ifdef CONFIG_MEMORY_HOTREMOVE > +static int just_wait(void *word) > +{ > + schedule(); > + return 0; > +} > + > +static void wait_while_offlining(void) > +{ > + while (ksm_run & KSM_RUN_OFFLINE) { > + mutex_unlock(&ksm_thread_mutex); > + wait_on_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE), > + just_wait, TASK_UNINTERRUPTIBLE); > + mutex_lock(&ksm_thread_mutex); > + } > +} > + > static void ksm_check_stable_tree(unsigned long start_pfn, > unsigned long end_pfn) > { > @@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no > switch (action) { > case MEM_GOING_OFFLINE: > /* > - * Keep it very simple for now: just lock out ksmd and > - * MADV_UNMERGEABLE while any memory is going offline. > - * mutex_lock_nested() is necessary because lockdep was alarmed > - * that here we take ksm_thread_mutex inside notifier chain > - * mutex, and later take notifier chain mutex inside > - * ksm_thread_mutex to unlock it. But that's safe because both > - * are inside mem_hotplug_mutex. > + * Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items() > + * and remove_all_stable_nodes() while memory is going offline: > + * it is unsafe for them to touch the stable tree at this time. > + * But unmerge_ksm_pages(), rmap lookups and other entry points Why unmerge_ksm_pages beneath us is safe for ksm memory hotremove? > + * which do not need the ksm_thread_mutex are all safe. > */ > - mutex_lock_nested(&ksm_thread_mutex, SINGLE_DEPTH_NESTING); > + mutex_lock(&ksm_thread_mutex); > + ksm_run |= KSM_RUN_OFFLINE; > + mutex_unlock(&ksm_thread_mutex); > break; > > case MEM_OFFLINE: > @@ -2122,11 +2141,20 @@ static int ksm_memory_callback(struct no > /* fallthrough */ > > case MEM_CANCEL_OFFLINE: > + mutex_lock(&ksm_thread_mutex); > + ksm_run &= ~KSM_RUN_OFFLINE; > mutex_unlock(&ksm_thread_mutex); > + > + smp_mb(); /* wake_up_bit advises this */ > + wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE)); > break; > } > return NOTIFY_OK; > } > +#else > +static void wait_while_offlining(void) > +{ > +} > #endif /* CONFIG_MEMORY_HOTREMOVE */ > > #ifdef CONFIG_SYSFS > @@ -2189,7 +2217,7 @@ KSM_ATTR(pages_to_scan); > static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr, > char *buf) > { > - return sprintf(buf, "%u\n", ksm_run); > + return sprintf(buf, "%lu\n", ksm_run); > } > > static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr, > @@ -2212,6 +2240,7 @@ static ssize_t run_store(struct kobject > */ > > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksm_run != flags) { > ksm_run = flags; > if (flags & KSM_RUN_UNMERGE) { > @@ -2254,6 +2283,7 @@ static ssize_t merge_across_nodes_store( > return -EINVAL; > > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksm_merge_across_nodes != knob) { > if (ksm_pages_shared || remove_all_stable_nodes()) > err = -EBUSY; > @@ -2366,10 +2396,7 @@ static int __init ksm_init(void) > #endif /* CONFIG_SYSFS */ > > #ifdef CONFIG_MEMORY_HOTREMOVE > - /* > - * Choose a high priority since the callback takes ksm_thread_mutex: > - * later callbacks could only be taking locks which nest within that. > - */ > + /* There is no significance to this priority 100 */ > hotplug_memory_notifier(ksm_memory_callback, 100); > #endif > return 0; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756271Ab3A0ItU (ORCPT ); Sun, 27 Jan 2013 03:49:20 -0500 Received: from mail-ia0-f169.google.com ([209.85.210.169]:48867 "EHLO mail-ia0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756238Ab3A0ItR (ORCPT ); Sun, 27 Jan 2013 03:49:17 -0500 Message-ID: <1359276555.6763.6.camel@kernel> Subject: Re: [PATCH 8/11] ksm: make !merge_across_nodes migration safe From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sun, 27 Jan 2013 02:49:15 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2013-01-25 at 18:05 -0800, Hugh Dickins wrote: > The new KSM NUMA merge_across_nodes knob introduces a problem, when it's > set to non-default 0: if a KSM page is migrated to a different NUMA node, > how do we migrate its stable node to the right tree? And what if that > collides with an existing stable node? > > ksm_migrate_page() can do no more than it's already doing, updating > stable_node->kpfn: the stable tree itself cannot be manipulated without > holding ksm_thread_mutex. So accept that a stable tree may temporarily > indicate a page belonging to the wrong NUMA node, leave updating until > the next pass of ksmd, just be careful not to merge other pages on to a > misplaced page. Note nid of holding tree in stable_node, and recognize > that it will not always match nid of kpfn. > > A misplaced KSM page is discovered, either when ksm_do_scan() next comes > around to one of its rmap_items (we now have to go to cmp_and_merge_page > even on pages in a stable tree), or when stable_tree_search() arrives at > a matching node for another page, and this node page is found misplaced. > > In each case, move the misplaced stable_node to a list of migrate_nodes > (and use the address of migrate_nodes as magic by which to identify them): > we don't need them in a tree. If stable_tree_search() finds no match for > a page, but it's currently exiled to this list, then slot its stable_node > right there into the tree, bringing all of its mappings with it; otherwise > they get migrated one by one to the original page of the colliding node. > stable_tree_search() is now modelled more like stable_tree_insert(), > in order to handle these insertions of migrated nodes. > > remove_node_from_stable_tree(), remove_all_stable_nodes() and > ksm_check_stable_tree() have to handle the migrate_nodes list as well as > the stable tree itself. Less obviously, we do need to prune the list of > stale entries from time to time (scan_get_next_rmap_item() does it once > each full scan): whereas stale nodes in the stable tree get naturally > pruned as searches try to brush past them, these migrate_nodes may get > forgotten and accumulate. > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 164 +++++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 134 insertions(+), 30 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 > @@ -122,13 +122,25 @@ struct ksm_scan { > /** > * struct stable_node - node of the stable rbtree > * @node: rb node of this ksm page in the stable tree > + * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list > + * @list: linked into migrate_nodes, pending placement in the proper node tree > * @hlist: hlist head of rmap_items using this ksm page > - * @kpfn: page frame number of this ksm page > + * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid) > + * @nid: NUMA node id of stable tree in which linked (may not match kpfn) > */ > struct stable_node { > - struct rb_node node; > + union { > + struct rb_node node; /* when node of stable tree */ > + struct { /* when listed for migration */ > + struct list_head *head; > + struct list_head list; > + }; > + }; > struct hlist_head hlist; > unsigned long kpfn; > +#ifdef CONFIG_NUMA > + int nid; > +#endif > }; > > /** > @@ -169,6 +181,9 @@ struct rmap_item { > static struct rb_root root_unstable_tree[MAX_NUMNODES]; > static struct rb_root root_stable_tree[MAX_NUMNODES]; > > +/* Recently migrated nodes of stable tree, pending proper placement */ > +static LIST_HEAD(migrate_nodes); > + > #define MM_SLOTS_HASH_BITS 10 > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > > @@ -311,11 +326,6 @@ static void insert_to_mm_slots_hash(stru > hash_add(mm_slots_hash, &mm_slot->link, (unsigned long)mm); > } > > -static inline int in_stable_tree(struct rmap_item *rmap_item) > -{ > - return rmap_item->address & STABLE_FLAG; > -} > - > /* > * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's > * page tables after it has passed through ksm_exit() - which, if necessary, > @@ -476,7 +486,6 @@ static void remove_node_from_stable_tree > { > struct rmap_item *rmap_item; > struct hlist_node *hlist; > - int nid; > > hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { > if (rmap_item->hlist.next) > @@ -488,8 +497,11 @@ static void remove_node_from_stable_tree > cond_resched(); > } > > - nid = get_kpfn_nid(stable_node->kpfn); > - rb_erase(&stable_node->node, &root_stable_tree[nid]); > + if (stable_node->head == &migrate_nodes) > + list_del(&stable_node->list); > + else > + rb_erase(&stable_node->node, > + &root_stable_tree[NUMA(stable_node->nid)]); > free_stable_node(stable_node); > } > > @@ -712,6 +724,7 @@ static int remove_stable_node(struct sta > static int remove_all_stable_nodes(void) > { > struct stable_node *stable_node; > + struct list_head *this, *next; > int nid; > int err = 0; > > @@ -726,6 +739,12 @@ static int remove_all_stable_nodes(void) > cond_resched(); > } > } > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, struct stable_node, list); > + if (remove_stable_node(stable_node)) > + err = -EBUSY; > + cond_resched(); > + } > return err; > } > > @@ -1113,25 +1132,30 @@ static struct page *try_to_merge_two_pag > */ > static struct page *stable_tree_search(struct page *page) > { > - struct rb_node *node; > - struct stable_node *stable_node; > int nid; > + struct rb_node **new; > + struct rb_node *parent; > + struct stable_node *stable_node; > + struct stable_node *page_node; > > - stable_node = page_stable_node(page); > - if (stable_node) { /* ksm page forked */ > + page_node = page_stable_node(page); > + if (page_node && page_node->head != &migrate_nodes) { > + /* ksm page forked */ > get_page(page); > return page; > } > > nid = get_kpfn_nid(page_to_pfn(page)); > - node = root_stable_tree[nid].rb_node; > +again: > + new = &root_stable_tree[nid].rb_node; > + parent = NULL; > > - while (node) { > + while (*new) { > struct page *tree_page; > int ret; > > cond_resched(); > - stable_node = rb_entry(node, struct stable_node, node); > + stable_node = rb_entry(*new, struct stable_node, node); > tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > @@ -1139,10 +1163,11 @@ static struct page *stable_tree_search(s > ret = memcmp_pages(page, tree_page); > put_page(tree_page); > > + parent = *new; > if (ret < 0) > - node = node->rb_left; > + new = &parent->rb_left; > else if (ret > 0) > - node = node->rb_right; > + new = &parent->rb_right; > else { > /* > * Lock and unlock the stable_node's page (which > @@ -1152,13 +1177,49 @@ static struct page *stable_tree_search(s > * than kpage, but that involves more changes. > */ > tree_page = get_ksm_page(stable_node, true); > - if (tree_page) > + if (tree_page) { > unlock_page(tree_page); > - return tree_page; > + if (get_kpfn_nid(stable_node->kpfn) != > + NUMA(stable_node->nid)) { > + put_page(tree_page); > + goto replace; > + } > + return tree_page; > + } > + /* > + * There is now a place for page_node, but the tree may > + * have been rebalanced, so re-evaluate parent and new. > + */ > + if (page_node) > + goto again; > + return NULL; > } > } > > - return NULL; > + if (!page_node) > + return NULL; > + > + list_del(&page_node->list); > + DO_NUMA(page_node->nid = nid); > + rb_link_node(&page_node->node, parent, new); > + rb_insert_color(&page_node->node, &root_stable_tree[nid]); > + get_page(page); > + return page; > + > +replace: > + if (page_node) { > + list_del(&page_node->list); > + DO_NUMA(page_node->nid = nid); > + rb_replace_node(&stable_node->node, > + &page_node->node, &root_stable_tree[nid]); > + get_page(page); > + } else { > + rb_erase(&stable_node->node, &root_stable_tree[nid]); > + page = NULL; > + } > + stable_node->head = &migrate_nodes; > + list_add(&stable_node->list, stable_node->head); > + return page; > } > > /* > @@ -1215,6 +1276,7 @@ static struct stable_node *stable_tree_i > INIT_HLIST_HEAD(&stable_node->hlist); > stable_node->kpfn = kpfn; > set_page_stable_node(kpage, stable_node); > + DO_NUMA(stable_node->nid = nid); > rb_link_node(&stable_node->node, parent, new); > rb_insert_color(&stable_node->node, &root_stable_tree[nid]); > > @@ -1311,11 +1373,6 @@ struct rmap_item *unstable_tree_search_i > static void stable_tree_append(struct rmap_item *rmap_item, > struct stable_node *stable_node) > { > - /* > - * Usually rmap_item->nid is already set correctly, > - * but it may be wrong after switching merge_across_nodes. > - */ > - DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn)); > rmap_item->head = stable_node; > rmap_item->address |= STABLE_FLAG; > hlist_add_head(&rmap_item->hlist, &stable_node->hlist); > @@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa > unsigned int checksum; > int err; > > - remove_rmap_item_from_tree(rmap_item); > + stable_node = page_stable_node(page); > + if (stable_node) { > + if (stable_node->head != &migrate_nodes && > + get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { > + rb_erase(&stable_node->node, > + &root_stable_tree[NUMA(stable_node->nid)]); > + stable_node->head = &migrate_nodes; > + list_add(&stable_node->list, stable_node->head); Why list add &stable_node->list to stable_node->head? stable_node->head is used for queue what? > + } > + if (stable_node->head != &migrate_nodes && > + rmap_item->head == stable_node) > + return; > + } > > /* We first start with searching the page inside the stable tree */ > kpage = stable_tree_search(page); > + if (kpage == page && rmap_item->head == stable_node) { > + put_page(kpage); > + return; > + } > + > + remove_rmap_item_from_tree(rmap_item); > + > if (kpage) { > err = try_to_merge_with_ksm_page(rmap_item, page, kpage); > if (!err) { > @@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r > */ > lru_add_drain_all(); > > + /* > + * Whereas stale stable_nodes on the stable_tree itself > + * get pruned in the regular course of stable_tree_search(), Which kinds of stable_nodes can be treated as stale? I just see remove rmap_item in stable_tree_search() and scan_get_next_rmap_item(). > + * those moved out to the migrate_nodes list can accumulate: > + * so prune them once before each full scan. > + */ > + if (!ksm_merge_across_nodes) { > + struct stable_node *stable_node; > + struct list_head *this, *next; > + struct page *page; > + > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, > + struct stable_node, list); > + page = get_ksm_page(stable_node, false); > + if (page) > + put_page(page); > + cond_resched(); > + } > + } > + Why get page of misplaced pages here? > for (nid = 0; nid < nr_node_ids; nid++) > root_unstable_tree[nid] = RB_ROOT; > > @@ -1586,8 +1683,7 @@ static void ksm_do_scan(unsigned int sca > rmap_item = scan_get_next_rmap_item(&page); > if (!rmap_item) > return; > - if (!PageKsm(page) || !in_stable_tree(rmap_item)) > - cmp_and_merge_page(page, rmap_item); > + cmp_and_merge_page(page, rmap_item); > put_page(page); > } > } > @@ -1964,6 +2060,7 @@ static void ksm_check_stable_tree(unsign > unsigned long end_pfn) > { > struct stable_node *stable_node; > + struct list_head *this, *next; > struct rb_node *node; > int nid; > > @@ -1984,6 +2081,13 @@ static void ksm_check_stable_tree(unsign > cond_resched(); > } > } > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, struct stable_node, list); > + if (stable_node->kpfn >= start_pfn && > + stable_node->kpfn < end_pfn) > + remove_node_from_stable_tree(stable_node); > + cond_resched(); > + } > } > > static int ksm_memory_callback(struct notifier_block *self, > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756634Ab3A0Vz0 (ORCPT ); Sun, 27 Jan 2013 16:55:26 -0500 Received: from mail-pb0-f46.google.com ([209.85.160.46]:43658 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756545Ab3A0VzY (ORCPT ); Sun, 27 Jan 2013 16:55:24 -0500 Date: Sun, 27 Jan 2013 13:55:19 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: <1359256581.4159.16.camel@kernel> Message-ID: References: <1359249282.4159.4.camel@kernel> <1359256581.4159.16.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 26 Jan 2013, Simon Jeons wrote: > On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote: > > > > So you'd like us to add code for moving a node from one tree to another > > in ksm_migrate_page() (and what would it do when it collides with an > > Without numa awareness, I still can't understand your explanation why > can't insert the node to the tree just after page migration instead of > inserting it at the next scan. The node is already there in the right (only) tree in that case. > > > existing node?), code which will then be removed a few patches later > > when ksm page migration is fully enabled? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756690Ab3A0WIC (ORCPT ); Sun, 27 Jan 2013 17:08:02 -0500 Received: from mail-da0-f44.google.com ([209.85.210.44]:43992 "EHLO mail-da0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756310Ab3A0WH7 (ORCPT ); Sun, 27 Jan 2013 17:07:59 -0500 Date: Sun, 27 Jan 2013 14:08:00 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked In-Reply-To: <1359254187.4159.10.camel@kernel> Message-ID: References: <1359254187.4159.10.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 26 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote: > > In some places where get_ksm_page() is used, we need the page to be locked. > > > > In function get_ksm_page, why check page->mapping => > get_page_unless_zero => check page->mapping instead of > get_page_unless_zero => check page->mapping, because > get_page_unless_zero is expensive? Yes, it's more expensive. > > > When KSM migration is fully enabled, we shall want that to make sure that > > the page just acquired cannot be migrated beneath us (raised page count is > > only effective when there is serialization to make sure migration notices). > > Whereas when navigating through the stable tree, we certainly do not want > > What's the meaning of "navigating through the stable tree"? Finding the right place in the stable tree, as stable_tree_search() and stable_tree_insert() do. > > > to lock each node (raised page count is enough to guarantee the memcmps, > > even if page is migrated to another node). > > > > Since we're about to add another use case, add the locked argument to > > get_ksm_page() now. > > Why the parameter lock passed from stable_tree_search/insert is true, > but remove_rmap_item_from_tree is false? The other way round? remove_rmap_item_from_tree needs the page locked, because it's about to modify the list: that's secured (e.g. against concurrent KSM page reclaim) by the page lock. stable_tree_search and stable_tree_insert do not need intermediate nodes to be locked: get_page is enough to secure the page contents for memcmp, and we don't want a pointless wait for exclusive page lock on every intermediate node. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756741Ab3A0WKS (ORCPT ); Sun, 27 Jan 2013 17:10:18 -0500 Received: from mail-pb0-f54.google.com ([209.85.160.54]:34580 "EHLO mail-pb0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756310Ab3A0WKP (ORCPT ); Sun, 27 Jan 2013 17:10:15 -0500 Date: Sun, 27 Jan 2013 14:10:16 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked In-Reply-To: <1359254927.4159.11.camel@kernel> Message-ID: References: <1359254927.4159.11.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 26 Jan 2013, Simon Jeons wrote: > > BTW, what's the meaning of ksm page forked? A ksm page is mapped into a process's mm, then that process calls fork(): the ksm page then appears in the child's mm, before ksmd has tracked it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756843Ab3A0XFx (ORCPT ); Sun, 27 Jan 2013 18:05:53 -0500 Received: from mail-pb0-f45.google.com ([209.85.160.45]:53300 "EHLO mail-pb0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755920Ab3A0XFu (ORCPT ); Sun, 27 Jan 2013 18:05:50 -0500 Date: Sun, 27 Jan 2013 15:05:46 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <1359262556.4159.23.camel@kernel> Message-ID: References: <1359262556.4159.23.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 26 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > > Switching merge_across_nodes after running KSM is liable to oops on stale > > nodes still left over from the previous stable tree. It's not something > > that people will often want to do, but it would be lame to demand a reboot > > when they're trying to determine which merge_across_nodes setting is best. > > > > How can this happen? We only permit switching merge_across_nodes when > > pages_shared is 0, and usually set run 2 to force that beforehand, which > > ought to unmerge everything: yet oopses still occur when you then run 1. > > > > Three causes: > > > > 1. The old stable tree (built according to the inverse merge_across_nodes) > > has not been fully torn down. A stable node lingers until get_ksm_page() > > notices that the page it references no longer references it: but the page > > is not necessarily freed as soon as expected, particularly when swapcache. > > > > When can this happen? Whenever there's an additional reference to the page, beyond those for its ptes in userspace - swapcache for example, or pinned by get_user_pages. That delays its being freed (arriving at the "page->mapping = NULL;" in free_pages_prepare()). Or it might simply be sitting in a pagevec, waiting for that to be filled up, to be freed as part of a batch. > > > Fix this with a pass through the old stable tree, applying get_ksm_page() > > to each of the remaining nodes (most found stale and removed immediately), > > with forced removal of any left over. Unless the page is still mapped: > > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > > and EBUSY than BUG. > > > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > > just behind ksmd's cursor, so there's a full pass for it to stabilize > > (or be removed) before ksmd addresses it. Nice when ksmd is running, > > but not so nice when we're trying to unmerge all mms: we were missing > > those mms forked and inserted behind the unmerge cursor. Easily fixed > > by inserting at the end when KSM_RUN_UNMERGE. > > mms forked will be unmerged just after ksmd's cursor since they're > inserted behind it, why will be missing? unmerge_and_remove_all_rmap_items() makes one pass through the list from start to finish: insert behind the cursor and it will be missed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756856Ab3A0XMa (ORCPT ); Sun, 27 Jan 2013 18:12:30 -0500 Received: from mail-da0-f44.google.com ([209.85.210.44]:34882 "EHLO mail-da0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755707Ab3A0XM2 (ORCPT ); Sun, 27 Jan 2013 18:12:28 -0500 Date: Sun, 27 Jan 2013 15:12:29 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible In-Reply-To: <1359265635.6763.0.camel@kernel> Message-ID: References: <1359265635.6763.0.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 26 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:03 -0800, Hugh Dickins wrote: > > + while (!get_page_unless_zero(page)) { > > + /* > > + * Another check for page->mapping != expected_mapping would > > + * work here too. We have chosen the !PageSwapCache test to > > + * optimize the common case, when the page is or is about to > > + * be freed: PageSwapCache is cleared (under spin_lock_irq) > > + * in the freeze_refs section of __remove_mapping(); but Anon > > + * page->mapping reset to NULL later, in free_pages_prepare(). > > + */ > > + if (!PageSwapCache(page)) > > + goto stale; > > + cpu_relax(); > > + } > > + > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > put_page(page); > > goto stale; > > } > > + > > if (locked) { > > lock_page(page); > > - if (page->mapping != expected_mapping) { > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > unlock_page(page); > > put_page(page); > > goto stale; > > } > > } > > Could you explain why need check page->mapping twice after get page? Once for the !locked case, which should not return page if mapping changed. Once for the locked case, which should not return page if mapping changed. We could use "else", but that wouldn't be an improvement. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756582Ab3A0XZz (ORCPT ); Sun, 27 Jan 2013 18:25:55 -0500 Received: from mail-pa0-f50.google.com ([209.85.220.50]:48514 "EHLO mail-pa0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751086Ab3A0XZy (ORCPT ); Sun, 27 Jan 2013 18:25:54 -0500 Date: Sun, 27 Jan 2013 15:25:54 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 8/11] ksm: make !merge_across_nodes migration safe In-Reply-To: <1359276555.6763.6.camel@kernel> Message-ID: References: <1359276555.6763.6.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:05 -0800, Hugh Dickins wrote: > > @@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa > > unsigned int checksum; > > int err; > > > > - remove_rmap_item_from_tree(rmap_item); > > + stable_node = page_stable_node(page); > > + if (stable_node) { > > + if (stable_node->head != &migrate_nodes && > > + get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { > > + rb_erase(&stable_node->node, > > + &root_stable_tree[NUMA(stable_node->nid)]); > > + stable_node->head = &migrate_nodes; > > + list_add(&stable_node->list, stable_node->head); > > Why list add &stable_node->list to stable_node->head? stable_node->head > is used for queue what? Read that as list_add(&stable_node->list, &migrate_nodes) if you prefer. stable_node->head (overlaying stable_node->node.__rb_parent_color, which would never point to migrate_nodes as an rb_node) &migrate_nodes is used as "magic" to show that that rb_node is currently saved on this list, rather than linked into the stable tree itself. We could do some #define MIGRATE_NODES_MAGIC 0xwhatever and put that in head instead. > > @@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r > > */ > > lru_add_drain_all(); > > > > + /* > > + * Whereas stale stable_nodes on the stable_tree itself > > + * get pruned in the regular course of stable_tree_search(), > > Which kinds of stable_nodes can be treated as stale? I just see remove > rmap_item in stable_tree_search() and scan_get_next_rmap_item(). See get_ksm_page(). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756885Ab3A0XfX (ORCPT ); Sun, 27 Jan 2013 18:35:23 -0500 Received: from mail-da0-f54.google.com ([209.85.210.54]:50647 "EHLO mail-da0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753080Ab3A0XfU (ORCPT ); Sun, 27 Jan 2013 18:35:20 -0500 Date: Sun, 27 Jan 2013 15:35:21 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Gerald Schaefer , KOSAKI Motohiro , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 11/11] ksm: stop hotremove lockdep warning In-Reply-To: <1359267810.6763.1.camel@kernel> Message-ID: References: <1359267810.6763.1.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:10 -0800, Hugh Dickins wrote: > > @@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no > > switch (action) { > > case MEM_GOING_OFFLINE: > > /* > > - * Keep it very simple for now: just lock out ksmd and > > - * MADV_UNMERGEABLE while any memory is going offline. > > - * mutex_lock_nested() is necessary because lockdep was alarmed > > - * that here we take ksm_thread_mutex inside notifier chain > > - * mutex, and later take notifier chain mutex inside > > - * ksm_thread_mutex to unlock it. But that's safe because both > > - * are inside mem_hotplug_mutex. > > + * Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items() > > + * and remove_all_stable_nodes() while memory is going offline: > > + * it is unsafe for them to touch the stable tree at this time. > > + * But unmerge_ksm_pages(), rmap lookups and other entry points > > Why unmerge_ksm_pages beneath us is safe for ksm memory hotremove? > > > + * which do not need the ksm_thread_mutex are all safe. It's just like userspace doing a write-fault on every KSM page in the vma. If that were unsafe for memory hotremove, then it would not be KSM's problem, memory hotremove would already be unsafe. (But memory hotremove is safe because it migrates away from all the pages to be removed before it can reach MEM_OFFLINE.) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757069Ab3A1AgO (ORCPT ); Sun, 27 Jan 2013 19:36:14 -0500 Received: from mail-ia0-f176.google.com ([209.85.210.176]:46299 "EHLO mail-ia0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754963Ab3A1AgM (ORCPT ); Sun, 27 Jan 2013 19:36:12 -0500 Message-ID: <1359333371.6763.12.camel@kernel> Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sun, 27 Jan 2013 18:36:11 -0600 In-Reply-To: References: <1359254187.4159.10.camel@kernel> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 2013-01-27 at 14:08 -0800, Hugh Dickins wrote: > On Sat, 26 Jan 2013, Simon Jeons wrote: > > On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote: > > > In some places where get_ksm_page() is used, we need the page to be locked. > > > > > > > In function get_ksm_page, why check page->mapping => > > get_page_unless_zero => check page->mapping instead of > > get_page_unless_zero => check page->mapping, because > > get_page_unless_zero is expensive? > > Yes, it's more expensive. > > > > > > When KSM migration is fully enabled, we shall want that to make sure that > > > the page just acquired cannot be migrated beneath us (raised page count is > > > only effective when there is serialization to make sure migration notices). > > > Whereas when navigating through the stable tree, we certainly do not want > > > > What's the meaning of "navigating through the stable tree"? > > Finding the right place in the stable tree, > as stable_tree_search() and stable_tree_insert() do. > > > > > > to lock each node (raised page count is enough to guarantee the memcmps, > > > even if page is migrated to another node). > > > > > > Since we're about to add another use case, add the locked argument to > > > get_ksm_page() now. > > > > Why the parameter lock passed from stable_tree_search/insert is true, > > but remove_rmap_item_from_tree is false? > > The other way round? remove_rmap_item_from_tree needs the page locked, > because it's about to modify the list: that's secured (e.g. against > concurrent KSM page reclaim) by the page lock. How can KSM page reclaim path call remove_rmap_item_from_tree? I have already track every callsites but can't find it. BTW, I'm curious about KSM page reclaim, it seems that there're no special handle in vmscan.c for KSM page reclaim, is it will be reclaimed similiar with normal page? > > stable_tree_search and stable_tree_insert do not need intermediate nodes > to be locked: get_page is enough to secure the page contents for memcmp, > and we don't want a pointless wait for exclusive page lock on every > intermediate node. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757099Ab3A1Al0 (ORCPT ); Sun, 27 Jan 2013 19:41:26 -0500 Received: from mail-da0-f42.google.com ([209.85.210.42]:38571 "EHLO mail-da0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754963Ab3A1AlY (ORCPT ); Sun, 27 Jan 2013 19:41:24 -0500 Message-ID: <1359333683.6763.13.camel@kernel> Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sun, 27 Jan 2013 18:41:23 -0600 In-Reply-To: References: <1359265635.6763.0.camel@kernel> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 2013-01-27 at 15:12 -0800, Hugh Dickins wrote: > On Sat, 26 Jan 2013, Simon Jeons wrote: > > On Fri, 2013-01-25 at 18:03 -0800, Hugh Dickins wrote: > > > + while (!get_page_unless_zero(page)) { > > > + /* > > > + * Another check for page->mapping != expected_mapping would > > > + * work here too. We have chosen the !PageSwapCache test to > > > + * optimize the common case, when the page is or is about to > > > + * be freed: PageSwapCache is cleared (under spin_lock_irq) > > > + * in the freeze_refs section of __remove_mapping(); but Anon > > > + * page->mapping reset to NULL later, in free_pages_prepare(). > > > + */ > > > + if (!PageSwapCache(page)) > > > + goto stale; > > > + cpu_relax(); > > > + } > > > + > > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > > put_page(page); > > > goto stale; > > > } > > > + > > > if (locked) { > > > lock_page(page); > > > - if (page->mapping != expected_mapping) { > > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > > unlock_page(page); > > > put_page(page); > > > goto stale; > > > } > > > } > > > > Could you explain why need check page->mapping twice after get page? > > Once for the !locked case, which should not return page if mapping changed. > Once for the locked case, which should not return page if mapping changed. > We could use "else", but that wouldn't be an improvement. But for locked case, page->mapping will be check twice. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757215Ab3A1BmH (ORCPT ); Sun, 27 Jan 2013 20:42:07 -0500 Received: from mail-ia0-f171.google.com ([209.85.210.171]:45521 "EHLO mail-ia0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756371Ab3A1BmE (ORCPT ); Sun, 27 Jan 2013 20:42:04 -0500 Message-ID: <1359337321.6763.18.camel@kernel> Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sun, 27 Jan 2013 19:42:01 -0600 In-Reply-To: References: <1359262556.4159.23.camel@kernel> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 2013-01-27 at 15:05 -0800, Hugh Dickins wrote: > On Sat, 26 Jan 2013, Simon Jeons wrote: > > On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > > > Switching merge_across_nodes after running KSM is liable to oops on stale > > > nodes still left over from the previous stable tree. It's not something > > > that people will often want to do, but it would be lame to demand a reboot > > > when they're trying to determine which merge_across_nodes setting is best. > > > > > > How can this happen? We only permit switching merge_across_nodes when > > > pages_shared is 0, and usually set run 2 to force that beforehand, which > > > ought to unmerge everything: yet oopses still occur when you then run 1. > > > > > > Three causes: > > > > > > 1. The old stable tree (built according to the inverse merge_across_nodes) ^^^^^^^^^^^^^^^^^^^^^ How to understand inverse merge_across_nodes here? > > > has not been fully torn down. A stable node lingers until get_ksm_page() > > > notices that the page it references no longer references it: but the page Do you mean page->mapping is NULL when call get_ksm_page()? Who clear it NULL? > > > is not necessarily freed as soon as expected, particularly when swapcache. Why is not necessarily freed as soon as expected? > > > > > > > When can this happen? > > Whenever there's an additional reference to the page, beyond those for > its ptes in userspace - swapcache for example, or pinned by get_user_pages. > That delays its being freed (arriving at the "page->mapping = NULL;" > in free_pages_prepare()). Or it might simply be sitting in a pagevec, > waiting for that to be filled up, to be freed as part of a batch. > > > > > > Fix this with a pass through the old stable tree, applying get_ksm_page() > > > to each of the remaining nodes (most found stale and removed immediately), > > > with forced removal of any left over. Unless the page is still mapped: > > > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > > > and EBUSY than BUG. > > > > > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > > > just behind ksmd's cursor, so there's a full pass for it to stabilize > > > (or be removed) before ksmd addresses it. Nice when ksmd is running, > > > but not so nice when we're trying to unmerge all mms: we were missing > > > those mms forked and inserted behind the unmerge cursor. Easily fixed > > > by inserting at the end when KSM_RUN_UNMERGE. > > > > mms forked will be unmerged just after ksmd's cursor since they're > > inserted behind it, why will be missing? > > unmerge_and_remove_all_rmap_items() makes one pass through the list > from start to finish: insert behind the cursor and it will be missed. Since mms forked will be insert just after ksmd's cursor, so it is the next which will be scan and unmerge, where I miss? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751519Ab3A1CM3 (ORCPT ); Sun, 27 Jan 2013 21:12:29 -0500 Received: from mail-pa0-f43.google.com ([209.85.220.43]:37817 "EHLO mail-pa0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751342Ab3A1CM2 (ORCPT ); Sun, 27 Jan 2013 21:12:28 -0500 Message-ID: <1359339147.6763.25.camel@kernel> Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sun, 27 Jan 2013 20:12:27 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > Switching merge_across_nodes after running KSM is liable to oops on stale > nodes still left over from the previous stable tree. It's not something Since this patch solve the problem, so the description of merge_across_nodes(Value can be changed only when there is no ksm shared pages in system) should be changed in this patch. > that people will often want to do, but it would be lame to demand a reboot > when they're trying to determine which merge_across_nodes setting is best. > > How can this happen? We only permit switching merge_across_nodes when > pages_shared is 0, and usually set run 2 to force that beforehand, which > ought to unmerge everything: yet oopses still occur when you then run 1. > > Three causes: > > 1. The old stable tree (built according to the inverse merge_across_nodes) > has not been fully torn down. A stable node lingers until get_ksm_page() > notices that the page it references no longer references it: but the page > is not necessarily freed as soon as expected, particularly when swapcache. > > Fix this with a pass through the old stable tree, applying get_ksm_page() > to each of the remaining nodes (most found stale and removed immediately), > with forced removal of any left over. Unless the page is still mapped: > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > and EBUSY than BUG. > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > just behind ksmd's cursor, so there's a full pass for it to stabilize > (or be removed) before ksmd addresses it. Nice when ksmd is running, > but not so nice when we're trying to unmerge all mms: we were missing > those mms forked and inserted behind the unmerge cursor. Easily fixed > by inserting at the end when KSM_RUN_UNMERGE. > > 3. It is possible for a KSM page to be faulted back from swapcache into > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. > > A long outstanding, unrelated bugfix sneaks in with that third fix: > ksm_does_need_to_copy() would copy from a !PageUptodate page (implying > I/O error when read in from swap) to a page which it then marks Uptodate. > Fix this case by not copying, letting do_swap_page() discover the error. > > Signed-off-by: Hugh Dickins > --- > include/linux/ksm.h | 18 ++------- > mm/ksm.c | 83 +++++++++++++++++++++++++++++++++++++++--- > mm/memory.c | 19 ++++----- > 3 files changed, 92 insertions(+), 28 deletions(-) > > --- mmotm.orig/include/linux/ksm.h 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/include/linux/ksm.h 2013-01-25 14:37:00.764206145 -0800 > @@ -16,9 +16,6 @@ > struct stable_node; > struct mem_cgroup; > > -struct page *ksm_does_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address); > - > #ifdef CONFIG_KSM > int ksm_madvise(struct vm_area_struct *vma, unsigned long start, > unsigned long end, int advice, unsigned long *vm_flags); > @@ -73,15 +70,8 @@ static inline void set_page_stable_node( > * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE, > * but what if the vma was unmerged while the page was swapped out? > */ > -static inline int ksm_might_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address) > -{ > - struct anon_vma *anon_vma = page_anon_vma(page); > - > - return anon_vma && > - (anon_vma->root != vma->anon_vma->root || > - page->index != linear_page_index(vma, address)); > -} > +struct page *ksm_might_need_to_copy(struct page *page, > + struct vm_area_struct *vma, unsigned long address); > > int page_referenced_ksm(struct page *page, > struct mem_cgroup *memcg, unsigned long *vm_flags); > @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_ > return 0; > } > > -static inline int ksm_might_need_to_copy(struct page *page, > +static inline struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > - return 0; > + return page; > } > > static inline int page_referenced_ksm(struct page *page, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a > /* > * Only called through the sysfs control interface: > */ > +static int remove_stable_node(struct stable_node *stable_node) > +{ > + struct page *page; > + int err; > + > + page = get_ksm_page(stable_node, true); > + if (!page) { > + /* > + * get_ksm_page did remove_node_from_stable_tree itself. > + */ > + return 0; > + } > + > + if (WARN_ON_ONCE(page_mapped(page))) > + err = -EBUSY; > + else { > + /* > + * This page might be in a pagevec waiting to be freed, > + * or it might be PageSwapCache (perhaps under writeback), > + * or it might have been removed from swapcache a moment ago. > + */ > + set_page_stable_node(page, NULL); > + remove_node_from_stable_tree(stable_node); > + err = 0; > + } > + > + unlock_page(page); > + put_page(page); > + return err; > +} > + > +static int remove_all_stable_nodes(void) > +{ > + struct stable_node *stable_node; > + int nid; > + int err = 0; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + while (root_stable_tree[nid].rb_node) { > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > + struct stable_node, node); > + if (remove_stable_node(stable_node)) { > + err = -EBUSY; > + break; /* proceed to next nid */ > + } > + cond_resched(); > + } > + } > + return err; > +} > + > static int unmerge_and_remove_all_rmap_items(void) > { > struct mm_slot *mm_slot; > @@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i > } > } > > + /* Clean up stable nodes, but don't worry if some are still busy */ > + remove_all_stable_nodes(); > ksm_scan.seqnr = 0; > return 0; > > @@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm) > spin_lock(&ksm_mmlist_lock); > insert_to_mm_slots_hash(mm, mm_slot); > /* > - * Insert just behind the scanning cursor, to let the area settle > + * When KSM_RUN_MERGE (or KSM_RUN_STOP), > + * insert just behind the scanning cursor, to let the area settle > * down a little; when fork is followed by immediate exec, we don't > * want ksmd to waste time setting up and tearing down an rmap_list. > + * > + * But when KSM_RUN_UNMERGE, it's important to insert ahead of its > + * scanning cursor, otherwise KSM pages in newly forked mms will be > + * missed: then we might as well insert at the end of the list. > */ > - list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > + if (ksm_run & KSM_RUN_UNMERGE) > + list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list); > + else > + list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > spin_unlock(&ksm_mmlist_lock); > > set_bit(MMF_VM_MERGEABLE, &mm->flags); > @@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm) > } > } > > -struct page *ksm_does_need_to_copy(struct page *page, > +struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > + struct anon_vma *anon_vma = page_anon_vma(page); > struct page *new_page; > > + if (PageKsm(page)) { > + if (page_stable_node(page) && > + !(ksm_run & KSM_RUN_UNMERGE)) > + return page; /* no need to copy it */ > + } else if (!anon_vma) { > + return page; /* no need to copy it */ > + } else if (anon_vma->root == vma->anon_vma->root && > + page->index == linear_page_index(vma, address)) { > + return page; /* still no need to copy it */ > + } > + if (!PageUptodate(page)) > + return page; /* let do_swap_page report the error */ > + > new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); > if (new_page) { > copy_user_highpage(new_page, page, address, vma); > @@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store( > > mutex_lock(&ksm_thread_mutex); > if (ksm_merge_across_nodes != knob) { > - if (ksm_pages_shared) > + if (ksm_pages_shared || remove_all_stable_nodes()) > err = -EBUSY; > else > ksm_merge_across_nodes = knob; > --- mmotm.orig/mm/memory.c 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/mm/memory.c 2013-01-25 14:37:00.768206145 -0800 > @@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct > if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) > goto out_page; > > - if (ksm_might_need_to_copy(page, vma, address)) { > - swapcache = page; > - page = ksm_does_need_to_copy(page, vma, address); > - > - if (unlikely(!page)) { > - ret = VM_FAULT_OOM; > - page = swapcache; > - swapcache = NULL; > - goto out_page; > - } > + swapcache = page; > + page = ksm_might_need_to_copy(page, vma, address); > + if (unlikely(!page)) { > + ret = VM_FAULT_OOM; > + page = swapcache; > + swapcache = NULL; > + goto out_page; > } > + if (page == swapcache) > + swapcache = NULL; > > if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > ret = VM_FAULT_OOM; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753854Ab3A1Dfl (ORCPT ); Sun, 27 Jan 2013 22:35:41 -0500 Received: from mail-pa0-f48.google.com ([209.85.220.48]:57206 "EHLO mail-pa0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753431Ab3A1Dff (ORCPT ); Sun, 27 Jan 2013 22:35:35 -0500 Date: Sun, 27 Jan 2013 19:35:31 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked In-Reply-To: <1359333371.6763.12.camel@kernel> Message-ID: References: <1359254187.4159.10.camel@kernel> <1359333371.6763.12.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Sun, 2013-01-27 at 14:08 -0800, Hugh Dickins wrote: > > On Sat, 26 Jan 2013, Simon Jeons wrote: > > > > > > Why the parameter lock passed from stable_tree_search/insert is true, > > > but remove_rmap_item_from_tree is false? > > > > The other way round? remove_rmap_item_from_tree needs the page locked, > > because it's about to modify the list: that's secured (e.g. against > > concurrent KSM page reclaim) by the page lock. > > How can KSM page reclaim path call remove_rmap_item_from_tree? I have > already track every callsites but can't find it. It doesn't. Please read what I said above again. > BTW, I'm curious about > KSM page reclaim, it seems that there're no special handle in vmscan.c > for KSM page reclaim, is it will be reclaimed similiar with normal > page? Look for PageKsm in mm/rmap.c. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754287Ab3A1Do1 (ORCPT ); Sun, 27 Jan 2013 22:44:27 -0500 Received: from mail-da0-f51.google.com ([209.85.210.51]:54984 "EHLO mail-da0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753547Ab3A1DoZ (ORCPT ); Sun, 27 Jan 2013 22:44:25 -0500 Message-ID: <1359344663.6763.32.camel@kernel> Subject: Re: [PATCH 8/11] ksm: make !merge_across_nodes migration safe From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Sun, 27 Jan 2013 21:44:23 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2013-01-25 at 18:05 -0800, Hugh Dickins wrote: > The new KSM NUMA merge_across_nodes knob introduces a problem, when it's > set to non-default 0: if a KSM page is migrated to a different NUMA node, > how do we migrate its stable node to the right tree? And what if that > collides with an existing stable node? > > ksm_migrate_page() can do no more than it's already doing, updating > stable_node->kpfn: the stable tree itself cannot be manipulated without > holding ksm_thread_mutex. So accept that a stable tree may temporarily > indicate a page belonging to the wrong NUMA node, leave updating until > the next pass of ksmd, just be careful not to merge other pages on to a How you not to merge other pages on to a misplaced page? I don't see it. > misplaced page. Note nid of holding tree in stable_node, and recognize > that it will not always match nid of kpfn. > > A misplaced KSM page is discovered, either when ksm_do_scan() next comes > around to one of its rmap_items (we now have to go to cmp_and_merge_page > even on pages in a stable tree), or when stable_tree_search() arrives at > a matching node for another page, and this node page is found misplaced. > > In each case, move the misplaced stable_node to a list of migrate_nodes > (and use the address of migrate_nodes as magic by which to identify them): > we don't need them in a tree. If stable_tree_search() finds no match for > a page, but it's currently exiled to this list, then slot its stable_node > right there into the tree, bringing all of its mappings with it; otherwise > they get migrated one by one to the original page of the colliding node. > stable_tree_search() is now modelled more like stable_tree_insert(), > in order to handle these insertions of migrated nodes. When node will be removed from migrate_nodes list and insert to stable tree? > > remove_node_from_stable_tree(), remove_all_stable_nodes() and > ksm_check_stable_tree() have to handle the migrate_nodes list as well as > the stable tree itself. Less obviously, we do need to prune the list of > stale entries from time to time (scan_get_next_rmap_item() does it once > each full scan): > whereas stale nodes in the stable tree get naturally > pruned as searches try to brush past them, these migrate_nodes may get > forgotten and accumulate. Hard to understand this description. Could you explain it? :) > Signed-off-by: Hugh Dickins What will happen if page node of an unstable tree migrate to a new numa node? Also need to handle colliding? > --- > mm/ksm.c | 164 +++++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 134 insertions(+), 30 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 > @@ -122,13 +122,25 @@ struct ksm_scan { > /** > * struct stable_node - node of the stable rbtree > * @node: rb node of this ksm page in the stable tree > + * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list > + * @list: linked into migrate_nodes, pending placement in the proper node tree > * @hlist: hlist head of rmap_items using this ksm page > - * @kpfn: page frame number of this ksm page > + * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid) > + * @nid: NUMA node id of stable tree in which linked (may not match kpfn) > */ > struct stable_node { > - struct rb_node node; > + union { > + struct rb_node node; /* when node of stable tree */ > + struct { /* when listed for migration */ > + struct list_head *head; > + struct list_head list; > + }; > + }; > struct hlist_head hlist; > unsigned long kpfn; > +#ifdef CONFIG_NUMA > + int nid; > +#endif > }; > > /** > @@ -169,6 +181,9 @@ struct rmap_item { > static struct rb_root root_unstable_tree[MAX_NUMNODES]; > static struct rb_root root_stable_tree[MAX_NUMNODES]; > > +/* Recently migrated nodes of stable tree, pending proper placement */ > +static LIST_HEAD(migrate_nodes); > + > #define MM_SLOTS_HASH_BITS 10 > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > > @@ -311,11 +326,6 @@ static void insert_to_mm_slots_hash(stru > hash_add(mm_slots_hash, &mm_slot->link, (unsigned long)mm); > } > > -static inline int in_stable_tree(struct rmap_item *rmap_item) > -{ > - return rmap_item->address & STABLE_FLAG; > -} > - > /* > * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's > * page tables after it has passed through ksm_exit() - which, if necessary, > @@ -476,7 +486,6 @@ static void remove_node_from_stable_tree > { > struct rmap_item *rmap_item; > struct hlist_node *hlist; > - int nid; > > hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { > if (rmap_item->hlist.next) > @@ -488,8 +497,11 @@ static void remove_node_from_stable_tree > cond_resched(); > } > > - nid = get_kpfn_nid(stable_node->kpfn); > - rb_erase(&stable_node->node, &root_stable_tree[nid]); > + if (stable_node->head == &migrate_nodes) > + list_del(&stable_node->list); > + else > + rb_erase(&stable_node->node, > + &root_stable_tree[NUMA(stable_node->nid)]); > free_stable_node(stable_node); > } > > @@ -712,6 +724,7 @@ static int remove_stable_node(struct sta > static int remove_all_stable_nodes(void) > { > struct stable_node *stable_node; > + struct list_head *this, *next; > int nid; > int err = 0; > > @@ -726,6 +739,12 @@ static int remove_all_stable_nodes(void) > cond_resched(); > } > } > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, struct stable_node, list); > + if (remove_stable_node(stable_node)) > + err = -EBUSY; > + cond_resched(); > + } > return err; > } > > @@ -1113,25 +1132,30 @@ static struct page *try_to_merge_two_pag > */ > static struct page *stable_tree_search(struct page *page) > { > - struct rb_node *node; > - struct stable_node *stable_node; > int nid; > + struct rb_node **new; > + struct rb_node *parent; > + struct stable_node *stable_node; > + struct stable_node *page_node; > > - stable_node = page_stable_node(page); > - if (stable_node) { /* ksm page forked */ > + page_node = page_stable_node(page); > + if (page_node && page_node->head != &migrate_nodes) { > + /* ksm page forked */ > get_page(page); > return page; > } > > nid = get_kpfn_nid(page_to_pfn(page)); > - node = root_stable_tree[nid].rb_node; > +again: > + new = &root_stable_tree[nid].rb_node; > + parent = NULL; > > - while (node) { > + while (*new) { > struct page *tree_page; > int ret; > > cond_resched(); > - stable_node = rb_entry(node, struct stable_node, node); > + stable_node = rb_entry(*new, struct stable_node, node); > tree_page = get_ksm_page(stable_node, false); > if (!tree_page) > return NULL; > @@ -1139,10 +1163,11 @@ static struct page *stable_tree_search(s > ret = memcmp_pages(page, tree_page); > put_page(tree_page); > > + parent = *new; > if (ret < 0) > - node = node->rb_left; > + new = &parent->rb_left; > else if (ret > 0) > - node = node->rb_right; > + new = &parent->rb_right; > else { > /* > * Lock and unlock the stable_node's page (which > @@ -1152,13 +1177,49 @@ static struct page *stable_tree_search(s > * than kpage, but that involves more changes. > */ > tree_page = get_ksm_page(stable_node, true); > - if (tree_page) > + if (tree_page) { > unlock_page(tree_page); > - return tree_page; > + if (get_kpfn_nid(stable_node->kpfn) != > + NUMA(stable_node->nid)) { > + put_page(tree_page); > + goto replace; > + } > + return tree_page; > + } > + /* > + * There is now a place for page_node, but the tree may > + * have been rebalanced, so re-evaluate parent and new. > + */ > + if (page_node) > + goto again; > + return NULL; > } > } > > - return NULL; > + if (!page_node) > + return NULL; > + > + list_del(&page_node->list); > + DO_NUMA(page_node->nid = nid); > + rb_link_node(&page_node->node, parent, new); > + rb_insert_color(&page_node->node, &root_stable_tree[nid]); > + get_page(page); > + return page; > + > +replace: > + if (page_node) { > + list_del(&page_node->list); > + DO_NUMA(page_node->nid = nid); > + rb_replace_node(&stable_node->node, > + &page_node->node, &root_stable_tree[nid]); > + get_page(page); > + } else { > + rb_erase(&stable_node->node, &root_stable_tree[nid]); > + page = NULL; > + } > + stable_node->head = &migrate_nodes; Why still set this magic since node has already insert to the tree? > + list_add(&stable_node->list, stable_node->head); > + return page; > } > > /* > @@ -1215,6 +1276,7 @@ static struct stable_node *stable_tree_i > INIT_HLIST_HEAD(&stable_node->hlist); > stable_node->kpfn = kpfn; > set_page_stable_node(kpage, stable_node); > + DO_NUMA(stable_node->nid = nid); > rb_link_node(&stable_node->node, parent, new); > rb_insert_color(&stable_node->node, &root_stable_tree[nid]); > > @@ -1311,11 +1373,6 @@ struct rmap_item *unstable_tree_search_i > static void stable_tree_append(struct rmap_item *rmap_item, > struct stable_node *stable_node) > { > - /* > - * Usually rmap_item->nid is already set correctly, > - * but it may be wrong after switching merge_across_nodes. > - */ > - DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn)); > rmap_item->head = stable_node; > rmap_item->address |= STABLE_FLAG; > hlist_add_head(&rmap_item->hlist, &stable_node->hlist); > @@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa > unsigned int checksum; > int err; > > - remove_rmap_item_from_tree(rmap_item); > + stable_node = page_stable_node(page); > + if (stable_node) { > + if (stable_node->head != &migrate_nodes && > + get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { > + rb_erase(&stable_node->node, > + &root_stable_tree[NUMA(stable_node->nid)]); > + stable_node->head = &migrate_nodes; > + list_add(&stable_node->list, stable_node->head); > + } > + if (stable_node->head != &migrate_nodes && > + rmap_item->head == stable_node) > + return; > + } > > /* We first start with searching the page inside the stable tree */ > kpage = stable_tree_search(page); > + if (kpage == page && rmap_item->head == stable_node) { > + put_page(kpage); > + return; > + } > + > + remove_rmap_item_from_tree(rmap_item); > + > if (kpage) { > err = try_to_merge_with_ksm_page(rmap_item, page, kpage); > if (!err) { > @@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r > */ > lru_add_drain_all(); > > + /* > + * Whereas stale stable_nodes on the stable_tree itself > + * get pruned in the regular course of stable_tree_search(), > + * those moved out to the migrate_nodes list can accumulate: > + * so prune them once before each full scan. > + */ > + if (!ksm_merge_across_nodes) { > + struct stable_node *stable_node; > + struct list_head *this, *next; > + struct page *page; > + > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, > + struct stable_node, list); > + page = get_ksm_page(stable_node, false); > + if (page) > + put_page(page); > + cond_resched(); > + } > + } > + > for (nid = 0; nid < nr_node_ids; nid++) > root_unstable_tree[nid] = RB_ROOT; > > @@ -1586,8 +1683,7 @@ static void ksm_do_scan(unsigned int sca > rmap_item = scan_get_next_rmap_item(&page); > if (!rmap_item) > return; > - if (!PageKsm(page) || !in_stable_tree(rmap_item)) > - cmp_and_merge_page(page, rmap_item); > + cmp_and_merge_page(page, rmap_item); > put_page(page); > } > } > @@ -1964,6 +2060,7 @@ static void ksm_check_stable_tree(unsign > unsigned long end_pfn) > { > struct stable_node *stable_node; > + struct list_head *this, *next; > struct rb_node *node; > int nid; > > @@ -1984,6 +2081,13 @@ static void ksm_check_stable_tree(unsign > cond_resched(); > } > } > + list_for_each_safe(this, next, &migrate_nodes) { > + stable_node = list_entry(this, struct stable_node, list); > + if (stable_node->kpfn >= start_pfn && > + stable_node->kpfn < end_pfn) > + remove_node_from_stable_tree(stable_node); > + cond_resched(); > + } > } > > static int ksm_memory_callback(struct notifier_block *self, > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754334Ab3A1Do5 (ORCPT ); Sun, 27 Jan 2013 22:44:57 -0500 Received: from mail-pb0-f52.google.com ([209.85.160.52]:54689 "EHLO mail-pb0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753368Ab3A1Doz (ORCPT ); Sun, 27 Jan 2013 22:44:55 -0500 Date: Sun, 27 Jan 2013 19:44:56 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible In-Reply-To: <1359333683.6763.13.camel@kernel> Message-ID: References: <1359265635.6763.0.camel@kernel> <1359333683.6763.13.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Sun, 2013-01-27 at 15:12 -0800, Hugh Dickins wrote: > > On Sat, 26 Jan 2013, Simon Jeons wrote: > > > > > > Could you explain why need check page->mapping twice after get page? > > > > Once for the !locked case, which should not return page if mapping changed. > > Once for the locked case, which should not return page if mapping changed. > > We could use "else", but that wouldn't be an improvement. > > But for locked case, page->mapping will be check twice. Thrice. I'm beginning to wonder: you do realize that page->mapping is volatile, from the point of view of get_ksm_page()? That is the whole point of why get_ksm_page() exists. I can see that the word "volatile" is not obviously used here - it's tucked away inside the ACCESS_ONCE() - but I thought the descriptions of races and barriers made that obvious. If the comments here haven't helped enough, please take a look at git commit 4035c07a8959 "ksm: take keyhole reference to page". From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753587Ab3A1EOX (ORCPT ); Sun, 27 Jan 2013 23:14:23 -0500 Received: from mail-da0-f44.google.com ([209.85.210.44]:62782 "EHLO mail-da0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751505Ab3A1EOV (ORCPT ); Sun, 27 Jan 2013 23:14:21 -0500 Date: Sun, 27 Jan 2013 20:14:22 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <1359337321.6763.18.camel@kernel> Message-ID: References: <1359262556.4159.23.camel@kernel> <1359337321.6763.18.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Sun, 2013-01-27 at 15:05 -0800, Hugh Dickins wrote: > > On Sat, 26 Jan 2013, Simon Jeons wrote: > > > > How can this happen? We only permit switching merge_across_nodes when > > > > pages_shared is 0, and usually set run 2 to force that beforehand, which > > > > ought to unmerge everything: yet oopses still occur when you then run 1. > > > > > > > > Three causes: > > > > > > > > 1. The old stable tree (built according to the inverse merge_across_nodes) > ^^^^^^^^^^^^^^^^^^^^^ > How to understand inverse merge_across_nodes here? How not to understand it? Either it was 0 before (in which case there were as many stable trees as NUMA nodes) and is being changed to 1 (in which case there is to be only one stable tree), or it was 1 before (for one) and is being changed to 0 (for many). > > > > > has not been fully torn down. A stable node lingers until get_ksm_page() > > > > notices that the page it references no longer references it: but the page > > Do you mean page->mapping is NULL when call get_ksm_page()? Who clear it > NULL? I think I already pointed you to free_pages_prepare(). > > > > > is not necessarily freed as soon as expected, particularly when swapcache. > > Why is not necessarily freed as soon as expected? As I answered below. > > > > > > > > > > When can this happen? > > > > Whenever there's an additional reference to the page, beyond those for > > its ptes in userspace - swapcache for example, or pinned by get_user_pages. > > That delays its being freed (arriving at the "page->mapping = NULL;" > > in free_pages_prepare()). Or it might simply be sitting in a pagevec, > > waiting for that to be filled up, to be freed as part of a batch. > > > mms forked will be unmerged just after ksmd's cursor since they're > > > inserted behind it, why will be missing? > > > > unmerge_and_remove_all_rmap_items() makes one pass through the list > > from start to finish: insert behind the cursor and it will be missed. > > Since mms forked will be insert just after ksmd's cursor, so it is the > next which will be scan and unmerge, where I miss? mms forked are normally inserted just behind (== before) ksmd's cursor, as I've said in comments and explanations several times. Simon, I've had enough: you clearly have much more time to spare for asking questions than I have for answering them repeatedly: I would rather spend my time attending to 100 higher priorities. Please try much harder to work these things out for yourself from the source (perhaps with help from kernelnewbies.org), before interrogating linux-kernel and linux-mm developers. Sometimes your questions may help everybody to understand better, but often they just waste our time. I'll happily admit that mm, and mm/ksm.c in particular, is not the easiest place to start in understanding the kernel, nor I the best expositor. Best wishes, Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753664Ab3A1ETa (ORCPT ); Sun, 27 Jan 2013 23:19:30 -0500 Received: from mail-da0-f42.google.com ([209.85.210.42]:62113 "EHLO mail-da0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753029Ab3A1ET1 (ORCPT ); Sun, 27 Jan 2013 23:19:27 -0500 Date: Sun, 27 Jan 2013 20:19:28 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Simon Jeons cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <1359339147.6763.25.camel@kernel> Message-ID: References: <1359339147.6763.25.camel@kernel> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 27 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > > Switching merge_across_nodes after running KSM is liable to oops on stale > > nodes still left over from the previous stable tree. It's not something > > Since this patch solve the problem, so the description of > merge_across_nodes(Value can be changed only when there is no ksm shared > pages in system) should be changed in this patch. No. The code could be changed to unmerge_and_remove_all_rmap_items() automatically whenever merge_across_nodes is changed; but that's not what Petr chose to do, and I didn't feel strongly to change it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751895Ab3A1Ggn (ORCPT ); Mon, 28 Jan 2013 01:36:43 -0500 Received: from mail-ia0-f173.google.com ([209.85.210.173]:36721 "EHLO mail-ia0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751131Ab3A1Ggm (ORCPT ); Mon, 28 Jan 2013 01:36:42 -0500 Message-ID: <1359355000.17885.1.camel@kernel> Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly From: Simon Jeons To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Mon, 28 Jan 2013 00:36:40 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote: > Switching merge_across_nodes after running KSM is liable to oops on stale > nodes still left over from the previous stable tree. It's not something > that people will often want to do, but it would be lame to demand a reboot > when they're trying to determine which merge_across_nodes setting is best. > > How can this happen? We only permit switching merge_across_nodes when > pages_shared is 0, and usually set run 2 to force that beforehand, which > ought to unmerge everything: yet oopses still occur when you then run 1. > > Three causes: > > 1. The old stable tree (built according to the inverse merge_across_nodes) > has not been fully torn down. A stable node lingers until get_ksm_page() > notices that the page it references no longer references it: but the page > is not necessarily freed as soon as expected, particularly when swapcache. > > Fix this with a pass through the old stable tree, applying get_ksm_page() > to each of the remaining nodes (most found stale and removed immediately), > with forced removal of any left over. Unless the page is still mapped: > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > and EBUSY than BUG. > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > just behind ksmd's cursor, so there's a full pass for it to stabilize > (or be removed) before ksmd addresses it. Nice when ksmd is running, > but not so nice when we're trying to unmerge all mms: we were missing > those mms forked and inserted behind the unmerge cursor. Easily fixed > by inserting at the end when KSM_RUN_UNMERGE. > > 3. It is possible for a KSM page to be faulted back from swapcache into > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. > > A long outstanding, unrelated bugfix sneaks in with that third fix: > ksm_does_need_to_copy() would copy from a !PageUptodate page (implying > I/O error when read in from swap) to a page which it then marks Uptodate. > Fix this case by not copying, letting do_swap_page() discover the error. > > Signed-off-by: Hugh Dickins > --- > include/linux/ksm.h | 18 ++------- > mm/ksm.c | 83 +++++++++++++++++++++++++++++++++++++++--- > mm/memory.c | 19 ++++----- > 3 files changed, 92 insertions(+), 28 deletions(-) > > --- mmotm.orig/include/linux/ksm.h 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/include/linux/ksm.h 2013-01-25 14:37:00.764206145 -0800 > @@ -16,9 +16,6 @@ > struct stable_node; > struct mem_cgroup; > > -struct page *ksm_does_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address); > - > #ifdef CONFIG_KSM > int ksm_madvise(struct vm_area_struct *vma, unsigned long start, > unsigned long end, int advice, unsigned long *vm_flags); > @@ -73,15 +70,8 @@ static inline void set_page_stable_node( > * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE, > * but what if the vma was unmerged while the page was swapped out? > */ > -static inline int ksm_might_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address) > -{ > - struct anon_vma *anon_vma = page_anon_vma(page); > - > - return anon_vma && > - (anon_vma->root != vma->anon_vma->root || > - page->index != linear_page_index(vma, address)); > -} > +struct page *ksm_might_need_to_copy(struct page *page, > + struct vm_area_struct *vma, unsigned long address); > > int page_referenced_ksm(struct page *page, > struct mem_cgroup *memcg, unsigned long *vm_flags); > @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_ > return 0; > } > > -static inline int ksm_might_need_to_copy(struct page *page, > +static inline struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > - return 0; > + return page; > } > > static inline int page_referenced_ksm(struct page *page, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a > /* > * Only called through the sysfs control interface: > */ > +static int remove_stable_node(struct stable_node *stable_node) > +{ > + struct page *page; > + int err; > + > + page = get_ksm_page(stable_node, true); > + if (!page) { > + /* > + * get_ksm_page did remove_node_from_stable_tree itself. > + */ > + return 0; > + } > + > + if (WARN_ON_ONCE(page_mapped(page))) > + err = -EBUSY; > + else { > + /* > + * This page might be in a pagevec waiting to be freed, > + * or it might be PageSwapCache (perhaps under writeback), > + * or it might have been removed from swapcache a moment ago. > + */ > + set_page_stable_node(page, NULL); > + remove_node_from_stable_tree(stable_node); > + err = 0; > + } > + > + unlock_page(page); > + put_page(page); > + return err; > +} > + > +static int remove_all_stable_nodes(void) > +{ > + struct stable_node *stable_node; > + int nid; > + int err = 0; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + while (root_stable_tree[nid].rb_node) { > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > + struct stable_node, node); > + if (remove_stable_node(stable_node)) { > + err = -EBUSY; > + break; /* proceed to next nid */ Why proceed to next nid if meet unstale stable node in stable tree? Then still can't fully cleanup stale stable nodes. > + } > + cond_resched(); > + } > + } > + return err; > +} > + > static int unmerge_and_remove_all_rmap_items(void) > { > struct mm_slot *mm_slot; > @@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i > } > } > > + /* Clean up stable nodes, but don't worry if some are still busy */ > + remove_all_stable_nodes(); > ksm_scan.seqnr = 0; > return 0; > > @@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm) > spin_lock(&ksm_mmlist_lock); > insert_to_mm_slots_hash(mm, mm_slot); > /* > - * Insert just behind the scanning cursor, to let the area settle > + * When KSM_RUN_MERGE (or KSM_RUN_STOP), > + * insert just behind the scanning cursor, to let the area settle > * down a little; when fork is followed by immediate exec, we don't > * want ksmd to waste time setting up and tearing down an rmap_list. > + * > + * But when KSM_RUN_UNMERGE, it's important to insert ahead of its > + * scanning cursor, otherwise KSM pages in newly forked mms will be > + * missed: then we might as well insert at the end of the list. > */ > - list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > + if (ksm_run & KSM_RUN_UNMERGE) > + list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list); > + else > + list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); > spin_unlock(&ksm_mmlist_lock); > > set_bit(MMF_VM_MERGEABLE, &mm->flags); > @@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm) > } > } > > -struct page *ksm_does_need_to_copy(struct page *page, > +struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > + struct anon_vma *anon_vma = page_anon_vma(page); > struct page *new_page; > > + if (PageKsm(page)) { > + if (page_stable_node(page) && > + !(ksm_run & KSM_RUN_UNMERGE)) > + return page; /* no need to copy it */ > + } else if (!anon_vma) { > + return page; /* no need to copy it */ > + } else if (anon_vma->root == vma->anon_vma->root && > + page->index == linear_page_index(vma, address)) { > + return page; /* still no need to copy it */ > + } > + if (!PageUptodate(page)) > + return page; /* let do_swap_page report the error */ > + > new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); > if (new_page) { > copy_user_highpage(new_page, page, address, vma); > @@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store( > > mutex_lock(&ksm_thread_mutex); > if (ksm_merge_across_nodes != knob) { > - if (ksm_pages_shared) > + if (ksm_pages_shared || remove_all_stable_nodes()) > err = -EBUSY; > else > ksm_merge_across_nodes = knob; > --- mmotm.orig/mm/memory.c 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/mm/memory.c 2013-01-25 14:37:00.768206145 -0800 > @@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct > if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) > goto out_page; > > - if (ksm_might_need_to_copy(page, vma, address)) { > - swapcache = page; > - page = ksm_does_need_to_copy(page, vma, address); > - > - if (unlikely(!page)) { > - ret = VM_FAULT_OOM; > - page = swapcache; > - swapcache = NULL; > - goto out_page; > - } > + swapcache = page; > + page = ksm_might_need_to_copy(page, vma, address); > + if (unlikely(!page)) { > + ret = VM_FAULT_OOM; > + page = swapcache; > + swapcache = NULL; > + goto out_page; > } > + if (page == swapcache) > + swapcache = NULL; > > if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > ret = VM_FAULT_OOM; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755425Ab3A1XDI (ORCPT ); Mon, 28 Jan 2013 18:03:08 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:43046 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750978Ab3A1XDG (ORCPT ); Mon, 28 Jan 2013 18:03:06 -0500 Date: Mon, 28 Jan 2013 15:03:04 -0800 From: Andrew Morton To: Hugh Dickins Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node Message-Id: <20130128150304.2e7a2fb4.akpm@linux-foundation.org> In-Reply-To: References: X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 25 Jan 2013 17:54:53 -0800 (PST) Hugh Dickins wrote: > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > Default: 20 (chosen for demonstration purposes) > > +merge_across_nodes - specifies if pages from different numa nodes can be merged. > + When set to 0, ksm merges only pages which physically > + reside in the memory area of same NUMA node. It brings > + lower latency to access to shared page. Value can be > + changed only when there is no ksm shared pages in system. > + Default: 1 > + The explanation doesn't really tell the operator whether or not to set merge_across_nodes for a particular machine/workload. I guess most people will just shrug, turn the thing on and see if it improved things, but that's rather random. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755453Ab3A1XI5 (ORCPT ); Mon, 28 Jan 2013 18:08:57 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:43060 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750978Ab3A1XI4 (ORCPT ); Mon, 28 Jan 2013 18:08:56 -0500 Date: Mon, 28 Jan 2013 15:08:54 -0800 From: Andrew Morton To: Hugh Dickins Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node Message-Id: <20130128150854.6813b1ca.akpm@linux-foundation.org> In-Reply-To: References: X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 25 Jan 2013 17:54:53 -0800 (PST) Hugh Dickins wrote: > +/* Zeroed when merging across nodes is not allowed */ > +static unsigned int ksm_merge_across_nodes = 1; I spose this should be __read_mostly. If __read_mostly is not really a synonym for __make_write_often_storage_slower. I continue to harbor fear, uncertainty and doubt about this... From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755480Ab3A1XLW (ORCPT ); Mon, 28 Jan 2013 18:11:22 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:43070 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750908Ab3A1XLU (ORCPT ); Mon, 28 Jan 2013 18:11:20 -0500 Date: Mon, 28 Jan 2013 15:11:19 -0800 From: Andrew Morton To: Hugh Dickins Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 3/11] ksm: trivial tidyups Message-Id: <20130128151119.b74d0150.akpm@linux-foundation.org> In-Reply-To: References: X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 25 Jan 2013 17:58:11 -0800 (PST) Hugh Dickins wrote: > +#ifdef CONFIG_NUMA > +#define NUMA(x) (x) > +#define DO_NUMA(x) (x) Did we consider #define DO_NUMA do { (x) } while (0) ? That could avoid some nasty config-dependent compilation issues. > +#else > +#define NUMA(x) (0) > +#define DO_NUMA(x) do { } while (0) > +#endif From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755707Ab3A1XoR (ORCPT ); Mon, 28 Jan 2013 18:44:17 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:43178 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755601Ab3A1XoJ (ORCPT ); Mon, 28 Jan 2013 18:44:09 -0500 Date: Mon, 28 Jan 2013 15:44:07 -0800 From: Andrew Morton To: Hugh Dickins Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly Message-Id: <20130128154407.16a623a4.akpm@linux-foundation.org> In-Reply-To: References: X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 25 Jan 2013 18:01:59 -0800 (PST) Hugh Dickins wrote: > +static int remove_all_stable_nodes(void) > +{ > + struct stable_node *stable_node; > + int nid; > + int err = 0; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + while (root_stable_tree[nid].rb_node) { > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > + struct stable_node, node); > + if (remove_stable_node(stable_node)) { > + err = -EBUSY; It's a bit rude to overwrite remove_stable_node()'s return value. > + break; /* proceed to next nid */ > + } > + cond_resched(); Why is this here? > + } > + } > + return err; > +} From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755698Ab3A1XzB (ORCPT ); Mon, 28 Jan 2013 18:55:01 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:43197 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753853Ab3A1Xy7 (ORCPT ); Mon, 28 Jan 2013 18:54:59 -0500 Date: Mon, 28 Jan 2013 15:54:52 -0800 From: Andrew Morton To: Hugh Dickins Cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration Message-Id: <20130128155452.16882a6e.akpm@linux-foundation.org> In-Reply-To: References: X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 25 Jan 2013 17:53:10 -0800 (PST) Hugh Dickins wrote: > Here's a KSM series Sanity check: do you have a feeling for how useful KSM is? Performance/space improvements for typical (or atypical) workloads? Are people using it? Successfully? IOW, is it justifying itself? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753798Ab3A2Azx (ORCPT ); Mon, 28 Jan 2013 19:55:53 -0500 Received: from na3sys010aog104.obsmtp.com ([74.125.245.76]:49055 "HELO na3sys010aog104.obsmtp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1751326Ab3A2Azv (ORCPT ); Mon, 28 Jan 2013 19:55:51 -0500 X-Greylist: delayed 368 seconds by postgrey-1.27 at vger.kernel.org; Mon, 28 Jan 2013 19:55:51 EST Message-ID: <51071CA0.801@ravellosystems.com> Date: Tue, 29 Jan 2013 02:49:36 +0200 From: Izik Eidus User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2 MIME-Version: 1.0 To: Andrew Morton CC: Hugh Dickins , Petr Holasek , Andrea Arcangeli , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration References: <20130128155452.16882a6e.akpm@linux-foundation.org> In-Reply-To: <20130128155452.16882a6e.akpm@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/29/2013 01:54 AM, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:53:10 -0800 (PST) > Hugh Dickins wrote: > >> Here's a KSM series > Sanity check: do you have a feeling for how useful KSM is? > Performance/space improvements for typical (or atypical) workloads? > Are people using it? Successfully? Hi, I think it mostly used for virtualization, I know at least two products that it use - RHEV - RedHat enterprise virtualization, and my current place (Ravello Systems) that use it to do vm consolidation on top of cloud enviorments (Run multiple unmodified VMs on top of one vm you get from ec2 / rackspace / what so ever), for Ravello it is highly critical in achieving high rate of consolidation ratio... > > IOW, is it justifying itself? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753648Ab3A2BHU (ORCPT ); Mon, 28 Jan 2013 20:07:20 -0500 Received: from mail-pa0-f53.google.com ([209.85.220.53]:48858 "EHLO mail-pa0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751451Ab3A2BHS (ORCPT ); Mon, 28 Jan 2013 20:07:18 -0500 Date: Mon, 28 Jan 2013 17:07:15 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Marcelo Tosatti , Gleb Natapov , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration In-Reply-To: <20130128155452.16882a6e.akpm@linux-foundation.org> Message-ID: References: <20130128155452.16882a6e.akpm@linux-foundation.org> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:53:10 -0800 (PST) > Hugh Dickins wrote: > > > Here's a KSM series > > Sanity check: do you have a feeling for how useful KSM is? > Performance/space improvements for typical (or atypical) workloads? > Are people using it? Successfully? > > IOW, is it justifying itself? I have no idea! To me it's simply a technical challenge - and I agree with your implication that that's not a good enough justification. I've added Marcelo and Gleb and the KVM list to the Cc: my understanding is that it's the KVM guys who really appreciate KSM. Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754790Ab3A2BRX (ORCPT ); Mon, 28 Jan 2013 20:17:23 -0500 Received: from mail-pb0-f43.google.com ([209.85.160.43]:42126 "EHLO mail-pb0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751518Ab3A2BRV (ORCPT ); Mon, 28 Jan 2013 20:17:21 -0500 Date: Mon, 28 Jan 2013 17:17:24 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: <20130128150304.2e7a2fb4.akpm@linux-foundation.org> Message-ID: References: <20130128150304.2e7a2fb4.akpm@linux-foundation.org> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:54:53 -0800 (PST) > Hugh Dickins wrote: > > > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 > > +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 > > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > > Default: 20 (chosen for demonstration purposes) > > > > +merge_across_nodes - specifies if pages from different numa nodes can be merged. > > + When set to 0, ksm merges only pages which physically > > + reside in the memory area of same NUMA node. It brings > > + lower latency to access to shared page. Value can be > > + changed only when there is no ksm shared pages in system. > > + Default: 1 > > + > > The explanation doesn't really tell the operator whether or not to set > merge_across_nodes for a particular machine/workload. > > I guess most people will just shrug, turn the thing on and see if it > improved things, but that's rather random. Right. I don't think we can tell them which is going to be better, but surely we could do a better job of hinting at the tradeoffs. I think we expect large NUMA machines with lots of memory to want the better NUMA behavior of !merge_across_nodes, but machines with more limited memory across short-distance NUMA nodes, to prefer the greater deduplication of merge_across nodes. Petr, do you have a more informative text for this? Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754942Ab3A2Bin (ORCPT ); Mon, 28 Jan 2013 20:38:43 -0500 Received: from mail-pa0-f45.google.com ([209.85.220.45]:49229 "EHLO mail-pa0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752513Ab3A2Bik (ORCPT ); Mon, 28 Jan 2013 20:38:40 -0500 Date: Mon, 28 Jan 2013 17:38:43 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: <20130128150854.6813b1ca.akpm@linux-foundation.org> Message-ID: References: <20130128150854.6813b1ca.akpm@linux-foundation.org> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:54:53 -0800 (PST) > Hugh Dickins wrote: > > > +/* Zeroed when merging across nodes is not allowed */ > > +static unsigned int ksm_merge_across_nodes = 1; > > I spose this should be __read_mostly. If __read_mostly is not really a > synonym for __make_write_often_storage_slower. I continue to harbor > fear, uncertainty and doubt about this... Could do. No strong feeling, but I think I'd rather it share its cacheline with other KSM-related stuff, than be off mixed up with unrelateds. I think there's a much stronger case for __read_mostly when it's a library thing accessed by different subsystems. You're right that this variable is accessed significantly more often that the other KSM tunables, so deserves a __read_mostly more than they do. But where to stop? Similar reluctance led me to avoid using "unlikely" throughout ksm.c, unlikely as some conditions are (I'm aghast to see that Andrea sneaked in a "likely" :). Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753016Ab3A2BoW (ORCPT ); Mon, 28 Jan 2013 20:44:22 -0500 Received: from mail-pb0-f49.google.com ([209.85.160.49]:43941 "EHLO mail-pb0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751342Ab3A2BoU (ORCPT ); Mon, 28 Jan 2013 20:44:20 -0500 Date: Mon, 28 Jan 2013 17:44:23 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 3/11] ksm: trivial tidyups In-Reply-To: <20130128151119.b74d0150.akpm@linux-foundation.org> Message-ID: References: <20130128151119.b74d0150.akpm@linux-foundation.org> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:58:11 -0800 (PST) > Hugh Dickins wrote: > > > +#ifdef CONFIG_NUMA > > +#define NUMA(x) (x) > > +#define DO_NUMA(x) (x) > > Did we consider > > #define DO_NUMA do { (x) } while (0) > > ? It didn't occur to me at all. I like that it makes more sense of the DO_NUMA variant. Is it okay that, to work with the way I was using it, we need "(x);" in there rather than just "(x)"? > > That could avoid some nasty config-dependent compilation issues. > > > +#else > > +#define NUMA(x) (0) [PATCH] ksm: trivial tidyups fix Suggested by akpm: make DO_NUMA(x) do { (x); } while (0) more like the #else. Signed-off-by: Hugh Dickins --- mm/ksm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- mmotm.org/mm/ksm.c 2013-01-27 09:55:45.000000000 -0800 +++ mmotm/mm/ksm.c 2013-01-28 16:50:25.772026446 -0800 @@ -43,7 +43,7 @@ #ifdef CONFIG_NUMA #define NUMA(x) (x) -#define DO_NUMA(x) (x) +#define DO_NUMA(x) do { (x); } while (0) #else #define NUMA(x) (0) #define DO_NUMA(x) do { } while (0) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755330Ab3A2CDP (ORCPT ); Mon, 28 Jan 2013 21:03:15 -0500 Received: from mail-pb0-f43.google.com ([209.85.160.43]:52375 "EHLO mail-pb0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751736Ab3A2CDN (ORCPT ); Mon, 28 Jan 2013 21:03:13 -0500 Date: Mon, 28 Jan 2013 18:03:16 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <20130128154407.16a623a4.akpm@linux-foundation.org> Message-ID: References: <20130128154407.16a623a4.akpm@linux-foundation.org> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 18:01:59 -0800 (PST) > Hugh Dickins wrote: > > > +static int remove_all_stable_nodes(void) > > +{ > > + struct stable_node *stable_node; > > + int nid; > > + int err = 0; > > + > > + for (nid = 0; nid < nr_node_ids; nid++) { > > + while (root_stable_tree[nid].rb_node) { > > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > > + struct stable_node, node); > > + if (remove_stable_node(stable_node)) { > > + err = -EBUSY; > > It's a bit rude to overwrite remove_stable_node()'s return value. Well.... yes, but only the tiniest bit rude :) > > > + break; /* proceed to next nid */ > > + } > > + cond_resched(); > > Why is this here? Because we don't have a limit on the length of this loop, and if every node which remove_stable_node() finds is already stale, and has no rmap_item still attached, then there would be no rescheduling point in the unbounded loop without this one. I was taught to worry about bad latencies even in unpreemptible kernels. Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755296Ab3A2C0V (ORCPT ); Mon, 28 Jan 2013 21:26:21 -0500 Received: from na3sys010aog110.obsmtp.com ([74.125.245.88]:57039 "HELO na3sys010aog110.obsmtp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1751958Ab3A2C0U (ORCPT ); Mon, 28 Jan 2013 21:26:20 -0500 Message-ID: <51073345.4070605@ravellosystems.com> Date: Tue, 29 Jan 2013 04:26:13 +0200 From: Izik Eidus User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2 MIME-Version: 1.0 To: Andrew Morton CC: Hugh Dickins , Petr Holasek , Andrea Arcangeli , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration References: <20130128155452.16882a6e.akpm@linux-foundation.org> <51071CA0.801@ravellosystems.com> In-Reply-To: <51071CA0.801@ravellosystems.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/29/2013 02:49 AM, Izik Eidus wrote: > On 01/29/2013 01:54 AM, Andrew Morton wrote: >> On Fri, 25 Jan 2013 17:53:10 -0800 (PST) >> Hugh Dickins wrote: >> >>> Here's a KSM series >> Sanity check: do you have a feeling for how useful KSM is? >> Performance/space improvements for typical (or atypical) workloads? >> Are people using it? Successfully? BTW, After thinking a bit about the word people, I wanted to see if normal users of linux that just download and install Linux (without using special virtualization product) are able to use it. So I google little bit for it, and found some nice results from users: http://serverascode.com/2012/11/11/ksm-kvm.html But I do agree that it provide justifying value only for virtualization users... > > Hi, > I think it mostly used for virtualization, I know at least two > products that it use - > RHEV - RedHat enterprise virtualization, and my current place (Ravello > Systems) that use it to do vm consolidation on top of cloud enviorments > (Run multiple unmodified VMs on top of one vm you get from ec2 / > rackspace / what so ever), for Ravello it is highly critical in > achieving high rate > of consolidation ratio... > >> >> IOW, is it justifying itself? > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754612Ab3A2KqH (ORCPT ); Tue, 29 Jan 2013 05:46:07 -0500 Received: from mx1.redhat.com ([209.132.183.28]:24552 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753481Ab3A2KqD (ORCPT ); Tue, 29 Jan 2013 05:46:03 -0500 Date: Tue, 29 Jan 2013 12:45:14 +0200 From: Gleb Natapov To: Hugh Dickins Cc: Andrew Morton , Marcelo Tosatti , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration Message-ID: <20130129104513.GA15004@redhat.com> References: <20130128155452.16882a6e.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 28, 2013 at 05:07:15PM -0800, Hugh Dickins wrote: > On Mon, 28 Jan 2013, Andrew Morton wrote: > > On Fri, 25 Jan 2013 17:53:10 -0800 (PST) > > Hugh Dickins wrote: > > > > > Here's a KSM series > > > > Sanity check: do you have a feeling for how useful KSM is? > > Performance/space improvements for typical (or atypical) workloads? > > Are people using it? Successfully? > > > > IOW, is it justifying itself? > > I have no idea! To me it's simply a technical challenge - and I agree > with your implication that that's not a good enough justification. > > I've added Marcelo and Gleb and the KVM list to the Cc: > my understanding is that it's the KVM guys who really appreciate KSM. > KSM is used on all RH kvm deployments for memory overcommit. I asked around for numbers and got the answer that it allows to squeeze anywhere between 10% and 100% more VMs on the same machine depends on a type of a guest OS and how similar workloads of VMs are. And management tries to keep VMs with similar OSes/workloads on the same host to gain more from KSM. -- Gleb. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753528Ab3A2QwQ (ORCPT ); Tue, 29 Jan 2013 11:52:16 -0500 Received: from mx1.redhat.com ([209.132.183.28]:42836 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751770Ab3A2QwO (ORCPT ); Tue, 29 Jan 2013 11:52:14 -0500 Date: Tue, 29 Jan 2013 17:51:25 +0100 From: Andrea Arcangeli To: Izik Eidus Cc: Andrew Morton , Hugh Dickins , Petr Holasek , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration Message-ID: <20130129165125.GA17671@redhat.com> References: <20130128155452.16882a6e.akpm@linux-foundation.org> <51071CA0.801@ravellosystems.com> <51073345.4070605@ravellosystems.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51073345.4070605@ravellosystems.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi everyone, On Tue, Jan 29, 2013 at 04:26:13AM +0200, Izik Eidus wrote: > On 01/29/2013 02:49 AM, Izik Eidus wrote: > > On 01/29/2013 01:54 AM, Andrew Morton wrote: > >> On Fri, 25 Jan 2013 17:53:10 -0800 (PST) > >> Hugh Dickins wrote: > >> > >>> Here's a KSM series > >> Sanity check: do you have a feeling for how useful KSM is? > >> Performance/space improvements for typical (or atypical) workloads? > >> Are people using it? Successfully? > > > BTW, After thinking a bit about the word people, I wanted to see if > normal users of linux > that just download and install Linux (without using special > virtualization product) are able to use it. > So I google little bit for it, and found some nice results from users: > http://serverascode.com/2012/11/11/ksm-kvm.html > > But I do agree that it provide justifying value only for virtualization > users... Mostly for virtualization users indeed, but I'm aware of a few non virtualization users too: 1) CERN has been one of the early adopters of KSM and initially they were using KSM standalone (probably because not all hypervisors they had to deal with were KVM/linux based, while all guests were linux and in turn KSM capable). More info in the KSM paper page 2: http://www.kernel.org/doc/ols/2009/ols2009-pages-19-28.pdf However lately they're running KSM in combination with KVM too, and I'm not sure if they're still using it standalone. See the "KSM shared" blue area in slide 12 and the comparison with KSM on and off in slide 14. https://indico.fnal.gov/getFile.py/access?contribId=18&sessionId=4&resId=0&materialId=slides&confId=4986 2) all recent cyanogenmod in the performance menu in settings supports KSM out of the box. You can run it for a while and then shut it off. Not sure how good idea it is to leave it always on, but the only efficient cellphone/tablet powersaving design (i.e. the wakelocks + suspend to ram) still won't waste energy while the screen is off and the phone has suspended to ram, regardless of KSM on or off. KSM NUMA awareness however is not needed on the cellphone :). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754902Ab3AaAFt (ORCPT ); Wed, 30 Jan 2013 19:05:49 -0500 Received: from mail-pa0-f46.google.com ([209.85.220.46]:63682 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753229Ab3AaAFr (ORCPT ); Wed, 30 Jan 2013 19:05:47 -0500 Message-ID: <1359590736.1557.0.camel@kernel> Subject: Re: [PATCH 0/11] ksm: NUMA trees and page migration From: Ric Mason To: Andrea Arcangeli Cc: Izik Eidus , Andrew Morton , Hugh Dickins , Petr Holasek , Rik van Riel , David Rientjes , Anton Arapov , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Wed, 30 Jan 2013 18:05:36 -0600 In-Reply-To: <20130129165125.GA17671@redhat.com> References: <20130128155452.16882a6e.akpm@linux-foundation.org> <51071CA0.801@ravellosystems.com> <51073345.4070605@ravellosystems.com> <20130129165125.GA17671@redhat.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4 (3.4.4-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2013-01-29 at 17:51 +0100, Andrea Arcangeli wrote: > Hi everyone, > > On Tue, Jan 29, 2013 at 04:26:13AM +0200, Izik Eidus wrote: > > On 01/29/2013 02:49 AM, Izik Eidus wrote: > > > On 01/29/2013 01:54 AM, Andrew Morton wrote: > > >> On Fri, 25 Jan 2013 17:53:10 -0800 (PST) > > >> Hugh Dickins wrote: > > >> > > >>> Here's a KSM series > > >> Sanity check: do you have a feeling for how useful KSM is? > > >> Performance/space improvements for typical (or atypical) workloads? > > >> Are people using it? Successfully? > > > > > > BTW, After thinking a bit about the word people, I wanted to see if > > normal users of linux > > that just download and install Linux (without using special > > virtualization product) are able to use it. > > So I google little bit for it, and found some nice results from users: > > http://serverascode.com/2012/11/11/ksm-kvm.html > > > > But I do agree that it provide justifying value only for virtualization > > users... > > Mostly for virtualization users indeed, but I'm aware of a few non > virtualization users too: > > 1) CERN has been one of the early adopters of KSM and initially they > were using KSM standalone (probably because not all hypervisors they > had to deal with were KVM/linux based, while all guests were linux and > in turn KSM capable). More info in the KSM paper page 2: > > http://www.kernel.org/doc/ols/2009/ols2009-pages-19-28.pdf > > However lately they're running KSM in combination with KVM too, and I'm > not sure if they're still using it standalone. See the "KSM shared" > blue area in slide 12 and the comparison with KSM on and off in slide > 14. > > https://indico.fnal.gov/getFile.py/access?contribId=18&sessionId=4&resId=0&materialId=slides&confId=4986 > > 2) all recent cyanogenmod in the performance menu in settings supports > KSM out of the box. You can run it for a while and then shut it > off. > > Not sure how good idea it is to leave it always on, but the only > efficient cellphone/tablet powersaving design (i.e. the wakelocks + > suspend to ram) still won't waste energy while the screen is off and > the phone has suspended to ram, regardless of KSM on or off. > > KSM NUMA awareness however is not needed on the cellphone :). Thanks for your sharing. Is there ksm benchmark? How to get it? > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755268Ab3BEQl0 (ORCPT ); Tue, 5 Feb 2013 11:41:26 -0500 Received: from cantor2.suse.de ([195.135.220.15]:58216 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751142Ab3BEQlX (ORCPT ); Tue, 5 Feb 2013 11:41:23 -0500 Date: Tue, 5 Feb 2013 16:41:18 +0000 From: Mel Gorman To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node Message-ID: <20130205164118.GI21389@suse.de> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote: > From: Petr Holasek > > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes > which control merging pages across different numa nodes. > When it is set to zero only pages from the same node are merged, > otherwise pages from all nodes can be merged together (default behavior). > > Typical use-case could be a lot of KVM guests on NUMA machine > and cpus from more distant nodes would have significant increase > of access latency to the merged ksm page. Sysfs knob was choosen > for higher variability when some users still prefers higher amount > of saved physical memory regardless of access latency. > This is understandable but it's going to be a fairly obscure option. I do not think it can be known in advance if the option should be set. The user must either run benchmarks before and after or use perf to record the "node-load-misses" event and see if setting the parameter reduces the number of remote misses. I don't know the internals of ksm.c at all and this is my first time reading this series. Everything in this review is subject to being completely wrong or due to a major misunderstanding on my part. Delete all feedback if desired. > Every numa node has its own stable & unstable trees because of faster > searching and inserting. Changing of merge_across_nodes value is possible > only when there are not any ksm shared pages in system. > > I've tested this patch on numa machines with 2, 4 and 8 nodes and > measured speed of memory access inside of KVM guests with memory pinned > to one of nodes with this benchmark: > > http://pholasek.fedorapeople.org/alloc_pg.c > > Population standard deviations of access times in percentage of average > were following: > > merge_across_nodes=1 > 2 nodes 1.4% > 4 nodes 1.6% > 8 nodes 1.7% > > merge_across_nodes=0 > 2 nodes 1% > 4 nodes 0.32% > 8 nodes 0.018% > > RFC: https://lkml.org/lkml/2011/11/30/91 > v1: https://lkml.org/lkml/2012/1/23/46 > v2: https://lkml.org/lkml/2012/6/29/105 > v3: https://lkml.org/lkml/2012/9/14/550 > v4: https://lkml.org/lkml/2012/9/23/137 > v5: https://lkml.org/lkml/2012/12/10/540 > v6: https://lkml.org/lkml/2012/12/23/154 > v7: https://lkml.org/lkml/2012/12/27/225 > > Hugh notes that this patch brings two problems, whose solution needs > further support in mm/ksm.c, which follows in subsequent patches: > 1) switching merge_across_nodes after running KSM is liable to oops > on stale nodes still left over from the previous stable tree; > 2) memory hotremove may migrate KSM pages, but there is no provision > here for !merge_across_nodes to migrate nodes to the proper tree. > > Signed-off-by: Petr Holasek > Signed-off-by: Hugh Dickins > Acked-by: Rik van Riel > --- > Documentation/vm/ksm.txt | 7 + > mm/ksm.c | 151 ++++++++++++++++++++++++++++++++----- > 2 files changed, 139 insertions(+), 19 deletions(-) > > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > Default: 20 (chosen for demonstration purposes) > > +merge_across_nodes - specifies if pages from different numa nodes can be merged. > + When set to 0, ksm merges only pages which physically > + reside in the memory area of same NUMA node. It brings > + lower latency to access to shared page. Value can be > + changed only when there is no ksm shared pages in system. > + Default: 1 > + > run - set 0 to stop ksmd from running but keep merged pages, > set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", > set 2 to stop ksmd and unmerge all pages currently merged, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 > @@ -36,6 +36,7 @@ > #include > #include > #include > +#include > > #include > #include "internal.h" > @@ -139,6 +140,9 @@ struct rmap_item { > struct mm_struct *mm; > unsigned long address; /* + low bits used for flags below */ > unsigned int oldchecksum; /* when unstable */ > +#ifdef CONFIG_NUMA > + unsigned int nid; > +#endif > union { > struct rb_node node; /* when node of unstable tree */ > struct { /* when listed from stable tree */ > @@ -153,8 +157,8 @@ struct rmap_item { > #define STABLE_FLAG 0x200 /* is listed from the stable tree */ > > /* The stable and unstable tree heads */ > -static struct rb_root root_stable_tree = RB_ROOT; > -static struct rb_root root_unstable_tree = RB_ROOT; > +static struct rb_root root_unstable_tree[MAX_NUMNODES]; > +static struct rb_root root_stable_tree[MAX_NUMNODES]; > With multiple stable node trees does the comment that begins with * A few notes about the KSM scanning process, * to make it easier to understand the data structures below: need an update? It's uninitialised so kernel data size in vmlinux should be unaffected but it's an additional runtime cost of around 4K for a standardish enterprise distro kernel config. Small beans on a NUMA machine and maybe not worth the hassle of kmalloc for nr_online_nodes and dealing with node memory hotplug but it's a pity. > #define MM_SLOTS_HASH_BITS 10 > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ > /* Milliseconds ksmd should sleep between batches */ > static unsigned int ksm_thread_sleep_millisecs = 20; > > +/* Zeroed when merging across nodes is not allowed */ > +static unsigned int ksm_merge_across_nodes = 1; > + Nit but initialised data does increase the size of vmlinux so maybe this should be the "opposite". i.e. rename it to ksm_merge_within_nodes and default it to 0? __read_mostly? > #define KSM_RUN_STOP 0 > #define KSM_RUN_MERGE 1 > #define KSM_RUN_UNMERGE 2 > @@ -441,10 +448,25 @@ out: page = NULL; > return page; > } > > +/* > + * This helper is used for getting right index into array of tree roots. > + * When merge_across_nodes knob is set to 1, there are only two rb-trees for > + * stable and unstable pages from all nodes with roots in index 0. Otherwise, > + * every node has its own stable and unstable tree. > + */ > +static inline int get_kpfn_nid(unsigned long kpfn) > +{ > + if (ksm_merge_across_nodes) > + return 0; > + else > + return pfn_to_nid(kpfn); > +} > + If we start with ksm_merge_across_nodes, KSM runs for a while and populates the stable node tree for node 0 and then ksm_merge_across_nodes gets set then badness happens because this can go anywhere nid = get_kpfn_nid(stable_node->kpfn); rb_erase(&stable_node->node, &root_stable_tree[nid]); Very late in the review I noticed that you comment on this already in the changelog and that it is addressed later in the series. I haven't seen this patch yet so the following suggestion is very stale but might still be relevant. We could increase size of root_stable_node[] by 1, have get_kpfn_nid return MAX_NR_NODES if ksm_merge_across_nodes and if ksm_merge_across_nodes gets set to 0 then we walk the stable tree at root_stable_tree[MAX_NR_NODES] and delete the entire tree? It's be disruptive as hell unfortunately and might break entirely if there is not enough memory to unshare the pages. Ideally we could take our time walking root_stable_tree[MAX_NR_NODES] without worrying about collisions and fix it up somehow. Dunno > static void remove_node_from_stable_tree(struct stable_node *stable_node) > { > struct rmap_item *rmap_item; > struct hlist_node *hlist; > + int nid; > > hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { > if (rmap_item->hlist.next) > @@ -456,7 +478,9 @@ static void remove_node_from_stable_tree > cond_resched(); > } > > - rb_erase(&stable_node->node, &root_stable_tree); > + nid = get_kpfn_nid(stable_node->kpfn); > + > + rb_erase(&stable_node->node, &root_stable_tree[nid]); > free_stable_node(stable_node); > } > > @@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s > age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); > BUG_ON(age > 1); > if (!age) > - rb_erase(&rmap_item->node, &root_unstable_tree); > +#ifdef CONFIG_NUMA > + rb_erase(&rmap_item->node, > + &root_unstable_tree[rmap_item->nid]); > +#else > + rb_erase(&rmap_item->node, &root_unstable_tree[0]); > +#endif > nit, does rmap_item->nid deserve a getter and setter helper instead? #ifdef CONFIG_NUMA static inline int rmap_item_nid(struct rmap_item *item) { return rmap_item->nid; } static inline void set_rmap_item_nid(struct rmap_item *item, int nid) { rmap_item->nid = nid; } #else static inline int rmap_item_nid(struct rmap_item *item) { return 0; } static inline void set_rmap_item_nid(struct rmap_item *item, int nid) { } #endif > ksm_pages_unshared--; > rmap_item->address &= PAGE_MASK; > @@ -990,8 +1019,9 @@ static struct page *try_to_merge_two_pag > */ > static struct page *stable_tree_search(struct page *page) > { > - struct rb_node *node = root_stable_tree.rb_node; > + struct rb_node *node; > struct stable_node *stable_node; > + int nid; > > stable_node = page_stable_node(page); > if (stable_node) { /* ksm page forked */ > @@ -999,6 +1029,9 @@ static struct page *stable_tree_search(s > return page; > } > > + nid = get_kpfn_nid(page_to_pfn(page)); > + node = root_stable_tree[nid].rb_node; > + > while (node) { > struct page *tree_page; > int ret; > @@ -1033,10 +1066,16 @@ static struct page *stable_tree_search(s > */ > static struct stable_node *stable_tree_insert(struct page *kpage) > { > - struct rb_node **new = &root_stable_tree.rb_node; > + int nid; > + unsigned long kpfn; > + struct rb_node **new; > struct rb_node *parent = NULL; > struct stable_node *stable_node; > > + kpfn = page_to_pfn(kpage); > + nid = get_kpfn_nid(kpfn); > + new = &root_stable_tree[nid].rb_node; > + > while (*new) { > struct page *tree_page; > int ret; > @@ -1070,11 +1109,11 @@ static struct stable_node *stable_tree_i > return NULL; > > rb_link_node(&stable_node->node, parent, new); > - rb_insert_color(&stable_node->node, &root_stable_tree); > + rb_insert_color(&stable_node->node, &root_stable_tree[nid]); > > INIT_HLIST_HEAD(&stable_node->hlist); > > - stable_node->kpfn = page_to_pfn(kpage); > + stable_node->kpfn = kpfn; > set_page_stable_node(kpage, stable_node); > > return stable_node; > @@ -1098,10 +1137,15 @@ static > struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item, > struct page *page, > struct page **tree_pagep) > - > { > - struct rb_node **new = &root_unstable_tree.rb_node; > + struct rb_node **new; > + struct rb_root *root; > struct rb_node *parent = NULL; > + int nid; > + > + nid = get_kpfn_nid(page_to_pfn(page)); > + root = &root_unstable_tree[nid]; > + new = &root->rb_node; > > while (*new) { > struct rmap_item *tree_rmap_item; > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > return NULL; > } > > + /* > + * If tree_page has been migrated to another NUMA node, it > + * will be flushed out and put into the right unstable tree > + * next time: only merge with it if merge_across_nodes. > + * Just notice, we don't have similar problem for PageKsm > + * because their migration is disabled now. (62b61f611e) > + */ > + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { > + put_page(tree_page); > + return NULL; > + } > + What about this case? 1. ksm_merge_across_nodes==0 2. pages gets placed on different unstable trees 3. ksm_merge_across_nodes==1 At that point we should be removing pages from the different unstable tree and moving them to root_unstable_tree[0] but this put_page() doesn't happen. Does it matter? > ret = memcmp_pages(page, tree_page); > > parent = *new; > @@ -1139,8 +1195,11 @@ struct rmap_item *unstable_tree_search_i > > rmap_item->address |= UNSTABLE_FLAG; > rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK); > +#ifdef CONFIG_NUMA > + rmap_item->nid = nid; > +#endif > rb_link_node(&rmap_item->node, parent, new); > - rb_insert_color(&rmap_item->node, &root_unstable_tree); > + rb_insert_color(&rmap_item->node, root); > > ksm_pages_unshared++; > return NULL; > @@ -1154,6 +1213,13 @@ struct rmap_item *unstable_tree_search_i > static void stable_tree_append(struct rmap_item *rmap_item, > struct stable_node *stable_node) > { > +#ifdef CONFIG_NUMA > + /* > + * Usually rmap_item->nid is already set correctly, > + * but it may be wrong after switching merge_across_nodes. > + */ > + rmap_item->nid = get_kpfn_nid(stable_node->kpfn); > +#endif > rmap_item->head = stable_node; > rmap_item->address |= STABLE_FLAG; > hlist_add_head(&rmap_item->hlist, &stable_node->hlist); > @@ -1283,6 +1349,7 @@ static struct rmap_item *scan_get_next_r > struct mm_slot *slot; > struct vm_area_struct *vma; > struct rmap_item *rmap_item; > + int nid; > > if (list_empty(&ksm_mm_head.mm_list)) > return NULL; > @@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r > */ > lru_add_drain_all(); > > - root_unstable_tree = RB_ROOT; > + for (nid = 0; nid < nr_node_ids; nid++) > + root_unstable_tree[nid] = RB_ROOT; > Minor but you shouldn't need to reset tham all if ksm_merge_across_nodes==1 Initially this triggered an alarm because it's not immediately obvious why you can just discard an rbtree like this. It looks like because the unstable tree is also part of a linked list so the rb representation can be reset quickly without leaking memory. > spin_lock(&ksm_mmlist_lock); > slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list); > @@ -1770,15 +1838,19 @@ static struct stable_node *ksm_check_sta > unsigned long end_pfn) > { > struct rb_node *node; > + int nid; > > - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) { > - struct stable_node *stable_node; > + for (nid = 0; nid < nr_node_ids; nid++) > + for (node = rb_first(&root_stable_tree[nid]); node; > + node = rb_next(node)) { > + struct stable_node *stable_node; > + > + stable_node = rb_entry(node, struct stable_node, node); > + if (stable_node->kpfn >= start_pfn && > + stable_node->kpfn < end_pfn) > + return stable_node; > + } > > - stable_node = rb_entry(node, struct stable_node, node); > - if (stable_node->kpfn >= start_pfn && > - stable_node->kpfn < end_pfn) > - return stable_node; > - } > return NULL; > } > > @@ -1925,6 +1997,40 @@ static ssize_t run_store(struct kobject > } > KSM_ATTR(run); > > +#ifdef CONFIG_NUMA > +static ssize_t merge_across_nodes_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return sprintf(buf, "%u\n", ksm_merge_across_nodes); > +} > + > +static ssize_t merge_across_nodes_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count) > +{ > + int err; > + unsigned long knob; > + > + err = kstrtoul(buf, 10, &knob); > + if (err) > + return err; > + if (knob > 1) > + return -EINVAL; > + > + mutex_lock(&ksm_thread_mutex); > + if (ksm_merge_across_nodes != knob) { > + if (ksm_pages_shared) > + err = -EBUSY; > + else > + ksm_merge_across_nodes = knob; > + } > + mutex_unlock(&ksm_thread_mutex); > + > + return err ? err : count; > +} > +KSM_ATTR(merge_across_nodes); > +#endif > + > static ssize_t pages_shared_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > @@ -1979,6 +2085,9 @@ static struct attribute *ksm_attrs[] = { > &pages_unshared_attr.attr, > &pages_volatile_attr.attr, > &full_scans_attr.attr, > +#ifdef CONFIG_NUMA > + &merge_across_nodes_attr.attr, > +#endif > NULL, > }; > > @@ -1992,11 +2101,15 @@ static int __init ksm_init(void) > { > struct task_struct *ksm_thread; > int err; > + int nid; > > err = ksm_slab_init(); > if (err) > goto out; > > + for (nid = 0; nid < nr_node_ids; nid++) > + root_stable_tree[nid] = RB_ROOT; > + > ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd"); > if (IS_ERR(ksm_thread)) { > printk(KERN_ERR "ksm: creating kthread failed\n"); > -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755679Ab3BEQsc (ORCPT ); Tue, 5 Feb 2013 11:48:32 -0500 Received: from cantor2.suse.de ([195.135.220.15]:58528 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755572Ab3BEQs1 (ORCPT ); Tue, 5 Feb 2013 11:48:27 -0500 Date: Tue, 5 Feb 2013 16:48:23 +0000 From: Mel Gorman To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 4/11] ksm: reorganize ksm_check_stable_tree Message-ID: <20130205164823.GJ21389@suse.de> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 25, 2013 at 05:59:35PM -0800, Hugh Dickins wrote: > Memory hotremove's ksm_check_stable_tree() is pitifully inefficient > (restarting whenever it finds a stale node to remove), but rearrange > so that at least it does not needlessly restart from nid 0 each time. > And add a couple of comments: here is why we keep pfn instead of page. > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 38 ++++++++++++++++++++++---------------- > 1 file changed, 22 insertions(+), 16 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:52.152205940 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > @@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa > #endif /* CONFIG_MIGRATION */ > > #ifdef CONFIG_MEMORY_HOTREMOVE > -static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn, > - unsigned long end_pfn) > +static void ksm_check_stable_tree(unsigned long start_pfn, > + unsigned long end_pfn) > { > + struct stable_node *stable_node; > struct rb_node *node; > int nid; > > - for (nid = 0; nid < nr_node_ids; nid++) > - for (node = rb_first(&root_stable_tree[nid]); node; > - node = rb_next(node)) { > - struct stable_node *stable_node; > - > + for (nid = 0; nid < nr_node_ids; nid++) { > + node = rb_first(&root_stable_tree[nid]); > + while (node) { This is not your fault, the old code is wrong too. It is assuming that all nodes are populated in numeric orders with no holes. It won't work if just two nodes 0 and 4 are online. It should be using for_each_online_node(). -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756393Ab3BERSP (ORCPT ); Tue, 5 Feb 2013 12:18:15 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60679 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756136Ab3BERSJ (ORCPT ); Tue, 5 Feb 2013 12:18:09 -0500 Date: Tue, 5 Feb 2013 17:18:05 +0000 From: Mel Gorman To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked Message-ID: <20130205171805.GK21389@suse.de> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 25, 2013 at 06:00:50PM -0800, Hugh Dickins wrote: > In some places where get_ksm_page() is used, we need the page to be locked. > > When KSM migration is fully enabled, we shall want that to make sure that > the page just acquired cannot be migrated beneath us (raised page count is > only effective when there is serialization to make sure migration notices). > Whereas when navigating through the stable tree, we certainly do not want > to lock each node (raised page count is enough to guarantee the memcmps, > even if page is migrated to another node). > > Since we're about to add another use case, add the locked argument to > get_ksm_page() now. > > Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I > really got the wrong end of the stick on that! There's a configuration > in which page_cache_get_speculative() can do something cheaper than > get_page_unless_zero(), relying on its caller's rcu_read_lock() to have > disabled preemption for it. There's no need for rcu_read_lock() around > get_page_unless_zero() (and mapping checks) here. Cut out that > silliness before making this any harder to understand. > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 23 +++++++++++++---------- > 1 file changed, 13 insertions(+), 10 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree > * but this is different - made simpler by ksm_thread_mutex being held, but > * interesting for assuming that no other use of the struct page could ever > * put our expected_mapping into page->mapping (or a field of the union which > - * coincides with page->mapping). The RCU calls are not for KSM at all, but > - * to keep the page_count protocol described with page_cache_get_speculative. > + * coincides with page->mapping). > * > * Note: it is possible that get_ksm_page() will return NULL one moment, > * then page the next, if the page is in between page_freeze_refs() and > * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > * is on its way to being freed; but it is an anomaly to bear in mind. > */ > -static struct page *get_ksm_page(struct stable_node *stable_node) > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > { The naming is unhelpful :( Because the second parameter is called "locked", it implies that the caller of this function holds the page lock (which is obviously very silly). ret_locked maybe? As the function is akin to find_lock_page I would prefer if there was a new get_lock_ksm_page() instead of locking depending on the value of a parameter. We can do this because expected_mapping is recorded by the stable_node and we only need to recalculate it if the page has been successfully pinned. We calculate the expected value twice but that's not earth shattering. It'd look something like; /* * get_lock_ksm_page: Similar to get_ksm_page except returns with page * locked and pinned */ static struct page *get_lock_ksm_page(struct stable_node *stable_node) { struct page *page = get_ksm_page(stable_node); if (page) { expected_mapping = (void *)stable_node + (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); lock_page(page); if (page->mapping != expected_mapping) { unlock_page(page); /* release pin taken by get_ksm_page() */ put_page(page); page = NULL; } } return page; } Up to you, I'm not going to make a big deal of it. FWIW, I agree that removing rcu_read_lock() is fine. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756318Ab3BERz7 (ORCPT ); Tue, 5 Feb 2013 12:55:59 -0500 Received: from cantor2.suse.de ([195.135.220.15]:33792 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756232Ab3BERz4 (ORCPT ); Tue, 5 Feb 2013 12:55:56 -0500 Date: Tue, 5 Feb 2013 17:55:51 +0000 From: Mel Gorman To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly Message-ID: <20130205175551.GL21389@suse.de> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 25, 2013 at 06:01:59PM -0800, Hugh Dickins wrote: > Switching merge_across_nodes after running KSM is liable to oops on stale > nodes still left over from the previous stable tree. It's not something > that people will often want to do, but it would be lame to demand a reboot > when they're trying to determine which merge_across_nodes setting is best. > > How can this happen? We only permit switching merge_across_nodes when > pages_shared is 0, and usually set run 2 to force that beforehand, which > ought to unmerge everything: yet oopses still occur when you then run 1. > When reviewing patch 1, I missed that the pages_shared check would prevent most of the problems I was envisioning with leftover entries in the stable tree. Sorry about that. > Three causes: > > 1. The old stable tree (built according to the inverse merge_across_nodes) > has not been fully torn down. A stable node lingers until get_ksm_page() > notices that the page it references no longer references it: but the page > is not necessarily freed as soon as expected, particularly when swapcache. > > Fix this with a pass through the old stable tree, applying get_ksm_page() > to each of the remaining nodes (most found stale and removed immediately), > with forced removal of any left over. Unless the page is still mapped: > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > and EBUSY than BUG. > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > just behind ksmd's cursor, so there's a full pass for it to stabilize > (or be removed) before ksmd addresses it. Nice when ksmd is running, > but not so nice when we're trying to unmerge all mms: we were missing > those mms forked and inserted behind the unmerge cursor. Easily fixed > by inserting at the end when KSM_RUN_UNMERGE. > > 3. It is possible for a KSM page to be faulted back from swapcache into > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. > > A long outstanding, unrelated bugfix sneaks in with that third fix: > ksm_does_need_to_copy() would copy from a !PageUptodate page (implying > I/O error when read in from swap) to a page which it then marks Uptodate. > Fix this case by not copying, letting do_swap_page() discover the error. > > Signed-off-by: Hugh Dickins > --- > include/linux/ksm.h | 18 ++------- > mm/ksm.c | 83 +++++++++++++++++++++++++++++++++++++++--- > mm/memory.c | 19 ++++----- > 3 files changed, 92 insertions(+), 28 deletions(-) > > --- mmotm.orig/include/linux/ksm.h 2013-01-25 14:27:58.220193250 -0800 > +++ mmotm/include/linux/ksm.h 2013-01-25 14:37:00.764206145 -0800 > @@ -16,9 +16,6 @@ > struct stable_node; > struct mem_cgroup; > > -struct page *ksm_does_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address); > - > #ifdef CONFIG_KSM > int ksm_madvise(struct vm_area_struct *vma, unsigned long start, > unsigned long end, int advice, unsigned long *vm_flags); > @@ -73,15 +70,8 @@ static inline void set_page_stable_node( > * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE, > * but what if the vma was unmerged while the page was swapped out? > */ > -static inline int ksm_might_need_to_copy(struct page *page, > - struct vm_area_struct *vma, unsigned long address) > -{ > - struct anon_vma *anon_vma = page_anon_vma(page); > - > - return anon_vma && > - (anon_vma->root != vma->anon_vma->root || > - page->index != linear_page_index(vma, address)); > -} > +struct page *ksm_might_need_to_copy(struct page *page, > + struct vm_area_struct *vma, unsigned long address); > > int page_referenced_ksm(struct page *page, > struct mem_cgroup *memcg, unsigned long *vm_flags); > @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_ > return 0; > } > > -static inline int ksm_might_need_to_copy(struct page *page, > +static inline struct page *ksm_might_need_to_copy(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > - return 0; > + return page; > } > > static inline int page_referenced_ksm(struct page *page, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a > /* > * Only called through the sysfs control interface: > */ > +static int remove_stable_node(struct stable_node *stable_node) > +{ > + struct page *page; > + int err; > + > + page = get_ksm_page(stable_node, true); > + if (!page) { > + /* > + * get_ksm_page did remove_node_from_stable_tree itself. > + */ > + return 0; > + } > + > + if (WARN_ON_ONCE(page_mapped(page))) > + err = -EBUSY; > + else { > + /* It will probably be very obvious to people familiar with ksm.c but even so maybe remind the reader that the pages must already have been unmerged * This page must already have been unmerged and should be stale. * It might be in a pagevec waiting to be freed or it might be ...... > + * This page might be in a pagevec waiting to be freed, > + * or it might be PageSwapCache (perhaps under writeback), > + * or it might have been removed from swapcache a moment ago. > + */ > + set_page_stable_node(page, NULL); > + remove_node_from_stable_tree(stable_node); > + err = 0; > + } > + > + unlock_page(page); > + put_page(page); > + return err; > +} > + > +static int remove_all_stable_nodes(void) > +{ > + struct stable_node *stable_node; > + int nid; > + int err = 0; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + while (root_stable_tree[nid].rb_node) { > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > + struct stable_node, node); > + if (remove_stable_node(stable_node)) { > + err = -EBUSY; > + break; /* proceed to next nid */ > + } If remove_stable_node() returns an error then it's quite possible that it'll go boom when that page is encountered later but it's not guaranteed. It'd be best effort to continue removing as many of the stable nodes anyway. We're in trouble either way of course. Otherwise I didn't spot a problem so as weak as it is due my familiarity with KSM; Acked-by: Mel Gorman -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756629Ab3BETLN (ORCPT ); Tue, 5 Feb 2013 14:11:13 -0500 Received: from cantor2.suse.de ([195.135.220.15]:36827 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755575Ab3BETLH (ORCPT ); Tue, 5 Feb 2013 14:11:07 -0500 Date: Tue, 5 Feb 2013 19:11:02 +0000 From: Mel Gorman To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible Message-ID: <20130205191102.GM21389@suse.de> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 25, 2013 at 06:03:31PM -0800, Hugh Dickins wrote: > KSM page migration is already supported in the case of memory hotremove, > which takes the ksm_thread_mutex across all its migrations to keep life > simple. > > But the new KSM NUMA merge_across_nodes knob introduces a problem, when > it's set to non-default 0: if a KSM page is migrated to a different NUMA > node, how do we migrate its stable node to the right tree? And what if > that collides with an existing stable node? > > So far there's no provision for that, and this patch does not attempt > to deal with it either. But how will I test a solution, when I don't > know how to hotremove memory? Just reach in and yank it straight out with a chisel. > The best answer is to enable KSM page > migration in all cases now, and test more common cases. With THP and > compaction added since KSM came in, page migration is now mainstream, > and it's a shame that a KSM page can frustrate freeing a page block. > THP will at least check if migration within a node works. It won't necessarily check we can migrate across nodes properly but it's a lot better than nothing. > Without worrying about merge_across_nodes 0 for now, this patch gets > KSM page migration working reliably for default merge_across_nodes 1 > (but leave the patch enabling it until near the end of the series). > > It's much simpler than I'd originally imagined, and does not require > an additional tier of locking: page migration relies on the page lock, > KSM page reclaim relies on the page lock, the page lock is enough for > KSM page migration too. > > Almost all the care has to be in get_ksm_page(): that's the function > which worries about when a stable node is stale and should be freed, > now it also has to worry about the KSM page being migrated. > > The only new overhead is an additional put/get/lock/unlock_page when > stable_tree_search() arrives at a matching node: to make sure migration > respects the raised page count, and so does not migrate the page while > we're busy with it here. That's probably avoidable, either by changing > internal interfaces from using kpage to stable_node, or by moving the > ksm_migrate_page() callsite into a page_freeze_refs() section (even if > not swapcache); but this works well, I've no urge to pull it apart now. > > (Descents of the stable tree may pass through nodes whose KSM pages are > under migration: being unlocked, the raised page count does not prevent > that, nor need it: it's safe to memcmp against either old or new page.) > > You might worry about mremap, and whether page migration's rmap_walk > to remove migration entries will find all the KSM locations where it > inserted earlier: that should already be handled, by the satisfyingly > heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). > > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 94 ++++++++++++++++++++++++++++++++++++++----------- > mm/migrate.c | 5 ++ > 2 files changed, 77 insertions(+), 22 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800 > @@ -499,6 +499,7 @@ static void remove_node_from_stable_tree > * In which case we can trust the content of the page, and it > * returns the gotten page; but if the page has now been zapped, > * remove the stale node from the stable tree and return NULL. > + * But beware, the stable node's page might be being migrated. > * > * You would expect the stable_node to hold a reference to the ksm page. > * But if it increments the page's count, swapping out has to wait for > @@ -509,44 +510,77 @@ static void remove_node_from_stable_tree > * pointing back to this stable node. This relies on freeing a PageAnon > * page to reset its page->mapping to NULL, and relies on no other use of > * a page to put something that might look like our key in page->mapping. > - * > - * include/linux/pagemap.h page_cache_get_speculative() is a good reference, > - * but this is different - made simpler by ksm_thread_mutex being held, but > - * interesting for assuming that no other use of the struct page could ever > - * put our expected_mapping into page->mapping (or a field of the union which > - * coincides with page->mapping). > - * > - * Note: it is possible that get_ksm_page() will return NULL one moment, > - * then page the next, if the page is in between page_freeze_refs() and > - * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > * is on its way to being freed; but it is an anomaly to bear in mind. > */ > static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > { > struct page *page; > void *expected_mapping; > + unsigned long kpfn; > > - page = pfn_to_page(stable_node->kpfn); > expected_mapping = (void *)stable_node + > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > - if (page->mapping != expected_mapping) > - goto stale; > - if (!get_page_unless_zero(page)) > +again: > + kpfn = ACCESS_ONCE(stable_node->kpfn); > + page = pfn_to_page(kpfn); > + Ok. There should be no concern that hot-remove made the kpfn invalid because those stable tree entries should have been discarded. > + /* > + * page is computed from kpfn, so on most architectures reading > + * page->mapping is naturally ordered after reading node->kpfn, > + * but on Alpha we need to be more careful. > + */ > + smp_read_barrier_depends(); The value of page is data dependant on pfn_to_page(). Is it really possible for that to be re-ordered even on Alpha? > + if (ACCESS_ONCE(page->mapping) != expected_mapping) > goto stale; > - if (page->mapping != expected_mapping) { > + > + /* > + * We cannot do anything with the page while its refcount is 0. > + * Usually 0 means free, or tail of a higher-order page: in which > + * case this node is no longer referenced, and should be freed; > + * however, it might mean that the page is under page_freeze_refs(). > + * The __remove_mapping() case is easy, again the node is now stale; > + * but if page is swapcache in migrate_page_move_mapping(), it might > + * still be our page, in which case it's essential to keep the node. > + */ > + while (!get_page_unless_zero(page)) { > + /* > + * Another check for page->mapping != expected_mapping would > + * work here too. We have chosen the !PageSwapCache test to > + * optimize the common case, when the page is or is about to > + * be freed: PageSwapCache is cleared (under spin_lock_irq) > + * in the freeze_refs section of __remove_mapping(); but Anon > + * page->mapping reset to NULL later, in free_pages_prepare(). > + */ > + if (!PageSwapCache(page)) > + goto stale; > + cpu_relax(); > + } The recheck of stable_node->kpfn check after a barrier distinguishes between a free and a completed migration, that's fine. I'm hesitate to ask because it must be obvious but where is the guarantee that a KSM page is in the swap cache? > + > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > put_page(page); > goto stale; > } > + > if (locked) { > lock_page(page); > - if (page->mapping != expected_mapping) { > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > unlock_page(page); > put_page(page); > goto stale; > } > } > return page; > + > stale: > + /* > + * We come here from above when page->mapping or !PageSwapCache > + * suggests that the node is stale; but it might be under migration. > + * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(), > + * before checking whether node->kpfn has been changed. > + */ > + smp_rmb(); > + if (ACCESS_ONCE(stable_node->kpfn) != kpfn) > + goto again; > remove_node_from_stable_tree(stable_node); > return NULL; > } > @@ -1103,15 +1137,25 @@ static struct page *stable_tree_search(s > return NULL; > > ret = memcmp_pages(page, tree_page); > + put_page(tree_page); > > - if (ret < 0) { > - put_page(tree_page); > + if (ret < 0) > node = node->rb_left; > - } else if (ret > 0) { > - put_page(tree_page); > + else if (ret > 0) > node = node->rb_right; > - } else > + else { > + /* > + * Lock and unlock the stable_node's page (which > + * might already have been migrated) so that page > + * migration is sure to notice its raised count. > + * It would be more elegant to return stable_node > + * than kpage, but that involves more changes. > + */ > + tree_page = get_ksm_page(stable_node, true); > + if (tree_page) > + unlock_page(tree_page); > return tree_page; > + } > } > > return NULL; > @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa > if (stable_node) { > VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage)); > stable_node->kpfn = page_to_pfn(newpage); > + /* > + * newpage->mapping was set in advance; now we need smp_wmb() > + * to make sure that the new stable_node->kpfn is visible > + * to get_ksm_page() before it can see that oldpage->mapping > + * has gone stale (or that PageSwapCache has been cleared). > + */ > + smp_wmb(); > + set_page_stable_node(oldpage, NULL); > } > } > #endif /* CONFIG_MIGRATION */ > --- mmotm.orig/mm/migrate.c 2013-01-25 14:27:58.140193249 -0800 > +++ mmotm/mm/migrate.c 2013-01-25 14:37:03.832206218 -0800 > @@ -464,7 +464,10 @@ void migrate_page_copy(struct page *newp > > mlock_migrate_page(newpage, page); > ksm_migrate_page(newpage, page); > - > + /* > + * Please do not reorder this without considering how mm/ksm.c's > + * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache(). > + */ > ClearPageSwapCache(page); > ClearPagePrivate(page); > set_page_private(page, 0); > -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753184Ab3BGX5x (ORCPT ); Thu, 7 Feb 2013 18:57:53 -0500 Received: from mail-pa0-f46.google.com ([209.85.220.46]:41455 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751268Ab3BGX5v (ORCPT ); Thu, 7 Feb 2013 18:57:51 -0500 Date: Thu, 7 Feb 2013 15:57:50 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Mel Gorman cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , Rik van Riel , David Rientjes , Anton Arapov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 1/11] ksm: allow trees per NUMA node In-Reply-To: <20130205164118.GI21389@suse.de> Message-ID: References: <20130205164118.GI21389@suse.de> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 5 Feb 2013, Mel Gorman wrote: > On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote: > > From: Petr Holasek > > > > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes > > which control merging pages across different numa nodes. > > When it is set to zero only pages from the same node are merged, > > otherwise pages from all nodes can be merged together (default behavior). > > > > Typical use-case could be a lot of KVM guests on NUMA machine > > and cpus from more distant nodes would have significant increase > > of access latency to the merged ksm page. Sysfs knob was choosen > > for higher variability when some users still prefers higher amount > > of saved physical memory regardless of access latency. > > > > This is understandable but it's going to be a fairly obscure option. > I do not think it can be known in advance if the option should be set. > The user must either run benchmarks before and after or use perf to > record the "node-load-misses" event and see if setting the parameter > reduces the number of remote misses. Andrew made a similar point on the description of merge_across_nodes in ksm.txt. Petr's quiet at the moment, so I'll add a few more lines to that description (in an incremental patch): but be assured what I say will remain inadequate and unspecific - I don't have much idea of how to decide the setting, but assume that the people who are interested in using the knob will have a firmer idea of how to test for it. > > I don't know the internals of ksm.c at all and this is my first time reading > this series. Everything in this review is subject to being completely > wrong or due to a major misunderstanding on my part. Delete all feedback > if desired. Thank you for spending your time on it. [...snippings, but let's leave this paragraph in] > > Hugh notes that this patch brings two problems, whose solution needs > > further support in mm/ksm.c, which follows in subsequent patches: > > 1) switching merge_across_nodes after running KSM is liable to oops > > on stale nodes still left over from the previous stable tree; > > 2) memory hotremove may migrate KSM pages, but there is no provision > > here for !merge_across_nodes to migrate nodes to the proper tree. ... > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 > > +++ mmotm/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 ... > > With multiple stable node trees does the comment that begins with > > * A few notes about the KSM scanning process, > * to make it easier to understand the data structures below: > > need an update? Okay: I won't go through it pluralizing everything, but a couple of lines on the !merge_across_nodes multiplicity of trees would be helpful. > > It's uninitialised so kernel data size in vmlinux should be unaffected but > it's an additional runtime cost of around 4K for a standardish enterprise > distro kernel config. Small beans on a NUMA machine and maybe not worth > the hassle of kmalloc for nr_online_nodes and dealing with node memory > hotplug but it's a pity. It's a pity, I agree; as is the addition of int nid into rmap_item on 32-bit (on 64-bit it just occupies a hole) - there can be a lot of those. We were kind of hoping that the #ifdef CONFIG_NUMA would cover it, but some distros now enable NUMA by default even on 32-bit. And it's a pity because 99% of users will leave merge_across_nodes at its default of 1 and only ever need a single tree of each kind. I'll look into starting off with just root_stable_tree[1] and root_unstable_tree[1], then kmalloc'ing nr_node_ids of them when and if merge_across_nodes is switched off. Then I don't think we need bother about hotplug. If it ends up looking clean enough, I'll add that patch. > > > #define MM_SLOTS_HASH_BITS 10 > > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > > @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ > > /* Milliseconds ksmd should sleep between batches */ > > static unsigned int ksm_thread_sleep_millisecs = 20; > > > > +/* Zeroed when merging across nodes is not allowed */ > > +static unsigned int ksm_merge_across_nodes = 1; > > + > > Nit but initialised data does increase the size of vmlinux so maybe this > should be the "opposite". i.e. rename it to ksm_merge_within_nodes and > default it to 0? I don't find that particular increase in size very compelling! Though I would have preferred the tunable to be the opposite way around: it annoys me that the new code comes into play when !ksm_merge_across_nodes. However, I do find "merge across nodes" (thanks to Andrew for "across") a much more vivid description than the opposite "merge within nodes", and can't think of a better alternative for that; and wouldn't want to change it anyway at this late (v7) stage, not without Petr's consent. > > __read_mostly? I feel the same way as I did when Andrew suggested it: > > I spose this should be __read_mostly. If __read_mostly is not really a > synonym for __make_write_often_storage_slower. I continue to harbor > fear, uncertainty and doubt about this... Could do. No strong feeling, but I think I'd rather it share its cacheline with other KSM-related stuff, than be off mixed up with unrelateds. I think there's a much stronger case for __read_mostly when it's a library thing accessed by different subsystems. You're right that this variable is accessed significantly more often that the other KSM tunables, so deserves a __read_mostly more than they do. But where to stop? Similar reluctance led me to avoid using "unlikely" throughout ksm.c, unlikely as some conditions are (I'm aghast to see that Andrea sneaked in a "likely" :). > > > #define KSM_RUN_STOP 0 > > #define KSM_RUN_MERGE 1 > > #define KSM_RUN_UNMERGE 2 > > @@ -441,10 +448,25 @@ out: page = NULL; > > return page; > > } > > > > +/* > > + * This helper is used for getting right index into array of tree roots. > > + * When merge_across_nodes knob is set to 1, there are only two rb-trees for > > + * stable and unstable pages from all nodes with roots in index 0. Otherwise, > > + * every node has its own stable and unstable tree. > > + */ > > +static inline int get_kpfn_nid(unsigned long kpfn) > > +{ > > + if (ksm_merge_across_nodes) > > + return 0; > > + else > > + return pfn_to_nid(kpfn); > > +} > > + > > If we start with ksm_merge_across_nodes, KSM runs for a while and populates > the stable node tree for node 0 and then ksm_merge_across_nodes gets set > then badness happens because this can go anywhere > > nid = get_kpfn_nid(stable_node->kpfn); > rb_erase(&stable_node->node, &root_stable_tree[nid]); > > Very late in the review I noticed that you comment on this already in the > changelog and that it is addressed later in the series. I haven't seen Yes. Nobody's git bisection will be thwarted by this defect, so I'm happy for Petr's patch to go in as is first, then fix applied after. And even in this patch, there's already a pages_shared 0 test: which is inadequate, but covers the common case. > this patch yet so the following suggestion is very stale but might still > be relevant. > > We could increase size of root_stable_node[] by 1, have > get_kpfn_nid return MAX_NR_NODES if ksm_merge_across_nodes and > if ksm_merge_across_nodes gets set to 0 then we walk the stable > tree at root_stable_tree[MAX_NR_NODES] and delete the entire > tree? It's be disruptive as hell unfortunately and might break > entirely if there is not enough memory to unshare the pages. > > Ideally we could take our time walking root_stable_tree[MAX_NR_NODES] > without worrying about collisions and fix it up somehow. Dunno Petr's intention was that we just be disruptive, and insist on the old tree being torn down first: it was merely a defect that this patch does not quite ensure that. You're right that we could be cleverer: in the light of the changes I ended up making for collisions in migration, maybe that approach could be extended to switching merge_across_nodes. But I think you'll agree that switching merge_across_nodes is a path that needs to be handled correctly, but no way does it need optimization: people will do it when they're trying to work out the right tuning for their loads, and thereafter probably never again. > > @@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s > > age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); > > BUG_ON(age > 1); > > if (!age) > > - rb_erase(&rmap_item->node, &root_unstable_tree); > > +#ifdef CONFIG_NUMA > > + rb_erase(&rmap_item->node, > > + &root_unstable_tree[rmap_item->nid]); > > +#else > > + rb_erase(&rmap_item->node, &root_unstable_tree[0]); > > +#endif > > > > nit, does rmap_item->nid deserve a getter and setter helper instead? I found that part ugly too: it gets macro helpers in trivial tidyups 3/11, though not quite the getter/setter helpers you had in mind. > > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > > return NULL; > > } > > > > + /* > > + * If tree_page has been migrated to another NUMA node, it > > + * will be flushed out and put into the right unstable tree > > + * next time: only merge with it if merge_across_nodes. > > + * Just notice, we don't have similar problem for PageKsm > > + * because their migration is disabled now. (62b61f611e) > > + */ > > + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { > > + put_page(tree_page); > > + return NULL; > > + } > > + > > What about this case? > > 1. ksm_merge_across_nodes==0 > 2. pages gets placed on different unstable trees > 3. ksm_merge_across_nodes==1 > > At that point we should be removing pages from the different unstable > tree and moving them to root_unstable_tree[0] but this put_page() doesn't > happen. Does it matter? It doesn't matter. The general philosophy in ksm.c is to be very lazy about the unstable tree: all kinds of things can go "wrong" with it temporarily, that's okay so long as we don't fall for errors that would persist round after round. The check above is required (somewhere) to make sure that we don't merge pages from different nodes into the same stable tree when the switch says not to do that. But the case that you're thinking of, it'll just sort itself out in a later round (I think you later realized how the unstable tree is rebuilt from scratch each round). Or have I misunderstood: are you worrying that a put_page() is missing? I don't see that. But now you point me to this block, I do wonder if we could place it better. When I came to worry about such an issue in the stable tree, I decided that it's perfectly okay to use a page from the wrong node for an intermediate test, and suboptimal to give up at that point, just wrong to return it as a final match. But here we give up even when it's an intermediate: seems inconsistent, I'll give it some more thought later, and probably want to move it: it's not wrong as is, but I think it could be more efficient and more consistent. > > @@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r > > */ > > lru_add_drain_all(); > > > > - root_unstable_tree = RB_ROOT; > > + for (nid = 0; nid < nr_node_ids; nid++) > > + root_unstable_tree[nid] = RB_ROOT; > > > > Minor but you shouldn't need to reset tham all if > ksm_merge_across_nodes==1 True; and I'll need to attend to this if we do move away from the static allocation of root_unstable_tree[MAX_NUMNODES]. > > Initially this triggered an alarm because it's not immediately obvious > why you can just discard an rbtree like this. It looks like because the > unstable tree is also part of a linked list so the rb representation can > be reset quickly without leaking memory. Right, it takes a while to get your head around the way we just forget the old tree and start again each time. There's a funny place in remove_rmap_item_from_tree() (visible in an earlier extract) where it has to consider the "age" of the rmap_item, to decide whether it's linked into the current tree or not. Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755079Ab3BHAHL (ORCPT ); Thu, 7 Feb 2013 19:07:11 -0500 Received: from mail-da0-f48.google.com ([209.85.210.48]:36107 "EHLO mail-da0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751542Ab3BHAHJ (ORCPT ); Thu, 7 Feb 2013 19:07:09 -0500 Date: Thu, 7 Feb 2013 16:07:17 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Mel Gorman cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 4/11] ksm: reorganize ksm_check_stable_tree In-Reply-To: <20130205164823.GJ21389@suse.de> Message-ID: References: <20130205164823.GJ21389@suse.de> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 5 Feb 2013, Mel Gorman wrote: > On Fri, Jan 25, 2013 at 05:59:35PM -0800, Hugh Dickins wrote: > > Memory hotremove's ksm_check_stable_tree() is pitifully inefficient > > (restarting whenever it finds a stale node to remove), but rearrange > > so that at least it does not needlessly restart from nid 0 each time. > > And add a couple of comments: here is why we keep pfn instead of page. > > > > Signed-off-by: Hugh Dickins > > --- > > mm/ksm.c | 38 ++++++++++++++++++++++---------------- > > 1 file changed, 22 insertions(+), 16 deletions(-) > > > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:52.152205940 -0800 > > +++ mmotm/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > > @@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa > > #endif /* CONFIG_MIGRATION */ > > > > #ifdef CONFIG_MEMORY_HOTREMOVE > > -static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn, > > - unsigned long end_pfn) > > +static void ksm_check_stable_tree(unsigned long start_pfn, > > + unsigned long end_pfn) > > { > > + struct stable_node *stable_node; > > struct rb_node *node; > > int nid; > > > > - for (nid = 0; nid < nr_node_ids; nid++) > > - for (node = rb_first(&root_stable_tree[nid]); node; > > - node = rb_next(node)) { > > - struct stable_node *stable_node; > > - > > + for (nid = 0; nid < nr_node_ids; nid++) { > > + node = rb_first(&root_stable_tree[nid]); > > + while (node) { > > This is not your fault, the old code is wrong too. It is assuming that all > nodes are populated in numeric orders with no holes. It won't work if just > two nodes 0 and 4 are online. It should be using for_each_online_node(). If the old code is wrong, it probably would be my fault! But I believe this is okay: these rb_roots we're looking at, they are in memory which is not being offlined, and the trees for offline nodes will simply be empty, won't they? Something's badly wrong if otherwise. I certainly prefer to avoid for_each_online_node() etc: maybe I'm confusing with for_each_online_something_else(), but experience tells that you can get into nasty hotplug mutex ordering issues with those things - not worth the pain if you can easily and safely avoid them. Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751499Ab3BHAdv (ORCPT ); Thu, 7 Feb 2013 19:33:51 -0500 Received: from mail-da0-f41.google.com ([209.85.210.41]:45618 "EHLO mail-da0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750755Ab3BHAdu (ORCPT ); Thu, 7 Feb 2013 19:33:50 -0500 Date: Thu, 7 Feb 2013 16:33:58 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Mel Gorman cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked In-Reply-To: <20130205171805.GK21389@suse.de> Message-ID: References: <20130205171805.GK21389@suse.de> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 5 Feb 2013, Mel Gorman wrote: > On Fri, Jan 25, 2013 at 06:00:50PM -0800, Hugh Dickins wrote: > > In some places where get_ksm_page() is used, we need the page to be locked. > > > > When KSM migration is fully enabled, we shall want that to make sure that > > the page just acquired cannot be migrated beneath us (raised page count is > > only effective when there is serialization to make sure migration notices). > > Whereas when navigating through the stable tree, we certainly do not want > > to lock each node (raised page count is enough to guarantee the memcmps, > > even if page is migrated to another node). > > > > Since we're about to add another use case, add the locked argument to > > get_ksm_page() now. > > > > Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I > > really got the wrong end of the stick on that! There's a configuration > > in which page_cache_get_speculative() can do something cheaper than > > get_page_unless_zero(), relying on its caller's rcu_read_lock() to have > > disabled preemption for it. There's no need for rcu_read_lock() around > > get_page_unless_zero() (and mapping checks) here. Cut out that > > silliness before making this any harder to understand. > > > > Signed-off-by: Hugh Dickins > > --- > > mm/ksm.c | 23 +++++++++++++---------- > > 1 file changed, 13 insertions(+), 10 deletions(-) > > > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > > +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree > > * but this is different - made simpler by ksm_thread_mutex being held, but > > * interesting for assuming that no other use of the struct page could ever > > * put our expected_mapping into page->mapping (or a field of the union which > > - * coincides with page->mapping). The RCU calls are not for KSM at all, but > > - * to keep the page_count protocol described with page_cache_get_speculative. > > + * coincides with page->mapping). > > * > > * Note: it is possible that get_ksm_page() will return NULL one moment, > > * then page the next, if the page is in between page_freeze_refs() and > > * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > > * is on its way to being freed; but it is an anomaly to bear in mind. > > */ > > -static struct page *get_ksm_page(struct stable_node *stable_node) > > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > > { > > The naming is unhelpful :( > > Because the second parameter is called "locked", it implies that the > caller of this function holds the page lock (which is obviously very > silly). ret_locked maybe? I'd prefer "lock_it": I'll make that change unless you've a better. > > As the function is akin to find_lock_page I would prefer if there was > a new get_lock_ksm_page() instead of locking depending on the value of a > parameter. I demur. If it were a global interface rather than a function static to ksm.c, yes, I'm sure Linus would side very strongly with you, and I'd be providing a pair of wrappers to get_ksm_page() to hide the bool arg. But this is a private function (you're invited :) which doesn't need that level of hand-holding. And I'm a firm believer in having one, difficult, function where all the heavy thought is focussed, which does the nasty work and spares everywhere else from having to worry about the difficulties. You'll shiver with horror as I recite shmem_getpage(_gfp), page_lock_anon_vma(_read), page_relock_lruvec (well, that one did not yet get beyond its posting): get_ksm_page is one of those. > We can do this because expected_mapping is recorded by the > stable_node and we only need to recalculate it if the page has been > successfully pinned. We calculate the expected value twice but that's > not earth shattering. It'd look something like; > > /* > * get_lock_ksm_page: Similar to get_ksm_page except returns with page > * locked and pinned > */ > static struct page *get_lock_ksm_page(struct stable_node *stable_node) > { > struct page *page = get_ksm_page(stable_node); > > if (page) { > expected_mapping = (void *)stable_node + > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > lock_page(page); > if (page->mapping != expected_mapping) { > unlock_page(page); > > /* release pin taken by get_ksm_page() */ > put_page(page); > page = NULL; > } > } > > return page; > } Something like; but would also need the remove_node_from_stable_tree. > > Up to you, I'm not going to make a big deal of it. Phew! Probably my insistence springs from knowing what this function develops into a few patches later, rather than the simpler version that appears at this stage of the series. > > FWIW, I agree that removing rcu_read_lock() is fine. Good, thanks, I was rather embarrassed by my misunderstanding there. Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946836Ab3BHSpY (ORCPT ); Fri, 8 Feb 2013 13:45:24 -0500 Received: from e06smtp14.uk.ibm.com ([195.75.94.110]:34569 "EHLO e06smtp14.uk.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946783Ab3BHSpV (ORCPT ); Fri, 8 Feb 2013 13:45:21 -0500 Date: Fri, 8 Feb 2013 19:45:10 +0100 From: Gerald Schaefer To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , KOSAKI Motohiro , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 11/11] ksm: stop hotremove lockdep warning Message-ID: <20130208194510.65fadd37@thinkpad.boeblingen.de.com> In-Reply-To: References: X-Mailer: Claws Mail 3.8.0 (GTK+ 2.24.10; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit x-cbid: 13020818-1948-0000-0000-00000441A517 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 25 Jan 2013 18:10:18 -0800 (PST) Hugh Dickins wrote: > Complaints are rare, but lockdep still does not understand the way > ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and > holds it until the ksm_memory_callback(MEM_OFFLINE): that appears > to be a problem because notifier callbacks are made under down_read > of blocking_notifier_head->rwsem (so first the mutex is taken while > holding the rwsem, then later the rwsem is taken while still holding > the mutex); but is not in fact a problem because mem_hotplug_mutex > is held throughout the dance. > > There was an attempt to fix this with mutex_lock_nested(); but if that > happened to fool lockdep two years ago, apparently it does so no > longer. > > I had hoped to eradicate this issue in extending KSM page migration > not to need the ksm_thread_mutex. But then realized that although > the page migration itself is safe, we do still need to lock out ksmd > and other users of get_ksm_page() while offlining memory - at some > point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages > themselves may vanish, and get_ksm_page()'s accesses to them become a > violation. > > So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE > to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and > wait_while_offlining() checks, to achieve the same lockout without > being caught by lockdep. This is less elegant for KSM, but it's more > important to keep lockdep useful to other users - and I apologize for > how long it took to fix. Thanks a lot for the patch! I verified that it fixes the lockdep warning that we got on memory hotremove. > > Reported-by: Gerald Schaefer > Signed-off-by: Hugh Dickins > --- > mm/ksm.c | 55 +++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 41 insertions(+), 14 deletions(-) > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800 > +++ mmotm/mm/ksm.c 2013-01-25 14:38:53.984208836 -0800 > @@ -226,7 +226,9 @@ static unsigned int ksm_merge_across_nod > #define KSM_RUN_STOP 0 > #define KSM_RUN_MERGE 1 > #define KSM_RUN_UNMERGE 2 > -static unsigned int ksm_run = KSM_RUN_STOP; > +#define KSM_RUN_OFFLINE 4 > +static unsigned long ksm_run = KSM_RUN_STOP; > +static void wait_while_offlining(void); > > static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait); > static DEFINE_MUTEX(ksm_thread_mutex); > @@ -1700,6 +1702,7 @@ static int ksm_scan_thread(void *nothing > > while (!kthread_should_stop()) { > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksmd_should_run()) > ksm_do_scan(ksm_thread_pages_to_scan); > mutex_unlock(&ksm_thread_mutex); > @@ -2056,6 +2059,22 @@ void ksm_migrate_page(struct page *newpa > #endif /* CONFIG_MIGRATION */ > > #ifdef CONFIG_MEMORY_HOTREMOVE > +static int just_wait(void *word) > +{ > + schedule(); > + return 0; > +} > + > +static void wait_while_offlining(void) > +{ > + while (ksm_run & KSM_RUN_OFFLINE) { > + mutex_unlock(&ksm_thread_mutex); > + wait_on_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE), > + just_wait, TASK_UNINTERRUPTIBLE); > + mutex_lock(&ksm_thread_mutex); > + } > +} > + > static void ksm_check_stable_tree(unsigned long start_pfn, > unsigned long end_pfn) > { > @@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no > switch (action) { > case MEM_GOING_OFFLINE: > /* > - * Keep it very simple for now: just lock out ksmd > and > - * MADV_UNMERGEABLE while any memory is going > offline. > - * mutex_lock_nested() is necessary because lockdep > was alarmed > - * that here we take ksm_thread_mutex inside > notifier chain > - * mutex, and later take notifier chain mutex inside > - * ksm_thread_mutex to unlock it. But that's safe > because both > - * are inside mem_hotplug_mutex. > + * Prevent ksm_do_scan(), > unmerge_and_remove_all_rmap_items() > + * and remove_all_stable_nodes() while memory is > going offline: > + * it is unsafe for them to touch the stable tree at > this time. > + * But unmerge_ksm_pages(), rmap lookups and other > entry points > + * which do not need the ksm_thread_mutex are all > safe. */ > - mutex_lock_nested(&ksm_thread_mutex, > SINGLE_DEPTH_NESTING); > + mutex_lock(&ksm_thread_mutex); > + ksm_run |= KSM_RUN_OFFLINE; > + mutex_unlock(&ksm_thread_mutex); > break; > > case MEM_OFFLINE: > @@ -2122,11 +2141,20 @@ static int ksm_memory_callback(struct no > /* fallthrough */ > > case MEM_CANCEL_OFFLINE: > + mutex_lock(&ksm_thread_mutex); > + ksm_run &= ~KSM_RUN_OFFLINE; > mutex_unlock(&ksm_thread_mutex); > + > + smp_mb(); /* wake_up_bit advises this */ > + wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE)); > break; > } > return NOTIFY_OK; > } > +#else > +static void wait_while_offlining(void) > +{ > +} > #endif /* CONFIG_MEMORY_HOTREMOVE */ > > #ifdef CONFIG_SYSFS > @@ -2189,7 +2217,7 @@ KSM_ATTR(pages_to_scan); > static ssize_t run_show(struct kobject *kobj, struct kobj_attribute > *attr, char *buf) > { > - return sprintf(buf, "%u\n", ksm_run); > + return sprintf(buf, "%lu\n", ksm_run); > } > > static ssize_t run_store(struct kobject *kobj, struct kobj_attribute > *attr, @@ -2212,6 +2240,7 @@ static ssize_t run_store(struct kobject > */ > > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksm_run != flags) { > ksm_run = flags; > if (flags & KSM_RUN_UNMERGE) { > @@ -2254,6 +2283,7 @@ static ssize_t merge_across_nodes_store( > return -EINVAL; > > mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > if (ksm_merge_across_nodes != knob) { > if (ksm_pages_shared || remove_all_stable_nodes()) > err = -EBUSY; > @@ -2366,10 +2396,7 @@ static int __init ksm_init(void) > #endif /* CONFIG_SYSFS */ > > #ifdef CONFIG_MEMORY_HOTREMOVE > - /* > - * Choose a high priority since the callback takes > ksm_thread_mutex: > - * later callbacks could only be taking locks which nest > within that. > - */ > + /* There is no significance to this priority 100 */ > hotplug_memory_notifier(ksm_memory_callback, 100); > #endif > return 0; > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1947013Ab3BHTdt (ORCPT ); Fri, 8 Feb 2013 14:33:49 -0500 Received: from mail-pa0-f45.google.com ([209.85.220.45]:57691 "EHLO mail-pa0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946873Ab3BHTdh (ORCPT ); Fri, 8 Feb 2013 14:33:37 -0500 Date: Fri, 8 Feb 2013 11:33:40 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Mel Gorman cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <20130205175551.GL21389@suse.de> Message-ID: References: <20130205175551.GL21389@suse.de> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 5 Feb 2013, Mel Gorman wrote: > On Fri, Jan 25, 2013 at 06:01:59PM -0800, Hugh Dickins wrote: > > Switching merge_across_nodes after running KSM is liable to oops on stale > > nodes still left over from the previous stable tree. It's not something > > that people will often want to do, but it would be lame to demand a reboot > > when they're trying to determine which merge_across_nodes setting is best. > > > > How can this happen? We only permit switching merge_across_nodes when > > pages_shared is 0, and usually set run 2 to force that beforehand, which > > ought to unmerge everything: yet oopses still occur when you then run 1. > > > > When reviewing patch 1, I missed that the pages_shared check would prevent > most of the problems I was envisioning with leftover entries in the > stable tree. Sorry about that. No apology necessary! > > > Three causes: > > > > 1. The old stable tree (built according to the inverse merge_across_nodes) > > has not been fully torn down. A stable node lingers until get_ksm_page() > > notices that the page it references no longer references it: but the page > > is not necessarily freed as soon as expected, particularly when swapcache. > > > > Fix this with a pass through the old stable tree, applying get_ksm_page() > > to each of the remaining nodes (most found stale and removed immediately), > > with forced removal of any left over. Unless the page is still mapped: > > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE > > and EBUSY than BUG. But once I applied the testing for this to the completed patch series, I did start seeing that WARN_ON_ONCE: it's made safe by the EBUSY, but not working as intended. Cause outlined below. > > > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > > just behind ksmd's cursor, so there's a full pass for it to stabilize > > (or be removed) before ksmd addresses it. Nice when ksmd is running, > > but not so nice when we're trying to unmerge all mms: we were missing > > those mms forked and inserted behind the unmerge cursor. Easily fixed > > by inserting at the end when KSM_RUN_UNMERGE. > > > > 3. It is possible for a KSM page to be faulted back from swapcache into > > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. What I found is that a 4th cause emerges once KSM migration is properly working: that interval during page migration when the old page has been fully unmapped but the new not yet mapped in its place. The KSM COW breaking cannot see a page there then, so it ends up with a (newly migrated) KSM page left behind. Almost certainly has to be fixed in follow_page(), but I've not yet settled on its final form - the fix I have works well, but a different approach might be better. I'm also puzzled that I've never in practice been hit by a 5th cause: swapoff's try_to_unuse() is much like faulting, and ought to have the same ksm_might_need_to_copy() safeguards as faulting (or at least, I cannot see why not). > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > > +++ mmotm/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800 > > @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a > > /* > > * Only called through the sysfs control interface: > > */ > > +static int remove_stable_node(struct stable_node *stable_node) > > +{ > > + struct page *page; > > + int err; > > + > > + page = get_ksm_page(stable_node, true); > > + if (!page) { > > + /* > > + * get_ksm_page did remove_node_from_stable_tree itself. > > + */ > > + return 0; > > + } > > + > > + if (WARN_ON_ONCE(page_mapped(page))) > > + err = -EBUSY; > > + else { > > + /* > > It will probably be very obvious to people familiar with ksm.c but even > so maybe remind the reader that the pages must already have been unmerged > > * This page must already have been unmerged and should be stale. > * It might be in a pagevec waiting to be freed or it might be Okay, I'll add a little more comment there; but I need to think longer for exactly how to express it. > ...... > > > > > + * This page might be in a pagevec waiting to be freed, > > + * or it might be PageSwapCache (perhaps under writeback), > > + * or it might have been removed from swapcache a moment ago. > > + */ > > + set_page_stable_node(page, NULL); > > + remove_node_from_stable_tree(stable_node); > > + err = 0; > > + } > > + > > + unlock_page(page); > > + put_page(page); > > + return err; > > +} > > + > > +static int remove_all_stable_nodes(void) > > +{ > > + struct stable_node *stable_node; > > + int nid; > > + int err = 0; > > + > > + for (nid = 0; nid < nr_node_ids; nid++) { > > + while (root_stable_tree[nid].rb_node) { > > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > > + struct stable_node, node); > > + if (remove_stable_node(stable_node)) { > > + err = -EBUSY; > > + break; /* proceed to next nid */ > > + } > > If remove_stable_node() returns an error then it's quite possible that it'll > go boom when that page is encountered later but it's not guaranteed. It'd > be best effort to continue removing as many of the stable nodes anyway. > We're in trouble either way of course. If it returns an error, then indeed something we don't yet understand has occurred, and we shall want to debug it. But unless it's due to corruption somewhere, we shouldn't be in much trouble, shouldn't go boom: remove_all_stable_nodes() error is ignored at the end of unmerging, it will be tried again when changing merge_across_nodes, and an error then will just prevent changing merge_across_nodes at that time. So the mysteriously unremovable stable nodes remain the same kind of tree. > > Otherwise I didn't spot a problem so as weak as it is due my familiarity > with KSM; > > Acked-by: Mel Gorman Thanks, Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1947072Ab3BHUwF (ORCPT ); Fri, 8 Feb 2013 15:52:05 -0500 Received: from mail-da0-f47.google.com ([209.85.210.47]:56978 "EHLO mail-da0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946936Ab3BHUwE (ORCPT ); Fri, 8 Feb 2013 15:52:04 -0500 Date: Fri, 8 Feb 2013 12:52:12 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Mel Gorman cc: "Paul E. McKenney" , Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible In-Reply-To: <20130205191102.GM21389@suse.de> Message-ID: References: <20130205191102.GM21389@suse.de> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Paul, I've added you to the Cc in the hope that you can shed your light on an smp_read_barrier_depends() question with which Mel taxes me below. You may ask for more context: linux-next currently has an mm/ksm.c after this patch is applied, but you may have questions beyond that - thanks! On Tue, 5 Feb 2013, Mel Gorman wrote: > On Fri, Jan 25, 2013 at 06:03:31PM -0800, Hugh Dickins wrote: > > KSM page migration is already supported in the case of memory hotremove, > > which takes the ksm_thread_mutex across all its migrations to keep life > > simple. > > > > But the new KSM NUMA merge_across_nodes knob introduces a problem, when > > it's set to non-default 0: if a KSM page is migrated to a different NUMA > > node, how do we migrate its stable node to the right tree? And what if > > that collides with an existing stable node? > > > > So far there's no provision for that, and this patch does not attempt > > to deal with it either. But how will I test a solution, when I don't > > know how to hotremove memory? > > Just reach in and yank it straight out with a chisel. :) > > > The best answer is to enable KSM page > > migration in all cases now, and test more common cases. With THP and > > compaction added since KSM came in, page migration is now mainstream, > > and it's a shame that a KSM page can frustrate freeing a page block. > > > > THP will at least check if migration within a node works. It won't > necessarily check we can migrate across nodes properly but it's a lot > better than nothing. No, I went back and dug out a hack-patch I was using three or four years ago: occasionally on fault, just migrate every possible page in that mm for no reason other than to test page migration. > > static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > > { > > struct page *page; > > void *expected_mapping; > > + unsigned long kpfn; > > > > - page = pfn_to_page(stable_node->kpfn); > > expected_mapping = (void *)stable_node + > > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); > > - if (page->mapping != expected_mapping) > > - goto stale; > > - if (!get_page_unless_zero(page)) > > +again: > > + kpfn = ACCESS_ONCE(stable_node->kpfn); > > + page = pfn_to_page(kpfn); > > + > > Ok. > > There should be no concern that hot-remove made the kpfn invalid because > those stable tree entries should have been discarded. Yes. > > > + /* > > + * page is computed from kpfn, so on most architectures reading > > + * page->mapping is naturally ordered after reading node->kpfn, > > + * but on Alpha we need to be more careful. > > + */ > > + smp_read_barrier_depends(); > > The value of page is data dependant on pfn_to_page(). Is it really possible > for that to be re-ordered even on Alpha? My intuition (to say "understanding" would be an exaggeration) is that on Alpha a very old value of page->mapping (in the line below) might be lying around and read from one cache, which has not necessarily been invalidated by ksm_migrate_page() pointing stable_node->kpfn to this new page. And if that happens, we could easily and mistakenly conclude that this stable node is stale: although there's an smp_rmb() after goto stale, stable_node->kpfn would still match kpfn, and we wrongly remove the node. My confidence that I've expressed that clearly in words, is lower than my confidence that I've coded it right; and if I'm wrong, yes, surely it's better to remove any cargo-cult smp_read_barrier_depends(). > > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) > > goto stale; > > - if (page->mapping != expected_mapping) { > > + > > + /* > > + * We cannot do anything with the page while its refcount is 0. > > + * Usually 0 means free, or tail of a higher-order page: in which > > + * case this node is no longer referenced, and should be freed; > > + * however, it might mean that the page is under page_freeze_refs(). > > + * The __remove_mapping() case is easy, again the node is now stale; > > + * but if page is swapcache in migrate_page_move_mapping(), it might > > + * still be our page, in which case it's essential to keep the node. > > + */ > > + while (!get_page_unless_zero(page)) { > > + /* > > + * Another check for page->mapping != expected_mapping would > > + * work here too. We have chosen the !PageSwapCache test to > > + * optimize the common case, when the page is or is about to > > + * be freed: PageSwapCache is cleared (under spin_lock_irq) > > + * in the freeze_refs section of __remove_mapping(); but Anon > > + * page->mapping reset to NULL later, in free_pages_prepare(). > > + */ > > + if (!PageSwapCache(page)) > > + goto stale; > > + cpu_relax(); > > + } > > The recheck of stable_node->kpfn check after a barrier distinguishes between > a free and a completed migration, that's fine. I'm hesitate to ask because > it must be obvious but where is the guarantee that a KSM page is in the > swap cache? Certainly none at all: it's the less common case that a KSM page is in swap cache. But if it is not in swap cache, how could its page count be 0 (causing get_page_unless_zero to fail)? By being free, or well on its way to being freed (hence stale); or reused as part of a compound page (hence stale also); or reused for another purpose which arrives at a page_freeze_refs() (hence stale also); other cases? It's hard to see from the diff, but in the original version of get_ksm_page(), !get_page_unless_zero goes straight to stale. Don't for a moment imagine that this function sprang fully formed from my mind: it was hard to get it working right (the swap cache get_page_unless_zero failure during migration really caught me out), and then to pare it down to its fairly simple final form. Hugh > > > + > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > put_page(page); > > goto stale; > > } > > + > > if (locked) { > > lock_page(page); > > - if (page->mapping != expected_mapping) { > > + if (ACCESS_ONCE(page->mapping) != expected_mapping) { > > unlock_page(page); > > put_page(page); > > goto stale; > > } > > } > > return page; > > + > > stale: > > + /* > > + * We come here from above when page->mapping or !PageSwapCache > > + * suggests that the node is stale; but it might be under migration. > > + * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(), > > + * before checking whether node->kpfn has been changed. > > + */ > > + smp_rmb(); > > + if (ACCESS_ONCE(stable_node->kpfn) != kpfn) > > + goto again; > > remove_node_from_stable_tree(stable_node); > > return NULL; > > } > > @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa > > if (stable_node) { > > VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage)); > > stable_node->kpfn = page_to_pfn(newpage); > > + /* > > + * newpage->mapping was set in advance; now we need smp_wmb() > > + * to make sure that the new stable_node->kpfn is visible > > + * to get_ksm_page() before it can see that oldpage->mapping > > + * has gone stale (or that PageSwapCache has been cleared). > > + */ > > + smp_wmb(); > > + set_page_stable_node(oldpage, NULL); > > } > > } > > #endif /* CONFIG_MIGRATION */ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932761Ab3BKWNj (ORCPT ); Mon, 11 Feb 2013 17:13:39 -0500 Received: from mail-pa0-f50.google.com ([209.85.220.50]:54368 "EHLO mail-pa0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932253Ab3BKWNh (ORCPT ); Mon, 11 Feb 2013 17:13:37 -0500 Date: Mon, 11 Feb 2013 14:13:48 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Gerald Schaefer cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , KOSAKI Motohiro , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 11/11] ksm: stop hotremove lockdep warning In-Reply-To: <20130208194510.65fadd37@thinkpad.boeblingen.de.com> Message-ID: References: <20130208194510.65fadd37@thinkpad.boeblingen.de.com> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 8 Feb 2013, Gerald Schaefer wrote: > On Fri, 25 Jan 2013 18:10:18 -0800 (PST) > Hugh Dickins wrote: > > > Complaints are rare, but lockdep still does not understand the way > > ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and > > holds it until the ksm_memory_callback(MEM_OFFLINE): that appears > > to be a problem because notifier callbacks are made under down_read > > of blocking_notifier_head->rwsem (so first the mutex is taken while > > holding the rwsem, then later the rwsem is taken while still holding > > the mutex); but is not in fact a problem because mem_hotplug_mutex > > is held throughout the dance. > > > > There was an attempt to fix this with mutex_lock_nested(); but if that > > happened to fool lockdep two years ago, apparently it does so no > > longer. > > > > I had hoped to eradicate this issue in extending KSM page migration > > not to need the ksm_thread_mutex. But then realized that although > > the page migration itself is safe, we do still need to lock out ksmd > > and other users of get_ksm_page() while offlining memory - at some > > point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages > > themselves may vanish, and get_ksm_page()'s accesses to them become a > > violation. > > > > So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE > > to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and > > wait_while_offlining() checks, to achieve the same lockout without > > being caught by lockdep. This is less elegant for KSM, but it's more > > important to keep lockdep useful to other users - and I apologize for > > how long it took to fix. > > Thanks a lot for the patch! I verified that it fixes the lockdep warning > that we got on memory hotremove. > > > > > Reported-by: Gerald Schaefer > > Signed-off-by: Hugh Dickins Thank you for reporting and testing and reporting back: sorry again for taking so long to fix it. Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934147Ab3BNLaL (ORCPT ); Thu, 14 Feb 2013 06:30:11 -0500 Received: from cantor2.suse.de ([195.135.220.15]:59314 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759063Ab3BNLaK (ORCPT ); Thu, 14 Feb 2013 06:30:10 -0500 Date: Thu, 14 Feb 2013 11:30:05 +0000 From: Mel Gorman To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 4/11] ksm: reorganize ksm_check_stable_tree Message-ID: <20130214113005.GA7367@suse.de> References: <20130205164823.GJ21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 07, 2013 at 04:07:17PM -0800, Hugh Dickins wrote: > On Tue, 5 Feb 2013, Mel Gorman wrote: > > On Fri, Jan 25, 2013 at 05:59:35PM -0800, Hugh Dickins wrote: > > > Memory hotremove's ksm_check_stable_tree() is pitifully inefficient > > > (restarting whenever it finds a stale node to remove), but rearrange > > > so that at least it does not needlessly restart from nid 0 each time. > > > And add a couple of comments: here is why we keep pfn instead of page. > > > > > > Signed-off-by: Hugh Dickins > > > --- > > > mm/ksm.c | 38 ++++++++++++++++++++++---------------- > > > 1 file changed, 22 insertions(+), 16 deletions(-) > > > > > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:52.152205940 -0800 > > > +++ mmotm/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > > > @@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa > > > #endif /* CONFIG_MIGRATION */ > > > > > > #ifdef CONFIG_MEMORY_HOTREMOVE > > > -static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn, > > > - unsigned long end_pfn) > > > +static void ksm_check_stable_tree(unsigned long start_pfn, > > > + unsigned long end_pfn) > > > { > > > + struct stable_node *stable_node; > > > struct rb_node *node; > > > int nid; > > > > > > - for (nid = 0; nid < nr_node_ids; nid++) > > > - for (node = rb_first(&root_stable_tree[nid]); node; > > > - node = rb_next(node)) { > > > - struct stable_node *stable_node; > > > - > > > + for (nid = 0; nid < nr_node_ids; nid++) { > > > + node = rb_first(&root_stable_tree[nid]); > > > + while (node) { > > > > This is not your fault, the old code is wrong too. It is assuming that all > > nodes are populated in numeric orders with no holes. It won't work if just > > two nodes 0 and 4 are online. It should be using for_each_online_node(). > > If the old code is wrong, it probably would be my fault! But I believe > this is okay: these rb_roots we're looking at, they are in memory which > is not being offlined, and the trees for offline nodes will simply be > empty, won't they? Something's badly wrong if otherwise. > I would expect them to be empty but that was not the problem I had in mind. Unfortunately I mixed up nr_online_ids and nr_node_ids and read the loop incorrectly. What you have is fine. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759595Ab3BNLeX (ORCPT ); Thu, 14 Feb 2013 06:34:23 -0500 Received: from cantor2.suse.de ([195.135.220.15]:59501 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754296Ab3BNLeW (ORCPT ); Thu, 14 Feb 2013 06:34:22 -0500 Date: Thu, 14 Feb 2013 11:34:18 +0000 From: Mel Gorman To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 5/11] ksm: get_ksm_page locked Message-ID: <20130214113418.GB7367@suse.de> References: <20130205171805.GK21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 07, 2013 at 04:33:58PM -0800, Hugh Dickins wrote: > > > > > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800 > > > +++ mmotm/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800 > > > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree > > > * but this is different - made simpler by ksm_thread_mutex being held, but > > > * interesting for assuming that no other use of the struct page could ever > > > * put our expected_mapping into page->mapping (or a field of the union which > > > - * coincides with page->mapping). The RCU calls are not for KSM at all, but > > > - * to keep the page_count protocol described with page_cache_get_speculative. > > > + * coincides with page->mapping). > > > * > > > * Note: it is possible that get_ksm_page() will return NULL one moment, > > > * then page the next, if the page is in between page_freeze_refs() and > > > * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page > > > * is on its way to being freed; but it is an anomaly to bear in mind. > > > */ > > > -static struct page *get_ksm_page(struct stable_node *stable_node) > > > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) > > > { > > > > The naming is unhelpful :( > > > > Because the second parameter is called "locked", it implies that the > > caller of this function holds the page lock (which is obviously very > > silly). ret_locked maybe? > > I'd prefer "lock_it": I'll make that change unless you've a better. > I don't. > > > > As the function is akin to find_lock_page I would prefer if there was > > a new get_lock_ksm_page() instead of locking depending on the value of a > > parameter. > > I demur. If it were a global interface rather than a function static > to ksm.c, yes, I'm sure Linus would side very strongly with you, and I'd > be providing a pair of wrappers to get_ksm_page() to hide the bool arg. > > But this is a private function (you're invited :) which doesn't need > that level of hand-holding. > > And I'm a firm believer in having one, difficult, function where all > the heavy thought is focussed, which does the nasty work and spares > everywhere else from having to worry about the difficulties. > Ok, I'm convinced. As you say, the case for having one function is a lot strong later in the series when this function becomes quite complex. Thanks. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934560Ab3BNL6N (ORCPT ); Thu, 14 Feb 2013 06:58:13 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60499 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934129Ab3BNL6J (ORCPT ); Thu, 14 Feb 2013 06:58:09 -0500 Date: Thu, 14 Feb 2013 11:58:05 +0000 From: Mel Gorman To: Hugh Dickins Cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly Message-ID: <20130214115805.GC7367@suse.de> References: <20130205175551.GL21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 08, 2013 at 11:33:40AM -0800, Hugh Dickins wrote: > > > > > > > > > 2. __ksm_enter() has a nice little optimization, to insert the new mm > > > just behind ksmd's cursor, so there's a full pass for it to stabilize > > > (or be removed) before ksmd addresses it. Nice when ksmd is running, > > > but not so nice when we're trying to unmerge all mms: we were missing > > > those mms forked and inserted behind the unmerge cursor. Easily fixed > > > by inserting at the end when KSM_RUN_UNMERGE. > > > > > > 3. It is possible for a KSM page to be faulted back from swapcache into > > > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. > > > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private > > > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() > > > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. > > What I found is that a 4th cause emerges once KSM migration > is properly working: that interval during page migration when the old > page has been fully unmapped but the new not yet mapped in its place. > For anyone else watching -- normal page migration expects to be protected during that particular window with migration ptes. Any references to the PTE mapping a page being migrated faults on a swap-like PTE and waits in migration_entry_wait(). > The KSM COW breaking cannot see a page there then, so it ends up with > a (newly migrated) KSM page left behind. Almost certainly has to be > fixed in follow_page(), but I've not yet settled on its final form - > the fix I have works well, but a different approach might be better. > follow_page() is one option. My guess is that you're thinking of adding a FOLL_ flag that will cause follow_page() to check is_migration_entry() and migration_entry_wait() if the flag is present. Otherwise you would need to check for migration ptes in a number of places under page lock and then hold the lock for long periods of time to prevent migration starting. I did not check this option in depth because it quickly looked like it would be a mess, with long page lock hold times and might not even be workable. > > > +static int remove_all_stable_nodes(void) > > > +{ > > > + struct stable_node *stable_node; > > > + int nid; > > > + int err = 0; > > > + > > > + for (nid = 0; nid < nr_node_ids; nid++) { > > > + while (root_stable_tree[nid].rb_node) { > > > + stable_node = rb_entry(root_stable_tree[nid].rb_node, > > > + struct stable_node, node); > > > + if (remove_stable_node(stable_node)) { > > > + err = -EBUSY; > > > + break; /* proceed to next nid */ > > > + } > > > > If remove_stable_node() returns an error then it's quite possible that it'll > > go boom when that page is encountered later but it's not guaranteed. It'd > > be best effort to continue removing as many of the stable nodes anyway. > > We're in trouble either way of course. > > If it returns an error, then indeed something we don't yet understand > has occurred, and we shall want to debug it. But unless it's due to > corruption somewhere, we shouldn't be in much trouble, shouldn't go boom: > remove_all_stable_nodes() error is ignored at the end of unmerging, it > will be tried again when changing merge_across_nodes, and an error > then will just prevent changing merge_across_nodes at that time. So > the mysteriously unremovable stable nodes remain the same kind of tree. > Ok. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934996Ab3BNWTU (ORCPT ); Thu, 14 Feb 2013 17:19:20 -0500 Received: from mail-pb0-f54.google.com ([209.85.160.54]:39034 "EHLO mail-pb0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758933Ab3BNWTT (ORCPT ); Thu, 14 Feb 2013 17:19:19 -0500 Date: Thu, 14 Feb 2013 14:19:26 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Mel Gorman cc: Andrew Morton , Petr Holasek , Andrea Arcangeli , Izik Eidus , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly In-Reply-To: <20130214115805.GC7367@suse.de> Message-ID: References: <20130205175551.GL21389@suse.de> <20130214115805.GC7367@suse.de> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 14 Feb 2013, Mel Gorman wrote: > On Fri, Feb 08, 2013 at 11:33:40AM -0800, Hugh Dickins wrote: > > > > What I found is that a 4th cause emerges once KSM migration > > is properly working: that interval during page migration when the old > > page has been fully unmapped but the new not yet mapped in its place. > > > > For anyone else watching -- normal page migration expects to be protected > during that particular window with migration ptes. Any references to the > PTE mapping a page being migrated faults on a swap-like PTE and waits > in migration_entry_wait(). > > > The KSM COW breaking cannot see a page there then, so it ends up with > > a (newly migrated) KSM page left behind. Almost certainly has to be > > fixed in follow_page(), but I've not yet settled on its final form - > > the fix I have works well, but a different approach might be better. > > The fix I had (following migration entry to old page) was a bit too PageKsm specfic, and probably wrong for when get_user_pages() needs to get a hold on the _new_ page. > > follow_page() is one option. My guess is that you're thinking of adding > a FOLL_ flag that will cause follow_page() to check is_migration_entry() > and migration_entry_wait() if the flag is present. Maybe a FOLL_flag, but I was thinking of doing it always. The usual get_user_pages() case will already wait in handle_mm_fault() and works okay, and I didn't identify a problem case for follow_page() apart from this ksm.c usage; but I did wonder if someone might have or add code which gets similarly caught out by the migration case. It's not a change I'd dare to make (without a FOLL_flag) if Andrea hadn't already added a wait_split_huge_page() into follow_page(); and I need to convince myself that adding another cause for waiting is necessarily safe (perhaps adding a might_sleep would be good). Sorry, I expected to have posted follow-up patches days and days ago, but in fact my time has vanished elsewhere and I've not even started. > > Otherwise you would need to check for migration ptes in a number of places > under page lock and then hold the lock for long periods of time to prevent > migration starting. I did not check this option in depth because it quickly > looked like it would be a mess, with long page lock hold times and might > not even be workable. Yes, I think that's more or less why I quickly decided on doing it in follow_page(). Another option would be to move the ksm_migrate_page() callsite, and allow it to reject the migration attempt when "inconvenient" (I haven't stopped to think of the definition of inconvenient). Though it wouldn't fail often enough for anyone out there to care, that option just feels like a shameful cop-out to me: I'm trying to improve migration, not add strange cases when it fails. Hugh