[PATCH 0/6] KSM fixes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] KSM fixes
@ 2015-10-15 16:04 Andrea Arcangeli
  2015-10-15 16:04 ` [PATCH 1/6] ksm: fix rmap_item->anon_vma memory corruption and vma user after free Andrea Arcangeli
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2015-10-15 16:04 UTC (permalink / raw)
  To: Hugh Dickins, Petr Holasek; +Cc: linux-mm, Andrew Morton

Hello,

With a stress test leveraging Hugh's allksm old patch, I found a
random memory corruption bug (modify and user after free). It's firing
oopses and generic instability so it's not just theoretical, but it
can only be reproduced in practice with frequent exits and KSM at 100%
CPU load. Note that it's common to give an entire core to KSM when the
system is low on memory in production, http://www.ovirt.org/MoM is
likely to do that as well. The allksm hack (not in this patchset) only
facilitates at creating a workload that exit frequently to increase
the frequency of the race window, without having to patch and rebuild
binaries to call the MADV_MERGEABLE by hand.

I also did some other orthogonal optimization and cleanup.

I'm sending those upstream standalone, separately from some more
complex larger and orthogonal pending changes I'm currently working
on. Those will take more time to review and this fix is higher
priority.

I haven't added -stable but at least 1/6 is definitely a candidate for
stable, IMHO 2/6 would be good idea too. If this passes review and the
fix is confirmed, I can resubmit at least 1/6 to stable. Comments?

Thanks,
Andrea

Andrea Arcangeli (6):
  ksm: fix rmap_item->anon_vma memory corruption and vma user after free
  ksm: add cond_resched() to the rmap_walks
  ksm: don't fail stable tree lookups if walking over stale stable_nodes
  ksm: use the helper method to do the hlist_empty check
  ksm: use find_mergeable_vma in try_to_merge_with_ksm_page
  ksm: unstable_tree_search_insert error checking cleanup

 mm/ksm.c  | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 mm/rmap.c |  4 +++
 2 files changed, 88 insertions(+), 13 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] ksm: fix rmap_item->anon_vma memory corruption and vma user after free
  2015-10-15 16:04 [PATCH 0/6] KSM fixes Andrea Arcangeli
@ 2015-10-15 16:04 ` Andrea Arcangeli
  2015-10-26  0:12   ` Hugh Dickins
  2015-10-15 16:04 ` [PATCH 2/6] ksm: add cond_resched() to the rmap_walks Andrea Arcangeli
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2015-10-15 16:04 UTC (permalink / raw)
  To: Hugh Dickins, Petr Holasek; +Cc: linux-mm, Andrew Morton

The ksm_test_exit() run after down_read(&mm->mmap_sem) assumed it
could serialize against ksm_exit() and prevent exit_mmap() to run
until the up_read(&mm->mmap_sem). That is true when the rmap_item->mm
is the one associated with the ksm_scan.mm_slot, as ksm_exit() would
take the !easy_to_free_path.

The problem is that when merging the current rmap_item (the one whose
->mm pointer is always associated ksm_scan.mm_slot) with a
tree_rmap_item in the unstable tree, the unstable tree
tree_rmap_item->mm can be any random mm. The locking technique
described above is a noop if the rmap_item->mm is not the one
associated with the ksm_scan.mm_slot. In turn the tree_rmap_item when
converted to a stable tree rmap_item and added to the
stable_node->hlist, can have a &rmap_item->anon_vma that points to
already freed memory. The find_vma and other vma operations to reach
the anon_vma also run on potentially already freed memory. The
get_anon_vma atomic_inc itself could corrupt memory randomly in
already re-used memory.

The result are oopses like below:

general protection fault: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/sleep_millisecs
CPU 14
Modules linked in: netconsole nfs nfs_acl auth_rpcgss fscache lockd sunrpc msr binfmt_misc sr_mod

Pid: 904, comm: ksmd Not tainted 2.6.32 #21 Supermicro X8DTN/X8DTN
RIP: 0010:[<ffffffff81138300>]  [<ffffffff81138300>] drop_anon_vma+0x0/0xe0
RSP: 0000:ffff88023b94bd28  EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff880231e64d50 RCX: 0000000000000017
RDX: ffff88023b94bfd8 RSI: 0000000000000216 RDI: 80000002192c6067
RBP: ffff88023b94bd40 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff880217fa4198
R13: ffff880217fa4198 R14: 000000000021916f R15: ffff880217fa419b
FS:  0000000000000000(0000) GS:ffff880030600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000d116a0 CR3: 000000023aa35000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ksmd (pid: 904, threadinfo ffff88023b948000, task ffff88023b946500)
Stack:
 ffffffff8114d049 ffffea000757d048 ffffea0000000000 ffff88023b94bda0
<d> ffffffff8114d153 ffff88023b946500 ffff88023afaab40 ffffffff8114e85d
<d> 0100000000000038 ffff88023b94bdb0 ffff880231d33560 ffff880217fa4198
Call Trace:
 [<ffffffff8114d049>] ? remove_node_from_stable_tree+0x29/0x80
 [<ffffffff8114d153>] get_ksm_page+0xb3/0x1e0
 [<ffffffff8114e85d>] ? ksm_scan_thread+0x60d/0x1130
 [<ffffffff8114d499>] remove_rmap_item_from_tree+0x99/0x130
 [<ffffffff8114ed19>] ksm_scan_thread+0xac9/0x1130
 [<ffffffff81095ce0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8114e250>] ? ksm_scan_thread+0x0/0x1130
 [<ffffffff8109508b>] kthread+0x8b/0xb0
 [<ffffffff8100c0ea>] child_rip+0xa/0x20
 [<ffffffff8100b910>] ? restore_args+0x0/0x30
 [<ffffffff81095000>] ? kthread+0x0/0xb0
 [<ffffffff8100c0e0>] ? child_rip+0x0/0x20
Code: 01 75 10 e8 d3 f7 ff ff 5d c3 90 0f 0b eb fe 0f 1f 40 00 e8 73 f6 ff ff 5d c3 e8 4c 77 01 00 5d c3 66 2e 0f 1f 84 00 00 00 00 00 <8b> 47 48 85 c0 0f 8e c5 00 00 00 55 48 89 e5 41 56 41 55 41 54
RIP  [<ffffffff81138300>] drop_anon_vma+0x0/0xe0
 RSP <ffff88023b94bd28>
---[ end trace b1f69fd4c12ce1ce ]---

rmap_item->anon_vma was set to the RDI value 0x80000002192c6067.

   0xffffffff81138300 <+0>:     mov    0x48(%rdi),%eax
   0xffffffff81138303 <+3>:     test   %eax,%eax
   0xffffffff81138305 <+5>:     jle    0xffffffff811383d0       <drop_anon_vma+208>
   0xffffffff8113830b <+11>:    push   %rbp

Other oopses are more random and harder to debug side effects of
memory corruption. In this case the anon_vma was a dangling pointer
because when try_to_merge_with_ksm_page did rmap_item->anon_vma =
vma->anon_vma, the vma already was already freed and reused memory. At
other times the oopses materialize with an vma->anon_vma pointer that
looks legit but it points to an already freed and reused anon_vma.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/ksm.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 7ee101e..8fc6793 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -350,6 +350,24 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
 }
 
 /*
+ * If the mm isn't the one associated with the current
+ * ksm_scan.mm_slot ksm_exit() will not down_write();up_write() and in
+ * turn the ksm_test_exit() check run inside a mm->mmap_sem critical
+ * section, will not prevent exit_mmap() to run from under us. In
+ * turn, in those cases where we could work with an "mm" that isn't
+ * guaranteed to be associated with the current ksm_scan.mm_slot,
+ * ksm_get_mm() is needed instead of the ksm_test_exit() run inside
+ * the mmap_sem. Return true if the mm_users was incremented or false
+ * if it we failed at taking the mm because it was freed from under
+ * us. If it returns 1, the caller must take care of calling mmput()
+ * after it finishes using the mm.
+ */
+static __always_inline bool ksm_get_mm(struct mm_struct *mm)
+{
+	return likely(atomic_inc_not_zero(&mm->mm_users));
+}
+
+/*
  * We use break_ksm to break COW on a ksm page: it's a stripped down
  *
  *	if (get_user_pages(current, mm, addr, 1, 1, 1, &page, NULL) == 1)
@@ -412,8 +430,6 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
 		unsigned long addr)
 {
 	struct vm_area_struct *vma;
-	if (ksm_test_exit(mm))
-		return NULL;
 	vma = find_vma(mm, addr);
 	if (!vma || vma->vm_start > addr)
 		return NULL;
@@ -434,11 +450,21 @@ static void break_cow(struct rmap_item *rmap_item)
 	 */
 	put_anon_vma(rmap_item->anon_vma);
 
+	/*
+	 * The "mm" of the unstable tree rmap_item isn't necessairly
+	 * associated with the current ksm_scan.mm_slot, it could be
+	 * any random mm. So we need ksm_get_mm here to prevent the
+	 * exit_mmap to run from under us in mmput().
+	 */
+	if (!ksm_get_mm(mm))
+		return;
+
 	down_read(&mm->mmap_sem);
 	vma = find_mergeable_vma(mm, addr);
 	if (vma)
 		break_ksm(vma, addr);
 	up_read(&mm->mmap_sem);
+	mmput(mm);
 }
 
 static struct page *page_trans_compound_anon(struct page *page)
@@ -462,6 +488,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 	struct vm_area_struct *vma;
 	struct page *page;
 
+	/*
+	 * The "mm" of the unstable tree rmap_item isn't necessairly
+	 * associated with the current ksm_scan.mm_slot, it could be
+	 * any random mm. So we need ksm_get_mm here to prevent the
+	 * exit_mmap to run from under us in mmput().
+	 */
+	if (!ksm_get_mm(mm))
+		return NULL;
+
 	down_read(&mm->mmap_sem);
 	vma = find_mergeable_vma(mm, addr);
 	if (!vma)
@@ -478,6 +513,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 out:		page = NULL;
 	}
 	up_read(&mm->mmap_sem);
+	mmput(mm);
 	return page;
 }
 
@@ -1086,9 +1122,19 @@ static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
 	struct vm_area_struct *vma;
 	int err = -EFAULT;
 
+	/*
+	 * The "mm" of the unstable tree rmap_item isn't necessairly
+	 * associated with the current ksm_scan.mm_slot, it could be
+	 * any random mm. So we need ksm_get_mm() here to prevent the
+	 * exit_mmap to run from under us in mmput(). Otherwise
+	 * rmap_item->anon_vma could point to an anon_vma that has
+	 * already been freed (i.e. get_anon_vma() below would run too
+	 * late).
+	 */
+	if (!ksm_get_mm(mm))
+		return err;
+
 	down_read(&mm->mmap_sem);
-	if (ksm_test_exit(mm))
-		goto out;
 	vma = find_vma(mm, rmap_item->address);
 	if (!vma || vma->vm_start > rmap_item->address)
 		goto out;
@@ -1105,6 +1151,7 @@ static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
 	get_anon_vma(vma->anon_vma);
 out:
 	up_read(&mm->mmap_sem);
+	mmput(mm);
 	return err;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] ksm: fix rmap_item->anon_vma memory corruption and vma user after free
  2015-10-15 16:04 ` [PATCH 1/6] ksm: fix rmap_item->anon_vma memory corruption and vma user after free Andrea Arcangeli
@ 2015-10-26  0:12   ` Hugh Dickins
  2015-10-30 18:55     ` Andrea Arcangeli
  0 siblings, 1 reply; 19+ messages in thread
From: Hugh Dickins @ 2015-10-26  0:12 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Hugh Dickins, Petr Holasek, linux-mm, Andrew Morton

On Thu, 15 Oct 2015, Andrea Arcangeli wrote:

> The ksm_test_exit() run after down_read(&mm->mmap_sem) assumed it
> could serialize against ksm_exit() and prevent exit_mmap() to run
> until the up_read(&mm->mmap_sem). That is true when the rmap_item->mm
> is the one associated with the ksm_scan.mm_slot, as ksm_exit() would
> take the !easy_to_free_path.
> 
> The problem is that when merging the current rmap_item (the one whose
> ->mm pointer is always associated ksm_scan.mm_slot) with a
> tree_rmap_item in the unstable tree, the unstable tree
> tree_rmap_item->mm can be any random mm. The locking technique
> described above is a noop if the rmap_item->mm is not the one
> associated with the ksm_scan.mm_slot.

I was convinced for an hour, though puzzled how this had survived
six years without being noticed: I'd certainly found the need for
the special ksm_exit()/ksm_test_exit() stuff when testing originally,
that wasn't hard, and why would this race be so very much harder?

Now, after looking again at ksm_exit(), I just don't see the point
you're making.  As I read it (and I certainly accept that I may be
wrong on all this), it will do the down_write,up_write on any mm
that is registered with it, and that has a chain of rmap_items -
the easy_to_free route is only for those that have no rmap_items
(and are not being worked on at present); and those that have no
rmap_items, have no rmap_items in the unstable or the stable tree.

Please explain what I'm missing before I look again harder.  One
nit below.  It looked very reasonable and nicely implemented to me,
but I didn't complete checking it before I lost sight of what it's
fixing.  (And incrementing mm_users always makes me worry a bit
about what happens under OOM, so I prefer not to do it.)

I've seen no problems at all in running rc5-mm1 and rc6-mm1
(with "allksm" of course) on this series.

Hugh

> In turn the tree_rmap_item when
> converted to a stable tree rmap_item and added to the
> stable_node->hlist, can have a &rmap_item->anon_vma that points to
> already freed memory. The find_vma and other vma operations to reach
> the anon_vma also run on potentially already freed memory. The
> get_anon_vma atomic_inc itself could corrupt memory randomly in
> already re-used memory.
> 
> The result are oopses like below:
> 
> general protection fault: 0000 [#1] SMP
> last sysfs file: /sys/kernel/mm/ksm/sleep_millisecs
> CPU 14
> Modules linked in: netconsole nfs nfs_acl auth_rpcgss fscache lockd sunrpc msr binfmt_misc sr_mod
> 
> Pid: 904, comm: ksmd Not tainted 2.6.32 #21 Supermicro X8DTN/X8DTN

Time for an update, maybe :-?  Though, in seriousness, I don't
think we have fixed anything of this nature in KSM since then.
Have probably fixed some anon_vma lifetime issues, though.

> RIP: 0010:[<ffffffff81138300>]  [<ffffffff81138300>] drop_anon_vma+0x0/0xe0
> RSP: 0000:ffff88023b94bd28  EFLAGS: 00010206
> RAX: 0000000000000000 RBX: ffff880231e64d50 RCX: 0000000000000017
> RDX: ffff88023b94bfd8 RSI: 0000000000000216 RDI: 80000002192c6067
> RBP: ffff88023b94bd40 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000001 R12: ffff880217fa4198
> R13: ffff880217fa4198 R14: 000000000021916f R15: ffff880217fa419b
> FS:  0000000000000000(0000) GS:ffff880030600000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000d116a0 CR3: 000000023aa35000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process ksmd (pid: 904, threadinfo ffff88023b948000, task ffff88023b946500)
> Stack:
>  ffffffff8114d049 ffffea000757d048 ffffea0000000000 ffff88023b94bda0
> <d> ffffffff8114d153 ffff88023b946500 ffff88023afaab40 ffffffff8114e85d
> <d> 0100000000000038 ffff88023b94bdb0 ffff880231d33560 ffff880217fa4198
> Call Trace:
>  [<ffffffff8114d049>] ? remove_node_from_stable_tree+0x29/0x80
>  [<ffffffff8114d153>] get_ksm_page+0xb3/0x1e0
>  [<ffffffff8114e85d>] ? ksm_scan_thread+0x60d/0x1130
>  [<ffffffff8114d499>] remove_rmap_item_from_tree+0x99/0x130
>  [<ffffffff8114ed19>] ksm_scan_thread+0xac9/0x1130
>  [<ffffffff81095ce0>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffff8114e250>] ? ksm_scan_thread+0x0/0x1130
>  [<ffffffff8109508b>] kthread+0x8b/0xb0
>  [<ffffffff8100c0ea>] child_rip+0xa/0x20
>  [<ffffffff8100b910>] ? restore_args+0x0/0x30
>  [<ffffffff81095000>] ? kthread+0x0/0xb0
>  [<ffffffff8100c0e0>] ? child_rip+0x0/0x20
> Code: 01 75 10 e8 d3 f7 ff ff 5d c3 90 0f 0b eb fe 0f 1f 40 00 e8 73 f6 ff ff 5d c3 e8 4c 77 01 00 5d c3 66 2e 0f 1f 84 00 00 00 00 00 <8b> 47 48 85 c0 0f 8e c5 00 00 00 55 48 89 e5 41 56 41 55 41 54
> RIP  [<ffffffff81138300>] drop_anon_vma+0x0/0xe0
>  RSP <ffff88023b94bd28>
> ---[ end trace b1f69fd4c12ce1ce ]---
> 
> rmap_item->anon_vma was set to the RDI value 0x80000002192c6067.
> 
>    0xffffffff81138300 <+0>:     mov    0x48(%rdi),%eax
>    0xffffffff81138303 <+3>:     test   %eax,%eax
>    0xffffffff81138305 <+5>:     jle    0xffffffff811383d0       <drop_anon_vma+208>
>    0xffffffff8113830b <+11>:    push   %rbp
> 
> Other oopses are more random and harder to debug side effects of
> memory corruption. In this case the anon_vma was a dangling pointer
> because when try_to_merge_with_ksm_page did rmap_item->anon_vma =
> vma->anon_vma, the vma already was already freed and reused memory. At
> other times the oopses materialize with an vma->anon_vma pointer that
> looks legit but it points to an already freed and reused anon_vma.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/ksm.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 51 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 7ee101e..8fc6793 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -350,6 +350,24 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
>  }
>  
>  /*
> + * If the mm isn't the one associated with the current
> + * ksm_scan.mm_slot ksm_exit() will not down_write();up_write() and in
> + * turn the ksm_test_exit() check run inside a mm->mmap_sem critical
> + * section, will not prevent exit_mmap() to run from under us. In
> + * turn, in those cases where we could work with an "mm" that isn't
> + * guaranteed to be associated with the current ksm_scan.mm_slot,
> + * ksm_get_mm() is needed instead of the ksm_test_exit() run inside
> + * the mmap_sem. Return true if the mm_users was incremented or false
> + * if it we failed at taking the mm because it was freed from under
> + * us. If it returns 1, the caller must take care of calling mmput()
> + * after it finishes using the mm.
> + */
> +static __always_inline bool ksm_get_mm(struct mm_struct *mm)

The nit was to ask, what's behind that __always_inline?
If the compiler chooses oddly not to inline it, that's okay, isn't it?

> +{
> +	return likely(atomic_inc_not_zero(&mm->mm_users));
> +}
> +
> +/*
>   * We use break_ksm to break COW on a ksm page: it's a stripped down
>   *
>   *	if (get_user_pages(current, mm, addr, 1, 1, 1, &page, NULL) == 1)
> @@ -412,8 +430,6 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
>  		unsigned long addr)
>  {
>  	struct vm_area_struct *vma;
> -	if (ksm_test_exit(mm))
> -		return NULL;
>  	vma = find_vma(mm, addr);
>  	if (!vma || vma->vm_start > addr)
>  		return NULL;
> @@ -434,11 +450,21 @@ static void break_cow(struct rmap_item *rmap_item)
>  	 */
>  	put_anon_vma(rmap_item->anon_vma);
>  
> +	/*
> +	 * The "mm" of the unstable tree rmap_item isn't necessairly
> +	 * associated with the current ksm_scan.mm_slot, it could be
> +	 * any random mm. So we need ksm_get_mm here to prevent the
> +	 * exit_mmap to run from under us in mmput().
> +	 */
> +	if (!ksm_get_mm(mm))
> +		return;
> +
>  	down_read(&mm->mmap_sem);
>  	vma = find_mergeable_vma(mm, addr);
>  	if (vma)
>  		break_ksm(vma, addr);
>  	up_read(&mm->mmap_sem);
> +	mmput(mm);
>  }
>  
>  static struct page *page_trans_compound_anon(struct page *page)
> @@ -462,6 +488,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>  	struct vm_area_struct *vma;
>  	struct page *page;
>  
> +	/*
> +	 * The "mm" of the unstable tree rmap_item isn't necessairly
> +	 * associated with the current ksm_scan.mm_slot, it could be
> +	 * any random mm. So we need ksm_get_mm here to prevent the
> +	 * exit_mmap to run from under us in mmput().
> +	 */
> +	if (!ksm_get_mm(mm))
> +		return NULL;
> +
>  	down_read(&mm->mmap_sem);
>  	vma = find_mergeable_vma(mm, addr);
>  	if (!vma)
> @@ -478,6 +513,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>  out:		page = NULL;
>  	}
>  	up_read(&mm->mmap_sem);
> +	mmput(mm);
>  	return page;
>  }
>  
> @@ -1086,9 +1122,19 @@ static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
>  	struct vm_area_struct *vma;
>  	int err = -EFAULT;
>  
> +	/*
> +	 * The "mm" of the unstable tree rmap_item isn't necessairly
> +	 * associated with the current ksm_scan.mm_slot, it could be
> +	 * any random mm. So we need ksm_get_mm() here to prevent the
> +	 * exit_mmap to run from under us in mmput(). Otherwise
> +	 * rmap_item->anon_vma could point to an anon_vma that has
> +	 * already been freed (i.e. get_anon_vma() below would run too
> +	 * late).
> +	 */
> +	if (!ksm_get_mm(mm))
> +		return err;
> +
>  	down_read(&mm->mmap_sem);
> -	if (ksm_test_exit(mm))
> -		goto out;
>  	vma = find_vma(mm, rmap_item->address);
>  	if (!vma || vma->vm_start > rmap_item->address)
>  		goto out;
> @@ -1105,6 +1151,7 @@ static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
>  	get_anon_vma(vma->anon_vma);
>  out:
>  	up_read(&mm->mmap_sem);
> +	mmput(mm);
>  	return err;
>  }
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] ksm: fix rmap_item->anon_vma memory corruption and vma user after free
  2015-10-26  0:12   ` Hugh Dickins
@ 2015-10-30 18:55     ` Andrea Arcangeli
  0 siblings, 0 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2015-10-30 18:55 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Petr Holasek, linux-mm, Andrew Morton

Hello,

On Sun, Oct 25, 2015 at 05:12:22PM -0700, Hugh Dickins wrote:
> I was convinced for an hour, though puzzled how this had survived
> six years without being noticed: I'd certainly found the need for
> the special ksm_exit()/ksm_test_exit() stuff when testing originally,
> that wasn't hard, and why would this race be so very much harder?

I was traveling last few days but I could leave the testcase running
to reproduce again by rolling back only this patch and I failed...

I now assume that it was another buggy patch that caused a corruption
consistent with the ksm_exit not taking the mmap_sem for writing.

The patch that may have caused this (not part of this patchset) tried
to synchronously drop the stable nodes, in order to remove the
migrate_nodes loop in scan_get_next_rmap_item.

> Now, after looking again at ksm_exit(), I just don't see the point
> you're making.  As I read it (and I certainly accept that I may be
> wrong on all this), it will do the down_write,up_write on any mm
> that is registered with it, and that has a chain of rmap_items -
> the easy_to_free route is only for those that have no rmap_items
> (and are not being worked on at present); and those that have no
> rmap_items, have no rmap_items in the unstable or the stable tree.

So the safety comes from relying on various implicit memory barriers
that are taken as we change mm_slot during the scan, so that we know
the mm_slot->rmap_list fields of the mm_slots not under scan, are
stable and never zero if we run into a rmap_item that belongs to
them.

> Please explain what I'm missing before I look again harder.  One
> nit below.  It looked very reasonable and nicely implemented to me,
> but I didn't complete checking it before I lost sight of what it's
> fixing.  (And incrementing mm_users always makes me worry a bit
> about what happens under OOM, so I prefer not to do it.)

Well the atomic_inc_not_zero is simpler and OOM shouldn't be a
practical problem because this would do mmput immediately after (it's
not holding it for long like the scan could do). However it adds an
atomic op that the current logic doesn't require, and I wouldn't like
to run an atomic op if there's no need.

So for the time being I agree that 1/6 is a noop and should be
dropped. This only applies to 1/6.

Thanks and sorry for the confusion about the mm_slot->rmap_list.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/6] ksm: add cond_resched() to the rmap_walks
  2015-10-15 16:04 [PATCH 0/6] KSM fixes Andrea Arcangeli
  2015-10-15 16:04 ` [PATCH 1/6] ksm: fix rmap_item->anon_vma memory corruption and vma user after free Andrea Arcangeli
@ 2015-10-15 16:04 ` Andrea Arcangeli
  2015-10-25 23:41   ` Hugh Dickins
  2015-10-15 16:04 ` [PATCH 3/6] ksm: don't fail stable tree lookups if walking over stale stable_nodes Andrea Arcangeli
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2015-10-15 16:04 UTC (permalink / raw)
  To: Hugh Dickins, Petr Holasek; +Cc: linux-mm, Andrew Morton

While at it add it to the file and anon walks too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/ksm.c  | 2 ++
 mm/rmap.c | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 8fc6793..39ef485 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1961,9 +1961,11 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
+		cond_resched();
 		anon_vma_lock_read(anon_vma);
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
 					       0, ULONG_MAX) {
+			cond_resched();
 			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f..b949778 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1607,6 +1607,8 @@ static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
 
+		cond_resched();
+
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
 			continue;
 
@@ -1656,6 +1658,8 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 
+		cond_resched();
+
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
 			continue;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/6] ksm: add cond_resched() to the rmap_walks
  2015-10-15 16:04 ` [PATCH 2/6] ksm: add cond_resched() to the rmap_walks Andrea Arcangeli
@ 2015-10-25 23:41   ` Hugh Dickins
  2015-10-27  0:32     ` Davidlohr Bueso
  0 siblings, 1 reply; 19+ messages in thread
From: Hugh Dickins @ 2015-10-25 23:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Davidlohr Bueso, Hugh Dickins, Petr Holasek, linux-mm,
	Andrew Morton

On Thu, 15 Oct 2015, Andrea Arcangeli wrote:

> While at it add it to the file and anon walks too.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Subject really should be "mm: add cond_resched() to the rmap walks",
then body "Add cond_resched() to the ksm and anon and file rmap walks."

Acked-by: Hugh Dickins <hughd@google.com>
but I think we need a blessing from Davidlohr too, if not more.

> ---
>  mm/ksm.c  | 2 ++
>  mm/rmap.c | 4 ++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 8fc6793..39ef485 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1961,9 +1961,11 @@ again:
>  		struct anon_vma_chain *vmac;
>  		struct vm_area_struct *vma;
>  
> +		cond_resched();
>  		anon_vma_lock_read(anon_vma);
>  		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
>  					       0, ULONG_MAX) {
> +			cond_resched();
>  			vma = vmac->vma;
>  			if (rmap_item->address < vma->vm_start ||
>  			    rmap_item->address >= vma->vm_end)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f5b5c1f..b949778 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1607,6 +1607,8 @@ static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
>  		struct vm_area_struct *vma = avc->vma;
>  		unsigned long address = vma_address(page, vma);
>  
> +		cond_resched();
> +
>  		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
>  			continue;
>  
> @@ -1656,6 +1658,8 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
>  	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>  		unsigned long address = vma_address(page, vma);
>  
> +		cond_resched();
> +
>  		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
>  			continue;
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/6] ksm: add cond_resched() to the rmap_walks
  2015-10-25 23:41   ` Hugh Dickins
@ 2015-10-27  0:32     ` Davidlohr Bueso
  2015-11-01 22:19       ` Andrea Arcangeli
  0 siblings, 1 reply; 19+ messages in thread
From: Davidlohr Bueso @ 2015-10-27  0:32 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrea Arcangeli, Petr Holasek, linux-mm, Andrew Morton

On Sun, 25 Oct 2015, Hugh Dickins wrote:

>On Thu, 15 Oct 2015, Andrea Arcangeli wrote:
>
>> While at it add it to the file and anon walks too.
>>
>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>
>Subject really should be "mm: add cond_resched() to the rmap walks",
>then body "Add cond_resched() to the ksm and anon and file rmap walks."
>
>Acked-by: Hugh Dickins <hughd@google.com>
>but I think we need a blessing from Davidlohr too, if not more.

Perhaps I'm lost in the context, but by the changelog alone I cannot
see the reasoning for the patch. Are latencies really that high? Maybe,
at least the changelog needs some love.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/6] ksm: add cond_resched() to the rmap_walks
  2015-10-27  0:32     ` Davidlohr Bueso
@ 2015-11-01 22:19       ` Andrea Arcangeli
  0 siblings, 0 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2015-11-01 22:19 UTC (permalink / raw)
  To: Davidlohr Bueso; +Cc: Hugh Dickins, Petr Holasek, linux-mm, Andrew Morton

On Mon, Oct 26, 2015 at 05:32:02PM -0700, Davidlohr Bueso wrote:
> On Sun, 25 Oct 2015, Hugh Dickins wrote:
> 
> >On Thu, 15 Oct 2015, Andrea Arcangeli wrote:
> >
> >> While at it add it to the file and anon walks too.
> >>
> >> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> >
> >Subject really should be "mm: add cond_resched() to the rmap walks",
> >then body "Add cond_resched() to the ksm and anon and file rmap walks."
> >
> >Acked-by: Hugh Dickins <hughd@google.com>
> >but I think we need a blessing from Davidlohr too, if not more.
> 
> Perhaps I'm lost in the context, but by the changelog alone I cannot
> see the reasoning for the patch. Are latencies really that high? Maybe,
> at least the changelog needs some love.

Yes they can be that high. The rmap walk must reach every possible
mapping of the page, so if a page is heavily shared (no matter if it's
KSM, anon, pagecache) there will be tons of entries to walk
through. All optimizations with prio_tree, anon_vma chains, interval
tree, helps to find the right virtual mapping faster, but if there are
lots of virtual mappings, all mapping must still be walked through.

The biggest cost is for the IPIs and the IPIs can be optimized in a
variety of ways, but at least for KSM if each virtual mapping ends up
in a different mm and each mm runs in a different CPU and if there are
tons of CPUs, it's actually impossible to reduce the number of IPIs
during KSM page migration.

Regardless of the IPIs, it's generally safer to keep these
cond_resched() in all cases, as even if we massively reduce the number
of IPIs, the number of entries to walk IPI-less may still be large and
no entry can be possibly skipped in the page migration case. Plus we
leverage having made those locks sleepable.

Dropping 1/6 triggers a reject in a later patch, so I'll have to
resubmit. So while at it, I'll add more commentary to the commit.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/6] ksm: don't fail stable tree lookups if walking over stale stable_nodes
  2015-10-15 16:04 [PATCH 0/6] KSM fixes Andrea Arcangeli
  2015-10-15 16:04 ` [PATCH 1/6] ksm: fix rmap_item->anon_vma memory corruption and vma user after free Andrea Arcangeli
  2015-10-15 16:04 ` [PATCH 2/6] ksm: add cond_resched() to the rmap_walks Andrea Arcangeli
@ 2015-10-15 16:04 ` Andrea Arcangeli
  2015-10-25 23:34   ` Hugh Dickins
  2015-10-15 16:04 ` [PATCH 4/6] ksm: use the helper method to do the hlist_empty check Andrea Arcangeli
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2015-10-15 16:04 UTC (permalink / raw)
  To: Hugh Dickins, Petr Holasek; +Cc: linux-mm, Andrew Morton

The stable_nodes can become stale at any time if the underlying pages
gets freed. The stable_node gets collected and removed from the stable
rbtree if that is detected during the rbtree tree lookups.

Don't fail the lookup if running into stale stable_nodes, just restart
the lookup after collecting the stale entries. Otherwise the CPU spent
in the preparation stage is wasted and the lookup must be repeated at
the next loop potentially failing a second time in a second stale
entry.

This also will contribute to pruning the stable tree and releasing the
stable_node memory more efficiently.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/ksm.c | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 39ef485..929b5c2 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1225,7 +1225,18 @@ again:
 		stable_node = rb_entry(*new, struct stable_node, node);
 		tree_page = get_ksm_page(stable_node, false);
 		if (!tree_page)
-			return NULL;
+			/*
+			 * If we walked over a stale stable_node,
+			 * get_ksm_page() will call rb_erase() and it
+			 * may rebalance the tree from under us. So
+			 * restart the search from scratch. Returning
+			 * NULL would be safe too, but we'd generate
+			 * false negative insertions just because some
+			 * stable_node was stale which would waste CPU
+			 * by doing the preparation work twice at the
+			 * next KSM pass.
+			 */
+			goto again;
 
 		ret = memcmp_pages(page, tree_page);
 		put_page(tree_page);
@@ -1301,12 +1312,14 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
 	unsigned long kpfn;
 	struct rb_root *root;
 	struct rb_node **new;
-	struct rb_node *parent = NULL;
+	struct rb_node *parent;
 	struct stable_node *stable_node;
 
 	kpfn = page_to_pfn(kpage);
 	nid = get_kpfn_nid(kpfn);
 	root = root_stable_tree + nid;
+again:
+	parent = NULL;
 	new = &root->rb_node;
 
 	while (*new) {
@@ -1317,7 +1330,18 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
 		stable_node = rb_entry(*new, struct stable_node, node);
 		tree_page = get_ksm_page(stable_node, false);
 		if (!tree_page)
-			return NULL;
+			/*
+			 * If we walked over a stale stable_node,
+			 * get_ksm_page() will call rb_erase() and it
+			 * may rebalance the tree from under us. So
+			 * restart the search from scratch. Returning
+			 * NULL would be safe too, but we'd generate
+			 * false negative insertions just because some
+			 * stable_node was stale which would waste CPU
+			 * by doing the preparation work twice at the
+			 * next KSM pass.
+			 */
+			goto again;
 
 		ret = memcmp_pages(kpage, tree_page);
 		put_page(tree_page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] ksm: don't fail stable tree lookups if walking over stale stable_nodes
  2015-10-15 16:04 ` [PATCH 3/6] ksm: don't fail stable tree lookups if walking over stale stable_nodes Andrea Arcangeli
@ 2015-10-25 23:34   ` Hugh Dickins
  2015-11-01 23:03     ` Andrea Arcangeli
  0 siblings, 1 reply; 19+ messages in thread
From: Hugh Dickins @ 2015-10-25 23:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Hugh Dickins, Petr Holasek, linux-mm, Andrew Morton

On Thu, 15 Oct 2015, Andrea Arcangeli wrote:

> The stable_nodes can become stale at any time if the underlying pages
> gets freed. The stable_node gets collected and removed from the stable
> rbtree if that is detected during the rbtree tree lookups.
> 
> Don't fail the lookup if running into stale stable_nodes, just restart
> the lookup after collecting the stale entries. Otherwise the CPU spent
> in the preparation stage is wasted and the lookup must be repeated at
> the next loop potentially failing a second time in a second stale
> entry.
> 
> This also will contribute to pruning the stable tree and releasing the
> stable_node memory more efficiently.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

I'll say
Acked-by: Hugh Dickins <hughd@google.com>
as a gesture of goodwill, but in honesty I'm sitting on the fence,
and couldn't decide.  I think I've gone back and forth on this in
my own mind in the past, worried that we might get stuck a long
time going back round to "again".  In the past I've felt that to
give up with NULL is consistent with KSM's willingness to give way
to any obstruction; but if you're finding "goto again" a better
strategy, sure, go ahead.  And at least there's a cond_resched()
just above the diff context shown.

A dittoed nit below...

> ---
>  mm/ksm.c | 30 +++++++++++++++++++++++++++---
>  1 file changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 39ef485..929b5c2 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1225,7 +1225,18 @@ again:
>  		stable_node = rb_entry(*new, struct stable_node, node);
>  		tree_page = get_ksm_page(stable_node, false);
>  		if (!tree_page)
> -			return NULL;
> +			/*
> +			 * If we walked over a stale stable_node,
> +			 * get_ksm_page() will call rb_erase() and it
> +			 * may rebalance the tree from under us. So
> +			 * restart the search from scratch. Returning
> +			 * NULL would be safe too, but we'd generate
> +			 * false negative insertions just because some
> +			 * stable_node was stale which would waste CPU
> +			 * by doing the preparation work twice at the
> +			 * next KSM pass.
> +			 */
> +			goto again;

When a comment gets that long, in fact even if it were only one line,
I'd much prefer that block inside braces.  I think I noticed Linus
feeling the same way a few days ago, when he fixed up someone's patch.

>  
>  		ret = memcmp_pages(page, tree_page);
>  		put_page(tree_page);
> @@ -1301,12 +1312,14 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
>  	unsigned long kpfn;
>  	struct rb_root *root;
>  	struct rb_node **new;
> -	struct rb_node *parent = NULL;
> +	struct rb_node *parent;
>  	struct stable_node *stable_node;
>  
>  	kpfn = page_to_pfn(kpage);
>  	nid = get_kpfn_nid(kpfn);
>  	root = root_stable_tree + nid;
> +again:
> +	parent = NULL;
>  	new = &root->rb_node;
>  
>  	while (*new) {
> @@ -1317,7 +1330,18 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
>  		stable_node = rb_entry(*new, struct stable_node, node);
>  		tree_page = get_ksm_page(stable_node, false);
>  		if (!tree_page)
> -			return NULL;
> +			/*
> +			 * If we walked over a stale stable_node,
> +			 * get_ksm_page() will call rb_erase() and it
> +			 * may rebalance the tree from under us. So
> +			 * restart the search from scratch. Returning
> +			 * NULL would be safe too, but we'd generate
> +			 * false negative insertions just because some
> +			 * stable_node was stale which would waste CPU
> +			 * by doing the preparation work twice at the
> +			 * next KSM pass.
> +			 */
> +			goto again;

Ditto.

>  
>  		ret = memcmp_pages(kpage, tree_page);
>  		put_page(tree_page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] ksm: don't fail stable tree lookups if walking over stale stable_nodes
  2015-10-25 23:34   ` Hugh Dickins
@ 2015-11-01 23:03     ` Andrea Arcangeli
  0 siblings, 0 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2015-11-01 23:03 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Petr Holasek, linux-mm, Andrew Morton

On Sun, Oct 25, 2015 at 04:34:27PM -0700, Hugh Dickins wrote:
> I'll say
> Acked-by: Hugh Dickins <hughd@google.com>
> as a gesture of goodwill, but in honesty I'm sitting on the fence,
> and couldn't decide.  I think I've gone back and forth on this in
> my own mind in the past, worried that we might get stuck a long
> time going back round to "again".  In the past I've felt that to
> give up with NULL is consistent with KSM's willingness to give way
> to any obstruction; but if you're finding "goto again" a better
> strategy, sure, go ahead.  And at least there's a cond_resched()
> just above the diff context shown.

If a couple of large process exists and create lots of stale
stable_nodes, we'll miss the opportunity to merge lots of unstable
tree nodes for potentially many passes. So KSM gets an artificial
latency in merging new unstable tree nodes. At the same time if we
don't prune aggressively, we delay the freeing of the stale
stable_nodes. Keeping stale stable_nodes around wastes memory and it
can't provide any benefit.

Ideally we should be doing a full rbtree walk to do a perfect pruning
after each pass to be sure some stale stable nodes don't stay around
forever, but walking the whole rtbree takes more memory and it'd use
more CPU. So considering we're not perfect at pruning the tree after
each pass, we can at least be more aggressive at pruning the tree
during the lookup.

> When a comment gets that long, in fact even if it were only one line,
> I'd much prefer that block inside braces.  I think I noticed Linus
> feeling the same way a few days ago, when he fixed up someone's patch.

Ok, I'll add braces, I don't mind either ways.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/6] ksm: use the helper method to do the hlist_empty check
  2015-10-15 16:04 [PATCH 0/6] KSM fixes Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2015-10-15 16:04 ` [PATCH 3/6] ksm: don't fail stable tree lookups if walking over stale stable_nodes Andrea Arcangeli
@ 2015-10-15 16:04 ` Andrea Arcangeli
  2015-10-25 23:22   ` Hugh Dickins
  2015-10-15 16:04 ` [PATCH 5/6] ksm: use find_mergeable_vma in try_to_merge_with_ksm_page Andrea Arcangeli
  2015-10-15 16:04 ` [PATCH 6/6] ksm: unstable_tree_search_insert error checking cleanup Andrea Arcangeli
  5 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2015-10-15 16:04 UTC (permalink / raw)
  To: Hugh Dickins, Petr Holasek; +Cc: linux-mm, Andrew Morton

This just uses the helper function to cleanup the assumption on the
hlist_node internals.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/ksm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 929b5c2..241588e 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -661,7 +661,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
 		unlock_page(page);
 		put_page(page);
 
-		if (stable_node->hlist.first)
+		if (!hlist_empty(&stable_node->hlist))
 			ksm_pages_sharing--;
 		else
 			ksm_pages_shared--;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] ksm: use the helper method to do the hlist_empty check
  2015-10-15 16:04 ` [PATCH 4/6] ksm: use the helper method to do the hlist_empty check Andrea Arcangeli
@ 2015-10-25 23:22   ` Hugh Dickins
  0 siblings, 0 replies; 19+ messages in thread
From: Hugh Dickins @ 2015-10-25 23:22 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Hugh Dickins, Petr Holasek, linux-mm, Andrew Morton

On Thu, 15 Oct 2015, Andrea Arcangeli wrote:

> This just uses the helper function to cleanup the assumption on the
> hlist_node internals.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Hugh Dickins <hughd@google.com>

> ---
>  mm/ksm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 929b5c2..241588e 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -661,7 +661,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
>  		unlock_page(page);
>  		put_page(page);
>  
> -		if (stable_node->hlist.first)
> +		if (!hlist_empty(&stable_node->hlist))
>  			ksm_pages_sharing--;
>  		else
>  			ksm_pages_shared--;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 5/6] ksm: use find_mergeable_vma in try_to_merge_with_ksm_page
  2015-10-15 16:04 [PATCH 0/6] KSM fixes Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2015-10-15 16:04 ` [PATCH 4/6] ksm: use the helper method to do the hlist_empty check Andrea Arcangeli
@ 2015-10-15 16:04 ` Andrea Arcangeli
  2015-10-25 23:21   ` Hugh Dickins
  2015-10-15 16:04 ` [PATCH 6/6] ksm: unstable_tree_search_insert error checking cleanup Andrea Arcangeli
  5 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2015-10-15 16:04 UTC (permalink / raw)
  To: Hugh Dickins, Petr Holasek; +Cc: linux-mm, Andrew Morton

Doing the VM_MERGEABLE check after the page == kpage check won't
provide any meaningful benefit. The !vma->anon_vma check of
find_mergeable_vma is the only superfluous bit in using
find_mergeable_vma because the !PageAnon check of
try_to_merge_one_page() implicitly checks for that, but it still looks
cleaner to share the same find_mergeable_vma().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/ksm.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 241588e..10618a3 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1057,8 +1057,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 	if (page == kpage)			/* ksm page forked */
 		return 0;
 
-	if (!(vma->vm_flags & VM_MERGEABLE))
-		goto out;
 	if (PageTransCompound(page) && page_trans_compound_anon_split(page))
 		goto out;
 	BUG_ON(PageTransCompound(page));
@@ -1135,8 +1133,8 @@ static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
 		return err;
 
 	down_read(&mm->mmap_sem);
-	vma = find_vma(mm, rmap_item->address);
-	if (!vma || vma->vm_start > rmap_item->address)
+	vma = find_mergeable_vma(mm, rmap_item->address);
+	if (!vma)
 		goto out;
 
 	err = try_to_merge_one_page(vma, page, kpage);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] ksm: use find_mergeable_vma in try_to_merge_with_ksm_page
  2015-10-15 16:04 ` [PATCH 5/6] ksm: use find_mergeable_vma in try_to_merge_with_ksm_page Andrea Arcangeli
@ 2015-10-25 23:21   ` Hugh Dickins
  0 siblings, 0 replies; 19+ messages in thread
From: Hugh Dickins @ 2015-10-25 23:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Hugh Dickins, Petr Holasek, linux-mm, Andrew Morton

On Thu, 15 Oct 2015, Andrea Arcangeli wrote:

> Doing the VM_MERGEABLE check after the page == kpage check won't
> provide any meaningful benefit. The !vma->anon_vma check of
> find_mergeable_vma is the only superfluous bit in using
> find_mergeable_vma because the !PageAnon check of
> try_to_merge_one_page() implicitly checks for that, but it still looks
> cleaner to share the same find_mergeable_vma().
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Hugh Dickins <hughd@google.com>

This looks like a nice little cleanup; I'm not 100% sure of it, forked
pages always awkward here; but you're clearly more in touch with this
now than I am, and I've seen no problem from it, so let's go with this.

> ---
>  mm/ksm.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 241588e..10618a3 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1057,8 +1057,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>  	if (page == kpage)			/* ksm page forked */
>  		return 0;
>  
> -	if (!(vma->vm_flags & VM_MERGEABLE))
> -		goto out;
>  	if (PageTransCompound(page) && page_trans_compound_anon_split(page))
>  		goto out;
>  	BUG_ON(PageTransCompound(page));
> @@ -1135,8 +1133,8 @@ static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
>  		return err;
>  
>  	down_read(&mm->mmap_sem);
> -	vma = find_vma(mm, rmap_item->address);
> -	if (!vma || vma->vm_start > rmap_item->address)
> +	vma = find_mergeable_vma(mm, rmap_item->address);
> +	if (!vma)
>  		goto out;
>  
>  	err = try_to_merge_one_page(vma, page, kpage);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 6/6] ksm: unstable_tree_search_insert error checking cleanup
  2015-10-15 16:04 [PATCH 0/6] KSM fixes Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2015-10-15 16:04 ` [PATCH 5/6] ksm: use find_mergeable_vma in try_to_merge_with_ksm_page Andrea Arcangeli
@ 2015-10-15 16:04 ` Andrea Arcangeli
  2015-10-25 23:18   ` Hugh Dickins
  5 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2015-10-15 16:04 UTC (permalink / raw)
  To: Hugh Dickins, Petr Holasek; +Cc: linux-mm, Andrew Morton

get_mergeable_page() can only return NULL (in case of errors) or the
pinned mergeable page. It can't return an error different than
NULL. This makes it more readable and less confusion in addition to
optimizing the check.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/ksm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 10618a3..dcefc37 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1409,7 +1409,7 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
 		cond_resched();
 		tree_rmap_item = rb_entry(*new, struct rmap_item, node);
 		tree_page = get_mergeable_page(tree_rmap_item);
-		if (IS_ERR_OR_NULL(tree_page))
+		if (!tree_page)
 			return NULL;
 
 		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] ksm: unstable_tree_search_insert error checking cleanup
  2015-10-15 16:04 ` [PATCH 6/6] ksm: unstable_tree_search_insert error checking cleanup Andrea Arcangeli
@ 2015-10-25 23:18   ` Hugh Dickins
  2015-11-01 23:45     ` Andrea Arcangeli
  0 siblings, 1 reply; 19+ messages in thread
From: Hugh Dickins @ 2015-10-25 23:18 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Hugh Dickins, Petr Holasek, linux-mm, Andrew Morton

On Thu, 15 Oct 2015, Andrea Arcangeli wrote:

> get_mergeable_page() can only return NULL (in case of errors) or the
> pinned mergeable page. It can't return an error different than
> NULL. This makes it more readable and less confusion in addition to
> optimizing the check.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

I share your sentiment, prefer to avoid an unnecessary IS_ERR_OR_NULL.
And you may be right that it's unnecessary; but that's far from clear
to me, and you haven't changed the IS_ERR_OR_NULL after follow_page()
in get_mergeable_page() where it originates, so I wonder if you just
got confused on this.

Even if you have established that there's currently no way that
follow_page(vma, addr, FOLL_GET) could return an -errno on a vma
validated by find_mergeable_vma(), I think we'd still be better off
to allow for some future -errno there; but I'd be happy for you to
keep the change below, but also adjust get_mergeable_page() to
convert an -errno immediately to NULL after follow_page().

So, I think this is gently Nacked in its present form,
but a replacement eagerly Acked.

Hugh

> ---
>  mm/ksm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 10618a3..dcefc37 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1409,7 +1409,7 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
>  		cond_resched();
>  		tree_rmap_item = rb_entry(*new, struct rmap_item, node);
>  		tree_page = get_mergeable_page(tree_rmap_item);
> -		if (IS_ERR_OR_NULL(tree_page))
> +		if (!tree_page)
>  			return NULL;
>  
>  		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] ksm: unstable_tree_search_insert error checking cleanup
  2015-10-25 23:18   ` Hugh Dickins
@ 2015-11-01 23:45     ` Andrea Arcangeli
  2015-11-02  0:23       ` Hugh Dickins
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2015-11-01 23:45 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Petr Holasek, linux-mm, Andrew Morton

On Sun, Oct 25, 2015 at 04:18:05PM -0700, Hugh Dickins wrote:
> On Thu, 15 Oct 2015, Andrea Arcangeli wrote:
> 
> > get_mergeable_page() can only return NULL (in case of errors) or the
> > pinned mergeable page. It can't return an error different than
> > NULL. This makes it more readable and less confusion in addition to
> > optimizing the check.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> I share your sentiment, prefer to avoid an unnecessary IS_ERR_OR_NULL.
> And you may be right that it's unnecessary; but that's far from clear
> to me, and you haven't changed the IS_ERR_OR_NULL after follow_page()
> in get_mergeable_page() where it originates, so I wonder if you just
> got confused on this.
> 
> Even if you have established that there's currently no way that
> follow_page(vma, addr, FOLL_GET) could return an -errno on a vma
> validated by find_mergeable_vma(), I think we'd still be better off
> to allow for some future -errno there; but I'd be happy for you to
> keep the change below, but also adjust get_mergeable_page() to
> convert an -errno immediately to NULL after follow_page().
> 
> So, I think this is gently Nacked in its present form,
> but a replacement eagerly Acked.

The "out:" label is followed by page = NULL, so if follow_page returns
an error, get_mergeable_page still cannot return an error.

If this wasn't the case, get_mergeable_page would return an
uninitialized pointer if find_mergeable_vma would return NULL.

I guess the IS_ERR_OR_NULL that I removed, was the direct result of
overlooking the location of the "out:".

If there was a return after the "out:" the readability would have been
improved, but I haven't touched the code around the "out:". That's the
way it was before. Now I'll add a return in this same patch after the
"out:" before resubmitting without 1/6.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] ksm: unstable_tree_search_insert error checking cleanup
  2015-11-01 23:45     ` Andrea Arcangeli
@ 2015-11-02  0:23       ` Hugh Dickins
  0 siblings, 0 replies; 19+ messages in thread
From: Hugh Dickins @ 2015-11-02  0:23 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Hugh Dickins, Petr Holasek, linux-mm, Andrew Morton

On Mon, 2 Nov 2015, Andrea Arcangeli wrote:
> On Sun, Oct 25, 2015 at 04:18:05PM -0700, Hugh Dickins wrote:
> > On Thu, 15 Oct 2015, Andrea Arcangeli wrote:
> > 
> > > get_mergeable_page() can only return NULL (in case of errors) or the
> > > pinned mergeable page. It can't return an error different than
> > > NULL. This makes it more readable and less confusion in addition to
> > > optimizing the check.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > I share your sentiment, prefer to avoid an unnecessary IS_ERR_OR_NULL.
> > And you may be right that it's unnecessary; but that's far from clear
> > to me, and you haven't changed the IS_ERR_OR_NULL after follow_page()
> > in get_mergeable_page() where it originates, so I wonder if you just
> > got confused on this.
> > 
> > Even if you have established that there's currently no way that
> > follow_page(vma, addr, FOLL_GET) could return an -errno on a vma
> > validated by find_mergeable_vma(), I think we'd still be better off
> > to allow for some future -errno there; but I'd be happy for you to
> > keep the change below, but also adjust get_mergeable_page() to
> > convert an -errno immediately to NULL after follow_page().
> > 
> > So, I think this is gently Nacked in its present form,
> > but a replacement eagerly Acked.
> 
> The "out:" label is followed by page = NULL, so if follow_page returns
> an error, get_mergeable_page still cannot return an error.

Ah, yes, I missed that, thanks.

Acked-by: Hugh Dickins <hughd@google.com>

> 
> If this wasn't the case, get_mergeable_page would return an
> uninitialized pointer if find_mergeable_vma would return NULL.
> 
> I guess the IS_ERR_OR_NULL that I removed, was the direct result of
> overlooking the location of the "out:".
> 
> If there was a return after the "out:" the readability would have been
> improved, but I haven't touched the code around the "out:". That's the
> way it was before. Now I'll add a return in this same patch after the
> "out:" before resubmitting without 1/6.

I can't very well claim now that it was readable before, can I?-)
We'll see; I wouldn't mind if you leave this patch as is.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-11-02  0:23 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-15 16:04 [PATCH 0/6] KSM fixes Andrea Arcangeli
2015-10-15 16:04 ` [PATCH 1/6] ksm: fix rmap_item->anon_vma memory corruption and vma user after free Andrea Arcangeli
2015-10-26  0:12   ` Hugh Dickins
2015-10-30 18:55     ` Andrea Arcangeli
2015-10-15 16:04 ` [PATCH 2/6] ksm: add cond_resched() to the rmap_walks Andrea Arcangeli
2015-10-25 23:41   ` Hugh Dickins
2015-10-27  0:32     ` Davidlohr Bueso
2015-11-01 22:19       ` Andrea Arcangeli
2015-10-15 16:04 ` [PATCH 3/6] ksm: don't fail stable tree lookups if walking over stale stable_nodes Andrea Arcangeli
2015-10-25 23:34   ` Hugh Dickins
2015-11-01 23:03     ` Andrea Arcangeli
2015-10-15 16:04 ` [PATCH 4/6] ksm: use the helper method to do the hlist_empty check Andrea Arcangeli
2015-10-25 23:22   ` Hugh Dickins
2015-10-15 16:04 ` [PATCH 5/6] ksm: use find_mergeable_vma in try_to_merge_with_ksm_page Andrea Arcangeli
2015-10-25 23:21   ` Hugh Dickins
2015-10-15 16:04 ` [PATCH 6/6] ksm: unstable_tree_search_insert error checking cleanup Andrea Arcangeli
2015-10-25 23:18   ` Hugh Dickins
2015-11-01 23:45     ` Andrea Arcangeli
2015-11-02  0:23       ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).