[PATCH stable pre 3.9] mm, gup: close FOLL MAP

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH stable pre 3.9] mm, gup: close FOLL MAP_PRIVATE race
@ 2016-10-19 20:49 Michal Hocko
  2016-10-19 22:22 ` Willy Tarreau
  0 siblings, 1 reply; 2+ messages in thread
From: Michal Hocko @ 2016-10-19 20:49 UTC (permalink / raw)
  To: Stable tree
  Cc: Michal Hocko, Linus Torvalds, Andy Lutomirski, Hugh Dickins,
	Phil not Paul Oester

From: Michal Hocko <mhocko@suse.com>

commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream.

This is an alternative fix. See more below.

faultin_page drops FOLL_WRITE after the page fault handler did the CoW
and then we retry follow_page_mask to get our CoWed page. This is racy,
however because the page might have been unmapped by that time and so
we would have to do a page fault again, this time without CoW. This
would cause the page cache corruption for FOLL_FORCE on MAP_PRIVATE
read only mappings with obvious consequences.

This is an ancient bug that was actually already fixed once by Linus
eleven years ago in commit 4ceb5db9757a ("Fix get_user_pages() race
for write access") but that was then undone due to problems on s390
by commit f33ea7f404e5 ("fix get_user_pages bug") because s390 didn't
have proper dirty pte tracking until abf09bed3cce ("s390/mm: implement
software dirty bits"). This wasn't a problem at the time as pointed out
by Hugh Dickins because madvise relied on mmap_sem for write up until
0a27a14a6292 ("mm: madvise avoid exclusive mmap_sem") but since then we
can race with madvise which can unmap the fresh COWed page or with KSM
and corrupt the content of the shared page.

This patch is based on the Linus' approach to not clear FOLL_WRITE after the
CoW page fault (aka VM_FAULT_WRITE) but instead introduces FOLL_COW to note
this fact. The flag is then rechecked when we translate pte to the page again
to enforce the page fault again if we do not see the CoWed page. Linus was
suggesting to check pte_dirty again as s390 is OK now. But that would make
backporting to some old kernels harder. So instead let's just make sure that
vm_normal_page sees a pure anonymous page.

This would guarantee we are seeing a real CoW page. Introduce
can_follow_write_pte which checks both pte_write and falls back to
PageAnon on forced write faults which passed CoW already. Thanks to Hugh
to point out that a special care has to be taken for KSM pages because
our COWed page might have been merged with a KSM one and keep its
PageAnon flag.

Fixes: 0a27a14a6292 ("mm: madvise avoid exclusive mmap_sem")
Cc: stable@vger.kernel.org # 2.6.22+
Reported-by: Phil "not Paul" Oester <kernel@linuxace.com>
Disclosed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
the patch should apply cleanly on 3.0, 3.2 and 3.4. Other versions might
need small tweaks but the essence of the patch should be same.
Let me know if something is not clear.

 include/linux/mm.h |  1 +
 mm/memory.c        | 38 +++++++++++++++++++++++++++-----------
 2 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ceebf63ef5da..ef706abe56be 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1525,6 +1525,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_MLOCK	0x40	/* mark page as mlocked */
 #define FOLL_SPLIT	0x80	/* don't return transhuge pages, split them */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
+#define FOLL_COW	0x4000	/* internal GUP flag */

 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
index 4774579d8acb..2fc9a2cddda3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1447,6 +1447,24 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(zap_vma_ptes);

+static inline bool can_follow_write_pte(pte_t pte, struct page *page,
+					unsigned int flags)
+{
+	if (pte_write(pte))
+		return true;
+
+	/*
+	 * Make sure that we are really following CoWed page. We do not really
+	 * have to care about exclusiveness of the page because we only want
+	 * to ensure that once COWed page hasn't disappeared in the meantime
+	 * or it hasn't been merged to a KSM page.
+	 */
+	if ((flags & FOLL_FORCE) && (flags & FOLL_COW))
+		return page && PageAnon(page) && !PageKsm(page);
+
+	return false;
+}
+
 /**
  * follow_page - look up a page descriptor from a user-virtual address
  * @vma: vm_area_struct mapping @address
@@ -1529,10 +1547,13 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	pte = *ptep;
 	if (!pte_present(pte))
 		goto no_page;
-	if ((flags & FOLL_WRITE) && !pte_write(pte))
-		goto unlock;

 	page = vm_normal_page(vma, address, pte);
+	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, page, flags)) {
+		pte_unmap_unlock(ptep, ptl);
+		return NULL;
+	}
+
 	if (unlikely(!page)) {
 		if ((flags & FOLL_DUMP) ||
 		    !is_zero_pfn(pte_pfn(pte)))
@@ -1575,7 +1596,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 			unlock_page(page);
 		}
 	}
-unlock:
 	pte_unmap_unlock(ptep, ptl);
 out:
 	return page;
@@ -1809,17 +1829,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				 * The VM_FAULT_WRITE bit tells us that
 				 * do_wp_page has broken COW when necessary,
 				 * even if maybe_mkwrite decided not to set
-				 * pte_write. We can thus safely do subsequent
-				 * page lookups as if they were reads. But only
-				 * do so when looping for pte_write is futile:
-				 * in some cases userspace may also be wanting
-				 * to write to the gotten user page, which a
-				 * read fault here might prevent (a readonly
-				 * page might get reCOWed by userspace write).
+				 * pte_write. We cannot simply drop FOLL_WRITE
+				 * here because the COWed page might be gone by
+				 * the time we do the subsequent page lookups.
 				 */
 				if ((ret & VM_FAULT_WRITE) &&
 				    !(vma->vm_flags & VM_WRITE))
-					foll_flags &= ~FOLL_WRITE;
+					foll_flags |= FOLL_COW;

 				cond_resched();
 			}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH stable pre 3.9] mm, gup: close FOLL MAP_PRIVATE race
  2016-10-19 20:49 [PATCH stable pre 3.9] mm, gup: close FOLL MAP_PRIVATE race Michal Hocko
@ 2016-10-19 22:22 ` Willy Tarreau
  0 siblings, 0 replies; 2+ messages in thread
From: Willy Tarreau @ 2016-10-19 22:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Stable tree, Michal Hocko, Linus Torvalds, Andy Lutomirski,
	Hugh Dickins, Phil not Paul Oester

Hi Michal,

On Wed, Oct 19, 2016 at 10:49:32PM +0200, Michal Hocko wrote:
> the patch should apply cleanly on 3.0, 3.2 and 3.4. Other versions might
> need small tweaks but the essence of the patch should be same.
> Let me know if something is not clear.
(...)

Thanks. And FWIW this backport I made for 3.10 but which should work for
3.9 to 3.15 (before the move from mm/memory.c to mm/gup.c). It was tested
on 3.10 only. Probably only Jiri is interested in this one anyway.

Willy

>From f9022af65029866d4a7a9a29521a66a9af38878c Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 13 Oct 2016 13:07:36 -0700
Subject: mm: remove gup_flags FOLL_WRITE games from __get_user_pages()

commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream.

This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db9757a ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug").

In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better).  The
s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement
software dirty bits") which made it into v3.9.  Earlier kernels will
have to look at the page state itself.

Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.

To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.

Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[wt: s/gup.c/memory.c; s/follow_page_pte/follow_page_mask;
     s/faultin_page/__get_user_page]
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 14 ++++++++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 53b0d70..55590f4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1715,6 +1715,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
+#define FOLL_COW	0x4000	/* internal GUP flag */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
index 10cdade..2ca2ee1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1462,6 +1462,16 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(zap_vma_ptes);
 
+/*
+ * FOLL_FORCE can write to even unwritable pte's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+{
+	return pte_write(pte) ||
+		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+}
+
 /**
  * follow_page_mask - look up a page descriptor from a user-virtual address
  * @vma: vm_area_struct mapping @address
@@ -1569,7 +1579,7 @@ split_fallthrough:
 	}
 	if ((flags & FOLL_NUMA) && pte_numa(pte))
 		goto no_page;
-	if ((flags & FOLL_WRITE) && !pte_write(pte))
+	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags))
 		goto unlock;
 
 	page = vm_normal_page(vma, address, pte);
@@ -1877,7 +1887,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				 */
 				if ((ret & VM_FAULT_WRITE) &&
 				    !(vma->vm_flags & VM_WRITE))
-					foll_flags &= ~FOLL_WRITE;
+					foll_flags |= FOLL_COW;
 
 				cond_resched();
 			}
-- 
2.8.0.rc2.1.gbe9624a


^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-10-19 22:23 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-19 20:49 [PATCH stable pre 3.9] mm, gup: close FOLL MAP_PRIVATE race Michal Hocko
2016-10-19 22:22 ` Willy Tarreau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).