From: Yu Zhao <yuzhao@google.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Peter Xu <peterx@redhat.com>, Nadav Amit <nadav.amit@gmail.com>,
linux-mm <linux-mm@kvack.org>,
lkml <linux-kernel@vger.kernel.org>,
Pavel Emelyanov <xemul@openvz.org>,
Mike Kravetz <mike.kravetz@oracle.com>,
Mike Rapoport <rppt@linux.vnet.ibm.com>,
stable <stable@vger.kernel.org>, Minchan Kim <minchan@kernel.org>,
Will Deacon <will@kernel.org>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
Date: Tue, 22 Dec 2020 20:36:04 -0700 [thread overview]
Message-ID: <X+K7JMrTEC9SpVIB@google.com> (raw)
In-Reply-To: <X+Kxy3oBMSLz8Eaq@redhat.com>
On Tue, Dec 22, 2020 at 09:56:11PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote:
> > We are talking about non-COW anon pages here -- they can't be mapped
> > more than once. So why not just identify them by checking
> > page_mapcount == 1 and then unconditionally reuse them? (This is
> > probably where I've missed things.)
>
> The problem in depending on page_mapcount to decide if it's COW or
> non-COW (respectively wp_page_copy or wp_page_reuse) is that is GUP
> may elevate the count of a COW anon page that become a non-COW anon
> page.
>
> This is Jann's idea not mine.
>
> The problem is we have an unprivileged long term GUP like vmsplice
> that facilitates elevating the page count indefinitely, until the
> parent finally writes a secret to it. Theoretically a short term pin
> would do it too so it's not just vmpslice, but the short term pin
> would be incredibly more challenging to become a concern since it'd
> kill a phone battery and flash before it can read any data.
>
> So what happens with your page_mapcount == 1 check is that it doesn't
> mean non-COW (we thought it did until it didn't for the long term gup
> pin in vmsplice).
>
> Jann's testcases does fork() and set page_mapcount 2 and page_count to
> 2, vmsplice, take unprivileged infinitely long GUP pin to set
> page_count to 3, queue the page in the pipe with page_count elevated,
> munmap to drop page_count to 2 and page_mapcount to 1.
>
> page_mapcount is 1, so you'd think the page is non-COW and owned by
> the parent, but the child can still read it so it's very much still
> wp_page_copy material if the parent tries to modify it. Otherwise the
> child can read the content.
>
> This was supposed to be solvable by just doing the COW in gup(write=0)
> case if page_mapcount > 1 with commit 17839856fd58. I'm not exactly
> sure why that didn't fly and it had to be reverted by Peter in
> a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a but at the time this was
> happening I was side tracked by urgent issues and I didn't manage to
> look back of how we ended up with the big hammer page_count == 1 check
> instead to decide if to call wp_page_reuse or wp_page_shared.
>
> So anyway, the only thing that is clear to me is that keeping the
> child from reading the page_mapcount == 1 pages of the parent, is the
> only reason why wp_page_reuse(vmf) will only be called on
> page_count(page) == 1 and not on page_mapcount(page) == 1.
>
> It's also the reason why your page_mapcount assumption will risk to
> reintroduce the issue, and I only wish we could put back page_mapcount
> == 1 back there.
>
> Still even if we put back page_mapcount there, it is not ok to leave
> the page fault with stale TLB entries and to rely on the fact
> wp_page_shared won't run. It'd also avoid the problem but I think if
> you leave stale TLB entries in change_protection just like NUMA
> balancing does, it also requires a catcher just like NUMA balancing
> has, or it'd truly work by luck.
>
> So until we can put a page_mapcount == 1 check back there, the
> page_count will be by definition unreliable because of the speculative
> lookups randomly elevating all non zero page_counts at any time in the
> background on all pages, so you will never be able to tell if a page
> is true COW or if it's just a spurious COW because of a speculative
> lookup. It is impossible to differentiate a speculative lookup from a
> vmsplice ref in a child.
Thanks for the details.
In your patch, do we need to take wrprotect_rwsem in
handle_userfault() as well? Otherwise, it seems userspace would have
to synchronize between its wrprotect ioctl and fault handler? i.e.,
the fault hander needs to be aware that the content of write-
protected pages can actually change before the iotcl returns.
next prev parent reply other threads:[~2020-12-23 3:36 UTC|newest]
Thread overview: 121+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-19 4:30 [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2020-12-19 19:15 ` Andrea Arcangeli
2020-12-19 21:34 ` Nadav Amit
2020-12-19 22:06 ` Nadav Amit
2020-12-20 2:20 ` Andrea Arcangeli
2020-12-21 4:36 ` Nadav Amit
2020-12-21 5:12 ` Yu Zhao
2020-12-21 5:25 ` Nadav Amit
2020-12-21 5:39 ` Nadav Amit
2020-12-21 7:29 ` Yu Zhao
2020-12-22 20:34 ` Andy Lutomirski
2020-12-22 20:58 ` Nadav Amit
2020-12-22 21:34 ` Andrea Arcangeli
2020-12-20 2:01 ` Andy Lutomirski
2020-12-20 2:49 ` Andrea Arcangeli
2020-12-20 5:08 ` Andy Lutomirski
2020-12-21 18:03 ` Andrea Arcangeli
2020-12-21 18:22 ` Andy Lutomirski
2020-12-20 6:05 ` Yu Zhao
2020-12-20 8:06 ` Nadav Amit
2020-12-20 9:54 ` Yu Zhao
2020-12-21 3:33 ` Nadav Amit
2020-12-21 4:44 ` Yu Zhao
2020-12-21 17:27 ` Peter Xu
2020-12-21 18:31 ` Nadav Amit
2020-12-21 19:16 ` Yu Zhao
2020-12-21 19:55 ` Linus Torvalds
2020-12-21 20:21 ` Yu Zhao
2020-12-21 20:25 ` Linus Torvalds
2020-12-21 20:23 ` Nadav Amit
2020-12-21 20:26 ` Linus Torvalds
2020-12-21 21:24 ` Yu Zhao
2020-12-21 21:49 ` Nadav Amit
2020-12-21 22:30 ` Peter Xu
2020-12-21 22:55 ` Nadav Amit
2020-12-21 23:30 ` Linus Torvalds
2020-12-21 23:46 ` Nadav Amit
2020-12-22 19:44 ` Andrea Arcangeli
2020-12-22 20:19 ` Nadav Amit
2020-12-22 21:17 ` Andrea Arcangeli
2020-12-21 23:12 ` Yu Zhao
2020-12-21 23:33 ` Linus Torvalds
2020-12-22 0:00 ` Yu Zhao
2020-12-22 0:11 ` Linus Torvalds
2020-12-22 0:24 ` Yu Zhao
2020-12-21 23:22 ` Linus Torvalds
2020-12-22 3:19 ` Andy Lutomirski
2020-12-22 4:16 ` Linus Torvalds
2020-12-22 20:19 ` Andy Lutomirski
2021-01-05 15:37 ` Peter Zijlstra
2021-01-05 18:03 ` Andrea Arcangeli
2021-01-12 16:20 ` Peter Zijlstra
2021-01-12 11:43 ` Vinayak Menon
2021-01-12 15:47 ` Laurent Dufour
2021-01-12 16:57 ` Peter Zijlstra
2021-01-12 19:02 ` Laurent Dufour
2021-01-12 19:15 ` Nadav Amit
2021-01-12 19:56 ` Yu Zhao
2021-01-12 20:38 ` Nadav Amit
2021-01-12 20:49 ` Yu Zhao
2021-01-12 21:43 ` Will Deacon
2021-01-12 22:29 ` Nadav Amit
2021-01-12 22:46 ` Will Deacon
2021-01-13 0:31 ` Andy Lutomirski
2021-01-17 4:41 ` Yu Zhao
2021-01-17 7:32 ` Nadav Amit
2021-01-17 9:16 ` Yu Zhao
2021-01-17 10:13 ` Nadav Amit
2021-01-17 19:25 ` Yu Zhao
2021-01-18 2:49 ` Nadav Amit
2020-12-22 9:38 ` Nadav Amit
2020-12-22 19:31 ` Andrea Arcangeli
2020-12-22 20:15 ` Matthew Wilcox
2020-12-22 20:26 ` Andrea Arcangeli
2020-12-22 21:14 ` Yu Zhao
2020-12-22 22:02 ` Andrea Arcangeli
2020-12-22 23:39 ` Yu Zhao
2020-12-22 23:50 ` Linus Torvalds
2020-12-23 0:01 ` Linus Torvalds
2020-12-23 0:23 ` Yu Zhao
2020-12-23 2:17 ` Andrea Arcangeli
2020-12-23 9:44 ` Linus Torvalds
2020-12-23 10:06 ` Yu Zhao
2020-12-23 16:24 ` Peter Xu
2020-12-23 18:51 ` Andrea Arcangeli
2020-12-23 18:55 ` Andrea Arcangeli
2020-12-23 19:12 ` Yu Zhao
2020-12-23 19:32 ` Peter Xu
2020-12-23 0:20 ` Linus Torvalds
2020-12-23 2:56 ` Andrea Arcangeli
2020-12-23 3:36 ` Yu Zhao [this message]
2020-12-23 15:52 ` Peter Xu
2020-12-23 21:07 ` Andrea Arcangeli
2020-12-23 21:39 ` Andrea Arcangeli
2020-12-23 22:29 ` Yu Zhao
2020-12-23 23:04 ` Andrea Arcangeli
2020-12-24 1:21 ` Andy Lutomirski
2020-12-24 2:00 ` Andrea Arcangeli
2020-12-24 3:09 ` Nadav Amit
2020-12-24 3:30 ` Nadav Amit
2020-12-24 3:34 ` Yu Zhao
2020-12-24 4:01 ` Andrea Arcangeli
2020-12-24 5:18 ` Nadav Amit
2020-12-24 18:49 ` Andrea Arcangeli
2020-12-24 19:16 ` Andrea Arcangeli
2020-12-24 4:37 ` Nadav Amit
2020-12-24 3:31 ` Andrea Arcangeli
2020-12-23 23:39 ` Linus Torvalds
2020-12-24 1:01 ` Andrea Arcangeli
2020-12-22 21:14 ` Nadav Amit
2020-12-22 12:40 ` Nadav Amit
2020-12-22 18:30 ` Yu Zhao
2020-12-22 19:20 ` Nadav Amit
2020-12-23 16:23 ` Will Deacon
2020-12-23 19:04 ` Nadav Amit
2020-12-23 22:05 ` Andrea Arcangeli
2020-12-23 22:45 ` Nadav Amit
2020-12-23 23:55 ` Andrea Arcangeli
2020-12-21 21:55 ` Peter Xu
2020-12-21 23:13 ` Linus Torvalds
2020-12-21 19:53 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=X+K7JMrTEC9SpVIB@google.com \
--to=yuzhao@google.com \
--cc=aarcange@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mike.kravetz@oracle.com \
--cc=minchan@kernel.org \
--cc=nadav.amit@gmail.com \
--cc=peterx@redhat.com \
--cc=peterz@infradead.org \
--cc=rppt@linux.vnet.ibm.com \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=will@kernel.org \
--cc=xemul@openvz.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.