From: Jerome Glisse <jglisse@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
David Hildenbrand <david@redhat.com>,
Hugh Dickins <hughd@google.com>, Maya Gokhale <gokhale2@llnl.gov>,
Pavel Emelyanov <xemul@virtuozzo.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Martin Cracauer <cracauer@cons.org>, Shaohua Li <shli@fb.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Mike Kravetz <mike.kravetz@oracle.com>,
Denis Plotnikov <dplotnikov@virtuozzo.com>,
Mike Rapoport <rppt@linux.vnet.ibm.com>,
Marty McFadden <mcfadden8@llnl.gov>, Mel Gorman <mgorman@suse.de>,
"Kirill A . Shutemov" <kirill@shutemov.name>,
"Dr . David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp
Date: Fri, 19 Apr 2019 11:02:53 -0400 [thread overview]
Message-ID: <20190419150253.GA3311@redhat.com> (raw)
In-Reply-To: <20190419062650.GF13323@xz-x1>
On Fri, Apr 19, 2019 at 02:26:50PM +0800, Peter Xu wrote:
> On Thu, Apr 18, 2019 at 04:51:15PM -0400, Jerome Glisse wrote:
> > On Wed, Mar 20, 2019 at 10:06:28AM +0800, Peter Xu wrote:
> > > This allows uffd-wp to support write-protected pages for COW.
> > >
> > > For example, the uffd write-protected PTE could also be write-protected
> > > by other usages like COW or zero pages. When that happens, we can't
> > > simply set the write bit in the PTE since otherwise it'll change the
> > > content of every single reference to the page. Instead, we should do
> > > the COW first if necessary, then handle the uffd-wp fault.
> > >
> > > To correctly copy the page, we'll also need to carry over the
> > > _PAGE_UFFD_WP bit if it was set in the original PTE.
> > >
> > > For huge PMDs, we just simply split the huge PMDs where we want to
> > > resolve an uffd-wp page fault always. That matches what we do with
> > > general huge PMD write protections. In that way, we resolved the huge
> > > PMD copy-on-write issue into PTE copy-on-write.
> > >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> >
> > This one has a bug see below.
> >
> >
> > > ---
> > > mm/memory.c | 5 +++-
> > > mm/mprotect.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > > 2 files changed, 65 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index e7a4b9650225..b8a4c0bab461 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2291,7 +2291,10 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
> > > }
> > > flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> > > entry = mk_pte(new_page, vma->vm_page_prot);
> > > - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > > + if (pte_uffd_wp(vmf->orig_pte))
> > > + entry = pte_mkuffd_wp(entry);
> > > + else
> > > + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > > /*
> > > * Clear the pte entry and flush it first, before updating the
> > > * pte with the new entry. This will avoid a race condition
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 9d4433044c21..855dddb07ff2 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -73,18 +73,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > > flush_tlb_batched_pending(vma->vm_mm);
> > > arch_enter_lazy_mmu_mode();
> > > do {
> > > +retry_pte:
> > > oldpte = *pte;
> > > if (pte_present(oldpte)) {
> > > pte_t ptent;
> > > bool preserve_write = prot_numa && pte_write(oldpte);
> > > + struct page *page;
> > >
> > > /*
> > > * Avoid trapping faults against the zero or KSM
> > > * pages. See similar comment in change_huge_pmd.
> > > */
> > > if (prot_numa) {
> > > - struct page *page;
> > > -
> > > page = vm_normal_page(vma, addr, oldpte);
> > > if (!page || PageKsm(page))
> > > continue;
> > > @@ -114,6 +114,54 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > > continue;
> > > }
> > >
> > > + /*
> > > + * Detect whether we'll need to COW before
> > > + * resolving an uffd-wp fault. Note that this
> > > + * includes detection of the zero page (where
> > > + * page==NULL)
> > > + */
> > > + if (uffd_wp_resolve) {
> > > + /* If the fault is resolved already, skip */
> > > + if (!pte_uffd_wp(*pte))
> > > + continue;
> > > + page = vm_normal_page(vma, addr, oldpte);
> > > + if (!page || page_mapcount(page) > 1) {
> > > + struct vm_fault vmf = {
> > > + .vma = vma,
> > > + .address = addr & PAGE_MASK,
> > > + .page = page,
> > > + .orig_pte = oldpte,
> > > + .pmd = pmd,
> > > + /* pte and ptl not needed */
> > > + };
> > > + vm_fault_t ret;
> > > +
> > > + if (page)
> > > + get_page(page);
> > > + arch_leave_lazy_mmu_mode();
> > > + pte_unmap_unlock(pte, ptl);
> > > + ret = wp_page_copy(&vmf);
> > > + /* PTE is changed, or OOM */
> > > + if (ret == 0)
> > > + /* It's done by others */
> > > + continue;
> >
> > This is wrong if ret == 0 you still need to remap the pte before
> > continuing as otherwise you will go to next pte without the page
> > table lock for the directory. So 0 case must be handled after
> > arch_enter_lazy_mmu_mode() below.
> >
> > Sorry i should have catch that in previous review.
>
> My fault to not have noticed it since the very beginning... thanks for
> spotting that.
>
> I'm squashing below changes into the patch:
Well thinking of this some more i think you should use do_wp_page() and
not wp_page_copy() it would avoid bunch of code above and also you are
not properly handling KSM page or page in the swap cache. Instead of
duplicating same code that is in do_wp_page() it would be better to call
it here.
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 3cddfd6627b8..13d493b836bb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -141,22 +141,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> arch_leave_lazy_mmu_mode();
> pte_unmap_unlock(pte, ptl);
> ret = wp_page_copy(&vmf);
> - /* PTE is changed, or OOM */
> - if (ret == 0)
> - /* It's done by others */
> - continue;
> - else if (WARN_ON(ret != VM_FAULT_WRITE))
> + if (ret != VM_FAULT_WRITE && ret != 0)
> + /* Probably OOM */
> return pages;
> pte = pte_offset_map_lock(vma->vm_mm,
> pmd, addr,
> &ptl);
> arch_enter_lazy_mmu_mode();
> - if (!pte_present(*pte))
> + if (ret == 0 || !pte_present(*pte))
> /*
> * This PTE could have been
> - * modified after COW
> - * before we have taken the
> - * lock; retry this PTE
> + * modified during or after
> + * COW before take the lock;
> + * retry.
> */
> goto retry_pte;
> }
>
> [...]
>
> > > if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> > > - if (next - addr != HPAGE_PMD_SIZE) {
> > > + /*
> > > + * When resolving an userfaultfd write
> > > + * protection fault, it's not easy to identify
> > > + * whether a THP is shared with others and
> > > + * whether we'll need to do copy-on-write, so
> > > + * just split it always for now to simply the
> > > + * procedure. And that's the policy too for
> > > + * general THP write-protect in af9e4d5f2de2.
> > > + */
> > > + if (next - addr != HPAGE_PMD_SIZE || uffd_wp_resolve) {
> >
> > Just a nit pick can you please add () to next - addr ie:
> > if ((next - addr) != HPAGE_PMD_SIZE || uffd_wp_resolve) {
> >
> > I know it is not needed but each time i bump into this i
> > have to scratch my head for second to remember the operator
> > rules :)
>
> Sure, as usual. :) And I tend to agree it's a good habit. It's just
> me that always forgot about it.
>
> Thanks,
>
> --
> Peter Xu
next prev parent reply other threads:[~2019-04-19 15:03 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-03-20 2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
2019-03-20 2:06 ` [PATCH v3 01/28] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
2019-03-20 2:06 ` [PATCH v3 02/28] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
2019-03-20 2:06 ` [PATCH v3 03/28] userfaultfd: don't retake mmap_sem to emulate NOPAGE Peter Xu
2019-03-20 2:06 ` [PATCH v3 04/28] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
2019-04-18 20:11 ` Jerome Glisse
2019-04-19 6:00 ` Peter Xu
2019-03-20 2:06 ` [PATCH v3 05/28] mm: gup: " Peter Xu
2019-03-20 2:06 ` [PATCH v3 06/28] userfaultfd: wp: add helper for writeprotect check Peter Xu
2019-03-20 2:06 ` [PATCH v3 07/28] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
2019-04-18 20:03 ` Jerome Glisse
2019-03-20 2:06 ` [PATCH v3 08/28] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
2019-03-20 2:06 ` [PATCH v3 09/28] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
2019-03-20 2:06 ` [PATCH v3 10/28] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
2019-03-20 2:06 ` [PATCH v3 11/28] mm: merge parameters for change_protection() Peter Xu
2019-03-20 2:06 ` [PATCH v3 12/28] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
2019-03-20 2:06 ` [PATCH v3 13/28] mm: export wp_page_copy() Peter Xu
2019-03-20 2:06 ` [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
2019-04-18 20:51 ` Jerome Glisse
2019-04-19 6:26 ` Peter Xu
2019-04-19 15:02 ` Jerome Glisse [this message]
2019-04-22 12:20 ` Peter Xu
2019-04-22 14:54 ` Jerome Glisse
2019-04-23 3:00 ` Peter Xu
2019-04-23 15:34 ` Jerome Glisse
2019-04-24 8:38 ` Peter Xu
2019-03-20 2:06 ` [PATCH v3 15/28] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
2019-03-20 2:06 ` [PATCH v3 16/28] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
2019-03-20 2:06 ` [PATCH v3 17/28] userfaultfd: wp: support swap and page migration Peter Xu
2019-04-18 20:59 ` Jerome Glisse
2019-04-19 7:42 ` Peter Xu
2019-04-19 15:08 ` Jerome Glisse
2019-04-22 12:23 ` Peter Xu
2019-03-20 2:06 ` [PATCH v3 18/28] khugepaged: skip collapse if uffd-wp detected Peter Xu
2019-03-20 2:06 ` [PATCH v3 19/28] userfaultfd: introduce helper vma_find_uffd Peter Xu
2019-03-20 2:06 ` [PATCH v3 20/28] userfaultfd: wp: support write protection for userfault vma range Peter Xu
2019-03-20 2:06 ` [PATCH v3 21/28] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
2019-03-20 2:06 ` [PATCH v3 22/28] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
2019-03-22 21:37 ` Mike Rapoport
2019-03-20 2:06 ` [PATCH v3 23/28] userfaultfd: wp: don't wake up when doing write protect Peter Xu
2019-03-20 2:06 ` [PATCH v3 24/28] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
2019-03-22 21:46 ` Mike Rapoport
2019-03-20 2:06 ` [PATCH v3 25/28] userfaultfd: wp: fixup swap entries in change_pte_range Peter Xu
2019-04-18 21:01 ` Jerome Glisse
2019-03-20 2:06 ` [PATCH v3 26/28] userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally Peter Xu
2019-03-22 21:43 ` Mike Rapoport
2019-03-20 2:06 ` [PATCH v3 27/28] userfaultfd: selftests: refactor statistics Peter Xu
2019-03-20 2:06 ` [PATCH v3 28/28] userfaultfd: selftests: add write-protect test Peter Xu
2019-04-09 6:08 ` [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
2019-04-18 21:07 ` Jerome Glisse
2019-04-19 7:53 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190419150253.GA3311@redhat.com \
--to=jglisse@redhat.com \
--cc=aarcange@redhat.com \
--cc=cracauer@cons.org \
--cc=david@redhat.com \
--cc=dgilbert@redhat.com \
--cc=dplotnikov@virtuozzo.com \
--cc=gokhale2@llnl.gov \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mcfadden8@llnl.gov \
--cc=mgorman@suse.de \
--cc=mike.kravetz@oracle.com \
--cc=peterx@redhat.com \
--cc=rppt@linux.vnet.ibm.com \
--cc=shli@fb.com \
--cc=xemul@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.