All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: David Stevens <stevensd@chromium.org>
Cc: linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Yang Shi <shy828301@gmail.com>,
	David Hildenbrand <david@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
Date: Thu, 16 Feb 2023 09:41:22 -0500	[thread overview]
Message-ID: <Y+5Akpz4CvGywt6R@x1n> (raw)
In-Reply-To: <CAD=HUj69L2e-Z4TB19qFt8h1cn0r1oGbWovJGMOjjyvfDcQ7NA@mail.gmail.com>

On Thu, Feb 16, 2023 at 10:37:47AM +0900, David Stevens wrote:
> On Thu, Feb 16, 2023 at 7:48 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Make sure that collapse_file respects any userfaultfds registered with
> > > MODE_MISSING. If userspace has any such userfaultfds registered, then
> > > for any page which it knows to be missing, it may expect a
> > > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> > > collapsing a shmem range would result in replacing an empty page with a
> > > THP, so that it doesn't break userfaultfd.
> > >
> > > Synchronization when checking for userfaultfds in collapse_file is
> > > tricky because the mmap locks can't be used to prevent races with the
> > > registration of new userfaultfds. Instead, we provide synchronization by
> > > ensuring that userspace cannot observe the fact that pages are missing
> > > before we check for userfaultfds. Although this allows registration of a
> > > userfaultfd to race with collapse_file, it ensures that userspace cannot
> > > observe any pages transition from missing to present after such a race.
> > > This makes such a race indistinguishable to the collapse occurring
> > > immediately before the userfaultfd registration.
> > >
> > > The first step to provide this synchronization is to stop filling gaps
> > > during the loop iterating over the target range, since the page cache
> > > lock can be dropped during that loop. The second step is to fill the
> > > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> > > time, to avoid races with accesses to the page cache that only take the
> > > RCU read lock.
> > >
> > > This fix is targeted at khugepaged, but the change also applies to
> > > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> > > return EBUSY if there are any missing pages (instead of succeeding on
> > > shmem and returning EINVAL on anonymous memory). There is also now a
> > > window during MADV_COLLAPSE where a fault on a missing page will cause
> > > the syscall to fail with EAGAIN.
> > >
> > > The fact that intermediate page cache state can no longer be observed
> > > before the rollback of a failed collapse is also technically a
> > > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> > > is exceedingly unlikely that anything relies on being able to observe
> > > that transient state.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
> > >  1 file changed, 58 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index b648f1053d95..8c2e2349e883 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -55,6 +55,7 @@ enum scan_result {
> > >       SCAN_CGROUP_CHARGE_FAIL,
> > >       SCAN_TRUNCATED,
> > >       SCAN_PAGE_HAS_PRIVATE,
> > > +     SCAN_PAGE_FILLED,
> >
> > PS: You may want to also touch SCAN_STATUS in huge_memory.h next time.
> >
> > >  };
> > >
> > >  #define CREATE_TRACE_POINTS
> > > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > >   *  - allocate and lock a new huge page;
> > >   *  - scan page cache replacing old pages with the new one
> > >   *    + swap/gup in pages if necessary;
> > > - *    + fill in gaps;
> >
> > IIUC it's not a complete removal, but just moved downwards:
> >
> > >   *    + keep old pages around in case rollback is required;
> > > + *  - finalize updates to the page cache;
> >
> >          + fill in gaps with RETRY entries
> >          + detect race conditions with userfaultfds
> >
> > >   *  - if replacing succeeds:
> > >   *    + copy data over;
> > >   *    + free old pages;
> > > @@ -1805,13 +1806,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > >                                               result = SCAN_TRUNCATED;
> > >                                               goto xa_locked;
> > >                                       }
> > > -                                     xas_set(&xas, index);
> > > +                                     xas_set(&xas, index + 1);
> > >                               }
> > >                               if (!shmem_charge(mapping->host, 1)) {
> > >                                       result = SCAN_FAIL;
> > >                                       goto xa_locked;
> > >                               }
> > > -                             xas_store(&xas, hpage);
> > >                               nr_none++;
> > >                               continue;
> > >                       }
> > > @@ -1970,6 +1970,56 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > >               put_page(page);
> > >               goto xa_unlocked;
> > >       }
> > > +
> > > +     if (nr_none) {
> > > +             struct vm_area_struct *vma;
> > > +             int nr_none_check = 0;
> > > +
> > > +             xas_unlock_irq(&xas);
> > > +             i_mmap_lock_read(mapping);
> > > +             xas_lock_irq(&xas);
> > > +
> > > +             xas_set(&xas, start);
> > > +             for (index = start; index < end; index++) {
> > > +                     if (!xas_next(&xas)) {
> > > +                             xas_store(&xas, XA_RETRY_ENTRY);
> > > +                             nr_none_check++;
> > > +                     }
> > > +             }
> > > +
> > > +             if (nr_none != nr_none_check) {
> > > +                     result = SCAN_PAGE_FILLED;
> > > +                     goto immap_locked;
> > > +             }
> > > +
> > > +             /*
> > > +              * If userspace observed a missing page in a VMA with an armed
> > > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > > +              * that page, so we need to roll back to avoid suppressing such
> > > +              * an event. Any userfaultfds armed after this point will not be
> > > +              * able to observe any missing pages due to the previously
> > > +              * inserted retry entries.
> > > +              */
> > > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > > +                     if (userfaultfd_missing(vma)) {
> > > +                             result = SCAN_EXCEED_NONE_PTE;
> > > +                             goto immap_locked;
> > > +                     }
> > > +             }
> > > +
> > > +immap_locked:
> > > +             i_mmap_unlock_read(mapping);
> > > +             if (result != SCAN_SUCCEED) {
> > > +                     xas_set(&xas, start);
> > > +                     for (index = start; index < end; index++) {
> > > +                             if (xas_next(&xas) == XA_RETRY_ENTRY)
> > > +                                     xas_store(&xas, NULL);
> > > +                     }
> > > +
> > > +                     goto xa_locked;
> > > +             }
> > > +     }
> > > +
> >
> > Until here, all look fine to me (ignoring patch 1 for now; assuming the
> > hpage is always uptodate).
> >
> > My question is after here we'll release page cache lock again before
> > try_to_unmap_flush(), but is it safe to keep RETRY entries after releasing
> > page cache lock?  It means other threads can be spinning.  I assume page
> > lock is always safe and sleepable, but not sure about the page cache lock
> > here.
> 
> We insert the multi-index entry for hpage before releasing the page
> cache lock, which should replace all of the XA_RETRY_ENTRYs. So the
> page cache will be fully up to date when we release the lock, at least
> in terms of which pages it contains.

IIUC we released it before copying the pages:

xa_locked:
	xas_unlock_irq(&xas);  <-------------------------------- here
xa_unlocked:

	/*
	 * If collapse is successful, flush must be done now before copying.
	 * If collapse is unsuccessful, does flush actually need to be done?
	 * Do it anyway, to clear the state.
	 */
	try_to_unmap_flush();

Before insertion of the multi-index:

        /* Join all the small entries into a single multi-index entry. */
        xas_set_order(&xas, start, HPAGE_PMD_ORDER);
        xas_store(&xas, hpage);

Thanks,

-- 
Peter Xu



  reply	other threads:[~2023-02-16 14:41 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-14  7:57 [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem David Stevens
2023-02-14  7:57 ` [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd David Stevens
2023-02-14 22:35   ` Peter Xu
2023-02-15  1:57     ` David Stevens
2023-02-15 22:27       ` Peter Xu
2023-02-15 22:48   ` Peter Xu
2023-02-16  1:37     ` David Stevens
2023-02-16 14:41       ` Peter Xu [this message]
2023-02-16 21:58         ` Yang Shi
2023-02-16 23:07           ` Peter Xu
2023-02-16 23:52             ` Yang Shi
2023-02-17  2:00         ` David Stevens
2023-02-17  3:20           ` Yang Shi
2023-02-14 15:44 ` [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem Matthew Wilcox
2023-02-15  1:33   ` David Stevens
2023-02-15 22:05     ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+5Akpz4CvGywt6R@x1n \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=shy828301@gmail.com \
    --cc=stevensd@chromium.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.