From: Jason Gunthorpe <jgg@nvidia.com>
To: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Andrew Morton <akpm@linux-foundation.org>,
Mel Gorman <mgorman@suse.de>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Hugh Dickins <hughd@google.com>, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in GUP-fast
Date: Tue, 30 Aug 2022 20:44:44 -0300 [thread overview]
Message-ID: <Yw6g7G4jvXaoBORm@nvidia.com> (raw)
In-Reply-To: <68b38ac4-c680-b694-21a9-1971396d63b9@redhat.com>
On Tue, Aug 30, 2022 at 09:23:44PM +0200, David Hildenbrand wrote:
> @@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
> */
> if (!PageAnon(page))
> return false;
> +
> + /* See page_try_share_anon_rmap() for GUP-fast details. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
> + smp_rmb();
> +
> /*
> * Note that PageKsm() pages cannot be exclusive, and consequently,
> * cannot get pinned.
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bf80adca980b..454c159f2aae 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
> * @page: the exclusive anonymous page to try marking possibly shared
> *
> * The caller needs to hold the PT lock and has to have the page table entry
> - * cleared/invalidated+flushed, to properly sync against GUP-fast.
> + * cleared/invalidated.
> *
> * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
> * to duplicate a mapping, but instead to prepare for KSM or temporarily
> @@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
> {
> VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
>
> - /* See page_try_dup_anon_rmap(). */
> - if (likely(!is_device_private_page(page) &&
> - unlikely(page_maybe_dma_pinned(page))))
> - return -EBUSY;
> + /* device private pages cannot get pinned via GUP. */
> + if (unlikely(is_device_private_page(page))) {
> + ClearPageAnonExclusive(page);
> + return 0;
> + }
>
> + /*
> + * We have to make sure that while we clear PageAnonExclusive, that
> + * the page is not pinned and that concurrent GUP-fast won't succeed in
> + * concurrently pinning the page.
> + *
> + * Conceptually, GUP-fast pinning code of anon pages consists of:
> + * (1) Read the PTE
> + * (2) Pin the mapped page
> + * (3) Check if the PTE changed by re-reading it; back off if so.
> + * (4) Check if PageAnonExclusive is not set; back off if so.
> + *
> + * Conceptually, PageAnonExclusive clearing code consists of:
> + * (1) Clear PTE
> + * (2) Check if the page is pinned; back off if so.
> + * (3) Clear PageAnonExclusive
> + * (4) Restore PTE (optional)
> + *
> + * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
> + * the right order. Memory order between (2) and (3) is handled by
> + * GUP-fast, independent of PageAnonExclusive.
> + *
> + * When clearing PageAnonExclusive(), we have to make sure that (1),
> + * (2), (3) and (4) happen in the right order.
> + *
> + * Note that (4) has to happen after (3) in both cases to handle the
> + * corner case whereby the PTE is restored to the original value after
> + * clearing PageAnonExclusive and while GUP-fast might not detect the
> + * PTE change, it will detect the PageAnonExclusive change.
> + *
> + * We assume that there might not be a memory barrier after
> + * clearing/invalidating the PTE (1) and before restoring the PTE (4),
> + * so we use explicit ones here.
> + *
> + * These memory barriers are paired with memory barriers in GUP-fast
> + * code, including gup_must_unshare().
> + */
> +
> + /* Clear/invalidate the PTE before checking for PINs. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb();
> +
> + if (unlikely(page_maybe_dma_pinned(page)))
> + return -EBUSY;
It is usually a bad sign to see an attempt to create a "read release"..
> ClearPageAnonExclusive(page);
> +
> + /* Clear PageAnonExclusive() before eventually restoring the PTE. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb__after_atomic();
> return 0;
> }
I don't know enough about the memory model to say if this is OK..
Generally, I've never seen an algorithm be successfull with these
kinds of multi-atomic gyrations.
If we break it down a bit, and replace the 'read release' with an
actual atomic for discussion:
CPU0 CPU1
clear pte
incr_return ref // release & acquire
add_ref // acquire
This seems OK, if CPU1 views !dma then CPU0 must view clear pte due to
the atomic's release/acquire semantic
If CPU1 views dma then it just exits
Now the second phase:
CPU0 CPU1
clear anon_exclusive
restore pte // release
read_pte // acquire
read anon_exclusive
If CPU0 observes the restored PTE then it must observe the cleared
anon_exclusive
Otherwise CPU0 must observe the cleared PTE.
So, maybe I could convince myself it is OK, but I think your placement
of barriers is confusing as to what data the barrier is actually
linked to.
We are using a barrier around the ref - acquire on the CPU0 and full
barier on the CPU1 (eg atomic_read(); smb_mb_after_atomic() )
The second phase uses a smp_store_release/load_acquire on the PTE.
It is the same barriers you sketched but integrated with the data they
are ordering.
Jason
next prev parent reply other threads:[~2022-08-30 23:44 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-25 16:46 [PATCH v1 0/3] mm: minor cleanups around NUMA hinting David Hildenbrand
2022-08-25 16:46 ` [PATCH v1 1/3] mm/gup: replace FOLL_NUMA by gup_can_follow_protnone() David Hildenbrand
2022-08-25 16:46 ` [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in GUP-fast David Hildenbrand
2022-08-26 14:59 ` David Hildenbrand
2022-08-30 18:23 ` David Hildenbrand
2022-08-30 18:45 ` Jason Gunthorpe
2022-08-30 18:53 ` David Hildenbrand
2022-08-30 19:18 ` John Hubbard
2022-08-30 19:23 ` David Hildenbrand
2022-08-30 23:44 ` Jason Gunthorpe [this message]
2022-08-31 7:44 ` David Hildenbrand
2022-08-31 16:21 ` Peter Xu
2022-08-31 16:31 ` David Hildenbrand
2022-08-31 18:23 ` Peter Xu
2022-08-31 19:25 ` David Hildenbrand
2022-09-01 7:55 ` Alistair Popple
2022-08-30 19:57 ` Jason Gunthorpe
2022-08-30 20:12 ` John Hubbard
2022-08-30 22:39 ` Jason Gunthorpe
2022-08-31 7:15 ` David Hildenbrand
2022-08-25 16:46 ` [PATCH v1 3/3] mm: fixup documentation regarding pte_numa() and PROT_NUMA David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yw6g7G4jvXaoBORm@nvidia.com \
--to=jgg@nvidia.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=jhubbard@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=peterx@redhat.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.