From: Jason Gunthorpe <jgg@nvidia.com>
To: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Andrew Morton <akpm@linux-foundation.org>,
Mel Gorman <mgorman@suse.de>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Hugh Dickins <hughd@google.com>, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in GUP-fast
Date: Tue, 30 Aug 2022 20:44:44 -0300 [thread overview]
Message-ID: <Yw6g7G4jvXaoBORm@nvidia.com> (raw)
In-Reply-To: <68b38ac4-c680-b694-21a9-1971396d63b9@redhat.com>
On Tue, Aug 30, 2022 at 09:23:44PM +0200, David Hildenbrand wrote:
> @@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
> */
> if (!PageAnon(page))
> return false;
> +
> + /* See page_try_share_anon_rmap() for GUP-fast details. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
> + smp_rmb();
> +
> /*
> * Note that PageKsm() pages cannot be exclusive, and consequently,
> * cannot get pinned.
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bf80adca980b..454c159f2aae 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
> * @page: the exclusive anonymous page to try marking possibly shared
> *
> * The caller needs to hold the PT lock and has to have the page table entry
> - * cleared/invalidated+flushed, to properly sync against GUP-fast.
> + * cleared/invalidated.
> *
> * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
> * to duplicate a mapping, but instead to prepare for KSM or temporarily
> @@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
> {
> VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
>
> - /* See page_try_dup_anon_rmap(). */
> - if (likely(!is_device_private_page(page) &&
> - unlikely(page_maybe_dma_pinned(page))))
> - return -EBUSY;
> + /* device private pages cannot get pinned via GUP. */
> + if (unlikely(is_device_private_page(page))) {
> + ClearPageAnonExclusive(page);
> + return 0;
> + }
>
> + /*
> + * We have to make sure that while we clear PageAnonExclusive, that
> + * the page is not pinned and that concurrent GUP-fast won't succeed in
> + * concurrently pinning the page.
> + *
> + * Conceptually, GUP-fast pinning code of anon pages consists of:
> + * (1) Read the PTE
> + * (2) Pin the mapped page
> + * (3) Check if the PTE changed by re-reading it; back off if so.
> + * (4) Check if PageAnonExclusive is not set; back off if so.
> + *
> + * Conceptually, PageAnonExclusive clearing code consists of:
> + * (1) Clear PTE
> + * (2) Check if the page is pinned; back off if so.
> + * (3) Clear PageAnonExclusive
> + * (4) Restore PTE (optional)
> + *
> + * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
> + * the right order. Memory order between (2) and (3) is handled by
> + * GUP-fast, independent of PageAnonExclusive.
> + *
> + * When clearing PageAnonExclusive(), we have to make sure that (1),
> + * (2), (3) and (4) happen in the right order.
> + *
> + * Note that (4) has to happen after (3) in both cases to handle the
> + * corner case whereby the PTE is restored to the original value after
> + * clearing PageAnonExclusive and while GUP-fast might not detect the
> + * PTE change, it will detect the PageAnonExclusive change.
> + *
> + * We assume that there might not be a memory barrier after
> + * clearing/invalidating the PTE (1) and before restoring the PTE (4),
> + * so we use explicit ones here.
> + *
> + * These memory barriers are paired with memory barriers in GUP-fast
> + * code, including gup_must_unshare().
> + */
> +
> + /* Clear/invalidate the PTE before checking for PINs. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb();
> +
> + if (unlikely(page_maybe_dma_pinned(page)))
> + return -EBUSY;
It is usually a bad sign to see an attempt to create a "read release"..
> ClearPageAnonExclusive(page);
> +
> + /* Clear PageAnonExclusive() before eventually restoring the PTE. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb__after_atomic();
> return 0;
> }
I don't know enough about the memory model to say if this is OK..
Generally, I've never seen an algorithm be successfull with these
kinds of multi-atomic gyrations.
If we break it down a bit, and replace the 'read release' with an
actual atomic for discussion:
CPU0 CPU1
clear pte
incr_return ref // release & acquire
add_ref // acquire
This seems OK, if CPU1 views !dma then CPU0 must view clear pte due to
the atomic's release/acquire semantic
If CPU1 views dma then it just exits
Now the second phase:
CPU0 CPU1
clear anon_exclusive
restore pte // release
read_pte // acquire
read anon_exclusive
If CPU0 observes the restored PTE then it must observe the cleared
anon_exclusive
Otherwise CPU0 must observe the cleared PTE.
So, maybe I could convince myself it is OK, but I think your placement
of barriers is confusing as to what data the barrier is actually
linked to.
We are using a barrier around the ref - acquire on the CPU0 and full
barier on the CPU1 (eg atomic_read(); smb_mb_after_atomic() )
The second phase uses a smp_store_release/load_acquire on the PTE.
It is the same barriers you sketched but integrated with the data they
are ordering.
Jason
next prev parent reply other threads:[~2022-08-30 23:44 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-25 16:46 [PATCH v1 0/3] mm: minor cleanups around NUMA hinting David Hildenbrand
2022-08-25 16:46 ` [PATCH v1 1/3] mm/gup: replace FOLL_NUMA by gup_can_follow_protnone() David Hildenbrand
2022-08-25 16:46 ` [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in GUP-fast David Hildenbrand
2022-08-26 14:59 ` David Hildenbrand
2022-08-30 18:23 ` David Hildenbrand
2022-08-30 18:45 ` Jason Gunthorpe
2022-08-30 18:53 ` David Hildenbrand
2022-08-30 19:18 ` John Hubbard
2022-08-30 19:23 ` David Hildenbrand
2022-08-30 23:44 ` Jason Gunthorpe [this message]
2022-08-31 7:44 ` David Hildenbrand
2022-08-31 16:21 ` Peter Xu
2022-08-31 16:31 ` David Hildenbrand
2022-08-31 18:23 ` Peter Xu
2022-08-31 19:25 ` David Hildenbrand
2022-09-01 7:55 ` Alistair Popple
2022-08-30 19:57 ` Jason Gunthorpe
2022-08-30 20:12 ` John Hubbard
2022-08-30 22:39 ` Jason Gunthorpe
2022-08-31 7:15 ` David Hildenbrand
2022-08-25 16:46 ` [PATCH v1 3/3] mm: fixup documentation regarding pte_numa() and PROT_NUMA David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yw6g7G4jvXaoBORm@nvidia.com \
--to=jgg@nvidia.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=jhubbard@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=peterx@redhat.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).