All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Hugh Dickins <hughd@google.com>, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in GUP-fast
Date: Tue, 30 Aug 2022 20:44:44 -0300	[thread overview]
Message-ID: <Yw6g7G4jvXaoBORm@nvidia.com> (raw)
In-Reply-To: <68b38ac4-c680-b694-21a9-1971396d63b9@redhat.com>

On Tue, Aug 30, 2022 at 09:23:44PM +0200, David Hildenbrand wrote:
> @@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
>  	 */
>  	if (!PageAnon(page))
>  		return false;
> +
> +	/* See page_try_share_anon_rmap() for GUP-fast details. */
> +	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
> +		smp_rmb();
> +
>  	/*
>  	 * Note that PageKsm() pages cannot be exclusive, and consequently,
>  	 * cannot get pinned.
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bf80adca980b..454c159f2aae 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
>   * @page: the exclusive anonymous page to try marking possibly shared
>   *
>   * The caller needs to hold the PT lock and has to have the page table entry
> - * cleared/invalidated+flushed, to properly sync against GUP-fast.
> + * cleared/invalidated.
>   *
>   * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
>   * to duplicate a mapping, but instead to prepare for KSM or temporarily
> @@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
>  {
>  	VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
>  
> -	/* See page_try_dup_anon_rmap(). */
> -	if (likely(!is_device_private_page(page) &&
> -	    unlikely(page_maybe_dma_pinned(page))))
> -		return -EBUSY;
> +	/* device private pages cannot get pinned via GUP. */
> +	if (unlikely(is_device_private_page(page))) {
> +		ClearPageAnonExclusive(page);
> +		return 0;
> +	}
>  
> +	/*
> +	 * We have to make sure that while we clear PageAnonExclusive, that
> +	 * the page is not pinned and that concurrent GUP-fast won't succeed in
> +	 * concurrently pinning the page.
> +	 *
> +	 * Conceptually, GUP-fast pinning code of anon pages consists of:
> +	 * (1) Read the PTE
> +	 * (2) Pin the mapped page
> +	 * (3) Check if the PTE changed by re-reading it; back off if so.
> +	 * (4) Check if PageAnonExclusive is not set; back off if so.
> +	 *
> +	 * Conceptually, PageAnonExclusive clearing code consists of:
> +	 * (1) Clear PTE
> +	 * (2) Check if the page is pinned; back off if so.
> +	 * (3) Clear PageAnonExclusive
> +	 * (4) Restore PTE (optional)
> +	 *
> +	 * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
> +	 * the right order. Memory order between (2) and (3) is handled by
> +	 * GUP-fast, independent of PageAnonExclusive.
> +	 *
> +	 * When clearing PageAnonExclusive(), we have to make sure that (1),
> +	 * (2), (3) and (4) happen in the right order.
> +	 *
> +	 * Note that (4) has to happen after (3) in both cases to handle the
> +	 * corner case whereby the PTE is restored to the original value after
> +	 * clearing PageAnonExclusive and while GUP-fast might not detect the
> +	 * PTE change, it will detect the PageAnonExclusive change.
> +	 *
> +	 * We assume that there might not be a memory barrier after
> +	 * clearing/invalidating the PTE (1) and before restoring the PTE (4),
> +	 * so we use explicit ones here.
> +	 *
> +	 * These memory barriers are paired with memory barriers in GUP-fast
> +	 * code, including gup_must_unshare().
> +	 */
> +
> +	/* Clear/invalidate the PTE before checking for PINs. */
> +	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> +		smp_mb();
> +
> +	if (unlikely(page_maybe_dma_pinned(page)))
> +		return -EBUSY;

It is usually a bad sign to see an attempt to create a "read release"..

>  	ClearPageAnonExclusive(page);
> +
> +	/* Clear PageAnonExclusive() before eventually restoring the PTE. */
> +	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> +		smp_mb__after_atomic();
>  	return 0;
>  }

I don't know enough about the memory model to say if this is OK..

Generally, I've never seen an algorithm be successfull with these
kinds of multi-atomic gyrations.

If we break it down a bit, and replace the 'read release' with an
actual atomic for discussion:


       CPU0                  CPU1
                            clear pte 
                            incr_return ref // release & acquire
 add_ref // acquire

This seems OK, if CPU1 views !dma then CPU0 must view clear pte due to
the atomic's release/acquire semantic

If CPU1 views dma then it just exits


Now the second phase:

       CPU0                  CPU1
                            clear anon_exclusive
                            restore pte // release

 read_pte // acquire
 read anon_exclusive 

If CPU0 observes the restored PTE then it must observe the cleared
anon_exclusive

Otherwise CPU0 must observe the cleared PTE.

So, maybe I could convince myself it is OK, but I think your placement
of barriers is confusing as to what data the barrier is actually
linked to.

We are using a barrier around the ref - acquire on the CPU0 and full
barier on the CPU1 (eg atomic_read(); smb_mb_after_atomic() )

The second phase uses a smp_store_release/load_acquire on the PTE.

It is the same barriers you sketched but integrated with the data they
are ordering.

Jason


  reply	other threads:[~2022-08-30 23:44 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-25 16:46 [PATCH v1 0/3] mm: minor cleanups around NUMA hinting David Hildenbrand
2022-08-25 16:46 ` [PATCH v1 1/3] mm/gup: replace FOLL_NUMA by gup_can_follow_protnone() David Hildenbrand
2022-08-25 16:46 ` [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in GUP-fast David Hildenbrand
2022-08-26 14:59   ` David Hildenbrand
2022-08-30 18:23     ` David Hildenbrand
2022-08-30 18:45       ` Jason Gunthorpe
2022-08-30 18:53         ` David Hildenbrand
2022-08-30 19:18           ` John Hubbard
2022-08-30 19:23             ` David Hildenbrand
2022-08-30 23:44               ` Jason Gunthorpe [this message]
2022-08-31  7:44                 ` David Hildenbrand
2022-08-31 16:21               ` Peter Xu
2022-08-31 16:31                 ` David Hildenbrand
2022-08-31 18:23                   ` Peter Xu
2022-08-31 19:25                     ` David Hildenbrand
2022-09-01  7:55                       ` Alistair Popple
2022-08-30 19:57           ` Jason Gunthorpe
2022-08-30 20:12             ` John Hubbard
2022-08-30 22:39               ` Jason Gunthorpe
2022-08-31  7:15             ` David Hildenbrand
2022-08-25 16:46 ` [PATCH v1 3/3] mm: fixup documentation regarding pte_numa() and PROT_NUMA David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yw6g7G4jvXaoBORm@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=peterx@redhat.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.