Re: [PATCH 0 of 9] mmu notifier #v12

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Robin Holt <holt@sgi.com>
To: Andrea Arcangeli <andrea@qumranet.com>
Cc: Robin Holt <holt@sgi.com>, Christoph Lameter <clameter@sgi.com>,
	akpm@linux-foundation.org, Nick Piggin <npiggin@suse.de>,
	Steve Wise <swise@opengridcomputing.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	linux-mm@kvack.org, Kanoj Sarcar <kanojsarcar@yahoo.com>,
	Roland Dreier <rdreier@cisco.com>, Jack Steiner <steiner@sgi.com>,
	linux-kernel@vger.kernel.org, Avi Kivity <avi@qumranet.com>,
	kvm-devel@lists.sourceforge.net, general@lists.openfabrics.org,
	Hugh Dickins <hugh@veritas.com>
Subject: Re: [PATCH 0 of 9] mmu notifier #v12
Date: Wed, 9 Apr 2008 13:55:00 -0500	[thread overview]
Message-ID: <20080409185500.GT11364@sgi.com> (raw)
In-Reply-To: <20080409144401.GT10133@duo.random>

On Wed, Apr 09, 2008 at 04:44:01PM +0200, Andrea Arcangeli wrote:
> BTW, how did you implement invalidate_page? As this?
> 
>        	invalidate_page() {
>        		invalidate_range_begin()
> 		invalidate_range_end()
> 	}

Essentially, I did the work of each step without releasing and
reacquiring locks.

> If yes, I prefer to remind you that normally invalidate_range_begin is
> always called before zapping the pte. In the invalidate_page case
> instead, invalidate_range_begin is called _after_ the pte has been
> zapped already.
> 
> Now there's no problem if the pte is established and the spte isn't
> established. But it must never happen that the spte is established and
> the pte isn't established (with page-pinning that means unswappable
> memlock leak, without page-pinning it would mean memory corruption).

I am not sure I follow what you are saying.  Here is a very terse
breakdown of how PFNs flow through xpmem's structures.

We have a PFN table associated with our structure describing a grant.
We use get_user_pages() to acquire information for that table and we
fill the table in under a mutex.  Remote hosts (on the same numa network
so they have direct access to the users memory) have a PROXY version of
that structure.  It is filled out in a similar fashion to the local
table.  PTEs are created for the other processes while holding the mutex
for this table (either local or remote).  During the process of
faulting, we have a simple linked list of ongoing faults that is
maintained whenever the mutex is going to be released.

Our version of a zap_page_range is called recall_PFNs.  The recall
process grabs the mutex, scans the faulters list for any that cover the
range and mark them as needing a retry.  It then calls zap_page_range
for any processes that have attached the granted memory to clear out
their page tables.  Finally, we release the mutex and proceed.

The locking is more complex than this, but that is the essential idea.

What that means for mmu_notifiers is we have a single reference on the
page for all the remote processes using it.  When the callout to
invalidate_page() is made, we will still have processes with that PTE in
their page tables and potentially TLB entries.  When we return from the
invalidate_page() callout, we will have removed all those page table
entries, we will have no in-progress page table or tlb insertions that
will complete, and we will have released all our references to the page.

Does that meet your expectations?

Thanks,
Robin
> 
> So the range_begin must serialize against the secondary mmu page fault
> so that it can't establish the spte on a pte that was zapped by the
> rmap code after get_user_pages/follow_page returned. I think your
> range_begin already does that so you should be ok but I wanted to
> remind about this slight difference in implementing invalidate_page as
> I suggested above in previous email just to be sure ;).
> 
> This is the race you must guard against in invalidate_page:
> 
> 
>    	 CPU0 	     	      CPU1
> 	 try_to_unmap on page
> 			      secondary mmu page fault
> 			      get_user_pages()/follow_page found a page
>          ptep_clear_flush
> 	 invalidate_page()
> 	  invalidate_range_begin()
>           invalidate_range_end()
>           return from invalidate_page
> 			      establish spte on page
> 			      return from secodnary mmu page fault
> 
> If your range_begin already serializes in a hard way against the
> secondary mmu page fault, my previously "trivial" suggested
> implementation for invalidate_page should work just fine and this this
> saves 1 branch for each try_to_unmap_one if compared to the emm
> implementation. The branch check is inlined and it checks against the
> mmu_notifier_head that is the hot cacheline, no new cachline is
> checked just one branch is saved and so it worth it IMHO even if it
> doesn't provide any other advantage if you implement it the way above.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2008-04-09 18:55 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-08 15:44 [PATCH 0 of 9] mmu notifier #v12 Andrea Arcangeli
2008-04-08 15:44 ` [PATCH 1 of 9] Lock the entire mm to prevent any mmu related operation to happen Andrea Arcangeli
2008-04-16 16:33   ` Robin Holt
2008-04-16 18:35     ` Christoph Lameter
2008-04-16 19:02       ` Robin Holt
2008-04-16 19:15         ` Christoph Lameter
2008-04-17 11:14           ` Robin Holt
2008-04-17 15:51       ` Andrea Arcangeli
2008-04-17 16:36         ` Robin Holt
2008-04-17 17:14           ` Andrea Arcangeli
2008-04-17 17:25             ` Robin Holt
2008-04-17 19:10             ` Christoph Lameter
2008-04-17 22:16               ` Andrea Arcangeli
2008-04-22  5:06   ` Rusty Russell
2008-04-25 16:56     ` Andrea Arcangeli
2008-04-25 17:04       ` Andrea Arcangeli
2008-04-25 19:25       ` Robin Holt
2008-04-26  0:57         ` Andrea Arcangeli
2008-04-08 15:44 ` [PATCH 2 of 9] Core of mmu notifiers Andrea Arcangeli
2008-04-08 16:26   ` Robin Holt
2008-04-08 17:05     ` Andrea Arcangeli
2008-04-14 19:57   ` Christoph Lameter
2008-04-14 19:59   ` Christoph Lameter
2008-04-08 15:44 ` [PATCH 3 of 9] Moves all mmu notifier methods outside the PT lock (first and not last Andrea Arcangeli
2008-04-14 19:57   ` Christoph Lameter
2008-04-08 15:44 ` [PATCH 4 of 9] Move the tlb flushing into free_pgtables. The conversion of the locks Andrea Arcangeli
2008-04-08 15:44 ` [PATCH 5 of 9] The conversion to a rwsem allows callbacks during rmap traversal Andrea Arcangeli
2008-04-08 15:44 ` [PATCH 6 of 9] We no longer abort unmapping in unmap vmas because we can reschedule while Andrea Arcangeli
2008-04-08 15:44 ` [PATCH 7 of 9] Convert the anon_vma spinlock to a rw semaphore. This allows concurrent Andrea Arcangeli
2008-04-08 15:44 ` [PATCH 8 of 9] XPMEM would have used sys_madvise() except that madvise_dontneed() Andrea Arcangeli
2008-04-08 15:44 ` [PATCH 9 of 9] This patch adds a lock ordering rule to avoid a potential deadlock when Andrea Arcangeli
2008-04-08 21:46 ` [PATCH 0 of 9] mmu notifier #v12 Avi Kivity
2008-04-08 22:06   ` Andrea Arcangeli
2008-04-09 13:17 ` Robin Holt
2008-04-09 14:44   ` Andrea Arcangeli
2008-04-09 18:55     ` Robin Holt [this message]
2008-04-22  7:20       ` Andrea Arcangeli
2008-04-22 12:00         ` Andrea Arcangeli
2008-04-22 13:01           ` Robin Holt
2008-04-22 13:21             ` Andrea Arcangeli
2008-04-22 13:36               ` Robin Holt
2008-04-22 13:48                 ` Andrea Arcangeli
2008-04-22 15:26                   ` Robin Holt
2008-04-14 23:09 ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080409185500.GT11364@sgi.com \
    --to=holt@sgi.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=andrea@qumranet.com \
    --cc=avi@qumranet.com \
    --cc=clameter@sgi.com \
    --cc=general@lists.openfabrics.org \
    --cc=hugh@veritas.com \
    --cc=kanojsarcar@yahoo.com \
    --cc=kvm-devel@lists.sourceforge.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    --cc=rdreier@cisco.com \
    --cc=steiner@sgi.com \
    --cc=swise@opengridcomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).