All of lore.kernel.org
 help / color / mirror / Atom feed
From: Robin Holt <holt@sgi.com>
To: Hugh Dickins <hugh@veritas.com>
Cc: Robin Holt <holt@sgi.com>, Nick Piggin <nickpiggin@yahoo.com.au>,
	Ingo Molnar <mingo@elte.hu>, Christoph Lameter <clameter@sgi.com>,
	Jack Steiner <steiner@sgi.com>,
	linux-mm@kvack.org
Subject: Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
Date: Mon, 23 Jun 2008 10:54:00 -0500	[thread overview]
Message-ID: <20080623155400.GH10123@sgi.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0806191441040.25832@blonde.site>

On Thu, Jun 19, 2008 at 02:49:50PM +0100, Hugh Dickins wrote:
> On Thu, 19 Jun 2008, Robin Holt wrote:
> > On Thu, Jun 19, 2008 at 12:09:15PM +0100, Hugh Dickins wrote:
> > > 
> > > (I assume Robin is not forking, we do know that causes this kind
> > > of problem, but he didn't mention any forking so I assume not.)
> > 
> > There has been a fork long before this mapping was created.  There was a
> > hole at this location and the mapping gets established and pages populated
> > following all ranks of the MPI job getting initialized.
> 
> There's usually been a fork somewhen in the past!  That's no problem.
> 
> The fork problem comes when someone has done a get_user_pages to break
> all the COWs, then another thread does a fork which writeprotects and
> raises page_mapcount, so the next write from userspace breaks COW again
> and writes to a different page from that which the kernel is holding.
> 
> That one kept on coming up, but I've not heard of it again since we
> added madvise MADV_DONTFORK so apps could exclude such parts of the
> address space from copy_page_range.

I finally tracked this down.  I think it is a problem specific to XPMEM
on the SLES10 kernel and will not be a problem once Andrea's mmu_notifier
is in the kernel.  It is a problem, as far as I can tell, specific to
the way XPMEM works.

I will open a SuSE bugzilla to work the issue directly with them.

Prior to the transition event, we have a page of memory that was
pre-faulted by a process.  The process has exported (via XPMEM) a
window of its own address space.  A remote process has attached and
touched the page of memory.  The fault will call into XPMEM which does
the get_user_pages.

At this point, both processes have a writable PTE entry to the same
page and XPMEM has one additional reference count (_count) on the page
acquired via get_user_pages().

Memory pressure causes swap_page to get called.  It clears the two
process's page table entries, returns the _count values, etc.  The only
thing that remains different from normal at this point is XPMEM retains
a reference.

Both processes then read-fault the page which results in readable PTEs
being installed.

The failure point comes when either process write faults the page.
At that point, a COW is initiated and now the two processes are looking
at seperate pages.

The scenario would be different in the case of mmu_notifiers.
The notifier callout when the readable PTE was being replaced with a
writable PTE would result in the remote page table getting cleared and
XPMEM releasing the _count.


All that said, I think the race we discussed earlier in the thread is
a legitimate one and believe Hugh's fix is correct.

Thank you for all your patience,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2008-06-23 15:54 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-18 16:41 Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2? Robin Holt
2008-06-18 17:29 ` Nick Piggin
2008-06-18 19:01   ` Hugh Dickins
2008-06-18 20:33     ` Robin Holt
2008-06-18 21:46       ` Hugh Dickins
2008-06-19  3:31         ` Nick Piggin
2008-06-19  3:34           ` Nick Piggin
2008-06-19 11:39           ` Hugh Dickins
2008-06-19 12:07             ` Nick Piggin
2008-06-19 12:21               ` Nick Piggin
2008-06-19 17:48                 ` Christoph Lameter
2008-06-19 12:34               ` Hugh Dickins
2008-06-19 12:53                 ` Nick Piggin
2008-06-19 13:25                   ` Hugh Dickins
2008-06-19 13:35                     ` Robin Holt
2008-06-19 16:32         ` Robin Holt
2008-06-20  9:23           ` Nick Piggin
2008-06-19  3:07     ` Nick Piggin
2008-06-19 11:09       ` Hugh Dickins
2008-06-19 13:38         ` Robin Holt
2008-06-19 13:49           ` Hugh Dickins
2008-06-23 15:54             ` Robin Holt [this message]
2008-06-23 16:48               ` Hugh Dickins
2008-06-23 17:52                 ` Robin Holt
2008-06-23 20:58                   ` Hugh Dickins
2008-06-24 11:56                     ` Robin Holt
2008-06-24 15:19                     ` Robin Holt
2008-06-24 20:19                       ` Hugh Dickins
2008-06-23 19:11             ` Robin Holt
2008-06-23 19:12               ` Robin Holt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080623155400.GH10123@sgi.com \
    --to=holt@sgi.com \
    --cc=clameter@sgi.com \
    --cc=hugh@veritas.com \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=nickpiggin@yahoo.com.au \
    --cc=steiner@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.