All of lore.kernel.org
 help / color / mirror / Atom feed
From: Robin Holt <holt@sgi.com>
To: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>, Robin Holt <holt@sgi.com>,
	Ingo Molnar <mingo@elte.hu>, Christoph Lameter <clameter@sgi.com>,
	Jack Steiner <steiner@sgi.com>,
	linux-mm@kvack.org
Subject: Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
Date: Wed, 18 Jun 2008 15:33:00 -0500	[thread overview]
Message-ID: <20080618203300.GA10123@sgi.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0806181944080.4968@blonde.site>

On Wed, Jun 18, 2008 at 08:01:48PM +0100, Hugh Dickins wrote:
> On Thu, 19 Jun 2008, Nick Piggin wrote:
> > On Thursday 19 June 2008 02:41, Robin Holt wrote:
> > > I am running into a problem where I think a call to get_user_pages(...,
> > > write=1, force=1,...) is returning a readable pte and a page ref count
> 
> I'm hoping Robin doesn't really need force=1 - can't you do what you
> need with force=0, Robin? force=1 is weird and really only for ptrace
> I think.  And assuming that Robin meant to say "readonly pte" above.

I don't know the exact reason for force=1.  The driver has been this
way for years and I don't recall the history.  I can dig into that.

... Removed text added to end of this email.

> I think perhaps Robin is wanting to write into the page both from the
> kernel (hence the get_user_pages) and from userspace: but finding that
> the attempt to write from userspace breaks COW again (because gup
> raised the page count and it's a readonly pte), so they end up
> writing into different pages.  We know that COW didn't need to
> be broken a second time, but do_wp_page doesn't know that.

That is exactly the problem I think I am seeing.  How should I be handling
this to get the correct behavior?  As a test, should I be looking at
the process's page table to see if the pfn matches and is writable.
If it is not, putting the page and redoing the call to get_user_pages()?

> > > Any subsequent write fault by the process will 
> > > result in a COW break and the process pointing at a different page than
> > > the get_user_pages() returned page.
> > >
> > > Is this sequence plausible or am I missing something key?
> > >
> > > If this sequence is plausible, I need to know how to either work around
> > > this problem or if it should really be fixed in the kernel.
> > 
> > I'd be interested to know the situation that leads to this problem.
> > If possible a test case would be ideal.
> 
> Might it help if do_wp_page returned VM_FAULT_WRITE (perhaps renamed)
> only in the case where maybe_mkwrite decided not to mkwrite i.e. the
> weird write=1,force=1 on readonly vma case?

I don't think it is in the return value, but rather the clearing of the
FOLL_WRITE flag.  Is that being done to handle a force=1 where the vma
is marked readonly?  Could follow_page handle the force case
differently?

What is the intent of force=1?

Thanks,
Robin Holt


Cut from above:
> > > of 2.  I have not yet trapped the event, but I think I see one place
> > > where this _may_ be happening.
> > >
> > > The case I am seeing is under heavy memory pressure.
> > >
> > > I think the first pass at follow_page has failed and we called
> > > __handle_mm_fault().  At the time in __handle_mm_fault where the page table
> > > is unlocked, there is a writable pte in the processes page table, and a
> > > struct page with a reference count of 1.  ret will have VM_FAULT_WRITE
> > > set so the get_user_pages code will clear FOLL_WRITE from foll_flags.
> > >
> > > Between the time above and the second attempt at follow_page, the
> > > page gets swapped out.  The second attempt at follow_page, now without
> > > FOLL_WRITE (and FOLL_GET is set) will result in a read-only pte with a
> > > reference count of 2.
> > 
> > There would not be a writeable pte in the page table, otherwise
> > VM_FAULT_WRITE should not get returned. But it can be returned via
> > other paths...
> 
> In his scenario, there wasn't a writeable pte originally, the
> first call to handle_pte_fault instantiates the writeable pte
> and returns with VM_FAULT_WRITE set.
> 
> > 
> > However, assuming it was returned, then mmap_sem is still held, so
> > the vma should not get changed from a writeable to a readonly one,
> > so I can't see the problem you're describin with that sequence.
> 
> The vma doesn't get changed, but the pte just instantiated writably
> above, gets swapped out before the next follow_page, then brought
> back in by the second call to handle_pte_fault.  But this is with
> FOLL_WRITE cleared, so it behaves as a read fault, and puts just
> a readonly pte.
> 
> > 
> > Swap pages, for one, could return with VM_FAULT_WRITE, then
> > subsequently have its page swapped out, then set up a readonly pte
> > due to the __handle_mm_fault with write access cleared. *I think*.
> 
> Yes.
> 
> > But although that feels a bit unclean, I don't think it would cause
> > a problem because the previous VM_FAULT_WRITE (while under mmap_sem)
> > ensures our swap page should still be valid to write into via get
> > user pages (and a subsequent write access should cause do_wp_page to
> > go through the proper reuse logic and now COW).
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2008-06-18 20:33 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-18 16:41 Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2? Robin Holt
2008-06-18 17:29 ` Nick Piggin
2008-06-18 19:01   ` Hugh Dickins
2008-06-18 20:33     ` Robin Holt [this message]
2008-06-18 21:46       ` Hugh Dickins
2008-06-19  3:31         ` Nick Piggin
2008-06-19  3:34           ` Nick Piggin
2008-06-19 11:39           ` Hugh Dickins
2008-06-19 12:07             ` Nick Piggin
2008-06-19 12:21               ` Nick Piggin
2008-06-19 17:48                 ` Christoph Lameter
2008-06-19 12:34               ` Hugh Dickins
2008-06-19 12:53                 ` Nick Piggin
2008-06-19 13:25                   ` Hugh Dickins
2008-06-19 13:35                     ` Robin Holt
2008-06-19 16:32         ` Robin Holt
2008-06-20  9:23           ` Nick Piggin
2008-06-19  3:07     ` Nick Piggin
2008-06-19 11:09       ` Hugh Dickins
2008-06-19 13:38         ` Robin Holt
2008-06-19 13:49           ` Hugh Dickins
2008-06-23 15:54             ` Robin Holt
2008-06-23 16:48               ` Hugh Dickins
2008-06-23 17:52                 ` Robin Holt
2008-06-23 20:58                   ` Hugh Dickins
2008-06-24 11:56                     ` Robin Holt
2008-06-24 15:19                     ` Robin Holt
2008-06-24 20:19                       ` Hugh Dickins
2008-06-23 19:11             ` Robin Holt
2008-06-23 19:12               ` Robin Holt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080618203300.GA10123@sgi.com \
    --to=holt@sgi.com \
    --cc=clameter@sgi.com \
    --cc=hugh@veritas.com \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=nickpiggin@yahoo.com.au \
    --cc=steiner@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.