From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Izik Eidus <ieidus@redhat.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
kvm@vger.kernel.org, chrisw@redhat.com, avi@redhat.com,
izike@qumranet.com
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Date: Wed, 12 Nov 2008 15:08:07 -0500 [thread overview]
Message-ID: <1226520487.7560.65.camel@lts-notebook> (raw)
In-Reply-To: <20081112173258.GX10818@random.random>
On Wed, 2008-11-12 at 18:32 +0100, Andrea Arcangeli wrote:
> On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> > get_user_pages() cannot get to it since the pagetables have already been
> > modified. If get_user_pages runs then the fault handling will occur
> > which will block the thread until migration is complete.
>
> migrate.c does nothing for ptes pointing to swap entries and
> do_swap_page won't wait for them either. Assume follow_page in
> migrate.c returns a swapcache not mapped but with a pte pointing to
> it. That means page_count 1 (+1 after you isolate it from the lru),
> page_mapcount 0, page_mapped 0, page_mapping = swap address space,
> swap_count = 2 (1 swapcache, 1 the pte with the swapentry). Now assume
> one thread does o_direct read from disk that triggers a minor fault in
> do_swap_cache called by get_user_pages. The other cpu is running
> sys_move_pages and the expected count will match the page count in
> migrate_page_move_mapping. Page is still in swapcache. So after the
> expected count matches in the migrate.c thread, the other thread
> continues in do_swap_page and runs lookup_swap_cache that succeeds
> (the page wasn't removed from swapcache yet as migrate.c needs to bail
> out if the expected count doesn't match, so it can't mess with the
> oldpage until it's sure it can migrate it). After that do_swap_page
> gets a reference on the swapcache (at that point migrate.c continues
> despite the expected count isn't 2 anymore! just a second after having
> verified that it was 2). lock_page blocks do_swap_page until migration
> is complete but pte_same in do_swap_page won't fail because the pte is
> still pointing to the same swapentry (it's just the swapcache inode
> radix tree that points to a different page, the swapentry is still the
> same as before the migration - is_swap_pte will succeed but
> is_migration_entry failed when restoring the pte).
Ah. try_to_unmap_one() won't replace the pte entry with a
migration_pte() if the [anon] page is already in the swap cache. When
migration completes, we won't modify the page tables with the newpage
pte--we'll just let any subsequent swap page [minor] fault handle that.
That suggests a possible fix: instead of replacing the pte with a
duplicate swap entry in try_to_unmap_one(), go ahead and replace the pte
with a migration pte. Then back in try_to_unmap_anon(), after unmapping
all references, free the swap cache entry, so as not to leak it
[assuming we're in a lock state that allows that--I haven't checked that
far]. Then, the page table WILL have been modified by the time
migration unlocks the page.
Might want/need to check for migration entry in do_swap_page() and loop
back to migration_entry_wait() call when the changed pte is detected
rather than returning an error to the caller.
Does that sound reasonable?
> Finally the pte is
> overwritten with the old page and any data written to the new page in
> between is lost.
And wouldn't the new page potentially be leaked? That is, could it end
up on the lru with page_count == page_mapcount() >= 1, but no page table
reference to ever be unmapped to release the count?
>
> However it's not exactly the same bug as the one in fork, I was
> talking about before, it's also not o_direct specific. Still
> page_wrprotect + replace_page is orders of magnitude simpler logic
> than migrate.c and it has no bugs or at least it's certainly much
> simpler to proof as correct. Furthermore we never 'stall' any userland
> task while we do our work. We only mark the pte wrprotected, the task
> can cow or takeover it if refcount allows anytime, and later we'll
> bailout during replace_page if something has happened in between
> page_wrprotect and replace_page. So our logic is simpler and tuned for
> max performance and fewer interference with userland runtime. Not
> really sure if it worth for us to call into migrate.c.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-11-12 20:08 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-11 13:21 [PATCH 0/4] ksm - dynamic page sharing driver for linux Izik Eidus
2008-11-11 13:21 ` [PATCH 1/4] rmap: add page_wrprotect() function, Izik Eidus, Izik Eidus
2008-11-11 13:21 ` [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another Izik Eidus, Izik Eidus
2008-11-11 13:21 ` [PATCH 3/4] add ksm kernel shared memory driver Izik Eidus, Izik Eidus
2008-11-11 13:21 ` [PATCH 4/4] MMU_NOTIFIRES: add set_pte_at_notify() Izik Eidus, Izik Eidus
2008-11-11 20:38 ` [PATCH 3/4] add ksm kernel shared memory driver Andrew Morton
2008-11-11 22:03 ` Andrea Arcangeli
2008-11-11 22:03 ` Jonathan Corbet
2008-11-11 22:17 ` Izik Eidus
2008-11-11 22:25 ` Jonathan Corbet
2008-11-11 22:31 ` Izik Eidus
2008-11-11 22:30 ` Jonathan Corbet
2008-11-11 22:38 ` Izik Eidus
2008-11-11 23:02 ` Izik Eidus
2008-11-11 23:03 ` Andrea Arcangeli
2008-11-11 22:49 ` Avi Kivity
2008-11-11 22:40 ` Valdis.Kletnieks
2008-11-13 6:13 ` Eric Rannaud
2008-11-11 22:43 ` Avi Kivity
2008-11-11 19:45 ` [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another Andrew Morton
2008-11-11 20:57 ` Izik Eidus
2008-11-11 21:21 ` Christoph Lameter
2008-11-11 21:23 ` Izik Eidus
2008-11-11 21:31 ` Christoph Lameter
2008-11-11 21:37 ` Izik Eidus
2008-11-11 22:24 ` Andrea Arcangeli
2008-11-12 2:19 ` KAMEZAWA Hiroyuki
2008-11-12 10:05 ` Avi Kivity
2008-11-12 11:11 ` Izik Eidus
2008-11-13 6:11 ` KAMEZAWA Hiroyuki
2008-11-13 10:38 ` Izik Eidus
2008-11-13 11:32 ` KAMEZAWA Hiroyuki
2008-11-11 21:35 ` Andrea Arcangeli
2008-11-11 21:06 ` Andrea Arcangeli
2008-11-11 21:26 ` Christoph Lameter
2008-11-11 21:39 ` Avi Kivity
2008-11-11 21:47 ` Christoph Lameter
2008-11-11 21:55 ` Izik Eidus
2008-11-11 22:36 ` Avi Kivity
2008-11-11 22:17 ` Andrea Arcangeli
2008-11-11 22:30 ` Christoph Lameter
2008-11-11 23:17 ` Andrea Arcangeli
2008-11-11 23:25 ` Andrea Arcangeli
2008-11-12 0:27 ` Christoph Lameter
2008-11-12 2:27 ` Andrea Arcangeli
2008-11-12 3:10 ` Christoph Lameter
2008-11-12 17:32 ` Andrea Arcangeli
2008-11-12 20:08 ` Lee Schermerhorn [this message]
2008-11-12 20:31 ` Christoph Lameter
2008-11-12 20:27 ` Christoph Lameter
2008-11-12 22:09 ` Lee Schermerhorn
2008-11-13 2:00 ` Andrea Arcangeli
2008-11-13 2:31 ` Andrea Arcangeli
2008-11-13 4:02 ` Nick Piggin
2008-11-11 19:39 ` [PATCH 1/4] rmap: add page_wrprotect() function, Andrew Morton
2008-11-11 20:38 ` Andrea Arcangeli
2008-11-11 21:01 ` Andrew Morton
2008-11-11 21:17 ` Andrea Arcangeli
2008-11-11 18:30 ` [PATCH 0/4] ksm - dynamic page sharing driver for linux Andrew Morton
2008-11-11 18:48 ` Avi Kivity
2008-11-11 19:08 ` Izik Eidus
2008-11-11 19:11 ` Andrew Morton
2008-11-11 19:18 ` Izik Eidus
2008-11-11 19:32 ` Andrew Morton
2008-11-11 19:52 ` Izik Eidus
2008-11-11 20:08 ` Izik Eidus
2008-11-11 19:29 ` Avi Kivity
2008-11-11 19:55 ` Andrea Arcangeli
2008-11-11 19:07 ` Izik Eidus
2008-11-11 19:20 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1226520487.7560.65.camel@lts-notebook \
--to=lee.schermerhorn@hp.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=avi@redhat.com \
--cc=chrisw@redhat.com \
--cc=cl@linux-foundation.org \
--cc=ieidus@redhat.com \
--cc=izike@qumranet.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).