From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Izik Eidus <ieidus@redhat.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
kvm@vger.kernel.org, chrisw@redhat.com, avi@redhat.com,
izike@qumranet.com
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Date: Wed, 12 Nov 2008 15:08:07 -0500 [thread overview]
Message-ID: <1226520487.7560.65.camel@lts-notebook> (raw)
In-Reply-To: <20081112173258.GX10818@random.random>
On Wed, 2008-11-12 at 18:32 +0100, Andrea Arcangeli wrote:
> On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> > get_user_pages() cannot get to it since the pagetables have already been
> > modified. If get_user_pages runs then the fault handling will occur
> > which will block the thread until migration is complete.
>
> migrate.c does nothing for ptes pointing to swap entries and
> do_swap_page won't wait for them either. Assume follow_page in
> migrate.c returns a swapcache not mapped but with a pte pointing to
> it. That means page_count 1 (+1 after you isolate it from the lru),
> page_mapcount 0, page_mapped 0, page_mapping = swap address space,
> swap_count = 2 (1 swapcache, 1 the pte with the swapentry). Now assume
> one thread does o_direct read from disk that triggers a minor fault in
> do_swap_cache called by get_user_pages. The other cpu is running
> sys_move_pages and the expected count will match the page count in
> migrate_page_move_mapping. Page is still in swapcache. So after the
> expected count matches in the migrate.c thread, the other thread
> continues in do_swap_page and runs lookup_swap_cache that succeeds
> (the page wasn't removed from swapcache yet as migrate.c needs to bail
> out if the expected count doesn't match, so it can't mess with the
> oldpage until it's sure it can migrate it). After that do_swap_page
> gets a reference on the swapcache (at that point migrate.c continues
> despite the expected count isn't 2 anymore! just a second after having
> verified that it was 2). lock_page blocks do_swap_page until migration
> is complete but pte_same in do_swap_page won't fail because the pte is
> still pointing to the same swapentry (it's just the swapcache inode
> radix tree that points to a different page, the swapentry is still the
> same as before the migration - is_swap_pte will succeed but
> is_migration_entry failed when restoring the pte).
Ah. try_to_unmap_one() won't replace the pte entry with a
migration_pte() if the [anon] page is already in the swap cache. When
migration completes, we won't modify the page tables with the newpage
pte--we'll just let any subsequent swap page [minor] fault handle that.
That suggests a possible fix: instead of replacing the pte with a
duplicate swap entry in try_to_unmap_one(), go ahead and replace the pte
with a migration pte. Then back in try_to_unmap_anon(), after unmapping
all references, free the swap cache entry, so as not to leak it
[assuming we're in a lock state that allows that--I haven't checked that
far]. Then, the page table WILL have been modified by the time
migration unlocks the page.
Might want/need to check for migration entry in do_swap_page() and loop
back to migration_entry_wait() call when the changed pte is detected
rather than returning an error to the caller.
Does that sound reasonable?
> Finally the pte is
> overwritten with the old page and any data written to the new page in
> between is lost.
And wouldn't the new page potentially be leaked? That is, could it end
up on the lru with page_count == page_mapcount() >= 1, but no page table
reference to ever be unmapped to release the count?
>
> However it's not exactly the same bug as the one in fork, I was
> talking about before, it's also not o_direct specific. Still
> page_wrprotect + replace_page is orders of magnitude simpler logic
> than migrate.c and it has no bugs or at least it's certainly much
> simpler to proof as correct. Furthermore we never 'stall' any userland
> task while we do our work. We only mark the pte wrprotected, the task
> can cow or takeover it if refcount allows anytime, and later we'll
> bailout during replace_page if something has happened in between
> page_wrprotect and replace_page. So our logic is simpler and tuned for
> max performance and fewer interference with userland runtime. Not
> really sure if it worth for us to call into migrate.c.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Izik Eidus <ieidus@redhat.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
kvm@vger.kernel.org, chrisw@redhat.com, avi@redhat.com,
izike@qumranet.com
Subject: Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Date: Wed, 12 Nov 2008 15:08:07 -0500 [thread overview]
Message-ID: <1226520487.7560.65.camel@lts-notebook> (raw)
In-Reply-To: <20081112173258.GX10818@random.random>
On Wed, 2008-11-12 at 18:32 +0100, Andrea Arcangeli wrote:
> On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> > get_user_pages() cannot get to it since the pagetables have already been
> > modified. If get_user_pages runs then the fault handling will occur
> > which will block the thread until migration is complete.
>
> migrate.c does nothing for ptes pointing to swap entries and
> do_swap_page won't wait for them either. Assume follow_page in
> migrate.c returns a swapcache not mapped but with a pte pointing to
> it. That means page_count 1 (+1 after you isolate it from the lru),
> page_mapcount 0, page_mapped 0, page_mapping = swap address space,
> swap_count = 2 (1 swapcache, 1 the pte with the swapentry). Now assume
> one thread does o_direct read from disk that triggers a minor fault in
> do_swap_cache called by get_user_pages. The other cpu is running
> sys_move_pages and the expected count will match the page count in
> migrate_page_move_mapping. Page is still in swapcache. So after the
> expected count matches in the migrate.c thread, the other thread
> continues in do_swap_page and runs lookup_swap_cache that succeeds
> (the page wasn't removed from swapcache yet as migrate.c needs to bail
> out if the expected count doesn't match, so it can't mess with the
> oldpage until it's sure it can migrate it). After that do_swap_page
> gets a reference on the swapcache (at that point migrate.c continues
> despite the expected count isn't 2 anymore! just a second after having
> verified that it was 2). lock_page blocks do_swap_page until migration
> is complete but pte_same in do_swap_page won't fail because the pte is
> still pointing to the same swapentry (it's just the swapcache inode
> radix tree that points to a different page, the swapentry is still the
> same as before the migration - is_swap_pte will succeed but
> is_migration_entry failed when restoring the pte).
Ah. try_to_unmap_one() won't replace the pte entry with a
migration_pte() if the [anon] page is already in the swap cache. When
migration completes, we won't modify the page tables with the newpage
pte--we'll just let any subsequent swap page [minor] fault handle that.
That suggests a possible fix: instead of replacing the pte with a
duplicate swap entry in try_to_unmap_one(), go ahead and replace the pte
with a migration pte. Then back in try_to_unmap_anon(), after unmapping
all references, free the swap cache entry, so as not to leak it
[assuming we're in a lock state that allows that--I haven't checked that
far]. Then, the page table WILL have been modified by the time
migration unlocks the page.
Might want/need to check for migration entry in do_swap_page() and loop
back to migration_entry_wait() call when the changed pte is detected
rather than returning an error to the caller.
Does that sound reasonable?
> Finally the pte is
> overwritten with the old page and any data written to the new page in
> between is lost.
And wouldn't the new page potentially be leaked? That is, could it end
up on the lru with page_count == page_mapcount() >= 1, but no page table
reference to ever be unmapped to release the count?
>
> However it's not exactly the same bug as the one in fork, I was
> talking about before, it's also not o_direct specific. Still
> page_wrprotect + replace_page is orders of magnitude simpler logic
> than migrate.c and it has no bugs or at least it's certainly much
> simpler to proof as correct. Furthermore we never 'stall' any userland
> task while we do our work. We only mark the pte wrprotected, the task
> can cow or takeover it if refcount allows anytime, and later we'll
> bailout during replace_page if something has happened in between
> page_wrprotect and replace_page. So our logic is simpler and tuned for
> max performance and fewer interference with userland runtime. Not
> really sure if it worth for us to call into migrate.c.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-11-12 20:08 UTC|newest]
Thread overview: 139+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-11 13:21 [PATCH 0/4] ksm - dynamic page sharing driver for linux Izik Eidus
2008-11-11 13:21 ` Izik Eidus
2008-11-11 13:21 ` [PATCH 1/4] rmap: add page_wrprotect() function, Izik Eidus
2008-11-11 13:21 ` Izik Eidus, Izik Eidus
2008-11-11 13:21 ` [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another Izik Eidus
2008-11-11 13:21 ` Izik Eidus, Izik Eidus
2008-11-11 13:21 ` [PATCH 3/4] add ksm kernel shared memory driver Izik Eidus
2008-11-11 13:21 ` Izik Eidus, Izik Eidus
2008-11-11 13:21 ` [PATCH 4/4] MMU_NOTIFIRES: add set_pte_at_notify() Izik Eidus
2008-11-11 13:21 ` Izik Eidus, Izik Eidus
2008-11-11 20:38 ` [PATCH 3/4] add ksm kernel shared memory driver Andrew Morton
2008-11-11 20:38 ` Andrew Morton
2008-11-11 22:03 ` Andrea Arcangeli
2008-11-11 22:03 ` Andrea Arcangeli
2008-11-11 22:03 ` Jonathan Corbet
2008-11-11 22:03 ` Jonathan Corbet
2008-11-11 22:17 ` Izik Eidus
2008-11-11 22:17 ` Izik Eidus
2008-11-11 22:25 ` Jonathan Corbet
2008-11-11 22:25 ` Jonathan Corbet
2008-11-11 22:31 ` Izik Eidus
2008-11-11 22:31 ` Izik Eidus
2008-11-11 22:30 ` Jonathan Corbet
2008-11-11 22:30 ` Jonathan Corbet
2008-11-11 22:38 ` Izik Eidus
2008-11-11 22:38 ` Izik Eidus
2008-11-11 23:02 ` Izik Eidus
2008-11-11 23:02 ` Izik Eidus
2008-11-11 23:03 ` Andrea Arcangeli
2008-11-11 23:03 ` Andrea Arcangeli
2008-11-11 22:49 ` Avi Kivity
2008-11-11 22:49 ` Avi Kivity
2008-11-11 22:40 ` Valdis.Kletnieks
2008-11-13 6:13 ` Eric Rannaud
2008-11-13 6:13 ` Eric Rannaud
2008-11-11 22:43 ` Avi Kivity
2008-11-11 22:43 ` Avi Kivity
2008-11-11 19:45 ` [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another Andrew Morton
2008-11-11 19:45 ` Andrew Morton
2008-11-11 20:57 ` Izik Eidus
2008-11-11 20:57 ` Izik Eidus
2008-11-11 21:21 ` Christoph Lameter
2008-11-11 21:21 ` Christoph Lameter
2008-11-11 21:23 ` Izik Eidus
2008-11-11 21:23 ` Izik Eidus
2008-11-11 21:31 ` Christoph Lameter
2008-11-11 21:31 ` Christoph Lameter
2008-11-11 21:37 ` Izik Eidus
2008-11-11 21:37 ` Izik Eidus
2008-11-11 22:24 ` Andrea Arcangeli
2008-11-11 22:24 ` Andrea Arcangeli
2008-11-12 2:19 ` KAMEZAWA Hiroyuki
2008-11-12 2:19 ` KAMEZAWA Hiroyuki
2008-11-12 10:05 ` Avi Kivity
2008-11-12 10:05 ` Avi Kivity
2008-11-12 11:11 ` Izik Eidus
2008-11-12 11:11 ` Izik Eidus
2008-11-13 6:11 ` KAMEZAWA Hiroyuki
2008-11-13 6:11 ` KAMEZAWA Hiroyuki
2008-11-13 10:38 ` Izik Eidus
2008-11-13 10:38 ` Izik Eidus
2008-11-13 11:32 ` KAMEZAWA Hiroyuki
2008-11-13 11:32 ` KAMEZAWA Hiroyuki
2008-11-11 21:35 ` Andrea Arcangeli
2008-11-11 21:35 ` Andrea Arcangeli
2008-11-11 21:06 ` Andrea Arcangeli
2008-11-11 21:06 ` Andrea Arcangeli
2008-11-11 21:26 ` Christoph Lameter
2008-11-11 21:26 ` Christoph Lameter
2008-11-11 21:39 ` Avi Kivity
2008-11-11 21:39 ` Avi Kivity
2008-11-11 21:47 ` Christoph Lameter
2008-11-11 21:47 ` Christoph Lameter
2008-11-11 21:55 ` Izik Eidus
2008-11-11 21:55 ` Izik Eidus
2008-11-11 22:36 ` Avi Kivity
2008-11-11 22:36 ` Avi Kivity
2008-11-11 22:17 ` Andrea Arcangeli
2008-11-11 22:17 ` Andrea Arcangeli
2008-11-11 22:30 ` Christoph Lameter
2008-11-11 22:30 ` Christoph Lameter
2008-11-11 23:17 ` Andrea Arcangeli
2008-11-11 23:17 ` Andrea Arcangeli
2008-11-11 23:25 ` Andrea Arcangeli
2008-11-11 23:25 ` Andrea Arcangeli
2008-11-12 0:27 ` Christoph Lameter
2008-11-12 0:27 ` Christoph Lameter
2008-11-12 2:27 ` Andrea Arcangeli
2008-11-12 2:27 ` Andrea Arcangeli
2008-11-12 3:10 ` Christoph Lameter
2008-11-12 3:10 ` Christoph Lameter
2008-11-12 17:32 ` Andrea Arcangeli
2008-11-12 17:32 ` Andrea Arcangeli
2008-11-12 20:08 ` Lee Schermerhorn [this message]
2008-11-12 20:08 ` Lee Schermerhorn
2008-11-12 20:31 ` Christoph Lameter
2008-11-12 20:31 ` Christoph Lameter
2008-11-12 20:27 ` Christoph Lameter
2008-11-12 20:27 ` Christoph Lameter
2008-11-12 22:09 ` Lee Schermerhorn
2008-11-12 22:09 ` Lee Schermerhorn
2008-11-13 2:00 ` Andrea Arcangeli
2008-11-13 2:00 ` Andrea Arcangeli
2008-11-13 2:31 ` Andrea Arcangeli
2008-11-13 2:31 ` Andrea Arcangeli
2008-11-13 4:02 ` Nick Piggin
2008-11-13 4:02 ` Nick Piggin
2008-11-11 19:39 ` [PATCH 1/4] rmap: add page_wrprotect() function, Andrew Morton
2008-11-11 19:39 ` Andrew Morton
2008-11-11 20:38 ` Andrea Arcangeli
2008-11-11 20:38 ` Andrea Arcangeli
2008-11-11 21:01 ` Andrew Morton
2008-11-11 21:01 ` Andrew Morton
2008-11-11 21:17 ` Andrea Arcangeli
2008-11-11 21:17 ` Andrea Arcangeli
2008-11-11 18:30 ` [PATCH 0/4] ksm - dynamic page sharing driver for linux Andrew Morton
2008-11-11 18:30 ` Andrew Morton
2008-11-11 18:48 ` Avi Kivity
2008-11-11 18:48 ` Avi Kivity
2008-11-11 19:08 ` Izik Eidus
2008-11-11 19:08 ` Izik Eidus
2008-11-11 19:11 ` Andrew Morton
2008-11-11 19:11 ` Andrew Morton
2008-11-11 19:18 ` Izik Eidus
2008-11-11 19:18 ` Izik Eidus
2008-11-11 19:32 ` Andrew Morton
2008-11-11 19:32 ` Andrew Morton
2008-11-11 19:52 ` Izik Eidus
2008-11-11 19:52 ` Izik Eidus
2008-11-11 20:08 ` Izik Eidus
2008-11-11 20:08 ` Izik Eidus
2008-11-11 19:29 ` Avi Kivity
2008-11-11 19:29 ` Avi Kivity
2008-11-11 19:55 ` Andrea Arcangeli
2008-11-11 19:55 ` Andrea Arcangeli
2008-11-11 19:07 ` Izik Eidus
2008-11-11 19:07 ` Izik Eidus
2008-11-11 19:20 ` Andrew Morton
2008-11-11 19:20 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1226520487.7560.65.camel@lts-notebook \
--to=lee.schermerhorn@hp.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=avi@redhat.com \
--cc=chrisw@redhat.com \
--cc=cl@linux-foundation.org \
--cc=ieidus@redhat.com \
--cc=izike@qumranet.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.