From: Andrea Arcangeli <aarcange@redhat.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Jerome Glisse <j.glisse@gmail.com>,
Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Alex Williamson <alex.williamson@redhat.com>,
kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: GUP guarantees wrt to userspace mappings redesign
Date: Mon, 2 May 2016 20:56:49 +0200 [thread overview]
Message-ID: <20160502185649.GC12310@redhat.com> (raw)
In-Reply-To: <20160502121402.GB23305@node.shutemov.name>
On Mon, May 02, 2016 at 03:14:02PM +0300, Kirill A. Shutemov wrote:
> Quick look around:
>
> - I don't see any check page_count() around __replace_page() in uprobes,
> so it can easily replace pinned page.
>
> - KSM has the page_count() check, there's still race wrt GUP_fast: it can
> take the pin between the check and establishing new pte entry.
* Ok this is tricky, when get_user_pages_fast() run it doesn't
* take any lock, therefore the check that we are going to make
* with the pagecount against the mapcount is racey and
* O_DIRECT can happen right after the check.
* So we clear the pte and flush the tlb before the check
* this assure us that no O_DIRECT can happen after the check
* or in the middle of the check.
*/
entry = ptep_clear_flush_notify(vma, addr, ptep);
KSM takes care of that or it wouldn't be safe if KSM was with memory
under O_DIRECT.
> - khugepaged: the same story as with KSM.
In __collapse_huge_page_isolate we do:
/*
* cannot use mapcount: can't collapse if there's a gup pin.
* The page must only be referenced by the scanned process
* and page swap cache.
*/
if (page_count(page) != 1 + !!PageSwapCache(page)) {
unlock_page(page);
result = SCAN_PAGE_COUNT;
goto out;
}
At that point the pmd has been zapped (pmdp_collapse_flush already
run) and like for KSM case that is enough to ensure
get_user_pages_fast can't succeed and it'll have to call into the slow
get_user_pages.
These two issues are not specific to vfio and IOMMUs, this is must be
correct or O_DIRECT will generate data corruption in presence of
KSM/khugepaged. Both looks fine to me.
> I don't see how we can deliver on the guarantee, especially with lockless
> GUP_fast.
By zapping the pmd_trans_huge/pte and sending IPIs if needed
(get_user_pages_fast runs with irq disabled), before checking
page_count.
With the RCU version of it it's the same, but instead of sending IPIs,
we'll wait for a quiescient point to be sure of having flushed any
concurrent get_user_pages_fast out of the other CPUs, before we
proceed to check page_count (then no other get_user_pages_fast can
increase the page count for this page on this "mm" anymore).
That's how the guaranteed is provided against get_user_pages_fast.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Jerome Glisse <j.glisse@gmail.com>,
Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Alex Williamson <alex.williamson@redhat.com>,
kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: GUP guarantees wrt to userspace mappings redesign
Date: Mon, 2 May 2016 20:56:49 +0200 [thread overview]
Message-ID: <20160502185649.GC12310@redhat.com> (raw)
In-Reply-To: <20160502121402.GB23305@node.shutemov.name>
On Mon, May 02, 2016 at 03:14:02PM +0300, Kirill A. Shutemov wrote:
> Quick look around:
>
> - I don't see any check page_count() around __replace_page() in uprobes,
> so it can easily replace pinned page.
>
> - KSM has the page_count() check, there's still race wrt GUP_fast: it can
> take the pin between the check and establishing new pte entry.
* Ok this is tricky, when get_user_pages_fast() run it doesn't
* take any lock, therefore the check that we are going to make
* with the pagecount against the mapcount is racey and
* O_DIRECT can happen right after the check.
* So we clear the pte and flush the tlb before the check
* this assure us that no O_DIRECT can happen after the check
* or in the middle of the check.
*/
entry = ptep_clear_flush_notify(vma, addr, ptep);
KSM takes care of that or it wouldn't be safe if KSM was with memory
under O_DIRECT.
> - khugepaged: the same story as with KSM.
In __collapse_huge_page_isolate we do:
/*
* cannot use mapcount: can't collapse if there's a gup pin.
* The page must only be referenced by the scanned process
* and page swap cache.
*/
if (page_count(page) != 1 + !!PageSwapCache(page)) {
unlock_page(page);
result = SCAN_PAGE_COUNT;
goto out;
}
At that point the pmd has been zapped (pmdp_collapse_flush already
run) and like for KSM case that is enough to ensure
get_user_pages_fast can't succeed and it'll have to call into the slow
get_user_pages.
These two issues are not specific to vfio and IOMMUs, this is must be
correct or O_DIRECT will generate data corruption in presence of
KSM/khugepaged. Both looks fine to me.
> I don't see how we can deliver on the guarantee, especially with lockless
> GUP_fast.
By zapping the pmd_trans_huge/pte and sending IPIs if needed
(get_user_pages_fast runs with irq disabled), before checking
page_count.
With the RCU version of it it's the same, but instead of sending IPIs,
we'll wait for a quiescient point to be sure of having flushed any
concurrent get_user_pages_fast out of the other CPUs, before we
proceed to check page_count (then no other get_user_pages_fast can
increase the page count for this page on this "mm" anymore).
That's how the guaranteed is provided against get_user_pages_fast.
next prev parent reply other threads:[~2016-05-02 18:56 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-28 16:20 [BUG] vfio device assignment regression with THP ref counting redesign Alex Williamson
2016-04-28 16:20 ` Alex Williamson
2016-04-28 18:17 ` Kirill A. Shutemov
2016-04-28 18:17 ` Kirill A. Shutemov
2016-04-28 18:58 ` Alex Williamson
2016-04-28 18:58 ` Alex Williamson
2016-04-28 23:21 ` Andrea Arcangeli
2016-04-28 23:21 ` Andrea Arcangeli
2016-04-29 0:44 ` Alex Williamson
2016-04-29 0:44 ` Alex Williamson
2016-04-29 0:51 ` Kirill A. Shutemov
2016-04-29 0:51 ` Kirill A. Shutemov
2016-04-29 2:45 ` Alex Williamson
2016-04-29 2:45 ` Alex Williamson
2016-04-29 7:06 ` Kirill A. Shutemov
2016-04-29 7:06 ` Kirill A. Shutemov
2016-04-29 15:12 ` Alex Williamson
2016-04-29 15:12 ` Alex Williamson
2016-04-29 16:34 ` Andrea Arcangeli
2016-04-29 16:34 ` Andrea Arcangeli
2016-04-29 22:34 ` Alex Williamson
2016-04-29 22:34 ` Alex Williamson
2016-05-02 10:41 ` Kirill A. Shutemov
2016-05-02 10:41 ` Kirill A. Shutemov
2016-05-02 11:15 ` Jerome Glisse
2016-05-02 11:15 ` Jerome Glisse
2016-05-02 12:14 ` GUP guarantees wrt to userspace mappings redesign Kirill A. Shutemov
2016-05-02 12:14 ` Kirill A. Shutemov
2016-05-02 13:39 ` Jerome Glisse
2016-05-02 13:39 ` Jerome Glisse
2016-05-02 15:00 ` GUP guarantees wrt to userspace mappings Kirill A. Shutemov
2016-05-02 15:00 ` Kirill A. Shutemov
2016-05-02 15:22 ` Jerome Glisse
2016-05-02 15:22 ` Jerome Glisse
2016-05-02 16:12 ` Kirill A. Shutemov
2016-05-02 16:12 ` Kirill A. Shutemov
2016-05-02 19:14 ` Andrea Arcangeli
2016-05-02 19:14 ` Andrea Arcangeli
2016-05-02 19:11 ` Andrea Arcangeli
2016-05-02 19:11 ` Andrea Arcangeli
2016-05-02 19:02 ` Andrea Arcangeli
2016-05-02 19:02 ` Andrea Arcangeli
2016-05-02 14:15 ` GUP guarantees wrt to userspace mappings redesign Oleg Nesterov
2016-05-02 14:15 ` Oleg Nesterov
2016-05-02 16:21 ` Kirill A. Shutemov
2016-05-02 16:21 ` Kirill A. Shutemov
2016-05-02 16:22 ` Oleg Nesterov
2016-05-02 16:22 ` Oleg Nesterov
2016-05-02 18:03 ` Kirill A. Shutemov
2016-05-02 18:03 ` Kirill A. Shutemov
2016-05-02 17:41 ` Oleg Nesterov
2016-05-02 17:41 ` Oleg Nesterov
2016-05-02 18:56 ` Andrea Arcangeli [this message]
2016-05-02 18:56 ` Andrea Arcangeli
2016-05-02 15:23 ` [BUG] vfio device assignment regression with THP ref counting redesign Andrea Arcangeli
2016-05-02 15:23 ` Andrea Arcangeli
2016-05-02 16:00 ` Kirill A. Shutemov
2016-05-02 16:00 ` Kirill A. Shutemov
2016-05-02 18:03 ` Andrea Arcangeli
2016-05-02 18:03 ` Andrea Arcangeli
2016-05-05 1:19 ` Alex Williamson
2016-05-05 1:19 ` Alex Williamson
2016-05-05 14:39 ` Andrea Arcangeli
2016-05-05 14:39 ` Andrea Arcangeli
2016-05-05 15:09 ` Andrea Arcangeli
2016-05-05 15:09 ` Andrea Arcangeli
2016-05-05 15:11 ` Kirill A. Shutemov
2016-05-05 15:11 ` Kirill A. Shutemov
2016-05-05 15:24 ` Andrea Arcangeli
2016-05-05 15:24 ` Andrea Arcangeli
2016-05-06 7:29 ` Kirill A. Shutemov
2016-05-06 7:29 ` Kirill A. Shutemov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160502185649.GC12310@redhat.com \
--to=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alex.williamson@redhat.com \
--cc=hughd@google.com \
--cc=j.glisse@gmail.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=oleg@redhat.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.