All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Jerome Glisse <j.glisse@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: GUP guarantees wrt to userspace mappings
Date: Mon, 2 May 2016 19:12:52 +0300	[thread overview]
Message-ID: <20160502161252.GE24419@node.shutemov.name> (raw)
In-Reply-To: <20160502152249.GA5827@gmail.com>

On Mon, May 02, 2016 at 05:22:49PM +0200, Jerome Glisse wrote:
> On Mon, May 02, 2016 at 06:00:13PM +0300, Kirill A. Shutemov wrote:
> > > > Quick look around:
> > > > 
> > > >  - I don't see any check page_count() around __replace_page() in uprobes,
> > > >    so it can easily replace pinned page.
> > > 
> > > Not an issue for existing user as this is only use to instrument code, existing
> > > user do not execute code from virtual address for which they have done a GUP.
> > 
> > Okay, so we can establish that GUP doesn't provide the guarantee in some
> > cases.
> 
> Correct but it use to provide that guarantee in respect to THP.

Yes, the THP regression need to be fixed. I don't argue with that.

> > > >  - KSM has the page_count() check, there's still race wrt GUP_fast: it can
> > > >    take the pin between the check and establishing new pte entry.
> > > 
> > > KSM is not an issue for existing user as they all do get_user_pages() with
> > > write = 1 and the KSM first map page read only before considering to replace
> > > them and check page refcount. So there can be no race with gup_fast there.
> > 
> > In vfio case, 'write' is conditional on IOMMU_WRITE, meaning not all
> > get_user_pages() are with write=1.
> 
> I think this is still fine as it means that device will read only and thus
> you can migrate to different page (ie the guest is not expecting to read back
> anything writen by the device and device writting to the page would be illegal
> and a proper IOMMU would forbid it). So it is like direct-io when you write
> from anonymous memory to a file.

Hm. Okay.

> > > >  - khugepaged: the same story as with KSM.
> > > 
> > > I am assuming you are talking about collapse_huge_page() here, if you look in
> > > that function there is a comment about GUP_fast. Noneless i believe the comment
> > > is wrong as i believe there is an existing race window btw pmdp_collapse_flush()
> > > and __collapse_huge_page_isolate() :
> > > 
> > >   get_user_pages_fast()          | collapse_huge_page()
> > >    gup_pmd_range() -> valid pmd  | ...
> > >                                  | pmdp_collapse_flush() clear pmd
> > >                                  | ...
> > >                                  | __collapse_huge_page_isolate()
> > >                                  | [Above check page count and see no GUP]
> > >    gup_pte_range() -> ref page   |
> > > 
> > > This is a very unlikely race because get_user_pages_fast() can not be preempted
> > > while collapse_huge_page() can be preempted btw pmdp_collapse_flush() and
> > > __collapse_huge_page_isolate(), more over collapse_huge_page() has lot more
> > > instructions to chew on than get_user_pages_fast() btw gup_pmd_range() and
> > > gup_pte_range().
> > 
> > Yes, the race window is small, but there.
> 
> Now that i think again about it, i don't think it exist. pmdp_collapse_flush()
> will flush the tlb and thus send an IPI but get_user_pages_fast() can't be
> preempted so the flush will have to wait for existing get_user_pages_fast() to
> complete. Or am i missunderstanding flush ? So khugepaged is safe from GUP_fast
> point of view like the comment, inside it, says.

You are right. It's safe too.

> > > So as said above, i think existing user of get_user_pages() are not sensitive
> > > to the races you pointed above. I am sure there are some corner case where
> > > the guarantee that GUP pin a page against a virtual address is violated but
> > > i do not think they apply to any existing user of GUP.
> > > 
> > > Note that i would personaly like that this existing assumption about GUP did
> > > not exist. I hate it, but fact is that it does exist and nobody can remember
> > > where the Doc did park the Delorean
> > 
> > The drivers who want the guarantee can provide own ->mmap and have more
> > control on what is visible in userspace.
> > 
> > Alternatively, we have mmu_notifiers to track changes in userspace
> > mappings.
> > 
> 
> Well you can't not rely on special vma here. Qemu alloc anonymous memory and
> hand it over to guest, then a guest driver (ie runing in the guest not on the
> host) try to map that memory and need valid DMA address for it, this is when
> vfio (on the host kernel) starts pining memory of regular anonymous vma (on
> the host). That same memory might back some special vma with ->mmap callback
> but in the guest. Point is there is no driver on the host and no special vma.
> From host point of view this is anonymous memory, but from guest POV it is
> just memory.
> 
> Requiring special vma would need major change to kvm and probably xen, in
> respect on how they support things like PCI passthrough.
> 
> In existing workload, host kernel can not make assumption on how anonymous
> memory is gonna be use.

Any reason why mmu_notifier is not an option?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Jerome Glisse <j.glisse@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: GUP guarantees wrt to userspace mappings
Date: Mon, 2 May 2016 19:12:52 +0300	[thread overview]
Message-ID: <20160502161252.GE24419@node.shutemov.name> (raw)
In-Reply-To: <20160502152249.GA5827@gmail.com>

On Mon, May 02, 2016 at 05:22:49PM +0200, Jerome Glisse wrote:
> On Mon, May 02, 2016 at 06:00:13PM +0300, Kirill A. Shutemov wrote:
> > > > Quick look around:
> > > > 
> > > >  - I don't see any check page_count() around __replace_page() in uprobes,
> > > >    so it can easily replace pinned page.
> > > 
> > > Not an issue for existing user as this is only use to instrument code, existing
> > > user do not execute code from virtual address for which they have done a GUP.
> > 
> > Okay, so we can establish that GUP doesn't provide the guarantee in some
> > cases.
> 
> Correct but it use to provide that guarantee in respect to THP.

Yes, the THP regression need to be fixed. I don't argue with that.

> > > >  - KSM has the page_count() check, there's still race wrt GUP_fast: it can
> > > >    take the pin between the check and establishing new pte entry.
> > > 
> > > KSM is not an issue for existing user as they all do get_user_pages() with
> > > write = 1 and the KSM first map page read only before considering to replace
> > > them and check page refcount. So there can be no race with gup_fast there.
> > 
> > In vfio case, 'write' is conditional on IOMMU_WRITE, meaning not all
> > get_user_pages() are with write=1.
> 
> I think this is still fine as it means that device will read only and thus
> you can migrate to different page (ie the guest is not expecting to read back
> anything writen by the device and device writting to the page would be illegal
> and a proper IOMMU would forbid it). So it is like direct-io when you write
> from anonymous memory to a file.

Hm. Okay.

> > > >  - khugepaged: the same story as with KSM.
> > > 
> > > I am assuming you are talking about collapse_huge_page() here, if you look in
> > > that function there is a comment about GUP_fast. Noneless i believe the comment
> > > is wrong as i believe there is an existing race window btw pmdp_collapse_flush()
> > > and __collapse_huge_page_isolate() :
> > > 
> > >   get_user_pages_fast()          | collapse_huge_page()
> > >    gup_pmd_range() -> valid pmd  | ...
> > >                                  | pmdp_collapse_flush() clear pmd
> > >                                  | ...
> > >                                  | __collapse_huge_page_isolate()
> > >                                  | [Above check page count and see no GUP]
> > >    gup_pte_range() -> ref page   |
> > > 
> > > This is a very unlikely race because get_user_pages_fast() can not be preempted
> > > while collapse_huge_page() can be preempted btw pmdp_collapse_flush() and
> > > __collapse_huge_page_isolate(), more over collapse_huge_page() has lot more
> > > instructions to chew on than get_user_pages_fast() btw gup_pmd_range() and
> > > gup_pte_range().
> > 
> > Yes, the race window is small, but there.
> 
> Now that i think again about it, i don't think it exist. pmdp_collapse_flush()
> will flush the tlb and thus send an IPI but get_user_pages_fast() can't be
> preempted so the flush will have to wait for existing get_user_pages_fast() to
> complete. Or am i missunderstanding flush ? So khugepaged is safe from GUP_fast
> point of view like the comment, inside it, says.

You are right. It's safe too.

> > > So as said above, i think existing user of get_user_pages() are not sensitive
> > > to the races you pointed above. I am sure there are some corner case where
> > > the guarantee that GUP pin a page against a virtual address is violated but
> > > i do not think they apply to any existing user of GUP.
> > > 
> > > Note that i would personaly like that this existing assumption about GUP did
> > > not exist. I hate it, but fact is that it does exist and nobody can remember
> > > where the Doc did park the Delorean
> > 
> > The drivers who want the guarantee can provide own ->mmap and have more
> > control on what is visible in userspace.
> > 
> > Alternatively, we have mmu_notifiers to track changes in userspace
> > mappings.
> > 
> 
> Well you can't not rely on special vma here. Qemu alloc anonymous memory and
> hand it over to guest, then a guest driver (ie runing in the guest not on the
> host) try to map that memory and need valid DMA address for it, this is when
> vfio (on the host kernel) starts pining memory of regular anonymous vma (on
> the host). That same memory might back some special vma with ->mmap callback
> but in the guest. Point is there is no driver on the host and no special vma.
> From host point of view this is anonymous memory, but from guest POV it is
> just memory.
> 
> Requiring special vma would need major change to kvm and probably xen, in
> respect on how they support things like PCI passthrough.
> 
> In existing workload, host kernel can not make assumption on how anonymous
> memory is gonna be use.

Any reason why mmu_notifier is not an option?

-- 
 Kirill A. Shutemov

  reply	other threads:[~2016-05-02 16:12 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-28 16:20 [BUG] vfio device assignment regression with THP ref counting redesign Alex Williamson
2016-04-28 16:20 ` Alex Williamson
2016-04-28 18:17 ` Kirill A. Shutemov
2016-04-28 18:17   ` Kirill A. Shutemov
2016-04-28 18:58   ` Alex Williamson
2016-04-28 18:58     ` Alex Williamson
2016-04-28 23:21     ` Andrea Arcangeli
2016-04-28 23:21       ` Andrea Arcangeli
2016-04-29  0:44       ` Alex Williamson
2016-04-29  0:44         ` Alex Williamson
2016-04-29  0:51       ` Kirill A. Shutemov
2016-04-29  0:51         ` Kirill A. Shutemov
2016-04-29  2:45         ` Alex Williamson
2016-04-29  2:45           ` Alex Williamson
2016-04-29  7:06           ` Kirill A. Shutemov
2016-04-29  7:06             ` Kirill A. Shutemov
2016-04-29 15:12             ` Alex Williamson
2016-04-29 15:12               ` Alex Williamson
2016-04-29 16:34             ` Andrea Arcangeli
2016-04-29 16:34               ` Andrea Arcangeli
2016-04-29 22:34               ` Alex Williamson
2016-04-29 22:34                 ` Alex Williamson
2016-05-02 10:41               ` Kirill A. Shutemov
2016-05-02 10:41                 ` Kirill A. Shutemov
2016-05-02 11:15                 ` Jerome Glisse
2016-05-02 11:15                   ` Jerome Glisse
2016-05-02 12:14                   ` GUP guarantees wrt to userspace mappings redesign Kirill A. Shutemov
2016-05-02 12:14                     ` Kirill A. Shutemov
2016-05-02 13:39                     ` Jerome Glisse
2016-05-02 13:39                       ` Jerome Glisse
2016-05-02 15:00                       ` GUP guarantees wrt to userspace mappings Kirill A. Shutemov
2016-05-02 15:00                         ` Kirill A. Shutemov
2016-05-02 15:22                         ` Jerome Glisse
2016-05-02 15:22                           ` Jerome Glisse
2016-05-02 16:12                           ` Kirill A. Shutemov [this message]
2016-05-02 16:12                             ` Kirill A. Shutemov
2016-05-02 19:14                             ` Andrea Arcangeli
2016-05-02 19:14                               ` Andrea Arcangeli
2016-05-02 19:11                           ` Andrea Arcangeli
2016-05-02 19:11                             ` Andrea Arcangeli
2016-05-02 19:02                         ` Andrea Arcangeli
2016-05-02 19:02                           ` Andrea Arcangeli
2016-05-02 14:15                     ` GUP guarantees wrt to userspace mappings redesign Oleg Nesterov
2016-05-02 14:15                       ` Oleg Nesterov
2016-05-02 16:21                       ` Kirill A. Shutemov
2016-05-02 16:21                         ` Kirill A. Shutemov
2016-05-02 16:22                         ` Oleg Nesterov
2016-05-02 16:22                           ` Oleg Nesterov
2016-05-02 18:03                           ` Kirill A. Shutemov
2016-05-02 18:03                             ` Kirill A. Shutemov
2016-05-02 17:41                             ` Oleg Nesterov
2016-05-02 17:41                               ` Oleg Nesterov
2016-05-02 18:56                     ` Andrea Arcangeli
2016-05-02 18:56                       ` Andrea Arcangeli
2016-05-02 15:23                 ` [BUG] vfio device assignment regression with THP ref counting redesign Andrea Arcangeli
2016-05-02 15:23                   ` Andrea Arcangeli
2016-05-02 16:00                   ` Kirill A. Shutemov
2016-05-02 16:00                     ` Kirill A. Shutemov
2016-05-02 18:03                     ` Andrea Arcangeli
2016-05-02 18:03                       ` Andrea Arcangeli
2016-05-05  1:19                       ` Alex Williamson
2016-05-05  1:19                         ` Alex Williamson
2016-05-05 14:39                         ` Andrea Arcangeli
2016-05-05 14:39                           ` Andrea Arcangeli
2016-05-05 15:09                           ` Andrea Arcangeli
2016-05-05 15:09                             ` Andrea Arcangeli
2016-05-05 15:11                           ` Kirill A. Shutemov
2016-05-05 15:11                             ` Kirill A. Shutemov
2016-05-05 15:24                             ` Andrea Arcangeli
2016-05-05 15:24                               ` Andrea Arcangeli
2016-05-06  7:29                               ` Kirill A. Shutemov
2016-05-06  7:29                                 ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160502161252.GE24419@node.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=hughd@google.com \
    --cc=j.glisse@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=oleg@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.