Re: [PATCH v2] mm: page_alloc: move mlocked flag clearance into free_pages

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2] mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
       [not found]       ` <ZxcrJHtIGckMo9Ni@google.com>
@ 2024-10-22  8:26         ` Yosry Ahmed
  2024-10-22 15:39           ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: Yosry Ahmed @ 2024-10-22  8:26 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Matthew Wilcox, Andrew Morton, linux-mm, Vlastimil Babka,
	linux-kernel, stable, Hugh Dickins, kvm, Sean Christopherson,
	Paolo Bonzini

On Mon, Oct 21, 2024 at 9:33 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Oct 22, 2024 at 04:47:19AM +0100, Matthew Wilcox wrote:
> > On Tue, Oct 22, 2024 at 02:14:39AM +0000, Roman Gushchin wrote:
> > > On Mon, Oct 21, 2024 at 09:34:24PM +0100, Matthew Wilcox wrote:
> > > > On Mon, Oct 21, 2024 at 05:34:55PM +0000, Roman Gushchin wrote:
> > > > > Fix it by moving the mlocked flag clearance down to
> > > > > free_page_prepare().
> > > >
> > > > Urgh, I don't like this new reference to folio in free_pages_prepare().
> > > > It feels like a layering violation.  I'll think about where else we
> > > > could put this.
> > >
> > > I agree, but it feels like it needs quite some work to do it in a nicer way,
> > > no way it can be backported to older kernels. As for this fix, I don't
> > > have better ideas...
> >
> > Well, what is KVM doing that causes this page to get mapped to userspace?
> > Don't tell me to look at the reproducer as it is 403 Forbidden.  All I
> > can tell is that it's freed with vfree().
> >
> > Is it from kvm_dirty_ring_get_page()?  That looks like the obvious thing,
> > but I'd hate to spend a lot of time on it and then discover I was looking
> > at the wrong thing.
>
> One of the pages is vcpu->run, others belong to kvm->coalesced_mmio_ring.

Looking at kvm_vcpu_fault(), it seems like we after mmap'ing the fd
returned by KVM_CREATE_VCPU we can access one of the following:
- vcpu->run
- vcpu->arch.pio_data
- vcpu->kvm->coalesced_mmio_ring
- a page returned by kvm_dirty_ring_get_page()

It doesn't seem like any of these are reclaimable, why is mlock()'ing
them supported to begin with? Even if we don't want mlock() to err in
this case, shouldn't we just do nothing?

I see a lot of checks at the beginning of mlock_fixup() to check
whether we should operate on the vma, perhaps we should also check for
these KVM vmas? or maybe set VM_SPECIAL in kvm_vcpu_mmap()? I am not
sure tbh, but this doesn't seem right.

FWIW, I think moving the mlock clearing from __page_cache_release ()
to free_pages_prepare() (or another common function in the page
freeing path) may be the right thing to do in its own right. I am just
wondering why we are not questioning the mlock() on the KVM vCPU
mapping to begin with.

Is there a use case for this that I am missing?

>
> Here is the reproducer:
>
> #define _GNU_SOURCE
>
> #include <endian.h>
> #include <fcntl.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/mount.h>
> #include <sys/stat.h>
> #include <sys/syscall.h>
> #include <sys/types.h>
> #include <unistd.h>
>
> #ifndef __NR_mlock2
> #define __NR_mlock2 325
> #endif
>
> uint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff};
>
> #ifndef KVM_CREATE_VM
> #define KVM_CREATE_VM 0xae01
> #endif
>
> #ifndef KVM_CREATE_VCPU
> #define KVM_CREATE_VCPU 0xae41
> #endif
>
> int main(void)
> {
>   syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul,
>           /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>           /*offset=*/0ul);
>   syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul,
>           /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul,
>           /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>           /*offset=*/0ul);
>   syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul,
>           /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>           /*offset=*/0ul);
>   intptr_t res = syscall(__NR_openat, /*fd=*/0xffffff9c, /*file=*/"/dev/kvm",
>                 /*flags=*/0, /*mode=*/0);
>   if (res != -1)
>     r[0] = res;
>   res = syscall(__NR_ioctl, /*fd=*/r[0], /*cmd=*/KVM_CREATE_VM, /*type=*/0ul);
>   if (res != -1)
>     r[1] = res;
>   res = syscall(__NR_ioctl, /*fd=*/r[1], /*cmd=*/KVM_CREATE_VCPU, /*id=*/0ul);
>   if (res != -1)
>     r[2] = res;
>   syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0xb36000ul,
>           /*prot=PROT_SEM|PROT_WRITE|PROT_READ|PROT_EXEC*/ 0xful,
>           /*flags=MAP_FIXED|MAP_SHARED*/ 0x11ul, /*fd=*/r[2], /*offset=*/0ul);
>   syscall(__NR_mlock2, /*addr=*/0x20000000ul, /*size=*/0x400000ul,
>           /*flags=*/0ul);
>   syscall(__NR_mremap, /*addr=*/0x200ab000ul, /*len=*/0x1000ul,
>           /*newlen=*/0x1000ul,
>           /*flags=MREMAP_DONTUNMAP|MREMAP_FIXED|MREMAP_MAYMOVE*/ 7ul,
>           /*newaddr=*/0x20ffc000ul);
>   return 0;
> }
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
  2024-10-22  8:26         ` [PATCH v2] mm: page_alloc: move mlocked flag clearance into free_pages_prepare() Yosry Ahmed
@ 2024-10-22 15:39           ` Sean Christopherson
  2024-10-22 16:59             ` Matthew Wilcox
  2024-10-23  2:04             ` Roman Gushchin
  0 siblings, 2 replies; 5+ messages in thread
From: Sean Christopherson @ 2024-10-22 15:39 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Roman Gushchin, Matthew Wilcox, Andrew Morton, linux-mm,
	Vlastimil Babka, linux-kernel, stable, Hugh Dickins, kvm,
	Paolo Bonzini

On Tue, Oct 22, 2024, Yosry Ahmed wrote:
> On Mon, Oct 21, 2024 at 9:33 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Tue, Oct 22, 2024 at 04:47:19AM +0100, Matthew Wilcox wrote:
> > > On Tue, Oct 22, 2024 at 02:14:39AM +0000, Roman Gushchin wrote:
> > > > On Mon, Oct 21, 2024 at 09:34:24PM +0100, Matthew Wilcox wrote:
> > > > > On Mon, Oct 21, 2024 at 05:34:55PM +0000, Roman Gushchin wrote:
> > > > > > Fix it by moving the mlocked flag clearance down to
> > > > > > free_page_prepare().
> > > > >
> > > > > Urgh, I don't like this new reference to folio in free_pages_prepare().
> > > > > It feels like a layering violation.  I'll think about where else we
> > > > > could put this.
> > > >
> > > > I agree, but it feels like it needs quite some work to do it in a nicer way,
> > > > no way it can be backported to older kernels. As for this fix, I don't
> > > > have better ideas...
> > >
> > > Well, what is KVM doing that causes this page to get mapped to userspace?
> > > Don't tell me to look at the reproducer as it is 403 Forbidden.  All I
> > > can tell is that it's freed with vfree().
> > >
> > > Is it from kvm_dirty_ring_get_page()?  That looks like the obvious thing,
> > > but I'd hate to spend a lot of time on it and then discover I was looking
> > > at the wrong thing.
> >
> > One of the pages is vcpu->run, others belong to kvm->coalesced_mmio_ring.
> 
> Looking at kvm_vcpu_fault(), it seems like we after mmap'ing the fd
> returned by KVM_CREATE_VCPU we can access one of the following:
> - vcpu->run
> - vcpu->arch.pio_data
> - vcpu->kvm->coalesced_mmio_ring
> - a page returned by kvm_dirty_ring_get_page()
> 
> It doesn't seem like any of these are reclaimable,

Correct, these are all kernel allocated pages that KVM exposes to userspace to
facilitate bidirectional sharing of large chunks of data.

> why is mlock()'ing them supported to begin with?

Because no one realized it would be problematic, and KVM would have had to go out
of its way to prevent mlock().

> Even if we don't want mlock() to err in this case, shouldn't we just do
> nothing?

Ideally, yes.

> I see a lot of checks at the beginning of mlock_fixup() to check
> whether we should operate on the vma, perhaps we should also check for
> these KVM vmas?

Definitely not.  KVM may be doing something unexpected, but the VMA certainly
isn't unique enough to warrant mm/ needing dedicated handling.

Focusing on KVM is likely a waste of time.  There are probably other subsystems
and/or drivers that .mmap() kernel allocated memory in the same way.  Odds are
good KVM is just the messenger, because syzkaller knows how to beat on KVM.  And
even if there aren't any other existing cases, nothing would prevent them from
coming along in the future.

> Trying to or maybe set VM_SPECIAL in kvm_vcpu_mmap()? I am not
> sure tbh, but this doesn't seem right.

Agreed.  VM_DONTEXPAND is the only VM_SPECIAL flag that is remotely appropriate,
but setting VM_DONTEXPAND could theoretically break userspace, and other than
preventing mlock(), there is no reason why the VMA can't be expanded.  I doubt
any userspace VMM is actually remapping and expanding a vCPU mapping, but trying
to fudge around this outside of core mm/ feels kludgy and has the potential to
turn into a game of whack-a-mole.

> FWIW, I think moving the mlock clearing from __page_cache_release ()
> to free_pages_prepare() (or another common function in the page
> freeing path) may be the right thing to do in its own right. I am just
> wondering why we are not questioning the mlock() on the KVM vCPU
> mapping to begin with.
> 
> Is there a use case for this that I am missing?

Not that I know of, I suspect mlock() is allowed simply because it's allowed by
default.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
  2024-10-22 15:39           ` Sean Christopherson
@ 2024-10-22 16:59             ` Matthew Wilcox
  2024-10-22 19:52               ` Sean Christopherson
  2024-10-23  2:04             ` Roman Gushchin
  1 sibling, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2024-10-22 16:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yosry Ahmed, Roman Gushchin, Andrew Morton, linux-mm,
	Vlastimil Babka, linux-kernel, stable, Hugh Dickins, kvm,
	Paolo Bonzini

On Tue, Oct 22, 2024 at 08:39:34AM -0700, Sean Christopherson wrote:
> On Tue, Oct 22, 2024, Yosry Ahmed wrote:
> > Even if we don't want mlock() to err in this case, shouldn't we just do
> > nothing?
> 
> Ideally, yes.

Agreed.  There's no sense in having this count against the NR_MLOCK
stats, for example.

> > I see a lot of checks at the beginning of mlock_fixup() to check
> > whether we should operate on the vma, perhaps we should also check for
> > these KVM vmas?
> 
> Definitely not.  KVM may be doing something unexpected, but the VMA certainly
> isn't unique enough to warrant mm/ needing dedicated handling.
> 
> Focusing on KVM is likely a waste of time.  There are probably other subsystems
> and/or drivers that .mmap() kernel allocated memory in the same way.  Odds are
> good KVM is just the messenger, because syzkaller knows how to beat on KVM.  And
> even if there aren't any other existing cases, nothing would prevent them from
> coming along in the future.

They all need to be fixed.  How to do that is not an answer I have at
this point.  Ideally we can fix them without changing them all immediately
(but they will all need to be fixed eventually because pages will no
longer have a refcount and so get_page() will need to go away ...)

> > Trying to or maybe set VM_SPECIAL in kvm_vcpu_mmap()? I am not
> > sure tbh, but this doesn't seem right.
> 
> Agreed.  VM_DONTEXPAND is the only VM_SPECIAL flag that is remotely appropriate,
> but setting VM_DONTEXPAND could theoretically break userspace, and other than
> preventing mlock(), there is no reason why the VMA can't be expanded.  I doubt
> any userspace VMM is actually remapping and expanding a vCPU mapping, but trying
> to fudge around this outside of core mm/ feels kludgy and has the potential to
> turn into a game of whack-a-mole.

Actually, VM_PFNMAP is probably ideal.  We're not really mapping pages
here (I mean, they are pages, but they're not filesystem pages or
anonymous pages ... there's no rmap to them).  We're mapping blobs of
memory whose refcount is controlled by the vma that maps them.  We don't
particularly want to be able to splice() this memory, or do RDMA to it.
We probably do want gdb to be able to read it (... yes?) which might be
a complication with a PFNMAP VMA.

We've given a lot of flexibility to device drivers about how they
implement mmap() and I think that's now getting in the way of some
important improvements.  I want to see a simpler way of providing the
same functionality, and I'm not quite there yet.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
  2024-10-22 16:59             ` Matthew Wilcox
@ 2024-10-22 19:52               ` Sean Christopherson
  0 siblings, 0 replies; 5+ messages in thread
From: Sean Christopherson @ 2024-10-22 19:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yosry Ahmed, Roman Gushchin, Andrew Morton, linux-mm,
	Vlastimil Babka, linux-kernel, stable, Hugh Dickins, kvm,
	Paolo Bonzini

On Tue, Oct 22, 2024, Matthew Wilcox wrote:
> On Tue, Oct 22, 2024 at 08:39:34AM -0700, Sean Christopherson wrote:
> > > Trying to or maybe set VM_SPECIAL in kvm_vcpu_mmap()? I am not
> > > sure tbh, but this doesn't seem right.
> > 
> > Agreed.  VM_DONTEXPAND is the only VM_SPECIAL flag that is remotely appropriate,
> > but setting VM_DONTEXPAND could theoretically break userspace, and other than
> > preventing mlock(), there is no reason why the VMA can't be expanded.  I doubt
> > any userspace VMM is actually remapping and expanding a vCPU mapping, but trying
> > to fudge around this outside of core mm/ feels kludgy and has the potential to
> > turn into a game of whack-a-mole.
> 
> Actually, VM_PFNMAP is probably ideal.  We're not really mapping pages
> here (I mean, they are pages, but they're not filesystem pages or
> anonymous pages ... there's no rmap to them).  We're mapping blobs of
> memory whose refcount is controlled by the vma that maps them.  We don't
> particularly want to be able to splice() this memory, or do RDMA to it.
> We probably do want gdb to be able to read it (... yes?)

More than likely, yes.  And we probably want the pages to show up in core dumps,
and be gup()-able.  I think that's the underlying problem with KVM's pages.  In
many cases, we want them to show up as vm_normal_page() pages.  But for a few
things, e.g. mlock(), it's nonsensical because they aren't entirely normal, just
mostly normal.

> which might be a complication with a PFNMAP VMA.
> 
> We've given a lot of flexibility to device drivers about how they
> implement mmap() and I think that's now getting in the way of some
> important improvements.  I want to see a simpler way of providing the
> same functionality, and I'm not quite there yet.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
  2024-10-22 15:39           ` Sean Christopherson
  2024-10-22 16:59             ` Matthew Wilcox
@ 2024-10-23  2:04             ` Roman Gushchin
  1 sibling, 0 replies; 5+ messages in thread
From: Roman Gushchin @ 2024-10-23  2:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yosry Ahmed, Matthew Wilcox, Andrew Morton, linux-mm,
	Vlastimil Babka, linux-kernel, stable, Hugh Dickins, kvm,
	Paolo Bonzini

On Tue, Oct 22, 2024 at 08:39:34AM -0700, Sean Christopherson wrote:
> On Tue, Oct 22, 2024, Yosry Ahmed wrote:
> > On Mon, Oct 21, 2024 at 9:33 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Tue, Oct 22, 2024 at 04:47:19AM +0100, Matthew Wilcox wrote:
> > > > On Tue, Oct 22, 2024 at 02:14:39AM +0000, Roman Gushchin wrote:
> > > > > On Mon, Oct 21, 2024 at 09:34:24PM +0100, Matthew Wilcox wrote:
> > > > > > On Mon, Oct 21, 2024 at 05:34:55PM +0000, Roman Gushchin wrote:
> > > > > > > Fix it by moving the mlocked flag clearance down to
> > > > > > > free_page_prepare().
> > > > > >
> > > > > > Urgh, I don't like this new reference to folio in free_pages_prepare().
> > > > > > It feels like a layering violation.  I'll think about where else we
> > > > > > could put this.
> > > > >
> > > > > I agree, but it feels like it needs quite some work to do it in a nicer way,
> > > > > no way it can be backported to older kernels. As for this fix, I don't
> > > > > have better ideas...
> > > >
> > > > Well, what is KVM doing that causes this page to get mapped to userspace?
> > > > Don't tell me to look at the reproducer as it is 403 Forbidden.  All I
> > > > can tell is that it's freed with vfree().
> > > >
> > > > Is it from kvm_dirty_ring_get_page()?  That looks like the obvious thing,
> > > > but I'd hate to spend a lot of time on it and then discover I was looking
> > > > at the wrong thing.
> > >
> > > One of the pages is vcpu->run, others belong to kvm->coalesced_mmio_ring.
> > 
> > Looking at kvm_vcpu_fault(), it seems like we after mmap'ing the fd
> > returned by KVM_CREATE_VCPU we can access one of the following:
> > - vcpu->run
> > - vcpu->arch.pio_data
> > - vcpu->kvm->coalesced_mmio_ring
> > - a page returned by kvm_dirty_ring_get_page()
> > 
> > It doesn't seem like any of these are reclaimable,
> 
> Correct, these are all kernel allocated pages that KVM exposes to userspace to
> facilitate bidirectional sharing of large chunks of data.
> 
> > why is mlock()'ing them supported to begin with?
> 
> Because no one realized it would be problematic, and KVM would have had to go out
> of its way to prevent mlock().
> 
> > Even if we don't want mlock() to err in this case, shouldn't we just do
> > nothing?
> 
> Ideally, yes.
> 
> > I see a lot of checks at the beginning of mlock_fixup() to check
> > whether we should operate on the vma, perhaps we should also check for
> > these KVM vmas?
> 
> Definitely not.  KVM may be doing something unexpected, but the VMA certainly
> isn't unique enough to warrant mm/ needing dedicated handling.
> 
> Focusing on KVM is likely a waste of time.  There are probably other subsystems
> and/or drivers that .mmap() kernel allocated memory in the same way.  Odds are
> good KVM is just the messenger, because syzkaller knows how to beat on KVM.  And
> even if there aren't any other existing cases, nothing would prevent them from
> coming along in the future.

Yeah, I also think so.
It seems that bpf/ringbuf.c contains another example. There are likely more.

So I think we have either to fix it like proposed or on the mlock side.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-10-23  2:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20241021173455.2691973-1-roman.gushchin@linux.dev>
     [not found] ` <Zxa60Ftbh8eN1MG5@casper.infradead.org>
     [not found]   ` <ZxcKjwhMKmnHTX8Q@google.com>
     [not found]     ` <ZxcgR46zpW8uVKrt@casper.infradead.org>
     [not found]       ` <ZxcrJHtIGckMo9Ni@google.com>
2024-10-22  8:26         ` [PATCH v2] mm: page_alloc: move mlocked flag clearance into free_pages_prepare() Yosry Ahmed
2024-10-22 15:39           ` Sean Christopherson
2024-10-22 16:59             ` Matthew Wilcox
2024-10-22 19:52               ` Sean Christopherson
2024-10-23  2:04             ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox