* Re: [Qemu-devel] [LSF/MM TOPIC][LSF/MM, ATTEND] shared TLB, hugetlb reservations
[not found] ` <e09c529d-50e7-e6f2-8054-a34f22b5835a@oracle.com>
@ 2017-03-14 18:37 ` Andrea Arcangeli
2017-03-17 22:13 ` Mike Kravetz
0 siblings, 1 reply; 2+ messages in thread
From: Andrea Arcangeli @ 2017-03-14 18:37 UTC (permalink / raw)
To: Mike Kravetz
Cc: lsf-pc, linux-mm, linux-kernel, Dr. David Alan Gilbert,
qemu-devel, Mike Rapoport
Hello,
On Wed, Mar 08, 2017 at 05:30:55PM -0800, Mike Kravetz wrote:
> On 01/10/2017 03:02 PM, Mike Kravetz wrote:
> > Another more concrete topic is hugetlb reservations. Michal Hocko
> > proposed the topic "mm patches review bandwidth", and brought up the
> > related subject of areas in need of attention from an architectural
> > POV. I suggested that hugetlb reservations was one such area. I'm
> > guessing it was introduced to solve a rather concrete problem. However,
> > over time additional hugetlb functionality was added and the
> > capabilities of the reservation code was stretched to accommodate.
> > It would be good to step back and take a look at the design of this
> > code to determine if a rewrite/redesign is necessary. Michal suggested
> > documenting the current design/code as a first step. If people think
> > this is worth discussion at the summit, I could put together such a
> > design before the gathering.
>
> I attempted to put together a design/overview of how hugetlb reservations
> currently work. Hopefully, this will be useful.
Another area of hugetlbfs that is not clear is the status of
MADV_REMOVE and the behavior of fallocate punch hole that deviates
from more standard shmem semantics. That might also be a topic of
interest related to your hugetlbfs topic and marginally related to
userfaultfd.
The current status for anon, shmem and hugetlbfs like this:
MADV_DONTNEED works: anon, !VM_SHARED shmem
MADV_DONTNEED doesn't work: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED
MADV_DONTNEED works but not guaranteed to fault: shmem VM_SHARED
MADV_REMOVE works: shmem VM_SHARED, hugetlbfs VM_SHARED
MADV_REMOVE doesn't work: anon, shmem !VM_SHARED, hugetlbfs !VM_SHARED
fallocate punch hole works: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED,
shmem VM_SHARED
fallocate punch hole doesn't work: anon, shmem !VM_SHARED
So what happens in qemu is:
anon -> MADV_DONTNEED
shmem !VM_SHARED -> MADV_DONTNEED (fallocate punch hole wouldn't zap
private pages, but it does on hugetlbfs)
shmem VM_SHARED -> fallocate punch hole (MADV_REMOVE would
work too)
hugetlbfs !VM_SHARED -> fallocate punch hole (works for hugetlbfs
but not for shmem !VM_SHARED)
hugetlbfs VM_SHARED -> fallocate punch hole (MADV_REMOVE would work too)
This means qemu has to carry around information on the type of memory
it got from the initial memblock setup, so at live migration time it
can zap the memory with the right call. (NOTE: such memory is not
generated by userfaultfd UFFDIO_COPY, but it was allocated and mapped
and it must be zapped well before calling userfaultfd the first time).
To do this qemu uses fstatfs and finds out which kind of memory it's
dealing with to use the right call depending on which memory.
In short it'd be better to have something like a generic MADV_REMOVE
that guarantees a non-present fault after it succeeds, no matter what
kind of memory is mapped in the virtual range that has to be
zapped. The above is far from ideal from a userland developer
prospective.
Overall fallocate punch hole covers the most cases so to keep the code
simpler ironically MADV_REMOVE ends up being never used despite it
provides a more friendly API than fallocate to qemu. The files are
always mapped and the older code only dealt with virtual addresses
(before hugetlbfs and shmem entered thee equation). Ideally qemu wants
to call the same madvise regardles if the memory is anon shmem or
hugetlbfs without having to carry around file descriptor, file offsets
and superblock types.
It's also not clear why MADV_DONTNEED doesn't work for hugetlbfs
!VM_SHARED mappings and why fallocate punch hole is also zapping
private cow-like pages from !VM_SHARED mappings (although if it
didn't, it would be impossible to zap those... so it's good luck it
does).
Thanks,
Andrea
PS. CC'ed also qemu-devel in case it may help clarify why things are
implemented they way they are in the postcopy live migration
hugetlbfs/shmem support and in the future patches for shmem/hugetlbfs
share=on.
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [Qemu-devel] [LSF/MM TOPIC][LSF/MM, ATTEND] shared TLB, hugetlb reservations
2017-03-14 18:37 ` [Qemu-devel] [LSF/MM TOPIC][LSF/MM, ATTEND] shared TLB, hugetlb reservations Andrea Arcangeli
@ 2017-03-17 22:13 ` Mike Kravetz
0 siblings, 0 replies; 2+ messages in thread
From: Mike Kravetz @ 2017-03-17 22:13 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: lsf-pc, linux-mm, linux-kernel, Dr. David Alan Gilbert,
qemu-devel, Mike Rapoport
On 03/14/2017 11:37 AM, Andrea Arcangeli wrote:
> Hello,
>
> On Wed, Mar 08, 2017 at 05:30:55PM -0800, Mike Kravetz wrote:
>> On 01/10/2017 03:02 PM, Mike Kravetz wrote:
>>> Another more concrete topic is hugetlb reservations. Michal Hocko
>>> proposed the topic "mm patches review bandwidth", and brought up the
>>> related subject of areas in need of attention from an architectural
>>> POV. I suggested that hugetlb reservations was one such area. I'm
>>> guessing it was introduced to solve a rather concrete problem. However,
>>> over time additional hugetlb functionality was added and the
>>> capabilities of the reservation code was stretched to accommodate.
>>> It would be good to step back and take a look at the design of this
>>> code to determine if a rewrite/redesign is necessary. Michal suggested
>>> documenting the current design/code as a first step. If people think
>>> this is worth discussion at the summit, I could put together such a
>>> design before the gathering.
>>
>> I attempted to put together a design/overview of how hugetlb reservations
>> currently work. Hopefully, this will be useful.
>
> Another area of hugetlbfs that is not clear is the status of
> MADV_REMOVE and the behavior of fallocate punch hole that deviates
> from more standard shmem semantics. That might also be a topic of
> interest related to your hugetlbfs topic and marginally related to
> userfaultfd.
Thanks Andrea,
I was not aware qemu was carrying all this information.
> The current status for anon, shmem and hugetlbfs like this:
>
> MADV_DONTNEED works: anon, !VM_SHARED shmem
> MADV_DONTNEED doesn't work: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED
> MADV_DONTNEED works but not guaranteed to fault: shmem VM_SHARED
>
> MADV_REMOVE works: shmem VM_SHARED, hugetlbfs VM_SHARED
> MADV_REMOVE doesn't work: anon, shmem !VM_SHARED, hugetlbfs !VM_SHARED
>
> fallocate punch hole works: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED,
> shmem VM_SHARED
> fallocate punch hole doesn't work: anon, shmem !VM_SHARED
>
> So what happens in qemu is:
>
> anon -> MADV_DONTNEED
>
> shmem !VM_SHARED -> MADV_DONTNEED (fallocate punch hole wouldn't zap
> private pages, but it does on hugetlbfs)
>
> shmem VM_SHARED -> fallocate punch hole (MADV_REMOVE would
> work too)
>
> hugetlbfs !VM_SHARED -> fallocate punch hole (works for hugetlbfs
> but not for shmem !VM_SHARED)
>
> hugetlbfs VM_SHARED -> fallocate punch hole (MADV_REMOVE would work too)
>
> This means qemu has to carry around information on the type of memory
> it got from the initial memblock setup, so at live migration time it
> can zap the memory with the right call. (NOTE: such memory is not
> generated by userfaultfd UFFDIO_COPY, but it was allocated and mapped
> and it must be zapped well before calling userfaultfd the first time).
>
> To do this qemu uses fstatfs and finds out which kind of memory it's
> dealing with to use the right call depending on which memory.
>
> In short it'd be better to have something like a generic MADV_REMOVE
> that guarantees a non-present fault after it succeeds, no matter what
> kind of memory is mapped in the virtual range that has to be
> zapped. The above is far from ideal from a userland developer
> prospective.
I think we will need to have a new generic MADV_REMOVE type of call
as you suggest. Based on existing documentation for MADV_DONTNEED,
MADV_REMOVE and fallocate hole punch they each are designed not to
work on at least one of the desired memory mapping types.
> Overall fallocate punch hole covers the most cases so to keep the code
> simpler ironically MADV_REMOVE ends up being never used despite it
> provides a more friendly API than fallocate to qemu. The files are
> always mapped and the older code only dealt with virtual addresses
> (before hugetlbfs and shmem entered thee equation). Ideally qemu wants
> to call the same madvise regardles if the memory is anon shmem or
> hugetlbfs without having to carry around file descriptor, file offsets
> and superblock types.
>
> It's also not clear why MADV_DONTNEED doesn't work for hugetlbfs
> !VM_SHARED mappings and why fallocate punch hole is also zapping
> private cow-like pages from !VM_SHARED mappings (although if it
> didn't, it would be impossible to zap those... so it's good luck it
> does).
Yes, it is more like good luck than design. fallocate hole punch for
hugetlbfs VM_SHARED was the original use case/design. MADV_REMOVE was
added just because it could without additional effort.
Thanks for bringing this up. We should definitely discuss within the
scope of hugetlbfs and/or userfaultfd.
--
Mike Kravetz
>
> Thanks,
> Andrea
>
> PS. CC'ed also qemu-devel in case it may help clarify why things are
> implemented they way they are in the postcopy live migration
> hugetlbfs/shmem support and in the future patches for shmem/hugetlbfs
> share=on.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2017-03-17 22:13 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <cad15568-221e-82b7-a387-f23567a0bc76@oracle.com>
[not found] ` <e09c529d-50e7-e6f2-8054-a34f22b5835a@oracle.com>
2017-03-14 18:37 ` [Qemu-devel] [LSF/MM TOPIC][LSF/MM, ATTEND] shared TLB, hugetlb reservations Andrea Arcangeli
2017-03-17 22:13 ` Mike Kravetz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).