* [Qemu-devel] Dual userfaultfd behavior [not found] <CGME20170310140816eucas1p2bd85d3759739d64bb184ce0fbd74f90d@eucas1p2.samsung.com> @ 2017-03-10 14:08 ` Alexey Perevalov 2017-03-13 10:53 ` Dr. David Alan Gilbert 0 siblings, 1 reply; 6+ messages in thread From: Alexey Perevalov @ 2017-03-10 14:08 UTC (permalink / raw) To: Dr. David Alan Gilbert, qemu-devel, Andrea Arcangeli, Mike Kravetz Hi, David, Andrea and Mike The problem I want to discuss it's 1G hugepage based VM and post copy live migration. I would like to know your opinion on following approach of avoiding such problem: Once we have mmap'ed area through 1G hugetlbfs, remap physical pages with /dev/mem. It will be 2 types of vmas mapped to the same PFN. Register userfaultfd for newly obtained virtual addresses, it could reduce granularity of pages and reduce downtime per one 1G page. So registering userfaultfd for 2Mb, when the real hugepage was 1G, I think, could help. Current postcopy implementation in QEMU allows to make live migration from 1G based hugepage VM to 2Mb based hugepages VM (sanity checks prevent it). Also I checked, it's possible to remap through /dev/mem and get PFN based vmas, register userfaultfd (with allowance in vma_can_userfault) and finally make UFFDIO_COPY with allowing PFN based vmas in __mcopy_atomic. But there are a lot of drawback of such approach: First of all it's /dev/mem interface. Need to provide full access (kernel w/o CONFIG_STRICT_DEVMEM) and need to disable PAT. The second drawback, maybe I just didn't find possibility to remap hugepages again, but mmap of /dev/mem character driver maps 4Kb pages. I don't know how THP could help here, but madvise with MADV_HUGEPAGE didn't. So 4Kb is not exactly what needed, due to overhead of encapsulation summary downtime is worse than in other cases. It would be great to have interface to obtain new virtual address based on existing PFN, but for hugepages. Honestly, I can't find another use case for this feature. -- BR Alexey ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Dual userfaultfd behavior 2017-03-10 14:08 ` [Qemu-devel] Dual userfaultfd behavior Alexey Perevalov @ 2017-03-13 10:53 ` Dr. David Alan Gilbert 2017-03-13 21:46 ` Andrea Arcangeli 0 siblings, 1 reply; 6+ messages in thread From: Dr. David Alan Gilbert @ 2017-03-13 10:53 UTC (permalink / raw) To: Alexey Perevalov; +Cc: qemu-devel, Andrea Arcangeli, Mike Kravetz * Alexey Perevalov (a.perevalov@samsung.com) wrote: > Hi, David, Andrea and Mike Hi Alexey, > The problem I want to discuss it's 1G hugepage based VM and post copy live > migration. > > I would like to know your opinion on following approach of avoiding such > problem: > Once we have mmap'ed area through 1G hugetlbfs, remap physical pages > with /dev/mem. It will be 2 types of vmas mapped to the same PFN. > Register userfaultfd for newly obtained virtual > addresses, it could reduce granularity of pages and reduce downtime per > one 1G page. So registering userfaultfd for 2Mb, when the real hugepage > was 1G, I think, could help. > > Current postcopy implementation in QEMU allows to make live migration > from 1G based hugepage VM to 2Mb based hugepages VM (sanity checks prevent > it). > > Also I checked, it's possible to remap through /dev/mem and get PFN > based vmas, register userfaultfd (with allowance in vma_can_userfault) > and finally make UFFDIO_COPY with allowing PFN based vmas in __mcopy_atomic. > > But there are a lot of drawback of such approach: > First of all it's /dev/mem interface. Need to provide full access > (kernel w/o CONFIG_STRICT_DEVMEM) and need to disable PAT. > The second drawback, maybe I just didn't find possibility to remap > hugepages again, but mmap of /dev/mem character driver maps 4Kb pages. > I don't know how THP could help here, but madvise with MADV_HUGEPAGE > didn't. So 4Kb is not exactly what needed, due to overhead of > encapsulation summary downtime is worse than in other cases. > It would be great to have interface to obtain new virtual address based > on existing PFN, but for hugepages. Yes, and I think as well on some architectures there can be cache problems from mapping the same page in two addresses unless we're careful. I think to do this we'd basically need the kernel to set up something similar to what you're saying, but without the mess of having to go via /dev/mem. Ideally it would all happen magically when I mark a hugetlb page as userfault and start UFFDIO_COPYing in 4kb pages; but I can imagine perhaps some more syscalls needed to tell it to do it. I've no idea how hard that is to do though. Dave > Honestly, I can't find another use case for this feature. > > > -- > > BR > Alexey -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Dual userfaultfd behavior 2017-03-13 10:53 ` Dr. David Alan Gilbert @ 2017-03-13 21:46 ` Andrea Arcangeli 2017-03-15 13:47 ` Alexey Perevalov 0 siblings, 1 reply; 6+ messages in thread From: Andrea Arcangeli @ 2017-03-13 21:46 UTC (permalink / raw) To: Dr. David Alan Gilbert; +Cc: Alexey Perevalov, qemu-devel, Mike Kravetz Hello, On Mon, Mar 13, 2017 at 10:53:39AM +0000, Dr. David Alan Gilbert wrote: > * Alexey Perevalov (a.perevalov@samsung.com) wrote: > > Hi, David, Andrea and Mike > > Hi Alexey, > > > The problem I want to discuss it's 1G hugepage based VM and post copy live > > migration. > > > > I would like to know your opinion on following approach of avoiding such > > problem: > > Once we have mmap'ed area through 1G hugetlbfs, remap physical pages > > with /dev/mem. It will be 2 types of vmas mapped to the same PFN. > > Register userfaultfd for newly obtained virtual > > addresses, it could reduce granularity of pages and reduce downtime per > > one 1G page. So registering userfaultfd for 2Mb, when the real hugepage > > was 1G, I think, could help. > > > > Current postcopy implementation in QEMU allows to make live migration > > from 1G based hugepage VM to 2Mb based hugepages VM (sanity checks prevent > > it). > > > > Also I checked, it's possible to remap through /dev/mem and get PFN > > based vmas, register userfaultfd (with allowance in vma_can_userfault) > > and finally make UFFDIO_COPY with allowing PFN based vmas in __mcopy_atomic. > > > > But there are a lot of drawback of such approach: > > First of all it's /dev/mem interface. Need to provide full access > > (kernel w/o CONFIG_STRICT_DEVMEM) and need to disable PAT. > > The second drawback, maybe I just didn't find possibility to remap > > hugepages again, but mmap of /dev/mem character driver maps 4Kb pages. > > I don't know how THP could help here, but madvise with MADV_HUGEPAGE > > didn't. So 4Kb is not exactly what needed, due to overhead of > > encapsulation summary downtime is worse than in other cases. > > It would be great to have interface to obtain new virtual address based > > on existing PFN, but for hugepages. > > Yes, and I think as well on some architectures there can be cache problems > from mapping the same page in two addresses unless we're careful. > > I think to do this we'd basically need the kernel to set up something > similar to what you're saying, but without the mess of having to > go via /dev/mem. Ideally it would all happen magically when I mark /dev/mem is problematic also if the hugetlbfs page is being migrated for example (the virtual2physical mapping that may change and be directed to a different physical page with no notification to userland). > a hugetlb page as userfault and start UFFDIO_COPYing in 4kb pages; > but I can imagine perhaps some more syscalls needed to tell it to do it. I think what is needed is an extension to UFFDIO_COPY so that it will not fail if used with lower granularity than the hugetlbfs page size. Then a special feature flag should be set in the uffdio_api.features to tell userland it can use a lower granularity than the hugetlbfs page size. UFFDIO_COPY will still allocate a 1GB page, but it will copy only part of and then map only the copied part with a pte of 4k granularity (or 2MB granularity if the uffdio_copy.dst/len parameters allows for hugepmds). This way if there's a missing fault in the part that is not copied yet, it'll still trigger a page fault and we'll call handle_userfault from hugetlb_no_page, which will trigger a new UFFDIO_COPY and map another fragment of the 1GB page. The same code than should allow us to map a 2MB hugetlbfs page with 4k granularity in UFFDIO_COPY so then userland can choose if to do postcopy live migration on hugetlbfs with 2MB or 4kb granularity (on very slow network 4kb would be preferable for example for the latency). Mapping a 1GB page without a hugepud or mapping a 2MB page without a hugepmd breaks all sort of invariants into the hugetlbfs code, page migration wouldn't be able to cope with it either. Solving the fallout from such breakage is what is required to implement this basically. Ideally this could be a UFFDIO_COPY-only feature, simply every other part of hugetlbfs should be able to cope with whatever UFFDIO_COPY has generated in terms of pagetables mapping those hugepages. The only thing that will change from hugetlbfs point of view is that a pte or hugepmd could be mapping a 1GB page, that couldn't be the case before it. Once the entirety of the hugepage has been mapped or when the UFFDIO_UNREGISTER is called on the 1GB range, all those hugepmds and/or ptes should be dropped and the hugepud would then point directly to the 1GB page restoring the current behavior (which is required to get a 1GB large TLB and top performance, i.e. immediately after the live migration completed). This is very different concept than THP that in fact never requires to allocate hugepages. THP nowadays also allows to map a THP page using ptes, instead of hugepmds, just for a different reason (i.e. to allow split_huge_page to fail and be deferred, so get/put_page are simplified, while still guaranteeing that split_huge_page_pmd cannot fail). So we're already doing something like the above with THP but we only cover ptes vs hugepmd. The codebase however is different, hugetlbfs doesn't interact much with the main VM, so there's likely nothing to share with THP, but the concept of mapping an hugepage with ptes is somewhat similar. THP fundamentally always allows fallback, in the UFFDIO_COPY on hugetlbfs instead the idea would be to never allow fallback on the physical side, but to allow partial mapping so that the UFFDIO_COPY granularity can be reduced, despite the physical granularity remains native. This also means the collapse after live migration will be zerocopy: the collapse would be of the virtual kind for hugetlbfs (khugepaged as opposed has to copy the actual data because the physical pages were not huge to begin with, to allow the fallback). When UFFDIO_COPY is called again on the same 1GB range, it would need to notice that the pud is not null but is not huge (new case introduced), and then keep going in the pagetables lookup until it finds the copied range is not mapped, skip the 1GB allocation in such case and copy into the pre-existing 1GB page. If instead it finds the dst range is already mapped by a pte or a hugepmd, it'll instead return -EEXIST as usual. Thanks, Andrea ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Dual userfaultfd behavior 2017-03-13 21:46 ` Andrea Arcangeli @ 2017-03-15 13:47 ` Alexey Perevalov 2017-03-17 23:27 ` Mike Kravetz 0 siblings, 1 reply; 6+ messages in thread From: Alexey Perevalov @ 2017-03-15 13:47 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Dr. David Alan Gilbert, qemu-devel, Mike Kravetz Hi Andrea, thank you for so perfect design description, the main question who will do RFC patches, you or Mike or if you not against I could try. On 03/14/2017 12:46 AM, Andrea Arcangeli wrote: > Hello, > > On Mon, Mar 13, 2017 at 10:53:39AM +0000, Dr. David Alan Gilbert wrote: >> * Alexey Perevalov (a.perevalov@samsung.com) wrote: >>> Hi, David, Andrea and Mike >> Hi Alexey, >> >>> The problem I want to discuss it's 1G hugepage based VM and post copy live >>> migration. >>> >>> I would like to know your opinion on following approach of avoiding such >>> problem: >>> Once we have mmap'ed area through 1G hugetlbfs, remap physical pages >>> with /dev/mem. It will be 2 types of vmas mapped to the same PFN. >>> Register userfaultfd for newly obtained virtual >>> addresses, it could reduce granularity of pages and reduce downtime per >>> one 1G page. So registering userfaultfd for 2Mb, when the real hugepage >>> was 1G, I think, could help. >>> >>> Current postcopy implementation in QEMU allows to make live migration >>> from 1G based hugepage VM to 2Mb based hugepages VM (sanity checks prevent >>> it). >>> >>> Also I checked, it's possible to remap through /dev/mem and get PFN >>> based vmas, register userfaultfd (with allowance in vma_can_userfault) >>> and finally make UFFDIO_COPY with allowing PFN based vmas in __mcopy_atomic. >>> >>> But there are a lot of drawback of such approach: >>> First of all it's /dev/mem interface. Need to provide full access >>> (kernel w/o CONFIG_STRICT_DEVMEM) and need to disable PAT. >>> The second drawback, maybe I just didn't find possibility to remap >>> hugepages again, but mmap of /dev/mem character driver maps 4Kb pages. >>> I don't know how THP could help here, but madvise with MADV_HUGEPAGE >>> didn't. So 4Kb is not exactly what needed, due to overhead of >>> encapsulation summary downtime is worse than in other cases. >>> It would be great to have interface to obtain new virtual address based >>> on existing PFN, but for hugepages. >> Yes, and I think as well on some architectures there can be cache problems >> from mapping the same page in two addresses unless we're careful. >> >> I think to do this we'd basically need the kernel to set up something >> similar to what you're saying, but without the mess of having to >> go via /dev/mem. Ideally it would all happen magically when I mark > /dev/mem is problematic also if the hugetlbfs page is being migrated > for example (the virtual2physical mapping that may change and be > directed to a different physical page with no notification to userland). > >> a hugetlb page as userfault and start UFFDIO_COPYing in 4kb pages; >> but I can imagine perhaps some more syscalls needed to tell it to do it. > I think what is needed is an extension to UFFDIO_COPY so that it will > not fail if used with lower granularity than the hugetlbfs page > size. Then a special feature flag should be set in the > uffdio_api.features to tell userland it can use a lower granularity > than the hugetlbfs page size. > > UFFDIO_COPY will still allocate a 1GB page, but it will copy only part > of and then map only the copied part with a pte of 4k granularity (or > 2MB granularity if the uffdio_copy.dst/len parameters allows for > hugepmds). This way if there's a missing fault in the part that is not > copied yet, it'll still trigger a page fault and we'll call > handle_userfault from hugetlb_no_page, which will trigger a new > UFFDIO_COPY and map another fragment of the 1GB page. > The same code than should allow us to map a 2MB hugetlbfs page with 4k > granularity in UFFDIO_COPY so then userland can choose if to do > postcopy live migration on hugetlbfs with 2MB or 4kb granularity (on > very slow network 4kb would be preferable for example for the latency). > > Mapping a 1GB page without a hugepud or mapping a 2MB page without a > hugepmd breaks all sort of invariants into the hugetlbfs code, page > migration wouldn't be able to cope with it either. Solving the fallout > from such breakage is what is required to implement this basically. > > Ideally this could be a UFFDIO_COPY-only feature, simply every other > part of hugetlbfs should be able to cope with whatever UFFDIO_COPY has > generated in terms of pagetables mapping those hugepages. > > The only thing that will change from hugetlbfs point of view is that a > pte or hugepmd could be mapping a 1GB page, that couldn't be the case > before it. > > Once the entirety of the hugepage has been mapped or when the > UFFDIO_UNREGISTER is called on the 1GB range, all those hugepmds > and/or ptes should be dropped and the hugepud would then point > directly to the 1GB page restoring the current behavior (which is > required to get a 1GB large TLB and top performance, i.e. immediately > after the live migration completed). > > This is very different concept than THP that in fact never requires to > allocate hugepages. THP nowadays also allows to map a THP page using > ptes, instead of hugepmds, just for a different reason (i.e. to allow > split_huge_page to fail and be deferred, so get/put_page are > simplified, while still guaranteeing that split_huge_page_pmd cannot > fail). So we're already doing something like the above with THP but we > only cover ptes vs hugepmd. The codebase however is different, > hugetlbfs doesn't interact much with the main VM, so there's likely > nothing to share with THP, but the concept of mapping an hugepage with > ptes is somewhat similar. > > THP fundamentally always allows fallback, in the UFFDIO_COPY on > hugetlbfs instead the idea would be to never allow fallback on the > physical side, but to allow partial mapping so that the UFFDIO_COPY > granularity can be reduced, despite the physical granularity remains > native. This also means the collapse after live migration will be > zerocopy: the collapse would be of the virtual kind for hugetlbfs > (khugepaged as opposed has to copy the actual data because the > physical pages were not huge to begin with, to allow the fallback). > > When UFFDIO_COPY is called again on the same 1GB range, it would need > to notice that the pud is not null but is not huge (new case > introduced), and then keep going in the pagetables lookup until it > finds the copied range is not mapped, skip the 1GB allocation in such > case and copy into the pre-existing 1GB page. If instead it finds the > dst range is already mapped by a pte or a hugepmd, it'll instead > return -EEXIST as usual. > > Thanks, > Andrea > > > -- Best regards, Alexey Perevalov ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Dual userfaultfd behavior 2017-03-15 13:47 ` Alexey Perevalov @ 2017-03-17 23:27 ` Mike Kravetz 2017-04-10 16:31 ` Alexey Perevalov 0 siblings, 1 reply; 6+ messages in thread From: Mike Kravetz @ 2017-03-17 23:27 UTC (permalink / raw) To: Alexey Perevalov, Andrea Arcangeli; +Cc: Dr. David Alan Gilbert, qemu-devel On 03/15/2017 06:47 AM, Alexey Perevalov wrote: > Hi Andrea, > > thank you for so perfect design description, > > the main question who will do RFC patches, > you or Mike or if you not against I could try. Sorry for not replying sooner, I have been away from e-mail. I have some other projects which require attention before I could start any in depth work on this issue. > On 03/14/2017 12:46 AM, Andrea Arcangeli wrote: >> I think what is needed is an extension to UFFDIO_COPY so that it will >> not fail if used with lower granularity than the hugetlbfs page >> size. Then a special feature flag should be set in the >> uffdio_api.features to tell userland it can use a lower granularity >> than the hugetlbfs page size. >> >> UFFDIO_COPY will still allocate a 1GB page, but it will copy only part >> of and then map only the copied part with a pte of 4k granularity (or >> 2MB granularity if the uffdio_copy.dst/len parameters allows for >> hugepmds). This way if there's a missing fault in the part that is not >> copied yet, it'll still trigger a page fault and we'll call >> handle_userfault from hugetlb_no_page, which will trigger a new >> UFFDIO_COPY and map another fragment of the 1GB page. >> The same code than should allow us to map a 2MB hugetlbfs page with 4k >> granularity in UFFDIO_COPY so then userland can choose if to do >> postcopy live migration on hugetlbfs with 2MB or 4kb granularity (on >> very slow network 4kb would be preferable for example for the latency). >> >> Mapping a 1GB page without a hugepud or mapping a 2MB page without a >> hugepmd breaks all sort of invariants into the hugetlbfs code, page >> migration wouldn't be able to cope with it either. Solving the fallout >> from such breakage is what is required to implement this basically. Agree! I was wondering if it might be easier or more straight forward to split the target vma. You would then create a new vma of the copy page size to be used during the copy. In this way the the mappings and page table entries remain as expected. Of course, you would still want to allocate a huge page of the original target vma size and map that as smaller page sizes are copied. I don't think there is anything that does something similar today. When the copy is done (or aborted) we then create/convert a new vma for the huge page and merge it into the target vma(s). Not sure if that would be any easier. It was just the first thing that popped into my head. -- Mike Kravetz ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Dual userfaultfd behavior 2017-03-17 23:27 ` Mike Kravetz @ 2017-04-10 16:31 ` Alexey Perevalov 0 siblings, 0 replies; 6+ messages in thread From: Alexey Perevalov @ 2017-04-10 16:31 UTC (permalink / raw) To: Mike Kravetz, Andrea Arcangeli; +Cc: Dr. David Alan Gilbert, qemu-devel Hello, On 03/18/2017 02:27 AM, Mike Kravetz wrote: > On 03/15/2017 06:47 AM, Alexey Perevalov wrote: >> Hi Andrea, >> >> thank you for so perfect design description, >> >> the main question who will do RFC patches, >> you or Mike or if you not against I could try. > Sorry for not replying sooner, I have been away from e-mail. > > I have some other projects which require attention before I could start > any in depth work on this issue. ok, if nobody started, let me to begin to code that. I already implemented straightforward approach, but it didn't take into account page migration. >> On 03/14/2017 12:46 AM, Andrea Arcangeli wrote: >>> I think what is needed is an extension to UFFDIO_COPY so that it will >>> not fail if used with lower granularity than the hugetlbfs page >>> size. Then a special feature flag should be set in the >>> uffdio_api.features to tell userland it can use a lower granularity >>> than the hugetlbfs page size. >>> >>> UFFDIO_COPY will still allocate a 1GB page, but it will copy only part >>> of and then map only the copied part with a pte of 4k granularity (or >>> 2MB granularity if the uffdio_copy.dst/len parameters allows for >>> hugepmds). This way if there's a missing fault in the part that is not >>> copied yet, it'll still trigger a page fault and we'll call >>> handle_userfault from hugetlb_no_page, which will trigger a new >>> UFFDIO_COPY and map another fragment of the 1GB page. >>> The same code than should allow us to map a 2MB hugetlbfs page with 4k >>> granularity in UFFDIO_COPY so then userland can choose if to do >>> postcopy live migration on hugetlbfs with 2MB or 4kb granularity (on >>> very slow network 4kb would be preferable for example for the latency). >>> >>> Mapping a 1GB page without a hugepud or mapping a 2MB page without a >>> hugepmd breaks all sort of invariants into the hugetlbfs code, page >>> migration wouldn't be able to cope with it either. Solving the fallout >>> from such breakage is what is required to implement this basically. > Agree! > > I was wondering if it might be easier or more straight forward to split > the target vma. You would then create a new vma of the copy page size > to be used during the copy. In this way the the mappings and page table > entries remain as expected. > > Of course, you would still want to allocate a huge page of the original > target vma size and map that as smaller page sizes are copied. I don't > think there is anything that does something similar today. > > When the copy is done (or aborted) we then create/convert a new vma for > the huge page and merge it into the target vma(s). > > Not sure if that would be any easier. It was just the first thing that > popped into my head. > -- Best regards, Alexey Perevalov ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-04-10 16:31 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CGME20170310140816eucas1p2bd85d3759739d64bb184ce0fbd74f90d@eucas1p2.samsung.com> 2017-03-10 14:08 ` [Qemu-devel] Dual userfaultfd behavior Alexey Perevalov 2017-03-13 10:53 ` Dr. David Alan Gilbert 2017-03-13 21:46 ` Andrea Arcangeli 2017-03-15 13:47 ` Alexey Perevalov 2017-03-17 23:27 ` Mike Kravetz 2017-04-10 16:31 ` Alexey Perevalov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).