* [PATCH 0/2] vfio: Align mmaps @ 2024-10-22 20:08 Alex Williamson 2024-10-22 20:08 ` [PATCH 1/2] vfio/helpers: Refactor vfio_region_mmap() error handling Alex Williamson ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Alex Williamson @ 2024-10-22 20:08 UTC (permalink / raw) To: qemu-devel; +Cc: Alex Williamson, clg, peterx As described in patch /2, newer kernels may support pfnmap with PMD or PUD sized mappings. Mappings must be aligned in order to see the full benefit of this support. We're largely able to get PMD alignment for free from mmap, but PUD alignment requires some effort. Further, we don't actually have an easy way to determine PMD or PUD alignment, therefore we align all mmaps to the nearest power-of-two relative to the region size, up to the maximum PUD size known the be currently available. Enabling debug prints in the kernel shows that this exclusively enables 1GiB mappings for a GPU with a multi-gigabyte BAR whereas previously the BAR is mapped with a combination of 2MiB and 1GiB mappings, only using 1GiB when opportunistically crossing an alignment boundary. If there are ways to determine discrete alignment intervals or better ways to generate a properly aligned address value for mmap, please share. Thanks, Alex Alex Williamson (2): vfio/helpers: Refactor vfio_region_mmap() error handling vfio/helpers: Align mmaps hw/vfio/helpers.c | 66 +++++++++++++++++++++++++++++++++-------------- 1 file changed, 47 insertions(+), 19 deletions(-) -- 2.46.2 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] vfio/helpers: Refactor vfio_region_mmap() error handling 2024-10-22 20:08 [PATCH 0/2] vfio: Align mmaps Alex Williamson @ 2024-10-22 20:08 ` Alex Williamson 2024-10-22 21:44 ` Peter Xu 2024-10-23 9:13 ` Cédric Le Goater 2024-10-22 20:08 ` [PATCH 2/2] vfio/helpers: Align mmaps Alex Williamson 2024-10-23 15:08 ` [PATCH 0/2] vfio: " Cédric Le Goater 2 siblings, 2 replies; 9+ messages in thread From: Alex Williamson @ 2024-10-22 20:08 UTC (permalink / raw) To: qemu-devel; +Cc: Alex Williamson, clg, peterx Move error handling code to the end of the function so that it can more easily be shared by new mmap failure conditions. No functional change intended. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- hw/vfio/helpers.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c index ea15c79db0a3..b9e606e364a2 100644 --- a/hw/vfio/helpers.c +++ b/hw/vfio/helpers.c @@ -395,7 +395,7 @@ static void vfio_subregion_unmap(VFIORegion *region, int index) int vfio_region_mmap(VFIORegion *region) { - int i, prot = 0; + int i, ret, prot = 0; char *name; if (!region->mem) { @@ -411,22 +411,8 @@ int vfio_region_mmap(VFIORegion *region) region->fd_offset + region->mmaps[i].offset); if (region->mmaps[i].mmap == MAP_FAILED) { - int ret = -errno; - - trace_vfio_region_mmap_fault(memory_region_name(region->mem), i, - region->fd_offset + - region->mmaps[i].offset, - region->fd_offset + - region->mmaps[i].offset + - region->mmaps[i].size - 1, ret); - - region->mmaps[i].mmap = NULL; - - for (i--; i >= 0; i--) { - vfio_subregion_unmap(region, i); - } - - return ret; + ret = -errno; + goto no_mmap; } name = g_strdup_printf("%s mmaps[%d]", @@ -446,6 +432,20 @@ int vfio_region_mmap(VFIORegion *region) } return 0; + +no_mmap: + trace_vfio_region_mmap_fault(memory_region_name(region->mem), i, + region->fd_offset + region->mmaps[i].offset, + region->fd_offset + region->mmaps[i].offset + + region->mmaps[i].size - 1, ret); + + region->mmaps[i].mmap = NULL; + + for (i--; i >= 0; i--) { + vfio_subregion_unmap(region, i); + } + + return ret; } void vfio_region_unmap(VFIORegion *region) -- 2.46.2 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] vfio/helpers: Refactor vfio_region_mmap() error handling 2024-10-22 20:08 ` [PATCH 1/2] vfio/helpers: Refactor vfio_region_mmap() error handling Alex Williamson @ 2024-10-22 21:44 ` Peter Xu 2024-10-23 9:13 ` Cédric Le Goater 1 sibling, 0 replies; 9+ messages in thread From: Peter Xu @ 2024-10-22 21:44 UTC (permalink / raw) To: Alex Williamson; +Cc: qemu-devel, clg On Tue, Oct 22, 2024 at 02:08:28PM -0600, Alex Williamson wrote: > Move error handling code to the end of the function so that it can more > easily be shared by new mmap failure conditions. No functional change > intended. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > hw/vfio/helpers.c | 34 +++++++++++++++++----------------- > 1 file changed, 17 insertions(+), 17 deletions(-) > > diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c > index ea15c79db0a3..b9e606e364a2 100644 > --- a/hw/vfio/helpers.c > +++ b/hw/vfio/helpers.c > @@ -395,7 +395,7 @@ static void vfio_subregion_unmap(VFIORegion *region, int index) > > int vfio_region_mmap(VFIORegion *region) > { > - int i, prot = 0; > + int i, ret, prot = 0; > char *name; > > if (!region->mem) { > @@ -411,22 +411,8 @@ int vfio_region_mmap(VFIORegion *region) > region->fd_offset + > region->mmaps[i].offset); > if (region->mmaps[i].mmap == MAP_FAILED) { > - int ret = -errno; > - > - trace_vfio_region_mmap_fault(memory_region_name(region->mem), i, > - region->fd_offset + > - region->mmaps[i].offset, > - region->fd_offset + > - region->mmaps[i].offset + > - region->mmaps[i].size - 1, ret); > - > - region->mmaps[i].mmap = NULL; > - > - for (i--; i >= 0; i--) { > - vfio_subregion_unmap(region, i); > - } > - > - return ret; > + ret = -errno; > + goto no_mmap; > } > > name = g_strdup_printf("%s mmaps[%d]", > @@ -446,6 +432,20 @@ int vfio_region_mmap(VFIORegion *region) > } > > return 0; > + > +no_mmap: > + trace_vfio_region_mmap_fault(memory_region_name(region->mem), i, > + region->fd_offset + region->mmaps[i].offset, > + region->fd_offset + region->mmaps[i].offset + > + region->mmaps[i].size - 1, ret); While at it, maybe when moving we can rename s/fault/fail/ or some other words to avoid "mmap" + "fault" being together. :) Totally not a huge deal. Reviewed-by: Peter Xu <peterx@redhat.com> > + > + region->mmaps[i].mmap = NULL; > + > + for (i--; i >= 0; i--) { > + vfio_subregion_unmap(region, i); > + } > + > + return ret; > } > > void vfio_region_unmap(VFIORegion *region) > -- > 2.46.2 > -- Peter Xu ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] vfio/helpers: Refactor vfio_region_mmap() error handling 2024-10-22 20:08 ` [PATCH 1/2] vfio/helpers: Refactor vfio_region_mmap() error handling Alex Williamson 2024-10-22 21:44 ` Peter Xu @ 2024-10-23 9:13 ` Cédric Le Goater 1 sibling, 0 replies; 9+ messages in thread From: Cédric Le Goater @ 2024-10-23 9:13 UTC (permalink / raw) To: Alex Williamson, qemu-devel; +Cc: peterx On 10/22/24 22:08, Alex Williamson wrote: > Move error handling code to the end of the function so that it can more > easily be shared by new mmap failure conditions. No functional change > intended. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Thanks, C. > --- > hw/vfio/helpers.c | 34 +++++++++++++++++----------------- > 1 file changed, 17 insertions(+), 17 deletions(-) > > diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c > index ea15c79db0a3..b9e606e364a2 100644 > --- a/hw/vfio/helpers.c > +++ b/hw/vfio/helpers.c > @@ -395,7 +395,7 @@ static void vfio_subregion_unmap(VFIORegion *region, int index) > > int vfio_region_mmap(VFIORegion *region) > { > - int i, prot = 0; > + int i, ret, prot = 0; > char *name; > > if (!region->mem) { > @@ -411,22 +411,8 @@ int vfio_region_mmap(VFIORegion *region) > region->fd_offset + > region->mmaps[i].offset); > if (region->mmaps[i].mmap == MAP_FAILED) { > - int ret = -errno; > - > - trace_vfio_region_mmap_fault(memory_region_name(region->mem), i, > - region->fd_offset + > - region->mmaps[i].offset, > - region->fd_offset + > - region->mmaps[i].offset + > - region->mmaps[i].size - 1, ret); > - > - region->mmaps[i].mmap = NULL; > - > - for (i--; i >= 0; i--) { > - vfio_subregion_unmap(region, i); > - } > - > - return ret; > + ret = -errno; > + goto no_mmap; > } > > name = g_strdup_printf("%s mmaps[%d]", > @@ -446,6 +432,20 @@ int vfio_region_mmap(VFIORegion *region) > } > > return 0; > + > +no_mmap: > + trace_vfio_region_mmap_fault(memory_region_name(region->mem), i, > + region->fd_offset + region->mmaps[i].offset, > + region->fd_offset + region->mmaps[i].offset + > + region->mmaps[i].size - 1, ret); > + > + region->mmaps[i].mmap = NULL; > + > + for (i--; i >= 0; i--) { > + vfio_subregion_unmap(region, i); > + } > + > + return ret; > } > > void vfio_region_unmap(VFIORegion *region) ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] vfio/helpers: Align mmaps 2024-10-22 20:08 [PATCH 0/2] vfio: Align mmaps Alex Williamson 2024-10-22 20:08 ` [PATCH 1/2] vfio/helpers: Refactor vfio_region_mmap() error handling Alex Williamson @ 2024-10-22 20:08 ` Alex Williamson 2024-10-22 21:50 ` Peter Xu 2024-10-23 12:44 ` Cédric Le Goater 2024-10-23 15:08 ` [PATCH 0/2] vfio: " Cédric Le Goater 2 siblings, 2 replies; 9+ messages in thread From: Alex Williamson @ 2024-10-22 20:08 UTC (permalink / raw) To: qemu-devel; +Cc: Alex Williamson, clg, peterx Thanks to work by Peter Xu, support is introduced in Linux v6.12 to allow pfnmap insertions at PMD and PUD levels of the page table. This means that provided a properly aligned mmap, the vfio driver is able to map MMIO at significantly larger intervals than PAGE_SIZE. For example on x86_64 (the only architecture currently supporting huge pfnmaps for PUD), rather than 4KiB mappings, we can map device MMIO using 2MiB and even 1GiB page table entries. Typically mmap will already provide PMD aligned mappings, so devices with moderately sized MMIO ranges, even GPUs with standard 256MiB BARs, will already take advantage of this support. However in order to better support devices exposing multi-GiB MMIO, such as 3D accelerators or GPUs with resizable BARs enabled, we need to manually align the mmap. There doesn't seem to be a way for userspace to easily learn about PMD and PUD mapping level sizes, therefore this takes the simple approach to align the mapping to the power-of-two size of the region, up to 1GiB, which is currently the maximum alignment we care about. Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- hw/vfio/helpers.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c index b9e606e364a2..913796f437f8 100644 --- a/hw/vfio/helpers.c +++ b/hw/vfio/helpers.c @@ -27,6 +27,7 @@ #include "trace.h" #include "qapi/error.h" #include "qemu/error-report.h" +#include "qemu/units.h" #include "monitor/monitor.h" /* @@ -406,8 +407,35 @@ int vfio_region_mmap(VFIORegion *region) prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0; for (i = 0; i < region->nr_mmaps; i++) { - region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot, - MAP_SHARED, region->vbasedev->fd, + size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB); + void *map_base, *map_align; + + /* + * Align the mmap for more efficient mapping in the kernel. Ideally + * we'd know the PMD and PUD mapping sizes to use as discrete alignment + * intervals, but we don't. As of Linux v6.12, the largest PUD size + * supporting huge pfnmap is 1GiB (ARCH_SUPPORTS_PUD_PFNMAP is only set + * on x86_64). Align by power-of-two size, capped at 1GiB. + * + * NB. qemu_memalign() and friends actually allocate memory, whereas + * the region size here can exceed host memory, therefore we manually + * create an oversized anonymous mapping and clean it up for alignment. + */ + map_base = mmap(0, region->mmaps[i].size + align, PROT_NONE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (map_base == MAP_FAILED) { + ret = -errno; + goto no_mmap; + } + + map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align); + munmap(map_base, map_align - map_base); + munmap(map_align + region->mmaps[i].size, + align - (map_align - map_base)); + + region->mmaps[i].mmap = mmap(map_align, region->mmaps[i].size, prot, + MAP_SHARED | MAP_FIXED, + region->vbasedev->fd, region->fd_offset + region->mmaps[i].offset); if (region->mmaps[i].mmap == MAP_FAILED) { -- 2.46.2 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] vfio/helpers: Align mmaps 2024-10-22 20:08 ` [PATCH 2/2] vfio/helpers: Align mmaps Alex Williamson @ 2024-10-22 21:50 ` Peter Xu 2024-10-23 12:44 ` Cédric Le Goater 1 sibling, 0 replies; 9+ messages in thread From: Peter Xu @ 2024-10-22 21:50 UTC (permalink / raw) To: Alex Williamson; +Cc: qemu-devel, clg On Tue, Oct 22, 2024 at 02:08:29PM -0600, Alex Williamson wrote: > Thanks to work by Peter Xu, support is introduced in Linux v6.12 to > allow pfnmap insertions at PMD and PUD levels of the page table. This > means that provided a properly aligned mmap, the vfio driver is able > to map MMIO at significantly larger intervals than PAGE_SIZE. For > example on x86_64 (the only architecture currently supporting huge > pfnmaps for PUD), rather than 4KiB mappings, we can map device MMIO > using 2MiB and even 1GiB page table entries. > > Typically mmap will already provide PMD aligned mappings, so devices > with moderately sized MMIO ranges, even GPUs with standard 256MiB BARs, > will already take advantage of this support. However in order to better > support devices exposing multi-GiB MMIO, such as 3D accelerators or GPUs > with resizable BARs enabled, we need to manually align the mmap. > > There doesn't seem to be a way for userspace to easily learn about PMD > and PUD mapping level sizes, therefore this takes the simple approach > to align the mapping to the power-of-two size of the region, up to 1GiB, > which is currently the maximum alignment we care about. > > Cc: Peter Xu <peterx@redhat.com> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> For the longer term, maybe QEMU can provide a function to reserve a range of mmap with some specific alignment requirement. For example, currently qemu_ram_mmap() does mostly the same thing (and it hides a hugetlb fix on ppc only with 7197fb4058, which isn't a concern here). Then the complexity can hide in that function. Kind of a comment for the future only. Reviewed-by: Peter Xu <peterx@redhat.com> Thanks! > --- > hw/vfio/helpers.c | 32 ++++++++++++++++++++++++++++++-- > 1 file changed, 30 insertions(+), 2 deletions(-) > > diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c > index b9e606e364a2..913796f437f8 100644 > --- a/hw/vfio/helpers.c > +++ b/hw/vfio/helpers.c > @@ -27,6 +27,7 @@ > #include "trace.h" > #include "qapi/error.h" > #include "qemu/error-report.h" > +#include "qemu/units.h" > #include "monitor/monitor.h" > > /* > @@ -406,8 +407,35 @@ int vfio_region_mmap(VFIORegion *region) > prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0; > > for (i = 0; i < region->nr_mmaps; i++) { > - region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot, > - MAP_SHARED, region->vbasedev->fd, > + size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB); > + void *map_base, *map_align; > + > + /* > + * Align the mmap for more efficient mapping in the kernel. Ideally > + * we'd know the PMD and PUD mapping sizes to use as discrete alignment > + * intervals, but we don't. As of Linux v6.12, the largest PUD size > + * supporting huge pfnmap is 1GiB (ARCH_SUPPORTS_PUD_PFNMAP is only set > + * on x86_64). Align by power-of-two size, capped at 1GiB. > + * > + * NB. qemu_memalign() and friends actually allocate memory, whereas > + * the region size here can exceed host memory, therefore we manually > + * create an oversized anonymous mapping and clean it up for alignment. > + */ > + map_base = mmap(0, region->mmaps[i].size + align, PROT_NONE, > + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > + if (map_base == MAP_FAILED) { > + ret = -errno; > + goto no_mmap; > + } > + > + map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align); > + munmap(map_base, map_align - map_base); > + munmap(map_align + region->mmaps[i].size, > + align - (map_align - map_base)); > + > + region->mmaps[i].mmap = mmap(map_align, region->mmaps[i].size, prot, > + MAP_SHARED | MAP_FIXED, > + region->vbasedev->fd, > region->fd_offset + > region->mmaps[i].offset); > if (region->mmaps[i].mmap == MAP_FAILED) { > -- > 2.46.2 > -- Peter Xu ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] vfio/helpers: Align mmaps 2024-10-22 20:08 ` [PATCH 2/2] vfio/helpers: Align mmaps Alex Williamson 2024-10-22 21:50 ` Peter Xu @ 2024-10-23 12:44 ` Cédric Le Goater 2024-10-23 13:55 ` Alex Williamson 1 sibling, 1 reply; 9+ messages in thread From: Cédric Le Goater @ 2024-10-23 12:44 UTC (permalink / raw) To: Alex Williamson, qemu-devel; +Cc: peterx On 10/22/24 22:08, Alex Williamson wrote: > Thanks to work by Peter Xu, support is introduced in Linux v6.12 to > allow pfnmap insertions at PMD and PUD levels of the page table. This > means that provided a properly aligned mmap, the vfio driver is able > to map MMIO at significantly larger intervals than PAGE_SIZE. For > example on x86_64 (the only architecture currently supporting huge > pfnmaps for PUD), rather than 4KiB mappings, we can map device MMIO > using 2MiB and even 1GiB page table entries. > > Typically mmap will already provide PMD aligned mappings, so devices > with moderately sized MMIO ranges, even GPUs with standard 256MiB BARs, > will already take advantage of this support. However in order to better > support devices exposing multi-GiB MMIO, such as 3D accelerators or GPUs > with resizable BARs enabled, we need to manually align the mmap. > > There doesn't seem to be a way for userspace to easily learn about PMD > and PUD mapping level sizes, therefore this takes the simple approach > to align the mapping to the power-of-two size of the region, up to 1GiB, > which is currently the maximum alignment we care about. Couldn't we inspect /sys/kernel/mm/hugepages/ to get the sizes ? > Cc: Peter Xu <peterx@redhat.com> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> anyhow, Reviewed-by: Cédric Le Goater <clg@redhat.com> Thanks, C. > --- > hw/vfio/helpers.c | 32 ++++++++++++++++++++++++++++++-- > 1 file changed, 30 insertions(+), 2 deletions(-) > > diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c > index b9e606e364a2..913796f437f8 100644 > --- a/hw/vfio/helpers.c > +++ b/hw/vfio/helpers.c > @@ -27,6 +27,7 @@ > #include "trace.h" > #include "qapi/error.h" > #include "qemu/error-report.h" > +#include "qemu/units.h" > #include "monitor/monitor.h" > > /* > @@ -406,8 +407,35 @@ int vfio_region_mmap(VFIORegion *region) > prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0; > > for (i = 0; i < region->nr_mmaps; i++) { > - region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot, > - MAP_SHARED, region->vbasedev->fd, > + size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB); > + void *map_base, *map_align; > + > + /* > + * Align the mmap for more efficient mapping in the kernel. Ideally > + * we'd know the PMD and PUD mapping sizes to use as discrete alignment > + * intervals, but we don't. As of Linux v6.12, the largest PUD size > + * supporting huge pfnmap is 1GiB (ARCH_SUPPORTS_PUD_PFNMAP is only set > + * on x86_64). Align by power-of-two size, capped at 1GiB. > + * > + * NB. qemu_memalign() and friends actually allocate memory, whereas > + * the region size here can exceed host memory, therefore we manually > + * create an oversized anonymous mapping and clean it up for alignment. > + */ > + map_base = mmap(0, region->mmaps[i].size + align, PROT_NONE, > + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > + if (map_base == MAP_FAILED) { > + ret = -errno; > + goto no_mmap; > + } > + > + map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align); > + munmap(map_base, map_align - map_base); > + munmap(map_align + region->mmaps[i].size, > + align - (map_align - map_base)); > + > + region->mmaps[i].mmap = mmap(map_align, region->mmaps[i].size, prot, > + MAP_SHARED | MAP_FIXED, > + region->vbasedev->fd, > region->fd_offset + > region->mmaps[i].offset); > if (region->mmaps[i].mmap == MAP_FAILED) { ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] vfio/helpers: Align mmaps 2024-10-23 12:44 ` Cédric Le Goater @ 2024-10-23 13:55 ` Alex Williamson 0 siblings, 0 replies; 9+ messages in thread From: Alex Williamson @ 2024-10-23 13:55 UTC (permalink / raw) To: Cédric Le Goater; +Cc: qemu-devel, peterx On Wed, 23 Oct 2024 14:44:19 +0200 Cédric Le Goater <clg@redhat.com> wrote: > On 10/22/24 22:08, Alex Williamson wrote: > > Thanks to work by Peter Xu, support is introduced in Linux v6.12 to > > allow pfnmap insertions at PMD and PUD levels of the page table. This > > means that provided a properly aligned mmap, the vfio driver is able > > to map MMIO at significantly larger intervals than PAGE_SIZE. For > > example on x86_64 (the only architecture currently supporting huge > > pfnmaps for PUD), rather than 4KiB mappings, we can map device MMIO > > using 2MiB and even 1GiB page table entries. > > > > Typically mmap will already provide PMD aligned mappings, so devices > > with moderately sized MMIO ranges, even GPUs with standard 256MiB BARs, > > will already take advantage of this support. However in order to better > > support devices exposing multi-GiB MMIO, such as 3D accelerators or GPUs > > with resizable BARs enabled, we need to manually align the mmap. > > > > There doesn't seem to be a way for userspace to easily learn about PMD > > and PUD mapping level sizes, therefore this takes the simple approach > > to align the mapping to the power-of-two size of the region, up to 1GiB, > > which is currently the maximum alignment we care about. > > > Couldn't we inspect /sys/kernel/mm/hugepages/ to get the sizes ? Sifting through sysfs doesn't seem like a great solution if we want to support running QEMU in a sandbox with limited access, but also hugepage sizes don't seem to strictly map to PMD and PUD page table levels on all platforms. For instance ARM seems to support an assortment of hugepage sizes, some of which appear to be available via contiguous page hinting rather than actual page table level sizes. At that point we're still approximating the actual discrete mapping intervals, but at a substantial increase in complexity, unless we already have dependencies and existing code that can be leveraged. > > Cc: Peter Xu <peterx@redhat.com> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > anyhow, > > > Reviewed-by: Cédric Le Goater <clg@redhat.com> Thanks! Alex > > --- > > hw/vfio/helpers.c | 32 ++++++++++++++++++++++++++++++-- > > 1 file changed, 30 insertions(+), 2 deletions(-) > > > > diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c > > index b9e606e364a2..913796f437f8 100644 > > --- a/hw/vfio/helpers.c > > +++ b/hw/vfio/helpers.c > > @@ -27,6 +27,7 @@ > > #include "trace.h" > > #include "qapi/error.h" > > #include "qemu/error-report.h" > > +#include "qemu/units.h" > > #include "monitor/monitor.h" > > > > /* > > @@ -406,8 +407,35 @@ int vfio_region_mmap(VFIORegion *region) > > prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0; > > > > for (i = 0; i < region->nr_mmaps; i++) { > > - region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot, > > - MAP_SHARED, region->vbasedev->fd, > > + size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB); > > + void *map_base, *map_align; > > + > > + /* > > + * Align the mmap for more efficient mapping in the kernel. Ideally > > + * we'd know the PMD and PUD mapping sizes to use as discrete alignment > > + * intervals, but we don't. As of Linux v6.12, the largest PUD size > > + * supporting huge pfnmap is 1GiB (ARCH_SUPPORTS_PUD_PFNMAP is only set > > + * on x86_64). Align by power-of-two size, capped at 1GiB. > > + * > > + * NB. qemu_memalign() and friends actually allocate memory, whereas > > + * the region size here can exceed host memory, therefore we manually > > + * create an oversized anonymous mapping and clean it up for alignment. > > + */ > > + map_base = mmap(0, region->mmaps[i].size + align, PROT_NONE, > > + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > + if (map_base == MAP_FAILED) { > > + ret = -errno; > > + goto no_mmap; > > + } > > + > > + map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align); > > + munmap(map_base, map_align - map_base); > > + munmap(map_align + region->mmaps[i].size, > > + align - (map_align - map_base)); > > + > > + region->mmaps[i].mmap = mmap(map_align, region->mmaps[i].size, prot, > > + MAP_SHARED | MAP_FIXED, > > + region->vbasedev->fd, > > region->fd_offset + > > region->mmaps[i].offset); > > if (region->mmaps[i].mmap == MAP_FAILED) { > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] vfio: Align mmaps 2024-10-22 20:08 [PATCH 0/2] vfio: Align mmaps Alex Williamson 2024-10-22 20:08 ` [PATCH 1/2] vfio/helpers: Refactor vfio_region_mmap() error handling Alex Williamson 2024-10-22 20:08 ` [PATCH 2/2] vfio/helpers: Align mmaps Alex Williamson @ 2024-10-23 15:08 ` Cédric Le Goater 2 siblings, 0 replies; 9+ messages in thread From: Cédric Le Goater @ 2024-10-23 15:08 UTC (permalink / raw) To: Alex Williamson, qemu-devel; +Cc: peterx On 10/22/24 22:08, Alex Williamson wrote: > As described in patch /2, newer kernels may support pfnmap with PMD or > PUD sized mappings. Mappings must be aligned in order to see the full > benefit of this support. We're largely able to get PMD alignment for > free from mmap, but PUD alignment requires some effort. Further, we > don't actually have an easy way to determine PMD or PUD alignment, > therefore we align all mmaps to the nearest power-of-two relative to the > region size, up to the maximum PUD size known the be currently available. > > Enabling debug prints in the kernel shows that this exclusively enables > 1GiB mappings for a GPU with a multi-gigabyte BAR whereas previously the > BAR is mapped with a combination of 2MiB and 1GiB mappings, only using > 1GiB when opportunistically crossing an alignment boundary. > > If there are ways to determine discrete alignment intervals or better > ways to generate a properly aligned address value for mmap, please share. > Thanks, > > Alex > > Alex Williamson (2): > vfio/helpers: Refactor vfio_region_mmap() error handling > vfio/helpers: Align mmaps > > hw/vfio/helpers.c | 66 +++++++++++++++++++++++++++++++++-------------- > 1 file changed, 47 insertions(+), 19 deletions(-) > Applied to vfio-next. Thanks, C. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-10-23 15:09 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-22 20:08 [PATCH 0/2] vfio: Align mmaps Alex Williamson 2024-10-22 20:08 ` [PATCH 1/2] vfio/helpers: Refactor vfio_region_mmap() error handling Alex Williamson 2024-10-22 21:44 ` Peter Xu 2024-10-23 9:13 ` Cédric Le Goater 2024-10-22 20:08 ` [PATCH 2/2] vfio/helpers: Align mmaps Alex Williamson 2024-10-22 21:50 ` Peter Xu 2024-10-23 12:44 ` Cédric Le Goater 2024-10-23 13:55 ` Alex Williamson 2024-10-23 15:08 ` [PATCH 0/2] vfio: " Cédric Le Goater
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).