* [PATCH kernel v2 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page @ 2018-06-26 5:59 Alexey Kardashevskiy 2018-06-26 5:59 ` [PATCH kernel v2 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize Alexey Kardashevskiy ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Alexey Kardashevskiy @ 2018-06-26 5:59 UTC (permalink / raw) To: linuxppc-dev Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson, Paul Mackerras This is to improve page boundaries checking and should probably be cc:stable. I came accross this while debugging nvlink2 passthrough but the lack of checking might be exploited by the existing userspace. Changes: v2: * 2/2: explicitly check for compound pages before calling compound_order() Please comment. Thanks. Alexey Kardashevskiy (2): vfio/spapr: Use IOMMU pageshift rather than pagesize KVM: PPC: Check if IOMMU page is contained in the pinned physical page arch/powerpc/include/asm/mmu_context.h | 4 ++-- arch/powerpc/kvm/book3s_64_vio.c | 2 +- arch/powerpc/kvm/book3s_64_vio_hv.c | 6 ++++-- arch/powerpc/mm/mmu_context_iommu.c | 20 +++++++++++++++++--- drivers/vfio/vfio_iommu_spapr_tce.c | 10 +++++----- 5 files changed, 29 insertions(+), 13 deletions(-) -- 2.11.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH kernel v2 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize 2018-06-26 5:59 [PATCH kernel v2 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page Alexey Kardashevskiy @ 2018-06-26 5:59 ` Alexey Kardashevskiy 2018-06-30 19:56 ` Alex Williamson 2018-06-26 5:59 ` [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page Alexey Kardashevskiy 2018-06-29 1:55 ` [PATCH kernel v2 0/2] " Michael Ellerman 2 siblings, 1 reply; 17+ messages in thread From: Alexey Kardashevskiy @ 2018-06-26 5:59 UTC (permalink / raw) To: linuxppc-dev Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson, Paul Mackerras The size is always equal to 1 page so let's use this. Later on this will be used for other checks which use page shifts to check the granularity of access. This should cause no behavioral change. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- drivers/vfio/vfio_iommu_spapr_tce.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 759a5bd..2da5f05 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -457,13 +457,13 @@ static void tce_iommu_unuse_page(struct tce_container *container, } static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container, - unsigned long tce, unsigned long size, + unsigned long tce, unsigned long shift, unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem) { long ret = 0; struct mm_iommu_table_group_mem_t *mem; - mem = mm_iommu_lookup(container->mm, tce, size); + mem = mm_iommu_lookup(container->mm, tce, 1ULL << shift); if (!mem) return -EINVAL; @@ -487,7 +487,7 @@ static void tce_iommu_unuse_page_v2(struct tce_container *container, if (!pua) return; - ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl), + ret = tce_iommu_prereg_ua_to_hpa(container, *pua, tbl->it_page_shift, &hpa, &mem); if (ret) pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n", @@ -611,7 +611,7 @@ static long tce_iommu_build_v2(struct tce_container *container, entry + i); ret = tce_iommu_prereg_ua_to_hpa(container, - tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem); + tce, tbl->it_page_shift, &hpa, &mem); if (ret) break; -- 2.11.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize 2018-06-26 5:59 ` [PATCH kernel v2 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize Alexey Kardashevskiy @ 2018-06-30 19:56 ` Alex Williamson 0 siblings, 0 replies; 17+ messages in thread From: Alex Williamson @ 2018-06-30 19:56 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: linuxppc-dev, David Gibson, kvm-ppc, Paul Mackerras On Tue, 26 Jun 2018 15:59:25 +1000 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > The size is always equal to 1 page so let's use this. Later on this will > be used for other checks which use page shifts to check the granularity > of access. > > This should cause no behavioral change. > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > --- > drivers/vfio/vfio_iommu_spapr_tce.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) I assume a v3+ will go in through the ppc tree since the bulk of the series is there. For this, Acked-by: Alex Williamson <alex.williamson@redhat.com> > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c > index 759a5bd..2da5f05 100644 > --- a/drivers/vfio/vfio_iommu_spapr_tce.c > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > @@ -457,13 +457,13 @@ static void tce_iommu_unuse_page(struct tce_container *container, > } > > static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container, > - unsigned long tce, unsigned long size, > + unsigned long tce, unsigned long shift, > unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem) > { > long ret = 0; > struct mm_iommu_table_group_mem_t *mem; > > - mem = mm_iommu_lookup(container->mm, tce, size); > + mem = mm_iommu_lookup(container->mm, tce, 1ULL << shift); > if (!mem) > return -EINVAL; > > @@ -487,7 +487,7 @@ static void tce_iommu_unuse_page_v2(struct tce_container *container, > if (!pua) > return; > > - ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl), > + ret = tce_iommu_prereg_ua_to_hpa(container, *pua, tbl->it_page_shift, > &hpa, &mem); > if (ret) > pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n", > @@ -611,7 +611,7 @@ static long tce_iommu_build_v2(struct tce_container *container, > entry + i); > > ret = tce_iommu_prereg_ua_to_hpa(container, > - tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem); > + tce, tbl->it_page_shift, &hpa, &mem); > if (ret) > break; > ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-06-26 5:59 [PATCH kernel v2 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page Alexey Kardashevskiy 2018-06-26 5:59 ` [PATCH kernel v2 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize Alexey Kardashevskiy @ 2018-06-26 5:59 ` Alexey Kardashevskiy 2018-06-29 4:12 ` David Gibson 2018-06-29 1:55 ` [PATCH kernel v2 0/2] " Michael Ellerman 2 siblings, 1 reply; 17+ messages in thread From: Alexey Kardashevskiy @ 2018-06-26 5:59 UTC (permalink / raw) To: linuxppc-dev Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson, Paul Mackerras We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory. However we do not have this check in KVM fastpath (H_PUT_TCE accelerated code) so the user space can pin memory backed with 64k pages and create a hardware TCE table with a bigger page size. We were lucky so far and did not hit this yet as the very first time the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver and that fails because of the check in vfio_iommu_spapr_tce.c which is really sustainable solution. This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- Changes: v2: * explicitly check for compound pages before calling compound_order() --- The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to advertise 16MB pages to the guest; a typical pseries guest will use 16MB for IOMMU pages without checking the mmu pagesize and this will fail at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 With the change, mapping will fail in KVM and the guest will print: mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 mlx5_core 0000:00:00.0: failed to map direct window for /pci@800000020000000/ethernet@0: -1 --- arch/powerpc/include/asm/mmu_context.h | 4 ++-- arch/powerpc/kvm/book3s_64_vio.c | 2 +- arch/powerpc/kvm/book3s_64_vio_hv.c | 6 ++++-- arch/powerpc/mm/mmu_context_iommu.c | 20 +++++++++++++++++--- drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- 5 files changed, 25 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 896efa5..79d570c 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm( extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, unsigned long ua, unsigned long entries); extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, - unsigned long ua, unsigned long *hpa); + unsigned long ua, unsigned int pageshift, unsigned long *hpa); extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, - unsigned long ua, unsigned long *hpa); + unsigned long ua, unsigned int pageshift, unsigned long *hpa); extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem); extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem); #endif diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index d066e37..8c456fa 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl, /* This only handles v2 IOMMU type, v1 is handled via ioctl() */ return H_TOO_HARD; - if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa))) + if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, &hpa))) return H_HARDWARE; if (mm_iommu_mapped_inc(mem)) diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index 925fc31..5b298f5 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -279,7 +279,8 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl, if (!mem) return H_TOO_HARD; - if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))) + if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift, + &hpa))) return H_HARDWARE; pua = (void *) vmalloc_to_phys(pua); @@ -469,7 +470,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K); if (mem) - prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0; + prereg = mm_iommu_ua_to_hpa_rm(mem, ua, + IOMMU_PAGE_SHIFT_4K, &tces) == 0; } if (!prereg) { diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index abb4364..dc0e8cd 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -27,6 +27,7 @@ struct mm_iommu_table_group_mem_t { struct rcu_head rcu; unsigned long used; atomic64_t mapped; + unsigned int pageshift; u64 ua; /* userspace address */ u64 entries; /* number of entries in hpas[] */ u64 *hpas; /* vmalloc'ed */ @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem) { struct mm_iommu_table_group_mem_t *mem; - long i, j, ret = 0, locked_entries = 0; + long i, j, ret = 0, locked_entries = 0, pageshift; struct page *page = NULL; mutex_lock(&mem_list_mutex); @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, goto unlock_exit; } + mem->pageshift = 30; /* start from 1G pages - the biggest we have */ + for (i = 0; i < entries; ++i) { if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), 1/* pages */, 1/* iswrite */, &page)) { @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, } } populate: + pageshift = PAGE_SHIFT; + if (PageCompound(page)) + pageshift += compound_order(compound_head(page)); + mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift); + mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT; } @@ -349,7 +357,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, EXPORT_SYMBOL_GPL(mm_iommu_find); long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, - unsigned long ua, unsigned long *hpa) + unsigned long ua, unsigned int pageshift, unsigned long *hpa) { const long entry = (ua - mem->ua) >> PAGE_SHIFT; u64 *va = &mem->hpas[entry]; @@ -357,6 +365,9 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, if (entry >= mem->entries) return -EFAULT; + if (pageshift > mem->pageshift) + return -EFAULT; + *hpa = *va | (ua & ~PAGE_MASK); return 0; @@ -364,7 +375,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa); long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, - unsigned long ua, unsigned long *hpa) + unsigned long ua, unsigned int pageshift, unsigned long *hpa) { const long entry = (ua - mem->ua) >> PAGE_SHIFT; void *va = &mem->hpas[entry]; @@ -373,6 +384,9 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, if (entry >= mem->entries) return -EFAULT; + if (pageshift > mem->pageshift) + return -EFAULT; + pa = (void *) vmalloc_to_phys(va); if (!pa) return -EFAULT; diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 2da5f05..7cd63b0 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -467,7 +467,7 @@ static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container, if (!mem) return -EINVAL; - ret = mm_iommu_ua_to_hpa(mem, tce, phpa); + ret = mm_iommu_ua_to_hpa(mem, tce, shift, phpa); if (ret) return -EINVAL; -- 2.11.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-06-26 5:59 ` [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page Alexey Kardashevskiy @ 2018-06-29 4:12 ` David Gibson 2018-06-29 4:51 ` Alexey Kardashevskiy 0 siblings, 1 reply; 17+ messages in thread From: David Gibson @ 2018-06-29 4:12 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras [-- Attachment #1: Type: text/plain, Size: 5241 bytes --] On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote: > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > an IOMMU page is contained in the physical page so the PCI hardware won't > get access to unassigned host memory. > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated > code) so the user space can pin memory backed with 64k pages and create > a hardware TCE table with a bigger page size. We were lucky so far and > did not hit this yet as the very first time the mapping happens > we do not have tbl::it_userspace allocated yet and fall back to > the userspace which in turn calls VFIO IOMMU driver and that fails > because of the check in vfio_iommu_spapr_tce.c which is really > sustainable solution. > > This stores the smallest preregistered page size in the preregistered > region descriptor and changes the mm_iommu_xxx API to check this against > the IOMMU page size. > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > --- > Changes: > v2: > * explicitly check for compound pages before calling compound_order() > > --- > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > for IOMMU pages without checking the mmu pagesize and this will fail > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > With the change, mapping will fail in KVM and the guest will print: > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 > mlx5_core 0000:00:00.0: failed to map direct window for > /pci@800000020000000/ethernet@0: -1 [snip] > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > struct mm_iommu_table_group_mem_t **pmem) > { > struct mm_iommu_table_group_mem_t *mem; > - long i, j, ret = 0, locked_entries = 0; > + long i, j, ret = 0, locked_entries = 0, pageshift; > struct page *page = NULL; > > mutex_lock(&mem_list_mutex); > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > goto unlock_exit; > } > > + mem->pageshift = 30; /* start from 1G pages - the biggest we have */ What about 16G pages on an HPT system? > for (i = 0; i < entries; ++i) { > if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), > 1/* pages */, 1/* iswrite */, &page)) { > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > } > } > populate: > + pageshift = PAGE_SHIFT; > + if (PageCompound(page)) > + pageshift += compound_order(compound_head(page)); > + mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift); Why not make mem->pageshift and pageshift local the same type to avoid the min_t() ? > + > mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT; > } > > @@ -349,7 +357,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, > EXPORT_SYMBOL_GPL(mm_iommu_find); > > long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, > - unsigned long ua, unsigned long *hpa) > + unsigned long ua, unsigned int pageshift, unsigned long *hpa) > { > const long entry = (ua - mem->ua) >> PAGE_SHIFT; > u64 *va = &mem->hpas[entry]; > @@ -357,6 +365,9 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, > if (entry >= mem->entries) > return -EFAULT; > > + if (pageshift > mem->pageshift) > + return -EFAULT; > + > *hpa = *va | (ua & ~PAGE_MASK); > > return 0; > @@ -364,7 +375,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, > EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa); > > long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, > - unsigned long ua, unsigned long *hpa) > + unsigned long ua, unsigned int pageshift, unsigned long *hpa) > { > const long entry = (ua - mem->ua) >> PAGE_SHIFT; > void *va = &mem->hpas[entry]; > @@ -373,6 +384,9 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, > if (entry >= mem->entries) > return -EFAULT; > > + if (pageshift > mem->pageshift) > + return -EFAULT; > + > pa = (void *) vmalloc_to_phys(va); > if (!pa) > return -EFAULT; > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c > index 2da5f05..7cd63b0 100644 > --- a/drivers/vfio/vfio_iommu_spapr_tce.c > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > @@ -467,7 +467,7 @@ static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container, > if (!mem) > return -EINVAL; > > - ret = mm_iommu_ua_to_hpa(mem, tce, phpa); > + ret = mm_iommu_ua_to_hpa(mem, tce, shift, phpa); > if (ret) > return -EINVAL; > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-06-29 4:12 ` David Gibson @ 2018-06-29 4:51 ` Alexey Kardashevskiy 2018-06-29 4:57 ` David Gibson 0 siblings, 1 reply; 17+ messages in thread From: Alexey Kardashevskiy @ 2018-06-29 4:51 UTC (permalink / raw) To: David Gibson; +Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras [-- Attachment #1: Type: text/plain, Size: 5771 bytes --] On Fri, 29 Jun 2018 14:12:41 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote: > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > > an IOMMU page is contained in the physical page so the PCI hardware won't > > get access to unassigned host memory. > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated > > code) so the user space can pin memory backed with 64k pages and create > > a hardware TCE table with a bigger page size. We were lucky so far and > > did not hit this yet as the very first time the mapping happens > > we do not have tbl::it_userspace allocated yet and fall back to > > the userspace which in turn calls VFIO IOMMU driver and that fails > > because of the check in vfio_iommu_spapr_tce.c which is really > > sustainable solution. > > > > This stores the smallest preregistered page size in the preregistered > > region descriptor and changes the mm_iommu_xxx API to check this against > > the IOMMU page size. > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > --- > > Changes: > > v2: > > * explicitly check for compound pages before calling compound_order() > > > > --- > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > > for IOMMU pages without checking the mmu pagesize and this will fail > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > > > With the change, mapping will fail in KVM and the guest will print: > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 > > mlx5_core 0000:00:00.0: failed to map direct window for > > /pci@800000020000000/ethernet@0: -1 > > [snip] > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > struct mm_iommu_table_group_mem_t **pmem) > > { > > struct mm_iommu_table_group_mem_t *mem; > > - long i, j, ret = 0, locked_entries = 0; > > + long i, j, ret = 0, locked_entries = 0, pageshift; > > struct page *page = NULL; > > > > mutex_lock(&mem_list_mutex); > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > goto unlock_exit; > > } > > > > + mem->pageshift = 30; /* start from 1G pages - the biggest we have */ > > What about 16G pages on an HPT system? Below in the loop mem->pageshift will reduce to the biggest actual size which will be 16mb/64k/4k. Or remain 1GB if no memory is actually pinned, no loss there. > > > for (i = 0; i < entries; ++i) { > > if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), > > 1/* pages */, 1/* iswrite */, &page)) { > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > } > > } > > populate: > > + pageshift = PAGE_SHIFT; > > + if (PageCompound(page)) > > + pageshift += compound_order(compound_head(page)); > > + mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift); > > Why not make mem->pageshift and pageshift local the same type to avoid > the min_t() ? I was under impression min() is deprecated (misinterpret checkpatch.pl may be) and therefore did not pay attention to it. I can fix this and repost if there is no other question. > > > + > > mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT; > > } > > > > @@ -349,7 +357,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, > > EXPORT_SYMBOL_GPL(mm_iommu_find); > > > > long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, > > - unsigned long ua, unsigned long *hpa) > > + unsigned long ua, unsigned int pageshift, unsigned long *hpa) > > { > > const long entry = (ua - mem->ua) >> PAGE_SHIFT; > > u64 *va = &mem->hpas[entry]; > > @@ -357,6 +365,9 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, > > if (entry >= mem->entries) > > return -EFAULT; > > > > + if (pageshift > mem->pageshift) > > + return -EFAULT; > > + > > *hpa = *va | (ua & ~PAGE_MASK); > > > > return 0; > > @@ -364,7 +375,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, > > EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa); > > > > long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, > > - unsigned long ua, unsigned long *hpa) > > + unsigned long ua, unsigned int pageshift, unsigned long *hpa) > > { > > const long entry = (ua - mem->ua) >> PAGE_SHIFT; > > void *va = &mem->hpas[entry]; > > @@ -373,6 +384,9 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, > > if (entry >= mem->entries) > > return -EFAULT; > > > > + if (pageshift > mem->pageshift) > > + return -EFAULT; > > + > > pa = (void *) vmalloc_to_phys(va); > > if (!pa) > > return -EFAULT; > > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c > > index 2da5f05..7cd63b0 100644 > > --- a/drivers/vfio/vfio_iommu_spapr_tce.c > > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > > @@ -467,7 +467,7 @@ static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container, > > if (!mem) > > return -EINVAL; > > > > - ret = mm_iommu_ua_to_hpa(mem, tce, phpa); > > + ret = mm_iommu_ua_to_hpa(mem, tce, shift, phpa); > > if (ret) > > return -EINVAL; > > -- Alexey [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-06-29 4:51 ` Alexey Kardashevskiy @ 2018-06-29 4:57 ` David Gibson 2018-06-29 5:18 ` Alexey Kardashevskiy 0 siblings, 1 reply; 17+ messages in thread From: David Gibson @ 2018-06-29 4:57 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras [-- Attachment #1: Type: text/plain, Size: 4251 bytes --] On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote: > On Fri, 29 Jun 2018 14:12:41 +1000 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote: > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > > > an IOMMU page is contained in the physical page so the PCI hardware won't > > > get access to unassigned host memory. > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated > > > code) so the user space can pin memory backed with 64k pages and create > > > a hardware TCE table with a bigger page size. We were lucky so far and > > > did not hit this yet as the very first time the mapping happens > > > we do not have tbl::it_userspace allocated yet and fall back to > > > the userspace which in turn calls VFIO IOMMU driver and that fails > > > because of the check in vfio_iommu_spapr_tce.c which is really > > > sustainable solution. > > > > > > This stores the smallest preregistered page size in the preregistered > > > region descriptor and changes the mm_iommu_xxx API to check this against > > > the IOMMU page size. > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > > --- > > > Changes: > > > v2: > > > * explicitly check for compound pages before calling compound_order() > > > > > > --- > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > > > for IOMMU pages without checking the mmu pagesize and this will fail > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > > > > > With the change, mapping will fail in KVM and the guest will print: > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 > > > mlx5_core 0000:00:00.0: failed to map direct window for > > > /pci@800000020000000/ethernet@0: -1 > > > > [snip] > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > struct mm_iommu_table_group_mem_t **pmem) > > > { > > > struct mm_iommu_table_group_mem_t *mem; > > > - long i, j, ret = 0, locked_entries = 0; > > > + long i, j, ret = 0, locked_entries = 0, pageshift; > > > struct page *page = NULL; > > > > > > mutex_lock(&mem_list_mutex); > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > goto unlock_exit; > > > } > > > > > > + mem->pageshift = 30; /* start from 1G pages - the biggest we have */ > > > > What about 16G pages on an HPT system? > > > Below in the loop mem->pageshift will reduce to the biggest actual size > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually > pinned, no loss there. Are you saying that 16G IOMMU pages aren't supported? Or that there's some reason a guest can never use them? > > > for (i = 0; i < entries; ++i) { > > > if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), > > > 1/* pages */, 1/* iswrite */, &page)) { > > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > } > > > } > > > populate: > > > + pageshift = PAGE_SHIFT; > > > + if (PageCompound(page)) > > > + pageshift += compound_order(compound_head(page)); > > > + mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift); > > > > Why not make mem->pageshift and pageshift local the same type to avoid > > the min_t() ? > > I was under impression min() is deprecated (misinterpret checkpatch.pl > may be) and therefore did not pay attention to it. I can fix this and > repost if there is no other question. Hm, it's possible. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-06-29 4:57 ` David Gibson @ 2018-06-29 5:18 ` Alexey Kardashevskiy 2018-06-29 7:07 ` Alexey Kardashevskiy 0 siblings, 1 reply; 17+ messages in thread From: Alexey Kardashevskiy @ 2018-06-29 5:18 UTC (permalink / raw) To: David Gibson; +Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras [-- Attachment #1: Type: text/plain, Size: 4518 bytes --] On Fri, 29 Jun 2018 14:57:02 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote: > > On Fri, 29 Jun 2018 14:12:41 +1000 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote: > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > > > > an IOMMU page is contained in the physical page so the PCI hardware won't > > > > get access to unassigned host memory. > > > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated > > > > code) so the user space can pin memory backed with 64k pages and create > > > > a hardware TCE table with a bigger page size. We were lucky so far and > > > > did not hit this yet as the very first time the mapping happens > > > > we do not have tbl::it_userspace allocated yet and fall back to > > > > the userspace which in turn calls VFIO IOMMU driver and that fails > > > > because of the check in vfio_iommu_spapr_tce.c which is really > > > > sustainable solution. > > > > > > > > This stores the smallest preregistered page size in the preregistered > > > > region descriptor and changes the mm_iommu_xxx API to check this against > > > > the IOMMU page size. > > > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > > > --- > > > > Changes: > > > > v2: > > > > * explicitly check for compound pages before calling compound_order() > > > > > > > > --- > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > > > > for IOMMU pages without checking the mmu pagesize and this will fail > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > > > > > > > With the change, mapping will fail in KVM and the guest will print: > > > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 > > > > mlx5_core 0000:00:00.0: failed to map direct window for > > > > /pci@800000020000000/ethernet@0: -1 > > > > > > [snip] > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > struct mm_iommu_table_group_mem_t **pmem) > > > > { > > > > struct mm_iommu_table_group_mem_t *mem; > > > > - long i, j, ret = 0, locked_entries = 0; > > > > + long i, j, ret = 0, locked_entries = 0, pageshift; > > > > struct page *page = NULL; > > > > > > > > mutex_lock(&mem_list_mutex); > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > goto unlock_exit; > > > > } > > > > > > > > + mem->pageshift = 30; /* start from 1G pages - the biggest we have */ > > > > > > What about 16G pages on an HPT system? > > > > > > Below in the loop mem->pageshift will reduce to the biggest actual size > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually > > pinned, no loss there. > > Are you saying that 16G IOMMU pages aren't supported? Or that there's > some reason a guest can never use them? ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will lift the limit up to 64 then, easier this way. > > > > > for (i = 0; i < entries; ++i) { > > > > if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), > > > > 1/* pages */, 1/* iswrite */, &page)) { > > > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > } > > > > } > > > > populate: > > > > + pageshift = PAGE_SHIFT; > > > > + if (PageCompound(page)) > > > > + pageshift += compound_order(compound_head(page)); > > > > + mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift); > > > > > > Why not make mem->pageshift and pageshift local the same type to avoid > > > the min_t() ? > > > > I was under impression min() is deprecated (misinterpret checkpatch.pl > > may be) and therefore did not pay attention to it. I can fix this and > > repost if there is no other question. > > Hm, it's possible. Nah, tried min(), compiles fine. -- Alexey [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-06-29 5:18 ` Alexey Kardashevskiy @ 2018-06-29 7:07 ` Alexey Kardashevskiy 2018-07-02 4:08 ` David Gibson 0 siblings, 1 reply; 17+ messages in thread From: Alexey Kardashevskiy @ 2018-06-29 7:07 UTC (permalink / raw) To: David Gibson; +Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras [-- Attachment #1: Type: text/plain, Size: 5000 bytes --] On Fri, 29 Jun 2018 15:18:20 +1000 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > On Fri, 29 Jun 2018 14:57:02 +1000 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote: > > > On Fri, 29 Jun 2018 14:12:41 +1000 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote: > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > > > > > an IOMMU page is contained in the physical page so the PCI hardware won't > > > > > get access to unassigned host memory. > > > > > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated > > > > > code) so the user space can pin memory backed with 64k pages and create > > > > > a hardware TCE table with a bigger page size. We were lucky so far and > > > > > did not hit this yet as the very first time the mapping happens > > > > > we do not have tbl::it_userspace allocated yet and fall back to > > > > > the userspace which in turn calls VFIO IOMMU driver and that fails > > > > > because of the check in vfio_iommu_spapr_tce.c which is really > > > > > sustainable solution. > > > > > > > > > > This stores the smallest preregistered page size in the preregistered > > > > > region descriptor and changes the mm_iommu_xxx API to check this against > > > > > the IOMMU page size. > > > > > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > > > > --- > > > > > Changes: > > > > > v2: > > > > > * explicitly check for compound pages before calling compound_order() > > > > > > > > > > --- > > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > > > > > for IOMMU pages without checking the mmu pagesize and this will fail > > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > > > > > > > > > With the change, mapping will fail in KVM and the guest will print: > > > > > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) > > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 > > > > > mlx5_core 0000:00:00.0: failed to map direct window for > > > > > /pci@800000020000000/ethernet@0: -1 > > > > > > > > [snip] > > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > struct mm_iommu_table_group_mem_t **pmem) > > > > > { > > > > > struct mm_iommu_table_group_mem_t *mem; > > > > > - long i, j, ret = 0, locked_entries = 0; > > > > > + long i, j, ret = 0, locked_entries = 0, pageshift; > > > > > struct page *page = NULL; > > > > > > > > > > mutex_lock(&mem_list_mutex); > > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > goto unlock_exit; > > > > > } > > > > > > > > > > + mem->pageshift = 30; /* start from 1G pages - the biggest we have */ > > > > > > > > What about 16G pages on an HPT system? > > > > > > > > > Below in the loop mem->pageshift will reduce to the biggest actual size > > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually > > > pinned, no loss there. > > > > Are you saying that 16G IOMMU pages aren't supported? Or that there's > > some reason a guest can never use them? > > > ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will > lift the limit up to 64 then, easier this way. Ah, no, rather this as the upper limit: mem->pageshift = ilog2(entries) + PAGE_SHIFT; @entries here is a number of system pages being pinned in that function. > > > > > > > > for (i = 0; i < entries; ++i) { > > > > > if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), > > > > > 1/* pages */, 1/* iswrite */, &page)) { > > > > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > } > > > > > } > > > > > populate: > > > > > + pageshift = PAGE_SHIFT; > > > > > + if (PageCompound(page)) > > > > > + pageshift += compound_order(compound_head(page)); > > > > > + mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift); > > > > > > > > Why not make mem->pageshift and pageshift local the same type to avoid > > > > the min_t() ? > > > > > > I was under impression min() is deprecated (misinterpret checkpatch.pl > > > may be) and therefore did not pay attention to it. I can fix this and > > > repost if there is no other question. > > > > Hm, it's possible. > > Nah, tried min(), compiles fine. -- Alexey [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-06-29 7:07 ` Alexey Kardashevskiy @ 2018-07-02 4:08 ` David Gibson 2018-07-02 4:33 ` Alexey Kardashevskiy 0 siblings, 1 reply; 17+ messages in thread From: David Gibson @ 2018-07-02 4:08 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras [-- Attachment #1: Type: text/plain, Size: 5797 bytes --] On Fri, Jun 29, 2018 at 05:07:47PM +1000, Alexey Kardashevskiy wrote: > On Fri, 29 Jun 2018 15:18:20 +1000 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > On Fri, 29 Jun 2018 14:57:02 +1000 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote: > > > > On Fri, 29 Jun 2018 14:12:41 +1000 > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote: > > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > > > > > > an IOMMU page is contained in the physical page so the PCI hardware won't > > > > > > get access to unassigned host memory. > > > > > > > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated > > > > > > code) so the user space can pin memory backed with 64k pages and create > > > > > > a hardware TCE table with a bigger page size. We were lucky so far and > > > > > > did not hit this yet as the very first time the mapping happens > > > > > > we do not have tbl::it_userspace allocated yet and fall back to > > > > > > the userspace which in turn calls VFIO IOMMU driver and that fails > > > > > > because of the check in vfio_iommu_spapr_tce.c which is really > > > > > > sustainable solution. > > > > > > > > > > > > This stores the smallest preregistered page size in the preregistered > > > > > > region descriptor and changes the mm_iommu_xxx API to check this against > > > > > > the IOMMU page size. > > > > > > > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > > > > > --- > > > > > > Changes: > > > > > > v2: > > > > > > * explicitly check for compound pages before calling compound_order() > > > > > > > > > > > > --- > > > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > > > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > > > > > > for IOMMU pages without checking the mmu pagesize and this will fail > > > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > > > > > > > > > > > With the change, mapping will fail in KVM and the guest will print: > > > > > > > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) > > > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 > > > > > > mlx5_core 0000:00:00.0: failed to map direct window for > > > > > > /pci@800000020000000/ethernet@0: -1 > > > > > > > > > > [snip] > > > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > struct mm_iommu_table_group_mem_t **pmem) > > > > > > { > > > > > > struct mm_iommu_table_group_mem_t *mem; > > > > > > - long i, j, ret = 0, locked_entries = 0; > > > > > > + long i, j, ret = 0, locked_entries = 0, pageshift; > > > > > > struct page *page = NULL; > > > > > > > > > > > > mutex_lock(&mem_list_mutex); > > > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > goto unlock_exit; > > > > > > } > > > > > > > > > > > > + mem->pageshift = 30; /* start from 1G pages - the biggest we have */ > > > > > > > > > > What about 16G pages on an HPT system? > > > > > > > > > > > > Below in the loop mem->pageshift will reduce to the biggest actual size > > > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually > > > > pinned, no loss there. > > > > > > Are you saying that 16G IOMMU pages aren't supported? Or that there's > > > some reason a guest can never use them? > > > > > > ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will > > lift the limit up to 64 then, easier this way. > > > Ah, no, rather this as the upper limit: > > mem->pageshift = ilog2(entries) + PAGE_SHIFT; I can't make sense of this comment in context. I see how you're computing the minimum page size in the reserved region. My question is about what the "maximum minimum" is - the starting value from which you calculate. Currently it's 1G, but I can't immediately see a reason that 16G is impossible here. > @entries here is a number of system pages being pinned in that > function. > > > > > > > > > > > > > > for (i = 0; i < entries; ++i) { > > > > > > if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), > > > > > > 1/* pages */, 1/* iswrite */, &page)) { > > > > > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > } > > > > > > } > > > > > > populate: > > > > > > + pageshift = PAGE_SHIFT; > > > > > > + if (PageCompound(page)) > > > > > > + pageshift += compound_order(compound_head(page)); > > > > > > + mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift); > > > > > > > > > > Why not make mem->pageshift and pageshift local the same type to avoid > > > > > the min_t() ? > > > > > > > > I was under impression min() is deprecated (misinterpret checkpatch.pl > > > > may be) and therefore did not pay attention to it. I can fix this and > > > > repost if there is no other question. > > > > > > Hm, it's possible. > > > > Nah, tried min(), compiles fine. > > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-07-02 4:08 ` David Gibson @ 2018-07-02 4:33 ` Alexey Kardashevskiy 2018-07-02 4:52 ` David Gibson 0 siblings, 1 reply; 17+ messages in thread From: Alexey Kardashevskiy @ 2018-07-02 4:33 UTC (permalink / raw) To: David Gibson; +Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras On Mon, 2 Jul 2018 14:08:52 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Fri, Jun 29, 2018 at 05:07:47PM +1000, Alexey Kardashevskiy wrote: > > On Fri, 29 Jun 2018 15:18:20 +1000 > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > > > On Fri, 29 Jun 2018 14:57:02 +1000 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote: > > > > > On Fri, 29 Jun 2018 14:12:41 +1000 > > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote: > > > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > > > > > > > an IOMMU page is contained in the physical page so the PCI hardware won't > > > > > > > get access to unassigned host memory. > > > > > > > > > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated > > > > > > > code) so the user space can pin memory backed with 64k pages and create > > > > > > > a hardware TCE table with a bigger page size. We were lucky so far and > > > > > > > did not hit this yet as the very first time the mapping happens > > > > > > > we do not have tbl::it_userspace allocated yet and fall back to > > > > > > > the userspace which in turn calls VFIO IOMMU driver and that fails > > > > > > > because of the check in vfio_iommu_spapr_tce.c which is really > > > > > > > sustainable solution. > > > > > > > > > > > > > > This stores the smallest preregistered page size in the preregistered > > > > > > > region descriptor and changes the mm_iommu_xxx API to check this against > > > > > > > the IOMMU page size. > > > > > > > > > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > > > > > > --- > > > > > > > Changes: > > > > > > > v2: > > > > > > > * explicitly check for compound pages before calling compound_order() > > > > > > > > > > > > > > --- > > > > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > > > > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > > > > > > > for IOMMU pages without checking the mmu pagesize and this will fail > > > > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > > > > > > > > > > > > > With the change, mapping will fail in KVM and the guest will print: > > > > > > > > > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) > > > > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 > > > > > > > mlx5_core 0000:00:00.0: failed to map direct window for > > > > > > > /pci@800000020000000/ethernet@0: -1 > > > > > > > > > > > > [snip] > > > > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > > struct mm_iommu_table_group_mem_t **pmem) > > > > > > > { > > > > > > > struct mm_iommu_table_group_mem_t *mem; > > > > > > > - long i, j, ret = 0, locked_entries = 0; > > > > > > > + long i, j, ret = 0, locked_entries = 0, pageshift; > > > > > > > struct page *page = NULL; > > > > > > > > > > > > > > mutex_lock(&mem_list_mutex); > > > > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > > goto unlock_exit; > > > > > > > } > > > > > > > > > > > > > > + mem->pageshift = 30; /* start from 1G pages - the biggest we have */ > > > > > > > > > > > > What about 16G pages on an HPT system? > > > > > > > > > > > > > > > Below in the loop mem->pageshift will reduce to the biggest actual size > > > > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually > > > > > pinned, no loss there. > > > > > > > > Are you saying that 16G IOMMU pages aren't supported? Or that there's > > > > some reason a guest can never use them? > > > > > > > > > ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will > > > lift the limit up to 64 then, easier this way. > > > > > > Ah, no, rather this as the upper limit: > > > > mem->pageshift = ilog2(entries) + PAGE_SHIFT; > > I can't make sense of this comment in context. I see how you're > computing the minimum page size in the reserved region. > > My question is about what the "maximum minimum" is - the starting > value from which you calculate. Currently it's 1G, but I can't > immediately see a reason that 16G is impossible here. 16GB is impossible if the chunk we are preregistering here is smaller than that, for example, the entire guest ram is 4GB. If that is the case and we try mapping a 16GB IOMMU page, this should fail as I do not really know what happens to the memory between 4GB..16GB. imho if not that, than 1<<64 would make a good upper limit. > > > @entries here is a number of system pages being pinned in that > > function. > > > > > > > > > > > > > > > > > > > > for (i = 0; i < entries; ++i) { > > > > > > > if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), > > > > > > > 1/* pages */, 1/* iswrite */, &page)) { > > > > > > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > > } > > > > > > > } > > > > > > > populate: > > > > > > > + pageshift = PAGE_SHIFT; > > > > > > > + if (PageCompound(page)) > > > > > > > + pageshift += compound_order(compound_head(page)); > > > > > > > + mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift); > > > > > > > > > > > > Why not make mem->pageshift and pageshift local the same type to avoid > > > > > > the min_t() ? > > > > > > > > > > I was under impression min() is deprecated (misinterpret checkpatch.pl > > > > > may be) and therefore did not pay attention to it. I can fix this and > > > > > repost if there is no other question. > > > > > > > > Hm, it's possible. > > > > > > Nah, tried min(), compiles fine. > > > > > > > > > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ > | _way_ _around_! > http://www.ozlabs.org/~dgibson -- Alexey ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-07-02 4:33 ` Alexey Kardashevskiy @ 2018-07-02 4:52 ` David Gibson 2018-07-02 6:32 ` Alexey Kardashevskiy 0 siblings, 1 reply; 17+ messages in thread From: David Gibson @ 2018-07-02 4:52 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras [-- Attachment #1: Type: text/plain, Size: 7092 bytes --] On Mon, Jul 02, 2018 at 02:33:30PM +1000, Alexey Kardashevskiy wrote: > On Mon, 2 Jul 2018 14:08:52 +1000 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Fri, Jun 29, 2018 at 05:07:47PM +1000, Alexey Kardashevskiy wrote: > > > On Fri, 29 Jun 2018 15:18:20 +1000 > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > > > > > On Fri, 29 Jun 2018 14:57:02 +1000 > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote: > > > > > > On Fri, 29 Jun 2018 14:12:41 +1000 > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote: > > > > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > > > > > > > > an IOMMU page is contained in the physical page so the PCI hardware won't > > > > > > > > get access to unassigned host memory. > > > > > > > > > > > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated > > > > > > > > code) so the user space can pin memory backed with 64k pages and create > > > > > > > > a hardware TCE table with a bigger page size. We were lucky so far and > > > > > > > > did not hit this yet as the very first time the mapping happens > > > > > > > > we do not have tbl::it_userspace allocated yet and fall back to > > > > > > > > the userspace which in turn calls VFIO IOMMU driver and that fails > > > > > > > > because of the check in vfio_iommu_spapr_tce.c which is really > > > > > > > > sustainable solution. > > > > > > > > > > > > > > > > This stores the smallest preregistered page size in the preregistered > > > > > > > > region descriptor and changes the mm_iommu_xxx API to check this against > > > > > > > > the IOMMU page size. > > > > > > > > > > > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > > > > > > > --- > > > > > > > > Changes: > > > > > > > > v2: > > > > > > > > * explicitly check for compound pages before calling compound_order() > > > > > > > > > > > > > > > > --- > > > > > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > > > > > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > > > > > > > > for IOMMU pages without checking the mmu pagesize and this will fail > > > > > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > > > > > > > > > > > > > > > With the change, mapping will fail in KVM and the guest will print: > > > > > > > > > > > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) > > > > > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 > > > > > > > > mlx5_core 0000:00:00.0: failed to map direct window for > > > > > > > > /pci@800000020000000/ethernet@0: -1 > > > > > > > > > > > > > > [snip] > > > > > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > > > struct mm_iommu_table_group_mem_t **pmem) > > > > > > > > { > > > > > > > > struct mm_iommu_table_group_mem_t *mem; > > > > > > > > - long i, j, ret = 0, locked_entries = 0; > > > > > > > > + long i, j, ret = 0, locked_entries = 0, pageshift; > > > > > > > > struct page *page = NULL; > > > > > > > > > > > > > > > > mutex_lock(&mem_list_mutex); > > > > > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > > > goto unlock_exit; > > > > > > > > } > > > > > > > > > > > > > > > > + mem->pageshift = 30; /* start from 1G pages - the biggest we have */ > > > > > > > > > > > > > > What about 16G pages on an HPT system? > > > > > > > > > > > > > > > > > > Below in the loop mem->pageshift will reduce to the biggest actual size > > > > > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually > > > > > > pinned, no loss there. > > > > > > > > > > Are you saying that 16G IOMMU pages aren't supported? Or that there's > > > > > some reason a guest can never use them? > > > > > > > > > > > > ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will > > > > lift the limit up to 64 then, easier this way. > > > > > > > > > Ah, no, rather this as the upper limit: > > > > > > mem->pageshift = ilog2(entries) + PAGE_SHIFT; > > > > I can't make sense of this comment in context. I see how you're > > computing the minimum page size in the reserved region. > > > > My question is about what the "maximum minimum" is - the starting > > value from which you calculate. Currently it's 1G, but I can't > > immediately see a reason that 16G is impossible here. > > > 16GB is impossible if the chunk we are preregistering here is smaller > than that, for example, the entire guest ram is 4GB. Of course. Just like it was for 1GiB if you had a 512MiB guest, for example. I'm talking about a case where you have a guest that's >=16GiB and you *have* allocated 16GiB hugepages to back it. > If that is the > case and we try mapping a 16GB IOMMU page, this should fail as I do not > really know what happens to the memory between 4GB..16GB. > > imho if not that, than 1<<64 would make a good upper limit. > > > > > > > > @entries here is a number of system pages being pinned in that > > > function. > > > > > > > > > > > > > > > > > > > > > > > > > > for (i = 0; i < entries; ++i) { > > > > > > > > if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), > > > > > > > > 1/* pages */, 1/* iswrite */, &page)) { > > > > > > > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > > > } > > > > > > > > } > > > > > > > > populate: > > > > > > > > + pageshift = PAGE_SHIFT; > > > > > > > > + if (PageCompound(page)) > > > > > > > > + pageshift += compound_order(compound_head(page)); > > > > > > > > + mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift); > > > > > > > > > > > > > > Why not make mem->pageshift and pageshift local the same type to avoid > > > > > > > the min_t() ? > > > > > > > > > > > > I was under impression min() is deprecated (misinterpret checkpatch.pl > > > > > > may be) and therefore did not pay attention to it. I can fix this and > > > > > > repost if there is no other question. > > > > > > > > > > Hm, it's possible. > > > > > > > > Nah, tried min(), compiles fine. > > > > > > > > > > > > > > > > > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-07-02 4:52 ` David Gibson @ 2018-07-02 6:32 ` Alexey Kardashevskiy 2018-07-03 1:36 ` David Gibson 0 siblings, 1 reply; 17+ messages in thread From: Alexey Kardashevskiy @ 2018-07-02 6:32 UTC (permalink / raw) To: David Gibson; +Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras [-- Attachment #1: Type: text/plain, Size: 5996 bytes --] On Mon, 2 Jul 2018 14:52:43 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Mon, Jul 02, 2018 at 02:33:30PM +1000, Alexey Kardashevskiy wrote: > > On Mon, 2 Jul 2018 14:08:52 +1000 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > On Fri, Jun 29, 2018 at 05:07:47PM +1000, Alexey Kardashevskiy wrote: > > > > On Fri, 29 Jun 2018 15:18:20 +1000 > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > > > > > > > On Fri, 29 Jun 2018 14:57:02 +1000 > > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > > > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote: > > > > > > > On Fri, 29 Jun 2018 14:12:41 +1000 > > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > > > > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote: > > > > > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > > > > > > > > > an IOMMU page is contained in the physical page so the PCI hardware won't > > > > > > > > > get access to unassigned host memory. > > > > > > > > > > > > > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated > > > > > > > > > code) so the user space can pin memory backed with 64k pages and create > > > > > > > > > a hardware TCE table with a bigger page size. We were lucky so far and > > > > > > > > > did not hit this yet as the very first time the mapping happens > > > > > > > > > we do not have tbl::it_userspace allocated yet and fall back to > > > > > > > > > the userspace which in turn calls VFIO IOMMU driver and that fails > > > > > > > > > because of the check in vfio_iommu_spapr_tce.c which is really > > > > > > > > > sustainable solution. > > > > > > > > > > > > > > > > > > This stores the smallest preregistered page size in the preregistered > > > > > > > > > region descriptor and changes the mm_iommu_xxx API to check this against > > > > > > > > > the IOMMU page size. > > > > > > > > > > > > > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > > > > > > > > --- > > > > > > > > > Changes: > > > > > > > > > v2: > > > > > > > > > * explicitly check for compound pages before calling compound_order() > > > > > > > > > > > > > > > > > > --- > > > > > > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > > > > > > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > > > > > > > > > for IOMMU pages without checking the mmu pagesize and this will fail > > > > > > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > > > > > > > > > > > > > > > > > With the change, mapping will fail in KVM and the guest will print: > > > > > > > > > > > > > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) > > > > > > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 > > > > > > > > > mlx5_core 0000:00:00.0: failed to map direct window for > > > > > > > > > /pci@800000020000000/ethernet@0: -1 > > > > > > > > > > > > > > > > [snip] > > > > > > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > > > > struct mm_iommu_table_group_mem_t **pmem) > > > > > > > > > { > > > > > > > > > struct mm_iommu_table_group_mem_t *mem; > > > > > > > > > - long i, j, ret = 0, locked_entries = 0; > > > > > > > > > + long i, j, ret = 0, locked_entries = 0, pageshift; > > > > > > > > > struct page *page = NULL; > > > > > > > > > > > > > > > > > > mutex_lock(&mem_list_mutex); > > > > > > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > > > > goto unlock_exit; > > > > > > > > > } > > > > > > > > > > > > > > > > > > + mem->pageshift = 30; /* start from 1G pages - the biggest we have */ > > > > > > > > > > > > > > > > What about 16G pages on an HPT system? > > > > > > > > > > > > > > > > > > > > > Below in the loop mem->pageshift will reduce to the biggest actual size > > > > > > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually > > > > > > > pinned, no loss there. > > > > > > > > > > > > Are you saying that 16G IOMMU pages aren't supported? Or that there's > > > > > > some reason a guest can never use them? > > > > > > > > > > > > > > > ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will > > > > > lift the limit up to 64 then, easier this way. > > > > > > > > > > > > Ah, no, rather this as the upper limit: > > > > > > > > mem->pageshift = ilog2(entries) + PAGE_SHIFT; > > > > > > I can't make sense of this comment in context. I see how you're > > > computing the minimum page size in the reserved region. > > > > > > My question is about what the is - the starting > > > value from which you calculate. Currently it's 1G, but I can't > > > immediately see a reason that 16G is impossible here. > > > > > > 16GB is impossible if the chunk we are preregistering here is smaller > > than that, for example, the entire guest ram is 4GB. > > Of course. Just like it was for 1GiB if you had a 512MiB guest, for > example. I'm talking about a case where you have a guest that's > >=16GiB and you *have* allocated 16GiB hugepages to back it. Then, assuming we are preregistering entire RAM as a single chunk, the "maximum minimum" will be initialized as ">=16GiB" (but floor-aligned to power of two) before the pinning loop and then reduce to the actual page size, inside the loop. I feel like I am missing something in the question, what is that? -- Alexey [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-07-02 6:32 ` Alexey Kardashevskiy @ 2018-07-03 1:36 ` David Gibson 0 siblings, 0 replies; 17+ messages in thread From: David Gibson @ 2018-07-03 1:36 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras [-- Attachment #1: Type: text/plain, Size: 6731 bytes --] On Mon, Jul 02, 2018 at 04:32:27PM +1000, Alexey Kardashevskiy wrote: > On Mon, 2 Jul 2018 14:52:43 +1000 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Mon, Jul 02, 2018 at 02:33:30PM +1000, Alexey Kardashevskiy wrote: > > > On Mon, 2 Jul 2018 14:08:52 +1000 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > On Fri, Jun 29, 2018 at 05:07:47PM +1000, Alexey Kardashevskiy wrote: > > > > > On Fri, 29 Jun 2018 15:18:20 +1000 > > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > > > > > > > > > On Fri, 29 Jun 2018 14:57:02 +1000 > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > > > > > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote: > > > > > > > > On Fri, 29 Jun 2018 14:12:41 +1000 > > > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > > > > > > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote: > > > > > > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > > > > > > > > > > an IOMMU page is contained in the physical page so the PCI hardware won't > > > > > > > > > > get access to unassigned host memory. > > > > > > > > > > > > > > > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated > > > > > > > > > > code) so the user space can pin memory backed with 64k pages and create > > > > > > > > > > a hardware TCE table with a bigger page size. We were lucky so far and > > > > > > > > > > did not hit this yet as the very first time the mapping happens > > > > > > > > > > we do not have tbl::it_userspace allocated yet and fall back to > > > > > > > > > > the userspace which in turn calls VFIO IOMMU driver and that fails > > > > > > > > > > because of the check in vfio_iommu_spapr_tce.c which is really > > > > > > > > > > sustainable solution. > > > > > > > > > > > > > > > > > > > > This stores the smallest preregistered page size in the preregistered > > > > > > > > > > region descriptor and changes the mm_iommu_xxx API to check this against > > > > > > > > > > the IOMMU page size. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > > > > > > > > > --- > > > > > > > > > > Changes: > > > > > > > > > > v2: > > > > > > > > > > * explicitly check for compound pages before calling compound_order() > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > > > > > > > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > > > > > > > > > > for IOMMU pages without checking the mmu pagesize and this will fail > > > > > > > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > > > > > > > > > > > > > > > > > > > With the change, mapping will fail in KVM and the guest will print: > > > > > > > > > > > > > > > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) > > > > > > > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 > > > > > > > > > > mlx5_core 0000:00:00.0: failed to map direct window for > > > > > > > > > > /pci@800000020000000/ethernet@0: -1 > > > > > > > > > > > > > > > > > > [snip] > > > > > > > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > > > > > struct mm_iommu_table_group_mem_t **pmem) > > > > > > > > > > { > > > > > > > > > > struct mm_iommu_table_group_mem_t *mem; > > > > > > > > > > - long i, j, ret = 0, locked_entries = 0; > > > > > > > > > > + long i, j, ret = 0, locked_entries = 0, pageshift; > > > > > > > > > > struct page *page = NULL; > > > > > > > > > > > > > > > > > > > > mutex_lock(&mem_list_mutex); > > > > > > > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > > > > > > > > goto unlock_exit; > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > + mem->pageshift = 30; /* start from 1G pages - the biggest we have */ > > > > > > > > > > > > > > > > > > What about 16G pages on an HPT system? > > > > > > > > > > > > > > > > > > > > > > > > Below in the loop mem->pageshift will reduce to the biggest actual size > > > > > > > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually > > > > > > > > pinned, no loss there. > > > > > > > > > > > > > > Are you saying that 16G IOMMU pages aren't supported? Or that there's > > > > > > > some reason a guest can never use them? > > > > > > > > > > > > > > > > > > ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will > > > > > > lift the limit up to 64 then, easier this way. > > > > > > > > > > > > > > > Ah, no, rather this as the upper limit: > > > > > > > > > > mem->pageshift = ilog2(entries) + PAGE_SHIFT; > > > > > > > > I can't make sense of this comment in context. I see how you're > > > > computing the minimum page size in the reserved region. > > > > > > > > My question is about what the is - the starting > > > > value from which you calculate. Currently it's 1G, but I can't > > > > immediately see a reason that 16G is impossible here. > > > > > > > > > 16GB is impossible if the chunk we are preregistering here is smaller > > > than that, for example, the entire guest ram is 4GB. > > > > Of course. Just like it was for 1GiB if you had a 512MiB guest, for > > example. I'm talking about a case where you have a guest that's > > >=16GiB and you *have* allocated 16GiB hugepages to back it. > > > Then, assuming we are preregistering entire RAM as a single chunk, the > "maximum minimum" will be initialized as ">=16GiB" (but floor-aligned > to power of two) before the pinning loop and then reduce to the actual > page size, inside the loop. I feel like I am missing something in the > question, what is that? I wish I knew... By the "maximum minimum" I'm talking about the shift value you start with before you start looking at the actual pages. Currently that is exactly 30 (1 GiB). AFAICT that prevents using 16GiB pages, and I can't see a good reason for that. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-06-26 5:59 [PATCH kernel v2 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page Alexey Kardashevskiy 2018-06-26 5:59 ` [PATCH kernel v2 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize Alexey Kardashevskiy 2018-06-26 5:59 ` [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page Alexey Kardashevskiy @ 2018-06-29 1:55 ` Michael Ellerman 2018-06-29 3:00 ` Alexey Kardashevskiy 2 siblings, 1 reply; 17+ messages in thread From: Michael Ellerman @ 2018-06-29 1:55 UTC (permalink / raw) To: Alexey Kardashevskiy, linuxppc-dev Cc: Alexey Kardashevskiy, Alex Williamson, kvm-ppc, David Gibson Alexey Kardashevskiy <aik@ozlabs.ru> writes: > This is to improve page boundaries checking and should probably > be cc:stable. I came accross this while debugging nvlink2 passthrough > but the lack of checking might be exploited by the existing userspace. Do you really mean "exploited" ? As in there's a security issue? Your change log for patch 2 sort of suggests that but then says that without the fix you just hit an error in vfio code. So I'm not clear on what the exposure is. cheers ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-06-29 1:55 ` [PATCH kernel v2 0/2] " Michael Ellerman @ 2018-06-29 3:00 ` Alexey Kardashevskiy 2018-06-29 4:14 ` David Gibson 0 siblings, 1 reply; 17+ messages in thread From: Alexey Kardashevskiy @ 2018-06-29 3:00 UTC (permalink / raw) To: Michael Ellerman; +Cc: linuxppc-dev, Alex Williamson, kvm-ppc, David Gibson On Fri, 29 Jun 2018 11:55:40 +1000 Michael Ellerman <mpe@ellerman.id.au> wrote: > Alexey Kardashevskiy <aik@ozlabs.ru> writes: > > > This is to improve page boundaries checking and should probably > > be cc:stable. I came accross this while debugging nvlink2 passthrough > > but the lack of checking might be exploited by the existing userspace. > > Do you really mean "exploited" ? As in there's a security issue? > > Your change log for patch 2 sort of suggests that but then says that > without the fix you just hit an error in vfio code. The bug is that I can easily make unmodified guest use 16MB IOMMU pages while guest RAM is backed with system 64K pages so unless the guest RAM is allocated contigously (which is unlikely), a 16MB TCE will provide the hardware access to the host physical memory it is not supposed to have access to, which is 16MB minus first 64K. There is a fast path for H_PUT_TCE - via KVM - there is no contained test. There is a slow path for H_PUT_TCE - via VFIO ioctl() - there is a contained test. Because of a different feature of VFIO on sPAPR (it stores an array of userspace addresses which we received from QEMU and translated to host physical addresses and programmed those to the TCE table) we do not take the fast path on the very first H_PUT_TCE (because I allocate the array when the slow path is taken the very first time), fail there, pass the failure to the guest the guest decides that is over. But a modified guest could ignore that initial H_PUT_TCE failure and simply repeat H_PUT_TCE again - this time it will take the fast path and allow the bad mapping. > So I'm not clear on what the exposure is. > > cheers -- Alexey ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH kernel v2 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page 2018-06-29 3:00 ` Alexey Kardashevskiy @ 2018-06-29 4:14 ` David Gibson 0 siblings, 0 replies; 17+ messages in thread From: David Gibson @ 2018-06-29 4:14 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Michael Ellerman, linuxppc-dev, Alex Williamson, kvm-ppc [-- Attachment #1: Type: text/plain, Size: 2181 bytes --] On Fri, Jun 29, 2018 at 01:00:07PM +1000, Alexey Kardashevskiy wrote: > On Fri, 29 Jun 2018 11:55:40 +1000 > Michael Ellerman <mpe@ellerman.id.au> wrote: > > > Alexey Kardashevskiy <aik@ozlabs.ru> writes: > > > > > This is to improve page boundaries checking and should probably > > > be cc:stable. I came accross this while debugging nvlink2 passthrough > > > but the lack of checking might be exploited by the existing userspace. > > > > Do you really mean "exploited" ? As in there's a security issue? > > > > Your change log for patch 2 sort of suggests that but then says that > > without the fix you just hit an error in vfio code. > > > The bug is that I can easily make unmodified guest use 16MB IOMMU pages > while guest RAM is backed with system 64K pages so unless the guest RAM > is allocated contigously (which is unlikely), a 16MB TCE will provide > the hardware access to the host physical memory it is not supposed to > have access to, which is 16MB minus first 64K. > > There is a fast path for H_PUT_TCE - via KVM - there is no contained > test. > > There is a slow path for H_PUT_TCE - via VFIO ioctl() - there is a > contained test. > > Because of a different feature of VFIO on sPAPR (it stores an array of > userspace addresses which we received from QEMU and translated to host > physical addresses and programmed those to the TCE table) we do not take > the fast path on the very first H_PUT_TCE (because I allocate the > array when the slow path is taken the very first time), fail there, > pass the failure to the guest the guest decides that is over. > > But a modified guest could ignore that initial H_PUT_TCE failure and > simply repeat H_PUT_TCE again - this time it will take the fast path > and allow the bad mapping. In short, yes, it's an exploitable security hole in the host. An unmodified Linux guest kernel just doesn't happen to exploit it, even if the guest userspace tries to get it to. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-07-03 1:36 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-06-26 5:59 [PATCH kernel v2 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page Alexey Kardashevskiy 2018-06-26 5:59 ` [PATCH kernel v2 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize Alexey Kardashevskiy 2018-06-30 19:56 ` Alex Williamson 2018-06-26 5:59 ` [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page Alexey Kardashevskiy 2018-06-29 4:12 ` David Gibson 2018-06-29 4:51 ` Alexey Kardashevskiy 2018-06-29 4:57 ` David Gibson 2018-06-29 5:18 ` Alexey Kardashevskiy 2018-06-29 7:07 ` Alexey Kardashevskiy 2018-07-02 4:08 ` David Gibson 2018-07-02 4:33 ` Alexey Kardashevskiy 2018-07-02 4:52 ` David Gibson 2018-07-02 6:32 ` Alexey Kardashevskiy 2018-07-03 1:36 ` David Gibson 2018-06-29 1:55 ` [PATCH kernel v2 0/2] " Michael Ellerman 2018-06-29 3:00 ` Alexey Kardashevskiy 2018-06-29 4:14 ` David Gibson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).