* [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling @ 2013-06-05 6:11 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc Ben, ping! :) This series has tiny fixes (capability and ioctl numbers, changed documentation, compile errors in some configuration). More details are in the commit messages. Rebased on v3.10-rc4. Alexey Kardashevskiy (4): KVM: PPC: Add support for multiple-TCE hcalls powerpc: Prepare to support kernel handling of IOMMU map/unmap KVM: PPC: Add support for IOMMU in-kernel handling KVM: PPC: Add hugepage support for IOMMU in-kernel handling Documentation/virtual/kvm/api.txt | 45 +++ arch/powerpc/include/asm/kvm_host.h | 7 + arch/powerpc/include/asm/kvm_ppc.h | 40 ++- arch/powerpc/include/asm/pgtable-ppc64.h | 4 + arch/powerpc/include/uapi/asm/kvm.h | 7 + arch/powerpc/kvm/book3s_64_vio.c | 398 ++++++++++++++++++++++++- arch/powerpc/kvm/book3s_64_vio_hv.c | 471 ++++++++++++++++++++++++++++-- arch/powerpc/kvm/book3s_hv.c | 39 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 ++- arch/powerpc/kvm/powerpc.c | 15 + arch/powerpc/mm/init_64.c | 77 ++++- include/uapi/linux/kvm.h | 3 + 13 files changed, 1121 insertions(+), 28 deletions(-) -- 1.7.10.4 ^ permalink raw reply [flat|nested] 160+ messages in thread
* [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling @ 2013-06-05 6:11 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc Ben, ping! :) This series has tiny fixes (capability and ioctl numbers, changed documentation, compile errors in some configuration). More details are in the commit messages. Rebased on v3.10-rc4. Alexey Kardashevskiy (4): KVM: PPC: Add support for multiple-TCE hcalls powerpc: Prepare to support kernel handling of IOMMU map/unmap KVM: PPC: Add support for IOMMU in-kernel handling KVM: PPC: Add hugepage support for IOMMU in-kernel handling Documentation/virtual/kvm/api.txt | 45 +++ arch/powerpc/include/asm/kvm_host.h | 7 + arch/powerpc/include/asm/kvm_ppc.h | 40 ++- arch/powerpc/include/asm/pgtable-ppc64.h | 4 + arch/powerpc/include/uapi/asm/kvm.h | 7 + arch/powerpc/kvm/book3s_64_vio.c | 398 ++++++++++++++++++++++++- arch/powerpc/kvm/book3s_64_vio_hv.c | 471 ++++++++++++++++++++++++++++-- arch/powerpc/kvm/book3s_hv.c | 39 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 ++- arch/powerpc/kvm/powerpc.c | 15 + arch/powerpc/mm/init_64.c | 77 ++++- include/uapi/linux/kvm.h | 3 + 13 files changed, 1121 insertions(+), 28 deletions(-) -- 1.7.10.4 ^ permalink raw reply [flat|nested] 160+ messages in thread
* [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling @ 2013-06-05 6:11 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson Ben, ping! :) This series has tiny fixes (capability and ioctl numbers, changed documentation, compile errors in some configuration). More details are in the commit messages. Rebased on v3.10-rc4. Alexey Kardashevskiy (4): KVM: PPC: Add support for multiple-TCE hcalls powerpc: Prepare to support kernel handling of IOMMU map/unmap KVM: PPC: Add support for IOMMU in-kernel handling KVM: PPC: Add hugepage support for IOMMU in-kernel handling Documentation/virtual/kvm/api.txt | 45 +++ arch/powerpc/include/asm/kvm_host.h | 7 + arch/powerpc/include/asm/kvm_ppc.h | 40 ++- arch/powerpc/include/asm/pgtable-ppc64.h | 4 + arch/powerpc/include/uapi/asm/kvm.h | 7 + arch/powerpc/kvm/book3s_64_vio.c | 398 ++++++++++++++++++++++++- arch/powerpc/kvm/book3s_64_vio_hv.c | 471 ++++++++++++++++++++++++++++-- arch/powerpc/kvm/book3s_hv.c | 39 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 ++- arch/powerpc/kvm/powerpc.c | 15 + arch/powerpc/mm/init_64.c | 77 ++++- include/uapi/linux/kvm.h | 3 + 13 files changed, 1121 insertions(+), 28 deletions(-) -- 1.7.10.4 ^ permalink raw reply [flat|nested] 160+ messages in thread
* [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-05 6:11 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the "hcall-multi-tce" hypertas property only if the capability is present, otherwise there will be serious performance degradation. Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) --- Documentation/virtual/kvm/api.txt | 17 ++ arch/powerpc/include/asm/kvm_host.h | 2 + arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- arch/powerpc/kvm/powerpc.c | 3 + include/uapi/linux/kvm.h | 1 + 10 files changed, 473 insertions(+), 32 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be handled. +4.83 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability tells the guest that multiple TCE entry add/remove hypercalls +handling is supported by the kernel. This significanly accelerates DMA +operations for PPC KVM guests. + +Unlike other capabilities in this section, this one does not have an ioctl. +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). + + 5. The kvm_run structure ------------------------ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index af326cd..85d8f26 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; + + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index a5287fe..e852921b 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, - unsigned long ioba, unsigned long tce); +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( + struct kvm_vcpu *vcpu, unsigned long liobn); +extern long kvmppc_emulated_validate_tce(unsigned long tce); +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, + unsigned long ioba, unsigned long tce); +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce); +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages); +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages); extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index b2d3f3b..06b7b20 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -14,6 +14,7 @@ * * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> */ #include <linux/types.h> @@ -36,8 +37,11 @@ #include <asm/ppc-opcode.h> #include <asm/kvm_host.h> #include <asm/udbg.h> +#include <asm/iommu.h> +#include <asm/tce.h> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) +#define ERROR_ADDR ((void *)~(unsigned long)0x0) static long kvmppc_stt_npages(unsigned long window_size) { @@ -148,3 +152,117 @@ fail: } return ret; } + +/* Converts guest physical address into host virtual */ +static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, + unsigned long gpa) +{ + unsigned long hva, gfn = gpa >> PAGE_SHIFT; + struct kvm_memory_slot *memslot; + + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); + if (!memslot) + return ERROR_ADDR; + + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); + return (void *) hva; +} + +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce) +{ + long ret; + struct kvmppc_spapr_tce_table *tt; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if (ioba >= tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce); + if (ret) + return ret; + + kvmppc_emulated_put_tce(tt, ioba, tce); + + return H_SUCCESS; +} + +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + unsigned long __user *tces; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!tt) + return H_TOO_HARD; + + /* + * The spec says that the maximum size of the list is 512 TCEs so + * so the whole table addressed resides in 4K page + */ + if (npages > 512) + return H_PARAMETER; + + if (tce_list & ~IOMMU_PAGE_MASK) + return H_PARAMETER; + + tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); + if (tces = ERROR_ADDR) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + for (i = 0; i < npages; ++i) { + if (get_user(vcpu->arch.tce_tmp[i], tces + i)) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]); + if (ret) + return ret; + } + + for (i = 0; i < npages; ++i) + kvmppc_emulated_put_tce(tt, + ioba + (i << IOMMU_PAGE_SHIFT), + vcpu->arch.tce_tmp[i]); + + return H_SUCCESS; +} + +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce_value); + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) + return H_PARAMETER; + + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) + kvmppc_emulated_put_tce(tt, ioba, tce_value); + + return H_SUCCESS; +} diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index 30c2f3b..c68d538 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -14,6 +14,7 @@ * * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> */ #include <linux/types.h> @@ -35,42 +36,249 @@ #include <asm/ppc-opcode.h> #include <asm/kvm_host.h> #include <asm/udbg.h> +#include <asm/iommu.h> +#include <asm/tce.h> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) +#define ERROR_ADDR (~(unsigned long)0x0) -/* WARNING: This will be called in real-mode on HV KVM and virtual - * mode on PR KVM +/* Finds a TCE table descriptor by LIOBN */ +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu, + unsigned long liobn) +{ + struct kvmppc_spapr_tce_table *tt; + + list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) { + if (tt->liobn = liobn) + return tt; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table); + +/* + * Validate TCE address. + * At the moment only flags are validated + * as other check will significantly slow down + * or can make it even impossible to handle TCE requests + * in real mode. + */ +long kvmppc_emulated_validate_tce(unsigned long tce) +{ + if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ)) + return H_PARAMETER; + + return H_SUCCESS; +} +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce); + +/* + * kvmppc_emulated_put_tce() handles TCE requests for devices emulated + * by QEMU. It puts guest TCE values into the table and expects + * the QEMU to convert them later in the QEMU device implementation. + * Wiorks in both real and virtual modes. + * It cannot fail so kvmppc_emulated_validate_tce must be called before it. */ +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, + unsigned long ioba, unsigned long tce) +{ + unsigned long idx = ioba >> SPAPR_TCE_SHIFT; + struct page *page; + u64 *tbl; + + /* + * Note on the use of page_address() in real mode, + * + * It is safe to use page_address() in real mode on ppc64 because + * page_address() is always defined as lowmem_page_address() + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial + * operation and does not access page struct. + * + * Theoretically page_address() could be defined different + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL + * should be enabled. + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64, + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP + * is not expected to be enabled on ppc32, page_address() + * is safe for ppc32 as well. + */ +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL) +#error TODO: fix to avoid page_address() here +#endif + page = tt->pages[idx / TCES_PER_PAGE]; + tbl = (u64 *)page_address(page); + + /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ + tbl[idx % TCES_PER_PAGE] = tce; +} +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce); + +#ifdef CONFIG_KVM_BOOK3S_64_HV + +static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, + unsigned long *pte_sizep) +{ + pte_t *ptep; + unsigned int shift = 0; + pte_t pte, tmp, ret; + + ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift); + if (!ptep) + return __pte(0); + if (shift) + *pte_sizep = 1ul << shift; + else + *pte_sizep = PAGE_SIZE; + + if (!pte_present(*ptep)) + return __pte(0); + + /* wait until _PAGE_BUSY is clear then set it atomically */ + __asm__ __volatile__ ( + "1: ldarx %0,0,%3\n" + " andi. %1,%0,%4\n" + " bne- 1b\n" + " ori %1,%0,%4\n" + " stdcx. %1,0,%3\n" + " bne- 1b" + : "=&r" (pte), "=&r" (tmp), "=m" (*ptep) + : "r" (ptep), "i" (_PAGE_BUSY) + : "cc"); + + ret = pte; + + return ret; +} + +/* + * Converts guest physical address into host physical address. + * Also returns pte and page size if the page is present in page table. + */ +static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, + unsigned long gpa) +{ + struct kvm_memory_slot *memslot; + pte_t pte; + unsigned long hva, hpa, pg_size = 0, offset; + unsigned long gfn = gpa >> PAGE_SHIFT; + bool writing = gpa & TCE_PCI_WRITE; + + /* Find a KVM memslot */ + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); + if (!memslot) + return ERROR_ADDR; + + /* Convert guest physical address to host virtual */ + hva = __gfn_to_hva_memslot(memslot, gfn); + + /* Find a PTE and determine the size */ + pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva, + writing, &pg_size); + if (!pte) + return ERROR_ADDR; + + /* Calculate host phys address keeping flags and offset in the page */ + offset = gpa & (pg_size - 1); + + /* pte_pfn(pte) should return an address aligned to pg_size */ + hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset; + + return hpa; +} + long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { - struct kvm *kvm = vcpu->kvm; - struct kvmppc_spapr_tce_table *stt; - - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ - /* liobn, ioba, tce); */ - - list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) { - if (stt->liobn = liobn) { - unsigned long idx = ioba >> SPAPR_TCE_SHIFT; - struct page *page; - u64 *tbl; - - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */ - /* liobn, stt, stt->window_size); */ - if (ioba >= stt->window_size) - return H_PARAMETER; - - page = stt->pages[idx / TCES_PER_PAGE]; - tbl = (u64 *)page_address(page); - - /* FIXME: Need to validate the TCE itself */ - /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ - tbl[idx % TCES_PER_PAGE] = tce; - return H_SUCCESS; - } + long ret; + struct kvmppc_spapr_tce_table *tt; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to virtual space */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if (ioba >= tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce); + if (ret) + return ret; + + kvmppc_emulated_put_tce(tt, ioba, tce); + + return H_SUCCESS; +} + +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + unsigned long *tces; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to virtual space */ + if (!tt) + return H_TOO_HARD; + + /* + * The spec says that the maximum size of the list is 512 TCEs so + * so the whole table addressed resides in 4K page + */ + if (npages > 512) + return H_PARAMETER; + + if (tce_list & ~IOMMU_PAGE_MASK) + return H_PARAMETER; + + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list); + if ((unsigned long)tces = ERROR_ADDR) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + for (i = 0; i < npages; ++i) { + ret = kvmppc_emulated_validate_tce(tces[i]); + if (ret) + return ret; } - /* Didn't find the liobn, punt it to userspace */ - return H_TOO_HARD; + for (i = 0; i < npages; ++i) + kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT), + tces[i]); + + return H_SUCCESS; +} + +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to virtual space */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce_value); + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) + return H_PARAMETER; + + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) + kvmppc_emulated_put_tce(tt, ioba, tce_value); + + return H_SUCCESS; } +#endif /* CONFIG_KVM_BOOK3S_64_HV */ diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 550f592..a39039a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) ret = kvmppc_xics_hcall(vcpu, req); break; } /* fallthrough */ + return RESUME_HOST; + case H_PUT_TCE: + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), + kvmppc_get_gpr(vcpu, 5), + kvmppc_get_gpr(vcpu, 6)); + if (ret = H_TOO_HARD) + return RESUME_HOST; + break; + case H_PUT_TCE_INDIRECT: + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), + kvmppc_get_gpr(vcpu, 5), + kvmppc_get_gpr(vcpu, 6), + kvmppc_get_gpr(vcpu, 7)); + if (ret = H_TOO_HARD) + return RESUME_HOST; + break; + case H_STUFF_TCE: + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), + kvmppc_get_gpr(vcpu, 5), + kvmppc_get_gpr(vcpu, 6), + kvmppc_get_gpr(vcpu, 7)); + if (ret = H_TOO_HARD) + return RESUME_HOST; + break; default: return RESUME_HOST; } @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) vcpu->arch.cpu_type = KVM_CPU_3S_64; kvmppc_sanity_check(vcpu); + /* + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT + * half executed, we first read TCEs from the user, check them and + * return error if something went wrong and only then put TCEs into + * the TCE table. + * + * tce_tmp is a cache for TCEs to avoid stack allocation or + * kmalloc as the whole TCE list can take up to 512 items 8 bytes + * each (4096 bytes). + */ + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); + if (!vcpu->arch.tce_tmp) + goto free_vcpu; + return vcpu; free_vcpu: @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); spin_unlock(&vcpu->arch.vpa_update_lock); + kfree(vcpu->arch.tce_tmp); kvm_vcpu_uninit(vcpu); kmem_cache_free(kvm_vcpu_cache, vcpu); } diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index b02f91e..d35554e 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1490,6 +1490,12 @@ hcall_real_table: .long 0 /* 0x11c */ .long 0 /* 0x120 */ .long .kvmppc_h_bulk_remove - hcall_real_table + .long 0 /* 0x128 */ + .long 0 /* 0x12c */ + .long 0 /* 0x130 */ + .long 0 /* 0x134 */ + .long .kvmppc_h_stuff_tce - hcall_real_table + .long .kvmppc_h_put_tce_indirect - hcall_real_table hcall_real_table_end: ignore_hdec: diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c index da0e0bc..91d4b45 100644 --- a/arch/powerpc/kvm/book3s_pr_papr.c +++ b/arch/powerpc/kvm/book3s_pr_papr.c @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) unsigned long tce = kvmppc_get_gpr(vcpu, 6); long rc; - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); + if (rc = H_TOO_HARD) + return EMULATE_FAIL; + kvmppc_set_gpr(vcpu, 3, rc); + return EMULATE_DONE; +} + +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) +{ + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); + unsigned long tce = kvmppc_get_gpr(vcpu, 6); + unsigned long npages = kvmppc_get_gpr(vcpu, 7); + long rc; + + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, + tce, npages); + if (rc = H_TOO_HARD) + return EMULATE_FAIL; + kvmppc_set_gpr(vcpu, 3, rc); + return EMULATE_DONE; +} + +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) +{ + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); + unsigned long npages = kvmppc_get_gpr(vcpu, 7); + long rc; + + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); if (rc = H_TOO_HARD) return EMULATE_FAIL; kvmppc_set_gpr(vcpu, 3, rc); @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) return kvmppc_h_pr_bulk_remove(vcpu); case H_PUT_TCE: return kvmppc_h_pr_put_tce(vcpu); + case H_PUT_TCE_INDIRECT: + return kvmppc_h_pr_put_tce_indirect(vcpu); + case H_STUFF_TCE: + return kvmppc_h_pr_stuff_tce(vcpu); case H_CEDE: vcpu->arch.shared->msr |= MSR_EE; kvm_vcpu_block(vcpu); diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 6316ee3..8465c2a 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) r = 1; break; #endif + case KVM_CAP_SPAPR_MULTITCE: + r = 1; + break; default: r = 0; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index a5c86fc..fc0d6b9 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_IRQ_MPIC 90 #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 +#define KVM_CAP_SPAPR_MULTITCE 93 #ifdef KVM_CAP_IRQ_ROUTING -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-05 6:11 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the "hcall-multi-tce" hypertas property only if the capability is present, otherwise there will be serious performance degradation. Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) --- Documentation/virtual/kvm/api.txt | 17 ++ arch/powerpc/include/asm/kvm_host.h | 2 + arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- arch/powerpc/kvm/powerpc.c | 3 + include/uapi/linux/kvm.h | 1 + 10 files changed, 473 insertions(+), 32 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be handled. +4.83 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability tells the guest that multiple TCE entry add/remove hypercalls +handling is supported by the kernel. This significanly accelerates DMA +operations for PPC KVM guests. + +Unlike other capabilities in this section, this one does not have an ioctl. +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). + + 5. The kvm_run structure ------------------------ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index af326cd..85d8f26 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; + + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index a5287fe..e852921b 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, - unsigned long ioba, unsigned long tce); +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( + struct kvm_vcpu *vcpu, unsigned long liobn); +extern long kvmppc_emulated_validate_tce(unsigned long tce); +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, + unsigned long ioba, unsigned long tce); +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce); +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages); +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages); extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index b2d3f3b..06b7b20 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -14,6 +14,7 @@ * * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> */ #include <linux/types.h> @@ -36,8 +37,11 @@ #include <asm/ppc-opcode.h> #include <asm/kvm_host.h> #include <asm/udbg.h> +#include <asm/iommu.h> +#include <asm/tce.h> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) +#define ERROR_ADDR ((void *)~(unsigned long)0x0) static long kvmppc_stt_npages(unsigned long window_size) { @@ -148,3 +152,117 @@ fail: } return ret; } + +/* Converts guest physical address into host virtual */ +static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, + unsigned long gpa) +{ + unsigned long hva, gfn = gpa >> PAGE_SHIFT; + struct kvm_memory_slot *memslot; + + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); + if (!memslot) + return ERROR_ADDR; + + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); + return (void *) hva; +} + +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce) +{ + long ret; + struct kvmppc_spapr_tce_table *tt; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if (ioba >= tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce); + if (ret) + return ret; + + kvmppc_emulated_put_tce(tt, ioba, tce); + + return H_SUCCESS; +} + +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + unsigned long __user *tces; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!tt) + return H_TOO_HARD; + + /* + * The spec says that the maximum size of the list is 512 TCEs so + * so the whole table addressed resides in 4K page + */ + if (npages > 512) + return H_PARAMETER; + + if (tce_list & ~IOMMU_PAGE_MASK) + return H_PARAMETER; + + tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); + if (tces == ERROR_ADDR) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + for (i = 0; i < npages; ++i) { + if (get_user(vcpu->arch.tce_tmp[i], tces + i)) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]); + if (ret) + return ret; + } + + for (i = 0; i < npages; ++i) + kvmppc_emulated_put_tce(tt, + ioba + (i << IOMMU_PAGE_SHIFT), + vcpu->arch.tce_tmp[i]); + + return H_SUCCESS; +} + +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce_value); + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) + return H_PARAMETER; + + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) + kvmppc_emulated_put_tce(tt, ioba, tce_value); + + return H_SUCCESS; +} diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index 30c2f3b..c68d538 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -14,6 +14,7 @@ * * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> */ #include <linux/types.h> @@ -35,42 +36,249 @@ #include <asm/ppc-opcode.h> #include <asm/kvm_host.h> #include <asm/udbg.h> +#include <asm/iommu.h> +#include <asm/tce.h> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) +#define ERROR_ADDR (~(unsigned long)0x0) -/* WARNING: This will be called in real-mode on HV KVM and virtual - * mode on PR KVM +/* Finds a TCE table descriptor by LIOBN */ +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu, + unsigned long liobn) +{ + struct kvmppc_spapr_tce_table *tt; + + list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) { + if (tt->liobn == liobn) + return tt; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table); + +/* + * Validate TCE address. + * At the moment only flags are validated + * as other check will significantly slow down + * or can make it even impossible to handle TCE requests + * in real mode. + */ +long kvmppc_emulated_validate_tce(unsigned long tce) +{ + if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ)) + return H_PARAMETER; + + return H_SUCCESS; +} +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce); + +/* + * kvmppc_emulated_put_tce() handles TCE requests for devices emulated + * by QEMU. It puts guest TCE values into the table and expects + * the QEMU to convert them later in the QEMU device implementation. + * Wiorks in both real and virtual modes. + * It cannot fail so kvmppc_emulated_validate_tce must be called before it. */ +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, + unsigned long ioba, unsigned long tce) +{ + unsigned long idx = ioba >> SPAPR_TCE_SHIFT; + struct page *page; + u64 *tbl; + + /* + * Note on the use of page_address() in real mode, + * + * It is safe to use page_address() in real mode on ppc64 because + * page_address() is always defined as lowmem_page_address() + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial + * operation and does not access page struct. + * + * Theoretically page_address() could be defined different + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL + * should be enabled. + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64, + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP + * is not expected to be enabled on ppc32, page_address() + * is safe for ppc32 as well. + */ +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL) +#error TODO: fix to avoid page_address() here +#endif + page = tt->pages[idx / TCES_PER_PAGE]; + tbl = (u64 *)page_address(page); + + /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ + tbl[idx % TCES_PER_PAGE] = tce; +} +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce); + +#ifdef CONFIG_KVM_BOOK3S_64_HV + +static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, + unsigned long *pte_sizep) +{ + pte_t *ptep; + unsigned int shift = 0; + pte_t pte, tmp, ret; + + ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift); + if (!ptep) + return __pte(0); + if (shift) + *pte_sizep = 1ul << shift; + else + *pte_sizep = PAGE_SIZE; + + if (!pte_present(*ptep)) + return __pte(0); + + /* wait until _PAGE_BUSY is clear then set it atomically */ + __asm__ __volatile__ ( + "1: ldarx %0,0,%3\n" + " andi. %1,%0,%4\n" + " bne- 1b\n" + " ori %1,%0,%4\n" + " stdcx. %1,0,%3\n" + " bne- 1b" + : "=&r" (pte), "=&r" (tmp), "=m" (*ptep) + : "r" (ptep), "i" (_PAGE_BUSY) + : "cc"); + + ret = pte; + + return ret; +} + +/* + * Converts guest physical address into host physical address. + * Also returns pte and page size if the page is present in page table. + */ +static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, + unsigned long gpa) +{ + struct kvm_memory_slot *memslot; + pte_t pte; + unsigned long hva, hpa, pg_size = 0, offset; + unsigned long gfn = gpa >> PAGE_SHIFT; + bool writing = gpa & TCE_PCI_WRITE; + + /* Find a KVM memslot */ + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); + if (!memslot) + return ERROR_ADDR; + + /* Convert guest physical address to host virtual */ + hva = __gfn_to_hva_memslot(memslot, gfn); + + /* Find a PTE and determine the size */ + pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva, + writing, &pg_size); + if (!pte) + return ERROR_ADDR; + + /* Calculate host phys address keeping flags and offset in the page */ + offset = gpa & (pg_size - 1); + + /* pte_pfn(pte) should return an address aligned to pg_size */ + hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset; + + return hpa; +} + long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { - struct kvm *kvm = vcpu->kvm; - struct kvmppc_spapr_tce_table *stt; - - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ - /* liobn, ioba, tce); */ - - list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) { - if (stt->liobn == liobn) { - unsigned long idx = ioba >> SPAPR_TCE_SHIFT; - struct page *page; - u64 *tbl; - - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */ - /* liobn, stt, stt->window_size); */ - if (ioba >= stt->window_size) - return H_PARAMETER; - - page = stt->pages[idx / TCES_PER_PAGE]; - tbl = (u64 *)page_address(page); - - /* FIXME: Need to validate the TCE itself */ - /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ - tbl[idx % TCES_PER_PAGE] = tce; - return H_SUCCESS; - } + long ret; + struct kvmppc_spapr_tce_table *tt; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to virtual space */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if (ioba >= tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce); + if (ret) + return ret; + + kvmppc_emulated_put_tce(tt, ioba, tce); + + return H_SUCCESS; +} + +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + unsigned long *tces; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to virtual space */ + if (!tt) + return H_TOO_HARD; + + /* + * The spec says that the maximum size of the list is 512 TCEs so + * so the whole table addressed resides in 4K page + */ + if (npages > 512) + return H_PARAMETER; + + if (tce_list & ~IOMMU_PAGE_MASK) + return H_PARAMETER; + + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list); + if ((unsigned long)tces == ERROR_ADDR) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + for (i = 0; i < npages; ++i) { + ret = kvmppc_emulated_validate_tce(tces[i]); + if (ret) + return ret; } - /* Didn't find the liobn, punt it to userspace */ - return H_TOO_HARD; + for (i = 0; i < npages; ++i) + kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT), + tces[i]); + + return H_SUCCESS; +} + +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to virtual space */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce_value); + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) + return H_PARAMETER; + + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) + kvmppc_emulated_put_tce(tt, ioba, tce_value); + + return H_SUCCESS; } +#endif /* CONFIG_KVM_BOOK3S_64_HV */ diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 550f592..a39039a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) ret = kvmppc_xics_hcall(vcpu, req); break; } /* fallthrough */ + return RESUME_HOST; + case H_PUT_TCE: + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), + kvmppc_get_gpr(vcpu, 5), + kvmppc_get_gpr(vcpu, 6)); + if (ret == H_TOO_HARD) + return RESUME_HOST; + break; + case H_PUT_TCE_INDIRECT: + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), + kvmppc_get_gpr(vcpu, 5), + kvmppc_get_gpr(vcpu, 6), + kvmppc_get_gpr(vcpu, 7)); + if (ret == H_TOO_HARD) + return RESUME_HOST; + break; + case H_STUFF_TCE: + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), + kvmppc_get_gpr(vcpu, 5), + kvmppc_get_gpr(vcpu, 6), + kvmppc_get_gpr(vcpu, 7)); + if (ret == H_TOO_HARD) + return RESUME_HOST; + break; default: return RESUME_HOST; } @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) vcpu->arch.cpu_type = KVM_CPU_3S_64; kvmppc_sanity_check(vcpu); + /* + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT + * half executed, we first read TCEs from the user, check them and + * return error if something went wrong and only then put TCEs into + * the TCE table. + * + * tce_tmp is a cache for TCEs to avoid stack allocation or + * kmalloc as the whole TCE list can take up to 512 items 8 bytes + * each (4096 bytes). + */ + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); + if (!vcpu->arch.tce_tmp) + goto free_vcpu; + return vcpu; free_vcpu: @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); spin_unlock(&vcpu->arch.vpa_update_lock); + kfree(vcpu->arch.tce_tmp); kvm_vcpu_uninit(vcpu); kmem_cache_free(kvm_vcpu_cache, vcpu); } diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index b02f91e..d35554e 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1490,6 +1490,12 @@ hcall_real_table: .long 0 /* 0x11c */ .long 0 /* 0x120 */ .long .kvmppc_h_bulk_remove - hcall_real_table + .long 0 /* 0x128 */ + .long 0 /* 0x12c */ + .long 0 /* 0x130 */ + .long 0 /* 0x134 */ + .long .kvmppc_h_stuff_tce - hcall_real_table + .long .kvmppc_h_put_tce_indirect - hcall_real_table hcall_real_table_end: ignore_hdec: diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c index da0e0bc..91d4b45 100644 --- a/arch/powerpc/kvm/book3s_pr_papr.c +++ b/arch/powerpc/kvm/book3s_pr_papr.c @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) unsigned long tce = kvmppc_get_gpr(vcpu, 6); long rc; - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); + if (rc == H_TOO_HARD) + return EMULATE_FAIL; + kvmppc_set_gpr(vcpu, 3, rc); + return EMULATE_DONE; +} + +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) +{ + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); + unsigned long tce = kvmppc_get_gpr(vcpu, 6); + unsigned long npages = kvmppc_get_gpr(vcpu, 7); + long rc; + + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, + tce, npages); + if (rc == H_TOO_HARD) + return EMULATE_FAIL; + kvmppc_set_gpr(vcpu, 3, rc); + return EMULATE_DONE; +} + +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) +{ + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); + unsigned long npages = kvmppc_get_gpr(vcpu, 7); + long rc; + + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); if (rc == H_TOO_HARD) return EMULATE_FAIL; kvmppc_set_gpr(vcpu, 3, rc); @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) return kvmppc_h_pr_bulk_remove(vcpu); case H_PUT_TCE: return kvmppc_h_pr_put_tce(vcpu); + case H_PUT_TCE_INDIRECT: + return kvmppc_h_pr_put_tce_indirect(vcpu); + case H_STUFF_TCE: + return kvmppc_h_pr_stuff_tce(vcpu); case H_CEDE: vcpu->arch.shared->msr |= MSR_EE; kvm_vcpu_block(vcpu); diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 6316ee3..8465c2a 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) r = 1; break; #endif + case KVM_CAP_SPAPR_MULTITCE: + r = 1; + break; default: r = 0; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index a5c86fc..fc0d6b9 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_IRQ_MPIC 90 #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 +#define KVM_CAP_SPAPR_MULTITCE 93 #ifdef KVM_CAP_IRQ_ROUTING -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-05 6:11 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the "hcall-multi-tce" hypertas property only if the capability is present, otherwise there will be serious performance degradation. Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) --- Documentation/virtual/kvm/api.txt | 17 ++ arch/powerpc/include/asm/kvm_host.h | 2 + arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- arch/powerpc/kvm/powerpc.c | 3 + include/uapi/linux/kvm.h | 1 + 10 files changed, 473 insertions(+), 32 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be handled. +4.83 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability tells the guest that multiple TCE entry add/remove hypercalls +handling is supported by the kernel. This significanly accelerates DMA +operations for PPC KVM guests. + +Unlike other capabilities in this section, this one does not have an ioctl. +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). + + 5. The kvm_run structure ------------------------ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index af326cd..85d8f26 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; + + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index a5287fe..e852921b 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, - unsigned long ioba, unsigned long tce); +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( + struct kvm_vcpu *vcpu, unsigned long liobn); +extern long kvmppc_emulated_validate_tce(unsigned long tce); +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, + unsigned long ioba, unsigned long tce); +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce); +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages); +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages); extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index b2d3f3b..06b7b20 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -14,6 +14,7 @@ * * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> */ #include <linux/types.h> @@ -36,8 +37,11 @@ #include <asm/ppc-opcode.h> #include <asm/kvm_host.h> #include <asm/udbg.h> +#include <asm/iommu.h> +#include <asm/tce.h> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) +#define ERROR_ADDR ((void *)~(unsigned long)0x0) static long kvmppc_stt_npages(unsigned long window_size) { @@ -148,3 +152,117 @@ fail: } return ret; } + +/* Converts guest physical address into host virtual */ +static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, + unsigned long gpa) +{ + unsigned long hva, gfn = gpa >> PAGE_SHIFT; + struct kvm_memory_slot *memslot; + + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); + if (!memslot) + return ERROR_ADDR; + + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); + return (void *) hva; +} + +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce) +{ + long ret; + struct kvmppc_spapr_tce_table *tt; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if (ioba >= tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce); + if (ret) + return ret; + + kvmppc_emulated_put_tce(tt, ioba, tce); + + return H_SUCCESS; +} + +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + unsigned long __user *tces; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!tt) + return H_TOO_HARD; + + /* + * The spec says that the maximum size of the list is 512 TCEs so + * so the whole table addressed resides in 4K page + */ + if (npages > 512) + return H_PARAMETER; + + if (tce_list & ~IOMMU_PAGE_MASK) + return H_PARAMETER; + + tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); + if (tces == ERROR_ADDR) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + for (i = 0; i < npages; ++i) { + if (get_user(vcpu->arch.tce_tmp[i], tces + i)) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]); + if (ret) + return ret; + } + + for (i = 0; i < npages; ++i) + kvmppc_emulated_put_tce(tt, + ioba + (i << IOMMU_PAGE_SHIFT), + vcpu->arch.tce_tmp[i]); + + return H_SUCCESS; +} + +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce_value); + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) + return H_PARAMETER; + + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) + kvmppc_emulated_put_tce(tt, ioba, tce_value); + + return H_SUCCESS; +} diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index 30c2f3b..c68d538 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -14,6 +14,7 @@ * * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> */ #include <linux/types.h> @@ -35,42 +36,249 @@ #include <asm/ppc-opcode.h> #include <asm/kvm_host.h> #include <asm/udbg.h> +#include <asm/iommu.h> +#include <asm/tce.h> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) +#define ERROR_ADDR (~(unsigned long)0x0) -/* WARNING: This will be called in real-mode on HV KVM and virtual - * mode on PR KVM +/* Finds a TCE table descriptor by LIOBN */ +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu, + unsigned long liobn) +{ + struct kvmppc_spapr_tce_table *tt; + + list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) { + if (tt->liobn == liobn) + return tt; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table); + +/* + * Validate TCE address. + * At the moment only flags are validated + * as other check will significantly slow down + * or can make it even impossible to handle TCE requests + * in real mode. + */ +long kvmppc_emulated_validate_tce(unsigned long tce) +{ + if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ)) + return H_PARAMETER; + + return H_SUCCESS; +} +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce); + +/* + * kvmppc_emulated_put_tce() handles TCE requests for devices emulated + * by QEMU. It puts guest TCE values into the table and expects + * the QEMU to convert them later in the QEMU device implementation. + * Wiorks in both real and virtual modes. + * It cannot fail so kvmppc_emulated_validate_tce must be called before it. */ +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, + unsigned long ioba, unsigned long tce) +{ + unsigned long idx = ioba >> SPAPR_TCE_SHIFT; + struct page *page; + u64 *tbl; + + /* + * Note on the use of page_address() in real mode, + * + * It is safe to use page_address() in real mode on ppc64 because + * page_address() is always defined as lowmem_page_address() + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial + * operation and does not access page struct. + * + * Theoretically page_address() could be defined different + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL + * should be enabled. + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64, + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP + * is not expected to be enabled on ppc32, page_address() + * is safe for ppc32 as well. + */ +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL) +#error TODO: fix to avoid page_address() here +#endif + page = tt->pages[idx / TCES_PER_PAGE]; + tbl = (u64 *)page_address(page); + + /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ + tbl[idx % TCES_PER_PAGE] = tce; +} +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce); + +#ifdef CONFIG_KVM_BOOK3S_64_HV + +static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, + unsigned long *pte_sizep) +{ + pte_t *ptep; + unsigned int shift = 0; + pte_t pte, tmp, ret; + + ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift); + if (!ptep) + return __pte(0); + if (shift) + *pte_sizep = 1ul << shift; + else + *pte_sizep = PAGE_SIZE; + + if (!pte_present(*ptep)) + return __pte(0); + + /* wait until _PAGE_BUSY is clear then set it atomically */ + __asm__ __volatile__ ( + "1: ldarx %0,0,%3\n" + " andi. %1,%0,%4\n" + " bne- 1b\n" + " ori %1,%0,%4\n" + " stdcx. %1,0,%3\n" + " bne- 1b" + : "=&r" (pte), "=&r" (tmp), "=m" (*ptep) + : "r" (ptep), "i" (_PAGE_BUSY) + : "cc"); + + ret = pte; + + return ret; +} + +/* + * Converts guest physical address into host physical address. + * Also returns pte and page size if the page is present in page table. + */ +static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, + unsigned long gpa) +{ + struct kvm_memory_slot *memslot; + pte_t pte; + unsigned long hva, hpa, pg_size = 0, offset; + unsigned long gfn = gpa >> PAGE_SHIFT; + bool writing = gpa & TCE_PCI_WRITE; + + /* Find a KVM memslot */ + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); + if (!memslot) + return ERROR_ADDR; + + /* Convert guest physical address to host virtual */ + hva = __gfn_to_hva_memslot(memslot, gfn); + + /* Find a PTE and determine the size */ + pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva, + writing, &pg_size); + if (!pte) + return ERROR_ADDR; + + /* Calculate host phys address keeping flags and offset in the page */ + offset = gpa & (pg_size - 1); + + /* pte_pfn(pte) should return an address aligned to pg_size */ + hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset; + + return hpa; +} + long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { - struct kvm *kvm = vcpu->kvm; - struct kvmppc_spapr_tce_table *stt; - - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ - /* liobn, ioba, tce); */ - - list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) { - if (stt->liobn == liobn) { - unsigned long idx = ioba >> SPAPR_TCE_SHIFT; - struct page *page; - u64 *tbl; - - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */ - /* liobn, stt, stt->window_size); */ - if (ioba >= stt->window_size) - return H_PARAMETER; - - page = stt->pages[idx / TCES_PER_PAGE]; - tbl = (u64 *)page_address(page); - - /* FIXME: Need to validate the TCE itself */ - /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ - tbl[idx % TCES_PER_PAGE] = tce; - return H_SUCCESS; - } + long ret; + struct kvmppc_spapr_tce_table *tt; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to virtual space */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if (ioba >= tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce); + if (ret) + return ret; + + kvmppc_emulated_put_tce(tt, ioba, tce); + + return H_SUCCESS; +} + +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + unsigned long *tces; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to virtual space */ + if (!tt) + return H_TOO_HARD; + + /* + * The spec says that the maximum size of the list is 512 TCEs so + * so the whole table addressed resides in 4K page + */ + if (npages > 512) + return H_PARAMETER; + + if (tce_list & ~IOMMU_PAGE_MASK) + return H_PARAMETER; + + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list); + if ((unsigned long)tces == ERROR_ADDR) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + for (i = 0; i < npages; ++i) { + ret = kvmppc_emulated_validate_tce(tces[i]); + if (ret) + return ret; } - /* Didn't find the liobn, punt it to userspace */ - return H_TOO_HARD; + for (i = 0; i < npages; ++i) + kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT), + tces[i]); + + return H_SUCCESS; +} + +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i, ret; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to virtual space */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) + return H_PARAMETER; + + ret = kvmppc_emulated_validate_tce(tce_value); + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) + return H_PARAMETER; + + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) + kvmppc_emulated_put_tce(tt, ioba, tce_value); + + return H_SUCCESS; } +#endif /* CONFIG_KVM_BOOK3S_64_HV */ diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 550f592..a39039a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) ret = kvmppc_xics_hcall(vcpu, req); break; } /* fallthrough */ + return RESUME_HOST; + case H_PUT_TCE: + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), + kvmppc_get_gpr(vcpu, 5), + kvmppc_get_gpr(vcpu, 6)); + if (ret == H_TOO_HARD) + return RESUME_HOST; + break; + case H_PUT_TCE_INDIRECT: + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), + kvmppc_get_gpr(vcpu, 5), + kvmppc_get_gpr(vcpu, 6), + kvmppc_get_gpr(vcpu, 7)); + if (ret == H_TOO_HARD) + return RESUME_HOST; + break; + case H_STUFF_TCE: + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), + kvmppc_get_gpr(vcpu, 5), + kvmppc_get_gpr(vcpu, 6), + kvmppc_get_gpr(vcpu, 7)); + if (ret == H_TOO_HARD) + return RESUME_HOST; + break; default: return RESUME_HOST; } @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) vcpu->arch.cpu_type = KVM_CPU_3S_64; kvmppc_sanity_check(vcpu); + /* + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT + * half executed, we first read TCEs from the user, check them and + * return error if something went wrong and only then put TCEs into + * the TCE table. + * + * tce_tmp is a cache for TCEs to avoid stack allocation or + * kmalloc as the whole TCE list can take up to 512 items 8 bytes + * each (4096 bytes). + */ + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); + if (!vcpu->arch.tce_tmp) + goto free_vcpu; + return vcpu; free_vcpu: @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); spin_unlock(&vcpu->arch.vpa_update_lock); + kfree(vcpu->arch.tce_tmp); kvm_vcpu_uninit(vcpu); kmem_cache_free(kvm_vcpu_cache, vcpu); } diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index b02f91e..d35554e 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1490,6 +1490,12 @@ hcall_real_table: .long 0 /* 0x11c */ .long 0 /* 0x120 */ .long .kvmppc_h_bulk_remove - hcall_real_table + .long 0 /* 0x128 */ + .long 0 /* 0x12c */ + .long 0 /* 0x130 */ + .long 0 /* 0x134 */ + .long .kvmppc_h_stuff_tce - hcall_real_table + .long .kvmppc_h_put_tce_indirect - hcall_real_table hcall_real_table_end: ignore_hdec: diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c index da0e0bc..91d4b45 100644 --- a/arch/powerpc/kvm/book3s_pr_papr.c +++ b/arch/powerpc/kvm/book3s_pr_papr.c @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) unsigned long tce = kvmppc_get_gpr(vcpu, 6); long rc; - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); + if (rc == H_TOO_HARD) + return EMULATE_FAIL; + kvmppc_set_gpr(vcpu, 3, rc); + return EMULATE_DONE; +} + +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) +{ + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); + unsigned long tce = kvmppc_get_gpr(vcpu, 6); + unsigned long npages = kvmppc_get_gpr(vcpu, 7); + long rc; + + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, + tce, npages); + if (rc == H_TOO_HARD) + return EMULATE_FAIL; + kvmppc_set_gpr(vcpu, 3, rc); + return EMULATE_DONE; +} + +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) +{ + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); + unsigned long npages = kvmppc_get_gpr(vcpu, 7); + long rc; + + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); if (rc == H_TOO_HARD) return EMULATE_FAIL; kvmppc_set_gpr(vcpu, 3, rc); @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) return kvmppc_h_pr_bulk_remove(vcpu); case H_PUT_TCE: return kvmppc_h_pr_put_tce(vcpu); + case H_PUT_TCE_INDIRECT: + return kvmppc_h_pr_put_tce_indirect(vcpu); + case H_STUFF_TCE: + return kvmppc_h_pr_stuff_tce(vcpu); case H_CEDE: vcpu->arch.shared->msr |= MSR_EE; kvm_vcpu_block(vcpu); diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 6316ee3..8465c2a 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) r = 1; break; #endif + case KVM_CAP_SPAPR_MULTITCE: + r = 1; + break; default: r = 0; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index a5c86fc..fc0d6b9 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_IRQ_MPIC 90 #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 +#define KVM_CAP_SPAPR_MULTITCE 93 #ifdef KVM_CAP_IRQ_ROUTING -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-16 4:20 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:20 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > This adds real mode handlers for the H_PUT_TCE_INDIRECT and > H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO > devices or emulated PCI. These calls allow adding multiple entries > (up to 512) into the TCE table in one call which saves time on > transition to/from real mode. > > This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs > (copied from user and verified) before writing the whole list into > the TCE table. This cache will be utilized more in the upcoming > VFIO/IOMMU support to continue TCE list processing in the virtual > mode in the case if the real mode handler failed for some reason. > > This adds a guest physical to host real address converter > and calls the existing H_PUT_TCE handler. The converting function > is going to be fully utilized by upcoming VFIO supporting patches. > > This also implements the KVM_CAP_PPC_MULTITCE capability, > so in order to support the functionality of this patch, QEMU > needs to query for this capability and set the "hcall-multi-tce" > hypertas property only if the capability is present, otherwise > there will be serious performance degradation. > > Cc: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Signed-off-by: Paul Mackerras <paulus@samba.org> > > --- > Changelog: > 2013/06/05: > * fixed mistype about IBMVIO in the commit message > * updated doc and moved it to another section > * changed capability number > > 2013/05/21: > * added kvm_vcpu_arch::tce_tmp > * removed cleanup if put_indirect failed, instead we do not even start > writing to TCE table if we cannot get TCEs from the user and they are > invalid > * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce > and kvmppc_emulated_validate_tce (for the previous item) > * fixed bug with failthrough for H_IPI > * removed all get_user() from real mode handlers > * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) > --- > Documentation/virtual/kvm/api.txt | 17 ++ > arch/powerpc/include/asm/kvm_host.h | 2 + > arch/powerpc/include/asm/kvm_ppc.h | 16 +- > arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ > arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- > arch/powerpc/kvm/book3s_hv.c | 39 +++++ > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + > arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- > arch/powerpc/kvm/powerpc.c | 3 + > include/uapi/linux/kvm.h | 1 + > 10 files changed, 473 insertions(+), 32 deletions(-) > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt > index 5f91eda..6c082ff 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be > handled. > > > +4.83 KVM_CAP_PPC_MULTITCE > + > +Capability: KVM_CAP_PPC_MULTITCE > +Architectures: ppc > +Type: vm > + > +This capability tells the guest that multiple TCE entry add/remove hypercalls > +handling is supported by the kernel. This significanly accelerates DMA > +operations for PPC KVM guests. > + > +Unlike other capabilities in this section, this one does not have an ioctl. > +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and > +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to > +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE > +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). > + > + > 5. The kvm_run structure > ------------------------ > > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h > index af326cd..85d8f26 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { > spinlock_t tbacct_lock; > u64 busy_stolen; > u64 busy_preempt; > + > + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ > #endif > }; > > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h > index a5287fe..e852921b 100644 > --- a/arch/powerpc/include/asm/kvm_ppc.h > +++ b/arch/powerpc/include/asm/kvm_ppc.h > @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); > > extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, > struct kvm_create_spapr_tce *args); > -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > - unsigned long ioba, unsigned long tce); > +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( > + struct kvm_vcpu *vcpu, unsigned long liobn); > +extern long kvmppc_emulated_validate_tce(unsigned long tce); > +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, > + unsigned long ioba, unsigned long tce); > +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce); > +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_list, unsigned long npages); > +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages); > extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, > struct kvm_allocate_rma *rma); > extern struct kvmppc_linear_info *kvm_alloc_rma(void); > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c > index b2d3f3b..06b7b20 100644 > --- a/arch/powerpc/kvm/book3s_64_vio.c > +++ b/arch/powerpc/kvm/book3s_64_vio.c > @@ -14,6 +14,7 @@ > * > * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> > * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> > + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> > */ > > #include <linux/types.h> > @@ -36,8 +37,11 @@ > #include <asm/ppc-opcode.h> > #include <asm/kvm_host.h> > #include <asm/udbg.h> > +#include <asm/iommu.h> > +#include <asm/tce.h> > > #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) > +#define ERROR_ADDR ((void *)~(unsigned long)0x0) > > static long kvmppc_stt_npages(unsigned long window_size) > { > @@ -148,3 +152,117 @@ fail: > } > return ret; > } > + > +/* Converts guest physical address into host virtual */ > +static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, > + unsigned long gpa) > +{ > + unsigned long hva, gfn = gpa >> PAGE_SHIFT; > + struct kvm_memory_slot *memslot; > + > + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); > + if (!memslot) > + return ERROR_ADDR; > + > + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); > + return (void *) hva; > +} > + > +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce) > +{ > + long ret; > + struct kvmppc_spapr_tce_table *tt; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to userspace */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if (ioba >= tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce); > + if (ret) > + return ret; > + > + kvmppc_emulated_put_tce(tt, ioba, tce); > + > + return H_SUCCESS; > +} > + > +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_list, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + unsigned long __user *tces; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to userspace */ > + if (!tt) > + return H_TOO_HARD; > + > + /* > + * The spec says that the maximum size of the list is 512 TCEs so > + * so the whole table addressed resides in 4K page > + */ > + if (npages > 512) > + return H_PARAMETER; > + > + if (tce_list & ~IOMMU_PAGE_MASK) > + return H_PARAMETER; > + > + tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); > + if (tces = ERROR_ADDR) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i) { > + if (get_user(vcpu->arch.tce_tmp[i], tces + i)) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]); > + if (ret) > + return ret; > + } > + > + for (i = 0; i < npages; ++i) > + kvmppc_emulated_put_tce(tt, > + ioba + (i << IOMMU_PAGE_SHIFT), > + vcpu->arch.tce_tmp[i]); > + > + return H_SUCCESS; > +} > + > +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to userspace */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce_value); > + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) > + kvmppc_emulated_put_tce(tt, ioba, tce_value); > + > + return H_SUCCESS; > +} > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c > index 30c2f3b..c68d538 100644 > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c > @@ -14,6 +14,7 @@ > * > * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> > * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> > + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> > */ > > #include <linux/types.h> > @@ -35,42 +36,249 @@ > #include <asm/ppc-opcode.h> > #include <asm/kvm_host.h> > #include <asm/udbg.h> > +#include <asm/iommu.h> > +#include <asm/tce.h> > > #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) > +#define ERROR_ADDR (~(unsigned long)0x0) > > -/* WARNING: This will be called in real-mode on HV KVM and virtual > - * mode on PR KVM > +/* Finds a TCE table descriptor by LIOBN */ > +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu, > + unsigned long liobn) > +{ > + struct kvmppc_spapr_tce_table *tt; > + > + list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) { > + if (tt->liobn = liobn) > + return tt; > + } > + > + return NULL; > +} > +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table); > + > +/* > + * Validate TCE address. > + * At the moment only flags are validated > + * as other check will significantly slow down > + * or can make it even impossible to handle TCE requests > + * in real mode. > + */ > +long kvmppc_emulated_validate_tce(unsigned long tce) > +{ > + if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ)) > + return H_PARAMETER; > + > + return H_SUCCESS; > +} > +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce); > + > +/* > + * kvmppc_emulated_put_tce() handles TCE requests for devices emulated > + * by QEMU. It puts guest TCE values into the table and expects > + * the QEMU to convert them later in the QEMU device implementation. > + * Wiorks in both real and virtual modes. > + * It cannot fail so kvmppc_emulated_validate_tce must be called before it. > */ > +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, > + unsigned long ioba, unsigned long tce) > +{ > + unsigned long idx = ioba >> SPAPR_TCE_SHIFT; > + struct page *page; > + u64 *tbl; > + > + /* > + * Note on the use of page_address() in real mode, > + * > + * It is safe to use page_address() in real mode on ppc64 because > + * page_address() is always defined as lowmem_page_address() > + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial > + * operation and does not access page struct. > + * > + * Theoretically page_address() could be defined different > + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL > + * should be enabled. > + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64, > + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only > + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP > + * is not expected to be enabled on ppc32, page_address() > + * is safe for ppc32 as well. > + */ > +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL) > +#error TODO: fix to avoid page_address() here > +#endif > + page = tt->pages[idx / TCES_PER_PAGE]; > + tbl = (u64 *)page_address(page); > + > + /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ > + tbl[idx % TCES_PER_PAGE] = tce; > +} > +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce); > + > +#ifdef CONFIG_KVM_BOOK3S_64_HV > + > +static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > + unsigned long *pte_sizep) > +{ > + pte_t *ptep; > + unsigned int shift = 0; > + pte_t pte, tmp, ret; > + > + ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift); > + if (!ptep) > + return __pte(0); > + if (shift) > + *pte_sizep = 1ul << shift; > + else > + *pte_sizep = PAGE_SIZE; > + > + if (!pte_present(*ptep)) > + return __pte(0); > + > + /* wait until _PAGE_BUSY is clear then set it atomically */ > + __asm__ __volatile__ ( > + "1: ldarx %0,0,%3\n" > + " andi. %1,%0,%4\n" > + " bne- 1b\n" > + " ori %1,%0,%4\n" > + " stdcx. %1,0,%3\n" > + " bne- 1b" > + : "=&r" (pte), "=&r" (tmp), "=m" (*ptep) > + : "r" (ptep), "i" (_PAGE_BUSY) > + : "cc"); > + > + ret = pte; > + > + return ret; > +} The test for pte_present() needs to be done again after you lock the PTE since potentially you could have raced with an invalidation. More worrisome: You set _PAGE_BUSY above (lock the PTE). But when do you clear it ? It looks like you rely on _PAGE_BUSY being set to protect yourself against any concurrent invalidation (or other change to the PTE) while you access the underlying page. This is *somewhat* ok, though frowned upon since you end up locking the PTE a lot longer (in real mode) than we normally do, but in any case, you need to ensure that you release that lock. Also you must *not* continue using the resulting physical address after releasing the lock since the page might be invalidated/freed/swapped_out etc... at any point once you clear busy. It might be better to use the MMU notifiers here to catch concurrent invalidations rather than locking the PTE for a long time. If a concurrent invalidation happens, just return TOO_HARD. > + > +/* > + * Converts guest physical address into host physical address. > + * Also returns pte and page size if the page is present in page table. > + */ > +static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, > + unsigned long gpa) > +{ > + struct kvm_memory_slot *memslot; > + pte_t pte; > + unsigned long hva, hpa, pg_size = 0, offset; > + unsigned long gfn = gpa >> PAGE_SHIFT; > + bool writing = gpa & TCE_PCI_WRITE; > + > + /* Find a KVM memslot */ > + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); > + if (!memslot) > + return ERROR_ADDR; > + > + /* Convert guest physical address to host virtual */ > + hva = __gfn_to_hva_memslot(memslot, gfn); > + > + /* Find a PTE and determine the size */ > + pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva, > + writing, &pg_size); > + if (!pte) > + return ERROR_ADDR; > + > + /* Calculate host phys address keeping flags and offset in the page */ > + offset = gpa & (pg_size - 1); > + > + /* pte_pfn(pte) should return an address aligned to pg_size */ > + hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset; > + > + return hpa; > +} Do you ever test whether the page protection on the PTE allows for access ? At the moment you only use that to read from TCEs so chances that this is wrong are slim (you wouldn't have PROT_NONE on TCE tables), but it's still a worry to have code like that. Also you do not set _PAGE_ACCESSED either, which means the VM doesn't know the page is being accessed. Not necessarily a huge deal in this specific case, but still. > long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > unsigned long ioba, unsigned long tce) > { > - struct kvm *kvm = vcpu->kvm; > - struct kvmppc_spapr_tce_table *stt; > - > - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ > - /* liobn, ioba, tce); */ > - > - list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) { > - if (stt->liobn = liobn) { > - unsigned long idx = ioba >> SPAPR_TCE_SHIFT; > - struct page *page; > - u64 *tbl; > - > - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */ > - /* liobn, stt, stt->window_size); */ > - if (ioba >= stt->window_size) > - return H_PARAMETER; > - > - page = stt->pages[idx / TCES_PER_PAGE]; > - tbl = (u64 *)page_address(page); > - > - /* FIXME: Need to validate the TCE itself */ > - /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ > - tbl[idx % TCES_PER_PAGE] = tce; > - return H_SUCCESS; > - } > + long ret; > + struct kvmppc_spapr_tce_table *tt; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to virtual space */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if (ioba >= tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce); > + if (ret) > + return ret; > + > + kvmppc_emulated_put_tce(tt, ioba, tce); > + > + return H_SUCCESS; > +} > + > +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_list, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + unsigned long *tces; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to virtual space */ > + if (!tt) > + return H_TOO_HARD; > + > + /* > + * The spec says that the maximum size of the list is 512 TCEs so > + * so the whole table addressed resides in 4K page > + */ > + if (npages > 512) > + return H_PARAMETER; > + > + if (tce_list & ~IOMMU_PAGE_MASK) > + return H_PARAMETER; > + > + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list); > + if ((unsigned long)tces = ERROR_ADDR) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i) { > + ret = kvmppc_emulated_validate_tce(tces[i]); > + if (ret) > + return ret; > } > > - /* Didn't find the liobn, punt it to userspace */ > - return H_TOO_HARD; > + for (i = 0; i < npages; ++i) > + kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT), > + tces[i]); > + > + return H_SUCCESS; > +} > + > +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to virtual space */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce_value); > + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) > + kvmppc_emulated_put_tce(tt, ioba, tce_value); > + > + return H_SUCCESS; > } > +#endif /* CONFIG_KVM_BOOK3S_64_HV */ > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 550f592..a39039a 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) > ret = kvmppc_xics_hcall(vcpu, req); > break; > } /* fallthrough */ > + return RESUME_HOST; > + case H_PUT_TCE: > + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6)); > + if (ret = H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_PUT_TCE_INDIRECT: > + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, 7)); > + if (ret = H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_STUFF_TCE: > + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, 7)); > + if (ret = H_TOO_HARD) > + return RESUME_HOST; > + break; > default: > return RESUME_HOST; > } > @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) > vcpu->arch.cpu_type = KVM_CPU_3S_64; > kvmppc_sanity_check(vcpu); > > + /* > + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT > + * half executed, we first read TCEs from the user, check them and > + * return error if something went wrong and only then put TCEs into > + * the TCE table. > + * > + * tce_tmp is a cache for TCEs to avoid stack allocation or > + * kmalloc as the whole TCE list can take up to 512 items 8 bytes > + * each (4096 bytes). > + */ > + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); > + if (!vcpu->arch.tce_tmp) > + goto free_vcpu; > + > return vcpu; > > free_vcpu: > @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) > unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); > unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); > spin_unlock(&vcpu->arch.vpa_update_lock); > + kfree(vcpu->arch.tce_tmp); > kvm_vcpu_uninit(vcpu); > kmem_cache_free(kvm_vcpu_cache, vcpu); > } > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index b02f91e..d35554e 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -1490,6 +1490,12 @@ hcall_real_table: > .long 0 /* 0x11c */ > .long 0 /* 0x120 */ > .long .kvmppc_h_bulk_remove - hcall_real_table > + .long 0 /* 0x128 */ > + .long 0 /* 0x12c */ > + .long 0 /* 0x130 */ > + .long 0 /* 0x134 */ > + .long .kvmppc_h_stuff_tce - hcall_real_table > + .long .kvmppc_h_put_tce_indirect - hcall_real_table > hcall_real_table_end: > > ignore_hdec: > diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c > index da0e0bc..91d4b45 100644 > --- a/arch/powerpc/kvm/book3s_pr_papr.c > +++ b/arch/powerpc/kvm/book3s_pr_papr.c > @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) > unsigned long tce = kvmppc_get_gpr(vcpu, 6); > long rc; > > - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); > + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); > + if (rc = H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); > + unsigned long tce = kvmppc_get_gpr(vcpu, 6); > + unsigned long npages = kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, > + tce, npages); > + if (rc = H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); > + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); > + unsigned long npages = kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); > if (rc = H_TOO_HARD) > return EMULATE_FAIL; > kvmppc_set_gpr(vcpu, 3, rc); > @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) > return kvmppc_h_pr_bulk_remove(vcpu); > case H_PUT_TCE: > return kvmppc_h_pr_put_tce(vcpu); > + case H_PUT_TCE_INDIRECT: > + return kvmppc_h_pr_put_tce_indirect(vcpu); > + case H_STUFF_TCE: > + return kvmppc_h_pr_stuff_tce(vcpu); > case H_CEDE: > vcpu->arch.shared->msr |= MSR_EE; > kvm_vcpu_block(vcpu); > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > index 6316ee3..8465c2a 100644 > --- a/arch/powerpc/kvm/powerpc.c > +++ b/arch/powerpc/kvm/powerpc.c > @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) > r = 1; > break; > #endif > + case KVM_CAP_SPAPR_MULTITCE: > + r = 1; > + break; > default: > r = 0; > break; > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index a5c86fc..fc0d6b9 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info { > #define KVM_CAP_IRQ_MPIC 90 > #define KVM_CAP_PPC_RTAS 91 > #define KVM_CAP_IRQ_XICS 92 > +#define KVM_CAP_SPAPR_MULTITCE 93 > > #ifdef KVM_CAP_IRQ_ROUTING > ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-16 4:20 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:20 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > This adds real mode handlers for the H_PUT_TCE_INDIRECT and > H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO > devices or emulated PCI. These calls allow adding multiple entries > (up to 512) into the TCE table in one call which saves time on > transition to/from real mode. > > This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs > (copied from user and verified) before writing the whole list into > the TCE table. This cache will be utilized more in the upcoming > VFIO/IOMMU support to continue TCE list processing in the virtual > mode in the case if the real mode handler failed for some reason. > > This adds a guest physical to host real address converter > and calls the existing H_PUT_TCE handler. The converting function > is going to be fully utilized by upcoming VFIO supporting patches. > > This also implements the KVM_CAP_PPC_MULTITCE capability, > so in order to support the functionality of this patch, QEMU > needs to query for this capability and set the "hcall-multi-tce" > hypertas property only if the capability is present, otherwise > there will be serious performance degradation. > > Cc: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Signed-off-by: Paul Mackerras <paulus@samba.org> > > --- > Changelog: > 2013/06/05: > * fixed mistype about IBMVIO in the commit message > * updated doc and moved it to another section > * changed capability number > > 2013/05/21: > * added kvm_vcpu_arch::tce_tmp > * removed cleanup if put_indirect failed, instead we do not even start > writing to TCE table if we cannot get TCEs from the user and they are > invalid > * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce > and kvmppc_emulated_validate_tce (for the previous item) > * fixed bug with failthrough for H_IPI > * removed all get_user() from real mode handlers > * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) > --- > Documentation/virtual/kvm/api.txt | 17 ++ > arch/powerpc/include/asm/kvm_host.h | 2 + > arch/powerpc/include/asm/kvm_ppc.h | 16 +- > arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ > arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- > arch/powerpc/kvm/book3s_hv.c | 39 +++++ > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + > arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- > arch/powerpc/kvm/powerpc.c | 3 + > include/uapi/linux/kvm.h | 1 + > 10 files changed, 473 insertions(+), 32 deletions(-) > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt > index 5f91eda..6c082ff 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be > handled. > > > +4.83 KVM_CAP_PPC_MULTITCE > + > +Capability: KVM_CAP_PPC_MULTITCE > +Architectures: ppc > +Type: vm > + > +This capability tells the guest that multiple TCE entry add/remove hypercalls > +handling is supported by the kernel. This significanly accelerates DMA > +operations for PPC KVM guests. > + > +Unlike other capabilities in this section, this one does not have an ioctl. > +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and > +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to > +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE > +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). > + > + > 5. The kvm_run structure > ------------------------ > > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h > index af326cd..85d8f26 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { > spinlock_t tbacct_lock; > u64 busy_stolen; > u64 busy_preempt; > + > + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ > #endif > }; > > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h > index a5287fe..e852921b 100644 > --- a/arch/powerpc/include/asm/kvm_ppc.h > +++ b/arch/powerpc/include/asm/kvm_ppc.h > @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); > > extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, > struct kvm_create_spapr_tce *args); > -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > - unsigned long ioba, unsigned long tce); > +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( > + struct kvm_vcpu *vcpu, unsigned long liobn); > +extern long kvmppc_emulated_validate_tce(unsigned long tce); > +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, > + unsigned long ioba, unsigned long tce); > +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce); > +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_list, unsigned long npages); > +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages); > extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, > struct kvm_allocate_rma *rma); > extern struct kvmppc_linear_info *kvm_alloc_rma(void); > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c > index b2d3f3b..06b7b20 100644 > --- a/arch/powerpc/kvm/book3s_64_vio.c > +++ b/arch/powerpc/kvm/book3s_64_vio.c > @@ -14,6 +14,7 @@ > * > * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> > * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> > + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> > */ > > #include <linux/types.h> > @@ -36,8 +37,11 @@ > #include <asm/ppc-opcode.h> > #include <asm/kvm_host.h> > #include <asm/udbg.h> > +#include <asm/iommu.h> > +#include <asm/tce.h> > > #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) > +#define ERROR_ADDR ((void *)~(unsigned long)0x0) > > static long kvmppc_stt_npages(unsigned long window_size) > { > @@ -148,3 +152,117 @@ fail: > } > return ret; > } > + > +/* Converts guest physical address into host virtual */ > +static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, > + unsigned long gpa) > +{ > + unsigned long hva, gfn = gpa >> PAGE_SHIFT; > + struct kvm_memory_slot *memslot; > + > + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); > + if (!memslot) > + return ERROR_ADDR; > + > + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); > + return (void *) hva; > +} > + > +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce) > +{ > + long ret; > + struct kvmppc_spapr_tce_table *tt; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to userspace */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if (ioba >= tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce); > + if (ret) > + return ret; > + > + kvmppc_emulated_put_tce(tt, ioba, tce); > + > + return H_SUCCESS; > +} > + > +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_list, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + unsigned long __user *tces; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to userspace */ > + if (!tt) > + return H_TOO_HARD; > + > + /* > + * The spec says that the maximum size of the list is 512 TCEs so > + * so the whole table addressed resides in 4K page > + */ > + if (npages > 512) > + return H_PARAMETER; > + > + if (tce_list & ~IOMMU_PAGE_MASK) > + return H_PARAMETER; > + > + tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); > + if (tces == ERROR_ADDR) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i) { > + if (get_user(vcpu->arch.tce_tmp[i], tces + i)) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]); > + if (ret) > + return ret; > + } > + > + for (i = 0; i < npages; ++i) > + kvmppc_emulated_put_tce(tt, > + ioba + (i << IOMMU_PAGE_SHIFT), > + vcpu->arch.tce_tmp[i]); > + > + return H_SUCCESS; > +} > + > +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to userspace */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce_value); > + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) > + kvmppc_emulated_put_tce(tt, ioba, tce_value); > + > + return H_SUCCESS; > +} > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c > index 30c2f3b..c68d538 100644 > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c > @@ -14,6 +14,7 @@ > * > * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> > * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> > + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> > */ > > #include <linux/types.h> > @@ -35,42 +36,249 @@ > #include <asm/ppc-opcode.h> > #include <asm/kvm_host.h> > #include <asm/udbg.h> > +#include <asm/iommu.h> > +#include <asm/tce.h> > > #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) > +#define ERROR_ADDR (~(unsigned long)0x0) > > -/* WARNING: This will be called in real-mode on HV KVM and virtual > - * mode on PR KVM > +/* Finds a TCE table descriptor by LIOBN */ > +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu, > + unsigned long liobn) > +{ > + struct kvmppc_spapr_tce_table *tt; > + > + list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) { > + if (tt->liobn == liobn) > + return tt; > + } > + > + return NULL; > +} > +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table); > + > +/* > + * Validate TCE address. > + * At the moment only flags are validated > + * as other check will significantly slow down > + * or can make it even impossible to handle TCE requests > + * in real mode. > + */ > +long kvmppc_emulated_validate_tce(unsigned long tce) > +{ > + if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ)) > + return H_PARAMETER; > + > + return H_SUCCESS; > +} > +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce); > + > +/* > + * kvmppc_emulated_put_tce() handles TCE requests for devices emulated > + * by QEMU. It puts guest TCE values into the table and expects > + * the QEMU to convert them later in the QEMU device implementation. > + * Wiorks in both real and virtual modes. > + * It cannot fail so kvmppc_emulated_validate_tce must be called before it. > */ > +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, > + unsigned long ioba, unsigned long tce) > +{ > + unsigned long idx = ioba >> SPAPR_TCE_SHIFT; > + struct page *page; > + u64 *tbl; > + > + /* > + * Note on the use of page_address() in real mode, > + * > + * It is safe to use page_address() in real mode on ppc64 because > + * page_address() is always defined as lowmem_page_address() > + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial > + * operation and does not access page struct. > + * > + * Theoretically page_address() could be defined different > + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL > + * should be enabled. > + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64, > + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only > + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP > + * is not expected to be enabled on ppc32, page_address() > + * is safe for ppc32 as well. > + */ > +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL) > +#error TODO: fix to avoid page_address() here > +#endif > + page = tt->pages[idx / TCES_PER_PAGE]; > + tbl = (u64 *)page_address(page); > + > + /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ > + tbl[idx % TCES_PER_PAGE] = tce; > +} > +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce); > + > +#ifdef CONFIG_KVM_BOOK3S_64_HV > + > +static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > + unsigned long *pte_sizep) > +{ > + pte_t *ptep; > + unsigned int shift = 0; > + pte_t pte, tmp, ret; > + > + ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift); > + if (!ptep) > + return __pte(0); > + if (shift) > + *pte_sizep = 1ul << shift; > + else > + *pte_sizep = PAGE_SIZE; > + > + if (!pte_present(*ptep)) > + return __pte(0); > + > + /* wait until _PAGE_BUSY is clear then set it atomically */ > + __asm__ __volatile__ ( > + "1: ldarx %0,0,%3\n" > + " andi. %1,%0,%4\n" > + " bne- 1b\n" > + " ori %1,%0,%4\n" > + " stdcx. %1,0,%3\n" > + " bne- 1b" > + : "=&r" (pte), "=&r" (tmp), "=m" (*ptep) > + : "r" (ptep), "i" (_PAGE_BUSY) > + : "cc"); > + > + ret = pte; > + > + return ret; > +} The test for pte_present() needs to be done again after you lock the PTE since potentially you could have raced with an invalidation. More worrisome: You set _PAGE_BUSY above (lock the PTE). But when do you clear it ? It looks like you rely on _PAGE_BUSY being set to protect yourself against any concurrent invalidation (or other change to the PTE) while you access the underlying page. This is *somewhat* ok, though frowned upon since you end up locking the PTE a lot longer (in real mode) than we normally do, but in any case, you need to ensure that you release that lock. Also you must *not* continue using the resulting physical address after releasing the lock since the page might be invalidated/freed/swapped_out etc... at any point once you clear busy. It might be better to use the MMU notifiers here to catch concurrent invalidations rather than locking the PTE for a long time. If a concurrent invalidation happens, just return TOO_HARD. > + > +/* > + * Converts guest physical address into host physical address. > + * Also returns pte and page size if the page is present in page table. > + */ > +static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, > + unsigned long gpa) > +{ > + struct kvm_memory_slot *memslot; > + pte_t pte; > + unsigned long hva, hpa, pg_size = 0, offset; > + unsigned long gfn = gpa >> PAGE_SHIFT; > + bool writing = gpa & TCE_PCI_WRITE; > + > + /* Find a KVM memslot */ > + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); > + if (!memslot) > + return ERROR_ADDR; > + > + /* Convert guest physical address to host virtual */ > + hva = __gfn_to_hva_memslot(memslot, gfn); > + > + /* Find a PTE and determine the size */ > + pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva, > + writing, &pg_size); > + if (!pte) > + return ERROR_ADDR; > + > + /* Calculate host phys address keeping flags and offset in the page */ > + offset = gpa & (pg_size - 1); > + > + /* pte_pfn(pte) should return an address aligned to pg_size */ > + hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset; > + > + return hpa; > +} Do you ever test whether the page protection on the PTE allows for access ? At the moment you only use that to read from TCEs so chances that this is wrong are slim (you wouldn't have PROT_NONE on TCE tables), but it's still a worry to have code like that. Also you do not set _PAGE_ACCESSED either, which means the VM doesn't know the page is being accessed. Not necessarily a huge deal in this specific case, but still. > long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > unsigned long ioba, unsigned long tce) > { > - struct kvm *kvm = vcpu->kvm; > - struct kvmppc_spapr_tce_table *stt; > - > - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ > - /* liobn, ioba, tce); */ > - > - list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) { > - if (stt->liobn == liobn) { > - unsigned long idx = ioba >> SPAPR_TCE_SHIFT; > - struct page *page; > - u64 *tbl; > - > - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */ > - /* liobn, stt, stt->window_size); */ > - if (ioba >= stt->window_size) > - return H_PARAMETER; > - > - page = stt->pages[idx / TCES_PER_PAGE]; > - tbl = (u64 *)page_address(page); > - > - /* FIXME: Need to validate the TCE itself */ > - /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ > - tbl[idx % TCES_PER_PAGE] = tce; > - return H_SUCCESS; > - } > + long ret; > + struct kvmppc_spapr_tce_table *tt; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to virtual space */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if (ioba >= tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce); > + if (ret) > + return ret; > + > + kvmppc_emulated_put_tce(tt, ioba, tce); > + > + return H_SUCCESS; > +} > + > +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_list, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + unsigned long *tces; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to virtual space */ > + if (!tt) > + return H_TOO_HARD; > + > + /* > + * The spec says that the maximum size of the list is 512 TCEs so > + * so the whole table addressed resides in 4K page > + */ > + if (npages > 512) > + return H_PARAMETER; > + > + if (tce_list & ~IOMMU_PAGE_MASK) > + return H_PARAMETER; > + > + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list); > + if ((unsigned long)tces == ERROR_ADDR) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i) { > + ret = kvmppc_emulated_validate_tce(tces[i]); > + if (ret) > + return ret; > } > > - /* Didn't find the liobn, punt it to userspace */ > - return H_TOO_HARD; > + for (i = 0; i < npages; ++i) > + kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT), > + tces[i]); > + > + return H_SUCCESS; > +} > + > +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to virtual space */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce_value); > + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) > + kvmppc_emulated_put_tce(tt, ioba, tce_value); > + > + return H_SUCCESS; > } > +#endif /* CONFIG_KVM_BOOK3S_64_HV */ > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 550f592..a39039a 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) > ret = kvmppc_xics_hcall(vcpu, req); > break; > } /* fallthrough */ > + return RESUME_HOST; > + case H_PUT_TCE: > + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6)); > + if (ret == H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_PUT_TCE_INDIRECT: > + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, 7)); > + if (ret == H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_STUFF_TCE: > + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, 7)); > + if (ret == H_TOO_HARD) > + return RESUME_HOST; > + break; > default: > return RESUME_HOST; > } > @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) > vcpu->arch.cpu_type = KVM_CPU_3S_64; > kvmppc_sanity_check(vcpu); > > + /* > + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT > + * half executed, we first read TCEs from the user, check them and > + * return error if something went wrong and only then put TCEs into > + * the TCE table. > + * > + * tce_tmp is a cache for TCEs to avoid stack allocation or > + * kmalloc as the whole TCE list can take up to 512 items 8 bytes > + * each (4096 bytes). > + */ > + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); > + if (!vcpu->arch.tce_tmp) > + goto free_vcpu; > + > return vcpu; > > free_vcpu: > @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) > unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); > unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); > spin_unlock(&vcpu->arch.vpa_update_lock); > + kfree(vcpu->arch.tce_tmp); > kvm_vcpu_uninit(vcpu); > kmem_cache_free(kvm_vcpu_cache, vcpu); > } > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index b02f91e..d35554e 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -1490,6 +1490,12 @@ hcall_real_table: > .long 0 /* 0x11c */ > .long 0 /* 0x120 */ > .long .kvmppc_h_bulk_remove - hcall_real_table > + .long 0 /* 0x128 */ > + .long 0 /* 0x12c */ > + .long 0 /* 0x130 */ > + .long 0 /* 0x134 */ > + .long .kvmppc_h_stuff_tce - hcall_real_table > + .long .kvmppc_h_put_tce_indirect - hcall_real_table > hcall_real_table_end: > > ignore_hdec: > diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c > index da0e0bc..91d4b45 100644 > --- a/arch/powerpc/kvm/book3s_pr_papr.c > +++ b/arch/powerpc/kvm/book3s_pr_papr.c > @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) > unsigned long tce = kvmppc_get_gpr(vcpu, 6); > long rc; > > - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); > + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); > + if (rc == H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); > + unsigned long tce = kvmppc_get_gpr(vcpu, 6); > + unsigned long npages = kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, > + tce, npages); > + if (rc == H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); > + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); > + unsigned long npages = kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); > if (rc == H_TOO_HARD) > return EMULATE_FAIL; > kvmppc_set_gpr(vcpu, 3, rc); > @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) > return kvmppc_h_pr_bulk_remove(vcpu); > case H_PUT_TCE: > return kvmppc_h_pr_put_tce(vcpu); > + case H_PUT_TCE_INDIRECT: > + return kvmppc_h_pr_put_tce_indirect(vcpu); > + case H_STUFF_TCE: > + return kvmppc_h_pr_stuff_tce(vcpu); > case H_CEDE: > vcpu->arch.shared->msr |= MSR_EE; > kvm_vcpu_block(vcpu); > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > index 6316ee3..8465c2a 100644 > --- a/arch/powerpc/kvm/powerpc.c > +++ b/arch/powerpc/kvm/powerpc.c > @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) > r = 1; > break; > #endif > + case KVM_CAP_SPAPR_MULTITCE: > + r = 1; > + break; > default: > r = 0; > break; > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index a5c86fc..fc0d6b9 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info { > #define KVM_CAP_IRQ_MPIC 90 > #define KVM_CAP_PPC_RTAS 91 > #define KVM_CAP_IRQ_XICS 92 > +#define KVM_CAP_SPAPR_MULTITCE 93 > > #ifdef KVM_CAP_IRQ_ROUTING > ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-16 4:20 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:20 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, Paul Mackerras, linuxppc-dev, David Gibson On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > This adds real mode handlers for the H_PUT_TCE_INDIRECT and > H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO > devices or emulated PCI. These calls allow adding multiple entries > (up to 512) into the TCE table in one call which saves time on > transition to/from real mode. > > This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs > (copied from user and verified) before writing the whole list into > the TCE table. This cache will be utilized more in the upcoming > VFIO/IOMMU support to continue TCE list processing in the virtual > mode in the case if the real mode handler failed for some reason. > > This adds a guest physical to host real address converter > and calls the existing H_PUT_TCE handler. The converting function > is going to be fully utilized by upcoming VFIO supporting patches. > > This also implements the KVM_CAP_PPC_MULTITCE capability, > so in order to support the functionality of this patch, QEMU > needs to query for this capability and set the "hcall-multi-tce" > hypertas property only if the capability is present, otherwise > there will be serious performance degradation. > > Cc: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Signed-off-by: Paul Mackerras <paulus@samba.org> > > --- > Changelog: > 2013/06/05: > * fixed mistype about IBMVIO in the commit message > * updated doc and moved it to another section > * changed capability number > > 2013/05/21: > * added kvm_vcpu_arch::tce_tmp > * removed cleanup if put_indirect failed, instead we do not even start > writing to TCE table if we cannot get TCEs from the user and they are > invalid > * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce > and kvmppc_emulated_validate_tce (for the previous item) > * fixed bug with failthrough for H_IPI > * removed all get_user() from real mode handlers > * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) > --- > Documentation/virtual/kvm/api.txt | 17 ++ > arch/powerpc/include/asm/kvm_host.h | 2 + > arch/powerpc/include/asm/kvm_ppc.h | 16 +- > arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ > arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- > arch/powerpc/kvm/book3s_hv.c | 39 +++++ > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + > arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- > arch/powerpc/kvm/powerpc.c | 3 + > include/uapi/linux/kvm.h | 1 + > 10 files changed, 473 insertions(+), 32 deletions(-) > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt > index 5f91eda..6c082ff 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be > handled. > > > +4.83 KVM_CAP_PPC_MULTITCE > + > +Capability: KVM_CAP_PPC_MULTITCE > +Architectures: ppc > +Type: vm > + > +This capability tells the guest that multiple TCE entry add/remove hypercalls > +handling is supported by the kernel. This significanly accelerates DMA > +operations for PPC KVM guests. > + > +Unlike other capabilities in this section, this one does not have an ioctl. > +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and > +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to > +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE > +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). > + > + > 5. The kvm_run structure > ------------------------ > > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h > index af326cd..85d8f26 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { > spinlock_t tbacct_lock; > u64 busy_stolen; > u64 busy_preempt; > + > + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ > #endif > }; > > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h > index a5287fe..e852921b 100644 > --- a/arch/powerpc/include/asm/kvm_ppc.h > +++ b/arch/powerpc/include/asm/kvm_ppc.h > @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); > > extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, > struct kvm_create_spapr_tce *args); > -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > - unsigned long ioba, unsigned long tce); > +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( > + struct kvm_vcpu *vcpu, unsigned long liobn); > +extern long kvmppc_emulated_validate_tce(unsigned long tce); > +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, > + unsigned long ioba, unsigned long tce); > +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce); > +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_list, unsigned long npages); > +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages); > extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, > struct kvm_allocate_rma *rma); > extern struct kvmppc_linear_info *kvm_alloc_rma(void); > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c > index b2d3f3b..06b7b20 100644 > --- a/arch/powerpc/kvm/book3s_64_vio.c > +++ b/arch/powerpc/kvm/book3s_64_vio.c > @@ -14,6 +14,7 @@ > * > * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> > * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> > + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> > */ > > #include <linux/types.h> > @@ -36,8 +37,11 @@ > #include <asm/ppc-opcode.h> > #include <asm/kvm_host.h> > #include <asm/udbg.h> > +#include <asm/iommu.h> > +#include <asm/tce.h> > > #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) > +#define ERROR_ADDR ((void *)~(unsigned long)0x0) > > static long kvmppc_stt_npages(unsigned long window_size) > { > @@ -148,3 +152,117 @@ fail: > } > return ret; > } > + > +/* Converts guest physical address into host virtual */ > +static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, > + unsigned long gpa) > +{ > + unsigned long hva, gfn = gpa >> PAGE_SHIFT; > + struct kvm_memory_slot *memslot; > + > + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); > + if (!memslot) > + return ERROR_ADDR; > + > + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); > + return (void *) hva; > +} > + > +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce) > +{ > + long ret; > + struct kvmppc_spapr_tce_table *tt; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to userspace */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if (ioba >= tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce); > + if (ret) > + return ret; > + > + kvmppc_emulated_put_tce(tt, ioba, tce); > + > + return H_SUCCESS; > +} > + > +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_list, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + unsigned long __user *tces; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to userspace */ > + if (!tt) > + return H_TOO_HARD; > + > + /* > + * The spec says that the maximum size of the list is 512 TCEs so > + * so the whole table addressed resides in 4K page > + */ > + if (npages > 512) > + return H_PARAMETER; > + > + if (tce_list & ~IOMMU_PAGE_MASK) > + return H_PARAMETER; > + > + tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); > + if (tces == ERROR_ADDR) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i) { > + if (get_user(vcpu->arch.tce_tmp[i], tces + i)) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]); > + if (ret) > + return ret; > + } > + > + for (i = 0; i < npages; ++i) > + kvmppc_emulated_put_tce(tt, > + ioba + (i << IOMMU_PAGE_SHIFT), > + vcpu->arch.tce_tmp[i]); > + > + return H_SUCCESS; > +} > + > +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to userspace */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce_value); > + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) > + kvmppc_emulated_put_tce(tt, ioba, tce_value); > + > + return H_SUCCESS; > +} > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c > index 30c2f3b..c68d538 100644 > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c > @@ -14,6 +14,7 @@ > * > * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com> > * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com> > + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com> > */ > > #include <linux/types.h> > @@ -35,42 +36,249 @@ > #include <asm/ppc-opcode.h> > #include <asm/kvm_host.h> > #include <asm/udbg.h> > +#include <asm/iommu.h> > +#include <asm/tce.h> > > #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) > +#define ERROR_ADDR (~(unsigned long)0x0) > > -/* WARNING: This will be called in real-mode on HV KVM and virtual > - * mode on PR KVM > +/* Finds a TCE table descriptor by LIOBN */ > +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu, > + unsigned long liobn) > +{ > + struct kvmppc_spapr_tce_table *tt; > + > + list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) { > + if (tt->liobn == liobn) > + return tt; > + } > + > + return NULL; > +} > +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table); > + > +/* > + * Validate TCE address. > + * At the moment only flags are validated > + * as other check will significantly slow down > + * or can make it even impossible to handle TCE requests > + * in real mode. > + */ > +long kvmppc_emulated_validate_tce(unsigned long tce) > +{ > + if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ)) > + return H_PARAMETER; > + > + return H_SUCCESS; > +} > +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce); > + > +/* > + * kvmppc_emulated_put_tce() handles TCE requests for devices emulated > + * by QEMU. It puts guest TCE values into the table and expects > + * the QEMU to convert them later in the QEMU device implementation. > + * Wiorks in both real and virtual modes. > + * It cannot fail so kvmppc_emulated_validate_tce must be called before it. > */ > +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, > + unsigned long ioba, unsigned long tce) > +{ > + unsigned long idx = ioba >> SPAPR_TCE_SHIFT; > + struct page *page; > + u64 *tbl; > + > + /* > + * Note on the use of page_address() in real mode, > + * > + * It is safe to use page_address() in real mode on ppc64 because > + * page_address() is always defined as lowmem_page_address() > + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial > + * operation and does not access page struct. > + * > + * Theoretically page_address() could be defined different > + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL > + * should be enabled. > + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64, > + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only > + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP > + * is not expected to be enabled on ppc32, page_address() > + * is safe for ppc32 as well. > + */ > +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL) > +#error TODO: fix to avoid page_address() here > +#endif > + page = tt->pages[idx / TCES_PER_PAGE]; > + tbl = (u64 *)page_address(page); > + > + /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ > + tbl[idx % TCES_PER_PAGE] = tce; > +} > +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce); > + > +#ifdef CONFIG_KVM_BOOK3S_64_HV > + > +static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > + unsigned long *pte_sizep) > +{ > + pte_t *ptep; > + unsigned int shift = 0; > + pte_t pte, tmp, ret; > + > + ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift); > + if (!ptep) > + return __pte(0); > + if (shift) > + *pte_sizep = 1ul << shift; > + else > + *pte_sizep = PAGE_SIZE; > + > + if (!pte_present(*ptep)) > + return __pte(0); > + > + /* wait until _PAGE_BUSY is clear then set it atomically */ > + __asm__ __volatile__ ( > + "1: ldarx %0,0,%3\n" > + " andi. %1,%0,%4\n" > + " bne- 1b\n" > + " ori %1,%0,%4\n" > + " stdcx. %1,0,%3\n" > + " bne- 1b" > + : "=&r" (pte), "=&r" (tmp), "=m" (*ptep) > + : "r" (ptep), "i" (_PAGE_BUSY) > + : "cc"); > + > + ret = pte; > + > + return ret; > +} The test for pte_present() needs to be done again after you lock the PTE since potentially you could have raced with an invalidation. More worrisome: You set _PAGE_BUSY above (lock the PTE). But when do you clear it ? It looks like you rely on _PAGE_BUSY being set to protect yourself against any concurrent invalidation (or other change to the PTE) while you access the underlying page. This is *somewhat* ok, though frowned upon since you end up locking the PTE a lot longer (in real mode) than we normally do, but in any case, you need to ensure that you release that lock. Also you must *not* continue using the resulting physical address after releasing the lock since the page might be invalidated/freed/swapped_out etc... at any point once you clear busy. It might be better to use the MMU notifiers here to catch concurrent invalidations rather than locking the PTE for a long time. If a concurrent invalidation happens, just return TOO_HARD. > + > +/* > + * Converts guest physical address into host physical address. > + * Also returns pte and page size if the page is present in page table. > + */ > +static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, > + unsigned long gpa) > +{ > + struct kvm_memory_slot *memslot; > + pte_t pte; > + unsigned long hva, hpa, pg_size = 0, offset; > + unsigned long gfn = gpa >> PAGE_SHIFT; > + bool writing = gpa & TCE_PCI_WRITE; > + > + /* Find a KVM memslot */ > + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); > + if (!memslot) > + return ERROR_ADDR; > + > + /* Convert guest physical address to host virtual */ > + hva = __gfn_to_hva_memslot(memslot, gfn); > + > + /* Find a PTE and determine the size */ > + pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva, > + writing, &pg_size); > + if (!pte) > + return ERROR_ADDR; > + > + /* Calculate host phys address keeping flags and offset in the page */ > + offset = gpa & (pg_size - 1); > + > + /* pte_pfn(pte) should return an address aligned to pg_size */ > + hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset; > + > + return hpa; > +} Do you ever test whether the page protection on the PTE allows for access ? At the moment you only use that to read from TCEs so chances that this is wrong are slim (you wouldn't have PROT_NONE on TCE tables), but it's still a worry to have code like that. Also you do not set _PAGE_ACCESSED either, which means the VM doesn't know the page is being accessed. Not necessarily a huge deal in this specific case, but still. > long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > unsigned long ioba, unsigned long tce) > { > - struct kvm *kvm = vcpu->kvm; > - struct kvmppc_spapr_tce_table *stt; > - > - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ > - /* liobn, ioba, tce); */ > - > - list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) { > - if (stt->liobn == liobn) { > - unsigned long idx = ioba >> SPAPR_TCE_SHIFT; > - struct page *page; > - u64 *tbl; > - > - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */ > - /* liobn, stt, stt->window_size); */ > - if (ioba >= stt->window_size) > - return H_PARAMETER; > - > - page = stt->pages[idx / TCES_PER_PAGE]; > - tbl = (u64 *)page_address(page); > - > - /* FIXME: Need to validate the TCE itself */ > - /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */ > - tbl[idx % TCES_PER_PAGE] = tce; > - return H_SUCCESS; > - } > + long ret; > + struct kvmppc_spapr_tce_table *tt; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to virtual space */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if (ioba >= tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce); > + if (ret) > + return ret; > + > + kvmppc_emulated_put_tce(tt, ioba, tce); > + > + return H_SUCCESS; > +} > + > +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_list, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + unsigned long *tces; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to virtual space */ > + if (!tt) > + return H_TOO_HARD; > + > + /* > + * The spec says that the maximum size of the list is 512 TCEs so > + * so the whole table addressed resides in 4K page > + */ > + if (npages > 512) > + return H_PARAMETER; > + > + if (tce_list & ~IOMMU_PAGE_MASK) > + return H_PARAMETER; > + > + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list); > + if ((unsigned long)tces == ERROR_ADDR) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i) { > + ret = kvmppc_emulated_validate_tce(tces[i]); > + if (ret) > + return ret; > } > > - /* Didn't find the liobn, punt it to userspace */ > - return H_TOO_HARD; > + for (i = 0; i < npages; ++i) > + kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT), > + tces[i]); > + > + return H_SUCCESS; > +} > + > +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages) > +{ > + struct kvmppc_spapr_tce_table *tt; > + long i, ret; > + > + tt = kvmppc_find_tce_table(vcpu, liobn); > + /* Didn't find the liobn, put it to virtual space */ > + if (!tt) > + return H_TOO_HARD; > + > + /* Emulated IO */ > + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) > + return H_PARAMETER; > + > + ret = kvmppc_emulated_validate_tce(tce_value); > + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE) > + kvmppc_emulated_put_tce(tt, ioba, tce_value); > + > + return H_SUCCESS; > } > +#endif /* CONFIG_KVM_BOOK3S_64_HV */ > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 550f592..a39039a 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) > ret = kvmppc_xics_hcall(vcpu, req); > break; > } /* fallthrough */ > + return RESUME_HOST; > + case H_PUT_TCE: > + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6)); > + if (ret == H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_PUT_TCE_INDIRECT: > + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, 7)); > + if (ret == H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_STUFF_TCE: > + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, 7)); > + if (ret == H_TOO_HARD) > + return RESUME_HOST; > + break; > default: > return RESUME_HOST; > } > @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) > vcpu->arch.cpu_type = KVM_CPU_3S_64; > kvmppc_sanity_check(vcpu); > > + /* > + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT > + * half executed, we first read TCEs from the user, check them and > + * return error if something went wrong and only then put TCEs into > + * the TCE table. > + * > + * tce_tmp is a cache for TCEs to avoid stack allocation or > + * kmalloc as the whole TCE list can take up to 512 items 8 bytes > + * each (4096 bytes). > + */ > + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); > + if (!vcpu->arch.tce_tmp) > + goto free_vcpu; > + > return vcpu; > > free_vcpu: > @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) > unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); > unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); > spin_unlock(&vcpu->arch.vpa_update_lock); > + kfree(vcpu->arch.tce_tmp); > kvm_vcpu_uninit(vcpu); > kmem_cache_free(kvm_vcpu_cache, vcpu); > } > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index b02f91e..d35554e 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -1490,6 +1490,12 @@ hcall_real_table: > .long 0 /* 0x11c */ > .long 0 /* 0x120 */ > .long .kvmppc_h_bulk_remove - hcall_real_table > + .long 0 /* 0x128 */ > + .long 0 /* 0x12c */ > + .long 0 /* 0x130 */ > + .long 0 /* 0x134 */ > + .long .kvmppc_h_stuff_tce - hcall_real_table > + .long .kvmppc_h_put_tce_indirect - hcall_real_table > hcall_real_table_end: > > ignore_hdec: > diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c > index da0e0bc..91d4b45 100644 > --- a/arch/powerpc/kvm/book3s_pr_papr.c > +++ b/arch/powerpc/kvm/book3s_pr_papr.c > @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) > unsigned long tce = kvmppc_get_gpr(vcpu, 6); > long rc; > > - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); > + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); > + if (rc == H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); > + unsigned long tce = kvmppc_get_gpr(vcpu, 6); > + unsigned long npages = kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, > + tce, npages); > + if (rc == H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); > + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); > + unsigned long npages = kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); > if (rc == H_TOO_HARD) > return EMULATE_FAIL; > kvmppc_set_gpr(vcpu, 3, rc); > @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) > return kvmppc_h_pr_bulk_remove(vcpu); > case H_PUT_TCE: > return kvmppc_h_pr_put_tce(vcpu); > + case H_PUT_TCE_INDIRECT: > + return kvmppc_h_pr_put_tce_indirect(vcpu); > + case H_STUFF_TCE: > + return kvmppc_h_pr_stuff_tce(vcpu); > case H_CEDE: > vcpu->arch.shared->msr |= MSR_EE; > kvm_vcpu_block(vcpu); > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > index 6316ee3..8465c2a 100644 > --- a/arch/powerpc/kvm/powerpc.c > +++ b/arch/powerpc/kvm/powerpc.c > @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) > r = 1; > break; > #endif > + case KVM_CAP_SPAPR_MULTITCE: > + r = 1; > + break; > default: > r = 0; > break; > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index a5c86fc..fc0d6b9 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info { > #define KVM_CAP_IRQ_MPIC 90 > #define KVM_CAP_PPC_RTAS 91 > #define KVM_CAP_IRQ_XICS 92 > +#define KVM_CAP_SPAPR_MULTITCE 93 > > #ifdef KVM_CAP_IRQ_ROUTING > ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-16 22:06 ` Alexander Graf -1 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-16 22:06 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: > This adds real mode handlers for the H_PUT_TCE_INDIRECT and > H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO > devices or emulated PCI. These calls allow adding multiple entries > (up to 512) into the TCE table in one call which saves time on > transition to/from real mode. > > This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs > (copied from user and verified) before writing the whole list into > the TCE table. This cache will be utilized more in the upcoming > VFIO/IOMMU support to continue TCE list processing in the virtual > mode in the case if the real mode handler failed for some reason. > > This adds a guest physical to host real address converter > and calls the existing H_PUT_TCE handler. The converting function > is going to be fully utilized by upcoming VFIO supporting patches. > > This also implements the KVM_CAP_PPC_MULTITCE capability, > so in order to support the functionality of this patch, QEMU > needs to query for this capability and set the "hcall-multi-tce" > hypertas property only if the capability is present, otherwise > there will be serious performance degradation. > > Cc: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Signed-off-by: Paul Mackerras <paulus@samba.org> Only a few minor nits. Ben already commented on implementation details. > > --- > Changelog: > 2013/06/05: > * fixed mistype about IBMVIO in the commit message > * updated doc and moved it to another section > * changed capability number > > 2013/05/21: > * added kvm_vcpu_arch::tce_tmp > * removed cleanup if put_indirect failed, instead we do not even start > writing to TCE table if we cannot get TCEs from the user and they are > invalid > * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce > and kvmppc_emulated_validate_tce (for the previous item) > * fixed bug with failthrough for H_IPI > * removed all get_user() from real mode handlers > * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) > --- > Documentation/virtual/kvm/api.txt | 17 ++ > arch/powerpc/include/asm/kvm_host.h | 2 + > arch/powerpc/include/asm/kvm_ppc.h | 16 +- > arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ > arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- > arch/powerpc/kvm/book3s_hv.c | 39 +++++ > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + > arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- > arch/powerpc/kvm/powerpc.c | 3 + > include/uapi/linux/kvm.h | 1 + > 10 files changed, 473 insertions(+), 32 deletions(-) > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt > index 5f91eda..6c082ff 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be > handled. > > > +4.83 KVM_CAP_PPC_MULTITCE > + > +Capability: KVM_CAP_PPC_MULTITCE > +Architectures: ppc > +Type: vm > + > +This capability tells the guest that multiple TCE entry add/remove hypercalls > +handling is supported by the kernel. This significanly accelerates DMA > +operations for PPC KVM guests. > + > +Unlike other capabilities in this section, this one does not have an ioctl. > +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and > +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to > +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE > +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). While this describes perfectly well what the consequences are of the patches, it does not describe properly what the CAP actually expresses. The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All other consequences are nice to document, but the semantics of the CAP are missing. We also usually try to keep KVM behavior unchanged with regards to older versions until a CAP is enabled. In this case I don't think it matters all that much, so I'm fine with declaring it as enabled by default. Please document that this is a change in behavior versus older KVM versions though. > + > + > 5. The kvm_run structure > ------------------------ > > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h > index af326cd..85d8f26 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { > spinlock_t tbacct_lock; > u64 busy_stolen; > u64 busy_preempt; > + > + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ > #endif > }; [...] > > [...] > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 550f592..a39039a 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) > ret = kvmppc_xics_hcall(vcpu, req); > break; > } /* fallthrough */ The fallthrough comment isn't accurate anymore. > + return RESUME_HOST; > + case H_PUT_TCE: > + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6)); > + if (ret = H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_PUT_TCE_INDIRECT: > + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, 7)); > + if (ret = H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_STUFF_TCE: > + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, 7)); > + if (ret = H_TOO_HARD) > + return RESUME_HOST; > + break; > default: > return RESUME_HOST; > } > @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) > vcpu->arch.cpu_type = KVM_CPU_3S_64; > kvmppc_sanity_check(vcpu); > > + /* > + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT > + * half executed, we first read TCEs from the user, check them and > + * return error if something went wrong and only then put TCEs into > + * the TCE table. > + * > + * tce_tmp is a cache for TCEs to avoid stack allocation or > + * kmalloc as the whole TCE list can take up to 512 items 8 bytes > + * each (4096 bytes). > + */ > + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); > + if (!vcpu->arch.tce_tmp) > + goto free_vcpu; > + > return vcpu; > > free_vcpu: > @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) > unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); > unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); > spin_unlock(&vcpu->arch.vpa_update_lock); > + kfree(vcpu->arch.tce_tmp); > kvm_vcpu_uninit(vcpu); > kmem_cache_free(kvm_vcpu_cache, vcpu); > } > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index b02f91e..d35554e 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -1490,6 +1490,12 @@ hcall_real_table: > .long 0 /* 0x11c */ > .long 0 /* 0x120 */ > .long .kvmppc_h_bulk_remove - hcall_real_table > + .long 0 /* 0x128 */ > + .long 0 /* 0x12c */ > + .long 0 /* 0x130 */ > + .long 0 /* 0x134 */ > + .long .kvmppc_h_stuff_tce - hcall_real_table > + .long .kvmppc_h_put_tce_indirect - hcall_real_table > hcall_real_table_end: > > ignore_hdec: > diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c > index da0e0bc..91d4b45 100644 > --- a/arch/powerpc/kvm/book3s_pr_papr.c > +++ b/arch/powerpc/kvm/book3s_pr_papr.c > @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) > unsigned long tce = kvmppc_get_gpr(vcpu, 6); > long rc; > > - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); > + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); > + if (rc = H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); > + unsigned long tce = kvmppc_get_gpr(vcpu, 6); > + unsigned long npages = kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, > + tce, npages); > + if (rc = H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); > + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); > + unsigned long npages = kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); > if (rc = H_TOO_HARD) > return EMULATE_FAIL; > kvmppc_set_gpr(vcpu, 3, rc); > @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) > return kvmppc_h_pr_bulk_remove(vcpu); > case H_PUT_TCE: > return kvmppc_h_pr_put_tce(vcpu); > + case H_PUT_TCE_INDIRECT: > + return kvmppc_h_pr_put_tce_indirect(vcpu); > + case H_STUFF_TCE: > + return kvmppc_h_pr_stuff_tce(vcpu); > case H_CEDE: > vcpu->arch.shared->msr |= MSR_EE; > kvm_vcpu_block(vcpu); > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > index 6316ee3..8465c2a 100644 > --- a/arch/powerpc/kvm/powerpc.c > +++ b/arch/powerpc/kvm/powerpc.c > @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) > r = 1; > break; > #endif > + case KVM_CAP_SPAPR_MULTITCE: > + r = 1; This should only be true for book3s. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-16 22:06 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-16 22:06 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: > This adds real mode handlers for the H_PUT_TCE_INDIRECT and > H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO > devices or emulated PCI. These calls allow adding multiple entries > (up to 512) into the TCE table in one call which saves time on > transition to/from real mode. > > This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs > (copied from user and verified) before writing the whole list into > the TCE table. This cache will be utilized more in the upcoming > VFIO/IOMMU support to continue TCE list processing in the virtual > mode in the case if the real mode handler failed for some reason. > > This adds a guest physical to host real address converter > and calls the existing H_PUT_TCE handler. The converting function > is going to be fully utilized by upcoming VFIO supporting patches. > > This also implements the KVM_CAP_PPC_MULTITCE capability, > so in order to support the functionality of this patch, QEMU > needs to query for this capability and set the "hcall-multi-tce" > hypertas property only if the capability is present, otherwise > there will be serious performance degradation. > > Cc: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Signed-off-by: Paul Mackerras <paulus@samba.org> Only a few minor nits. Ben already commented on implementation details. > > --- > Changelog: > 2013/06/05: > * fixed mistype about IBMVIO in the commit message > * updated doc and moved it to another section > * changed capability number > > 2013/05/21: > * added kvm_vcpu_arch::tce_tmp > * removed cleanup if put_indirect failed, instead we do not even start > writing to TCE table if we cannot get TCEs from the user and they are > invalid > * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce > and kvmppc_emulated_validate_tce (for the previous item) > * fixed bug with failthrough for H_IPI > * removed all get_user() from real mode handlers > * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) > --- > Documentation/virtual/kvm/api.txt | 17 ++ > arch/powerpc/include/asm/kvm_host.h | 2 + > arch/powerpc/include/asm/kvm_ppc.h | 16 +- > arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ > arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- > arch/powerpc/kvm/book3s_hv.c | 39 +++++ > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + > arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- > arch/powerpc/kvm/powerpc.c | 3 + > include/uapi/linux/kvm.h | 1 + > 10 files changed, 473 insertions(+), 32 deletions(-) > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt > index 5f91eda..6c082ff 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be > handled. > > > +4.83 KVM_CAP_PPC_MULTITCE > + > +Capability: KVM_CAP_PPC_MULTITCE > +Architectures: ppc > +Type: vm > + > +This capability tells the guest that multiple TCE entry add/remove hypercalls > +handling is supported by the kernel. This significanly accelerates DMA > +operations for PPC KVM guests. > + > +Unlike other capabilities in this section, this one does not have an ioctl. > +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and > +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to > +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE > +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). While this describes perfectly well what the consequences are of the patches, it does not describe properly what the CAP actually expresses. The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All other consequences are nice to document, but the semantics of the CAP are missing. We also usually try to keep KVM behavior unchanged with regards to older versions until a CAP is enabled. In this case I don't think it matters all that much, so I'm fine with declaring it as enabled by default. Please document that this is a change in behavior versus older KVM versions though. > + > + > 5. The kvm_run structure > ------------------------ > > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h > index af326cd..85d8f26 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { > spinlock_t tbacct_lock; > u64 busy_stolen; > u64 busy_preempt; > + > + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ > #endif > }; [...] > > [...] > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 550f592..a39039a 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) > ret = kvmppc_xics_hcall(vcpu, req); > break; > } /* fallthrough */ The fallthrough comment isn't accurate anymore. > + return RESUME_HOST; > + case H_PUT_TCE: > + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6)); > + if (ret == H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_PUT_TCE_INDIRECT: > + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, 7)); > + if (ret == H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_STUFF_TCE: > + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, 7)); > + if (ret == H_TOO_HARD) > + return RESUME_HOST; > + break; > default: > return RESUME_HOST; > } > @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) > vcpu->arch.cpu_type = KVM_CPU_3S_64; > kvmppc_sanity_check(vcpu); > > + /* > + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT > + * half executed, we first read TCEs from the user, check them and > + * return error if something went wrong and only then put TCEs into > + * the TCE table. > + * > + * tce_tmp is a cache for TCEs to avoid stack allocation or > + * kmalloc as the whole TCE list can take up to 512 items 8 bytes > + * each (4096 bytes). > + */ > + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); > + if (!vcpu->arch.tce_tmp) > + goto free_vcpu; > + > return vcpu; > > free_vcpu: > @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) > unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); > unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); > spin_unlock(&vcpu->arch.vpa_update_lock); > + kfree(vcpu->arch.tce_tmp); > kvm_vcpu_uninit(vcpu); > kmem_cache_free(kvm_vcpu_cache, vcpu); > } > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index b02f91e..d35554e 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -1490,6 +1490,12 @@ hcall_real_table: > .long 0 /* 0x11c */ > .long 0 /* 0x120 */ > .long .kvmppc_h_bulk_remove - hcall_real_table > + .long 0 /* 0x128 */ > + .long 0 /* 0x12c */ > + .long 0 /* 0x130 */ > + .long 0 /* 0x134 */ > + .long .kvmppc_h_stuff_tce - hcall_real_table > + .long .kvmppc_h_put_tce_indirect - hcall_real_table > hcall_real_table_end: > > ignore_hdec: > diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c > index da0e0bc..91d4b45 100644 > --- a/arch/powerpc/kvm/book3s_pr_papr.c > +++ b/arch/powerpc/kvm/book3s_pr_papr.c > @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) > unsigned long tce = kvmppc_get_gpr(vcpu, 6); > long rc; > > - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); > + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); > + if (rc == H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); > + unsigned long tce = kvmppc_get_gpr(vcpu, 6); > + unsigned long npages = kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, > + tce, npages); > + if (rc == H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); > + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); > + unsigned long npages = kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); > if (rc == H_TOO_HARD) > return EMULATE_FAIL; > kvmppc_set_gpr(vcpu, 3, rc); > @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) > return kvmppc_h_pr_bulk_remove(vcpu); > case H_PUT_TCE: > return kvmppc_h_pr_put_tce(vcpu); > + case H_PUT_TCE_INDIRECT: > + return kvmppc_h_pr_put_tce_indirect(vcpu); > + case H_STUFF_TCE: > + return kvmppc_h_pr_stuff_tce(vcpu); > case H_CEDE: > vcpu->arch.shared->msr |= MSR_EE; > kvm_vcpu_block(vcpu); > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > index 6316ee3..8465c2a 100644 > --- a/arch/powerpc/kvm/powerpc.c > +++ b/arch/powerpc/kvm/powerpc.c > @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) > r = 1; > break; > #endif > + case KVM_CAP_SPAPR_MULTITCE: > + r = 1; This should only be true for book3s. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-16 22:06 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-16 22:06 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: > This adds real mode handlers for the H_PUT_TCE_INDIRECT and > H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO > devices or emulated PCI. These calls allow adding multiple entries > (up to 512) into the TCE table in one call which saves time on > transition to/from real mode. >=20 > This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs > (copied from user and verified) before writing the whole list into > the TCE table. This cache will be utilized more in the upcoming > VFIO/IOMMU support to continue TCE list processing in the virtual > mode in the case if the real mode handler failed for some reason. >=20 > This adds a guest physical to host real address converter > and calls the existing H_PUT_TCE handler. The converting function > is going to be fully utilized by upcoming VFIO supporting patches. >=20 > This also implements the KVM_CAP_PPC_MULTITCE capability, > so in order to support the functionality of this patch, QEMU > needs to query for this capability and set the "hcall-multi-tce" > hypertas property only if the capability is present, otherwise > there will be serious performance degradation. >=20 > Cc: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Signed-off-by: Paul Mackerras <paulus@samba.org> Only a few minor nits. Ben already commented on implementation details. >=20 > --- > Changelog: > 2013/06/05: > * fixed mistype about IBMVIO in the commit message > * updated doc and moved it to another section > * changed capability number >=20 > 2013/05/21: > * added kvm_vcpu_arch::tce_tmp > * removed cleanup if put_indirect failed, instead we do not even start > writing to TCE table if we cannot get TCEs from the user and they are > invalid > * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce > and kvmppc_emulated_validate_tce (for the previous item) > * fixed bug with failthrough for H_IPI > * removed all get_user() from real mode handlers > * kvmppc_lookup_pte() added (instead of making lookup_linux_pte = public) > --- > Documentation/virtual/kvm/api.txt | 17 ++ > arch/powerpc/include/asm/kvm_host.h | 2 + > arch/powerpc/include/asm/kvm_ppc.h | 16 +- > arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ > arch/powerpc/kvm/book3s_64_vio_hv.c | 266 = +++++++++++++++++++++++++++---- > arch/powerpc/kvm/book3s_hv.c | 39 +++++ > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + > arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- > arch/powerpc/kvm/powerpc.c | 3 + > include/uapi/linux/kvm.h | 1 + > 10 files changed, 473 insertions(+), 32 deletions(-) >=20 > diff --git a/Documentation/virtual/kvm/api.txt = b/Documentation/virtual/kvm/api.txt > index 5f91eda..6c082ff 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -2362,6 +2362,23 @@ calls by the guest for that service will be = passed to userspace to be > handled. >=20 >=20 > +4.83 KVM_CAP_PPC_MULTITCE > + > +Capability: KVM_CAP_PPC_MULTITCE > +Architectures: ppc > +Type: vm > + > +This capability tells the guest that multiple TCE entry add/remove = hypercalls > +handling is supported by the kernel. This significanly accelerates = DMA > +operations for PPC KVM guests. > + > +Unlike other capabilities in this section, this one does not have an = ioctl. > +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and > +H_STUFF_TCE hypercalls are to be handled in the host kernel and not = passed to > +the guest. Othwerwise it might be better for the guest to continue = using H_PUT_TCE > +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are = present). While this describes perfectly well what the consequences are of the = patches, it does not describe properly what the CAP actually expresses. = The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and = H_STUFF_TCE hypercalls directly". All other consequences are nice to = document, but the semantics of the CAP are missing. We also usually try to keep KVM behavior unchanged with regards to older = versions until a CAP is enabled. In this case I don't think it matters = all that much, so I'm fine with declaring it as enabled by default. = Please document that this is a change in behavior versus older KVM = versions though. > + > + > 5. The kvm_run structure > ------------------------ >=20 > diff --git a/arch/powerpc/include/asm/kvm_host.h = b/arch/powerpc/include/asm/kvm_host.h > index af326cd..85d8f26 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { > spinlock_t tbacct_lock; > u64 busy_stolen; > u64 busy_preempt; > + > + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT = hall */ > #endif > }; [...] >=20 >=20 [...] > diff --git a/arch/powerpc/kvm/book3s_hv.c = b/arch/powerpc/kvm/book3s_hv.c > index 550f592..a39039a 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu = *vcpu) > ret =3D kvmppc_xics_hcall(vcpu, req); > break; > } /* fallthrough */ The fallthrough comment isn't accurate anymore. > + return RESUME_HOST; > + case H_PUT_TCE: > + ret =3D kvmppc_virtmode_h_put_tce(vcpu, = kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, = 6)); > + if (ret =3D=3D H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_PUT_TCE_INDIRECT: > + ret =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, = kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, = 7)); > + if (ret =3D=3D H_TOO_HARD) > + return RESUME_HOST; > + break; > + case H_STUFF_TCE: > + ret =3D kvmppc_virtmode_h_stuff_tce(vcpu, = kvmppc_get_gpr(vcpu, 4), > + kvmppc_get_gpr(vcpu, 5), > + kvmppc_get_gpr(vcpu, 6), > + kvmppc_get_gpr(vcpu, = 7)); > + if (ret =3D=3D H_TOO_HARD) > + return RESUME_HOST; > + break; > default: > return RESUME_HOST; > } > @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct = kvm *kvm, unsigned int id) > vcpu->arch.cpu_type =3D KVM_CPU_3S_64; > kvmppc_sanity_check(vcpu); >=20 > + /* > + * As we want to minimize the chance of having = H_PUT_TCE_INDIRECT > + * half executed, we first read TCEs from the user, check them = and > + * return error if something went wrong and only then put TCEs = into > + * the TCE table. > + * > + * tce_tmp is a cache for TCEs to avoid stack allocation or > + * kmalloc as the whole TCE list can take up to 512 items 8 = bytes > + * each (4096 bytes). > + */ > + vcpu->arch.tce_tmp =3D kmalloc(4096, GFP_KERNEL); > + if (!vcpu->arch.tce_tmp) > + goto free_vcpu; > + > return vcpu; >=20 > free_vcpu: > @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) > unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); > unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); > spin_unlock(&vcpu->arch.vpa_update_lock); > + kfree(vcpu->arch.tce_tmp); > kvm_vcpu_uninit(vcpu); > kmem_cache_free(kvm_vcpu_cache, vcpu); > } > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S = b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index b02f91e..d35554e 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -1490,6 +1490,12 @@ hcall_real_table: > .long 0 /* 0x11c */ > .long 0 /* 0x120 */ > .long .kvmppc_h_bulk_remove - hcall_real_table > + .long 0 /* 0x128 */ > + .long 0 /* 0x12c */ > + .long 0 /* 0x130 */ > + .long 0 /* 0x134 */ > + .long .kvmppc_h_stuff_tce - hcall_real_table > + .long .kvmppc_h_put_tce_indirect - hcall_real_table > hcall_real_table_end: >=20 > ignore_hdec: > diff --git a/arch/powerpc/kvm/book3s_pr_papr.c = b/arch/powerpc/kvm/book3s_pr_papr.c > index da0e0bc..91d4b45 100644 > --- a/arch/powerpc/kvm/book3s_pr_papr.c > +++ b/arch/powerpc/kvm/book3s_pr_papr.c > @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu = *vcpu) > unsigned long tce =3D kvmppc_get_gpr(vcpu, 6); > long rc; >=20 > - rc =3D kvmppc_h_put_tce(vcpu, liobn, ioba, tce); > + rc =3D kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); > + if (rc =3D=3D H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5); > + unsigned long tce =3D kvmppc_get_gpr(vcpu, 6); > + unsigned long npages =3D kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, > + tce, npages); > + if (rc =3D=3D H_TOO_HARD) > + return EMULATE_FAIL; > + kvmppc_set_gpr(vcpu, 3, rc); > + return EMULATE_DONE; > +} > + > +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) > +{ > + unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4); > + unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5); > + unsigned long tce_value =3D kvmppc_get_gpr(vcpu, 6); > + unsigned long npages =3D kvmppc_get_gpr(vcpu, 7); > + long rc; > + > + rc =3D kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, = npages); > if (rc =3D=3D H_TOO_HARD) > return EMULATE_FAIL; > kvmppc_set_gpr(vcpu, 3, rc); > @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned = long cmd) > return kvmppc_h_pr_bulk_remove(vcpu); > case H_PUT_TCE: > return kvmppc_h_pr_put_tce(vcpu); > + case H_PUT_TCE_INDIRECT: > + return kvmppc_h_pr_put_tce_indirect(vcpu); > + case H_STUFF_TCE: > + return kvmppc_h_pr_stuff_tce(vcpu); > case H_CEDE: > vcpu->arch.shared->msr |=3D MSR_EE; > kvm_vcpu_block(vcpu); > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > index 6316ee3..8465c2a 100644 > --- a/arch/powerpc/kvm/powerpc.c > +++ b/arch/powerpc/kvm/powerpc.c > @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) > r =3D 1; > break; > #endif > + case KVM_CAP_SPAPR_MULTITCE: > + r =3D 1; This should only be true for book3s. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-16 22:06 ` Alexander Graf (?) @ 2013-06-17 7:55 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 7:55 UTC (permalink / raw) To: Alexander Graf Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/17/2013 08:06 AM, Alexander Graf wrote: > > On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: > >> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >> devices or emulated PCI. These calls allow adding multiple entries >> (up to 512) into the TCE table in one call which saves time on >> transition to/from real mode. >> >> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >> (copied from user and verified) before writing the whole list into >> the TCE table. This cache will be utilized more in the upcoming >> VFIO/IOMMU support to continue TCE list processing in the virtual >> mode in the case if the real mode handler failed for some reason. >> >> This adds a guest physical to host real address converter >> and calls the existing H_PUT_TCE handler. The converting function >> is going to be fully utilized by upcoming VFIO supporting patches. >> >> This also implements the KVM_CAP_PPC_MULTITCE capability, >> so in order to support the functionality of this patch, QEMU >> needs to query for this capability and set the "hcall-multi-tce" >> hypertas property only if the capability is present, otherwise >> there will be serious performance degradation. >> >> Cc: David Gibson <david@gibson.dropbear.id.au> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> Signed-off-by: Paul Mackerras <paulus@samba.org> > > Only a few minor nits. Ben already commented on implementation details. > >> >> --- >> Changelog: >> 2013/06/05: >> * fixed mistype about IBMVIO in the commit message >> * updated doc and moved it to another section >> * changed capability number >> >> 2013/05/21: >> * added kvm_vcpu_arch::tce_tmp >> * removed cleanup if put_indirect failed, instead we do not even start >> writing to TCE table if we cannot get TCEs from the user and they are >> invalid >> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >> and kvmppc_emulated_validate_tce (for the previous item) >> * fixed bug with failthrough for H_IPI >> * removed all get_user() from real mode handlers >> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >> --- >> Documentation/virtual/kvm/api.txt | 17 ++ >> arch/powerpc/include/asm/kvm_host.h | 2 + >> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- >> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >> arch/powerpc/kvm/powerpc.c | 3 + >> include/uapi/linux/kvm.h | 1 + >> 10 files changed, 473 insertions(+), 32 deletions(-) >> >> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt >> index 5f91eda..6c082ff 100644 >> --- a/Documentation/virtual/kvm/api.txt >> +++ b/Documentation/virtual/kvm/api.txt >> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be >> handled. >> >> >> +4.83 KVM_CAP_PPC_MULTITCE >> + >> +Capability: KVM_CAP_PPC_MULTITCE >> +Architectures: ppc >> +Type: vm >> + >> +This capability tells the guest that multiple TCE entry add/remove hypercalls >> +handling is supported by the kernel. This significanly accelerates DMA >> +operations for PPC KVM guests. >> + >> +Unlike other capabilities in this section, this one does not have an ioctl. >> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to >> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE >> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). > > While this describes perfectly well what the consequences are of the > patches, it does not describe properly what the CAP actually expresses. > The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and > H_STUFF_TCE hypercalls directly". All other consequences are nice to > document, but the semantics of the CAP are missing. ? It expresses ability to handle 2 hcalls. What is missing? > We also usually try to keep KVM behavior unchanged with regards to older > versions until a CAP is enabled. In this case I don't think it matters > all that much, so I'm fine with declaring it as enabled by default. > Please document that this is a change in behavior versus older KVM > versions though. Ok! >> + >> + >> 5. The kvm_run structure >> ------------------------ >> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h >> index af326cd..85d8f26 100644 >> --- a/arch/powerpc/include/asm/kvm_host.h >> +++ b/arch/powerpc/include/asm/kvm_host.h >> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >> spinlock_t tbacct_lock; >> u64 busy_stolen; >> u64 busy_preempt; >> + >> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ >> #endif >> }; > > [...] >> >> > > [...] > >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >> index 550f592..a39039a 100644 >> --- a/arch/powerpc/kvm/book3s_hv.c >> +++ b/arch/powerpc/kvm/book3s_hv.c >> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) >> ret = kvmppc_xics_hcall(vcpu, req); >> break; >> } /* fallthrough */ > > The fallthrough comment isn't accurate anymore. > >> + return RESUME_HOST; >> + case H_PUT_TCE: >> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >> + kvmppc_get_gpr(vcpu, 5), >> + kvmppc_get_gpr(vcpu, 6)); >> + if (ret = H_TOO_HARD) >> + return RESUME_HOST; >> + break; >> + case H_PUT_TCE_INDIRECT: >> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), >> + kvmppc_get_gpr(vcpu, 5), >> + kvmppc_get_gpr(vcpu, 6), >> + kvmppc_get_gpr(vcpu, 7)); >> + if (ret = H_TOO_HARD) >> + return RESUME_HOST; >> + break; >> + case H_STUFF_TCE: >> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >> + kvmppc_get_gpr(vcpu, 5), >> + kvmppc_get_gpr(vcpu, 6), >> + kvmppc_get_gpr(vcpu, 7)); >> + if (ret = H_TOO_HARD) >> + return RESUME_HOST; >> + break; >> default: >> return RESUME_HOST; >> } >> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >> vcpu->arch.cpu_type = KVM_CPU_3S_64; >> kvmppc_sanity_check(vcpu); >> >> + /* >> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT >> + * half executed, we first read TCEs from the user, check them and >> + * return error if something went wrong and only then put TCEs into >> + * the TCE table. >> + * >> + * tce_tmp is a cache for TCEs to avoid stack allocation or >> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes >> + * each (4096 bytes). >> + */ >> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); >> + if (!vcpu->arch.tce_tmp) >> + goto free_vcpu; >> + >> return vcpu; >> >> free_vcpu: >> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) >> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >> spin_unlock(&vcpu->arch.vpa_update_lock); >> + kfree(vcpu->arch.tce_tmp); >> kvm_vcpu_uninit(vcpu); >> kmem_cache_free(kvm_vcpu_cache, vcpu); >> } >> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >> index b02f91e..d35554e 100644 >> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >> @@ -1490,6 +1490,12 @@ hcall_real_table: >> .long 0 /* 0x11c */ >> .long 0 /* 0x120 */ >> .long .kvmppc_h_bulk_remove - hcall_real_table >> + .long 0 /* 0x128 */ >> + .long 0 /* 0x12c */ >> + .long 0 /* 0x130 */ >> + .long 0 /* 0x134 */ >> + .long .kvmppc_h_stuff_tce - hcall_real_table >> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >> hcall_real_table_end: >> >> ignore_hdec: >> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c >> index da0e0bc..91d4b45 100644 >> --- a/arch/powerpc/kvm/book3s_pr_papr.c >> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) >> unsigned long tce = kvmppc_get_gpr(vcpu, 6); >> long rc; >> >> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >> + if (rc = H_TOO_HARD) >> + return EMULATE_FAIL; >> + kvmppc_set_gpr(vcpu, 3, rc); >> + return EMULATE_DONE; >> +} >> + >> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >> +{ >> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >> + unsigned long tce = kvmppc_get_gpr(vcpu, 6); >> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >> + long rc; >> + >> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >> + tce, npages); >> + if (rc = H_TOO_HARD) >> + return EMULATE_FAIL; >> + kvmppc_set_gpr(vcpu, 3, rc); >> + return EMULATE_DONE; >> +} >> + >> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >> +{ >> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); >> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >> + long rc; >> + >> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); >> if (rc = H_TOO_HARD) >> return EMULATE_FAIL; >> kvmppc_set_gpr(vcpu, 3, rc); >> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) >> return kvmppc_h_pr_bulk_remove(vcpu); >> case H_PUT_TCE: >> return kvmppc_h_pr_put_tce(vcpu); >> + case H_PUT_TCE_INDIRECT: >> + return kvmppc_h_pr_put_tce_indirect(vcpu); >> + case H_STUFF_TCE: >> + return kvmppc_h_pr_stuff_tce(vcpu); >> case H_CEDE: >> vcpu->arch.shared->msr |= MSR_EE; >> kvm_vcpu_block(vcpu); >> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >> index 6316ee3..8465c2a 100644 >> --- a/arch/powerpc/kvm/powerpc.c >> +++ b/arch/powerpc/kvm/powerpc.c >> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >> r = 1; >> break; >> #endif >> + case KVM_CAP_SPAPR_MULTITCE: >> + r = 1; > > This should only be true for book3s. We had this discussion with v2. David: =So, in the case of MULTITCE, that's not quite right. PR KVM can emulate a PAPR system on a BookE machine, and there's no reason not to allow TCE acceleration as well. We can't make it dependent on PAPR mode being selected, because that's enabled per-vcpu, whereas these capabilities are queried on the VM before the vcpus are created. = Wrong? -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 7:55 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 7:55 UTC (permalink / raw) To: Alexander Graf Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/17/2013 08:06 AM, Alexander Graf wrote: > > On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: > >> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >> devices or emulated PCI. These calls allow adding multiple entries >> (up to 512) into the TCE table in one call which saves time on >> transition to/from real mode. >> >> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >> (copied from user and verified) before writing the whole list into >> the TCE table. This cache will be utilized more in the upcoming >> VFIO/IOMMU support to continue TCE list processing in the virtual >> mode in the case if the real mode handler failed for some reason. >> >> This adds a guest physical to host real address converter >> and calls the existing H_PUT_TCE handler. The converting function >> is going to be fully utilized by upcoming VFIO supporting patches. >> >> This also implements the KVM_CAP_PPC_MULTITCE capability, >> so in order to support the functionality of this patch, QEMU >> needs to query for this capability and set the "hcall-multi-tce" >> hypertas property only if the capability is present, otherwise >> there will be serious performance degradation. >> >> Cc: David Gibson <david@gibson.dropbear.id.au> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> Signed-off-by: Paul Mackerras <paulus@samba.org> > > Only a few minor nits. Ben already commented on implementation details. > >> >> --- >> Changelog: >> 2013/06/05: >> * fixed mistype about IBMVIO in the commit message >> * updated doc and moved it to another section >> * changed capability number >> >> 2013/05/21: >> * added kvm_vcpu_arch::tce_tmp >> * removed cleanup if put_indirect failed, instead we do not even start >> writing to TCE table if we cannot get TCEs from the user and they are >> invalid >> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >> and kvmppc_emulated_validate_tce (for the previous item) >> * fixed bug with failthrough for H_IPI >> * removed all get_user() from real mode handlers >> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >> --- >> Documentation/virtual/kvm/api.txt | 17 ++ >> arch/powerpc/include/asm/kvm_host.h | 2 + >> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- >> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >> arch/powerpc/kvm/powerpc.c | 3 + >> include/uapi/linux/kvm.h | 1 + >> 10 files changed, 473 insertions(+), 32 deletions(-) >> >> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt >> index 5f91eda..6c082ff 100644 >> --- a/Documentation/virtual/kvm/api.txt >> +++ b/Documentation/virtual/kvm/api.txt >> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be >> handled. >> >> >> +4.83 KVM_CAP_PPC_MULTITCE >> + >> +Capability: KVM_CAP_PPC_MULTITCE >> +Architectures: ppc >> +Type: vm >> + >> +This capability tells the guest that multiple TCE entry add/remove hypercalls >> +handling is supported by the kernel. This significanly accelerates DMA >> +operations for PPC KVM guests. >> + >> +Unlike other capabilities in this section, this one does not have an ioctl. >> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to >> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE >> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). > > While this describes perfectly well what the consequences are of the > patches, it does not describe properly what the CAP actually expresses. > The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and > H_STUFF_TCE hypercalls directly". All other consequences are nice to > document, but the semantics of the CAP are missing. ? It expresses ability to handle 2 hcalls. What is missing? > We also usually try to keep KVM behavior unchanged with regards to older > versions until a CAP is enabled. In this case I don't think it matters > all that much, so I'm fine with declaring it as enabled by default. > Please document that this is a change in behavior versus older KVM > versions though. Ok! >> + >> + >> 5. The kvm_run structure >> ------------------------ >> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h >> index af326cd..85d8f26 100644 >> --- a/arch/powerpc/include/asm/kvm_host.h >> +++ b/arch/powerpc/include/asm/kvm_host.h >> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >> spinlock_t tbacct_lock; >> u64 busy_stolen; >> u64 busy_preempt; >> + >> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ >> #endif >> }; > > [...] >> >> > > [...] > >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >> index 550f592..a39039a 100644 >> --- a/arch/powerpc/kvm/book3s_hv.c >> +++ b/arch/powerpc/kvm/book3s_hv.c >> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) >> ret = kvmppc_xics_hcall(vcpu, req); >> break; >> } /* fallthrough */ > > The fallthrough comment isn't accurate anymore. > >> + return RESUME_HOST; >> + case H_PUT_TCE: >> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >> + kvmppc_get_gpr(vcpu, 5), >> + kvmppc_get_gpr(vcpu, 6)); >> + if (ret == H_TOO_HARD) >> + return RESUME_HOST; >> + break; >> + case H_PUT_TCE_INDIRECT: >> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), >> + kvmppc_get_gpr(vcpu, 5), >> + kvmppc_get_gpr(vcpu, 6), >> + kvmppc_get_gpr(vcpu, 7)); >> + if (ret == H_TOO_HARD) >> + return RESUME_HOST; >> + break; >> + case H_STUFF_TCE: >> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >> + kvmppc_get_gpr(vcpu, 5), >> + kvmppc_get_gpr(vcpu, 6), >> + kvmppc_get_gpr(vcpu, 7)); >> + if (ret == H_TOO_HARD) >> + return RESUME_HOST; >> + break; >> default: >> return RESUME_HOST; >> } >> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >> vcpu->arch.cpu_type = KVM_CPU_3S_64; >> kvmppc_sanity_check(vcpu); >> >> + /* >> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT >> + * half executed, we first read TCEs from the user, check them and >> + * return error if something went wrong and only then put TCEs into >> + * the TCE table. >> + * >> + * tce_tmp is a cache for TCEs to avoid stack allocation or >> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes >> + * each (4096 bytes). >> + */ >> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); >> + if (!vcpu->arch.tce_tmp) >> + goto free_vcpu; >> + >> return vcpu; >> >> free_vcpu: >> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) >> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >> spin_unlock(&vcpu->arch.vpa_update_lock); >> + kfree(vcpu->arch.tce_tmp); >> kvm_vcpu_uninit(vcpu); >> kmem_cache_free(kvm_vcpu_cache, vcpu); >> } >> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >> index b02f91e..d35554e 100644 >> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >> @@ -1490,6 +1490,12 @@ hcall_real_table: >> .long 0 /* 0x11c */ >> .long 0 /* 0x120 */ >> .long .kvmppc_h_bulk_remove - hcall_real_table >> + .long 0 /* 0x128 */ >> + .long 0 /* 0x12c */ >> + .long 0 /* 0x130 */ >> + .long 0 /* 0x134 */ >> + .long .kvmppc_h_stuff_tce - hcall_real_table >> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >> hcall_real_table_end: >> >> ignore_hdec: >> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c >> index da0e0bc..91d4b45 100644 >> --- a/arch/powerpc/kvm/book3s_pr_papr.c >> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) >> unsigned long tce = kvmppc_get_gpr(vcpu, 6); >> long rc; >> >> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >> + if (rc == H_TOO_HARD) >> + return EMULATE_FAIL; >> + kvmppc_set_gpr(vcpu, 3, rc); >> + return EMULATE_DONE; >> +} >> + >> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >> +{ >> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >> + unsigned long tce = kvmppc_get_gpr(vcpu, 6); >> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >> + long rc; >> + >> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >> + tce, npages); >> + if (rc == H_TOO_HARD) >> + return EMULATE_FAIL; >> + kvmppc_set_gpr(vcpu, 3, rc); >> + return EMULATE_DONE; >> +} >> + >> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >> +{ >> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); >> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >> + long rc; >> + >> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); >> if (rc == H_TOO_HARD) >> return EMULATE_FAIL; >> kvmppc_set_gpr(vcpu, 3, rc); >> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) >> return kvmppc_h_pr_bulk_remove(vcpu); >> case H_PUT_TCE: >> return kvmppc_h_pr_put_tce(vcpu); >> + case H_PUT_TCE_INDIRECT: >> + return kvmppc_h_pr_put_tce_indirect(vcpu); >> + case H_STUFF_TCE: >> + return kvmppc_h_pr_stuff_tce(vcpu); >> case H_CEDE: >> vcpu->arch.shared->msr |= MSR_EE; >> kvm_vcpu_block(vcpu); >> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >> index 6316ee3..8465c2a 100644 >> --- a/arch/powerpc/kvm/powerpc.c >> +++ b/arch/powerpc/kvm/powerpc.c >> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >> r = 1; >> break; >> #endif >> + case KVM_CAP_SPAPR_MULTITCE: >> + r = 1; > > This should only be true for book3s. We had this discussion with v2. David: === So, in the case of MULTITCE, that's not quite right. PR KVM can emulate a PAPR system on a BookE machine, and there's no reason not to allow TCE acceleration as well. We can't make it dependent on PAPR mode being selected, because that's enabled per-vcpu, whereas these capabilities are queried on the VM before the vcpus are created. === Wrong? -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 7:55 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 7:55 UTC (permalink / raw) To: Alexander Graf Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson On 06/17/2013 08:06 AM, Alexander Graf wrote: > > On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: > >> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >> devices or emulated PCI. These calls allow adding multiple entries >> (up to 512) into the TCE table in one call which saves time on >> transition to/from real mode. >> >> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >> (copied from user and verified) before writing the whole list into >> the TCE table. This cache will be utilized more in the upcoming >> VFIO/IOMMU support to continue TCE list processing in the virtual >> mode in the case if the real mode handler failed for some reason. >> >> This adds a guest physical to host real address converter >> and calls the existing H_PUT_TCE handler. The converting function >> is going to be fully utilized by upcoming VFIO supporting patches. >> >> This also implements the KVM_CAP_PPC_MULTITCE capability, >> so in order to support the functionality of this patch, QEMU >> needs to query for this capability and set the "hcall-multi-tce" >> hypertas property only if the capability is present, otherwise >> there will be serious performance degradation. >> >> Cc: David Gibson <david@gibson.dropbear.id.au> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> Signed-off-by: Paul Mackerras <paulus@samba.org> > > Only a few minor nits. Ben already commented on implementation details. > >> >> --- >> Changelog: >> 2013/06/05: >> * fixed mistype about IBMVIO in the commit message >> * updated doc and moved it to another section >> * changed capability number >> >> 2013/05/21: >> * added kvm_vcpu_arch::tce_tmp >> * removed cleanup if put_indirect failed, instead we do not even start >> writing to TCE table if we cannot get TCEs from the user and they are >> invalid >> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >> and kvmppc_emulated_validate_tce (for the previous item) >> * fixed bug with failthrough for H_IPI >> * removed all get_user() from real mode handlers >> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >> --- >> Documentation/virtual/kvm/api.txt | 17 ++ >> arch/powerpc/include/asm/kvm_host.h | 2 + >> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- >> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >> arch/powerpc/kvm/powerpc.c | 3 + >> include/uapi/linux/kvm.h | 1 + >> 10 files changed, 473 insertions(+), 32 deletions(-) >> >> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt >> index 5f91eda..6c082ff 100644 >> --- a/Documentation/virtual/kvm/api.txt >> +++ b/Documentation/virtual/kvm/api.txt >> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be >> handled. >> >> >> +4.83 KVM_CAP_PPC_MULTITCE >> + >> +Capability: KVM_CAP_PPC_MULTITCE >> +Architectures: ppc >> +Type: vm >> + >> +This capability tells the guest that multiple TCE entry add/remove hypercalls >> +handling is supported by the kernel. This significanly accelerates DMA >> +operations for PPC KVM guests. >> + >> +Unlike other capabilities in this section, this one does not have an ioctl. >> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to >> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE >> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). > > While this describes perfectly well what the consequences are of the > patches, it does not describe properly what the CAP actually expresses. > The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and > H_STUFF_TCE hypercalls directly". All other consequences are nice to > document, but the semantics of the CAP are missing. ? It expresses ability to handle 2 hcalls. What is missing? > We also usually try to keep KVM behavior unchanged with regards to older > versions until a CAP is enabled. In this case I don't think it matters > all that much, so I'm fine with declaring it as enabled by default. > Please document that this is a change in behavior versus older KVM > versions though. Ok! >> + >> + >> 5. The kvm_run structure >> ------------------------ >> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h >> index af326cd..85d8f26 100644 >> --- a/arch/powerpc/include/asm/kvm_host.h >> +++ b/arch/powerpc/include/asm/kvm_host.h >> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >> spinlock_t tbacct_lock; >> u64 busy_stolen; >> u64 busy_preempt; >> + >> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ >> #endif >> }; > > [...] >> >> > > [...] > >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >> index 550f592..a39039a 100644 >> --- a/arch/powerpc/kvm/book3s_hv.c >> +++ b/arch/powerpc/kvm/book3s_hv.c >> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) >> ret = kvmppc_xics_hcall(vcpu, req); >> break; >> } /* fallthrough */ > > The fallthrough comment isn't accurate anymore. > >> + return RESUME_HOST; >> + case H_PUT_TCE: >> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >> + kvmppc_get_gpr(vcpu, 5), >> + kvmppc_get_gpr(vcpu, 6)); >> + if (ret == H_TOO_HARD) >> + return RESUME_HOST; >> + break; >> + case H_PUT_TCE_INDIRECT: >> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), >> + kvmppc_get_gpr(vcpu, 5), >> + kvmppc_get_gpr(vcpu, 6), >> + kvmppc_get_gpr(vcpu, 7)); >> + if (ret == H_TOO_HARD) >> + return RESUME_HOST; >> + break; >> + case H_STUFF_TCE: >> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >> + kvmppc_get_gpr(vcpu, 5), >> + kvmppc_get_gpr(vcpu, 6), >> + kvmppc_get_gpr(vcpu, 7)); >> + if (ret == H_TOO_HARD) >> + return RESUME_HOST; >> + break; >> default: >> return RESUME_HOST; >> } >> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >> vcpu->arch.cpu_type = KVM_CPU_3S_64; >> kvmppc_sanity_check(vcpu); >> >> + /* >> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT >> + * half executed, we first read TCEs from the user, check them and >> + * return error if something went wrong and only then put TCEs into >> + * the TCE table. >> + * >> + * tce_tmp is a cache for TCEs to avoid stack allocation or >> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes >> + * each (4096 bytes). >> + */ >> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); >> + if (!vcpu->arch.tce_tmp) >> + goto free_vcpu; >> + >> return vcpu; >> >> free_vcpu: >> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) >> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >> spin_unlock(&vcpu->arch.vpa_update_lock); >> + kfree(vcpu->arch.tce_tmp); >> kvm_vcpu_uninit(vcpu); >> kmem_cache_free(kvm_vcpu_cache, vcpu); >> } >> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >> index b02f91e..d35554e 100644 >> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >> @@ -1490,6 +1490,12 @@ hcall_real_table: >> .long 0 /* 0x11c */ >> .long 0 /* 0x120 */ >> .long .kvmppc_h_bulk_remove - hcall_real_table >> + .long 0 /* 0x128 */ >> + .long 0 /* 0x12c */ >> + .long 0 /* 0x130 */ >> + .long 0 /* 0x134 */ >> + .long .kvmppc_h_stuff_tce - hcall_real_table >> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >> hcall_real_table_end: >> >> ignore_hdec: >> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c >> index da0e0bc..91d4b45 100644 >> --- a/arch/powerpc/kvm/book3s_pr_papr.c >> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) >> unsigned long tce = kvmppc_get_gpr(vcpu, 6); >> long rc; >> >> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >> + if (rc == H_TOO_HARD) >> + return EMULATE_FAIL; >> + kvmppc_set_gpr(vcpu, 3, rc); >> + return EMULATE_DONE; >> +} >> + >> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >> +{ >> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >> + unsigned long tce = kvmppc_get_gpr(vcpu, 6); >> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >> + long rc; >> + >> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >> + tce, npages); >> + if (rc == H_TOO_HARD) >> + return EMULATE_FAIL; >> + kvmppc_set_gpr(vcpu, 3, rc); >> + return EMULATE_DONE; >> +} >> + >> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >> +{ >> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); >> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >> + long rc; >> + >> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); >> if (rc == H_TOO_HARD) >> return EMULATE_FAIL; >> kvmppc_set_gpr(vcpu, 3, rc); >> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) >> return kvmppc_h_pr_bulk_remove(vcpu); >> case H_PUT_TCE: >> return kvmppc_h_pr_put_tce(vcpu); >> + case H_PUT_TCE_INDIRECT: >> + return kvmppc_h_pr_put_tce_indirect(vcpu); >> + case H_STUFF_TCE: >> + return kvmppc_h_pr_stuff_tce(vcpu); >> case H_CEDE: >> vcpu->arch.shared->msr |= MSR_EE; >> kvm_vcpu_block(vcpu); >> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >> index 6316ee3..8465c2a 100644 >> --- a/arch/powerpc/kvm/powerpc.c >> +++ b/arch/powerpc/kvm/powerpc.c >> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >> r = 1; >> break; >> #endif >> + case KVM_CAP_SPAPR_MULTITCE: >> + r = 1; > > This should only be true for book3s. We had this discussion with v2. David: === So, in the case of MULTITCE, that's not quite right. PR KVM can emulate a PAPR system on a BookE machine, and there's no reason not to allow TCE acceleration as well. We can't make it dependent on PAPR mode being selected, because that's enabled per-vcpu, whereas these capabilities are queried on the VM before the vcpus are created. === Wrong? -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-17 7:55 ` Alexey Kardashevskiy (?) @ 2013-06-17 8:02 ` Alexander Graf -1 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 8:02 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: > On 06/17/2013 08:06 AM, Alexander Graf wrote: >> >> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >> >>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >>> devices or emulated PCI. These calls allow adding multiple entries >>> (up to 512) into the TCE table in one call which saves time on >>> transition to/from real mode. >>> >>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>> (copied from user and verified) before writing the whole list into >>> the TCE table. This cache will be utilized more in the upcoming >>> VFIO/IOMMU support to continue TCE list processing in the virtual >>> mode in the case if the real mode handler failed for some reason. >>> >>> This adds a guest physical to host real address converter >>> and calls the existing H_PUT_TCE handler. The converting function >>> is going to be fully utilized by upcoming VFIO supporting patches. >>> >>> This also implements the KVM_CAP_PPC_MULTITCE capability, >>> so in order to support the functionality of this patch, QEMU >>> needs to query for this capability and set the "hcall-multi-tce" >>> hypertas property only if the capability is present, otherwise >>> there will be serious performance degradation. >>> >>> Cc: David Gibson <david@gibson.dropbear.id.au> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>> Signed-off-by: Paul Mackerras <paulus@samba.org> >> >> Only a few minor nits. Ben already commented on implementation details. >> >>> >>> --- >>> Changelog: >>> 2013/06/05: >>> * fixed mistype about IBMVIO in the commit message >>> * updated doc and moved it to another section >>> * changed capability number >>> >>> 2013/05/21: >>> * added kvm_vcpu_arch::tce_tmp >>> * removed cleanup if put_indirect failed, instead we do not even start >>> writing to TCE table if we cannot get TCEs from the user and they are >>> invalid >>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >>> and kvmppc_emulated_validate_tce (for the previous item) >>> * fixed bug with failthrough for H_IPI >>> * removed all get_user() from real mode handlers >>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >>> --- >>> Documentation/virtual/kvm/api.txt | 17 ++ >>> arch/powerpc/include/asm/kvm_host.h | 2 + >>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- >>> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>> arch/powerpc/kvm/powerpc.c | 3 + >>> include/uapi/linux/kvm.h | 1 + >>> 10 files changed, 473 insertions(+), 32 deletions(-) >>> >>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt >>> index 5f91eda..6c082ff 100644 >>> --- a/Documentation/virtual/kvm/api.txt >>> +++ b/Documentation/virtual/kvm/api.txt >>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be >>> handled. >>> >>> >>> +4.83 KVM_CAP_PPC_MULTITCE >>> + >>> +Capability: KVM_CAP_PPC_MULTITCE >>> +Architectures: ppc >>> +Type: vm >>> + >>> +This capability tells the guest that multiple TCE entry add/remove hypercalls >>> +handling is supported by the kernel. This significanly accelerates DMA >>> +operations for PPC KVM guests. >>> + >>> +Unlike other capabilities in this section, this one does not have an ioctl. >>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to >>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE >>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). >> > >> While this describes perfectly well what the consequences are of the >> patches, it does not describe properly what the CAP actually expresses. >> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and >> H_STUFF_TCE hypercalls directly". All other consequences are nice to >> document, but the semantics of the CAP are missing. > > > ? It expresses ability to handle 2 hcalls. What is missing? You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap. > > >> We also usually try to keep KVM behavior unchanged with regards to older >> versions until a CAP is enabled. In this case I don't think it matters >> all that much, so I'm fine with declaring it as enabled by default. >> Please document that this is a change in behavior versus older KVM >> versions though. > > > Ok! > > >>> + >>> + >>> 5. The kvm_run structure >>> ------------------------ >>> >>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h >>> index af326cd..85d8f26 100644 >>> --- a/arch/powerpc/include/asm/kvm_host.h >>> +++ b/arch/powerpc/include/asm/kvm_host.h >>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >>> spinlock_t tbacct_lock; >>> u64 busy_stolen; >>> u64 busy_preempt; >>> + >>> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ >>> #endif >>> }; >> >> [...] >>> >>> >> >> [...] >> >>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >>> index 550f592..a39039a 100644 >>> --- a/arch/powerpc/kvm/book3s_hv.c >>> +++ b/arch/powerpc/kvm/book3s_hv.c >>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) >>> ret = kvmppc_xics_hcall(vcpu, req); >>> break; >>> } /* fallthrough */ >> >> The fallthrough comment isn't accurate anymore. >> >>> + return RESUME_HOST; >>> + case H_PUT_TCE: >>> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>> + kvmppc_get_gpr(vcpu, 5), >>> + kvmppc_get_gpr(vcpu, 6)); >>> + if (ret = H_TOO_HARD) >>> + return RESUME_HOST; >>> + break; >>> + case H_PUT_TCE_INDIRECT: >>> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), >>> + kvmppc_get_gpr(vcpu, 5), >>> + kvmppc_get_gpr(vcpu, 6), >>> + kvmppc_get_gpr(vcpu, 7)); >>> + if (ret = H_TOO_HARD) >>> + return RESUME_HOST; >>> + break; >>> + case H_STUFF_TCE: >>> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>> + kvmppc_get_gpr(vcpu, 5), >>> + kvmppc_get_gpr(vcpu, 6), >>> + kvmppc_get_gpr(vcpu, 7)); >>> + if (ret = H_TOO_HARD) >>> + return RESUME_HOST; >>> + break; >>> default: >>> return RESUME_HOST; >>> } >>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >>> vcpu->arch.cpu_type = KVM_CPU_3S_64; >>> kvmppc_sanity_check(vcpu); >>> >>> + /* >>> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT >>> + * half executed, we first read TCEs from the user, check them and >>> + * return error if something went wrong and only then put TCEs into >>> + * the TCE table. >>> + * >>> + * tce_tmp is a cache for TCEs to avoid stack allocation or >>> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes >>> + * each (4096 bytes). >>> + */ >>> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); >>> + if (!vcpu->arch.tce_tmp) >>> + goto free_vcpu; >>> + >>> return vcpu; >>> >>> free_vcpu: >>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) >>> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >>> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >>> spin_unlock(&vcpu->arch.vpa_update_lock); >>> + kfree(vcpu->arch.tce_tmp); >>> kvm_vcpu_uninit(vcpu); >>> kmem_cache_free(kvm_vcpu_cache, vcpu); >>> } >>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>> index b02f91e..d35554e 100644 >>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>> @@ -1490,6 +1490,12 @@ hcall_real_table: >>> .long 0 /* 0x11c */ >>> .long 0 /* 0x120 */ >>> .long .kvmppc_h_bulk_remove - hcall_real_table >>> + .long 0 /* 0x128 */ >>> + .long 0 /* 0x12c */ >>> + .long 0 /* 0x130 */ >>> + .long 0 /* 0x134 */ >>> + .long .kvmppc_h_stuff_tce - hcall_real_table >>> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >>> hcall_real_table_end: >>> >>> ignore_hdec: >>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c >>> index da0e0bc..91d4b45 100644 >>> --- a/arch/powerpc/kvm/book3s_pr_papr.c >>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) >>> unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>> long rc; >>> >>> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >>> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >>> + if (rc = H_TOO_HARD) >>> + return EMULATE_FAIL; >>> + kvmppc_set_gpr(vcpu, 3, rc); >>> + return EMULATE_DONE; >>> +} >>> + >>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >>> +{ >>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>> + unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>> + long rc; >>> + >>> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >>> + tce, npages); >>> + if (rc = H_TOO_HARD) >>> + return EMULATE_FAIL; >>> + kvmppc_set_gpr(vcpu, 3, rc); >>> + return EMULATE_DONE; >>> +} >>> + >>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >>> +{ >>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); >>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>> + long rc; >>> + >>> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); >>> if (rc = H_TOO_HARD) >>> return EMULATE_FAIL; >>> kvmppc_set_gpr(vcpu, 3, rc); >>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) >>> return kvmppc_h_pr_bulk_remove(vcpu); >>> case H_PUT_TCE: >>> return kvmppc_h_pr_put_tce(vcpu); >>> + case H_PUT_TCE_INDIRECT: >>> + return kvmppc_h_pr_put_tce_indirect(vcpu); >>> + case H_STUFF_TCE: >>> + return kvmppc_h_pr_stuff_tce(vcpu); >>> case H_CEDE: >>> vcpu->arch.shared->msr |= MSR_EE; >>> kvm_vcpu_block(vcpu); >>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>> index 6316ee3..8465c2a 100644 >>> --- a/arch/powerpc/kvm/powerpc.c >>> +++ b/arch/powerpc/kvm/powerpc.c >>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >>> r = 1; >>> break; >>> #endif >>> + case KVM_CAP_SPAPR_MULTITCE: >>> + r = 1; >> >> This should only be true for book3s. > > > We had this discussion with v2. > > David: > => So, in the case of MULTITCE, that's not quite right. PR KVM can > emulate a PAPR system on a BookE machine, and there's no reason not to > allow TCE acceleration as well. We can't make it dependent on PAPR > mode being selected, because that's enabled per-vcpu, whereas these > capabilities are queried on the VM before the vcpus are created. > => > Wrong? Partially. BookE can not emulate a PAPR system as it stands today. The code should of course be generic and be available generically. But your code only patches the hypercall's availability in the book3s hypercall handlers, so that specific kernel version can only handle these hypercalls on book3s. Whether a later version of the kernel will be able to handle them is a different question. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:02 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 8:02 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: > On 06/17/2013 08:06 AM, Alexander Graf wrote: >> >> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >> >>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >>> devices or emulated PCI. These calls allow adding multiple entries >>> (up to 512) into the TCE table in one call which saves time on >>> transition to/from real mode. >>> >>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>> (copied from user and verified) before writing the whole list into >>> the TCE table. This cache will be utilized more in the upcoming >>> VFIO/IOMMU support to continue TCE list processing in the virtual >>> mode in the case if the real mode handler failed for some reason. >>> >>> This adds a guest physical to host real address converter >>> and calls the existing H_PUT_TCE handler. The converting function >>> is going to be fully utilized by upcoming VFIO supporting patches. >>> >>> This also implements the KVM_CAP_PPC_MULTITCE capability, >>> so in order to support the functionality of this patch, QEMU >>> needs to query for this capability and set the "hcall-multi-tce" >>> hypertas property only if the capability is present, otherwise >>> there will be serious performance degradation. >>> >>> Cc: David Gibson <david@gibson.dropbear.id.au> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>> Signed-off-by: Paul Mackerras <paulus@samba.org> >> >> Only a few minor nits. Ben already commented on implementation details. >> >>> >>> --- >>> Changelog: >>> 2013/06/05: >>> * fixed mistype about IBMVIO in the commit message >>> * updated doc and moved it to another section >>> * changed capability number >>> >>> 2013/05/21: >>> * added kvm_vcpu_arch::tce_tmp >>> * removed cleanup if put_indirect failed, instead we do not even start >>> writing to TCE table if we cannot get TCEs from the user and they are >>> invalid >>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >>> and kvmppc_emulated_validate_tce (for the previous item) >>> * fixed bug with failthrough for H_IPI >>> * removed all get_user() from real mode handlers >>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >>> --- >>> Documentation/virtual/kvm/api.txt | 17 ++ >>> arch/powerpc/include/asm/kvm_host.h | 2 + >>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- >>> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>> arch/powerpc/kvm/powerpc.c | 3 + >>> include/uapi/linux/kvm.h | 1 + >>> 10 files changed, 473 insertions(+), 32 deletions(-) >>> >>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt >>> index 5f91eda..6c082ff 100644 >>> --- a/Documentation/virtual/kvm/api.txt >>> +++ b/Documentation/virtual/kvm/api.txt >>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be >>> handled. >>> >>> >>> +4.83 KVM_CAP_PPC_MULTITCE >>> + >>> +Capability: KVM_CAP_PPC_MULTITCE >>> +Architectures: ppc >>> +Type: vm >>> + >>> +This capability tells the guest that multiple TCE entry add/remove hypercalls >>> +handling is supported by the kernel. This significanly accelerates DMA >>> +operations for PPC KVM guests. >>> + >>> +Unlike other capabilities in this section, this one does not have an ioctl. >>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to >>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE >>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). >> > >> While this describes perfectly well what the consequences are of the >> patches, it does not describe properly what the CAP actually expresses. >> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and >> H_STUFF_TCE hypercalls directly". All other consequences are nice to >> document, but the semantics of the CAP are missing. > > > ? It expresses ability to handle 2 hcalls. What is missing? You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap. > > >> We also usually try to keep KVM behavior unchanged with regards to older >> versions until a CAP is enabled. In this case I don't think it matters >> all that much, so I'm fine with declaring it as enabled by default. >> Please document that this is a change in behavior versus older KVM >> versions though. > > > Ok! > > >>> + >>> + >>> 5. The kvm_run structure >>> ------------------------ >>> >>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h >>> index af326cd..85d8f26 100644 >>> --- a/arch/powerpc/include/asm/kvm_host.h >>> +++ b/arch/powerpc/include/asm/kvm_host.h >>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >>> spinlock_t tbacct_lock; >>> u64 busy_stolen; >>> u64 busy_preempt; >>> + >>> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ >>> #endif >>> }; >> >> [...] >>> >>> >> >> [...] >> >>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >>> index 550f592..a39039a 100644 >>> --- a/arch/powerpc/kvm/book3s_hv.c >>> +++ b/arch/powerpc/kvm/book3s_hv.c >>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) >>> ret = kvmppc_xics_hcall(vcpu, req); >>> break; >>> } /* fallthrough */ >> >> The fallthrough comment isn't accurate anymore. >> >>> + return RESUME_HOST; >>> + case H_PUT_TCE: >>> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>> + kvmppc_get_gpr(vcpu, 5), >>> + kvmppc_get_gpr(vcpu, 6)); >>> + if (ret == H_TOO_HARD) >>> + return RESUME_HOST; >>> + break; >>> + case H_PUT_TCE_INDIRECT: >>> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), >>> + kvmppc_get_gpr(vcpu, 5), >>> + kvmppc_get_gpr(vcpu, 6), >>> + kvmppc_get_gpr(vcpu, 7)); >>> + if (ret == H_TOO_HARD) >>> + return RESUME_HOST; >>> + break; >>> + case H_STUFF_TCE: >>> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>> + kvmppc_get_gpr(vcpu, 5), >>> + kvmppc_get_gpr(vcpu, 6), >>> + kvmppc_get_gpr(vcpu, 7)); >>> + if (ret == H_TOO_HARD) >>> + return RESUME_HOST; >>> + break; >>> default: >>> return RESUME_HOST; >>> } >>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >>> vcpu->arch.cpu_type = KVM_CPU_3S_64; >>> kvmppc_sanity_check(vcpu); >>> >>> + /* >>> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT >>> + * half executed, we first read TCEs from the user, check them and >>> + * return error if something went wrong and only then put TCEs into >>> + * the TCE table. >>> + * >>> + * tce_tmp is a cache for TCEs to avoid stack allocation or >>> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes >>> + * each (4096 bytes). >>> + */ >>> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); >>> + if (!vcpu->arch.tce_tmp) >>> + goto free_vcpu; >>> + >>> return vcpu; >>> >>> free_vcpu: >>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) >>> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >>> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >>> spin_unlock(&vcpu->arch.vpa_update_lock); >>> + kfree(vcpu->arch.tce_tmp); >>> kvm_vcpu_uninit(vcpu); >>> kmem_cache_free(kvm_vcpu_cache, vcpu); >>> } >>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>> index b02f91e..d35554e 100644 >>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>> @@ -1490,6 +1490,12 @@ hcall_real_table: >>> .long 0 /* 0x11c */ >>> .long 0 /* 0x120 */ >>> .long .kvmppc_h_bulk_remove - hcall_real_table >>> + .long 0 /* 0x128 */ >>> + .long 0 /* 0x12c */ >>> + .long 0 /* 0x130 */ >>> + .long 0 /* 0x134 */ >>> + .long .kvmppc_h_stuff_tce - hcall_real_table >>> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >>> hcall_real_table_end: >>> >>> ignore_hdec: >>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c >>> index da0e0bc..91d4b45 100644 >>> --- a/arch/powerpc/kvm/book3s_pr_papr.c >>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) >>> unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>> long rc; >>> >>> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >>> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >>> + if (rc == H_TOO_HARD) >>> + return EMULATE_FAIL; >>> + kvmppc_set_gpr(vcpu, 3, rc); >>> + return EMULATE_DONE; >>> +} >>> + >>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >>> +{ >>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>> + unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>> + long rc; >>> + >>> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >>> + tce, npages); >>> + if (rc == H_TOO_HARD) >>> + return EMULATE_FAIL; >>> + kvmppc_set_gpr(vcpu, 3, rc); >>> + return EMULATE_DONE; >>> +} >>> + >>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >>> +{ >>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); >>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>> + long rc; >>> + >>> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); >>> if (rc == H_TOO_HARD) >>> return EMULATE_FAIL; >>> kvmppc_set_gpr(vcpu, 3, rc); >>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) >>> return kvmppc_h_pr_bulk_remove(vcpu); >>> case H_PUT_TCE: >>> return kvmppc_h_pr_put_tce(vcpu); >>> + case H_PUT_TCE_INDIRECT: >>> + return kvmppc_h_pr_put_tce_indirect(vcpu); >>> + case H_STUFF_TCE: >>> + return kvmppc_h_pr_stuff_tce(vcpu); >>> case H_CEDE: >>> vcpu->arch.shared->msr |= MSR_EE; >>> kvm_vcpu_block(vcpu); >>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>> index 6316ee3..8465c2a 100644 >>> --- a/arch/powerpc/kvm/powerpc.c >>> +++ b/arch/powerpc/kvm/powerpc.c >>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >>> r = 1; >>> break; >>> #endif >>> + case KVM_CAP_SPAPR_MULTITCE: >>> + r = 1; >> >> This should only be true for book3s. > > > We had this discussion with v2. > > David: > === > So, in the case of MULTITCE, that's not quite right. PR KVM can > emulate a PAPR system on a BookE machine, and there's no reason not to > allow TCE acceleration as well. We can't make it dependent on PAPR > mode being selected, because that's enabled per-vcpu, whereas these > capabilities are queried on the VM before the vcpus are created. > === > > Wrong? Partially. BookE can not emulate a PAPR system as it stands today. The code should of course be generic and be available generically. But your code only patches the hypercall's availability in the book3s hypercall handlers, so that specific kernel version can only handle these hypercalls on book3s. Whether a later version of the kernel will be able to handle them is a different question. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:02 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 8:02 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: > On 06/17/2013 08:06 AM, Alexander Graf wrote: >>=20 >> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>=20 >>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >>> devices or emulated PCI. These calls allow adding multiple entries >>> (up to 512) into the TCE table in one call which saves time on >>> transition to/from real mode. >>>=20 >>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>> (copied from user and verified) before writing the whole list into >>> the TCE table. This cache will be utilized more in the upcoming >>> VFIO/IOMMU support to continue TCE list processing in the virtual >>> mode in the case if the real mode handler failed for some reason. >>>=20 >>> This adds a guest physical to host real address converter >>> and calls the existing H_PUT_TCE handler. The converting function >>> is going to be fully utilized by upcoming VFIO supporting patches. >>>=20 >>> This also implements the KVM_CAP_PPC_MULTITCE capability, >>> so in order to support the functionality of this patch, QEMU >>> needs to query for this capability and set the "hcall-multi-tce" >>> hypertas property only if the capability is present, otherwise >>> there will be serious performance degradation. >>>=20 >>> Cc: David Gibson <david@gibson.dropbear.id.au> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>> Signed-off-by: Paul Mackerras <paulus@samba.org> >>=20 >> Only a few minor nits. Ben already commented on implementation = details. >>=20 >>>=20 >>> --- >>> Changelog: >>> 2013/06/05: >>> * fixed mistype about IBMVIO in the commit message >>> * updated doc and moved it to another section >>> * changed capability number >>>=20 >>> 2013/05/21: >>> * added kvm_vcpu_arch::tce_tmp >>> * removed cleanup if put_indirect failed, instead we do not even = start >>> writing to TCE table if we cannot get TCEs from the user and they = are >>> invalid >>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >>> and kvmppc_emulated_validate_tce (for the previous item) >>> * fixed bug with failthrough for H_IPI >>> * removed all get_user() from real mode handlers >>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte = public) >>> --- >>> Documentation/virtual/kvm/api.txt | 17 ++ >>> arch/powerpc/include/asm/kvm_host.h | 2 + >>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 = +++++++++++++++++++++++++++---- >>> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>> arch/powerpc/kvm/powerpc.c | 3 + >>> include/uapi/linux/kvm.h | 1 + >>> 10 files changed, 473 insertions(+), 32 deletions(-) >>>=20 >>> diff --git a/Documentation/virtual/kvm/api.txt = b/Documentation/virtual/kvm/api.txt >>> index 5f91eda..6c082ff 100644 >>> --- a/Documentation/virtual/kvm/api.txt >>> +++ b/Documentation/virtual/kvm/api.txt >>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be = passed to userspace to be >>> handled. >>>=20 >>>=20 >>> +4.83 KVM_CAP_PPC_MULTITCE >>> + >>> +Capability: KVM_CAP_PPC_MULTITCE >>> +Architectures: ppc >>> +Type: vm >>> + >>> +This capability tells the guest that multiple TCE entry add/remove = hypercalls >>> +handling is supported by the kernel. This significanly accelerates = DMA >>> +operations for PPC KVM guests. >>> + >>> +Unlike other capabilities in this section, this one does not have = an ioctl. >>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not = passed to >>> +the guest. Othwerwise it might be better for the guest to continue = using H_PUT_TCE >>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are = present). >>=20 >=20 >> While this describes perfectly well what the consequences are of the >> patches, it does not describe properly what the CAP actually = expresses. >> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT = and >> H_STUFF_TCE hypercalls directly". All other consequences are nice to >> document, but the semantics of the CAP are missing. >=20 >=20 > ? It expresses ability to handle 2 hcalls. What is missing? You don't describe the kvm <-> qemu interface. You describe some = decisions qemu can take from this cap. >=20 >=20 >> We also usually try to keep KVM behavior unchanged with regards to = older >> versions until a CAP is enabled. In this case I don't think it = matters >> all that much, so I'm fine with declaring it as enabled by default. >> Please document that this is a change in behavior versus older KVM >> versions though. >=20 >=20 > Ok! >=20 >=20 >>> + >>> + >>> 5. The kvm_run structure >>> ------------------------ >>>=20 >>> diff --git a/arch/powerpc/include/asm/kvm_host.h = b/arch/powerpc/include/asm/kvm_host.h >>> index af326cd..85d8f26 100644 >>> --- a/arch/powerpc/include/asm/kvm_host.h >>> +++ b/arch/powerpc/include/asm/kvm_host.h >>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >>> spinlock_t tbacct_lock; >>> u64 busy_stolen; >>> u64 busy_preempt; >>> + >>> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT = hall */ >>> #endif >>> }; >>=20 >> [...] >>>=20 >>>=20 >>=20 >> [...] >>=20 >>> diff --git a/arch/powerpc/kvm/book3s_hv.c = b/arch/powerpc/kvm/book3s_hv.c >>> index 550f592..a39039a 100644 >>> --- a/arch/powerpc/kvm/book3s_hv.c >>> +++ b/arch/powerpc/kvm/book3s_hv.c >>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu = *vcpu) >>> ret =3D kvmppc_xics_hcall(vcpu, req); >>> break; >>> } /* fallthrough */ >>=20 >> The fallthrough comment isn't accurate anymore. >>=20 >>> + return RESUME_HOST; >>> + case H_PUT_TCE: >>> + ret =3D kvmppc_virtmode_h_put_tce(vcpu, = kvmppc_get_gpr(vcpu, 4), >>> + kvmppc_get_gpr(vcpu, 5), >>> + kvmppc_get_gpr(vcpu, = 6)); >>> + if (ret =3D=3D H_TOO_HARD) >>> + return RESUME_HOST; >>> + break; >>> + case H_PUT_TCE_INDIRECT: >>> + ret =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, = kvmppc_get_gpr(vcpu, 4), >>> + kvmppc_get_gpr(vcpu, 5), >>> + kvmppc_get_gpr(vcpu, 6), >>> + kvmppc_get_gpr(vcpu, = 7)); >>> + if (ret =3D=3D H_TOO_HARD) >>> + return RESUME_HOST; >>> + break; >>> + case H_STUFF_TCE: >>> + ret =3D kvmppc_virtmode_h_stuff_tce(vcpu, = kvmppc_get_gpr(vcpu, 4), >>> + kvmppc_get_gpr(vcpu, 5), >>> + kvmppc_get_gpr(vcpu, 6), >>> + kvmppc_get_gpr(vcpu, = 7)); >>> + if (ret =3D=3D H_TOO_HARD) >>> + return RESUME_HOST; >>> + break; >>> default: >>> return RESUME_HOST; >>> } >>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct = kvm *kvm, unsigned int id) >>> vcpu->arch.cpu_type =3D KVM_CPU_3S_64; >>> kvmppc_sanity_check(vcpu); >>>=20 >>> + /* >>> + * As we want to minimize the chance of having = H_PUT_TCE_INDIRECT >>> + * half executed, we first read TCEs from the user, check them = and >>> + * return error if something went wrong and only then put TCEs = into >>> + * the TCE table. >>> + * >>> + * tce_tmp is a cache for TCEs to avoid stack allocation or >>> + * kmalloc as the whole TCE list can take up to 512 items 8 = bytes >>> + * each (4096 bytes). >>> + */ >>> + vcpu->arch.tce_tmp =3D kmalloc(4096, GFP_KERNEL); >>> + if (!vcpu->arch.tce_tmp) >>> + goto free_vcpu; >>> + >>> return vcpu; >>>=20 >>> free_vcpu: >>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu = *vcpu) >>> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >>> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >>> spin_unlock(&vcpu->arch.vpa_update_lock); >>> + kfree(vcpu->arch.tce_tmp); >>> kvm_vcpu_uninit(vcpu); >>> kmem_cache_free(kvm_vcpu_cache, vcpu); >>> } >>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S = b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>> index b02f91e..d35554e 100644 >>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>> @@ -1490,6 +1490,12 @@ hcall_real_table: >>> .long 0 /* 0x11c */ >>> .long 0 /* 0x120 */ >>> .long .kvmppc_h_bulk_remove - hcall_real_table >>> + .long 0 /* 0x128 */ >>> + .long 0 /* 0x12c */ >>> + .long 0 /* 0x130 */ >>> + .long 0 /* 0x134 */ >>> + .long .kvmppc_h_stuff_tce - hcall_real_table >>> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >>> hcall_real_table_end: >>>=20 >>> ignore_hdec: >>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c = b/arch/powerpc/kvm/book3s_pr_papr.c >>> index da0e0bc..91d4b45 100644 >>> --- a/arch/powerpc/kvm/book3s_pr_papr.c >>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu = *vcpu) >>> unsigned long tce =3D kvmppc_get_gpr(vcpu, 6); >>> long rc; >>>=20 >>> - rc =3D kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >>> + rc =3D kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >>> + if (rc =3D=3D H_TOO_HARD) >>> + return EMULATE_FAIL; >>> + kvmppc_set_gpr(vcpu, 3, rc); >>> + return EMULATE_DONE; >>> +} >>> + >>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >>> +{ >>> + unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4); >>> + unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5); >>> + unsigned long tce =3D kvmppc_get_gpr(vcpu, 6); >>> + unsigned long npages =3D kvmppc_get_gpr(vcpu, 7); >>> + long rc; >>> + >>> + rc =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >>> + tce, npages); >>> + if (rc =3D=3D H_TOO_HARD) >>> + return EMULATE_FAIL; >>> + kvmppc_set_gpr(vcpu, 3, rc); >>> + return EMULATE_DONE; >>> +} >>> + >>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >>> +{ >>> + unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4); >>> + unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5); >>> + unsigned long tce_value =3D kvmppc_get_gpr(vcpu, 6); >>> + unsigned long npages =3D kvmppc_get_gpr(vcpu, 7); >>> + long rc; >>> + >>> + rc =3D kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, = npages); >>> if (rc =3D=3D H_TOO_HARD) >>> return EMULATE_FAIL; >>> kvmppc_set_gpr(vcpu, 3, rc); >>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned = long cmd) >>> return kvmppc_h_pr_bulk_remove(vcpu); >>> case H_PUT_TCE: >>> return kvmppc_h_pr_put_tce(vcpu); >>> + case H_PUT_TCE_INDIRECT: >>> + return kvmppc_h_pr_put_tce_indirect(vcpu); >>> + case H_STUFF_TCE: >>> + return kvmppc_h_pr_stuff_tce(vcpu); >>> case H_CEDE: >>> vcpu->arch.shared->msr |=3D MSR_EE; >>> kvm_vcpu_block(vcpu); >>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>> index 6316ee3..8465c2a 100644 >>> --- a/arch/powerpc/kvm/powerpc.c >>> +++ b/arch/powerpc/kvm/powerpc.c >>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >>> r =3D 1; >>> break; >>> #endif >>> + case KVM_CAP_SPAPR_MULTITCE: >>> + r =3D 1; >>=20 >> This should only be true for book3s. >=20 >=20 > We had this discussion with v2. >=20 > David: > =3D=3D=3D > So, in the case of MULTITCE, that's not quite right. PR KVM can > emulate a PAPR system on a BookE machine, and there's no reason not to > allow TCE acceleration as well. We can't make it dependent on PAPR > mode being selected, because that's enabled per-vcpu, whereas these > capabilities are queried on the VM before the vcpus are created. > =3D=3D=3D >=20 > Wrong? Partially. BookE can not emulate a PAPR system as it stands today. The code should of course be generic and be available generically. But = your code only patches the hypercall's availability in the book3s = hypercall handlers, so that specific kernel version can only handle = these hypercalls on book3s. Whether a later version of the kernel will be able to handle them is a = different question. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-17 8:02 ` Alexander Graf (?) @ 2013-06-17 8:34 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 8:34 UTC (permalink / raw) To: Alexander Graf Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/17/2013 06:02 PM, Alexander Graf wrote: > > On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: > >> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>> >>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>> >>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >>>> devices or emulated PCI. These calls allow adding multiple entries >>>> (up to 512) into the TCE table in one call which saves time on >>>> transition to/from real mode. >>>> >>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>> (copied from user and verified) before writing the whole list into >>>> the TCE table. This cache will be utilized more in the upcoming >>>> VFIO/IOMMU support to continue TCE list processing in the virtual >>>> mode in the case if the real mode handler failed for some reason. >>>> >>>> This adds a guest physical to host real address converter >>>> and calls the existing H_PUT_TCE handler. The converting function >>>> is going to be fully utilized by upcoming VFIO supporting patches. >>>> >>>> This also implements the KVM_CAP_PPC_MULTITCE capability, >>>> so in order to support the functionality of this patch, QEMU >>>> needs to query for this capability and set the "hcall-multi-tce" >>>> hypertas property only if the capability is present, otherwise >>>> there will be serious performance degradation. >>>> >>>> Cc: David Gibson <david@gibson.dropbear.id.au> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>> Signed-off-by: Paul Mackerras <paulus@samba.org> >>> >>> Only a few minor nits. Ben already commented on implementation details. >>> >>>> >>>> --- >>>> Changelog: >>>> 2013/06/05: >>>> * fixed mistype about IBMVIO in the commit message >>>> * updated doc and moved it to another section >>>> * changed capability number >>>> >>>> 2013/05/21: >>>> * added kvm_vcpu_arch::tce_tmp >>>> * removed cleanup if put_indirect failed, instead we do not even start >>>> writing to TCE table if we cannot get TCEs from the user and they are >>>> invalid >>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >>>> and kvmppc_emulated_validate_tce (for the previous item) >>>> * fixed bug with failthrough for H_IPI >>>> * removed all get_user() from real mode handlers >>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >>>> --- >>>> Documentation/virtual/kvm/api.txt | 17 ++ >>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- >>>> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>> arch/powerpc/kvm/powerpc.c | 3 + >>>> include/uapi/linux/kvm.h | 1 + >>>> 10 files changed, 473 insertions(+), 32 deletions(-) >>>> >>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt >>>> index 5f91eda..6c082ff 100644 >>>> --- a/Documentation/virtual/kvm/api.txt >>>> +++ b/Documentation/virtual/kvm/api.txt >>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be >>>> handled. >>>> >>>> >>>> +4.83 KVM_CAP_PPC_MULTITCE >>>> + >>>> +Capability: KVM_CAP_PPC_MULTITCE >>>> +Architectures: ppc >>>> +Type: vm >>>> + >>>> +This capability tells the guest that multiple TCE entry add/remove hypercalls >>>> +handling is supported by the kernel. This significanly accelerates DMA >>>> +operations for PPC KVM guests. >>>> + >>>> +Unlike other capabilities in this section, this one does not have an ioctl. >>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to >>>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE >>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). >>> >> >>> While this describes perfectly well what the consequences are of the >>> patches, it does not describe properly what the CAP actually expresses. >>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and >>> H_STUFF_TCE hypercalls directly". All other consequences are nice to >>> document, but the semantics of the CAP are missing. >> >> >> ? It expresses ability to handle 2 hcalls. What is missing? > > You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap. This file does not mention qemu at all. And the interface is - qemu (or kvmtool could do that) just adds "hcall-multi-tce" to "ibm,hypertas-functions" but this is for pseries linux and AIX could always do it (no idea about it). Does it really have to be in this file? >>> We also usually try to keep KVM behavior unchanged with regards to older >>> versions until a CAP is enabled. In this case I don't think it matters >>> all that much, so I'm fine with declaring it as enabled by default. >>> Please document that this is a change in behavior versus older KVM >>> versions though. >> >> >> Ok! >> >> >>>> + >>>> + >>>> 5. The kvm_run structure >>>> ------------------------ >>>> >>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h >>>> index af326cd..85d8f26 100644 >>>> --- a/arch/powerpc/include/asm/kvm_host.h >>>> +++ b/arch/powerpc/include/asm/kvm_host.h >>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >>>> spinlock_t tbacct_lock; >>>> u64 busy_stolen; >>>> u64 busy_preempt; >>>> + >>>> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ >>>> #endif >>>> }; >>> >>> [...] >>>> >>>> >>> >>> [...] >>> >>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >>>> index 550f592..a39039a 100644 >>>> --- a/arch/powerpc/kvm/book3s_hv.c >>>> +++ b/arch/powerpc/kvm/book3s_hv.c >>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) >>>> ret = kvmppc_xics_hcall(vcpu, req); >>>> break; >>>> } /* fallthrough */ >>> >>> The fallthrough comment isn't accurate anymore. >>> >>>> + return RESUME_HOST; >>>> + case H_PUT_TCE: >>>> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6)); >>>> + if (ret = H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> + case H_PUT_TCE_INDIRECT: >>>> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6), >>>> + kvmppc_get_gpr(vcpu, 7)); >>>> + if (ret = H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> + case H_STUFF_TCE: >>>> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6), >>>> + kvmppc_get_gpr(vcpu, 7)); >>>> + if (ret = H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> default: >>>> return RESUME_HOST; >>>> } >>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >>>> vcpu->arch.cpu_type = KVM_CPU_3S_64; >>>> kvmppc_sanity_check(vcpu); >>>> >>>> + /* >>>> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT >>>> + * half executed, we first read TCEs from the user, check them and >>>> + * return error if something went wrong and only then put TCEs into >>>> + * the TCE table. >>>> + * >>>> + * tce_tmp is a cache for TCEs to avoid stack allocation or >>>> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes >>>> + * each (4096 bytes). >>>> + */ >>>> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); >>>> + if (!vcpu->arch.tce_tmp) >>>> + goto free_vcpu; >>>> + >>>> return vcpu; >>>> >>>> free_vcpu: >>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) >>>> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >>>> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >>>> spin_unlock(&vcpu->arch.vpa_update_lock); >>>> + kfree(vcpu->arch.tce_tmp); >>>> kvm_vcpu_uninit(vcpu); >>>> kmem_cache_free(kvm_vcpu_cache, vcpu); >>>> } >>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> index b02f91e..d35554e 100644 >>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> @@ -1490,6 +1490,12 @@ hcall_real_table: >>>> .long 0 /* 0x11c */ >>>> .long 0 /* 0x120 */ >>>> .long .kvmppc_h_bulk_remove - hcall_real_table >>>> + .long 0 /* 0x128 */ >>>> + .long 0 /* 0x12c */ >>>> + .long 0 /* 0x130 */ >>>> + .long 0 /* 0x134 */ >>>> + .long .kvmppc_h_stuff_tce - hcall_real_table >>>> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >>>> hcall_real_table_end: >>>> >>>> ignore_hdec: >>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c >>>> index da0e0bc..91d4b45 100644 >>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c >>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) >>>> unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>> long rc; >>>> >>>> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >>>> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >>>> + if (rc = H_TOO_HARD) >>>> + return EMULATE_FAIL; >>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>> + return EMULATE_DONE; >>>> +} >>>> + >>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >>>> +{ >>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>> + unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>> + long rc; >>>> + >>>> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >>>> + tce, npages); >>>> + if (rc = H_TOO_HARD) >>>> + return EMULATE_FAIL; >>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>> + return EMULATE_DONE; >>>> +} >>>> + >>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >>>> +{ >>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); >>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>> + long rc; >>>> + >>>> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); >>>> if (rc = H_TOO_HARD) >>>> return EMULATE_FAIL; >>>> kvmppc_set_gpr(vcpu, 3, rc); >>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) >>>> return kvmppc_h_pr_bulk_remove(vcpu); >>>> case H_PUT_TCE: >>>> return kvmppc_h_pr_put_tce(vcpu); >>>> + case H_PUT_TCE_INDIRECT: >>>> + return kvmppc_h_pr_put_tce_indirect(vcpu); >>>> + case H_STUFF_TCE: >>>> + return kvmppc_h_pr_stuff_tce(vcpu); >>>> case H_CEDE: >>>> vcpu->arch.shared->msr |= MSR_EE; >>>> kvm_vcpu_block(vcpu); >>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>>> index 6316ee3..8465c2a 100644 >>>> --- a/arch/powerpc/kvm/powerpc.c >>>> +++ b/arch/powerpc/kvm/powerpc.c >>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >>>> r = 1; >>>> break; >>>> #endif >>>> + case KVM_CAP_SPAPR_MULTITCE: >>>> + r = 1; >>> >>> This should only be true for book3s. >> >> >> We had this discussion with v2. >> >> David: >> =>> So, in the case of MULTITCE, that's not quite right. PR KVM can >> emulate a PAPR system on a BookE machine, and there's no reason not to >> allow TCE acceleration as well. We can't make it dependent on PAPR >> mode being selected, because that's enabled per-vcpu, whereas these >> capabilities are queried on the VM before the vcpus are created. >> =>> >> Wrong? > Partially. BookE can not emulate a PAPR system as it stands today. Oh. Ok. So - #ifdef CONFIG_PPC_BOOK3S_64 ? Or run-time check for book3s (how...)? -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:34 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 8:34 UTC (permalink / raw) To: Alexander Graf Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/17/2013 06:02 PM, Alexander Graf wrote: > > On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: > >> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>> >>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>> >>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >>>> devices or emulated PCI. These calls allow adding multiple entries >>>> (up to 512) into the TCE table in one call which saves time on >>>> transition to/from real mode. >>>> >>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>> (copied from user and verified) before writing the whole list into >>>> the TCE table. This cache will be utilized more in the upcoming >>>> VFIO/IOMMU support to continue TCE list processing in the virtual >>>> mode in the case if the real mode handler failed for some reason. >>>> >>>> This adds a guest physical to host real address converter >>>> and calls the existing H_PUT_TCE handler. The converting function >>>> is going to be fully utilized by upcoming VFIO supporting patches. >>>> >>>> This also implements the KVM_CAP_PPC_MULTITCE capability, >>>> so in order to support the functionality of this patch, QEMU >>>> needs to query for this capability and set the "hcall-multi-tce" >>>> hypertas property only if the capability is present, otherwise >>>> there will be serious performance degradation. >>>> >>>> Cc: David Gibson <david@gibson.dropbear.id.au> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>> Signed-off-by: Paul Mackerras <paulus@samba.org> >>> >>> Only a few minor nits. Ben already commented on implementation details. >>> >>>> >>>> --- >>>> Changelog: >>>> 2013/06/05: >>>> * fixed mistype about IBMVIO in the commit message >>>> * updated doc and moved it to another section >>>> * changed capability number >>>> >>>> 2013/05/21: >>>> * added kvm_vcpu_arch::tce_tmp >>>> * removed cleanup if put_indirect failed, instead we do not even start >>>> writing to TCE table if we cannot get TCEs from the user and they are >>>> invalid >>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >>>> and kvmppc_emulated_validate_tce (for the previous item) >>>> * fixed bug with failthrough for H_IPI >>>> * removed all get_user() from real mode handlers >>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >>>> --- >>>> Documentation/virtual/kvm/api.txt | 17 ++ >>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- >>>> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>> arch/powerpc/kvm/powerpc.c | 3 + >>>> include/uapi/linux/kvm.h | 1 + >>>> 10 files changed, 473 insertions(+), 32 deletions(-) >>>> >>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt >>>> index 5f91eda..6c082ff 100644 >>>> --- a/Documentation/virtual/kvm/api.txt >>>> +++ b/Documentation/virtual/kvm/api.txt >>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be >>>> handled. >>>> >>>> >>>> +4.83 KVM_CAP_PPC_MULTITCE >>>> + >>>> +Capability: KVM_CAP_PPC_MULTITCE >>>> +Architectures: ppc >>>> +Type: vm >>>> + >>>> +This capability tells the guest that multiple TCE entry add/remove hypercalls >>>> +handling is supported by the kernel. This significanly accelerates DMA >>>> +operations for PPC KVM guests. >>>> + >>>> +Unlike other capabilities in this section, this one does not have an ioctl. >>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to >>>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE >>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). >>> >> >>> While this describes perfectly well what the consequences are of the >>> patches, it does not describe properly what the CAP actually expresses. >>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and >>> H_STUFF_TCE hypercalls directly". All other consequences are nice to >>> document, but the semantics of the CAP are missing. >> >> >> ? It expresses ability to handle 2 hcalls. What is missing? > > You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap. This file does not mention qemu at all. And the interface is - qemu (or kvmtool could do that) just adds "hcall-multi-tce" to "ibm,hypertas-functions" but this is for pseries linux and AIX could always do it (no idea about it). Does it really have to be in this file? >>> We also usually try to keep KVM behavior unchanged with regards to older >>> versions until a CAP is enabled. In this case I don't think it matters >>> all that much, so I'm fine with declaring it as enabled by default. >>> Please document that this is a change in behavior versus older KVM >>> versions though. >> >> >> Ok! >> >> >>>> + >>>> + >>>> 5. The kvm_run structure >>>> ------------------------ >>>> >>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h >>>> index af326cd..85d8f26 100644 >>>> --- a/arch/powerpc/include/asm/kvm_host.h >>>> +++ b/arch/powerpc/include/asm/kvm_host.h >>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >>>> spinlock_t tbacct_lock; >>>> u64 busy_stolen; >>>> u64 busy_preempt; >>>> + >>>> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ >>>> #endif >>>> }; >>> >>> [...] >>>> >>>> >>> >>> [...] >>> >>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >>>> index 550f592..a39039a 100644 >>>> --- a/arch/powerpc/kvm/book3s_hv.c >>>> +++ b/arch/powerpc/kvm/book3s_hv.c >>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) >>>> ret = kvmppc_xics_hcall(vcpu, req); >>>> break; >>>> } /* fallthrough */ >>> >>> The fallthrough comment isn't accurate anymore. >>> >>>> + return RESUME_HOST; >>>> + case H_PUT_TCE: >>>> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6)); >>>> + if (ret == H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> + case H_PUT_TCE_INDIRECT: >>>> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6), >>>> + kvmppc_get_gpr(vcpu, 7)); >>>> + if (ret == H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> + case H_STUFF_TCE: >>>> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6), >>>> + kvmppc_get_gpr(vcpu, 7)); >>>> + if (ret == H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> default: >>>> return RESUME_HOST; >>>> } >>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >>>> vcpu->arch.cpu_type = KVM_CPU_3S_64; >>>> kvmppc_sanity_check(vcpu); >>>> >>>> + /* >>>> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT >>>> + * half executed, we first read TCEs from the user, check them and >>>> + * return error if something went wrong and only then put TCEs into >>>> + * the TCE table. >>>> + * >>>> + * tce_tmp is a cache for TCEs to avoid stack allocation or >>>> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes >>>> + * each (4096 bytes). >>>> + */ >>>> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); >>>> + if (!vcpu->arch.tce_tmp) >>>> + goto free_vcpu; >>>> + >>>> return vcpu; >>>> >>>> free_vcpu: >>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) >>>> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >>>> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >>>> spin_unlock(&vcpu->arch.vpa_update_lock); >>>> + kfree(vcpu->arch.tce_tmp); >>>> kvm_vcpu_uninit(vcpu); >>>> kmem_cache_free(kvm_vcpu_cache, vcpu); >>>> } >>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> index b02f91e..d35554e 100644 >>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> @@ -1490,6 +1490,12 @@ hcall_real_table: >>>> .long 0 /* 0x11c */ >>>> .long 0 /* 0x120 */ >>>> .long .kvmppc_h_bulk_remove - hcall_real_table >>>> + .long 0 /* 0x128 */ >>>> + .long 0 /* 0x12c */ >>>> + .long 0 /* 0x130 */ >>>> + .long 0 /* 0x134 */ >>>> + .long .kvmppc_h_stuff_tce - hcall_real_table >>>> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >>>> hcall_real_table_end: >>>> >>>> ignore_hdec: >>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c >>>> index da0e0bc..91d4b45 100644 >>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c >>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) >>>> unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>> long rc; >>>> >>>> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >>>> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >>>> + if (rc == H_TOO_HARD) >>>> + return EMULATE_FAIL; >>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>> + return EMULATE_DONE; >>>> +} >>>> + >>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >>>> +{ >>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>> + unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>> + long rc; >>>> + >>>> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >>>> + tce, npages); >>>> + if (rc == H_TOO_HARD) >>>> + return EMULATE_FAIL; >>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>> + return EMULATE_DONE; >>>> +} >>>> + >>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >>>> +{ >>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); >>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>> + long rc; >>>> + >>>> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); >>>> if (rc == H_TOO_HARD) >>>> return EMULATE_FAIL; >>>> kvmppc_set_gpr(vcpu, 3, rc); >>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) >>>> return kvmppc_h_pr_bulk_remove(vcpu); >>>> case H_PUT_TCE: >>>> return kvmppc_h_pr_put_tce(vcpu); >>>> + case H_PUT_TCE_INDIRECT: >>>> + return kvmppc_h_pr_put_tce_indirect(vcpu); >>>> + case H_STUFF_TCE: >>>> + return kvmppc_h_pr_stuff_tce(vcpu); >>>> case H_CEDE: >>>> vcpu->arch.shared->msr |= MSR_EE; >>>> kvm_vcpu_block(vcpu); >>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>>> index 6316ee3..8465c2a 100644 >>>> --- a/arch/powerpc/kvm/powerpc.c >>>> +++ b/arch/powerpc/kvm/powerpc.c >>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >>>> r = 1; >>>> break; >>>> #endif >>>> + case KVM_CAP_SPAPR_MULTITCE: >>>> + r = 1; >>> >>> This should only be true for book3s. >> >> >> We had this discussion with v2. >> >> David: >> === >> So, in the case of MULTITCE, that's not quite right. PR KVM can >> emulate a PAPR system on a BookE machine, and there's no reason not to >> allow TCE acceleration as well. We can't make it dependent on PAPR >> mode being selected, because that's enabled per-vcpu, whereas these >> capabilities are queried on the VM before the vcpus are created. >> === >> >> Wrong? > Partially. BookE can not emulate a PAPR system as it stands today. Oh. Ok. So - #ifdef CONFIG_PPC_BOOK3S_64 ? Or run-time check for book3s (how...)? -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:34 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 8:34 UTC (permalink / raw) To: Alexander Graf Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson On 06/17/2013 06:02 PM, Alexander Graf wrote: > > On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: > >> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>> >>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>> >>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >>>> devices or emulated PCI. These calls allow adding multiple entries >>>> (up to 512) into the TCE table in one call which saves time on >>>> transition to/from real mode. >>>> >>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>> (copied from user and verified) before writing the whole list into >>>> the TCE table. This cache will be utilized more in the upcoming >>>> VFIO/IOMMU support to continue TCE list processing in the virtual >>>> mode in the case if the real mode handler failed for some reason. >>>> >>>> This adds a guest physical to host real address converter >>>> and calls the existing H_PUT_TCE handler. The converting function >>>> is going to be fully utilized by upcoming VFIO supporting patches. >>>> >>>> This also implements the KVM_CAP_PPC_MULTITCE capability, >>>> so in order to support the functionality of this patch, QEMU >>>> needs to query for this capability and set the "hcall-multi-tce" >>>> hypertas property only if the capability is present, otherwise >>>> there will be serious performance degradation. >>>> >>>> Cc: David Gibson <david@gibson.dropbear.id.au> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>> Signed-off-by: Paul Mackerras <paulus@samba.org> >>> >>> Only a few minor nits. Ben already commented on implementation details. >>> >>>> >>>> --- >>>> Changelog: >>>> 2013/06/05: >>>> * fixed mistype about IBMVIO in the commit message >>>> * updated doc and moved it to another section >>>> * changed capability number >>>> >>>> 2013/05/21: >>>> * added kvm_vcpu_arch::tce_tmp >>>> * removed cleanup if put_indirect failed, instead we do not even start >>>> writing to TCE table if we cannot get TCEs from the user and they are >>>> invalid >>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >>>> and kvmppc_emulated_validate_tce (for the previous item) >>>> * fixed bug with failthrough for H_IPI >>>> * removed all get_user() from real mode handlers >>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >>>> --- >>>> Documentation/virtual/kvm/api.txt | 17 ++ >>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- >>>> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>> arch/powerpc/kvm/powerpc.c | 3 + >>>> include/uapi/linux/kvm.h | 1 + >>>> 10 files changed, 473 insertions(+), 32 deletions(-) >>>> >>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt >>>> index 5f91eda..6c082ff 100644 >>>> --- a/Documentation/virtual/kvm/api.txt >>>> +++ b/Documentation/virtual/kvm/api.txt >>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be >>>> handled. >>>> >>>> >>>> +4.83 KVM_CAP_PPC_MULTITCE >>>> + >>>> +Capability: KVM_CAP_PPC_MULTITCE >>>> +Architectures: ppc >>>> +Type: vm >>>> + >>>> +This capability tells the guest that multiple TCE entry add/remove hypercalls >>>> +handling is supported by the kernel. This significanly accelerates DMA >>>> +operations for PPC KVM guests. >>>> + >>>> +Unlike other capabilities in this section, this one does not have an ioctl. >>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to >>>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE >>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). >>> >> >>> While this describes perfectly well what the consequences are of the >>> patches, it does not describe properly what the CAP actually expresses. >>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and >>> H_STUFF_TCE hypercalls directly". All other consequences are nice to >>> document, but the semantics of the CAP are missing. >> >> >> ? It expresses ability to handle 2 hcalls. What is missing? > > You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap. This file does not mention qemu at all. And the interface is - qemu (or kvmtool could do that) just adds "hcall-multi-tce" to "ibm,hypertas-functions" but this is for pseries linux and AIX could always do it (no idea about it). Does it really have to be in this file? >>> We also usually try to keep KVM behavior unchanged with regards to older >>> versions until a CAP is enabled. In this case I don't think it matters >>> all that much, so I'm fine with declaring it as enabled by default. >>> Please document that this is a change in behavior versus older KVM >>> versions though. >> >> >> Ok! >> >> >>>> + >>>> + >>>> 5. The kvm_run structure >>>> ------------------------ >>>> >>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h >>>> index af326cd..85d8f26 100644 >>>> --- a/arch/powerpc/include/asm/kvm_host.h >>>> +++ b/arch/powerpc/include/asm/kvm_host.h >>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >>>> spinlock_t tbacct_lock; >>>> u64 busy_stolen; >>>> u64 busy_preempt; >>>> + >>>> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ >>>> #endif >>>> }; >>> >>> [...] >>>> >>>> >>> >>> [...] >>> >>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >>>> index 550f592..a39039a 100644 >>>> --- a/arch/powerpc/kvm/book3s_hv.c >>>> +++ b/arch/powerpc/kvm/book3s_hv.c >>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) >>>> ret = kvmppc_xics_hcall(vcpu, req); >>>> break; >>>> } /* fallthrough */ >>> >>> The fallthrough comment isn't accurate anymore. >>> >>>> + return RESUME_HOST; >>>> + case H_PUT_TCE: >>>> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6)); >>>> + if (ret == H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> + case H_PUT_TCE_INDIRECT: >>>> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6), >>>> + kvmppc_get_gpr(vcpu, 7)); >>>> + if (ret == H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> + case H_STUFF_TCE: >>>> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6), >>>> + kvmppc_get_gpr(vcpu, 7)); >>>> + if (ret == H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> default: >>>> return RESUME_HOST; >>>> } >>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >>>> vcpu->arch.cpu_type = KVM_CPU_3S_64; >>>> kvmppc_sanity_check(vcpu); >>>> >>>> + /* >>>> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT >>>> + * half executed, we first read TCEs from the user, check them and >>>> + * return error if something went wrong and only then put TCEs into >>>> + * the TCE table. >>>> + * >>>> + * tce_tmp is a cache for TCEs to avoid stack allocation or >>>> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes >>>> + * each (4096 bytes). >>>> + */ >>>> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); >>>> + if (!vcpu->arch.tce_tmp) >>>> + goto free_vcpu; >>>> + >>>> return vcpu; >>>> >>>> free_vcpu: >>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) >>>> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >>>> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >>>> spin_unlock(&vcpu->arch.vpa_update_lock); >>>> + kfree(vcpu->arch.tce_tmp); >>>> kvm_vcpu_uninit(vcpu); >>>> kmem_cache_free(kvm_vcpu_cache, vcpu); >>>> } >>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> index b02f91e..d35554e 100644 >>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> @@ -1490,6 +1490,12 @@ hcall_real_table: >>>> .long 0 /* 0x11c */ >>>> .long 0 /* 0x120 */ >>>> .long .kvmppc_h_bulk_remove - hcall_real_table >>>> + .long 0 /* 0x128 */ >>>> + .long 0 /* 0x12c */ >>>> + .long 0 /* 0x130 */ >>>> + .long 0 /* 0x134 */ >>>> + .long .kvmppc_h_stuff_tce - hcall_real_table >>>> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >>>> hcall_real_table_end: >>>> >>>> ignore_hdec: >>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c >>>> index da0e0bc..91d4b45 100644 >>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c >>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) >>>> unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>> long rc; >>>> >>>> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >>>> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >>>> + if (rc == H_TOO_HARD) >>>> + return EMULATE_FAIL; >>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>> + return EMULATE_DONE; >>>> +} >>>> + >>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >>>> +{ >>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>> + unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>> + long rc; >>>> + >>>> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >>>> + tce, npages); >>>> + if (rc == H_TOO_HARD) >>>> + return EMULATE_FAIL; >>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>> + return EMULATE_DONE; >>>> +} >>>> + >>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >>>> +{ >>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); >>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>> + long rc; >>>> + >>>> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); >>>> if (rc == H_TOO_HARD) >>>> return EMULATE_FAIL; >>>> kvmppc_set_gpr(vcpu, 3, rc); >>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) >>>> return kvmppc_h_pr_bulk_remove(vcpu); >>>> case H_PUT_TCE: >>>> return kvmppc_h_pr_put_tce(vcpu); >>>> + case H_PUT_TCE_INDIRECT: >>>> + return kvmppc_h_pr_put_tce_indirect(vcpu); >>>> + case H_STUFF_TCE: >>>> + return kvmppc_h_pr_stuff_tce(vcpu); >>>> case H_CEDE: >>>> vcpu->arch.shared->msr |= MSR_EE; >>>> kvm_vcpu_block(vcpu); >>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>>> index 6316ee3..8465c2a 100644 >>>> --- a/arch/powerpc/kvm/powerpc.c >>>> +++ b/arch/powerpc/kvm/powerpc.c >>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >>>> r = 1; >>>> break; >>>> #endif >>>> + case KVM_CAP_SPAPR_MULTITCE: >>>> + r = 1; >>> >>> This should only be true for book3s. >> >> >> We had this discussion with v2. >> >> David: >> === >> So, in the case of MULTITCE, that's not quite right. PR KVM can >> emulate a PAPR system on a BookE machine, and there's no reason not to >> allow TCE acceleration as well. We can't make it dependent on PAPR >> mode being selected, because that's enabled per-vcpu, whereas these >> capabilities are queried on the VM before the vcpus are created. >> === >> >> Wrong? > Partially. BookE can not emulate a PAPR system as it stands today. Oh. Ok. So - #ifdef CONFIG_PPC_BOOK3S_64 ? Or run-time check for book3s (how...)? -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-17 8:34 ` Alexey Kardashevskiy (?) @ 2013-06-17 8:40 ` Alexander Graf -1 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 8:40 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: > On 06/17/2013 06:02 PM, Alexander Graf wrote: >> >> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >> >>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>> >>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>> >>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >>>>> devices or emulated PCI. These calls allow adding multiple entries >>>>> (up to 512) into the TCE table in one call which saves time on >>>>> transition to/from real mode. >>>>> >>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>> (copied from user and verified) before writing the whole list into >>>>> the TCE table. This cache will be utilized more in the upcoming >>>>> VFIO/IOMMU support to continue TCE list processing in the virtual >>>>> mode in the case if the real mode handler failed for some reason. >>>>> >>>>> This adds a guest physical to host real address converter >>>>> and calls the existing H_PUT_TCE handler. The converting function >>>>> is going to be fully utilized by upcoming VFIO supporting patches. >>>>> >>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, >>>>> so in order to support the functionality of this patch, QEMU >>>>> needs to query for this capability and set the "hcall-multi-tce" >>>>> hypertas property only if the capability is present, otherwise >>>>> there will be serious performance degradation. >>>>> >>>>> Cc: David Gibson <david@gibson.dropbear.id.au> >>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>>> Signed-off-by: Paul Mackerras <paulus@samba.org> >>>> >>>> Only a few minor nits. Ben already commented on implementation details. >>>> >>>>> >>>>> --- >>>>> Changelog: >>>>> 2013/06/05: >>>>> * fixed mistype about IBMVIO in the commit message >>>>> * updated doc and moved it to another section >>>>> * changed capability number >>>>> >>>>> 2013/05/21: >>>>> * added kvm_vcpu_arch::tce_tmp >>>>> * removed cleanup if put_indirect failed, instead we do not even start >>>>> writing to TCE table if we cannot get TCEs from the user and they are >>>>> invalid >>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >>>>> and kvmppc_emulated_validate_tce (for the previous item) >>>>> * fixed bug with failthrough for H_IPI >>>>> * removed all get_user() from real mode handlers >>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >>>>> --- >>>>> Documentation/virtual/kvm/api.txt | 17 ++ >>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- >>>>> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>> include/uapi/linux/kvm.h | 1 + >>>>> 10 files changed, 473 insertions(+), 32 deletions(-) >>>>> >>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt >>>>> index 5f91eda..6c082ff 100644 >>>>> --- a/Documentation/virtual/kvm/api.txt >>>>> +++ b/Documentation/virtual/kvm/api.txt >>>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be >>>>> handled. >>>>> >>>>> >>>>> +4.83 KVM_CAP_PPC_MULTITCE >>>>> + >>>>> +Capability: KVM_CAP_PPC_MULTITCE >>>>> +Architectures: ppc >>>>> +Type: vm >>>>> + >>>>> +This capability tells the guest that multiple TCE entry add/remove hypercalls >>>>> +handling is supported by the kernel. This significanly accelerates DMA >>>>> +operations for PPC KVM guests. >>>>> + >>>>> +Unlike other capabilities in this section, this one does not have an ioctl. >>>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >>>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to >>>>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE >>>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). >>>> >>> >>>> While this describes perfectly well what the consequences are of the >>>> patches, it does not describe properly what the CAP actually expresses. >>>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and >>>> H_STUFF_TCE hypercalls directly". All other consequences are nice to >>>> document, but the semantics of the CAP are missing. >>> >>> >>> ? It expresses ability to handle 2 hcalls. What is missing? >> >> You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap. > > > This file does not mention qemu at all. And the interface is - qemu (or > kvmtool could do that) just adds "hcall-multi-tce" to > "ibm,hypertas-functions" but this is for pseries linux and AIX could always > do it (no idea about it). Does it really have to be in this file? Ok, let's go back a step. What does this CAP describe? Don't look at the description you wrote above. Just write a new one. What exactly can user space expect when it finds this CAP? > > > >>>> We also usually try to keep KVM behavior unchanged with regards to older >>>> versions until a CAP is enabled. In this case I don't think it matters >>>> all that much, so I'm fine with declaring it as enabled by default. >>>> Please document that this is a change in behavior versus older KVM >>>> versions though. >>> >>> >>> Ok! >>> >>> >>>>> + >>>>> + >>>>> 5. The kvm_run structure >>>>> ------------------------ >>>>> >>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h >>>>> index af326cd..85d8f26 100644 >>>>> --- a/arch/powerpc/include/asm/kvm_host.h >>>>> +++ b/arch/powerpc/include/asm/kvm_host.h >>>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >>>>> spinlock_t tbacct_lock; >>>>> u64 busy_stolen; >>>>> u64 busy_preempt; >>>>> + >>>>> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ >>>>> #endif >>>>> }; >>>> >>>> [...] >>>>> >>>>> >>>> >>>> [...] >>>> >>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >>>>> index 550f592..a39039a 100644 >>>>> --- a/arch/powerpc/kvm/book3s_hv.c >>>>> +++ b/arch/powerpc/kvm/book3s_hv.c >>>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) >>>>> ret = kvmppc_xics_hcall(vcpu, req); >>>>> break; >>>>> } /* fallthrough */ >>>> >>>> The fallthrough comment isn't accurate anymore. >>>> >>>>> + return RESUME_HOST; >>>>> + case H_PUT_TCE: >>>>> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>>> + kvmppc_get_gpr(vcpu, 5), >>>>> + kvmppc_get_gpr(vcpu, 6)); >>>>> + if (ret = H_TOO_HARD) >>>>> + return RESUME_HOST; >>>>> + break; >>>>> + case H_PUT_TCE_INDIRECT: >>>>> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), >>>>> + kvmppc_get_gpr(vcpu, 5), >>>>> + kvmppc_get_gpr(vcpu, 6), >>>>> + kvmppc_get_gpr(vcpu, 7)); >>>>> + if (ret = H_TOO_HARD) >>>>> + return RESUME_HOST; >>>>> + break; >>>>> + case H_STUFF_TCE: >>>>> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>>> + kvmppc_get_gpr(vcpu, 5), >>>>> + kvmppc_get_gpr(vcpu, 6), >>>>> + kvmppc_get_gpr(vcpu, 7)); >>>>> + if (ret = H_TOO_HARD) >>>>> + return RESUME_HOST; >>>>> + break; >>>>> default: >>>>> return RESUME_HOST; >>>>> } >>>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >>>>> vcpu->arch.cpu_type = KVM_CPU_3S_64; >>>>> kvmppc_sanity_check(vcpu); >>>>> >>>>> + /* >>>>> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT >>>>> + * half executed, we first read TCEs from the user, check them and >>>>> + * return error if something went wrong and only then put TCEs into >>>>> + * the TCE table. >>>>> + * >>>>> + * tce_tmp is a cache for TCEs to avoid stack allocation or >>>>> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes >>>>> + * each (4096 bytes). >>>>> + */ >>>>> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); >>>>> + if (!vcpu->arch.tce_tmp) >>>>> + goto free_vcpu; >>>>> + >>>>> return vcpu; >>>>> >>>>> free_vcpu: >>>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) >>>>> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >>>>> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >>>>> spin_unlock(&vcpu->arch.vpa_update_lock); >>>>> + kfree(vcpu->arch.tce_tmp); >>>>> kvm_vcpu_uninit(vcpu); >>>>> kmem_cache_free(kvm_vcpu_cache, vcpu); >>>>> } >>>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>>> index b02f91e..d35554e 100644 >>>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>>> @@ -1490,6 +1490,12 @@ hcall_real_table: >>>>> .long 0 /* 0x11c */ >>>>> .long 0 /* 0x120 */ >>>>> .long .kvmppc_h_bulk_remove - hcall_real_table >>>>> + .long 0 /* 0x128 */ >>>>> + .long 0 /* 0x12c */ >>>>> + .long 0 /* 0x130 */ >>>>> + .long 0 /* 0x134 */ >>>>> + .long .kvmppc_h_stuff_tce - hcall_real_table >>>>> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >>>>> hcall_real_table_end: >>>>> >>>>> ignore_hdec: >>>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c >>>>> index da0e0bc..91d4b45 100644 >>>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c >>>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >>>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) >>>>> unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>>> long rc; >>>>> >>>>> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >>>>> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >>>>> + if (rc = H_TOO_HARD) >>>>> + return EMULATE_FAIL; >>>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>>> + return EMULATE_DONE; >>>>> +} >>>>> + >>>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >>>>> +{ >>>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>>> + unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>>> + long rc; >>>>> + >>>>> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >>>>> + tce, npages); >>>>> + if (rc = H_TOO_HARD) >>>>> + return EMULATE_FAIL; >>>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>>> + return EMULATE_DONE; >>>>> +} >>>>> + >>>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >>>>> +{ >>>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>>> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); >>>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>>> + long rc; >>>>> + >>>>> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); >>>>> if (rc = H_TOO_HARD) >>>>> return EMULATE_FAIL; >>>>> kvmppc_set_gpr(vcpu, 3, rc); >>>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) >>>>> return kvmppc_h_pr_bulk_remove(vcpu); >>>>> case H_PUT_TCE: >>>>> return kvmppc_h_pr_put_tce(vcpu); >>>>> + case H_PUT_TCE_INDIRECT: >>>>> + return kvmppc_h_pr_put_tce_indirect(vcpu); >>>>> + case H_STUFF_TCE: >>>>> + return kvmppc_h_pr_stuff_tce(vcpu); >>>>> case H_CEDE: >>>>> vcpu->arch.shared->msr |= MSR_EE; >>>>> kvm_vcpu_block(vcpu); >>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>>>> index 6316ee3..8465c2a 100644 >>>>> --- a/arch/powerpc/kvm/powerpc.c >>>>> +++ b/arch/powerpc/kvm/powerpc.c >>>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >>>>> r = 1; >>>>> break; >>>>> #endif >>>>> + case KVM_CAP_SPAPR_MULTITCE: >>>>> + r = 1; >>>> >>>> This should only be true for book3s. >>> >>> >>> We had this discussion with v2. >>> >>> David: >>> =>>> So, in the case of MULTITCE, that's not quite right. PR KVM can >>> emulate a PAPR system on a BookE machine, and there's no reason not to >>> allow TCE acceleration as well. We can't make it dependent on PAPR >>> mode being selected, because that's enabled per-vcpu, whereas these >>> capabilities are queried on the VM before the vcpus are created. >>> =>>> >>> Wrong? > >> Partially. BookE can not emulate a PAPR system as it stands today. > > Oh. > Ok. > So - #ifdef CONFIG_PPC_BOOK3S_64 ? Or run-time check for book3s (how...)? The former. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:40 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 8:40 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: > On 06/17/2013 06:02 PM, Alexander Graf wrote: >> >> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >> >>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>> >>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>> >>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >>>>> devices or emulated PCI. These calls allow adding multiple entries >>>>> (up to 512) into the TCE table in one call which saves time on >>>>> transition to/from real mode. >>>>> >>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>> (copied from user and verified) before writing the whole list into >>>>> the TCE table. This cache will be utilized more in the upcoming >>>>> VFIO/IOMMU support to continue TCE list processing in the virtual >>>>> mode in the case if the real mode handler failed for some reason. >>>>> >>>>> This adds a guest physical to host real address converter >>>>> and calls the existing H_PUT_TCE handler. The converting function >>>>> is going to be fully utilized by upcoming VFIO supporting patches. >>>>> >>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, >>>>> so in order to support the functionality of this patch, QEMU >>>>> needs to query for this capability and set the "hcall-multi-tce" >>>>> hypertas property only if the capability is present, otherwise >>>>> there will be serious performance degradation. >>>>> >>>>> Cc: David Gibson <david@gibson.dropbear.id.au> >>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>>> Signed-off-by: Paul Mackerras <paulus@samba.org> >>>> >>>> Only a few minor nits. Ben already commented on implementation details. >>>> >>>>> >>>>> --- >>>>> Changelog: >>>>> 2013/06/05: >>>>> * fixed mistype about IBMVIO in the commit message >>>>> * updated doc and moved it to another section >>>>> * changed capability number >>>>> >>>>> 2013/05/21: >>>>> * added kvm_vcpu_arch::tce_tmp >>>>> * removed cleanup if put_indirect failed, instead we do not even start >>>>> writing to TCE table if we cannot get TCEs from the user and they are >>>>> invalid >>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >>>>> and kvmppc_emulated_validate_tce (for the previous item) >>>>> * fixed bug with failthrough for H_IPI >>>>> * removed all get_user() from real mode handlers >>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >>>>> --- >>>>> Documentation/virtual/kvm/api.txt | 17 ++ >>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- >>>>> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>> include/uapi/linux/kvm.h | 1 + >>>>> 10 files changed, 473 insertions(+), 32 deletions(-) >>>>> >>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt >>>>> index 5f91eda..6c082ff 100644 >>>>> --- a/Documentation/virtual/kvm/api.txt >>>>> +++ b/Documentation/virtual/kvm/api.txt >>>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be >>>>> handled. >>>>> >>>>> >>>>> +4.83 KVM_CAP_PPC_MULTITCE >>>>> + >>>>> +Capability: KVM_CAP_PPC_MULTITCE >>>>> +Architectures: ppc >>>>> +Type: vm >>>>> + >>>>> +This capability tells the guest that multiple TCE entry add/remove hypercalls >>>>> +handling is supported by the kernel. This significanly accelerates DMA >>>>> +operations for PPC KVM guests. >>>>> + >>>>> +Unlike other capabilities in this section, this one does not have an ioctl. >>>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >>>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to >>>>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE >>>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). >>>> >>> >>>> While this describes perfectly well what the consequences are of the >>>> patches, it does not describe properly what the CAP actually expresses. >>>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and >>>> H_STUFF_TCE hypercalls directly". All other consequences are nice to >>>> document, but the semantics of the CAP are missing. >>> >>> >>> ? It expresses ability to handle 2 hcalls. What is missing? >> >> You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap. > > > This file does not mention qemu at all. And the interface is - qemu (or > kvmtool could do that) just adds "hcall-multi-tce" to > "ibm,hypertas-functions" but this is for pseries linux and AIX could always > do it (no idea about it). Does it really have to be in this file? Ok, let's go back a step. What does this CAP describe? Don't look at the description you wrote above. Just write a new one. What exactly can user space expect when it finds this CAP? > > > >>>> We also usually try to keep KVM behavior unchanged with regards to older >>>> versions until a CAP is enabled. In this case I don't think it matters >>>> all that much, so I'm fine with declaring it as enabled by default. >>>> Please document that this is a change in behavior versus older KVM >>>> versions though. >>> >>> >>> Ok! >>> >>> >>>>> + >>>>> + >>>>> 5. The kvm_run structure >>>>> ------------------------ >>>>> >>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h >>>>> index af326cd..85d8f26 100644 >>>>> --- a/arch/powerpc/include/asm/kvm_host.h >>>>> +++ b/arch/powerpc/include/asm/kvm_host.h >>>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >>>>> spinlock_t tbacct_lock; >>>>> u64 busy_stolen; >>>>> u64 busy_preempt; >>>>> + >>>>> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ >>>>> #endif >>>>> }; >>>> >>>> [...] >>>>> >>>>> >>>> >>>> [...] >>>> >>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >>>>> index 550f592..a39039a 100644 >>>>> --- a/arch/powerpc/kvm/book3s_hv.c >>>>> +++ b/arch/powerpc/kvm/book3s_hv.c >>>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) >>>>> ret = kvmppc_xics_hcall(vcpu, req); >>>>> break; >>>>> } /* fallthrough */ >>>> >>>> The fallthrough comment isn't accurate anymore. >>>> >>>>> + return RESUME_HOST; >>>>> + case H_PUT_TCE: >>>>> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>>> + kvmppc_get_gpr(vcpu, 5), >>>>> + kvmppc_get_gpr(vcpu, 6)); >>>>> + if (ret == H_TOO_HARD) >>>>> + return RESUME_HOST; >>>>> + break; >>>>> + case H_PUT_TCE_INDIRECT: >>>>> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), >>>>> + kvmppc_get_gpr(vcpu, 5), >>>>> + kvmppc_get_gpr(vcpu, 6), >>>>> + kvmppc_get_gpr(vcpu, 7)); >>>>> + if (ret == H_TOO_HARD) >>>>> + return RESUME_HOST; >>>>> + break; >>>>> + case H_STUFF_TCE: >>>>> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>>> + kvmppc_get_gpr(vcpu, 5), >>>>> + kvmppc_get_gpr(vcpu, 6), >>>>> + kvmppc_get_gpr(vcpu, 7)); >>>>> + if (ret == H_TOO_HARD) >>>>> + return RESUME_HOST; >>>>> + break; >>>>> default: >>>>> return RESUME_HOST; >>>>> } >>>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >>>>> vcpu->arch.cpu_type = KVM_CPU_3S_64; >>>>> kvmppc_sanity_check(vcpu); >>>>> >>>>> + /* >>>>> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT >>>>> + * half executed, we first read TCEs from the user, check them and >>>>> + * return error if something went wrong and only then put TCEs into >>>>> + * the TCE table. >>>>> + * >>>>> + * tce_tmp is a cache for TCEs to avoid stack allocation or >>>>> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes >>>>> + * each (4096 bytes). >>>>> + */ >>>>> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); >>>>> + if (!vcpu->arch.tce_tmp) >>>>> + goto free_vcpu; >>>>> + >>>>> return vcpu; >>>>> >>>>> free_vcpu: >>>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) >>>>> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >>>>> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >>>>> spin_unlock(&vcpu->arch.vpa_update_lock); >>>>> + kfree(vcpu->arch.tce_tmp); >>>>> kvm_vcpu_uninit(vcpu); >>>>> kmem_cache_free(kvm_vcpu_cache, vcpu); >>>>> } >>>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>>> index b02f91e..d35554e 100644 >>>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>>> @@ -1490,6 +1490,12 @@ hcall_real_table: >>>>> .long 0 /* 0x11c */ >>>>> .long 0 /* 0x120 */ >>>>> .long .kvmppc_h_bulk_remove - hcall_real_table >>>>> + .long 0 /* 0x128 */ >>>>> + .long 0 /* 0x12c */ >>>>> + .long 0 /* 0x130 */ >>>>> + .long 0 /* 0x134 */ >>>>> + .long .kvmppc_h_stuff_tce - hcall_real_table >>>>> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >>>>> hcall_real_table_end: >>>>> >>>>> ignore_hdec: >>>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c >>>>> index da0e0bc..91d4b45 100644 >>>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c >>>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >>>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) >>>>> unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>>> long rc; >>>>> >>>>> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >>>>> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >>>>> + if (rc == H_TOO_HARD) >>>>> + return EMULATE_FAIL; >>>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>>> + return EMULATE_DONE; >>>>> +} >>>>> + >>>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >>>>> +{ >>>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>>> + unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>>> + long rc; >>>>> + >>>>> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >>>>> + tce, npages); >>>>> + if (rc == H_TOO_HARD) >>>>> + return EMULATE_FAIL; >>>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>>> + return EMULATE_DONE; >>>>> +} >>>>> + >>>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >>>>> +{ >>>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>>> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); >>>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>>> + long rc; >>>>> + >>>>> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); >>>>> if (rc == H_TOO_HARD) >>>>> return EMULATE_FAIL; >>>>> kvmppc_set_gpr(vcpu, 3, rc); >>>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) >>>>> return kvmppc_h_pr_bulk_remove(vcpu); >>>>> case H_PUT_TCE: >>>>> return kvmppc_h_pr_put_tce(vcpu); >>>>> + case H_PUT_TCE_INDIRECT: >>>>> + return kvmppc_h_pr_put_tce_indirect(vcpu); >>>>> + case H_STUFF_TCE: >>>>> + return kvmppc_h_pr_stuff_tce(vcpu); >>>>> case H_CEDE: >>>>> vcpu->arch.shared->msr |= MSR_EE; >>>>> kvm_vcpu_block(vcpu); >>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>>>> index 6316ee3..8465c2a 100644 >>>>> --- a/arch/powerpc/kvm/powerpc.c >>>>> +++ b/arch/powerpc/kvm/powerpc.c >>>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >>>>> r = 1; >>>>> break; >>>>> #endif >>>>> + case KVM_CAP_SPAPR_MULTITCE: >>>>> + r = 1; >>>> >>>> This should only be true for book3s. >>> >>> >>> We had this discussion with v2. >>> >>> David: >>> === >>> So, in the case of MULTITCE, that's not quite right. PR KVM can >>> emulate a PAPR system on a BookE machine, and there's no reason not to >>> allow TCE acceleration as well. We can't make it dependent on PAPR >>> mode being selected, because that's enabled per-vcpu, whereas these >>> capabilities are queried on the VM before the vcpus are created. >>> === >>> >>> Wrong? > >> Partially. BookE can not emulate a PAPR system as it stands today. > > Oh. > Ok. > So - #ifdef CONFIG_PPC_BOOK3S_64 ? Or run-time check for book3s (how...)? The former. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:40 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 8:40 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: > On 06/17/2013 06:02 PM, Alexander Graf wrote: >>=20 >> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >>=20 >>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>>=20 >>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>>=20 >>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >>>>> devices or emulated PCI. These calls allow adding multiple = entries >>>>> (up to 512) into the TCE table in one call which saves time on >>>>> transition to/from real mode. >>>>>=20 >>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>> (copied from user and verified) before writing the whole list into >>>>> the TCE table. This cache will be utilized more in the upcoming >>>>> VFIO/IOMMU support to continue TCE list processing in the virtual >>>>> mode in the case if the real mode handler failed for some reason. >>>>>=20 >>>>> This adds a guest physical to host real address converter >>>>> and calls the existing H_PUT_TCE handler. The converting function >>>>> is going to be fully utilized by upcoming VFIO supporting patches. >>>>>=20 >>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, >>>>> so in order to support the functionality of this patch, QEMU >>>>> needs to query for this capability and set the "hcall-multi-tce" >>>>> hypertas property only if the capability is present, otherwise >>>>> there will be serious performance degradation. >>>>>=20 >>>>> Cc: David Gibson <david@gibson.dropbear.id.au> >>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>>> Signed-off-by: Paul Mackerras <paulus@samba.org> >>>>=20 >>>> Only a few minor nits. Ben already commented on implementation = details. >>>>=20 >>>>>=20 >>>>> --- >>>>> Changelog: >>>>> 2013/06/05: >>>>> * fixed mistype about IBMVIO in the commit message >>>>> * updated doc and moved it to another section >>>>> * changed capability number >>>>>=20 >>>>> 2013/05/21: >>>>> * added kvm_vcpu_arch::tce_tmp >>>>> * removed cleanup if put_indirect failed, instead we do not even = start >>>>> writing to TCE table if we cannot get TCEs from the user and they = are >>>>> invalid >>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >>>>> and kvmppc_emulated_validate_tce (for the previous item) >>>>> * fixed bug with failthrough for H_IPI >>>>> * removed all get_user() from real mode handlers >>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte = public) >>>>> --- >>>>> Documentation/virtual/kvm/api.txt | 17 ++ >>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 = +++++++++++++++++++++++++++---- >>>>> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>> include/uapi/linux/kvm.h | 1 + >>>>> 10 files changed, 473 insertions(+), 32 deletions(-) >>>>>=20 >>>>> diff --git a/Documentation/virtual/kvm/api.txt = b/Documentation/virtual/kvm/api.txt >>>>> index 5f91eda..6c082ff 100644 >>>>> --- a/Documentation/virtual/kvm/api.txt >>>>> +++ b/Documentation/virtual/kvm/api.txt >>>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be = passed to userspace to be >>>>> handled. >>>>>=20 >>>>>=20 >>>>> +4.83 KVM_CAP_PPC_MULTITCE >>>>> + >>>>> +Capability: KVM_CAP_PPC_MULTITCE >>>>> +Architectures: ppc >>>>> +Type: vm >>>>> + >>>>> +This capability tells the guest that multiple TCE entry = add/remove hypercalls >>>>> +handling is supported by the kernel. This significanly = accelerates DMA >>>>> +operations for PPC KVM guests. >>>>> + >>>>> +Unlike other capabilities in this section, this one does not have = an ioctl. >>>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT = and >>>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and = not passed to >>>>> +the guest. Othwerwise it might be better for the guest to = continue using H_PUT_TCE >>>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are = present). >>>>=20 >>>=20 >>>> While this describes perfectly well what the consequences are of = the >>>> patches, it does not describe properly what the CAP actually = expresses. >>>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT = and >>>> H_STUFF_TCE hypercalls directly". All other consequences are nice = to >>>> document, but the semantics of the CAP are missing. >>>=20 >>>=20 >>> ? It expresses ability to handle 2 hcalls. What is missing? >>=20 >> You don't describe the kvm <-> qemu interface. You describe some = decisions qemu can take from this cap. >=20 >=20 > This file does not mention qemu at all. And the interface is - qemu = (or > kvmtool could do that) just adds "hcall-multi-tce" to > "ibm,hypertas-functions" but this is for pseries linux and AIX could = always > do it (no idea about it). Does it really have to be in this file? Ok, let's go back a step. What does this CAP describe? Don't look at the = description you wrote above. Just write a new one. What exactly can user = space expect when it finds this CAP? >=20 >=20 >=20 >>>> We also usually try to keep KVM behavior unchanged with regards to = older >>>> versions until a CAP is enabled. In this case I don't think it = matters >>>> all that much, so I'm fine with declaring it as enabled by default. >>>> Please document that this is a change in behavior versus older KVM >>>> versions though. >>>=20 >>>=20 >>> Ok! >>>=20 >>>=20 >>>>> + >>>>> + >>>>> 5. The kvm_run structure >>>>> ------------------------ >>>>>=20 >>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h = b/arch/powerpc/include/asm/kvm_host.h >>>>> index af326cd..85d8f26 100644 >>>>> --- a/arch/powerpc/include/asm/kvm_host.h >>>>> +++ b/arch/powerpc/include/asm/kvm_host.h >>>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >>>>> spinlock_t tbacct_lock; >>>>> u64 busy_stolen; >>>>> u64 busy_preempt; >>>>> + >>>>> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT = hall */ >>>>> #endif >>>>> }; >>>>=20 >>>> [...] >>>>>=20 >>>>>=20 >>>>=20 >>>> [...] >>>>=20 >>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c = b/arch/powerpc/kvm/book3s_hv.c >>>>> index 550f592..a39039a 100644 >>>>> --- a/arch/powerpc/kvm/book3s_hv.c >>>>> +++ b/arch/powerpc/kvm/book3s_hv.c >>>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu = *vcpu) >>>>> ret =3D kvmppc_xics_hcall(vcpu, req); >>>>> break; >>>>> } /* fallthrough */ >>>>=20 >>>> The fallthrough comment isn't accurate anymore. >>>>=20 >>>>> + return RESUME_HOST; >>>>> + case H_PUT_TCE: >>>>> + ret =3D kvmppc_virtmode_h_put_tce(vcpu, = kvmppc_get_gpr(vcpu, 4), >>>>> + kvmppc_get_gpr(vcpu, 5), >>>>> + kvmppc_get_gpr(vcpu, = 6)); >>>>> + if (ret =3D=3D H_TOO_HARD) >>>>> + return RESUME_HOST; >>>>> + break; >>>>> + case H_PUT_TCE_INDIRECT: >>>>> + ret =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, = kvmppc_get_gpr(vcpu, 4), >>>>> + kvmppc_get_gpr(vcpu, 5), >>>>> + kvmppc_get_gpr(vcpu, 6), >>>>> + kvmppc_get_gpr(vcpu, = 7)); >>>>> + if (ret =3D=3D H_TOO_HARD) >>>>> + return RESUME_HOST; >>>>> + break; >>>>> + case H_STUFF_TCE: >>>>> + ret =3D kvmppc_virtmode_h_stuff_tce(vcpu, = kvmppc_get_gpr(vcpu, 4), >>>>> + kvmppc_get_gpr(vcpu, 5), >>>>> + kvmppc_get_gpr(vcpu, 6), >>>>> + kvmppc_get_gpr(vcpu, = 7)); >>>>> + if (ret =3D=3D H_TOO_HARD) >>>>> + return RESUME_HOST; >>>>> + break; >>>>> default: >>>>> return RESUME_HOST; >>>>> } >>>>> @@ -958,6 +982,20 @@ struct kvm_vcpu = *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >>>>> vcpu->arch.cpu_type =3D KVM_CPU_3S_64; >>>>> kvmppc_sanity_check(vcpu); >>>>>=20 >>>>> + /* >>>>> + * As we want to minimize the chance of having = H_PUT_TCE_INDIRECT >>>>> + * half executed, we first read TCEs from the user, check them = and >>>>> + * return error if something went wrong and only then put TCEs = into >>>>> + * the TCE table. >>>>> + * >>>>> + * tce_tmp is a cache for TCEs to avoid stack allocation or >>>>> + * kmalloc as the whole TCE list can take up to 512 items 8 = bytes >>>>> + * each (4096 bytes). >>>>> + */ >>>>> + vcpu->arch.tce_tmp =3D kmalloc(4096, GFP_KERNEL); >>>>> + if (!vcpu->arch.tce_tmp) >>>>> + goto free_vcpu; >>>>> + >>>>> return vcpu; >>>>>=20 >>>>> free_vcpu: >>>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu = *vcpu) >>>>> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >>>>> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >>>>> spin_unlock(&vcpu->arch.vpa_update_lock); >>>>> + kfree(vcpu->arch.tce_tmp); >>>>> kvm_vcpu_uninit(vcpu); >>>>> kmem_cache_free(kvm_vcpu_cache, vcpu); >>>>> } >>>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S = b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>>> index b02f91e..d35554e 100644 >>>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>>> @@ -1490,6 +1490,12 @@ hcall_real_table: >>>>> .long 0 /* 0x11c */ >>>>> .long 0 /* 0x120 */ >>>>> .long .kvmppc_h_bulk_remove - hcall_real_table >>>>> + .long 0 /* 0x128 */ >>>>> + .long 0 /* 0x12c */ >>>>> + .long 0 /* 0x130 */ >>>>> + .long 0 /* 0x134 */ >>>>> + .long .kvmppc_h_stuff_tce - hcall_real_table >>>>> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >>>>> hcall_real_table_end: >>>>>=20 >>>>> ignore_hdec: >>>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c = b/arch/powerpc/kvm/book3s_pr_papr.c >>>>> index da0e0bc..91d4b45 100644 >>>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c >>>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >>>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct = kvm_vcpu *vcpu) >>>>> unsigned long tce =3D kvmppc_get_gpr(vcpu, 6); >>>>> long rc; >>>>>=20 >>>>> - rc =3D kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >>>>> + rc =3D kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >>>>> + if (rc =3D=3D H_TOO_HARD) >>>>> + return EMULATE_FAIL; >>>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>>> + return EMULATE_DONE; >>>>> +} >>>>> + >>>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >>>>> +{ >>>>> + unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4); >>>>> + unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5); >>>>> + unsigned long tce =3D kvmppc_get_gpr(vcpu, 6); >>>>> + unsigned long npages =3D kvmppc_get_gpr(vcpu, 7); >>>>> + long rc; >>>>> + >>>>> + rc =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >>>>> + tce, npages); >>>>> + if (rc =3D=3D H_TOO_HARD) >>>>> + return EMULATE_FAIL; >>>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>>> + return EMULATE_DONE; >>>>> +} >>>>> + >>>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >>>>> +{ >>>>> + unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4); >>>>> + unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5); >>>>> + unsigned long tce_value =3D kvmppc_get_gpr(vcpu, 6); >>>>> + unsigned long npages =3D kvmppc_get_gpr(vcpu, 7); >>>>> + long rc; >>>>> + >>>>> + rc =3D kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, = npages); >>>>> if (rc =3D=3D H_TOO_HARD) >>>>> return EMULATE_FAIL; >>>>> kvmppc_set_gpr(vcpu, 3, rc); >>>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, = unsigned long cmd) >>>>> return kvmppc_h_pr_bulk_remove(vcpu); >>>>> case H_PUT_TCE: >>>>> return kvmppc_h_pr_put_tce(vcpu); >>>>> + case H_PUT_TCE_INDIRECT: >>>>> + return kvmppc_h_pr_put_tce_indirect(vcpu); >>>>> + case H_STUFF_TCE: >>>>> + return kvmppc_h_pr_stuff_tce(vcpu); >>>>> case H_CEDE: >>>>> vcpu->arch.shared->msr |=3D MSR_EE; >>>>> kvm_vcpu_block(vcpu); >>>>> diff --git a/arch/powerpc/kvm/powerpc.c = b/arch/powerpc/kvm/powerpc.c >>>>> index 6316ee3..8465c2a 100644 >>>>> --- a/arch/powerpc/kvm/powerpc.c >>>>> +++ b/arch/powerpc/kvm/powerpc.c >>>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >>>>> r =3D 1; >>>>> break; >>>>> #endif >>>>> + case KVM_CAP_SPAPR_MULTITCE: >>>>> + r =3D 1; >>>>=20 >>>> This should only be true for book3s. >>>=20 >>>=20 >>> We had this discussion with v2. >>>=20 >>> David: >>> =3D=3D=3D >>> So, in the case of MULTITCE, that's not quite right. PR KVM can >>> emulate a PAPR system on a BookE machine, and there's no reason not = to >>> allow TCE acceleration as well. We can't make it dependent on PAPR >>> mode being selected, because that's enabled per-vcpu, whereas these >>> capabilities are queried on the VM before the vcpus are created. >>> =3D=3D=3D >>>=20 >>> Wrong? >=20 >> Partially. BookE can not emulate a PAPR system as it stands today. >=20 > Oh. > Ok. > So - #ifdef CONFIG_PPC_BOOK3S_64 ? Or run-time check for book3s = (how...)? The former. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-17 8:40 ` Alexander Graf (?) @ 2013-06-17 8:51 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 8:51 UTC (permalink / raw) To: Alexander Graf Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/17/2013 06:40 PM, Alexander Graf wrote: > > On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: > >> On 06/17/2013 06:02 PM, Alexander Graf wrote: >>> >>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >>> >>>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>>> >>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>>> >>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as >>>>>> IBMVIO devices or emulated PCI. These calls allow adding >>>>>> multiple entries (up to 512) into the TCE table in one call >>>>>> which saves time on transition to/from real mode. >>>>>> >>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>>> (copied from user and verified) before writing the whole list >>>>>> into the TCE table. This cache will be utilized more in the >>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in >>>>>> the virtual mode in the case if the real mode handler failed >>>>>> for some reason. >>>>>> >>>>>> This adds a guest physical to host real address converter and >>>>>> calls the existing H_PUT_TCE handler. The converting function >>>>>> is going to be fully utilized by upcoming VFIO supporting >>>>>> patches. >>>>>> >>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so >>>>>> in order to support the functionality of this patch, QEMU >>>>>> needs to query for this capability and set the >>>>>> "hcall-multi-tce" hypertas property only if the capability is >>>>>> present, otherwise there will be serious performance >>>>>> degradation. >>>>>> >>>>>> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul >>>>>> Mackerras <paulus@samba.org> >>>>> >>>>> Only a few minor nits. Ben already commented on implementation >>>>> details. >>>>> >>>>>> >>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the >>>>>> commit message * updated doc and moved it to another section * >>>>>> changed capability number >>>>>> >>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup >>>>>> if put_indirect failed, instead we do not even start writing >>>>>> to TCE table if we cannot get TCEs from the user and they are >>>>>> invalid * kvmppc_emulated_h_put_tce is split to >>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for >>>>>> the previous item) * fixed bug with failthrough for H_IPI * >>>>>> removed all get_user() from real mode handlers * >>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte >>>>>> public) --- Documentation/virtual/kvm/api.txt | 17 ++ >>>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 >>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c >>>>>> | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>>> include/uapi/linux/kvm.h | 1 + 10 files >>>>>> changed, 473 insertions(+), 32 deletions(-) >>>>>> >>>>>> diff --git a/Documentation/virtual/kvm/api.txt >>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff >>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++ >>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ >>>>>> calls by the guest for that service will be passed to >>>>>> userspace to be handled. >>>>>> >>>>>> >>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability: >>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This >>>>>> capability tells the guest that multiple TCE entry add/remove >>>>>> hypercalls +handling is supported by the kernel. This >>>>>> significanly accelerates DMA +operations for PPC KVM guests. >>>>>> + +Unlike other capabilities in this section, this one does >>>>>> not have an ioctl. +Instead, when the capability is present, >>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be >>>>>> handled in the host kernel and not passed to +the guest. >>>>>> Othwerwise it might be better for the guest to continue using >>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or >>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present). >>>>> >>>> >>>>> While this describes perfectly well what the consequences are of >>>>> the patches, it does not describe properly what the CAP actually >>>>> expresses. The CAP only says "this kernel is able to handle >>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All >>>>> other consequences are nice to document, but the semantics of >>>>> the CAP are missing. >>>> >>>> >>>> ? It expresses ability to handle 2 hcalls. What is missing? >>> >>> You don't describe the kvm <-> qemu interface. You describe some >>> decisions qemu can take from this cap. >> >> >> This file does not mention qemu at all. And the interface is - qemu >> (or kvmtool could do that) just adds "hcall-multi-tce" to >> "ibm,hypertas-functions" but this is for pseries linux and AIX could >> always do it (no idea about it). Does it really have to be in this >> file? > > Ok, let's go back a step. What does this CAP describe? Don't look at the > description you wrote above. Just write a new one. The CAP means the kernel is capable of handling hcalls A and B without passing those into the user space. That accelerates DMA. > What exactly can user space expect when it finds this CAP? The user space can expect that its handlers for A and B are not going to be called if it configures the guest appropriately. Any better? :) -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:51 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 8:51 UTC (permalink / raw) To: Alexander Graf Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/17/2013 06:40 PM, Alexander Graf wrote: > > On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: > >> On 06/17/2013 06:02 PM, Alexander Graf wrote: >>> >>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >>> >>>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>>> >>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>>> >>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as >>>>>> IBMVIO devices or emulated PCI. These calls allow adding >>>>>> multiple entries (up to 512) into the TCE table in one call >>>>>> which saves time on transition to/from real mode. >>>>>> >>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>>> (copied from user and verified) before writing the whole list >>>>>> into the TCE table. This cache will be utilized more in the >>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in >>>>>> the virtual mode in the case if the real mode handler failed >>>>>> for some reason. >>>>>> >>>>>> This adds a guest physical to host real address converter and >>>>>> calls the existing H_PUT_TCE handler. The converting function >>>>>> is going to be fully utilized by upcoming VFIO supporting >>>>>> patches. >>>>>> >>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so >>>>>> in order to support the functionality of this patch, QEMU >>>>>> needs to query for this capability and set the >>>>>> "hcall-multi-tce" hypertas property only if the capability is >>>>>> present, otherwise there will be serious performance >>>>>> degradation. >>>>>> >>>>>> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul >>>>>> Mackerras <paulus@samba.org> >>>>> >>>>> Only a few minor nits. Ben already commented on implementation >>>>> details. >>>>> >>>>>> >>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the >>>>>> commit message * updated doc and moved it to another section * >>>>>> changed capability number >>>>>> >>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup >>>>>> if put_indirect failed, instead we do not even start writing >>>>>> to TCE table if we cannot get TCEs from the user and they are >>>>>> invalid * kvmppc_emulated_h_put_tce is split to >>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for >>>>>> the previous item) * fixed bug with failthrough for H_IPI * >>>>>> removed all get_user() from real mode handlers * >>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte >>>>>> public) --- Documentation/virtual/kvm/api.txt | 17 ++ >>>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 >>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c >>>>>> | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>>> include/uapi/linux/kvm.h | 1 + 10 files >>>>>> changed, 473 insertions(+), 32 deletions(-) >>>>>> >>>>>> diff --git a/Documentation/virtual/kvm/api.txt >>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff >>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++ >>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ >>>>>> calls by the guest for that service will be passed to >>>>>> userspace to be handled. >>>>>> >>>>>> >>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability: >>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This >>>>>> capability tells the guest that multiple TCE entry add/remove >>>>>> hypercalls +handling is supported by the kernel. This >>>>>> significanly accelerates DMA +operations for PPC KVM guests. >>>>>> + +Unlike other capabilities in this section, this one does >>>>>> not have an ioctl. +Instead, when the capability is present, >>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be >>>>>> handled in the host kernel and not passed to +the guest. >>>>>> Othwerwise it might be better for the guest to continue using >>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or >>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present). >>>>> >>>> >>>>> While this describes perfectly well what the consequences are of >>>>> the patches, it does not describe properly what the CAP actually >>>>> expresses. The CAP only says "this kernel is able to handle >>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All >>>>> other consequences are nice to document, but the semantics of >>>>> the CAP are missing. >>>> >>>> >>>> ? It expresses ability to handle 2 hcalls. What is missing? >>> >>> You don't describe the kvm <-> qemu interface. You describe some >>> decisions qemu can take from this cap. >> >> >> This file does not mention qemu at all. And the interface is - qemu >> (or kvmtool could do that) just adds "hcall-multi-tce" to >> "ibm,hypertas-functions" but this is for pseries linux and AIX could >> always do it (no idea about it). Does it really have to be in this >> file? > > Ok, let's go back a step. What does this CAP describe? Don't look at the > description you wrote above. Just write a new one. The CAP means the kernel is capable of handling hcalls A and B without passing those into the user space. That accelerates DMA. > What exactly can user space expect when it finds this CAP? The user space can expect that its handlers for A and B are not going to be called if it configures the guest appropriately. Any better? :) -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:51 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 8:51 UTC (permalink / raw) To: Alexander Graf Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson On 06/17/2013 06:40 PM, Alexander Graf wrote: > > On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: > >> On 06/17/2013 06:02 PM, Alexander Graf wrote: >>> >>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >>> >>>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>>> >>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>>> >>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as >>>>>> IBMVIO devices or emulated PCI. These calls allow adding >>>>>> multiple entries (up to 512) into the TCE table in one call >>>>>> which saves time on transition to/from real mode. >>>>>> >>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>>> (copied from user and verified) before writing the whole list >>>>>> into the TCE table. This cache will be utilized more in the >>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in >>>>>> the virtual mode in the case if the real mode handler failed >>>>>> for some reason. >>>>>> >>>>>> This adds a guest physical to host real address converter and >>>>>> calls the existing H_PUT_TCE handler. The converting function >>>>>> is going to be fully utilized by upcoming VFIO supporting >>>>>> patches. >>>>>> >>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so >>>>>> in order to support the functionality of this patch, QEMU >>>>>> needs to query for this capability and set the >>>>>> "hcall-multi-tce" hypertas property only if the capability is >>>>>> present, otherwise there will be serious performance >>>>>> degradation. >>>>>> >>>>>> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul >>>>>> Mackerras <paulus@samba.org> >>>>> >>>>> Only a few minor nits. Ben already commented on implementation >>>>> details. >>>>> >>>>>> >>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the >>>>>> commit message * updated doc and moved it to another section * >>>>>> changed capability number >>>>>> >>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup >>>>>> if put_indirect failed, instead we do not even start writing >>>>>> to TCE table if we cannot get TCEs from the user and they are >>>>>> invalid * kvmppc_emulated_h_put_tce is split to >>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for >>>>>> the previous item) * fixed bug with failthrough for H_IPI * >>>>>> removed all get_user() from real mode handlers * >>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte >>>>>> public) --- Documentation/virtual/kvm/api.txt | 17 ++ >>>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 >>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c >>>>>> | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>>> include/uapi/linux/kvm.h | 1 + 10 files >>>>>> changed, 473 insertions(+), 32 deletions(-) >>>>>> >>>>>> diff --git a/Documentation/virtual/kvm/api.txt >>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff >>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++ >>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ >>>>>> calls by the guest for that service will be passed to >>>>>> userspace to be handled. >>>>>> >>>>>> >>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability: >>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This >>>>>> capability tells the guest that multiple TCE entry add/remove >>>>>> hypercalls +handling is supported by the kernel. This >>>>>> significanly accelerates DMA +operations for PPC KVM guests. >>>>>> + +Unlike other capabilities in this section, this one does >>>>>> not have an ioctl. +Instead, when the capability is present, >>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be >>>>>> handled in the host kernel and not passed to +the guest. >>>>>> Othwerwise it might be better for the guest to continue using >>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or >>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present). >>>>> >>>> >>>>> While this describes perfectly well what the consequences are of >>>>> the patches, it does not describe properly what the CAP actually >>>>> expresses. The CAP only says "this kernel is able to handle >>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All >>>>> other consequences are nice to document, but the semantics of >>>>> the CAP are missing. >>>> >>>> >>>> ? It expresses ability to handle 2 hcalls. What is missing? >>> >>> You don't describe the kvm <-> qemu interface. You describe some >>> decisions qemu can take from this cap. >> >> >> This file does not mention qemu at all. And the interface is - qemu >> (or kvmtool could do that) just adds "hcall-multi-tce" to >> "ibm,hypertas-functions" but this is for pseries linux and AIX could >> always do it (no idea about it). Does it really have to be in this >> file? > > Ok, let's go back a step. What does this CAP describe? Don't look at the > description you wrote above. Just write a new one. The CAP means the kernel is capable of handling hcalls A and B without passing those into the user space. That accelerates DMA. > What exactly can user space expect when it finds this CAP? The user space can expect that its handlers for A and B are not going to be called if it configures the guest appropriately. Any better? :) -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-17 8:51 ` Alexey Kardashevskiy (?) @ 2013-06-17 10:46 ` Alexander Graf -1 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 10:46 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/17/2013 10:51 AM, Alexey Kardashevskiy wrote: > On 06/17/2013 06:40 PM, Alexander Graf wrote: >> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: >> >>> On 06/17/2013 06:02 PM, Alexander Graf wrote: >>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >>>> >>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>>>> >>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as >>>>>>> IBMVIO devices or emulated PCI. These calls allow adding >>>>>>> multiple entries (up to 512) into the TCE table in one call >>>>>>> which saves time on transition to/from real mode. >>>>>>> >>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>>>> (copied from user and verified) before writing the whole list >>>>>>> into the TCE table. This cache will be utilized more in the >>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in >>>>>>> the virtual mode in the case if the real mode handler failed >>>>>>> for some reason. >>>>>>> >>>>>>> This adds a guest physical to host real address converter and >>>>>>> calls the existing H_PUT_TCE handler. The converting function >>>>>>> is going to be fully utilized by upcoming VFIO supporting >>>>>>> patches. >>>>>>> >>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so >>>>>>> in order to support the functionality of this patch, QEMU >>>>>>> needs to query for this capability and set the >>>>>>> "hcall-multi-tce" hypertas property only if the capability is >>>>>>> present, otherwise there will be serious performance >>>>>>> degradation. >>>>>>> >>>>>>> Cc: David Gibson<david@gibson.dropbear.id.au> Signed-off-by: >>>>>>> Alexey Kardashevskiy<aik@ozlabs.ru> Signed-off-by: Paul >>>>>>> Mackerras<paulus@samba.org> >>>>>> Only a few minor nits. Ben already commented on implementation >>>>>> details. >>>>>> >>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the >>>>>>> commit message * updated doc and moved it to another section * >>>>>>> changed capability number >>>>>>> >>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup >>>>>>> if put_indirect failed, instead we do not even start writing >>>>>>> to TCE table if we cannot get TCEs from the user and they are >>>>>>> invalid * kvmppc_emulated_h_put_tce is split to >>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for >>>>>>> the previous item) * fixed bug with failthrough for H_IPI * >>>>>>> removed all get_user() from real mode handlers * >>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte >>>>>>> public) --- Documentation/virtual/kvm/api.txt | 17 ++ >>>>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 >>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c >>>>>>> | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>>>> include/uapi/linux/kvm.h | 1 + 10 files >>>>>>> changed, 473 insertions(+), 32 deletions(-) >>>>>>> >>>>>>> diff --git a/Documentation/virtual/kvm/api.txt >>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff >>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++ >>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ >>>>>>> calls by the guest for that service will be passed to >>>>>>> userspace to be handled. >>>>>>> >>>>>>> >>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability: >>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This >>>>>>> capability tells the guest that multiple TCE entry add/remove >>>>>>> hypercalls +handling is supported by the kernel. This >>>>>>> significanly accelerates DMA +operations for PPC KVM guests. >>>>>>> + +Unlike other capabilities in this section, this one does >>>>>>> not have an ioctl. +Instead, when the capability is present, >>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be >>>>>>> handled in the host kernel and not passed to +the guest. >>>>>>> Othwerwise it might be better for the guest to continue using >>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or >>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present). >>>>>> While this describes perfectly well what the consequences are of >>>>>> the patches, it does not describe properly what the CAP actually >>>>>> expresses. The CAP only says "this kernel is able to handle >>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All >>>>>> other consequences are nice to document, but the semantics of >>>>>> the CAP are missing. >>>>> >>>>> ? It expresses ability to handle 2 hcalls. What is missing? >>>> You don't describe the kvm<-> qemu interface. You describe some >>>> decisions qemu can take from this cap. >>> >>> This file does not mention qemu at all. And the interface is - qemu >>> (or kvmtool could do that) just adds "hcall-multi-tce" to >>> "ibm,hypertas-functions" but this is for pseries linux and AIX could >>> always do it (no idea about it). Does it really have to be in this >>> file? >> Ok, let's go back a step. What does this CAP describe? Don't look at the >> description you wrote above. Just write a new one. > The CAP means the kernel is capable of handling hcalls A and B without > passing those into the user space. That accelerates DMA. > > >> What exactly can user space expect when it finds this CAP? > The user space can expect that its handlers for A and B are not going to be > called if it configures the guest appropriately. > > Any better? :) A lot, yes. This is what the CAP actually means. It's nice to give some guidance in the documentation of implications (should expose "ibm,hypertas-functions" to enable the guest to actually use these for example) but the first paragraph should only indicate what the CAP changes. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 10:46 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 10:46 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/17/2013 10:51 AM, Alexey Kardashevskiy wrote: > On 06/17/2013 06:40 PM, Alexander Graf wrote: >> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: >> >>> On 06/17/2013 06:02 PM, Alexander Graf wrote: >>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >>>> >>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>>>> >>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as >>>>>>> IBMVIO devices or emulated PCI. These calls allow adding >>>>>>> multiple entries (up to 512) into the TCE table in one call >>>>>>> which saves time on transition to/from real mode. >>>>>>> >>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>>>> (copied from user and verified) before writing the whole list >>>>>>> into the TCE table. This cache will be utilized more in the >>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in >>>>>>> the virtual mode in the case if the real mode handler failed >>>>>>> for some reason. >>>>>>> >>>>>>> This adds a guest physical to host real address converter and >>>>>>> calls the existing H_PUT_TCE handler. The converting function >>>>>>> is going to be fully utilized by upcoming VFIO supporting >>>>>>> patches. >>>>>>> >>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so >>>>>>> in order to support the functionality of this patch, QEMU >>>>>>> needs to query for this capability and set the >>>>>>> "hcall-multi-tce" hypertas property only if the capability is >>>>>>> present, otherwise there will be serious performance >>>>>>> degradation. >>>>>>> >>>>>>> Cc: David Gibson<david@gibson.dropbear.id.au> Signed-off-by: >>>>>>> Alexey Kardashevskiy<aik@ozlabs.ru> Signed-off-by: Paul >>>>>>> Mackerras<paulus@samba.org> >>>>>> Only a few minor nits. Ben already commented on implementation >>>>>> details. >>>>>> >>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the >>>>>>> commit message * updated doc and moved it to another section * >>>>>>> changed capability number >>>>>>> >>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup >>>>>>> if put_indirect failed, instead we do not even start writing >>>>>>> to TCE table if we cannot get TCEs from the user and they are >>>>>>> invalid * kvmppc_emulated_h_put_tce is split to >>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for >>>>>>> the previous item) * fixed bug with failthrough for H_IPI * >>>>>>> removed all get_user() from real mode handlers * >>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte >>>>>>> public) --- Documentation/virtual/kvm/api.txt | 17 ++ >>>>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 >>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c >>>>>>> | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>>>> include/uapi/linux/kvm.h | 1 + 10 files >>>>>>> changed, 473 insertions(+), 32 deletions(-) >>>>>>> >>>>>>> diff --git a/Documentation/virtual/kvm/api.txt >>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff >>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++ >>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ >>>>>>> calls by the guest for that service will be passed to >>>>>>> userspace to be handled. >>>>>>> >>>>>>> >>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability: >>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This >>>>>>> capability tells the guest that multiple TCE entry add/remove >>>>>>> hypercalls +handling is supported by the kernel. This >>>>>>> significanly accelerates DMA +operations for PPC KVM guests. >>>>>>> + +Unlike other capabilities in this section, this one does >>>>>>> not have an ioctl. +Instead, when the capability is present, >>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be >>>>>>> handled in the host kernel and not passed to +the guest. >>>>>>> Othwerwise it might be better for the guest to continue using >>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or >>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present). >>>>>> While this describes perfectly well what the consequences are of >>>>>> the patches, it does not describe properly what the CAP actually >>>>>> expresses. The CAP only says "this kernel is able to handle >>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All >>>>>> other consequences are nice to document, but the semantics of >>>>>> the CAP are missing. >>>>> >>>>> ? It expresses ability to handle 2 hcalls. What is missing? >>>> You don't describe the kvm<-> qemu interface. You describe some >>>> decisions qemu can take from this cap. >>> >>> This file does not mention qemu at all. And the interface is - qemu >>> (or kvmtool could do that) just adds "hcall-multi-tce" to >>> "ibm,hypertas-functions" but this is for pseries linux and AIX could >>> always do it (no idea about it). Does it really have to be in this >>> file? >> Ok, let's go back a step. What does this CAP describe? Don't look at the >> description you wrote above. Just write a new one. > The CAP means the kernel is capable of handling hcalls A and B without > passing those into the user space. That accelerates DMA. > > >> What exactly can user space expect when it finds this CAP? > The user space can expect that its handlers for A and B are not going to be > called if it configures the guest appropriately. > > Any better? :) A lot, yes. This is what the CAP actually means. It's nice to give some guidance in the documentation of implications (should expose "ibm,hypertas-functions" to enable the guest to actually use these for example) but the first paragraph should only indicate what the CAP changes. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 10:46 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 10:46 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson On 06/17/2013 10:51 AM, Alexey Kardashevskiy wrote: > On 06/17/2013 06:40 PM, Alexander Graf wrote: >> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: >> >>> On 06/17/2013 06:02 PM, Alexander Graf wrote: >>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >>>> >>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>>>> >>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as >>>>>>> IBMVIO devices or emulated PCI. These calls allow adding >>>>>>> multiple entries (up to 512) into the TCE table in one call >>>>>>> which saves time on transition to/from real mode. >>>>>>> >>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>>>> (copied from user and verified) before writing the whole list >>>>>>> into the TCE table. This cache will be utilized more in the >>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in >>>>>>> the virtual mode in the case if the real mode handler failed >>>>>>> for some reason. >>>>>>> >>>>>>> This adds a guest physical to host real address converter and >>>>>>> calls the existing H_PUT_TCE handler. The converting function >>>>>>> is going to be fully utilized by upcoming VFIO supporting >>>>>>> patches. >>>>>>> >>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so >>>>>>> in order to support the functionality of this patch, QEMU >>>>>>> needs to query for this capability and set the >>>>>>> "hcall-multi-tce" hypertas property only if the capability is >>>>>>> present, otherwise there will be serious performance >>>>>>> degradation. >>>>>>> >>>>>>> Cc: David Gibson<david@gibson.dropbear.id.au> Signed-off-by: >>>>>>> Alexey Kardashevskiy<aik@ozlabs.ru> Signed-off-by: Paul >>>>>>> Mackerras<paulus@samba.org> >>>>>> Only a few minor nits. Ben already commented on implementation >>>>>> details. >>>>>> >>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the >>>>>>> commit message * updated doc and moved it to another section * >>>>>>> changed capability number >>>>>>> >>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup >>>>>>> if put_indirect failed, instead we do not even start writing >>>>>>> to TCE table if we cannot get TCEs from the user and they are >>>>>>> invalid * kvmppc_emulated_h_put_tce is split to >>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for >>>>>>> the previous item) * fixed bug with failthrough for H_IPI * >>>>>>> removed all get_user() from real mode handlers * >>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte >>>>>>> public) --- Documentation/virtual/kvm/api.txt | 17 ++ >>>>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 >>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c >>>>>>> | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>>>> include/uapi/linux/kvm.h | 1 + 10 files >>>>>>> changed, 473 insertions(+), 32 deletions(-) >>>>>>> >>>>>>> diff --git a/Documentation/virtual/kvm/api.txt >>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff >>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++ >>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ >>>>>>> calls by the guest for that service will be passed to >>>>>>> userspace to be handled. >>>>>>> >>>>>>> >>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability: >>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This >>>>>>> capability tells the guest that multiple TCE entry add/remove >>>>>>> hypercalls +handling is supported by the kernel. This >>>>>>> significanly accelerates DMA +operations for PPC KVM guests. >>>>>>> + +Unlike other capabilities in this section, this one does >>>>>>> not have an ioctl. +Instead, when the capability is present, >>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be >>>>>>> handled in the host kernel and not passed to +the guest. >>>>>>> Othwerwise it might be better for the guest to continue using >>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or >>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present). >>>>>> While this describes perfectly well what the consequences are of >>>>>> the patches, it does not describe properly what the CAP actually >>>>>> expresses. The CAP only says "this kernel is able to handle >>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All >>>>>> other consequences are nice to document, but the semantics of >>>>>> the CAP are missing. >>>>> >>>>> ? It expresses ability to handle 2 hcalls. What is missing? >>>> You don't describe the kvm<-> qemu interface. You describe some >>>> decisions qemu can take from this cap. >>> >>> This file does not mention qemu at all. And the interface is - qemu >>> (or kvmtool could do that) just adds "hcall-multi-tce" to >>> "ibm,hypertas-functions" but this is for pseries linux and AIX could >>> always do it (no idea about it). Does it really have to be in this >>> file? >> Ok, let's go back a step. What does this CAP describe? Don't look at the >> description you wrote above. Just write a new one. > The CAP means the kernel is capable of handling hcalls A and B without > passing those into the user space. That accelerates DMA. > > >> What exactly can user space expect when it finds this CAP? > The user space can expect that its handlers for A and B are not going to be > called if it configures the guest appropriately. > > Any better? :) A lot, yes. This is what the CAP actually means. It's nice to give some guidance in the documentation of implications (should expose "ibm,hypertas-functions" to enable the guest to actually use these for example) but the first paragraph should only indicate what the CAP changes. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-17 10:46 ` Alexander Graf (?) @ 2013-06-17 10:48 ` Alexander Graf -1 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 10:48 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/17/2013 12:46 PM, Alexander Graf wrote: > On 06/17/2013 10:51 AM, Alexey Kardashevskiy wrote: >> On 06/17/2013 06:40 PM, Alexander Graf wrote: >>> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: >>> >>>> On 06/17/2013 06:02 PM, Alexander Graf wrote: >>>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >>>>> >>>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>>>>> >>>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as >>>>>>>> IBMVIO devices or emulated PCI. These calls allow adding >>>>>>>> multiple entries (up to 512) into the TCE table in one call >>>>>>>> which saves time on transition to/from real mode. >>>>>>>> >>>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>>>>> (copied from user and verified) before writing the whole list >>>>>>>> into the TCE table. This cache will be utilized more in the >>>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in >>>>>>>> the virtual mode in the case if the real mode handler failed >>>>>>>> for some reason. >>>>>>>> >>>>>>>> This adds a guest physical to host real address converter and >>>>>>>> calls the existing H_PUT_TCE handler. The converting function >>>>>>>> is going to be fully utilized by upcoming VFIO supporting >>>>>>>> patches. >>>>>>>> >>>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so >>>>>>>> in order to support the functionality of this patch, QEMU >>>>>>>> needs to query for this capability and set the >>>>>>>> "hcall-multi-tce" hypertas property only if the capability is >>>>>>>> present, otherwise there will be serious performance >>>>>>>> degradation. >>>>>>>> >>>>>>>> Cc: David Gibson<david@gibson.dropbear.id.au> Signed-off-by: >>>>>>>> Alexey Kardashevskiy<aik@ozlabs.ru> Signed-off-by: Paul >>>>>>>> Mackerras<paulus@samba.org> >>>>>>> Only a few minor nits. Ben already commented on implementation >>>>>>> details. >>>>>>> >>>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the >>>>>>>> commit message * updated doc and moved it to another section * >>>>>>>> changed capability number >>>>>>>> >>>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup >>>>>>>> if put_indirect failed, instead we do not even start writing >>>>>>>> to TCE table if we cannot get TCEs from the user and they are >>>>>>>> invalid * kvmppc_emulated_h_put_tce is split to >>>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for >>>>>>>> the previous item) * fixed bug with failthrough for H_IPI * >>>>>>>> removed all get_user() from real mode handlers * >>>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte >>>>>>>> public) --- Documentation/virtual/kvm/api.txt | 17 ++ >>>>>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 >>>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c >>>>>>>> | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>>>>> include/uapi/linux/kvm.h | 1 + 10 files >>>>>>>> changed, 473 insertions(+), 32 deletions(-) >>>>>>>> >>>>>>>> diff --git a/Documentation/virtual/kvm/api.txt >>>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff >>>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++ >>>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ >>>>>>>> calls by the guest for that service will be passed to >>>>>>>> userspace to be handled. >>>>>>>> >>>>>>>> >>>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability: >>>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This >>>>>>>> capability tells the guest that multiple TCE entry add/remove >>>>>>>> hypercalls +handling is supported by the kernel. This >>>>>>>> significanly accelerates DMA +operations for PPC KVM guests. >>>>>>>> + +Unlike other capabilities in this section, this one does >>>>>>>> not have an ioctl. +Instead, when the capability is present, >>>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be >>>>>>>> handled in the host kernel and not passed to +the guest. >>>>>>>> Othwerwise it might be better for the guest to continue using >>>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or >>>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present). >>>>>>> While this describes perfectly well what the consequences are of >>>>>>> the patches, it does not describe properly what the CAP actually >>>>>>> expresses. The CAP only says "this kernel is able to handle >>>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All >>>>>>> other consequences are nice to document, but the semantics of >>>>>>> the CAP are missing. >>>>>> >>>>>> ? It expresses ability to handle 2 hcalls. What is missing? >>>>> You don't describe the kvm<-> qemu interface. You describe some >>>>> decisions qemu can take from this cap. >>>> >>>> This file does not mention qemu at all. And the interface is - qemu >>>> (or kvmtool could do that) just adds "hcall-multi-tce" to >>>> "ibm,hypertas-functions" but this is for pseries linux and AIX could >>>> always do it (no idea about it). Does it really have to be in this >>>> file? >>> Ok, let's go back a step. What does this CAP describe? Don't look at >>> the >>> description you wrote above. Just write a new one. >> The CAP means the kernel is capable of handling hcalls A and B without >> passing those into the user space. That accelerates DMA. >> >> >>> What exactly can user space expect when it finds this CAP? >> The user space can expect that its handlers for A and B are not going >> to be >> called if it configures the guest appropriately. Actually a nitpick here too. User space can expect that its handlers for A and B are going to already be processed by KVM. Regardless of how user space configures the guest. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 10:48 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 10:48 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/17/2013 12:46 PM, Alexander Graf wrote: > On 06/17/2013 10:51 AM, Alexey Kardashevskiy wrote: >> On 06/17/2013 06:40 PM, Alexander Graf wrote: >>> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: >>> >>>> On 06/17/2013 06:02 PM, Alexander Graf wrote: >>>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >>>>> >>>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>>>>> >>>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as >>>>>>>> IBMVIO devices or emulated PCI. These calls allow adding >>>>>>>> multiple entries (up to 512) into the TCE table in one call >>>>>>>> which saves time on transition to/from real mode. >>>>>>>> >>>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>>>>> (copied from user and verified) before writing the whole list >>>>>>>> into the TCE table. This cache will be utilized more in the >>>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in >>>>>>>> the virtual mode in the case if the real mode handler failed >>>>>>>> for some reason. >>>>>>>> >>>>>>>> This adds a guest physical to host real address converter and >>>>>>>> calls the existing H_PUT_TCE handler. The converting function >>>>>>>> is going to be fully utilized by upcoming VFIO supporting >>>>>>>> patches. >>>>>>>> >>>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so >>>>>>>> in order to support the functionality of this patch, QEMU >>>>>>>> needs to query for this capability and set the >>>>>>>> "hcall-multi-tce" hypertas property only if the capability is >>>>>>>> present, otherwise there will be serious performance >>>>>>>> degradation. >>>>>>>> >>>>>>>> Cc: David Gibson<david@gibson.dropbear.id.au> Signed-off-by: >>>>>>>> Alexey Kardashevskiy<aik@ozlabs.ru> Signed-off-by: Paul >>>>>>>> Mackerras<paulus@samba.org> >>>>>>> Only a few minor nits. Ben already commented on implementation >>>>>>> details. >>>>>>> >>>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the >>>>>>>> commit message * updated doc and moved it to another section * >>>>>>>> changed capability number >>>>>>>> >>>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup >>>>>>>> if put_indirect failed, instead we do not even start writing >>>>>>>> to TCE table if we cannot get TCEs from the user and they are >>>>>>>> invalid * kvmppc_emulated_h_put_tce is split to >>>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for >>>>>>>> the previous item) * fixed bug with failthrough for H_IPI * >>>>>>>> removed all get_user() from real mode handlers * >>>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte >>>>>>>> public) --- Documentation/virtual/kvm/api.txt | 17 ++ >>>>>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 >>>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c >>>>>>>> | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>>>>> include/uapi/linux/kvm.h | 1 + 10 files >>>>>>>> changed, 473 insertions(+), 32 deletions(-) >>>>>>>> >>>>>>>> diff --git a/Documentation/virtual/kvm/api.txt >>>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff >>>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++ >>>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ >>>>>>>> calls by the guest for that service will be passed to >>>>>>>> userspace to be handled. >>>>>>>> >>>>>>>> >>>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability: >>>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This >>>>>>>> capability tells the guest that multiple TCE entry add/remove >>>>>>>> hypercalls +handling is supported by the kernel. This >>>>>>>> significanly accelerates DMA +operations for PPC KVM guests. >>>>>>>> + +Unlike other capabilities in this section, this one does >>>>>>>> not have an ioctl. +Instead, when the capability is present, >>>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be >>>>>>>> handled in the host kernel and not passed to +the guest. >>>>>>>> Othwerwise it might be better for the guest to continue using >>>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or >>>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present). >>>>>>> While this describes perfectly well what the consequences are of >>>>>>> the patches, it does not describe properly what the CAP actually >>>>>>> expresses. The CAP only says "this kernel is able to handle >>>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All >>>>>>> other consequences are nice to document, but the semantics of >>>>>>> the CAP are missing. >>>>>> >>>>>> ? It expresses ability to handle 2 hcalls. What is missing? >>>>> You don't describe the kvm<-> qemu interface. You describe some >>>>> decisions qemu can take from this cap. >>>> >>>> This file does not mention qemu at all. And the interface is - qemu >>>> (or kvmtool could do that) just adds "hcall-multi-tce" to >>>> "ibm,hypertas-functions" but this is for pseries linux and AIX could >>>> always do it (no idea about it). Does it really have to be in this >>>> file? >>> Ok, let's go back a step. What does this CAP describe? Don't look at >>> the >>> description you wrote above. Just write a new one. >> The CAP means the kernel is capable of handling hcalls A and B without >> passing those into the user space. That accelerates DMA. >> >> >>> What exactly can user space expect when it finds this CAP? >> The user space can expect that its handlers for A and B are not going >> to be >> called if it configures the guest appropriately. Actually a nitpick here too. User space can expect that its handlers for A and B are going to already be processed by KVM. Regardless of how user space configures the guest. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 10:48 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 10:48 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson On 06/17/2013 12:46 PM, Alexander Graf wrote: > On 06/17/2013 10:51 AM, Alexey Kardashevskiy wrote: >> On 06/17/2013 06:40 PM, Alexander Graf wrote: >>> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: >>> >>>> On 06/17/2013 06:02 PM, Alexander Graf wrote: >>>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: >>>>> >>>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>>>>>> >>>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as >>>>>>>> IBMVIO devices or emulated PCI. These calls allow adding >>>>>>>> multiple entries (up to 512) into the TCE table in one call >>>>>>>> which saves time on transition to/from real mode. >>>>>>>> >>>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>>>>> (copied from user and verified) before writing the whole list >>>>>>>> into the TCE table. This cache will be utilized more in the >>>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in >>>>>>>> the virtual mode in the case if the real mode handler failed >>>>>>>> for some reason. >>>>>>>> >>>>>>>> This adds a guest physical to host real address converter and >>>>>>>> calls the existing H_PUT_TCE handler. The converting function >>>>>>>> is going to be fully utilized by upcoming VFIO supporting >>>>>>>> patches. >>>>>>>> >>>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so >>>>>>>> in order to support the functionality of this patch, QEMU >>>>>>>> needs to query for this capability and set the >>>>>>>> "hcall-multi-tce" hypertas property only if the capability is >>>>>>>> present, otherwise there will be serious performance >>>>>>>> degradation. >>>>>>>> >>>>>>>> Cc: David Gibson<david@gibson.dropbear.id.au> Signed-off-by: >>>>>>>> Alexey Kardashevskiy<aik@ozlabs.ru> Signed-off-by: Paul >>>>>>>> Mackerras<paulus@samba.org> >>>>>>> Only a few minor nits. Ben already commented on implementation >>>>>>> details. >>>>>>> >>>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the >>>>>>>> commit message * updated doc and moved it to another section * >>>>>>>> changed capability number >>>>>>>> >>>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup >>>>>>>> if put_indirect failed, instead we do not even start writing >>>>>>>> to TCE table if we cannot get TCEs from the user and they are >>>>>>>> invalid * kvmppc_emulated_h_put_tce is split to >>>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for >>>>>>>> the previous item) * fixed bug with failthrough for H_IPI * >>>>>>>> removed all get_user() from real mode handlers * >>>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte >>>>>>>> public) --- Documentation/virtual/kvm/api.txt | 17 ++ >>>>>>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 >>>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c >>>>>>>> | 39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>>>>> include/uapi/linux/kvm.h | 1 + 10 files >>>>>>>> changed, 473 insertions(+), 32 deletions(-) >>>>>>>> >>>>>>>> diff --git a/Documentation/virtual/kvm/api.txt >>>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff >>>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++ >>>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ >>>>>>>> calls by the guest for that service will be passed to >>>>>>>> userspace to be handled. >>>>>>>> >>>>>>>> >>>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability: >>>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This >>>>>>>> capability tells the guest that multiple TCE entry add/remove >>>>>>>> hypercalls +handling is supported by the kernel. This >>>>>>>> significanly accelerates DMA +operations for PPC KVM guests. >>>>>>>> + +Unlike other capabilities in this section, this one does >>>>>>>> not have an ioctl. +Instead, when the capability is present, >>>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be >>>>>>>> handled in the host kernel and not passed to +the guest. >>>>>>>> Othwerwise it might be better for the guest to continue using >>>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or >>>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present). >>>>>>> While this describes perfectly well what the consequences are of >>>>>>> the patches, it does not describe properly what the CAP actually >>>>>>> expresses. The CAP only says "this kernel is able to handle >>>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All >>>>>>> other consequences are nice to document, but the semantics of >>>>>>> the CAP are missing. >>>>>> >>>>>> ? It expresses ability to handle 2 hcalls. What is missing? >>>>> You don't describe the kvm<-> qemu interface. You describe some >>>>> decisions qemu can take from this cap. >>>> >>>> This file does not mention qemu at all. And the interface is - qemu >>>> (or kvmtool could do that) just adds "hcall-multi-tce" to >>>> "ibm,hypertas-functions" but this is for pseries linux and AIX could >>>> always do it (no idea about it). Does it really have to be in this >>>> file? >>> Ok, let's go back a step. What does this CAP describe? Don't look at >>> the >>> description you wrote above. Just write a new one. >> The CAP means the kernel is capable of handling hcalls A and B without >> passing those into the user space. That accelerates DMA. >> >> >>> What exactly can user space expect when it finds this CAP? >> The user space can expect that its handlers for A and B are not going >> to be >> called if it configures the guest appropriately. Actually a nitpick here too. User space can expect that its handlers for A and B are going to already be processed by KVM. Regardless of how user space configures the guest. Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-17 7:55 ` Alexey Kardashevskiy (?) @ 2013-06-17 8:37 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-17 8:37 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Alexander Graf, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Mon, 2013-06-17 at 17:55 +1000, Alexey Kardashevskiy wrote: > David: > => So, in the case of MULTITCE, that's not quite right. PR KVM can > emulate a PAPR system on a BookE machine, and there's no reason not to > allow TCE acceleration as well. We can't make it dependent on PAPR > mode being selected, because that's enabled per-vcpu, whereas these > capabilities are queried on the VM before the vcpus are created. > => > Wrong? The capability just tells qemu the kernel supports it, it doesn't have to depend on PAPR mode, qemu can sort things out no ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:37 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-17 8:37 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Alexander Graf, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Mon, 2013-06-17 at 17:55 +1000, Alexey Kardashevskiy wrote: > David: > === > So, in the case of MULTITCE, that's not quite right. PR KVM can > emulate a PAPR system on a BookE machine, and there's no reason not to > allow TCE acceleration as well. We can't make it dependent on PAPR > mode being selected, because that's enabled per-vcpu, whereas these > capabilities are queried on the VM before the vcpus are created. > === > > Wrong? The capability just tells qemu the kernel supports it, it doesn't have to depend on PAPR mode, qemu can sort things out no ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:37 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-17 8:37 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, Paul Mackerras, linuxppc-dev, David Gibson On Mon, 2013-06-17 at 17:55 +1000, Alexey Kardashevskiy wrote: > David: > === > So, in the case of MULTITCE, that's not quite right. PR KVM can > emulate a PAPR system on a BookE machine, and there's no reason not to > allow TCE acceleration as well. We can't make it dependent on PAPR > mode being selected, because that's enabled per-vcpu, whereas these > capabilities are queried on the VM before the vcpus are created. > === > > Wrong? The capability just tells qemu the kernel supports it, it doesn't have to depend on PAPR mode, qemu can sort things out no ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls 2013-06-17 8:37 ` Benjamin Herrenschmidt (?) @ 2013-06-17 8:42 ` Alexander Graf -1 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 8:42 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 17.06.2013, at 10:37, Benjamin Herrenschmidt wrote: > On Mon, 2013-06-17 at 17:55 +1000, Alexey Kardashevskiy wrote: >> David: >> =>> So, in the case of MULTITCE, that's not quite right. PR KVM can >> emulate a PAPR system on a BookE machine, and there's no reason not to >> allow TCE acceleration as well. We can't make it dependent on PAPR >> mode being selected, because that's enabled per-vcpu, whereas these >> capabilities are queried on the VM before the vcpus are created. >> =>> >> Wrong? > > The capability just tells qemu the kernel supports it, it doesn't have > to depend on PAPR mode, qemu can sort things out no ? Yes, this goes hand-in-hand with the documentation bit I'm trying to get through to Alexey atm. The CAP merely says that if in PAPR mode the kernel can handle hypercalls X and Y itself. This is true for all book3s implementations as the patches stand. It is not true for BookE as the patches stand. Hence the CAP should be limited to book3s, regardless of its mode :). Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:42 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 8:42 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 17.06.2013, at 10:37, Benjamin Herrenschmidt wrote: > On Mon, 2013-06-17 at 17:55 +1000, Alexey Kardashevskiy wrote: >> David: >> === >> So, in the case of MULTITCE, that's not quite right. PR KVM can >> emulate a PAPR system on a BookE machine, and there's no reason not to >> allow TCE acceleration as well. We can't make it dependent on PAPR >> mode being selected, because that's enabled per-vcpu, whereas these >> capabilities are queried on the VM before the vcpus are created. >> === >> >> Wrong? > > The capability just tells qemu the kernel supports it, it doesn't have > to depend on PAPR mode, qemu can sort things out no ? Yes, this goes hand-in-hand with the documentation bit I'm trying to get through to Alexey atm. The CAP merely says that if in PAPR mode the kernel can handle hypercalls X and Y itself. This is true for all book3s implementations as the patches stand. It is not true for BookE as the patches stand. Hence the CAP should be limited to book3s, regardless of its mode :). Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls @ 2013-06-17 8:42 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-17 8:42 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm, Alexey Kardashevskiy, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson On 17.06.2013, at 10:37, Benjamin Herrenschmidt wrote: > On Mon, 2013-06-17 at 17:55 +1000, Alexey Kardashevskiy wrote: >> David: >> =3D=3D=3D >> So, in the case of MULTITCE, that's not quite right. PR KVM can >> emulate a PAPR system on a BookE machine, and there's no reason not = to >> allow TCE acceleration as well. We can't make it dependent on PAPR >> mode being selected, because that's enabled per-vcpu, whereas these >> capabilities are queried on the VM before the vcpus are created. >> =3D=3D=3D >>=20 >> Wrong? >=20 > The capability just tells qemu the kernel supports it, it doesn't have > to depend on PAPR mode, qemu can sort things out no ? Yes, this goes hand-in-hand with the documentation bit I'm trying to get = through to Alexey atm. The CAP merely says that if in PAPR mode the = kernel can handle hypercalls X and Y itself. This is true for all book3s implementations as the patches stand. It is = not true for BookE as the patches stand. Hence the CAP should be limited = to book3s, regardless of its mode :). Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-05 6:11 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc The current VFIO-on-POWER implementation supports only user mode driven mapping, i.e. QEMU is sending requests to map/unmap pages. However this approach is really slow, so we want to move that to KVM. Since H_PUT_TCE can be extremely performance sensitive (especially with network adapters where each packet needs to be mapped/unmapped) we chose to implement that as a "fast" hypercall directly in "real mode" (processor still in the guest context but MMU off). To be able to do that, we need to provide some facilities to access the struct page count within that real mode environment as things like the sparsemem vmemmap mappings aren't accessible. This adds an API to increment/decrement page counter as get_user_pages API used for user mode mapping does not work in the real mode. CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Paul Mackerras <paulus@samba.org> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: 2013-05-20: * PageTail() is replaced by PageCompound() in order to have the same checks for whether the page is huge in realmode_get_page() and realmode_put_page() --- arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++ arch/powerpc/mm/init_64.c | 77 +++++++++++++++++++++++++++++- 2 files changed, 80 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index e3d55f6f..7b46e5f 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, } #endif /* !CONFIG_HUGETLB_PAGE */ +struct page *realmode_pfn_to_page(unsigned long pfn); +int realmode_get_page(struct page *page); +int realmode_put_page(struct page *page); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */ diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index a90b9c4..ce3d8d4 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -297,5 +297,80 @@ void vmemmap_free(unsigned long start, unsigned long end) { } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ +/* + * We do not have access to the sparsemem vmemmap, so we fallback to + * walking the list of sparsemem blocks which we already maintain for + * the sake of crashdump. In the long run, we might want to maintain + * a tree if performance of that linear walk becomes a problem. + * + * Any of realmode_XXXX functions can fail due to: + * 1) As real sparsemem blocks do not lay in RAM continously (they + * are in virtual address space which is not available in the real mode), + * the requested page struct can be split between blocks so get_page/put_page + * may fail. + * 2) When huge pages are used, the get_page/put_page API will fail + * in real mode as the linked addresses in the page struct are virtual + * too. + * When 1) or 2) takes place, the API returns an error code to cause + * an exit to kernel virtual mode where the operation will be completed. + */ +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct vmemmap_backing *vmem_back; + struct page *page; + unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); + + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) { + if (pg_va < vmem_back->virt_addr) + continue; + /* Check that page struct is not split between real pages */ + if ((pg_va + sizeof(struct page)) > + (vmem_back->virt_addr + page_size)) + return NULL; + + page = (struct page *) (vmem_back->phys + pg_va - + vmem_back->virt_addr); + return page; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#elif defined(CONFIG_FLATMEM) + +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + return page; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */ + +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) +int realmode_get_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + get_page(page); + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_get_page); + +int realmode_put_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + if (!atomic_add_unless(&page->_count, -1, 1)) + return -EAGAIN; + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_put_page); +#endif -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap @ 2013-06-05 6:11 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc The current VFIO-on-POWER implementation supports only user mode driven mapping, i.e. QEMU is sending requests to map/unmap pages. However this approach is really slow, so we want to move that to KVM. Since H_PUT_TCE can be extremely performance sensitive (especially with network adapters where each packet needs to be mapped/unmapped) we chose to implement that as a "fast" hypercall directly in "real mode" (processor still in the guest context but MMU off). To be able to do that, we need to provide some facilities to access the struct page count within that real mode environment as things like the sparsemem vmemmap mappings aren't accessible. This adds an API to increment/decrement page counter as get_user_pages API used for user mode mapping does not work in the real mode. CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Paul Mackerras <paulus@samba.org> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: 2013-05-20: * PageTail() is replaced by PageCompound() in order to have the same checks for whether the page is huge in realmode_get_page() and realmode_put_page() --- arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++ arch/powerpc/mm/init_64.c | 77 +++++++++++++++++++++++++++++- 2 files changed, 80 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index e3d55f6f..7b46e5f 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, } #endif /* !CONFIG_HUGETLB_PAGE */ +struct page *realmode_pfn_to_page(unsigned long pfn); +int realmode_get_page(struct page *page); +int realmode_put_page(struct page *page); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */ diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index a90b9c4..ce3d8d4 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -297,5 +297,80 @@ void vmemmap_free(unsigned long start, unsigned long end) { } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ +/* + * We do not have access to the sparsemem vmemmap, so we fallback to + * walking the list of sparsemem blocks which we already maintain for + * the sake of crashdump. In the long run, we might want to maintain + * a tree if performance of that linear walk becomes a problem. + * + * Any of realmode_XXXX functions can fail due to: + * 1) As real sparsemem blocks do not lay in RAM continously (they + * are in virtual address space which is not available in the real mode), + * the requested page struct can be split between blocks so get_page/put_page + * may fail. + * 2) When huge pages are used, the get_page/put_page API will fail + * in real mode as the linked addresses in the page struct are virtual + * too. + * When 1) or 2) takes place, the API returns an error code to cause + * an exit to kernel virtual mode where the operation will be completed. + */ +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct vmemmap_backing *vmem_back; + struct page *page; + unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); + + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) { + if (pg_va < vmem_back->virt_addr) + continue; + /* Check that page struct is not split between real pages */ + if ((pg_va + sizeof(struct page)) > + (vmem_back->virt_addr + page_size)) + return NULL; + + page = (struct page *) (vmem_back->phys + pg_va - + vmem_back->virt_addr); + return page; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#elif defined(CONFIG_FLATMEM) + +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + return page; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */ + +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) +int realmode_get_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + get_page(page); + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_get_page); + +int realmode_put_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + if (!atomic_add_unless(&page->_count, -1, 1)) + return -EAGAIN; + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_put_page); +#endif -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap @ 2013-06-05 6:11 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson The current VFIO-on-POWER implementation supports only user mode driven mapping, i.e. QEMU is sending requests to map/unmap pages. However this approach is really slow, so we want to move that to KVM. Since H_PUT_TCE can be extremely performance sensitive (especially with network adapters where each packet needs to be mapped/unmapped) we chose to implement that as a "fast" hypercall directly in "real mode" (processor still in the guest context but MMU off). To be able to do that, we need to provide some facilities to access the struct page count within that real mode environment as things like the sparsemem vmemmap mappings aren't accessible. This adds an API to increment/decrement page counter as get_user_pages API used for user mode mapping does not work in the real mode. CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Paul Mackerras <paulus@samba.org> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: 2013-05-20: * PageTail() is replaced by PageCompound() in order to have the same checks for whether the page is huge in realmode_get_page() and realmode_put_page() --- arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++ arch/powerpc/mm/init_64.c | 77 +++++++++++++++++++++++++++++- 2 files changed, 80 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index e3d55f6f..7b46e5f 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, } #endif /* !CONFIG_HUGETLB_PAGE */ +struct page *realmode_pfn_to_page(unsigned long pfn); +int realmode_get_page(struct page *page); +int realmode_put_page(struct page *page); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */ diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index a90b9c4..ce3d8d4 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -297,5 +297,80 @@ void vmemmap_free(unsigned long start, unsigned long end) { } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ +/* + * We do not have access to the sparsemem vmemmap, so we fallback to + * walking the list of sparsemem blocks which we already maintain for + * the sake of crashdump. In the long run, we might want to maintain + * a tree if performance of that linear walk becomes a problem. + * + * Any of realmode_XXXX functions can fail due to: + * 1) As real sparsemem blocks do not lay in RAM continously (they + * are in virtual address space which is not available in the real mode), + * the requested page struct can be split between blocks so get_page/put_page + * may fail. + * 2) When huge pages are used, the get_page/put_page API will fail + * in real mode as the linked addresses in the page struct are virtual + * too. + * When 1) or 2) takes place, the API returns an error code to cause + * an exit to kernel virtual mode where the operation will be completed. + */ +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct vmemmap_backing *vmem_back; + struct page *page; + unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); + + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) { + if (pg_va < vmem_back->virt_addr) + continue; + /* Check that page struct is not split between real pages */ + if ((pg_va + sizeof(struct page)) > + (vmem_back->virt_addr + page_size)) + return NULL; + + page = (struct page *) (vmem_back->phys + pg_va - + vmem_back->virt_addr); + return page; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#elif defined(CONFIG_FLATMEM) + +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + return page; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */ + +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) +int realmode_get_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + get_page(page); + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_get_page); + +int realmode_put_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + if (!atomic_add_unless(&page->_count, -1, 1)) + return -EAGAIN; + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_put_page); +#endif -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap 2013-06-05 6:11 ` Alexey Kardashevskiy (?) (?) @ 2013-06-16 4:26 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:26 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, linux-mm@kvack.org > +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) > +int realmode_get_page(struct page *page) > +{ > + if (PageCompound(page)) > + return -EAGAIN; > + > + get_page(page); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(realmode_get_page); > + > +int realmode_put_page(struct page *page) > +{ > + if (PageCompound(page)) > + return -EAGAIN; > + > + if (!atomic_add_unless(&page->_count, -1, 1)) > + return -EAGAIN; > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(realmode_put_page); > +#endif Several worries here, mostly that if the generic code ever changes (something gets added to get_page() that makes it no-longer safe for use in real mode for example, or some other condition gets added to put_page()), we go out of sync and potentially end up with very hard and very subtle bugs. It might be worth making sure that: - This is reviewed by some generic VM people (and make sure they understand why we need to do that) - A comment is added to get_page() and put_page() to make sure that if they are changed in any way, dbl check the impact on our realmode_get_page() (or "ping" us to make sure things are still ok). Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap @ 2013-06-16 4:26 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:26 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, linux-mm@kvack.org > +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) > +int realmode_get_page(struct page *page) > +{ > + if (PageCompound(page)) > + return -EAGAIN; > + > + get_page(page); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(realmode_get_page); > + > +int realmode_put_page(struct page *page) > +{ > + if (PageCompound(page)) > + return -EAGAIN; > + > + if (!atomic_add_unless(&page->_count, -1, 1)) > + return -EAGAIN; > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(realmode_put_page); > +#endif Several worries here, mostly that if the generic code ever changes (something gets added to get_page() that makes it no-longer safe for use in real mode for example, or some other condition gets added to put_page()), we go out of sync and potentially end up with very hard and very subtle bugs. It might be worth making sure that: - This is reviewed by some generic VM people (and make sure they understand why we need to do that) - A comment is added to get_page() and put_page() to make sure that if they are changed in any way, dbl check the impact on our realmode_get_page() (or "ping" us to make sure things are still ok). Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap @ 2013-06-16 4:26 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:26 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, linux-mm@kvack.org > +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) > +int realmode_get_page(struct page *page) > +{ > + if (PageCompound(page)) > + return -EAGAIN; > + > + get_page(page); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(realmode_get_page); > + > +int realmode_put_page(struct page *page) > +{ > + if (PageCompound(page)) > + return -EAGAIN; > + > + if (!atomic_add_unless(&page->_count, -1, 1)) > + return -EAGAIN; > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(realmode_put_page); > +#endif Several worries here, mostly that if the generic code ever changes (something gets added to get_page() that makes it no-longer safe for use in real mode for example, or some other condition gets added to put_page()), we go out of sync and potentially end up with very hard and very subtle bugs. It might be worth making sure that: - This is reviewed by some generic VM people (and make sure they understand why we need to do that) - A comment is added to get_page() and put_page() to make sure that if they are changed in any way, dbl check the impact on our realmode_get_page() (or "ping" us to make sure things are still ok). Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap @ 2013-06-16 4:26 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:26 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, linux-mm@kvack.org, Paul Mackerras, linuxppc-dev, David Gibson > +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) > +int realmode_get_page(struct page *page) > +{ > + if (PageCompound(page)) > + return -EAGAIN; > + > + get_page(page); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(realmode_get_page); > + > +int realmode_put_page(struct page *page) > +{ > + if (PageCompound(page)) > + return -EAGAIN; > + > + if (!atomic_add_unless(&page->_count, -1, 1)) > + return -EAGAIN; > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(realmode_put_page); > +#endif Several worries here, mostly that if the generic code ever changes (something gets added to get_page() that makes it no-longer safe for use in real mode for example, or some other condition gets added to put_page()), we go out of sync and potentially end up with very hard and very subtle bugs. It might be worth making sure that: - This is reviewed by some generic VM people (and make sure they understand why we need to do that) - A comment is added to get_page() and put_page() to make sure that if they are changed in any way, dbl check the impact on our realmode_get_page() (or "ping" us to make sure things are still ok). Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap 2013-06-16 4:26 ` Benjamin Herrenschmidt (?) (?) @ 2013-06-16 4:31 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:31 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, linux-mm@kvack.org On Sun, 2013-06-16 at 14:26 +1000, Benjamin Herrenschmidt wrote: > > +int realmode_get_page(struct page *page) > > +{ > > + if (PageCompound(page)) > > + return -EAGAIN; > > + > > + get_page(page); > > + > > + return 0; > > +} Shouldn't it be get_page_unless_zero ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap @ 2013-06-16 4:31 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:31 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, linux-mm@kvack.org On Sun, 2013-06-16 at 14:26 +1000, Benjamin Herrenschmidt wrote: > > +int realmode_get_page(struct page *page) > > +{ > > + if (PageCompound(page)) > > + return -EAGAIN; > > + > > + get_page(page); > > + > > + return 0; > > +} Shouldn't it be get_page_unless_zero ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap @ 2013-06-16 4:31 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:31 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, linux-mm@kvack.org On Sun, 2013-06-16 at 14:26 +1000, Benjamin Herrenschmidt wrote: > > +int realmode_get_page(struct page *page) > > +{ > > + if (PageCompound(page)) > > + return -EAGAIN; > > + > > + get_page(page); > > + > > + return 0; > > +} Shouldn't it be get_page_unless_zero ? Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap @ 2013-06-16 4:31 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:31 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, linux-mm@kvack.org, Paul Mackerras, linuxppc-dev, David Gibson On Sun, 2013-06-16 at 14:26 +1000, Benjamin Herrenschmidt wrote: > > +int realmode_get_page(struct page *page) > > +{ > > + if (PageCompound(page)) > > + return -EAGAIN; > > + > > + get_page(page); > > + > > + return 0; > > +} Shouldn't it be get_page_unless_zero ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap 2013-06-16 4:26 ` Benjamin Herrenschmidt (?) (?) @ 2013-06-17 9:17 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 9:17 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, linux-mm@kvack.org On 06/16/2013 02:26 PM, Benjamin Herrenschmidt wrote: >> +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) >> +int realmode_get_page(struct page *page) >> +{ >> + if (PageCompound(page)) >> + return -EAGAIN; >> + >> + get_page(page); >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(realmode_get_page); >> + >> +int realmode_put_page(struct page *page) >> +{ >> + if (PageCompound(page)) >> + return -EAGAIN; >> + >> + if (!atomic_add_unless(&page->_count, -1, 1)) >> + return -EAGAIN; >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(realmode_put_page); >> +#endif > > Several worries here, mostly that if the generic code ever changes > (something gets added to get_page() that makes it no-longer safe for use > in real mode for example, or some other condition gets added to > put_page()), we go out of sync and potentially end up with very hard and > very subtle bugs. > > It might be worth making sure that: > > - This is reviewed by some generic VM people (and make sure they > understand why we need to do that) > > - A comment is added to get_page() and put_page() to make sure that if > they are changed in any way, dbl check the impact on our > realmode_get_page() (or "ping" us to make sure things are still ok). After changing get_page() to get_page_unless_zero(), the get_page API I use is: get_page_unless_zero() - basically atomic_inc_not_zero() atomic_add_unless() - just operated with the counter PageCompound() - check if it is a huge page. No usage of get_page or put_page. If any of those changes, I would expect it to hit us immediately, no? So it may only make sense to add a comment to PageCompound(). But the comment says "PageCompound is generally not used in hot code paths", and our path is hot. Heh. diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..c70a654 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -329,7 +329,8 @@ static inline void set_page_writeback(struct page *page) * System with lots of page flags available. This allows separate * flags for PageHead() and PageTail() checks of compound pages so that bit * tests can be used in performance sensitive paths. PageCompound is - * generally not used in hot code paths. + * generally not used in hot code paths except arch/powerpc/mm/init_64.c + * which uses it to detect huge pages and avoid handling those in real mode. */ __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) __PAGEFLAG(Tail, tail) So? -- Alexey ^ permalink raw reply related [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap @ 2013-06-17 9:17 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 9:17 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, linux-mm@kvack.org On 06/16/2013 02:26 PM, Benjamin Herrenschmidt wrote: >> +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) >> +int realmode_get_page(struct page *page) >> +{ >> + if (PageCompound(page)) >> + return -EAGAIN; >> + >> + get_page(page); >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(realmode_get_page); >> + >> +int realmode_put_page(struct page *page) >> +{ >> + if (PageCompound(page)) >> + return -EAGAIN; >> + >> + if (!atomic_add_unless(&page->_count, -1, 1)) >> + return -EAGAIN; >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(realmode_put_page); >> +#endif > > Several worries here, mostly that if the generic code ever changes > (something gets added to get_page() that makes it no-longer safe for use > in real mode for example, or some other condition gets added to > put_page()), we go out of sync and potentially end up with very hard and > very subtle bugs. > > It might be worth making sure that: > > - This is reviewed by some generic VM people (and make sure they > understand why we need to do that) > > - A comment is added to get_page() and put_page() to make sure that if > they are changed in any way, dbl check the impact on our > realmode_get_page() (or "ping" us to make sure things are still ok). After changing get_page() to get_page_unless_zero(), the get_page API I use is: get_page_unless_zero() - basically atomic_inc_not_zero() atomic_add_unless() - just operated with the counter PageCompound() - check if it is a huge page. No usage of get_page or put_page. If any of those changes, I would expect it to hit us immediately, no? So it may only make sense to add a comment to PageCompound(). But the comment says "PageCompound is generally not used in hot code paths", and our path is hot. Heh. diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..c70a654 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -329,7 +329,8 @@ static inline void set_page_writeback(struct page *page) * System with lots of page flags available. This allows separate * flags for PageHead() and PageTail() checks of compound pages so that bit * tests can be used in performance sensitive paths. PageCompound is - * generally not used in hot code paths. + * generally not used in hot code paths except arch/powerpc/mm/init_64.c + * which uses it to detect huge pages and avoid handling those in real mode. */ __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) __PAGEFLAG(Tail, tail) So? -- Alexey -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap @ 2013-06-17 9:17 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 9:17 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, linux-mm@kvack.org On 06/16/2013 02:26 PM, Benjamin Herrenschmidt wrote: >> +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) >> +int realmode_get_page(struct page *page) >> +{ >> + if (PageCompound(page)) >> + return -EAGAIN; >> + >> + get_page(page); >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(realmode_get_page); >> + >> +int realmode_put_page(struct page *page) >> +{ >> + if (PageCompound(page)) >> + return -EAGAIN; >> + >> + if (!atomic_add_unless(&page->_count, -1, 1)) >> + return -EAGAIN; >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(realmode_put_page); >> +#endif > > Several worries here, mostly that if the generic code ever changes > (something gets added to get_page() that makes it no-longer safe for use > in real mode for example, or some other condition gets added to > put_page()), we go out of sync and potentially end up with very hard and > very subtle bugs. > > It might be worth making sure that: > > - This is reviewed by some generic VM people (and make sure they > understand why we need to do that) > > - A comment is added to get_page() and put_page() to make sure that if > they are changed in any way, dbl check the impact on our > realmode_get_page() (or "ping" us to make sure things are still ok). After changing get_page() to get_page_unless_zero(), the get_page API I use is: get_page_unless_zero() - basically atomic_inc_not_zero() atomic_add_unless() - just operated with the counter PageCompound() - check if it is a huge page. No usage of get_page or put_page. If any of those changes, I would expect it to hit us immediately, no? So it may only make sense to add a comment to PageCompound(). But the comment says "PageCompound is generally not used in hot code paths", and our path is hot. Heh. diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..c70a654 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -329,7 +329,8 @@ static inline void set_page_writeback(struct page *page) * System with lots of page flags available. This allows separate * flags for PageHead() and PageTail() checks of compound pages so that bit * tests can be used in performance sensitive paths. PageCompound is - * generally not used in hot code paths. + * generally not used in hot code paths except arch/powerpc/mm/init_64.c + * which uses it to detect huge pages and avoid handling those in real mode. */ __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) __PAGEFLAG(Tail, tail) So? -- Alexey ^ permalink raw reply related [flat|nested] 160+ messages in thread
* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap @ 2013-06-17 9:17 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-17 9:17 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, linux-mm@kvack.org, Paul Mackerras, linuxppc-dev, David Gibson On 06/16/2013 02:26 PM, Benjamin Herrenschmidt wrote: >> +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) >> +int realmode_get_page(struct page *page) >> +{ >> + if (PageCompound(page)) >> + return -EAGAIN; >> + >> + get_page(page); >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(realmode_get_page); >> + >> +int realmode_put_page(struct page *page) >> +{ >> + if (PageCompound(page)) >> + return -EAGAIN; >> + >> + if (!atomic_add_unless(&page->_count, -1, 1)) >> + return -EAGAIN; >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(realmode_put_page); >> +#endif > > Several worries here, mostly that if the generic code ever changes > (something gets added to get_page() that makes it no-longer safe for use > in real mode for example, or some other condition gets added to > put_page()), we go out of sync and potentially end up with very hard and > very subtle bugs. > > It might be worth making sure that: > > - This is reviewed by some generic VM people (and make sure they > understand why we need to do that) > > - A comment is added to get_page() and put_page() to make sure that if > they are changed in any way, dbl check the impact on our > realmode_get_page() (or "ping" us to make sure things are still ok). After changing get_page() to get_page_unless_zero(), the get_page API I use is: get_page_unless_zero() - basically atomic_inc_not_zero() atomic_add_unless() - just operated with the counter PageCompound() - check if it is a huge page. No usage of get_page or put_page. If any of those changes, I would expect it to hit us immediately, no? So it may only make sense to add a comment to PageCompound(). But the comment says "PageCompound is generally not used in hot code paths", and our path is hot. Heh. diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..c70a654 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -329,7 +329,8 @@ static inline void set_page_writeback(struct page *page) * System with lots of page flags available. This allows separate * flags for PageHead() and PageTail() checks of compound pages so that bit * tests can be used in performance sensitive paths. PageCompound is - * generally not used in hot code paths. + * generally not used in hot code paths except arch/powerpc/mm/init_64.c + * which uses it to detect huge pages and avoid handling those in real mode. */ __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) __PAGEFLAG(Tail, tail) So? -- Alexey ^ permalink raw reply related [flat|nested] 160+ messages in thread
* [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-05 6:11 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests without passing them to QEMU, which should save time on switching to QEMU and back. Both real and virtual modes are supported - whenever the kernel fails to handle TCE request, it passes it to the virtual mode. If it the virtual mode handlers fail, then the request is passed to the user mode, for example, to QEMU. This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables in-kernel handling of IOMMU map/unmap. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: 2013/06/05: * changed capability number * changed ioctl number * update the doc article number 2013/05/20: * removed get_user() from real mode handlers * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there translated TCEs, tries realmode_get_page() on those and if it fails, it passes control over the virtual mode handler which tries to finish the request handling * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit on a page * The only reason to pass the request to user mode now is when the user mode did not register TCE table in the kernel, in all other cases the virtual mode handler is expected to do the job --- Documentation/virtual/kvm/api.txt | 28 +++++ arch/powerpc/include/asm/kvm_host.h | 3 + arch/powerpc/include/asm/kvm_ppc.h | 2 + arch/powerpc/include/uapi/asm/kvm.h | 7 ++ arch/powerpc/kvm/book3s_64_vio.c | 198 ++++++++++++++++++++++++++++++++++- arch/powerpc/kvm/book3s_64_vio_hv.c | 193 +++++++++++++++++++++++++++++++++- arch/powerpc/kvm/powerpc.c | 12 +++ include/uapi/linux/kvm.h | 2 + 8 files changed, 439 insertions(+), 6 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6c082ff..e962e3b 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2379,6 +2379,34 @@ the guest. Othwerwise it might be better for the guest to continue using H_PUT_T hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). +4.84 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + +No flag is supported at the moment. + +When the guest issues TCE call on a liobn for which a TCE table has been +registered, the kernel will handle it in real mode, updating the hardware +TCE table. TCE table calls for other liobns will cause a vm exit and must +be handled by userspace. + + 5. The kvm_run structure ------------------------ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 85d8f26..ac0e2fe 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + struct iommu_group *grp; /* used for IOMMU groups */ struct page *pages[0]; }; @@ -611,6 +612,8 @@ struct kvm_vcpu_arch { u64 busy_preempt; unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ + unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */ + unsigned long tce_reason; /* The reason of switching to the virtmode */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index e852921b..934e01d 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args); extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( struct kvm_vcpu *vcpu, unsigned long liobn); extern long kvmppc_emulated_validate_tce(unsigned long tce); diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 0fb1a6e..cf82af4 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -319,6 +319,13 @@ struct kvm_create_spapr_tce { __u32 window_size; }; +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + /* for KVM_ALLOCATE_RMA */ struct kvm_allocate_rma { __u64 rma_size; diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 06b7b20..ffb4698 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -27,6 +27,8 @@ #include <linux/hugetlb.h> #include <linux/list.h> #include <linux/anon_inodes.h> +#include <linux/pci.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/kvm_ppc.h> @@ -56,8 +58,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt) mutex_lock(&kvm->lock); list_del(&stt->list); - for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) - __free_page(stt->pages[i]); +#ifdef CONFIG_IOMMU_API + if (stt->grp) { + iommu_group_put(stt->grp); + } else +#endif + for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) + __free_page(stt->pages[i]); kfree(stt); mutex_unlock(&kvm->lock); @@ -153,6 +160,62 @@ fail: return ret; } +#ifdef CONFIG_IOMMU_API +static const struct file_operations kvm_spapr_tce_iommu_fops = { + .release = kvm_spapr_tce_release, +}; + +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args) +{ + struct kvmppc_spapr_tce_table *tt = NULL; + struct iommu_group *grp; + struct iommu_table *tbl; + + /* Find an IOMMU table for the given ID */ + grp = iommu_group_get_by_id(args->iommu_id); + if (!grp) + return -ENXIO; + + tbl = iommu_group_get_iommudata(grp); + if (!tbl) + return -ENXIO; + + /* Check this LIOBN hasn't been previously allocated */ + list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) { + if (tt->liobn = args->liobn) + return -EBUSY; + } + + tt = kzalloc(sizeof(*tt), GFP_KERNEL); + if (!tt) + return -ENOMEM; + + tt->liobn = args->liobn; + tt->kvm = kvm; + tt->grp = grp; + + kvm_get_kvm(kvm); + + mutex_lock(&kvm->lock); + list_add(&tt->list, &kvm->arch.spapr_tce_tables); + + mutex_unlock(&kvm->lock); + + pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n", + args->liobn, args->iommu_id, args->flags); + + return anon_inode_getfd("kvm-spapr-tce-iommu", + &kvm_spapr_tce_iommu_fops, tt, O_RDWR); +} +#else +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args) +{ + return -ENOSYS; +} +#endif /* CONFIG_IOMMU_API */ + /* Converts guest physical address into host virtual */ static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, unsigned long gpa) @@ -180,6 +243,46 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + unsigned long entry = ioba >> IOMMU_PAGE_SHIFT; + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + if (vcpu->arch.tce_reason = H_HARDWARE) { + iommu_clear_tces_and_put_pages(tbl, entry, 1); + return H_HARDWARE; + + } else if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) { + if (iommu_tce_clear_param_check(tbl, ioba, 0, 1)) + return H_PARAMETER; + + ret = iommu_clear_tces_and_put_pages(tbl, entry, 1); + } else { + void *hva; + + if (iommu_tce_put_param_check(tbl, ioba, tce)) + return H_PARAMETER; + + hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce); + if (hva = ERROR_ADDR) + return H_HARDWARE; + + ret = iommu_put_tce_user_mode(tbl, + ioba >> IOMMU_PAGE_SHIFT, + (unsigned long) hva); + } + iommu_flush_tce(tbl); + + if (ret) + return H_HARDWARE; + + return H_SUCCESS; + } +#endif /* Emulated IO */ if (ioba >= tt->window_size) return H_PARAMETER; @@ -220,6 +323,70 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tces = ERROR_ADDR) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + /* Something bad happened, do cleanup and exit */ + if (vcpu->arch.tce_reason = H_HARDWARE) { + i = vcpu->arch.tce_tmp_num; + goto fail_clear_tce; + } else if (vcpu->arch.tce_reason != H_TOO_HARD) { + /* + * We get here only in PR KVM mode, otherwise + * the real mode handler would have checked TCEs + * already and failed on guest TCE translation. + */ + for (i = 0; i < npages; ++i) { + if (get_user(vcpu->arch.tce_tmp[i], tces + i)) + return H_HARDWARE; + + if (iommu_tce_put_param_check(tbl, ioba + + (i << IOMMU_PAGE_SHIFT), + vcpu->arch.tce_tmp[i])) + return H_PARAMETER; + } + } /* else: The real mode handler checked TCEs already */ + + /* Translate TCEs */ + for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) { + void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, + vcpu->arch.tce_tmp[i]); + + if (hva = ERROR_ADDR) + goto fail_clear_tce; + + vcpu->arch.tce_tmp[i] = (unsigned long) hva; + } + + /* Do get_page and put TCEs for all pages */ + for (i = 0; i < npages; ++i) { + if (iommu_put_tce_user_mode(tbl, + (ioba >> IOMMU_PAGE_SHIFT) + i, + vcpu->arch.tce_tmp[i])) { + i = npages; + goto fail_clear_tce; + } + } + + iommu_flush_tce(tbl); + + return H_SUCCESS; + +fail_clear_tce: + /* Cannot complete the translation, clean up and exit */ + iommu_clear_tces_and_put_pages(tbl, + ioba >> IOMMU_PAGE_SHIFT, i); + + iommu_flush_tce(tbl); + + return H_HARDWARE; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; @@ -253,6 +420,33 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + unsigned long tmp, entry = ioba >> IOMMU_PAGE_SHIFT; + + vcpu->arch.tce_tmp_num = 0; + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + /* PR KVM? */ + if (!vcpu->arch.tce_tmp_num && + (vcpu->arch.tce_reason != H_TOO_HARD) && + iommu_tce_clear_param_check(tbl, ioba, + tce_value, npages)) + return H_PARAMETER; + + /* Do actual cleanup */ + tmp = vcpu->arch.tce_tmp_num; + if (iommu_clear_tces_and_put_pages(tbl, entry + tmp, + npages - tmp)) + return H_PARAMETER; + + return H_SUCCESS; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index c68d538..dc4ae32 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -26,6 +26,7 @@ #include <linux/slab.h> #include <linux/hugetlb.h> #include <linux/list.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/kvm_ppc.h> @@ -118,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce); #ifdef CONFIG_KVM_BOOK3S_64_HV static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, - unsigned long *pte_sizep) + unsigned long *pte_sizep, bool do_get_page) { pte_t *ptep; unsigned int shift = 0; @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, if (!pte_present(*ptep)) return __pte(0); + /* + * Put huge pages handling to the virtual mode. + * The only exception is for TCE list pages which we + * do need to call get_page() for. + */ + if ((*pte_sizep > PAGE_SIZE) && do_get_page) + return __pte(0); + /* wait until _PAGE_BUSY is clear then set it atomically */ __asm__ __volatile__ ( "1: ldarx %0,0,%3\n" @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, : "cc"); ret = pte; + if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) { + struct page *pg = NULL; + pg = realmode_pfn_to_page(pte_pfn(pte)); + if (realmode_get_page(pg)) { + ret = __pte(0); + } else { + pte = pte_mkyoung(pte); + if (writing) + pte = pte_mkdirty(pte); + } + } + *ptep = pte; /* clears _PAGE_BUSY */ return ret; } @@ -157,7 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, * Also returns pte and page size if the page is present in page table. */ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, - unsigned long gpa) + unsigned long gpa, bool do_get_page) { struct kvm_memory_slot *memslot; pte_t pte; @@ -175,7 +196,7 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, /* Find a PTE and determine the size */ pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva, - writing, &pg_size); + writing, &pg_size, do_get_page); if (!pte) return ERROR_ADDR; @@ -188,6 +209,52 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, return hpa; } +#ifdef CONFIG_IOMMU_API +static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, + struct iommu_table *tbl, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + long ret = 0, i; + unsigned long entry = ioba >> IOMMU_PAGE_SHIFT; + + if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages)) + return H_PARAMETER; + + for (i = 0; i < npages; ++i) { + struct page *page; + unsigned long oldtce; + + oldtce = iommu_clear_tce(tbl, entry + i); + if (!oldtce) + continue; + + page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT); + if (!page) { + ret = H_TOO_HARD; + break; + } + + if (oldtce & TCE_PCI_WRITE) + SetPageDirty(page); + + if (realmode_put_page(page)) { + ret = H_TOO_HARD; + break; + } + } + + if (ret = H_TOO_HARD) { + vcpu->arch.tce_tmp_num = i; + vcpu->arch.tce_reason = H_TOO_HARD; + } + /* if (ret < 0) + pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n", + __func__, ioba, tce_value, ret); */ + + return ret; +} +#endif /* CONFIG_IOMMU_API */ + long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { @@ -199,6 +266,52 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + vcpu->arch.tce_reason = 0; + + if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) { + unsigned long hpa, hva; + + if (iommu_tce_put_param_check(tbl, ioba, tce)) + return H_PARAMETER; + + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); + if (hpa = ERROR_ADDR) { + vcpu->arch.tce_reason = H_TOO_HARD; + return H_TOO_HARD; + } + + hva = (unsigned long) __va(hpa); + ret = iommu_tce_build(tbl, + ioba >> IOMMU_PAGE_SHIFT, + hva, iommu_tce_direction(hva)); + if (unlikely(ret)) { + struct page *pg = realmode_pfn_to_page(hpa); + BUG_ON(!pg); + if (realmode_put_page(pg)) { + vcpu->arch.tce_reason = H_HARDWARE; + return H_TOO_HARD; + } + return H_HARDWARE; + } + } else { + ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, 0, 1); + if (ret) + return ret; + } + + iommu_flush_tce(tbl); + + return H_SUCCESS; + } +#endif /* Emulated IO */ if (ioba >= tt->window_size) return H_PARAMETER; @@ -235,10 +348,62 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tce_list & ~IOMMU_PAGE_MASK) return H_PARAMETER; - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list); + vcpu->arch.tce_tmp_num = 0; + vcpu->arch.tce_reason = 0; + + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, + tce_list, false); if ((unsigned long)tces = ERROR_ADDR) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + /* Check all TCEs */ + for (i = 0; i < npages; ++i) { + if (iommu_tce_put_param_check(tbl, ioba + + (i << IOMMU_PAGE_SHIFT), tces[i])) + return H_PARAMETER; + vcpu->arch.tce_tmp[i] = tces[i]; + } + + /* Translate TCEs and go get_page */ + for (i = 0; i < npages; ++i) { + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, + vcpu->arch.tce_tmp[i], true); + if (hpa = ERROR_ADDR) { + vcpu->arch.tce_tmp_num = i; + vcpu->arch.tce_reason = H_TOO_HARD; + return H_TOO_HARD; + } + vcpu->arch.tce_tmp[i] = hpa; + } + + /* Put TCEs to the table */ + for (i = 0; i < npages; ++i) { + unsigned long hva = (unsigned long) + __va(vcpu->arch.tce_tmp[i]); + + ret = iommu_tce_build(tbl, + (ioba >> IOMMU_PAGE_SHIFT) + i, + hva, iommu_tce_direction(hva)); + if (ret) { + /* All wrong, go virtmode and do cleanup */ + vcpu->arch.tce_reason = H_HARDWARE; + return H_TOO_HARD; + } + } + + iommu_flush_tce(tbl); + + return H_SUCCESS; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; @@ -268,6 +433,26 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + vcpu->arch.tce_reason = 0; + + ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, + tce_value, npages); + if (ret) + return ret; + + iommu_flush_tce(tbl); + + return H_SUCCESS; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 8465c2a..da6bf61 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext) break; #endif case KVM_CAP_SPAPR_MULTITCE: + case KVM_CAP_SPAPR_TCE_IOMMU: r = 1; break; default: @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce); goto out; } + case KVM_CREATE_SPAPR_TCE_IOMMU: { + struct kvm_create_spapr_tce_iommu create_tce_iommu; + struct kvm *kvm = filp->private_data; + + r = -EFAULT; + if (copy_from_user(&create_tce_iommu, argp, + sizeof(create_tce_iommu))) + goto out; + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu); + goto out; + } #endif /* CONFIG_PPC_BOOK3S_64 */ #ifdef CONFIG_KVM_BOOK3S_64_HV diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index fc0d6b9..8cf86dc 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE 93 +#define KVM_CAP_SPAPR_TCE_IOMMU 94 #ifdef KVM_CAP_IRQ_ROUTING @@ -922,6 +923,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PPC_ALLOC_HTAB */ #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32) #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce) +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu) /* Available with KVM_CAP_RMA */ #define KVM_ALLOCATE_RMA _IOR(KVMIO, 0xa9, struct kvm_allocate_rma) /* Available with KVM_CAP_PPC_HTAB_FD */ -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-05 6:11 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests without passing them to QEMU, which should save time on switching to QEMU and back. Both real and virtual modes are supported - whenever the kernel fails to handle TCE request, it passes it to the virtual mode. If it the virtual mode handlers fail, then the request is passed to the user mode, for example, to QEMU. This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables in-kernel handling of IOMMU map/unmap. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: 2013/06/05: * changed capability number * changed ioctl number * update the doc article number 2013/05/20: * removed get_user() from real mode handlers * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there translated TCEs, tries realmode_get_page() on those and if it fails, it passes control over the virtual mode handler which tries to finish the request handling * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit on a page * The only reason to pass the request to user mode now is when the user mode did not register TCE table in the kernel, in all other cases the virtual mode handler is expected to do the job --- Documentation/virtual/kvm/api.txt | 28 +++++ arch/powerpc/include/asm/kvm_host.h | 3 + arch/powerpc/include/asm/kvm_ppc.h | 2 + arch/powerpc/include/uapi/asm/kvm.h | 7 ++ arch/powerpc/kvm/book3s_64_vio.c | 198 ++++++++++++++++++++++++++++++++++- arch/powerpc/kvm/book3s_64_vio_hv.c | 193 +++++++++++++++++++++++++++++++++- arch/powerpc/kvm/powerpc.c | 12 +++ include/uapi/linux/kvm.h | 2 + 8 files changed, 439 insertions(+), 6 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6c082ff..e962e3b 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2379,6 +2379,34 @@ the guest. Othwerwise it might be better for the guest to continue using H_PUT_T hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). +4.84 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + +No flag is supported at the moment. + +When the guest issues TCE call on a liobn for which a TCE table has been +registered, the kernel will handle it in real mode, updating the hardware +TCE table. TCE table calls for other liobns will cause a vm exit and must +be handled by userspace. + + 5. The kvm_run structure ------------------------ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 85d8f26..ac0e2fe 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + struct iommu_group *grp; /* used for IOMMU groups */ struct page *pages[0]; }; @@ -611,6 +612,8 @@ struct kvm_vcpu_arch { u64 busy_preempt; unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ + unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */ + unsigned long tce_reason; /* The reason of switching to the virtmode */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index e852921b..934e01d 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args); extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( struct kvm_vcpu *vcpu, unsigned long liobn); extern long kvmppc_emulated_validate_tce(unsigned long tce); diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 0fb1a6e..cf82af4 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -319,6 +319,13 @@ struct kvm_create_spapr_tce { __u32 window_size; }; +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + /* for KVM_ALLOCATE_RMA */ struct kvm_allocate_rma { __u64 rma_size; diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 06b7b20..ffb4698 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -27,6 +27,8 @@ #include <linux/hugetlb.h> #include <linux/list.h> #include <linux/anon_inodes.h> +#include <linux/pci.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/kvm_ppc.h> @@ -56,8 +58,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt) mutex_lock(&kvm->lock); list_del(&stt->list); - for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) - __free_page(stt->pages[i]); +#ifdef CONFIG_IOMMU_API + if (stt->grp) { + iommu_group_put(stt->grp); + } else +#endif + for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) + __free_page(stt->pages[i]); kfree(stt); mutex_unlock(&kvm->lock); @@ -153,6 +160,62 @@ fail: return ret; } +#ifdef CONFIG_IOMMU_API +static const struct file_operations kvm_spapr_tce_iommu_fops = { + .release = kvm_spapr_tce_release, +}; + +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args) +{ + struct kvmppc_spapr_tce_table *tt = NULL; + struct iommu_group *grp; + struct iommu_table *tbl; + + /* Find an IOMMU table for the given ID */ + grp = iommu_group_get_by_id(args->iommu_id); + if (!grp) + return -ENXIO; + + tbl = iommu_group_get_iommudata(grp); + if (!tbl) + return -ENXIO; + + /* Check this LIOBN hasn't been previously allocated */ + list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) { + if (tt->liobn == args->liobn) + return -EBUSY; + } + + tt = kzalloc(sizeof(*tt), GFP_KERNEL); + if (!tt) + return -ENOMEM; + + tt->liobn = args->liobn; + tt->kvm = kvm; + tt->grp = grp; + + kvm_get_kvm(kvm); + + mutex_lock(&kvm->lock); + list_add(&tt->list, &kvm->arch.spapr_tce_tables); + + mutex_unlock(&kvm->lock); + + pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n", + args->liobn, args->iommu_id, args->flags); + + return anon_inode_getfd("kvm-spapr-tce-iommu", + &kvm_spapr_tce_iommu_fops, tt, O_RDWR); +} +#else +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args) +{ + return -ENOSYS; +} +#endif /* CONFIG_IOMMU_API */ + /* Converts guest physical address into host virtual */ static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, unsigned long gpa) @@ -180,6 +243,46 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + unsigned long entry = ioba >> IOMMU_PAGE_SHIFT; + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + if (vcpu->arch.tce_reason == H_HARDWARE) { + iommu_clear_tces_and_put_pages(tbl, entry, 1); + return H_HARDWARE; + + } else if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) { + if (iommu_tce_clear_param_check(tbl, ioba, 0, 1)) + return H_PARAMETER; + + ret = iommu_clear_tces_and_put_pages(tbl, entry, 1); + } else { + void *hva; + + if (iommu_tce_put_param_check(tbl, ioba, tce)) + return H_PARAMETER; + + hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce); + if (hva == ERROR_ADDR) + return H_HARDWARE; + + ret = iommu_put_tce_user_mode(tbl, + ioba >> IOMMU_PAGE_SHIFT, + (unsigned long) hva); + } + iommu_flush_tce(tbl); + + if (ret) + return H_HARDWARE; + + return H_SUCCESS; + } +#endif /* Emulated IO */ if (ioba >= tt->window_size) return H_PARAMETER; @@ -220,6 +323,70 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tces == ERROR_ADDR) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + /* Something bad happened, do cleanup and exit */ + if (vcpu->arch.tce_reason == H_HARDWARE) { + i = vcpu->arch.tce_tmp_num; + goto fail_clear_tce; + } else if (vcpu->arch.tce_reason != H_TOO_HARD) { + /* + * We get here only in PR KVM mode, otherwise + * the real mode handler would have checked TCEs + * already and failed on guest TCE translation. + */ + for (i = 0; i < npages; ++i) { + if (get_user(vcpu->arch.tce_tmp[i], tces + i)) + return H_HARDWARE; + + if (iommu_tce_put_param_check(tbl, ioba + + (i << IOMMU_PAGE_SHIFT), + vcpu->arch.tce_tmp[i])) + return H_PARAMETER; + } + } /* else: The real mode handler checked TCEs already */ + + /* Translate TCEs */ + for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) { + void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, + vcpu->arch.tce_tmp[i]); + + if (hva == ERROR_ADDR) + goto fail_clear_tce; + + vcpu->arch.tce_tmp[i] = (unsigned long) hva; + } + + /* Do get_page and put TCEs for all pages */ + for (i = 0; i < npages; ++i) { + if (iommu_put_tce_user_mode(tbl, + (ioba >> IOMMU_PAGE_SHIFT) + i, + vcpu->arch.tce_tmp[i])) { + i = npages; + goto fail_clear_tce; + } + } + + iommu_flush_tce(tbl); + + return H_SUCCESS; + +fail_clear_tce: + /* Cannot complete the translation, clean up and exit */ + iommu_clear_tces_and_put_pages(tbl, + ioba >> IOMMU_PAGE_SHIFT, i); + + iommu_flush_tce(tbl); + + return H_HARDWARE; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; @@ -253,6 +420,33 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + unsigned long tmp, entry = ioba >> IOMMU_PAGE_SHIFT; + + vcpu->arch.tce_tmp_num = 0; + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + /* PR KVM? */ + if (!vcpu->arch.tce_tmp_num && + (vcpu->arch.tce_reason != H_TOO_HARD) && + iommu_tce_clear_param_check(tbl, ioba, + tce_value, npages)) + return H_PARAMETER; + + /* Do actual cleanup */ + tmp = vcpu->arch.tce_tmp_num; + if (iommu_clear_tces_and_put_pages(tbl, entry + tmp, + npages - tmp)) + return H_PARAMETER; + + return H_SUCCESS; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index c68d538..dc4ae32 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -26,6 +26,7 @@ #include <linux/slab.h> #include <linux/hugetlb.h> #include <linux/list.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/kvm_ppc.h> @@ -118,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce); #ifdef CONFIG_KVM_BOOK3S_64_HV static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, - unsigned long *pte_sizep) + unsigned long *pte_sizep, bool do_get_page) { pte_t *ptep; unsigned int shift = 0; @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, if (!pte_present(*ptep)) return __pte(0); + /* + * Put huge pages handling to the virtual mode. + * The only exception is for TCE list pages which we + * do need to call get_page() for. + */ + if ((*pte_sizep > PAGE_SIZE) && do_get_page) + return __pte(0); + /* wait until _PAGE_BUSY is clear then set it atomically */ __asm__ __volatile__ ( "1: ldarx %0,0,%3\n" @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, : "cc"); ret = pte; + if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) { + struct page *pg = NULL; + pg = realmode_pfn_to_page(pte_pfn(pte)); + if (realmode_get_page(pg)) { + ret = __pte(0); + } else { + pte = pte_mkyoung(pte); + if (writing) + pte = pte_mkdirty(pte); + } + } + *ptep = pte; /* clears _PAGE_BUSY */ return ret; } @@ -157,7 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, * Also returns pte and page size if the page is present in page table. */ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, - unsigned long gpa) + unsigned long gpa, bool do_get_page) { struct kvm_memory_slot *memslot; pte_t pte; @@ -175,7 +196,7 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, /* Find a PTE and determine the size */ pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva, - writing, &pg_size); + writing, &pg_size, do_get_page); if (!pte) return ERROR_ADDR; @@ -188,6 +209,52 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, return hpa; } +#ifdef CONFIG_IOMMU_API +static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, + struct iommu_table *tbl, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + long ret = 0, i; + unsigned long entry = ioba >> IOMMU_PAGE_SHIFT; + + if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages)) + return H_PARAMETER; + + for (i = 0; i < npages; ++i) { + struct page *page; + unsigned long oldtce; + + oldtce = iommu_clear_tce(tbl, entry + i); + if (!oldtce) + continue; + + page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT); + if (!page) { + ret = H_TOO_HARD; + break; + } + + if (oldtce & TCE_PCI_WRITE) + SetPageDirty(page); + + if (realmode_put_page(page)) { + ret = H_TOO_HARD; + break; + } + } + + if (ret == H_TOO_HARD) { + vcpu->arch.tce_tmp_num = i; + vcpu->arch.tce_reason = H_TOO_HARD; + } + /* if (ret < 0) + pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n", + __func__, ioba, tce_value, ret); */ + + return ret; +} +#endif /* CONFIG_IOMMU_API */ + long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { @@ -199,6 +266,52 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + vcpu->arch.tce_reason = 0; + + if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) { + unsigned long hpa, hva; + + if (iommu_tce_put_param_check(tbl, ioba, tce)) + return H_PARAMETER; + + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); + if (hpa == ERROR_ADDR) { + vcpu->arch.tce_reason = H_TOO_HARD; + return H_TOO_HARD; + } + + hva = (unsigned long) __va(hpa); + ret = iommu_tce_build(tbl, + ioba >> IOMMU_PAGE_SHIFT, + hva, iommu_tce_direction(hva)); + if (unlikely(ret)) { + struct page *pg = realmode_pfn_to_page(hpa); + BUG_ON(!pg); + if (realmode_put_page(pg)) { + vcpu->arch.tce_reason = H_HARDWARE; + return H_TOO_HARD; + } + return H_HARDWARE; + } + } else { + ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, 0, 1); + if (ret) + return ret; + } + + iommu_flush_tce(tbl); + + return H_SUCCESS; + } +#endif /* Emulated IO */ if (ioba >= tt->window_size) return H_PARAMETER; @@ -235,10 +348,62 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tce_list & ~IOMMU_PAGE_MASK) return H_PARAMETER; - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list); + vcpu->arch.tce_tmp_num = 0; + vcpu->arch.tce_reason = 0; + + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, + tce_list, false); if ((unsigned long)tces == ERROR_ADDR) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + /* Check all TCEs */ + for (i = 0; i < npages; ++i) { + if (iommu_tce_put_param_check(tbl, ioba + + (i << IOMMU_PAGE_SHIFT), tces[i])) + return H_PARAMETER; + vcpu->arch.tce_tmp[i] = tces[i]; + } + + /* Translate TCEs and go get_page */ + for (i = 0; i < npages; ++i) { + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, + vcpu->arch.tce_tmp[i], true); + if (hpa == ERROR_ADDR) { + vcpu->arch.tce_tmp_num = i; + vcpu->arch.tce_reason = H_TOO_HARD; + return H_TOO_HARD; + } + vcpu->arch.tce_tmp[i] = hpa; + } + + /* Put TCEs to the table */ + for (i = 0; i < npages; ++i) { + unsigned long hva = (unsigned long) + __va(vcpu->arch.tce_tmp[i]); + + ret = iommu_tce_build(tbl, + (ioba >> IOMMU_PAGE_SHIFT) + i, + hva, iommu_tce_direction(hva)); + if (ret) { + /* All wrong, go virtmode and do cleanup */ + vcpu->arch.tce_reason = H_HARDWARE; + return H_TOO_HARD; + } + } + + iommu_flush_tce(tbl); + + return H_SUCCESS; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; @@ -268,6 +433,26 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + vcpu->arch.tce_reason = 0; + + ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, + tce_value, npages); + if (ret) + return ret; + + iommu_flush_tce(tbl); + + return H_SUCCESS; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 8465c2a..da6bf61 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext) break; #endif case KVM_CAP_SPAPR_MULTITCE: + case KVM_CAP_SPAPR_TCE_IOMMU: r = 1; break; default: @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce); goto out; } + case KVM_CREATE_SPAPR_TCE_IOMMU: { + struct kvm_create_spapr_tce_iommu create_tce_iommu; + struct kvm *kvm = filp->private_data; + + r = -EFAULT; + if (copy_from_user(&create_tce_iommu, argp, + sizeof(create_tce_iommu))) + goto out; + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu); + goto out; + } #endif /* CONFIG_PPC_BOOK3S_64 */ #ifdef CONFIG_KVM_BOOK3S_64_HV diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index fc0d6b9..8cf86dc 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE 93 +#define KVM_CAP_SPAPR_TCE_IOMMU 94 #ifdef KVM_CAP_IRQ_ROUTING @@ -922,6 +923,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PPC_ALLOC_HTAB */ #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32) #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce) +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu) /* Available with KVM_CAP_RMA */ #define KVM_ALLOCATE_RMA _IOR(KVMIO, 0xa9, struct kvm_allocate_rma) /* Available with KVM_CAP_PPC_HTAB_FD */ -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-05 6:11 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests without passing them to QEMU, which should save time on switching to QEMU and back. Both real and virtual modes are supported - whenever the kernel fails to handle TCE request, it passes it to the virtual mode. If it the virtual mode handlers fail, then the request is passed to the user mode, for example, to QEMU. This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables in-kernel handling of IOMMU map/unmap. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: 2013/06/05: * changed capability number * changed ioctl number * update the doc article number 2013/05/20: * removed get_user() from real mode handlers * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there translated TCEs, tries realmode_get_page() on those and if it fails, it passes control over the virtual mode handler which tries to finish the request handling * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit on a page * The only reason to pass the request to user mode now is when the user mode did not register TCE table in the kernel, in all other cases the virtual mode handler is expected to do the job --- Documentation/virtual/kvm/api.txt | 28 +++++ arch/powerpc/include/asm/kvm_host.h | 3 + arch/powerpc/include/asm/kvm_ppc.h | 2 + arch/powerpc/include/uapi/asm/kvm.h | 7 ++ arch/powerpc/kvm/book3s_64_vio.c | 198 ++++++++++++++++++++++++++++++++++- arch/powerpc/kvm/book3s_64_vio_hv.c | 193 +++++++++++++++++++++++++++++++++- arch/powerpc/kvm/powerpc.c | 12 +++ include/uapi/linux/kvm.h | 2 + 8 files changed, 439 insertions(+), 6 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6c082ff..e962e3b 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2379,6 +2379,34 @@ the guest. Othwerwise it might be better for the guest to continue using H_PUT_T hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). +4.84 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + +No flag is supported at the moment. + +When the guest issues TCE call on a liobn for which a TCE table has been +registered, the kernel will handle it in real mode, updating the hardware +TCE table. TCE table calls for other liobns will cause a vm exit and must +be handled by userspace. + + 5. The kvm_run structure ------------------------ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 85d8f26..ac0e2fe 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + struct iommu_group *grp; /* used for IOMMU groups */ struct page *pages[0]; }; @@ -611,6 +612,8 @@ struct kvm_vcpu_arch { u64 busy_preempt; unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ + unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */ + unsigned long tce_reason; /* The reason of switching to the virtmode */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index e852921b..934e01d 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args); extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( struct kvm_vcpu *vcpu, unsigned long liobn); extern long kvmppc_emulated_validate_tce(unsigned long tce); diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 0fb1a6e..cf82af4 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -319,6 +319,13 @@ struct kvm_create_spapr_tce { __u32 window_size; }; +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + /* for KVM_ALLOCATE_RMA */ struct kvm_allocate_rma { __u64 rma_size; diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 06b7b20..ffb4698 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -27,6 +27,8 @@ #include <linux/hugetlb.h> #include <linux/list.h> #include <linux/anon_inodes.h> +#include <linux/pci.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/kvm_ppc.h> @@ -56,8 +58,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt) mutex_lock(&kvm->lock); list_del(&stt->list); - for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) - __free_page(stt->pages[i]); +#ifdef CONFIG_IOMMU_API + if (stt->grp) { + iommu_group_put(stt->grp); + } else +#endif + for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) + __free_page(stt->pages[i]); kfree(stt); mutex_unlock(&kvm->lock); @@ -153,6 +160,62 @@ fail: return ret; } +#ifdef CONFIG_IOMMU_API +static const struct file_operations kvm_spapr_tce_iommu_fops = { + .release = kvm_spapr_tce_release, +}; + +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args) +{ + struct kvmppc_spapr_tce_table *tt = NULL; + struct iommu_group *grp; + struct iommu_table *tbl; + + /* Find an IOMMU table for the given ID */ + grp = iommu_group_get_by_id(args->iommu_id); + if (!grp) + return -ENXIO; + + tbl = iommu_group_get_iommudata(grp); + if (!tbl) + return -ENXIO; + + /* Check this LIOBN hasn't been previously allocated */ + list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) { + if (tt->liobn == args->liobn) + return -EBUSY; + } + + tt = kzalloc(sizeof(*tt), GFP_KERNEL); + if (!tt) + return -ENOMEM; + + tt->liobn = args->liobn; + tt->kvm = kvm; + tt->grp = grp; + + kvm_get_kvm(kvm); + + mutex_lock(&kvm->lock); + list_add(&tt->list, &kvm->arch.spapr_tce_tables); + + mutex_unlock(&kvm->lock); + + pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n", + args->liobn, args->iommu_id, args->flags); + + return anon_inode_getfd("kvm-spapr-tce-iommu", + &kvm_spapr_tce_iommu_fops, tt, O_RDWR); +} +#else +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args) +{ + return -ENOSYS; +} +#endif /* CONFIG_IOMMU_API */ + /* Converts guest physical address into host virtual */ static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, unsigned long gpa) @@ -180,6 +243,46 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + unsigned long entry = ioba >> IOMMU_PAGE_SHIFT; + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + if (vcpu->arch.tce_reason == H_HARDWARE) { + iommu_clear_tces_and_put_pages(tbl, entry, 1); + return H_HARDWARE; + + } else if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) { + if (iommu_tce_clear_param_check(tbl, ioba, 0, 1)) + return H_PARAMETER; + + ret = iommu_clear_tces_and_put_pages(tbl, entry, 1); + } else { + void *hva; + + if (iommu_tce_put_param_check(tbl, ioba, tce)) + return H_PARAMETER; + + hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce); + if (hva == ERROR_ADDR) + return H_HARDWARE; + + ret = iommu_put_tce_user_mode(tbl, + ioba >> IOMMU_PAGE_SHIFT, + (unsigned long) hva); + } + iommu_flush_tce(tbl); + + if (ret) + return H_HARDWARE; + + return H_SUCCESS; + } +#endif /* Emulated IO */ if (ioba >= tt->window_size) return H_PARAMETER; @@ -220,6 +323,70 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tces == ERROR_ADDR) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + /* Something bad happened, do cleanup and exit */ + if (vcpu->arch.tce_reason == H_HARDWARE) { + i = vcpu->arch.tce_tmp_num; + goto fail_clear_tce; + } else if (vcpu->arch.tce_reason != H_TOO_HARD) { + /* + * We get here only in PR KVM mode, otherwise + * the real mode handler would have checked TCEs + * already and failed on guest TCE translation. + */ + for (i = 0; i < npages; ++i) { + if (get_user(vcpu->arch.tce_tmp[i], tces + i)) + return H_HARDWARE; + + if (iommu_tce_put_param_check(tbl, ioba + + (i << IOMMU_PAGE_SHIFT), + vcpu->arch.tce_tmp[i])) + return H_PARAMETER; + } + } /* else: The real mode handler checked TCEs already */ + + /* Translate TCEs */ + for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) { + void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, + vcpu->arch.tce_tmp[i]); + + if (hva == ERROR_ADDR) + goto fail_clear_tce; + + vcpu->arch.tce_tmp[i] = (unsigned long) hva; + } + + /* Do get_page and put TCEs for all pages */ + for (i = 0; i < npages; ++i) { + if (iommu_put_tce_user_mode(tbl, + (ioba >> IOMMU_PAGE_SHIFT) + i, + vcpu->arch.tce_tmp[i])) { + i = npages; + goto fail_clear_tce; + } + } + + iommu_flush_tce(tbl); + + return H_SUCCESS; + +fail_clear_tce: + /* Cannot complete the translation, clean up and exit */ + iommu_clear_tces_and_put_pages(tbl, + ioba >> IOMMU_PAGE_SHIFT, i); + + iommu_flush_tce(tbl); + + return H_HARDWARE; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; @@ -253,6 +420,33 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + unsigned long tmp, entry = ioba >> IOMMU_PAGE_SHIFT; + + vcpu->arch.tce_tmp_num = 0; + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + /* PR KVM? */ + if (!vcpu->arch.tce_tmp_num && + (vcpu->arch.tce_reason != H_TOO_HARD) && + iommu_tce_clear_param_check(tbl, ioba, + tce_value, npages)) + return H_PARAMETER; + + /* Do actual cleanup */ + tmp = vcpu->arch.tce_tmp_num; + if (iommu_clear_tces_and_put_pages(tbl, entry + tmp, + npages - tmp)) + return H_PARAMETER; + + return H_SUCCESS; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index c68d538..dc4ae32 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -26,6 +26,7 @@ #include <linux/slab.h> #include <linux/hugetlb.h> #include <linux/list.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/kvm_ppc.h> @@ -118,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce); #ifdef CONFIG_KVM_BOOK3S_64_HV static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, - unsigned long *pte_sizep) + unsigned long *pte_sizep, bool do_get_page) { pte_t *ptep; unsigned int shift = 0; @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, if (!pte_present(*ptep)) return __pte(0); + /* + * Put huge pages handling to the virtual mode. + * The only exception is for TCE list pages which we + * do need to call get_page() for. + */ + if ((*pte_sizep > PAGE_SIZE) && do_get_page) + return __pte(0); + /* wait until _PAGE_BUSY is clear then set it atomically */ __asm__ __volatile__ ( "1: ldarx %0,0,%3\n" @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, : "cc"); ret = pte; + if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) { + struct page *pg = NULL; + pg = realmode_pfn_to_page(pte_pfn(pte)); + if (realmode_get_page(pg)) { + ret = __pte(0); + } else { + pte = pte_mkyoung(pte); + if (writing) + pte = pte_mkdirty(pte); + } + } + *ptep = pte; /* clears _PAGE_BUSY */ return ret; } @@ -157,7 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, * Also returns pte and page size if the page is present in page table. */ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, - unsigned long gpa) + unsigned long gpa, bool do_get_page) { struct kvm_memory_slot *memslot; pte_t pte; @@ -175,7 +196,7 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, /* Find a PTE and determine the size */ pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva, - writing, &pg_size); + writing, &pg_size, do_get_page); if (!pte) return ERROR_ADDR; @@ -188,6 +209,52 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, return hpa; } +#ifdef CONFIG_IOMMU_API +static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, + struct iommu_table *tbl, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + long ret = 0, i; + unsigned long entry = ioba >> IOMMU_PAGE_SHIFT; + + if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages)) + return H_PARAMETER; + + for (i = 0; i < npages; ++i) { + struct page *page; + unsigned long oldtce; + + oldtce = iommu_clear_tce(tbl, entry + i); + if (!oldtce) + continue; + + page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT); + if (!page) { + ret = H_TOO_HARD; + break; + } + + if (oldtce & TCE_PCI_WRITE) + SetPageDirty(page); + + if (realmode_put_page(page)) { + ret = H_TOO_HARD; + break; + } + } + + if (ret == H_TOO_HARD) { + vcpu->arch.tce_tmp_num = i; + vcpu->arch.tce_reason = H_TOO_HARD; + } + /* if (ret < 0) + pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n", + __func__, ioba, tce_value, ret); */ + + return ret; +} +#endif /* CONFIG_IOMMU_API */ + long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { @@ -199,6 +266,52 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + vcpu->arch.tce_reason = 0; + + if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) { + unsigned long hpa, hva; + + if (iommu_tce_put_param_check(tbl, ioba, tce)) + return H_PARAMETER; + + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); + if (hpa == ERROR_ADDR) { + vcpu->arch.tce_reason = H_TOO_HARD; + return H_TOO_HARD; + } + + hva = (unsigned long) __va(hpa); + ret = iommu_tce_build(tbl, + ioba >> IOMMU_PAGE_SHIFT, + hva, iommu_tce_direction(hva)); + if (unlikely(ret)) { + struct page *pg = realmode_pfn_to_page(hpa); + BUG_ON(!pg); + if (realmode_put_page(pg)) { + vcpu->arch.tce_reason = H_HARDWARE; + return H_TOO_HARD; + } + return H_HARDWARE; + } + } else { + ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, 0, 1); + if (ret) + return ret; + } + + iommu_flush_tce(tbl); + + return H_SUCCESS; + } +#endif /* Emulated IO */ if (ioba >= tt->window_size) return H_PARAMETER; @@ -235,10 +348,62 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tce_list & ~IOMMU_PAGE_MASK) return H_PARAMETER; - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list); + vcpu->arch.tce_tmp_num = 0; + vcpu->arch.tce_reason = 0; + + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, + tce_list, false); if ((unsigned long)tces == ERROR_ADDR) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + /* Check all TCEs */ + for (i = 0; i < npages; ++i) { + if (iommu_tce_put_param_check(tbl, ioba + + (i << IOMMU_PAGE_SHIFT), tces[i])) + return H_PARAMETER; + vcpu->arch.tce_tmp[i] = tces[i]; + } + + /* Translate TCEs and go get_page */ + for (i = 0; i < npages; ++i) { + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, + vcpu->arch.tce_tmp[i], true); + if (hpa == ERROR_ADDR) { + vcpu->arch.tce_tmp_num = i; + vcpu->arch.tce_reason = H_TOO_HARD; + return H_TOO_HARD; + } + vcpu->arch.tce_tmp[i] = hpa; + } + + /* Put TCEs to the table */ + for (i = 0; i < npages; ++i) { + unsigned long hva = (unsigned long) + __va(vcpu->arch.tce_tmp[i]); + + ret = iommu_tce_build(tbl, + (ioba >> IOMMU_PAGE_SHIFT) + i, + hva, iommu_tce_direction(hva)); + if (ret) { + /* All wrong, go virtmode and do cleanup */ + vcpu->arch.tce_reason = H_HARDWARE; + return H_TOO_HARD; + } + } + + iommu_flush_tce(tbl); + + return H_SUCCESS; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; @@ -268,6 +433,26 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, if (!tt) return H_TOO_HARD; +#ifdef CONFIG_IOMMU_API + if (tt->grp) { + struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp); + + /* Return error if the group is being destroyed */ + if (!tbl) + return H_RESCINDED; + + vcpu->arch.tce_reason = 0; + + ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, + tce_value, npages); + if (ret) + return ret; + + iommu_flush_tce(tbl); + + return H_SUCCESS; + } +#endif /* Emulated IO */ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size) return H_PARAMETER; diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 8465c2a..da6bf61 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext) break; #endif case KVM_CAP_SPAPR_MULTITCE: + case KVM_CAP_SPAPR_TCE_IOMMU: r = 1; break; default: @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce); goto out; } + case KVM_CREATE_SPAPR_TCE_IOMMU: { + struct kvm_create_spapr_tce_iommu create_tce_iommu; + struct kvm *kvm = filp->private_data; + + r = -EFAULT; + if (copy_from_user(&create_tce_iommu, argp, + sizeof(create_tce_iommu))) + goto out; + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu); + goto out; + } #endif /* CONFIG_PPC_BOOK3S_64 */ #ifdef CONFIG_KVM_BOOK3S_64_HV diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index fc0d6b9..8cf86dc 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE 93 +#define KVM_CAP_SPAPR_TCE_IOMMU 94 #ifdef KVM_CAP_IRQ_ROUTING @@ -922,6 +923,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PPC_ALLOC_HTAB */ #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32) #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce) +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu) /* Available with KVM_CAP_RMA */ #define KVM_ALLOCATE_RMA _IOR(KVMIO, 0xa9, struct kvm_allocate_rma) /* Available with KVM_CAP_PPC_HTAB_FD */ -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-16 4:39 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:39 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc > static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > - unsigned long *pte_sizep) > + unsigned long *pte_sizep, bool do_get_page) > { > pte_t *ptep; > unsigned int shift = 0; > @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > if (!pte_present(*ptep)) > return __pte(0); > > + /* > + * Put huge pages handling to the virtual mode. > + * The only exception is for TCE list pages which we > + * do need to call get_page() for. > + */ > + if ((*pte_sizep > PAGE_SIZE) && do_get_page) > + return __pte(0); > + > /* wait until _PAGE_BUSY is clear then set it atomically */ > __asm__ __volatile__ ( > "1: ldarx %0,0,%3\n" > @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > : "cc"); > > ret = pte; > + if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) { > + struct page *pg = NULL; > + pg = realmode_pfn_to_page(pte_pfn(pte)); > + if (realmode_get_page(pg)) { > + ret = __pte(0); > + } else { > + pte = pte_mkyoung(pte); > + if (writing) > + pte = pte_mkdirty(pte); > + } > + } > + *ptep = pte; /* clears _PAGE_BUSY */ > > return ret; > } So now you are adding the clearing of _PAGE_BUSY that was missing for your first patch, except that this is not enough since that means that in the "emulated" case (ie, !do_get_page) you will in essence return and then use a PTE that is not locked without any synchronization to ensure that the underlying page doesn't go away... then you'll dereference that page. So either make everything use speculative get_page, or make the emulated case use the MMU notifier to drop the operation in case of collision. The former looks easier. Also, any specific reason why you do: - Lock the PTE - get_page() - Unlock the PTE Instead of - Read the PTE - get_page_unless_zero - re-check PTE Like get_user_pages_fast() does ? The former will be two atomic ops, the latter only one (faster), but maybe you have a good reason why that can't work... Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-16 4:39 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:39 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc > static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > - unsigned long *pte_sizep) > + unsigned long *pte_sizep, bool do_get_page) > { > pte_t *ptep; > unsigned int shift = 0; > @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > if (!pte_present(*ptep)) > return __pte(0); > > + /* > + * Put huge pages handling to the virtual mode. > + * The only exception is for TCE list pages which we > + * do need to call get_page() for. > + */ > + if ((*pte_sizep > PAGE_SIZE) && do_get_page) > + return __pte(0); > + > /* wait until _PAGE_BUSY is clear then set it atomically */ > __asm__ __volatile__ ( > "1: ldarx %0,0,%3\n" > @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > : "cc"); > > ret = pte; > + if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) { > + struct page *pg = NULL; > + pg = realmode_pfn_to_page(pte_pfn(pte)); > + if (realmode_get_page(pg)) { > + ret = __pte(0); > + } else { > + pte = pte_mkyoung(pte); > + if (writing) > + pte = pte_mkdirty(pte); > + } > + } > + *ptep = pte; /* clears _PAGE_BUSY */ > > return ret; > } So now you are adding the clearing of _PAGE_BUSY that was missing for your first patch, except that this is not enough since that means that in the "emulated" case (ie, !do_get_page) you will in essence return and then use a PTE that is not locked without any synchronization to ensure that the underlying page doesn't go away... then you'll dereference that page. So either make everything use speculative get_page, or make the emulated case use the MMU notifier to drop the operation in case of collision. The former looks easier. Also, any specific reason why you do: - Lock the PTE - get_page() - Unlock the PTE Instead of - Read the PTE - get_page_unless_zero - re-check PTE Like get_user_pages_fast() does ? The former will be two atomic ops, the latter only one (faster), but maybe you have a good reason why that can't work... Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-16 4:39 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:39 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, Paul Mackerras, linuxppc-dev, David Gibson > static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > - unsigned long *pte_sizep) > + unsigned long *pte_sizep, bool do_get_page) > { > pte_t *ptep; > unsigned int shift = 0; > @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > if (!pte_present(*ptep)) > return __pte(0); > > + /* > + * Put huge pages handling to the virtual mode. > + * The only exception is for TCE list pages which we > + * do need to call get_page() for. > + */ > + if ((*pte_sizep > PAGE_SIZE) && do_get_page) > + return __pte(0); > + > /* wait until _PAGE_BUSY is clear then set it atomically */ > __asm__ __volatile__ ( > "1: ldarx %0,0,%3\n" > @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, > : "cc"); > > ret = pte; > + if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) { > + struct page *pg = NULL; > + pg = realmode_pfn_to_page(pte_pfn(pte)); > + if (realmode_get_page(pg)) { > + ret = __pte(0); > + } else { > + pte = pte_mkyoung(pte); > + if (writing) > + pte = pte_mkdirty(pte); > + } > + } > + *ptep = pte; /* clears _PAGE_BUSY */ > > return ret; > } So now you are adding the clearing of _PAGE_BUSY that was missing for your first patch, except that this is not enough since that means that in the "emulated" case (ie, !do_get_page) you will in essence return and then use a PTE that is not locked without any synchronization to ensure that the underlying page doesn't go away... then you'll dereference that page. So either make everything use speculative get_page, or make the emulated case use the MMU notifier to drop the operation in case of collision. The former looks easier. Also, any specific reason why you do: - Lock the PTE - get_page() - Unlock the PTE Instead of - Read the PTE - get_page_unless_zero - re-check PTE Like get_user_pages_fast() does ? The former will be two atomic ops, the latter only one (faster), but maybe you have a good reason why that can't work... Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-16 4:39 ` Benjamin Herrenschmidt (?) @ 2013-06-19 3:17 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-19 3:17 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/16/2013 02:39 PM, Benjamin Herrenschmidt wrote: >> static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, >> - unsigned long *pte_sizep) >> + unsigned long *pte_sizep, bool do_get_page) >> { >> pte_t *ptep; >> unsigned int shift = 0; >> @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, >> if (!pte_present(*ptep)) >> return __pte(0); >> >> + /* >> + * Put huge pages handling to the virtual mode. >> + * The only exception is for TCE list pages which we >> + * do need to call get_page() for. >> + */ >> + if ((*pte_sizep > PAGE_SIZE) && do_get_page) >> + return __pte(0); >> + >> /* wait until _PAGE_BUSY is clear then set it atomically */ >> __asm__ __volatile__ ( >> "1: ldarx %0,0,%3\n" >> @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, >> : "cc"); >> >> ret = pte; >> + if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) { >> + struct page *pg = NULL; >> + pg = realmode_pfn_to_page(pte_pfn(pte)); >> + if (realmode_get_page(pg)) { >> + ret = __pte(0); >> + } else { >> + pte = pte_mkyoung(pte); >> + if (writing) >> + pte = pte_mkdirty(pte); >> + } >> + } >> + *ptep = pte; /* clears _PAGE_BUSY */ >> >> return ret; >> } > > So now you are adding the clearing of _PAGE_BUSY that was missing for > your first patch, except that this is not enough since that means that > in the "emulated" case (ie, !do_get_page) you will in essence return > and then use a PTE that is not locked without any synchronization to > ensure that the underlying page doesn't go away... then you'll > dereference that page. > > So either make everything use speculative get_page, or make the emulated > case use the MMU notifier to drop the operation in case of collision. > > The former looks easier. > > Also, any specific reason why you do: > > - Lock the PTE > - get_page() > - Unlock the PTE > > Instead of > > - Read the PTE > - get_page_unless_zero > - re-check PTE > > Like get_user_pages_fast() does ? > > The former will be two atomic ops, the latter only one (faster), but > maybe you have a good reason why that can't work... If we want to set "dirty" and "young" bits for pte then I do not know how to avoid _PAGE_BUSY. -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 3:17 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-19 3:17 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 06/16/2013 02:39 PM, Benjamin Herrenschmidt wrote: >> static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, >> - unsigned long *pte_sizep) >> + unsigned long *pte_sizep, bool do_get_page) >> { >> pte_t *ptep; >> unsigned int shift = 0; >> @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, >> if (!pte_present(*ptep)) >> return __pte(0); >> >> + /* >> + * Put huge pages handling to the virtual mode. >> + * The only exception is for TCE list pages which we >> + * do need to call get_page() for. >> + */ >> + if ((*pte_sizep > PAGE_SIZE) && do_get_page) >> + return __pte(0); >> + >> /* wait until _PAGE_BUSY is clear then set it atomically */ >> __asm__ __volatile__ ( >> "1: ldarx %0,0,%3\n" >> @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, >> : "cc"); >> >> ret = pte; >> + if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) { >> + struct page *pg = NULL; >> + pg = realmode_pfn_to_page(pte_pfn(pte)); >> + if (realmode_get_page(pg)) { >> + ret = __pte(0); >> + } else { >> + pte = pte_mkyoung(pte); >> + if (writing) >> + pte = pte_mkdirty(pte); >> + } >> + } >> + *ptep = pte; /* clears _PAGE_BUSY */ >> >> return ret; >> } > > So now you are adding the clearing of _PAGE_BUSY that was missing for > your first patch, except that this is not enough since that means that > in the "emulated" case (ie, !do_get_page) you will in essence return > and then use a PTE that is not locked without any synchronization to > ensure that the underlying page doesn't go away... then you'll > dereference that page. > > So either make everything use speculative get_page, or make the emulated > case use the MMU notifier to drop the operation in case of collision. > > The former looks easier. > > Also, any specific reason why you do: > > - Lock the PTE > - get_page() > - Unlock the PTE > > Instead of > > - Read the PTE > - get_page_unless_zero > - re-check PTE > > Like get_user_pages_fast() does ? > > The former will be two atomic ops, the latter only one (faster), but > maybe you have a good reason why that can't work... If we want to set "dirty" and "young" bits for pte then I do not know how to avoid _PAGE_BUSY. -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 3:17 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-19 3:17 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, Paul Mackerras, linuxppc-dev, David Gibson On 06/16/2013 02:39 PM, Benjamin Herrenschmidt wrote: >> static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, >> - unsigned long *pte_sizep) >> + unsigned long *pte_sizep, bool do_get_page) >> { >> pte_t *ptep; >> unsigned int shift = 0; >> @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, >> if (!pte_present(*ptep)) >> return __pte(0); >> >> + /* >> + * Put huge pages handling to the virtual mode. >> + * The only exception is for TCE list pages which we >> + * do need to call get_page() for. >> + */ >> + if ((*pte_sizep > PAGE_SIZE) && do_get_page) >> + return __pte(0); >> + >> /* wait until _PAGE_BUSY is clear then set it atomically */ >> __asm__ __volatile__ ( >> "1: ldarx %0,0,%3\n" >> @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, >> : "cc"); >> >> ret = pte; >> + if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) { >> + struct page *pg = NULL; >> + pg = realmode_pfn_to_page(pte_pfn(pte)); >> + if (realmode_get_page(pg)) { >> + ret = __pte(0); >> + } else { >> + pte = pte_mkyoung(pte); >> + if (writing) >> + pte = pte_mkdirty(pte); >> + } >> + } >> + *ptep = pte; /* clears _PAGE_BUSY */ >> >> return ret; >> } > > So now you are adding the clearing of _PAGE_BUSY that was missing for > your first patch, except that this is not enough since that means that > in the "emulated" case (ie, !do_get_page) you will in essence return > and then use a PTE that is not locked without any synchronization to > ensure that the underlying page doesn't go away... then you'll > dereference that page. > > So either make everything use speculative get_page, or make the emulated > case use the MMU notifier to drop the operation in case of collision. > > The former looks easier. > > Also, any specific reason why you do: > > - Lock the PTE > - get_page() > - Unlock the PTE > > Instead of > > - Read the PTE > - get_page_unless_zero > - re-check PTE > > Like get_user_pages_fast() does ? > > The former will be two atomic ops, the latter only one (faster), but > maybe you have a good reason why that can't work... If we want to set "dirty" and "young" bits for pte then I do not know how to avoid _PAGE_BUSY. -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-16 22:25 ` Alexander Graf -1 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-16 22:25 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > and H_STUFF_TCE requests without passing them to QEMU, which should > save time on switching to QEMU and back. > > Both real and virtual modes are supported - whenever the kernel > fails to handle TCE request, it passes it to the virtual mode. > If it the virtual mode handlers fail, then the request is passed > to the user mode, for example, to QEMU. > > This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate > a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables > in-kernel handling of IOMMU map/unmap. > > Tests show that this patch increases transmission speed from 220MB/s > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > > Cc: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Signed-off-by: Paul Mackerras <paulus@samba.org> > > --- > > Changes: > 2013/06/05: > * changed capability number > * changed ioctl number > * update the doc article number > > 2013/05/20: > * removed get_user() from real mode handlers > * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there > translated TCEs, tries realmode_get_page() on those and if it fails, it > passes control over the virtual mode handler which tries to finish > the request handling > * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit > on a page > * The only reason to pass the request to user mode now is when the user mode > did not register TCE table in the kernel, in all other cases the virtual mode > handler is expected to do the job > --- > Documentation/virtual/kvm/api.txt | 28 +++++ > arch/powerpc/include/asm/kvm_host.h | 3 + > arch/powerpc/include/asm/kvm_ppc.h | 2 + > arch/powerpc/include/uapi/asm/kvm.h | 7 ++ > arch/powerpc/kvm/book3s_64_vio.c | 198 ++++++++++++++++++++++++++++++++++- > arch/powerpc/kvm/book3s_64_vio_hv.c | 193 +++++++++++++++++++++++++++++++++- > arch/powerpc/kvm/powerpc.c | 12 +++ > include/uapi/linux/kvm.h | 2 + > 8 files changed, 439 insertions(+), 6 deletions(-) > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt > index 6c082ff..e962e3b 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -2379,6 +2379,34 @@ the guest. Othwerwise it might be better for the guest to continue using H_PUT_T > hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). > > > +4.84 KVM_CREATE_SPAPR_TCE_IOMMU > + > +Capability: KVM_CAP_SPAPR_TCE_IOMMU > +Architectures: powerpc > +Type: vm ioctl > +Parameters: struct kvm_create_spapr_tce_iommu (in) > +Returns: 0 on success, -1 on error > + > +This creates a link between IOMMU group and a hardware TCE (translation > +control entry) table. This link lets the host kernel know what IOMMU > +group (i.e. TCE table) to use for the LIOBN number passed with > +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. > + > +/* for KVM_CAP_SPAPR_TCE_IOMMU */ > +struct kvm_create_spapr_tce_iommu { > + __u64 liobn; > + __u32 iommu_id; > + __u32 flags; > +}; > + > +No flag is supported at the moment. > + > +When the guest issues TCE call on a liobn for which a TCE table has been > +registered, the kernel will handle it in real mode, updating the hardware > +TCE table. TCE table calls for other liobns will cause a vm exit and must > +be handled by userspace. Ok, please walk me through the security model you have in mind here. Basically what this ioctl does is that it creates a guest TCE table that reflects its changes into a host TCE table whenever it gets modified. So far so good. Now I don't see any checks that verify whether iommu_id is actually good to use from that user's access rights. Just because I have access to /dev/kvm I don't necessarily have access to an iommu control device. So the least I can see would be a local DoS attack where one user space program with only access to /dev/kvm can simply kill any access to another process's device by overflowing a host iommu TCE table with junk entries. There's even a certain chance of an information disclosure exploit here where a malicious user space program could get itself all network traffic DMA'd from another VM. How does this work on the host level? What is the security token to take control of a host TCE table? Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-16 22:25 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-16 22:25 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Paul Mackerras, kvm, linux-kernel, kvm-ppc On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > and H_STUFF_TCE requests without passing them to QEMU, which should > save time on switching to QEMU and back. > > Both real and virtual modes are supported - whenever the kernel > fails to handle TCE request, it passes it to the virtual mode. > If it the virtual mode handlers fail, then the request is passed > to the user mode, for example, to QEMU. > > This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate > a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables > in-kernel handling of IOMMU map/unmap. > > Tests show that this patch increases transmission speed from 220MB/s > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > > Cc: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Signed-off-by: Paul Mackerras <paulus@samba.org> > > --- > > Changes: > 2013/06/05: > * changed capability number > * changed ioctl number > * update the doc article number > > 2013/05/20: > * removed get_user() from real mode handlers > * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there > translated TCEs, tries realmode_get_page() on those and if it fails, it > passes control over the virtual mode handler which tries to finish > the request handling > * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit > on a page > * The only reason to pass the request to user mode now is when the user mode > did not register TCE table in the kernel, in all other cases the virtual mode > handler is expected to do the job > --- > Documentation/virtual/kvm/api.txt | 28 +++++ > arch/powerpc/include/asm/kvm_host.h | 3 + > arch/powerpc/include/asm/kvm_ppc.h | 2 + > arch/powerpc/include/uapi/asm/kvm.h | 7 ++ > arch/powerpc/kvm/book3s_64_vio.c | 198 ++++++++++++++++++++++++++++++++++- > arch/powerpc/kvm/book3s_64_vio_hv.c | 193 +++++++++++++++++++++++++++++++++- > arch/powerpc/kvm/powerpc.c | 12 +++ > include/uapi/linux/kvm.h | 2 + > 8 files changed, 439 insertions(+), 6 deletions(-) > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt > index 6c082ff..e962e3b 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -2379,6 +2379,34 @@ the guest. Othwerwise it might be better for the guest to continue using H_PUT_T > hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). > > > +4.84 KVM_CREATE_SPAPR_TCE_IOMMU > + > +Capability: KVM_CAP_SPAPR_TCE_IOMMU > +Architectures: powerpc > +Type: vm ioctl > +Parameters: struct kvm_create_spapr_tce_iommu (in) > +Returns: 0 on success, -1 on error > + > +This creates a link between IOMMU group and a hardware TCE (translation > +control entry) table. This link lets the host kernel know what IOMMU > +group (i.e. TCE table) to use for the LIOBN number passed with > +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. > + > +/* for KVM_CAP_SPAPR_TCE_IOMMU */ > +struct kvm_create_spapr_tce_iommu { > + __u64 liobn; > + __u32 iommu_id; > + __u32 flags; > +}; > + > +No flag is supported at the moment. > + > +When the guest issues TCE call on a liobn for which a TCE table has been > +registered, the kernel will handle it in real mode, updating the hardware > +TCE table. TCE table calls for other liobns will cause a vm exit and must > +be handled by userspace. Ok, please walk me through the security model you have in mind here. Basically what this ioctl does is that it creates a guest TCE table that reflects its changes into a host TCE table whenever it gets modified. So far so good. Now I don't see any checks that verify whether iommu_id is actually good to use from that user's access rights. Just because I have access to /dev/kvm I don't necessarily have access to an iommu control device. So the least I can see would be a local DoS attack where one user space program with only access to /dev/kvm can simply kill any access to another process's device by overflowing a host iommu TCE table with junk entries. There's even a certain chance of an information disclosure exploit here where a malicious user space program could get itself all network traffic DMA'd from another VM. How does this work on the host level? What is the security token to take control of a host TCE table? Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-16 22:25 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-16 22:25 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > and H_STUFF_TCE requests without passing them to QEMU, which should > save time on switching to QEMU and back. >=20 > Both real and virtual modes are supported - whenever the kernel > fails to handle TCE request, it passes it to the virtual mode. > If it the virtual mode handlers fail, then the request is passed > to the user mode, for example, to QEMU. >=20 > This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate > a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables > in-kernel handling of IOMMU map/unmap. >=20 > Tests show that this patch increases transmission speed from 220MB/s > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). >=20 > Cc: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Signed-off-by: Paul Mackerras <paulus@samba.org> >=20 > --- >=20 > Changes: > 2013/06/05: > * changed capability number > * changed ioctl number > * update the doc article number >=20 > 2013/05/20: > * removed get_user() from real mode handlers > * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts = there > translated TCEs, tries realmode_get_page() on those and if it fails, = it > passes control over the virtual mode handler which tries to finish > the request handling > * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY = bit > on a page > * The only reason to pass the request to user mode now is when the = user mode > did not register TCE table in the kernel, in all other cases the = virtual mode > handler is expected to do the job > --- > Documentation/virtual/kvm/api.txt | 28 +++++ > arch/powerpc/include/asm/kvm_host.h | 3 + > arch/powerpc/include/asm/kvm_ppc.h | 2 + > arch/powerpc/include/uapi/asm/kvm.h | 7 ++ > arch/powerpc/kvm/book3s_64_vio.c | 198 = ++++++++++++++++++++++++++++++++++- > arch/powerpc/kvm/book3s_64_vio_hv.c | 193 = +++++++++++++++++++++++++++++++++- > arch/powerpc/kvm/powerpc.c | 12 +++ > include/uapi/linux/kvm.h | 2 + > 8 files changed, 439 insertions(+), 6 deletions(-) >=20 > diff --git a/Documentation/virtual/kvm/api.txt = b/Documentation/virtual/kvm/api.txt > index 6c082ff..e962e3b 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -2379,6 +2379,34 @@ the guest. Othwerwise it might be better for = the guest to continue using H_PUT_T > hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are = present). >=20 >=20 > +4.84 KVM_CREATE_SPAPR_TCE_IOMMU > + > +Capability: KVM_CAP_SPAPR_TCE_IOMMU > +Architectures: powerpc > +Type: vm ioctl > +Parameters: struct kvm_create_spapr_tce_iommu (in) > +Returns: 0 on success, -1 on error > + > +This creates a link between IOMMU group and a hardware TCE = (translation > +control entry) table. This link lets the host kernel know what IOMMU > +group (i.e. TCE table) to use for the LIOBN number passed with > +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. > + > +/* for KVM_CAP_SPAPR_TCE_IOMMU */ > +struct kvm_create_spapr_tce_iommu { > + __u64 liobn; > + __u32 iommu_id; > + __u32 flags; > +}; > + > +No flag is supported at the moment. > + > +When the guest issues TCE call on a liobn for which a TCE table has = been > +registered, the kernel will handle it in real mode, updating the = hardware > +TCE table. TCE table calls for other liobns will cause a vm exit and = must > +be handled by userspace. Ok, please walk me through the security model you have in mind here. Basically what this ioctl does is that it creates a guest TCE table that = reflects its changes into a host TCE table whenever it gets modified. So = far so good. Now I don't see any checks that verify whether iommu_id is actually good = to use from that user's access rights. Just because I have access to = /dev/kvm I don't necessarily have access to an iommu control device. So the least I can see would be a local DoS attack where one user space = program with only access to /dev/kvm can simply kill any access to = another process's device by overflowing a host iommu TCE table with junk = entries. There's even a certain chance of an information disclosure exploit here = where a malicious user space program could get itself all network = traffic DMA'd from another VM. How does this work on the host level? What is the security token to take = control of a host TCE table? Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-16 22:39 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 22:39 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, Alex Williamson On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, > + struct kvm_create_spapr_tce_iommu *args) > +{ > + struct kvmppc_spapr_tce_table *tt = NULL; > + struct iommu_group *grp; > + struct iommu_table *tbl; > + > + /* Find an IOMMU table for the given ID */ > + grp = iommu_group_get_by_id(args->iommu_id); > + if (!grp) > + return -ENXIO; > + > + tbl = iommu_group_get_iommudata(grp); > + if (!tbl) > + return -ENXIO; So Alex Graf pointed out here, there is a security issue here, or are we missing something ? What prevents a malicious program that has access to /dev/kvm from taking over random iommu groups (including host used ones) that way? What is the security model of that whole iommu stuff to begin with ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-16 22:39 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 22:39 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, Alex Williamson On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, > + struct kvm_create_spapr_tce_iommu *args) > +{ > + struct kvmppc_spapr_tce_table *tt = NULL; > + struct iommu_group *grp; > + struct iommu_table *tbl; > + > + /* Find an IOMMU table for the given ID */ > + grp = iommu_group_get_by_id(args->iommu_id); > + if (!grp) > + return -ENXIO; > + > + tbl = iommu_group_get_iommudata(grp); > + if (!tbl) > + return -ENXIO; So Alex Graf pointed out here, there is a security issue here, or are we missing something ? What prevents a malicious program that has access to /dev/kvm from taking over random iommu groups (including host used ones) that way? What is the security model of that whole iommu stuff to begin with ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-16 22:39 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 22:39 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, Alex Williamson, Paul Mackerras, linuxppc-dev, David Gibson On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, > + struct kvm_create_spapr_tce_iommu *args) > +{ > + struct kvmppc_spapr_tce_table *tt = NULL; > + struct iommu_group *grp; > + struct iommu_table *tbl; > + > + /* Find an IOMMU table for the given ID */ > + grp = iommu_group_get_by_id(args->iommu_id); > + if (!grp) > + return -ENXIO; > + > + tbl = iommu_group_get_iommudata(grp); > + if (!tbl) > + return -ENXIO; So Alex Graf pointed out here, there is a security issue here, or are we missing something ? What prevents a malicious program that has access to /dev/kvm from taking over random iommu groups (including host used ones) that way? What is the security model of that whole iommu stuff to begin with ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-16 22:39 ` Benjamin Herrenschmidt (?) @ 2013-06-17 3:13 ` Alex Williamson -1 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-17 3:13 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Mon, 2013-06-17 at 08:39 +1000, Benjamin Herrenschmidt wrote: > On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > > +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, > > + struct kvm_create_spapr_tce_iommu *args) > > +{ > > + struct kvmppc_spapr_tce_table *tt = NULL; > > + struct iommu_group *grp; > > + struct iommu_table *tbl; > > + > > + /* Find an IOMMU table for the given ID */ > > + grp = iommu_group_get_by_id(args->iommu_id); > > + if (!grp) > > + return -ENXIO; > > + > > + tbl = iommu_group_get_iommudata(grp); > > + if (!tbl) > > + return -ENXIO; > > So Alex Graf pointed out here, there is a security issue here, or are we > missing something ? > > What prevents a malicious program that has access to /dev/kvm from > taking over random iommu groups (including host used ones) that way? > > What is the security model of that whole iommu stuff to begin with ? IOMMU groups themselves don't provide security, they're accessed by interfaces like VFIO, which provide the security. Given a brief look, I agree, this looks like a possible backdoor. The typical VFIO way to handle this would be to pass a VFIO file descriptor here to prove that the process has access to the IOMMU group. This is how /dev/vfio/vfio gains the ability to setup an IOMMU domain an do mappings with the SET_CONTAINER ioctl using a group fd. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-17 3:13 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-17 3:13 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Mon, 2013-06-17 at 08:39 +1000, Benjamin Herrenschmidt wrote: > On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > > +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, > > + struct kvm_create_spapr_tce_iommu *args) > > +{ > > + struct kvmppc_spapr_tce_table *tt = NULL; > > + struct iommu_group *grp; > > + struct iommu_table *tbl; > > + > > + /* Find an IOMMU table for the given ID */ > > + grp = iommu_group_get_by_id(args->iommu_id); > > + if (!grp) > > + return -ENXIO; > > + > > + tbl = iommu_group_get_iommudata(grp); > > + if (!tbl) > > + return -ENXIO; > > So Alex Graf pointed out here, there is a security issue here, or are we > missing something ? > > What prevents a malicious program that has access to /dev/kvm from > taking over random iommu groups (including host used ones) that way? > > What is the security model of that whole iommu stuff to begin with ? IOMMU groups themselves don't provide security, they're accessed by interfaces like VFIO, which provide the security. Given a brief look, I agree, this looks like a possible backdoor. The typical VFIO way to handle this would be to pass a VFIO file descriptor here to prove that the process has access to the IOMMU group. This is how /dev/vfio/vfio gains the ability to setup an IOMMU domain an do mappings with the SET_CONTAINER ioctl using a group fd. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-17 3:13 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-17 3:13 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson On Mon, 2013-06-17 at 08:39 +1000, Benjamin Herrenschmidt wrote: > On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > > +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, > > + struct kvm_create_spapr_tce_iommu *args) > > +{ > > + struct kvmppc_spapr_tce_table *tt = NULL; > > + struct iommu_group *grp; > > + struct iommu_table *tbl; > > + > > + /* Find an IOMMU table for the given ID */ > > + grp = iommu_group_get_by_id(args->iommu_id); > > + if (!grp) > > + return -ENXIO; > > + > > + tbl = iommu_group_get_iommudata(grp); > > + if (!tbl) > > + return -ENXIO; > > So Alex Graf pointed out here, there is a security issue here, or are we > missing something ? > > What prevents a malicious program that has access to /dev/kvm from > taking over random iommu groups (including host used ones) that way? > > What is the security model of that whole iommu stuff to begin with ? IOMMU groups themselves don't provide security, they're accessed by interfaces like VFIO, which provide the security. Given a brief look, I agree, this looks like a possible backdoor. The typical VFIO way to handle this would be to pass a VFIO file descriptor here to prove that the process has access to the IOMMU group. This is how /dev/vfio/vfio gains the ability to setup an IOMMU domain an do mappings with the SET_CONTAINER ioctl using a group fd. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-17 3:13 ` Alex Williamson (?) @ 2013-06-17 3:56 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-17 3:56 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote: > IOMMU groups themselves don't provide security, they're accessed by > interfaces like VFIO, which provide the security. Given a brief look, I > agree, this looks like a possible backdoor. The typical VFIO way to > handle this would be to pass a VFIO file descriptor here to prove that > the process has access to the IOMMU group. This is how /dev/vfio/vfio > gains the ability to setup an IOMMU domain an do mappings with the > SET_CONTAINER ioctl using a group fd. Thanks, How do you envision that in the kernel ? IE. I'm in KVM code, gets that vfio fd, what do I do with it ? Basically, KVM needs to know that the user is allowed to use that iommu group. I don't think we want KVM however to call into VFIO directly right ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-17 3:56 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-17 3:56 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote: > IOMMU groups themselves don't provide security, they're accessed by > interfaces like VFIO, which provide the security. Given a brief look, I > agree, this looks like a possible backdoor. The typical VFIO way to > handle this would be to pass a VFIO file descriptor here to prove that > the process has access to the IOMMU group. This is how /dev/vfio/vfio > gains the ability to setup an IOMMU domain an do mappings with the > SET_CONTAINER ioctl using a group fd. Thanks, How do you envision that in the kernel ? IE. I'm in KVM code, gets that vfio fd, what do I do with it ? Basically, KVM needs to know that the user is allowed to use that iommu group. I don't think we want KVM however to call into VFIO directly right ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-17 3:56 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-17 3:56 UTC (permalink / raw) To: Alex Williamson Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote: > IOMMU groups themselves don't provide security, they're accessed by > interfaces like VFIO, which provide the security. Given a brief look, I > agree, this looks like a possible backdoor. The typical VFIO way to > handle this would be to pass a VFIO file descriptor here to prove that > the process has access to the IOMMU group. This is how /dev/vfio/vfio > gains the ability to setup an IOMMU domain an do mappings with the > SET_CONTAINER ioctl using a group fd. Thanks, How do you envision that in the kernel ? IE. I'm in KVM code, gets that vfio fd, what do I do with it ? Basically, KVM needs to know that the user is allowed to use that iommu group. I don't think we want KVM however to call into VFIO directly right ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-17 3:56 ` Benjamin Herrenschmidt (?) @ 2013-06-18 2:32 ` Alex Williamson -1 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-18 2:32 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote: > On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote: > > > IOMMU groups themselves don't provide security, they're accessed by > > interfaces like VFIO, which provide the security. Given a brief look, I > > agree, this looks like a possible backdoor. The typical VFIO way to > > handle this would be to pass a VFIO file descriptor here to prove that > > the process has access to the IOMMU group. This is how /dev/vfio/vfio > > gains the ability to setup an IOMMU domain an do mappings with the > > SET_CONTAINER ioctl using a group fd. Thanks, > > How do you envision that in the kernel ? IE. I'm in KVM code, gets that > vfio fd, what do I do with it ? > > Basically, KVM needs to know that the user is allowed to use that iommu > group. I don't think we want KVM however to call into VFIO directly > right ? Right, we don't want to create dependencies across modules. I don't have a vision for how this should work. This is effectively a complete side-band to vfio, so we're really just dealing in the iommu group space. Maybe there needs to be some kind of registration of ownership for the group using some kind of token. It would need to include some kind of notification when that ownership ends. That might also be a convenient tag to toggle driver probing off for devices in the group. Other ideas? Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-18 2:32 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-18 2:32 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote: > On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote: > > > IOMMU groups themselves don't provide security, they're accessed by > > interfaces like VFIO, which provide the security. Given a brief look, I > > agree, this looks like a possible backdoor. The typical VFIO way to > > handle this would be to pass a VFIO file descriptor here to prove that > > the process has access to the IOMMU group. This is how /dev/vfio/vfio > > gains the ability to setup an IOMMU domain an do mappings with the > > SET_CONTAINER ioctl using a group fd. Thanks, > > How do you envision that in the kernel ? IE. I'm in KVM code, gets that > vfio fd, what do I do with it ? > > Basically, KVM needs to know that the user is allowed to use that iommu > group. I don't think we want KVM however to call into VFIO directly > right ? Right, we don't want to create dependencies across modules. I don't have a vision for how this should work. This is effectively a complete side-band to vfio, so we're really just dealing in the iommu group space. Maybe there needs to be some kind of registration of ownership for the group using some kind of token. It would need to include some kind of notification when that ownership ends. That might also be a convenient tag to toggle driver probing off for devices in the group. Other ideas? Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-18 2:32 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-18 2:32 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote: > On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote: > > > IOMMU groups themselves don't provide security, they're accessed by > > interfaces like VFIO, which provide the security. Given a brief look, I > > agree, this looks like a possible backdoor. The typical VFIO way to > > handle this would be to pass a VFIO file descriptor here to prove that > > the process has access to the IOMMU group. This is how /dev/vfio/vfio > > gains the ability to setup an IOMMU domain an do mappings with the > > SET_CONTAINER ioctl using a group fd. Thanks, > > How do you envision that in the kernel ? IE. I'm in KVM code, gets that > vfio fd, what do I do with it ? > > Basically, KVM needs to know that the user is allowed to use that iommu > group. I don't think we want KVM however to call into VFIO directly > right ? Right, we don't want to create dependencies across modules. I don't have a vision for how this should work. This is effectively a complete side-band to vfio, so we're really just dealing in the iommu group space. Maybe there needs to be some kind of registration of ownership for the group using some kind of token. It would need to include some kind of notification when that ownership ends. That might also be a convenient tag to toggle driver probing off for devices in the group. Other ideas? Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-18 2:32 ` Alex Williamson (?) @ 2013-06-18 4:38 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-18 4:38 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote: > Right, we don't want to create dependencies across modules. I don't > have a vision for how this should work. This is effectively a complete > side-band to vfio, so we're really just dealing in the iommu group > space. Maybe there needs to be some kind of registration of ownership > for the group using some kind of token. It would need to include some > kind of notification when that ownership ends. That might also be a > convenient tag to toggle driver probing off for devices in the group. > Other ideas? Thanks, All of that smells nasty like it will need a pile of bloody infrastructure.... which makes me think it's too complicated and not the right approach. How does access control work today on x86/VFIO ? Can you give me a bit more details ? I didn't get a good grasp in your previous email.... From the look of it, the VFIO file descriptor is what has the "access control" to the underlying iommu, is this right ? So we somewhat need to transfer (or copy) that ownership from the VFIO fd to the KVM VM. I don't see a way to do that without some cross-layering here... Rusty, are you aware of some kernel mechanism we can use for that ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-18 4:38 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-18 4:38 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote: > Right, we don't want to create dependencies across modules. I don't > have a vision for how this should work. This is effectively a complete > side-band to vfio, so we're really just dealing in the iommu group > space. Maybe there needs to be some kind of registration of ownership > for the group using some kind of token. It would need to include some > kind of notification when that ownership ends. That might also be a > convenient tag to toggle driver probing off for devices in the group. > Other ideas? Thanks, All of that smells nasty like it will need a pile of bloody infrastructure.... which makes me think it's too complicated and not the right approach. How does access control work today on x86/VFIO ? Can you give me a bit more details ? I didn't get a good grasp in your previous email.... >From the look of it, the VFIO file descriptor is what has the "access control" to the underlying iommu, is this right ? So we somewhat need to transfer (or copy) that ownership from the VFIO fd to the KVM VM. I don't see a way to do that without some cross-layering here... Rusty, are you aware of some kernel mechanism we can use for that ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-18 4:38 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-18 4:38 UTC (permalink / raw) To: Alex Williamson Cc: kvm, Alexey Kardashevskiy, Rusty Russell, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote: > Right, we don't want to create dependencies across modules. I don't > have a vision for how this should work. This is effectively a complete > side-band to vfio, so we're really just dealing in the iommu group > space. Maybe there needs to be some kind of registration of ownership > for the group using some kind of token. It would need to include some > kind of notification when that ownership ends. That might also be a > convenient tag to toggle driver probing off for devices in the group. > Other ideas? Thanks, All of that smells nasty like it will need a pile of bloody infrastructure.... which makes me think it's too complicated and not the right approach. How does access control work today on x86/VFIO ? Can you give me a bit more details ? I didn't get a good grasp in your previous email.... >From the look of it, the VFIO file descriptor is what has the "access control" to the underlying iommu, is this right ? So we somewhat need to transfer (or copy) that ownership from the VFIO fd to the KVM VM. I don't see a way to do that without some cross-layering here... Rusty, are you aware of some kernel mechanism we can use for that ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-18 4:38 ` Benjamin Herrenschmidt (?) @ 2013-06-18 14:48 ` Alex Williamson -1 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-18 14:48 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote: > > > Right, we don't want to create dependencies across modules. I don't > > have a vision for how this should work. This is effectively a complete > > side-band to vfio, so we're really just dealing in the iommu group > > space. Maybe there needs to be some kind of registration of ownership > > for the group using some kind of token. It would need to include some > > kind of notification when that ownership ends. That might also be a > > convenient tag to toggle driver probing off for devices in the group. > > Other ideas? Thanks, > > All of that smells nasty like it will need a pile of bloody > infrastructure.... which makes me think it's too complicated and not the > right approach. > > How does access control work today on x86/VFIO ? Can you give me a bit > more details ? I didn't get a good grasp in your previous email.... The current model is not x86 specific, but it only covers doing iommu and device access through vfio. The kink here is that we're trying to do device access and setup through vfio, but iommu manipulation through kvm. We may want to revisit whether we can do the in-kernel iommu manipulation through vfio rather than kvm. For vfio in general, the group is the unit of ownership. A user is granted access to /dev/vfio/$GROUP through file permissions. The user opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER on the group. If supported by the platform, multiple groups can be set to the same container, which allows for iommu domain sharing. Once a group is associated with a container, an iommu backend can be initialized for the container. Only then can a device be accessed through the group. So even if we were to pass a vfio group file descriptor into kvm and it matched as some kind of ownership token on the iommu group, it's not clear that's sufficient to assume we can start programming the iommu. Thanks, Alex > From the look of it, the VFIO file descriptor is what has the "access > control" to the underlying iommu, is this right ? So we somewhat need to > transfer (or copy) that ownership from the VFIO fd to the KVM VM. > > I don't see a way to do that without some cross-layering here... > > Rusty, are you aware of some kernel mechanism we can use for that ? > > Cheers, > Ben. > > ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-18 14:48 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-18 14:48 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote: > > > Right, we don't want to create dependencies across modules. I don't > > have a vision for how this should work. This is effectively a complete > > side-band to vfio, so we're really just dealing in the iommu group > > space. Maybe there needs to be some kind of registration of ownership > > for the group using some kind of token. It would need to include some > > kind of notification when that ownership ends. That might also be a > > convenient tag to toggle driver probing off for devices in the group. > > Other ideas? Thanks, > > All of that smells nasty like it will need a pile of bloody > infrastructure.... which makes me think it's too complicated and not the > right approach. > > How does access control work today on x86/VFIO ? Can you give me a bit > more details ? I didn't get a good grasp in your previous email.... The current model is not x86 specific, but it only covers doing iommu and device access through vfio. The kink here is that we're trying to do device access and setup through vfio, but iommu manipulation through kvm. We may want to revisit whether we can do the in-kernel iommu manipulation through vfio rather than kvm. For vfio in general, the group is the unit of ownership. A user is granted access to /dev/vfio/$GROUP through file permissions. The user opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER on the group. If supported by the platform, multiple groups can be set to the same container, which allows for iommu domain sharing. Once a group is associated with a container, an iommu backend can be initialized for the container. Only then can a device be accessed through the group. So even if we were to pass a vfio group file descriptor into kvm and it matched as some kind of ownership token on the iommu group, it's not clear that's sufficient to assume we can start programming the iommu. Thanks, Alex > From the look of it, the VFIO file descriptor is what has the "access > control" to the underlying iommu, is this right ? So we somewhat need to > transfer (or copy) that ownership from the VFIO fd to the KVM VM. > > I don't see a way to do that without some cross-layering here... > > Rusty, are you aware of some kernel mechanism we can use for that ? > > Cheers, > Ben. > > ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-18 14:48 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-18 14:48 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm, Alexey Kardashevskiy, Rusty Russell, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote: > > > Right, we don't want to create dependencies across modules. I don't > > have a vision for how this should work. This is effectively a complete > > side-band to vfio, so we're really just dealing in the iommu group > > space. Maybe there needs to be some kind of registration of ownership > > for the group using some kind of token. It would need to include some > > kind of notification when that ownership ends. That might also be a > > convenient tag to toggle driver probing off for devices in the group. > > Other ideas? Thanks, > > All of that smells nasty like it will need a pile of bloody > infrastructure.... which makes me think it's too complicated and not the > right approach. > > How does access control work today on x86/VFIO ? Can you give me a bit > more details ? I didn't get a good grasp in your previous email.... The current model is not x86 specific, but it only covers doing iommu and device access through vfio. The kink here is that we're trying to do device access and setup through vfio, but iommu manipulation through kvm. We may want to revisit whether we can do the in-kernel iommu manipulation through vfio rather than kvm. For vfio in general, the group is the unit of ownership. A user is granted access to /dev/vfio/$GROUP through file permissions. The user opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER on the group. If supported by the platform, multiple groups can be set to the same container, which allows for iommu domain sharing. Once a group is associated with a container, an iommu backend can be initialized for the container. Only then can a device be accessed through the group. So even if we were to pass a vfio group file descriptor into kvm and it matched as some kind of ownership token on the iommu group, it's not clear that's sufficient to assume we can start programming the iommu. Thanks, Alex > From the look of it, the VFIO file descriptor is what has the "access > control" to the underlying iommu, is this right ? So we somewhat need to > transfer (or copy) that ownership from the VFIO fd to the KVM VM. > > I don't see a way to do that without some cross-layering here... > > Rusty, are you aware of some kernel mechanism we can use for that ? > > Cheers, > Ben. > > ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-18 14:48 ` Alex Williamson (?) @ 2013-06-18 21:58 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-18 21:58 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell On Tue, 2013-06-18 at 08:48 -0600, Alex Williamson wrote: > On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote: > > On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote: > > > > > Right, we don't want to create dependencies across modules. I don't > > > have a vision for how this should work. This is effectively a complete > > > side-band to vfio, so we're really just dealing in the iommu group > > > space. Maybe there needs to be some kind of registration of ownership > > > for the group using some kind of token. It would need to include some > > > kind of notification when that ownership ends. That might also be a > > > convenient tag to toggle driver probing off for devices in the group. > > > Other ideas? Thanks, > > > > All of that smells nasty like it will need a pile of bloody > > infrastructure.... which makes me think it's too complicated and not the > > right approach. > > > > How does access control work today on x86/VFIO ? Can you give me a bit > > more details ? I didn't get a good grasp in your previous email.... > > The current model is not x86 specific, but it only covers doing iommu > and device access through vfio. The kink here is that we're trying to > do device access and setup through vfio, but iommu manipulation through > kvm. We may want to revisit whether we can do the in-kernel iommu > manipulation through vfio rather than kvm. How would that be possible ? The hypercalls from the guest arrive in KVM... in a very very specific & restricted environment which we call real mode (MMU off but still running in guest context), where we try to do as much as possible, or in virtual mode, where they get handled as normal KVM exits. The only way we could handle them "in VFIO" would be if somewhat VFIO registered callbacks with KVM... if we have that sort of cross-dependency, then we may as well have a simpler one where VFIO tells KVM what iommu is available for the VM > For vfio in general, the group is the unit of ownership. A user is > granted access to /dev/vfio/$GROUP through file permissions. The user > opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER > on the group. If supported by the platform, multiple groups can be set > to the same container, which allows for iommu domain sharing. Once a > group is associated with a container, an iommu backend can be > initialized for the container. Only then can a device be accessed > through the group. > > So even if we were to pass a vfio group file descriptor into kvm and it > matched as some kind of ownership token on the iommu group, it's not > clear that's sufficient to assume we can start programming the iommu. > Thanks, Your scheme seems to me that it would have the same problem if you wanted to do virtualized iommu.... In any case, this is a big deal. We have a requirement for pass-through. It cannot work with any remotely usable performance level if we don't implement the calls in KVM, so it needs to be sorted one way or another and I'm at a loss how here... Ben. > Alex > > > From the look of it, the VFIO file descriptor is what has the "access > > control" to the underlying iommu, is this right ? So we somewhat need to > > transfer (or copy) that ownership from the VFIO fd to the KVM VM. > > > > I don't see a way to do that without some cross-layering here... > > > > Rusty, are you aware of some kernel mechanism we can use for that ? > > > > Cheers, > > Ben. > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-18 21:58 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-18 21:58 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell On Tue, 2013-06-18 at 08:48 -0600, Alex Williamson wrote: > On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote: > > On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote: > > > > > Right, we don't want to create dependencies across modules. I don't > > > have a vision for how this should work. This is effectively a complete > > > side-band to vfio, so we're really just dealing in the iommu group > > > space. Maybe there needs to be some kind of registration of ownership > > > for the group using some kind of token. It would need to include some > > > kind of notification when that ownership ends. That might also be a > > > convenient tag to toggle driver probing off for devices in the group. > > > Other ideas? Thanks, > > > > All of that smells nasty like it will need a pile of bloody > > infrastructure.... which makes me think it's too complicated and not the > > right approach. > > > > How does access control work today on x86/VFIO ? Can you give me a bit > > more details ? I didn't get a good grasp in your previous email.... > > The current model is not x86 specific, but it only covers doing iommu > and device access through vfio. The kink here is that we're trying to > do device access and setup through vfio, but iommu manipulation through > kvm. We may want to revisit whether we can do the in-kernel iommu > manipulation through vfio rather than kvm. How would that be possible ? The hypercalls from the guest arrive in KVM... in a very very specific & restricted environment which we call real mode (MMU off but still running in guest context), where we try to do as much as possible, or in virtual mode, where they get handled as normal KVM exits. The only way we could handle them "in VFIO" would be if somewhat VFIO registered callbacks with KVM... if we have that sort of cross-dependency, then we may as well have a simpler one where VFIO tells KVM what iommu is available for the VM > For vfio in general, the group is the unit of ownership. A user is > granted access to /dev/vfio/$GROUP through file permissions. The user > opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER > on the group. If supported by the platform, multiple groups can be set > to the same container, which allows for iommu domain sharing. Once a > group is associated with a container, an iommu backend can be > initialized for the container. Only then can a device be accessed > through the group. > > So even if we were to pass a vfio group file descriptor into kvm and it > matched as some kind of ownership token on the iommu group, it's not > clear that's sufficient to assume we can start programming the iommu. > Thanks, Your scheme seems to me that it would have the same problem if you wanted to do virtualized iommu.... In any case, this is a big deal. We have a requirement for pass-through. It cannot work with any remotely usable performance level if we don't implement the calls in KVM, so it needs to be sorted one way or another and I'm at a loss how here... Ben. > Alex > > > From the look of it, the VFIO file descriptor is what has the "access > > control" to the underlying iommu, is this right ? So we somewhat need to > > transfer (or copy) that ownership from the VFIO fd to the KVM VM. > > > > I don't see a way to do that without some cross-layering here... > > > > Rusty, are you aware of some kernel mechanism we can use for that ? > > > > Cheers, > > Ben. > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-18 21:58 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-18 21:58 UTC (permalink / raw) To: Alex Williamson Cc: kvm, Alexey Kardashevskiy, Rusty Russell, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson On Tue, 2013-06-18 at 08:48 -0600, Alex Williamson wrote: > On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote: > > On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote: > > > > > Right, we don't want to create dependencies across modules. I don't > > > have a vision for how this should work. This is effectively a complete > > > side-band to vfio, so we're really just dealing in the iommu group > > > space. Maybe there needs to be some kind of registration of ownership > > > for the group using some kind of token. It would need to include some > > > kind of notification when that ownership ends. That might also be a > > > convenient tag to toggle driver probing off for devices in the group. > > > Other ideas? Thanks, > > > > All of that smells nasty like it will need a pile of bloody > > infrastructure.... which makes me think it's too complicated and not the > > right approach. > > > > How does access control work today on x86/VFIO ? Can you give me a bit > > more details ? I didn't get a good grasp in your previous email.... > > The current model is not x86 specific, but it only covers doing iommu > and device access through vfio. The kink here is that we're trying to > do device access and setup through vfio, but iommu manipulation through > kvm. We may want to revisit whether we can do the in-kernel iommu > manipulation through vfio rather than kvm. How would that be possible ? The hypercalls from the guest arrive in KVM... in a very very specific & restricted environment which we call real mode (MMU off but still running in guest context), where we try to do as much as possible, or in virtual mode, where they get handled as normal KVM exits. The only way we could handle them "in VFIO" would be if somewhat VFIO registered callbacks with KVM... if we have that sort of cross-dependency, then we may as well have a simpler one where VFIO tells KVM what iommu is available for the VM > For vfio in general, the group is the unit of ownership. A user is > granted access to /dev/vfio/$GROUP through file permissions. The user > opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER > on the group. If supported by the platform, multiple groups can be set > to the same container, which allows for iommu domain sharing. Once a > group is associated with a container, an iommu backend can be > initialized for the container. Only then can a device be accessed > through the group. > > So even if we were to pass a vfio group file descriptor into kvm and it > matched as some kind of ownership token on the iommu group, it's not > clear that's sufficient to assume we can start programming the iommu. > Thanks, Your scheme seems to me that it would have the same problem if you wanted to do virtualized iommu.... In any case, this is a big deal. We have a requirement for pass-through. It cannot work with any remotely usable performance level if we don't implement the calls in KVM, so it needs to be sorted one way or another and I'm at a loss how here... Ben. > Alex > > > From the look of it, the VFIO file descriptor is what has the "access > > control" to the underlying iommu, is this right ? So we somewhat need to > > transfer (or copy) that ownership from the VFIO fd to the KVM VM. > > > > I don't see a way to do that without some cross-layering here... > > > > Rusty, are you aware of some kernel mechanism we can use for that ? > > > > Cheers, > > Ben. > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-18 2:32 ` Alex Williamson (?) @ 2013-06-19 3:35 ` Rusty Russell -1 siblings, 0 replies; 160+ messages in thread From: Rusty Russell @ 2013-06-19 3:35 UTC (permalink / raw) To: Alex Williamson, Benjamin Herrenschmidt Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson Alex Williamson <alex.williamson@redhat.com> writes: > On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote: >> On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote: >> >> > IOMMU groups themselves don't provide security, they're accessed by >> > interfaces like VFIO, which provide the security. Given a brief look, I >> > agree, this looks like a possible backdoor. The typical VFIO way to >> > handle this would be to pass a VFIO file descriptor here to prove that >> > the process has access to the IOMMU group. This is how /dev/vfio/vfio >> > gains the ability to setup an IOMMU domain an do mappings with the >> > SET_CONTAINER ioctl using a group fd. Thanks, >> >> How do you envision that in the kernel ? IE. I'm in KVM code, gets that >> vfio fd, what do I do with it ? >> >> Basically, KVM needs to know that the user is allowed to use that iommu >> group. I don't think we want KVM however to call into VFIO directly >> right ? > > Right, we don't want to create dependencies across modules. I don't > have a vision for how this should work. This is effectively a complete > side-band to vfio, so we're really just dealing in the iommu group > space. Maybe there needs to be some kind of registration of ownership > for the group using some kind of token. It would need to include some > kind of notification when that ownership ends. That might also be a > convenient tag to toggle driver probing off for devices in the group. > Other ideas? Thanks, It's actually not that bad. eg. struct vfio_container *vfio_container_from_file(struct file *filp) { if (filp->f_op != &vfio_device_fops) return ERR_PTR(-EINVAL); /* OK it really is a vfio fd, return the data. */ .... } EXPORT_SYMBOL_GPL(vfio_container_from_file); ... inside KVM_CREATE_SPAPR_TCE_IOMMU: struct file *vfio_filp; struct vfio_container *(lookup)(struct file *filp); vfio_filp = fget(create_tce_iommu.fd); if (!vfio) ret = -EBADF; lookup = symbol_get(vfio_container_from_file); if (!lookup) ret = -EINVAL; else { container = lookup(vfio_filp); if (IS_ERR(container)) ret = PTR_ERR(container); else ... symbol_put(vfio_container_from_file); } symbol_get() won't try to load a module; it'll just fail. This is what you want, since they must have vfio in the kernel to get a valid fd... Hope that helps, Rusty. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 3:35 ` Rusty Russell 0 siblings, 0 replies; 160+ messages in thread From: Rusty Russell @ 2013-06-19 3:47 UTC (permalink / raw) To: Alex Williamson, Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc Alex Williamson <alex.williamson@redhat.com> writes: > On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote: >> On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote: >> >> > IOMMU groups themselves don't provide security, they're accessed by >> > interfaces like VFIO, which provide the security. Given a brief look, I >> > agree, this looks like a possible backdoor. The typical VFIO way to >> > handle this would be to pass a VFIO file descriptor here to prove that >> > the process has access to the IOMMU group. This is how /dev/vfio/vfio >> > gains the ability to setup an IOMMU domain an do mappings with the >> > SET_CONTAINER ioctl using a group fd. Thanks, >> >> How do you envision that in the kernel ? IE. I'm in KVM code, gets that >> vfio fd, what do I do with it ? >> >> Basically, KVM needs to know that the user is allowed to use that iommu >> group. I don't think we want KVM however to call into VFIO directly >> right ? > > Right, we don't want to create dependencies across modules. I don't > have a vision for how this should work. This is effectively a complete > side-band to vfio, so we're really just dealing in the iommu group > space. Maybe there needs to be some kind of registration of ownership > for the group using some kind of token. It would need to include some > kind of notification when that ownership ends. That might also be a > convenient tag to toggle driver probing off for devices in the group. > Other ideas? Thanks, It's actually not that bad. eg. struct vfio_container *vfio_container_from_file(struct file *filp) { if (filp->f_op != &vfio_device_fops) return ERR_PTR(-EINVAL); /* OK it really is a vfio fd, return the data. */ .... } EXPORT_SYMBOL_GPL(vfio_container_from_file); ... inside KVM_CREATE_SPAPR_TCE_IOMMU: struct file *vfio_filp; struct vfio_container *(lookup)(struct file *filp); vfio_filp = fget(create_tce_iommu.fd); if (!vfio) ret = -EBADF; lookup = symbol_get(vfio_container_from_file); if (!lookup) ret = -EINVAL; else { container = lookup(vfio_filp); if (IS_ERR(container)) ret = PTR_ERR(container); else ... symbol_put(vfio_container_from_file); } symbol_get() won't try to load a module; it'll just fail. This is what you want, since they must have vfio in the kernel to get a valid fd... Hope that helps, Rusty. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 3:35 ` Rusty Russell 0 siblings, 0 replies; 160+ messages in thread From: Rusty Russell @ 2013-06-19 3:35 UTC (permalink / raw) To: Alex Williamson, Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc Alex Williamson <alex.williamson@redhat.com> writes: > On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote: >> On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote: >> >> > IOMMU groups themselves don't provide security, they're accessed by >> > interfaces like VFIO, which provide the security. Given a brief look, I >> > agree, this looks like a possible backdoor. The typical VFIO way to >> > handle this would be to pass a VFIO file descriptor here to prove that >> > the process has access to the IOMMU group. This is how /dev/vfio/vfio >> > gains the ability to setup an IOMMU domain an do mappings with the >> > SET_CONTAINER ioctl using a group fd. Thanks, >> >> How do you envision that in the kernel ? IE. I'm in KVM code, gets that >> vfio fd, what do I do with it ? >> >> Basically, KVM needs to know that the user is allowed to use that iommu >> group. I don't think we want KVM however to call into VFIO directly >> right ? > > Right, we don't want to create dependencies across modules. I don't > have a vision for how this should work. This is effectively a complete > side-band to vfio, so we're really just dealing in the iommu group > space. Maybe there needs to be some kind of registration of ownership > for the group using some kind of token. It would need to include some > kind of notification when that ownership ends. That might also be a > convenient tag to toggle driver probing off for devices in the group. > Other ideas? Thanks, It's actually not that bad. eg. struct vfio_container *vfio_container_from_file(struct file *filp) { if (filp->f_op != &vfio_device_fops) return ERR_PTR(-EINVAL); /* OK it really is a vfio fd, return the data. */ .... } EXPORT_SYMBOL_GPL(vfio_container_from_file); ... inside KVM_CREATE_SPAPR_TCE_IOMMU: struct file *vfio_filp; struct vfio_container *(lookup)(struct file *filp); vfio_filp = fget(create_tce_iommu.fd); if (!vfio) ret = -EBADF; lookup = symbol_get(vfio_container_from_file); if (!lookup) ret = -EINVAL; else { container = lookup(vfio_filp); if (IS_ERR(container)) ret = PTR_ERR(container); else ... symbol_put(vfio_container_from_file); } symbol_get() won't try to load a module; it'll just fail. This is what you want, since they must have vfio in the kernel to get a valid fd... Hope that helps, Rusty. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-19 3:35 ` Rusty Russell (?) @ 2013-06-19 4:59 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-19 4:59 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell On Wed, 2013-06-19 at 13:05 +0930, Rusty Russell wrote: > symbol_get() won't try to load a module; it'll just fail. This is what > you want, since they must have vfio in the kernel to get a valid fd... Ok, cool. I suppose what we want here Alexey is slightly higher level, something like: vfio_validate_iommu_id(file, iommu_id) Which verifies that the file that was passed in is allowed to use that iommu_id. That's a simple and flexible interface (ie, it will work even if we support multiple iommu IDs in the future for a vfio, for example for DDW windows etc...), the logic to know about the ID remains in qemu, this is strictly a validation call. That way we also don't have to expose the containing vfio struct etc... just that simple function. Alex, any objection ? Do we need to make it a get/put interface instead ? vfio_validate_and_use_iommu(file, iommu_id); vfio_release_iommu(file, iommu_id); To ensure that the resource remains owned by the process until KVM is closed as well ? Or do we want to register with VFIO with a callback so that VFIO can call us if it needs us to give it up ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 4:59 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-19 4:59 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell On Wed, 2013-06-19 at 13:05 +0930, Rusty Russell wrote: > symbol_get() won't try to load a module; it'll just fail. This is what > you want, since they must have vfio in the kernel to get a valid fd... Ok, cool. I suppose what we want here Alexey is slightly higher level, something like: vfio_validate_iommu_id(file, iommu_id) Which verifies that the file that was passed in is allowed to use that iommu_id. That's a simple and flexible interface (ie, it will work even if we support multiple iommu IDs in the future for a vfio, for example for DDW windows etc...), the logic to know about the ID remains in qemu, this is strictly a validation call. That way we also don't have to expose the containing vfio struct etc... just that simple function. Alex, any objection ? Do we need to make it a get/put interface instead ? vfio_validate_and_use_iommu(file, iommu_id); vfio_release_iommu(file, iommu_id); To ensure that the resource remains owned by the process until KVM is closed as well ? Or do we want to register with VFIO with a callback so that VFIO can call us if it needs us to give it up ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 4:59 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-19 4:59 UTC (permalink / raw) To: Alex Williamson Cc: kvm, Alexey Kardashevskiy, Rusty Russell, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson On Wed, 2013-06-19 at 13:05 +0930, Rusty Russell wrote: > symbol_get() won't try to load a module; it'll just fail. This is what > you want, since they must have vfio in the kernel to get a valid fd... Ok, cool. I suppose what we want here Alexey is slightly higher level, something like: vfio_validate_iommu_id(file, iommu_id) Which verifies that the file that was passed in is allowed to use that iommu_id. That's a simple and flexible interface (ie, it will work even if we support multiple iommu IDs in the future for a vfio, for example for DDW windows etc...), the logic to know about the ID remains in qemu, this is strictly a validation call. That way we also don't have to expose the containing vfio struct etc... just that simple function. Alex, any objection ? Do we need to make it a get/put interface instead ? vfio_validate_and_use_iommu(file, iommu_id); vfio_release_iommu(file, iommu_id); To ensure that the resource remains owned by the process until KVM is closed as well ? Or do we want to register with VFIO with a callback so that VFIO can call us if it needs us to give it up ? Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-19 4:59 ` Benjamin Herrenschmidt (?) @ 2013-06-19 9:58 ` Alexander Graf -1 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-19 9:58 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alex Williamson, Alexey Kardashevskiy, linuxppc-dev, David Gibson, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On 19.06.2013, at 06:59, Benjamin Herrenschmidt wrote: > On Wed, 2013-06-19 at 13:05 +0930, Rusty Russell wrote: >> symbol_get() won't try to load a module; it'll just fail. This is what >> you want, since they must have vfio in the kernel to get a valid fd... > > Ok, cool. I suppose what we want here Alexey is slightly higher level, > something like: > > vfio_validate_iommu_id(file, iommu_id) > > Which verifies that the file that was passed in is allowed to use > that iommu_id. > > That's a simple and flexible interface (ie, it will work even if we > support multiple iommu IDs in the future for a vfio, for example > for DDW windows etc...), the logic to know about the ID remains > in qemu, this is strictly a validation call. > > That way we also don't have to expose the containing vfio struct etc... > just that simple function. > > Alex, any objection ? Which Alex? :) I think validate works, it keeps iteration logic out of the kernel which is a good thing. There still needs to be an interface for getting the iommu id in VFIO, but I suppose that one's for the other Alex and Jörg to comment on. > > Do we need to make it a get/put interface instead ? > > vfio_validate_and_use_iommu(file, iommu_id); > > vfio_release_iommu(file, iommu_id); > > To ensure that the resource remains owned by the process until KVM > is closed as well ? > > Or do we want to register with VFIO with a callback so that VFIO can > call us if it needs us to give it up ? Can't we just register a handler on the fd and get notified when it closes? Can you kill VFIO access without closing the fd? Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 9:58 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-19 9:58 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alex Williamson, Alexey Kardashevskiy, linuxppc-dev, David Gibson, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On 19.06.2013, at 06:59, Benjamin Herrenschmidt wrote: > On Wed, 2013-06-19 at 13:05 +0930, Rusty Russell wrote: >> symbol_get() won't try to load a module; it'll just fail. This is what >> you want, since they must have vfio in the kernel to get a valid fd... > > Ok, cool. I suppose what we want here Alexey is slightly higher level, > something like: > > vfio_validate_iommu_id(file, iommu_id) > > Which verifies that the file that was passed in is allowed to use > that iommu_id. > > That's a simple and flexible interface (ie, it will work even if we > support multiple iommu IDs in the future for a vfio, for example > for DDW windows etc...), the logic to know about the ID remains > in qemu, this is strictly a validation call. > > That way we also don't have to expose the containing vfio struct etc... > just that simple function. > > Alex, any objection ? Which Alex? :) I think validate works, it keeps iteration logic out of the kernel which is a good thing. There still needs to be an interface for getting the iommu id in VFIO, but I suppose that one's for the other Alex and Jörg to comment on. > > Do we need to make it a get/put interface instead ? > > vfio_validate_and_use_iommu(file, iommu_id); > > vfio_release_iommu(file, iommu_id); > > To ensure that the resource remains owned by the process until KVM > is closed as well ? > > Or do we want to register with VFIO with a callback so that VFIO can > call us if it needs us to give it up ? Can't we just register a handler on the fd and get notified when it closes? Can you kill VFIO access without closing the fd? Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 9:58 ` Alexander Graf 0 siblings, 0 replies; 160+ messages in thread From: Alexander Graf @ 2013-06-19 9:58 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, open list, kvm-ppc, Alex Williamson, Paul Mackerras, linuxppc-dev, David Gibson On 19.06.2013, at 06:59, Benjamin Herrenschmidt wrote: > On Wed, 2013-06-19 at 13:05 +0930, Rusty Russell wrote: >> symbol_get() won't try to load a module; it'll just fail. This is = what >> you want, since they must have vfio in the kernel to get a valid = fd... >=20 > Ok, cool. I suppose what we want here Alexey is slightly higher level, > something like: >=20 > vfio_validate_iommu_id(file, iommu_id) >=20 > Which verifies that the file that was passed in is allowed to use > that iommu_id. >=20 > That's a simple and flexible interface (ie, it will work even if we > support multiple iommu IDs in the future for a vfio, for example > for DDW windows etc...), the logic to know about the ID remains > in qemu, this is strictly a validation call. >=20 > That way we also don't have to expose the containing vfio struct = etc... > just that simple function. >=20 > Alex, any objection ? Which Alex? :) I think validate works, it keeps iteration logic out of the kernel which = is a good thing. There still needs to be an interface for getting the = iommu id in VFIO, but I suppose that one's for the other Alex and J=F6rg = to comment on. >=20 > Do we need to make it a get/put interface instead ? >=20 > vfio_validate_and_use_iommu(file, iommu_id); >=20 > vfio_release_iommu(file, iommu_id); >=20 > To ensure that the resource remains owned by the process until KVM > is closed as well ? >=20 > Or do we want to register with VFIO with a callback so that VFIO can > call us if it needs us to give it up ? Can't we just register a handler on the fd and get notified when it = closes? Can you kill VFIO access without closing the fd? Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-19 9:58 ` Alexander Graf (?) @ 2013-06-19 14:50 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-19 14:50 UTC (permalink / raw) To: Alexander Graf Cc: Alex Williamson, Alexey Kardashevskiy, linuxppc-dev, David Gibson, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: > > Alex, any objection ? > > Which Alex? :) Heh, mostly Williamson in this specific case but your input is still welcome :-) > I think validate works, it keeps iteration logic out of the kernel > which is a good thing. There still needs to be an interface for > getting the iommu id in VFIO, but I suppose that one's for the other > Alex and Jörg to comment on. I think getting the iommu fd is already covered by separate patches from Alexey. > > > > Do we need to make it a get/put interface instead ? > > > > vfio_validate_and_use_iommu(file, iommu_id); > > > > vfio_release_iommu(file, iommu_id); > > > > To ensure that the resource remains owned by the process until KVM > > is closed as well ? > > > > Or do we want to register with VFIO with a callback so that VFIO can > > call us if it needs us to give it up ? > > Can't we just register a handler on the fd and get notified when it > closes? Can you kill VFIO access without closing the fd? That sounds actually harder :-) The question is basically: When we validate that relationship between a specific VFIO struct file with an iommu, what is the lifetime of that and how do we handle this lifetime properly. There's two ways for that sort of situation: The notification model where we get notified when the relationship is broken, and the refcount model where we become a "user" and thus delay the breaking of the relationship until we have been disposed of as well. In this specific case, it's hard to tell what is the right model from my perspective, which is why I would welcome Alex (W.) input. In the end, the solution will end up being in the form of APIs exposed by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), as owner of VFIO at this stage, what do you want those to look like ? :-) Cheers, Ben. > > Alex > > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 14:50 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-19 14:50 UTC (permalink / raw) To: Alexander Graf Cc: Alex Williamson, Alexey Kardashevskiy, linuxppc-dev, David Gibson, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: > > Alex, any objection ? > > Which Alex? :) Heh, mostly Williamson in this specific case but your input is still welcome :-) > I think validate works, it keeps iteration logic out of the kernel > which is a good thing. There still needs to be an interface for > getting the iommu id in VFIO, but I suppose that one's for the other > Alex and Jörg to comment on. I think getting the iommu fd is already covered by separate patches from Alexey. > > > > Do we need to make it a get/put interface instead ? > > > > vfio_validate_and_use_iommu(file, iommu_id); > > > > vfio_release_iommu(file, iommu_id); > > > > To ensure that the resource remains owned by the process until KVM > > is closed as well ? > > > > Or do we want to register with VFIO with a callback so that VFIO can > > call us if it needs us to give it up ? > > Can't we just register a handler on the fd and get notified when it > closes? Can you kill VFIO access without closing the fd? That sounds actually harder :-) The question is basically: When we validate that relationship between a specific VFIO struct file with an iommu, what is the lifetime of that and how do we handle this lifetime properly. There's two ways for that sort of situation: The notification model where we get notified when the relationship is broken, and the refcount model where we become a "user" and thus delay the breaking of the relationship until we have been disposed of as well. In this specific case, it's hard to tell what is the right model from my perspective, which is why I would welcome Alex (W.) input. In the end, the solution will end up being in the form of APIs exposed by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), as owner of VFIO at this stage, what do you want those to look like ? :-) Cheers, Ben. > > Alex > > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 14:50 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-19 14:50 UTC (permalink / raw) To: Alexander Graf Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, open list, kvm-ppc, Alex Williamson, Paul Mackerras, linuxppc-dev, David Gibson On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: > > Alex, any objection ? > > Which Alex? :) Heh, mostly Williamson in this specific case but your input is still welcome :-) > I think validate works, it keeps iteration logic out of the kernel > which is a good thing. There still needs to be an interface for > getting the iommu id in VFIO, but I suppose that one's for the other > Alex and Jörg to comment on. I think getting the iommu fd is already covered by separate patches from Alexey. > > > > Do we need to make it a get/put interface instead ? > > > > vfio_validate_and_use_iommu(file, iommu_id); > > > > vfio_release_iommu(file, iommu_id); > > > > To ensure that the resource remains owned by the process until KVM > > is closed as well ? > > > > Or do we want to register with VFIO with a callback so that VFIO can > > call us if it needs us to give it up ? > > Can't we just register a handler on the fd and get notified when it > closes? Can you kill VFIO access without closing the fd? That sounds actually harder :-) The question is basically: When we validate that relationship between a specific VFIO struct file with an iommu, what is the lifetime of that and how do we handle this lifetime properly. There's two ways for that sort of situation: The notification model where we get notified when the relationship is broken, and the refcount model where we become a "user" and thus delay the breaking of the relationship until we have been disposed of as well. In this specific case, it's hard to tell what is the right model from my perspective, which is why I would welcome Alex (W.) input. In the end, the solution will end up being in the form of APIs exposed by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), as owner of VFIO at this stage, what do you want those to look like ? :-) Cheers, Ben. > > Alex > > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-19 14:50 ` Benjamin Herrenschmidt (?) @ 2013-06-19 15:49 ` Alex Williamson -1 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-19 15:49 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexander Graf, Alexey Kardashevskiy, linuxppc-dev, David Gibson, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote: > On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: > > > > Alex, any objection ? > > > > Which Alex? :) > > Heh, mostly Williamson in this specific case but your input is still > welcome :-) > > > I think validate works, it keeps iteration logic out of the kernel > > which is a good thing. There still needs to be an interface for > > getting the iommu id in VFIO, but I suppose that one's for the other > > Alex and Jörg to comment on. > > I think getting the iommu fd is already covered by separate patches from > Alexey. > > > > > > > Do we need to make it a get/put interface instead ? > > > > > > vfio_validate_and_use_iommu(file, iommu_id); > > > > > > vfio_release_iommu(file, iommu_id); > > > > > > To ensure that the resource remains owned by the process until KVM > > > is closed as well ? > > > > > > Or do we want to register with VFIO with a callback so that VFIO can > > > call us if it needs us to give it up ? > > > > Can't we just register a handler on the fd and get notified when it > > closes? Can you kill VFIO access without closing the fd? > > That sounds actually harder :-) > > The question is basically: When we validate that relationship between a > specific VFIO struct file with an iommu, what is the lifetime of that > and how do we handle this lifetime properly. > > There's two ways for that sort of situation: The notification model > where we get notified when the relationship is broken, and the refcount > model where we become a "user" and thus delay the breaking of the > relationship until we have been disposed of as well. > > In this specific case, it's hard to tell what is the right model from my > perspective, which is why I would welcome Alex (W.) input. > > In the end, the solution will end up being in the form of APIs exposed > by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), > as owner of VFIO at this stage, what do you want those to look > like ? :-) My first thought is that we should use the same reference counting as we have for vfio devices (group->container_users). An interface for that might look like: int vfio_group_add_external_user(struct file *filep) { struct vfio_group *group = filep->private_data; if (filep->f_op != &vfio_group_fops) return -EINVAL; if (!atomic_inc_not_zero(&group->container_users)) return -EINVAL; return 0; } void vfio_group_del_external_user(struct file *filep) { struct vfio_group *group = filep->private_data; BUG_ON(filep->f_op != &vfio_group_fops); vfio_group_try_dissolve_container(group); } int vfio_group_iommu_id_from_file(struct file *filep) { struct vfio_group *group = filep->private_data; BUG_ON(filep->f_op != &vfio_group_fops); return iommu_group_id(group->iommu_group); } Would that work? Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 15:49 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-19 15:49 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexander Graf, Alexey Kardashevskiy, linuxppc-dev, David Gibson, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote: > On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: > > > > Alex, any objection ? > > > > Which Alex? :) > > Heh, mostly Williamson in this specific case but your input is still > welcome :-) > > > I think validate works, it keeps iteration logic out of the kernel > > which is a good thing. There still needs to be an interface for > > getting the iommu id in VFIO, but I suppose that one's for the other > > Alex and Jörg to comment on. > > I think getting the iommu fd is already covered by separate patches from > Alexey. > > > > > > > Do we need to make it a get/put interface instead ? > > > > > > vfio_validate_and_use_iommu(file, iommu_id); > > > > > > vfio_release_iommu(file, iommu_id); > > > > > > To ensure that the resource remains owned by the process until KVM > > > is closed as well ? > > > > > > Or do we want to register with VFIO with a callback so that VFIO can > > > call us if it needs us to give it up ? > > > > Can't we just register a handler on the fd and get notified when it > > closes? Can you kill VFIO access without closing the fd? > > That sounds actually harder :-) > > The question is basically: When we validate that relationship between a > specific VFIO struct file with an iommu, what is the lifetime of that > and how do we handle this lifetime properly. > > There's two ways for that sort of situation: The notification model > where we get notified when the relationship is broken, and the refcount > model where we become a "user" and thus delay the breaking of the > relationship until we have been disposed of as well. > > In this specific case, it's hard to tell what is the right model from my > perspective, which is why I would welcome Alex (W.) input. > > In the end, the solution will end up being in the form of APIs exposed > by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), > as owner of VFIO at this stage, what do you want those to look > like ? :-) My first thought is that we should use the same reference counting as we have for vfio devices (group->container_users). An interface for that might look like: int vfio_group_add_external_user(struct file *filep) { struct vfio_group *group = filep->private_data; if (filep->f_op != &vfio_group_fops) return -EINVAL; if (!atomic_inc_not_zero(&group->container_users)) return -EINVAL; return 0; } void vfio_group_del_external_user(struct file *filep) { struct vfio_group *group = filep->private_data; BUG_ON(filep->f_op != &vfio_group_fops); vfio_group_try_dissolve_container(group); } int vfio_group_iommu_id_from_file(struct file *filep) { struct vfio_group *group = filep->private_data; BUG_ON(filep->f_op != &vfio_group_fops); return iommu_group_id(group->iommu_group); } Would that work? Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-19 15:49 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-19 15:49 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Paul Mackerras, linuxppc-dev, David Gibson On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote: > On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: > > > > Alex, any objection ? > > > > Which Alex? :) > > Heh, mostly Williamson in this specific case but your input is still > welcome :-) > > > I think validate works, it keeps iteration logic out of the kernel > > which is a good thing. There still needs to be an interface for > > getting the iommu id in VFIO, but I suppose that one's for the other > > Alex and Jörg to comment on. > > I think getting the iommu fd is already covered by separate patches from > Alexey. > > > > > > > Do we need to make it a get/put interface instead ? > > > > > > vfio_validate_and_use_iommu(file, iommu_id); > > > > > > vfio_release_iommu(file, iommu_id); > > > > > > To ensure that the resource remains owned by the process until KVM > > > is closed as well ? > > > > > > Or do we want to register with VFIO with a callback so that VFIO can > > > call us if it needs us to give it up ? > > > > Can't we just register a handler on the fd and get notified when it > > closes? Can you kill VFIO access without closing the fd? > > That sounds actually harder :-) > > The question is basically: When we validate that relationship between a > specific VFIO struct file with an iommu, what is the lifetime of that > and how do we handle this lifetime properly. > > There's two ways for that sort of situation: The notification model > where we get notified when the relationship is broken, and the refcount > model where we become a "user" and thus delay the breaking of the > relationship until we have been disposed of as well. > > In this specific case, it's hard to tell what is the right model from my > perspective, which is why I would welcome Alex (W.) input. > > In the end, the solution will end up being in the form of APIs exposed > by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), > as owner of VFIO at this stage, what do you want those to look > like ? :-) My first thought is that we should use the same reference counting as we have for vfio devices (group->container_users). An interface for that might look like: int vfio_group_add_external_user(struct file *filep) { struct vfio_group *group = filep->private_data; if (filep->f_op != &vfio_group_fops) return -EINVAL; if (!atomic_inc_not_zero(&group->container_users)) return -EINVAL; return 0; } void vfio_group_del_external_user(struct file *filep) { struct vfio_group *group = filep->private_data; BUG_ON(filep->f_op != &vfio_group_fops); vfio_group_try_dissolve_container(group); } int vfio_group_iommu_id_from_file(struct file *filep) { struct vfio_group *group = filep->private_data; BUG_ON(filep->f_op != &vfio_group_fops); return iommu_group_id(group->iommu_group); } Would that work? Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-19 15:49 ` Alex Williamson (?) @ 2013-06-20 4:58 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-20 4:58 UTC (permalink / raw) To: Alex Williamson Cc: Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, David Gibson, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On 06/20/2013 01:49 AM, Alex Williamson wrote: > On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote: >> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: >> >>>> Alex, any objection ? >>> >>> Which Alex? :) >> >> Heh, mostly Williamson in this specific case but your input is still >> welcome :-) >> >>> I think validate works, it keeps iteration logic out of the kernel >>> which is a good thing. There still needs to be an interface for >>> getting the iommu id in VFIO, but I suppose that one's for the other >>> Alex and Jörg to comment on. >> >> I think getting the iommu fd is already covered by separate patches from >> Alexey. >> >>>> >>>> Do we need to make it a get/put interface instead ? >>>> >>>> vfio_validate_and_use_iommu(file, iommu_id); >>>> >>>> vfio_release_iommu(file, iommu_id); >>>> >>>> To ensure that the resource remains owned by the process until KVM >>>> is closed as well ? >>>> >>>> Or do we want to register with VFIO with a callback so that VFIO can >>>> call us if it needs us to give it up ? >>> >>> Can't we just register a handler on the fd and get notified when it >>> closes? Can you kill VFIO access without closing the fd? >> >> That sounds actually harder :-) >> >> The question is basically: When we validate that relationship between a >> specific VFIO struct file with an iommu, what is the lifetime of that >> and how do we handle this lifetime properly. >> >> There's two ways for that sort of situation: The notification model >> where we get notified when the relationship is broken, and the refcount >> model where we become a "user" and thus delay the breaking of the >> relationship until we have been disposed of as well. >> >> In this specific case, it's hard to tell what is the right model from my >> perspective, which is why I would welcome Alex (W.) input. >> >> In the end, the solution will end up being in the form of APIs exposed >> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), >> as owner of VFIO at this stage, what do you want those to look >> like ? :-) > > My first thought is that we should use the same reference counting as we > have for vfio devices (group->container_users). An interface for that > might look like: > > int vfio_group_add_external_user(struct file *filep) > { > struct vfio_group *group = filep->private_data; > > if (filep->f_op != &vfio_group_fops) > return -EINVAL; > > > if (!atomic_inc_not_zero(&group->container_users)) > return -EINVAL; > > return 0; > } > > void vfio_group_del_external_user(struct file *filep) > { > struct vfio_group *group = filep->private_data; > > BUG_ON(filep->f_op != &vfio_group_fops); > > vfio_group_try_dissolve_container(group); > } > > int vfio_group_iommu_id_from_file(struct file *filep) > { > struct vfio_group *group = filep->private_data; > > BUG_ON(filep->f_op != &vfio_group_fops); > > return iommu_group_id(group->iommu_group); > } > > Would that work? Thanks, Just out of curiosity - would not get_file() and fput_atomic() on a group's file* do the right job instead of vfio_group_add_external_user() and vfio_group_del_external_user()? -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-20 4:58 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-20 4:58 UTC (permalink / raw) To: Alex Williamson Cc: Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, David Gibson, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On 06/20/2013 01:49 AM, Alex Williamson wrote: > On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote: >> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: >> >>>> Alex, any objection ? >>> >>> Which Alex? :) >> >> Heh, mostly Williamson in this specific case but your input is still >> welcome :-) >> >>> I think validate works, it keeps iteration logic out of the kernel >>> which is a good thing. There still needs to be an interface for >>> getting the iommu id in VFIO, but I suppose that one's for the other >>> Alex and Jörg to comment on. >> >> I think getting the iommu fd is already covered by separate patches from >> Alexey. >> >>>> >>>> Do we need to make it a get/put interface instead ? >>>> >>>> vfio_validate_and_use_iommu(file, iommu_id); >>>> >>>> vfio_release_iommu(file, iommu_id); >>>> >>>> To ensure that the resource remains owned by the process until KVM >>>> is closed as well ? >>>> >>>> Or do we want to register with VFIO with a callback so that VFIO can >>>> call us if it needs us to give it up ? >>> >>> Can't we just register a handler on the fd and get notified when it >>> closes? Can you kill VFIO access without closing the fd? >> >> That sounds actually harder :-) >> >> The question is basically: When we validate that relationship between a >> specific VFIO struct file with an iommu, what is the lifetime of that >> and how do we handle this lifetime properly. >> >> There's two ways for that sort of situation: The notification model >> where we get notified when the relationship is broken, and the refcount >> model where we become a "user" and thus delay the breaking of the >> relationship until we have been disposed of as well. >> >> In this specific case, it's hard to tell what is the right model from my >> perspective, which is why I would welcome Alex (W.) input. >> >> In the end, the solution will end up being in the form of APIs exposed >> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), >> as owner of VFIO at this stage, what do you want those to look >> like ? :-) > > My first thought is that we should use the same reference counting as we > have for vfio devices (group->container_users). An interface for that > might look like: > > int vfio_group_add_external_user(struct file *filep) > { > struct vfio_group *group = filep->private_data; > > if (filep->f_op != &vfio_group_fops) > return -EINVAL; > > > if (!atomic_inc_not_zero(&group->container_users)) > return -EINVAL; > > return 0; > } > > void vfio_group_del_external_user(struct file *filep) > { > struct vfio_group *group = filep->private_data; > > BUG_ON(filep->f_op != &vfio_group_fops); > > vfio_group_try_dissolve_container(group); > } > > int vfio_group_iommu_id_from_file(struct file *filep) > { > struct vfio_group *group = filep->private_data; > > BUG_ON(filep->f_op != &vfio_group_fops); > > return iommu_group_id(group->iommu_group); > } > > Would that work? Thanks, Just out of curiosity - would not get_file() and fput_atomic() on a group's file* do the right job instead of vfio_group_add_external_user() and vfio_group_del_external_user()? -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-20 4:58 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-20 4:58 UTC (permalink / raw) To: Alex Williamson Cc: kvm@vger.kernel.org mailing list, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Paul Mackerras, linuxppc-dev, David Gibson On 06/20/2013 01:49 AM, Alex Williamson wrote: > On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote: >> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: >> >>>> Alex, any objection ? >>> >>> Which Alex? :) >> >> Heh, mostly Williamson in this specific case but your input is still >> welcome :-) >> >>> I think validate works, it keeps iteration logic out of the kernel >>> which is a good thing. There still needs to be an interface for >>> getting the iommu id in VFIO, but I suppose that one's for the other >>> Alex and Jörg to comment on. >> >> I think getting the iommu fd is already covered by separate patches from >> Alexey. >> >>>> >>>> Do we need to make it a get/put interface instead ? >>>> >>>> vfio_validate_and_use_iommu(file, iommu_id); >>>> >>>> vfio_release_iommu(file, iommu_id); >>>> >>>> To ensure that the resource remains owned by the process until KVM >>>> is closed as well ? >>>> >>>> Or do we want to register with VFIO with a callback so that VFIO can >>>> call us if it needs us to give it up ? >>> >>> Can't we just register a handler on the fd and get notified when it >>> closes? Can you kill VFIO access without closing the fd? >> >> That sounds actually harder :-) >> >> The question is basically: When we validate that relationship between a >> specific VFIO struct file with an iommu, what is the lifetime of that >> and how do we handle this lifetime properly. >> >> There's two ways for that sort of situation: The notification model >> where we get notified when the relationship is broken, and the refcount >> model where we become a "user" and thus delay the breaking of the >> relationship until we have been disposed of as well. >> >> In this specific case, it's hard to tell what is the right model from my >> perspective, which is why I would welcome Alex (W.) input. >> >> In the end, the solution will end up being in the form of APIs exposed >> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), >> as owner of VFIO at this stage, what do you want those to look >> like ? :-) > > My first thought is that we should use the same reference counting as we > have for vfio devices (group->container_users). An interface for that > might look like: > > int vfio_group_add_external_user(struct file *filep) > { > struct vfio_group *group = filep->private_data; > > if (filep->f_op != &vfio_group_fops) > return -EINVAL; > > > if (!atomic_inc_not_zero(&group->container_users)) > return -EINVAL; > > return 0; > } > > void vfio_group_del_external_user(struct file *filep) > { > struct vfio_group *group = filep->private_data; > > BUG_ON(filep->f_op != &vfio_group_fops); > > vfio_group_try_dissolve_container(group); > } > > int vfio_group_iommu_id_from_file(struct file *filep) > { > struct vfio_group *group = filep->private_data; > > BUG_ON(filep->f_op != &vfio_group_fops); > > return iommu_group_id(group->iommu_group); > } > > Would that work? Thanks, Just out of curiosity - would not get_file() and fput_atomic() on a group's file* do the right job instead of vfio_group_add_external_user() and vfio_group_del_external_user()? -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-20 4:58 ` Alexey Kardashevskiy (?) @ 2013-06-20 5:28 ` David Gibson -1 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-20 5:28 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Alex Williamson, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel [-- Attachment #1: Type: text/plain, Size: 3827 bytes --] On Thu, Jun 20, 2013 at 02:58:18PM +1000, Alexey Kardashevskiy wrote: > On 06/20/2013 01:49 AM, Alex Williamson wrote: > > On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote: > >> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: > >> > >>>> Alex, any objection ? > >>> > >>> Which Alex? :) > >> > >> Heh, mostly Williamson in this specific case but your input is still > >> welcome :-) > >> > >>> I think validate works, it keeps iteration logic out of the kernel > >>> which is a good thing. There still needs to be an interface for > >>> getting the iommu id in VFIO, but I suppose that one's for the other > >>> Alex and Jörg to comment on. > >> > >> I think getting the iommu fd is already covered by separate patches from > >> Alexey. > >> > >>>> > >>>> Do we need to make it a get/put interface instead ? > >>>> > >>>> vfio_validate_and_use_iommu(file, iommu_id); > >>>> > >>>> vfio_release_iommu(file, iommu_id); > >>>> > >>>> To ensure that the resource remains owned by the process until KVM > >>>> is closed as well ? > >>>> > >>>> Or do we want to register with VFIO with a callback so that VFIO can > >>>> call us if it needs us to give it up ? > >>> > >>> Can't we just register a handler on the fd and get notified when it > >>> closes? Can you kill VFIO access without closing the fd? > >> > >> That sounds actually harder :-) > >> > >> The question is basically: When we validate that relationship between a > >> specific VFIO struct file with an iommu, what is the lifetime of that > >> and how do we handle this lifetime properly. > >> > >> There's two ways for that sort of situation: The notification model > >> where we get notified when the relationship is broken, and the refcount > >> model where we become a "user" and thus delay the breaking of the > >> relationship until we have been disposed of as well. > >> > >> In this specific case, it's hard to tell what is the right model from my > >> perspective, which is why I would welcome Alex (W.) input. > >> > >> In the end, the solution will end up being in the form of APIs exposed > >> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), > >> as owner of VFIO at this stage, what do you want those to look > >> like ? :-) > > > > My first thought is that we should use the same reference counting as we > > have for vfio devices (group->container_users). An interface for that > > might look like: > > > > int vfio_group_add_external_user(struct file *filep) > > { > > struct vfio_group *group = filep->private_data; > > > > if (filep->f_op != &vfio_group_fops) > > return -EINVAL; > > > > > > if (!atomic_inc_not_zero(&group->container_users)) > > return -EINVAL; > > > > return 0; > > } > > > > void vfio_group_del_external_user(struct file *filep) > > { > > struct vfio_group *group = filep->private_data; > > > > BUG_ON(filep->f_op != &vfio_group_fops); > > > > vfio_group_try_dissolve_container(group); > > } > > > > int vfio_group_iommu_id_from_file(struct file *filep) > > { > > struct vfio_group *group = filep->private_data; > > > > BUG_ON(filep->f_op != &vfio_group_fops); > > > > return iommu_group_id(group->iommu_group); > > } > > > > Would that work? Thanks, > > > Just out of curiosity - would not get_file() and fput_atomic() on a group's > file* do the right job instead of vfio_group_add_external_user() and > vfio_group_del_external_user()? I was thinking that too. Grabbing a file reference would certainly be the usual way of handling this sort of thing. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-20 5:28 ` David Gibson 0 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-20 5:28 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Alex Williamson, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel [-- Attachment #1: Type: text/plain, Size: 3827 bytes --] On Thu, Jun 20, 2013 at 02:58:18PM +1000, Alexey Kardashevskiy wrote: > On 06/20/2013 01:49 AM, Alex Williamson wrote: > > On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote: > >> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: > >> > >>>> Alex, any objection ? > >>> > >>> Which Alex? :) > >> > >> Heh, mostly Williamson in this specific case but your input is still > >> welcome :-) > >> > >>> I think validate works, it keeps iteration logic out of the kernel > >>> which is a good thing. There still needs to be an interface for > >>> getting the iommu id in VFIO, but I suppose that one's for the other > >>> Alex and Jörg to comment on. > >> > >> I think getting the iommu fd is already covered by separate patches from > >> Alexey. > >> > >>>> > >>>> Do we need to make it a get/put interface instead ? > >>>> > >>>> vfio_validate_and_use_iommu(file, iommu_id); > >>>> > >>>> vfio_release_iommu(file, iommu_id); > >>>> > >>>> To ensure that the resource remains owned by the process until KVM > >>>> is closed as well ? > >>>> > >>>> Or do we want to register with VFIO with a callback so that VFIO can > >>>> call us if it needs us to give it up ? > >>> > >>> Can't we just register a handler on the fd and get notified when it > >>> closes? Can you kill VFIO access without closing the fd? > >> > >> That sounds actually harder :-) > >> > >> The question is basically: When we validate that relationship between a > >> specific VFIO struct file with an iommu, what is the lifetime of that > >> and how do we handle this lifetime properly. > >> > >> There's two ways for that sort of situation: The notification model > >> where we get notified when the relationship is broken, and the refcount > >> model where we become a "user" and thus delay the breaking of the > >> relationship until we have been disposed of as well. > >> > >> In this specific case, it's hard to tell what is the right model from my > >> perspective, which is why I would welcome Alex (W.) input. > >> > >> In the end, the solution will end up being in the form of APIs exposed > >> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), > >> as owner of VFIO at this stage, what do you want those to look > >> like ? :-) > > > > My first thought is that we should use the same reference counting as we > > have for vfio devices (group->container_users). An interface for that > > might look like: > > > > int vfio_group_add_external_user(struct file *filep) > > { > > struct vfio_group *group = filep->private_data; > > > > if (filep->f_op != &vfio_group_fops) > > return -EINVAL; > > > > > > if (!atomic_inc_not_zero(&group->container_users)) > > return -EINVAL; > > > > return 0; > > } > > > > void vfio_group_del_external_user(struct file *filep) > > { > > struct vfio_group *group = filep->private_data; > > > > BUG_ON(filep->f_op != &vfio_group_fops); > > > > vfio_group_try_dissolve_container(group); > > } > > > > int vfio_group_iommu_id_from_file(struct file *filep) > > { > > struct vfio_group *group = filep->private_data; > > > > BUG_ON(filep->f_op != &vfio_group_fops); > > > > return iommu_group_id(group->iommu_group); > > } > > > > Would that work? Thanks, > > > Just out of curiosity - would not get_file() and fput_atomic() on a group's > file* do the right job instead of vfio_group_add_external_user() and > vfio_group_del_external_user()? I was thinking that too. Grabbing a file reference would certainly be the usual way of handling this sort of thing. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-20 5:28 ` David Gibson 0 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-20 5:28 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm@vger.kernel.org mailing list, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Alex Williamson, Paul Mackerras, linuxppc-dev [-- Attachment #1: Type: text/plain, Size: 3827 bytes --] On Thu, Jun 20, 2013 at 02:58:18PM +1000, Alexey Kardashevskiy wrote: > On 06/20/2013 01:49 AM, Alex Williamson wrote: > > On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote: > >> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: > >> > >>>> Alex, any objection ? > >>> > >>> Which Alex? :) > >> > >> Heh, mostly Williamson in this specific case but your input is still > >> welcome :-) > >> > >>> I think validate works, it keeps iteration logic out of the kernel > >>> which is a good thing. There still needs to be an interface for > >>> getting the iommu id in VFIO, but I suppose that one's for the other > >>> Alex and Jörg to comment on. > >> > >> I think getting the iommu fd is already covered by separate patches from > >> Alexey. > >> > >>>> > >>>> Do we need to make it a get/put interface instead ? > >>>> > >>>> vfio_validate_and_use_iommu(file, iommu_id); > >>>> > >>>> vfio_release_iommu(file, iommu_id); > >>>> > >>>> To ensure that the resource remains owned by the process until KVM > >>>> is closed as well ? > >>>> > >>>> Or do we want to register with VFIO with a callback so that VFIO can > >>>> call us if it needs us to give it up ? > >>> > >>> Can't we just register a handler on the fd and get notified when it > >>> closes? Can you kill VFIO access without closing the fd? > >> > >> That sounds actually harder :-) > >> > >> The question is basically: When we validate that relationship between a > >> specific VFIO struct file with an iommu, what is the lifetime of that > >> and how do we handle this lifetime properly. > >> > >> There's two ways for that sort of situation: The notification model > >> where we get notified when the relationship is broken, and the refcount > >> model where we become a "user" and thus delay the breaking of the > >> relationship until we have been disposed of as well. > >> > >> In this specific case, it's hard to tell what is the right model from my > >> perspective, which is why I would welcome Alex (W.) input. > >> > >> In the end, the solution will end up being in the form of APIs exposed > >> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), > >> as owner of VFIO at this stage, what do you want those to look > >> like ? :-) > > > > My first thought is that we should use the same reference counting as we > > have for vfio devices (group->container_users). An interface for that > > might look like: > > > > int vfio_group_add_external_user(struct file *filep) > > { > > struct vfio_group *group = filep->private_data; > > > > if (filep->f_op != &vfio_group_fops) > > return -EINVAL; > > > > > > if (!atomic_inc_not_zero(&group->container_users)) > > return -EINVAL; > > > > return 0; > > } > > > > void vfio_group_del_external_user(struct file *filep) > > { > > struct vfio_group *group = filep->private_data; > > > > BUG_ON(filep->f_op != &vfio_group_fops); > > > > vfio_group_try_dissolve_container(group); > > } > > > > int vfio_group_iommu_id_from_file(struct file *filep) > > { > > struct vfio_group *group = filep->private_data; > > > > BUG_ON(filep->f_op != &vfio_group_fops); > > > > return iommu_group_id(group->iommu_group); > > } > > > > Would that work? Thanks, > > > Just out of curiosity - would not get_file() and fput_atomic() on a group's > file* do the right job instead of vfio_group_add_external_user() and > vfio_group_del_external_user()? I was thinking that too. Grabbing a file reference would certainly be the usual way of handling this sort of thing. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-20 5:28 ` David Gibson (?) @ 2013-06-20 7:47 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-20 7:47 UTC (permalink / raw) To: David Gibson Cc: Alexey Kardashevskiy, Alex Williamson, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > Just out of curiosity - would not get_file() and fput_atomic() on a > group's > > file* do the right job instead of vfio_group_add_external_user() and > > vfio_group_del_external_user()? > > I was thinking that too. Grabbing a file reference would certainly be > the usual way of handling this sort of thing. But that wouldn't prevent the group ownership to be returned to the kernel or another user would it ? Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-20 7:47 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-20 7:47 UTC (permalink / raw) To: David Gibson Cc: Alexey Kardashevskiy, Alex Williamson, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > Just out of curiosity - would not get_file() and fput_atomic() on a > group's > > file* do the right job instead of vfio_group_add_external_user() and > > vfio_group_del_external_user()? > > I was thinking that too. Grabbing a file reference would certainly be > the usual way of handling this sort of thing. But that wouldn't prevent the group ownership to be returned to the kernel or another user would it ? Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-20 7:47 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-20 7:47 UTC (permalink / raw) To: David Gibson Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Alex Williamson, Paul Mackerras, linuxppc-dev On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > Just out of curiosity - would not get_file() and fput_atomic() on a > group's > > file* do the right job instead of vfio_group_add_external_user() and > > vfio_group_del_external_user()? > > I was thinking that too. Grabbing a file reference would certainly be > the usual way of handling this sort of thing. But that wouldn't prevent the group ownership to be returned to the kernel or another user would it ? Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-20 7:47 ` Benjamin Herrenschmidt (?) @ 2013-06-20 8:48 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-20 8:48 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David Gibson, Alex Williamson, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: >>> Just out of curiosity - would not get_file() and fput_atomic() on a >> group's >>> file* do the right job instead of vfio_group_add_external_user() and >>> vfio_group_del_external_user()? >> >> I was thinking that too. Grabbing a file reference would certainly be >> the usual way of handling this sort of thing. > > But that wouldn't prevent the group ownership to be returned to > the kernel or another user would it ? Holding the file pointer does not let the group->container_users counter go to zero and this is exactly what vfio_group_add_external_user() and vfio_group_del_external_user() do. The difference is only in absolute value - 2 vs. 3. No change in behaviour whether I use new vfio API or simply hold file* till KVM closes fd created when IOMMU was connected to LIOBN. And while this counter is not zero, QEMU cannot take ownership over the group. I am definitely still missing the bigger picture... -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-20 8:48 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-20 8:48 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David Gibson, Alex Williamson, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: >>> Just out of curiosity - would not get_file() and fput_atomic() on a >> group's >>> file* do the right job instead of vfio_group_add_external_user() and >>> vfio_group_del_external_user()? >> >> I was thinking that too. Grabbing a file reference would certainly be >> the usual way of handling this sort of thing. > > But that wouldn't prevent the group ownership to be returned to > the kernel or another user would it ? Holding the file pointer does not let the group->container_users counter go to zero and this is exactly what vfio_group_add_external_user() and vfio_group_del_external_user() do. The difference is only in absolute value - 2 vs. 3. No change in behaviour whether I use new vfio API or simply hold file* till KVM closes fd created when IOMMU was connected to LIOBN. And while this counter is not zero, QEMU cannot take ownership over the group. I am definitely still missing the bigger picture... -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-20 8:48 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-20 8:48 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm@vger.kernel.org mailing list, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Alex Williamson, Paul Mackerras, linuxppc-dev, David Gibson On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: >>> Just out of curiosity - would not get_file() and fput_atomic() on a >> group's >>> file* do the right job instead of vfio_group_add_external_user() and >>> vfio_group_del_external_user()? >> >> I was thinking that too. Grabbing a file reference would certainly be >> the usual way of handling this sort of thing. > > But that wouldn't prevent the group ownership to be returned to > the kernel or another user would it ? Holding the file pointer does not let the group->container_users counter go to zero and this is exactly what vfio_group_add_external_user() and vfio_group_del_external_user() do. The difference is only in absolute value - 2 vs. 3. No change in behaviour whether I use new vfio API or simply hold file* till KVM closes fd created when IOMMU was connected to LIOBN. And while this counter is not zero, QEMU cannot take ownership over the group. I am definitely still missing the bigger picture... -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-20 8:48 ` Alexey Kardashevskiy (?) @ 2013-06-20 14:55 ` Alex Williamson -1 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-20 14:55 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, David Gibson, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > >> group's > >>> file* do the right job instead of vfio_group_add_external_user() and > >>> vfio_group_del_external_user()? > >> > >> I was thinking that too. Grabbing a file reference would certainly be > >> the usual way of handling this sort of thing. > > > > But that wouldn't prevent the group ownership to be returned to > > the kernel or another user would it ? > > > Holding the file pointer does not let the group->container_users counter go > to zero How so? Holding the file pointer means the file won't go away, which means the group release function won't be called. That means the group won't go away, but that doesn't mean it's attached to an IOMMU. A user could call UNSET_CONTAINER. > and this is exactly what vfio_group_add_external_user() and > vfio_group_del_external_user() do. The difference is only in absolute value > - 2 vs. 3. > > No change in behaviour whether I use new vfio API or simply hold file* till > KVM closes fd created when IOMMU was connected to LIOBN. By that notion you could open(/dev/vfio/$GROUP) and you're safe, right? But what about SET_CONTAINER & SET_IOMMU? All that you guarantee holding the file pointer is that the vfio_group exists. > And while this counter is not zero, QEMU cannot take ownership over the group. > > I am definitely still missing the bigger picture... The bigger picture is that the group needs to exist AND it needs to be setup and maintained to have IOMMU protection. Actually, my first stab at add_external_user doesn't look sufficient, it needs to look more like vfio_group_get_device_fd, checking group->container->iommu and group_viable(). As written it would allow an external user after SET_CONTAINER without SET_IOMMU. It should also be part of the API that the external user must hold the file reference between add_external_use and del_external_user and do cleanup on any exit paths. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-20 14:55 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-20 14:55 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, David Gibson, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > >> group's > >>> file* do the right job instead of vfio_group_add_external_user() and > >>> vfio_group_del_external_user()? > >> > >> I was thinking that too. Grabbing a file reference would certainly be > >> the usual way of handling this sort of thing. > > > > But that wouldn't prevent the group ownership to be returned to > > the kernel or another user would it ? > > > Holding the file pointer does not let the group->container_users counter go > to zero How so? Holding the file pointer means the file won't go away, which means the group release function won't be called. That means the group won't go away, but that doesn't mean it's attached to an IOMMU. A user could call UNSET_CONTAINER. > and this is exactly what vfio_group_add_external_user() and > vfio_group_del_external_user() do. The difference is only in absolute value > - 2 vs. 3. > > No change in behaviour whether I use new vfio API or simply hold file* till > KVM closes fd created when IOMMU was connected to LIOBN. By that notion you could open(/dev/vfio/$GROUP) and you're safe, right? But what about SET_CONTAINER & SET_IOMMU? All that you guarantee holding the file pointer is that the vfio_group exists. > And while this counter is not zero, QEMU cannot take ownership over the group. > > I am definitely still missing the bigger picture... The bigger picture is that the group needs to exist AND it needs to be setup and maintained to have IOMMU protection. Actually, my first stab at add_external_user doesn't look sufficient, it needs to look more like vfio_group_get_device_fd, checking group->container->iommu and group_viable(). As written it would allow an external user after SET_CONTAINER without SET_IOMMU. It should also be part of the API that the external user must hold the file reference between add_external_use and del_external_user and do cleanup on any exit paths. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-20 14:55 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-20 14:55 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm@vger.kernel.org mailing list, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Paul Mackerras, linuxppc-dev, David Gibson On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > >> group's > >>> file* do the right job instead of vfio_group_add_external_user() and > >>> vfio_group_del_external_user()? > >> > >> I was thinking that too. Grabbing a file reference would certainly be > >> the usual way of handling this sort of thing. > > > > But that wouldn't prevent the group ownership to be returned to > > the kernel or another user would it ? > > > Holding the file pointer does not let the group->container_users counter go > to zero How so? Holding the file pointer means the file won't go away, which means the group release function won't be called. That means the group won't go away, but that doesn't mean it's attached to an IOMMU. A user could call UNSET_CONTAINER. > and this is exactly what vfio_group_add_external_user() and > vfio_group_del_external_user() do. The difference is only in absolute value > - 2 vs. 3. > > No change in behaviour whether I use new vfio API or simply hold file* till > KVM closes fd created when IOMMU was connected to LIOBN. By that notion you could open(/dev/vfio/$GROUP) and you're safe, right? But what about SET_CONTAINER & SET_IOMMU? All that you guarantee holding the file pointer is that the vfio_group exists. > And while this counter is not zero, QEMU cannot take ownership over the group. > > I am definitely still missing the bigger picture... The bigger picture is that the group needs to exist AND it needs to be setup and maintained to have IOMMU protection. Actually, my first stab at add_external_user doesn't look sufficient, it needs to look more like vfio_group_get_device_fd, checking group->container->iommu and group_viable(). As written it would allow an external user after SET_CONTAINER without SET_IOMMU. It should also be part of the API that the external user must hold the file reference between add_external_use and del_external_user and do cleanup on any exit paths. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-20 14:55 ` Alex Williamson (?) @ 2013-06-22 8:25 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-22 8:25 UTC (permalink / raw) To: Alex Williamson Cc: Benjamin Herrenschmidt, David Gibson, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On 06/21/2013 12:55 AM, Alex Williamson wrote: > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: >> On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: >>> On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: >>>>> Just out of curiosity - would not get_file() and fput_atomic() on a >>>> group's >>>>> file* do the right job instead of vfio_group_add_external_user() and >>>>> vfio_group_del_external_user()? >>>> >>>> I was thinking that too. Grabbing a file reference would certainly be >>>> the usual way of handling this sort of thing. >>> >>> But that wouldn't prevent the group ownership to be returned to >>> the kernel or another user would it ? >> >> >> Holding the file pointer does not let the group->container_users counter go >> to zero > > How so? Holding the file pointer means the file won't go away, which > means the group release function won't be called. That means the group > won't go away, but that doesn't mean it's attached to an IOMMU. A user > could call UNSET_CONTAINER. > >> and this is exactly what vfio_group_add_external_user() and >> vfio_group_del_external_user() do. The difference is only in absolute value >> - 2 vs. 3. >> >> No change in behaviour whether I use new vfio API or simply hold file* till >> KVM closes fd created when IOMMU was connected to LIOBN. > > By that notion you could open(/dev/vfio/$GROUP) and you're safe, right? > But what about SET_CONTAINER & SET_IOMMU? All that you guarantee > holding the file pointer is that the vfio_group exists. > >> And while this counter is not zero, QEMU cannot take ownership over the group. >> >> I am definitely still missing the bigger picture... > > The bigger picture is that the group needs to exist AND it needs to be > setup and maintained to have IOMMU protection. Actually, my first stab > at add_external_user doesn't look sufficient, it needs to look more like > vfio_group_get_device_fd, checking group->container->iommu and > group_viable(). This makes sense. If you did this, that would be great. Without it, I really cannot see how the proposed inc/dec of container_users is better than simple holding file*. Thanks. > As written it would allow an external user after > SET_CONTAINER without SET_IOMMU. It should also be part of the API that > the external user must hold the file reference between add_external_use > and del_external_user and do cleanup on any exit paths. Thanks, -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-22 8:25 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-22 8:25 UTC (permalink / raw) To: Alex Williamson Cc: Benjamin Herrenschmidt, David Gibson, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On 06/21/2013 12:55 AM, Alex Williamson wrote: > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: >> On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: >>> On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: >>>>> Just out of curiosity - would not get_file() and fput_atomic() on a >>>> group's >>>>> file* do the right job instead of vfio_group_add_external_user() and >>>>> vfio_group_del_external_user()? >>>> >>>> I was thinking that too. Grabbing a file reference would certainly be >>>> the usual way of handling this sort of thing. >>> >>> But that wouldn't prevent the group ownership to be returned to >>> the kernel or another user would it ? >> >> >> Holding the file pointer does not let the group->container_users counter go >> to zero > > How so? Holding the file pointer means the file won't go away, which > means the group release function won't be called. That means the group > won't go away, but that doesn't mean it's attached to an IOMMU. A user > could call UNSET_CONTAINER. > >> and this is exactly what vfio_group_add_external_user() and >> vfio_group_del_external_user() do. The difference is only in absolute value >> - 2 vs. 3. >> >> No change in behaviour whether I use new vfio API or simply hold file* till >> KVM closes fd created when IOMMU was connected to LIOBN. > > By that notion you could open(/dev/vfio/$GROUP) and you're safe, right? > But what about SET_CONTAINER & SET_IOMMU? All that you guarantee > holding the file pointer is that the vfio_group exists. > >> And while this counter is not zero, QEMU cannot take ownership over the group. >> >> I am definitely still missing the bigger picture... > > The bigger picture is that the group needs to exist AND it needs to be > setup and maintained to have IOMMU protection. Actually, my first stab > at add_external_user doesn't look sufficient, it needs to look more like > vfio_group_get_device_fd, checking group->container->iommu and > group_viable(). This makes sense. If you did this, that would be great. Without it, I really cannot see how the proposed inc/dec of container_users is better than simple holding file*. Thanks. > As written it would allow an external user after > SET_CONTAINER without SET_IOMMU. It should also be part of the API that > the external user must hold the file reference between add_external_use > and del_external_user and do cleanup on any exit paths. Thanks, -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-22 8:25 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-22 8:25 UTC (permalink / raw) To: Alex Williamson Cc: kvm@vger.kernel.org mailing list, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Paul Mackerras, linuxppc-dev, David Gibson On 06/21/2013 12:55 AM, Alex Williamson wrote: > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: >> On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: >>> On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: >>>>> Just out of curiosity - would not get_file() and fput_atomic() on a >>>> group's >>>>> file* do the right job instead of vfio_group_add_external_user() and >>>>> vfio_group_del_external_user()? >>>> >>>> I was thinking that too. Grabbing a file reference would certainly be >>>> the usual way of handling this sort of thing. >>> >>> But that wouldn't prevent the group ownership to be returned to >>> the kernel or another user would it ? >> >> >> Holding the file pointer does not let the group->container_users counter go >> to zero > > How so? Holding the file pointer means the file won't go away, which > means the group release function won't be called. That means the group > won't go away, but that doesn't mean it's attached to an IOMMU. A user > could call UNSET_CONTAINER. > >> and this is exactly what vfio_group_add_external_user() and >> vfio_group_del_external_user() do. The difference is only in absolute value >> - 2 vs. 3. >> >> No change in behaviour whether I use new vfio API or simply hold file* till >> KVM closes fd created when IOMMU was connected to LIOBN. > > By that notion you could open(/dev/vfio/$GROUP) and you're safe, right? > But what about SET_CONTAINER & SET_IOMMU? All that you guarantee > holding the file pointer is that the vfio_group exists. > >> And while this counter is not zero, QEMU cannot take ownership over the group. >> >> I am definitely still missing the bigger picture... > > The bigger picture is that the group needs to exist AND it needs to be > setup and maintained to have IOMMU protection. Actually, my first stab > at add_external_user doesn't look sufficient, it needs to look more like > vfio_group_get_device_fd, checking group->container->iommu and > group_viable(). This makes sense. If you did this, that would be great. Without it, I really cannot see how the proposed inc/dec of container_users is better than simple holding file*. Thanks. > As written it would allow an external user after > SET_CONTAINER without SET_IOMMU. It should also be part of the API that > the external user must hold the file reference between add_external_use > and del_external_user and do cleanup on any exit paths. Thanks, -- Alexey ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-20 14:55 ` Alex Williamson (?) @ 2013-06-22 12:03 ` David Gibson -1 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-22 12:03 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel [-- Attachment #1: Type: text/plain, Size: 1950 bytes --] On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > >> group's > > >>> file* do the right job instead of vfio_group_add_external_user() and > > >>> vfio_group_del_external_user()? > > >> > > >> I was thinking that too. Grabbing a file reference would certainly be > > >> the usual way of handling this sort of thing. > > > > > > But that wouldn't prevent the group ownership to be returned to > > > the kernel or another user would it ? > > > > > > Holding the file pointer does not let the group->container_users counter go > > to zero > > How so? Holding the file pointer means the file won't go away, which > means the group release function won't be called. That means the group > won't go away, but that doesn't mean it's attached to an IOMMU. A user > could call UNSET_CONTAINER. Uhh... *thinks*. Ah, I see. I think the interface should not take the group fd, but the container fd. Holding a reference to *that* would keep the necessary things around. But more to the point, it's the right thing semantically: The container is essentially the handle on a host iommu address space, and so that's what should be bound by the KVM call to a particular guest iommu address space. e.g. it would make no sense to bind two different groups to different guest iommu address spaces, if they were in the same container - the guest thinks they are different spaces, but if they're in the same container they must be the same space. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-22 12:03 ` David Gibson 0 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-22 12:03 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel [-- Attachment #1: Type: text/plain, Size: 1950 bytes --] On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > >> group's > > >>> file* do the right job instead of vfio_group_add_external_user() and > > >>> vfio_group_del_external_user()? > > >> > > >> I was thinking that too. Grabbing a file reference would certainly be > > >> the usual way of handling this sort of thing. > > > > > > But that wouldn't prevent the group ownership to be returned to > > > the kernel or another user would it ? > > > > > > Holding the file pointer does not let the group->container_users counter go > > to zero > > How so? Holding the file pointer means the file won't go away, which > means the group release function won't be called. That means the group > won't go away, but that doesn't mean it's attached to an IOMMU. A user > could call UNSET_CONTAINER. Uhh... *thinks*. Ah, I see. I think the interface should not take the group fd, but the container fd. Holding a reference to *that* would keep the necessary things around. But more to the point, it's the right thing semantically: The container is essentially the handle on a host iommu address space, and so that's what should be bound by the KVM call to a particular guest iommu address space. e.g. it would make no sense to bind two different groups to different guest iommu address spaces, if they were in the same container - the guest thinks they are different spaces, but if they're in the same container they must be the same space. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-22 12:03 ` David Gibson 0 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-22 12:03 UTC (permalink / raw) To: Alex Williamson Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Paul Mackerras, linuxppc-dev [-- Attachment #1: Type: text/plain, Size: 1950 bytes --] On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > >> group's > > >>> file* do the right job instead of vfio_group_add_external_user() and > > >>> vfio_group_del_external_user()? > > >> > > >> I was thinking that too. Grabbing a file reference would certainly be > > >> the usual way of handling this sort of thing. > > > > > > But that wouldn't prevent the group ownership to be returned to > > > the kernel or another user would it ? > > > > > > Holding the file pointer does not let the group->container_users counter go > > to zero > > How so? Holding the file pointer means the file won't go away, which > means the group release function won't be called. That means the group > won't go away, but that doesn't mean it's attached to an IOMMU. A user > could call UNSET_CONTAINER. Uhh... *thinks*. Ah, I see. I think the interface should not take the group fd, but the container fd. Holding a reference to *that* would keep the necessary things around. But more to the point, it's the right thing semantically: The container is essentially the handle on a host iommu address space, and so that's what should be bound by the KVM call to a particular guest iommu address space. e.g. it would make no sense to bind two different groups to different guest iommu address spaces, if they were in the same container - the guest thinks they are different spaces, but if they're in the same container they must be the same space. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-22 12:03 ` David Gibson (?) @ 2013-06-22 14:28 ` Alex Williamson -1 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-22 14:28 UTC (permalink / raw) To: David Gibson Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > >> group's > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > >>> vfio_group_del_external_user()? > > > >> > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > >> the usual way of handling this sort of thing. > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > the kernel or another user would it ? > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > to zero > > > > How so? Holding the file pointer means the file won't go away, which > > means the group release function won't be called. That means the group > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > could call UNSET_CONTAINER. > > Uhh... *thinks*. Ah, I see. > > I think the interface should not take the group fd, but the container > fd. Holding a reference to *that* would keep the necessary things > around. But more to the point, it's the right thing semantically: > > The container is essentially the handle on a host iommu address space, > and so that's what should be bound by the KVM call to a particular > guest iommu address space. e.g. it would make no sense to bind two > different groups to different guest iommu address spaces, if they were > in the same container - the guest thinks they are different spaces, > but if they're in the same container they must be the same space. While the container is the gateway to the iommu, what empowers the container to maintain an iommu is the group. What happens to a container when all the groups are disconnected or closed? Groups are the unit that indicates hardware access, not containers. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-22 14:28 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-22 14:28 UTC (permalink / raw) To: David Gibson Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > >> group's > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > >>> vfio_group_del_external_user()? > > > >> > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > >> the usual way of handling this sort of thing. > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > the kernel or another user would it ? > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > to zero > > > > How so? Holding the file pointer means the file won't go away, which > > means the group release function won't be called. That means the group > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > could call UNSET_CONTAINER. > > Uhh... *thinks*. Ah, I see. > > I think the interface should not take the group fd, but the container > fd. Holding a reference to *that* would keep the necessary things > around. But more to the point, it's the right thing semantically: > > The container is essentially the handle on a host iommu address space, > and so that's what should be bound by the KVM call to a particular > guest iommu address space. e.g. it would make no sense to bind two > different groups to different guest iommu address spaces, if they were > in the same container - the guest thinks they are different spaces, > but if they're in the same container they must be the same space. While the container is the gateway to the iommu, what empowers the container to maintain an iommu is the group. What happens to a container when all the groups are disconnected or closed? Groups are the unit that indicates hardware access, not containers. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-22 14:28 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-22 14:28 UTC (permalink / raw) To: David Gibson Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Paul Mackerras, linuxppc-dev On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > >> group's > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > >>> vfio_group_del_external_user()? > > > >> > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > >> the usual way of handling this sort of thing. > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > the kernel or another user would it ? > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > to zero > > > > How so? Holding the file pointer means the file won't go away, which > > means the group release function won't be called. That means the group > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > could call UNSET_CONTAINER. > > Uhh... *thinks*. Ah, I see. > > I think the interface should not take the group fd, but the container > fd. Holding a reference to *that* would keep the necessary things > around. But more to the point, it's the right thing semantically: > > The container is essentially the handle on a host iommu address space, > and so that's what should be bound by the KVM call to a particular > guest iommu address space. e.g. it would make no sense to bind two > different groups to different guest iommu address spaces, if they were > in the same container - the guest thinks they are different spaces, > but if they're in the same container they must be the same space. While the container is the gateway to the iommu, what empowers the container to maintain an iommu is the group. What happens to a container when all the groups are disconnected or closed? Groups are the unit that indicates hardware access, not containers. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-22 14:28 ` Alex Williamson (?) (?) @ 2013-06-24 3:52 ` David Gibson -1 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-24 3:52 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel [-- Attachment #1: Type: text/plain, Size: 2795 bytes --] On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > > >> group's > > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > > >>> vfio_group_del_external_user()? > > > > >> > > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > > >> the usual way of handling this sort of thing. > > > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > > the kernel or another user would it ? > > > > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > > to zero > > > > > > How so? Holding the file pointer means the file won't go away, which > > > means the group release function won't be called. That means the group > > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > > could call UNSET_CONTAINER. > > > > Uhh... *thinks*. Ah, I see. > > > > I think the interface should not take the group fd, but the container > > fd. Holding a reference to *that* would keep the necessary things > > around. But more to the point, it's the right thing semantically: > > > > The container is essentially the handle on a host iommu address space, > > and so that's what should be bound by the KVM call to a particular > > guest iommu address space. e.g. it would make no sense to bind two > > different groups to different guest iommu address spaces, if they were > > in the same container - the guest thinks they are different spaces, > > but if they're in the same container they must be the same space. > > While the container is the gateway to the iommu, what empowers the > container to maintain an iommu is the group. What happens to a > container when all the groups are disconnected or closed? Groups are > the unit that indicates hardware access, not containers. Thanks, Uh... huh? I'm really not sure what you're getting at. The operation we're doing for KVM here is binding a guest iommu address space to a particular host iommu address space. Why would we not want to use the obvious handle on the host iommu address space, which is the container fd? -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-24 3:52 ` David Gibson 0 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-24 3:52 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel [-- Attachment #1: Type: text/plain, Size: 2795 bytes --] On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > > >> group's > > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > > >>> vfio_group_del_external_user()? > > > > >> > > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > > >> the usual way of handling this sort of thing. > > > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > > the kernel or another user would it ? > > > > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > > to zero > > > > > > How so? Holding the file pointer means the file won't go away, which > > > means the group release function won't be called. That means the group > > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > > could call UNSET_CONTAINER. > > > > Uhh... *thinks*. Ah, I see. > > > > I think the interface should not take the group fd, but the container > > fd. Holding a reference to *that* would keep the necessary things > > around. But more to the point, it's the right thing semantically: > > > > The container is essentially the handle on a host iommu address space, > > and so that's what should be bound by the KVM call to a particular > > guest iommu address space. e.g. it would make no sense to bind two > > different groups to different guest iommu address spaces, if they were > > in the same container - the guest thinks they are different spaces, > > but if they're in the same container they must be the same space. > > While the container is the gateway to the iommu, what empowers the > container to maintain an iommu is the group. What happens to a > container when all the groups are disconnected or closed? Groups are > the unit that indicates hardware access, not containers. Thanks, Uh... huh? I'm really not sure what you're getting at. The operation we're doing for KVM here is binding a guest iommu address space to a particular host iommu address space. Why would we not want to use the obvious handle on the host iommu address space, which is the container fd? -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-24 3:52 ` David Gibson 0 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-24 3:52 UTC (permalink / raw) To: Alex Williamson Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Paul Mackerras, linuxppc-dev [-- Attachment #1.1: Type: text/plain, Size: 2795 bytes --] On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > > >> group's > > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > > >>> vfio_group_del_external_user()? > > > > >> > > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > > >> the usual way of handling this sort of thing. > > > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > > the kernel or another user would it ? > > > > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > > to zero > > > > > > How so? Holding the file pointer means the file won't go away, which > > > means the group release function won't be called. That means the group > > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > > could call UNSET_CONTAINER. > > > > Uhh... *thinks*. Ah, I see. > > > > I think the interface should not take the group fd, but the container > > fd. Holding a reference to *that* would keep the necessary things > > around. But more to the point, it's the right thing semantically: > > > > The container is essentially the handle on a host iommu address space, > > and so that's what should be bound by the KVM call to a particular > > guest iommu address space. e.g. it would make no sense to bind two > > different groups to different guest iommu address spaces, if they were > > in the same container - the guest thinks they are different spaces, > > but if they're in the same container they must be the same space. > > While the container is the gateway to the iommu, what empowers the > container to maintain an iommu is the group. What happens to a > container when all the groups are disconnected or closed? Groups are > the unit that indicates hardware access, not containers. Thanks, Uh... huh? I'm really not sure what you're getting at. The operation we're doing for KVM here is binding a guest iommu address space to a particular host iommu address space. Why would we not want to use the obvious handle on the host iommu address space, which is the container fd? -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #1.2: Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 150 bytes --] _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-24 3:52 ` David Gibson 0 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-24 3:52 UTC (permalink / raw) To: Alex Williamson Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Paul Mackerras, linuxppc-dev [-- Attachment #1: Type: text/plain, Size: 2795 bytes --] On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > > >> group's > > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > > >>> vfio_group_del_external_user()? > > > > >> > > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > > >> the usual way of handling this sort of thing. > > > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > > the kernel or another user would it ? > > > > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > > to zero > > > > > > How so? Holding the file pointer means the file won't go away, which > > > means the group release function won't be called. That means the group > > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > > could call UNSET_CONTAINER. > > > > Uhh... *thinks*. Ah, I see. > > > > I think the interface should not take the group fd, but the container > > fd. Holding a reference to *that* would keep the necessary things > > around. But more to the point, it's the right thing semantically: > > > > The container is essentially the handle on a host iommu address space, > > and so that's what should be bound by the KVM call to a particular > > guest iommu address space. e.g. it would make no sense to bind two > > different groups to different guest iommu address spaces, if they were > > in the same container - the guest thinks they are different spaces, > > but if they're in the same container they must be the same space. > > While the container is the gateway to the iommu, what empowers the > container to maintain an iommu is the group. What happens to a > container when all the groups are disconnected or closed? Groups are > the unit that indicates hardware access, not containers. Thanks, Uh... huh? I'm really not sure what you're getting at. The operation we're doing for KVM here is binding a guest iommu address space to a particular host iommu address space. Why would we not want to use the obvious handle on the host iommu address space, which is the container fd? -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-24 3:52 ` David Gibson (?) @ 2013-06-24 4:41 ` Alex Williamson -1 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-24 4:41 UTC (permalink / raw) To: David Gibson Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote: > On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: > > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > > > >> group's > > > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > > > >>> vfio_group_del_external_user()? > > > > > >> > > > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > > > >> the usual way of handling this sort of thing. > > > > > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > > > the kernel or another user would it ? > > > > > > > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > > > to zero > > > > > > > > How so? Holding the file pointer means the file won't go away, which > > > > means the group release function won't be called. That means the group > > > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > > > could call UNSET_CONTAINER. > > > > > > Uhh... *thinks*. Ah, I see. > > > > > > I think the interface should not take the group fd, but the container > > > fd. Holding a reference to *that* would keep the necessary things > > > around. But more to the point, it's the right thing semantically: > > > > > > The container is essentially the handle on a host iommu address space, > > > and so that's what should be bound by the KVM call to a particular > > > guest iommu address space. e.g. it would make no sense to bind two > > > different groups to different guest iommu address spaces, if they were > > > in the same container - the guest thinks they are different spaces, > > > but if they're in the same container they must be the same space. > > > > While the container is the gateway to the iommu, what empowers the > > container to maintain an iommu is the group. What happens to a > > container when all the groups are disconnected or closed? Groups are > > the unit that indicates hardware access, not containers. Thanks, > > Uh... huh? I'm really not sure what you're getting at. > > The operation we're doing for KVM here is binding a guest iommu > address space to a particular host iommu address space. Why would we > not want to use the obvious handle on the host iommu address space, > which is the container fd? AIUI, the request isn't for an interface through which to do iommu mappings. The request is for an interface to show that the user has sufficient privileges to do mappings. Groups are what gives the user that ability. The iommu is also possibly associated with multiple iommu groups and I believe what is being asked for here is a way to hold and lock a single iommu group with iommu protection. From a practical point of view, the iommu interface is de-privileged once the groups are disconnected or closed. Holding a reference count on the iommu fd won't prevent that. That means we'd have to use a notifier to have KVM stop the side-channel iommu access. Meanwhile holding the file descriptor for the group and adding an interface that bumps use counter allows KVM to lock itself in, just as if it had a device opened itself. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-24 4:41 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-24 4:41 UTC (permalink / raw) To: David Gibson Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote: > On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: > > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > > > >> group's > > > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > > > >>> vfio_group_del_external_user()? > > > > > >> > > > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > > > >> the usual way of handling this sort of thing. > > > > > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > > > the kernel or another user would it ? > > > > > > > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > > > to zero > > > > > > > > How so? Holding the file pointer means the file won't go away, which > > > > means the group release function won't be called. That means the group > > > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > > > could call UNSET_CONTAINER. > > > > > > Uhh... *thinks*. Ah, I see. > > > > > > I think the interface should not take the group fd, but the container > > > fd. Holding a reference to *that* would keep the necessary things > > > around. But more to the point, it's the right thing semantically: > > > > > > The container is essentially the handle on a host iommu address space, > > > and so that's what should be bound by the KVM call to a particular > > > guest iommu address space. e.g. it would make no sense to bind two > > > different groups to different guest iommu address spaces, if they were > > > in the same container - the guest thinks they are different spaces, > > > but if they're in the same container they must be the same space. > > > > While the container is the gateway to the iommu, what empowers the > > container to maintain an iommu is the group. What happens to a > > container when all the groups are disconnected or closed? Groups are > > the unit that indicates hardware access, not containers. Thanks, > > Uh... huh? I'm really not sure what you're getting at. > > The operation we're doing for KVM here is binding a guest iommu > address space to a particular host iommu address space. Why would we > not want to use the obvious handle on the host iommu address space, > which is the container fd? AIUI, the request isn't for an interface through which to do iommu mappings. The request is for an interface to show that the user has sufficient privileges to do mappings. Groups are what gives the user that ability. The iommu is also possibly associated with multiple iommu groups and I believe what is being asked for here is a way to hold and lock a single iommu group with iommu protection. >From a practical point of view, the iommu interface is de-privileged once the groups are disconnected or closed. Holding a reference count on the iommu fd won't prevent that. That means we'd have to use a notifier to have KVM stop the side-channel iommu access. Meanwhile holding the file descriptor for the group and adding an interface that bumps use counter allows KVM to lock itself in, just as if it had a device opened itself. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-24 4:41 ` Alex Williamson 0 siblings, 0 replies; 160+ messages in thread From: Alex Williamson @ 2013-06-24 4:41 UTC (permalink / raw) To: David Gibson Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Paul Mackerras, linuxppc-dev On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote: > On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: > > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > > > >> group's > > > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > > > >>> vfio_group_del_external_user()? > > > > > >> > > > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > > > >> the usual way of handling this sort of thing. > > > > > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > > > the kernel or another user would it ? > > > > > > > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > > > to zero > > > > > > > > How so? Holding the file pointer means the file won't go away, which > > > > means the group release function won't be called. That means the group > > > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > > > could call UNSET_CONTAINER. > > > > > > Uhh... *thinks*. Ah, I see. > > > > > > I think the interface should not take the group fd, but the container > > > fd. Holding a reference to *that* would keep the necessary things > > > around. But more to the point, it's the right thing semantically: > > > > > > The container is essentially the handle on a host iommu address space, > > > and so that's what should be bound by the KVM call to a particular > > > guest iommu address space. e.g. it would make no sense to bind two > > > different groups to different guest iommu address spaces, if they were > > > in the same container - the guest thinks they are different spaces, > > > but if they're in the same container they must be the same space. > > > > While the container is the gateway to the iommu, what empowers the > > container to maintain an iommu is the group. What happens to a > > container when all the groups are disconnected or closed? Groups are > > the unit that indicates hardware access, not containers. Thanks, > > Uh... huh? I'm really not sure what you're getting at. > > The operation we're doing for KVM here is binding a guest iommu > address space to a particular host iommu address space. Why would we > not want to use the obvious handle on the host iommu address space, > which is the container fd? AIUI, the request isn't for an interface through which to do iommu mappings. The request is for an interface to show that the user has sufficient privileges to do mappings. Groups are what gives the user that ability. The iommu is also possibly associated with multiple iommu groups and I believe what is being asked for here is a way to hold and lock a single iommu group with iommu protection. >From a practical point of view, the iommu interface is de-privileged once the groups are disconnected or closed. Holding a reference count on the iommu fd won't prevent that. That means we'd have to use a notifier to have KVM stop the side-channel iommu access. Meanwhile holding the file descriptor for the group and adding an interface that bumps use counter allows KVM to lock itself in, just as if it had a device opened itself. Thanks, Alex ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-24 4:41 ` Alex Williamson (?) @ 2013-06-27 11:01 ` David Gibson -1 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-27 11:01 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel [-- Attachment #1: Type: text/plain, Size: 4030 bytes --] On Sun, Jun 23, 2013 at 10:41:24PM -0600, Alex Williamson wrote: > On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote: > > On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: > > > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > > > > >> group's > > > > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > > > > >>> vfio_group_del_external_user()? > > > > > > >> > > > > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > > > > >> the usual way of handling this sort of thing. > > > > > > > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > > > > the kernel or another user would it ? > > > > > > > > > > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > > > > to zero > > > > > > > > > > How so? Holding the file pointer means the file won't go away, which > > > > > means the group release function won't be called. That means the group > > > > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > > > > could call UNSET_CONTAINER. > > > > > > > > Uhh... *thinks*. Ah, I see. > > > > > > > > I think the interface should not take the group fd, but the container > > > > fd. Holding a reference to *that* would keep the necessary things > > > > around. But more to the point, it's the right thing semantically: > > > > > > > > The container is essentially the handle on a host iommu address space, > > > > and so that's what should be bound by the KVM call to a particular > > > > guest iommu address space. e.g. it would make no sense to bind two > > > > different groups to different guest iommu address spaces, if they were > > > > in the same container - the guest thinks they are different spaces, > > > > but if they're in the same container they must be the same space. > > > > > > While the container is the gateway to the iommu, what empowers the > > > container to maintain an iommu is the group. What happens to a > > > container when all the groups are disconnected or closed? Groups are > > > the unit that indicates hardware access, not containers. Thanks, > > > > Uh... huh? I'm really not sure what you're getting at. > > > > The operation we're doing for KVM here is binding a guest iommu > > address space to a particular host iommu address space. Why would we > > not want to use the obvious handle on the host iommu address space, > > which is the container fd? > > AIUI, the request isn't for an interface through which to do iommu > mappings. The request is for an interface to show that the user has > sufficient privileges to do mappings. Groups are what gives the user > that ability. The iommu is also possibly associated with multiple iommu > groups and I believe what is being asked for here is a way to hold and > lock a single iommu group with iommu protection. > > >From a practical point of view, the iommu interface is de-privileged > once the groups are disconnected or closed. Holding a reference count > on the iommu fd won't prevent that. That means we'd have to use a > notifier to have KVM stop the side-channel iommu access. Meanwhile > holding the file descriptor for the group and adding an interface that > bumps use counter allows KVM to lock itself in, just as if it had a > device opened itself. Thanks, Ah, good point. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-27 11:01 ` David Gibson 0 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-27 11:01 UTC (permalink / raw) To: Alex Williamson Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel [-- Attachment #1: Type: text/plain, Size: 4030 bytes --] On Sun, Jun 23, 2013 at 10:41:24PM -0600, Alex Williamson wrote: > On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote: > > On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: > > > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > > > > >> group's > > > > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > > > > >>> vfio_group_del_external_user()? > > > > > > >> > > > > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > > > > >> the usual way of handling this sort of thing. > > > > > > > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > > > > the kernel or another user would it ? > > > > > > > > > > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > > > > to zero > > > > > > > > > > How so? Holding the file pointer means the file won't go away, which > > > > > means the group release function won't be called. That means the group > > > > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > > > > could call UNSET_CONTAINER. > > > > > > > > Uhh... *thinks*. Ah, I see. > > > > > > > > I think the interface should not take the group fd, but the container > > > > fd. Holding a reference to *that* would keep the necessary things > > > > around. But more to the point, it's the right thing semantically: > > > > > > > > The container is essentially the handle on a host iommu address space, > > > > and so that's what should be bound by the KVM call to a particular > > > > guest iommu address space. e.g. it would make no sense to bind two > > > > different groups to different guest iommu address spaces, if they were > > > > in the same container - the guest thinks they are different spaces, > > > > but if they're in the same container they must be the same space. > > > > > > While the container is the gateway to the iommu, what empowers the > > > container to maintain an iommu is the group. What happens to a > > > container when all the groups are disconnected or closed? Groups are > > > the unit that indicates hardware access, not containers. Thanks, > > > > Uh... huh? I'm really not sure what you're getting at. > > > > The operation we're doing for KVM here is binding a guest iommu > > address space to a particular host iommu address space. Why would we > > not want to use the obvious handle on the host iommu address space, > > which is the container fd? > > AIUI, the request isn't for an interface through which to do iommu > mappings. The request is for an interface to show that the user has > sufficient privileges to do mappings. Groups are what gives the user > that ability. The iommu is also possibly associated with multiple iommu > groups and I believe what is being asked for here is a way to hold and > lock a single iommu group with iommu protection. > > >From a practical point of view, the iommu interface is de-privileged > once the groups are disconnected or closed. Holding a reference count > on the iommu fd won't prevent that. That means we'd have to use a > notifier to have KVM stop the side-channel iommu access. Meanwhile > holding the file descriptor for the group and adding an interface that > bumps use counter allows KVM to lock itself in, just as if it had a > device opened itself. Thanks, Ah, good point. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-27 11:01 ` David Gibson 0 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-27 11:01 UTC (permalink / raw) To: Alex Williamson Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Paul Mackerras, linuxppc-dev [-- Attachment #1: Type: text/plain, Size: 4030 bytes --] On Sun, Jun 23, 2013 at 10:41:24PM -0600, Alex Williamson wrote: > On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote: > > On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: > > > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: > > > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: > > > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: > > > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: > > > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a > > > > > > >> group's > > > > > > >>> file* do the right job instead of vfio_group_add_external_user() and > > > > > > >>> vfio_group_del_external_user()? > > > > > > >> > > > > > > >> I was thinking that too. Grabbing a file reference would certainly be > > > > > > >> the usual way of handling this sort of thing. > > > > > > > > > > > > > > But that wouldn't prevent the group ownership to be returned to > > > > > > > the kernel or another user would it ? > > > > > > > > > > > > > > > > > > Holding the file pointer does not let the group->container_users counter go > > > > > > to zero > > > > > > > > > > How so? Holding the file pointer means the file won't go away, which > > > > > means the group release function won't be called. That means the group > > > > > won't go away, but that doesn't mean it's attached to an IOMMU. A user > > > > > could call UNSET_CONTAINER. > > > > > > > > Uhh... *thinks*. Ah, I see. > > > > > > > > I think the interface should not take the group fd, but the container > > > > fd. Holding a reference to *that* would keep the necessary things > > > > around. But more to the point, it's the right thing semantically: > > > > > > > > The container is essentially the handle on a host iommu address space, > > > > and so that's what should be bound by the KVM call to a particular > > > > guest iommu address space. e.g. it would make no sense to bind two > > > > different groups to different guest iommu address spaces, if they were > > > > in the same container - the guest thinks they are different spaces, > > > > but if they're in the same container they must be the same space. > > > > > > While the container is the gateway to the iommu, what empowers the > > > container to maintain an iommu is the group. What happens to a > > > container when all the groups are disconnected or closed? Groups are > > > the unit that indicates hardware access, not containers. Thanks, > > > > Uh... huh? I'm really not sure what you're getting at. > > > > The operation we're doing for KVM here is binding a guest iommu > > address space to a particular host iommu address space. Why would we > > not want to use the obvious handle on the host iommu address space, > > which is the container fd? > > AIUI, the request isn't for an interface through which to do iommu > mappings. The request is for an interface to show that the user has > sufficient privileges to do mappings. Groups are what gives the user > that ability. The iommu is also possibly associated with multiple iommu > groups and I believe what is being asked for here is a way to hold and > lock a single iommu group with iommu protection. > > >From a practical point of view, the iommu interface is de-privileged > once the groups are disconnected or closed. Holding a reference count > on the iommu fd won't prevent that. That means we'd have to use a > notifier to have KVM stop the side-channel iommu access. Meanwhile > holding the file descriptor for the group and adding an interface that > bumps use counter allows KVM to lock itself in, just as if it had a > device opened itself. Thanks, Ah, good point. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-22 12:03 ` David Gibson (?) @ 2013-06-22 23:28 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-22 23:28 UTC (permalink / raw) To: David Gibson Cc: Alex Williamson, Alexey Kardashevskiy, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > I think the interface should not take the group fd, but the container > fd. Holding a reference to *that* would keep the necessary things > around. But more to the point, it's the right thing semantically: > > The container is essentially the handle on a host iommu address space, > and so that's what should be bound by the KVM call to a particular > guest iommu address space. e.g. it would make no sense to bind two > different groups to different guest iommu address spaces, if they were > in the same container - the guest thinks they are different spaces, > but if they're in the same container they must be the same space. Interestingly, how are we going to extend that when/if we implement DDW ? DDW means an API by which the guest can request the creation of additional iommus for a given device (typically, in addition to the default smallish 32-bit one using 4k pages, the guest can request a larger window in 64-bit space using a larger page size). Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-22 23:28 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-22 23:28 UTC (permalink / raw) To: David Gibson Cc: Alex Williamson, Alexey Kardashevskiy, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > I think the interface should not take the group fd, but the container > fd. Holding a reference to *that* would keep the necessary things > around. But more to the point, it's the right thing semantically: > > The container is essentially the handle on a host iommu address space, > and so that's what should be bound by the KVM call to a particular > guest iommu address space. e.g. it would make no sense to bind two > different groups to different guest iommu address spaces, if they were > in the same container - the guest thinks they are different spaces, > but if they're in the same container they must be the same space. Interestingly, how are we going to extend that when/if we implement DDW ? DDW means an API by which the guest can request the creation of additional iommus for a given device (typically, in addition to the default smallish 32-bit one using 4k pages, the guest can request a larger window in 64-bit space using a larger page size). Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-22 23:28 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-22 23:28 UTC (permalink / raw) To: David Gibson Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Alex Williamson, Paul Mackerras, linuxppc-dev On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > I think the interface should not take the group fd, but the container > fd. Holding a reference to *that* would keep the necessary things > around. But more to the point, it's the right thing semantically: > > The container is essentially the handle on a host iommu address space, > and so that's what should be bound by the KVM call to a particular > guest iommu address space. e.g. it would make no sense to bind two > different groups to different guest iommu address spaces, if they were > in the same container - the guest thinks they are different spaces, > but if they're in the same container they must be the same space. Interestingly, how are we going to extend that when/if we implement DDW ? DDW means an API by which the guest can request the creation of additional iommus for a given device (typically, in addition to the default smallish 32-bit one using 4k pages, the guest can request a larger window in 64-bit space using a larger page size). Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-22 23:28 ` Benjamin Herrenschmidt (?) @ 2013-06-24 3:54 ` David Gibson -1 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-24 3:54 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alex Williamson, Alexey Kardashevskiy, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel [-- Attachment #1: Type: text/plain, Size: 1457 bytes --] On Sun, Jun 23, 2013 at 09:28:13AM +1000, Benjamin Herrenschmidt wrote: > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > I think the interface should not take the group fd, but the container > > fd. Holding a reference to *that* would keep the necessary things > > around. But more to the point, it's the right thing semantically: > > > > The container is essentially the handle on a host iommu address space, > > and so that's what should be bound by the KVM call to a particular > > guest iommu address space. e.g. it would make no sense to bind two > > different groups to different guest iommu address spaces, if they were > > in the same container - the guest thinks they are different spaces, > > but if they're in the same container they must be the same space. > > Interestingly, how are we going to extend that when/if we implement > DDW ? > > DDW means an API by which the guest can request the creation of > additional iommus for a given device (typically, in addition to the > default smallish 32-bit one using 4k pages, the guest can request > a larger window in 64-bit space using a larger page size). So, would a PAPR gest requesting this expect the new window to have a new liobn, or an existing liobn? -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-24 3:54 ` David Gibson 0 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-24 3:54 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alex Williamson, Alexey Kardashevskiy, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel [-- Attachment #1: Type: text/plain, Size: 1457 bytes --] On Sun, Jun 23, 2013 at 09:28:13AM +1000, Benjamin Herrenschmidt wrote: > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > I think the interface should not take the group fd, but the container > > fd. Holding a reference to *that* would keep the necessary things > > around. But more to the point, it's the right thing semantically: > > > > The container is essentially the handle on a host iommu address space, > > and so that's what should be bound by the KVM call to a particular > > guest iommu address space. e.g. it would make no sense to bind two > > different groups to different guest iommu address spaces, if they were > > in the same container - the guest thinks they are different spaces, > > but if they're in the same container they must be the same space. > > Interestingly, how are we going to extend that when/if we implement > DDW ? > > DDW means an API by which the guest can request the creation of > additional iommus for a given device (typically, in addition to the > default smallish 32-bit one using 4k pages, the guest can request > a larger window in 64-bit space using a larger page size). So, would a PAPR gest requesting this expect the new window to have a new liobn, or an existing liobn? -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-24 3:54 ` David Gibson 0 siblings, 0 replies; 160+ messages in thread From: David Gibson @ 2013-06-24 3:54 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Alex Williamson, Paul Mackerras, linuxppc-dev [-- Attachment #1: Type: text/plain, Size: 1457 bytes --] On Sun, Jun 23, 2013 at 09:28:13AM +1000, Benjamin Herrenschmidt wrote: > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: > > I think the interface should not take the group fd, but the container > > fd. Holding a reference to *that* would keep the necessary things > > around. But more to the point, it's the right thing semantically: > > > > The container is essentially the handle on a host iommu address space, > > and so that's what should be bound by the KVM call to a particular > > guest iommu address space. e.g. it would make no sense to bind two > > different groups to different guest iommu address spaces, if they were > > in the same container - the guest thinks they are different spaces, > > but if they're in the same container they must be the same space. > > Interestingly, how are we going to extend that when/if we implement > DDW ? > > DDW means an API by which the guest can request the creation of > additional iommus for a given device (typically, in addition to the > default smallish 32-bit one using 4k pages, the guest can request > a larger window in 64-bit space using a larger page size). So, would a PAPR gest requesting this expect the new window to have a new liobn, or an existing liobn? -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling 2013-06-24 3:54 ` David Gibson (?) @ 2013-06-24 3:58 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-24 3:58 UTC (permalink / raw) To: David Gibson Cc: Alex Williamson, Alexey Kardashevskiy, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Mon, 2013-06-24 at 13:54 +1000, David Gibson wrote: > > DDW means an API by which the guest can request the creation of > > additional iommus for a given device (typically, in addition to the > > default smallish 32-bit one using 4k pages, the guest can request > > a larger window in 64-bit space using a larger page size). > > So, would a PAPR gest requesting this expect the new window to have > a new liobn, or an existing liobn? New liobn or there is no way to H_PUT_TCE it (it exists in addition to the legacy window). Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-24 3:58 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-24 3:58 UTC (permalink / raw) To: David Gibson Cc: Alex Williamson, Alexey Kardashevskiy, Alexander Graf, linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list, open list, kvm-ppc, Rusty Russell, Joerg Roedel On Mon, 2013-06-24 at 13:54 +1000, David Gibson wrote: > > DDW means an API by which the guest can request the creation of > > additional iommus for a given device (typically, in addition to the > > default smallish 32-bit one using 4k pages, the guest can request > > a larger window in 64-bit space using a larger page size). > > So, would a PAPR gest requesting this expect the new window to have > a new liobn, or an existing liobn? New liobn or there is no way to H_PUT_TCE it (it exists in addition to the legacy window). Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling @ 2013-06-24 3:58 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-24 3:58 UTC (permalink / raw) To: David Gibson Cc: kvm@vger.kernel.org mailing list, Alexey Kardashevskiy, Joerg Roedel, Rusty Russell, Alexander Graf, kvm-ppc, open list, Alex Williamson, Paul Mackerras, linuxppc-dev On Mon, 2013-06-24 at 13:54 +1000, David Gibson wrote: > > DDW means an API by which the guest can request the creation of > > additional iommus for a given device (typically, in addition to the > > default smallish 32-bit one using 4k pages, the guest can request > > a larger window in 64-bit space using a larger page size). > > So, would a PAPR gest requesting this expect the new window to have > a new liobn, or an existing liobn? New liobn or there is no way to H_PUT_TCE it (it exists in addition to the legacy window). Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-05 6:11 ` Alexey Kardashevskiy -1 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: 2013/06/05: * fixed compile error when CONFIG_IOMMU_API=n 2013/05/20: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). --- arch/powerpc/include/asm/kvm_host.h | 2 + arch/powerpc/include/asm/kvm_ppc.h | 22 +++++++++ arch/powerpc/kvm/book3s_64_vio.c | 88 +++++++++++++++++++++++++++++++++-- arch/powerpc/kvm/book3s_64_vio_hv.c | 40 ++++++++++++++-- 4 files changed, 146 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ac0e2fe..4fc0865 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table { u64 liobn; u32 window_size; struct iommu_group *grp; /* used for IOMMU groups */ + struct list_head hugepages; /* used for IOMMU groups */ + spinlock_t hugepages_lock; /* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 934e01d..9054df0 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce_value, unsigned long npages); + +/* + * The KVM guest can be backed with 16MB pages (qemu switch + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +struct kvmppc_iommu_hugepage { + struct list_head list; + pte_t pte; /* Huge page PTE */ + unsigned long gpa; /* Guest physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index ffb4698..9e2ba4d 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -45,6 +45,71 @@ #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) #define ERROR_ADDR ((void *)~(unsigned long)0x0) +#ifdef CONFIG_IOMMU_API +/* Adds a new huge page descriptor to the list */ +static long kvmppc_iommu_hugepage_try_add( + struct kvmppc_spapr_tce_table *tt, + pte_t pte, unsigned long hva, unsigned long gpa, + unsigned long pg_size) +{ + long ret = 0; + struct kvmppc_iommu_hugepage *hp; + struct page *p; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if (hp->pte = pte) + goto unlock_exit; + } + + hva = hva & ~(pg_size - 1); + ret = get_user_pages_fast(hva, 1, true/*write*/, &p); + if ((ret != 1) || !p) { + ret = -EFAULT; + goto unlock_exit; + } + ret = 0; + + hp = kzalloc(sizeof(*hp), GFP_KERNEL); + if (!hp) { + ret = -ENOMEM; + goto unlock_exit; + } + + hp->page = p; + hp->pte = pte; + hp->gpa = gpa & ~(pg_size - 1); + hp->size = pg_size; + + list_add(&hp->list, &tt->hugepages); + +unlock_exit: + spin_unlock(&tt->hugepages_lock); + + return ret; +} + +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt) +{ + INIT_LIST_HEAD(&tt->hugepages); + spin_lock_init(&tt->hugepages_lock); +} + +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt) +{ + struct kvmppc_iommu_hugepage *hp, *tmp; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) { + list_del(&hp->list); + put_page(hp->page); /* one for iommu_put_tce_user_mode */ + put_page(hp->page); /* one for kvmppc_iommu_hugepage_try_add */ + kfree(hp); + } + spin_unlock(&tt->hugepages_lock); +} +#endif /* CONFIG_IOMMU_API */ + static long kvmppc_stt_npages(unsigned long window_size) { return ALIGN((window_size >> SPAPR_TCE_SHIFT) @@ -61,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt) #ifdef CONFIG_IOMMU_API if (stt->grp) { iommu_group_put(stt->grp); + kvmppc_iommu_hugepages_cleanup(stt); } else #endif for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) @@ -198,6 +264,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, kvm_get_kvm(kvm); mutex_lock(&kvm->lock); + kvmppc_iommu_hugepages_init(tt); list_add(&tt->list, &kvm->arch.spapr_tce_tables); mutex_unlock(&kvm->lock); @@ -218,16 +285,31 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, /* Converts guest physical address into host virtual */ static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa) { unsigned long hva, gfn = gpa >> PAGE_SHIFT; struct kvm_memory_slot *memslot; + pte_t *ptep; + unsigned int shift = 0; memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) return ERROR_ADDR; hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); + + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift); + WARN_ON(!ptep); + if (!ptep) + return ERROR_ADDR; +#ifdef CONFIG_IOMMU_API + if (tt && (shift > PAGE_SHIFT)) { + if (kvmppc_iommu_hugepage_try_add(tt, *ptep, + hva, gpa, 1 << shift)) + return ERROR_ADDR; + } +#endif return (void *) hva; } @@ -267,7 +349,7 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce); + hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, tce); if (hva = ERROR_ADDR) return H_HARDWARE; @@ -319,7 +401,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tce_list & ~IOMMU_PAGE_MASK) return H_PARAMETER; - tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); + tces = kvmppc_virtmode_gpa_to_hva(vcpu, NULL, tce_list); if (tces = ERROR_ADDR) return H_TOO_HARD; @@ -354,7 +436,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs */ for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) { - void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, + void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, vcpu->arch.tce_tmp[i]); if (hva = ERROR_ADDR) diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index dc4ae32..6245365 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -178,6 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, * Also returns pte and page size if the page is present in page table. */ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa, bool do_get_page) { struct kvm_memory_slot *memslot; @@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, unsigned long hva, hpa, pg_size = 0, offset; unsigned long gfn = gpa >> PAGE_SHIFT; bool writing = gpa & TCE_PCI_WRITE; + struct kvmppc_iommu_hugepage *hp; + /* + * Try to find an already used hugepage. + * If it is not there, the kvmppc_lookup_pte() will return zero + * as it won't do get_page() on a huge page in real mode + * and therefore the request will be passed to the virtual mode. + */ + if (tt) { + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size)) + continue; + + /* Calculate host phys address keeping flags and offset in the page */ + offset = gpa & (hp->size - 1); + + /* pte_pfn(pte) should return an address aligned to pg_size */ + hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset; + spin_unlock(&tt->hugepages_lock); + + return hpa; + } + spin_unlock(&tt->hugepages_lock); + } /* Find a KVM memslot */ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) @@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, if (oldtce & TCE_PCI_WRITE) SetPageDirty(page); + /* Do not put a huge page and continue without error */ + if (PageCompound(page)) + continue; + if (realmode_put_page(page)) { ret = H_TOO_HARD; break; @@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true); if (hpa = ERROR_ADDR) { vcpu->arch.tce_reason = H_TOO_HARD; return H_TOO_HARD; @@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (unlikely(ret)) { struct page *pg = realmode_pfn_to_page(hpa); BUG_ON(!pg); + + /* Do not put a huge page and return an error */ + if (!PageCompound(pg)) + return H_HARDWARE; + if (realmode_put_page(pg)) { vcpu->arch.tce_reason = H_HARDWARE; return H_TOO_HARD; @@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, vcpu->arch.tce_tmp_num = 0; vcpu->arch.tce_reason = 0; - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL, tce_list, false); if ((unsigned long)tces = ERROR_ADDR) return H_TOO_HARD; @@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs and go get_page */ for (i = 0; i < npages; ++i) { - unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, vcpu->arch.tce_tmp[i], true); if (hpa = ERROR_ADDR) { vcpu->arch.tce_tmp_num = i; -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling @ 2013-06-05 6:11 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: 2013/06/05: * fixed compile error when CONFIG_IOMMU_API=n 2013/05/20: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). --- arch/powerpc/include/asm/kvm_host.h | 2 + arch/powerpc/include/asm/kvm_ppc.h | 22 +++++++++ arch/powerpc/kvm/book3s_64_vio.c | 88 +++++++++++++++++++++++++++++++++-- arch/powerpc/kvm/book3s_64_vio_hv.c | 40 ++++++++++++++-- 4 files changed, 146 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ac0e2fe..4fc0865 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table { u64 liobn; u32 window_size; struct iommu_group *grp; /* used for IOMMU groups */ + struct list_head hugepages; /* used for IOMMU groups */ + spinlock_t hugepages_lock; /* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 934e01d..9054df0 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce_value, unsigned long npages); + +/* + * The KVM guest can be backed with 16MB pages (qemu switch + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +struct kvmppc_iommu_hugepage { + struct list_head list; + pte_t pte; /* Huge page PTE */ + unsigned long gpa; /* Guest physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index ffb4698..9e2ba4d 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -45,6 +45,71 @@ #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) #define ERROR_ADDR ((void *)~(unsigned long)0x0) +#ifdef CONFIG_IOMMU_API +/* Adds a new huge page descriptor to the list */ +static long kvmppc_iommu_hugepage_try_add( + struct kvmppc_spapr_tce_table *tt, + pte_t pte, unsigned long hva, unsigned long gpa, + unsigned long pg_size) +{ + long ret = 0; + struct kvmppc_iommu_hugepage *hp; + struct page *p; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if (hp->pte == pte) + goto unlock_exit; + } + + hva = hva & ~(pg_size - 1); + ret = get_user_pages_fast(hva, 1, true/*write*/, &p); + if ((ret != 1) || !p) { + ret = -EFAULT; + goto unlock_exit; + } + ret = 0; + + hp = kzalloc(sizeof(*hp), GFP_KERNEL); + if (!hp) { + ret = -ENOMEM; + goto unlock_exit; + } + + hp->page = p; + hp->pte = pte; + hp->gpa = gpa & ~(pg_size - 1); + hp->size = pg_size; + + list_add(&hp->list, &tt->hugepages); + +unlock_exit: + spin_unlock(&tt->hugepages_lock); + + return ret; +} + +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt) +{ + INIT_LIST_HEAD(&tt->hugepages); + spin_lock_init(&tt->hugepages_lock); +} + +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt) +{ + struct kvmppc_iommu_hugepage *hp, *tmp; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) { + list_del(&hp->list); + put_page(hp->page); /* one for iommu_put_tce_user_mode */ + put_page(hp->page); /* one for kvmppc_iommu_hugepage_try_add */ + kfree(hp); + } + spin_unlock(&tt->hugepages_lock); +} +#endif /* CONFIG_IOMMU_API */ + static long kvmppc_stt_npages(unsigned long window_size) { return ALIGN((window_size >> SPAPR_TCE_SHIFT) @@ -61,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt) #ifdef CONFIG_IOMMU_API if (stt->grp) { iommu_group_put(stt->grp); + kvmppc_iommu_hugepages_cleanup(stt); } else #endif for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) @@ -198,6 +264,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, kvm_get_kvm(kvm); mutex_lock(&kvm->lock); + kvmppc_iommu_hugepages_init(tt); list_add(&tt->list, &kvm->arch.spapr_tce_tables); mutex_unlock(&kvm->lock); @@ -218,16 +285,31 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, /* Converts guest physical address into host virtual */ static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa) { unsigned long hva, gfn = gpa >> PAGE_SHIFT; struct kvm_memory_slot *memslot; + pte_t *ptep; + unsigned int shift = 0; memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) return ERROR_ADDR; hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); + + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift); + WARN_ON(!ptep); + if (!ptep) + return ERROR_ADDR; +#ifdef CONFIG_IOMMU_API + if (tt && (shift > PAGE_SHIFT)) { + if (kvmppc_iommu_hugepage_try_add(tt, *ptep, + hva, gpa, 1 << shift)) + return ERROR_ADDR; + } +#endif return (void *) hva; } @@ -267,7 +349,7 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce); + hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, tce); if (hva == ERROR_ADDR) return H_HARDWARE; @@ -319,7 +401,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tce_list & ~IOMMU_PAGE_MASK) return H_PARAMETER; - tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); + tces = kvmppc_virtmode_gpa_to_hva(vcpu, NULL, tce_list); if (tces == ERROR_ADDR) return H_TOO_HARD; @@ -354,7 +436,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs */ for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) { - void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, + void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, vcpu->arch.tce_tmp[i]); if (hva == ERROR_ADDR) diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index dc4ae32..6245365 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -178,6 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, * Also returns pte and page size if the page is present in page table. */ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa, bool do_get_page) { struct kvm_memory_slot *memslot; @@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, unsigned long hva, hpa, pg_size = 0, offset; unsigned long gfn = gpa >> PAGE_SHIFT; bool writing = gpa & TCE_PCI_WRITE; + struct kvmppc_iommu_hugepage *hp; + /* + * Try to find an already used hugepage. + * If it is not there, the kvmppc_lookup_pte() will return zero + * as it won't do get_page() on a huge page in real mode + * and therefore the request will be passed to the virtual mode. + */ + if (tt) { + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size)) + continue; + + /* Calculate host phys address keeping flags and offset in the page */ + offset = gpa & (hp->size - 1); + + /* pte_pfn(pte) should return an address aligned to pg_size */ + hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset; + spin_unlock(&tt->hugepages_lock); + + return hpa; + } + spin_unlock(&tt->hugepages_lock); + } /* Find a KVM memslot */ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) @@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, if (oldtce & TCE_PCI_WRITE) SetPageDirty(page); + /* Do not put a huge page and continue without error */ + if (PageCompound(page)) + continue; + if (realmode_put_page(page)) { ret = H_TOO_HARD; break; @@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true); if (hpa == ERROR_ADDR) { vcpu->arch.tce_reason = H_TOO_HARD; return H_TOO_HARD; @@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (unlikely(ret)) { struct page *pg = realmode_pfn_to_page(hpa); BUG_ON(!pg); + + /* Do not put a huge page and return an error */ + if (!PageCompound(pg)) + return H_HARDWARE; + if (realmode_put_page(pg)) { vcpu->arch.tce_reason = H_HARDWARE; return H_TOO_HARD; @@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, vcpu->arch.tce_tmp_num = 0; vcpu->arch.tce_reason = 0; - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL, tce_list, false); if ((unsigned long)tces == ERROR_ADDR) return H_TOO_HARD; @@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs and go get_page */ for (i = 0; i < npages; ++i) { - unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, vcpu->arch.tce_tmp[i], true); if (hpa == ERROR_ADDR) { vcpu->arch.tce_tmp_num = i; -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling @ 2013-06-05 6:11 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 6:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: 2013/06/05: * fixed compile error when CONFIG_IOMMU_API=n 2013/05/20: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). --- arch/powerpc/include/asm/kvm_host.h | 2 + arch/powerpc/include/asm/kvm_ppc.h | 22 +++++++++ arch/powerpc/kvm/book3s_64_vio.c | 88 +++++++++++++++++++++++++++++++++-- arch/powerpc/kvm/book3s_64_vio_hv.c | 40 ++++++++++++++-- 4 files changed, 146 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ac0e2fe..4fc0865 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table { u64 liobn; u32 window_size; struct iommu_group *grp; /* used for IOMMU groups */ + struct list_head hugepages; /* used for IOMMU groups */ + spinlock_t hugepages_lock; /* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 934e01d..9054df0 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce_value, unsigned long npages); + +/* + * The KVM guest can be backed with 16MB pages (qemu switch + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +struct kvmppc_iommu_hugepage { + struct list_head list; + pte_t pte; /* Huge page PTE */ + unsigned long gpa; /* Guest physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index ffb4698..9e2ba4d 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -45,6 +45,71 @@ #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) #define ERROR_ADDR ((void *)~(unsigned long)0x0) +#ifdef CONFIG_IOMMU_API +/* Adds a new huge page descriptor to the list */ +static long kvmppc_iommu_hugepage_try_add( + struct kvmppc_spapr_tce_table *tt, + pte_t pte, unsigned long hva, unsigned long gpa, + unsigned long pg_size) +{ + long ret = 0; + struct kvmppc_iommu_hugepage *hp; + struct page *p; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if (hp->pte == pte) + goto unlock_exit; + } + + hva = hva & ~(pg_size - 1); + ret = get_user_pages_fast(hva, 1, true/*write*/, &p); + if ((ret != 1) || !p) { + ret = -EFAULT; + goto unlock_exit; + } + ret = 0; + + hp = kzalloc(sizeof(*hp), GFP_KERNEL); + if (!hp) { + ret = -ENOMEM; + goto unlock_exit; + } + + hp->page = p; + hp->pte = pte; + hp->gpa = gpa & ~(pg_size - 1); + hp->size = pg_size; + + list_add(&hp->list, &tt->hugepages); + +unlock_exit: + spin_unlock(&tt->hugepages_lock); + + return ret; +} + +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt) +{ + INIT_LIST_HEAD(&tt->hugepages); + spin_lock_init(&tt->hugepages_lock); +} + +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt) +{ + struct kvmppc_iommu_hugepage *hp, *tmp; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) { + list_del(&hp->list); + put_page(hp->page); /* one for iommu_put_tce_user_mode */ + put_page(hp->page); /* one for kvmppc_iommu_hugepage_try_add */ + kfree(hp); + } + spin_unlock(&tt->hugepages_lock); +} +#endif /* CONFIG_IOMMU_API */ + static long kvmppc_stt_npages(unsigned long window_size) { return ALIGN((window_size >> SPAPR_TCE_SHIFT) @@ -61,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt) #ifdef CONFIG_IOMMU_API if (stt->grp) { iommu_group_put(stt->grp); + kvmppc_iommu_hugepages_cleanup(stt); } else #endif for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) @@ -198,6 +264,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, kvm_get_kvm(kvm); mutex_lock(&kvm->lock); + kvmppc_iommu_hugepages_init(tt); list_add(&tt->list, &kvm->arch.spapr_tce_tables); mutex_unlock(&kvm->lock); @@ -218,16 +285,31 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, /* Converts guest physical address into host virtual */ static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa) { unsigned long hva, gfn = gpa >> PAGE_SHIFT; struct kvm_memory_slot *memslot; + pte_t *ptep; + unsigned int shift = 0; memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) return ERROR_ADDR; hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); + + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift); + WARN_ON(!ptep); + if (!ptep) + return ERROR_ADDR; +#ifdef CONFIG_IOMMU_API + if (tt && (shift > PAGE_SHIFT)) { + if (kvmppc_iommu_hugepage_try_add(tt, *ptep, + hva, gpa, 1 << shift)) + return ERROR_ADDR; + } +#endif return (void *) hva; } @@ -267,7 +349,7 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce); + hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, tce); if (hva == ERROR_ADDR) return H_HARDWARE; @@ -319,7 +401,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tce_list & ~IOMMU_PAGE_MASK) return H_PARAMETER; - tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); + tces = kvmppc_virtmode_gpa_to_hva(vcpu, NULL, tce_list); if (tces == ERROR_ADDR) return H_TOO_HARD; @@ -354,7 +436,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs */ for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) { - void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, + void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, vcpu->arch.tce_tmp[i]); if (hva == ERROR_ADDR) diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index dc4ae32..6245365 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -178,6 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, * Also returns pte and page size if the page is present in page table. */ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa, bool do_get_page) { struct kvm_memory_slot *memslot; @@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, unsigned long hva, hpa, pg_size = 0, offset; unsigned long gfn = gpa >> PAGE_SHIFT; bool writing = gpa & TCE_PCI_WRITE; + struct kvmppc_iommu_hugepage *hp; + /* + * Try to find an already used hugepage. + * If it is not there, the kvmppc_lookup_pte() will return zero + * as it won't do get_page() on a huge page in real mode + * and therefore the request will be passed to the virtual mode. + */ + if (tt) { + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size)) + continue; + + /* Calculate host phys address keeping flags and offset in the page */ + offset = gpa & (hp->size - 1); + + /* pte_pfn(pte) should return an address aligned to pg_size */ + hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset; + spin_unlock(&tt->hugepages_lock); + + return hpa; + } + spin_unlock(&tt->hugepages_lock); + } /* Find a KVM memslot */ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) @@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, if (oldtce & TCE_PCI_WRITE) SetPageDirty(page); + /* Do not put a huge page and continue without error */ + if (PageCompound(page)) + continue; + if (realmode_put_page(page)) { ret = H_TOO_HARD; break; @@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true); if (hpa == ERROR_ADDR) { vcpu->arch.tce_reason = H_TOO_HARD; return H_TOO_HARD; @@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (unlikely(ret)) { struct page *pg = realmode_pfn_to_page(hpa); BUG_ON(!pg); + + /* Do not put a huge page and return an error */ + if (!PageCompound(pg)) + return H_HARDWARE; + if (realmode_put_page(pg)) { vcpu->arch.tce_reason = H_HARDWARE; return H_TOO_HARD; @@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, vcpu->arch.tce_tmp_num = 0; vcpu->arch.tce_reason = 0; - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL, tce_list, false); if ((unsigned long)tces == ERROR_ADDR) return H_TOO_HARD; @@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs and go get_page */ for (i = 0; i < npages; ++i) { - unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, vcpu->arch.tce_tmp[i], true); if (hpa == ERROR_ADDR) { vcpu->arch.tce_tmp_num = i; -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* Re: [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-16 4:46 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:46 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > @@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, > unsigned long hva, hpa, pg_size = 0, offset; > unsigned long gfn = gpa >> PAGE_SHIFT; > bool writing = gpa & TCE_PCI_WRITE; > + struct kvmppc_iommu_hugepage *hp; > > + /* > + * Try to find an already used hugepage. > + * If it is not there, the kvmppc_lookup_pte() will return zero > + * as it won't do get_page() on a huge page in real mode > + * and therefore the request will be passed to the virtual mode. > + */ > + if (tt) { > + spin_lock(&tt->hugepages_lock); > + list_for_each_entry(hp, &tt->hugepages, list) { > + if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size)) > + continue; > + > + /* Calculate host phys address keeping flags and offset in the page */ > + offset = gpa & (hp->size - 1); > + > + /* pte_pfn(pte) should return an address aligned to pg_size */ > + hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset; > + spin_unlock(&tt->hugepages_lock); > + > + return hpa; > + } > + spin_unlock(&tt->hugepages_lock); > + } Wow .... this is run in real mode right ? spin_lock() and spin_unlock() are a big no-no in real mode. If lockdep and/or spinlock debugging are enabled and something goes pear-shaped they are going to bring your whole system down in a blink in quite horrible ways. If you are going to do that, you need some kind of custom low-level lock. Also, I see that you are basically using a non-ordered list and doing a linear search in it every time. That's going to COST ! You should really consider a more efficient data structure. You should also be able to do something that doesn't require locks for readers. > /* Find a KVM memslot */ > memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); > if (!memslot) > @@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, > if (oldtce & TCE_PCI_WRITE) > SetPageDirty(page); > > + /* Do not put a huge page and continue without error */ > + if (PageCompound(page)) > + continue; > + > if (realmode_put_page(page)) { > ret = H_TOO_HARD; > break; > @@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > if (iommu_tce_put_param_check(tbl, ioba, tce)) > return H_PARAMETER; > > - hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); > + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true); > if (hpa = ERROR_ADDR) { > vcpu->arch.tce_reason = H_TOO_HARD; > return H_TOO_HARD; > @@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > if (unlikely(ret)) { > struct page *pg = realmode_pfn_to_page(hpa); > BUG_ON(!pg); > + > + /* Do not put a huge page and return an error */ > + if (!PageCompound(pg)) > + return H_HARDWARE; > + > if (realmode_put_page(pg)) { > vcpu->arch.tce_reason = H_HARDWARE; > return H_TOO_HARD; > @@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > vcpu->arch.tce_tmp_num = 0; > vcpu->arch.tce_reason = 0; > > - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, > + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL, > tce_list, false); > if ((unsigned long)tces = ERROR_ADDR) > return H_TOO_HARD; > @@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > > /* Translate TCEs and go get_page */ > for (i = 0; i < npages; ++i) { > - unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, > + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, > vcpu->arch.tce_tmp[i], true); > if (hpa = ERROR_ADDR) { > vcpu->arch.tce_tmp_num = i; Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling @ 2013-06-16 4:46 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:46 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > @@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, > unsigned long hva, hpa, pg_size = 0, offset; > unsigned long gfn = gpa >> PAGE_SHIFT; > bool writing = gpa & TCE_PCI_WRITE; > + struct kvmppc_iommu_hugepage *hp; > > + /* > + * Try to find an already used hugepage. > + * If it is not there, the kvmppc_lookup_pte() will return zero > + * as it won't do get_page() on a huge page in real mode > + * and therefore the request will be passed to the virtual mode. > + */ > + if (tt) { > + spin_lock(&tt->hugepages_lock); > + list_for_each_entry(hp, &tt->hugepages, list) { > + if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size)) > + continue; > + > + /* Calculate host phys address keeping flags and offset in the page */ > + offset = gpa & (hp->size - 1); > + > + /* pte_pfn(pte) should return an address aligned to pg_size */ > + hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset; > + spin_unlock(&tt->hugepages_lock); > + > + return hpa; > + } > + spin_unlock(&tt->hugepages_lock); > + } Wow .... this is run in real mode right ? spin_lock() and spin_unlock() are a big no-no in real mode. If lockdep and/or spinlock debugging are enabled and something goes pear-shaped they are going to bring your whole system down in a blink in quite horrible ways. If you are going to do that, you need some kind of custom low-level lock. Also, I see that you are basically using a non-ordered list and doing a linear search in it every time. That's going to COST ! You should really consider a more efficient data structure. You should also be able to do something that doesn't require locks for readers. > /* Find a KVM memslot */ > memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); > if (!memslot) > @@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, > if (oldtce & TCE_PCI_WRITE) > SetPageDirty(page); > > + /* Do not put a huge page and continue without error */ > + if (PageCompound(page)) > + continue; > + > if (realmode_put_page(page)) { > ret = H_TOO_HARD; > break; > @@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > if (iommu_tce_put_param_check(tbl, ioba, tce)) > return H_PARAMETER; > > - hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); > + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true); > if (hpa == ERROR_ADDR) { > vcpu->arch.tce_reason = H_TOO_HARD; > return H_TOO_HARD; > @@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > if (unlikely(ret)) { > struct page *pg = realmode_pfn_to_page(hpa); > BUG_ON(!pg); > + > + /* Do not put a huge page and return an error */ > + if (!PageCompound(pg)) > + return H_HARDWARE; > + > if (realmode_put_page(pg)) { > vcpu->arch.tce_reason = H_HARDWARE; > return H_TOO_HARD; > @@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > vcpu->arch.tce_tmp_num = 0; > vcpu->arch.tce_reason = 0; > > - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, > + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL, > tce_list, false); > if ((unsigned long)tces == ERROR_ADDR) > return H_TOO_HARD; > @@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > > /* Translate TCEs and go get_page */ > for (i = 0; i < npages; ++i) { > - unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, > + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, > vcpu->arch.tce_tmp[i], true); > if (hpa == ERROR_ADDR) { > vcpu->arch.tce_tmp_num = i; Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling @ 2013-06-16 4:46 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-16 4:46 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, Paul Mackerras, linuxppc-dev, David Gibson On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > @@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, > unsigned long hva, hpa, pg_size = 0, offset; > unsigned long gfn = gpa >> PAGE_SHIFT; > bool writing = gpa & TCE_PCI_WRITE; > + struct kvmppc_iommu_hugepage *hp; > > + /* > + * Try to find an already used hugepage. > + * If it is not there, the kvmppc_lookup_pte() will return zero > + * as it won't do get_page() on a huge page in real mode > + * and therefore the request will be passed to the virtual mode. > + */ > + if (tt) { > + spin_lock(&tt->hugepages_lock); > + list_for_each_entry(hp, &tt->hugepages, list) { > + if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size)) > + continue; > + > + /* Calculate host phys address keeping flags and offset in the page */ > + offset = gpa & (hp->size - 1); > + > + /* pte_pfn(pte) should return an address aligned to pg_size */ > + hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset; > + spin_unlock(&tt->hugepages_lock); > + > + return hpa; > + } > + spin_unlock(&tt->hugepages_lock); > + } Wow .... this is run in real mode right ? spin_lock() and spin_unlock() are a big no-no in real mode. If lockdep and/or spinlock debugging are enabled and something goes pear-shaped they are going to bring your whole system down in a blink in quite horrible ways. If you are going to do that, you need some kind of custom low-level lock. Also, I see that you are basically using a non-ordered list and doing a linear search in it every time. That's going to COST ! You should really consider a more efficient data structure. You should also be able to do something that doesn't require locks for readers. > /* Find a KVM memslot */ > memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); > if (!memslot) > @@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, > if (oldtce & TCE_PCI_WRITE) > SetPageDirty(page); > > + /* Do not put a huge page and continue without error */ > + if (PageCompound(page)) > + continue; > + > if (realmode_put_page(page)) { > ret = H_TOO_HARD; > break; > @@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > if (iommu_tce_put_param_check(tbl, ioba, tce)) > return H_PARAMETER; > > - hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); > + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true); > if (hpa == ERROR_ADDR) { > vcpu->arch.tce_reason = H_TOO_HARD; > return H_TOO_HARD; > @@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > if (unlikely(ret)) { > struct page *pg = realmode_pfn_to_page(hpa); > BUG_ON(!pg); > + > + /* Do not put a huge page and return an error */ > + if (!PageCompound(pg)) > + return H_HARDWARE; > + > if (realmode_put_page(pg)) { > vcpu->arch.tce_reason = H_HARDWARE; > return H_TOO_HARD; > @@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > vcpu->arch.tce_tmp_num = 0; > vcpu->arch.tce_reason = 0; > > - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, > + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL, > tce_list, false); > if ((unsigned long)tces == ERROR_ADDR) > return H_TOO_HARD; > @@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > > /* Translate TCEs and go get_page */ > for (i = 0; i < npages; ++i) { > - unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, > + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, > vcpu->arch.tce_tmp[i], true); > if (hpa == ERROR_ADDR) { > vcpu->arch.tce_tmp_num = i; Cheers, Ben. ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-17 16:35 ` Paolo Bonzini -1 siblings, 0 replies; 160+ messages in thread From: Paolo Bonzini @ 2013-06-17 16:35 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc Il 05/06/2013 08:11, Alexey Kardashevskiy ha scritto: > +/* > + * The KVM guest can be backed with 16MB pages (qemu switch > + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). Nitpick: we try to avoid references to QEMU, so perhaps s/qemu switch/for example, with QEMU you can use the command-line option/ Paolo ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling @ 2013-06-17 16:35 ` Paolo Bonzini 0 siblings, 0 replies; 160+ messages in thread From: Paolo Bonzini @ 2013-06-17 16:35 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc Il 05/06/2013 08:11, Alexey Kardashevskiy ha scritto: > +/* > + * The KVM guest can be backed with 16MB pages (qemu switch > + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). Nitpick: we try to avoid references to QEMU, so perhaps s/qemu switch/for example, with QEMU you can use the command-line option/ Paolo ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling @ 2013-06-17 16:35 ` Paolo Bonzini 0 siblings, 0 replies; 160+ messages in thread From: Paolo Bonzini @ 2013-06-17 16:35 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, linuxppc-dev, David Gibson Il 05/06/2013 08:11, Alexey Kardashevskiy ha scritto: > +/* > + * The KVM guest can be backed with 16MB pages (qemu switch > + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). Nitpick: we try to avoid references to QEMU, so perhaps s/qemu switch/for example, with QEMU you can use the command-line option/ Paolo ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling 2013-06-05 6:11 ` Alexey Kardashevskiy (?) @ 2013-06-12 3:14 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-12 3:14 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > Ben, ping! :) > > This series has tiny fixes (capability and ioctl numbers, > changed documentation, compile errors in some configuration). > More details are in the commit messages. > Rebased on v3.10-rc4. Alex, I assume you'll merge that once I ack it ? Cheers, Ben. > > Alexey Kardashevskiy (4): > KVM: PPC: Add support for multiple-TCE hcalls > powerpc: Prepare to support kernel handling of IOMMU map/unmap > KVM: PPC: Add support for IOMMU in-kernel handling > KVM: PPC: Add hugepage support for IOMMU in-kernel handling > > Documentation/virtual/kvm/api.txt | 45 +++ > arch/powerpc/include/asm/kvm_host.h | 7 + > arch/powerpc/include/asm/kvm_ppc.h | 40 ++- > arch/powerpc/include/asm/pgtable-ppc64.h | 4 + > arch/powerpc/include/uapi/asm/kvm.h | 7 + > arch/powerpc/kvm/book3s_64_vio.c | 398 ++++++++++++++++++++++++- > arch/powerpc/kvm/book3s_64_vio_hv.c | 471 ++++++++++++++++++++++++++++-- > arch/powerpc/kvm/book3s_hv.c | 39 +++ > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + > arch/powerpc/kvm/book3s_pr_papr.c | 37 ++- > arch/powerpc/kvm/powerpc.c | 15 + > arch/powerpc/mm/init_64.c | 77 ++++- > include/uapi/linux/kvm.h | 3 + > 13 files changed, 1121 insertions(+), 28 deletions(-) > ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling @ 2013-06-12 3:14 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-12 3:14 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > Ben, ping! :) > > This series has tiny fixes (capability and ioctl numbers, > changed documentation, compile errors in some configuration). > More details are in the commit messages. > Rebased on v3.10-rc4. Alex, I assume you'll merge that once I ack it ? Cheers, Ben. > > Alexey Kardashevskiy (4): > KVM: PPC: Add support for multiple-TCE hcalls > powerpc: Prepare to support kernel handling of IOMMU map/unmap > KVM: PPC: Add support for IOMMU in-kernel handling > KVM: PPC: Add hugepage support for IOMMU in-kernel handling > > Documentation/virtual/kvm/api.txt | 45 +++ > arch/powerpc/include/asm/kvm_host.h | 7 + > arch/powerpc/include/asm/kvm_ppc.h | 40 ++- > arch/powerpc/include/asm/pgtable-ppc64.h | 4 + > arch/powerpc/include/uapi/asm/kvm.h | 7 + > arch/powerpc/kvm/book3s_64_vio.c | 398 ++++++++++++++++++++++++- > arch/powerpc/kvm/book3s_64_vio_hv.c | 471 ++++++++++++++++++++++++++++-- > arch/powerpc/kvm/book3s_hv.c | 39 +++ > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + > arch/powerpc/kvm/book3s_pr_papr.c | 37 ++- > arch/powerpc/kvm/powerpc.c | 15 + > arch/powerpc/mm/init_64.c | 77 ++++- > include/uapi/linux/kvm.h | 3 + > 13 files changed, 1121 insertions(+), 28 deletions(-) > ^ permalink raw reply [flat|nested] 160+ messages in thread
* Re: [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling @ 2013-06-12 3:14 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 160+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-12 3:14 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, Paul Mackerras, linuxppc-dev, David Gibson On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote: > Ben, ping! :) > > This series has tiny fixes (capability and ioctl numbers, > changed documentation, compile errors in some configuration). > More details are in the commit messages. > Rebased on v3.10-rc4. Alex, I assume you'll merge that once I ack it ? Cheers, Ben. > > Alexey Kardashevskiy (4): > KVM: PPC: Add support for multiple-TCE hcalls > powerpc: Prepare to support kernel handling of IOMMU map/unmap > KVM: PPC: Add support for IOMMU in-kernel handling > KVM: PPC: Add hugepage support for IOMMU in-kernel handling > > Documentation/virtual/kvm/api.txt | 45 +++ > arch/powerpc/include/asm/kvm_host.h | 7 + > arch/powerpc/include/asm/kvm_ppc.h | 40 ++- > arch/powerpc/include/asm/pgtable-ppc64.h | 4 + > arch/powerpc/include/uapi/asm/kvm.h | 7 + > arch/powerpc/kvm/book3s_64_vio.c | 398 ++++++++++++++++++++++++- > arch/powerpc/kvm/book3s_64_vio_hv.c | 471 ++++++++++++++++++++++++++++-- > arch/powerpc/kvm/book3s_hv.c | 39 +++ > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + > arch/powerpc/kvm/book3s_pr_papr.c | 37 ++- > arch/powerpc/kvm/powerpc.c | 15 + > arch/powerpc/mm/init_64.c | 77 ++++- > include/uapi/linux/kvm.h | 3 + > 13 files changed, 1121 insertions(+), 28 deletions(-) > ^ permalink raw reply [flat|nested] 160+ messages in thread
* [PATCH 0/4 v2] KVM: PPC: IOMMU in-kernel handling @ 2013-05-21 3:06 Alexey Kardashevskiy 2013-05-21 3:06 ` Alexey Kardashevskiy 0 siblings, 1 reply; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-05-21 3:06 UTC (permalink / raw) To: linuxppc-dev Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt, Alexander Graf, Paul Mackerras, linux-kernel, kvm, kvm-ppc This accelerates IOMMU operations in real and virtual mode in the host kernel for the KVM guest. The first patch with multitce support is useful for emulated devices as is. The other patches are designed for VFIO although this series does not contain any VFIO related code as the connection between VFIO and the new handlers is to be made in QEMU via ioctl to the KVM fd. The series was made and tested against v3.10-rc1. Alexey Kardashevskiy (4): KVM: PPC: Add support for multiple-TCE hcalls powerpc: Prepare to support kernel handling of IOMMU map/unmap KVM: PPC: Add support for IOMMU in-kernel handling KVM: PPC: Add hugepage support for IOMMU in-kernel handling Documentation/virtual/kvm/api.txt | 42 +++ arch/powerpc/include/asm/kvm_host.h | 7 + arch/powerpc/include/asm/kvm_ppc.h | 40 ++- arch/powerpc/include/asm/pgtable-ppc64.h | 4 + arch/powerpc/include/uapi/asm/kvm.h | 7 + arch/powerpc/kvm/book3s_64_vio.c | 398 ++++++++++++++++++++++++- arch/powerpc/kvm/book3s_64_vio_hv.c | 471 ++++++++++++++++++++++++++++-- arch/powerpc/kvm/book3s_hv.c | 39 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 ++- arch/powerpc/kvm/powerpc.c | 15 + arch/powerpc/mm/init_64.c | 77 ++++- include/uapi/linux/kvm.h | 5 + 13 files changed, 1120 insertions(+), 28 deletions(-) -- 1.7.10.4 ^ permalink raw reply [flat|nested] 160+ messages in thread
* [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling 2013-05-21 3:06 [PATCH 0/4 v2] " Alexey Kardashevskiy 2013-05-21 3:06 ` Alexey Kardashevskiy @ 2013-05-21 3:06 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-05-21 3:06 UTC (permalink / raw) To: linuxppc-dev Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt, Alexander Graf, Paul Mackerras, linux-kernel, kvm, kvm-ppc This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). --- arch/powerpc/include/asm/kvm_host.h | 2 + arch/powerpc/include/asm/kvm_ppc.h | 22 +++++++++ arch/powerpc/kvm/book3s_64_vio.c | 88 +++++++++++++++++++++++++++++++++-- arch/powerpc/kvm/book3s_64_vio_hv.c | 40 ++++++++++++++-- 4 files changed, 146 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ac0e2fe..4fc0865 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table { u64 liobn; u32 window_size; struct iommu_group *grp; /* used for IOMMU groups */ + struct list_head hugepages; /* used for IOMMU groups */ + spinlock_t hugepages_lock; /* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 934e01d..9054df0 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce_value, unsigned long npages); + +/* + * The KVM guest can be backed with 16MB pages (qemu switch + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +struct kvmppc_iommu_hugepage { + struct list_head list; + pte_t pte; /* Huge page PTE */ + unsigned long gpa; /* Guest physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index ffb4698..c34d63a 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -45,6 +45,71 @@ #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) #define ERROR_ADDR ((void *)~(unsigned long)0x0) +#ifdef CONFIG_IOMMU_API +/* Adds a new huge page descriptor to the list */ +static long kvmppc_iommu_hugepage_try_add( + struct kvmppc_spapr_tce_table *tt, + pte_t pte, unsigned long hva, unsigned long gpa, + unsigned long pg_size) +{ + long ret = 0; + struct kvmppc_iommu_hugepage *hp; + struct page *p; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if (hp->pte = pte) + goto unlock_exit; + } + + hva = hva & ~(pg_size - 1); + ret = get_user_pages_fast(hva, 1, true/*write*/, &p); + if ((ret != 1) || !p) { + ret = -EFAULT; + goto unlock_exit; + } + ret = 0; + + hp = kzalloc(sizeof(*hp), GFP_KERNEL); + if (!hp) { + ret = -ENOMEM; + goto unlock_exit; + } + + hp->page = p; + hp->pte = pte; + hp->gpa = gpa & ~(pg_size - 1); + hp->size = pg_size; + + list_add(&hp->list, &tt->hugepages); + +unlock_exit: + spin_unlock(&tt->hugepages_lock); + + return ret; +} + +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt) +{ + INIT_LIST_HEAD(&tt->hugepages); + spin_lock_init(&tt->hugepages_lock); +} + +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt) +{ + struct kvmppc_iommu_hugepage *hp, *tmp; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) { + list_del(&hp->list); + put_page(hp->page); /* one for iommu_put_tce_user_mode */ + put_page(hp->page); /* one for kvmppc_iommu_hugepage_try_add */ + kfree(hp); + } + spin_unlock(&tt->hugepages_lock); +} +#endif /* CONFIG_IOMMU_API */ + static long kvmppc_stt_npages(unsigned long window_size) { return ALIGN((window_size >> SPAPR_TCE_SHIFT) @@ -61,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt) #ifdef CONFIG_IOMMU_API if (stt->grp) { iommu_group_put(stt->grp); + kvmppc_iommu_hugepages_cleanup(stt); } else #endif for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) @@ -198,6 +264,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, kvm_get_kvm(kvm); mutex_lock(&kvm->lock); + kvmppc_iommu_hugepages_init(tt); list_add(&tt->list, &kvm->arch.spapr_tce_tables); mutex_unlock(&kvm->lock); @@ -218,16 +285,31 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, /* Converts guest physical address into host virtual */ static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa) { unsigned long hva, gfn = gpa >> PAGE_SHIFT; struct kvm_memory_slot *memslot; + pte_t *ptep; + unsigned int shift = 0; memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) return ERROR_ADDR; hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); + + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift); + WARN_ON(!ptep); + if (!ptep) + return ERROR_ADDR; + + if (tt && (shift > PAGE_SHIFT)) { + if (kvmppc_iommu_hugepage_try_add(tt, *ptep, + hva, gpa, 1 << shift)) + return ERROR_ADDR; + } + return (void *) hva; } @@ -267,7 +349,7 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce); + hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, tce); if (hva = ERROR_ADDR) return H_HARDWARE; @@ -319,7 +401,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tce_list & ~IOMMU_PAGE_MASK) return H_PARAMETER; - tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); + tces = kvmppc_virtmode_gpa_to_hva(vcpu, NULL, tce_list); if (tces = ERROR_ADDR) return H_TOO_HARD; @@ -354,7 +436,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs */ for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) { - void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, + void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, vcpu->arch.tce_tmp[i]); if (hva = ERROR_ADDR) diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index dc4ae32..6245365 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -178,6 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, * Also returns pte and page size if the page is present in page table. */ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa, bool do_get_page) { struct kvm_memory_slot *memslot; @@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, unsigned long hva, hpa, pg_size = 0, offset; unsigned long gfn = gpa >> PAGE_SHIFT; bool writing = gpa & TCE_PCI_WRITE; + struct kvmppc_iommu_hugepage *hp; + /* + * Try to find an already used hugepage. + * If it is not there, the kvmppc_lookup_pte() will return zero + * as it won't do get_page() on a huge page in real mode + * and therefore the request will be passed to the virtual mode. + */ + if (tt) { + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size)) + continue; + + /* Calculate host phys address keeping flags and offset in the page */ + offset = gpa & (hp->size - 1); + + /* pte_pfn(pte) should return an address aligned to pg_size */ + hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset; + spin_unlock(&tt->hugepages_lock); + + return hpa; + } + spin_unlock(&tt->hugepages_lock); + } /* Find a KVM memslot */ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) @@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, if (oldtce & TCE_PCI_WRITE) SetPageDirty(page); + /* Do not put a huge page and continue without error */ + if (PageCompound(page)) + continue; + if (realmode_put_page(page)) { ret = H_TOO_HARD; break; @@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true); if (hpa = ERROR_ADDR) { vcpu->arch.tce_reason = H_TOO_HARD; return H_TOO_HARD; @@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (unlikely(ret)) { struct page *pg = realmode_pfn_to_page(hpa); BUG_ON(!pg); + + /* Do not put a huge page and return an error */ + if (!PageCompound(pg)) + return H_HARDWARE; + if (realmode_put_page(pg)) { vcpu->arch.tce_reason = H_HARDWARE; return H_TOO_HARD; @@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, vcpu->arch.tce_tmp_num = 0; vcpu->arch.tce_reason = 0; - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL, tce_list, false); if ((unsigned long)tces = ERROR_ADDR) return H_TOO_HARD; @@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs and go get_page */ for (i = 0; i < npages; ++i) { - unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, vcpu->arch.tce_tmp[i], true); if (hpa = ERROR_ADDR) { vcpu->arch.tce_tmp_num = i; -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling @ 2013-05-21 3:06 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-05-21 3:06 UTC (permalink / raw) To: linuxppc-dev Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt, Alexander Graf, Paul Mackerras, linux-kernel, kvm, kvm-ppc This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). --- arch/powerpc/include/asm/kvm_host.h | 2 + arch/powerpc/include/asm/kvm_ppc.h | 22 +++++++++ arch/powerpc/kvm/book3s_64_vio.c | 88 +++++++++++++++++++++++++++++++++-- arch/powerpc/kvm/book3s_64_vio_hv.c | 40 ++++++++++++++-- 4 files changed, 146 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ac0e2fe..4fc0865 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table { u64 liobn; u32 window_size; struct iommu_group *grp; /* used for IOMMU groups */ + struct list_head hugepages; /* used for IOMMU groups */ + spinlock_t hugepages_lock; /* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 934e01d..9054df0 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce_value, unsigned long npages); + +/* + * The KVM guest can be backed with 16MB pages (qemu switch + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +struct kvmppc_iommu_hugepage { + struct list_head list; + pte_t pte; /* Huge page PTE */ + unsigned long gpa; /* Guest physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index ffb4698..c34d63a 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -45,6 +45,71 @@ #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) #define ERROR_ADDR ((void *)~(unsigned long)0x0) +#ifdef CONFIG_IOMMU_API +/* Adds a new huge page descriptor to the list */ +static long kvmppc_iommu_hugepage_try_add( + struct kvmppc_spapr_tce_table *tt, + pte_t pte, unsigned long hva, unsigned long gpa, + unsigned long pg_size) +{ + long ret = 0; + struct kvmppc_iommu_hugepage *hp; + struct page *p; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if (hp->pte == pte) + goto unlock_exit; + } + + hva = hva & ~(pg_size - 1); + ret = get_user_pages_fast(hva, 1, true/*write*/, &p); + if ((ret != 1) || !p) { + ret = -EFAULT; + goto unlock_exit; + } + ret = 0; + + hp = kzalloc(sizeof(*hp), GFP_KERNEL); + if (!hp) { + ret = -ENOMEM; + goto unlock_exit; + } + + hp->page = p; + hp->pte = pte; + hp->gpa = gpa & ~(pg_size - 1); + hp->size = pg_size; + + list_add(&hp->list, &tt->hugepages); + +unlock_exit: + spin_unlock(&tt->hugepages_lock); + + return ret; +} + +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt) +{ + INIT_LIST_HEAD(&tt->hugepages); + spin_lock_init(&tt->hugepages_lock); +} + +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt) +{ + struct kvmppc_iommu_hugepage *hp, *tmp; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) { + list_del(&hp->list); + put_page(hp->page); /* one for iommu_put_tce_user_mode */ + put_page(hp->page); /* one for kvmppc_iommu_hugepage_try_add */ + kfree(hp); + } + spin_unlock(&tt->hugepages_lock); +} +#endif /* CONFIG_IOMMU_API */ + static long kvmppc_stt_npages(unsigned long window_size) { return ALIGN((window_size >> SPAPR_TCE_SHIFT) @@ -61,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt) #ifdef CONFIG_IOMMU_API if (stt->grp) { iommu_group_put(stt->grp); + kvmppc_iommu_hugepages_cleanup(stt); } else #endif for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) @@ -198,6 +264,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, kvm_get_kvm(kvm); mutex_lock(&kvm->lock); + kvmppc_iommu_hugepages_init(tt); list_add(&tt->list, &kvm->arch.spapr_tce_tables); mutex_unlock(&kvm->lock); @@ -218,16 +285,31 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, /* Converts guest physical address into host virtual */ static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa) { unsigned long hva, gfn = gpa >> PAGE_SHIFT; struct kvm_memory_slot *memslot; + pte_t *ptep; + unsigned int shift = 0; memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) return ERROR_ADDR; hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); + + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift); + WARN_ON(!ptep); + if (!ptep) + return ERROR_ADDR; + + if (tt && (shift > PAGE_SHIFT)) { + if (kvmppc_iommu_hugepage_try_add(tt, *ptep, + hva, gpa, 1 << shift)) + return ERROR_ADDR; + } + return (void *) hva; } @@ -267,7 +349,7 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce); + hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, tce); if (hva == ERROR_ADDR) return H_HARDWARE; @@ -319,7 +401,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tce_list & ~IOMMU_PAGE_MASK) return H_PARAMETER; - tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); + tces = kvmppc_virtmode_gpa_to_hva(vcpu, NULL, tce_list); if (tces == ERROR_ADDR) return H_TOO_HARD; @@ -354,7 +436,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs */ for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) { - void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, + void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, vcpu->arch.tce_tmp[i]); if (hva == ERROR_ADDR) diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index dc4ae32..6245365 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -178,6 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, * Also returns pte and page size if the page is present in page table. */ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa, bool do_get_page) { struct kvm_memory_slot *memslot; @@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, unsigned long hva, hpa, pg_size = 0, offset; unsigned long gfn = gpa >> PAGE_SHIFT; bool writing = gpa & TCE_PCI_WRITE; + struct kvmppc_iommu_hugepage *hp; + /* + * Try to find an already used hugepage. + * If it is not there, the kvmppc_lookup_pte() will return zero + * as it won't do get_page() on a huge page in real mode + * and therefore the request will be passed to the virtual mode. + */ + if (tt) { + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size)) + continue; + + /* Calculate host phys address keeping flags and offset in the page */ + offset = gpa & (hp->size - 1); + + /* pte_pfn(pte) should return an address aligned to pg_size */ + hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset; + spin_unlock(&tt->hugepages_lock); + + return hpa; + } + spin_unlock(&tt->hugepages_lock); + } /* Find a KVM memslot */ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) @@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, if (oldtce & TCE_PCI_WRITE) SetPageDirty(page); + /* Do not put a huge page and continue without error */ + if (PageCompound(page)) + continue; + if (realmode_put_page(page)) { ret = H_TOO_HARD; break; @@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true); if (hpa == ERROR_ADDR) { vcpu->arch.tce_reason = H_TOO_HARD; return H_TOO_HARD; @@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (unlikely(ret)) { struct page *pg = realmode_pfn_to_page(hpa); BUG_ON(!pg); + + /* Do not put a huge page and return an error */ + if (!PageCompound(pg)) + return H_HARDWARE; + if (realmode_put_page(pg)) { vcpu->arch.tce_reason = H_HARDWARE; return H_TOO_HARD; @@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, vcpu->arch.tce_tmp_num = 0; vcpu->arch.tce_reason = 0; - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL, tce_list, false); if ((unsigned long)tces == ERROR_ADDR) return H_TOO_HARD; @@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs and go get_page */ for (i = 0; i < npages; ++i) { - unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, vcpu->arch.tce_tmp[i], true); if (hpa == ERROR_ADDR) { vcpu->arch.tce_tmp_num = i; -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
* [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling @ 2013-05-21 3:06 ` Alexey Kardashevskiy 0 siblings, 0 replies; 160+ messages in thread From: Alexey Kardashevskiy @ 2013-05-21 3:06 UTC (permalink / raw) To: linuxppc-dev Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras, David Gibson This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> --- Changes: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). --- arch/powerpc/include/asm/kvm_host.h | 2 + arch/powerpc/include/asm/kvm_ppc.h | 22 +++++++++ arch/powerpc/kvm/book3s_64_vio.c | 88 +++++++++++++++++++++++++++++++++-- arch/powerpc/kvm/book3s_64_vio_hv.c | 40 ++++++++++++++-- 4 files changed, 146 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ac0e2fe..4fc0865 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table { u64 liobn; u32 window_size; struct iommu_group *grp; /* used for IOMMU groups */ + struct list_head hugepages; /* used for IOMMU groups */ + spinlock_t hugepages_lock; /* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 934e01d..9054df0 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce_value, unsigned long npages); + +/* + * The KVM guest can be backed with 16MB pages (qemu switch + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +struct kvmppc_iommu_hugepage { + struct list_head list; + pte_t pte; /* Huge page PTE */ + unsigned long gpa; /* Guest physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index ffb4698..c34d63a 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -45,6 +45,71 @@ #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) #define ERROR_ADDR ((void *)~(unsigned long)0x0) +#ifdef CONFIG_IOMMU_API +/* Adds a new huge page descriptor to the list */ +static long kvmppc_iommu_hugepage_try_add( + struct kvmppc_spapr_tce_table *tt, + pte_t pte, unsigned long hva, unsigned long gpa, + unsigned long pg_size) +{ + long ret = 0; + struct kvmppc_iommu_hugepage *hp; + struct page *p; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if (hp->pte == pte) + goto unlock_exit; + } + + hva = hva & ~(pg_size - 1); + ret = get_user_pages_fast(hva, 1, true/*write*/, &p); + if ((ret != 1) || !p) { + ret = -EFAULT; + goto unlock_exit; + } + ret = 0; + + hp = kzalloc(sizeof(*hp), GFP_KERNEL); + if (!hp) { + ret = -ENOMEM; + goto unlock_exit; + } + + hp->page = p; + hp->pte = pte; + hp->gpa = gpa & ~(pg_size - 1); + hp->size = pg_size; + + list_add(&hp->list, &tt->hugepages); + +unlock_exit: + spin_unlock(&tt->hugepages_lock); + + return ret; +} + +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt) +{ + INIT_LIST_HEAD(&tt->hugepages); + spin_lock_init(&tt->hugepages_lock); +} + +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt) +{ + struct kvmppc_iommu_hugepage *hp, *tmp; + + spin_lock(&tt->hugepages_lock); + list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) { + list_del(&hp->list); + put_page(hp->page); /* one for iommu_put_tce_user_mode */ + put_page(hp->page); /* one for kvmppc_iommu_hugepage_try_add */ + kfree(hp); + } + spin_unlock(&tt->hugepages_lock); +} +#endif /* CONFIG_IOMMU_API */ + static long kvmppc_stt_npages(unsigned long window_size) { return ALIGN((window_size >> SPAPR_TCE_SHIFT) @@ -61,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt) #ifdef CONFIG_IOMMU_API if (stt->grp) { iommu_group_put(stt->grp); + kvmppc_iommu_hugepages_cleanup(stt); } else #endif for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) @@ -198,6 +264,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, kvm_get_kvm(kvm); mutex_lock(&kvm->lock); + kvmppc_iommu_hugepages_init(tt); list_add(&tt->list, &kvm->arch.spapr_tce_tables); mutex_unlock(&kvm->lock); @@ -218,16 +285,31 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, /* Converts guest physical address into host virtual */ static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa) { unsigned long hva, gfn = gpa >> PAGE_SHIFT; struct kvm_memory_slot *memslot; + pte_t *ptep; + unsigned int shift = 0; memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) return ERROR_ADDR; hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK); + + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift); + WARN_ON(!ptep); + if (!ptep) + return ERROR_ADDR; + + if (tt && (shift > PAGE_SHIFT)) { + if (kvmppc_iommu_hugepage_try_add(tt, *ptep, + hva, gpa, 1 << shift)) + return ERROR_ADDR; + } + return (void *) hva; } @@ -267,7 +349,7 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce); + hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, tce); if (hva == ERROR_ADDR) return H_HARDWARE; @@ -319,7 +401,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (tce_list & ~IOMMU_PAGE_MASK) return H_PARAMETER; - tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list); + tces = kvmppc_virtmode_gpa_to_hva(vcpu, NULL, tce_list); if (tces == ERROR_ADDR) return H_TOO_HARD; @@ -354,7 +436,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs */ for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) { - void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, + void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, vcpu->arch.tce_tmp[i]); if (hva == ERROR_ADDR) diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index dc4ae32..6245365 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -178,6 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, * Also returns pte and page size if the page is present in page table. */ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, + struct kvmppc_spapr_tce_table *tt, unsigned long gpa, bool do_get_page) { struct kvm_memory_slot *memslot; @@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu, unsigned long hva, hpa, pg_size = 0, offset; unsigned long gfn = gpa >> PAGE_SHIFT; bool writing = gpa & TCE_PCI_WRITE; + struct kvmppc_iommu_hugepage *hp; + /* + * Try to find an already used hugepage. + * If it is not there, the kvmppc_lookup_pte() will return zero + * as it won't do get_page() on a huge page in real mode + * and therefore the request will be passed to the virtual mode. + */ + if (tt) { + spin_lock(&tt->hugepages_lock); + list_for_each_entry(hp, &tt->hugepages, list) { + if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size)) + continue; + + /* Calculate host phys address keeping flags and offset in the page */ + offset = gpa & (hp->size - 1); + + /* pte_pfn(pte) should return an address aligned to pg_size */ + hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset; + spin_unlock(&tt->hugepages_lock); + + return hpa; + } + spin_unlock(&tt->hugepages_lock); + } /* Find a KVM memslot */ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); if (!memslot) @@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu, if (oldtce & TCE_PCI_WRITE) SetPageDirty(page); + /* Do not put a huge page and continue without error */ + if (PageCompound(page)) + continue; + if (realmode_put_page(page)) { ret = H_TOO_HARD; break; @@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (iommu_tce_put_param_check(tbl, ioba, tce)) return H_PARAMETER; - hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true); + hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true); if (hpa == ERROR_ADDR) { vcpu->arch.tce_reason = H_TOO_HARD; return H_TOO_HARD; @@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (unlikely(ret)) { struct page *pg = realmode_pfn_to_page(hpa); BUG_ON(!pg); + + /* Do not put a huge page and return an error */ + if (!PageCompound(pg)) + return H_HARDWARE; + if (realmode_put_page(pg)) { vcpu->arch.tce_reason = H_HARDWARE; return H_TOO_HARD; @@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, vcpu->arch.tce_tmp_num = 0; vcpu->arch.tce_reason = 0; - tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL, tce_list, false); if ((unsigned long)tces == ERROR_ADDR) return H_TOO_HARD; @@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, /* Translate TCEs and go get_page */ for (i = 0; i < npages; ++i) { - unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, + unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, vcpu->arch.tce_tmp[i], true); if (hpa == ERROR_ADDR) { vcpu->arch.tce_tmp_num = i; -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 160+ messages in thread
end of thread, other threads:[~2013-06-27 11:01 UTC | newest] Thread overview: 160+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-05 6:11 [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy 2013-06-05 6:11 ` Alexey Kardashevskiy 2013-06-05 6:11 ` Alexey Kardashevskiy 2013-06-05 6:11 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy 2013-06-05 6:11 ` Alexey Kardashevskiy 2013-06-05 6:11 ` Alexey Kardashevskiy 2013-06-16 4:20 ` Benjamin Herrenschmidt 2013-06-16 4:20 ` Benjamin Herrenschmidt 2013-06-16 4:20 ` Benjamin Herrenschmidt 2013-06-16 22:06 ` Alexander Graf 2013-06-16 22:06 ` Alexander Graf 2013-06-16 22:06 ` Alexander Graf 2013-06-17 7:55 ` Alexey Kardashevskiy 2013-06-17 7:55 ` Alexey Kardashevskiy 2013-06-17 7:55 ` Alexey Kardashevskiy 2013-06-17 8:02 ` Alexander Graf 2013-06-17 8:02 ` Alexander Graf 2013-06-17 8:02 ` Alexander Graf 2013-06-17 8:34 ` Alexey Kardashevskiy 2013-06-17 8:34 ` Alexey Kardashevskiy 2013-06-17 8:34 ` Alexey Kardashevskiy 2013-06-17 8:40 ` Alexander Graf 2013-06-17 8:40 ` Alexander Graf 2013-06-17 8:40 ` Alexander Graf 2013-06-17 8:51 ` Alexey Kardashevskiy 2013-06-17 8:51 ` Alexey Kardashevskiy 2013-06-17 8:51 ` Alexey Kardashevskiy 2013-06-17 10:46 ` Alexander Graf 2013-06-17 10:46 ` Alexander Graf 2013-06-17 10:46 ` Alexander Graf 2013-06-17 10:48 ` Alexander Graf 2013-06-17 10:48 ` Alexander Graf 2013-06-17 10:48 ` Alexander Graf 2013-06-17 8:37 ` Benjamin Herrenschmidt 2013-06-17 8:37 ` Benjamin Herrenschmidt 2013-06-17 8:37 ` Benjamin Herrenschmidt 2013-06-17 8:42 ` Alexander Graf 2013-06-17 8:42 ` Alexander Graf 2013-06-17 8:42 ` Alexander Graf 2013-06-05 6:11 ` [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap Alexey Kardashevskiy 2013-06-05 6:11 ` Alexey Kardashevskiy 2013-06-05 6:11 ` Alexey Kardashevskiy 2013-06-16 4:26 ` Benjamin Herrenschmidt 2013-06-16 4:26 ` Benjamin Herrenschmidt 2013-06-16 4:26 ` Benjamin Herrenschmidt 2013-06-16 4:26 ` Benjamin Herrenschmidt 2013-06-16 4:31 ` Benjamin Herrenschmidt 2013-06-16 4:31 ` Benjamin Herrenschmidt 2013-06-16 4:31 ` Benjamin Herrenschmidt 2013-06-16 4:31 ` Benjamin Herrenschmidt 2013-06-17 9:17 ` Alexey Kardashevskiy 2013-06-17 9:17 ` Alexey Kardashevskiy 2013-06-17 9:17 ` Alexey Kardashevskiy 2013-06-17 9:17 ` Alexey Kardashevskiy 2013-06-05 6:11 ` [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy 2013-06-05 6:11 ` Alexey Kardashevskiy 2013-06-05 6:11 ` Alexey Kardashevskiy 2013-06-16 4:39 ` Benjamin Herrenschmidt 2013-06-16 4:39 ` Benjamin Herrenschmidt 2013-06-16 4:39 ` Benjamin Herrenschmidt 2013-06-19 3:17 ` Alexey Kardashevskiy 2013-06-19 3:17 ` Alexey Kardashevskiy 2013-06-19 3:17 ` Alexey Kardashevskiy 2013-06-16 22:25 ` Alexander Graf 2013-06-16 22:25 ` Alexander Graf 2013-06-16 22:25 ` Alexander Graf 2013-06-16 22:39 ` Benjamin Herrenschmidt 2013-06-16 22:39 ` Benjamin Herrenschmidt 2013-06-16 22:39 ` Benjamin Herrenschmidt 2013-06-17 3:13 ` Alex Williamson 2013-06-17 3:13 ` Alex Williamson 2013-06-17 3:13 ` Alex Williamson 2013-06-17 3:56 ` Benjamin Herrenschmidt 2013-06-17 3:56 ` Benjamin Herrenschmidt 2013-06-17 3:56 ` Benjamin Herrenschmidt 2013-06-18 2:32 ` Alex Williamson 2013-06-18 2:32 ` Alex Williamson 2013-06-18 2:32 ` Alex Williamson 2013-06-18 4:38 ` Benjamin Herrenschmidt 2013-06-18 4:38 ` Benjamin Herrenschmidt 2013-06-18 4:38 ` Benjamin Herrenschmidt 2013-06-18 14:48 ` Alex Williamson 2013-06-18 14:48 ` Alex Williamson 2013-06-18 14:48 ` Alex Williamson 2013-06-18 21:58 ` Benjamin Herrenschmidt 2013-06-18 21:58 ` Benjamin Herrenschmidt 2013-06-18 21:58 ` Benjamin Herrenschmidt 2013-06-19 3:35 ` Rusty Russell 2013-06-19 3:47 ` Rusty Russell 2013-06-19 3:35 ` Rusty Russell 2013-06-19 4:59 ` Benjamin Herrenschmidt 2013-06-19 4:59 ` Benjamin Herrenschmidt 2013-06-19 4:59 ` Benjamin Herrenschmidt 2013-06-19 9:58 ` Alexander Graf 2013-06-19 9:58 ` Alexander Graf 2013-06-19 9:58 ` Alexander Graf 2013-06-19 14:50 ` Benjamin Herrenschmidt 2013-06-19 14:50 ` Benjamin Herrenschmidt 2013-06-19 14:50 ` Benjamin Herrenschmidt 2013-06-19 15:49 ` Alex Williamson 2013-06-19 15:49 ` Alex Williamson 2013-06-19 15:49 ` Alex Williamson 2013-06-20 4:58 ` Alexey Kardashevskiy 2013-06-20 4:58 ` Alexey Kardashevskiy 2013-06-20 4:58 ` Alexey Kardashevskiy 2013-06-20 5:28 ` David Gibson 2013-06-20 5:28 ` David Gibson 2013-06-20 5:28 ` David Gibson 2013-06-20 7:47 ` Benjamin Herrenschmidt 2013-06-20 7:47 ` Benjamin Herrenschmidt 2013-06-20 7:47 ` Benjamin Herrenschmidt 2013-06-20 8:48 ` Alexey Kardashevskiy 2013-06-20 8:48 ` Alexey Kardashevskiy 2013-06-20 8:48 ` Alexey Kardashevskiy 2013-06-20 14:55 ` Alex Williamson 2013-06-20 14:55 ` Alex Williamson 2013-06-20 14:55 ` Alex Williamson 2013-06-22 8:25 ` Alexey Kardashevskiy 2013-06-22 8:25 ` Alexey Kardashevskiy 2013-06-22 8:25 ` Alexey Kardashevskiy 2013-06-22 12:03 ` David Gibson 2013-06-22 12:03 ` David Gibson 2013-06-22 12:03 ` David Gibson 2013-06-22 14:28 ` Alex Williamson 2013-06-22 14:28 ` Alex Williamson 2013-06-22 14:28 ` Alex Williamson 2013-06-24 3:52 ` David Gibson 2013-06-24 3:52 ` David Gibson 2013-06-24 3:52 ` David Gibson 2013-06-24 3:52 ` David Gibson 2013-06-24 4:41 ` Alex Williamson 2013-06-24 4:41 ` Alex Williamson 2013-06-24 4:41 ` Alex Williamson 2013-06-27 11:01 ` David Gibson 2013-06-27 11:01 ` David Gibson 2013-06-27 11:01 ` David Gibson 2013-06-22 23:28 ` Benjamin Herrenschmidt 2013-06-22 23:28 ` Benjamin Herrenschmidt 2013-06-22 23:28 ` Benjamin Herrenschmidt 2013-06-24 3:54 ` David Gibson 2013-06-24 3:54 ` David Gibson 2013-06-24 3:54 ` David Gibson 2013-06-24 3:58 ` Benjamin Herrenschmidt 2013-06-24 3:58 ` Benjamin Herrenschmidt 2013-06-24 3:58 ` Benjamin Herrenschmidt 2013-06-05 6:11 ` [PATCH 4/4] KVM: PPC: Add hugepage " Alexey Kardashevskiy 2013-06-05 6:11 ` Alexey Kardashevskiy 2013-06-05 6:11 ` Alexey Kardashevskiy 2013-06-16 4:46 ` Benjamin Herrenschmidt 2013-06-16 4:46 ` Benjamin Herrenschmidt 2013-06-16 4:46 ` Benjamin Herrenschmidt 2013-06-17 16:35 ` Paolo Bonzini 2013-06-17 16:35 ` Paolo Bonzini 2013-06-17 16:35 ` Paolo Bonzini 2013-06-12 3:14 ` [PATCH 0/4 v3] KVM: PPC: " Benjamin Herrenschmidt 2013-06-12 3:14 ` Benjamin Herrenschmidt 2013-06-12 3:14 ` Benjamin Herrenschmidt -- strict thread matches above, loose matches on Subject: below -- 2013-05-21 3:06 [PATCH 0/4 v2] " Alexey Kardashevskiy 2013-05-21 3:06 ` [PATCH 4/4] KVM: PPC: Add hugepage support for " Alexey Kardashevskiy 2013-05-21 3:06 ` Alexey Kardashevskiy 2013-05-21 3:06 ` Alexey Kardashevskiy
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.