From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexey Kardashevskiy Subject: Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE Date: Tue, 22 Mar 2016 11:34:55 +1100 Message-ID: <56F0932F.2050708@ozlabs.ru> References: <1457322077-26640-1-git-send-email-aik@ozlabs.ru> <1457322077-26640-10-git-send-email-aik@ozlabs.ru> <20160309054544.GM22546@voom.fritz.box> <56DFEACC.7030508@ozlabs.ru> <20160310052108.GV22546@voom.fritz.box> <56E1FEBE.5090602@ozlabs.ru> <20160315060400.GK15272@voom.fritz.box> <15389a41428.27cb.1ca38dd7e845b990cd13d431eb58563d@ozlabs.ru> <20160321051932.GJ23586@voom.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Cc: "linuxppc-dev@lists.ozlabs.org" , Paul Mackerras , Alex Williamson , kvm-ppc , kvm@vger.kernel.org To: David Gibson Return-path: Received: from mail-pf0-f193.google.com ([209.85.192.193]:35965 "EHLO mail-pf0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753361AbcCVAfE (ORCPT ); Mon, 21 Mar 2016 20:35:04 -0400 Received: by mail-pf0-f193.google.com with SMTP id q129so33293933pfb.3 for ; Mon, 21 Mar 2016 17:35:03 -0700 (PDT) In-Reply-To: <20160321051932.GJ23586@voom.redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: Uff, lost cc: list. Added back. Some comments below. On 03/21/2016 04:19 PM, David Gibson wrote: > On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote: >> On March 15, 2016 17:29:26 David Gibson wrote: >> >>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote: >>>> On 03/10/2016 04:21 PM, David Gibson wrote: >>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote: >>>>>> On 03/09/2016 04:45 PM, David Gibson wrote: >>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote: >>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap >>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU >>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and >>>>>>>> linked to IOMMU groups by the user space. >>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need >>>>>>>> to tell KVM the information about the LIOBN-to-group mapping. >>>>>>>> >>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter >>>>>>>> is added which accepts: >>>>>>>> - a VFIO group fd and IO base address to find the actual hardware >>>>>>>> TCE table; >>>>>>>> - a LIOBN to assign to the found table. >>>>>>>> >>>>>>>> Before notifying KVM about new link, this check the group for being >>>>>>>> registered with KVM device in order to release them at unexpected KVM >>>>>>>> finish. >>>>>>>> >>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user >>>>>>>> space. >>>>>>>> >>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it >>>>>>>> link to a KVM module. >>>>>>>> >>>>>>>> Signed-off-by: Alexey Kardashevskiy >>>>>>>> --- >>>>>>>> Documentation/virtual/kvm/devices/vfio.txt | 21 +++++- >>>>>>>> arch/powerpc/kvm/Kconfig | 1 + >>>>>>>> arch/powerpc/kvm/Makefile | 5 +- >>>>>>>> arch/powerpc/kvm/powerpc.c | 1 + >>>>>>>> include/uapi/linux/kvm.h | 9 +++ >>>>>>>> virt/kvm/vfio.c | 106 >>>> +++++++++++++++++++++++++++++ >>>>>>>> 6 files changed, 140 insertions(+), 3 deletions(-) >>>>>>>> >>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt >>>> b/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>> index ef51740..c0d3eb7 100644 >>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>> @@ -16,7 +16,24 @@ Groups: >>>>>>>> >>>>>>>> KVM_DEV_VFIO_GROUP attributes: >>>>>>>> KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking >>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor >>>>>>>> + for the VFIO group. >>>>>>> >>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which >>>>>>> case it might be clearer to put them in a separate patch. >>>>>>> >>>>>>>> KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device >>>> tracking >>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor >>>>>>>> + for the VFIO group. >>>>>>>> >>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor >>>>>>>> -for the VFIO group. >>>>>>>> + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group >>>>>>>> + kvm_device_attr.addr points to a struct: >>>>>>>> + struct kvm_vfio_spapr_tce_liobn { >>>>>>>> + __u32 argsz; >>>>>>>> + __s32 fd; >>>>>>>> + __u32 liobn; >>>>>>>> + __u8 pad[4]; >>>>>>>> + __u64 start_addr; >>>>>>>> + }; >>>>>>>> + where >>>>>>>> + @argsz is the size of kvm_vfio_spapr_tce_liobn; >>>>>>>> + @fd is a file descriptor for a VFIO group; >>>>>>>> + @liobn is a logical bus id to be associated with the group; >>>>>>>> + @start_addr is a DMA window offset on the IO (PCI) bus >>>>>>> >>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call >>>>>>> this multiple times with different LIOBNs and the same IOMMU group? >>>>>> >>>>>> >>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit >>>>>> and 64bit windows, this is why @start_addr is there. >>>>>> >>>>>> >>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig >>>>>>>> index 1059846..dfa3488 100644 >>>>>>>> --- a/arch/powerpc/kvm/Kconfig >>>>>>>> +++ b/arch/powerpc/kvm/Kconfig >>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64 >>>>>>>> select KVM >>>>>>>> select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE >>>>>>>> select SPAPR_TCE_IOMMU if IOMMU_SUPPORT >>>>>>>> + select KVM_VFIO if VFIO >>>>>>>> ---help--- >>>>>>>> Support running unmodified book3s_64 and book3s_32 guest kernels >>>>>>>> in virtual machines on book3s_64 host processors. >>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile >>>>>>>> index 7f7b6d8..71f577c 100644 >>>>>>>> --- a/arch/powerpc/kvm/Makefile >>>>>>>> +++ b/arch/powerpc/kvm/Makefile >>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm >>>>>>>> KVM := ../../../virt/kvm >>>>>>>> >>>>>>>> common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \ >>>>>>>> - $(KVM)/eventfd.o $(KVM)/vfio.o >>>>>>>> + $(KVM)/eventfd.o >>>>>>> >>>>>>> Please don't disable the VFIO device for the non-book3s case. I added >>>>>>> it (even though it didn't do anything until now) so that libvirt >>>>>>> wouldn't choke when it finds it's not available. Obviously the new >>>>>>> ioctl needs to be only for the right IOMMU setup, but the device >>>>>>> itself should be available always. >>>>>> >>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module. >>>>>> >>>>>> >>>>>>>> CFLAGS_e500_mmu.o := -I. >>>>>>>> CFLAGS_e500_mmu_host.o := -I. >>>>>>>> @@ -87,6 +87,9 @@ endif >>>>>>>> kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \ >>>>>>>> book3s_xics.o >>>>>>>> >>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \ >>>>>>>> + $(KVM)/vfio.o \ >>>>>>>> + >>>>>>>> kvm-book3s_64-module-objs += \ >>>>>>>> $(KVM)/kvm_main.o \ >>>>>>>> $(KVM)/eventfd.o \ >>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>>>>>>> index 19aa59b..63f188d 100644 >>>>>>>> --- a/arch/powerpc/kvm/powerpc.c >>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c >>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm >>>> *kvm, long ext) >>>>>>>> #ifdef CONFIG_PPC_BOOK3S_64 >>>>>>>> case KVM_CAP_SPAPR_TCE: >>>>>>>> case KVM_CAP_SPAPR_TCE_64: >>>>>>>> + case KVM_CAP_SPAPR_TCE_VFIO: >>>>>>>> case KVM_CAP_PPC_ALLOC_HTAB: >>>>>>>> case KVM_CAP_PPC_RTAS: >>>>>>>> case KVM_CAP_PPC_FIXUP_HCALL: >>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h >>>>>>>> index 080ffbf..f1abbea 100644 >>>>>>>> --- a/include/uapi/linux/kvm.h >>>>>>>> +++ b/include/uapi/linux/kvm.h >>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr { >>>>>>>> #define KVM_DEV_VFIO_GROUP 1 >>>>>>>> #define KVM_DEV_VFIO_GROUP_ADD 1 >>>>>>>> #define KVM_DEV_VFIO_GROUP_DEL 2 >>>>>>>> +#define KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN 3 >>>>>>>> >>>>>>>> enum kvm_device_type { >>>>>>>> KVM_DEV_TYPE_FSL_MPIC_20 = 1, >>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type { >>>>>>>> KVM_DEV_TYPE_MAX, >>>>>>>> }; >>>>>>>> >>>>>>>> +struct kvm_vfio_spapr_tce_liobn { >>>>>>>> + __u32 argsz; >>>>>>>> + __s32 fd; >>>>>>>> + __u32 liobn; >>>>>>>> + __u8 pad[4]; >>>>>>>> + __u64 start_addr; >>>>>>>> +}; >>>>>>>> + >>>>>>>> /* >>>>>>>> * ioctls for VM fds >>>>>>>> */ >>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c >>>>>>>> index 1dd087d..87c771e 100644 >>>>>>>> --- a/virt/kvm/vfio.c >>>>>>>> +++ b/virt/kvm/vfio.c >>>>>>>> @@ -20,6 +20,10 @@ >>>>>>>> #include >>>>>>>> #include "vfio.h" >>>>>>>> >>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>> +#include >>>>>>>> +#endif >>>>>>>> + >>>>>>>> struct kvm_vfio_group { >>>>>>>> struct list_head node; >>>>>>>> struct vfio_group *vfio_group; >>>>>>>> @@ -60,6 +64,22 @@ static void >>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group) >>>>>>>> symbol_put(vfio_group_put_external_user); >>>>>>>> } >>>>>>>> >>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group) >>>>>>>> +{ >>>>>>>> + int (*fn)(struct vfio_group *); >>>>>>>> + int ret = -1; >>>>>>> >>>>>>> Should this be -ESOMETHING? >>>>>>> >>>>>>>> + fn = symbol_get(vfio_external_user_iommu_id); >>>>>>>> + if (!fn) >>>>>>>> + return ret; >>>>>>>> + >>>>>>>> + ret = fn(vfio_group); >>>>>>>> + >>>>>>>> + symbol_put(vfio_external_user_iommu_id); >>>>>>>> + >>>>>>>> + return ret; >>>>>>>> +} >>>>>>>> + >>>>>>>> static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group) >>>>>>>> { >>>>>>>> long (*fn)(struct vfio_group *, unsigned long); >>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct >>>> kvm_device *dev) >>>>>>>> mutex_unlock(&kv->lock); >>>>>>>> } >>>>>>>> >>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm, >>>>>>>> + struct vfio_group *vfio_group) >>>>>>> >>>>>>> >>>>>>> Shouldn't this go in the same patch that introduced the attach >>>>>>> function? >>>>>> >>>>>> Having less patches which touch different maintainers areas is better. I >>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in >>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE >>>>>> table". >>>>>> >>>>>> >>>>>>> >>>>>>>> +{ >>>>>>>> + int group_id; >>>>>>>> + struct iommu_group *grp; >>>>>>>> + >>>>>>>> + group_id = kvm_vfio_external_user_iommu_id(vfio_group); >>>>>>>> + grp = iommu_group_get_by_id(group_id); >>>>>>>> + if (grp) { >>>>>>>> + kvm_spapr_tce_detach_iommu_group(kvm, grp); >>>>>>>> + iommu_group_put(grp); >>>>>>>> + } >>>>>>>> +} >>>>>>>> +#endif >>>>>>>> + >>>>>>>> static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg) >>>>>>>> { >>>>>>>> struct kvm_vfio *kv = dev->private; >>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device >>>> *dev, long attr, u64 arg) >>>>>>>> continue; >>>>>>>> >>>>>>>> list_del(&kvg->node); >>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>> >>>>>>> Better to make a no-op version of the call than have to #ifdef at the >>>>>>> callsite. >>>>>> >>>>>> It is questionable. A x86 reader may decide that >>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get >>>>>> confused. >>>>>> >>>>>> >>>>>>> >>>>>>>> + kvm_vfio_spapr_detach_iommu_group(dev->kvm, >>>>>>>> + kvg->vfio_group); >>>>>>>> +#endif >>>>>>>> kvm_vfio_group_put_external_user(kvg->vfio_group); >>>>>>>> kfree(kvg); >>>>>>>> ret = 0; >>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device >>>> *dev, long attr, u64 arg) >>>>>>>> kvm_vfio_update_coherency(dev); >>>>>>>> >>>>>>>> return ret; >>>>>>>> + >>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>> + case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: { >>>>>>>> + struct kvm_vfio_spapr_tce_liobn param; >>>>>>>> + unsigned long minsz; >>>>>>>> + struct kvm_vfio *kv = dev->private; >>>>>>>> + struct vfio_group *vfio_group; >>>>>>>> + struct kvm_vfio_group *kvg; >>>>>>>> + struct fd f; >>>>>>>> + >>>>>>>> + minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn, >>>>>>>> + start_addr); >>>>>>>> + >>>>>>>> + if (copy_from_user(¶m, (void __user *)arg, minsz)) >>>>>>>> + return -EFAULT; >>>>>>>> + >>>>>>>> + if (param.argsz < minsz) >>>>>>>> + return -EINVAL; >>>>>>>> + >>>>>>>> + f = fdget(param.fd); >>>>>>>> + if (!f.file) >>>>>>>> + return -EBADF; >>>>>>>> + >>>>>>>> + vfio_group = kvm_vfio_group_get_external_user(f.file); >>>>>>>> + fdput(f); >>>>>>>> + >>>>>>>> + if (IS_ERR(vfio_group)) >>>>>>>> + return PTR_ERR(vfio_group); >>>>>>>> + >>>>>>>> + ret = -ENOENT; >>>>>>> >>>>>>> Shouldn't there be some runtime test for the type of the IOMMU? It's >>>>>>> possible a kernel could be built for a platform supporting multiple >>>>>>> IOMMU types. >>>>>> >>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a >>>>>> VFIO container property, not a group property and here (KVM) we only have >>>>>> groups. >>>>> >>>>> Which, as mentioned previously, is broken. >>>> >>>> Which I am failing to follow you on this. >>>> >>>> What I am trying to achieve here is pretty much referencing a group so it >>>> cannot be reused. Plus LIOBNs. >>> >>> "Plus LIOBNs" is not a trivial change. You are establishing a linkage >> >from LIOBNs to groups. But that doesn't make sense; if mapping in one >>> (guest) LIOBN affects a group it must affect all groups in the >>> container. i.e. LIOBN->container is the natural mapping, *not* LIOBN >>> to group. >> >> I can see your point but i don't see how to proceed now, I'm totally stuck. >> Pass container fd and then implement new api to lock containers somehow and > > I'm not really understanding what the question is about locking containers. > >> enumerate groups when updating TCE table (including real mode)? > > Why do you need to enumerate groups? The groups within the container > share a TCE table, so can't you just update that once? Well, they share a TCE table but they do not share TCE Kill (TCE cache invalidate) register address, it is still per PE but this does not matter here (pnv_pci_link_table_and_group() does that), just mentioned to complete the picture. >> Plus new API when we remove a group from a container as the result of guest >> PCI hot unplug? > > I assume you mean a kernel internal API, since it shouldn't need > anything else visible to userspace. Won't this happen naturally? > When the group is removed from the container, it will get its own TCE > table instead of the previously shared one. > >>>> Passing a container fd does not make much >>>> sense here as the VFIO device would walk through groups, reference them and >>>> that is it, there is no locking on VFIO containters and so far there was no >>>> need to teach KVM about containers. >>>> >>>> What do I miss now? >>> >>> Referencing the groups is essentially just a useful side effect. The >>> important functionality is informing VFIO of the guest LIOBNs; and >>> LIOBNs map to containers, not groups. >> >> No. One liobn maps to one KVM-allocated TCE table, not a container. There >> can be one or many or none containers per liobn. > > Ah, true. So I need to add new kernel API for KVM to get table(s) from VFIO container(s). And invent some locking mechanism to prevent table(s) (or associated container(s)) from going away, like vfio_group_get_external_user/vfio_group_put_external_user but for containers. Correct? -- Alexey