* [RFC 00/11] KVM PCIe/MSI passthrough on ARM/ARM64: re-design with transparent MSI mapping @ 2016-09-27 20:48 Eric Auger [not found] ` <1475009318-2617-6-git-send-email-eric.auger@redhat.com> 0 siblings, 1 reply; 6+ messages in thread From: Eric Auger @ 2016-09-27 20:48 UTC (permalink / raw) To: linux-arm-kernel Following Robin's series [1] addressing MSI IOMMU mapping for devices attached to a DMA ops domain, my previous 3 part series (v12) lost most of its consistency. msi-iommu API role now is handled at dma-iommu level while MSI doorbell registration API only is used for security assessment. Also MSI layer part is not needed anymore since mapping directly is done in the compose callback. Here I propose an alternative approach, based on [1]. This approach was discussed at the KVM forum with Christoffer Dall and Marc Zyngier, and was suggested by Christoffer. The idea is we could let the iommu layer transparently allocate MSI frame IOVAs in the holes left between UNMANAGED iova slots, set by the iommu-api user. This series introduces a new IOMMU domain type that allows mixing of unmanaged and managed IOVA slots. We define an IOVA domain whose aperture covers the GPA address range. Each time the IOMMU-API user maps iova/pa, we reserve the IOVA range to prevent the iova allocator from using it for MSI mapping. This simplifies the user part which does not need anymore to provide an IOVA aperture anymore. The current series does not address the interrupt safety assessment, which may be considered as a separate issue. Currently the assignemnt is considered as unsafe, on ARM (even with a GICv3 ITS). Please let me know what is your feeling wrt this alternative approach. dependency: [1] [PATCH v7 00/22] Generic DT bindings for PCI IOMMUs and ARM SMMU http://www.spinics.net/lists/arm-kernel/msg531110.html Best Regards Eric Testing: - functional on ARM64 AMD Overdrive HW (single GICv2m frame). Lack of contexts prevents me from testing multiple assignment. Git: complete series available at https://github.com/eauger/linux/tree/generic-v7-pcie-passthru-redesign-rfc previous: https://github.com/eauger/linux/tree/v4.7-rc7-passthrough-v12 the above branch includes a temporary patch to work around a ThunderX pci bus reset crash (which I think unrelated to this series): "vfio: pci: HACK! workaround thunderx pci_try_reset_bus crash" Do not take this one for other platforms. Eric Auger (10): iommu: Add iommu_domain_msi_geometry and DOMAIN_ATTR_MSI_GEOMETRY iommu: Introduce IOMMU_CAP_TRANSLATE_MSI capability iommu: Introduce IOMMU_DOMAIN_MIXED iommu/dma: iommu_dma_(un)map_mixed iommu/arm-smmu: Allow IOMMU_DOMAIN_MIXED domain allocation iommu: Use IOMMU_DOMAIN_MIXED typed domain when IOMMU translates MSI vfio/type1: Sets the IOVA window in case MSI IOVA need to be allocated vfio/type1: Reserve IOVAs for IOMMU_DOMAIN_MIXED domains iommu/arm-smmu: Do not advertise IOMMU_CAP_INTR_REMAP iommu/arm-smmu: Advertise IOMMU_CAP_TRANSLATE_MSI Robin Murphy (1): iommu/dma: Allow MSI-only cookies drivers/iommu/arm-smmu-v3.c | 8 +++- drivers/iommu/arm-smmu.c | 8 +++- drivers/iommu/dma-iommu.c | 91 +++++++++++++++++++++++++++++++++++++++++ drivers/iommu/iommu.c | 10 ++++- drivers/vfio/vfio_iommu_type1.c | 48 ++++++++++++++++++---- include/linux/dma-iommu.h | 27 ++++++++++++ include/linux/iommu.h | 23 +++++++++++ 7 files changed, 203 insertions(+), 12 deletions(-) -- 1.9.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <1475009318-2617-6-git-send-email-eric.auger@redhat.com>]
* [RFC 05/11] iommu/dma: iommu_dma_(un)map_mixed [not found] ` <1475009318-2617-6-git-send-email-eric.auger@redhat.com> @ 2016-09-30 13:24 ` Robin Murphy 2016-10-02 9:56 ` Christoffer Dall 2016-10-03 9:38 ` Auger Eric 0 siblings, 2 replies; 6+ messages in thread From: Robin Murphy @ 2016-09-30 13:24 UTC (permalink / raw) To: linux-arm-kernel Hi Eric, On 27/09/16 21:48, Eric Auger wrote: > iommu_dma_map_mixed and iommu_dma_unmap_mixed operate on > IOMMU_DOMAIN_MIXED typed domains. On top of standard iommu_map/unmap > they reserve the IOVA window to prevent the iova allocator to > allocate in those areas. > > Signed-off-by: Eric Auger <eric.auger@redhat.com> > --- > drivers/iommu/dma-iommu.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++ > include/linux/dma-iommu.h | 18 ++++++++++++++++++ > 2 files changed, 66 insertions(+) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 04bbc85..db21143 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -759,3 +759,51 @@ int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain, > return 0; > } > EXPORT_SYMBOL(iommu_get_dma_msi_region_cookie); > + > +int iommu_dma_map_mixed(struct iommu_domain *domain, unsigned long iova, > + phys_addr_t paddr, size_t size, int prot) > +{ > + struct iova_domain *iovad; > + unsigned long lo, hi; > + int ret; > + > + if (domain->type != IOMMU_DOMAIN_MIXED) > + return -EINVAL; > + > + if (!domain->iova_cookie) > + return -EINVAL; > + > + iovad = cookie_iovad(domain); > + > + lo = iova_pfn(iovad, iova); > + hi = iova_pfn(iovad, iova + size - 1); > + reserve_iova(iovad, lo, hi); This can't work reliably - reserve_iova() will (for good reason) merge any adjacent or overlapping entries, so any unmap is liable to free more IOVA space than actually gets unmapped, and things will get subtly out of sync and go wrong later. The more general issue with this whole approach, though, is that it effectively rules out userspace doing guest memory hotplug or similar, and I'm not we want to paint ourselves into that corner. Basically, as soon as a device is attached to a guest, the entirety of the unallocated IPA space becomes reserved, and userspace can never add anything further to it, because any given address *might* be in use for an MSI mapping. I think it still makes most sense to stick with the original approach of cooperating with userspace to reserve a bounded area - it's just that we can then let automatic mapping take care of itself within that area. Speaking of which, I've realised the same fundamental reservation problem already applies to PCI without ACS, regardless of MSIs. I just tried on my Juno with guest memory placed at 0x4000000000, (i.e. matching the host PA of the 64-bit PCI window), and sure enough when the guest kicks off some DMA on the passed-through NIC, the root complex interprets the guest IPA as (unsupported) peer-to-peer DMA to a BAR claimed by the video card, and it fails. I guess this doesn't get hit in practice on x86 because the guest memory map is unlikely to be much different from the host's. It seems like we basically need a general way of communicating fixed and movable host reservations to userspace :/ Robin. > + ret = iommu_map(domain, iova, paddr, size, prot); > + if (ret) > + free_iova(iovad, lo); > + return ret; > +} > +EXPORT_SYMBOL(iommu_dma_map_mixed); > + > +size_t iommu_dma_unmap_mixed(struct iommu_domain *domain, unsigned long iova, > + size_t size) > +{ > + struct iova_domain *iovad; > + unsigned long lo; > + size_t ret; > + > + if (domain->type != IOMMU_DOMAIN_MIXED) > + return -EINVAL; > + > + if (!domain->iova_cookie) > + return -EINVAL; > + > + iovad = cookie_iovad(domain); > + lo = iova_pfn(iovad, iova); > + > + ret = iommu_unmap(domain, iova, size); > + if (ret == size) > + free_iova(iovad, lo); > + return ret; > +} > +EXPORT_SYMBOL(iommu_dma_unmap_mixed); > diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h > index 1c55413..f2aa855 100644 > --- a/include/linux/dma-iommu.h > +++ b/include/linux/dma-iommu.h > @@ -70,6 +70,12 @@ void iommu_dma_map_msi_msg(int irq, struct msi_msg *msg); > int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain, > dma_addr_t base, u64 size); > > +int iommu_dma_map_mixed(struct iommu_domain *domain, unsigned long iova, > + phys_addr_t paddr, size_t size, int prot); > + > +size_t iommu_dma_unmap_mixed(struct iommu_domain *domain, unsigned long iova, > + size_t size); > + > #else > > struct iommu_domain; > @@ -99,6 +105,18 @@ static inline int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain, > return -ENODEV; > } > > +int iommu_dma_map_mixed(struct iommu_domain *domain, unsigned long iova, > + phys_addr_t paddr, size_t size, int prot) > +{ > + return -ENODEV; > +} > + > +size_t iommu_dma_unmap_mixed(struct iommu_domain *domain, unsigned long iova, > + size_t size) > +{ > + return -ENODEV; > +} > + > #endif /* CONFIG_IOMMU_DMA */ > #endif /* __KERNEL__ */ > #endif /* __DMA_IOMMU_H */ > ^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC 05/11] iommu/dma: iommu_dma_(un)map_mixed 2016-09-30 13:24 ` [RFC 05/11] iommu/dma: iommu_dma_(un)map_mixed Robin Murphy @ 2016-10-02 9:56 ` Christoffer Dall 2016-10-04 17:18 ` Robin Murphy 2016-10-03 9:38 ` Auger Eric 1 sibling, 1 reply; 6+ messages in thread From: Christoffer Dall @ 2016-10-02 9:56 UTC (permalink / raw) To: linux-arm-kernel On Fri, Sep 30, 2016 at 02:24:40PM +0100, Robin Murphy wrote: > Hi Eric, > > On 27/09/16 21:48, Eric Auger wrote: > > iommu_dma_map_mixed and iommu_dma_unmap_mixed operate on > > IOMMU_DOMAIN_MIXED typed domains. On top of standard iommu_map/unmap > > they reserve the IOVA window to prevent the iova allocator to > > allocate in those areas. > > > > Signed-off-by: Eric Auger <eric.auger@redhat.com> > > --- > > drivers/iommu/dma-iommu.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++ > > include/linux/dma-iommu.h | 18 ++++++++++++++++++ > > 2 files changed, 66 insertions(+) > > > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > > index 04bbc85..db21143 100644 > > --- a/drivers/iommu/dma-iommu.c > > +++ b/drivers/iommu/dma-iommu.c > > @@ -759,3 +759,51 @@ int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain, > > return 0; > > } > > EXPORT_SYMBOL(iommu_get_dma_msi_region_cookie); > > + > > +int iommu_dma_map_mixed(struct iommu_domain *domain, unsigned long iova, > > + phys_addr_t paddr, size_t size, int prot) > > +{ > > + struct iova_domain *iovad; > > + unsigned long lo, hi; > > + int ret; > > + > > + if (domain->type != IOMMU_DOMAIN_MIXED) > > + return -EINVAL; > > + > > + if (!domain->iova_cookie) > > + return -EINVAL; > > + > > + iovad = cookie_iovad(domain); > > + > > + lo = iova_pfn(iovad, iova); > > + hi = iova_pfn(iovad, iova + size - 1); > > + reserve_iova(iovad, lo, hi); > > This can't work reliably - reserve_iova() will (for good reason) merge > any adjacent or overlapping entries, so any unmap is liable to free more > IOVA space than actually gets unmapped, and things will get subtly out > of sync and go wrong later. > > The more general issue with this whole approach, though, is that it > effectively rules out userspace doing guest memory hotplug or similar, > and I'm not we want to paint ourselves into that corner. Basically, as > soon as a device is attached to a guest, the entirety of the unallocated > IPA space becomes reserved, and userspace can never add anything further > to it, because any given address *might* be in use for an MSI mapping. Ah, we didn't think of that when discussing this design at KVM Forum, because the idea was that the IOVA allocator was in charge of that resource, and the IOVA was a separate concept from the IPA space. I think what tripped us up, is that while the above is true for the MSI configuration where we trap the bar and do the allocation at VFIO init time, the guest device driver can program DMA to any address without trapping, and therefore there's an inherent relationship between the IOVA and the IPA space. Is that right? > > I think it still makes most sense to stick with the original approach of > cooperating with userspace to reserve a bounded area - it's just that we > can then let automatic mapping take care of itself within that area. I was thinking that it's also possible to do it the other way around: To let userspace say wherever memory may be hotplugged and do the allocation within the remaining area, but I suppose that's pretty much the same thing, and it should just depend on what's easiest to implement and what userspace can best predict. > > Speaking of which, I've realised the same fundamental reservation > problem already applies to PCI without ACS, regardless of MSIs. I just > tried on my Juno with guest memory placed at 0x4000000000, (i.e. > matching the host PA of the 64-bit PCI window), and sure enough when the > guest kicks off some DMA on the passed-through NIC, the root complex > interprets the guest IPA as (unsupported) peer-to-peer DMA to a BAR > claimed by the video card, and it fails. I guess this doesn't get hit in > practice on x86 because the guest memory map is unlikely to be much > different from the host's. > > It seems like we basically need a general way of communicating fixed and > movable host reservations to userspace :/ > Yes, this makes sense to me. Do we have any existing way of discovering this from userspace or can we think of something? Thanks, -Christoffer ^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC 05/11] iommu/dma: iommu_dma_(un)map_mixed 2016-10-02 9:56 ` Christoffer Dall @ 2016-10-04 17:18 ` Robin Murphy 2016-10-04 17:37 ` Auger Eric 0 siblings, 1 reply; 6+ messages in thread From: Robin Murphy @ 2016-10-04 17:18 UTC (permalink / raw) To: linux-arm-kernel On 02/10/16 10:56, Christoffer Dall wrote: > On Fri, Sep 30, 2016 at 02:24:40PM +0100, Robin Murphy wrote: >> Hi Eric, >> >> On 27/09/16 21:48, Eric Auger wrote: >>> iommu_dma_map_mixed and iommu_dma_unmap_mixed operate on >>> IOMMU_DOMAIN_MIXED typed domains. On top of standard iommu_map/unmap >>> they reserve the IOVA window to prevent the iova allocator to >>> allocate in those areas. >>> >>> Signed-off-by: Eric Auger <eric.auger@redhat.com> >>> --- >>> drivers/iommu/dma-iommu.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++ >>> include/linux/dma-iommu.h | 18 ++++++++++++++++++ >>> 2 files changed, 66 insertions(+) >>> >>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c >>> index 04bbc85..db21143 100644 >>> --- a/drivers/iommu/dma-iommu.c >>> +++ b/drivers/iommu/dma-iommu.c >>> @@ -759,3 +759,51 @@ int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain, >>> return 0; >>> } >>> EXPORT_SYMBOL(iommu_get_dma_msi_region_cookie); >>> + >>> +int iommu_dma_map_mixed(struct iommu_domain *domain, unsigned long iova, >>> + phys_addr_t paddr, size_t size, int prot) >>> +{ >>> + struct iova_domain *iovad; >>> + unsigned long lo, hi; >>> + int ret; >>> + >>> + if (domain->type != IOMMU_DOMAIN_MIXED) >>> + return -EINVAL; >>> + >>> + if (!domain->iova_cookie) >>> + return -EINVAL; >>> + >>> + iovad = cookie_iovad(domain); >>> + >>> + lo = iova_pfn(iovad, iova); >>> + hi = iova_pfn(iovad, iova + size - 1); >>> + reserve_iova(iovad, lo, hi); >> >> This can't work reliably - reserve_iova() will (for good reason) merge >> any adjacent or overlapping entries, so any unmap is liable to free more >> IOVA space than actually gets unmapped, and things will get subtly out >> of sync and go wrong later. >> >> The more general issue with this whole approach, though, is that it >> effectively rules out userspace doing guest memory hotplug or similar, >> and I'm not we want to paint ourselves into that corner. Basically, as >> soon as a device is attached to a guest, the entirety of the unallocated >> IPA space becomes reserved, and userspace can never add anything further >> to it, because any given address *might* be in use for an MSI mapping. > > Ah, we didn't think of that when discussing this design at KVM Forum, > because the idea was that the IOVA allocator was in charge of that > resource, and the IOVA was a separate concept from the IPA space. > > I think what tripped us up, is that while the above is true for the MSI > configuration where we trap the bar and do the allocation at VFIO init > time, the guest device driver can program DMA to any address without > trapping, and therefore there's an inherent relationship between the > IOVA and the IPA space. Is that right? Yes, for anything the guest knows about and/or can touch directly, IOVA must equal IPA, or DMA is going to go horribly wrong. It's only direct interactions between device and host behind the guest's back where we (may) have some freedom with IOVA assignment. >> I think it still makes most sense to stick with the original approach of >> cooperating with userspace to reserve a bounded area - it's just that we >> can then let automatic mapping take care of itself within that area. > > I was thinking that it's also possible to do it the other way around: To > let userspace say wherever memory may be hotplugged and do the > allocation within the remaining area, but I suppose that's pretty much > the same thing, and it should just depend on what's easiest to implement > and what userspace can best predict. Indeed, if userspace *is* able to pre-emptively claim everything it might ever want, that does kind of implicitly solve the "tell me where I can put this" problem (assuming it doesn't simply claim the whole address space, of course), but I'm not so sure it works well if there are any specific restrictions (e.g. if some device is going to require the MSI range to be 32-bit addressable). It also fails to address the issue below... >> Speaking of which, I've realised the same fundamental reservation >> problem already applies to PCI without ACS, regardless of MSIs. I just >> tried on my Juno with guest memory placed at 0x4000000000, (i.e. >> matching the host PA of the 64-bit PCI window), and sure enough when the >> guest kicks off some DMA on the passed-through NIC, the root complex >> interprets the guest IPA as (unsupported) peer-to-peer DMA to a BAR >> claimed by the video card, and it fails. I guess this doesn't get hit in >> practice on x86 because the guest memory map is unlikely to be much >> different from the host's. >> >> It seems like we basically need a general way of communicating fixed and >> movable host reservations to userspace :/ >> > > Yes, this makes sense to me. Do we have any existing way of > discovering this from userspace or can we think of something? I know virtually nothing about the userspace interface, but I was under the impression it would require something new. I wasn't even aware you could do the VFIO-under-QEMU-TCG thing which Eric points out, so it seems like the general "tell userspace about addresses it can't use" issue is perhaps the more pressing one. On investigation, QEMU's static memory map with RAM at 0x4000000 is already busted for VFIO on Juno, as that results in attempting DMA to config space, which goes about as well as one might expect. Robin. > > Thanks, > -Christoffer > ^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC 05/11] iommu/dma: iommu_dma_(un)map_mixed 2016-10-04 17:18 ` Robin Murphy @ 2016-10-04 17:37 ` Auger Eric 0 siblings, 0 replies; 6+ messages in thread From: Auger Eric @ 2016-10-04 17:37 UTC (permalink / raw) To: linux-arm-kernel Hi Robin, On 04/10/2016 19:18, Robin Murphy wrote: > On 02/10/16 10:56, Christoffer Dall wrote: >> On Fri, Sep 30, 2016 at 02:24:40PM +0100, Robin Murphy wrote: >>> Hi Eric, >>> >>> On 27/09/16 21:48, Eric Auger wrote: >>>> iommu_dma_map_mixed and iommu_dma_unmap_mixed operate on >>>> IOMMU_DOMAIN_MIXED typed domains. On top of standard iommu_map/unmap >>>> they reserve the IOVA window to prevent the iova allocator to >>>> allocate in those areas. >>>> >>>> Signed-off-by: Eric Auger <eric.auger@redhat.com> >>>> --- >>>> drivers/iommu/dma-iommu.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++ >>>> include/linux/dma-iommu.h | 18 ++++++++++++++++++ >>>> 2 files changed, 66 insertions(+) >>>> >>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c >>>> index 04bbc85..db21143 100644 >>>> --- a/drivers/iommu/dma-iommu.c >>>> +++ b/drivers/iommu/dma-iommu.c >>>> @@ -759,3 +759,51 @@ int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain, >>>> return 0; >>>> } >>>> EXPORT_SYMBOL(iommu_get_dma_msi_region_cookie); >>>> + >>>> +int iommu_dma_map_mixed(struct iommu_domain *domain, unsigned long iova, >>>> + phys_addr_t paddr, size_t size, int prot) >>>> +{ >>>> + struct iova_domain *iovad; >>>> + unsigned long lo, hi; >>>> + int ret; >>>> + >>>> + if (domain->type != IOMMU_DOMAIN_MIXED) >>>> + return -EINVAL; >>>> + >>>> + if (!domain->iova_cookie) >>>> + return -EINVAL; >>>> + >>>> + iovad = cookie_iovad(domain); >>>> + >>>> + lo = iova_pfn(iovad, iova); >>>> + hi = iova_pfn(iovad, iova + size - 1); >>>> + reserve_iova(iovad, lo, hi); >>> >>> This can't work reliably - reserve_iova() will (for good reason) merge >>> any adjacent or overlapping entries, so any unmap is liable to free more >>> IOVA space than actually gets unmapped, and things will get subtly out >>> of sync and go wrong later. >>> >>> The more general issue with this whole approach, though, is that it >>> effectively rules out userspace doing guest memory hotplug or similar, >>> and I'm not we want to paint ourselves into that corner. Basically, as >>> soon as a device is attached to a guest, the entirety of the unallocated >>> IPA space becomes reserved, and userspace can never add anything further >>> to it, because any given address *might* be in use for an MSI mapping. >> >> Ah, we didn't think of that when discussing this design at KVM Forum, >> because the idea was that the IOVA allocator was in charge of that >> resource, and the IOVA was a separate concept from the IPA space. >> >> I think what tripped us up, is that while the above is true for the MSI >> configuration where we trap the bar and do the allocation at VFIO init >> time, the guest device driver can program DMA to any address without >> trapping, and therefore there's an inherent relationship between the >> IOVA and the IPA space. Is that right? > > Yes, for anything the guest knows about and/or can touch directly, IOVA > must equal IPA, or DMA is going to go horribly wrong. It's only direct > interactions between device and host behind the guest's back where we > (may) have some freedom with IOVA assignment. > >>> I think it still makes most sense to stick with the original approach of >>> cooperating with userspace to reserve a bounded area - it's just that we >>> can then let automatic mapping take care of itself within that area. >> >> I was thinking that it's also possible to do it the other way around: To >> let userspace say wherever memory may be hotplugged and do the >> allocation within the remaining area, but I suppose that's pretty much >> the same thing, and it should just depend on what's easiest to implement >> and what userspace can best predict. > > Indeed, if userspace *is* able to pre-emptively claim everything it > might ever want, that does kind of implicitly solve the "tell me where I > can put this" problem (assuming it doesn't simply claim the whole > address space, of course), but I'm not so sure it works well if there > are any specific restrictions (e.g. if some device is going to require > the MSI range to be 32-bit addressable). It also fails to address the > issue below... > >>> Speaking of which, I've realised the same fundamental reservation >>> problem already applies to PCI without ACS, regardless of MSIs. I just >>> tried on my Juno with guest memory placed at 0x4000000000, (i.e. >>> matching the host PA of the 64-bit PCI window), and sure enough when the >>> guest kicks off some DMA on the passed-through NIC, the root complex >>> interprets the guest IPA as (unsupported) peer-to-peer DMA to a BAR >>> claimed by the video card, and it fails. I guess this doesn't get hit in >>> practice on x86 because the guest memory map is unlikely to be much >>> different from the host's. >>> >>> It seems like we basically need a general way of communicating fixed and >>> movable host reservations to userspace :/ >>> >> >> Yes, this makes sense to me. Do we have any existing way of >> discovering this from userspace or can we think of something? > > I know virtually nothing about the userspace interface, but I was under > the impression it would require something new. I wasn't even aware you > could do the VFIO-under-QEMU-TCG thing which Eric points out, I meant running a non x86 VM on an x86 host. Quoting Alex: "x86 isn't problem-free in this space. An x86 VM is going to know that the 0xfee00000 address range is special, it won't be backed by RAM and won't be a DMA target, thus we'll never attempt to map it for an iova address. However, if we run a non-x86 VM or a userspace driver, it doesn't necessarily know that there's anything special about that range of iovas. I intend to resolve this with an extension to the iommu info ioctl that describes the available iova space for the iommu. The interrupt region would simply be excluded." In my v12 I added such VFIO IOMMU info ioctl to retrieve the MSI topology. Now for the issue you pointed out (PCI without ACS) I understand this is a generalisation of the same issue and the VFIO IOMMU info capability chain API could be used as well. I can submit something separately. But anyway at QEMU level, due to the static mapping in mach-virt, at the moment, we just can reject the assignment I am afraid. Thanks Eric so it > seems like the general "tell userspace about addresses it can't use" > issue is perhaps the more pressing one. On investigation, QEMU's static > memory map with RAM at 0x4000000 is already busted for VFIO on Juno, as > that results in attempting DMA to config space, which goes about as well > as one might expect. > > Robin. > >> >> Thanks, >> -Christoffer >> > ^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC 05/11] iommu/dma: iommu_dma_(un)map_mixed 2016-09-30 13:24 ` [RFC 05/11] iommu/dma: iommu_dma_(un)map_mixed Robin Murphy 2016-10-02 9:56 ` Christoffer Dall @ 2016-10-03 9:38 ` Auger Eric 1 sibling, 0 replies; 6+ messages in thread From: Auger Eric @ 2016-10-03 9:38 UTC (permalink / raw) To: linux-arm-kernel Hi Robin, On 30/09/2016 15:24, Robin Murphy wrote: > Hi Eric, > > On 27/09/16 21:48, Eric Auger wrote: >> iommu_dma_map_mixed and iommu_dma_unmap_mixed operate on >> IOMMU_DOMAIN_MIXED typed domains. On top of standard iommu_map/unmap >> they reserve the IOVA window to prevent the iova allocator to >> allocate in those areas. >> >> Signed-off-by: Eric Auger <eric.auger@redhat.com> >> --- >> drivers/iommu/dma-iommu.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++ >> include/linux/dma-iommu.h | 18 ++++++++++++++++++ >> 2 files changed, 66 insertions(+) >> >> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c >> index 04bbc85..db21143 100644 >> --- a/drivers/iommu/dma-iommu.c >> +++ b/drivers/iommu/dma-iommu.c >> @@ -759,3 +759,51 @@ int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain, >> return 0; >> } >> EXPORT_SYMBOL(iommu_get_dma_msi_region_cookie); >> + >> +int iommu_dma_map_mixed(struct iommu_domain *domain, unsigned long iova, >> + phys_addr_t paddr, size_t size, int prot) >> +{ >> + struct iova_domain *iovad; >> + unsigned long lo, hi; >> + int ret; >> + >> + if (domain->type != IOMMU_DOMAIN_MIXED) >> + return -EINVAL; >> + >> + if (!domain->iova_cookie) >> + return -EINVAL; >> + >> + iovad = cookie_iovad(domain); >> + >> + lo = iova_pfn(iovad, iova); >> + hi = iova_pfn(iovad, iova + size - 1); >> + reserve_iova(iovad, lo, hi); > > This can't work reliably - reserve_iova() will (for good reason) merge > any adjacent or overlapping entries, so any unmap is liable to free more > IOVA space than actually gets unmapped, and things will get subtly out > of sync and go wrong later. OK. I did not notice that. > > The more general issue with this whole approach, though, is that it > effectively rules out userspace doing guest memory hotplug or similar, > and I'm not we want to paint ourselves into that corner. Basically, as > soon as a device is attached to a guest, the entirety of the unallocated > IPA space becomes reserved, and userspace can never add anything further > to it, because any given address *might* be in use for an MSI mapping. I fully agree. My bad, I mixed up about how/when the PCI MMIO space was iommu mapped. So we don't have any other solution than having the guest providing unused and non reserved GPA. Back to the original approach then. > > I think it still makes most sense to stick with the original approach of > cooperating with userspace to reserve a bounded area - it's just that we > can then let automatic mapping take care of itself within that area. OK will respin asap. > > Speaking of which, I've realised the same fundamental reservation > problem already applies to PCI without ACS, regardless of MSIs. I just > tried on my Juno with guest memory placed at 0x4000000000, (i.e. > matching the host PA of the 64-bit PCI window), and sure enough when the > guest kicks off some DMA on the passed-through NIC, the root complex > interprets the guest IPA as (unsupported) peer-to-peer DMA to a BAR > claimed by the video card, and it fails. I guess this doesn't get hit in > practice on x86 because the guest memory map is unlikely to be much > different from the host's. > > It seems like we basically need a general way of communicating fixed and > movable host reservations to userspace :/ Yes I saw "iommu/dma: Avoid PCI host bridge windows". Well this looks like a generalisation of the MSI geometry issue (they also face this one on x86 with a non x86 guest). This will also hit the fact that on QEMU the ARM guest memory map is static. Thank you for your time Best Regards Eric > > Robin. > >> + ret = iommu_map(domain, iova, paddr, size, prot); >> + if (ret) >> + free_iova(iovad, lo); >> + return ret; >> +} >> +EXPORT_SYMBOL(iommu_dma_map_mixed); >> + >> +size_t iommu_dma_unmap_mixed(struct iommu_domain *domain, unsigned long iova, >> + size_t size) >> +{ >> + struct iova_domain *iovad; >> + unsigned long lo; >> + size_t ret; >> + >> + if (domain->type != IOMMU_DOMAIN_MIXED) >> + return -EINVAL; >> + >> + if (!domain->iova_cookie) >> + return -EINVAL; >> + >> + iovad = cookie_iovad(domain); >> + lo = iova_pfn(iovad, iova); >> + >> + ret = iommu_unmap(domain, iova, size); >> + if (ret == size) >> + free_iova(iovad, lo); >> + return ret; >> +} >> +EXPORT_SYMBOL(iommu_dma_unmap_mixed); >> diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h >> index 1c55413..f2aa855 100644 >> --- a/include/linux/dma-iommu.h >> +++ b/include/linux/dma-iommu.h >> @@ -70,6 +70,12 @@ void iommu_dma_map_msi_msg(int irq, struct msi_msg *msg); >> int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain, >> dma_addr_t base, u64 size); >> >> +int iommu_dma_map_mixed(struct iommu_domain *domain, unsigned long iova, >> + phys_addr_t paddr, size_t size, int prot); >> + >> +size_t iommu_dma_unmap_mixed(struct iommu_domain *domain, unsigned long iova, >> + size_t size); >> + >> #else >> >> struct iommu_domain; >> @@ -99,6 +105,18 @@ static inline int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain, >> return -ENODEV; >> } >> >> +int iommu_dma_map_mixed(struct iommu_domain *domain, unsigned long iova, >> + phys_addr_t paddr, size_t size, int prot) >> +{ >> + return -ENODEV; >> +} >> + >> +size_t iommu_dma_unmap_mixed(struct iommu_domain *domain, unsigned long iova, >> + size_t size) >> +{ >> + return -ENODEV; >> +} >> + >> #endif /* CONFIG_IOMMU_DMA */ >> #endif /* __KERNEL__ */ >> #endif /* __DMA_IOMMU_H */ >> > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-10-04 17:37 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-09-27 20:48 [RFC 00/11] KVM PCIe/MSI passthrough on ARM/ARM64: re-design with transparent MSI mapping Eric Auger [not found] ` <1475009318-2617-6-git-send-email-eric.auger@redhat.com> 2016-09-30 13:24 ` [RFC 05/11] iommu/dma: iommu_dma_(un)map_mixed Robin Murphy 2016-10-02 9:56 ` Christoffer Dall 2016-10-04 17:18 ` Robin Murphy 2016-10-04 17:37 ` Auger Eric 2016-10-03 9:38 ` Auger Eric
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).