* DMA region abruptly removed from PCI device @ 2020-07-06 10:55 Thanos Makatos 2020-07-06 14:20 ` Alex Williamson 0 siblings, 1 reply; 4+ messages in thread From: Thanos Makatos @ 2020-07-06 10:55 UTC (permalink / raw) To: Alex Williamson Cc: Walker, Benjamin, John G Johnson, Swapnil Ingle, qemu-devel@nongnu.org, Stefan Hajnoczi, Felipe Franciosi, Liu, Changpeng We have an issue when using the VFIO-over-socket libmuser PoC (https://www.mail-archive.com/qemu-devel@nongnu.org/msg692251.html) instead of the VFIO kernel module: we notice that DMA regions used by the emulated device can be abruptly removed while the device is still using them. The PCI device we've implemented is an NVMe controller using SPDK, so it polls the submission queues for new requests. We use the latest SeaBIOS where it tries to boot from the NVMe controller. Several DMA regions are registered (VFIO_IOMMU_MAP_DMA) and then the admin and a submission queues are created. From this point SPDK polls both queues. Then, the DMA region where the submission queue lies is removed (VFIO_IOMMU_UNMAP_DMA) and then re-added at the same IOVA but at a different offset. SPDK crashes soon after as it accesses invalid memory. There is no other event (e.g. some PCI config space or NVMe register write) happening between unmapping and mapping the DMA region. My guess is that this behavior is legitimate and that this is solved in the VFIO kernel module by releasing the DMA region only after all references to it have been released, which is handled by vfio_pin/unpin_pages, correct? If this is the case then I suppose we need to implement the same logic in libmuser, but I just want to make sure I'm not missing anything as this is a substantial change. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: DMA region abruptly removed from PCI device 2020-07-06 10:55 DMA region abruptly removed from PCI device Thanos Makatos @ 2020-07-06 14:20 ` Alex Williamson 2020-07-07 10:38 ` Felipe Franciosi 0 siblings, 1 reply; 4+ messages in thread From: Alex Williamson @ 2020-07-06 14:20 UTC (permalink / raw) To: Thanos Makatos Cc: Walker, Benjamin, John G Johnson, Swapnil Ingle, qemu-devel@nongnu.org, Stefan Hajnoczi, Felipe Franciosi, Liu, Changpeng On Mon, 6 Jul 2020 10:55:00 +0000 Thanos Makatos <thanos.makatos@nutanix.com> wrote: > We have an issue when using the VFIO-over-socket libmuser PoC > (https://www.mail-archive.com/qemu-devel@nongnu.org/msg692251.html) instead of > the VFIO kernel module: we notice that DMA regions used by the emulated device > can be abruptly removed while the device is still using them. > > The PCI device we've implemented is an NVMe controller using SPDK, so it polls > the submission queues for new requests. We use the latest SeaBIOS where it tries > to boot from the NVMe controller. Several DMA regions are registered > (VFIO_IOMMU_MAP_DMA) and then the admin and a submission queues are created. > From this point SPDK polls both queues. Then, the DMA region where the > submission queue lies is removed (VFIO_IOMMU_UNMAP_DMA) and then re-added at the > same IOVA but at a different offset. SPDK crashes soon after as it accesses > invalid memory. There is no other event (e.g. some PCI config space or NVMe > register write) happening between unmapping and mapping the DMA region. My guess > is that this behavior is legitimate and that this is solved in the VFIO kernel > module by releasing the DMA region only after all references to it have been > released, which is handled by vfio_pin/unpin_pages, correct? If this is the case > then I suppose we need to implement the same logic in libmuser, but I just want > to make sure I'm not missing anything as this is a substantial change. The vfio_{pin,unpin}_pages() interface only comes into play for mdev devices and even then it's an announcement that a given mapping is going away and the vendor driver is required to release references. For normal PCI device assignment, vfio-pci is (aside from a few quirks) device agnostic and the IOMMU container mappings are independent of the device. We do not have any device specific knowledge to know if DMA pages still have device references. The user's unmap request is absolute, it cannot fail (aside from invalid usage) and upon return there must be no residual mappings or references of the pages. If you say there's no config space write, ex. clearing bus master from the command register, then something like turning on a vIOMMU might cause a change in the entire address space accessible by the device. This would cause the identity map of IOVA to GPA to be replaced by a new one, perhaps another identity map if iommu=pt or a more restricted mapping if the vIOMMU is used for isolation. It sounds like you have an incomplete device model, physical devices have their address space adjusted by an IOMMU independent of, but hopefully in collaboration with a device driver. If a physical device manages to bridge this transition, do what it does. Thanks, Alex ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: DMA region abruptly removed from PCI device 2020-07-06 14:20 ` Alex Williamson @ 2020-07-07 10:38 ` Felipe Franciosi 2020-07-07 15:54 ` Alex Williamson 0 siblings, 1 reply; 4+ messages in thread From: Felipe Franciosi @ 2020-07-07 10:38 UTC (permalink / raw) To: Alex Williamson Cc: Walker, Benjamin, John G Johnson, Swapnil Ingle, qemu-devel@nongnu.org, Stefan Hajnoczi, Thanos Makatos, Liu, Changpeng > On Jul 6, 2020, at 3:20 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > > On Mon, 6 Jul 2020 10:55:00 +0000 > Thanos Makatos <thanos.makatos@nutanix.com> wrote: > >> We have an issue when using the VFIO-over-socket libmuser PoC >> (https://www.mail-archive.com/qemu-devel@nongnu.org/msg692251.html) instead of >> the VFIO kernel module: we notice that DMA regions used by the emulated device >> can be abruptly removed while the device is still using them. >> >> The PCI device we've implemented is an NVMe controller using SPDK, so it polls >> the submission queues for new requests. We use the latest SeaBIOS where it tries >> to boot from the NVMe controller. Several DMA regions are registered >> (VFIO_IOMMU_MAP_DMA) and then the admin and a submission queues are created. >> From this point SPDK polls both queues. Then, the DMA region where the >> submission queue lies is removed (VFIO_IOMMU_UNMAP_DMA) and then re-added at the >> same IOVA but at a different offset. SPDK crashes soon after as it accesses >> invalid memory. There is no other event (e.g. some PCI config space or NVMe >> register write) happening between unmapping and mapping the DMA region. My guess >> is that this behavior is legitimate and that this is solved in the VFIO kernel >> module by releasing the DMA region only after all references to it have been >> released, which is handled by vfio_pin/unpin_pages, correct? If this is the case >> then I suppose we need to implement the same logic in libmuser, but I just want >> to make sure I'm not missing anything as this is a substantial change. > > The vfio_{pin,unpin}_pages() interface only comes into play for mdev > devices and even then it's an announcement that a given mapping is > going away and the vendor driver is required to release references. > For normal PCI device assignment, vfio-pci is (aside from a few quirks) > device agnostic and the IOMMU container mappings are independent of the > device. We do not have any device specific knowledge to know if DMA > pages still have device references. The user's unmap request is > absolute, it cannot fail (aside from invalid usage) and upon return > there must be no residual mappings or references of the pages. > > If you say there's no config space write, ex. clearing bus master from > the command register, then something like turning on a vIOMMU might > cause a change in the entire address space accessible by the device. > This would cause the identity map of IOVA to GPA to be replaced by a > new one, perhaps another identity map if iommu=pt or a more restricted > mapping if the vIOMMU is used for isolation. > > It sounds like you have an incomplete device model, physical devices > have their address space adjusted by an IOMMU independent of, but > hopefully in collaboration with a device driver. If a physical device > manages to bridge this transition, do what it does. Thanks, Hi, That's what we are trying to work out. IIUC, the problem we are having is that a mapping removal was requested but the device was still operational. We can surely figure out how to handle that gracefully, but I'm trying to get my head around how real hardware handles that. Maybe you can add some colour here. :) What happens when a device tries to write to a physical address that has no memory behind it? Is it an MCE of sorts? I haven't really ever looked at memory hot unplug in detail, but after reading some QEMU code this is my understanding: 1) QEMU makes an ACPI request to the guest OS for mem unplug 2) Guest OS acks that memory can be pulled out 3) QEMU pulls the memory from the guest Before step 3, I'm guessing that QEMU tells all device backends that this memory is going away. I suppose that in normal operation, the Guest OS will have already stopped using the memory (ie. before step 2), so there shouldn't be much to it. But I also suppose a malicious guest could go "ah, you want to remove this dimm? sure, let me just ask all these devices to start using it first... ok, there you go." Is this understanding correct? I don't think that's the case we're running into, though, but I think we need to consider it at this time. What's probably happening here is that the guest went from SeaBIOS to the kernel, a PCI reset happened and we didn't plumb that message through correctly. While we are at it, we should review the memory hot unplug business. Thanks, Felipe > > Alex > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: DMA region abruptly removed from PCI device 2020-07-07 10:38 ` Felipe Franciosi @ 2020-07-07 15:54 ` Alex Williamson 0 siblings, 0 replies; 4+ messages in thread From: Alex Williamson @ 2020-07-07 15:54 UTC (permalink / raw) To: Felipe Franciosi Cc: Walker, Benjamin, John G Johnson, Swapnil Ingle, qemu-devel@nongnu.org, Stefan Hajnoczi, Thanos Makatos, Liu, Changpeng On Tue, 7 Jul 2020 10:38:01 +0000 Felipe Franciosi <felipe@nutanix.com> wrote: > > On Jul 6, 2020, at 3:20 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > > > > On Mon, 6 Jul 2020 10:55:00 +0000 > > Thanos Makatos <thanos.makatos@nutanix.com> wrote: > > > >> We have an issue when using the VFIO-over-socket libmuser PoC > >> (https://www.mail-archive.com/qemu-devel@nongnu.org/msg692251.html) instead of > >> the VFIO kernel module: we notice that DMA regions used by the emulated device > >> can be abruptly removed while the device is still using them. > >> > >> The PCI device we've implemented is an NVMe controller using SPDK, so it polls > >> the submission queues for new requests. We use the latest SeaBIOS where it tries > >> to boot from the NVMe controller. Several DMA regions are registered > >> (VFIO_IOMMU_MAP_DMA) and then the admin and a submission queues are created. > >> From this point SPDK polls both queues. Then, the DMA region where the > >> submission queue lies is removed (VFIO_IOMMU_UNMAP_DMA) and then re-added at the > >> same IOVA but at a different offset. SPDK crashes soon after as it accesses > >> invalid memory. There is no other event (e.g. some PCI config space or NVMe > >> register write) happening between unmapping and mapping the DMA region. My guess > >> is that this behavior is legitimate and that this is solved in the VFIO kernel > >> module by releasing the DMA region only after all references to it have been > >> released, which is handled by vfio_pin/unpin_pages, correct? If this is the case > >> then I suppose we need to implement the same logic in libmuser, but I just want > >> to make sure I'm not missing anything as this is a substantial change. > > > > The vfio_{pin,unpin}_pages() interface only comes into play for mdev > > devices and even then it's an announcement that a given mapping is > > going away and the vendor driver is required to release references. > > For normal PCI device assignment, vfio-pci is (aside from a few quirks) > > device agnostic and the IOMMU container mappings are independent of the > > device. We do not have any device specific knowledge to know if DMA > > pages still have device references. The user's unmap request is > > absolute, it cannot fail (aside from invalid usage) and upon return > > there must be no residual mappings or references of the pages. > > > > If you say there's no config space write, ex. clearing bus master from > > the command register, then something like turning on a vIOMMU might > > cause a change in the entire address space accessible by the device. > > This would cause the identity map of IOVA to GPA to be replaced by a > > new one, perhaps another identity map if iommu=pt or a more restricted > > mapping if the vIOMMU is used for isolation. > > > > It sounds like you have an incomplete device model, physical devices > > have their address space adjusted by an IOMMU independent of, but > > hopefully in collaboration with a device driver. If a physical device > > manages to bridge this transition, do what it does. Thanks, > > Hi, > > That's what we are trying to work out. IIUC, the problem we are having > is that a mapping removal was requested but the device was still > operational. We can surely figure out how to handle that gracefully, > but I'm trying to get my head around how real hardware handles that. > Maybe you can add some colour here. :) > > What happens when a device tries to write to a physical address that > has no memory behind it? Is it an MCE of sorts? It depends on the system, the write might be silently dropped (a), it might generate an IOMMU fault (b), or firmware-first platform error handling might freak out from either (a) or (b) and decide to trigger a fatal error. If mappings are getting removed due to bus master enable getting cleared, I would expect device specific behavior, the device could either stall or drop transactions. > I haven't really ever looked at memory hot unplug in detail, but > after reading some QEMU code this is my understanding: > > 1) QEMU makes an ACPI request to the guest OS for mem unplug > 2) Guest OS acks that memory can be pulled out > 3) QEMU pulls the memory from the guest > > Before step 3, I'm guessing that QEMU tells all device backends that > this memory is going away. I suppose that in normal operation, the > Guest OS will have already stopped using the memory (ie. before step > 2), so there shouldn't be much to it. But I also suppose a malicious > guest could go "ah, you want to remove this dimm? sure, let me just > ask all these devices to start using it first... ok, there you go." > > Is this understanding correct? Memory hot-unplug is cooperative, the guest OS needs to be able to vacate the necessary range. If it can't do that or doesn't want to do that, it just rejects the operation. The unplugged memory is removed from the VM address space, so there's no way it can be malicious. Devices don't own memory, they just use it. Drivers within the guest OS having allocations within the requested memory range, especially if those allocations are for DMA, would be reason for the guest to reject the unplug operation. Drivers within QEMU have no business getting a vote in this matter, if the guest OS has completed the unplug operation, the memory must be unmapped. If the guest OS has overlooked some inflight DMA target, that's on the guest and the above error handling, or lack thereof comes into play for those transactions. > I don't think that's the case we're running into, though, but I think > we need to consider it at this time. What's probably happening here is > that the guest went from SeaBIOS to the kernel, a PCI reset happened > and we didn't plumb that message through correctly. While we are at > it, we should review the memory hot unplug business. Looking at IOVAs mapped to the device from the device perspective, clearing bus master will remove all the mappings. That will happen when the guest OS or SeaBIOS sizes the PCI BARs, but the description above said that no config space accesses were occurring. Enabling the vIOMMU would also change the entire address space of the device. In transitioning from SeaBIOS to guest kernel, why is the device still active? The normal expectation here would be that SeaBIOS accesses the device to load the kernel and initrd into memory, the device is quiesced, the guest OS boots, enumerating the I/O and IOMMU, potentially involving multiple address space changes, then device drivers load, which should make sure the device is performing DMA to valid targets. I'll be curious to see what's causing this mysterious remove and shift operation. Thanks, Alex ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-07-07 15:56 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-07-06 10:55 DMA region abruptly removed from PCI device Thanos Makatos 2020-07-06 14:20 ` Alex Williamson 2020-07-07 10:38 ` Felipe Franciosi 2020-07-07 15:54 ` Alex Williamson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).