DMA region abruptly removed from PCI device

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* DMA region abruptly removed from PCI device
@ 2020-07-06 10:55 Thanos Makatos
  2020-07-06 14:20 ` Alex Williamson
  0 siblings, 1 reply; 4+ messages in thread
From: Thanos Makatos @ 2020-07-06 10:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Walker, Benjamin, John G Johnson, Swapnil Ingle,
	qemu-devel@nongnu.org, Stefan Hajnoczi, Felipe Franciosi,
	Liu, Changpeng

We have an issue when using the VFIO-over-socket libmuser PoC
(https://www.mail-archive.com/qemu-devel@nongnu.org/msg692251.html) instead of
the VFIO kernel module: we notice that DMA regions used by the emulated device
can be abruptly removed while the device is still using them.

The PCI device we've implemented is an NVMe controller using SPDK, so it polls
the submission queues for new requests. We use the latest SeaBIOS where it tries
to boot from the NVMe controller. Several DMA regions are registered
(VFIO_IOMMU_MAP_DMA) and then the admin and a submission queues are created.
From this point SPDK polls both queues. Then, the DMA region where the
submission queue lies is removed (VFIO_IOMMU_UNMAP_DMA) and then re-added at the
same IOVA but at a different offset. SPDK crashes soon after as it accesses
invalid memory. There is no other event (e.g. some PCI config space or NVMe
register write) happening between unmapping and mapping the DMA region. My guess
is that this behavior is legitimate and that this is solved in the VFIO kernel
module by releasing the DMA region only after all references to it have been
released, which is handled by vfio_pin/unpin_pages, correct? If this is the case
then I suppose we need to implement the same logic in libmuser, but I just want
to make sure I'm not missing anything as this is a substantial change.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: DMA region abruptly removed from PCI device
  2020-07-06 10:55 DMA region abruptly removed from PCI device Thanos Makatos
@ 2020-07-06 14:20 ` Alex Williamson
  2020-07-07 10:38   ` Felipe Franciosi
  0 siblings, 1 reply; 4+ messages in thread
From: Alex Williamson @ 2020-07-06 14:20 UTC (permalink / raw)
  To: Thanos Makatos
  Cc: Walker, Benjamin, John G Johnson, Swapnil Ingle,
	qemu-devel@nongnu.org, Stefan Hajnoczi, Felipe Franciosi,
	Liu, Changpeng

On Mon, 6 Jul 2020 10:55:00 +0000
Thanos Makatos <thanos.makatos@nutanix.com> wrote:

> We have an issue when using the VFIO-over-socket libmuser PoC
> (https://www.mail-archive.com/qemu-devel@nongnu.org/msg692251.html) instead of
> the VFIO kernel module: we notice that DMA regions used by the emulated device
> can be abruptly removed while the device is still using them.
> 
> The PCI device we've implemented is an NVMe controller using SPDK, so it polls
> the submission queues for new requests. We use the latest SeaBIOS where it tries
> to boot from the NVMe controller. Several DMA regions are registered
> (VFIO_IOMMU_MAP_DMA) and then the admin and a submission queues are created.
> From this point SPDK polls both queues. Then, the DMA region where the
> submission queue lies is removed (VFIO_IOMMU_UNMAP_DMA) and then re-added at the
> same IOVA but at a different offset. SPDK crashes soon after as it accesses
> invalid memory. There is no other event (e.g. some PCI config space or NVMe
> register write) happening between unmapping and mapping the DMA region. My guess
> is that this behavior is legitimate and that this is solved in the VFIO kernel
> module by releasing the DMA region only after all references to it have been
> released, which is handled by vfio_pin/unpin_pages, correct? If this is the case
> then I suppose we need to implement the same logic in libmuser, but I just want
> to make sure I'm not missing anything as this is a substantial change.

The vfio_{pin,unpin}_pages() interface only comes into play for mdev
devices and even then it's an announcement that a given mapping is
going away and the vendor driver is required to release references.
For normal PCI device assignment, vfio-pci is (aside from a few quirks)
device agnostic and the IOMMU container mappings are independent of the
device.  We do not have any device specific knowledge to know if DMA
pages still have device references.  The user's unmap request is
absolute, it cannot fail (aside from invalid usage) and upon return
there must be no residual mappings or references of the pages.

If you say there's no config space write, ex. clearing bus master from
the command register, then something like turning on a vIOMMU might
cause a change in the entire address space accessible by the device.
This would cause the identity map of IOVA to GPA to be replaced by a
new one, perhaps another identity map if iommu=pt or a more restricted
mapping if the vIOMMU is used for isolation.

It sounds like you have an incomplete device model, physical devices
have their address space adjusted by an IOMMU independent of, but
hopefully in collaboration with a device driver.  If a physical device
manages to bridge this transition, do what it does.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: DMA region abruptly removed from PCI device
  2020-07-06 14:20 ` Alex Williamson
@ 2020-07-07 10:38   ` Felipe Franciosi
  2020-07-07 15:54     ` Alex Williamson
  0 siblings, 1 reply; 4+ messages in thread
From: Felipe Franciosi @ 2020-07-07 10:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Walker, Benjamin, John G Johnson, Swapnil Ingle,
	qemu-devel@nongnu.org, Stefan Hajnoczi, Thanos Makatos,
	Liu,  Changpeng



> On Jul 6, 2020, at 3:20 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> On Mon, 6 Jul 2020 10:55:00 +0000
> Thanos Makatos <thanos.makatos@nutanix.com> wrote:
> 
>> We have an issue when using the VFIO-over-socket libmuser PoC
>> (https://www.mail-archive.com/qemu-devel@nongnu.org/msg692251.html) instead of
>> the VFIO kernel module: we notice that DMA regions used by the emulated device
>> can be abruptly removed while the device is still using them.
>> 
>> The PCI device we've implemented is an NVMe controller using SPDK, so it polls
>> the submission queues for new requests. We use the latest SeaBIOS where it tries
>> to boot from the NVMe controller. Several DMA regions are registered
>> (VFIO_IOMMU_MAP_DMA) and then the admin and a submission queues are created.
>> From this point SPDK polls both queues. Then, the DMA region where the
>> submission queue lies is removed (VFIO_IOMMU_UNMAP_DMA) and then re-added at the
>> same IOVA but at a different offset. SPDK crashes soon after as it accesses
>> invalid memory. There is no other event (e.g. some PCI config space or NVMe
>> register write) happening between unmapping and mapping the DMA region. My guess
>> is that this behavior is legitimate and that this is solved in the VFIO kernel
>> module by releasing the DMA region only after all references to it have been
>> released, which is handled by vfio_pin/unpin_pages, correct? If this is the case
>> then I suppose we need to implement the same logic in libmuser, but I just want
>> to make sure I'm not missing anything as this is a substantial change.
> 
> The vfio_{pin,unpin}_pages() interface only comes into play for mdev
> devices and even then it's an announcement that a given mapping is
> going away and the vendor driver is required to release references.
> For normal PCI device assignment, vfio-pci is (aside from a few quirks)
> device agnostic and the IOMMU container mappings are independent of the
> device.  We do not have any device specific knowledge to know if DMA
> pages still have device references.  The user's unmap request is
> absolute, it cannot fail (aside from invalid usage) and upon return
> there must be no residual mappings or references of the pages.
> 
> If you say there's no config space write, ex. clearing bus master from
> the command register, then something like turning on a vIOMMU might
> cause a change in the entire address space accessible by the device.
> This would cause the identity map of IOVA to GPA to be replaced by a
> new one, perhaps another identity map if iommu=pt or a more restricted
> mapping if the vIOMMU is used for isolation.
> 
> It sounds like you have an incomplete device model, physical devices
> have their address space adjusted by an IOMMU independent of, but
> hopefully in collaboration with a device driver.  If a physical device
> manages to bridge this transition, do what it does.  Thanks,

Hi,

That's what we are trying to work out. IIUC, the problem we are having
is that a mapping removal was requested but the device was still
operational. We can surely figure out how to handle that gracefully,
but I'm trying to get my head around how real hardware handles that.
Maybe you can add some colour here. :)

What happens when a device tries to write to a physical address that
has no memory behind it? Is it an MCE of sorts?

I haven't really ever looked at memory hot unplug in detail, but
after reading some QEMU code this is my understanding:

1) QEMU makes an ACPI request to the guest OS for mem unplug
2) Guest OS acks that memory can be pulled out
3) QEMU pulls the memory from the guest

Before step 3, I'm guessing that QEMU tells all device backends that
this memory is going away. I suppose that in normal operation, the
Guest OS will have already stopped using the memory (ie. before step
2), so there shouldn't be much to it. But I also suppose a malicious
guest could go "ah, you want to remove this dimm? sure, let me just
ask all these devices to start using it first... ok, there you go."

Is this understanding correct?

I don't think that's the case we're running into, though, but I think
we need to consider it at this time. What's probably happening here is
that the guest went from SeaBIOS to the kernel, a PCI reset happened
and we didn't plumb that message through correctly. While we are at
it, we should review the memory hot unplug business.

Thanks,
Felipe

> 
> Alex
> 



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: DMA region abruptly removed from PCI device
  2020-07-07 10:38   ` Felipe Franciosi
@ 2020-07-07 15:54     ` Alex Williamson
  0 siblings, 0 replies; 4+ messages in thread
From: Alex Williamson @ 2020-07-07 15:54 UTC (permalink / raw)
  To: Felipe Franciosi
  Cc: Walker, Benjamin, John G Johnson, Swapnil Ingle,
	qemu-devel@nongnu.org, Stefan Hajnoczi, Thanos Makatos,
	Liu,  Changpeng

On Tue, 7 Jul 2020 10:38:01 +0000
Felipe Franciosi <felipe@nutanix.com> wrote:

> > On Jul 6, 2020, at 3:20 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > On Mon, 6 Jul 2020 10:55:00 +0000
> > Thanos Makatos <thanos.makatos@nutanix.com> wrote:
> >   
> >> We have an issue when using the VFIO-over-socket libmuser PoC
> >> (https://www.mail-archive.com/qemu-devel@nongnu.org/msg692251.html) instead of
> >> the VFIO kernel module: we notice that DMA regions used by the emulated device
> >> can be abruptly removed while the device is still using them.
> >> 
> >> The PCI device we've implemented is an NVMe controller using SPDK, so it polls
> >> the submission queues for new requests. We use the latest SeaBIOS where it tries
> >> to boot from the NVMe controller. Several DMA regions are registered
> >> (VFIO_IOMMU_MAP_DMA) and then the admin and a submission queues are created.
> >> From this point SPDK polls both queues. Then, the DMA region where the
> >> submission queue lies is removed (VFIO_IOMMU_UNMAP_DMA) and then re-added at the
> >> same IOVA but at a different offset. SPDK crashes soon after as it accesses
> >> invalid memory. There is no other event (e.g. some PCI config space or NVMe
> >> register write) happening between unmapping and mapping the DMA region. My guess
> >> is that this behavior is legitimate and that this is solved in the VFIO kernel
> >> module by releasing the DMA region only after all references to it have been
> >> released, which is handled by vfio_pin/unpin_pages, correct? If this is the case
> >> then I suppose we need to implement the same logic in libmuser, but I just want
> >> to make sure I'm not missing anything as this is a substantial change.  
> > 
> > The vfio_{pin,unpin}_pages() interface only comes into play for mdev
> > devices and even then it's an announcement that a given mapping is
> > going away and the vendor driver is required to release references.
> > For normal PCI device assignment, vfio-pci is (aside from a few quirks)
> > device agnostic and the IOMMU container mappings are independent of the
> > device.  We do not have any device specific knowledge to know if DMA
> > pages still have device references.  The user's unmap request is
> > absolute, it cannot fail (aside from invalid usage) and upon return
> > there must be no residual mappings or references of the pages.
> > 
> > If you say there's no config space write, ex. clearing bus master from
> > the command register, then something like turning on a vIOMMU might
> > cause a change in the entire address space accessible by the device.
> > This would cause the identity map of IOVA to GPA to be replaced by a
> > new one, perhaps another identity map if iommu=pt or a more restricted
> > mapping if the vIOMMU is used for isolation.
> > 
> > It sounds like you have an incomplete device model, physical devices
> > have their address space adjusted by an IOMMU independent of, but
> > hopefully in collaboration with a device driver.  If a physical device
> > manages to bridge this transition, do what it does.  Thanks,  
> 
> Hi,
> 
> That's what we are trying to work out. IIUC, the problem we are having
> is that a mapping removal was requested but the device was still
> operational. We can surely figure out how to handle that gracefully,
> but I'm trying to get my head around how real hardware handles that.
> Maybe you can add some colour here. :)
> 
> What happens when a device tries to write to a physical address that
> has no memory behind it? Is it an MCE of sorts?

It depends on the system, the write might be silently dropped (a), it
might generate an IOMMU fault (b), or firmware-first platform error
handling might freak out from either (a) or (b) and decide to trigger a
fatal error.  If mappings are getting removed due to bus master enable
getting cleared, I would expect device specific behavior, the device
could either stall or drop transactions.
 
> I haven't really ever looked at memory hot unplug in detail, but
> after reading some QEMU code this is my understanding:
>
> 1) QEMU makes an ACPI request to the guest OS for mem unplug
> 2) Guest OS acks that memory can be pulled out
> 3) QEMU pulls the memory from the guest
> 
> Before step 3, I'm guessing that QEMU tells all device backends that
> this memory is going away. I suppose that in normal operation, the
> Guest OS will have already stopped using the memory (ie. before step
> 2), so there shouldn't be much to it. But I also suppose a malicious
> guest could go "ah, you want to remove this dimm? sure, let me just
> ask all these devices to start using it first... ok, there you go."
> 
> Is this understanding correct?

Memory hot-unplug is cooperative, the guest OS needs to be able to
vacate the necessary range.  If it can't do that or doesn't want to do
that, it just rejects the operation.  The unplugged memory is removed
from the VM address space, so there's no way it can be malicious.
Devices don't own memory, they just use it.  Drivers within the guest
OS having allocations within the requested memory range, especially if
those allocations are for DMA, would be reason for the guest to reject
the unplug operation.  Drivers within QEMU have no business getting a
vote in this matter, if the guest OS has completed the unplug
operation, the memory must be unmapped.  If the guest OS has overlooked
some inflight DMA target, that's on the guest and the above error
handling, or lack thereof comes into play for those transactions.
 
> I don't think that's the case we're running into, though, but I think
> we need to consider it at this time. What's probably happening here is
> that the guest went from SeaBIOS to the kernel, a PCI reset happened
> and we didn't plumb that message through correctly. While we are at
> it, we should review the memory hot unplug business.

Looking at IOVAs mapped to the device from the device perspective,
clearing bus master will remove all the mappings.  That will happen
when the guest OS or SeaBIOS sizes the PCI BARs, but the description
above said that no config space accesses were occurring.  Enabling the
vIOMMU would also change the entire address space of the device.  In
transitioning from SeaBIOS to guest kernel, why is the device still
active?  The normal expectation here would be that SeaBIOS accesses the
device to load the kernel and initrd into memory, the device is
quiesced, the guest OS boots, enumerating the I/O and IOMMU,
potentially involving multiple address space changes, then device
drivers load, which should make sure the device is performing DMA to
valid targets.  I'll be curious to see what's causing this mysterious
remove and shift operation.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-07-07 15:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-07-06 10:55 DMA region abruptly removed from PCI device Thanos Makatos
2020-07-06 14:20 ` Alex Williamson
2020-07-07 10:38   ` Felipe Franciosi
2020-07-07 15:54     ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).