From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jag Raman <jag.raman@oracle.com>
Cc: "eduardo@habkost.net" <eduardo@habkost.net>,
"Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
"Daniel P. Berrangé" <berrange@redhat.com>,
"Beraldo Leal" <bleal@redhat.com>,
"John Johnson" <john.g.johnson@oracle.com>,
"quintela@redhat.com" <quintela@redhat.com>,
qemu-devel <qemu-devel@nongnu.org>,
"armbru@redhat.com" <armbru@redhat.com>,
"Alex Williamson" <alex.williamson@redhat.com>,
"Marc-André Lureau" <marcandre.lureau@gmail.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
"Stefan Hajnoczi" <stefanha@redhat.com>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"thanos.makatos@nutanix.com" <thanos.makatos@nutanix.com>,
"Eric Blake" <eblake@redhat.com>,
"john.levon@nutanix.com" <john.levon@nutanix.com>,
"Philippe Mathieu-Daudé" <f4bug@amsat.org>
Subject: Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
Date: Thu, 10 Feb 2022 17:53:12 -0500 [thread overview]
Message-ID: <20220210175040-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <9E989878-326F-4E72-85DD-34D1CB72F0F8@oracle.com>
On Thu, Feb 10, 2022 at 10:23:01PM +0000, Jag Raman wrote:
>
>
> > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:
> >>
> >>
> >>> On Feb 2, 2022, at 12:34 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>
> >>> On Wed, 2 Feb 2022 01:13:22 +0000
> >>> Jag Raman <jag.raman@oracle.com> wrote:
> >>>
> >>>>> On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>>>
> >>>>> On Tue, 1 Feb 2022 21:24:08 +0000
> >>>>> Jag Raman <jag.raman@oracle.com> wrote:
> >>>>>
> >>>>>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On Tue, 1 Feb 2022 09:30:35 +0000
> >>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>>
> >>>>>>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:
> >>>>>>>>> On Fri, 28 Jan 2022 09:18:08 +0000
> >>>>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:
> >>>>>>>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
> >>>>>>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?
> >>>>>>>>>>
> >>>>>>>>>> The issue Dave raised is that vfio-user servers run in separate
> >>>>>>>>>> processses from QEMU with shared memory access to RAM but no direct
> >>>>>>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> >>>>>>>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
> >>>>>>>>>> requests.
> >>>>>>>>>>
> >>>>>>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
> >>>>>>>>>> protocol already has messages that vfio-user servers can use as a
> >>>>>>>>>> fallback when DMA cannot be completed through the shared memory RAM
> >>>>>>>>>> accesses.
> >>>>>>>>>>
> >>>>>>>>>>> In
> >>>>>>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
> >>>>>>>>>>> address space per BDF. Is the dynamic mapping overhead too much? What
> >>>>>>>>>>> physical hardware properties or specifications could we leverage to
> >>>>>>>>>>> restrict p2p mappings to a device? Should it be governed by machine
> >>>>>>>>>>> type to provide consistency between devices? Should each "isolated"
> >>>>>>>>>>> bus be in a separate root complex? Thanks,
> >>>>>>>>>>
> >>>>>>>>>> There is a separate issue in this patch series regarding isolating the
> >>>>>>>>>> address space where BAR accesses are made (i.e. the global
> >>>>>>>>>> address_space_memory/io). When one process hosts multiple vfio-user
> >>>>>>>>>> server instances (e.g. a software-defined network switch with multiple
> >>>>>>>>>> ethernet devices) then each instance needs isolated memory and io address
> >>>>>>>>>> spaces so that vfio-user clients don't cause collisions when they map
> >>>>>>>>>> BARs to the same address.
> >>>>>>>>>>
> >>>>>>>>>> I think the the separate root complex idea is a good solution. This
> >>>>>>>>>> patch series takes a different approach by adding the concept of
> >>>>>>>>>> isolated address spaces into hw/pci/.
> >>>>>>>>>
> >>>>>>>>> This all still seems pretty sketchy, BARs cannot overlap within the
> >>>>>>>>> same vCPU address space, perhaps with the exception of when they're
> >>>>>>>>> being sized, but DMA should be disabled during sizing.
> >>>>>>>>>
> >>>>>>>>> Devices within the same VM context with identical BARs would need to
> >>>>>>>>> operate in different address spaces. For example a translation offset
> >>>>>>>>> in the vCPU address space would allow unique addressing to the devices,
> >>>>>>>>> perhaps using the translation offset bits to address a root complex and
> >>>>>>>>> masking those bits for downstream transactions.
> >>>>>>>>>
> >>>>>>>>> In general, the device simply operates in an address space, ie. an
> >>>>>>>>> IOVA. When a mapping is made within that address space, we perform a
> >>>>>>>>> translation as necessary to generate a guest physical address. The
> >>>>>>>>> IOVA itself is only meaningful within the context of the address space,
> >>>>>>>>> there is no requirement or expectation for it to be globally unique.
> >>>>>>>>>
> >>>>>>>>> If the vfio-user server is making some sort of requirement that IOVAs
> >>>>>>>>> are unique across all devices, that seems very, very wrong. Thanks,
> >>>>>>>>
> >>>>>>>> Yes, BARs and IOVAs don't need to be unique across all devices.
> >>>>>>>>
> >>>>>>>> The issue is that there can be as many guest physical address spaces as
> >>>>>>>> there are vfio-user clients connected, so per-client isolated address
> >>>>>>>> spaces are required. This patch series has a solution to that problem
> >>>>>>>> with the new pci_isol_as_mem/io() API.
> >>>>>>>
> >>>>>>> Sorry, this still doesn't follow for me. A server that hosts multiple
> >>>>>>> devices across many VMs (I'm not sure if you're referring to the device
> >>>>>>> or the VM as a client) needs to deal with different address spaces per
> >>>>>>> device. The server needs to be able to uniquely identify every DMA,
> >>>>>>> which must be part of the interface protocol. But I don't see how that
> >>>>>>> imposes a requirement of an isolated address space. If we want the
> >>>>>>> device isolated because we don't trust the server, that's where an IOMMU
> >>>>>>> provides per device isolation. What is the restriction of the
> >>>>>>> per-client isolated address space and why do we need it? The server
> >>>>>>> needing to support multiple clients is not a sufficient answer to
> >>>>>>> impose new PCI bus types with an implicit restriction on the VM.
> >>>>>>
> >>>>>> Hi Alex,
> >>>>>>
> >>>>>> I believe there are two separate problems with running PCI devices in
> >>>>>> the vfio-user server. The first one is concerning memory isolation and
> >>>>>> second one is vectoring of BAR accesses (as explained below).
> >>>>>>
> >>>>>> In our previous patches (v3), we used an IOMMU to isolate memory
> >>>>>> spaces. But we still had trouble with the vectoring. So we implemented
> >>>>>> separate address spaces for each PCIBus to tackle both problems
> >>>>>> simultaneously, based on the feedback we got.
> >>>>>>
> >>>>>> The following gives an overview of issues concerning vectoring of
> >>>>>> BAR accesses.
> >>>>>>
> >>>>>> The device’s BAR regions are mapped into the guest physical address
> >>>>>> space. The guest writes the guest PA of each BAR into the device’s BAR
> >>>>>> registers. To access the BAR regions of the device, QEMU uses
> >>>>>> address_space_rw() which vectors the physical address access to the
> >>>>>> device BAR region handlers.
> >>>>>
> >>>>> The guest physical address written to the BAR is irrelevant from the
> >>>>> device perspective, this only serves to assign the BAR an offset within
> >>>>> the address_space_mem, which is used by the vCPU (and possibly other
> >>>>> devices depending on their address space). There is no reason for the
> >>>>> device itself to care about this address.
> >>>>
> >>>> Thank you for the explanation, Alex!
> >>>>
> >>>> The confusion at my part is whether we are inside the device already when
> >>>> the server receives a request to access BAR region of a device. Based on
> >>>> your explanation, I get that your view is the BAR access request has
> >>>> propagated into the device already, whereas I was under the impression
> >>>> that the request is still on the CPU side of the PCI root complex.
> >>>
> >>> If you are getting an access through your MemoryRegionOps, all the
> >>> translations have been made, you simply need to use the hwaddr as the
> >>> offset into the MemoryRegion for the access. Perform the read/write to
> >>> your device, no further translations required.
> >>>
> >>>> Your view makes sense to me - once the BAR access request reaches the
> >>>> client (on the other side), we could consider that the request has reached
> >>>> the device.
> >>>>
> >>>> On a separate note, if devices don’t care about the values in BAR
> >>>> registers, why do the default PCI config handlers intercept and map
> >>>> the BAR region into address_space_mem?
> >>>> (pci_default_write_config() -> pci_update_mappings())
> >>>
> >>> This is the part that's actually placing the BAR MemoryRegion as a
> >>> sub-region into the vCPU address space. I think if you track it,
> >>> you'll see PCIDevice.io_regions[i].address_space is actually
> >>> system_memory, which is used to initialize address_space_system.
> >>>
> >>> The machine assembles PCI devices onto buses as instructed by the
> >>> command line or hot plug operations. It's the responsibility of the
> >>> guest firmware and guest OS to probe those devices, size the BARs, and
> >>> place the BARs into the memory hierarchy of the PCI bus, ie. system
> >>> memory. The BARs are necessarily in the "guest physical memory" for
> >>> vCPU access, but it's essentially only coincidental that PCI devices
> >>> might be in an address space that provides a mapping to their own BAR.
> >>> There's no reason to ever use it.
> >>>
> >>> In the vIOMMU case, we can't know that the device address space
> >>> includes those BAR mappings or if they do, that they're identity mapped
> >>> to the physical address. Devices really need to not infer anything
> >>> about an address. Think about real hardware, a device is told by
> >>> driver programming to perform a DMA operation. The device doesn't know
> >>> the target of that operation, it's the guest driver's responsibility to
> >>> make sure the IOVA within the device address space is valid and maps to
> >>> the desired target. Thanks,
> >>
> >> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
> >> helped to clarify this problem.
> >>
> >> We have implemented the memory isolation based on the discussion in the
> >> thread. We will send the patches out shortly.
> >>
> >> Devices such as “name" and “e1000” worked fine. But I’d like to note that
> >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
> >> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
> >> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
> >> the device to access other BAR regions by using the BAR address programmed
> >> in the PCI config space. This happens even without vfio-user patches. For example,
> >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> >> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
> >> In this case, we could see an IOMMU fault.
> >
> > So, device accessing its own BAR is different. Basically, these
> > transactions never go on the bus at all, never mind get to the IOMMU.
>
> Hi Michael,
>
> In LSI case, I did notice that it went to the IOMMU.
Hmm do you mean you analyzed how a physical device works?
Or do you mean in QEMU?
> The device is reading the BAR
> address as if it was a DMA address.
I got that, my understanding of PCI was that a device can
not be both a master and a target of a transaction at
the same time though. Could not find this in the spec though,
maybe I remember incorrectly.
> > I think it's just used as a handle to address internal device memory.
> > This kind of trick is not universal, but not terribly unusual.
> >
> >
> >> Unfortunately, we started off our project with the LSI device. So that lead to all the
> >> confusion about what is expected at the server end in-terms of
> >> vectoring/address-translation. It gave an impression as if the request was still on
> >> the CPU side of the PCI root complex, but the actual problem was with the
> >> device driver itself.
> >>
> >> I’m wondering how to deal with this problem. Would it be OK if we mapped the
> >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
> >> This would help devices such as LSI to circumvent this problem. One problem
> >> with this approach is that it has the potential to collide with another legitimate
> >> IOVA address. Kindly share your thought on this.
> >>
> >> Thank you!
> >
> > I am not 100% sure what do you plan to do but it sounds fine since even
> > if it collides, with traditional PCI device must never initiate cycles
>
> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.
>
> Thank you!
> --
> Jag
>
> > within their own BAR range, and PCIe is software-compatible with PCI. So
> > devices won't be able to access this IOVA even if it was programmed in
> > the IOMMU.
> >
> > As was mentioned elsewhere on this thread, devices accessing each
> > other's BAR is a different matter.
> >
> > I do not remember which rules apply to multiple functions of a
> > multi-function device though. I think in a traditional PCI
> > they will never go out on the bus, but with e.g. SRIOV they
> > would probably do go out? Alex, any idea?
> >
> >
> >> --
> >> Jag
> >>
> >>>
> >>> Alex
> >>>
> >>
> >
>
next prev parent reply other threads:[~2022-02-10 22:56 UTC|newest]
Thread overview: 99+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
2022-01-19 21:41 ` [PATCH v5 01/18] configure, meson: override C compiler for cmake Jagannathan Raman
2022-01-20 13:27 ` Paolo Bonzini
2022-01-20 15:21 ` Jag Raman
2022-02-17 6:10 ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 02/18] tests/avocado: Specify target VM argument to helper routines Jagannathan Raman
2022-01-25 9:40 ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 03/18] pci: isolated address space for PCI bus Jagannathan Raman
2022-01-20 0:12 ` Michael S. Tsirkin
2022-01-20 15:20 ` Jag Raman
2022-01-25 18:38 ` Dr. David Alan Gilbert
2022-01-26 5:27 ` Jag Raman
2022-01-26 9:45 ` Stefan Hajnoczi
2022-01-26 20:07 ` Dr. David Alan Gilbert
2022-01-26 21:13 ` Michael S. Tsirkin
2022-01-27 8:30 ` Stefan Hajnoczi
2022-01-27 12:50 ` Michael S. Tsirkin
2022-01-27 21:22 ` Alex Williamson
2022-01-28 8:19 ` Stefan Hajnoczi
2022-01-28 9:18 ` Stefan Hajnoczi
2022-01-31 16:16 ` Alex Williamson
2022-02-01 9:30 ` Stefan Hajnoczi
2022-02-01 15:24 ` Alex Williamson
2022-02-01 21:24 ` Jag Raman
2022-02-01 22:47 ` Alex Williamson
2022-02-02 1:13 ` Jag Raman
2022-02-02 5:34 ` Alex Williamson
2022-02-02 9:22 ` Stefan Hajnoczi
2022-02-10 0:08 ` Jag Raman
2022-02-10 8:02 ` Michael S. Tsirkin
2022-02-10 22:23 ` Jag Raman
2022-02-10 22:53 ` Michael S. Tsirkin [this message]
2022-02-10 23:46 ` Jag Raman
2022-02-10 23:17 ` Alex Williamson
2022-02-10 23:28 ` Michael S. Tsirkin
2022-02-10 23:49 ` Alex Williamson
2022-02-11 0:26 ` Michael S. Tsirkin
2022-02-11 0:54 ` Jag Raman
2022-02-11 0:10 ` Jag Raman
2022-02-02 9:30 ` Peter Maydell
2022-02-02 10:06 ` Michael S. Tsirkin
2022-02-02 15:49 ` Alex Williamson
2022-02-02 16:53 ` Michael S. Tsirkin
2022-02-02 17:12 ` Alex Williamson
2022-02-01 10:42 ` Dr. David Alan Gilbert
2022-01-26 18:13 ` Dr. David Alan Gilbert
2022-01-27 17:43 ` Jag Raman
2022-01-25 9:56 ` Stefan Hajnoczi
2022-01-25 13:49 ` Jag Raman
2022-01-25 14:19 ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 04/18] pci: create and free isolated PCI buses Jagannathan Raman
2022-01-25 10:25 ` Stefan Hajnoczi
2022-01-25 14:10 ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 05/18] qdev: unplug blocker for devices Jagannathan Raman
2022-01-25 10:27 ` Stefan Hajnoczi
2022-01-25 14:43 ` Jag Raman
2022-01-26 9:32 ` Stefan Hajnoczi
2022-01-26 15:13 ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine Jagannathan Raman
2022-01-25 10:32 ` Stefan Hajnoczi
2022-01-25 18:12 ` Jag Raman
2022-01-26 9:35 ` Stefan Hajnoczi
2022-01-26 15:20 ` Jag Raman
2022-01-26 15:43 ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 07/18] vfio-user: set qdev bus callbacks " Jagannathan Raman
2022-01-25 10:44 ` Stefan Hajnoczi
2022-01-25 21:12 ` Jag Raman
2022-01-26 9:37 ` Stefan Hajnoczi
2022-01-26 15:51 ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 08/18] vfio-user: build library Jagannathan Raman
2022-01-19 21:41 ` [PATCH v5 09/18] vfio-user: define vfio-user-server object Jagannathan Raman
2022-01-25 14:40 ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 10/18] vfio-user: instantiate vfio-user context Jagannathan Raman
2022-01-25 14:44 ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 11/18] vfio-user: find and init PCI device Jagannathan Raman
2022-01-25 14:48 ` Stefan Hajnoczi
2022-01-26 3:14 ` Jag Raman
2022-01-19 21:42 ` [PATCH v5 12/18] vfio-user: run vfio-user context Jagannathan Raman
2022-01-25 15:10 ` Stefan Hajnoczi
2022-01-26 3:26 ` Jag Raman
2022-01-19 21:42 ` [PATCH v5 13/18] vfio-user: handle PCI config space accesses Jagannathan Raman
2022-01-25 15:13 ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 14/18] vfio-user: handle DMA mappings Jagannathan Raman
2022-01-19 21:42 ` [PATCH v5 15/18] vfio-user: handle PCI BAR accesses Jagannathan Raman
2022-01-19 21:42 ` [PATCH v5 16/18] vfio-user: handle device interrupts Jagannathan Raman
2022-01-25 15:25 ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 17/18] vfio-user: register handlers to facilitate migration Jagannathan Raman
2022-01-25 15:48 ` Stefan Hajnoczi
2022-01-27 17:04 ` Jag Raman
2022-01-28 8:29 ` Stefan Hajnoczi
2022-01-28 14:49 ` Thanos Makatos
2022-02-01 3:49 ` Jag Raman
2022-02-01 9:37 ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 18/18] vfio-user: avocado tests for vfio-user Jagannathan Raman
2022-01-26 4:25 ` Philippe Mathieu-Daudé via
2022-01-26 15:12 ` Jag Raman
2022-01-25 16:00 ` [PATCH v5 00/18] vfio-user server in QEMU Stefan Hajnoczi
2022-01-26 5:04 ` Jag Raman
2022-01-26 9:56 ` Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220210175040-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=armbru@redhat.com \
--cc=berrange@redhat.com \
--cc=bleal@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eblake@redhat.com \
--cc=eduardo@habkost.net \
--cc=elena.ufimtseva@oracle.com \
--cc=f4bug@amsat.org \
--cc=jag.raman@oracle.com \
--cc=john.g.johnson@oracle.com \
--cc=john.levon@nutanix.com \
--cc=marcandre.lureau@gmail.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=stefanha@redhat.com \
--cc=thanos.makatos@nutanix.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).