Re: [Qemu-devel] [for-4.2 PATCH 2/2] hw/i386: AMD-Vi IVRS DMA alias support

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: Peter Xu <zhexu@redhat.com>
Cc: brijesh.singh@amd.com, mst@redhat.com, qemu-devel@nongnu.org,
	eric.auger@redhat.com, Suravee.Suthikulpanit@amd.com
Subject: Re: [Qemu-devel] [for-4.2 PATCH 2/2] hw/i386: AMD-Vi IVRS DMA alias support
Date: Mon, 29 Jul 2019 13:04:41 -0600	[thread overview]
Message-ID: <20190729130441.5657a535@x1.home> (raw)
In-Reply-To: <20190729082646.GA19232@xz-x1>

On Mon, 29 Jul 2019 16:26:46 +0800
Peter Xu <zhexu@redhat.com> wrote:

> On Fri, Jul 26, 2019 at 06:55:53PM -0600, Alex Williamson wrote:
> > When we account for DMA aliases in the PCI address space, we can no
> > longer use a single IVHD entry in the IVRS covering all devices.  We
> > instead need to walk the PCI bus and create alias ranges when we find
> > a conventional bus.  These alias ranges cannot overlap with a "Select
> > All" range (as currently implemented), so we also need to enumerate
> > each device with IVHD entries.
> > 
> > Importantly, the IVHD entries used here include a Device ID, which is
> > simply the PCI BDF (Bus/Device/Function).  The guest firmware is
> > responsible for programming bus numbers, so the final revision of this
> > table depends on the update mechanism (acpi_build_update) to be called
> > after guest PCI enumeration.  
> 
> Ouch... so the ACPI build procedure is after those guest PCI code!
> Could I ask how do you find this? :) It seems much easier for sure
> this way...

I believe this is what MST was referring to with the MCFG update,
acpi_build() is called from both acpi_setup() and acpi_build_update(),
the latter is setup in numerous places to be called via a mechanism
that re-writes the table on certain guest actions.  For instance
acpi_add_rom_blob() passes this function as a callback which turns into
a select_cb in fw_cfg, such that (aiui) the tables are updated before
the guest reads them.  I added some fprintfs in the PCI config write
path to confirm that the update callback occurs after PCI enumeration
for both SeaBIOS and OVMF.  Since we seem to have other dependencies on
this ordering elsewhere, I don't think that the IVRS update is unique
in this regard.
 
> This looks very nice to me already, though I still have got a few
> questions, please see below.
> 
> [...]
> 
> > +    if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_BRIDGE)) {
> > +        PCIBus *sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(dev));
> > +        uint8_t sec = pci_bus_num(sec_bus);
> > +        uint8_t sub = dev->config[PCI_SUBORDINATE_BUS];
> > +
> > +        if (pci_bus_is_express(sec_bus)) {
> > +            /*
> > +             * Walk the bus if there are subordinates, otherwise use a range
> > +             * to cover an entire leaf bus.  We could potentially also use a
> > +             * range for traversed buses, but we'd need to take care not to
> > +             * create both Select and Range entries covering the same device.
> > +             * This is easier and potentially more compact.
> > +             *
> > +             * An example bare metal system seems to use Select entries for
> > +             * root ports without a slot (ie. built-ins) and Range entries
> > +             * when there is a slot.  The same system also only hard-codes
> > +             * the alias range for an onboard PCIe-to-PCI bridge, apparently
> > +             * making no effort to support nested bridges.  We attempt to
> > +             * be more thorough here.
> > +             */
> > +            if (sec == sub) { /* leaf bus */
> > +                /* "Start of Range" IVHD entry, type 0x3 */
> > +                entry = PCI_BUILD_BDF(sec, PCI_DEVFN(0, 0)) << 8 | 0x3;
> > +                build_append_int_noprefix(table_data, entry, 4);
> > +                /* "End of Range" IVHD entry, type 0x4 */
> > +                entry = PCI_BUILD_BDF(sub, PCI_DEVFN(31, 7)) << 8 | 0x4;
> > +                build_append_int_noprefix(table_data, entry, 4);
> > +            } else {
> > +                pci_for_each_device(sec_bus, sec, insert_ivhd, table_data);
> > +            }
> > +        } else {
> > +            /*
> > +             * If the secondary bus is conventional, then we need to create an
> > +             * Alias range for everything downstream.  The range covers the
> > +             * first devfn on the secondary bus to the last devfn on the
> > +             * subordinate bus.  The alias target depends on legacy versus
> > +             * express bridges, just as in pci_device_iommu_address_space().
> > +             * DeviceIDa vs DeviceIDb as per the AMD IOMMU spec.
> > +             */
> > +            uint16_t dev_id_a, dev_id_b;
> > +
> > +            dev_id_a = PCI_BUILD_BDF(sec, PCI_DEVFN(0, 0));
> > +
> > +            if (pci_is_express(dev) &&
> > +                pcie_cap_get_type(dev) == PCI_EXP_TYPE_PCI_BRIDGE) {
> > +                dev_id_b = dev_id_a;
> > +            } else {
> > +                dev_id_b = PCI_BUILD_BDF(pci_bus_num(bus), dev->devfn);
> > +            }
> > +
> > +            /* "Alias Start of Range" IVHD entry, type 0x43, 8 bytes */
> > +            build_append_int_noprefix(table_data, dev_id_a << 8 | 0x43, 4);
> > +            build_append_int_noprefix(table_data, dev_id_b << 8 | 0x0, 4);
> > +
> > +            /* "End of Range" IVHD entry, type 0x4 */
> > +            entry = PCI_BUILD_BDF(sub, PCI_DEVFN(31, 7)) << 8 | 0x4;
> > +            build_append_int_noprefix(table_data, entry, 4);
> > +        }  
> 
> We've implmented the similar logic for multiple times:
> 
>   - When we want to do DMA (pci_requester_id)
>   - When we want to fetch the DMA address space (the previous patch)
>   - When we fill in the AMD ACPI table (this patch)
> 
> Do you think we can generalize them somehow?  I'm thinking how about
> we directly fetch RID in the 2nd/3rd use case using pci_requester_id()
> (which existed already) and simply use it?

For this patch, I think we could use pci_requester_id() for dev_id_b
above, but we still need to walk the buses and devices, so it really
only simplifies the 'if (pci_is_express...' code block above, and
barely at that since we're at the point in the topology where such a
decision is made.  For the previous patch, pci_requester_id() returns a
BDF and that code is executed before bus numbers are programmed.  We
might still make use of requester_id_cache, but a different interface
would be necessary.  Also note how pci_req_id_cache_get() assumes we're
looking for the requester ID as seen from the root bus while
pci_device_iommu_address_space() is from the bus hosting iommu_fn.
While these are generally the same in practice, it's not required.  I'd
therefore tend to leave that to some future consolidation.  I can
update the comment to include that justification in the previous patch.

> > +    /*
> > +     * A PCI bus walk, for each PCI host bridge, is necessary to create a
> > +     * complete set of IVHD entries.  Do this into a separate blob so that we
> > +     * can calculate the total IVRS table length here and then append the new
> > +     * blob further below.  Fall back to an entry covering all devices, which
> > +     * is sufficient when no aliases are present.
> > +     */
> > +    object_child_foreach_recursive(object_get_root(),
> > +                                   ivrs_host_bridges, ivhd_blob);
> > +
> > +    if (!ivhd_blob->len) {
> > +        /*
> > +         *   Type 1 device entry reporting all devices
> > +         *   These are 4-byte device entries currently reporting the range of
> > +         *   Refer to Spec - Table 95:IVHD Device Entry Type Codes(4-byte)
> > +         */
> > +        build_append_int_noprefix(ivhd_blob, 0x0000001, 4);
> > +    }  
> 
> Is there a real use case for ivhd_blob->len==0?

It was mostly paranoia, but I believe it's really only an Intel
convention that the PCI host bridge appears as a device on the bus.  It
seems possible that we could have a host bridge with no devices, in
which case we'd insert this select-all entry to make the IVRS valid.
Perhaps in combination with AMD exposing their IOMMU as a device on the
PCI bus this is not really an issue, but it's a trivial safety net.
Thanks,

Alex

next prev parent reply	other threads:[~2019-07-29 19:05 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-27  0:55 [Qemu-devel] [for-4.2 PATCH 0/2] PCI DMA alias support Alex Williamson
2019-07-27  0:55 ` [Qemu-devel] [for-4.2 PATCH 1/2] pci: Use PCI aliases when determining device IOMMU address space Alex Williamson
2019-07-27  0:55 ` [Qemu-devel] [for-4.2 PATCH 2/2] hw/i386: AMD-Vi IVRS DMA alias support Alex Williamson
2019-07-29  8:26   ` Peter Xu
2019-07-29 19:04     ` Alex Williamson [this message]
2019-07-30  1:18       ` Peter Xu
2019-07-29 19:15 ` [Qemu-devel] [for-4.2 PATCH 0/2] PCI " Michael S. Tsirkin
2019-08-19 21:23   ` Alex Williamson
2019-10-14  2:59     ` Peter Xu
2019-10-15 17:53       ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190729130441.5657a535@x1.home \
    --to=alex.williamson@redhat.com \
    --cc=Suravee.Suthikulpanit@amd.com \
    --cc=brijesh.singh@amd.com \
    --cc=eric.auger@redhat.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=zhexu@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).