[Qemu-devel] A question about PCI device address spaces

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] A question about PCI device address spaces
@ 2016-12-22  9:42 Peter Xu
  2016-12-22 10:24 ` Paolo Bonzini
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Peter Xu @ 2016-12-22  9:42 UTC (permalink / raw)
  To: QEMU Devel Mailing List; +Cc: David Gibson, Marcel Apfelbaum, Paolo Bonzini

Hello,

Since this is a general topic, I picked it out from the VT-d
discussion and put it here, just want to be more clear of it.

The issue is, whether we have exposed too much address spaces for
emulated PCI devices?

Now for each PCI device, we are having PCIDevice::bus_master_as for
the device visible address space, which derived from
pci_device_iommu_address_space():

AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
{
    PCIBus *bus = PCI_BUS(dev->bus);
    PCIBus *iommu_bus = bus;

    while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
        iommu_bus = PCI_BUS(iommu_bus->parent_dev->bus);
    }
    if (iommu_bus && iommu_bus->iommu_fn) {
        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
    }
    return &address_space_memory;
}

By default (for no-iommu case), it's pointed to system memory space,
which includes MMIO, and looks wrong - PCI device should not be able to
write to MMIO regions.

As an example, if we dump a PCI device address space into detail on
x86_64 system, we can see (this is address space for a virtio-net-pci
device on an Q35 machine with 6G memory):

    0000000000000000-000000000009ffff (prio 0, RW): pc.ram
    00000000000a0000-00000000000affff (prio 1, RW): vga.vram
    00000000000b0000-00000000000bffff (prio 1, RW): vga-lowmem
    00000000000c0000-00000000000c9fff (prio 0, RW): pc.ram
    00000000000ca000-00000000000ccfff (prio 0, RW): pc.ram
    00000000000cd000-00000000000ebfff (prio 0, RW): pc.ram
    00000000000ec000-00000000000effff (prio 0, RW): pc.ram
    00000000000f0000-00000000000fffff (prio 0, RW): pc.ram
    0000000000100000-000000007fffffff (prio 0, RW): pc.ram
    00000000b0000000-00000000bfffffff (prio 0, RW): pcie-mmcfg-mmio
    00000000fd000000-00000000fdffffff (prio 1, RW): vga.vram
    00000000fe000000-00000000fe000fff (prio 0, RW): virtio-pci-common
    00000000fe001000-00000000fe001fff (prio 0, RW): virtio-pci-isr
    00000000fe002000-00000000fe002fff (prio 0, RW): virtio-pci-device
    00000000fe003000-00000000fe003fff (prio 0, RW): virtio-pci-notify
    00000000febd0400-00000000febd041f (prio 0, RW): vga ioports remapped
    00000000febd0500-00000000febd0515 (prio 0, RW): bochs dispi interface
    00000000febd0600-00000000febd0607 (prio 0, RW): qemu extended regs
    00000000febd1000-00000000febd102f (prio 0, RW): msix-table
    00000000febd1800-00000000febd1807 (prio 0, RW): msix-pba
    00000000febd2000-00000000febd2fff (prio 1, RW): ahci
    00000000fec00000-00000000fec00fff (prio 0, RW): kvm-ioapic
    00000000fed00000-00000000fed003ff (prio 0, RW): hpet
    00000000fed1c000-00000000fed1ffff (prio 1, RW): lpc-rcrb-mmio
    00000000fee00000-00000000feefffff (prio 4096, RW): kvm-apic-msi
    00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
    0000000100000000-00000001ffffffff (prio 0, RW): pc.ram

So here are the "pc.ram" regions the only ones that we should expose
to PCI devices? (it should contain all of them, including the low-mem
ones and the >=4g one)

And, should this rule work for all platforms? Or say, would it be a
problem if I directly change address_space_memory in
pci_device_iommu_address_space() into something else, which only
contains RAMs? (of course this won't affect any platform that has
IOMMU, aka, customized PCIBus::iommu_fn function)

(btw, I'd appreciate if anyone has quick answer on why we have lots of
 continuous "pc.ram" in low 2g range - from can_merge() I guess they
 seem to have different dirty_log_mask, romd_mode, etc., but I still
 would like to know why they are having these difference. Anyway, this
 is totally an "optional question" just to satisfy my own curiosity :)

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] A question about PCI device address spaces
  2016-12-22  9:42 [Qemu-devel] A question about PCI device address spaces Peter Xu
@ 2016-12-22 10:24 ` Paolo Bonzini
  2016-12-23  0:02 ` David Gibson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Paolo Bonzini @ 2016-12-22 10:24 UTC (permalink / raw)
  To: Peter Xu, QEMU Devel Mailing List; +Cc: David Gibson, Marcel Apfelbaum



On 22/12/2016 10:42, Peter Xu wrote:
> Hello,
> 
> Since this is a general topic, I picked it out from the VT-d
> discussion and put it here, just want to be more clear of it.
> 
> The issue is, whether we have exposed too much address spaces for
> emulated PCI devices?
> 
> Now for each PCI device, we are having PCIDevice::bus_master_as for
> the device visible address space, which derived from
> pci_device_iommu_address_space():
> 
> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> {
>     PCIBus *bus = PCI_BUS(dev->bus);
>     PCIBus *iommu_bus = bus;
> 
>     while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>         iommu_bus = PCI_BUS(iommu_bus->parent_dev->bus);
>     }
>     if (iommu_bus && iommu_bus->iommu_fn) {
>         return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
>     }
>     return &address_space_memory;
> }
> 
> By default (for no-iommu case), it's pointed to system memory space,
> which includes MMIO, and looks wrong - PCI device should not be able to
> write to MMIO regions.

Hmm, I think that unless you describe that with ACS, PCI devices should
be able to write to MMIO regions.  Possibly not _all_ of them, I don't
know (maybe they cannot write to MMCONFIG?) but after all they can write
to the MSI region, and that is an MMIO region.

If the IOMMU translate callback included the MemTxAttrs, ACS Source
Validation could probably be implemented with an IOMMU region on the
root complex.  Most of the other ACS features either do not apply, or
they are already implicit in the way that all PCI devices' address
spaces are based on address_space_memory.

Paolo

> As an example, if we dump a PCI device address space into detail on
> x86_64 system, we can see (this is address space for a virtio-net-pci
> device on an Q35 machine with 6G memory):
> 
>     0000000000000000-000000000009ffff (prio 0, RW): pc.ram
>     00000000000a0000-00000000000affff (prio 1, RW): vga.vram
>     00000000000b0000-00000000000bffff (prio 1, RW): vga-lowmem
>     00000000000c0000-00000000000c9fff (prio 0, RW): pc.ram
>     00000000000ca000-00000000000ccfff (prio 0, RW): pc.ram
>     00000000000cd000-00000000000ebfff (prio 0, RW): pc.ram
>     00000000000ec000-00000000000effff (prio 0, RW): pc.ram
>     00000000000f0000-00000000000fffff (prio 0, RW): pc.ram
>     0000000000100000-000000007fffffff (prio 0, RW): pc.ram
>     00000000b0000000-00000000bfffffff (prio 0, RW): pcie-mmcfg-mmio
>     00000000fd000000-00000000fdffffff (prio 1, RW): vga.vram
>     00000000fe000000-00000000fe000fff (prio 0, RW): virtio-pci-common
>     00000000fe001000-00000000fe001fff (prio 0, RW): virtio-pci-isr
>     00000000fe002000-00000000fe002fff (prio 0, RW): virtio-pci-device
>     00000000fe003000-00000000fe003fff (prio 0, RW): virtio-pci-notify
>     00000000febd0400-00000000febd041f (prio 0, RW): vga ioports remapped
>     00000000febd0500-00000000febd0515 (prio 0, RW): bochs dispi interface
>     00000000febd0600-00000000febd0607 (prio 0, RW): qemu extended regs
>     00000000febd1000-00000000febd102f (prio 0, RW): msix-table
>     00000000febd1800-00000000febd1807 (prio 0, RW): msix-pba
>     00000000febd2000-00000000febd2fff (prio 1, RW): ahci
>     00000000fec00000-00000000fec00fff (prio 0, RW): kvm-ioapic
>     00000000fed00000-00000000fed003ff (prio 0, RW): hpet
>     00000000fed1c000-00000000fed1ffff (prio 1, RW): lpc-rcrb-mmio
>     00000000fee00000-00000000feefffff (prio 4096, RW): kvm-apic-msi
>     00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
>     0000000100000000-00000001ffffffff (prio 0, RW): pc.ram
> 
> So here are the "pc.ram" regions the only ones that we should expose
> to PCI devices? (it should contain all of them, including the low-mem
> ones and the >=4g one)
> 
> And, should this rule work for all platforms? Or say, would it be a
> problem if I directly change address_space_memory in
> pci_device_iommu_address_space() into something else, which only
> contains RAMs? (of course this won't affect any platform that has
> IOMMU, aka, customized PCIBus::iommu_fn function)
> 
> (btw, I'd appreciate if anyone has quick answer on why we have lots of
>  continuous "pc.ram" in low 2g range - from can_merge() I guess they
>  seem to have different dirty_log_mask, romd_mode, etc., but I still
>  would like to know why they are having these difference. Anyway, this
>  is totally an "optional question" just to satisfy my own curiosity :)
> 
> Thanks,
> 
> -- peterx
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] A question about PCI device address spaces
  2016-12-22  9:42 [Qemu-devel] A question about PCI device address spaces Peter Xu
  2016-12-22 10:24 ` Paolo Bonzini
@ 2016-12-23  0:02 ` David Gibson
  2016-12-23 11:21 ` Peter Maydell
  2016-12-26 11:01 ` Marcel Apfelbaum
  3 siblings, 0 replies; 7+ messages in thread
From: David Gibson @ 2016-12-23  0:02 UTC (permalink / raw)
  To: Peter Xu; +Cc: QEMU Devel Mailing List, Marcel Apfelbaum, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 7060 bytes --]

On Thu, Dec 22, 2016 at 05:42:40PM +0800, Peter Xu wrote:
> Hello,
> 
> Since this is a general topic, I picked it out from the VT-d
> discussion and put it here, just want to be more clear of it.
> 
> The issue is, whether we have exposed too much address spaces for
> emulated PCI devices?
> 
> Now for each PCI device, we are having PCIDevice::bus_master_as for
> the device visible address space, which derived from
> pci_device_iommu_address_space():
> 
> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> {
>     PCIBus *bus = PCI_BUS(dev->bus);
>     PCIBus *iommu_bus = bus;
> 
>     while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>         iommu_bus = PCI_BUS(iommu_bus->parent_dev->bus);
>     }
>     if (iommu_bus && iommu_bus->iommu_fn) {
>         return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
>     }
>     return &address_space_memory;
> }
> 
> By default (for no-iommu case), it's pointed to system memory space,
> which includes MMIO, and looks wrong - PCI device should not be able to
> write to MMIO regions.

Sorry, I've realized my earlier comments were a bit misleading.

I'm pretty sure the inbound (==DMA) window(s) will be less than the
full 64-bit address space.  However, that doesn't necessarily mean it
won't cover *any* MMIO.

Plus, of course, any MMIO that's provided by PCI (or legacy ISA)
devices - and on the PC platform, that's nearly everything - will also
be visible in PCI space, since it doesn't need to go through the
inbound window for that at all.  Strictly speaking PCI-provided MMIO
may not appear at the same address in PCI space as it does in the
system memory space, but for PC they will be.  By platform convention
the outbound windows are also identity mappings.

Part of the reason I was misleading was that I was thinking of non-PC
platforms, which often have more "native" MMIO devices on the CPU side
of of the PCI host bridge.

> As an example, if we dump a PCI device address space into detail on
> x86_64 system, we can see (this is address space for a virtio-net-pci
> device on an Q35 machine with 6G memory):
> 
>     0000000000000000-000000000009ffff (prio 0, RW): pc.ram
>     00000000000a0000-00000000000affff (prio 1, RW): vga.vram
>     00000000000b0000-00000000000bffff (prio 1, RW): vga-lowmem
>     00000000000c0000-00000000000c9fff (prio 0, RW): pc.ram
>     00000000000ca000-00000000000ccfff (prio 0, RW): pc.ram
>     00000000000cd000-00000000000ebfff (prio 0, RW): pc.ram
>     00000000000ec000-00000000000effff (prio 0, RW): pc.ram
>     00000000000f0000-00000000000fffff (prio 0, RW): pc.ram
>     0000000000100000-000000007fffffff (prio 0, RW): pc.ram
>     00000000b0000000-00000000bfffffff (prio 0, RW): pcie-mmcfg-mmio
>     00000000fd000000-00000000fdffffff (prio 1, RW): vga.vram
>     00000000fe000000-00000000fe000fff (prio 0, RW): virtio-pci-common
>     00000000fe001000-00000000fe001fff (prio 0, RW): virtio-pci-isr
>     00000000fe002000-00000000fe002fff (prio 0, RW): virtio-pci-device
>     00000000fe003000-00000000fe003fff (prio 0, RW): virtio-pci-notify
>     00000000febd0400-00000000febd041f (prio 0, RW): vga ioports remapped
>     00000000febd0500-00000000febd0515 (prio 0, RW): bochs dispi interface
>     00000000febd0600-00000000febd0607 (prio 0, RW): qemu extended regs
>     00000000febd1000-00000000febd102f (prio 0, RW): msix-table
>     00000000febd1800-00000000febd1807 (prio 0, RW): msix-pba
>     00000000febd2000-00000000febd2fff (prio 1, RW): ahci
>     00000000fec00000-00000000fec00fff (prio 0, RW): kvm-ioapic
>     00000000fed00000-00000000fed003ff (prio 0, RW): hpet
>     00000000fed1c000-00000000fed1ffff (prio 1, RW): lpc-rcrb-mmio
>     00000000fee00000-00000000feefffff (prio 4096, RW): kvm-apic-msi
>     00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
>     0000000100000000-00000001ffffffff (prio 0, RW): pc.ram
> 
> So here are the "pc.ram" regions the only ones that we should expose
> to PCI devices? (it should contain all of them, including the low-mem
> ones and the >=4g one)
> 
> And, should this rule work for all platforms? Or say, would it be a
> problem if I directly change address_space_memory in
> pci_device_iommu_address_space() into something else, which only
> contains RAMs? (of course this won't affect any platform that has
> IOMMU, aka, customized PCIBus::iommu_fn function)

No, the arragement of both inbound and outbound windows is certainly
platform dependent (strictly speaking, dependent on the model and
configuration of the host bridge, but that tends to be tied strongly
to the platform).  I think address_space_memory is the closest
approximation we're going to get that works for multiple platforms -
having both inbound and outbound windows identity mapped is pretty
common, I believe, even if they don't strictly speaking cover the
whole address space.

> (btw, I'd appreciate if anyone has quick answer on why we have lots of
>  continuous "pc.ram" in low 2g range - from can_merge() I guess they
>  seem to have different dirty_log_mask, romd_mode, etc., but I still
>  would like to know why they are having these difference. Anyway, this
>  is totally an "optional question" just to satisfy my own curiosity :)

I don't know PC well enough to be sure, but I suspect those low
regions have special meaning for the BIOS.

Note also the large gap between the pc.ram at 1M..2G and 4G..up.  This
is the so-called "memory hole".  You'll notice that all the IO regions
are in that range - that's for backwards compatibility with
32-bit machines where there was obviously nowhere else to put them.
Many 64-bit native platforms (including PAPR) don't have such a thing
and instead have RAM contiguous at 0 and the IO well above 4G in CPU
address space.

The PC PCI host bridge must clearly have an outgoing IO window from
2G..4G (mapping to the same addresses in PCI space) to handle these
devices.  I'm pretty sure there must also be another window much
higher up, to handle 64-bit PCI devices with really big BARs (which
you probably don't have any of on your example system).

Whaet I don't know is whether the 2G..4G range in PCI space will be
specifically excluded from the incoming (DMA) windows on the host
bridge.  It might be that it is, or it might just be that the host
bridge will forward things to the CPU bus only if they don't get
picked up by a device BAR first.  And I guess it's further complicated
by the fact that on PCI-E "up-bound" and "down-bound" transactions can
be distinguished, and the fact that at least some PCI-to-PCI or
PCIe-to-PCI bridges also have configurable inbound and outbound
windows.  I'm not sure if that includes the implicit bridges in PCIe
root ports or switch ports.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] A question about PCI device address spaces
  2016-12-22  9:42 [Qemu-devel] A question about PCI device address spaces Peter Xu
  2016-12-22 10:24 ` Paolo Bonzini
  2016-12-23  0:02 ` David Gibson
@ 2016-12-23 11:21 ` Peter Maydell
  2016-12-26  6:53   ` Peter Xu
  2016-12-26 11:01 ` Marcel Apfelbaum
  3 siblings, 1 reply; 7+ messages in thread
From: Peter Maydell @ 2016-12-23 11:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: QEMU Devel Mailing List, Marcel Apfelbaum, Paolo Bonzini,
	David Gibson

On 22 December 2016 at 09:42, Peter Xu <peterx@redhat.com> wrote:
> Hello,
>
> Since this is a general topic, I picked it out from the VT-d
> discussion and put it here, just want to be more clear of it.
>
> The issue is, whether we have exposed too much address spaces for
> emulated PCI devices?
>
> Now for each PCI device, we are having PCIDevice::bus_master_as for
> the device visible address space, which derived from
> pci_device_iommu_address_space():
>
> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> {
>     PCIBus *bus = PCI_BUS(dev->bus);
>     PCIBus *iommu_bus = bus;
>
>     while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>         iommu_bus = PCI_BUS(iommu_bus->parent_dev->bus);
>     }
>     if (iommu_bus && iommu_bus->iommu_fn) {
>         return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
>     }
>     return &address_space_memory;
> }
>
> By default (for no-iommu case), it's pointed to system memory space,
> which includes MMIO, and looks wrong - PCI device should not be able to
> write to MMIO regions.

This is just legacy, I think, ie a combination of "this used to
be system memory space so let's not break things" and "PC works
mostly like this". It should be possible for the PCI host bridge
emulation to set things up so that the device's visible address
space is whatever it feels like. The PCI APIs we have for doing
this have "iommu" in the name but they work just as well even
if the host bridge doesn't actually have an iommu and is just
setting up a fixed or slightly configurable mapping.
I think it just hasn't been implemented because for guests which
aren't misbehaving it doesn't make any difference.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] A question about PCI device address spaces
  2016-12-23 11:21 ` Peter Maydell
@ 2016-12-26  6:53   ` Peter Xu
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Xu @ 2016-12-26  6:53 UTC (permalink / raw)
  To: Peter Maydell, Paolo Bonzini, David Gibson
  Cc: QEMU Devel Mailing List, Marcel Apfelbaum

On Fri, Dec 23, 2016 at 11:21:53AM +0000, Peter Maydell wrote:
> On 22 December 2016 at 09:42, Peter Xu <peterx@redhat.com> wrote:
> > Hello,
> >
> > Since this is a general topic, I picked it out from the VT-d
> > discussion and put it here, just want to be more clear of it.
> >
> > The issue is, whether we have exposed too much address spaces for
> > emulated PCI devices?
> >
> > Now for each PCI device, we are having PCIDevice::bus_master_as for
> > the device visible address space, which derived from
> > pci_device_iommu_address_space():
> >
> > AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> > {
> >     PCIBus *bus = PCI_BUS(dev->bus);
> >     PCIBus *iommu_bus = bus;
> >
> >     while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> >         iommu_bus = PCI_BUS(iommu_bus->parent_dev->bus);
> >     }
> >     if (iommu_bus && iommu_bus->iommu_fn) {
> >         return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> >     }
> >     return &address_space_memory;
> > }
> >
> > By default (for no-iommu case), it's pointed to system memory space,
> > which includes MMIO, and looks wrong - PCI device should not be able to
> > write to MMIO regions.
> 
> This is just legacy, I think, ie a combination of "this used to
> be system memory space so let's not break things" and "PC works
> mostly like this". It should be possible for the PCI host bridge
> emulation to set things up so that the device's visible address
> space is whatever it feels like. The PCI APIs we have for doing
> this have "iommu" in the name but they work just as well even
> if the host bridge doesn't actually have an iommu and is just
> setting up a fixed or slightly configurable mapping.
> I think it just hasn't been implemented because for guests which
> aren't misbehaving it doesn't make any difference.

Hmm, yes I see ppc e500 is using that for setting up its own address
space, possibly x86 can leverage it too when we really need this. For
now, I see no strong reason for this enhancement, so let me keep it as
it is, and wait until we have both a strong reason and a PCI guru.

Thank you for your answer! (to Paolo/David as well)

-- peterx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] A question about PCI device address spaces
  2016-12-22  9:42 [Qemu-devel] A question about PCI device address spaces Peter Xu
                   ` (2 preceding siblings ...)
  2016-12-23 11:21 ` Peter Maydell
@ 2016-12-26 11:01 ` Marcel Apfelbaum
  2016-12-26 11:40   ` David Gibson
  3 siblings, 1 reply; 7+ messages in thread
From: Marcel Apfelbaum @ 2016-12-26 11:01 UTC (permalink / raw)
  To: Peter Xu, QEMU Devel Mailing List; +Cc: David Gibson, Paolo Bonzini

On 12/22/2016 11:42 AM, Peter Xu wrote:
> Hello,
>

Hi Peter,

> Since this is a general topic, I picked it out from the VT-d
> discussion and put it here, just want to be more clear of it.
>
> The issue is, whether we have exposed too much address spaces for
> emulated PCI devices?
>
> Now for each PCI device, we are having PCIDevice::bus_master_as for
> the device visible address space, which derived from
> pci_device_iommu_address_space():
>
> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> {
>     PCIBus *bus = PCI_BUS(dev->bus);
>     PCIBus *iommu_bus = bus;
>
>     while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>         iommu_bus = PCI_BUS(iommu_bus->parent_dev->bus);
>     }
>     if (iommu_bus && iommu_bus->iommu_fn) {
>         return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
>     }
>     return &address_space_memory;
> }
>
> By default (for no-iommu case), it's pointed to system memory space,
> which includes MMIO, and looks wrong - PCI device should not be able to
> write to MMIO regions.
>

Why? As far as I know a PCI device can start a read/write transaction
to virtually any address, it doesn't matter if it 'lands' in RAM or a MMIO
region mapped by other device. But I might be wrong, need to read the spec again...

The PCI transaction will eventually reach the Root Complex/PCI host bridge
where an IOMMU or some other hw entity can sanitize/translate, but is out of
the scope of the device itself.

The Root Complex will 'translate' the transaction into a memory read/write
in the behalf of the device and pass it to the memory controller.
If the transaction target is another device, I am not sure if the
Root Complex will re-route by itself or pass it to the Memory Controller.


> As an example, if we dump a PCI device address space into detail on
> x86_64 system, we can see (this is address space for a virtio-net-pci
> device on an Q35 machine with 6G memory):
>
>     0000000000000000-000000000009ffff (prio 0, RW): pc.ram
>     00000000000a0000-00000000000affff (prio 1, RW): vga.vram
>     00000000000b0000-00000000000bffff (prio 1, RW): vga-lowmem
>     00000000000c0000-00000000000c9fff (prio 0, RW): pc.ram
>     00000000000ca000-00000000000ccfff (prio 0, RW): pc.ram
>     00000000000cd000-00000000000ebfff (prio 0, RW): pc.ram
>     00000000000ec000-00000000000effff (prio 0, RW): pc.ram
>     00000000000f0000-00000000000fffff (prio 0, RW): pc.ram
>     0000000000100000-000000007fffffff (prio 0, RW): pc.ram
>     00000000b0000000-00000000bfffffff (prio 0, RW): pcie-mmcfg-mmio
>     00000000fd000000-00000000fdffffff (prio 1, RW): vga.vram
>     00000000fe000000-00000000fe000fff (prio 0, RW): virtio-pci-common
>     00000000fe001000-00000000fe001fff (prio 0, RW): virtio-pci-isr
>     00000000fe002000-00000000fe002fff (prio 0, RW): virtio-pci-device
>     00000000fe003000-00000000fe003fff (prio 0, RW): virtio-pci-notify
>     00000000febd0400-00000000febd041f (prio 0, RW): vga ioports remapped
>     00000000febd0500-00000000febd0515 (prio 0, RW): bochs dispi interface
>     00000000febd0600-00000000febd0607 (prio 0, RW): qemu extended regs
>     00000000febd1000-00000000febd102f (prio 0, RW): msix-table
>     00000000febd1800-00000000febd1807 (prio 0, RW): msix-pba
>     00000000febd2000-00000000febd2fff (prio 1, RW): ahci
>     00000000fec00000-00000000fec00fff (prio 0, RW): kvm-ioapic
>     00000000fed00000-00000000fed003ff (prio 0, RW): hpet
>     00000000fed1c000-00000000fed1ffff (prio 1, RW): lpc-rcrb-mmio
>     00000000fee00000-00000000feefffff (prio 4096, RW): kvm-apic-msi
>     00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
>     0000000100000000-00000001ffffffff (prio 0, RW): pc.ram
>
> So here are the "pc.ram" regions the only ones that we should expose
> to PCI devices? (it should contain all of them, including the low-mem
> ones and the >=4g one)
>

As I previously said, it does not have to be RAM only, but let's wait
also for Michael's opinion.

> And, should this rule work for all platforms?

The PCI rules should be generic for all platforms, but I don't know
the other platforms.

Thanks,
Marcel

Or say, would it be a
> problem if I directly change address_space_memory in
> pci_device_iommu_address_space() into something else, which only
> contains RAMs? (of course this won't affect any platform that has
> IOMMU, aka, customized PCIBus::iommu_fn function)
>
> (btw, I'd appreciate if anyone has quick answer on why we have lots of
>  continuous "pc.ram" in low 2g range - from can_merge() I guess they
>  seem to have different dirty_log_mask, romd_mode, etc., but I still
>  would like to know why they are having these difference. Anyway, this
>  is totally an "optional question" just to satisfy my own curiosity :)
>
> Thanks,
>
> -- peterx
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] A question about PCI device address spaces
  2016-12-26 11:01 ` Marcel Apfelbaum
@ 2016-12-26 11:40   ` David Gibson
  0 siblings, 0 replies; 7+ messages in thread
From: David Gibson @ 2016-12-26 11:40 UTC (permalink / raw)
  To: Marcel Apfelbaum; +Cc: Peter Xu, QEMU Devel Mailing List, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 6775 bytes --]

On Mon, Dec 26, 2016 at 01:01:34PM +0200, Marcel Apfelbaum wrote:
> On 12/22/2016 11:42 AM, Peter Xu wrote:
> > Hello,
> > 
> 
> Hi Peter,
> 
> > Since this is a general topic, I picked it out from the VT-d
> > discussion and put it here, just want to be more clear of it.
> > 
> > The issue is, whether we have exposed too much address spaces for
> > emulated PCI devices?
> > 
> > Now for each PCI device, we are having PCIDevice::bus_master_as for
> > the device visible address space, which derived from
> > pci_device_iommu_address_space():
> > 
> > AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> > {
> >     PCIBus *bus = PCI_BUS(dev->bus);
> >     PCIBus *iommu_bus = bus;
> > 
> >     while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> >         iommu_bus = PCI_BUS(iommu_bus->parent_dev->bus);
> >     }
> >     if (iommu_bus && iommu_bus->iommu_fn) {
> >         return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> >     }
> >     return &address_space_memory;
> > }
> > 
> > By default (for no-iommu case), it's pointed to system memory space,
> > which includes MMIO, and looks wrong - PCI device should not be able to
> > write to MMIO regions.
> > 
> 
> Why? As far as I know a PCI device can start a read/write transaction
> to virtually any address, it doesn't matter if it 'lands' in RAM or a MMIO
> region mapped by other device. But I might be wrong, need to read the spec again...

So as I noted in another mail, my earlier comment which led Peter to
say that was misleading.  In particular I was talking about *non PCI*
MMIO devices, which barely exist on x86 (and even there the statement
won't necessarily be true).

> The PCI transaction will eventually reach the Root Complex/PCI host bridge
> where an IOMMU or some other hw entity can sanitize/translate, but is out of
> the scope of the device itself.

Right, but we're not talking about the device, or purely within PCI
address space.  We're explicitly talking about what addresses the
RC/host bridge will translate between PCI space and CPU address space.
I'm betting that even on x86, it won't be the whole 64-bit address
space (otherwise how would the host bridge know whether another PCI
device might be listening on that address).

> The Root Complex will 'translate' the transaction into a memory read/write
> in the behalf of the device and pass it to the memory controller.
> If the transaction target is another device, I am not sure if the
> Root Complex will re-route by itself or pass it to the Memory Controller.

It will either re-route itself, or simply drop it, possibly depending
on configuration.  I'm sure the MC won't be bouncing transactions back
to PCI space.  Note that for vanilla PCI the question is moot - the
cycle will be broadcast on the bus segment and something will pick it
up - either a device or the host bridge.  If multiple things try to
respond to the same addresses, things will go badly wrong.

> > As an example, if we dump a PCI device address space into detail on
> > x86_64 system, we can see (this is address space for a virtio-net-pci
> > device on an Q35 machine with 6G memory):
> > 
> >     0000000000000000-000000000009ffff (prio 0, RW): pc.ram
> >     00000000000a0000-00000000000affff (prio 1, RW): vga.vram
> >     00000000000b0000-00000000000bffff (prio 1, RW): vga-lowmem
> >     00000000000c0000-00000000000c9fff (prio 0, RW): pc.ram
> >     00000000000ca000-00000000000ccfff (prio 0, RW): pc.ram
> >     00000000000cd000-00000000000ebfff (prio 0, RW): pc.ram
> >     00000000000ec000-00000000000effff (prio 0, RW): pc.ram
> >     00000000000f0000-00000000000fffff (prio 0, RW): pc.ram
> >     0000000000100000-000000007fffffff (prio 0, RW): pc.ram
> >     00000000b0000000-00000000bfffffff (prio 0, RW): pcie-mmcfg-mmio
> >     00000000fd000000-00000000fdffffff (prio 1, RW): vga.vram
> >     00000000fe000000-00000000fe000fff (prio 0, RW): virtio-pci-common
> >     00000000fe001000-00000000fe001fff (prio 0, RW): virtio-pci-isr
> >     00000000fe002000-00000000fe002fff (prio 0, RW): virtio-pci-device
> >     00000000fe003000-00000000fe003fff (prio 0, RW): virtio-pci-notify
> >     00000000febd0400-00000000febd041f (prio 0, RW): vga ioports remapped
> >     00000000febd0500-00000000febd0515 (prio 0, RW): bochs dispi interface
> >     00000000febd0600-00000000febd0607 (prio 0, RW): qemu extended regs
> >     00000000febd1000-00000000febd102f (prio 0, RW): msix-table
> >     00000000febd1800-00000000febd1807 (prio 0, RW): msix-pba
> >     00000000febd2000-00000000febd2fff (prio 1, RW): ahci
> >     00000000fec00000-00000000fec00fff (prio 0, RW): kvm-ioapic
> >     00000000fed00000-00000000fed003ff (prio 0, RW): hpet
> >     00000000fed1c000-00000000fed1ffff (prio 1, RW): lpc-rcrb-mmio
> >     00000000fee00000-00000000feefffff (prio 4096, RW): kvm-apic-msi
> >     00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
> >     0000000100000000-00000001ffffffff (prio 0, RW): pc.ram
> > 
> > So here are the "pc.ram" regions the only ones that we should expose
> > to PCI devices? (it should contain all of them, including the low-mem
> > ones and the >=4g one)
> > 
> 
> As I previously said, it does not have to be RAM only, but let's wait
> also for Michael's opinion.
> 
> > And, should this rule work for all platforms?
> 
> The PCI rules should be generic for all platforms, but I don't know
> the other platforms.

The rules *within the PCI address space* will be common across
platforms.  But we're discussing the host bridge and the rules across
the PCI/host interface.  This behaviour - what address ranges will be
forwarded in which direction, for example - can and does vary
significantly by platform.

> 
> Thanks,
> Marcel
> 
> Or say, would it be a
> > problem if I directly change address_space_memory in
> > pci_device_iommu_address_space() into something else, which only
> > contains RAMs? (of course this won't affect any platform that has
> > IOMMU, aka, customized PCIBus::iommu_fn function)
> > 
> > (btw, I'd appreciate if anyone has quick answer on why we have lots of
> >  continuous "pc.ram" in low 2g range - from can_merge() I guess they
> >  seem to have different dirty_log_mask, romd_mode, etc., but I still
> >  would like to know why they are having these difference. Anyway, this
> >  is totally an "optional question" just to satisfy my own curiosity :)
> > 
> > Thanks,
> > 
> > -- peterx
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-12-26 11:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-22  9:42 [Qemu-devel] A question about PCI device address spaces Peter Xu
2016-12-22 10:24 ` Paolo Bonzini
2016-12-23  0:02 ` David Gibson
2016-12-23 11:21 ` Peter Maydell
2016-12-26  6:53   ` Peter Xu
2016-12-26 11:01 ` Marcel Apfelbaum
2016-12-26 11:40   ` David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).