From: Alex Williamson <alex.williamson@redhat.com>
To: Alexander Graf <agraf@suse.de>
Cc: Peter Maydell <peter.maydell@linaro.org>,
"Michael S. Tsirkin" <mst@redhat.com>,
Alexey Kardashevskiy <aik@ozlabs.ru>,
QEMU Developers <qemu-devel@nongnu.org>,
Luiz Capitulino <lcapitulino@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
Date: Mon, 13 Jan 2014 14:39:04 -0700 [thread overview]
Message-ID: <1389649144.3209.394.camel@bling.home> (raw)
In-Reply-To: <B6BE8908-3699-403A-B501-DA2E0FC1533D@suse.de>
On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>
> >>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>
> >>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive. The region it gets
> >>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>> address space (see pci_bus_init). Due to a typo that's only 2^63-1,
> >>>>>>>>>> not 2^64. But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>> bits of the physical address. In address_space_translate_internal then
> >>>>>>>>>>
> >>>>>>>>>> diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>> *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>
> >>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>
> >>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>
> >>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>> ---
> >>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>> --- a/exec.c
> >>>>>>>>>> +++ b/exec.c
> >>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>
> >>>>>>>>>> /* Size of the L2 (and L3, etc) page tables. */
> >>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>
> >>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>> {
> >>>>>>>>>> system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>
> >>>>>>>>>> - assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>> -
> >>>>>>>>>> - memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>> - ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>> - UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>> + memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>> address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>
> >>>>>>>>>> system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>
> >>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>
> >>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>> don't detect BAR being disabled?
> >>>>>>>
> >>>>>>> See the trace below, the BARs are not disabled. QEMU pci-core is doing
> >>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>> pass-through here.
> >>>>>>
> >>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>> while I/O & memory are enabled int he command register. Thanks,
> >>>>>>
> >>>>>> Alex
> >>>>>
> >>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>
> >>>> Unfortunately
> >>>>
> >>>>>>>>> After this patch I get vfio
> >>>>>>>>> traces like this:
> >>>>>>>>>
> >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>> (write mask to BAR)
> >>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>> (memory region gets unmapped)
> >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>> (read size mask)
> >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>> (restore BAR)
> >>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>> (memory region re-mapped)
> >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>> (write mask to BAR)
> >>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>> (memory region gets unmapped)
> >>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>
> >>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>
> >>>>> Why can't you? Generally memory core let you find out easily.
> >>>>
> >>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>> out anything that's not memory_region_is_ram(). This still gets
> >>>> through, so how do I easily find out?
> >>>>
> >>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>> know it's MMIO.
> >>>>
> >>>> How so? I have a MemoryListener as described above and pass everything
> >>>> through to the IOMMU. I suppose I could look through all the
> >>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>> ugly.
> >>>>
> >>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>> bar though, like ivshmem?
> >>>>
> >>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>
> >>> Must be a 64 bit BAR to trigger the issue though.
> >>>
> >>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>
> >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>> window. This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Alex
> >>>>>>>>
> >>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>
> >>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>> a device? The CPU doesn't have the physical bits to access it. I have
> >>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>> either. Maybe in some limited scenario where the devices are on the
> >>>>>>> same conventional PCI bus. In the typical case, PCI addresses are
> >>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>> in ACPI). Even if I wanted to filter these out as noise in vfio, how
> >>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>> programmed. PCI has this knowledge, I hope. VFIO doesn't. Thanks,
> >>>>>>>
> >>>>>>> Alex
> >>>>>>
> >>>>>
> >>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>> test suites normally check that it actually does work
> >>>>> if it happens.
> >>>>
> >>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>> routing, that's more what I'm referring to. There are generally only
> >>>> fixed address windows for RAM vs MMIO.
> >>>
> >>> The physical chipset? Likely - in the presence of IOMMU.
> >>> Without that, devices can talk to each other without going
> >>> through chipset, and bridge spec is very explicit that
> >>> full 64 bit addressing must be supported.
> >>>
> >>> So as long as we don't emulate an IOMMU,
> >>> guest will normally think it's okay to use any address.
> >>>
> >>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>> windows would protect you, but pci already does this filtering:
> >>>>> if you see this address in the memory map this means
> >>>>> your virtual device is on root bus.
> >>>>>
> >>>>> So I think it's the other way around: if VFIO requires specific
> >>>>> address ranges to be assigned to devices, it should give this
> >>>>> info to qemu and qemu can give this to guest.
> >>>>> Then anything outside that range can be ignored by VFIO.
> >>>>
> >>>> Then we get into deficiencies in the IOMMU API and maybe VFIO. There's
> >>>> currently no way to find out the address width of the IOMMU. We've been
> >>>> getting by because it's safely close enough to the CPU address width to
> >>>> not be a concern until we start exposing things at the top of the 64bit
> >>>> address space. Maybe I can safely ignore anything above
> >>>> TARGET_PHYS_ADDR_SPACE_BITS for now. Thanks,
> >>>>
> >>>> Alex
> >>>
> >>> I think it's not related to target CPU at all - it's a host limitation.
> >>> So just make up your own constant, maybe depending on host architecture.
> >>> Long term add an ioctl to query it.
> >>
> >> It's a hardware limitation which I'd imagine has some loose ties to the
> >> physical address bits of the CPU.
> >>
> >>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>> placing BARs above some address.
> >>
> >> That doesn't help this case, it's a spurious mapping caused by sizing
> >> the BARs with them enabled. We may still want such a thing to feed into
> >> building ACPI tables though.
> >
> > Well the point is that if you want BIOS to avoid
> > specific addresses, you need to tell it what to avoid.
> > But neither BIOS nor ACPI actually cover the range above
> > 2^48 ATM so it's not a high priority.
> >
> >>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>> lines of vfio_get_addr_space_bits(void).
> >>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>
> >> It's an IOMMU hardware limitation, legacy assignment has the same
> >> problem. It looks like legacy will abort() in QEMU for the failed
> >> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >> mappings. In the short term, I think I'll ignore any mappings above
> >> TARGET_PHYS_ADDR_SPACE_BITS,
> >
> > That seems very wrong. It will still fail on an x86 host if we are
> > emulating a CPU with full 64 bit addressing. The limitation is on the
> > host side there's no real reason to tie it to the target.
I doubt vfio would be the only thing broken in that case.
> >> long term vfio already has an IOMMU info
> >> ioctl that we could use to return this information, but we'll need to
> >> figure out how to get it out of the IOMMU driver first.
> >> Thanks,
> >>
> >> Alex
> >
> > Short term, just assume 48 bits on x86.
I hate to pick an arbitrary value since we have a very specific mapping
we're trying to avoid. Perhaps a better option is to skip anything
where:
MemoryRegionSection.offset_within_address_space >
~MemoryRegionSection.offset_within_address_space
> > We need to figure out what's the limitation on ppc and arm -
> > maybe there's none and it can address full 64 bit range.
>
> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
>
> Or did I misunderstand the question?
Sounds right, if either BAR mappings outside the window will not be
realized in the memory space or the IOMMU has a full 64bit address
space, there's no problem. Here we have an intermediate step in the BAR
sizing producing a stray mapping that the IOMMU hardware can't handle.
Even if we could handle it, it's not clear that we want to. On AMD-Vi
the IOMMU pages tables can grow to 6-levels deep. A stray mapping like
this then causes space and time overhead until the tables are pruned
back down. Thanks,
Alex
next prev parent reply other threads:[~2014-01-13 21:39 UTC|newest]
Thread overview: 74+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 01/28] hw: Pass QEMUMachine to its init() method Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 02/28] pc: map PCI address space as catchall region for not mapped addresses Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 03/28] qtest: split configuration of qtest accelerator and chardev Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 04/28] acpi-test: basic acpi unit-test Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 05/28] MAINTAINERS: update X86 machine entry Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 06/28] pci: fix address space size for bridge Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 07/28] pc: s/INT64_MAX/UINT64_MAX/ Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 08/28] spapr_pci: s/INT64_MAX/UINT64_MAX/ Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 09/28] split definitions for exec.c and translate-all.c radix trees Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 10/28] exec: replace leaf with skip Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 11/28] exec: extend skip field to 6 bit, page entry to 32 bit Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 12/28] exec: pass hw address to phys_page_find Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 13/28] exec: memory radix tree page level compression Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide Michael S. Tsirkin
2014-01-09 17:24 ` Alex Williamson
2014-01-09 18:00 ` Michael S. Tsirkin
2014-01-09 18:47 ` Alex Williamson
2014-01-09 19:03 ` Alex Williamson
2014-01-09 21:56 ` Michael S. Tsirkin
2014-01-09 22:42 ` Alex Williamson
2014-01-10 12:55 ` Michael S. Tsirkin
2014-01-10 15:31 ` Alex Williamson
2014-01-12 7:54 ` Michael S. Tsirkin
2014-01-12 15:03 ` Alexander Graf
2014-01-13 21:39 ` Alex Williamson [this message]
2014-01-13 21:48 ` Alexander Graf
2014-01-13 22:48 ` Alex Williamson
2014-01-14 10:24 ` Avi Kivity
2014-01-14 11:50 ` Michael S. Tsirkin
2014-01-14 15:36 ` Alex Williamson
2014-01-14 16:20 ` Michael S. Tsirkin
2014-01-14 12:07 ` Michael S. Tsirkin
2014-01-14 15:57 ` Alex Williamson
2014-01-14 16:03 ` Michael S. Tsirkin
2014-01-14 16:15 ` Alex Williamson
2014-01-14 16:18 ` Michael S. Tsirkin
2014-01-14 16:39 ` Alex Williamson
2014-01-14 16:45 ` Michael S. Tsirkin
2014-01-14 8:18 ` Michael S. Tsirkin
2014-01-14 9:20 ` Alexander Graf
2014-01-14 9:31 ` Peter Maydell
2014-01-14 10:28 ` Michael S. Tsirkin
2014-01-14 10:43 ` Michael S. Tsirkin
2014-01-14 12:21 ` Michael S. Tsirkin
2014-01-14 15:49 ` Alex Williamson
2014-01-14 16:07 ` Michael S. Tsirkin
2014-01-14 17:49 ` Mike Day
2014-01-14 17:55 ` Mike Day
2014-01-14 18:05 ` Alex Williamson
2014-01-14 18:20 ` Mike Day
2014-01-14 13:50 ` Mike Day
2014-01-14 14:05 ` Michael S. Tsirkin
2014-01-14 15:01 ` Mike Day
2014-01-15 0:48 ` Alexey Kardashevskiy
2014-01-20 16:20 ` Mike Day
2014-01-20 16:45 ` Alex Williamson
2014-01-20 17:04 ` Michael S. Tsirkin
2014-01-20 17:16 ` Alex Williamson
2014-01-20 20:37 ` Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 15/28] exec: reduce L2_PAGE_SIZE Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 16/28] smbios: Set system manufacturer, product & version by default Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 17/28] acpi unit-test: verify signature and checksum Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 18/28] acpi: strip compiler info in built-in DSDT Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 19/28] ACPI DSDT: Make control method `IQCR` serialized Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 20/28] pci: fix pci bridge fw path Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 21/28] hpet: inverse polarity when pin above ISA_NUM_IRQS Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 22/28] hpet: enable to entitle more irq pins for hpet Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 23/28] memory.c: bugfix - ref counting mismatch in memory_region_find Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 24/28] exec: separate sections and nodes per address space Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 25/28] acpi unit-test: load and check facs table Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 26/28] acpi unit-test: adjust the test data structure for better handling Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 27/28] hpet: fix build with CONFIG_HPET off Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 28/28] pc: use macro for HPET type Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1389649144.3209.394.camel@bling.home \
--to=alex.williamson@redhat.com \
--cc=agraf@suse.de \
--cc=aik@ozlabs.ru \
--cc=david@gibson.dropbear.id.au \
--cc=lcapitulino@redhat.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).