From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:42659)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <avi@cloudius-systems.com>) id 1W31Al-000114-Re
	for qemu-devel@nongnu.org; Tue, 14 Jan 2014 05:24:47 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <avi@cloudius-systems.com>) id 1W31Ab-0004Up-8F
	for qemu-devel@nongnu.org; Tue, 14 Jan 2014 05:24:39 -0500
Received: from mail-ee0-f53.google.com ([74.125.83.53]:38765)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <avi@cloudius-systems.com>) id 1W31Aa-0004Tg-V1
	for qemu-devel@nongnu.org; Tue, 14 Jan 2014 05:24:29 -0500
Received: by mail-ee0-f53.google.com with SMTP id t10so111315eei.12
	for <qemu-devel@nongnu.org>; Tue, 14 Jan 2014 02:24:27 -0800 (PST)
Message-ID: <52D51058.9000701@cloudius-systems.com>
Date: Tue, 14 Jan 2014 12:24:24 +0200
From: Avi Kivity <avi@cloudius-systems.com>
MIME-Version: 1.0
References: <cover.1386786228.git.mst@redhat.com>	<1386786509-29966-14-git-send-email-mst@redhat.com>	<1389288287.3209.231.camel@bling.home>	<20140109180003.GA6819@redhat.com>	<1389293278.3209.248.camel@bling.home>	<1389294206.3209.249.camel@bling.home>	<20140109215632.GB9385@redhat.com>	<1389307342.3209.269.camel@bling.home>	<20140110125504.GF10700@redhat.com>	<1389367896.3209.291.camel@bling.home>	<20140112075419.GB22644@redhat.com>	<B6BE8908-3699-403A-B501-DA2E0FC1533D@suse.de>	<1389649144.3209.394.camel@bling.home>	<D6081A76-49B0-49B4-937A-64D7DCA29C36@suse.de>
	<1389653291.3209.410.camel@bling.home>
In-Reply-To: <1389653291.3209.410.camel@bling.home>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>, Alexander Graf <agraf@suse.de>
Cc: Peter Maydell <peter.maydell@linaro.org>, "Michael S. Tsirkin" <mst@redhat.com>, Alexey Kardashevskiy <aik@ozlabs.ru>, QEMU Developers <qemu-devel@nongnu.org>, Luiz Capitulino <lcapitulino@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, David Gibson <david@gibson.dropbear.id.au>

On 01/14/2014 12:48 AM, Alex Williamson wrote:
> On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
>>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
>>>
>>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
>>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>
>>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
>>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
>>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
>>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
>>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
>>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
>>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
>>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
>>>>>>>>>>>>>> consequently messing up the computations.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
>>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
>>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
>>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
>>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
>>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
>>>>>>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
>>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> exec.c | 8 ++------
>>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
>>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
>>>>>>>>>>>>>> --- a/exec.c
>>>>>>>>>>>>>> +++ b/exec.c
>>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
>>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
>>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #define P_L2_BITS 10
>>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
>>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
>>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
>>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
>>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>>>>>>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
>>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
>>>>>>>>>>>>> BARs that I'm not sure how to handle.
>>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
>>>>>>>>>>>> don't detect BAR being disabled?
>>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
>>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
>>>>>>>>>>> pass-through here.
>>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
>>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
>>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
>>>>>>>> Unfortunately
>>>>>>>>
>>>>>>>>>>>>> After this patch I get vfio
>>>>>>>>>>>>> traces like this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
>>>>>>>>>>>>> (save lower 32bits of BAR)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
>>>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
>>>>>>>>>>>>> (read size mask)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
>>>>>>>>>>>>> (restore BAR)
>>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
>>>>>>>>>>>>> (memory region re-mapped)
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
>>>>>>>>>>>>> (save upper 32bits of BAR)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
>>>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
>>>>>>>>>>>>> (memory region gets re-mapped with new address)
>>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
>>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
>>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
>>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
>>>>>>>>> Why can't you? Generally memory core let you find out easily.
>>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
>>>>>>>> out anything that's not memory_region_is_ram().  This still gets
>>>>>>>> through, so how do I easily find out?
>>>>>>>>
>>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
>>>>>>>>> know it's MMIO.
>>>>>>>> How so?  I have a MemoryListener as described above and pass everything
>>>>>>>> through to the IOMMU.  I suppose I could look through all the
>>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
>>>>>>>> ugly.
>>>>>>>>
>>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
>>>>>>>>> bar though, like ivshmem?
>>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
>>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
>>>>>>> Must be a 64 bit BAR to trigger the issue though.
>>>>>>>
>>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
>>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
>>>>>>>>>>>
>>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
>>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
>>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Alex
>>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
>>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
>>>>>>>>>>> What happens on real hardware when an address like this is programmed to
>>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
>>>>>>>>>>> serious doubts that another PCI device would be able to access it
>>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
>>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
>>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
>>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
>>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
>>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
>>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Alex
>>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
>>>>>>>>> full 64 bit addresses must be allowed and hardware validation
>>>>>>>>> test suites normally check that it actually does work
>>>>>>>>> if it happens.
>>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
>>>>>>>> routing, that's more what I'm referring to.  There are generally only
>>>>>>>> fixed address windows for RAM vs MMIO.
>>>>>>> The physical chipset? Likely - in the presence of IOMMU.
>>>>>>> Without that, devices can talk to each other without going
>>>>>>> through chipset, and bridge spec is very explicit that
>>>>>>> full 64 bit addressing must be supported.
>>>>>>>
>>>>>>> So as long as we don't emulate an IOMMU,
>>>>>>> guest will normally think it's okay to use any address.
>>>>>>>
>>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
>>>>>>>>> windows would protect you, but pci already does this filtering:
>>>>>>>>> if you see this address in the memory map this means
>>>>>>>>> your virtual device is on root bus.
>>>>>>>>>
>>>>>>>>> So I think it's the other way around: if VFIO requires specific
>>>>>>>>> address ranges to be assigned to devices, it should give this
>>>>>>>>> info to qemu and qemu can give this to guest.
>>>>>>>>> Then anything outside that range can be ignored by VFIO.
>>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
>>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
>>>>>>>> getting by because it's safely close enough to the CPU address width to
>>>>>>>> not be a concern until we start exposing things at the top of the 64bit
>>>>>>>> address space.  Maybe I can safely ignore anything above
>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
>>>>>>>>
>>>>>>>> Alex
>>>>>>> I think it's not related to target CPU at all - it's a host limitation.
>>>>>>> So just make up your own constant, maybe depending on host architecture.
>>>>>>> Long term add an ioctl to query it.
>>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
>>>>>> physical address bits of the CPU.
>>>>>>
>>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
>>>>>>> placing BARs above some address.
>>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
>>>>>> the BARs with them enabled.  We may still want such a thing to feed into
>>>>>> building ACPI tables though.
>>>>> Well the point is that if you want BIOS to avoid
>>>>> specific addresses, you need to tell it what to avoid.
>>>>> But neither BIOS nor ACPI actually cover the range above
>>>>> 2^48 ATM so it's not a high priority.
>>>>>
>>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
>>>>>>> lines of vfio_get_addr_space_bits(void).
>>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
>>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
>>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
>>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
>>>>>> mappings.  In the short term, I think I'll ignore any mappings above
>>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
>>>>> That seems very wrong. It will still fail on an x86 host if we are
>>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
>>>>> host side there's no real reason to tie it to the target.
>>> I doubt vfio would be the only thing broken in that case.
>>>
>>>>>> long term vfio already has an IOMMU info
>>>>>> ioctl that we could use to return this information, but we'll need to
>>>>>> figure out how to get it out of the IOMMU driver first.
>>>>>> Thanks,
>>>>>>
>>>>>> Alex
>>>>> Short term, just assume 48 bits on x86.
>>> I hate to pick an arbitrary value since we have a very specific mapping
>>> we're trying to avoid.  Perhaps a better option is to skip anything
>>> where:
>>>
>>>         MemoryRegionSection.offset_within_address_space >
>>>         ~MemoryRegionSection.offset_within_address_space
>>>
>>>>> We need to figure out what's the limitation on ppc and arm -
>>>>> maybe there's none and it can address full 64 bit range.
>>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
>>>>
>>>> Or did I misunderstand the question?
>>> Sounds right, if either BAR mappings outside the window will not be
>>> realized in the memory space or the IOMMU has a full 64bit address
>>> space, there's no problem.  Here we have an intermediate step in the BAR
>>> sizing producing a stray mapping that the IOMMU hardware can't handle.
>>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
>>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
>>> this then causes space and time overhead until the tables are pruned
>>> back down.  Thanks,
>> I thought sizing is hard defined as a set to
>> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> PCI doesn't want to handle this as anything special to differentiate a
> sizing mask from a valid BAR address.  I agree though, I'd prefer to
> never see a spurious address like this in my MemoryListener.
>
>

Can't you just ignore regions that cannot be mapped?  Oh, and teach the 
bios and/or linux to disable memory access while sizing.