From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42659) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1W31Al-000114-Re for qemu-devel@nongnu.org; Tue, 14 Jan 2014 05:24:47 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1W31Ab-0004Up-8F for qemu-devel@nongnu.org; Tue, 14 Jan 2014 05:24:39 -0500 Received: from mail-ee0-f53.google.com ([74.125.83.53]:38765) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1W31Aa-0004Tg-V1 for qemu-devel@nongnu.org; Tue, 14 Jan 2014 05:24:29 -0500 Received: by mail-ee0-f53.google.com with SMTP id t10so111315eei.12 for ; Tue, 14 Jan 2014 02:24:27 -0800 (PST) Message-ID: <52D51058.9000701@cloudius-systems.com> Date: Tue, 14 Jan 2014 12:24:24 +0200 From: Avi Kivity MIME-Version: 1.0 References: <1386786509-29966-14-git-send-email-mst@redhat.com> <1389288287.3209.231.camel@bling.home> <20140109180003.GA6819@redhat.com> <1389293278.3209.248.camel@bling.home> <1389294206.3209.249.camel@bling.home> <20140109215632.GB9385@redhat.com> <1389307342.3209.269.camel@bling.home> <20140110125504.GF10700@redhat.com> <1389367896.3209.291.camel@bling.home> <20140112075419.GB22644@redhat.com> <1389649144.3209.394.camel@bling.home> <1389653291.3209.410.camel@bling.home> In-Reply-To: <1389653291.3209.410.camel@bling.home> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson , Alexander Graf Cc: Peter Maydell , "Michael S. Tsirkin" , Alexey Kardashevskiy , QEMU Developers , Luiz Capitulino , Paolo Bonzini , David Gibson On 01/14/2014 12:48 AM, Alex Williamson wrote: > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote: >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson : >>> >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote: >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin wrote: >>>>> >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote: >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote: >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote: >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote: >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote: >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote: >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote: >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote: >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote: >>>>>>>>>>>>>> From: Paolo Bonzini >>>>>>>>>>>>>> >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide. >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal >>>>>>>>>>>>>> consequently messing up the computations. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive. The region it gets >>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI >>>>>>>>>>>>>> address space (see pci_bus_init). Due to a typo that's only 2^63-1, >>>>>>>>>>>>>> not 2^64. But we get it anyway because phys_page_find ignores the upper >>>>>>>>>>>>>> bits of the physical address. In address_space_translate_internal then >>>>>>>>>>>>>> >>>>>>>>>>>>>> diff = int128_sub(section->mr->size, int128_make64(addr)); >>>>>>>>>>>>>> *plen = int128_get64(int128_min(diff, int128_make64(*plen))); >>>>>>>>>>>>>> >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Reported-by: Luiz Capitulino >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin >>>>>>>>>>>>>> --- >>>>>>>>>>>>>> exec.c | 8 ++------ >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-) >>>>>>>>>>>>>> >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644 >>>>>>>>>>>>>> --- a/exec.c >>>>>>>>>>>>>> +++ b/exec.c >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry { >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6) >>>>>>>>>>>>>> >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables. */ >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64 >>>>>>>>>>>>>> >>>>>>>>>>>>>> #define P_L2_BITS 10 >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS) >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void) >>>>>>>>>>>>>> { >>>>>>>>>>>>>> system_memory = g_malloc(sizeof(*system_memory)); >>>>>>>>>>>>>> >>>>>>>>>>>>>> - assert(ADDR_SPACE_BITS <= 64); >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - memory_region_init(system_memory, NULL, "system", >>>>>>>>>>>>>> - ADDR_SPACE_BITS == 64 ? >>>>>>>>>>>>>> - UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS)); >>>>>>>>>>>>>> + memory_region_init(system_memory, NULL, "system", UINT64_MAX); >>>>>>>>>>>>>> address_space_init(&address_space_memory, system_memory, "memory"); >>>>>>>>>>>>>> >>>>>>>>>>>>>> system_io = g_malloc(sizeof(*system_io)); >>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI >>>>>>>>>>>>> BARs that I'm not sure how to handle. >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you >>>>>>>>>>>> don't detect BAR being disabled? >>>>>>>>>>> See the trace below, the BARs are not disabled. QEMU pci-core is doing >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a >>>>>>>>>>> pass-through here. >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening >>>>>>>>>> while I/O & memory are enabled int he command register. Thanks, >>>>>>>>>> >>>>>>>>>> Alex >>>>>>>>> OK then from QEMU POV this BAR value is not special at all. >>>>>>>> Unfortunately >>>>>>>> >>>>>>>>>>>>> After this patch I get vfio >>>>>>>>>>>>> traces like this: >>>>>>>>>>>>> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004 >>>>>>>>>>>>> (save lower 32bits of BAR) >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4) >>>>>>>>>>>>> (write mask to BAR) >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff >>>>>>>>>>>>> (memory region gets unmapped) >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004 >>>>>>>>>>>>> (read size mask) >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4) >>>>>>>>>>>>> (restore BAR) >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000] >>>>>>>>>>>>> (memory region re-mapped) >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0 >>>>>>>>>>>>> (save upper 32bits of BAR) >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4) >>>>>>>>>>>>> (write mask to BAR) >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff >>>>>>>>>>>>> (memory region gets unmapped) >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000] >>>>>>>>>>>>> (memory region gets re-mapped with new address) >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address) >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses) >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu? >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO. >>>>>>>>> Why can't you? Generally memory core let you find out easily. >>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter >>>>>>>> out anything that's not memory_region_is_ram(). This still gets >>>>>>>> through, so how do I easily find out? >>>>>>>> >>>>>>>>> But in this case it's vfio device itself that is sized so for sure you >>>>>>>>> know it's MMIO. >>>>>>>> How so? I have a MemoryListener as described above and pass everything >>>>>>>> through to the IOMMU. I suppose I could look through all the >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really >>>>>>>> ugly. >>>>>>>> >>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit >>>>>>>>> bar though, like ivshmem? >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR >>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr. >>>>>>> Must be a 64 bit BAR to trigger the issue though. >>>>>>> >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something >>>>>>>>>>> that we might be able to take advantage of with GPU passthrough. >>>>>>>>>>> >>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000 >>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI >>>>>>>>>>>>> window. This address is clearly not in a PCI MMIO space, so why are we >>>>>>>>>>>>> allowing it to be realized in the system address space at this location? >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Alex >>>>>>>>>>>> Why do you think it is not in PCI MMIO space? >>>>>>>>>>>> True, CPU can't access this address but other pci devices can. >>>>>>>>>>> What happens on real hardware when an address like this is programmed to >>>>>>>>>>> a device? The CPU doesn't have the physical bits to access it. I have >>>>>>>>>>> serious doubts that another PCI device would be able to access it >>>>>>>>>>> either. Maybe in some limited scenario where the devices are on the >>>>>>>>>>> same conventional PCI bus. In the typical case, PCI addresses are >>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit >>>>>>>>>>> in ACPI). Even if I wanted to filter these out as noise in vfio, how >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be >>>>>>>>>>> programmed. PCI has this knowledge, I hope. VFIO doesn't. Thanks, >>>>>>>>>>> >>>>>>>>>>> Alex >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that >>>>>>>>> full 64 bit addresses must be allowed and hardware validation >>>>>>>>> test suites normally check that it actually does work >>>>>>>>> if it happens. >>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined >>>>>>>> routing, that's more what I'm referring to. There are generally only >>>>>>>> fixed address windows for RAM vs MMIO. >>>>>>> The physical chipset? Likely - in the presence of IOMMU. >>>>>>> Without that, devices can talk to each other without going >>>>>>> through chipset, and bridge spec is very explicit that >>>>>>> full 64 bit addressing must be supported. >>>>>>> >>>>>>> So as long as we don't emulate an IOMMU, >>>>>>> guest will normally think it's okay to use any address. >>>>>>> >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's >>>>>>>>> windows would protect you, but pci already does this filtering: >>>>>>>>> if you see this address in the memory map this means >>>>>>>>> your virtual device is on root bus. >>>>>>>>> >>>>>>>>> So I think it's the other way around: if VFIO requires specific >>>>>>>>> address ranges to be assigned to devices, it should give this >>>>>>>>> info to qemu and qemu can give this to guest. >>>>>>>>> Then anything outside that range can be ignored by VFIO. >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO. There's >>>>>>>> currently no way to find out the address width of the IOMMU. We've been >>>>>>>> getting by because it's safely close enough to the CPU address width to >>>>>>>> not be a concern until we start exposing things at the top of the 64bit >>>>>>>> address space. Maybe I can safely ignore anything above >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now. Thanks, >>>>>>>> >>>>>>>> Alex >>>>>>> I think it's not related to target CPU at all - it's a host limitation. >>>>>>> So just make up your own constant, maybe depending on host architecture. >>>>>>> Long term add an ioctl to query it. >>>>>> It's a hardware limitation which I'd imagine has some loose ties to the >>>>>> physical address bits of the CPU. >>>>>> >>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid >>>>>>> placing BARs above some address. >>>>>> That doesn't help this case, it's a spurious mapping caused by sizing >>>>>> the BARs with them enabled. We may still want such a thing to feed into >>>>>> building ACPI tables though. >>>>> Well the point is that if you want BIOS to avoid >>>>> specific addresses, you need to tell it what to avoid. >>>>> But neither BIOS nor ACPI actually cover the range above >>>>> 2^48 ATM so it's not a high priority. >>>>> >>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the >>>>>>> lines of vfio_get_addr_space_bits(void). >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?) >>>>>> It's an IOMMU hardware limitation, legacy assignment has the same >>>>>> problem. It looks like legacy will abort() in QEMU for the failed >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed >>>>>> mappings. In the short term, I think I'll ignore any mappings above >>>>>> TARGET_PHYS_ADDR_SPACE_BITS, >>>>> That seems very wrong. It will still fail on an x86 host if we are >>>>> emulating a CPU with full 64 bit addressing. The limitation is on the >>>>> host side there's no real reason to tie it to the target. >>> I doubt vfio would be the only thing broken in that case. >>> >>>>>> long term vfio already has an IOMMU info >>>>>> ioctl that we could use to return this information, but we'll need to >>>>>> figure out how to get it out of the IOMMU driver first. >>>>>> Thanks, >>>>>> >>>>>> Alex >>>>> Short term, just assume 48 bits on x86. >>> I hate to pick an arbitrary value since we have a very specific mapping >>> we're trying to avoid. Perhaps a better option is to skip anything >>> where: >>> >>> MemoryRegionSection.offset_within_address_space > >>> ~MemoryRegionSection.offset_within_address_space >>> >>>>> We need to figure out what's the limitation on ppc and arm - >>>>> maybe there's none and it can address full 64 bit range. >>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs. >>>> >>>> Or did I misunderstand the question? >>> Sounds right, if either BAR mappings outside the window will not be >>> realized in the memory space or the IOMMU has a full 64bit address >>> space, there's no problem. Here we have an intermediate step in the BAR >>> sizing producing a stray mapping that the IOMMU hardware can't handle. >>> Even if we could handle it, it's not clear that we want to. On AMD-Vi >>> the IOMMU pages tables can grow to 6-levels deep. A stray mapping like >>> this then causes space and time overhead until the tables are pruned >>> back down. Thanks, >> I thought sizing is hard defined as a set to >> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"? > PCI doesn't want to handle this as anything special to differentiate a > sizing mask from a valid BAR address. I agree though, I'd prefer to > never see a spurious address like this in my MemoryListener. > > Can't you just ignore regions that cannot be mapped? Oh, and teach the bios and/or linux to disable memory access while sizing.