From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43372) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1W31FB-0002kP-P6 for qemu-devel@nongnu.org; Tue, 14 Jan 2014 05:29:19 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1W31F5-0005lM-I9 for qemu-devel@nongnu.org; Tue, 14 Jan 2014 05:29:13 -0500 Received: from mx1.redhat.com ([209.132.183.28]:17306) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1W31F5-0005lE-7U for qemu-devel@nongnu.org; Tue, 14 Jan 2014 05:29:07 -0500 Date: Tue, 14 Jan 2014 12:28:58 +0200 From: "Michael S. Tsirkin" Message-ID: <20140114102858.GC2846@redhat.com> References: <20140109215632.GB9385@redhat.com> <1389307342.3209.269.camel@bling.home> <20140110125504.GF10700@redhat.com> <1389367896.3209.291.camel@bling.home> <20140112075419.GB22644@redhat.com> <1389649144.3209.394.camel@bling.home> <20140114081819.GA1101@redhat.com> <86B6CCBE-A534-4C7C-9363-840B3308D728@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <86B6CCBE-A534-4C7C-9363-840B3308D728@suse.de> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexander Graf Cc: Peter Maydell , Alexey Kardashevskiy , QEMU Developers , Luiz Capitulino , Alex Williamson , Paolo Bonzini , David Gibson On Tue, Jan 14, 2014 at 10:20:57AM +0100, Alexander Graf wrote: >=20 > On 14.01.2014, at 09:18, Michael S. Tsirkin wrote: >=20 > > On Mon, Jan 13, 2014 at 10:48:21PM +0100, Alexander Graf wrote: > >>=20 > >>=20 > >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson : > >>>=20 > >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote: > >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin wrot= e: > >>>>>=20 > >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote: > >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote: > >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrot= e: > >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote: > >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wr= ote: > >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote: > >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrot= e: > >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson= wrote: > >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wr= ote: > >>>>>>>>>>>>>> From: Paolo Bonzini > >>>>>>>>>>>>>>=20 > >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system = memory > >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-= bit wide. > >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bi= ts above > >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_= internal > >>>>>>>>>>>>>> consequently messing up the computations. > >>>>>>>>>>>>>>=20 > >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to rea= d from address > >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive. The= region it gets > >>>>>>>>>>>>>> is the newly introduced master abort region, which is as= big as the PCI > >>>>>>>>>>>>>> address space (see pci_bus_init). Due to a typo that's = only 2^63-1, > >>>>>>>>>>>>>> not 2^64. But we get it anyway because phys_page_find i= gnores the upper > >>>>>>>>>>>>>> bits of the physical address. In address_space_translat= e_internal then > >>>>>>>>>>>>>>=20 > >>>>>>>>>>>>>> diff =3D int128_sub(section->mr->size, int128_make64(ad= dr)); > >>>>>>>>>>>>>> *plen =3D int128_get64(int128_min(diff, int128_make64(*= plen))); > >>>>>>>>>>>>>>=20 > >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms. > >>>>>>>>>>>>>>=20 > >>>>>>>>>>>>>> The size of the PCI address space region should be fixed= anyway. > >>>>>>>>>>>>>>=20 > >>>>>>>>>>>>>> Reported-by: Luiz Capitulino > >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini > >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin > >>>>>>>>>>>>>> --- > >>>>>>>>>>>>>> exec.c | 8 ++------ > >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-) > >>>>>>>>>>>>>>=20 > >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c > >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644 > >>>>>>>>>>>>>> --- a/exec.c > >>>>>>>>>>>>>> +++ b/exec.c > >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry { > >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6) > >>>>>>>>>>>>>>=20 > >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables. */ > >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS > >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64 > >>>>>>>>>>>>>>=20 > >>>>>>>>>>>>>> #define P_L2_BITS 10 > >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS) > >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void) > >>>>>>>>>>>>>> { > >>>>>>>>>>>>>> system_memory =3D g_malloc(sizeof(*system_memory)); > >>>>>>>>>>>>>>=20 > >>>>>>>>>>>>>> - assert(ADDR_SPACE_BITS <=3D 64); > >>>>>>>>>>>>>> - > >>>>>>>>>>>>>> - memory_region_init(system_memory, NULL, "system", > >>>>>>>>>>>>>> - ADDR_SPACE_BITS =3D=3D 64 ? > >>>>>>>>>>>>>> - UINT64_MAX : (0x1ULL << ADDR_SPA= CE_BITS)); > >>>>>>>>>>>>>> + memory_region_init(system_memory, NULL, "system", U= INT64_MAX); > >>>>>>>>>>>>>> address_space_init(&address_space_memory, system_memor= y, "memory"); > >>>>>>>>>>>>>>=20 > >>>>>>>>>>>>>> system_io =3D g_malloc(sizeof(*system_io)); > >>>>>>>>>>>>>=20 > >>>>>>>>>>>>> This seems to have some unexpected consequences around si= zing 64bit PCI > >>>>>>>>>>>>> BARs that I'm not sure how to handle. > >>>>>>>>>>>>=20 > >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you > >>>>>>>>>>>> don't detect BAR being disabled? > >>>>>>>>>>>=20 > >>>>>>>>>>> See the trace below, the BARs are not disabled. QEMU pci-c= ore is doing > >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is j= ust a > >>>>>>>>>>> pass-through here. > >>>>>>>>>>=20 > >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to b= e happening > >>>>>>>>>> while I/O & memory are enabled int he command register. Tha= nks, > >>>>>>>>>>=20 > >>>>>>>>>> Alex > >>>>>>>>>=20 > >>>>>>>>> OK then from QEMU POV this BAR value is not special at all. > >>>>>>>>=20 > >>>>>>>> Unfortunately > >>>>>>>>=20 > >>>>>>>>>>>>> After this patch I get vfio > >>>>>>>>>>>>> traces like this: > >>>>>>>>>>>>>=20 > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=3D0x4= ) febe0004 > >>>>>>>>>>>>> (save lower 32bits of BAR) > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffff= ff, len=3D0x4) > >>>>>>>>>>>>> (write mask to BAR) > >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff > >>>>>>>>>>>>> (memory region gets unmapped) > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=3D0x4= ) ffffc004 > >>>>>>>>>>>>> (read size mask) > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe00= 04, len=3D0x4) > >>>>>>>>>>>>> (restore BAR) > >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000] > >>>>>>>>>>>>> (memory region re-mapped) > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=3D0x4= ) 0 > >>>>>>>>>>>>> (save upper 32bits of BAR) > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffff= ff, len=3D0x4) > >>>>>>>>>>>>> (write mask to BAR) > >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff > >>>>>>>>>>>>> (memory region gets unmapped) > >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7= fcf3654d000] > >>>>>>>>>>>>> (memory region gets re-mapped with new address) > >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xffffff= fffebe0000, 0x4000, 0x7fcf3654d000) =3D -14 (Bad address) > >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical ad= dresses) > >>>>>>>>>>>>=20 > >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the= iommu? > >>>>>>>>>>>=20 > >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM = and MMIO. > >>>>>>>>>=20 > >>>>>>>>> Why can't you? Generally memory core let you find out easily. > >>>>>>>>=20 > >>>>>>>> My MemoryListener is setup for &address_space_memory and I the= n filter > >>>>>>>> out anything that's not memory_region_is_ram(). This still ge= ts > >>>>>>>> through, so how do I easily find out? > >>>>>>>>=20 > >>>>>>>>> But in this case it's vfio device itself that is sized so for= sure you > >>>>>>>>> know it's MMIO. > >>>>>>>>=20 > >>>>>>>> How so? I have a MemoryListener as described above and pass e= verything > >>>>>>>> through to the IOMMU. I suppose I could look through all the > >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that se= ems really > >>>>>>>> ugly. > >>>>>>>>=20 > >>>>>>>>> Maybe you will have same issue if there's another device with= a 64 bit > >>>>>>>>> bar though, like ivshmem? > >>>>>>>>=20 > >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR > >>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init= _ram_ptr. > >>>>>>>=20 > >>>>>>> Must be a 64 bit BAR to trigger the issue though. > >>>>>>>=20 > >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which = is something > >>>>>>>>>>> that we might be able to take advantage of with GPU passthr= ough. > >>>>>>>>>>>=20 > >>>>>>>>>>>>> Prior to this change, there was no re-map with the ffffff= fffebe0000 > >>>>>>>>>>>>> address, presumably because it was beyond the address spa= ce of the PCI > >>>>>>>>>>>>> window. This address is clearly not in a PCI MMIO space,= so why are we > >>>>>>>>>>>>> allowing it to be realized in the system address space at= this location? > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>=20 > >>>>>>>>>>>>> Alex > >>>>>>>>>>>>=20 > >>>>>>>>>>>> Why do you think it is not in PCI MMIO space? > >>>>>>>>>>>> True, CPU can't access this address but other pci devices = can. > >>>>>>>>>>>=20 > >>>>>>>>>>> What happens on real hardware when an address like this is = programmed to > >>>>>>>>>>> a device? The CPU doesn't have the physical bits to access= it. I have > >>>>>>>>>>> serious doubts that another PCI device would be able to acc= ess it > >>>>>>>>>>> either. Maybe in some limited scenario where the devices a= re on the > >>>>>>>>>>> same conventional PCI bus. In the typical case, PCI addres= ses are > >>>>>>>>>>> always limited by some kind of aperture, whether that's exp= licit in > >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps = made explicit > >>>>>>>>>>> in ACPI). Even if I wanted to filter these out as noise in= vfio, how > >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to= be > >>>>>>>>>>> programmed. PCI has this knowledge, I hope. VFIO doesn't.= Thanks, > >>>>>>>>>>>=20 > >>>>>>>>>>> Alex > >>>>>>>>>=20 > >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is ex= plicit that > >>>>>>>>> full 64 bit addresses must be allowed and hardware validation > >>>>>>>>> test suites normally check that it actually does work > >>>>>>>>> if it happens. > >>>>>>>>=20 > >>>>>>>> Sure, PCI devices themselves, but the chipset typically has de= fined > >>>>>>>> routing, that's more what I'm referring to. There are general= ly only > >>>>>>>> fixed address windows for RAM vs MMIO. > >>>>>>>=20 > >>>>>>> The physical chipset? Likely - in the presence of IOMMU. > >>>>>>> Without that, devices can talk to each other without going > >>>>>>> through chipset, and bridge spec is very explicit that > >>>>>>> full 64 bit addressing must be supported. > >>>>>>>=20 > >>>>>>> So as long as we don't emulate an IOMMU, > >>>>>>> guest will normally think it's okay to use any address. > >>>>>>>=20 > >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's > >>>>>>>>> windows would protect you, but pci already does this filterin= g: > >>>>>>>>> if you see this address in the memory map this means > >>>>>>>>> your virtual device is on root bus. > >>>>>>>>>=20 > >>>>>>>>> So I think it's the other way around: if VFIO requires specif= ic > >>>>>>>>> address ranges to be assigned to devices, it should give this > >>>>>>>>> info to qemu and qemu can give this to guest. > >>>>>>>>> Then anything outside that range can be ignored by VFIO. > >>>>>>>>=20 > >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.= There's > >>>>>>>> currently no way to find out the address width of the IOMMU. = We've been > >>>>>>>> getting by because it's safely close enough to the CPU address= width to > >>>>>>>> not be a concern until we start exposing things at the top of = the 64bit > >>>>>>>> address space. Maybe I can safely ignore anything above > >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now. Thanks, > >>>>>>>>=20 > >>>>>>>> Alex > >>>>>>>=20 > >>>>>>> I think it's not related to target CPU at all - it's a host lim= itation. > >>>>>>> So just make up your own constant, maybe depending on host arch= itecture. > >>>>>>> Long term add an ioctl to query it. > >>>>>>=20 > >>>>>> It's a hardware limitation which I'd imagine has some loose ties= to the > >>>>>> physical address bits of the CPU. > >>>>>>=20 > >>>>>>> Also, we can add a fwcfg interface to tell bios that it should = avoid > >>>>>>> placing BARs above some address. > >>>>>>=20 > >>>>>> That doesn't help this case, it's a spurious mapping caused by s= izing > >>>>>> the BARs with them enabled. We may still want such a thing to f= eed into > >>>>>> building ACPI tables though. > >>>>>=20 > >>>>> Well the point is that if you want BIOS to avoid > >>>>> specific addresses, you need to tell it what to avoid. > >>>>> But neither BIOS nor ACPI actually cover the range above > >>>>> 2^48 ATM so it's not a high priority. > >>>>>=20 > >>>>>>> Since it's a vfio limitation I think it should be a vfio API, a= long the > >>>>>>> lines of vfio_get_addr_space_bits(void). > >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?= ) > >>>>>>=20 > >>>>>> It's an IOMMU hardware limitation, legacy assignment has the sam= e > >>>>>> problem. It looks like legacy will abort() in QEMU for the fail= ed > >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for= failed > >>>>>> mappings. In the short term, I think I'll ignore any mappings a= bove > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS, > >>>>>=20 > >>>>> That seems very wrong. It will still fail on an x86 host if we ar= e > >>>>> emulating a CPU with full 64 bit addressing. The limitation is on= the > >>>>> host side there's no real reason to tie it to the target. > >>>=20 > >>> I doubt vfio would be the only thing broken in that case. > >>>=20 > >>>>>> long term vfio already has an IOMMU info > >>>>>> ioctl that we could use to return this information, but we'll ne= ed to > >>>>>> figure out how to get it out of the IOMMU driver first. > >>>>>> Thanks, > >>>>>>=20 > >>>>>> Alex > >>>>>=20 > >>>>> Short term, just assume 48 bits on x86. > >>>=20 > >>> I hate to pick an arbitrary value since we have a very specific map= ping > >>> we're trying to avoid. Perhaps a better option is to skip anything > >>> where: > >>>=20 > >>> MemoryRegionSection.offset_within_address_space > > >>> ~MemoryRegionSection.offset_within_address_space > >>>=20 > >>>>> We need to figure out what's the limitation on ppc and arm - > >>>>> maybe there's none and it can address full 64 bit range. > >>>>=20 > >>>> IIUC on PPC and ARM you always have BAR windows where things can g= et mapped into. Unlike x86 where the full phyiscal address range can be o= verlayed by BARs. > >>>>=20 > >>>> Or did I misunderstand the question? > >>>=20 > >>> Sounds right, if either BAR mappings outside the window will not be > >>> realized in the memory space or the IOMMU has a full 64bit address > >>> space, there's no problem. Here we have an intermediate step in th= e BAR > >>> sizing producing a stray mapping that the IOMMU hardware can't hand= le. > >>> Even if we could handle it, it's not clear that we want to. On AMD= -Vi > >>> the IOMMU pages tables can grow to 6-levels deep. A stray mapping = like > >>> this then causes space and time overhead until the tables are prune= d > >>> back down. Thanks, > >>=20 > >> I thought sizing is hard defined as a set to > >> -1? Can't we check for that one special case and treat it as "not ma= pped, but tell the guest the size in config space"? > >>=20 > >> Alex > >=20 > > We already have a work-around like this and it works for 32 bit BARs > > or after software writes the full 64 register: > > if (last_addr <=3D new_addr || new_addr =3D=3D 0 || > > last_addr =3D=3D PCI_BAR_UNMAPPED) { > > return PCI_BAR_UNMAPPED; > > } > >=20 > > if (!(type & PCI_BASE_ADDRESS_MEM_TYPE_64) && last_addr >=3D UINT= 32_MAX) { > > return PCI_BAR_UNMAPPED; > > } > >=20 > >=20 > > But for 64 bit BARs this software writes all 1's > > in the high 32 bit register before writing in the low register > > (see trace above). > > This makes it impossible to distinguish between > > setting bar at fffffffffebe0000 and this intermediate sizing step. >=20 > Well, at least according to the AMD manual there's only support for 52 = bytes of physical address space: >=20 > =E2=80=A2 Long Mode=E2=80=94This mode is unique to the AMD64 architect= ure. This mode supports up to 4 petabytes of physical-address space using= 52-bit physical addresses. >=20 > Intel seems to agree: >=20 > =E2=80=A2 CPUID.80000008H:EAX[7:0] reports the physical-address width = supported by the processor. (For processors that do not support CPUID fun= ction 80000008H, the width is generally 36 if CPUID.01H:EDX.PAE [bit 6] =3D= 1 and 32 otherwise.) This width is referred to as MAXPHYADDR. MAXPHYADDR= is at most 52. >=20 > Of course there's potential for future extensions to allow for more bit= s in the future, but at least the current generation x86_64 (and x86) spe= cification clearly only supports 52 bits of physical address space. And n= on-x86(_64) don't care about bigger address spaces either because they us= e BAR windows which are very unlikely to grow bigger than 52 bits ;). >=20 >=20 > Alex Yes but that's from CPU's point of view. I think that devices can still access each other's BARs using full 64 bit addresses. --=20 MST