From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43693) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bFrNN-0007eR-4T for qemu-devel@nongnu.org; Wed, 22 Jun 2016 19:16:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bFrNI-00029T-RD for qemu-devel@nongnu.org; Wed, 22 Jun 2016 19:16:03 -0400 Received: from mx1.redhat.com ([209.132.183.28]:49910) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bFrNI-00029P-JD for qemu-devel@nongnu.org; Wed, 22 Jun 2016 19:16:00 -0400 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id E606D3B730 for ; Wed, 22 Jun 2016 23:15:59 +0000 (UTC) Date: Thu, 23 Jun 2016 01:15:57 +0200 From: Andrea Arcangeli Message-ID: <20160622231557.GP30202@redhat.com> References: <20160616202449.GY18662@thinpad.lan.raisama.net> <20160617081505.GA2273@work-vm> <20160617131815.GA18662@thinpad.lan.raisama.net> <20160617151900.GE18662@thinpad.lan.raisama.net> <20160617154905.GH18662@thinpad.lan.raisama.net> <20160621194440.GN17952@thinpad.lan.raisama.net> <9b76415a-23e6-3ded-4dbc-42838cc164b0@redhat.com> <20160622224042.GA29641@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160622224042.GA29641@redhat.com> Subject: Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: Paolo Bonzini , Eduardo Habkost , Marcel Apfelbaum , "Dr. David Alan Gilbert" , qemu-devel@nongnu.org On Thu, Jun 23, 2016 at 01:40:42AM +0300, Michael S. Tsirkin wrote: > Where's a problem then? If EPT/NPT is enabled, the guest pagetables are parsed by the hardware and not by the KVM shadow MMU in software. The hardware speaks host phys bits and AFIK the hardware will behave different depending on the host phys bits. In fact the guest could probe those the host phys bits anyway. Now the breakage with guest phys bits < host phys bits happens only with EPT/NPT if the guest instead of "probing" the host phys bits, it just runs cpuid and it assumes the value it receives would be the same as the effect of a "probe". Then guest could assume the probing effect would match the guest phys bits returned by the guest cpuid insn, and do important stuff in function of that (i.e. expecting a GPF which won't materialize if the host phys bits is > guest phys bits). The guest must do somewhat weird for any breakage to happen (notably changing pagetable format in function of cpuid retval). The guest of course could also be changed to stop being weird and then it wouldn't break anymore. So just in case there's any weird proprietary OS like that, we can still add a -cpu=force_host_phys_bits fallback, to prevent the discrepancy between cpuid and probing effect, in turn eliminating any risk of guest failures (but then we should also prevent live migration if source host phys bits != destination host phys bits to provide the same guarantee to the weird guest, through live migration). > So I think that all we need is a way to let libvirt control > the _CRS range. Teach it that _CRS must fit within what > host can support. Also check and fail kvm init if _CRS exceeds > what host can support. Right. The production solution is such a simple patch that I certainly agree it can be applied first, along with the mtrr fix. The complexity in dealing with _CSR and all the up layers about this subtle phys bits detail, to calculate the highest possible guest physical address, is what makes the production solution attractive in the short term. Then if we implement the "soft" guest phys bits exercise, that is all about adding robustness to live migration (and save/restore). So for it not to risk to be futile, it'd be nice if the phys bits checks were all contained inside qemu. Initially libvirt/ovirt/OpenStack would just return some live migration generic error to the user, in the unlikely case there's a phys bits mismatch during the live migration or restore (i.e. "soft" guest phys bits > destination host phys bits). That will still avoid us getting weird bugreports and way more important it'll avoid any risk of customer unexpected guest crashes. The managers that load-balance the load in the cloud if they want they can still do their own calculation on the host/guest phys bits matching the qemu internal calculation and guarantee themselves that they'll never run into the qemu live migration error because of too low destination host phys bits (either that or they can check a proper error from the migration command). Thanks, Andrea