From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:43693)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1bFrNN-0007eR-4T
	for qemu-devel@nongnu.org; Wed, 22 Jun 2016 19:16:06 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1bFrNI-00029T-RD
	for qemu-devel@nongnu.org; Wed, 22 Jun 2016 19:16:03 -0400
Received: from mx1.redhat.com ([209.132.183.28]:49910)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1bFrNI-00029P-JD
	for qemu-devel@nongnu.org; Wed, 22 Jun 2016 19:16:00 -0400
Received: from int-mx14.intmail.prod.int.phx2.redhat.com
	(int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id E606D3B730
	for <qemu-devel@nongnu.org>; Wed, 22 Jun 2016 23:15:59 +0000 (UTC)
Date: Thu, 23 Jun 2016 01:15:57 +0200
From: Andrea Arcangeli <aarcange@redhat.com>
Message-ID: <20160622231557.GP30202@redhat.com>
References: <20160616202449.GY18662@thinpad.lan.raisama.net>
	<20160617081505.GA2273@work-vm>
	<20160617131815.GA18662@thinpad.lan.raisama.net>
	<a9a5a663-91d4-b8bc-7502-6e9520a0763f@redhat.com>
	<20160617151900.GE18662@thinpad.lan.raisama.net>
	<deb30133-e84e-2c09-cc22-b34d1e9f3d90@redhat.com>
	<20160617154905.GH18662@thinpad.lan.raisama.net>
	<20160621194440.GN17952@thinpad.lan.raisama.net>
	<9b76415a-23e6-3ded-4dbc-42838cc164b0@redhat.com>
	<20160622224042.GA29641@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160622224042.GA29641@redhat.com>
Subject: Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5]
 x86: Allow physical address bits to be set)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, Eduardo Habkost <ehabkost@redhat.com>, Marcel Apfelbaum <marcel@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, qemu-devel@nongnu.org

On Thu, Jun 23, 2016 at 01:40:42AM +0300, Michael S. Tsirkin wrote:
> Where's a problem then?

If EPT/NPT is enabled, the guest pagetables are parsed by the hardware
and not by the KVM shadow MMU in software. The hardware speaks host
phys bits and AFIK the hardware will behave different depending on the
host phys bits.

In fact the guest could probe those the host phys bits anyway.

Now the breakage with guest phys bits < host phys bits happens only
with EPT/NPT if the guest instead of "probing" the host phys bits, it
just runs cpuid and it assumes the value it receives would be the same
as the effect of a "probe".

Then guest could assume the probing effect would match the guest phys
bits returned by the guest cpuid insn, and do important stuff in
function of that (i.e. expecting a GPF which won't materialize if the
host phys bits is > guest phys bits).

The guest must do somewhat weird for any breakage to happen (notably
changing pagetable format in function of cpuid retval). The guest of
course could also be changed to stop being weird and then it wouldn't
break anymore.

So just in case there's any weird proprietary OS like that, we can
still add a -cpu=force_host_phys_bits fallback, to prevent the
discrepancy between cpuid and probing effect, in turn eliminating any
risk of guest failures (but then we should also prevent live migration
if source host phys bits != destination host phys bits to provide the
same guarantee to the weird guest, through live migration).

> So I think that all we need is a way to let libvirt control
> the _CRS range. Teach it that _CRS must fit within what
> host can support. Also check and fail kvm init if _CRS exceeds
> what host can support.

Right.

The production solution is such a simple patch that I certainly agree
it can be applied first, along with the mtrr fix.

The complexity in dealing with _CSR and all the up layers about this
subtle phys bits detail, to calculate the highest possible guest
physical address, is what makes the production solution attractive in
the short term.

Then if we implement the "soft" guest phys bits exercise, that is all
about adding robustness to live migration (and save/restore). So for
it not to risk to be futile, it'd be nice if the phys bits checks were
all contained inside qemu.

Initially libvirt/ovirt/OpenStack would just return some live
migration generic error to the user, in the unlikely case there's a
phys bits mismatch during the live migration or restore (i.e. "soft"
guest phys bits > destination host phys bits). That will still avoid
us getting weird bugreports and way more important it'll avoid any
risk of customer unexpected guest crashes.

The managers that load-balance the load in the cloud if they want they
can still do their own calculation on the host/guest phys bits
matching the qemu internal calculation and guarantee themselves that
they'll never run into the qemu live migration error because of too
low destination host phys bits (either that or they can check a proper
error from the migration command).

Thanks,
Andrea