Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Paolo Bonzini <pbonzini@redhat.com>
To: Eduardo Habkost <ehabkost@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	qemu-devel@nongnu.org, Gleb Natapov <gleb@redhat.com>,
	qemu-stable@nongnu.org
Subject: Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
Date: Wed, 17 Jul 2013 16:01:55 +0200	[thread overview]
Message-ID: <51E6A3D3.4020908@redhat.com> (raw)
In-Reply-To: <20130717133721.GH11420@otherpad.lan.raisama.net>

Il 17/07/2013 15:39, Eduardo Habkost ha scritto:
> On Wed, Jul 17, 2013 at 10:09:01AM +0200, Paolo Bonzini wrote:
>> Il 16/07/2013 21:42, Eduardo Habkost ha scritto:
>>> On Tue, Jul 16, 2013 at 09:24:30PM +0200, Paolo Bonzini wrote:
>>>> Il 16/07/2013 20:11, Eduardo Habkost ha scritto:
>>>>> For physical bit size, what about extending it in a backwards-compatible
>>>>> way? Something like this:
>>>>>
>>>>>     *eax = 0x0003000; /* 48 bits virtual */
>>>>>     if (ram_size < 1TB) {
>>>>>         physical_size = 40; /* Keeping backwards compatibility */
>>>>>     } else if (ram_size < 4TB) {
>>>>>         physical_size = 42;
>>>>
>>>> Why not go straight up to 44?
>>>
>>> I simply trusted the comment saying: "The physical address space is
>>> limited to 42 bits in exec.c", and assumed we had a 42-bit limit
>>> somewhere else.
>>
>> Yeah, that's obsolete.  We now can go up to 64 (but actually only
>> support 52 because that's what Intel says will be the limit----4PB RAM
>> should be good for everyone, as Bill Gates used to say).
>>
>> So far Intel has been upgrading the physical RAM size in x16 steps
>> (MAXPHYADDR was 36, then 40, then 44).  MAXPHYADDR is how Intel calls
>> what you wrote as physical_size.
> 
> Then if ram_size is too large, we could round it up to a multiple of 4?

Yeah, that would be closer to real processors.

>>      if (supported_host_physical_size() < msb(ram_size)) {
>>          abort();
>>      }
> 
> What if the host max physical size is smaller than the MAXPHYADDR we're
> setting for the VM (but still larger than msb(ram_size))? Will the CPU
> or KVM complain, or will it work?

That would only happen with a buggy guest that points page tables to
non-existent RAM, i.e.:

- host has MAXPHYADDR = 36, corresponding to 64 GB physical address space

- guest has 2 GB RAM, but still we set MAXPHYADDR = 40 for backwards
compatibility

- guest creates a page table for RAM beyond 64GB just because it can

KVM would complain, real hardware probably would return all-ones or
something like that.  I think we can ignore it.

> In other words, do we need a check for
>   (supported_host_physical_size() < physical_size)
> below, as well?

That would have to be conditional on !some_compat_prop, too, because
MAXPHYADDR=40 is beyond what older VMX-enabled CPUs support (for example
Core 2 has MAXPHYADDR=36).  I don't think it's important though.  The
less stringent test vs. msb(ram_size) should be enough.

>>      if (ram_size < 64GB && !some_compat_prop) {
>>          physical_size = 36;
>>      } else if (ram_size < 1TB) {
>>          physical_size = 40;
>>      } else {
>>          physical_size = 44;
>>      }
>>
>> What do you think?
> 
> Why stop at 44? What if ram_size is larger than (1 << 44)?

You can certainly add more.  However, no processors exist yet that have
MAXPHYADDR > 44, so you would have already failed the check above.  I
didn't go beyond 44 because who knows, maybe Intel will go from 44 to 46
and break the tradition of increasing by 4.

> We must never start a VM with physical_size > msb(ram_size), right?
> 
> But I am not sure if we should we simply increase physical_size
> automatically, or abort on some cases where we physical_size ends up
> being too small. (I believe we should simply increase it automatically)

Increasing automatically is fine, that's basically what those "if"s do.

>>>> This makes sense too.  Though the best would be of course to use CPUID
>>>> values coming from the real processors, and only using 40 for backwards
>>>> compatibility.
>>>
>>> We can't use the values coming from the real processors directly, or
>>> we will break live migration.
>>
>> I said real processors, not host processors. :)
> 
> So, you mean setting per-cpu-model values?

Yes.

>> So a Core 2 has MAXPHYADDR=36, Nehalem has IIRC 40, Sandy Bridge has 44,
>> and so on.
> 
> That would work as well (and on pc-1.5 and older we could keep
> physical_size=40 or use the above algorithm). But: I wonder if it would
> be usable. If somebody is using "-cpu Nehalem -m 8T", I believe it would
> make sense to increase the physical address size automatically instead
> of aborting. What do you think?

I think I agree.  This leaf is used so little by the guest (the
hypervisor uses it) that mimicking real CPU models in everything is of
little value.

> If the VM is already broken and has MAXPHYADDR < msb(ram_size), I don't
> worry about ABI stability, because those VMs are not even supposed to be
> running properly. If somebody have such a broken VM running, it seems
> better to make the MAXPHYADDR bits suddenly change to a reasonable value
> (so that MAXPHYADDR >= msb(ram_size)) during live-migration than
> aborting migration, or keeping the broken value.
> 
> So if we use an algorithm that always increase MAXPHYADDR automatically
> (breaking the ABI in the cases where the current is already broken), we
> are not going to abort migration unless the host really can't run our
> guest (this seems to be a less serious problem than aborting because the
> VM configuration is broken and needs to be manually adjusted).

Agreed.  So perhaps per-cpu-model values aren't too useful.

Paolo

next prev parent reply	other threads:[~2013-07-17 14:02 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-16 17:22 [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests Andrea Arcangeli
2013-07-16 17:26 ` Paolo Bonzini
2013-07-16 17:38 ` Eduardo Habkost
2013-07-16 17:46   ` Paolo Bonzini
2013-07-16 17:48     ` Paolo Bonzini
2013-07-16 18:06       ` Andrea Arcangeli
2013-07-16 18:11     ` Eduardo Habkost
2013-07-16 19:24       ` Paolo Bonzini
2013-07-16 19:42         ` Eduardo Habkost
2013-07-17  8:09           ` Paolo Bonzini
2013-07-17 13:39             ` Eduardo Habkost
2013-07-17 14:01               ` Paolo Bonzini [this message]
2013-07-17 15:19           ` Gleb Natapov
2013-07-17 21:20             ` Eduardo Habkost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51E6A3D3.4020908@redhat.com \
    --to=pbonzini@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=gleb@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-stable@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).