All of lore.kernel.org
 help / color / mirror / Atom feed
From: Paolo Bonzini <pbonzini@redhat.com>
To: Eduardo Habkost <ehabkost@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	qemu-devel@nongnu.org, Gleb Natapov <gleb@redhat.com>,
	qemu-stable@nongnu.org
Subject: Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
Date: Wed, 17 Jul 2013 16:01:55 +0200	[thread overview]
Message-ID: <51E6A3D3.4020908@redhat.com> (raw)
In-Reply-To: <20130717133721.GH11420@otherpad.lan.raisama.net>

Il 17/07/2013 15:39, Eduardo Habkost ha scritto:
> On Wed, Jul 17, 2013 at 10:09:01AM +0200, Paolo Bonzini wrote:
>> Il 16/07/2013 21:42, Eduardo Habkost ha scritto:
>>> On Tue, Jul 16, 2013 at 09:24:30PM +0200, Paolo Bonzini wrote:
>>>> Il 16/07/2013 20:11, Eduardo Habkost ha scritto:
>>>>> For physical bit size, what about extending it in a backwards-compatible
>>>>> way? Something like this:
>>>>>
>>>>>     *eax = 0x0003000; /* 48 bits virtual */
>>>>>     if (ram_size < 1TB) {
>>>>>         physical_size = 40; /* Keeping backwards compatibility */
>>>>>     } else if (ram_size < 4TB) {
>>>>>         physical_size = 42;
>>>>
>>>> Why not go straight up to 44?
>>>
>>> I simply trusted the comment saying: "The physical address space is
>>> limited to 42 bits in exec.c", and assumed we had a 42-bit limit
>>> somewhere else.
>>
>> Yeah, that's obsolete.  We now can go up to 64 (but actually only
>> support 52 because that's what Intel says will be the limit----4PB RAM
>> should be good for everyone, as Bill Gates used to say).
>>
>> So far Intel has been upgrading the physical RAM size in x16 steps
>> (MAXPHYADDR was 36, then 40, then 44).  MAXPHYADDR is how Intel calls
>> what you wrote as physical_size.
> 
> Then if ram_size is too large, we could round it up to a multiple of 4?

Yeah, that would be closer to real processors.

>>      if (supported_host_physical_size() < msb(ram_size)) {
>>          abort();
>>      }
> 
> What if the host max physical size is smaller than the MAXPHYADDR we're
> setting for the VM (but still larger than msb(ram_size))? Will the CPU
> or KVM complain, or will it work?

That would only happen with a buggy guest that points page tables to
non-existent RAM, i.e.:

- host has MAXPHYADDR = 36, corresponding to 64 GB physical address space

- guest has 2 GB RAM, but still we set MAXPHYADDR = 40 for backwards
compatibility

- guest creates a page table for RAM beyond 64GB just because it can

KVM would complain, real hardware probably would return all-ones or
something like that.  I think we can ignore it.

> In other words, do we need a check for
>   (supported_host_physical_size() < physical_size)
> below, as well?

That would have to be conditional on !some_compat_prop, too, because
MAXPHYADDR=40 is beyond what older VMX-enabled CPUs support (for example
Core 2 has MAXPHYADDR=36).  I don't think it's important though.  The
less stringent test vs. msb(ram_size) should be enough.

>>      if (ram_size < 64GB && !some_compat_prop) {
>>          physical_size = 36;
>>      } else if (ram_size < 1TB) {
>>          physical_size = 40;
>>      } else {
>>          physical_size = 44;
>>      }
>>
>> What do you think?
> 
> Why stop at 44? What if ram_size is larger than (1 << 44)?

You can certainly add more.  However, no processors exist yet that have
MAXPHYADDR > 44, so you would have already failed the check above.  I
didn't go beyond 44 because who knows, maybe Intel will go from 44 to 46
and break the tradition of increasing by 4.

> We must never start a VM with physical_size > msb(ram_size), right?
> 
> But I am not sure if we should we simply increase physical_size
> automatically, or abort on some cases where we physical_size ends up
> being too small. (I believe we should simply increase it automatically)

Increasing automatically is fine, that's basically what those "if"s do.

>>>> This makes sense too.  Though the best would be of course to use CPUID
>>>> values coming from the real processors, and only using 40 for backwards
>>>> compatibility.
>>>
>>> We can't use the values coming from the real processors directly, or
>>> we will break live migration.
>>
>> I said real processors, not host processors. :)
> 
> So, you mean setting per-cpu-model values?

Yes.

>> So a Core 2 has MAXPHYADDR=36, Nehalem has IIRC 40, Sandy Bridge has 44,
>> and so on.
> 
> That would work as well (and on pc-1.5 and older we could keep
> physical_size=40 or use the above algorithm). But: I wonder if it would
> be usable. If somebody is using "-cpu Nehalem -m 8T", I believe it would
> make sense to increase the physical address size automatically instead
> of aborting. What do you think?

I think I agree.  This leaf is used so little by the guest (the
hypervisor uses it) that mimicking real CPU models in everything is of
little value.

> If the VM is already broken and has MAXPHYADDR < msb(ram_size), I don't
> worry about ABI stability, because those VMs are not even supposed to be
> running properly. If somebody have such a broken VM running, it seems
> better to make the MAXPHYADDR bits suddenly change to a reasonable value
> (so that MAXPHYADDR >= msb(ram_size)) during live-migration than
> aborting migration, or keeping the broken value.
> 
> So if we use an algorithm that always increase MAXPHYADDR automatically
> (breaking the ABI in the cases where the current is already broken), we
> are not going to abort migration unless the host really can't run our
> guest (this seems to be a less serious problem than aborting because the
> VM configuration is broken and needs to be manually adjusted).

Agreed.  So perhaps per-cpu-model values aren't too useful.

Paolo

  reply	other threads:[~2013-07-17 14:02 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-16 17:22 [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests Andrea Arcangeli
2013-07-16 17:26 ` Paolo Bonzini
2013-07-16 17:38 ` Eduardo Habkost
2013-07-16 17:46   ` Paolo Bonzini
2013-07-16 17:48     ` Paolo Bonzini
2013-07-16 18:06       ` Andrea Arcangeli
2013-07-16 18:11     ` Eduardo Habkost
2013-07-16 19:24       ` Paolo Bonzini
2013-07-16 19:42         ` Eduardo Habkost
2013-07-17  8:09           ` Paolo Bonzini
2013-07-17 13:39             ` Eduardo Habkost
2013-07-17 14:01               ` Paolo Bonzini [this message]
2013-07-17 15:19           ` Gleb Natapov
2013-07-17 21:20             ` Eduardo Habkost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51E6A3D3.4020908@redhat.com \
    --to=pbonzini@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=gleb@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-stable@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.