qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: George Dunlap <george.dunlap@eu.citrix.com>
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Yongjie Ren <yongjie.ren@intel.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Keir Fraser <keir@xen.org>, Hanweidong <hanweidong@huawei.com>,
	Xudong Hao <xudong.hao@intel.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	Tim Deegan <tim@xen.org>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	Yanqiangjun <yanqiangjun@huawei.com>,
	Wangzhenguo <wangzhenguo@huawei.com>,
	YangXiaowei <xiaowei.yang@huawei.com>,
	"Gonglei (Arei)" <arei.gonglei@huawei.com>,
	Jan Beulich <JBeulich@suse.com>,
	YongweiX Xu <yongweix.xu@intel.com>,
	Luonengjun <luonengjun@huawei.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	SongtaoX Liu <songtaox.liu@intel.com>
Subject: Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
Date: Fri, 14 Jun 2013 15:14:23 +0100	[thread overview]
Message-ID: <51BB253F.8060706@eu.citrix.com> (raw)
In-Reply-To: <1371209687.3967.4.camel@zakaz.uk.xensource.com>

On 14/06/13 12:34, Ian Campbell wrote:
> On Fri, 2013-06-14 at 11:53 +0100, George Dunlap wrote:
>> On Thu, Jun 13, 2013 at 6:22 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>>> On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:
>>>
>>>>>> We could have a xenstore flag somewhere that enables the old behaviour
>>>>>> so that people can revert back to qemu-xen-traditional and make the pci
>>>>>> hole below 4G even bigger than 448MB, but I think that keeping the old
>>>>>> behaviour around is going to make the code more difficult to maintain.
>>>>> The downside of that is that things which worked with the old scheme may
>>>>> not work with the new one though. Early in a release cycle when we have
>>>>> time to discover what has broken then that might be OK, but is post rc4
>>>>> really the time to be risking it?
>>>> Yes, you are right: there are some scenarios that would have worked
>>>> before that wouldn't work anymore with the new scheme.
>>>> Are they important enough to have a workaround, pretty difficult to
>>>> identify for a user?
>>> That question would be reasonable early in the development cycle. At rc4
>>> the question should be: do we think this problem is so critical that we
>>> want to risk breaking something else which currently works for people.
>>>
>>> Remember that we are invalidating whatever passthrough testing people
>>> have already done up to this point of the release.
>>>
>>> It is also worth noting that the things which this change ends up
>>> breaking may for all we know be equally difficult for a user to identify
>>> (they are after all approximately the same class of issue).
>>>
>>> The problem here is that the risk is difficult to evaluate, we just
>>> don't know what will break with this change, and we don't know therefore
>>> if the cure is worse than the disease. The conservative approach at this
>>> point in the release would be to not change anything, or to change the
>>> minimal possible number of things (which would preclude changes which
>>> impact qemu-trad IMHO).
>>>
>>
>>> WRT pretty difficult to identify -- the root of this thread suggests the
>>> guest entered a reboot loop with "No bootable device", that sounds
>>> eminently release notable to me. I also not that it was changing the
>>> size of the PCI hole which caused the issue -- which does somewhat
>>> underscore the risks involved in this sort of change.
>> But that bug was a bug in the first attempt to fix the root problem.
>> The root problem shows up as qemu crashing at some point because it
>> tried to access invalid guest gpfn space; see
>> http://lists.xen.org/archives/html/xen-devel/2013-03/msg00559.html.
>>
>> Stefano tried to fix it with the above patch, just changing the hole
>> to start at 0xe; but that was incomplete, as it didn't match with
>> hvmloader and seabios's view of the world.  That's what this bug
>> report is about.  This thread is an attempt to find a better fix.
>>
>> So the root problem is that if we revert this patch, and someone
>> passes through a pci device using qemu-xen (the default) and the MMIO
>> hole is resized, at some point in the future qemu will randomly die.
> Right, I see, thanks for explaining.
>
>> If it's a choice between users experiencing, "My VM randomly crashes"
>> and experiencing, "I tried to pass through this device but the guest
>> OS doesn't see it", I'd rather choose the latter.
> All other things being equal, obviously we all would. But the point I've
> been trying to make is that we don't know the other consequences of
> making that fix -- e.g. on existing working configurations. So the
> choice is "some VMs randomly crash, but other stuff works fine and we
> have had a reasonable amount of user testing" and "those particular VMs
> don't crash any more, but we don't know what other stuff no longer works
> and the existing test base has been at least partially invalidated".
>
> I think that at post rc4 in a release we ought to be being pretty
> conservative about the risks of this sort of change, especially wrt
> invalidating testing and the unknowns involved.
>
> Aren't the configurations which might trip over this issue are going to
> be in the minority compared to those which we risk breaking?

So there are the technical proposals we've been discussing, each of 
which has different risks.

1. Set the default MMIO hole size to 0xe0000000.
2. If possible, relocate PCI devices that don't fit in the hole to the 
64-bit hole.
  - Here "if possible" will mean a) the device has a 64-bit BAR, and b) 
this hasn't been disabled by libxl (probably via a xenstore key).
3. If possible, resize the MMIO hole; otherwise refuse to map the device
  - Currently "if possible" is always true; the new thing here would be 
making it possible for libxl to disable this, probably via a xenstore key.

Each of these will have different risks for qemu-traditional and qemu-xen.

Implementing #3 would have no risk for qemu-traditional, because we 
won't be changing the way anything works; what works will still work, 
what is broken (if anything) will still be broken.

Implementing #3 for qemu-xen only changes one kind of failure for 
another.  If you resize the MMIO hole for qemu-xen, then you *will* 
eventually crash.  So this will not break existing working 
configurations -- it will only change the failure from "qemu crashes at 
some point" to "the guest OS cannot see the device".  This is a uniform 
improvement.

So #3 is very low risk, as far as I can tell, and has a solid benefit.  
I think we should definitely implement it.

I think #2 should have no impact on qemu-traditional, because xl should 
disable it by default.

For qemu-xen, the only devices that are relocated are devices that would 
otherwise be disabled by #3; so remember that the alternate for this one 
(assuming we implement #3) is "not visible to OS". There are several 
potential sub-sets here:
  a. Guest OSes that can't access the 64-bit region.  In that case the 
device will not be visible, which is the same failure as not 
implementing this at all.
  b. I think most devices and operating systems will just work; this 
codepath in QEMU and guest OSes is fairly well tested with KVM.
  c. It may be that there are devices that would have worked with 
qemu-traditional and placed in a resized MMIO hole, but that will break 
with qemu-xen and and placed in the 64-bit MMIO hole.  For these 
devices, the failure mode will change from "not visible to guest OS" to 
"fails in some other unforseen way".

The the main risk I see from this one is c.  However, I think from a 
cost-benefits, it's still pretty low -- we get the benefit of most 
people transparently being able to just use pci-passthrough, at the 
potential cost of a handful of people having "weird crash" failures 
instead of "can't use the device" failures.

I suppose that there is a potential risk for b as well, in that we 
haven't tested relocating to the 64-bit MMIO hole *with Xen*.  There may 
be assumptions about 32-bit paddrs baked in somewhere that we don't know 
about.  If we implement this change on top of #3, and there are problems 
with b, then it will change "device not visible to guest OS" failures 
back into "may crash in a weird way" failures.

On the whole I would be inclined to implement #2 if it's not too 
difficult, but I can certainly see the point of saying that it's too 
risky and that we shouldn't do it.

Then there's #1.  This should in theory be low-risk, because in theory 
hvmloader might have chosen 0xe as the start of the MMIO hole anyway.

For qemu-traditional, this change has no benefits.  The benefits for 
qemu-xen depend on what else gets implemented.  If nothing else is 
implemented, this changes some "random qemu crash" failures into 
successes (while leaving other ones to keep crashing).  If #3 is 
implemented, then it changes some "guest OS can't see device" failures 
into successes.  If Both #3 and #2 are implemented, then some of the 
"guest OS can't see 64-bit MMIO hole" failures will change into successes.

However, as you say, the number of times hvmloader *actually* choses 
that at the moment is fairly low.  The vast majority of VMs and 
configurations at the moment will *not* use 0xe as a base; at most only 
a handful of people will have tested that configuration.  So there is a 
fairly significant risk that there *is* some configuration for which, 
*had* hvmloader chosen 0xe, then it *would* have caused a problem.  For 
those configurations, #1 will change "will break but only if you have a 
very large PCI device" to "will break no matter what".

This would be fine if we hadn't started RCs; but we have.  Our most 
important userbase is people who are *not* doing pci pass-through; so I 
think I agree with you, that this introduces a probably unacceptable 
risk for very little gain -- particularly if we implement #3.

So my recommendation is:
* Implement #3
* Consider implementing #2
* Don't implement #1.

Thoughts?

  -George

  reply	other threads:[~2013-06-14 14:14 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <EE92950F97EE42469CA4F508D4691F5E016FAD15@SHSMSX104.ccr.corp.intel.com>
     [not found] ` <alpine.DEB.2.02.1306071246270.4589@kaball.uk.xensource.com>
     [not found]   ` <51B1FF50.90406@eu.citrix.com>
     [not found]     ` <alpine.DEB.2.02.1306071655060.4589@kaball.uk.xensource.com>
     [not found]       ` <403610A45A2B5242BD291EDAE8B37D3010E56731@SHSMSX102.ccr.corp.intel.com>
     [not found]         ` <CAFLBxZZfH8im-hTrma29Ag7CUR1HZEm=4b7ft_h5weukGL1BzQ@mail.gmail.com>
2013-06-11 17:26           ` [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M Stefano Stabellini
2013-06-12  7:25             ` Jan Beulich
2013-06-12  8:31               ` Ian Campbell
2013-06-12  9:02                 ` Jan Beulich
2013-06-12  9:22                   ` Ian Campbell
2013-06-12 10:07                     ` Jan Beulich
2013-06-12 11:23                       ` Ian Campbell
2013-06-12 11:56                         ` Jan Beulich
2013-06-12 11:59                           ` Ian Campbell
2013-06-12 10:05               ` George Dunlap
2013-06-12 10:11                 ` Jan Beulich
2013-06-12 10:15                   ` George Dunlap
2013-06-12 13:23                 ` Paolo Bonzini
2013-06-12 13:49                   ` Jan Beulich
2013-06-12 14:02                     ` Paolo Bonzini
2013-06-12 14:19                       ` Jan Beulich
2013-06-12 15:25                         ` George Dunlap
2013-06-12 20:13                           ` Paolo Bonzini
2013-06-13 13:44                 ` Stefano Stabellini
2013-06-13 13:54                   ` George Dunlap
2013-06-13 14:50                     ` Stefano Stabellini
2013-06-13 15:06                       ` Jan Beulich
2013-06-13 15:29                       ` George Dunlap
2013-06-13 16:13                         ` Stefano Stabellini
2013-06-13 15:34                       ` Ian Campbell
2013-06-13 16:55                         ` Stefano Stabellini
2013-06-13 17:22                           ` Ian Campbell
2013-06-14 10:53                             ` George Dunlap
2013-06-14 11:34                               ` Ian Campbell
2013-06-14 14:14                                 ` George Dunlap [this message]
2013-06-14 14:36                                   ` George Dunlap
2013-06-13 14:54                     ` Paolo Bonzini
2013-06-13 15:16                     ` Ian Campbell
2013-06-13 15:30                       ` George Dunlap
2013-06-13 15:36                         ` Ian Campbell
2013-06-13 15:40                           ` George Dunlap
2013-06-13 15:42                             ` Ian Campbell
2013-06-13 15:40                       ` Stefano Stabellini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51BB253F.8060706@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=JBeulich@suse.com \
    --cc=arei.gonglei@huawei.com \
    --cc=hanweidong@huawei.com \
    --cc=keir@xen.org \
    --cc=luonengjun@huawei.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=songtaox.liu@intel.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=tim@xen.org \
    --cc=wangzhenguo@huawei.com \
    --cc=xen-devel@lists.xensource.com \
    --cc=xiaowei.yang@huawei.com \
    --cc=xudong.hao@intel.com \
    --cc=yanqiangjun@huawei.com \
    --cc=yongjie.ren@intel.com \
    --cc=yongweix.xu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).