Re: Kernel bug from 3.0 (was phy disks and vifs timing out in DomU)

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Anthony Wright <anthony@overnetdata.com>
To: Ian Campbell <Ian.Campbell@eu.citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Keir Fraser <keir@xen.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	David Vrabel <david.vrabel@citrix.com>,
	Todd Deshane <todd.deshane@xen.org>
Subject: Re: Kernel bug from 3.0 (was phy disks and vifs timing out in DomU)
Date: Fri, 23 Sep 2011 13:35:01 +0100	[thread overview]
Message-ID: <4E7C7CF5.1090907@overnetdata.com> (raw)
In-Reply-To: <1315045639.19389.1683.camel@dagon.hellion.org.uk>

On 03/09/2011 11:27, Ian Campbell wrote:
> On Fri, 2011-09-02 at 21:26 +0100, Jeremy Fitzhardinge wrote:
>> On 09/02/2011 12:17 AM, Ian Campbell wrote:
>>> On Thu, 2011-09-01 at 21:34 +0100, Jeremy Fitzhardinge wrote:
>>>> On 09/01/2011 12:21 PM, Ian Campbell wrote:
>>>>> On Thu, 2011-09-01 at 18:32 +0100, Jeremy Fitzhardinge wrote:
>>>>>> On 09/01/2011 12:42 AM, Ian Campbell wrote:
>>>>>>> On Wed, 2011-08-31 at 18:07 +0100, Konrad Rzeszutek Wilk wrote:
>>>>>>>> On Wed, Aug 31, 2011 at 05:58:43PM +0100, David Vrabel wrote:
>>>>>>>>> On 26/08/11 15:44, Konrad Rzeszutek Wilk wrote:
>>>>>>>>>> So while I am still looking at the hypervisor code to figure out why
>>>>>>>>>> it would give me [when trying to map a grant page]:
>>>>>>>>>>
>>>>>>>>>> (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000
>>>>>>>>> It is failing in guest_map_l1e() because the page for the vmalloc'd
>>>>>>>>> virtual address PTEs is not present.
>>>>>>>>>
>>>>>>>>> The test that fails is:
>>>>>>>>>
>>>>>>>>> (l2e_get_flags(l2e) & (_PAGE_PRESENT | _PAGE_PSE)) != _PAGE_PRESENT
>>>>>>>>>
>>>>>>>>> I think this is because the GNTTABOP_map_grant_ref hypercall is done
>>>>>>>>> when task->active_mm != &init_mm and alloc_vm_area() only adds PTEs into
>>>>>>>>> init_mm so when Xen looks in the page tables it doesn't find the entries
>>>>>>>>> because they're not there yet.
>>>>>>>>>
>>>>>>>>> Putting a call to vmalloc_sync_all() after create_vm_area() and before
>>>>>>>>> the hypercall makes it work for me.  Classic Xen kernels used to have
>>>>>>>>> such a call.
>>>>>>>> That sounds quite reasonable.
>>>>>>> I was wondering why upstream was missing the vmalloc_sync_all() in
>>>>>>> alloc_vm_area() since the out-of-tree kernels did have it and the
>>>>>>> function was added by us. I found this:
>>>>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=ef691947d8a3d479e67652312783aedcf629320a
>>>>>>>
>>>>>>> commit ef691947d8a3d479e67652312783aedcf629320a
>>>>>>> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
>>>>>>> Date:   Wed Dec 1 15:45:48 2010 -0800
>>>>>>>
>>>>>>>     vmalloc: remove vmalloc_sync_all() from alloc_vm_area()
>>>>>>>     
>>>>>>>     There's no need for it: it will get faulted into the current pagetable
>>>>>>>     as needed.
>>>>>>>     
>>>>>>>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
>>>>>>>
>>>>>>> The flaw in the reasoning here is that you cannot take a kernel fault
>>>>>>> while processing a hypercall, so hypercall arguments must have been
>>>>>>> faulted in beforehand and that is what the sync_all was for.
>>>>>> That's a good point.  (Maybe Xen should have generated pagefaults when
>>>>>> hypercall arg pointers are bad...)
>>>>> I think it would be a bit tricky to do in practice, you'd either have to
>>>>> support recursive hypercalls in the middle of other hypercalls (because
>>>>> the page fault handler is surely going to want to do some) or proper
>>>>> hypercall restart (so you can fully return to guest context to handle
>>>>> the fault then retry) or something along those and complexifying up the
>>>>> hypervisor one way or another. Probably not impossible if you were
>>>>> building something form the ground up, but not trivial.
>>>> Well, Xen already has the continuation machinery for dealing with
>>>> hypercall restart, so that could be reused.
>>> That requires special support beyond just calling the continuation in
>>> each hypercall (often extending into the ABI) for pickling progress and
>>> picking it up again, only a small number of (usually long running)
>>> hypercalls have that support today. It also uses the guest context to
>>> store the state which perhaps isn't helpful if you want to return to the
>>> guest, although I suppose building a nested frame would work.
>> I guess it depends on how many hypercalls do work before touching guest
>> memory, but any hypercall should be like that anyway, or at least be
>> able to wind back work done if a later read EFAULTs.
>>
>> I was vaguely speculating about a scheme on the lines of:
>>
>>  1. In copy_to/from_user, if we touch a bad address, save it in a
>>     per-vcpu "bad_guest_addr"
>>  2. when returning to the guest, if the errno is EFAULT and
>>     bad_guest_addr is set, then generate a memory fault frame with cr2 =
>>     bad_guest_addr, and with the exception return restarting the hypercall
>>
>> Perhaps there should be a EFAULT_RETRY error return to trigger this
>> behaviour, rather than doing it for all EFAULTs, so the faulting
>> behaviour can be added incrementally.
> The kernel uses -ERESTARTSSYS for something similar, doesn't it?
>
> Does this scheme work if the hypercall causing the exception was itself
> runnnig in an exception handler? I guess it depends on the architecture
> +OSes handling of nested faults.
>
>> Maybe this is a lost cause for x86, but perhaps its worth considering
>> for new ports?
> Certainly worth thinking about.
>
>>> The guys doing paging and sharing etc looked into this and came to the
>>> conclusion that it would be intractably difficult to do this fully --
>>> hence we now have the ability to sleep in hypercalls, which works
>>> because the pager/sharer is in a different domain/vcpu.
>> Hmm.  Were they looking at injecting faults back into the guest, or
>> forwarding "missing page" events off to another domain?
> Sharing and swapping are transparent to the domain, another domain runs
> the swapper/unshare process (actually, unshare might be in the h/v
> itself, not sure).
>
>>>>   And accesses to guest
>>>> memory are already special events which must be checked so that EFAULT
>>>> can be returned.  If, rather than failing with EFAULT Xen set up a
>>>> pagefault exception for the guest CPU with the return set up to retry
>>>> the hypercall, it should all work...
>>>>
>>>> Of course, if the guest isn't expecting that - or its buggy - then it
>>>> could end up in an infinite loop.  But maybe a flag (set a high bit in
>>>> the hypercall number?), or a feature, or something?  Might be worthwhile
>>>> if it saves guests having to do something expensive (like a
>>>> vmalloc_sync_all), even if they have to also deal with old hypervisors.
>>> The vmalloc_sync_all is a pretty event even on Xen though, isn't it?
>> Looks like an important word is missing there.  But its very expensive,
>> if that's what you're saying.
> Oops. "rare" was the missing word.
Is there any progress on an official patch for this? I have my own
unofficial patch which places a vmalloc_sync_all() after every
alloc_vm_area() call and it works, but from the thread it sounds like
there should be a more sophisticated solution to the problem.

next prev parent reply	other threads:[~2011-09-23 12:35 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <29902981.10.1311837224851.JavaMail.root@zimbra.overnetdata.com>
2011-07-28  7:24 ` phy disks and vifs timing out in DomU Anthony Wright
2011-07-28 15:01   ` Todd Deshane
2011-07-28 15:36     ` Anthony Wright
2011-07-28 15:46       ` Todd Deshane
2011-07-28 16:00         ` Anthony Wright
2011-07-29 15:55           ` Konrad Rzeszutek Wilk
2011-07-29 18:40             ` Anthony Wright
2011-07-29 20:01               ` Konrad Rzeszutek Wilk
2011-07-30 17:05                 ` Anthony Wright
2011-08-01 11:03                   ` Anthony Wright
2011-07-28 16:28       ` Ian Campbell
2011-07-29  7:53         ` Kernel bug from 3.0 (was phy disks and vifs timing out in DomU) Anthony Wright
2011-08-03 15:28           ` Konrad Rzeszutek Wilk
2011-08-09 16:35             ` Konrad Rzeszutek Wilk
2011-08-19 10:22             ` Anthony Wright
2011-08-19 12:56               ` Konrad Rzeszutek Wilk
2011-08-22 11:02                 ` Anthony Wright
2011-08-25 20:31                 ` Anthony Wright
2011-08-26 14:26                   ` Konrad Rzeszutek Wilk
2011-08-26 14:44                     ` Konrad Rzeszutek Wilk
2011-08-29 12:13                       ` Anthony Wright
2011-08-31 16:58                       ` David Vrabel
2011-08-31 17:07                         ` Konrad Rzeszutek Wilk
2011-09-01  7:42                           ` Ian Campbell
2011-09-01 14:23                             ` Konrad Rzeszutek Wilk
2011-09-01 15:12                               ` David Vrabel
2011-09-01 15:37                                 ` Konrad Rzeszutek Wilk
2011-09-01 15:43                                   ` Ian Campbell
2011-09-01 16:07                                     ` Konrad Rzeszutek Wilk
2011-09-07 12:57                                 ` Anthony Wright
2011-09-07 18:35                                   ` Konrad Rzeszutek Wilk
2011-09-01 15:12                               ` Ian Campbell
2011-09-01 15:38                                 ` Konrad Rzeszutek Wilk
2011-09-01 15:44                                   ` Ian Campbell
2011-09-01 17:34                                     ` Jeremy Fitzhardinge
2011-09-01 19:19                                       ` Ian Campbell
2011-09-01 17:32                             ` Jeremy Fitzhardinge
2011-09-01 19:21                               ` Ian Campbell
2011-09-01 20:34                                 ` Jeremy Fitzhardinge
2011-09-02  7:17                                   ` Ian Campbell
2011-09-02 20:26                                     ` Jeremy Fitzhardinge
2011-09-03 10:27                                       ` Ian Campbell
2011-09-23 12:35                                         ` Anthony Wright [this message]
2011-09-23 12:49                                           ` David Vrabel
2011-08-29 17:33                     ` Anthony Wright
2011-08-25 21:11                 ` Anthony Wright
2011-08-26  7:10                   ` Sander Eikelenboom
2011-08-26 11:23                     ` Pasi Kärkkäinen
2011-08-26 12:16                   ` Stefano Stabellini
2011-08-26 12:15                     ` Anthony Wright
2011-08-26 12:32                       ` Stefano Stabellini
2011-07-29 15:48         ` phy disks and vifs timing out in DomU (only on certain hardware) Anthony Wright
2011-07-29 16:06           ` Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E7C7CF5.1090907@overnetdata.com \
    --to=anthony@overnetdata.com \
    --cc=Ian.Campbell@eu.citrix.com \
    --cc=david.vrabel@citrix.com \
    --cc=jeremy@goop.org \
    --cc=keir@xen.org \
    --cc=konrad.wilk@oracle.com \
    --cc=todd.deshane@xen.org \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).