From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ian Campbell Subject: Re: Kernel bug from 3.0 (was phy disks and vifs timing out in DomU) Date: Sat, 3 Sep 2011 11:27:19 +0100 Message-ID: <1315045639.19389.1683.camel@dagon.hellion.org.uk> References: <4E31820C.5030200@overnetdata.com> <1311870512.24408.153.camel@cthulhu.hellion.org.uk> <4E3266DE.9000606@overnetdata.com> <20110803152841.GA2860@dumpdata.com> <4E4E3957.1040007@overnetdata.com> <20110819125615.GA26558@dumpdata.com> <4E56B132.9050708@overnetdata.com> <20110826142606.GA25511@dumpdata.com> <20110826144438.GA24836@dumpdata.com> <4E5E6843.7050206@citrix.com> <20110831170711.GB13642@dumpdata.com> <1314862972.28989.74.camel@zakaz.uk.xensource.com> <4E5FC197.7040004@goop.org> <1314904905.19389.28.camel@dagon.hellion.org.uk> <4E5FEC71.4090504@goop.org> <1314947875.28989.155.camel@zakaz.uk.xensource.com> <4E613BEF.30908@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4E613BEF.30908@goop.org> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Jeremy Fitzhardinge Cc: "xen-devel@lists.xensource.com" , Keir Fraser , Anthony Wright , Konrad Rzeszutek Wilk , David Vrabel , Todd Deshane List-Id: xen-devel@lists.xenproject.org On Fri, 2011-09-02 at 21:26 +0100, Jeremy Fitzhardinge wrote: > On 09/02/2011 12:17 AM, Ian Campbell wrote: > > On Thu, 2011-09-01 at 21:34 +0100, Jeremy Fitzhardinge wrote: > >> On 09/01/2011 12:21 PM, Ian Campbell wrote: > >>> On Thu, 2011-09-01 at 18:32 +0100, Jeremy Fitzhardinge wrote: > >>>> On 09/01/2011 12:42 AM, Ian Campbell wrote: > >>>>> On Wed, 2011-08-31 at 18:07 +0100, Konrad Rzeszutek Wilk wrote: > >>>>>> On Wed, Aug 31, 2011 at 05:58:43PM +0100, David Vrabel wrote: > >>>>>>> On 26/08/11 15:44, Konrad Rzeszutek Wilk wrote: > >>>>>>>> So while I am still looking at the hypervisor code to figure out why > >>>>>>>> it would give me [when trying to map a grant page]: > >>>>>>>> > >>>>>>>> (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000 > >>>>>>> It is failing in guest_map_l1e() because the page for the vmalloc'd > >>>>>>> virtual address PTEs is not present. > >>>>>>> > >>>>>>> The test that fails is: > >>>>>>> > >>>>>>> (l2e_get_flags(l2e) & (_PAGE_PRESENT | _PAGE_PSE)) != _PAGE_PRESENT > >>>>>>> > >>>>>>> I think this is because the GNTTABOP_map_grant_ref hypercall is done > >>>>>>> when task->active_mm != &init_mm and alloc_vm_area() only adds PTEs into > >>>>>>> init_mm so when Xen looks in the page tables it doesn't find the entries > >>>>>>> because they're not there yet. > >>>>>>> > >>>>>>> Putting a call to vmalloc_sync_all() after create_vm_area() and before > >>>>>>> the hypercall makes it work for me. Classic Xen kernels used to have > >>>>>>> such a call. > >>>>>> That sounds quite reasonable. > >>>>> I was wondering why upstream was missing the vmalloc_sync_all() in > >>>>> alloc_vm_area() since the out-of-tree kernels did have it and the > >>>>> function was added by us. I found this: > >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=ef691947d8a3d479e67652312783aedcf629320a > >>>>> > >>>>> commit ef691947d8a3d479e67652312783aedcf629320a > >>>>> Author: Jeremy Fitzhardinge > >>>>> Date: Wed Dec 1 15:45:48 2010 -0800 > >>>>> > >>>>> vmalloc: remove vmalloc_sync_all() from alloc_vm_area() > >>>>> > >>>>> There's no need for it: it will get faulted into the current pagetable > >>>>> as needed. > >>>>> > >>>>> Signed-off-by: Jeremy Fitzhardinge > >>>>> > >>>>> The flaw in the reasoning here is that you cannot take a kernel fault > >>>>> while processing a hypercall, so hypercall arguments must have been > >>>>> faulted in beforehand and that is what the sync_all was for. > >>>> That's a good point. (Maybe Xen should have generated pagefaults when > >>>> hypercall arg pointers are bad...) > >>> I think it would be a bit tricky to do in practice, you'd either have to > >>> support recursive hypercalls in the middle of other hypercalls (because > >>> the page fault handler is surely going to want to do some) or proper > >>> hypercall restart (so you can fully return to guest context to handle > >>> the fault then retry) or something along those and complexifying up the > >>> hypervisor one way or another. Probably not impossible if you were > >>> building something form the ground up, but not trivial. > >> Well, Xen already has the continuation machinery for dealing with > >> hypercall restart, so that could be reused. > > That requires special support beyond just calling the continuation in > > each hypercall (often extending into the ABI) for pickling progress and > > picking it up again, only a small number of (usually long running) > > hypercalls have that support today. It also uses the guest context to > > store the state which perhaps isn't helpful if you want to return to the > > guest, although I suppose building a nested frame would work. > > I guess it depends on how many hypercalls do work before touching guest > memory, but any hypercall should be like that anyway, or at least be > able to wind back work done if a later read EFAULTs. > > I was vaguely speculating about a scheme on the lines of: > > 1. In copy_to/from_user, if we touch a bad address, save it in a > per-vcpu "bad_guest_addr" > 2. when returning to the guest, if the errno is EFAULT and > bad_guest_addr is set, then generate a memory fault frame with cr2 = > bad_guest_addr, and with the exception return restarting the hypercall > > Perhaps there should be a EFAULT_RETRY error return to trigger this > behaviour, rather than doing it for all EFAULTs, so the faulting > behaviour can be added incrementally. The kernel uses -ERESTARTSSYS for something similar, doesn't it? Does this scheme work if the hypercall causing the exception was itself runnnig in an exception handler? I guess it depends on the architecture +OSes handling of nested faults. > Maybe this is a lost cause for x86, but perhaps its worth considering > for new ports? Certainly worth thinking about. > > The guys doing paging and sharing etc looked into this and came to the > > conclusion that it would be intractably difficult to do this fully -- > > hence we now have the ability to sleep in hypercalls, which works > > because the pager/sharer is in a different domain/vcpu. > > Hmm. Were they looking at injecting faults back into the guest, or > forwarding "missing page" events off to another domain? Sharing and swapping are transparent to the domain, another domain runs the swapper/unshare process (actually, unshare might be in the h/v itself, not sure). > >> And accesses to guest > >> memory are already special events which must be checked so that EFAULT > >> can be returned. If, rather than failing with EFAULT Xen set up a > >> pagefault exception for the guest CPU with the return set up to retry > >> the hypercall, it should all work... > >> > >> Of course, if the guest isn't expecting that - or its buggy - then it > >> could end up in an infinite loop. But maybe a flag (set a high bit in > >> the hypercall number?), or a feature, or something? Might be worthwhile > >> if it saves guests having to do something expensive (like a > >> vmalloc_sync_all), even if they have to also deal with old hypervisors. > > The vmalloc_sync_all is a pretty event even on Xen though, isn't it? > > Looks like an important word is missing there. But its very expensive, > if that's what you're saying. Oops. "rare" was the missing word. > > J