From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751373AbbG3UUU (ORCPT ); Thu, 30 Jul 2015 16:20:20 -0400 Received: from aserp1040.oracle.com ([141.146.126.69]:31834 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750865AbbG3UUS (ORCPT ); Thu, 30 Jul 2015 16:20:18 -0400 Message-ID: <55BA86AA.8000202@oracle.com> Date: Thu, 30 Jul 2015 16:18:50 -0400 From: Boris Ostrovsky User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Andy Lutomirski CC: Andrew Cooper , David Vrabel , "security@kernel.org" , Peter Zijlstra , X86 ML , "linux-kernel@vger.kernel.org" , Steven Rostedt , xen-devel , Borislav Petkov , David Vrabel , Jan Beulich , Sasha Levin Subject: Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option References: <55B7B791.2050208@oracle.com> <55B822B8.3090608@citrix.com> <55B841FF.2000102@oracle.com> <55B8E16C.2050406@citrix.com> <55B8E68B.2030305@oracle.com> <55B9236B.9090507@citrix.com> <55B94451.8040600@oracle.com> <55B947AF.7020404@citrix.com> <55B94F9D.3000405@citrix.com> <55B957DE.60405@cantab.net> <55B95863.2000102@oracle.com> <55B95B70.8010902@citrix.com> <55B96FE0.6010600@citrix.com> <55BA72E1.4050809@citrix.com> <55BA828E.8070304@oracle.com> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: userv0022.oracle.com [156.151.31.74] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/30/2015 04:05 PM, Andy Lutomirski wrote: > On Thu, Jul 30, 2015 at 1:01 PM, Boris Ostrovsky > wrote: >> On 07/30/2015 02:54 PM, Andrew Cooper wrote: >>> On 30/07/15 19:30, Andy Lutomirski wrote: >>>> On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper >>>> wrote: >>>>> On 30/07/2015 00:13, Andy Lutomirski wrote: >>>>>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper >>>>>> wrote: >>>>>>> On 29/07/2015 23:49, Boris Ostrovsky wrote: >>>>>>>> On 07/29/2015 06:46 PM, David Vrabel wrote: >>>>>>>>> On 29/07/2015 23:11, Andrew Cooper wrote: >>>>>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote: >>>>>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper >>>>>>>>>>> wrote: >>>>>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote: >>>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote: >>>>>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote: >>>>>>>>>>>>>>>> FYI, I have got a repro now and am investigating. >>>>>>>>>>>>>>> Good and bad news. This bug has nothing to do with LDTs >>>>>>>>>>>>>>> themselves. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have worked out what is going on, but this: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c >>>>>>>>>>>>>>> b/arch/x86/xen/enlighten.c >>>>>>>>>>>>>>> index 5abeaac..7e1a82e 100644 >>>>>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c >>>>>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c >>>>>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, >>>>>>>>>>>>>>> pgprot_t prot) >>>>>>>>>>>>>>> pte = pfn_pte(pfn, prot); >>>>>>>>>>>>>>> + (void)*(volatile int*)v; >>>>>>>>>>>>>>> if (HYPERVISOR_update_va_mapping((unsigned long)v, >>>>>>>>>>>>>>> pte, 0)) { >>>>>>>>>>>>>>> pr_err("set_aliased_prot va update failed >>>>>>>>>>>>>>> w/ >>>>>>>>>>>>>>> lazy mode >>>>>>>>>>>>>>> %u\n", paravirt_get_lazy_mode()); >>>>>>>>>>>>>>> BUG(); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of >>>>>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same >>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>> I think in most cases we know that page is mapped so hopefully >>>>>>>>>>>>>> this is the >>>>>>>>>>>>>> only site that we need to be careful about. >>>>>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix >>>>>>>>>>>>> that >>>>>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix >>>>>>>>>>>>> isn't >>>>>>>>>>>>> available yet? >>>>>>>>>>>> Quick and dirty? >>>>>>>>>>>> >>>>>>>>>>>> Reading from v is the most obvious and quick way, for areas where >>>>>>>>>>>> we are >>>>>>>>>>>> certain v exists, is kernel memory and is expected to have a >>>>>>>>>>>> backing >>>>>>>>>>>> page. I don't know offhand how many of current >>>>>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to. >>>>>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something >>>>>>>>>>> better >>>>>>>>>>> in the wings. Keep in mind that we need this for -stable, and >>>>>>>>>>> it's >>>>>>>>>>> likely to get backported quite quickly due to CVE-2015-5157. >>>>>>>>>> Hmm - something like that tucked inside >>>>>>>>>> HYPERVISOR_update_va_mapping() >>>>>>>>>> would probably work, and certainly be minimal hassle for -stable. >>>>>>>>>> >>>>>>>>>> Altering the hypercall used is certainly not something to backport, >>>>>>>>>> nor >>>>>>>>>> are we sure it is a viable fix at this time. >>>>>>>>> Changing this one use of update_va_mapping to use >>>>>>>>> mmu_update_normal_pt >>>>>>>>> is the correct fix to unblock this LDT series. I see no reason why >>>>>>>>> this >>>>>>>>> cannot be backported. >>>>>>>> To properly fix it should include batching and that is not something >>>>>>>> that I think we should target for stable. >>>>>>> Batching is absolutely not necessary to alter update_va_mapping to >>>>>>> mmu_update_normal_pt. After all, update_va_mapping isn't batched. >>>>>>> >>>>>>> However this isn't the first issue issue we have had lazy mmu >>>>>>> faulting, >>>>>>> and I doubt it is the last. There are not many callsites of >>>>>>> update_va_mapping - I will audit them tomorrow and see if any similar >>>>>>> issues are lurking elsewhere. >>>>>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt, >>>>>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT >>>>>> access to fault. Is this something we should be worried about? >>>>> Yes. update_va_mapping() will function perfectly well taking one RW >>>>> mapping to RO even if there is a second RW mapping. In such a case, the >>>>> next LDT access will fault. >>>> Which is a problem because that alias might still exist, and also >>>> because Linux really doesn't expect that fault. >>>> >>>>> On closer inspection, Xen is rather unhelpful with the fault. Xen's >>>>> lazy #PF will be bounced back to the guest with cr2 adjusted to appear >>>>> in the range passed to set_ldt(). The error code however will be >>>>> unmodified (and limited only by not-user and not-reserved), so will >>>>> appear as a non-present read or write supervisor access to an address >>>>> which the kernel has a valid read mapping of. >>>> More yuck. >>>> >>>> I think I'm just going to stick an unconditional vm_flush_aliases in >>>> alloc_ldt. >>>> >>>>> Therefore, set_ldt() needs to be confident that there are no writeable >>>>> mappings to the frames used to make up the LDT. It could proactively >>>>> fault them in by accessing one descriptor in each page inside the limit, >>>>> but by the time a fault is received it is probably too late to work out >>>>> where the other mapping is which prevented the typechange (or indeed, >>>>> whether Xen objected to one of the descriptors instead). >>>> This seems like overkill. >>>> >>>> I'm still a bit confused, though: the failure is in xen_free_ldt. How >>>> do we make it all the way to xen_free_ldt without the vmapped page >>>> existing in the guest's page tables? After all, we had to survive >>>> xen_alloc_ldt first, and ISTM that should fail in exactly the same >>>> way. >>> (Summarising part of a discussion which has just occurred on IRC) >>> >>> I presume that xen_free_ldt() is called while in the context of an mm >>> which doesn't have the particular area of the vmalloc() space faulted in. >> >> This is exactly what's happening --- the bug is only triggered during exit >> and xen_free_ldt() is called from someone else's context, e.g.: >> >> [ 53.986677] Call Trace: >> [ 53.986677] [] xen_free_ldt+0x2d/0x40 >> [ 53.986677] [] free_ldt_struct.part.1+0x10/0x40 >> [ 53.986677] [] destroy_context+0x25/0x40 >> [ 53.986677] [] __mmdrop+0x1e/0xc0 >> [ 53.986677] [] finish_task_switch+0xd8/0x1a0 >> [ 53.986677] [] __schedule+0x316/0x950 >> [ 53.986677] [] schedule+0x26/0x70 >> [ 53.986677] [] do_wait+0x1b3/0x200 >> [ 53.986677] [] SyS_waitpid+0x67/0xd0 >> [ 53.986677] [] ? task_stopped_code+0x50/0x50 >> [ 53.986677] [] syscall_call+0x7/0x7 >> >> But that would imply that this other context has mm->context.ldt of >> ldt_gdt_32. How is that possible? >> > It's freed via destroy_context, which destroys someone else's LDT, right? > Yes, that's what it appears to be. -boris