From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751373AbbG3UUU (ORCPT <rfc822;w@1wt.eu>);
	Thu, 30 Jul 2015 16:20:20 -0400
Received: from aserp1040.oracle.com ([141.146.126.69]:31834 "EHLO
	aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750865AbbG3UUS (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 30 Jul 2015 16:20:18 -0400
Message-ID: <55BA86AA.8000202@oracle.com>
Date: Thu, 30 Jul 2015 16:18:50 -0400
From: Boris Ostrovsky <boris.ostrovsky@oracle.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: Andy Lutomirski <luto@amacapital.net>
CC: Andrew Cooper <andrew.cooper3@citrix.com>,
        David Vrabel <dvrabel@cantab.net>,
        "security@kernel.org" <security@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>, X86 ML <x86@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Steven Rostedt <rostedt@goodmis.org>,
        xen-devel <xen-devel@lists.xen.org>, Borislav Petkov <bp@alien8.de>,
        David Vrabel <david.vrabel@citrix.com>,
        Jan Beulich <jbeulich@suse.com>, Sasha Levin <sasha.levin@oracle.com>
Subject: Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test,
 and config option
References: <cover.1437802102.git.luto@kernel.org> <55B7B791.2050208@oracle.com> <CALCETrXH5_PMqfH1en_5c+5gUpq8SjCnQ3Xaz-K6ej6FgBgLDQ@mail.gmail.com> <55B822B8.3090608@citrix.com> <55B841FF.2000102@oracle.com> <CALCETrWkMRb+Y3FsJ7+kNYmPxtupM3ZPOeOPwagXytgBqM6tJQ@mail.gmail.com> <55B8E16C.2050406@citrix.com> <55B8E68B.2030305@oracle.com> <55B9236B.9090507@citrix.com> <55B94451.8040600@oracle.com> <CALCETrWA=hAyqqp=yzZ2r_S=9U9hLkd6dZEuNefew8hyLVA_eQ@mail.gmail.com> <55B947AF.7020404@citrix.com> <CALCETrXp_DV-_Uvekwv7xLHO-5P8Oxkgn6OeXG-6tVOD4RkKMw@mail.gmail.com> <55B94F9D.3000405@citrix.com> <55B957DE.60405@cantab.net> <55B95863.2000102@oracle.com> <55B95B70.8010902@citrix.com> <CALCETrWy93qobHmMWzTfqFN+0Y7DGyM7viwpPMGOeSiXEP0Z6w@mail.gmail.com> <55B96FE0.6010600@citrix.com> <CALCETrUi2GBdGP2OX+3PwSf0UYjKuf2+DugENe3Y6mUoy-Rfkw@mail.gmail.com> <55BA72E1.4050809@citrix.com> <55BA828E.8070304@oracle.com> <CALCETrUsFn23tKf418VSbGCgXoXXRq8dk41ZfM3F55=_xWPQhw@mail.gmail.com>
In-Reply-To: <CALCETrUsFn23tKf418VSbGCgXoXXRq8dk41ZfM3F55=_xWPQhw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Source-IP: userv0022.oracle.com [156.151.31.74]
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 07/30/2015 04:05 PM, Andy Lutomirski wrote:
> On Thu, Jul 30, 2015 at 1:01 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/30/2015 02:54 PM, Andrew Cooper wrote:
>>> On 30/07/15 19:30, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
>>>> <andrew.cooper3@citrix.com> wrote:
>>>>> On 30/07/2015 00:13, Andy Lutomirski wrote:
>>>>>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>>>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>>>>>> themselves.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>>>>>               pte = pfn_pte(pfn, prot);
>>>>>>>>>>>>>>>      +       (void)*(volatile int*)v;
>>>>>>>>>>>>>>>             if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>>>>>                     pr_err("set_aliased_prot va update failed
>>>>>>>>>>>>>>> w/
>>>>>>>>>>>>>>> lazy mode
>>>>>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>>>>>                     BUG();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same
>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>>>>>> this is the
>>>>>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix
>>>>>>>>>>>>> that
>>>>>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix
>>>>>>>>>>>>> isn't
>>>>>>>>>>>>> available yet?
>>>>>>>>>>>> Quick and dirty?
>>>>>>>>>>>>
>>>>>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>>>>>> we are
>>>>>>>>>>>> certain v exists, is kernel memory and is expected to have a
>>>>>>>>>>>> backing
>>>>>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something
>>>>>>>>>>> better
>>>>>>>>>>> in the wings.  Keep in mind that we need this for -stable, and
>>>>>>>>>>> it's
>>>>>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>>>>>> Hmm - something like that tucked inside
>>>>>>>>>> HYPERVISOR_update_va_mapping()
>>>>>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>>>>>
>>>>>>>>>> Altering the hypercall used is certainly not something to backport,
>>>>>>>>>> nor
>>>>>>>>>> are we sure it is a viable fix at this time.
>>>>>>>>> Changing this one use of update_va_mapping to use
>>>>>>>>> mmu_update_normal_pt
>>>>>>>>> is the correct fix to unblock this LDT series.  I see no reason why
>>>>>>>>> this
>>>>>>>>> cannot be backported.
>>>>>>>> To properly fix it should include batching and that is not something
>>>>>>>> that I think we should target for stable.
>>>>>>> Batching is absolutely not necessary to alter update_va_mapping to
>>>>>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>>>>>
>>>>>>> However this isn't the first issue issue we have had lazy mmu
>>>>>>> faulting,
>>>>>>> and I doubt it is the last.  There are not many callsites of
>>>>>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>>>>>> issues are lurking elsewhere.
>>>>>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>>>>>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>>>>>> access to fault.  Is this something we should be worried about?
>>>>> Yes.  update_va_mapping() will function perfectly well taking one RW
>>>>> mapping to RO even if there is a second RW mapping.  In such a case, the
>>>>> next LDT access will fault.
>>>> Which is a problem because that alias might still exist, and also
>>>> because Linux really doesn't expect that fault.
>>>>
>>>>> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
>>>>> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
>>>>> in the range passed to set_ldt().  The error code however will be
>>>>> unmodified (and limited only by not-user and not-reserved), so will
>>>>> appear as a non-present read or write supervisor access to an address
>>>>> which the kernel has a valid read mapping of.
>>>> More yuck.
>>>>
>>>> I think I'm just going to stick an unconditional vm_flush_aliases in
>>>> alloc_ldt.
>>>>
>>>>> Therefore, set_ldt() needs to be confident that there are no writeable
>>>>> mappings to the frames used to make up the LDT.  It could proactively
>>>>> fault them in by accessing one descriptor in each page inside the limit,
>>>>> but by the time a fault is received it is probably too late to work out
>>>>> where the other mapping is which prevented the typechange (or indeed,
>>>>> whether Xen objected to one of the descriptors instead).
>>>> This seems like overkill.
>>>>
>>>> I'm still a bit confused, though: the failure is in xen_free_ldt.  How
>>>> do we make it all the way to xen_free_ldt without the vmapped page
>>>> existing in the guest's page tables?  After all, we had to survive
>>>> xen_alloc_ldt first, and ISTM that should fail in exactly the same
>>>> way.
>>> (Summarising part of a discussion which has just occurred on IRC)
>>>
>>> I presume that xen_free_ldt() is called while in the context of an mm
>>> which doesn't have the particular area of the vmalloc() space faulted in.
>>
>> This is exactly what's happening --- the bug is only triggered during exit
>> and xen_free_ldt() is called from someone else's context, e.g.:
>>
>> [   53.986677] Call Trace:
>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>
>> But that would imply that this other context has mm->context.ldt of
>> ldt_gdt_32. How is that possible?
>>
> It's freed via destroy_context, which destroys someone else's LDT, right?
>

Yes, that's what it appears to be.

-boris