From mboxrd@z Thu Jan  1 00:00:00 1970
From: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Subject: Re: [PATCH V2] x86, amd_ucode: Safeguard against #GP
Date: Fri, 30 May 2014 11:46:01 -0500
Message-ID: <5388B5C9.20100@amd.com>
References: <1401215048-17154-1-git-send-email-aravind.gopalakrishnan@amd.com>
	<53852405.9010704@citrix.com> <5385FDD3.8020307@amd.com>
	<53862348.1060400@oracle.com> <5388AB6A.7010803@amd.com>
	<5388B005.2070108@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <5388B005.2070108@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Andrew Cooper <andrew.cooper3@citrix.com>, boris ostrovsky <boris.ostrovsky@oracle.com>
Cc: keir@xen.org, JBeulich@suse.com, xen-devel@lists.xen.org
List-Id: xen-devel@lists.xenproject.org

On 5/30/2014 11:21 AM, Andrew Cooper wrote:
>
> On 30/05/2014 17:01, Aravind Gopalakrishnan wrote:
>> On 5/28/2014 12:56 PM, boris ostrovsky wrote:
>>>
>>> On 5/28/2014 11:16 AM, Aravind Gopalakrishnan wrote:
>>>> On 5/27/2014 6:47 PM, Andrew Cooper wrote:
>>>>> On 27/05/2014 19:24, Aravind Gopalakrishnan wrote:
>>>>>> When HW tries to load a corrupted patch, it generates #GP
>>>>>> and hangs the system. Use wrmsr_safe instead so that we
>>>>>> fail to load microcode gracefully.
>>>>>>
>>>>>> Also, massage error handling around apply_microcode to keep
>>>>>> in tune with error handling style of other parts of the code.
>>>>>>
>>>>>> Example on a Fam15h system-
>>>>>> (XEN) microcode: CPU0 collect_cpu_info: patch_id=0x6000626
>>>>>> (XEN) microcode: CPU0 size 7870, block size 2586 offset 76 equivID
>>>>>> 0x6012 rev 0x6000637
>>>>>> (XEN) microcode: CPU0 found a matching microcode update with version
>>>>>> 0x6000637 (current=0x6000626)
>>>>>> (XEN) traps.c:3073: GPF (0000): ffff82d08016f682 -> ffff82d08022d9f8
>>>>>> (XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 
>>>>>> failed
>>>>>> ^^^^^^^^^^^^^^^^^^^^^^
>>>>>> As shown, the log message above has the two revisions reversed. 
>>>>>> Fix this
>>>>>>
>>>>>> Changes in V2:
>>>>>>      - Do not ignore return value from wrmsr_safe
>>>>>>      - Flip revision numbers as shown above
>>>>>>
>>>>>> Signed-off-by: Aravind Gopalakrishnan 
>>>>>> <aravind.gopalakrishnan@amd.com>
>>>>>> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>>>> I thought we had identified that the hangs were to do with your 
>>>>> use of
>>>>> 'noreboot' on the Xen command line.
>>>>>
>>>>
>>>> Hmm. Yeah.. I figured using wrmsr_safe allows user to just boot 
>>>> into dom0 without
>>>> having to run through reboot loops. (lazy alternative I guess)
>>>>
>>>> Nevermind then. Thanks for the comments (Jan and Andrew). Will keep 
>>>> in mind for the future.
>>>
>>> I don't understand --- the fact that you had noreboot option meant 
>>> that your system wouldn't reboot (duh!) when a patch is corrupted 
>>> (aka "it will hang"). But I'd think we don't want a reboot neither 
>>> --- we want to safely skip the patch (and possibly backlist it).
>>>
>>>
>>
>> So, allowing the reboot to happen in turn allows entry to grub.
>> Where we can simply remove 'ucode=' option and this is same as
>> 'skipping' the patch using wrmsr_safe right?
>>
>> Except this is now explicitly done by the sysadmin
>> while wrmsr_safe just works without anyone doing some extra work;
>> and printing a log message informs user that update went wrong
>> for whatever reason.
>>
>> My understanding from earlier comments is that Andrew (and Jan)
>> would rather not see a change from wrmsrl to wrmsr_safe if it
>> is needless because there is already a way someone can circumvent the
>> corrupt patch: just don't provide it on the grub menu.
>>
>> Thanks,
>> -Aravind.
>
> Sorry - I think my comment confused the issue.  Let me retry.
>
> Originally, the bug was described approximately as "A corrupt ucode 
> will cause a GP fault, causing the server to hang".
>
> The unhandled #GP fault certainly should be wrapped with wrmsr_safe(), 
> and an error/warning presented to the user.  In the case that a bad 
> ucode is discovered, it should be discarded and the server allowed to 
> boot.  It is substantially more useful for the server to come up and 
> say "I couldn't load that bit of microcode you wanted me to", than to 
> sit in a reboot loop because you made a typo in the bootloader config, 
> and have to get someone in the datacenter to poke the physical server.
>
>
> My objection was to the wording of the comment alone.  Unhandled #GP 
> faults do not "hang" Xen unless you ask for them to behave in that 
> way, given "noreboot" on the command line.
>

Ah. Okay, let me look into this again then and fix issues with V2.

-Aravind.