From mboxrd@z Thu Jan 1 00:00:00 1970 From: Aravind Gopalakrishnan Subject: Re: [PATCH V2] x86, amd_ucode: Safeguard against #GP Date: Fri, 30 May 2014 11:46:01 -0500 Message-ID: <5388B5C9.20100@amd.com> References: <1401215048-17154-1-git-send-email-aravind.gopalakrishnan@amd.com> <53852405.9010704@citrix.com> <5385FDD3.8020307@amd.com> <53862348.1060400@oracle.com> <5388AB6A.7010803@amd.com> <5388B005.2070108@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5388B005.2070108@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Andrew Cooper , boris ostrovsky Cc: keir@xen.org, JBeulich@suse.com, xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On 5/30/2014 11:21 AM, Andrew Cooper wrote: > > On 30/05/2014 17:01, Aravind Gopalakrishnan wrote: >> On 5/28/2014 12:56 PM, boris ostrovsky wrote: >>> >>> On 5/28/2014 11:16 AM, Aravind Gopalakrishnan wrote: >>>> On 5/27/2014 6:47 PM, Andrew Cooper wrote: >>>>> On 27/05/2014 19:24, Aravind Gopalakrishnan wrote: >>>>>> When HW tries to load a corrupted patch, it generates #GP >>>>>> and hangs the system. Use wrmsr_safe instead so that we >>>>>> fail to load microcode gracefully. >>>>>> >>>>>> Also, massage error handling around apply_microcode to keep >>>>>> in tune with error handling style of other parts of the code. >>>>>> >>>>>> Example on a Fam15h system- >>>>>> (XEN) microcode: CPU0 collect_cpu_info: patch_id=0x6000626 >>>>>> (XEN) microcode: CPU0 size 7870, block size 2586 offset 76 equivID >>>>>> 0x6012 rev 0x6000637 >>>>>> (XEN) microcode: CPU0 found a matching microcode update with version >>>>>> 0x6000637 (current=0x6000626) >>>>>> (XEN) traps.c:3073: GPF (0000): ffff82d08016f682 -> ffff82d08022d9f8 >>>>>> (XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 >>>>>> failed >>>>>> ^^^^^^^^^^^^^^^^^^^^^^ >>>>>> As shown, the log message above has the two revisions reversed. >>>>>> Fix this >>>>>> >>>>>> Changes in V2: >>>>>> - Do not ignore return value from wrmsr_safe >>>>>> - Flip revision numbers as shown above >>>>>> >>>>>> Signed-off-by: Aravind Gopalakrishnan >>>>>> >>>>>> Reviewed-by: Boris Ostrovsky >>>>> I thought we had identified that the hangs were to do with your >>>>> use of >>>>> 'noreboot' on the Xen command line. >>>>> >>>> >>>> Hmm. Yeah.. I figured using wrmsr_safe allows user to just boot >>>> into dom0 without >>>> having to run through reboot loops. (lazy alternative I guess) >>>> >>>> Nevermind then. Thanks for the comments (Jan and Andrew). Will keep >>>> in mind for the future. >>> >>> I don't understand --- the fact that you had noreboot option meant >>> that your system wouldn't reboot (duh!) when a patch is corrupted >>> (aka "it will hang"). But I'd think we don't want a reboot neither >>> --- we want to safely skip the patch (and possibly backlist it). >>> >>> >> >> So, allowing the reboot to happen in turn allows entry to grub. >> Where we can simply remove 'ucode=' option and this is same as >> 'skipping' the patch using wrmsr_safe right? >> >> Except this is now explicitly done by the sysadmin >> while wrmsr_safe just works without anyone doing some extra work; >> and printing a log message informs user that update went wrong >> for whatever reason. >> >> My understanding from earlier comments is that Andrew (and Jan) >> would rather not see a change from wrmsrl to wrmsr_safe if it >> is needless because there is already a way someone can circumvent the >> corrupt patch: just don't provide it on the grub menu. >> >> Thanks, >> -Aravind. > > Sorry - I think my comment confused the issue. Let me retry. > > Originally, the bug was described approximately as "A corrupt ucode > will cause a GP fault, causing the server to hang". > > The unhandled #GP fault certainly should be wrapped with wrmsr_safe(), > and an error/warning presented to the user. In the case that a bad > ucode is discovered, it should be discarded and the server allowed to > boot. It is substantially more useful for the server to come up and > say "I couldn't load that bit of microcode you wanted me to", than to > sit in a reboot loop because you made a typo in the bootloader config, > and have to get someone in the datacenter to poke the physical server. > > > My objection was to the wording of the comment alone. Unhandled #GP > faults do not "hang" Xen unless you ask for them to behave in that > way, given "noreboot" on the command line. > Ah. Okay, let me look into this again then and fix issues with V2. -Aravind.