xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2] x86, amd_ucode: Safeguard against #GP
@ 2014-05-27 18:24 Aravind Gopalakrishnan
  2014-05-27 23:47 ` Andrew Cooper
  2014-05-28  7:22 ` Jan Beulich
  0 siblings, 2 replies; 10+ messages in thread
From: Aravind Gopalakrishnan @ 2014-05-27 18:24 UTC (permalink / raw)
  To: JBeulich, xen-devel
  Cc: andrew.cooper3, boris.ostrovsky, keir, Aravind Gopalakrishnan

When HW tries to load a corrupted patch, it generates #GP
and hangs the system. Use wrmsr_safe instead so that we
fail to load microcode gracefully.

Also, massage error handling around apply_microcode to keep
in tune with error handling style of other parts of the code.

Example on a Fam15h system-
(XEN) microcode: CPU0 collect_cpu_info: patch_id=0x6000626
(XEN) microcode: CPU0 size 7870, block size 2586 offset 76 equivID
0x6012 rev 0x6000637
(XEN) microcode: CPU0 found a matching microcode update with version
0x6000637 (current=0x6000626)
(XEN) traps.c:3073: GPF (0000): ffff82d08016f682 -> ffff82d08022d9f8
(XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 failed
                                           ^^^^^^^^^^^^^^^^^^^^^^
As shown, the log message above has the two revisions reversed. Fix this

Changes in V2:
    - Do not ignore return value from wrmsr_safe
    - Flip revision numbers as shown above

Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
 xen/arch/x86/microcode_amd.c |   25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/microcode_amd.c b/xen/arch/x86/microcode_amd.c
index e83f4b6..1db8a0d 100644
--- a/xen/arch/x86/microcode_amd.c
+++ b/xen/arch/x86/microcode_amd.c
@@ -178,32 +178,39 @@ static int apply_microcode(int cpu)
     uint32_t rev;
     struct microcode_amd *mc_amd = uci->mc.mc_amd;
     struct microcode_header_amd *hdr;
+    int error = -EINVAL;
 
     /* We should bind the task to the CPU */
     BUG_ON(raw_smp_processor_id() != cpu);
 
     if ( mc_amd == NULL )
-        return -EINVAL;
+        goto apply_err1;
 
     hdr = mc_amd->mpb;
     if ( hdr == NULL )
-        return -EINVAL;
+        goto apply_err1;
 
     spin_lock_irqsave(&microcode_update_lock, flags);
 
-    wrmsrl(MSR_AMD_PATCHLOADER, (unsigned long)hdr);
+    error = wrmsr_safe(MSR_AMD_PATCHLOADER, (unsigned long)hdr);
 
     /* get patch id after patching */
     rdmsrl(MSR_AMD_PATCHLEVEL, rev);
 
     spin_unlock_irqrestore(&microcode_update_lock, flags);
 
+    /* Catch HW patch application failure */
+    if ( error ) {
+        printk(KERN_ERR "microcode: CPU%d ucode patch application failed HW tests. "
+               "HW returned #GP\n", cpu);
+        goto apply_err2;
+    }
+
     /* check current patch id and patch's id for match */
     if ( rev != hdr->patch_id )
     {
-        printk(KERN_ERR "microcode: CPU%d update from revision "
-               "%#x to %#x failed\n", cpu, hdr->patch_id, rev);
-        return -EIO;
+        error = -EIO;
+        goto apply_err2;
     }
 
     printk(KERN_WARNING "microcode: CPU%d updated from revision %#x to %#x\n",
@@ -212,6 +219,12 @@ static int apply_microcode(int cpu)
     uci->cpu_sig.rev = rev;
 
     return 0;
+
+apply_err2:
+    printk(KERN_ERR "microcode: CPU%d update from revision "
+           "%#x to %#x failed\n", cpu, rev, hdr->patch_id);
+apply_err1:
+    return error;
 }
 
 static int get_ucode_from_buffer_amd(
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V2] x86, amd_ucode: Safeguard against #GP
  2014-05-27 18:24 [PATCH V2] x86, amd_ucode: Safeguard against #GP Aravind Gopalakrishnan
@ 2014-05-27 23:47 ` Andrew Cooper
  2014-05-28 15:16   ` Aravind Gopalakrishnan
  2014-05-28  7:22 ` Jan Beulich
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Cooper @ 2014-05-27 23:47 UTC (permalink / raw)
  To: Aravind Gopalakrishnan, JBeulich, xen-devel; +Cc: boris.ostrovsky, keir

On 27/05/2014 19:24, Aravind Gopalakrishnan wrote:
> When HW tries to load a corrupted patch, it generates #GP
> and hangs the system. Use wrmsr_safe instead so that we
> fail to load microcode gracefully.
>
> Also, massage error handling around apply_microcode to keep
> in tune with error handling style of other parts of the code.
>
> Example on a Fam15h system-
> (XEN) microcode: CPU0 collect_cpu_info: patch_id=0x6000626
> (XEN) microcode: CPU0 size 7870, block size 2586 offset 76 equivID
> 0x6012 rev 0x6000637
> (XEN) microcode: CPU0 found a matching microcode update with version
> 0x6000637 (current=0x6000626)
> (XEN) traps.c:3073: GPF (0000): ffff82d08016f682 -> ffff82d08022d9f8
> (XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 failed
>                                            ^^^^^^^^^^^^^^^^^^^^^^
> As shown, the log message above has the two revisions reversed. Fix this
>
> Changes in V2:
>     - Do not ignore return value from wrmsr_safe
>     - Flip revision numbers as shown above
>
> Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

I thought we had identified that the hangs were to do with your use of
'noreboot' on the Xen command line.

~Andrew


> ---
>  xen/arch/x86/microcode_amd.c |   25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/xen/arch/x86/microcode_amd.c b/xen/arch/x86/microcode_amd.c
> index e83f4b6..1db8a0d 100644
> --- a/xen/arch/x86/microcode_amd.c
> +++ b/xen/arch/x86/microcode_amd.c
> @@ -178,32 +178,39 @@ static int apply_microcode(int cpu)
>      uint32_t rev;
>      struct microcode_amd *mc_amd = uci->mc.mc_amd;
>      struct microcode_header_amd *hdr;
> +    int error = -EINVAL;
>  
>      /* We should bind the task to the CPU */
>      BUG_ON(raw_smp_processor_id() != cpu);
>  
>      if ( mc_amd == NULL )
> -        return -EINVAL;
> +        goto apply_err1;
>  
>      hdr = mc_amd->mpb;
>      if ( hdr == NULL )
> -        return -EINVAL;
> +        goto apply_err1;
>  
>      spin_lock_irqsave(&microcode_update_lock, flags);
>  
> -    wrmsrl(MSR_AMD_PATCHLOADER, (unsigned long)hdr);
> +    error = wrmsr_safe(MSR_AMD_PATCHLOADER, (unsigned long)hdr);
>  
>      /* get patch id after patching */
>      rdmsrl(MSR_AMD_PATCHLEVEL, rev);
>  
>      spin_unlock_irqrestore(&microcode_update_lock, flags);
>  
> +    /* Catch HW patch application failure */
> +    if ( error ) {
> +        printk(KERN_ERR "microcode: CPU%d ucode patch application failed HW tests. "
> +               "HW returned #GP\n", cpu);
> +        goto apply_err2;

This...

> +    }
> +
>      /* check current patch id and patch's id for match */
>      if ( rev != hdr->patch_id )
>      {
> -        printk(KERN_ERR "microcode: CPU%d update from revision "
> -               "%#x to %#x failed\n", cpu, hdr->patch_id, rev);
> -        return -EIO;
> +        error = -EIO;
> +        goto apply_err2;
>      }
>  
>      printk(KERN_WARNING "microcode: CPU%d updated from revision %#x to %#x\n",
> @@ -212,6 +219,12 @@ static int apply_microcode(int cpu)
>      uci->cpu_sig.rev = rev;
>  
>      return 0;
> +
> +apply_err2:
> +    printk(KERN_ERR "microcode: CPU%d update from revision "
> +           "%#x to %#x failed\n", cpu, rev, hdr->patch_id);

... combined with this will result in two error messages being printed.

This seems over overkill for the circumstance.

~Andrew.

> +apply_err1:
> +    return error;
>  }
>  
>  static int get_ucode_from_buffer_amd(

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2] x86, amd_ucode: Safeguard against #GP
  2014-05-27 18:24 [PATCH V2] x86, amd_ucode: Safeguard against #GP Aravind Gopalakrishnan
  2014-05-27 23:47 ` Andrew Cooper
@ 2014-05-28  7:22 ` Jan Beulich
  1 sibling, 0 replies; 10+ messages in thread
From: Jan Beulich @ 2014-05-28  7:22 UTC (permalink / raw)
  To: Aravind Gopalakrishnan; +Cc: andrew.cooper3, boris.ostrovsky, keir, xen-devel

>>> On 27.05.14 at 20:24, <aravind.gopalakrishnan@amd.com> wrote:
> When HW tries to load a corrupted patch, it generates #GP
> and hangs the system. Use wrmsr_safe instead so that we
> fail to load microcode gracefully.

Leaving aside the question of whether this is needed at all (see
Andrew's earlier reply), a couple of mechanical comments:

> --- a/xen/arch/x86/microcode_amd.c
> +++ b/xen/arch/x86/microcode_amd.c
> @@ -178,32 +178,39 @@ static int apply_microcode(int cpu)
>      uint32_t rev;
>      struct microcode_amd *mc_amd = uci->mc.mc_amd;
>      struct microcode_header_amd *hdr;
> +    int error = -EINVAL;
>  
>      /* We should bind the task to the CPU */
>      BUG_ON(raw_smp_processor_id() != cpu);
>  
>      if ( mc_amd == NULL )
> -        return -EINVAL;
> +        goto apply_err1;

Why?

>  
>      hdr = mc_amd->mpb;
>      if ( hdr == NULL )
> -        return -EINVAL;
> +        goto apply_err1;

And again, why?

>  
>      spin_lock_irqsave(&microcode_update_lock, flags);
>  
> -    wrmsrl(MSR_AMD_PATCHLOADER, (unsigned long)hdr);
> +    error = wrmsr_safe(MSR_AMD_PATCHLOADER, (unsigned long)hdr);
>  
>      /* get patch id after patching */
>      rdmsrl(MSR_AMD_PATCHLEVEL, rev);
>  
>      spin_unlock_irqrestore(&microcode_update_lock, flags);
>  
> +    /* Catch HW patch application failure */
> +    if ( error ) {

Coding style.

> +        printk(KERN_ERR "microcode: CPU%d ucode patch application failed HW tests. "
> +               "HW returned #GP\n", cpu);
> +        goto apply_err2;

No need to print two messages.

> +    }
> +
>      /* check current patch id and patch's id for match */
>      if ( rev != hdr->patch_id )
>      {
> -        printk(KERN_ERR "microcode: CPU%d update from revision "
> -               "%#x to %#x failed\n", cpu, hdr->patch_id, rev);
> -        return -EIO;
> +        error = -EIO;
> +        goto apply_err2;

And with the above, this change isn't warranted too.

> @@ -212,6 +219,12 @@ static int apply_microcode(int cpu)
>      uci->cpu_sig.rev = rev;
>  
>      return 0;
> +
> +apply_err2:
> +    printk(KERN_ERR "microcode: CPU%d update from revision "
> +           "%#x to %#x failed\n", cpu, rev, hdr->patch_id);
> +apply_err1:
> +    return error;

While as a consequence this change will go away too, for the
future - if you add any labels, they ought to be indented by at
least on space.

Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2] x86, amd_ucode: Safeguard against #GP
  2014-05-27 23:47 ` Andrew Cooper
@ 2014-05-28 15:16   ` Aravind Gopalakrishnan
  2014-05-28 17:56     ` boris ostrovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Aravind Gopalakrishnan @ 2014-05-28 15:16 UTC (permalink / raw)
  To: Andrew Cooper, JBeulich, xen-devel; +Cc: boris.ostrovsky, keir

On 5/27/2014 6:47 PM, Andrew Cooper wrote:
> On 27/05/2014 19:24, Aravind Gopalakrishnan wrote:
>> When HW tries to load a corrupted patch, it generates #GP
>> and hangs the system. Use wrmsr_safe instead so that we
>> fail to load microcode gracefully.
>>
>> Also, massage error handling around apply_microcode to keep
>> in tune with error handling style of other parts of the code.
>>
>> Example on a Fam15h system-
>> (XEN) microcode: CPU0 collect_cpu_info: patch_id=0x6000626
>> (XEN) microcode: CPU0 size 7870, block size 2586 offset 76 equivID
>> 0x6012 rev 0x6000637
>> (XEN) microcode: CPU0 found a matching microcode update with version
>> 0x6000637 (current=0x6000626)
>> (XEN) traps.c:3073: GPF (0000): ffff82d08016f682 -> ffff82d08022d9f8
>> (XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 failed
>>                                             ^^^^^^^^^^^^^^^^^^^^^^
>> As shown, the log message above has the two revisions reversed. Fix this
>>
>> Changes in V2:
>>      - Do not ignore return value from wrmsr_safe
>>      - Flip revision numbers as shown above
>>
>> Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
>> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> I thought we had identified that the hangs were to do with your use of
> 'noreboot' on the Xen command line.
>

Hmm. Yeah.. I figured using wrmsr_safe allows user to just boot into 
dom0 without
having to run through reboot loops. (lazy alternative I guess)

Nevermind then. Thanks for the comments (Jan and Andrew). Will keep in 
mind for the future.

-Aravind.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2] x86, amd_ucode: Safeguard against #GP
  2014-05-28 15:16   ` Aravind Gopalakrishnan
@ 2014-05-28 17:56     ` boris ostrovsky
  2014-05-30 16:01       ` Aravind Gopalakrishnan
  0 siblings, 1 reply; 10+ messages in thread
From: boris ostrovsky @ 2014-05-28 17:56 UTC (permalink / raw)
  To: Aravind Gopalakrishnan; +Cc: Andrew Cooper, keir, JBeulich, xen-devel


On 5/28/2014 11:16 AM, Aravind Gopalakrishnan wrote:
> On 5/27/2014 6:47 PM, Andrew Cooper wrote:
>> On 27/05/2014 19:24, Aravind Gopalakrishnan wrote:
>>> When HW tries to load a corrupted patch, it generates #GP
>>> and hangs the system. Use wrmsr_safe instead so that we
>>> fail to load microcode gracefully.
>>>
>>> Also, massage error handling around apply_microcode to keep
>>> in tune with error handling style of other parts of the code.
>>>
>>> Example on a Fam15h system-
>>> (XEN) microcode: CPU0 collect_cpu_info: patch_id=0x6000626
>>> (XEN) microcode: CPU0 size 7870, block size 2586 offset 76 equivID
>>> 0x6012 rev 0x6000637
>>> (XEN) microcode: CPU0 found a matching microcode update with version
>>> 0x6000637 (current=0x6000626)
>>> (XEN) traps.c:3073: GPF (0000): ffff82d08016f682 -> ffff82d08022d9f8
>>> (XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 
>>> failed
>>> ^^^^^^^^^^^^^^^^^^^^^^
>>> As shown, the log message above has the two revisions reversed. Fix 
>>> this
>>>
>>> Changes in V2:
>>>      - Do not ignore return value from wrmsr_safe
>>>      - Flip revision numbers as shown above
>>>
>>> Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
>>> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>> I thought we had identified that the hangs were to do with your use of
>> 'noreboot' on the Xen command line.
>>
>
> Hmm. Yeah.. I figured using wrmsr_safe allows user to just boot into 
> dom0 without
> having to run through reboot loops. (lazy alternative I guess)
>
> Nevermind then. Thanks for the comments (Jan and Andrew). Will keep in 
> mind for the future.

I don't understand --- the fact that you had noreboot option meant that 
your system wouldn't reboot (duh!) when a patch is corrupted (aka "it 
will hang"). But I'd think we don't want a reboot neither --- we want to 
safely skip the patch (and possibly backlist it).

-boris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2] x86, amd_ucode: Safeguard against #GP
  2014-05-28 17:56     ` boris ostrovsky
@ 2014-05-30 16:01       ` Aravind Gopalakrishnan
  2014-05-30 16:21         ` Andrew Cooper
  0 siblings, 1 reply; 10+ messages in thread
From: Aravind Gopalakrishnan @ 2014-05-30 16:01 UTC (permalink / raw)
  To: boris ostrovsky; +Cc: Andrew Cooper, keir, JBeulich, xen-devel

On 5/28/2014 12:56 PM, boris ostrovsky wrote:
>
> On 5/28/2014 11:16 AM, Aravind Gopalakrishnan wrote:
>> On 5/27/2014 6:47 PM, Andrew Cooper wrote:
>>> On 27/05/2014 19:24, Aravind Gopalakrishnan wrote:
>>>> When HW tries to load a corrupted patch, it generates #GP
>>>> and hangs the system. Use wrmsr_safe instead so that we
>>>> fail to load microcode gracefully.
>>>>
>>>> Also, massage error handling around apply_microcode to keep
>>>> in tune with error handling style of other parts of the code.
>>>>
>>>> Example on a Fam15h system-
>>>> (XEN) microcode: CPU0 collect_cpu_info: patch_id=0x6000626
>>>> (XEN) microcode: CPU0 size 7870, block size 2586 offset 76 equivID
>>>> 0x6012 rev 0x6000637
>>>> (XEN) microcode: CPU0 found a matching microcode update with version
>>>> 0x6000637 (current=0x6000626)
>>>> (XEN) traps.c:3073: GPF (0000): ffff82d08016f682 -> ffff82d08022d9f8
>>>> (XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 
>>>> failed
>>>> ^^^^^^^^^^^^^^^^^^^^^^
>>>> As shown, the log message above has the two revisions reversed. Fix 
>>>> this
>>>>
>>>> Changes in V2:
>>>>      - Do not ignore return value from wrmsr_safe
>>>>      - Flip revision numbers as shown above
>>>>
>>>> Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
>>>> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>> I thought we had identified that the hangs were to do with your use of
>>> 'noreboot' on the Xen command line.
>>>
>>
>> Hmm. Yeah.. I figured using wrmsr_safe allows user to just boot into 
>> dom0 without
>> having to run through reboot loops. (lazy alternative I guess)
>>
>> Nevermind then. Thanks for the comments (Jan and Andrew). Will keep 
>> in mind for the future.
>
> I don't understand --- the fact that you had noreboot option meant 
> that your system wouldn't reboot (duh!) when a patch is corrupted (aka 
> "it will hang"). But I'd think we don't want a reboot neither --- we 
> want to safely skip the patch (and possibly backlist it).
>
>

So, allowing the reboot to happen in turn allows entry to grub.
Where we can simply remove 'ucode=' option and this is same as
'skipping' the patch using wrmsr_safe right?

Except this is now explicitly done by the sysadmin
while wrmsr_safe just works without anyone doing some extra work;
and printing a log message informs user that update went wrong
for whatever reason.

My understanding from earlier comments is that Andrew (and Jan)
would rather not see a change from wrmsrl to wrmsr_safe if it
is needless because there is already a way someone can circumvent the
corrupt patch: just don't provide it on the grub menu.

Thanks,
-Aravind.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2] x86, amd_ucode: Safeguard against #GP
  2014-05-30 16:01       ` Aravind Gopalakrishnan
@ 2014-05-30 16:21         ` Andrew Cooper
  2014-05-30 16:46           ` Aravind Gopalakrishnan
  2014-06-02  7:31           ` Jan Beulich
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew Cooper @ 2014-05-30 16:21 UTC (permalink / raw)
  To: Aravind Gopalakrishnan, boris ostrovsky; +Cc: keir, JBeulich, xen-devel


On 30/05/2014 17:01, Aravind Gopalakrishnan wrote:
> On 5/28/2014 12:56 PM, boris ostrovsky wrote:
>>
>> On 5/28/2014 11:16 AM, Aravind Gopalakrishnan wrote:
>>> On 5/27/2014 6:47 PM, Andrew Cooper wrote:
>>>> On 27/05/2014 19:24, Aravind Gopalakrishnan wrote:
>>>>> When HW tries to load a corrupted patch, it generates #GP
>>>>> and hangs the system. Use wrmsr_safe instead so that we
>>>>> fail to load microcode gracefully.
>>>>>
>>>>> Also, massage error handling around apply_microcode to keep
>>>>> in tune with error handling style of other parts of the code.
>>>>>
>>>>> Example on a Fam15h system-
>>>>> (XEN) microcode: CPU0 collect_cpu_info: patch_id=0x6000626
>>>>> (XEN) microcode: CPU0 size 7870, block size 2586 offset 76 equivID
>>>>> 0x6012 rev 0x6000637
>>>>> (XEN) microcode: CPU0 found a matching microcode update with version
>>>>> 0x6000637 (current=0x6000626)
>>>>> (XEN) traps.c:3073: GPF (0000): ffff82d08016f682 -> ffff82d08022d9f8
>>>>> (XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 
>>>>> failed
>>>>> ^^^^^^^^^^^^^^^^^^^^^^
>>>>> As shown, the log message above has the two revisions reversed. 
>>>>> Fix this
>>>>>
>>>>> Changes in V2:
>>>>>      - Do not ignore return value from wrmsr_safe
>>>>>      - Flip revision numbers as shown above
>>>>>
>>>>> Signed-off-by: Aravind Gopalakrishnan 
>>>>> <aravind.gopalakrishnan@amd.com>
>>>>> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>>> I thought we had identified that the hangs were to do with your use of
>>>> 'noreboot' on the Xen command line.
>>>>
>>>
>>> Hmm. Yeah.. I figured using wrmsr_safe allows user to just boot into 
>>> dom0 without
>>> having to run through reboot loops. (lazy alternative I guess)
>>>
>>> Nevermind then. Thanks for the comments (Jan and Andrew). Will keep 
>>> in mind for the future.
>>
>> I don't understand --- the fact that you had noreboot option meant 
>> that your system wouldn't reboot (duh!) when a patch is corrupted 
>> (aka "it will hang"). But I'd think we don't want a reboot neither 
>> --- we want to safely skip the patch (and possibly backlist it).
>>
>>
>
> So, allowing the reboot to happen in turn allows entry to grub.
> Where we can simply remove 'ucode=' option and this is same as
> 'skipping' the patch using wrmsr_safe right?
>
> Except this is now explicitly done by the sysadmin
> while wrmsr_safe just works without anyone doing some extra work;
> and printing a log message informs user that update went wrong
> for whatever reason.
>
> My understanding from earlier comments is that Andrew (and Jan)
> would rather not see a change from wrmsrl to wrmsr_safe if it
> is needless because there is already a way someone can circumvent the
> corrupt patch: just don't provide it on the grub menu.
>
> Thanks,
> -Aravind.

Sorry - I think my comment confused the issue.  Let me retry.

Originally, the bug was described approximately as "A corrupt ucode will 
cause a GP fault, causing the server to hang".

The unhandled #GP fault certainly should be wrapped with wrmsr_safe(), 
and an error/warning presented to the user.  In the case that a bad 
ucode is discovered, it should be discarded and the server allowed to 
boot.  It is substantially more useful for the server to come up and say 
"I couldn't load that bit of microcode you wanted me to", than to sit in 
a reboot loop because you made a typo in the bootloader config, and have 
to get someone in the datacenter to poke the physical server.


My objection was to the wording of the comment alone.  Unhandled #GP 
faults do not "hang" Xen unless you ask for them to behave in that way, 
given "noreboot" on the command line.

~Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2] x86, amd_ucode: Safeguard against #GP
  2014-05-30 16:21         ` Andrew Cooper
@ 2014-05-30 16:46           ` Aravind Gopalakrishnan
  2014-06-02  7:31           ` Jan Beulich
  1 sibling, 0 replies; 10+ messages in thread
From: Aravind Gopalakrishnan @ 2014-05-30 16:46 UTC (permalink / raw)
  To: Andrew Cooper, boris ostrovsky; +Cc: keir, JBeulich, xen-devel

On 5/30/2014 11:21 AM, Andrew Cooper wrote:
>
> On 30/05/2014 17:01, Aravind Gopalakrishnan wrote:
>> On 5/28/2014 12:56 PM, boris ostrovsky wrote:
>>>
>>> On 5/28/2014 11:16 AM, Aravind Gopalakrishnan wrote:
>>>> On 5/27/2014 6:47 PM, Andrew Cooper wrote:
>>>>> On 27/05/2014 19:24, Aravind Gopalakrishnan wrote:
>>>>>> When HW tries to load a corrupted patch, it generates #GP
>>>>>> and hangs the system. Use wrmsr_safe instead so that we
>>>>>> fail to load microcode gracefully.
>>>>>>
>>>>>> Also, massage error handling around apply_microcode to keep
>>>>>> in tune with error handling style of other parts of the code.
>>>>>>
>>>>>> Example on a Fam15h system-
>>>>>> (XEN) microcode: CPU0 collect_cpu_info: patch_id=0x6000626
>>>>>> (XEN) microcode: CPU0 size 7870, block size 2586 offset 76 equivID
>>>>>> 0x6012 rev 0x6000637
>>>>>> (XEN) microcode: CPU0 found a matching microcode update with version
>>>>>> 0x6000637 (current=0x6000626)
>>>>>> (XEN) traps.c:3073: GPF (0000): ffff82d08016f682 -> ffff82d08022d9f8
>>>>>> (XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 
>>>>>> failed
>>>>>> ^^^^^^^^^^^^^^^^^^^^^^
>>>>>> As shown, the log message above has the two revisions reversed. 
>>>>>> Fix this
>>>>>>
>>>>>> Changes in V2:
>>>>>>      - Do not ignore return value from wrmsr_safe
>>>>>>      - Flip revision numbers as shown above
>>>>>>
>>>>>> Signed-off-by: Aravind Gopalakrishnan 
>>>>>> <aravind.gopalakrishnan@amd.com>
>>>>>> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>>>> I thought we had identified that the hangs were to do with your 
>>>>> use of
>>>>> 'noreboot' on the Xen command line.
>>>>>
>>>>
>>>> Hmm. Yeah.. I figured using wrmsr_safe allows user to just boot 
>>>> into dom0 without
>>>> having to run through reboot loops. (lazy alternative I guess)
>>>>
>>>> Nevermind then. Thanks for the comments (Jan and Andrew). Will keep 
>>>> in mind for the future.
>>>
>>> I don't understand --- the fact that you had noreboot option meant 
>>> that your system wouldn't reboot (duh!) when a patch is corrupted 
>>> (aka "it will hang"). But I'd think we don't want a reboot neither 
>>> --- we want to safely skip the patch (and possibly backlist it).
>>>
>>>
>>
>> So, allowing the reboot to happen in turn allows entry to grub.
>> Where we can simply remove 'ucode=' option and this is same as
>> 'skipping' the patch using wrmsr_safe right?
>>
>> Except this is now explicitly done by the sysadmin
>> while wrmsr_safe just works without anyone doing some extra work;
>> and printing a log message informs user that update went wrong
>> for whatever reason.
>>
>> My understanding from earlier comments is that Andrew (and Jan)
>> would rather not see a change from wrmsrl to wrmsr_safe if it
>> is needless because there is already a way someone can circumvent the
>> corrupt patch: just don't provide it on the grub menu.
>>
>> Thanks,
>> -Aravind.
>
> Sorry - I think my comment confused the issue.  Let me retry.
>
> Originally, the bug was described approximately as "A corrupt ucode 
> will cause a GP fault, causing the server to hang".
>
> The unhandled #GP fault certainly should be wrapped with wrmsr_safe(), 
> and an error/warning presented to the user.  In the case that a bad 
> ucode is discovered, it should be discarded and the server allowed to 
> boot.  It is substantially more useful for the server to come up and 
> say "I couldn't load that bit of microcode you wanted me to", than to 
> sit in a reboot loop because you made a typo in the bootloader config, 
> and have to get someone in the datacenter to poke the physical server.
>
>
> My objection was to the wording of the comment alone.  Unhandled #GP 
> faults do not "hang" Xen unless you ask for them to behave in that 
> way, given "noreboot" on the command line.
>

Ah. Okay, let me look into this again then and fix issues with V2.

-Aravind.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2] x86, amd_ucode: Safeguard against #GP
  2014-05-30 16:21         ` Andrew Cooper
  2014-05-30 16:46           ` Aravind Gopalakrishnan
@ 2014-06-02  7:31           ` Jan Beulich
  2014-06-02 14:13             ` Boris Ostrovsky
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2014-06-02  7:31 UTC (permalink / raw)
  To: Aravind Gopalakrishnan, Andrew Cooper, boris ostrovsky; +Cc: keir, xen-devel

>>> On 30.05.14 at 18:21, <andrew.cooper3@citrix.com> wrote:
> The unhandled #GP fault certainly should be wrapped with wrmsr_safe(), 
> and an error/warning presented to the user.  In the case that a bad 
> ucode is discovered, it should be discarded and the server allowed to 
> boot.  It is substantially more useful for the server to come up and say 
> "I couldn't load that bit of microcode you wanted me to", than to sit in 
> a reboot loop because you made a typo in the bootloader config, and have 
> to get someone in the datacenter to poke the physical server.

But this isn't due to a typo somewhere, but due to a corrupted
microcode blob.

Besides that no matter which BKDG I look at, I can't seem to find any
indication of there being room for a #GP here if the MSR itself is
implemented. While I don't question its presence in reality, I'd prefer
if this was documented properly for a patch to recover from it to go
in.

Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2] x86, amd_ucode: Safeguard against #GP
  2014-06-02  7:31           ` Jan Beulich
@ 2014-06-02 14:13             ` Boris Ostrovsky
  0 siblings, 0 replies; 10+ messages in thread
From: Boris Ostrovsky @ 2014-06-02 14:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, keir, Aravind Gopalakrishnan, xen-devel

On 06/02/2014 03:31 AM, Jan Beulich wrote:
>>>> On 30.05.14 at 18:21, <andrew.cooper3@citrix.com> wrote:
>> The unhandled #GP fault certainly should be wrapped with wrmsr_safe(),
>> and an error/warning presented to the user.  In the case that a bad
>> ucode is discovered, it should be discarded and the server allowed to
>> boot.  It is substantially more useful for the server to come up and say
>> "I couldn't load that bit of microcode you wanted me to", than to sit in
>> a reboot loop because you made a typo in the bootloader config, and have
>> to get someone in the datacenter to poke the physical server.
> But this isn't due to a typo somewhere, but due to a corrupted
> microcode blob.

Right, but the argument that we don't want to be stuck in the reboot 
loop still holds.

> Besides that no matter which BKDG I look at, I can't seem to find any
> indication of there being room for a #GP here if the MSR itself is
> implemented. While I don't question its presence in reality, I'd prefer
> if this was documented properly for a patch to recover from it to go
> in.

Unfortunately the whole microcode patching procedure is, to put it 
mildly, not well documented, particularly the #GP part. We had an email 
exchange with an AMD HW architect and he confirmed that corrupted patch 
results in #GP.

-boris

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-06-02 14:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-27 18:24 [PATCH V2] x86, amd_ucode: Safeguard against #GP Aravind Gopalakrishnan
2014-05-27 23:47 ` Andrew Cooper
2014-05-28 15:16   ` Aravind Gopalakrishnan
2014-05-28 17:56     ` boris ostrovsky
2014-05-30 16:01       ` Aravind Gopalakrishnan
2014-05-30 16:21         ` Andrew Cooper
2014-05-30 16:46           ` Aravind Gopalakrishnan
2014-06-02  7:31           ` Jan Beulich
2014-06-02 14:13             ` Boris Ostrovsky
2014-05-28  7:22 ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).