* Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
2012-12-04 18:16 ` [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path Andrew Cooper
@ 2012-12-04 18:25 ` Andrew Cooper
2012-12-05 9:17 ` Jan Beulich
2012-12-06 11:36 ` Tim Deegan
2 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-04 18:25 UTC (permalink / raw)
To: xen-devel@lists.xen.org
On 04/12/12 18:16, Andrew Cooper wrote:
> Experimentally, certain crash kernels will triple fault very early after
> starting if started with NMIs disabled. This was discovered when
> experimenting with a debug keyhandler which deliberately created a
> reentrant NMI, causing stack corruption.
>
> Because of this discovered bug, and that the future changes to the NMI
> handling will make the kexec path more fragile, take the time now to
> bullet-proof the kexec behaviour to be safe in all circumstances.
>
> This patch adds three new low level routines:
> * trap_nop
> This is a no op handler which irets immediately
> * nmi_crash
> This is a special NMI handler for using during a kexec crash
> * enable_nmis
> This function enables NMIs by executing an iret-to-self
>
> As a result, the new behaviour of the kexec_crash path is:
>
> nmi_shootdown_cpus() will:
>
> * Disable interrupt stack tables.
> Disabling ISTs will prevent stack corruption from future NMIs and
> MCEs. As Xen is going to crash, we are not concerned with the
> security race condition with 'sysret'; any guest which managed to
> benefit from the race condition will only be able execute a handful
> of instructions before it will be killed anyway, and a guest is
> still unable to trigger the race condition in the first place.
>
> * Swap the NMI trap handlers.
> The crashing pcpu gets the nop handler, to prevent it getting stuck in
> an NMI context, causing a hang instead of crash. The non-crashing
> pcpus all get the crash_nmi handler which is designed never to
> return.
>
> do_nmi_crash() will:
>
> * Save the crash notes and shut the pcpu down.
> There is now an extra per-cpu variable to prevent us from executing
> this multiple times. In the case where we reenter midway through,
> attempt the whole operation again in preference to not completing
> it in the first place.
>
> * Set up another NMI at the LAPIC.
> Even when the LAPIC has been disabled, the ID and command registers
> are still usable. As a result, we can deliberately queue up a new
> NMI to re-interrupt us later if NMIs get unlatched. Because of the
> call to __stop_this_cpu(), we have to hand craft self_nmi() to be
> safe from General Protection Faults.
>
> * Fall into infinite loop.
>
> machine_kexec() will:
>
> * Swap the MCE handlers to be a nop.
> We cannot prevent MCEs from being delivered when we pass off to the
> crash kernel, and the less Xen context is being touched the better.
>
> * Explicitly enable NMIs.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/crash.c
> --- a/xen/arch/x86/crash.c
> +++ b/xen/arch/x86/crash.c
> @@ -32,41 +32,97 @@
>
> static atomic_t waiting_for_crash_ipi;
> static unsigned int crashing_cpu;
> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>
> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
> +/* This becomes the NMI handler for non-crashing CPUs. */
> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
> {
> - /* Don't do anything if this handler is invoked on crashing cpu.
> - * Otherwise, system will completely hang. Crashing cpu can get
> - * an NMI if system was initially booted with nmi_watchdog parameter.
> + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
> + ASSERT( crashing_cpu != smp_processor_id() );
> +
> + /* Save crash information and shut down CPU. Attempt only once. */
> + if ( ! this_cpu(crash_save_done) )
> + {
> + kexec_crash_save_cpu();
> + __stop_this_cpu();
> +
> + this_cpu(crash_save_done) = 1;
And I have already found a bug. The
"atomic_dec(&waiting_for_crash_ipi);" from below should appear here.
Without it, we just wait the full second in nmi_shootdown_cpus() and
continue anyway.
~Andrew
> + }
> +
> + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC
> + * back to its boot state, so we are unable to rely on the regular
> + * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
> + * (The likely senario is that we have reverted from x2apic mode to
> + * xapic, at which point #GPFs will occur if we use the apic_*
> + * functions)
> + *
> + * The ICR and APIC ID of the LAPIC are still valid even during
> + * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to
> + * queue up another NMI, to force us back into this loop if we exit.
> */
> - if ( cpu == crashing_cpu )
> - return 1;
> - local_irq_disable();
> + switch ( current_local_apic_mode() )
> + {
> + u32 apic_id;
>
> - kexec_crash_save_cpu();
> + case APIC_MODE_X2APIC:
> + apic_id = apic_rdmsr(APIC_ID);
>
> - __stop_this_cpu();
> + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));
> + break;
>
> - atomic_dec(&waiting_for_crash_ipi);
> + case APIC_MODE_XAPIC:
> + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
> +
> + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
> + cpu_relax();
> +
> + apic_mem_write(APIC_ICR2, apic_id << 24);
> + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
> + break;
> +
> + default:
> + break;
> + }
>
> for ( ; ; )
> halt();
> -
> - return 1;
> }
>
> static void nmi_shootdown_cpus(void)
> {
> unsigned long msecs;
> + int i;
> + int cpu = smp_processor_id();
>
> local_irq_disable();
>
> - crashing_cpu = smp_processor_id();
> + crashing_cpu = cpu;
> local_irq_count(crashing_cpu) = 0;
>
> atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
> - /* Would it be better to replace the trap vector here? */
> - set_nmi_callback(crash_nmi_callback);
> +
> + /* Change NMI trap handlers. Non-crashing pcpus get crash_nmi which
> + * invokes do_nmi_crash (above), which cause them to write state and
> + * fall into a loop. The crashing pcpu gets the nop handler to
> + * cause it to return to this function ASAP.
> + *
> + * Futhermore, disable stack tables for NMIs and MCEs to prevent
> + * race conditions resulting in corrupt stack frames. As Xen is
> + * about to die, we no longer care about the security-related race
> + * condition with 'sysret' which ISTs are designed to solve.
> + */
> + for ( i = 0; i < nr_cpu_ids; ++i )
> + if ( idt_tables[i] )
> + {
> + idt_tables[i][TRAP_nmi].a &= ~(7UL << 32);
> + idt_tables[i][TRAP_machine_check].a &= ~(7UL << 32);
> +
> + if ( i == cpu )
> + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
> + else
> + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash);
> + }
> +
> /* Ensure the new callback function is set before sending out the NMI. */
> wmb();
>
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/machine_kexec.c
> --- a/xen/arch/x86/machine_kexec.c
> +++ b/xen/arch/x86/machine_kexec.c
> @@ -87,6 +87,17 @@ void machine_kexec(xen_kexec_image_t *im
> */
> local_irq_disable();
>
> + /* Now regular interrupts are disabled, we need to reduce the impact
> + * of interrupts not disabled by 'cli'. NMIs have already been
> + * dealt with by machine_crash_shutdown().
> + *
> + * Set all pcpu MCE handler to be a noop. */
> + set_intr_gate(TRAP_machine_check, &trap_nop);
> +
> + /* Explicitly enable NMIs on this CPU. Some crashdump kernels do
> + * not like running with NMIs disabled. */
> + enable_nmis();
> +
> /*
> * compat_machine_kexec() returns to idle pagetables, which requires us
> * to be running on a static GDT mapping (idle pagetables have no GDT
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/x86_64/entry.S
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -635,11 +635,42 @@ ENTRY(nmi)
> movl $TRAP_nmi,4(%rsp)
> jmp handle_ist_exception
>
> +ENTRY(nmi_crash)
> + cli
> + pushq $0
> + movl $TRAP_nmi,4(%rsp)
> + SAVE_ALL
> + movq %rsp,%rdi
> + callq do_nmi_crash /* Does not return */
> + ud2
> +
> ENTRY(machine_check)
> pushq $0
> movl $TRAP_machine_check,4(%rsp)
> jmp handle_ist_exception
>
> +/* No op trap handler. Required for kexec path. */
> +ENTRY(trap_nop)
> + iretq
> +
> +/* Enable NMIs. No special register assumptions, and all preserved. */
> +ENTRY(enable_nmis)
> + push %rax
> + movq %rsp, %rax /* Grab RSP before pushing */
> +
> + /* Set up stack frame */
> + pushq $0 /* SS */
> + pushq %rax /* RSP */
> + pushfq /* RFLAGS */
> + pushq $__HYPERVISOR_CS /* CS */
> + leaq 1f(%rip),%rax
> + pushq %rax /* RIP */
> +
> + iretq /* Disable the hardware NMI latch */
> +1:
> + popq %rax
> + retq
> +
> .section .rodata, "a", @progbits
>
> ENTRY(exception_table)
> diff -r 1c69c938f641 -r b15d3ae525af xen/include/asm-x86/processor.h
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -517,6 +517,7 @@ void do_ ## _name(struct cpu_user_regs *
> DECLARE_TRAP_HANDLER(divide_error);
> DECLARE_TRAP_HANDLER(debug);
> DECLARE_TRAP_HANDLER(nmi);
> +DECLARE_TRAP_HANDLER(nmi_crash);
> DECLARE_TRAP_HANDLER(int3);
> DECLARE_TRAP_HANDLER(overflow);
> DECLARE_TRAP_HANDLER(bounds);
> @@ -535,6 +536,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
> DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
> #undef DECLARE_TRAP_HANDLER
>
> +void trap_nop(void);
> +void enable_nmis(void);
> +
> void syscall_enter(void);
> void sysenter_entry(void);
> void sysenter_eflags_saved(void);
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
2012-12-04 18:16 ` [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path Andrew Cooper
2012-12-04 18:25 ` Andrew Cooper
@ 2012-12-05 9:17 ` Jan Beulich
2012-12-05 10:05 ` Andrew Cooper
2012-12-06 11:36 ` Tim Deegan
2 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2012-12-05 9:17 UTC (permalink / raw)
To: Andrew Cooper; +Cc: xen-devel
>>> On 04.12.12 at 19:16, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> --- a/xen/arch/x86/crash.c
> +++ b/xen/arch/x86/crash.c
> @@ -32,41 +32,97 @@
>
> static atomic_t waiting_for_crash_ipi;
> static unsigned int crashing_cpu;
> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>
> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
> +/* This becomes the NMI handler for non-crashing CPUs. */
> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
> {
> - /* Don't do anything if this handler is invoked on crashing cpu.
> - * Otherwise, system will completely hang. Crashing cpu can get
> - * an NMI if system was initially booted with nmi_watchdog parameter.
> + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
> + ASSERT( crashing_cpu != smp_processor_id() );
> +
> + /* Save crash information and shut down CPU. Attempt only once. */
> + if ( ! this_cpu(crash_save_done) )
> + {
> + kexec_crash_save_cpu();
> + __stop_this_cpu();
> +
> + this_cpu(crash_save_done) = 1;
> + }
> +
> + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC
> + * back to its boot state, so we are unable to rely on the regular
> + * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
> + * (The likely senario is that we have reverted from x2apic mode to
> + * xapic, at which point #GPFs will occur if we use the apic_*
> + * functions)
> + *
> + * The ICR and APIC ID of the LAPIC are still valid even during
> + * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to
> + * queue up another NMI, to force us back into this loop if we exit.
> */
> - if ( cpu == crashing_cpu )
> - return 1;
> - local_irq_disable();
> + switch ( current_local_apic_mode() )
> + {
> + u32 apic_id;
>
> - kexec_crash_save_cpu();
> + case APIC_MODE_X2APIC:
> + apic_id = apic_rdmsr(APIC_ID);
>
> - __stop_this_cpu();
> + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));
As in the description you talk about the LAPIC being (possibly)
disabled here - did you check that this would not #GP in that
case?
> + break;
>
> - atomic_dec(&waiting_for_crash_ipi);
> + case APIC_MODE_XAPIC:
> + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
> +
> + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
> + cpu_relax();
> +
> + apic_mem_write(APIC_ICR2, apic_id << 24);
> + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
> + break;
> +
> + default:
> + break;
> + }
>
> for ( ; ; )
> halt();
> -
> - return 1;
> }
>...
> ENTRY(machine_check)
> pushq $0
> movl $TRAP_machine_check,4(%rsp)
> jmp handle_ist_exception
>
> +/* No op trap handler. Required for kexec path. */
> +ENTRY(trap_nop)
I'd prefer you to re-use an existing IRETQ (e.g. the one you add
below) here - this is not performance critical, and hence there's
no need for it to be aligned.
Jan
> + iretq
> +
> +/* Enable NMIs. No special register assumptions, and all preserved. */
> +ENTRY(enable_nmis)
> + push %rax
> + movq %rsp, %rax /* Grab RSP before pushing */
> +
> + /* Set up stack frame */
> + pushq $0 /* SS */
> + pushq %rax /* RSP */
> + pushfq /* RFLAGS */
> + pushq $__HYPERVISOR_CS /* CS */
> + leaq 1f(%rip),%rax
> + pushq %rax /* RIP */
> +
> + iretq /* Disable the hardware NMI latch */
> +1:
> + popq %rax
> + retq
> +
> .section .rodata, "a", @progbits
>
> ENTRY(exception_table)
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
2012-12-05 9:17 ` Jan Beulich
@ 2012-12-05 10:05 ` Andrew Cooper
0 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-05 10:05 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel@lists.xen.org
On 05/12/12 09:17, Jan Beulich wrote:
>>>> On 04.12.12 at 19:16, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> --- a/xen/arch/x86/crash.c
>> +++ b/xen/arch/x86/crash.c
>> @@ -32,41 +32,97 @@
>>
>> static atomic_t waiting_for_crash_ipi;
>> static unsigned int crashing_cpu;
>> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>>
>> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
>> +/* This becomes the NMI handler for non-crashing CPUs. */
>> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
>> {
>> - /* Don't do anything if this handler is invoked on crashing cpu.
>> - * Otherwise, system will completely hang. Crashing cpu can get
>> - * an NMI if system was initially booted with nmi_watchdog parameter.
>> + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
>> + ASSERT( crashing_cpu != smp_processor_id() );
>> +
>> + /* Save crash information and shut down CPU. Attempt only once. */
>> + if ( ! this_cpu(crash_save_done) )
>> + {
>> + kexec_crash_save_cpu();
>> + __stop_this_cpu();
>> +
>> + this_cpu(crash_save_done) = 1;
>> + }
>> +
>> + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC
>> + * back to its boot state, so we are unable to rely on the regular
>> + * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
>> + * (The likely senario is that we have reverted from x2apic mode to
>> + * xapic, at which point #GPFs will occur if we use the apic_*
>> + * functions)
>> + *
>> + * The ICR and APIC ID of the LAPIC are still valid even during
>> + * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to
>> + * queue up another NMI, to force us back into this loop if we exit.
>> */
>> - if ( cpu == crashing_cpu )
>> - return 1;
>> - local_irq_disable();
>> + switch ( current_local_apic_mode() )
>> + {
>> + u32 apic_id;
>>
>> - kexec_crash_save_cpu();
>> + case APIC_MODE_X2APIC:
>> + apic_id = apic_rdmsr(APIC_ID);
>>
>> - __stop_this_cpu();
>> + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));
> As in the description you talk about the LAPIC being (possibly)
> disabled here - did you check that this would not #GP in that
> case?
Yes - we are switching on current_local_apic_mode() which reads the
APICBASE MSR to work out which mode we are in, and returns an enum
apic_mode.
With this information, all the apic_**msr accesses are in case x2apic,
and all the apic_mem_** accesses are in case xapic.
If the apic_mode is disabled, then we skip trying to set up the next NMI.
The patch is sadly rather less legible than I would have hoped, but the
code post patch is far more logical to read.
>> + break;
>>
>> - atomic_dec(&waiting_for_crash_ipi);
>> + case APIC_MODE_XAPIC:
>> + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
>> +
>> + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
>> + cpu_relax();
>> +
>> + apic_mem_write(APIC_ICR2, apic_id << 24);
>> + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
>> + break;
>> +
>> + default:
>> + break;
>> + }
>>
>> for ( ; ; )
>> halt();
>> -
>> - return 1;
>> }
>> ...
>> ENTRY(machine_check)
>> pushq $0
>> movl $TRAP_machine_check,4(%rsp)
>> jmp handle_ist_exception
>>
>> +/* No op trap handler. Required for kexec path. */
>> +ENTRY(trap_nop)
> I'd prefer you to re-use an existing IRETQ (e.g. the one you add
> below) here - this is not performance critical, and hence there's
> no need for it to be aligned.
>
> Jan
Certainly.
~Andrew
>
>> + iretq
>> +
>> +/* Enable NMIs. No special register assumptions, and all preserved. */
>> +ENTRY(enable_nmis)
>> + push %rax
>> + movq %rsp, %rax /* Grab RSP before pushing */
>> +
>> + /* Set up stack frame */
>> + pushq $0 /* SS */
>> + pushq %rax /* RSP */
>> + pushfq /* RFLAGS */
>> + pushq $__HYPERVISOR_CS /* CS */
>> + leaq 1f(%rip),%rax
>> + pushq %rax /* RIP */
>> +
>> + iretq /* Disable the hardware NMI latch */
>> +1:
>> + popq %rax
>> + retq
>> +
>> .section .rodata, "a", @progbits
>>
>> ENTRY(exception_table)
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
2012-12-04 18:16 ` [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path Andrew Cooper
2012-12-04 18:25 ` Andrew Cooper
2012-12-05 9:17 ` Jan Beulich
@ 2012-12-06 11:36 ` Tim Deegan
2012-12-06 11:58 ` Andrew Cooper
2 siblings, 1 reply; 13+ messages in thread
From: Tim Deegan @ 2012-12-06 11:36 UTC (permalink / raw)
To: Andrew Cooper; +Cc: xen-devel
At 18:16 +0000 on 04 Dec (1354644968), Andrew Cooper wrote:
> do_nmi_crash() will:
>
> * Save the crash notes and shut the pcpu down.
> There is now an extra per-cpu variable to prevent us from executing
> this multiple times. In the case where we reenter midway through,
> attempt the whole operation again in preference to not completing
> it in the first place.
>
> * Set up another NMI at the LAPIC.
> Even when the LAPIC has been disabled, the ID and command registers
> are still usable. As a result, we can deliberately queue up a new
> NMI to re-interrupt us later if NMIs get unlatched.
Can you please include this reasoning in the code itself, as well as in
the commit message?
> machine_kexec() will:
>
> * Swap the MCE handlers to be a nop.
> We cannot prevent MCEs from being delivered when we pass off to the
> crash kernel, and the less Xen context is being touched the better.
>
> * Explicitly enable NMIs.
Would it be cleaner to have this path explicitly set the IDT entry to
invoke the noop handler? Or do we know this is alwys the case when we
reach this point?
I'm just wondering about the case where we get here with an NMI pending.
Presumably if we have some other source of NMIs active while we kexec,
the post-exec kernel will crash if it overwrites our handlers &c before
setting up its own IDT. But if kexec()ing with NMIs _disabled_ also
fails than we haven't much choice. :|
Otherwise, this looks good to me.
Tim.
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/crash.c
> --- a/xen/arch/x86/crash.c
> +++ b/xen/arch/x86/crash.c
> @@ -32,41 +32,97 @@
>
> static atomic_t waiting_for_crash_ipi;
> static unsigned int crashing_cpu;
> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>
> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
> +/* This becomes the NMI handler for non-crashing CPUs. */
> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
> {
> - /* Don't do anything if this handler is invoked on crashing cpu.
> - * Otherwise, system will completely hang. Crashing cpu can get
> - * an NMI if system was initially booted with nmi_watchdog parameter.
> + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
> + ASSERT( crashing_cpu != smp_processor_id() );
> +
> + /* Save crash information and shut down CPU. Attempt only once. */
> + if ( ! this_cpu(crash_save_done) )
> + {
> + kexec_crash_save_cpu();
> + __stop_this_cpu();
> +
> + this_cpu(crash_save_done) = 1;
> + }
> +
> + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC
> + * back to its boot state, so we are unable to rely on the regular
> + * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
> + * (The likely senario is that we have reverted from x2apic mode to
> + * xapic, at which point #GPFs will occur if we use the apic_*
> + * functions)
> + *
> + * The ICR and APIC ID of the LAPIC are still valid even during
> + * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to
> + * queue up another NMI, to force us back into this loop if we exit.
> */
> - if ( cpu == crashing_cpu )
> - return 1;
> - local_irq_disable();
> + switch ( current_local_apic_mode() )
> + {
> + u32 apic_id;
>
> - kexec_crash_save_cpu();
> + case APIC_MODE_X2APIC:
> + apic_id = apic_rdmsr(APIC_ID);
>
> - __stop_this_cpu();
> + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));
> + break;
>
> - atomic_dec(&waiting_for_crash_ipi);
> + case APIC_MODE_XAPIC:
> + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
> +
> + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
> + cpu_relax();
> +
> + apic_mem_write(APIC_ICR2, apic_id << 24);
> + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
> + break;
> +
> + default:
> + break;
> + }
>
> for ( ; ; )
> halt();
> -
> - return 1;
> }
>
> static void nmi_shootdown_cpus(void)
> {
> unsigned long msecs;
> + int i;
> + int cpu = smp_processor_id();
>
> local_irq_disable();
>
> - crashing_cpu = smp_processor_id();
> + crashing_cpu = cpu;
> local_irq_count(crashing_cpu) = 0;
>
> atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
> - /* Would it be better to replace the trap vector here? */
> - set_nmi_callback(crash_nmi_callback);
> +
> + /* Change NMI trap handlers. Non-crashing pcpus get crash_nmi which
> + * invokes do_nmi_crash (above), which cause them to write state and
> + * fall into a loop. The crashing pcpu gets the nop handler to
> + * cause it to return to this function ASAP.
> + *
> + * Futhermore, disable stack tables for NMIs and MCEs to prevent
> + * race conditions resulting in corrupt stack frames. As Xen is
> + * about to die, we no longer care about the security-related race
> + * condition with 'sysret' which ISTs are designed to solve.
> + */
> + for ( i = 0; i < nr_cpu_ids; ++i )
> + if ( idt_tables[i] )
> + {
> + idt_tables[i][TRAP_nmi].a &= ~(7UL << 32);
> + idt_tables[i][TRAP_machine_check].a &= ~(7UL << 32);
> +
> + if ( i == cpu )
> + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
> + else
> + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash);
> + }
> +
> /* Ensure the new callback function is set before sending out the NMI. */
> wmb();
>
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/machine_kexec.c
> --- a/xen/arch/x86/machine_kexec.c
> +++ b/xen/arch/x86/machine_kexec.c
> @@ -87,6 +87,17 @@ void machine_kexec(xen_kexec_image_t *im
> */
> local_irq_disable();
>
> + /* Now regular interrupts are disabled, we need to reduce the impact
> + * of interrupts not disabled by 'cli'. NMIs have already been
> + * dealt with by machine_crash_shutdown().
> + *
> + * Set all pcpu MCE handler to be a noop. */
> + set_intr_gate(TRAP_machine_check, &trap_nop);
> +
> + /* Explicitly enable NMIs on this CPU. Some crashdump kernels do
> + * not like running with NMIs disabled. */
> + enable_nmis();
> +
> /*
> * compat_machine_kexec() returns to idle pagetables, which requires us
> * to be running on a static GDT mapping (idle pagetables have no GDT
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/x86_64/entry.S
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -635,11 +635,42 @@ ENTRY(nmi)
> movl $TRAP_nmi,4(%rsp)
> jmp handle_ist_exception
>
> +ENTRY(nmi_crash)
> + cli
> + pushq $0
> + movl $TRAP_nmi,4(%rsp)
> + SAVE_ALL
> + movq %rsp,%rdi
> + callq do_nmi_crash /* Does not return */
> + ud2
> +
> ENTRY(machine_check)
> pushq $0
> movl $TRAP_machine_check,4(%rsp)
> jmp handle_ist_exception
>
> +/* No op trap handler. Required for kexec path. */
> +ENTRY(trap_nop)
> + iretq
> +
> +/* Enable NMIs. No special register assumptions, and all preserved. */
> +ENTRY(enable_nmis)
> + push %rax
> + movq %rsp, %rax /* Grab RSP before pushing */
> +
> + /* Set up stack frame */
> + pushq $0 /* SS */
> + pushq %rax /* RSP */
> + pushfq /* RFLAGS */
> + pushq $__HYPERVISOR_CS /* CS */
> + leaq 1f(%rip),%rax
> + pushq %rax /* RIP */
> +
> + iretq /* Disable the hardware NMI latch */
> +1:
> + popq %rax
> + retq
> +
> .section .rodata, "a", @progbits
>
> ENTRY(exception_table)
> diff -r 1c69c938f641 -r b15d3ae525af xen/include/asm-x86/processor.h
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -517,6 +517,7 @@ void do_ ## _name(struct cpu_user_regs *
> DECLARE_TRAP_HANDLER(divide_error);
> DECLARE_TRAP_HANDLER(debug);
> DECLARE_TRAP_HANDLER(nmi);
> +DECLARE_TRAP_HANDLER(nmi_crash);
> DECLARE_TRAP_HANDLER(int3);
> DECLARE_TRAP_HANDLER(overflow);
> DECLARE_TRAP_HANDLER(bounds);
> @@ -535,6 +536,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
> DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
> #undef DECLARE_TRAP_HANDLER
>
> +void trap_nop(void);
> +void enable_nmis(void);
> +
> void syscall_enter(void);
> void sysenter_entry(void);
> void sysenter_eflags_saved(void);
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
2012-12-06 11:36 ` Tim Deegan
@ 2012-12-06 11:58 ` Andrew Cooper
0 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-06 11:58 UTC (permalink / raw)
To: Tim Deegan; +Cc: xen-devel@lists.xen.org
On 06/12/12 11:36, Tim Deegan wrote:
> At 18:16 +0000 on 04 Dec (1354644968), Andrew Cooper wrote:
>> do_nmi_crash() will:
>>
>> * Save the crash notes and shut the pcpu down.
>> There is now an extra per-cpu variable to prevent us from executing
>> this multiple times. In the case where we reenter midway through,
>> attempt the whole operation again in preference to not completing
>> it in the first place.
>>
>> * Set up another NMI at the LAPIC.
>> Even when the LAPIC has been disabled, the ID and command registers
>> are still usable. As a result, we can deliberately queue up a new
>> NMI to re-interrupt us later if NMIs get unlatched.
> Can you please include this reasoning in the code itself, as well as in
> the commit message?
There is a comment alluding to this; the final sentence in the block
beginning "Poor mans self_nmi()", but I will reword it to make it clearer.
>
>> machine_kexec() will:
>>
>> * Swap the MCE handlers to be a nop.
>> We cannot prevent MCEs from being delivered when we pass off to the
>> crash kernel, and the less Xen context is being touched the better.
>>
>> * Explicitly enable NMIs.
> Would it be cleaner to have this path explicitly set the IDT entry to
> invoke the noop handler? Or do we know this is alwys the case when we
> reach this point?
Early versions of this patch did change the NMI entry here, but that was
before I decided that nmi_shootdown_cpus() needed redoing.
As it currently stands, all pcpus other than us have the nmi_crash
handler, and we have the nop handler because the call graph looks like:
kexec_crash()
...
machine_crash_shutdown()
nmi_shootdown_cpus()
machine_kexec()
I will however clarify this NMI setup with a comment at this location.
>
> I'm just wondering about the case where we get here with an NMI pending.
> Presumably if we have some other source of NMIs active while we kexec,
> the post-exec kernel will crash if it overwrites our handlers &c before
> setting up its own IDT. But if kexec()ing with NMIs _disabled_ also
> fails than we haven't much choice. :|
>
> Otherwise, this looks good to me.
>
> Tim.
We do have another source of NMIs active; The watchdog disable routine
only tells the NMI handler to ignore watchdog NMIs, rather than disable
the source of NMIs themselves. It is another item on my hitlist.
The other problem we have have with the kexec mechanism is to do with
taking NMIs and MCEs between purgatory changing addressing schemes and
setting up its own IDT. I believe that compat_machine_kexec() does this
in an unsafe way even before jumping to purgatory. I was wondering
whether it was possible to monopolise some very low pages which could be
identity mapped in any paging scheme, but I cant think of a nice way to
fix a move into real mode at which point the loaded IDT is completely
invalid. Another item for the hitlist.
~Andrew
>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/crash.c
>> --- a/xen/arch/x86/crash.c
>> +++ b/xen/arch/x86/crash.c
>> @@ -32,41 +32,97 @@
>>
>> static atomic_t waiting_for_crash_ipi;
>> static unsigned int crashing_cpu;
>> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>>
>> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
>> +/* This becomes the NMI handler for non-crashing CPUs. */
>> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
>> {
>> - /* Don't do anything if this handler is invoked on crashing cpu.
>> - * Otherwise, system will completely hang. Crashing cpu can get
>> - * an NMI if system was initially booted with nmi_watchdog parameter.
>> + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
>> + ASSERT( crashing_cpu != smp_processor_id() );
>> +
>> + /* Save crash information and shut down CPU. Attempt only once. */
>> + if ( ! this_cpu(crash_save_done) )
>> + {
>> + kexec_crash_save_cpu();
>> + __stop_this_cpu();
>> +
>> + this_cpu(crash_save_done) = 1;
>> + }
>> +
>> + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC
>> + * back to its boot state, so we are unable to rely on the regular
>> + * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
>> + * (The likely senario is that we have reverted from x2apic mode to
>> + * xapic, at which point #GPFs will occur if we use the apic_*
>> + * functions)
>> + *
>> + * The ICR and APIC ID of the LAPIC are still valid even during
>> + * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to
>> + * queue up another NMI, to force us back into this loop if we exit.
>> */
>> - if ( cpu == crashing_cpu )
>> - return 1;
>> - local_irq_disable();
>> + switch ( current_local_apic_mode() )
>> + {
>> + u32 apic_id;
>>
>> - kexec_crash_save_cpu();
>> + case APIC_MODE_X2APIC:
>> + apic_id = apic_rdmsr(APIC_ID);
>>
>> - __stop_this_cpu();
>> + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));
>> + break;
>>
>> - atomic_dec(&waiting_for_crash_ipi);
>> + case APIC_MODE_XAPIC:
>> + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
>> +
>> + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
>> + cpu_relax();
>> +
>> + apic_mem_write(APIC_ICR2, apic_id << 24);
>> + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
>> + break;
>> +
>> + default:
>> + break;
>> + }
>>
>> for ( ; ; )
>> halt();
>> -
>> - return 1;
>> }
>>
>> static void nmi_shootdown_cpus(void)
>> {
>> unsigned long msecs;
>> + int i;
>> + int cpu = smp_processor_id();
>>
>> local_irq_disable();
>>
>> - crashing_cpu = smp_processor_id();
>> + crashing_cpu = cpu;
>> local_irq_count(crashing_cpu) = 0;
>>
>> atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
>> - /* Would it be better to replace the trap vector here? */
>> - set_nmi_callback(crash_nmi_callback);
>> +
>> + /* Change NMI trap handlers. Non-crashing pcpus get crash_nmi which
>> + * invokes do_nmi_crash (above), which cause them to write state and
>> + * fall into a loop. The crashing pcpu gets the nop handler to
>> + * cause it to return to this function ASAP.
>> + *
>> + * Futhermore, disable stack tables for NMIs and MCEs to prevent
>> + * race conditions resulting in corrupt stack frames. As Xen is
>> + * about to die, we no longer care about the security-related race
>> + * condition with 'sysret' which ISTs are designed to solve.
>> + */
>> + for ( i = 0; i < nr_cpu_ids; ++i )
>> + if ( idt_tables[i] )
>> + {
>> + idt_tables[i][TRAP_nmi].a &= ~(7UL << 32);
>> + idt_tables[i][TRAP_machine_check].a &= ~(7UL << 32);
>> +
>> + if ( i == cpu )
>> + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
>> + else
>> + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash);
>> + }
>> +
>> /* Ensure the new callback function is set before sending out the NMI. */
>> wmb();
>>
>> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/machine_kexec.c
>> --- a/xen/arch/x86/machine_kexec.c
>> +++ b/xen/arch/x86/machine_kexec.c
>> @@ -87,6 +87,17 @@ void machine_kexec(xen_kexec_image_t *im
>> */
>> local_irq_disable();
>>
>> + /* Now regular interrupts are disabled, we need to reduce the impact
>> + * of interrupts not disabled by 'cli'. NMIs have already been
>> + * dealt with by machine_crash_shutdown().
>> + *
>> + * Set all pcpu MCE handler to be a noop. */
>> + set_intr_gate(TRAP_machine_check, &trap_nop);
>> +
>> + /* Explicitly enable NMIs on this CPU. Some crashdump kernels do
>> + * not like running with NMIs disabled. */
>> + enable_nmis();
>> +
>> /*
>> * compat_machine_kexec() returns to idle pagetables, which requires us
>> * to be running on a static GDT mapping (idle pagetables have no GDT
>> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/x86_64/entry.S
>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -635,11 +635,42 @@ ENTRY(nmi)
>> movl $TRAP_nmi,4(%rsp)
>> jmp handle_ist_exception
>>
>> +ENTRY(nmi_crash)
>> + cli
>> + pushq $0
>> + movl $TRAP_nmi,4(%rsp)
>> + SAVE_ALL
>> + movq %rsp,%rdi
>> + callq do_nmi_crash /* Does not return */
>> + ud2
>> +
>> ENTRY(machine_check)
>> pushq $0
>> movl $TRAP_machine_check,4(%rsp)
>> jmp handle_ist_exception
>>
>> +/* No op trap handler. Required for kexec path. */
>> +ENTRY(trap_nop)
>> + iretq
>> +
>> +/* Enable NMIs. No special register assumptions, and all preserved. */
>> +ENTRY(enable_nmis)
>> + push %rax
>> + movq %rsp, %rax /* Grab RSP before pushing */
>> +
>> + /* Set up stack frame */
>> + pushq $0 /* SS */
>> + pushq %rax /* RSP */
>> + pushfq /* RFLAGS */
>> + pushq $__HYPERVISOR_CS /* CS */
>> + leaq 1f(%rip),%rax
>> + pushq %rax /* RIP */
>> +
>> + iretq /* Disable the hardware NMI latch */
>> +1:
>> + popq %rax
>> + retq
>> +
>> .section .rodata, "a", @progbits
>>
>> ENTRY(exception_table)
>> diff -r 1c69c938f641 -r b15d3ae525af xen/include/asm-x86/processor.h
>> --- a/xen/include/asm-x86/processor.h
>> +++ b/xen/include/asm-x86/processor.h
>> @@ -517,6 +517,7 @@ void do_ ## _name(struct cpu_user_regs *
>> DECLARE_TRAP_HANDLER(divide_error);
>> DECLARE_TRAP_HANDLER(debug);
>> DECLARE_TRAP_HANDLER(nmi);
>> +DECLARE_TRAP_HANDLER(nmi_crash);
>> DECLARE_TRAP_HANDLER(int3);
>> DECLARE_TRAP_HANDLER(overflow);
>> DECLARE_TRAP_HANDLER(bounds);
>> @@ -535,6 +536,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
>> DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
>> #undef DECLARE_TRAP_HANDLER
>>
>> +void trap_nop(void);
>> +void enable_nmis(void);
>> +
>> void syscall_enter(void);
>> void sysenter_entry(void);
>> void sysenter_eflags_saved(void);
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
^ permalink raw reply [flat|nested] 13+ messages in thread