Xen-4.3 - curious crash

All of lore.kernel.org
 help / color / mirror / Atom feed

* Xen-4.3 - curious crash
@ 2014-01-28 20:25 Andrew Cooper
  2014-01-29  8:43 ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Cooper @ 2014-01-28 20:25 UTC (permalink / raw)
  To: Xen-devel List, Jan Beulich

Hello,

Last night, XenRT discovered an interesting host crash.  The crash
itself somewhat concerning, but lack of information does highlight an
area which could do with easier debugability.

Here is the results from the serial console.  The server in question is
a Supermicro Xeon X5376 system which has not exhibited stability issues
in the past, and seems fine for tests during today.

I have linearised the stack and applied notes beside.

----[ Xen-4.3.1-xs82408-d  x86_64  debug=y  Not tainted ]----
CPU:    4
RIP:    e008:[<ffff82c4c0235a92>] compat_create_bounce_frame+0x8/0xec
RFLAGS: 0000000000010046   CONTEXT: hypervisor
rax: 0000000000000061   rbx: ffff8300cfafa000   rcx: ffff82c4c02ffd80
rdx: ffff8300cfafa570   rsi: ffff83022eacfd00   rdi: ffff8300cfafa000
rbp: ffff83022eacfd60   rsp: ffff83022eacff08   r8:  0000000000000000
r9:  0000000000000000   r10: ffff83022ead32e8   r11: 00001ac42042804f
r12: ffff8300cfafa000   r13: 0000000000000004   r14: ffff8300cfd3f000
r15: 0000000000000001   cr0: 000000008005003b   cr4: 00000000000026f0
cr3: 0000000228dde000   cr2: 00000000b74e4f10
ds: 007b   es: 007b   fs: 00d8   gs: 00e0   ss: 0000   cs: e008
Xen stack trace from rsp=ffff83022eacff08:
    0000000000000093 | rflags from pushfq in ASSERT_INTERRUPTS_ENABLED
    ffff82c4c02358d8 | RA? compat/entry.S:123 in compat_test_all_events()
    0000000000000001 | r15
    ffff8300cfd3f000 | r14
    0000000000000004 | r13
    ffff8300cfafa000 | r12
    00000000c1695ec0 | ebp
    00000000deadbeef | ebx
    0000000000000000 | r11
    00000000deadbeef | r10
    ffff8300cfafa060 | r9
    0000000000000000 | r8
    0000000000000000 | eax
    00000000deadbeef | ecx
    00000000ee8507a0 | edx
    00000000c23a7000 | esi
    0000000000000000 | edi
    0002010000000000 | TRAP_syscall | TRAP_regs_dirty
    00000000c10013a7 + (hypercall page) __HYPERCALL_sched_op
    0000000000000061 |
    0000000000000246 | Exception frame from ring1 kernel
    00000000c1695eb0 |
    0000000000000069 +
    0000000000000000 | es
    0000000000000000 | ds
    0000000000000000 | fs
    0000000000000000 | gs
    0000000000000004 | cpu_info.processor_id
    ffff8300cfafa000 | cpu_info.current_vcpu
    0000003d6e797180 | cpu_info.per_cpu_offset
    0000000000000000 +

Xen call trace:
   [<ffff82c4c0235a92>] compat_create_bounce_frame+0x8/0xec

Xen has failed the ASSERT_INTERRUPTS_ENABLED check at the very top of
compat_create_bounce_frame, which itself lacks a bugframe which is why
it is not automatically recognised as an assertion.

Following the code back using what I presume to be a return address as
the penultimate word on the stack, the codeflow looks like:

compat_test_all_events:
  ...
  sti
  leaq ...
  5x mov ...
  call compat_create_bounce_frame
  jmp  compat_test_all_events

compat_create_bounce_frame:
  pushfq
  testb
  jnz
  ud2

What I presume has happened is that after 'sti', Xen has taken an
interrupt, which has caused some form of corruption.  Judging from the
top word on the stack, rflags looks quite corrupt.  Unfortunatly, this
is all the available information.  (The crash kernel failed to boot
which is another issue I am looking into).

For crashes like this, particularly when attempting to leave Xen context
and return back to a guest, the information provided by the stack trace
is quite lacking; The interesting information is what is what has just
been popped off the stack (which I am hoping would have been another
exception frame)

Would it be sensible to have some indication that we are on the way out
of Xen, so errors in situations like this can take a chance to print
some of the recently popped stack values? I know it wont be terribly
heavily used debugging, but think it is probably worth the effort for
situations like this where there is simply not enough information to
diagnose the issue.

~Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen-4.3 - curious crash
  2014-01-28 20:25 Xen-4.3 - curious crash Andrew Cooper
@ 2014-01-29  8:43 ` Jan Beulich
  2014-01-29  8:51   ` Ian Campbell
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2014-01-29  8:43 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel List

>>> On 28.01.14 at 21:25, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>     0000000000000093 | rflags from pushfq in ASSERT_INTERRUPTS_ENABLED
>     ffff82c4c02358d8 | RA? compat/entry.S:123 in compat_test_all_events()
>     0000000000000001 | r15
>     ffff8300cfd3f000 | r14
>     0000000000000004 | r13
>     ffff8300cfafa000 | r12
>     00000000c1695ec0 | ebp
>     00000000deadbeef | ebx
>     0000000000000000 | r11
>     00000000deadbeef | r10
>     ffff8300cfafa060 | r9
>     0000000000000000 | r8
>     0000000000000000 | eax
>     00000000deadbeef | ecx
>     00000000ee8507a0 | edx
>     00000000c23a7000 | esi
>     0000000000000000 | edi
>     0002010000000000 | TRAP_syscall | TRAP_regs_dirty
>     00000000c10013a7 + (hypercall page) __HYPERCALL_sched_op
>     0000000000000061 |
>     0000000000000246 | Exception frame from ring1 kernel
>     00000000c1695eb0 |
>     0000000000000069 +
>     0000000000000000 | es
>     0000000000000000 | ds
>     0000000000000000 | fs
>     0000000000000000 | gs
>     0000000000000004 | cpu_info.processor_id
>     ffff8300cfafa000 | cpu_info.current_vcpu
>     0000003d6e797180 | cpu_info.per_cpu_offset
>     0000000000000000 +
> 
> Xen call trace:
>    [<ffff82c4c0235a92>] compat_create_bounce_frame+0x8/0xec
> 
> 
> Xen has failed the ASSERT_INTERRUPTS_ENABLED check at the very top of
> compat_create_bounce_frame, which itself lacks a bugframe which is why
> it is not automatically recognised as an assertion.
> 
> Following the code back using what I presume to be a return address as
> the penultimate word on the stack, the codeflow looks like:
> 
> compat_test_all_events:
>   ...
>   sti
>   leaq ...
>   5x mov ...
>   call compat_create_bounce_frame
>   jmp  compat_test_all_events
> 
> compat_create_bounce_frame:
>   pushfq
>   testb
>   jnz
>   ud2
> 
> 
> What I presume has happened is that after 'sti', Xen has taken an
> interrupt, which has caused some form of corruption.  Judging from the
> top word on the stack, rflags looks quite corrupt.

Other that IF being clear, I see no other obvious corruption:
CF, AF, and SF (and the reserved bit 1) are set, and all other flags
are clear. Quite reasonable a state after the "cmpl  $0xfe,%eax"
(being the most recent instruction that affected the flags) it seems.

An interrupt not properly restoring EFLAGS.IF (or actually one not
properly restoring all of EFLAGS) would be very odd. About as odd
as a cosmic radiation induced bit flip resulting in some other
misbehavior. This hasn't been seen more than once I suppose?

> For crashes like this, particularly when attempting to leave Xen context
> and return back to a guest, the information provided by the stack trace
> is quite lacking; The interesting information is what is what has just
> been popped off the stack (which I am hoping would have been another
> exception frame)
> 
> Would it be sensible to have some indication that we are on the way out
> of Xen, so errors in situations like this can take a chance to print
> some of the recently popped stack values? I know it wont be terribly
> heavily used debugging, but think it is probably worth the effort for
> situations like this where there is simply not enough information to
> diagnose the issue.

While I realize that in a case like this seeing stack contents below the
stack pointer may be useful (but there's no guarantee it would be), I
don't think it is reasonable to get the code prepared for all kinds of
extremely unlikely scenarios to be debuggable. If the issue here is
reproducible, I'm sure you'll be able to instrument the code such that
you can get further information out of the system (and that's not
necessarily just stack contents - presumably you'd want to track
other state or state changes in some kind of static buffer, which
you'd then also want to dump out at the point of the crash).

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen-4.3 - curious crash
  2014-01-29  8:43 ` Jan Beulich
@ 2014-01-29  8:51   ` Ian Campbell
  2014-01-29  9:01     ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Ian Campbell @ 2014-01-29  8:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Xen-devel List

On Wed, 2014-01-29 at 08:43 +0000, Jan Beulich wrote:
> An interrupt not properly restoring EFLAGS.IF (or actually one not
> properly restoring all of EFLAGS) would be very odd. About as odd
> as a cosmic radiation induced bit flip resulting in some other
> misbehavior.

Isn't it also the affect of a missing spin_unlock(_irqrestore)? Or does
something else catch that first?

Ian.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen-4.3 - curious crash
  2014-01-29  8:51   ` Ian Campbell
@ 2014-01-29  9:01     ` Jan Beulich
  2014-01-29  9:25       ` Ian Campbell
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2014-01-29  9:01 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, Xen-devel List

>>> On 29.01.14 at 09:51, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2014-01-29 at 08:43 +0000, Jan Beulich wrote:
>> An interrupt not properly restoring EFLAGS.IF (or actually one not
>> properly restoring all of EFLAGS) would be very odd. About as odd
>> as a cosmic radiation induced bit flip resulting in some other
>> misbehavior.
> 
> Isn't it also the affect of a missing spin_unlock(_irqrestore)? Or does
> something else catch that first?

A missing plain spin_unlock() wouldn't have any effect of IF. And
a missing spin_unlock_irqrestore() would have an effect on IF in
the interrupt handler, but with the return being through an IRET
something would need to actively modify the flags on the stack
that IRET uses in order to affect the interrupted code's EFLAGS.

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen-4.3 - curious crash
  2014-01-29  9:01     ` Jan Beulich
@ 2014-01-29  9:25       ` Ian Campbell
  2014-01-29  9:42         ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Ian Campbell @ 2014-01-29  9:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Xen-devel List

On Wed, 2014-01-29 at 09:01 +0000, Jan Beulich wrote:
> >>> On 29.01.14 at 09:51, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2014-01-29 at 08:43 +0000, Jan Beulich wrote:
> >> An interrupt not properly restoring EFLAGS.IF (or actually one not
> >> properly restoring all of EFLAGS) would be very odd. About as odd
> >> as a cosmic radiation induced bit flip resulting in some other
> >> misbehavior.
> > 
> > Isn't it also the affect of a missing spin_unlock(_irqrestore)? Or does
> > something else catch that first?
> 
> A missing plain spin_unlock() wouldn't have any effect of IF. And
> a missing spin_unlock_irqrestore() would have an effect on IF in
> the interrupt handler, but with the return being through an IRET
> something would need to actively modify the flags on the stack
> that IRET uses in order to affect the interrupted code's EFLAGS.

Ah, I mistakenly thought that this issue was happening on that return
path (i.e. before the IRET).

Ian.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen-4.3 - curious crash
  2014-01-29  9:25       ` Ian Campbell
@ 2014-01-29  9:42         ` Jan Beulich
  2014-01-29 10:30           ` Andrew Cooper
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2014-01-29  9:42 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, Xen-devel List

>>> On 29.01.14 at 10:25, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2014-01-29 at 09:01 +0000, Jan Beulich wrote:
>> >>> On 29.01.14 at 09:51, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>> > On Wed, 2014-01-29 at 08:43 +0000, Jan Beulich wrote:
>> >> An interrupt not properly restoring EFLAGS.IF (or actually one not
>> >> properly restoring all of EFLAGS) would be very odd. About as odd
>> >> as a cosmic radiation induced bit flip resulting in some other
>> >> misbehavior.
>> > 
>> > Isn't it also the affect of a missing spin_unlock(_irqrestore)? Or does
>> > something else catch that first?
>> 
>> A missing plain spin_unlock() wouldn't have any effect of IF. And
>> a missing spin_unlock_irqrestore() would have an effect on IF in
>> the interrupt handler, but with the return being through an IRET
>> something would need to actively modify the flags on the stack
>> that IRET uses in order to affect the interrupted code's EFLAGS.
> 
> Ah, I mistakenly thought that this issue was happening on that return
> path (i.e. before the IRET).

Right - the problem is that we're having two return paths to
consider here: The outer one (wanting to return to the guest)
explicitly used STI a few instructions before the crash. And it
would need to be an inner one (hardware interrupt) that would
have to fail to restore IF properly, and for that to happen the
EFLAGS image used by that exit path's IRET would need to get
corrupted.

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen-4.3 - curious crash
  2014-01-29  9:42         ` Jan Beulich
@ 2014-01-29 10:30           ` Andrew Cooper
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Cooper @ 2014-01-29 10:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Campbell, Xen-devel List

On 29/01/14 09:42, Jan Beulich wrote:
>>>> On 29.01.14 at 10:25, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>> On Wed, 2014-01-29 at 09:01 +0000, Jan Beulich wrote:
>>>>>> On 29.01.14 at 09:51, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>>>> On Wed, 2014-01-29 at 08:43 +0000, Jan Beulich wrote:
>>>>> An interrupt not properly restoring EFLAGS.IF (or actually one not
>>>>> properly restoring all of EFLAGS) would be very odd. About as odd
>>>>> as a cosmic radiation induced bit flip resulting in some other
>>>>> misbehavior.
>>>> Isn't it also the affect of a missing spin_unlock(_irqrestore)? Or does
>>>> something else catch that first?
>>> A missing plain spin_unlock() wouldn't have any effect of IF. And
>>> a missing spin_unlock_irqrestore() would have an effect on IF in
>>> the interrupt handler, but with the return being through an IRET
>>> something would need to actively modify the flags on the stack
>>> that IRET uses in order to affect the interrupted code's EFLAGS.
>> Ah, I mistakenly thought that this issue was happening on that return
>> path (i.e. before the IRET).
> Right - the problem is that we're having two return paths to
> consider here: The outer one (wanting to return to the guest)
> explicitly used STI a few instructions before the crash. And it
> would need to be an inner one (hardware interrupt) that would
> have to fail to restore IF properly, and for that to happen the
> EFLAGS image used by that exit path's IRET would need to get
> corrupted.
>
> Jan
>

This issue has been seen exactly once, on an otherwise perfectly stable
server, which is running stably since.  I certainly have no evidence to
rule out cosmic radiation.

I suppose all that can be done at this point is to wait and see whether
it reoccurs.

~Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-01-29 10:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-28 20:25 Xen-4.3 - curious crash Andrew Cooper
2014-01-29  8:43 ` Jan Beulich
2014-01-29  8:51   ` Ian Campbell
2014-01-29  9:01     ` Jan Beulich
2014-01-29  9:25       ` Ian Campbell
2014-01-29  9:42         ` Jan Beulich
2014-01-29 10:30           ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.