From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: HVM domains crash after upgrade from XEN 4.5.1 to
 4.5.2
Date: Thu, 19 Nov 2015 10:38:20 +0000
Message-ID: <564DA69C.9070809@citrix.com>
References: <5644A248.1060505@web2web.at> <5644C1CD.3020202@citrix.com>
	<56451A2B.9090706@web2web.at>
	<56459E5F02000078000B4944@prv-mh.provo.novell.com>
	<5645B6BC.6030603@citrix.com> <56467D44.5040205@web2web.at>
	<56479A6B.6080102@citrix.com> <5647CE57.50209@web2web.at>
	<5648E727.6080204@cardoe.com> <56492BDF.5030208@web2web.at>
	<20151116153107.GD13720@char.us.oracle.com>	<564A2B91.2090501@web2web.at>
	<564A6064.4080800@citrix.com> <564A626E.6010305@web2web.at>
	<564D00E0.8080004@web2web.at> <564D0720.9020506@citrix.com>
	<564DB17C02000078000B6B19@prv-mh.provo.novell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <564DB17C02000078000B6B19@prv-mh.provo.novell.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Jan Beulich <JBeulich@suse.com>, Atom2 <ariel.atom2@web2web.at>
Cc: Doug Goldstein <cardoe@cardoe.com>, xen-devel@lists.xen.org
List-Id: xen-devel@lists.xenproject.org

On 19/11/15 10:24, Jan Beulich wrote:
>>>> On 19.11.15 at 00:17, <andrew.cooper3@citrix.com> wrote:
>> The disassembly of do_IRQ now looks like a plausible function, but the
>> consistently faulting address has no plausible way of generating a
>> double fault.  I suspect therefore that something has caused memory
>> corruption in Xen .text section.
> Dump of assembler code for function do_IRQ:
>    0xffff82d080176577 <+0>:	push   %rbp
>    0xffff82d080176578 <+1>:	mov    %rsp,%rbp
>    0xffff82d08017657b <+4>:	push   %r15
>    0xffff82d08017657d <+6>:	push   %r14
>    0xffff82d08017657f <+8>:	push   %r13
>    0xffff82d080176581 <+10>:	push   %r12
>    0xffff82d080176583 <+12>:	push   %rbx
>    0xffff82d080176584 <+13>:	lea    -0x1058(%rsp),%rsp
>    0xffff82d08017658c <+21>:	orq    $0x0,(%rsp)
>    0xffff82d080176591 <+26>:	lea    0x1020(%rsp),%rsp
>
> The orq surely has potential for causing a double fault, if %rsp is
> near the stack limit. The two LEAs look suspect, presumably a
> result of some non-standard option passed to gcc. Removing that
> option might already be a step forward.

Actually yes - that is a huge quantity of stack usage.

(The actual behaviour looks very suspect - it appears to be completely
pointless).

The #DF handler reports that %rsp in the exception frame is within
range.  Having said that,

(XEN) [    2.788209] rbp: ffff83080ca8ed78   rsp: ffff83080ca8dcf8  
r8:  ffff83080ca9d558
...
(XEN) [    2.837474] Valid stack range:
ffff83080ca8e000-ffff83080ca90000, sp=ffff83080ca8dcf8,
tss.esp0=ffff83080ca8ffc0
(XEN) [    2.848969] No stack overflow detected. Skipping stack trace.

In this case, the stack pointer *is* out of range, and has hit the guard
page.

This means:
1) There is some bug in the stack overflow detection in the #DF handler.
2) Whatever options Gentoo compiles Xen with is sufficient to overflow
the 8K hypervisor stack.

~Andrew