From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Cooper Subject: Re: HVM domains crash after upgrade from XEN 4.5.1 to 4.5.2 Date: Thu, 19 Nov 2015 10:38:20 +0000 Message-ID: <564DA69C.9070809@citrix.com> References: <5644A248.1060505@web2web.at> <5644C1CD.3020202@citrix.com> <56451A2B.9090706@web2web.at> <56459E5F02000078000B4944@prv-mh.provo.novell.com> <5645B6BC.6030603@citrix.com> <56467D44.5040205@web2web.at> <56479A6B.6080102@citrix.com> <5647CE57.50209@web2web.at> <5648E727.6080204@cardoe.com> <56492BDF.5030208@web2web.at> <20151116153107.GD13720@char.us.oracle.com> <564A2B91.2090501@web2web.at> <564A6064.4080800@citrix.com> <564A626E.6010305@web2web.at> <564D00E0.8080004@web2web.at> <564D0720.9020506@citrix.com> <564DB17C02000078000B6B19@prv-mh.provo.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <564DB17C02000078000B6B19@prv-mh.provo.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich , Atom2 Cc: Doug Goldstein , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On 19/11/15 10:24, Jan Beulich wrote: >>>> On 19.11.15 at 00:17, wrote: >> The disassembly of do_IRQ now looks like a plausible function, but the >> consistently faulting address has no plausible way of generating a >> double fault. I suspect therefore that something has caused memory >> corruption in Xen .text section. > Dump of assembler code for function do_IRQ: > 0xffff82d080176577 <+0>: push %rbp > 0xffff82d080176578 <+1>: mov %rsp,%rbp > 0xffff82d08017657b <+4>: push %r15 > 0xffff82d08017657d <+6>: push %r14 > 0xffff82d08017657f <+8>: push %r13 > 0xffff82d080176581 <+10>: push %r12 > 0xffff82d080176583 <+12>: push %rbx > 0xffff82d080176584 <+13>: lea -0x1058(%rsp),%rsp > 0xffff82d08017658c <+21>: orq $0x0,(%rsp) > 0xffff82d080176591 <+26>: lea 0x1020(%rsp),%rsp > > The orq surely has potential for causing a double fault, if %rsp is > near the stack limit. The two LEAs look suspect, presumably a > result of some non-standard option passed to gcc. Removing that > option might already be a step forward. Actually yes - that is a huge quantity of stack usage. (The actual behaviour looks very suspect - it appears to be completely pointless). The #DF handler reports that %rsp in the exception frame is within range. Having said that, (XEN) [ 2.788209] rbp: ffff83080ca8ed78 rsp: ffff83080ca8dcf8 r8: ffff83080ca9d558 ... (XEN) [ 2.837474] Valid stack range: ffff83080ca8e000-ffff83080ca90000, sp=ffff83080ca8dcf8, tss.esp0=ffff83080ca8ffc0 (XEN) [ 2.848969] No stack overflow detected. Skipping stack trace. In this case, the stack pointer *is* out of range, and has hit the guard page. This means: 1) There is some bug in the stack overflow detection in the #DF handler. 2) Whatever options Gentoo compiles Xen with is sufficient to overflow the 8K hypervisor stack. ~Andrew