From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ian Campbell <Ian.Campbell@citrix.com>
Subject: Re: xenstored crashes with SIGSEGV
Date: Tue, 16 Dec 2014 09:51:14 +0000
Message-ID: <1418723474.16425.193.camel@citrix.com>
References: <546461A2.2070908@univention.de>
	<1415869951.31613.26.camel@citrix.com> <548B1472.5080302@univention.de>
	<1418401932.16425.34.camel@citrix.com> <548B1BA8.3090504@univention.de>
	<1418403387.16425.38.camel@citrix.com> <548B23FA.6070108@univention.de>
	<1418407116.16425.53.camel@citrix.com>
	<1418649458.16425.108.camel@citrix.com> <548EEDF5.20808@univention.de>
	<1418655014.16425.138.camel@citrix.com>
	<1418665524.16425.171.camel@citrix.com>
	<548F60BF.4020901@univention.de>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <548F60BF.4020901@univention.de>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Philipp Hahn <hahn@univention.de>
Cc: Ian Jackson <Ian.Jackson@eu.citrix.com>, Xen-devel@lists.xen.org
List-Id: xen-devel@lists.xenproject.org

On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 15.12.2014 18:45, Ian Campbell wrote:
> > On Mon, 2014-12-15 at 14:50 +0000, Ian Campbell wrote:
> >> On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote:
> >>> I just noticed something strange:
> >>>
> >>>> #3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
> >>>> 0xff00000000 out of bounds>, hash_size=0,
> >>>>     tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> ...
> > I'm reasonably convinced now that this is just a weird artefact of
> > running gdb on an optimised binary, probably a shortcoming in the debug
> > info leading to gdb getting confused.
> > 
> > Unfortunately this also calls into doubt the parameter to talloc_free,
> > perhaps in that context 0xff0000000 is a similar artefact.
> > 
> > Please can you print the entire contents of tdb in the second frame
> > ("print *tdb" ought to do it). I'm curious whether it is all sane or
> > not.
> 
> (gdb) print *tdb
> $1 = {name = 0x0, map_ptr = 0x0, fd = 47, map_size = 65280, read_only =
> 16711680,
>   locked = 0xff0000000000,

So it really does seem to be 0xff0000000000 in memory.

> flags = 0,
> travlocks = {
>     next = 0xff0000, off = 0, hash = 65280}, next = 0xff0000,
>   device = 280375465082880, inode = 16711680, log_fn = 0x4093b0
> <null_log_fn>,
>   hash_fn = 0x4092f0 <default_tdb_hash>, open_flags = 2}

And here we can see tdb->{flags,open_flags} == 0 and 2, contrary to what
the stack trace says we were called with, which was nonsense. Since 0
and 2 are sensible and correspond to what the caller passes I think the
stack trace is just confused.

> (gdb) info registers
> rax            0x0      0
> rbx            0x16bff70        23854960
> rcx            0xffffffffffffffff       -1
> rdx            0x40ecd0 4254928
> rsi            0x0      0
> rdi            0xff0000000000   280375465082880

And here it is in the registers.

> rbp            0x7fcaed6c96a8   0x7fcaed6c96a8
> rsp            0x7fff9dc86330   0x7fff9dc86330
> r8             0x7fcaece54c08   140509534571528
> r9             0xff00000000000000       -72057594037927936
> r10            0x7fcaed08c14c   140509536895308
> r11            0x246    582
> r12            0xd      13
> r13            0xff0000000000   280375465082880

And again.

> r14            0x4093b0 4232112
> r15            0x167d620        23582240
> rip            0x4075c4 0x4075c4 <talloc_chunk_from_ptr+4>

This must be the faulting address.

> eflags         0x10206  [ PF IF RF ]
> cs             0x33     51
> ss             0x2b     43
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> fctrl          0x0      0
> fstat          0x0      0
> ftag           0x0      0
> fiseg          0x0      0
> fioff          0x0      0
> foseg          0x0      0
> fooff          0x0      0
> fop            0x0      0
> mxcsr          0x0      [ ]
> 
> (gdb) disassemble
> Dump of assembler code for function talloc_chunk_from_ptr:
> 0x00000000004075c0 <talloc_chunk_from_ptr+0>:   sub    $0x8,%rsp
> 0x00000000004075c4 <talloc_chunk_from_ptr+4>:   mov    -0x8(%rdi),%edx

This is the line corresponding to %rip above which is doing a read via %
rdi, which is 0xff0000000000.

It's reading tc->flags. It's been optimised, tc = pp - SIZE, so it is
loading *(pp-SIZE+offsetof(flags)), which is pp-8 (flags is the last
field in the struct).

So rdi contains pp which == the ptr given as an argument to the
function, so ptr was bogus.

So it seems we really do have tdb->locked containing 0xff0000000000.

This is only allocated in one place which is:
	tdb->locked = talloc_zero_array(tdb, struct tdb_lock_type,
					tdb->header.hash_size+1);
midway through tdb_open_ex. It might be worth inserting a check+log for
this returning  0xff, 0xff00, 0xff0000 ... 0xff0000000000 etc.

> 0x00000000004075c7 <talloc_chunk_from_ptr+7>:   lea    -0x50(%rdi),%rax

This is actually calculating tc, ready for return upon success.

> 0x00000000004075cb <talloc_chunk_from_ptr+11>:  mov    %edx,%ecx
> 0x00000000004075cd <talloc_chunk_from_ptr+13>:  and    $0xfffffffffffffff0,%ecx
> 0x00000000004075d0 <talloc_chunk_from_ptr+16>:  cmp    $0xe814ec70,%ecx
> 0x00000000004075d6 <talloc_chunk_from_ptr+22>:  jne    0x4075e2 <talloc_chunk_from_ptr+34>

(tc->flags & ~0xF) != TALLOC_MAGIC

> 0x00000000004075d8 <talloc_chunk_from_ptr+24>:  and    $0x1,%edx
> 0x00000000004075db <talloc_chunk_from_ptr+27>:  jne    0x4075e2 <talloc_chunk_from_ptr+34>

tc->flags & TALLOC_FLAG_FREE

> 0x00000000004075dd <talloc_chunk_from_ptr+29>:  add    $0x8,%rsp
> 0x00000000004075e1 <talloc_chunk_from_ptr+33>:  retq

Success, return.

> 0x00000000004075e2 <talloc_chunk_from_ptr+34>:  nopw   0x0(%rax,%rax,1)
> 0x00000000004075e8 <talloc_chunk_from_ptr+40>:  callq  0x401b98 <abort@plt>

The two TALLOC_ABORTS both end up here if the checks above fail.

> > Can you also "p $_siginfo._sifields._sigfault.si_addr" (in frame 0).
> > This ought to be the actual faulting address, which ought to give a hint
> > on how much we can trust the parameters in the stack trace.
> 
> Hmm, my gdb refused to access $_siginfo:
> (gdb) show convenience
> $_siginfo = Unable to read siginfo

That's ok, I think I've convinced myself above what the crash is.

Ian.