From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ian Campbell Subject: Re: xenstored crashes with SIGSEGV Date: Tue, 16 Dec 2014 09:51:14 +0000 Message-ID: <1418723474.16425.193.camel@citrix.com> References: <546461A2.2070908@univention.de> <1415869951.31613.26.camel@citrix.com> <548B1472.5080302@univention.de> <1418401932.16425.34.camel@citrix.com> <548B1BA8.3090504@univention.de> <1418403387.16425.38.camel@citrix.com> <548B23FA.6070108@univention.de> <1418407116.16425.53.camel@citrix.com> <1418649458.16425.108.camel@citrix.com> <548EEDF5.20808@univention.de> <1418655014.16425.138.camel@citrix.com> <1418665524.16425.171.camel@citrix.com> <548F60BF.4020901@univention.de> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <548F60BF.4020901@univention.de> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Philipp Hahn Cc: Ian Jackson , Xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote: > Hello Ian, > > On 15.12.2014 18:45, Ian Campbell wrote: > > On Mon, 2014-12-15 at 14:50 +0000, Ian Campbell wrote: > >> On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote: > >>> I just noticed something strange: > >>> > >>>> #3 0x000000000040a684 in tdb_open (name=0xff00000000
>>>> 0xff00000000 out of bounds>, hash_size=0, > >>>> tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773 > ... > > I'm reasonably convinced now that this is just a weird artefact of > > running gdb on an optimised binary, probably a shortcoming in the debug > > info leading to gdb getting confused. > > > > Unfortunately this also calls into doubt the parameter to talloc_free, > > perhaps in that context 0xff0000000 is a similar artefact. > > > > Please can you print the entire contents of tdb in the second frame > > ("print *tdb" ought to do it). I'm curious whether it is all sane or > > not. > > (gdb) print *tdb > $1 = {name = 0x0, map_ptr = 0x0, fd = 47, map_size = 65280, read_only = > 16711680, > locked = 0xff0000000000, So it really does seem to be 0xff0000000000 in memory. > flags = 0, > travlocks = { > next = 0xff0000, off = 0, hash = 65280}, next = 0xff0000, > device = 280375465082880, inode = 16711680, log_fn = 0x4093b0 > , > hash_fn = 0x4092f0 , open_flags = 2} And here we can see tdb->{flags,open_flags} == 0 and 2, contrary to what the stack trace says we were called with, which was nonsense. Since 0 and 2 are sensible and correspond to what the caller passes I think the stack trace is just confused. > (gdb) info registers > rax 0x0 0 > rbx 0x16bff70 23854960 > rcx 0xffffffffffffffff -1 > rdx 0x40ecd0 4254928 > rsi 0x0 0 > rdi 0xff0000000000 280375465082880 And here it is in the registers. > rbp 0x7fcaed6c96a8 0x7fcaed6c96a8 > rsp 0x7fff9dc86330 0x7fff9dc86330 > r8 0x7fcaece54c08 140509534571528 > r9 0xff00000000000000 -72057594037927936 > r10 0x7fcaed08c14c 140509536895308 > r11 0x246 582 > r12 0xd 13 > r13 0xff0000000000 280375465082880 And again. > r14 0x4093b0 4232112 > r15 0x167d620 23582240 > rip 0x4075c4 0x4075c4 This must be the faulting address. > eflags 0x10206 [ PF IF RF ] > cs 0x33 51 > ss 0x2b 43 > ds 0x0 0 > es 0x0 0 > fs 0x0 0 > gs 0x0 0 > fctrl 0x0 0 > fstat 0x0 0 > ftag 0x0 0 > fiseg 0x0 0 > fioff 0x0 0 > foseg 0x0 0 > fooff 0x0 0 > fop 0x0 0 > mxcsr 0x0 [ ] > > (gdb) disassemble > Dump of assembler code for function talloc_chunk_from_ptr: > 0x00000000004075c0 : sub $0x8,%rsp > 0x00000000004075c4 : mov -0x8(%rdi),%edx This is the line corresponding to %rip above which is doing a read via % rdi, which is 0xff0000000000. It's reading tc->flags. It's been optimised, tc = pp - SIZE, so it is loading *(pp-SIZE+offsetof(flags)), which is pp-8 (flags is the last field in the struct). So rdi contains pp which == the ptr given as an argument to the function, so ptr was bogus. So it seems we really do have tdb->locked containing 0xff0000000000. This is only allocated in one place which is: tdb->locked = talloc_zero_array(tdb, struct tdb_lock_type, tdb->header.hash_size+1); midway through tdb_open_ex. It might be worth inserting a check+log for this returning 0xff, 0xff00, 0xff0000 ... 0xff0000000000 etc. > 0x00000000004075c7 : lea -0x50(%rdi),%rax This is actually calculating tc, ready for return upon success. > 0x00000000004075cb : mov %edx,%ecx > 0x00000000004075cd : and $0xfffffffffffffff0,%ecx > 0x00000000004075d0 : cmp $0xe814ec70,%ecx > 0x00000000004075d6 : jne 0x4075e2 (tc->flags & ~0xF) != TALLOC_MAGIC > 0x00000000004075d8 : and $0x1,%edx > 0x00000000004075db : jne 0x4075e2 tc->flags & TALLOC_FLAG_FREE > 0x00000000004075dd : add $0x8,%rsp > 0x00000000004075e1 : retq Success, return. > 0x00000000004075e2 : nopw 0x0(%rax,%rax,1) > 0x00000000004075e8 : callq 0x401b98 The two TALLOC_ABORTS both end up here if the checks above fail. > > Can you also "p $_siginfo._sifields._sigfault.si_addr" (in frame 0). > > This ought to be the actual faulting address, which ought to give a hint > > on how much we can trust the parameters in the stack trace. > > Hmm, my gdb refused to access $_siginfo: > (gdb) show convenience > $_siginfo = Unable to read siginfo That's ok, I think I've convinced myself above what the crash is. Ian.