From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ian Campbell <Ian.Campbell@citrix.com>
Subject: Re: xenstored crashes with SIGSEGV
Date: Mon, 15 Dec 2014 14:50:14 +0000
Message-ID: <1418655014.16425.138.camel@citrix.com>
References: <546461A2.2070908@univention.de>
	<1415869951.31613.26.camel@citrix.com> <548B1472.5080302@univention.de>
	<1418401932.16425.34.camel@citrix.com> <548B1BA8.3090504@univention.de>
	<1418403387.16425.38.camel@citrix.com> <548B23FA.6070108@univention.de>
	<1418407116.16425.53.camel@citrix.com>
	<1418649458.16425.108.camel@citrix.com> <548EEDF5.20808@univention.de>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <548EEDF5.20808@univention.de>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Philipp Hahn <hahn@univention.de>
Cc: Ian Jackson <Ian.Jackson@eu.citrix.com>, Xen-devel@lists.xen.org
List-Id: xen-devel@lists.xenproject.org

On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 15.12.2014 14:17, Ian Campbell wrote:
> > On Fri, 2014-12-12 at 17:58 +0000, Ian Campbell wrote:
> >>  On Fri, 2014-12-12 at 18:20 +0100, Philipp Hahn wrote:
> >>> On 12.12.2014 17:56, Ian Campbell wrote:
> >>>> On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> >>>>> On 12.12.2014 17:32, Ian Campbell wrote:
> >>>>>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> ...
> >>> The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.
> >>>
> >>> (gdb) bt full
> >>> #0  talloc_chunk_from_ptr (ptr=0xff00000000) at talloc.c:116
> >>>         tc = <value optimized out>
> >>> #1  0x0000000000407edf in talloc_free (ptr=0xff00000000) at talloc.c:551
> >>>         tc = <value optimized out>
> >>> #2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0
> >>> "/var/lib/xenstored/tdb.0x1935bb0",
> 
> I just noticed something strange:
> 
> > #3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
> > 0xff00000000 out of bounds>, hash_size=0,
> >     tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> > #4  0x000000000040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
> > "/var/lib/xenstored/tdb.0x1935bb0")
> 
> Why does gdb-7.0.1 print "name=0xff000000" here for frame 3, but for
> frame 2 and 4 the pointers are correct again?
> Verifying the values with an explicit "print" shows them as correct.

I has just noticed that and was wondering about that same thing. I'm
starting to worry that 0xff00000000 might just be a gdb thing, similar
to <value optimized out>, but infinitely more misleading.

I've also noticed in
https://forge.univention.org/bugzilla/show_bug.cgi?id=35104 that the
constant can be either 0xff000000, 0xff00000000 or 0xff0000000000 (6, 8
or 10 zeroes).

> >>>     hash_size=<value optimized out>, tdb_flags=0, open_flags=<value
> >>> optimized out>, mode=<value optimized out>,
> >>>     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at
> >>> tdb.c:1958
> > 
> > Please can you confirm what is at line 1958 of your copy of tdb.c. I
> > think it will be tdb->locked, but I'd like to be sure.
> 
> Yes, that's the line:
> # sed -ne 1958p tdb.c
>         SAFE_FREE(tdb->locked);

Good, thanks.

> > You are running a 64-bit dom0, correct?
> 
> yes: x86_64

Thanks for confirming. I'm resurrecting the 64-bit root partition on my
test box (which it turns out was still Debian Squeeze!)

> 
> > I've only just noticed that
> > 0xff00000000 is >32bits. My testing so far was 32-bit, I don't think it
> > should matter wrt use of uninitialised data etc.
> > 
> > I can't help feeling that 0xff00000000 must be some sort of magic
> > sentinel value to someone. I can't figure out what though.
> 
> 0xff is too much for bit flip errors. and also two crashes on different
> machines in the same location very much rules out any HW error for me.
> 
> My 2nd idea was that someone decremented 0 one too many, but then that
> would have to be an 8 bit value - reading the code I didn't see anything
> like that.

I was wondering if it was an overflow or sign-extension thing, but it
doesn't seem likely, not enough high bits set for one thing.

> One more thing we noticed: /var/lib/xenstored/ contained the tdb file
> and to bit-identical copies after the crash, so I would read that as two
> transactions being in progress at the time of the crash. Might be that
> this is important.

It's certainly worth noting, thanks.

Ian.