Re: [BUG] Call trace during snapshot start/stop sequence

From: Dave Chinner <david@fromorbit.com>
To: "Arkadiusz Bubała" <arkadiusz.bubala@open-e.com>
Cc: xfs@oss.sgi.com
Subject: Re: [BUG] Call trace during snapshot start/stop sequence
Date: Thu, 28 Nov 2013 10:06:08 +1100	[thread overview]
Message-ID: <20131127230608.GJ10988@dastard> (raw)
In-Reply-To: <20131127221923.GI10988@dastard>

On Thu, Nov 28, 2013 at 09:19:23AM +1100, Dave Chinner wrote:
> On Wed, Nov 27, 2013 at 11:01:43AM +0100, Arkadiusz Bubała wrote:
> > Hello,
> > 
> > we're running test script that starts and stops
> > snapshots in a loop while overfilling them.  After a few days of running
> > system hangs. We've captured following call trace:
> > 
> > [116649.755761] XFS (dm-42): metadata I/O error: block 0xfa2b06
> > ("xlog_iodone") error 5 buf count 1024
> > [116649.947247] XFS (dm-42): Log I/O Error Detected.  Shutting down
> > filesystem
> > [116650.073881] XFS (dm-42): Please umount the filesystem and rectify
> > the problem(s)
> 
> So, an EIO error on a log IO, resulting in a shutdown....
> 
> > [116650.207186] BUG: unable to handle kernel paging request at
> > 00000000000010a8
> 
> That's an interesting offset - quite large for a null pointer
> dereference.
> 
> > [116650.335185] IP: [<ffffffff8102e1d6>] __ticket_spin_lock+0x6/0x20
> > [116650.451052] PGD 0
> > [116650.518151] Oops: 0002 [#1] SMP
> > [116650.599477] CPU 0
> > [116650.622838] Modules linked in: iscsi_scst(O) scst_vdisk(O) scst(O)
> > drbd(O) twofish_x86_64 twofish_generic twofish_common
> > serpent_sse2_x86_64 lrw xts gf1]
> > [116651.479730]
> > [116651.540674] Pid: 30173, comm: kworker/0:5 Tainted: G           O
> > 3.4.63-oe64-00000-g1a33902 #38 Intel Corporation S1200BTL/S1200BTL
> 
> Running a custom built 3.4.63 kernel with a bunch of out of tree
> modules installed. can you reproduce this on a vanilla 3.12 kernel?
> 
> > [116653.923833] Call Trace:
> > [116653.995006]  [<ffffffff815f4b45>] ? _raw_spin_lock+0x5/0x10
> > [116654.103462]  [<ffffffff812685f2>] ? xlog_state_done_syncing+0x32/0xc0
> > [116654.221716]  [<ffffffff81051843>] ? process_one_work+0xf3/0x320
> > [116654.333195]  [<ffffffff810534f2>] ? worker_thread+0xe2/0x280
> > [116654.441031]  [<ffffffff81053410>] ? gcwq_mayday_timeout+0x80/0x80
> > [116654.553512]  [<ffffffff8105776b>] ? kthread+0x9b/0xb0
> 
> Which is this line:
> 
> STATIC void
> xlog_state_done_syncing(
>         xlog_in_core_t  *iclog,
>         int             aborted)
> {
>         struct xlog        *log = iclog->ic_log;
> 
>         spin_lock(&log->l_icloglock);
> 
> So, the icloglock is at offset 296 bytes into the struct xlog, and
> the iclog structure is only 256 bytes in size itself, so that
> structure offset is way outside anything the code should be trying
> to access (ignoring the null pointer issue). Even if we assume that
> the 0x1000 bit is a memory corruption, offset 0xa8 lands in a hole
> in the struct xlog_in_core, and isin the middle of a bunch of log
> size constants in the struct xlog (l_sectBBsize to be exact).
> 
> So this doesn't make much sense to me.
> 
> BTW, you should compile you kernels with frame pointers enabled so
> that the kernel emits stack traces that can be trusted rather than
> just dumping a list of symbols found on the stack...
> 
> > It looks like a race condition.
> 
> Looks more like memory corruption to me....
> 
> > Test script source:
> 
> I'll see if I can reproduce it locally.

The script is full of bugs, and i don't have time to debug it - it
hard codes /dev/sda in places despite taking the device as a CLI
parameter. It has hard coded mount points.  It sometimes fails to
make the filesystem on the base LV after it's been created.
start_snap() appears to fail for some reason, as it doesn't result
in mounted snapshots. stop_snap fails as well:

Starting snap19 : Thursday 28 November  10:01:26 EST 2013
  Logical volume lv1+snap19 converted to snapshot.
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ OK ] lv1+snap19 activated.
Starting time : 37 s.
---------------------------
Stopping snap0 : Thursday 28 November  10:02:06 EST 2013
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] lv0+snap00 still active !!!
[ OK ] lv0+snap00 umounted.
Stopping time : 0 s.

I've got no idea is this is intended behaviour, but it sure doesn't
seem right to me...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs