From: Dave Chinner <david@fromorbit.com>
To: "Arkadiusz Bubała" <arkadiusz.bubala@open-e.com>
Cc: xfs@oss.sgi.com
Subject: Re: [BUG] Call trace during snapshot start/stop sequence
Date: Thu, 28 Nov 2013 10:06:08 +1100 [thread overview]
Message-ID: <20131127230608.GJ10988@dastard> (raw)
In-Reply-To: <20131127221923.GI10988@dastard>
On Thu, Nov 28, 2013 at 09:19:23AM +1100, Dave Chinner wrote:
> On Wed, Nov 27, 2013 at 11:01:43AM +0100, Arkadiusz Bubała wrote:
> > Hello,
> >
> > we're running test script that starts and stops
> > snapshots in a loop while overfilling them. After a few days of running
> > system hangs. We've captured following call trace:
> >
> > [116649.755761] XFS (dm-42): metadata I/O error: block 0xfa2b06
> > ("xlog_iodone") error 5 buf count 1024
> > [116649.947247] XFS (dm-42): Log I/O Error Detected. Shutting down
> > filesystem
> > [116650.073881] XFS (dm-42): Please umount the filesystem and rectify
> > the problem(s)
>
> So, an EIO error on a log IO, resulting in a shutdown....
>
> > [116650.207186] BUG: unable to handle kernel paging request at
> > 00000000000010a8
>
> That's an interesting offset - quite large for a null pointer
> dereference.
>
> > [116650.335185] IP: [<ffffffff8102e1d6>] __ticket_spin_lock+0x6/0x20
> > [116650.451052] PGD 0
> > [116650.518151] Oops: 0002 [#1] SMP
> > [116650.599477] CPU 0
> > [116650.622838] Modules linked in: iscsi_scst(O) scst_vdisk(O) scst(O)
> > drbd(O) twofish_x86_64 twofish_generic twofish_common
> > serpent_sse2_x86_64 lrw xts gf1]
> > [116651.479730]
> > [116651.540674] Pid: 30173, comm: kworker/0:5 Tainted: G O
> > 3.4.63-oe64-00000-g1a33902 #38 Intel Corporation S1200BTL/S1200BTL
>
> Running a custom built 3.4.63 kernel with a bunch of out of tree
> modules installed. can you reproduce this on a vanilla 3.12 kernel?
>
> > [116653.923833] Call Trace:
> > [116653.995006] [<ffffffff815f4b45>] ? _raw_spin_lock+0x5/0x10
> > [116654.103462] [<ffffffff812685f2>] ? xlog_state_done_syncing+0x32/0xc0
> > [116654.221716] [<ffffffff81051843>] ? process_one_work+0xf3/0x320
> > [116654.333195] [<ffffffff810534f2>] ? worker_thread+0xe2/0x280
> > [116654.441031] [<ffffffff81053410>] ? gcwq_mayday_timeout+0x80/0x80
> > [116654.553512] [<ffffffff8105776b>] ? kthread+0x9b/0xb0
>
> Which is this line:
>
> STATIC void
> xlog_state_done_syncing(
> xlog_in_core_t *iclog,
> int aborted)
> {
> struct xlog *log = iclog->ic_log;
>
> spin_lock(&log->l_icloglock);
>
> So, the icloglock is at offset 296 bytes into the struct xlog, and
> the iclog structure is only 256 bytes in size itself, so that
> structure offset is way outside anything the code should be trying
> to access (ignoring the null pointer issue). Even if we assume that
> the 0x1000 bit is a memory corruption, offset 0xa8 lands in a hole
> in the struct xlog_in_core, and isin the middle of a bunch of log
> size constants in the struct xlog (l_sectBBsize to be exact).
>
> So this doesn't make much sense to me.
>
> BTW, you should compile you kernels with frame pointers enabled so
> that the kernel emits stack traces that can be trusted rather than
> just dumping a list of symbols found on the stack...
>
> > It looks like a race condition.
>
> Looks more like memory corruption to me....
>
> > Test script source:
>
> I'll see if I can reproduce it locally.
The script is full of bugs, and i don't have time to debug it - it
hard codes /dev/sda in places despite taking the device as a CLI
parameter. It has hard coded mount points. It sometimes fails to
make the filesystem on the base LV after it's been created.
start_snap() appears to fail for some reason, as it doesn't result
in mounted snapshots. stop_snap fails as well:
Starting snap19 : Thursday 28 November 10:01:26 EST 2013
Logical volume lv1+snap19 converted to snapshot.
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ OK ] lv1+snap19 activated.
Starting time : 37 s.
---------------------------
Stopping snap0 : Thursday 28 November 10:02:06 EST 2013
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] lv0+snap00 still active !!!
[ OK ] lv0+snap00 umounted.
Stopping time : 0 s.
I've got no idea is this is intended behaviour, but it sure doesn't
seem right to me...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2013-11-27 23:06 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-27 10:01 [BUG] Call trace during snapshot start/stop sequence Arkadiusz Bubała
2013-11-27 22:19 ` Dave Chinner
2013-11-27 23:06 ` Dave Chinner [this message]
2013-11-28 10:00 ` Arkadiusz Bubała
2013-11-28 21:16 ` Dave Chinner
2013-12-05 8:36 ` Arkadiusz Bubała
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131127230608.GJ10988@dastard \
--to=david@fromorbit.com \
--cc=arkadiusz.bubala@open-e.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox