All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Arkadiusz Bubała" <arkadiusz.bubala@open-e.com>
Cc: xfs@oss.sgi.com
Subject: Re: [BUG] Call trace during snapshot start/stop sequence
Date: Thu, 28 Nov 2013 10:06:08 +1100	[thread overview]
Message-ID: <20131127230608.GJ10988@dastard> (raw)
In-Reply-To: <20131127221923.GI10988@dastard>

On Thu, Nov 28, 2013 at 09:19:23AM +1100, Dave Chinner wrote:
> On Wed, Nov 27, 2013 at 11:01:43AM +0100, Arkadiusz Bubała wrote:
> > Hello,
> > 
> > we're running test script that starts and stops
> > snapshots in a loop while overfilling them.  After a few days of running
> > system hangs. We've captured following call trace:
> > 
> > [116649.755761] XFS (dm-42): metadata I/O error: block 0xfa2b06
> > ("xlog_iodone") error 5 buf count 1024
> > [116649.947247] XFS (dm-42): Log I/O Error Detected.  Shutting down
> > filesystem
> > [116650.073881] XFS (dm-42): Please umount the filesystem and rectify
> > the problem(s)
> 
> So, an EIO error on a log IO, resulting in a shutdown....
> 
> > [116650.207186] BUG: unable to handle kernel paging request at
> > 00000000000010a8
> 
> That's an interesting offset - quite large for a null pointer
> dereference.
> 
> > [116650.335185] IP: [<ffffffff8102e1d6>] __ticket_spin_lock+0x6/0x20
> > [116650.451052] PGD 0
> > [116650.518151] Oops: 0002 [#1] SMP
> > [116650.599477] CPU 0
> > [116650.622838] Modules linked in: iscsi_scst(O) scst_vdisk(O) scst(O)
> > drbd(O) twofish_x86_64 twofish_generic twofish_common
> > serpent_sse2_x86_64 lrw xts gf1]
> > [116651.479730]
> > [116651.540674] Pid: 30173, comm: kworker/0:5 Tainted: G           O
> > 3.4.63-oe64-00000-g1a33902 #38 Intel Corporation S1200BTL/S1200BTL
> 
> Running a custom built 3.4.63 kernel with a bunch of out of tree
> modules installed. can you reproduce this on a vanilla 3.12 kernel?
> 
> > [116653.923833] Call Trace:
> > [116653.995006]  [<ffffffff815f4b45>] ? _raw_spin_lock+0x5/0x10
> > [116654.103462]  [<ffffffff812685f2>] ? xlog_state_done_syncing+0x32/0xc0
> > [116654.221716]  [<ffffffff81051843>] ? process_one_work+0xf3/0x320
> > [116654.333195]  [<ffffffff810534f2>] ? worker_thread+0xe2/0x280
> > [116654.441031]  [<ffffffff81053410>] ? gcwq_mayday_timeout+0x80/0x80
> > [116654.553512]  [<ffffffff8105776b>] ? kthread+0x9b/0xb0
> 
> Which is this line:
> 
> STATIC void
> xlog_state_done_syncing(
>         xlog_in_core_t  *iclog,
>         int             aborted)
> {
>         struct xlog        *log = iclog->ic_log;
> 
>         spin_lock(&log->l_icloglock);
> 
> So, the icloglock is at offset 296 bytes into the struct xlog, and
> the iclog structure is only 256 bytes in size itself, so that
> structure offset is way outside anything the code should be trying
> to access (ignoring the null pointer issue). Even if we assume that
> the 0x1000 bit is a memory corruption, offset 0xa8 lands in a hole
> in the struct xlog_in_core, and isin the middle of a bunch of log
> size constants in the struct xlog (l_sectBBsize to be exact).
> 
> So this doesn't make much sense to me.
> 
> BTW, you should compile you kernels with frame pointers enabled so
> that the kernel emits stack traces that can be trusted rather than
> just dumping a list of symbols found on the stack...
> 
> > It looks like a race condition.
> 
> Looks more like memory corruption to me....
> 
> > Test script source:
> 
> I'll see if I can reproduce it locally.

The script is full of bugs, and i don't have time to debug it - it
hard codes /dev/sda in places despite taking the device as a CLI
parameter. It has hard coded mount points.  It sometimes fails to
make the filesystem on the base LV after it's been created.
start_snap() appears to fail for some reason, as it doesn't result
in mounted snapshots. stop_snap fails as well:

Starting snap19 : Thursday 28 November  10:01:26 EST 2013
  Logical volume lv1+snap19 converted to snapshot.
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ OK ] lv1+snap19 activated.
Starting time : 37 s.
---------------------------
Stopping snap0 : Thursday 28 November  10:02:06 EST 2013
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] lv0+snap00 still active !!!
[ OK ] lv0+snap00 umounted.
Stopping time : 0 s.

I've got no idea is this is intended behaviour, but it sure doesn't
seem right to me...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2013-11-27 23:06 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-27 10:01 [BUG] Call trace during snapshot start/stop sequence Arkadiusz Bubała
2013-11-27 22:19 ` Dave Chinner
2013-11-27 23:06   ` Dave Chinner [this message]
2013-11-28 10:00     ` Arkadiusz Bubała
2013-11-28 21:16       ` Dave Chinner
2013-12-05  8:36         ` Arkadiusz Bubała

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131127230608.GJ10988@dastard \
    --to=david@fromorbit.com \
    --cc=arkadiusz.bubala@open-e.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.