public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Arkadiusz Bubała" <arkadiusz.bubala@open-e.com>
Cc: xfs@oss.sgi.com
Subject: Re: [BUG] Call trace during snapshot start/stop sequence
Date: Thu, 28 Nov 2013 10:06:08 +1100	[thread overview]
Message-ID: <20131127230608.GJ10988@dastard> (raw)
In-Reply-To: <20131127221923.GI10988@dastard>

On Thu, Nov 28, 2013 at 09:19:23AM +1100, Dave Chinner wrote:
> On Wed, Nov 27, 2013 at 11:01:43AM +0100, Arkadiusz Bubała wrote:
> > Hello,
> > 
> > we're running test script that starts and stops
> > snapshots in a loop while overfilling them.  After a few days of running
> > system hangs. We've captured following call trace:
> > 
> > [116649.755761] XFS (dm-42): metadata I/O error: block 0xfa2b06
> > ("xlog_iodone") error 5 buf count 1024
> > [116649.947247] XFS (dm-42): Log I/O Error Detected.  Shutting down
> > filesystem
> > [116650.073881] XFS (dm-42): Please umount the filesystem and rectify
> > the problem(s)
> 
> So, an EIO error on a log IO, resulting in a shutdown....
> 
> > [116650.207186] BUG: unable to handle kernel paging request at
> > 00000000000010a8
> 
> That's an interesting offset - quite large for a null pointer
> dereference.
> 
> > [116650.335185] IP: [<ffffffff8102e1d6>] __ticket_spin_lock+0x6/0x20
> > [116650.451052] PGD 0
> > [116650.518151] Oops: 0002 [#1] SMP
> > [116650.599477] CPU 0
> > [116650.622838] Modules linked in: iscsi_scst(O) scst_vdisk(O) scst(O)
> > drbd(O) twofish_x86_64 twofish_generic twofish_common
> > serpent_sse2_x86_64 lrw xts gf1]
> > [116651.479730]
> > [116651.540674] Pid: 30173, comm: kworker/0:5 Tainted: G           O
> > 3.4.63-oe64-00000-g1a33902 #38 Intel Corporation S1200BTL/S1200BTL
> 
> Running a custom built 3.4.63 kernel with a bunch of out of tree
> modules installed. can you reproduce this on a vanilla 3.12 kernel?
> 
> > [116653.923833] Call Trace:
> > [116653.995006]  [<ffffffff815f4b45>] ? _raw_spin_lock+0x5/0x10
> > [116654.103462]  [<ffffffff812685f2>] ? xlog_state_done_syncing+0x32/0xc0
> > [116654.221716]  [<ffffffff81051843>] ? process_one_work+0xf3/0x320
> > [116654.333195]  [<ffffffff810534f2>] ? worker_thread+0xe2/0x280
> > [116654.441031]  [<ffffffff81053410>] ? gcwq_mayday_timeout+0x80/0x80
> > [116654.553512]  [<ffffffff8105776b>] ? kthread+0x9b/0xb0
> 
> Which is this line:
> 
> STATIC void
> xlog_state_done_syncing(
>         xlog_in_core_t  *iclog,
>         int             aborted)
> {
>         struct xlog        *log = iclog->ic_log;
> 
>         spin_lock(&log->l_icloglock);
> 
> So, the icloglock is at offset 296 bytes into the struct xlog, and
> the iclog structure is only 256 bytes in size itself, so that
> structure offset is way outside anything the code should be trying
> to access (ignoring the null pointer issue). Even if we assume that
> the 0x1000 bit is a memory corruption, offset 0xa8 lands in a hole
> in the struct xlog_in_core, and isin the middle of a bunch of log
> size constants in the struct xlog (l_sectBBsize to be exact).
> 
> So this doesn't make much sense to me.
> 
> BTW, you should compile you kernels with frame pointers enabled so
> that the kernel emits stack traces that can be trusted rather than
> just dumping a list of symbols found on the stack...
> 
> > It looks like a race condition.
> 
> Looks more like memory corruption to me....
> 
> > Test script source:
> 
> I'll see if I can reproduce it locally.

The script is full of bugs, and i don't have time to debug it - it
hard codes /dev/sda in places despite taking the device as a CLI
parameter. It has hard coded mount points.  It sometimes fails to
make the filesystem on the base LV after it's been created.
start_snap() appears to fail for some reason, as it doesn't result
in mounted snapshots. stop_snap fails as well:

Starting snap19 : Thursday 28 November  10:01:26 EST 2013
  Logical volume lv1+snap19 converted to snapshot.
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ OK ] lv1+snap19 activated.
Starting time : 37 s.
---------------------------
Stopping snap0 : Thursday 28 November  10:02:06 EST 2013
[ FAIL ] Can't umount snapshot
[ FAIL ] Can't remove snapshot
[ FAIL ] lv0+snap00 still active !!!
[ OK ] lv0+snap00 umounted.
Stopping time : 0 s.

I've got no idea is this is intended behaviour, but it sure doesn't
seem right to me...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2013-11-27 23:06 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-27 10:01 [BUG] Call trace during snapshot start/stop sequence Arkadiusz Bubała
2013-11-27 22:19 ` Dave Chinner
2013-11-27 23:06   ` Dave Chinner [this message]
2013-11-28 10:00     ` Arkadiusz Bubała
2013-11-28 21:16       ` Dave Chinner
2013-12-05  8:36         ` Arkadiusz Bubała

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131127230608.GJ10988@dastard \
    --to=david@fromorbit.com \
    --cc=arkadiusz.bubala@open-e.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox