Re: Kernel oops when using XFS

From: Brian Foster <bfoster@redhat.com>
To: Tal Maoz <magogo200@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: Kernel oops when using XFS
Date: Wed, 1 Jul 2015 11:13:01 -0400	[thread overview]
Message-ID: <20150701151301.GA31994@bfoster.bfoster> (raw)
In-Reply-To: <CAD6tmdA97Vuqu=Y6na138cSAYyS=vmJX=jpf6jXpjfQkYb7nTA@mail.gmail.com>

On Wed, Jul 01, 2015 at 05:32:48PM +0300, Tal Maoz wrote:
> Hey all,
> 
> 
> 
> I’m using XFS on a Ubuntu 14.04.1 x84_64 machine and getting kernel oops
> messages from the XFS module almost daily.
> 
> My machine is a 6-core Core-i7 with 24GB RAM, an mdadm software raid5 over
> 4 disks with an XFS fs on it.
> 
> Here is a typical error message:
> 
> 
> 
> [Wed Jul  1 07:14:30 2015] XFS (md1p3): Access to block zero in inode
> 70952984 start_block: 0 start_off: 0 blkcnt: 5 extent-state: 0 lastx: 8
> 
> [Wed Jul  1 07:14:30 2015] XFS (md1p3): xfs_dabuf_map: bno 8388608 dir:
> inode 70952984
> 
> [Wed Jul  1 07:14:30 2015] XFS (md1p3): [00] br_startoff 8388608
> br_startblock -2 br_blockcount 1 br_state 0
> 
> [Wed Jul  1 07:14:30 2015] XFS (md1p3): Internal error xfs_da_do_buf(1) at
> line 2521 of file
> /build/buildd/linux-lts-utopic-3.16.0/fs/xfs/xfs_da_btree.c.  Caller
> xfs_da_read_buf+0x50/0xf0 [xfs]
> 

This is a distro kernel so we don't exactly know what is or isn't
included for upstream patches. The error is likely the check in
xfs_dabuf_map() finding a hole when not expected.

> [Wed Jul  1 07:14:30 2015] CPU: 11 PID: 12649 Comm: updatedb.mlocat
> Tainted: P    B      OE 3.16.0-36-generic #48~14.04.1-Ubuntu
> 
> [Wed Jul  1 07:14:30 2015] Hardware name: MSI MS-7666/Big Bang-XPower
> (MS-7666), BIOS V1.6 03/29/2011
> 
> [Wed Jul  1 07:14:30 2015]  ffff88018fa1fa40 ffff88018fa1f9e0
> ffffffff81764a5f ffff8806199f0000
> 
> [Wed Jul  1 07:14:30 2015]  ffff88018fa1f9f8 ffffffffc076e1eb
> ffffffffc07a27e0 ffff88018fa1fa88
> 
> [Wed Jul  1 07:14:30 2015]  ffffffffc07a186f ffff880100000000
> ffff880598a36c00 ffff88018fa1fac0
> 
> [Wed Jul  1 07:14:30 2015] Call Trace:
> 
> [Wed Jul  1 07:14:30 2015]  [<ffffffff81764a5f>] dump_stack+0x45/0x56
> 
> [Wed Jul  1 07:14:30 2015]  [<ffffffffc076e1eb>] xfs_error_report+0x3b/0x40
> [xfs]
> 
> [Wed Jul  1 07:14:30 2015]  [<ffffffffc07a27e0>] ?
> xfs_da_read_buf+0x50/0xf0 [xfs]
> 
> [Wed Jul  1 07:14:30 2015]  [<ffffffffc07a186f>]
> xfs_dabuf_map.constprop.16+0x16f/0x370 [xfs]
> 
> [Wed Jul  1 07:14:30 2015]  [<ffffffffc07a27e0>] xfs_da_read_buf+0x50/0xf0
> [xfs]
> 
> [Wed Jul  1 07:14:30 2015]  [<ffffffff81282ebb>] ?
> __ext4_handle_dirty_metadata+0x8b/0x200
> 

The stack does not appear to be reliable (and the line wrapping makes it
difficult to read, fwiw). Note the ext4 frame above in the middle of an
XFS trace. You might need to run with a debug variant kernel package to
get a reliable stack.

...
> 
> 
> 
> I have gathered some information that I hope can help figure out what the
> problem is. FYI, this sort of problem also happened when I had an Ubuntu
> 12.04 installed with the same raid configuration:
> 

The kernel version and what additional patches might be included from
upstream is more relevant than the distro release (I have no idea what
kernel runs an Ubuntu 12.04 box).

> 
> 
> 
> 
...
> 
> What could be the problem?
> 
> I have another mdadm raid5 array that used to have XFS and gave the same
> errors. I use it for backup and noticed that each backup from the main
> array to the second one used to take a long time and gave these errors. I
> switched from XFS to ext4 and not only did the errors go away
> (obvisouly...) but the backup time went down from almost 23 hours to 1 hour!
> 

There has been one recent instance of an error similar to above that I'm
aware of, but it's just a guess since I don't know what's in your kernel
and we don't have an actual stacktrace from the error.

If you can determine whether your kernel has the following patch:

	xfs: xfs_attr_inactive leaves inconsistent attr fork state behind

... without this one:

	xfs: don't truncate attribute extents if no extents exist

... you might need to include the latter or revert to a kernel prior to
the former being introduced. Specifically, these introduced and
subsequently fixed a regression that could leave inode extended
attribute forks in a weird state. On some tests, I recall seeing
xfs_dabuf_map() errors on inode reclaim as a side effect.

Also, have you run 'xfs_repair -n' to see if anything is wrong on-disk?

Brian

> 
> Thanks,
> 
> Tal

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs