Re: XFS filesystem triggering assert in _repair 3.1.11

From: Jay Ashworth <jra@baylink.com>
To: xfs@oss.sgi.com
Subject: Re: XFS filesystem triggering assert in _repair 3.1.11
Date: Sun, 14 Jul 2013 12:14:21 -0400 (EDT)	[thread overview]
Message-ID: <11307192.1374.1373818461931.JavaMail.root@benjamin.baylink.com> (raw)
In-Reply-To: <51E2C912.8080305@sandeen.net>

----- Original Message -----
> From: "Eric Sandeen" <sandeen@sandeen.net>

> On 7/13/13 4:29 PM, Jay Ashworth wrote:
> ...
> 
> > That's where I am right now: the drive was throwing a kernel oops if
> > I mounted it,
> 
> That shouldn't happen, for starters - was this on the older 2.6.37
> kernel?

Correct.  It also threw btree errors on that kernel *and* the 3.7 liveCD, 
but never oopsed the 3.7.

> > and xfs_repair would just lock up. I had to do a -L on
> > it
> 
> ok, so much for debugging the oops ...

Yeah, sorry.  Thankfully, it's summer hiatus, but it is a production
box, which sometimes limits how long I can keep problems around before
brute forcing them.  I *have* the oops, but no longer the FS that caused
it.

> > after which it would mount and unmount cleanly, and xfs_repair runs
> > and finds problems, but then fails an assert at the end and dies.
> >
> > Here's that entire repair run:
> >
> > =============================================================
> > plaintain:/var/log/mythtv # xfs_repair /dev/sdc2
> > Phase 1 - find and verify superblock...
> > Not enough RAM available for repair to enable prefetching.
> 
> ...
> 
> > entry "1011_20130509205900.mpg" at block 13 offset 4016 in directory
> > inode 1073789184 references free inode 1137017084
> >         clearing inode number in entry at offset 4016...
> > bad back (left) sibling pointer (saw 16140901064495857663 should be
> > NULL (0))
> ^^^ 0xDFFFFFFFFFFFFFFF i.e. -2
> 
> #define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL) ?
> 
> >         in inode 1115989006 (data fork) bmap btree block 107963248
> > xfs_repair: dinode.c:2136: process_inode_data_fork: Assertion `err
> > == 0' failed.
> 
> This means we were in the check_dups path, and one of the process_*()
> functions
> failed. Due to that "bad back (left) sibling pointer ..."
> 
> If I had time to work on this, I'd ask for an xfs_metadump image of
> the filesystem to be able to reproduce it and look further into the
> problem...
> 
> It might shed some light on things to use xfs_db to look at inode
> 1115989006
> 
> # xfs_db /dev/sdc2
> xfs_db> inode 1115989006
> xfs_db> p

xfs_db> inode 1115989006
xfs_db> p
core.magic = 0x494e
core.mode = 0100666
core.version = 2
core.format = 3 (btree)
core.nlinkv2 = 1
core.onlink = 0
core.projid_lo = 0
core.projid_hi = 0
core.uid = 111
core.gid = 33
core.flushiter = 18
core.atime.sec = Wed Jul  3 19:28:22 2013
core.atime.nsec = 956870002
core.mtime.sec = Tue Jan 29 20:00:10 2013
core.mtime.nsec = 466912274
core.ctime.sec = Fri Jul 12 13:37:43 2013
core.ctime.nsec = 217838130
core.size = 916961916
core.nblocks = 223869
core.extsize = 0
core.nextents = 16
core.naextents = 0
core.forkoff = 0
core.aformat = 2 (extents)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 0
core.append = 0
core.sync = 0
core.noatime = 0
core.nodump = 0
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 0
core.filestream = 0
core.gen = 3501711335
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:107963248

> looking at bmap btree block 107963248 might also be interesting; like
> this I think but I'm rusty:
> 
> xfs_db> fsblock 107963248
> xfs_db> type bmapbt

Well, the manpage says that's a type, but my xfs_db, v 3.1.11, says it's not.  Huh?

> xfs_db> p
> 
> > Aborted
> > =============================================================
> >
> > This is xfs_repair 3.1.11, from xfsprogs 3.1.11 from tarball,
> > compiled on
> > the machine in question, which is a 32-bit OS with 512MB of ram (the
> > mobo, an old MSI KT6V, pukes if we try to put more ram on it for
> > some
> > reason). I have run memtest+ on the ram and multiple passes come
> > back clean as a whistle; the SATA controller is a SiI 3114, which we
> > had to buy to talk to the 3TB drives; boot is from the VT6420 on the
> > motherboard and a dedicated 40G Samsung.
> >
> > I have done some work on this repair booted from a Suse 12.1 rescue
> > disk
> > with a 3.7 kernel, on the theory that the XFS drivers in the kernel
> > might help; I found that mounting and unmounting in between multiple
> > repair runs made me have to do less of them -- though I'm sure more
> > than two dirty runs before one sees a clean one ought to be Right
> > Out
> > anyway.
> 
> Eek, so you thrashed about, in other words. ;)

I've been at this over a week.  Yes, there's been some thrashing.  I have a
2TB that I need to dedupe and re-mkfs, so I have space to work on; that 
process itself is hanging against a *different* XFS problem on a different 
filesystem.  (Specifically, I have one bad inode on that FS that repair
doesn't seem to want to touch.  It's been lower priority cause that data's
duped, but as I need the free space more, its priority is rising.)

I hate power supplies.

> > I've seen suggestions on the mailing list archives and other places
> > that (some) assertion fails were for things fixed in earlier tools
> > releases, but that one's not helping me...
> 
> well, not always true, esp. in userspace.
> 
> > I have space to move this data off and remake the filesystem,
> > if I can get it to mount reliably and stay that way long enough.
> 
> you can always mount it & copy as much as possible until you hit
> corruption. But until repair succeeds you'll have corruption lurking
> that you'll hit which will probably cause the fs to shut down
> (gracefully, in theory).

Well, the bottom half shuts down, but then the top half keeps going, 
throwing error 5's all night.  

Cheers,
-- jra
-- 
Jay R. Ashworth                  Baylink                       jra@baylink.com
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates     http://baylink.pitas.com         2000 Land Rover DII
St Petersburg FL USA               #natog                      +1 727 647 1274

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs