* something very strange w/ filestreams...
@ 2007-09-22 4:39 Eric Sandeen
2007-09-23 9:24 ` David Chinner
2007-09-24 4:22 ` Barry Naujok
0 siblings, 2 replies; 8+ messages in thread
From: Eric Sandeen @ 2007-09-22 4:39 UTC (permalink / raw)
To: xfs-oss
if I do:
for I in 173 174 178; do ./check $I; done
it's not terribly interesting, things seem to go ok, just normal
filestreams failures ;-)
if I do:
./check 173 174 178
things go very badly; the very first repair in 178 finds a horribly
corrupted filesystem, and repair tips over (memory appears corrupted, as
witnessed by):
> xfs_repair: zone calloc failed (, 572662388 bytes): Cannot allocate memory
hm, no zone name, length of 0x22222274?
I already provided a metadump image to Barry, but I wonder why the
timing(?) seems to make a difference here... first sign of things going
awry in repair is:
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
bad length 131072 for agf 0, should be 4096
bad length # 131072 for agi 0, should be 4096
would reset bad agf for ag 0
would reset bad agi for ag 0
....
not sure what's going on here, but it only seems to happen if I do those
2 filestreams test immediately before 178...
oh, and this is over LVM, just for fun.
-Eric
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: something very strange w/ filestreams... 2007-09-22 4:39 something very strange w/ filestreams Eric Sandeen @ 2007-09-23 9:24 ` David Chinner 2007-09-24 5:50 ` Barry Naujok 2007-09-24 4:22 ` Barry Naujok 1 sibling, 1 reply; 8+ messages in thread From: David Chinner @ 2007-09-23 9:24 UTC (permalink / raw) To: Eric Sandeen; +Cc: xfs-oss On Fri, Sep 21, 2007 at 11:39:28PM -0500, Eric Sandeen wrote: > if I do: > > for I in 173 174 178; do ./check $I; done > > it's not terribly interesting, things seem to go ok, just normal > filestreams failures ;-) > > if I do: > > ./check 173 174 178 > > things go very badly; the very first repair in 178 finds a horribly > corrupted filesystem, and repair tips over (memory appears corrupted, as > witnessed by): Well, i get: budgie:~/dgc/xfstests # ./check -l 173 174 178 FSTYP -- xfs (debug) PLATFORM -- Linux/ia64 budgie 2.6.23-rc4-dgc-xfs MKFS_OPTIONS -- -f -bsize=4096 /dev/sdb9 MOUNT_OPTIONS -- /dev/sdb9 /mnt/scratch 173 75s ... 174 16s ... 178 *** glibc detected *** /sbin/xfs_repair: double free or corruption (!prev): 0x600000000000ebc0 *** ======= Backtrace: ========= /lib/libc.so.6.1[0x20000000001a2f50] /lib/libc.so.6.1(__libc_free+0x15f0e8)[0x20000000001a6ce0] /sbin/xfs_repair[0x40000000000320e0] /sbin/xfs_repair[0x4000000000043a90] /sbin/xfs_repair[0x400000000006d230] /lib/libc.so.6.1(__libc_start_main+0xb4018)[0x20000000000fbc20] /sbin/xfs_repair[0x4000000000003520] ======= Memory map: ======== ..... Just executing ./check -l 174 178 isn't sufficient, but ./check -l 172 174 178 triggers it. 172,173,178 does not trigger it, so it's something to do with test 174 running after another filestreams test but before 178. Well, what does test 178 do? Oh, it mkfs's a new filesystem on the scratch device and then hoses the superblock and tries to use secondary superblocks to reconstruct it successfully. I'm guessing that it is finding a superblock from a previous test and incorrectly using that, finding stuff all nasty and inconsistent due to the more recent mkfs.... Given this error: bad length 156382 for agf 0, should be LENGTH bad length # 156382 for agi 0, should be LENGTH I think that is what is happening - those messages only come up when teh agf/agi lengths don't match the superblock, and that points to using the wrong superblock for recovery. Especially as: # mkfs.xfs -f /dev/sdb9 meta-data=/dev/sdb9 isize=256 agcount=8, agsize=156382 blks = sectsz=512 attr=0 data = bsize=4096 blocks=1251056, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=2560, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 An AG length of 156382 is correct. Hmmm - just a plain: # ./check -l 172 174 ; mkfs.xfs -f /dev/sdb9; dd if=/dev/zero of=/dev/sdb9 bs=512 count=1 ; xfs_repair /dev/sdb9 reproduces the problem. Barry - I think xfs_repair might be finding the incorrect superblock for the repair. Tests 172, 173 and 174 use less than the whole disk, so there are going to be stale superblocks all over the place.... > hm, no zone name, length of 0x22222274? > > I already provided a metadump image to Barry, but I wonder why the > timing(?) seems to make a difference here... first sign of things going > awry in repair is: > > Phase 2 - using internal log > - zero log... > - scan filesystem freespace and inode maps... > bad length 131072 for agf 0, should be 4096 > bad length # 131072 for agi 0, should be 4096 Yes - test 173 uses 1GB filesystem with 64x16MB AGs - 4096 * 4k block size = 16MB AG. definitely looks like a stale superblock being found. Barry, I think that the secondary superblock needs better verification (e.g. that there really are AG headers where the sb says there are supposed to be and all the lengths match up). Eric - you can relax. Filestreams is not hosing your filesystem; xfs_reapir is.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: something very strange w/ filestreams... 2007-09-23 9:24 ` David Chinner @ 2007-09-24 5:50 ` Barry Naujok 2007-09-24 12:40 ` Eric Sandeen 0 siblings, 1 reply; 8+ messages in thread From: Barry Naujok @ 2007-09-24 5:50 UTC (permalink / raw) To: David Chinner, Eric Sandeen; +Cc: xfs-oss On Sun, 23 Sep 2007 19:24:44 +1000, David Chinner <dgc@sgi.com> wrote: > Barry - I think xfs_repair might be finding the incorrect superblock > for the repair. Tests 172, 173 and 174 use less than the whole disk, > so there are going to be stale superblocks all over the place.... > >> hm, no zone name, length of 0x22222274? >> >> I already provided a metadump image to Barry, but I wonder why the >> timing(?) seems to make a difference here... first sign of things going >> awry in repair is: >> >> Phase 2 - using internal log >> - zero log... >> - scan filesystem freespace and inode maps... >> bad length 131072 for agf 0, should be 4096 >> bad length # 131072 for agi 0, should be 4096 > > Yes - test 173 uses 1GB filesystem with 64x16MB AGs - 4096 * 4k block > size = 16MB AG. definitely looks like a stale superblock being > found. > > Barry, I think that the secondary superblock needs better verification > (e.g. that there really are AG headers where the sb says there > are supposed to be and all the lengths match up). > > Eric - you can relax. Filestreams is not hosing your filesystem; > xfs_reapir > is.... Test 178 is designed to test mkfs.xfs in http://oss.sgi.com/archives/xfs/2007-07/msg00139.html and will still make xfs_repair go bananas if there is other old AG headers. So, before running this test, you should make sure your test partitions are completely zeroed from mkfs's that occurred before that recent version of mkfs.xfs was installed. I tried on my test box and sure enough, xfs_repair barfed. After zeroing the devices, 172, 174 & 178 sequence succeeded. If you have failures after the zeroing and ONLY using the latest mkfs.xfs then something else is wrong. Also, xfs_copy/xfs_mdrestore of different images could still trigger the problem. There is a TODO to improve xfs_repair's handling of this scenario. Barry. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: something very strange w/ filestreams... 2007-09-24 5:50 ` Barry Naujok @ 2007-09-24 12:40 ` Eric Sandeen 2007-09-25 0:17 ` Barry Naujok 0 siblings, 1 reply; 8+ messages in thread From: Eric Sandeen @ 2007-09-24 12:40 UTC (permalink / raw) To: Barry Naujok; +Cc: David Chinner, xfs-oss Barry Naujok wrote: > On Sun, 23 Sep 2007 19:24:44 +1000, David Chinner <dgc@sgi.com> wrote: > >> Barry - I think xfs_repair might be finding the incorrect superblock >> for the repair. Tests 172, 173 and 174 use less than the whole disk, >> so there are going to be stale superblocks all over the place.... >> >>> hm, no zone name, length of 0x22222274? >>> >>> I already provided a metadump image to Barry, but I wonder why the >>> timing(?) seems to make a difference here... first sign of things going >>> awry in repair is: >>> >>> Phase 2 - using internal log >>> - zero log... >>> - scan filesystem freespace and inode maps... >>> bad length 131072 for agf 0, should be 4096 >>> bad length # 131072 for agi 0, should be 4096 >> Yes - test 173 uses 1GB filesystem with 64x16MB AGs - 4096 * 4k block >> size = 16MB AG. definitely looks like a stale superblock being >> found. >> >> Barry, I think that the secondary superblock needs better verification >> (e.g. that there really are AG headers where the sb says there >> are supposed to be and all the lengths match up). >> >> Eric - you can relax. Filestreams is not hosing your filesystem; >> xfs_reapir >> is.... > > Test 178 is designed to test mkfs.xfs in > http://oss.sgi.com/archives/xfs/2007-07/msg00139.html and > will still make xfs_repair go bananas if there is other > old AG headers. > > So, before running this test, you should make sure your test > partitions are completely zeroed from mkfs's that occurred > before that recent version of mkfs.xfs was installed. I dd'd over the whole test partition, ran the sequence, and hit the problem. > I tried on my test box and sure enough, xfs_repair barfed. > After zeroing the devices, 172, 174 & 178 sequence succeeded. > > If you have failures after the zeroing and ONLY using the > latest mkfs.xfs then something else is wrong. Also, > xfs_copy/xfs_mdrestore of different images could still > trigger the problem. > > There is a TODO to improve xfs_repair's handling of this > scenario. I do have the patch installed that you mentioned, as long as it's in 2.9.3. but if xfs_repair is double-freeing, then something else is still wrong -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: something very strange w/ filestreams... 2007-09-24 12:40 ` Eric Sandeen @ 2007-09-25 0:17 ` Barry Naujok 2007-09-25 1:32 ` Eric Sandeen 0 siblings, 1 reply; 8+ messages in thread From: Barry Naujok @ 2007-09-25 0:17 UTC (permalink / raw) To: Eric Sandeen; +Cc: David Chinner, xfs-oss [-- Attachment #1: Type: text/plain, Size: 2104 bytes --] On Mon, 24 Sep 2007 22:40:45 +1000, Eric Sandeen <sandeen@sandeen.net> wrote: > Barry Naujok wrote: >> On Sun, 23 Sep 2007 19:24:44 +1000, David Chinner <dgc@sgi.com> wrote: >> >>> Barry - I think xfs_repair might be finding the incorrect superblock >>> for the repair. Tests 172, 173 and 174 use less than the whole disk, >>> so there are going to be stale superblocks all over the place.... >>> >>>> hm, no zone name, length of 0x22222274? >>>> >>>> I already provided a metadump image to Barry, but I wonder why the >>>> timing(?) seems to make a difference here... first sign of things >>>> going >>>> awry in repair is: >>>> >>>> Phase 2 - using internal log >>>> - zero log... >>>> - scan filesystem freespace and inode maps... >>>> bad length 131072 for agf 0, should be 4096 >>>> bad length # 131072 for agi 0, should be 4096 >>> Yes - test 173 uses 1GB filesystem with 64x16MB AGs - 4096 * 4k block >>> size = 16MB AG. definitely looks like a stale superblock being >>> found. >>> >>> Barry, I think that the secondary superblock needs better verification >>> (e.g. that there really are AG headers where the sb says there >>> are supposed to be and all the lengths match up). >>> >>> Eric - you can relax. Filestreams is not hosing your filesystem; >>> xfs_reapir >>> is.... >> >> Test 178 is designed to test mkfs.xfs in >> http://oss.sgi.com/archives/xfs/2007-07/msg00139.html and >> will still make xfs_repair go bananas if there is other >> old AG headers. >> >> So, before running this test, you should make sure your test >> partitions are completely zeroed from mkfs's that occurred >> before that recent version of mkfs.xfs was installed. > > I dd'd over the whole test partition, ran the sequence, and hit the > problem. Yeah, worked it out yesterday but never got around to doing another email. It's a combination of the two filestreams tests which do small filesystems and mkfs.xfs doesn't wipe beyond the new filesystem size. Zero the disk, try the attached patch and see if that fixes the problem. Barry. [-- Attachment #2: better_ag_zeroing_in_mkfs.patch --] [-- Type: application/octet-stream, Size: 881 bytes --] --- a/xfsprogs/mkfs/xfs_mkfs.c 2007-09-25 10:12:42.000000000 +1000 +++ b/xfsprogs/mkfs/xfs_mkfs.c 2007-09-25 10:11:36.955516729 +1000 @@ -558,15 +558,12 @@ zero_old_xfs_structures( goto done; /* - * block size and basic geometry seems alright, zero the secondaries, - * but don't go beyond the end of the new filesystem. + * block size and basic geometry seems alright, zero the secondaries. */ bzero(buf, new_sb->sb_sectsize); off = 0; for (i = 1; i < sb.sb_agcount; i++) { off += sb.sb_agblocks; - if (off >= new_sb->sb_dblocks) - break; if (pwrite64(xi->dfd, buf, new_sb->sb_sectsize, off << sb.sb_blocklog) == -1) break; @@ -2115,6 +2112,7 @@ an AG size that is one stripe unit small BTOBB(WHACK_SIZE)); bzero(XFS_BUF_PTR(buf), WHACK_SIZE); libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE); + libxfs_purgebuf(buf); } /* ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: something very strange w/ filestreams... 2007-09-25 0:17 ` Barry Naujok @ 2007-09-25 1:32 ` Eric Sandeen 2007-09-25 4:41 ` Eric Sandeen 0 siblings, 1 reply; 8+ messages in thread From: Eric Sandeen @ 2007-09-25 1:32 UTC (permalink / raw) To: Barry Naujok; +Cc: David Chinner, xfs-oss Barry Naujok wrote: >>> So, before running this test, you should make sure your test >>> partitions are completely zeroed from mkfs's that occurred >>> before that recent version of mkfs.xfs was installed. >> I dd'd over the whole test partition, ran the sequence, and hit the >> problem. > > Yeah, worked it out yesterday but never got around to doing another > email. It's a combination of the two filestreams tests which do > small filesystems and mkfs.xfs doesn't wipe beyond the new > filesystem size. Zero the disk, try the attached patch and see > if that fixes the problem. > > Barry. Ok, but what about that double free? -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: something very strange w/ filestreams... 2007-09-25 1:32 ` Eric Sandeen @ 2007-09-25 4:41 ` Eric Sandeen 0 siblings, 0 replies; 8+ messages in thread From: Eric Sandeen @ 2007-09-25 4:41 UTC (permalink / raw) To: Barry Naujok; +Cc: David Chinner, xfs-oss Eric Sandeen wrote: > Barry Naujok wrote: > >>>> So, before running this test, you should make sure your test >>>> partitions are completely zeroed from mkfs's that occurred >>>> before that recent version of mkfs.xfs was installed. >>> I dd'd over the whole test partition, ran the sequence, and hit the >>> problem. >> Yeah, worked it out yesterday but never got around to doing another >> email. It's a combination of the two filestreams tests which do >> small filesystems and mkfs.xfs doesn't wipe beyond the new >> filesystem size. Zero the disk, try the attached patch and see >> if that fixes the problem. >> >> Barry. > > Ok, but what about that double free? > > -Eric > > I have a bit of a clue about what's going wrong. first we get the buffer zone allocated: new zone 0x80efd68 for "xfs_buffer", size=116 set a watchpoint on that, also break on setup_bmap: (gdb) watch *((int *)0x80efd68) Hardware watchpoint 1: *(int *) 135200104 (gdb) break setup_bmap (gdb) cont ba_bmap gets allocated, based on some particular sb_agblocks count at the time: setup_bmap(agcount, mp->m_sb.sb_agblocks, mp->m_sb.sb_rextents); on this filesystem it's 4096 at this point, like so: Breakpoint 3, setup_bmap (agno=64, numblocks=4096, rtblocks=0) at incore.c:59 and from some debugging the size of ba_bmap[i] ends up as 2048: ... ba_bmap[31] at 0x80edc58 size 2048 ba_bmap[32] at 0x80ee460 size 2048 ba_bmap[33] at 0x80eec68 size 2048 ... so I set a watch on the zone that ends up corrupted, and: Hardware watchpoint 4: *(int *) 135200104 Old value = 116 New value = 372 0x08063a2f in set_agbno_state (mp=0xbf999188, agno=32, ag_blockno=12818, state=1) at incore.c:278 278 *addr = (((*addr) & (gdb) bt #0 0x08063a2f in set_agbno_state (mp=0xbf999188, agno=32, ag_blockno=12818, state=1) at incore.c:278 #1 0x0807d752 in scanfunc_bno (ablock=0x8187200, level=0, bno=1, agno=32, suspect=0, isroot=1) at scan.c:548 #2 0x0807c017 in scan_sbtree (root=1, nlevels=1, agno=32, suspect=0, func=0x807d430 <scanfunc_bno>, isroot=1) at scan.c:66 #3 0x0807d19a in scan_ag (agno=32) at ../include/xfs/swab.h:126 #4 0x0806751b in phase2 (mp=0xbf999188) at phase2.c:148 #5 0x08080d77 in main (argc=Cannot access memory at address 0x8 ) at xfs_repair.c:619 so at this point it looks like we're trying to use an ag_blockno of 12818, when we only allocated based on expecting 4096 blocks per ag? So I guess we've stumbled across another piece of the older, larger filesystem and those values cause us to walk off the end of the ba_map array? Not sure where it goes from here, but bedtime for me. :) -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: something very strange w/ filestreams... 2007-09-22 4:39 something very strange w/ filestreams Eric Sandeen 2007-09-23 9:24 ` David Chinner @ 2007-09-24 4:22 ` Barry Naujok 1 sibling, 0 replies; 8+ messages in thread From: Barry Naujok @ 2007-09-24 4:22 UTC (permalink / raw) To: Eric Sandeen, xfs-oss On Sat, 22 Sep 2007 14:39:28 +1000, Eric Sandeen <sandeen@sandeen.net> wrote: > if I do: > > for I in 173 174 178; do ./check $I; done > > it's not terribly interesting, things seem to go ok, just normal > filestreams failures ;-) > > if I do: > > ./check 173 174 178 > > things go very badly; the very first repair in 178 finds a horribly > corrupted filesystem, and repair tips over (memory appears corrupted, as > witnessed by): > >> xfs_repair: zone calloc failed (, 572662388 bytes): Cannot allocate >> memory > > hm, no zone name, length of 0x22222274? > > I already provided a metadump image to Barry, but I wonder why the > timing(?) seems to make a difference here... first sign of things going > awry in repair is: > > Phase 2 - using internal log > - zero log... > - scan filesystem freespace and inode maps... > bad length 131072 for agf 0, should be 4096 > bad length # 131072 for agi 0, should be 4096 > would reset bad agf for ag 0 > would reset bad agi for ag 0 > .... > > not sure what's going on here, but it only seems to happen if I do those > 2 filestreams test immediately before 178... > > oh, and this is over LVM, just for fun. Eric, you have this patch installed don't you? http://oss.sgi.com/archives/xfs/2007-07/msg00139.html ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2007-09-25 4:41 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-09-22 4:39 something very strange w/ filestreams Eric Sandeen 2007-09-23 9:24 ` David Chinner 2007-09-24 5:50 ` Barry Naujok 2007-09-24 12:40 ` Eric Sandeen 2007-09-25 0:17 ` Barry Naujok 2007-09-25 1:32 ` Eric Sandeen 2007-09-25 4:41 ` Eric Sandeen 2007-09-24 4:22 ` Barry Naujok
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox