* xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
@ 2007-11-02 2:08 Jay Sullivan
2007-11-02 5:18 ` David Chinner
0 siblings, 1 reply; 21+ messages in thread
From: Jay Sullivan @ 2007-11-02 2:08 UTC (permalink / raw)
To: xfs
(Sorry if this is a dupe to the list; it has been a long day.)
I have an XFS filesystem that has had the following happen twice in 3
months, both times an impossibly large block number was requested.
Unfortunately my logs dont go back far enough for me to know if it
was the _exact_ same block both times
Im running xfsprogs 2.8.21.
Excerpt from syslog (hostname obfuscated to servername to protect
the innocent):
##
Nov 1 14:06:32 servername dm-1: rw=0, want=39943195856896,
limit=7759462400
Nov 1 14:06:32 servername I/O error in filesystem ("dm-1") meta-data
dev dm-1 block 0x245400000ff8 ("xfs_trans_read_buf") error 5 buf
count 4096
Nov 1 14:06:32 servername xfs_force_shutdown(dm-1,0x1) called from
line 415 of file fs/xfs/xfs_trans_buf.c. Return address = 0xc02baa25
Nov 1 14:06:32 servername Filesystem "dm-1": I/O Error Detected.
Shutting down filesystem: dm-1
Nov 1 14:06:32 servername Please umount the filesystem, and rectify
the problem(s)
###
I ran xfs_repair L on the FS and it could be mounted again, but how
long until it happens a third time? What concerns me is that this is
a FS smaller than 4TB and 39943195856896 (or 0x245400000ff8) seems
like a block that I would only have if my FS was muuuuuch larger. The
following is output from some pertinent programs:
###
servername ~ # xfs_info /mnt/san
meta-data=/dev/servername-sanvg01/servername-sanlv01 isize=256
agcount=5, agsize=203161600 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=969932800,
imaxpct=25
= sunit=0 swidth=0 blks,
unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=1
= sectsz=512 sunit=0 blks, lazy-
count=0
realtime =none extsz=4096 blocks=0, rtextents=0
servername ~ # mount
/dev/sda3 on / type ext3 (rw,noatime,acl)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
udev on /dev type tmpfs (rw,nosuid)
devpts on /dev/pts type devpts (rw,nosuid,noexec)
shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev)
usbfs on /proc/bus/usb type usbfs
(rw,noexec,nosuid,devmode=0664,devgid=85)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc
(rw,noexec,nosuid,nodev)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/mapper/servername--sanvg01-servername--sanlv01 on /mnt/san type
xfs (rw,noatime,nodiratime,logbufs=8,attr2)
/dev/mapper/servername--sanvg01-servername--rendersharelv01 on /mnt/
san/rendershare type xfs (rw,noatime,nodiratime,logbufs=8,attr2)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
servername ~ # uname -a
Linux servername 2.6.20-gentoo-r8 #7 SMP Fri Jun 29 14:46:02 EDT 2007
i686 Intel(R) Xeon(TM) CPU 3.20GHz GenuineIntel GNU/Linux
###
Does anyone know if this points to a bad block on a disk or if
something is corrupted and can be fixed with some expert knowledge of
xfs_db?
~Jay
[[HTML alternate version deleted]]
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-11-02 2:08 xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c Jay Sullivan
@ 2007-11-02 5:18 ` David Chinner
0 siblings, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-11-02 5:18 UTC (permalink / raw)
To: Jay Sullivan; +Cc: xfs
On Thu, Nov 01, 2007 at 10:08:09PM -0400, Jay Sullivan wrote:
> (Sorry if this is a dupe to the list; it has been a long day.)
>
> I have an XFS filesystem that has had the following happen twice in 3
> months, both times an impossibly large block number was requested.
....
Sure sign of a corrupted btree.
> I ran xfs_repair L on the FS and it could be mounted again, but how
> long until it happens a third time?
<shrug>
What was the problem that xfs_repair fixed?
BTW, why did you run xfs_repair -L?
Also, when it happens next, what does xfs_check tell you is broken?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
@ 2009-02-24 13:04 Federico Sevilla III
2009-02-24 22:46 ` Dave Chinner
0 siblings, 1 reply; 21+ messages in thread
From: Federico Sevilla III @ 2009-02-24 13:04 UTC (permalink / raw)
To: Linux XFS
[-- Attachment #1.1: Type: text/plain, Size: 21305 bytes --]
Hi,
Recently, we had two file servers crash during periods of increased load
(increased access from workstations in the LAN). After the crash, the
XFS file systems would no longer mount. The mount process would just
stay in state D, with no progress, and no significant disk activity.
The first pass of xfs_repair (without -L) successfully found a secondary
super block. The primary was corrupted for some reason. The file system
could still not be mounted after this first xfs_repair, and xfs_repair
would no longer continue because the log had to be replayed. Running
xfs_repair a second time with the -L option worked.
Unfortunately we don't have the output of these runs of xfs_repair to
share with the list. The above narrative is the same for the crash on
both servers, though.
On one of the servers now, on the same file system that had trouble, we
are having the following messages (the system otherwise remains usable,
though, which is weird):
attempt to access beyond end of device
sda7: rw=0, want=154858897362229008, limit=3885978852
I/O error in filesystem ("sda7") meta-data dev sda7 block 0x2262b58bf959708 ("xfs_trans_read_buf") error 5 buf count 4096
On this server which has been up for ~6 hours, we have 348 of the above
messages, and they are all identical.
Both servers use CentOS5 with the Linux 2.6.18-92.1.22.el5 kernel. For
the server currently spewing the above messages, the underlying storage
(quoted directly from dmesg) is:
megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
SCSI subsystem initialized
megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
megaraid: probe new device 0x1000:0x1960:0x1000:0x0523: bus 4:slot 3:func 0
ACPI: PCI Interrupt 0000:04:03.0[A] -> GSI 25 (level, low) -> IRQ 201
megaraid: fw version:[713S] bios version:[G121]
scsi0 : LSI Logic MegaRAID driver
scsi[0]: scanning scsi channel 0 [Phy 0] for non-raid devices
scsi[0]: scanning scsi channel 1 [virtual] for logical drives
Vendor: MegaRAID Model: LD 0 RAID5 1907G Rev: 713S
Type: Direct-Access ANSI SCSI revision: 02
SCSI device sda: 3906242560 512-byte hdwr sectors (1999996 MB)
The information of the file system on /dev/sda7 is as follows:
meta-data=/dev/sda7 isize=256 agcount=32, agsize=15179616 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=485747328, imaxpct=25
= sunit=32 swidth=160 blks, unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=32 blks, lazy-count=0
realtime =none extsz=4096 blocks=0, rtextents=0
Mount options are:
rw,logbufs=8,logbsize=256k,sunit=256,swidth=1280,nobarrier
Write cache is disabled by the hardware RAID controller for all the
drives, and caching on the controller is set to write-through. These
servers are not new, and before setting them up we ran multiple passes
of memtest86+ with no issue.
Please help me understand what the cause of this problem could be. I
have been searching and it remains unclear. Also, suggestions on what
can be done to fix this.
Thanks!
On 2008-01-03 15:55, Jay Sullivan wrote:
> I'm still seeing a lot of the following in my dmesg. Any ideas? See
> below for what I have already tried (including moving data to a fresh
> XFS volume).
>
>
>
> Tons of these; sometimes the want= changes, but it is always huge.
>
> ###
>
> attempt to access beyond end of device
>
> dm-0: rw=0, want=68609558288793608, limit=8178892800
>
> I/O error in filesystem ("dm-0") meta-data dev dm-0 block
> 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
>
> ###
>
>
>
> Occasionally some of these:
>
> ###
>
> XFS internal error XFS_WANT_CORRUPTED_GOTO at line 4533 of file
> fs/xfs/xfs_bmap.c. Caller 0xc028c5a2
> [<c026bc58>] xfs_bmap_read_extents+0x3bd/0x498
> [<c028c5a2>] xfs_iread_extents+0x74/0xe1
> [<c028fb02>] xfs_iext_realloc_direct+0xa4/0xe7
> [<c028f3ef>] xfs_iex t;c028c5a2>] xfs_iread_extents+0x74/0xe1
> [<c026befd>] xfs_bmapi+0x1ca/0x173f
> [<c02e2d7e>] elv_rb_add+0x6f/0x88
> [<c02eb843>] as_update_rq+0x32/0x72
> [<c02ec08b>] as_add_request+0x76/0xa4
> [<c02e330c>] elv_insert+0xd5/0x142
> [<c02e70ad>] __make_request+0xc8/0x305
> [<c02e7480>] generic_make_request+0x122/0x1d9
> [<c03ee0e3>] __map_bio+0x33/0xa9
> [<c03ee36c>] __clone_and_map+0xda/0x34c
> [<c0148fce>] mempool_alloc+0x2a/0xdb
> [<c028aa3c>] xfs_ilock+0x58/0xa0
> [<c029168b>] xfs_iomap+0x216/0x4b7
> [<c02b2000>] __xfs_get_blocks+0x6b/0x226
> [<c02f2792>] radix_tree_node_alloc+0x16/0x57
> [<c02f2997>] radix_tree_insert+0xb0/0x126
> [<c02b21e3>] xfs_get_blocks+0x28/0x2d
> [<c0183a32>] block_read_full_page+0x192/0x346
> [<c02b21bb>] xfs_get_blocks+0x0/0x2d
> [<c028a667>] xfs_iget+0x145/0x150
> [<c018982d>] do_mpage_readpag 28aba1>] xfs_iunlock+0x43/0x84
> [<c02a8096>] xfs_vget+0xe1/0xf2
> [<c020a578>] find_exported_dentry+0x71/0x4b6
> [<c014c4a4>] __do_page_cache_readahead+0x88/0x153
> [<c0189aa4>] mpage_readpage+0x4b/0x5e
> [<c02b21bb>] xfs_get_blocks+0x0/0x2d
> [<c014c69d>] blockable_page_cache_readahead+0x4d/0xb9
> [<c014c942>] page_cache_readahead+0x174/0x1a3
> [<c014630f>] find_get_page+0x18/0x3a
> [<c014684e>] do_generic_mapping_read+0x1b5/0x535
> [<c012621a>] __capable+0x8/0x1b
> [<c0146f6c>] generic_file_sendfile+0x68/0x83
> [<c020eff2>] nfsd_read_actor+0x0/0x10f
> [<c02b822f>] xfs_sendfile+0x94/0x164
> [<c020eff2>] nfsd_read_actor+0x0/0x10f
> [<c0211325>] nfsd_permission+0x6e/0x103
> [<c02b4868>] xfs_file_sendfile+0x4c/0x5c
> [<c020eff2>] nfsd_read_actor+0x0/0x10f
> [<c020f445>] nfsd_vfs_read+0x344/0x361
> [<c020eff2>] nfsd_read_actor+0x0/0x ] nfsd_read+0xd8/0xf9
> [<c021548e>] nfsd3_proc_read+0xb0/0x174
> [<c02170b4>] nfs3svc_decode_readargs+0x0/0xf7
> [<c020b535>] nfsd_dispatch+0x8a/0x1f5
> [<c048c43e>] svcauth_unix_set_client+0x11d/0x175
> [<c0488d73>] svc_process+0x4fd/0x681
> [<c020b39b>] nfsd+0x163/0x273
> [<c020b238>] nfsd+0x0/0x273
> [<c01037fb>] kernel_thread_helper+0x7/0x10
> ###
>
>
>
> Thanks!
>
>
>
> ~Jay
>
>
>
> From: Jay Sullivan [mailto:jpspgd@???]
> Sent: Thursday, December 20, 2007 9:01 PM
> To: xfs@???
> Cc: Jay Sullivan
> Subject: Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
>
>
>
> I'm still seeing problems. =(
>
>
>
> Most recently I have copied all of the data off of the suspect XFS
> volume onto another fresh XFS volume. A few days later I saw the same
> messages show up in dmesg. I haven't had a catastrophic failure that
> makes the kernel remount the FS RO, but I don't want to wait for that to
> happen.
>
>
>
> Today I upgraded to the latest stable kernel in Gentoo (2.6.23-r3) and
> I'm still on xfsprogs 2.9.4, also the latest stable release. A few
> hours after rebooting to load the new kernel, I saw the following in
> dmesg:
>
>
>
> ####################
>
> attempt to access beyond end of device
>
> dm-0: rw=0, want=68609558288793608, limit=8178892800
>
> I/O error in files dev dm-0 block 0xf3c0079e000000
> ("xfs_trans_read_buf") error 5 buf count 4096
>
> attempt to access beyond end of device
>
> dm-0: rw=0, want=68609558288793608, limit=8178892800
>
> I/O error in filesystem ("dm-0") meta-data dev dm-0 block
> 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
>
> attempt to access beyond end of device
>
> dm-0: rw=0, want=68609558288793608, limit=8178892800
>
> I/O error in filesystem ("dm-0") meta-data dev dm-0 block
> 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
>
> attempt to access beyond end of device
>
> dm-0: rw=0, want=68609558288793608, limit=8178892800
>
> I/O error in filesystem ("dm-0") meta-data dev dm-0 block
> 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
>
> ###################
>
>
>
> These are the same to access a block that is WAAAY outside of the range
> of my drives) that I was seeing before the last time my FS got remounted
> read-only by the colonel.
>
>
>
> Any ideas? What other information can I gather that would help with
> troubleshooting? Here are some more specifics:
>
>
>
> This is a Dell PowerEdge 1850 with a FusionMPT/LSI fibre channel card.
> The XFS volume is a 3.9TB logical volume in LVM. The volume group is
> spread across LUNs of a Apple XServe RAIDs which are connected o'er FC
> to our fabric. I just swapped FC switches (to a different brand even)
> and the problem was showing before and after the switch switch, so
> that's not it. I have also swapped FC cards, upgraded FC card firmware,
> updated BIOSs, etc.. This server sees heavy NFS (v3) and samba
> (currently 3.0.24 until the current regression bug is squashed and
> stable) traffic. usually sees 200-300Mbps throughput 24/7, although
> sometimes more.
>
>
>
> Far-fetched: Is there any way that a particular file on my FS, when
> accessed, is causing the problem?
>
>
>
> I have a very similar system (Dell PE 2650, same FC card, same type of
> RAID, same SFP cables, same GPT scheme, same kernel) but instead with an
> ext3 (full journal) FS in a 5.[something]TB logical volume (LVM) with no
> problems. Oh, and it sees system load values in the mid-20s just about
> all day.
>
>
>
> Grasping at straws. I need XFS to work because we'll soon be requiring
> seriously large filesystems with non-sucky extended attribute and ACL
> support. Plus it's fast and I like it.
>
>
>
> Can the XFS community help? I don't want to have to turn to that guy
> that a =P
>
>
>
> ~Jay
>
>
>
>
>
> On Nov 14, 2007, at 10:05 AM, Jay Sullivan wrote:
>
>
>
>
>
> Of course this had to happen one more time before my scheduled
> maintenance window... Anyways, here's all of the good stuff I
> collected. Can anyone make sense of it? Oh, and I upgraded to xfsprogs
> 2.9.4 last week, so all output you see is with that version.
>
> Thanks!
>
> ###################
>
> dmesg output
>
> ###################
>
> XFS internal error XFS_WANT_CORRUPTED_GOTO at line 4533 of file
> fs/xfs/xfs_bmap.c. Caller 0xc028c5a2
> [<c026bc58>] xfs_bmap_read_extents+0x3bd/0x498
> [<c028c5a2>] xfs_iread_extents+0x74/0xe1
> [<c028fb02>] xfs_iext_realloc_direct+0xa4/0xe7
> [<c028f3ef>] xfs_iex t;c028c5a2>] xfs_iread_extents+0x74/0xe1
> [<c026befd>] xfs_bmapi+0x1ca/0x173f
> [<c02e2d7e>] elv_rb_add+0x6f/0x88
> [<c02eb843>] as_update_rq+0x32/0x72
> [<c02ec08b>] as_add_request+0x76/0xa4
> [<c02e330c>] elv_insert+0xd5/0x142
> [<c02e70ad>] __make_request+0xc8/0x305
> [<c02e7480>] generic_make_request+0x122/0x1d9
> [<c03ee0e3>] __map_bio+0x33/0xa9
> [<c03ee36c>] __clone_and_map+0xda/0x34c
> [<c0148fce>] mempool_alloc+0x2a/0xdb
> [<c028aa3c>] xfs_ilock+0x58/0xa0
> [<c029168b>] xfs_iomap+0x216/0x4b7
> [<c02b2000>] __xfs_get_blocks+0x6b/0x226
> [<c02f2792>] radix_tree_node_alloc+0x16/0x57
> [<c02f2997>] radix_tree_insert+0xb0/0x126
> [<c02b21e3>] xfs_get_blocks+0x28/0x2d
> [<c0183a32>] block_read_full_page+0x192/0x346
> [<c02b21bb>] xfs_get_blocks+0x0/0x2d
> [<c028a667>] xfs_iget+0x145/0x150
> [<c018982d>] do_mpage_readpag 28aba1>] xfs_iunlock+0x43/0x84
> [<c02a8096>] xfs_vget+0xe1/0xf2
> [<c020a578>] find_exported_dentry+0x71/0x4b6
> [<c014c4a4>] __do_page_cache_readahead+0x88/0x153
> [<c0189aa4>] mpage_readpage+0x4b/0x5e
> [<c02b21bb>] xfs_get_blocks+0x0/0x2d
> [<c014c69d>] blockable_page_cache_readahead+0x4d/0xb9
> [<c014c942>] page_cache_readahead+0x174/0x1a3
> [<c014630f>] find_get_page+0x18/0x3a
> [<c014684e>] do_generic_mapping_read+0x1b5/0x535
> [<c012621a>] __capable+0x8/0x1b
> [<c0146f6c>] generic_file_sendfile+0x68/0x83
> [<c020eff2>] nfsd_read_actor+0x0/0x10f
> [<c02b822f>] xfs_sendfile+0x94/0x164
> [<c020eff2>] nfsd_read_actor+0x0/0x10f
> [<c0211325>] nfsd_permission+0x6e/0x103
> [<c02b4868>] xfs_file_sendfile+0x4c/0x5c
> [<c020eff2>] nfsd_read_actor+0x0/0x10f
> [<c020f445>] nfsd_vfs_read+0x344/0x361
> [<c020eff2>] nfsd_read_actor+0x0/0x ] nfsd_read+0xd8/0xf9
> [<c021548e>] nfsd3_proc_read+0xb0/0x174
> [<c02170b4>] nfs3svc_decode_readargs+0x0/0xf7
> [<c020b535>] nfsd_dispatch+0x8a/0x1f5
> [<c048c43e>] svcauth_unix_set_client+0x11d/0x175
> [<c0488d73>] svc_process+0x4fd/0x681
> [<c020b39b>] nfsd+0x163/0x273
> [<c020b238>] nfsd+0x0/0x273
> [<c01037fb>] kernel_thread_helper+0x7/0x10
> =======================
> attempt to access beyond end of device
> dm-1: rw=0, want=6763361770196172808, limit=7759462400
> I/O error in filesystem ("dm-1") meta-data dev dm-1 block
> 0x5ddc49b238000000 ("xfs_trans_read_buf") error 5 buf count 4096
> xfs_force_shutdown(dm-1,0x1) called from line 415 of file
> fs/xfs/xfs_trans_buf.c. Return address = 0xc02baa25
> Filesystem "dm-1": I/O Error Detected. Shutting down filesystem: dm-1
> Please umount the filesystem, and rectify the problem(s)
>
>
> #################### I umount'ed and mount'ed the FS several times, but
> xfs_repair still told me to use -L... Any ideas?
>
> #######################
>
> server-files ~ # umount /mnt/san/
> server-files ~ # mount /mnt/san/
> server-files ~ # umount /mnt/san/
> server-files ~ # xfs_repair
> /dev/server-files-sanvg01/server-files-sanlv01
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> ERROR: The filesystem has valuable metadata changes in a log which needs
> to
> be replayed. Mount the filesystem to replay the log, and unmount it
> before
> re-running xfs_repair. If you are unable to mount the filesystem, then
> use
> the -L option to destroy the log and attempt a repair.
> Note that destroying the log may cause corruption -- please attempt a
> mount
> of the filesystem before doing this.
> server-files ~ # xfs_repair -L
> /dev/server-files-sanvg01/server-files-sanlv01
> Pha perblock...
> Phase 2 - using internal log
> - zero log...
> ALERT: The filesystem has valuable metadata changes in a log which is
> being
> destroyed because the -L option was used.
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan and clear agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> 4002: Badness in key lookup (length)
> bp=(bno 2561904, len 16384 bytes) key=(bno 2561904, len 8192 bytes)
> 8003: Badness in key lookup (length)
> bp=(bno 0, len 512 bytes) key=(bno 0, len 4096 bytes)
> bad bmap btree ptr 0x5f808b0400000000 in ino 5123809
> bad data fork in inode 5123809
> cleared inode 5123809
> bad magi 480148 (data fork) bmbt block 0
> bad data fork in inode 7480148
> cleared inode 7480148
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> entry "Fuller_RotoscopeCorrected.mov" at block 0 offset 184 in directory
> inode 89923 7480148
> clearing inode number in entry at offset 184...
> Phase 5 - rebuild AG headers and trees...
> - reset superblock...
> 4000: Badness in key lookup (length)
> bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
> Phase 6 - check inode connectivity...
> - resetting contents of realtime bitmap and summary inodes
> - traversing filesystem ...
> bad hash table for directory inode 8992373 (no data entry): rebuilding
> rebuilding directory inode 8992373
> 4000: Badness in key lookup (length)
> bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
> 4000: Badness in key lookup (length)
> bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
> - traversal finished ...
> - moving disconnected inodes to lost+found nd correct link
> counts...
> 4000: Badness in key lookup (length)
> bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
> done
> server-files ~ # mount /mnt/san
> server-files ~ # umount /mnt/san
> server-files ~ # xfs_repair -L
> /dev/server-files-sanvg01/server-files-sanlv01
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
>
> server-files ~ # xfs_repair
> /dev/server-files-sanvg01/server-files-sanlv01
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> XFS: totally zeroed log
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan and clear agi unlinked lists...
> - proc orm inode discovery...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> Phase 5 - rebuild AG headers and trees...
> - reset su check inode connectivity...
> - resetting contents of realtime bitmap and summary inodes
> - traversing filesystem ...
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> Phase 7 - verify and correct link counts...
> done
>
> ################
>
> So that's it for now. Next week I'll be rsyncing all of the data off of
> this volume to another array. I still want to know what's happening,
> though... *pout*
>
> Anyways, thanks a lot for everyone's help.
>
> ~Jay
>
>
> -----Original Message-----
> From: xfs-bounce@??? [mailto:xfs-bounce@???] On Behalf
> Of Jay Sullivan
> Sent: Friday, November 02, 2007 10:49 AM
> To: xfs@???
> Subject: RE: xf rom file fs/xfs/xfs_trans_buf.c
>
> What can I say about Murphy and his silly laws? I just had a drive fail
> on my array. I wonder if this is the root of my problems... Yay
> parity.
>
> ~Jay
>
> -----Original Message-----
> From: xfs-bounce@??? [mailto:xfs-bounce@???] On Behalf
> Of Jay Sullivan
> Sent: Friday, November 02, 2007 10:00 AM
> To: xfs@???
> Subject: RE: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
>
> I lost the xfs_repair output on an xterm with only four lines of
> scrollback... I'll definitely be more careful to preserve more
> 'evidence' next time. =( "Pics or it didn't happen", right?
>
> I just upgraded xfsprogs and will scan the disk during my next scheduled
> downtime (probably in about 2 weeks). I'm tempted to just wipe the
> volume and start over: I have enough ' to copy
> everything out to a fresh XFS volume.
>
> Regarding "areca": I'm using hardware RAID built into Apple XServe
> RAIDs o'er LSI FC929X cards.
>
> Someone else offered the likely explanation that the btree is corrupted.
> Isn't this something xfs_repair should be able to fix? Would it be
> easier, safer, and faster to move the data to a new volume (and restore
> corrupted files if/as I find them from backup)? We're talking about
> just less than 4TB of data which used to take about 6 hours to fsck (one
> pass) with ext3. Restoring the whole shebang from backups would
> probably take the better part of 12 years (waiting for compression,
> resetting ACLs, etc.)...
>
> FWIW, another (way less important,) much busier and significantly larger
> logical volume on the same array has been totally fine. Murphy--go
> figure.
>
> Thanks!
>
> -----Original Message-----
> From: Eric Sandeen [mail >]
> Sent: Thursday, November 01, 2007 10:30 PM
> To: Jay Sullivan
> Cc: <mailto:sandeen@???> xfs@???
> Subject: Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
>
> Jay Sullivan wrote:
>
>
>
> Good eye: it wasn't mountable, thus the -L flag. No recent
>
> (unplanned) power outages. The machine and the array that holds
> the
>
> disks are both on serious batteries/UPS and the array's cache
>
> batteries are in good health.
>
>
> Did you have the xfs_repair output to see what it found? You might also
> grab the very latest xfsprogs (2.9.4) in case it's catching more cases.
>
> I hate it when people suggest running memtest86, but I might do that
> anyway. :)
>
> What controller are you using? If you say "areca" I might be on to
> something e seen...
>
> -Eric
>
>
>
> [[HTML alternate version deleted]]
>
>
>
>
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2009-02-24 13:04 Federico Sevilla III
@ 2009-02-24 22:46 ` Dave Chinner
2009-02-25 10:00 ` Federico Sevilla III
0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2009-02-24 22:46 UTC (permalink / raw)
To: Federico Sevilla III; +Cc: Linux XFS
[please don't reply to year old message threads with a new problem]
On Tue, Feb 24, 2009 at 09:04:21PM +0800, Federico Sevilla III wrote:
> Hi,
>
> Recently, we had two file servers crash during periods of increased load
> (increased access from workstations in the LAN). After the crash, the
> XFS file systems would no longer mount. The mount process would just
> stay in state D, with no progress, and no significant disk activity.
What was the cause of the crash?
....
> On one of the servers now, on the same file system that had trouble, we
> are having the following messages (the system otherwise remains usable,
> though, which is weird):
>
> attempt to access beyond end of device
> sda7: rw=0, want=154858897362229008, limit=3885978852
> I/O error in filesystem ("sda7") meta-data dev sda7 block 0x2262b58bf959708 ("xfs_trans_read_buf") error 5 buf count 4096
A corrupted extent pointer of some kind. xfs_repair should have
found this. Can you run xfs_repair again? If it doesn't find
anything, please upgrade xfs_repair to the latest version and
try again.
> Both servers use CentOS5 with the Linux 2.6.18-92.1.22.el5 kernel. For
Oh. XFS is not really supported on that platform because it is pretty much
completely untested on RHEL based kernels.
> Please help me understand what the cause of this problem could be.
Could be anything. Knowing what caused you systems to crash in the
first place would be handy....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2009-02-24 22:46 ` Dave Chinner
@ 2009-02-25 10:00 ` Federico Sevilla III
2009-02-25 11:51 ` Michael Monnerie
2009-02-25 18:47 ` Federico Sevilla III
0 siblings, 2 replies; 21+ messages in thread
From: Federico Sevilla III @ 2009-02-25 10:00 UTC (permalink / raw)
To: Linux XFS
[-- Attachment #1.1: Type: text/plain, Size: 1898 bytes --]
On Wed, 2009-02-25 at 09:46 +1100, Dave Chinner wrote:
> On Tue, Feb 24, 2009 at 09:04:21PM +0800, Federico Sevilla III wrote:
> > Hi,
> >
> > Recently, we had two file servers crash during periods of increased load
> > (increased access from workstations in the LAN). After the crash, the
> > XFS file systems would no longer mount. The mount process would just
> > stay in state D, with no progress, and no significant disk activity.
>
> What was the cause of the crash?
The engineers on site were unable to copy the exact error message, but
whatever could be read of a photograph of the screen seems to point to
similar issues with xfs_trans_read_buf error 5 as we're beginning to see
now.
We are unsure of the cause of the crash but know that the system load
was higher than usual because of some larger files that people were busy
working on at the time.
> > attempt to access beyond end of device
> > sda7: rw=0, want=154858897362229008, limit=3885978852
> > I/O error in filesystem ("sda7") meta-data dev sda7 block 0x2262b58bf959708 ("xfs_trans_read_buf") error 5 buf count 4096
>
> A corrupted extent pointer of some kind. xfs_repair should have
> found this. Can you run xfs_repair again? If it doesn't find
> anything, please upgrade xfs_repair to the latest version and
> try again.
Will do, and will revert to the list again with the results.
> > Both servers use CentOS5 with the Linux 2.6.18-92.1.22.el5 kernel. For
>
> Oh. XFS is not really supported on that platform because it is pretty much
> completely untested on RHEL based kernels.
What would be the "community endorsed" approach to using our favorite
file system on CentOS 5? Would you recommend we go with the CentOSPlus
kernels instead?
Thank you very much.
Cheers!
--
Federico Sevilla III
F S 3 Consulting Inc.
http://www.fs3.ph
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2009-02-25 10:00 ` Federico Sevilla III
@ 2009-02-25 11:51 ` Michael Monnerie
2009-02-25 18:47 ` Federico Sevilla III
1 sibling, 0 replies; 21+ messages in thread
From: Michael Monnerie @ 2009-02-25 11:51 UTC (permalink / raw)
To: xfs
On Mittwoch 25 Februar 2009 Federico Sevilla III wrote:
> What would be the "community endorsed" approach to using our favorite
> file system on CentOS 5? Would you recommend we go with the
> CentOSPlus kernels instead?
We run openSUSE, but many servers with a vanilla kernel. So you can
upgrade whenever you like. Currently, 2.6.28.x is the newest version.
Problem could be that CentOS relies on a patched kernel in some place (I
don't use it so don't know), but if that doesn't hit you, it doesn't
matter. For example, openSUSE kernels support a nice graphical boot
instead of the text, but we don't need that.
You must also take care of kernel bugs and updates yourself from the
moment you use your own compiled kernel.
mfg zmi
--
// Michael Monnerie, Ing.BSc ----- http://it-management.at
// Tel: 0660 / 415 65 31 .network.your.ideas.
// PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2009-02-25 10:00 ` Federico Sevilla III
2009-02-25 11:51 ` Michael Monnerie
@ 2009-02-25 18:47 ` Federico Sevilla III
1 sibling, 0 replies; 21+ messages in thread
From: Federico Sevilla III @ 2009-02-25 18:47 UTC (permalink / raw)
To: Linux XFS
[-- Attachment #1.1.1: Type: text/plain, Size: 1256 bytes --]
On Wed, 2009-02-25 at 18:00 +0800, Federico Sevilla III wrote:
> On Wed, 2009-02-25 at 09:46 +1100, Dave Chinner wrote:
> > On Tue, Feb 24, 2009 at 09:04:21PM +0800, Federico Sevilla III wrote:
>
> > > attempt to access beyond end of device
> > > sda7: rw=0, want=154858897362229008, limit=3885978852
> > > I/O error in filesystem ("sda7") meta-data dev sda7 block 0x2262b58bf959708 ("xfs_trans_read_buf") error 5 buf count 4096
> >
> > A corrupted extent pointer of some kind. xfs_repair should have
> > found this. Can you run xfs_repair again? If it doesn't find
> > anything, please upgrade xfs_repair to the latest version and
> > try again.
I have attached the output of xfs_repair. You are correct, it did find
errors with the file system, and repaired them. I don't know why this
wasn't caught the first time, but I guess the lesson learned here is to
re-run xfs_repair until it finds no further errors.
Does the output of xfs_repair help give you an idea of what could have
been the root cause of the crash? (I know it's a long shot, but maybe
you recognize a pattern in the messages.)
Thank you very much.
Cheers!
--
Federico Sevilla III
F S 3 Consulting Inc.
http://www.fs3.ph
[-- Attachment #1.1.2: xfs_repair_sda7.log --]
[-- Type: text/x-log, Size: 2588 bytes --]
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
bad bmap btree ptr 0xc4c025347495f41 in ino 536960951
bad data fork in inode 536960951
cleared inode 536960951
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 30
- agno = 31
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
entry "in_out of INK_LO_BG.xls" at block 0 offset 2352 in directory inode 537005595 references free inode 536960951
clearing inode number in entry at offset 2352...
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 30
- agno = 31
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
bad hash table for directory inode 537005595 (no data entry): rebuilding
rebuilding directory inode 537005595
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <B3EDBE0F860AF74BAA82EF17A7CDEDC660BE05A3@svits26.main.ad.rit.edu>]
* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
[not found] <B3EDBE0F860AF74BAA82EF17A7CDEDC660BE05A3@svits26.main.ad.rit.edu>
@ 2007-12-21 2:01 ` Jay Sullivan
2008-01-03 15:55 ` Jay Sullivan
2008-08-04 16:55 ` Richard Freeman
0 siblings, 2 replies; 21+ messages in thread
From: Jay Sullivan @ 2007-12-21 2:01 UTC (permalink / raw)
To: xfs; +Cc: Jay Sullivan
I'm still seeing problems. =(
Most recently I have copied all of the data off of the suspect XFS
volume onto another fresh XFS volume. A few days later I saw the same
messages show up in dmesg. I haven't had a catastrophic failure that
makes the kernel remount the FS RO, but I don't want to wait for that
to happen.
Today I upgraded to the latest stable kernel in Gentoo (2.6.23-r3) and
I'm still on xfsprogs 2.9.4, also the latest stable release. A few
hours after rebooting to load the new kernel, I saw the following in
dmesg:
####################
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block
0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block
0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block
0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block
0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
###################
These are the same types of messages (trying to access a block that is
WAAAY outside of the range of my drives) that I was seeing before the
last time my FS got remounted read-only by the colonel.
Any ideas? What other information can I gather that would help with
troubleshooting? Here are some more specifics:
This is a Dell PowerEdge 1850 with a FusionMPT/LSI fibre channel
card. The XFS volume is a 3.9TB logical volume in LVM. The volume
group is spread across LUNs of a Apple XServe RAIDs which are
connected o'er FC to our fabric. I just swapped FC switches (to a
different brand even) and the problem was showing before and after the
switch switch, so that's not it. I have also swapped FC cards,
upgraded FC card firmware, updated BIOSs, etc.. This server sees
heavy NFS (v3) and samba (currently 3.0.24 until the current
regression bug is squashed and stable) traffic. 'Heavy traffic' means
it usually sees 200-300Mbps throughput 24/7, although sometimes more.
Far-fetched: Is there any way that a particular file on my FS, when
accessed, is causing the problem?
I have a very similar system (Dell PE 2650, same FC card, same type of
RAID, same SFP cables, same GPT scheme, same kernel) but instead with
an ext3 (full journal) FS in a 5.[something]TB logical volume (LVM)
with no problems. Oh, and it sees system load values in the mid-20s
just about all day.
Grasping at straws. I need XFS to work because we'll soon be
requiring seriously large filesystems with non-sucky extended
attribute and ACL support. Plus it's fast and I like it.
Can the XFS community help? I don't want to have to turn to that guy
that allegedly killed his wife. =P
~Jay
On Nov 14, 2007, at 10:05 AM, Jay Sullivan wrote:
> Of course this had to happen one more time before my scheduled
> maintenance window... Anyways, here's all of the good stuff I
> collected. Can anyone make sense of it? Oh, and I upgraded to
> xfsprogs
> 2.9.4 last week, so all output you see is with that version.
>
> Thanks!
>
> ###################
>
> dmesg output
>
> ###################
>
> XFS internal error XFS_WANT_CORRUPTED_GOTO at line 4533 of file
> fs/xfs/xfs_bmap.c. Caller 0xc028c5a2
> [<c026bc58>] xfs_bmap_read_extents+0x3bd/0x498
> [<c028c5a2>] xfs_iread_extents+0x74/0xe1
> [<c028fb02>] xfs_iext_realloc_direct+0xa4/0xe7
> [<c028f3ef>] xfs_iext_add+0x138/0x272
> [<c028c5a2>] xfs_iread_extents+0x74/0xe1
> [<c026befd>] xfs_bmapi+0x1ca/0x173f
> [<c02e2d7e>] elv_rb_add+0x6f/0x88
> [<c02eb843>] as_update_rq+0x32/0x72
> [<c02ec08b>] as_add_request+0x76/0xa4
> [<c02e330c>] elv_insert+0xd5/0x142
> [<c02e70ad>] __make_request+0xc8/0x305
> [<c02e7480>] generic_make_request+0x122/0x1d9
> [<c03ee0e3>] __map_bio+0x33/0xa9
> [<c03ee36c>] __clone_and_map+0xda/0x34c
> [<c0148fce>] mempool_alloc+0x2a/0xdb
> [<c028aa3c>] xfs_ilock+0x58/0xa0
> [<c029168b>] xfs_iomap+0x216/0x4b7
> [<c02b2000>] __xfs_get_blocks+0x6b/0x226
> [<c02f2792>] radix_tree_node_alloc+0x16/0x57
> [<c02f2997>] radix_tree_insert+0xb0/0x126
> [<c02b21e3>] xfs_get_blocks+0x28/0x2d
> [<c0183a32>] block_read_full_page+0x192/0x346
> [<c02b21bb>] xfs_get_blocks+0x0/0x2d
> [<c028a667>] xfs_iget+0x145/0x150
> [<c018982d>] do_mpage_readpage+0x530/0x621
> [<c028aba1>] xfs_iunlock+0x43/0x84
> [<c02a8096>] xfs_vget+0xe1/0xf2
> [<c020a578>] find_exported_dentry+0x71/0x4b6
> [<c014c4a4>] __do_page_cache_readahead+0x88/0x153
> [<c0189aa4>] mpage_readpage+0x4b/0x5e
> [<c02b21bb>] xfs_get_blocks+0x0/0x2d
> [<c014c69d>] blockable_page_cache_readahead+0x4d/0xb9
> [<c014c942>] page_cache_readahead+0x174/0x1a3
> [<c014630f>] find_get_page+0x18/0x3a
> [<c014684e>] do_generic_mapping_read+0x1b5/0x535
> [<c012621a>] __capable+0x8/0x1b
> [<c0146f6c>] generic_file_sendfile+0x68/0x83
> [<c020eff2>] nfsd_read_actor+0x0/0x10f
> [<c02b822f>] xfs_sendfile+0x94/0x164
> [<c020eff2>] nfsd_read_actor+0x0/0x10f
> [<c0211325>] nfsd_permission+0x6e/0x103
> [<c02b4868>] xfs_file_sendfile+0x4c/0x5c
> [<c020eff2>] nfsd_read_actor+0x0/0x10f
> [<c020f445>] nfsd_vfs_read+0x344/0x361
> [<c020eff2>] nfsd_read_actor+0x0/0x10f
> [<c020f862>] nfsd_read+0xd8/0xf9
> [<c021548e>] nfsd3_proc_read+0xb0/0x174
> [<c02170b4>] nfs3svc_decode_readargs+0x0/0xf7
> [<c020b535>] nfsd_dispatch+0x8a/0x1f5
> [<c048c43e>] svcauth_unix_set_client+0x11d/0x175
> [<c0488d73>] svc_process+0x4fd/0x681
> [<c020b39b>] nfsd+0x163/0x273
> [<c020b238>] nfsd+0x0/0x273
> [<c01037fb>] kernel_thread_helper+0x7/0x10
> =======================
> attempt to access beyond end of device
> dm-1: rw=0, want=6763361770196172808, limit=7759462400
> I/O error in filesystem ("dm-1") meta-data dev dm-1 block
> 0x5ddc49b238000000 ("xfs_trans_read_buf") error 5 buf count 4096
> xfs_force_shutdown(dm-1,0x1) called from line 415 of file
> fs/xfs/xfs_trans_buf.c. Return address = 0xc02baa25
> Filesystem "dm-1": I/O Error Detected. Shutting down filesystem: dm-1
> Please umount the filesystem, and rectify the problem(s)
>
>
> #######################
>
> At this point I umount'ed and mount'ed the FS several times, but
> xfs_repair still told me to use -L... Any ideas?
>
> #######################
>
> server-files ~ # umount /mnt/san/
> server-files ~ # mount /mnt/san/
> server-files ~ # umount /mnt/san/
> server-files ~ # xfs_repair
> /dev/server-files-sanvg01/server-files-sanlv01
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> ERROR: The filesystem has valuable metadata changes in a log which
> needs
> to
> be replayed. Mount the filesystem to replay the log, and unmount it
> before
> re-running xfs_repair. If you are unable to mount the filesystem,
> then
> use
> the -L option to destroy the log and attempt a repair.
> Note that destroying the log may cause corruption -- please attempt a
> mount
> of the filesystem before doing this.
> server-files ~ # xfs_repair -L
> /dev/server-files-sanvg01/server-files-sanlv01
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> ALERT: The filesystem has valuable metadata changes in a log which is
> being
> destroyed because the -L option was used.
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan and clear agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> 4002: Badness in key lookup (length)
> bp=(bno 2561904, len 16384 bytes) key=(bno 2561904, len 8192 bytes)
> 8003: Badness in key lookup (length)
> bp=(bno 0, len 512 bytes) key=(bno 0, len 4096 bytes)
> bad bmap btree ptr 0x5f808b0400000000 in ino 5123809
> bad data fork in inode 5123809
> cleared inode 5123809
> bad magic # 0x58465342 in inode 7480148 (data fork) bmbt block 0
> bad data fork in inode 7480148
> cleared inode 7480148
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> entry "Fuller_RotoscopeCorrected.mov" at block 0 offset 184 in
> directory
> inode 8992373 references free inode 7480148
> clearing inode number in entry at offset 184...
> Phase 5 - rebuild AG headers and trees...
> - reset superblock...
> 4000: Badness in key lookup (length)
> bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
> Phase 6 - check inode connectivity...
> - resetting contents of realtime bitmap and summary inodes
> - traversing filesystem ...
> bad hash table for directory inode 8992373 (no data entry): rebuilding
> rebuilding directory inode 8992373
> 4000: Badness in key lookup (length)
> bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
> 4000: Badness in key lookup (length)
> bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> Phase 7 - verify and correct link counts...
> 4000: Badness in key lookup (length)
> bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
> done
> server-files ~ # mount /mnt/san
> server-files ~ # umount /mnt/san
> server-files ~ # xfs_repair -L
> /dev/server-files-sanvg01/server-files-sanlv01
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
>
> server-files ~ # xfs_repair
> /dev/server-files-sanvg01/server-files-sanlv01
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> XFS: totally zeroed log
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan and clear agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> Phase 5 - rebuild AG headers and trees...
> - reset superblock...
> Phase 6 - check inode connectivity...
> - resetting contents of realtime bitmap and summary inodes
> - traversing filesystem ...
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> Phase 7 - verify and correct link counts...
> done
>
> ################
>
> So that's it for now. Next week I'll be rsyncing all of the data
> off of
> this volume to another array. I still want to know what's happening,
> though... *pout*
>
> Anyways, thanks a lot for everyone's help.
>
> ~Jay
>
>
> -----Original Message-----
> From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] On Behalf
> Of Jay Sullivan
> Sent: Friday, November 02, 2007 10:49 AM
> To: xfs@oss.sgi.com
> Subject: RE: xfs_force_shutdown called from file fs/xfs/
> xfs_trans_buf.c
>
> What can I say about Murphy and his silly laws? I just had a drive
> fail
> on my array. I wonder if this is the root of my problems... Yay
> parity.
>
> ~Jay
>
> -----Original Message-----
> From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] On Behalf
> Of Jay Sullivan
> Sent: Friday, November 02, 2007 10:00 AM
> To: xfs@oss.sgi.com
> Subject: RE: xfs_force_shutdown called from file fs/xfs/
> xfs_trans_buf.c
>
> I lost the xfs_repair output on an xterm with only four lines of
> scrollback... I'll definitely be more careful to preserve more
> 'evidence' next time. =( "Pics or it didn't happen", right?
>
> I just upgraded xfsprogs and will scan the disk during my next
> scheduled
> downtime (probably in about 2 weeks). I'm tempted to just wipe the
> volume and start over: I have enough 'spare' space lying around to
> copy
> everything out to a fresh XFS volume.
>
> Regarding "areca": I'm using hardware RAID built into Apple XServe
> RAIDs o'er LSI FC929X cards.
>
> Someone else offered the likely explanation that the btree is
> corrupted.
> Isn't this something xfs_repair should be able to fix? Would it be
> easier, safer, and faster to move the data to a new volume (and
> restore
> corrupted files if/as I find them from backup)? We're talking about
> just less than 4TB of data which used to take about 6 hours to fsck
> (one
> pass) with ext3. Restoring the whole shebang from backups would
> probably take the better part of 12 years (waiting for compression,
> resetting ACLs, etc.)...
>
> FWIW, another (way less important,) much busier and significantly
> larger
> logical volume on the same array has been totally fine. Murphy--go
> figure.
>
> Thanks!
>
> -----Original Message-----
> From: Eric Sandeen [mailto:sandeen@sandeen.net]
> Sent: Thursday, November 01, 2007 10:30 PM
> To: Jay Sullivan
> Cc: xfs@oss.sgi.com
> Subject: Re: xfs_force_shutdown called from file fs/xfs/
> xfs_trans_buf.c
>
> Jay Sullivan wrote:
>> Good eye: it wasn't mountable, thus the -L flag. No recent
>> (unplanned) power outages. The machine and the array that holds the
>> disks are both on serious batteries/UPS and the array's cache
>> batteries are in good health.
>
> Did you have the xfs_repair output to see what it found? You might
> also
> grab the very latest xfsprogs (2.9.4) in case it's catching more
> cases.
>
> I hate it when people suggest running memtest86, but I might do that
> anyway. :)
>
> What controller are you using? If you say "areca" I might be on to
> something with some other bugs I've seen...
>
> -Eric
>
>
>
>
--
Jay Sullivan
PC Systems Administrator
College of Imaging Arts and Sciences
Rochester Institute of Technology
7A-1320 :: 585.475.4688
--
Privacy at RIT:
http://www.rit.edu/privacy/
--
[[HTML alternate version deleted]]
^ permalink raw reply [flat|nested] 21+ messages in thread* RE: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-12-21 2:01 ` Jay Sullivan
@ 2008-01-03 15:55 ` Jay Sullivan
2008-08-04 16:55 ` Richard Freeman
1 sibling, 0 replies; 21+ messages in thread
From: Jay Sullivan @ 2008-01-03 15:55 UTC (permalink / raw)
To: xfs; +Cc: Jay Sullivan
I'm still seeing a lot of the following in my dmesg. Any ideas? See
below for what I have already tried (including moving data to a fresh
XFS volume).
Tons of these; sometimes the want= changes, but it is always huge.
###
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block
0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
###
Occasionally some of these:
###
XFS internal error XFS_WANT_CORRUPTED_GOTO at line 4533 of file
fs/xfs/xfs_bmap.c. Caller 0xc028c5a2
[<c026bc58>] xfs_bmap_read_extents+0x3bd/0x498
[<c028c5a2>] xfs_iread_extents+0x74/0xe1
[<c028fb02>] xfs_iext_realloc_direct+0xa4/0xe7
[<c028f3ef>] xfs_iex t;c028c5a2>] xfs_iread_extents+0x74/0xe1
[<c026befd>] xfs_bmapi+0x1ca/0x173f
[<c02e2d7e>] elv_rb_add+0x6f/0x88
[<c02eb843>] as_update_rq+0x32/0x72
[<c02ec08b>] as_add_request+0x76/0xa4
[<c02e330c>] elv_insert+0xd5/0x142
[<c02e70ad>] __make_request+0xc8/0x305
[<c02e7480>] generic_make_request+0x122/0x1d9
[<c03ee0e3>] __map_bio+0x33/0xa9
[<c03ee36c>] __clone_and_map+0xda/0x34c
[<c0148fce>] mempool_alloc+0x2a/0xdb
[<c028aa3c>] xfs_ilock+0x58/0xa0
[<c029168b>] xfs_iomap+0x216/0x4b7
[<c02b2000>] __xfs_get_blocks+0x6b/0x226
[<c02f2792>] radix_tree_node_alloc+0x16/0x57
[<c02f2997>] radix_tree_insert+0xb0/0x126
[<c02b21e3>] xfs_get_blocks+0x28/0x2d
[<c0183a32>] block_read_full_page+0x192/0x346
[<c02b21bb>] xfs_get_blocks+0x0/0x2d
[<c028a667>] xfs_iget+0x145/0x150
[<c018982d>] do_mpage_readpag 28aba1>] xfs_iunlock+0x43/0x84
[<c02a8096>] xfs_vget+0xe1/0xf2
[<c020a578>] find_exported_dentry+0x71/0x4b6
[<c014c4a4>] __do_page_cache_readahead+0x88/0x153
[<c0189aa4>] mpage_readpage+0x4b/0x5e
[<c02b21bb>] xfs_get_blocks+0x0/0x2d
[<c014c69d>] blockable_page_cache_readahead+0x4d/0xb9
[<c014c942>] page_cache_readahead+0x174/0x1a3
[<c014630f>] find_get_page+0x18/0x3a
[<c014684e>] do_generic_mapping_read+0x1b5/0x535
[<c012621a>] __capable+0x8/0x1b
[<c0146f6c>] generic_file_sendfile+0x68/0x83
[<c020eff2>] nfsd_read_actor+0x0/0x10f
[<c02b822f>] xfs_sendfile+0x94/0x164
[<c020eff2>] nfsd_read_actor+0x0/0x10f
[<c0211325>] nfsd_permission+0x6e/0x103
[<c02b4868>] xfs_file_sendfile+0x4c/0x5c
[<c020eff2>] nfsd_read_actor+0x0/0x10f
[<c020f445>] nfsd_vfs_read+0x344/0x361
[<c020eff2>] nfsd_read_actor+0x0/0x ] nfsd_read+0xd8/0xf9
[<c021548e>] nfsd3_proc_read+0xb0/0x174
[<c02170b4>] nfs3svc_decode_readargs+0x0/0xf7
[<c020b535>] nfsd_dispatch+0x8a/0x1f5
[<c048c43e>] svcauth_unix_set_client+0x11d/0x175
[<c0488d73>] svc_process+0x4fd/0x681
[<c020b39b>] nfsd+0x163/0x273
[<c020b238>] nfsd+0x0/0x273
[<c01037fb>] kernel_thread_helper+0x7/0x10
###
Thanks!
~Jay
From: Jay Sullivan [mailto:jpspgd@rit.edu]
Sent: Thursday, December 20, 2007 9:01 PM
To: xfs@oss.sgi.com
Cc: Jay Sullivan
Subject: Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
I'm still seeing problems. =(
Most recently I have copied all of the data off of the suspect XFS
volume onto another fresh XFS volume. A few days later I saw the same
messages show up in dmesg. I haven't had a catastrophic failure that
makes the kernel remount the FS RO, but I don't want to wait for that to
happen.
Today I upgraded to the latest stable kernel in Gentoo (2.6.23-r3) and
I'm still on xfsprogs 2.9.4, also the latest stable release. A few
hours after rebooting to load the new kernel, I saw the following in
dmesg:
####################
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in files dev dm-0 block 0xf3c0079e000000
("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block
0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block
0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block
0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
###################
These are the same to access a block that is WAAAY outside of the range
of my drives) that I was seeing before the last time my FS got remounted
read-only by the colonel.
Any ideas? What other information can I gather that would help with
troubleshooting? Here are some more specifics:
This is a Dell PowerEdge 1850 with a FusionMPT/LSI fibre channel card.
The XFS volume is a 3.9TB logical volume in LVM. The volume group is
spread across LUNs of a Apple XServe RAIDs which are connected o'er FC
to our fabric. I just swapped FC switches (to a different brand even)
and the problem was showing before and after the switch switch, so
that's not it. I have also swapped FC cards, upgraded FC card firmware,
updated BIOSs, etc.. This server sees heavy NFS (v3) and samba
(currently 3.0.24 until the current regression bug is squashed and
stable) traffic. usually sees 200-300Mbps throughput 24/7, although
sometimes more.
Far-fetched: Is there any way that a particular file on my FS, when
accessed, is causing the problem?
I have a very similar system (Dell PE 2650, same FC card, same type of
RAID, same SFP cables, same GPT scheme, same kernel) but instead with an
ext3 (full journal) FS in a 5.[something]TB logical volume (LVM) with no
problems. Oh, and it sees system load values in the mid-20s just about
all day.
Grasping at straws. I need XFS to work because we'll soon be requiring
seriously large filesystems with non-sucky extended attribute and ACL
support. Plus it's fast and I like it.
Can the XFS community help? I don't want to have to turn to that guy
that a =P
~Jay
On Nov 14, 2007, at 10:05 AM, Jay Sullivan wrote:
Of course this had to happen one more time before my scheduled
maintenance window... Anyways, here's all of the good stuff I
collected. Can anyone make sense of it? Oh, and I upgraded to xfsprogs
2.9.4 last week, so all output you see is with that version.
Thanks!
###################
dmesg output
###################
XFS internal error XFS_WANT_CORRUPTED_GOTO at line 4533 of file
fs/xfs/xfs_bmap.c. Caller 0xc028c5a2
[<c026bc58>] xfs_bmap_read_extents+0x3bd/0x498
[<c028c5a2>] xfs_iread_extents+0x74/0xe1
[<c028fb02>] xfs_iext_realloc_direct+0xa4/0xe7
[<c028f3ef>] xfs_iex t;c028c5a2>] xfs_iread_extents+0x74/0xe1
[<c026befd>] xfs_bmapi+0x1ca/0x173f
[<c02e2d7e>] elv_rb_add+0x6f/0x88
[<c02eb843>] as_update_rq+0x32/0x72
[<c02ec08b>] as_add_request+0x76/0xa4
[<c02e330c>] elv_insert+0xd5/0x142
[<c02e70ad>] __make_request+0xc8/0x305
[<c02e7480>] generic_make_request+0x122/0x1d9
[<c03ee0e3>] __map_bio+0x33/0xa9
[<c03ee36c>] __clone_and_map+0xda/0x34c
[<c0148fce>] mempool_alloc+0x2a/0xdb
[<c028aa3c>] xfs_ilock+0x58/0xa0
[<c029168b>] xfs_iomap+0x216/0x4b7
[<c02b2000>] __xfs_get_blocks+0x6b/0x226
[<c02f2792>] radix_tree_node_alloc+0x16/0x57
[<c02f2997>] radix_tree_insert+0xb0/0x126
[<c02b21e3>] xfs_get_blocks+0x28/0x2d
[<c0183a32>] block_read_full_page+0x192/0x346
[<c02b21bb>] xfs_get_blocks+0x0/0x2d
[<c028a667>] xfs_iget+0x145/0x150
[<c018982d>] do_mpage_readpag 28aba1>] xfs_iunlock+0x43/0x84
[<c02a8096>] xfs_vget+0xe1/0xf2
[<c020a578>] find_exported_dentry+0x71/0x4b6
[<c014c4a4>] __do_page_cache_readahead+0x88/0x153
[<c0189aa4>] mpage_readpage+0x4b/0x5e
[<c02b21bb>] xfs_get_blocks+0x0/0x2d
[<c014c69d>] blockable_page_cache_readahead+0x4d/0xb9
[<c014c942>] page_cache_readahead+0x174/0x1a3
[<c014630f>] find_get_page+0x18/0x3a
[<c014684e>] do_generic_mapping_read+0x1b5/0x535
[<c012621a>] __capable+0x8/0x1b
[<c0146f6c>] generic_file_sendfile+0x68/0x83
[<c020eff2>] nfsd_read_actor+0x0/0x10f
[<c02b822f>] xfs_sendfile+0x94/0x164
[<c020eff2>] nfsd_read_actor+0x0/0x10f
[<c0211325>] nfsd_permission+0x6e/0x103
[<c02b4868>] xfs_file_sendfile+0x4c/0x5c
[<c020eff2>] nfsd_read_actor+0x0/0x10f
[<c020f445>] nfsd_vfs_read+0x344/0x361
[<c020eff2>] nfsd_read_actor+0x0/0x ] nfsd_read+0xd8/0xf9
[<c021548e>] nfsd3_proc_read+0xb0/0x174
[<c02170b4>] nfs3svc_decode_readargs+0x0/0xf7
[<c020b535>] nfsd_dispatch+0x8a/0x1f5
[<c048c43e>] svcauth_unix_set_client+0x11d/0x175
[<c0488d73>] svc_process+0x4fd/0x681
[<c020b39b>] nfsd+0x163/0x273
[<c020b238>] nfsd+0x0/0x273
[<c01037fb>] kernel_thread_helper+0x7/0x10
=======================
attempt to access beyond end of device
dm-1: rw=0, want=6763361770196172808, limit=7759462400
I/O error in filesystem ("dm-1") meta-data dev dm-1 block
0x5ddc49b238000000 ("xfs_trans_read_buf") error 5 buf count 4096
xfs_force_shutdown(dm-1,0x1) called from line 415 of file
fs/xfs/xfs_trans_buf.c. Return address = 0xc02baa25
Filesystem "dm-1": I/O Error Detected. Shutting down filesystem: dm-1
Please umount the filesystem, and rectify the problem(s)
#################### I umount'ed and mount'ed the FS several times, but
xfs_repair still told me to use -L... Any ideas?
#######################
server-files ~ # umount /mnt/san/
server-files ~ # mount /mnt/san/
server-files ~ # umount /mnt/san/
server-files ~ # xfs_repair
/dev/server-files-sanvg01/server-files-sanlv01
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs
to
be replayed. Mount the filesystem to replay the log, and unmount it
before
re-running xfs_repair. If you are unable to mount the filesystem, then
use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a
mount
of the filesystem before doing this.
server-files ~ # xfs_repair -L
/dev/server-files-sanvg01/server-files-sanlv01
Pha perblock...
Phase 2 - using internal log
- zero log...
ALERT: The filesystem has valuable metadata changes in a log which is
being
destroyed because the -L option was used.
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
4002: Badness in key lookup (length)
bp=(bno 2561904, len 16384 bytes) key=(bno 2561904, len 8192 bytes)
8003: Badness in key lookup (length)
bp=(bno 0, len 512 bytes) key=(bno 0, len 4096 bytes)
bad bmap btree ptr 0x5f808b0400000000 in ino 5123809
bad data fork in inode 5123809
cleared inode 5123809
bad magi 480148 (data fork) bmbt block 0
bad data fork in inode 7480148
cleared inode 7480148
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
entry "Fuller_RotoscopeCorrected.mov" at block 0 offset 184 in directory
inode 89923 7480148
clearing inode number in entry at offset 184...
Phase 5 - rebuild AG headers and trees...
- reset superblock...
4000: Badness in key lookup (length)
bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
bad hash table for directory inode 8992373 (no data entry): rebuilding
rebuilding directory inode 8992373
4000: Badness in key lookup (length)
bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
4000: Badness in key lookup (length)
bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
- traversal finished ...
- moving disconnected inodes to lost+found nd correct link
counts...
4000: Badness in key lookup (length)
bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
done
server-files ~ # mount /mnt/san
server-files ~ # umount /mnt/san
server-files ~ # xfs_repair -L
/dev/server-files-sanvg01/server-files-sanlv01
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
server-files ~ # xfs_repair
/dev/server-files-sanvg01/server-files-sanlv01
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
XFS: totally zeroed log
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- proc orm inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
Phase 5 - rebuild AG headers and trees...
- reset su check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
################
So that's it for now. Next week I'll be rsyncing all of the data off of
this volume to another array. I still want to know what's happening,
though... *pout*
Anyways, thanks a lot for everyone's help.
~Jay
-----Original Message-----
From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] On Behalf
Of Jay Sullivan
Sent: Friday, November 02, 2007 10:49 AM
To: xfs@oss.sgi.com
Subject: RE: xf rom file fs/xfs/xfs_trans_buf.c
What can I say about Murphy and his silly laws? I just had a drive fail
on my array. I wonder if this is the root of my problems... Yay
parity.
~Jay
-----Original Message-----
From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] On Behalf
Of Jay Sullivan
Sent: Friday, November 02, 2007 10:00 AM
To: xfs@oss.sgi.com
Subject: RE: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
I lost the xfs_repair output on an xterm with only four lines of
scrollback... I'll definitely be more careful to preserve more
'evidence' next time. =( "Pics or it didn't happen", right?
I just upgraded xfsprogs and will scan the disk during my next scheduled
downtime (probably in about 2 weeks). I'm tempted to just wipe the
volume and start over: I have enough ' to copy
everything out to a fresh XFS volume.
Regarding "areca": I'm using hardware RAID built into Apple XServe
RAIDs o'er LSI FC929X cards.
Someone else offered the likely explanation that the btree is corrupted.
Isn't this something xfs_repair should be able to fix? Would it be
easier, safer, and faster to move the data to a new volume (and restore
corrupted files if/as I find them from backup)? We're talking about
just less than 4TB of data which used to take about 6 hours to fsck (one
pass) with ext3. Restoring the whole shebang from backups would
probably take the better part of 12 years (waiting for compression,
resetting ACLs, etc.)...
FWIW, another (way less important,) much busier and significantly larger
logical volume on the same array has been totally fine. Murphy--go
figure.
Thanks!
-----Original Message-----
From: Eric Sandeen [mail >]
Sent: Thursday, November 01, 2007 10:30 PM
To: Jay Sullivan
Cc: <mailto:sandeen@sandeen.net> xfs@oss.sgi.com
Subject: Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
Jay Sullivan wrote:
Good eye: it wasn't mountable, thus the -L flag. No recent
(unplanned) power outages. The machine and the array that holds
the
disks are both on serious batteries/UPS and the array's cache
batteries are in good health.
Did you have the xfs_repair output to see what it found? You might also
grab the very latest xfsprogs (2.9.4) in case it's catching more cases.
I hate it when people suggest running memtest86, but I might do that
anyway. :)
What controller are you using? If you say "areca" I might be on to
something e seen...
-Eric
[[HTML alternate version deleted]]
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-12-21 2:01 ` Jay Sullivan
2008-01-03 15:55 ` Jay Sullivan
@ 2008-08-04 16:55 ` Richard Freeman
1 sibling, 0 replies; 21+ messages in thread
From: Richard Freeman @ 2008-08-04 16:55 UTC (permalink / raw)
To: linux-xfs
Jay Sullivan <jpspgd <at> rit.edu> writes:
>
> Today I upgraded to the latest stable kernel in Gentoo (2.6.23-r3) and
> I'm still on xfsprogs 2.9.4, also the latest stable release. A few
> hours after rebooting to load the new kernel, I saw the following in
> dmesg:
>
> ####################
> attempt to access beyond end of device
> dm-0: rw=0, want=68609558288793608, limit=8178892800
I just started getting these on an ext3 filesystem also on gentoo, with the
latest stable kernel. I suspect there is an lvm bug of some kind that is
responsible. I ran an e2fsck on the filesystem and managed to corrupt not
only that filesystem, but also several others on the same RAID. I'm probably
going to have to try to salvage what I can from the no-longer-booting system
and rebuild from scatch/backups.
Either lvm has some major bug, or somehow e2fsck is bypassing the lvm layer
and writing directly to the drives. It shouldn't be possible to write to one
logical volume and modify data stored in a different logical volume on the
same md raid-5 device. A check of the underlying RAID turns up no issues - I
suspect the problem is in the lvm layer.
Googling around for "access beyond end of device" turns up other reports of
similar issues. Obviously the problem is rare.
^ permalink raw reply [flat|nested] 21+ messages in thread
* xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
@ 2007-11-01 20:06 Jay Sullivan
2007-11-02 2:14 ` Eric Sandeen
0 siblings, 1 reply; 21+ messages in thread
From: Jay Sullivan @ 2007-11-01 20:06 UTC (permalink / raw)
To: xfs
I have an XFS filesystem that has had the following happen twice in 3
months, both times with an impossibly large block number was requested.
Unfortunately my logs don't go back far enough for me to know if it was
the _exact_ same block both times... I'm running xfsprogs 2.8.21.
Excerpt from syslog (hostname obfuscated to 'servername' to protect the
innocent):
##
Nov 1 14:06:32 servername dm-1: rw=0, want=39943195856896,
limit=7759462400
Nov 1 14:06:32 servername I/O error in filesystem ("dm-1") meta-data
dev dm-1 block 0x245400000ff8 ("xfs_trans_read_buf") error 5 buf
count 4096
Nov 1 14:06:32 servername xfs_force_shutdown(dm-1,0x1) called from line
415 of file fs/xfs/xfs_trans_buf.c. Return address = 0xc02baa25
Nov 1 14:06:32 servername Filesystem "dm-1": I/O Error Detected.
Shutting down filesystem: dm-1
Nov 1 14:06:32 servername Please umount the filesystem, and rectify the
problem(s)
###
I ran xfs_repair -L on the FS and it could be mounted again, but how
long until it happens a third time? What concerns me is that this is a
FS smaller than 4TB and 39943195856896 (or 0x245400000ff8) seems like a
block that I would only have if my FS was muuuuuch larger. The
following is output from some pertinent programs:
###
servername ~ # xfs_info /mnt/san
meta-data=/dev/servername-sanvg01/servername-sanlv01 isize=256
agcount=5, agsize=203161600 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=969932800,
imaxpct=25
= sunit=0 swidth=0 blks, unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=1
= sectsz=512 sunit=0 blks, lazy-count=0
realtime =none extsz=4096 blocks=0, rtextents=0
servername ~ # mount
/dev/sda3 on / type ext3 (rw,noatime,acl)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
udev on /dev type tmpfs (rw,nosuid)
devpts on /dev/pts type devpts (rw,nosuid,noexec)
shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev)
usbfs on /proc/bus/usb type usbfs
(rw,noexec,nosuid,devmode=0664,devgid=85)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc
(rw,noexec,nosuid,nodev)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/mapper/servername--sanvg01-servername--sanlv01 on /mnt/san type xfs
(rw,noatime,nodiratime,logbufs=8,attr2)
/dev/mapper/servername--sanvg01-servername--rendersharelv01 on
/mnt/san/rendershare type xfs (rw,noatime,nodiratime,logbufs=8,attr2)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
servername ~ # uname -a
Linux servername 2.6.20-gentoo-r8 #7 SMP Fri Jun 29 14:46:02 EDT 2007
i686 Intel(R) Xeon(TM) CPU 3.20GHz GenuineIntel GNU/Linux
###
Does anyone know if this points to a bad block on a disk or if something
is corrupted and can be fixed with some expert knowledge of xfs_db?
~Jay
[[HTML alternate version deleted]]
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-11-01 20:06 Jay Sullivan
@ 2007-11-02 2:14 ` Eric Sandeen
2007-11-02 2:22 ` Jay Sullivan
2007-11-02 4:37 ` Timothy Shimmin
0 siblings, 2 replies; 21+ messages in thread
From: Eric Sandeen @ 2007-11-02 2:14 UTC (permalink / raw)
To: Jay Sullivan; +Cc: xfs
Jay Sullivan wrote:
> I ran xfs_repair -L on the FS and it could be mounted again,
Was it not even mountable before this, or why did you use the -L flag?
If the log is corrupted that points to more problems... perhaps you've
had some power loss & your write caches evaporated, and lvm doesn't do
barriers?
-eric
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-11-02 2:14 ` Eric Sandeen
@ 2007-11-02 2:22 ` Jay Sullivan
2007-11-02 2:30 ` Eric Sandeen
2007-11-02 4:37 ` Timothy Shimmin
1 sibling, 1 reply; 21+ messages in thread
From: Jay Sullivan @ 2007-11-02 2:22 UTC (permalink / raw)
To: xfs
Good eye: it wasn't mountable, thus the -L flag. No recent
(unplanned) power outages. The machine and the array that holds the
disks are both on serious batteries/UPS and the array's cache
batteries are in good health.
~Jay
On Nov 1, 2007, at 10:14 PM, Eric Sandeen wrote:
> Jay Sullivan wrote:
>
> > I ran xfs_repair -L on the FS and it could be mounted again,
>
> Was it not even mountable before this, or why did you use the -L flag?
> If the log is corrupted that points to more problems... perhaps you've
> had some power loss & your write caches evaporated, and lvm doesn't do
> barriers?
>
> -eric
>
>
[[HTML alternate version deleted]]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-11-02 2:22 ` Jay Sullivan
@ 2007-11-02 2:30 ` Eric Sandeen
2007-11-02 9:07 ` Ralf Gross
2007-11-02 14:00 ` Jay Sullivan
0 siblings, 2 replies; 21+ messages in thread
From: Eric Sandeen @ 2007-11-02 2:30 UTC (permalink / raw)
To: Jay Sullivan; +Cc: xfs
Jay Sullivan wrote:
> Good eye: it wasn't mountable, thus the -L flag. No recent
> (unplanned) power outages. The machine and the array that holds the
> disks are both on serious batteries/UPS and the array's cache
> batteries are in good health.
Did you have the xfs_repair output to see what it found? You might also
grab the very latest xfsprogs (2.9.4) in case it's catching more cases.
I hate it when people suggest running memtest86, but I might do that
anyway. :)
What controller are you using? If you say "areca" I might be on to
something with some other bugs I've seen...
-Eric
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-11-02 2:30 ` Eric Sandeen
@ 2007-11-02 9:07 ` Ralf Gross
2007-11-02 16:10 ` Eric Sandeen
2007-11-02 14:00 ` Jay Sullivan
1 sibling, 1 reply; 21+ messages in thread
From: Ralf Gross @ 2007-11-02 9:07 UTC (permalink / raw)
To: xfs
Eric Sandeen schrieb:
> ...
> What controller are you using? If you say "areca" I might be on to
> something with some other bugs I've seen...
I use areca controllers with xfs, but had no problems yet. Can you
explain what bugs might hit me?
Ralf
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-11-02 9:07 ` Ralf Gross
@ 2007-11-02 16:10 ` Eric Sandeen
0 siblings, 0 replies; 21+ messages in thread
From: Eric Sandeen @ 2007-11-02 16:10 UTC (permalink / raw)
To: Ralf Gross; +Cc: xfs
Ralf Gross wrote:
> Eric Sandeen schrieb:
>> ...
>> What controller are you using? If you say "areca" I might be on to
>> something with some other bugs I've seen...
>
> I use areca controllers with xfs, but had no problems yet. Can you
> explain what bugs might hit me?
maybe none, it was just a wild guess. :) I've seen a bug on ext3,
volumes > 2T corrupted, on an areca controller. Due to the 2T
threshold, it seems more like a lower layer IO issue (2^32 x 512) than a
filesystem issue... googling a bit I found others with problems on
areca, but then that's what I googled for, so I might have
self-selected. So, maybe nothing, I was just looking for a 3rd data point.
-Eric
^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-11-02 2:30 ` Eric Sandeen
2007-11-02 9:07 ` Ralf Gross
@ 2007-11-02 14:00 ` Jay Sullivan
2007-11-02 14:49 ` Jay Sullivan
1 sibling, 1 reply; 21+ messages in thread
From: Jay Sullivan @ 2007-11-02 14:00 UTC (permalink / raw)
To: xfs
I lost the xfs_repair output on an xterm with only four lines of
scrollback... I'll definitely be more careful to preserve more
'evidence' next time. =( "Pics or it didn't happen", right?
I just upgraded xfsprogs and will scan the disk during my next scheduled
downtime (probably in about 2 weeks). I'm tempted to just wipe the
volume and start over: I have enough 'spare' space lying around to copy
everything out to a fresh XFS volume.
Regarding "areca": I'm using hardware RAID built into Apple XServe
RAIDs o'er LSI FC929X cards.
Someone else offered the likely explanation that the btree is corrupted.
Isn't this something xfs_repair should be able to fix? Would it be
easier, safer, and faster to move the data to a new volume (and restore
corrupted files if/as I find them from backup)? We're talking about
just less than 4TB of data which used to take about 6 hours to fsck (one
pass) with ext3. Restoring the whole shebang from backups would
probably take the better part of 12 years (waiting for compression,
resetting ACLs, etc.)...
FWIW, another (way less important,) much busier and significantly larger
logical volume on the same array has been totally fine. Murphy--go
figure.
Thanks!
-----Original Message-----
From: Eric Sandeen [mailto:sandeen@sandeen.net]
Sent: Thursday, November 01, 2007 10:30 PM
To: Jay Sullivan
Cc: xfs@oss.sgi.com
Subject: Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
Jay Sullivan wrote:
> Good eye: it wasn't mountable, thus the -L flag. No recent
> (unplanned) power outages. The machine and the array that holds the
> disks are both on serious batteries/UPS and the array's cache
> batteries are in good health.
Did you have the xfs_repair output to see what it found? You might also
grab the very latest xfsprogs (2.9.4) in case it's catching more cases.
I hate it when people suggest running memtest86, but I might do that
anyway. :)
What controller are you using? If you say "areca" I might be on to
something with some other bugs I've seen...
-Eric
^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-11-02 14:00 ` Jay Sullivan
@ 2007-11-02 14:49 ` Jay Sullivan
2007-11-14 15:05 ` Jay Sullivan
0 siblings, 1 reply; 21+ messages in thread
From: Jay Sullivan @ 2007-11-02 14:49 UTC (permalink / raw)
To: xfs
What can I say about Murphy and his silly laws? I just had a drive fail
on my array. I wonder if this is the root of my problems... Yay
parity.
~Jay
-----Original Message-----
From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] On Behalf
Of Jay Sullivan
Sent: Friday, November 02, 2007 10:00 AM
To: xfs@oss.sgi.com
Subject: RE: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
I lost the xfs_repair output on an xterm with only four lines of
scrollback... I'll definitely be more careful to preserve more
'evidence' next time. =( "Pics or it didn't happen", right?
I just upgraded xfsprogs and will scan the disk during my next scheduled
downtime (probably in about 2 weeks). I'm tempted to just wipe the
volume and start over: I have enough 'spare' space lying around to copy
everything out to a fresh XFS volume.
Regarding "areca": I'm using hardware RAID built into Apple XServe
RAIDs o'er LSI FC929X cards.
Someone else offered the likely explanation that the btree is corrupted.
Isn't this something xfs_repair should be able to fix? Would it be
easier, safer, and faster to move the data to a new volume (and restore
corrupted files if/as I find them from backup)? We're talking about
just less than 4TB of data which used to take about 6 hours to fsck (one
pass) with ext3. Restoring the whole shebang from backups would
probably take the better part of 12 years (waiting for compression,
resetting ACLs, etc.)...
FWIW, another (way less important,) much busier and significantly larger
logical volume on the same array has been totally fine. Murphy--go
figure.
Thanks!
-----Original Message-----
From: Eric Sandeen [mailto:sandeen@sandeen.net]
Sent: Thursday, November 01, 2007 10:30 PM
To: Jay Sullivan
Cc: xfs@oss.sgi.com
Subject: Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
Jay Sullivan wrote:
> Good eye: it wasn't mountable, thus the -L flag. No recent
> (unplanned) power outages. The machine and the array that holds the
> disks are both on serious batteries/UPS and the array's cache
> batteries are in good health.
Did you have the xfs_repair output to see what it found? You might also
grab the very latest xfsprogs (2.9.4) in case it's catching more cases.
I hate it when people suggest running memtest86, but I might do that
anyway. :)
What controller are you using? If you say "areca" I might be on to
something with some other bugs I've seen...
-Eric
^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-11-02 14:49 ` Jay Sullivan
@ 2007-11-14 15:05 ` Jay Sullivan
2007-11-15 3:26 ` Eric Sandeen
0 siblings, 1 reply; 21+ messages in thread
From: Jay Sullivan @ 2007-11-14 15:05 UTC (permalink / raw)
To: xfs; +Cc: Jay Sullivan
Of course this had to happen one more time before my scheduled
maintenance window... Anyways, here's all of the good stuff I
collected. Can anyone make sense of it? Oh, and I upgraded to xfsprogs
2.9.4 last week, so all output you see is with that version.
Thanks!
###################
dmesg output
###################
XFS internal error XFS_WANT_CORRUPTED_GOTO at line 4533 of file
fs/xfs/xfs_bmap.c. Caller 0xc028c5a2
[<c026bc58>] xfs_bmap_read_extents+0x3bd/0x498
[<c028c5a2>] xfs_iread_extents+0x74/0xe1
[<c028fb02>] xfs_iext_realloc_direct+0xa4/0xe7
[<c028f3ef>] xfs_iext_add+0x138/0x272
[<c028c5a2>] xfs_iread_extents+0x74/0xe1
[<c026befd>] xfs_bmapi+0x1ca/0x173f
[<c02e2d7e>] elv_rb_add+0x6f/0x88
[<c02eb843>] as_update_rq+0x32/0x72
[<c02ec08b>] as_add_request+0x76/0xa4
[<c02e330c>] elv_insert+0xd5/0x142
[<c02e70ad>] __make_request+0xc8/0x305
[<c02e7480>] generic_make_request+0x122/0x1d9
[<c03ee0e3>] __map_bio+0x33/0xa9
[<c03ee36c>] __clone_and_map+0xda/0x34c
[<c0148fce>] mempool_alloc+0x2a/0xdb
[<c028aa3c>] xfs_ilock+0x58/0xa0
[<c029168b>] xfs_iomap+0x216/0x4b7
[<c02b2000>] __xfs_get_blocks+0x6b/0x226
[<c02f2792>] radix_tree_node_alloc+0x16/0x57
[<c02f2997>] radix_tree_insert+0xb0/0x126
[<c02b21e3>] xfs_get_blocks+0x28/0x2d
[<c0183a32>] block_read_full_page+0x192/0x346
[<c02b21bb>] xfs_get_blocks+0x0/0x2d
[<c028a667>] xfs_iget+0x145/0x150
[<c018982d>] do_mpage_readpage+0x530/0x621
[<c028aba1>] xfs_iunlock+0x43/0x84
[<c02a8096>] xfs_vget+0xe1/0xf2
[<c020a578>] find_exported_dentry+0x71/0x4b6
[<c014c4a4>] __do_page_cache_readahead+0x88/0x153
[<c0189aa4>] mpage_readpage+0x4b/0x5e
[<c02b21bb>] xfs_get_blocks+0x0/0x2d
[<c014c69d>] blockable_page_cache_readahead+0x4d/0xb9
[<c014c942>] page_cache_readahead+0x174/0x1a3
[<c014630f>] find_get_page+0x18/0x3a
[<c014684e>] do_generic_mapping_read+0x1b5/0x535
[<c012621a>] __capable+0x8/0x1b
[<c0146f6c>] generic_file_sendfile+0x68/0x83
[<c020eff2>] nfsd_read_actor+0x0/0x10f
[<c02b822f>] xfs_sendfile+0x94/0x164
[<c020eff2>] nfsd_read_actor+0x0/0x10f
[<c0211325>] nfsd_permission+0x6e/0x103
[<c02b4868>] xfs_file_sendfile+0x4c/0x5c
[<c020eff2>] nfsd_read_actor+0x0/0x10f
[<c020f445>] nfsd_vfs_read+0x344/0x361
[<c020eff2>] nfsd_read_actor+0x0/0x10f
[<c020f862>] nfsd_read+0xd8/0xf9
[<c021548e>] nfsd3_proc_read+0xb0/0x174
[<c02170b4>] nfs3svc_decode_readargs+0x0/0xf7
[<c020b535>] nfsd_dispatch+0x8a/0x1f5
[<c048c43e>] svcauth_unix_set_client+0x11d/0x175
[<c0488d73>] svc_process+0x4fd/0x681
[<c020b39b>] nfsd+0x163/0x273
[<c020b238>] nfsd+0x0/0x273
[<c01037fb>] kernel_thread_helper+0x7/0x10
=======================
attempt to access beyond end of device
dm-1: rw=0, want=6763361770196172808, limit=7759462400
I/O error in filesystem ("dm-1") meta-data dev dm-1 block
0x5ddc49b238000000 ("xfs_trans_read_buf") error 5 buf count 4096
xfs_force_shutdown(dm-1,0x1) called from line 415 of file
fs/xfs/xfs_trans_buf.c. Return address = 0xc02baa25
Filesystem "dm-1": I/O Error Detected. Shutting down filesystem: dm-1
Please umount the filesystem, and rectify the problem(s)
#######################
At this point I umount'ed and mount'ed the FS several times, but
xfs_repair still told me to use -L... Any ideas?
#######################
server-files ~ # umount /mnt/san/
server-files ~ # mount /mnt/san/
server-files ~ # umount /mnt/san/
server-files ~ # xfs_repair
/dev/server-files-sanvg01/server-files-sanlv01
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs
to
be replayed. Mount the filesystem to replay the log, and unmount it
before
re-running xfs_repair. If you are unable to mount the filesystem, then
use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a
mount
of the filesystem before doing this.
server-files ~ # xfs_repair -L
/dev/server-files-sanvg01/server-files-sanlv01
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ALERT: The filesystem has valuable metadata changes in a log which is
being
destroyed because the -L option was used.
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
4002: Badness in key lookup (length)
bp=(bno 2561904, len 16384 bytes) key=(bno 2561904, len 8192 bytes)
8003: Badness in key lookup (length)
bp=(bno 0, len 512 bytes) key=(bno 0, len 4096 bytes)
bad bmap btree ptr 0x5f808b0400000000 in ino 5123809
bad data fork in inode 5123809
cleared inode 5123809
bad magic # 0x58465342 in inode 7480148 (data fork) bmbt block 0
bad data fork in inode 7480148
cleared inode 7480148
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
entry "Fuller_RotoscopeCorrected.mov" at block 0 offset 184 in directory
inode 8992373 references free inode 7480148
clearing inode number in entry at offset 184...
Phase 5 - rebuild AG headers and trees...
- reset superblock...
4000: Badness in key lookup (length)
bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
bad hash table for directory inode 8992373 (no data entry): rebuilding
rebuilding directory inode 8992373
4000: Badness in key lookup (length)
bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
4000: Badness in key lookup (length)
bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
4000: Badness in key lookup (length)
bp=(bno 0, len 4096 bytes) key=(bno 0, len 512 bytes)
done
server-files ~ # mount /mnt/san
server-files ~ # umount /mnt/san
server-files ~ # xfs_repair -L
/dev/server-files-sanvg01/server-files-sanlv01
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
server-files ~ # xfs_repair
/dev/server-files-sanvg01/server-files-sanlv01
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
XFS: totally zeroed log
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
################
So that's it for now. Next week I'll be rsyncing all of the data off of
this volume to another array. I still want to know what's happening,
though... *pout*
Anyways, thanks a lot for everyone's help.
~Jay
-----Original Message-----
From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] On Behalf
Of Jay Sullivan
Sent: Friday, November 02, 2007 10:49 AM
To: xfs@oss.sgi.com
Subject: RE: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
What can I say about Murphy and his silly laws? I just had a drive fail
on my array. I wonder if this is the root of my problems... Yay
parity.
~Jay
-----Original Message-----
From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] On Behalf
Of Jay Sullivan
Sent: Friday, November 02, 2007 10:00 AM
To: xfs@oss.sgi.com
Subject: RE: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
I lost the xfs_repair output on an xterm with only four lines of
scrollback... I'll definitely be more careful to preserve more
'evidence' next time. =( "Pics or it didn't happen", right?
I just upgraded xfsprogs and will scan the disk during my next scheduled
downtime (probably in about 2 weeks). I'm tempted to just wipe the
volume and start over: I have enough 'spare' space lying around to copy
everything out to a fresh XFS volume.
Regarding "areca": I'm using hardware RAID built into Apple XServe
RAIDs o'er LSI FC929X cards.
Someone else offered the likely explanation that the btree is corrupted.
Isn't this something xfs_repair should be able to fix? Would it be
easier, safer, and faster to move the data to a new volume (and restore
corrupted files if/as I find them from backup)? We're talking about
just less than 4TB of data which used to take about 6 hours to fsck (one
pass) with ext3. Restoring the whole shebang from backups would
probably take the better part of 12 years (waiting for compression,
resetting ACLs, etc.)...
FWIW, another (way less important,) much busier and significantly larger
logical volume on the same array has been totally fine. Murphy--go
figure.
Thanks!
-----Original Message-----
From: Eric Sandeen [mailto:sandeen@sandeen.net]
Sent: Thursday, November 01, 2007 10:30 PM
To: Jay Sullivan
Cc: xfs@oss.sgi.com
Subject: Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
Jay Sullivan wrote:
> Good eye: it wasn't mountable, thus the -L flag. No recent
> (unplanned) power outages. The machine and the array that holds the
> disks are both on serious batteries/UPS and the array's cache
> batteries are in good health.
Did you have the xfs_repair output to see what it found? You might also
grab the very latest xfsprogs (2.9.4) in case it's catching more cases.
I hate it when people suggest running memtest86, but I might do that
anyway. :)
What controller are you using? If you say "areca" I might be on to
something with some other bugs I've seen...
-Eric
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-11-14 15:05 ` Jay Sullivan
@ 2007-11-15 3:26 ` Eric Sandeen
0 siblings, 0 replies; 21+ messages in thread
From: Eric Sandeen @ 2007-11-15 3:26 UTC (permalink / raw)
To: Jay Sullivan; +Cc: xfs
Jay Sullivan wrote:
> Of course this had to happen one more time before my scheduled
> maintenance window... Anyways, here's all of the good stuff I
> collected. Can anyone make sense of it? Oh, and I upgraded to xfsprogs
> 2.9.4 last week, so all output you see is with that version.
>
Forgot to ask, are you running w/ 4k stacks?
And/or, do you have stack usage debugging enabled? That's quite a
backtrace you've got there... just a shot in the dark.
-Eric
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c
2007-11-02 2:14 ` Eric Sandeen
2007-11-02 2:22 ` Jay Sullivan
@ 2007-11-02 4:37 ` Timothy Shimmin
1 sibling, 0 replies; 21+ messages in thread
From: Timothy Shimmin @ 2007-11-02 4:37 UTC (permalink / raw)
To: Eric Sandeen; +Cc: Jay Sullivan, xfs
Eric Sandeen wrote:
> Jay Sullivan wrote:
>
>> I ran xfs_repair -L on the FS and it could be mounted again,
>
> Was it not even mountable before this, or why did you use the -L flag?
> If the log is corrupted that points to more problems... perhaps you've
> had some power loss & your write caches evaporated, and lvm doesn't do
> barriers?
>
> -eric
>
BTW, I occasionally wonder about the reason for log corruptions.
If we have an "evaporated" write cache that would stop a write from
going but it wouldn't do a partial sector (< 512 byte) write, would it?
I have presumed that sector writes complete or not and that is what
the log code is based on.
OOI, Jay, how did it fail to mount - what was the log msg?
I presume you couldn't mount such that even the log couldn't be
replayed? Did it fail during replay?
--Tim
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2009-02-25 18:48 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-02 2:08 xfs_force_shutdown called from file fs/xfs/xfs_trans_buf.c Jay Sullivan
2007-11-02 5:18 ` David Chinner
-- strict thread matches above, loose matches on Subject: below --
2009-02-24 13:04 Federico Sevilla III
2009-02-24 22:46 ` Dave Chinner
2009-02-25 10:00 ` Federico Sevilla III
2009-02-25 11:51 ` Michael Monnerie
2009-02-25 18:47 ` Federico Sevilla III
[not found] <B3EDBE0F860AF74BAA82EF17A7CDEDC660BE05A3@svits26.main.ad.rit.edu>
2007-12-21 2:01 ` Jay Sullivan
2008-01-03 15:55 ` Jay Sullivan
2008-08-04 16:55 ` Richard Freeman
2007-11-01 20:06 Jay Sullivan
2007-11-02 2:14 ` Eric Sandeen
2007-11-02 2:22 ` Jay Sullivan
2007-11-02 2:30 ` Eric Sandeen
2007-11-02 9:07 ` Ralf Gross
2007-11-02 16:10 ` Eric Sandeen
2007-11-02 14:00 ` Jay Sullivan
2007-11-02 14:49 ` Jay Sullivan
2007-11-14 15:05 ` Jay Sullivan
2007-11-15 3:26 ` Eric Sandeen
2007-11-02 4:37 ` Timothy Shimmin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox