88TB filesystem going off-line without warning

* 88TB filesystem going off-line without warning
@ 2013-04-02 18:44 L Ox
  2013-04-03 19:40 ` Emmanuel Florac
  2013-04-04  4:35 ` Dave Chinner
  0 siblings, 2 replies; 3+ messages in thread
From: L Ox @ 2013-04-02 18:44 UTC (permalink / raw)
  To: xfs

[-- Attachment #1.1: Type: text/plain, Size: 10790 bytes --]

Hi,

We have a new Linux/XFS deployment (about a month old) and randomly without
warning the XFS filesystem will go off-line. We are running Scientific
Linux release 5.9 with the latest updates.

# uname -a
Linux node24 2.6.18-348.3.1.el5 #1 SMP Mon Mar 11 15:43:13 EDT 2013 x86_64
x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release
Scientific Linux release 5.9 (Boron)

Here are the errors we see in /var/log/messages after the initial off-line
event:

-- snip --

Apr  2 07:50:28 node24 kernel: xfs_iunlink_remove: xfs_inotobp()  returned
an error 22 on dm-6.  Returning error.
Apr  2 07:50:28 node24 kernel: xfs_inactive:  xfs_ifree() returned an error
= 22 on dm-6
Apr  2 07:50:28 node24 kernel: xfs_force_shutdown(dm-6,0x1) called from
line 1419 of file fs/xfs/xfs_vnodeops.c.  Return address =
0xffffffff8855b86b
Apr  2 07:50:28 node24 kernel: Filesystem dm-6: I/O Error Detected.
Shutting down filesystem: dm-6
Apr  2 07:50:28 node24 kernel: Please umount the filesystem, and rectify
the problem(s)
Apr  2 07:50:52 node24 kernel: Filesystem dm-6: xfs_log_force: error 5
returned.
Apr  2 07:51:52 node24 last message repeated 2 times

-- snip --

Here are the messages after I umount/xfs_repair/mount the filesystem:

-- snip --

Apr  2 10:23:04 node24 kernel: xfs_force_shutdown(dm-6,0x1) called from
line 420 of file fs/xfs/xfs_rw.c.  Return address = 0xffffffff8855c0fe
Apr  2 10:23:07 node24 kernel: Filesystem dm-6: xfs_log_force: error 5
returned.
Apr  2 10:23:07 node24 last message repeated 4 times
Apr  2 10:24:08 node24 kernel: Filesystem dm-6: Disabling barriers, trial
barrier write failed
Apr  2 10:24:08 node24 kernel: XFS mounting filesystem dm-6
Apr  2 10:24:08 node24 kernel: Starting XFS recovery on filesystem: dm-6
(logdev: internal)
Apr  2 10:24:10 node24 kernel: Ending XFS recovery on filesystem: dm-6
(logdev: internal)
Apr  2 10:24:17 node24 multipathd: dm-6: umount map (uevent)
Apr  2 10:58:54 node24 kernel: Filesystem dm-6: Disabling barriers, trial
barrier write failed
Apr  2 10:58:54 node24 kernel: XFS mounting filesystem dm-6

-- snip --

We are taking 6 devices from a SAN and using LVM to effectively create a
RAID0 block devices which XFS is sitting on. We do not see any multipathd
errors.

I created the filesystem using this command.

# mkfs.xfs -f -d su=256k,sw=6,sectsize=4096,unwritten=0 -i attr=2 -l
sectsize=4096,lazy-count=1 -r extsize=4096 /dev/mapper/vol_d24-root

Here are the mount options:

# cat /etc/fstab | grep xfs
/dev/mapper/vol_d24-root            /archive/d24       xfs
defaults,inode64        0 9

# mount | grep xfs
/dev/mapper/vol_d24-root on /archive/d24 type xfs (rw,inode64)

Here is the output of xfs_info:

# xfs_info /dev/mapper/vol_d24-root
meta-data=/dev/mapper/vol_d24-root isize=256    agcount=88,
agsize=268435392 blks
         =                       sectsz=4096  attr=2
data     =                       bsize=4096   blocks=23441774592, imaxpct=25
         =                       sunit=64     swidth=384 blks, unwritten=0
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

After the initial off-line event I:
- umount
- ran xfs_repair (it told me to mount/umount and then re-run xfs_repair)
- mount
- umount
- xfs_repair

Here is the output of xfs_repair:

-- snip --

# xfs_repair /dev/mapper/vol_d24-root
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
2acde2416940: Badness in key lookup (length)
bp=(bno 14657493984, len 16384 bytes) key=(bno 14657493984, len 8192 bytes)
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
2acde2416940: Badness in key lookup (length)
bp=(bno 26065183200, len 16384 bytes) key=(bno 26065183200, len 8192 bytes)
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
2acde2e17940: Badness in key lookup (length)
bp=(bno 43039175488, len 16384 bytes) key=(bno 43039175488, len 8192 bytes)
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 30
        - agno = 31
        - agno = 32
        - agno = 33
        - agno = 34
        - agno = 35
        - agno = 36
        - agno = 37
        - agno = 38
        - agno = 39
        - agno = 40
        - agno = 41
        - agno = 42
        - agno = 43
        - agno = 44
        - agno = 45
        - agno = 46
        - agno = 47
2acde0613940: Badness in key lookup (length)
bp=(bno 101051527232, len 16384 bytes) key=(bno 101051527232, len 8192
bytes)
2acde0613940: Badness in key lookup (length)
bp=(bno 101081120768, len 16384 bytes) key=(bno 101081120768, len 8192
bytes)
2acde0613940: Badness in key lookup (length)
bp=(bno 102336613216, len 16384 bytes) key=(bno 102336613216, len 8192
bytes)
        - agno = 48
        - agno = 49
2acde2416940: Badness in key lookup (length)
bp=(bno 107185599392, len 16384 bytes) key=(bno 107185599392, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107606543312, len 16384 bytes) key=(bno 107606543312, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107674994560, len 16384 bytes) key=(bno 107674994560, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675078656, len 16384 bytes) key=(bno 107675078656, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675078688, len 16384 bytes) key=(bno 107675078688, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675078720, len 16384 bytes) key=(bno 107675078720, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675175008, len 16384 bytes) key=(bno 107675175008, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107704942624, len 16384 bytes) key=(bno 107704942624, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107763211904, len 16384 bytes) key=(bno 107763211904, len 8192
bytes)
        - agno = 50
2acde1014940: Badness in key lookup (length)
bp=(bno 109436122656, len 16384 bytes) key=(bno 109436122656, len 8192
bytes)
2acde2e17940: Badness in key lookup (length)
bp=(bno 110466056352, len 16384 bytes) key=(bno 110466056352, len 8192
bytes)
2acde2e17940: Badness in key lookup (length)
bp=(bno 110603835392, len 16384 bytes) key=(bno 110603835392, len 8192
bytes)
        - agno = 51
        - agno = 52
        - agno = 53
        - agno = 54
        - agno = 55
        - agno = 56
        - agno = 57
        - agno = 58
        - agno = 59
        - agno = 60
        - agno = 61
2acde2416940: Badness in key lookup (length)
bp=(bno 132435472416, len 16384 bytes) key=(bno 132435472416, len 8192
bytes)
        - agno = 62
2acde2416940: Badness in key lookup (length)
bp=(bno 135330780000, len 16384 bytes) key=(bno 135330780000, len 8192
bytes)
2acde2416940: Badness in key lookup (length)
bp=(bno 135508074496, len 16384 bytes) key=(bno 135508074496, len 8192
bytes)
2acde2416940: Badness in key lookup (length)
bp=(bno 135675982432, len 16384 bytes) key=(bno 135675982432, len 8192
bytes)
        - agno = 63
        - agno = 64
        - agno = 65
        - agno = 66
        - agno = 67
        - agno = 68
        - agno = 69
        - agno = 70
        - agno = 71
        - agno = 72
        - agno = 73
        - agno = 74
        - agno = 75
        - agno = 76
        - agno = 77
        - agno = 78
        - agno = 79
        - agno = 80
        - agno = 81
        - agno = 82
        - agno = 83
        - agno = 84
        - agno = 85
        - agno = 86
        - agno = 87
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 2
        - agno = 3
        - agno = 8
        - agno = 9
        - agno = 4
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 19
        - agno = 20
        - agno = 18
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 30
        - agno = 31
        - agno = 32
        - agno = 33
        - agno = 34
        - agno = 35
        - agno = 36
        - agno = 37
        - agno = 38
        - agno = 39
        - agno = 40
        - agno = 41
        - agno = 42
        - agno = 43
        - agno = 44
        - agno = 45
        - agno = 46
        - agno = 47
        - agno = 48
        - agno = 49
        - agno = 50
        - agno = 51
        - agno = 52
        - agno = 53
        - agno = 54
        - agno = 55
        - agno = 56
        - agno = 57
        - agno = 58
        - agno = 59
        - agno = 60
        - agno = 61
        - agno = 62
        - agno = 63
        - agno = 64
        - agno = 65
        - agno = 66
        - agno = 67
        - agno = 68
        - agno = 69
        - agno = 70
        - agno = 71
        - agno = 72
        - agno = 73
        - agno = 74
        - agno = 75
        - agno = 76
        - agno = 77
        - agno = 78
        - agno = 79
        - agno = 80
        - agno = 81
        - agno = 82
        - agno = 83
        - agno = 84
        - agno = 85
        - agno = 86
        - agno = 87
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 202102936036, moving to lost+found
disconnected inode 215350040250, moving to lost+found
disconnected inode 215350208634, moving to lost+found
disconnected inode 271016406074, moving to lost+found
Phase 7 - verify and correct link counts...
done

-- snip --

Any ideas?

Thanks

[-- Attachment #1.2: Type: text/html, Size: 11897 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 3+ messages in thread