* 88TB filesystem going off-line without warning
@ 2013-04-02 18:44 L Ox
2013-04-03 19:40 ` Emmanuel Florac
2013-04-04 4:35 ` Dave Chinner
0 siblings, 2 replies; 3+ messages in thread
From: L Ox @ 2013-04-02 18:44 UTC (permalink / raw)
To: xfs
[-- Attachment #1.1: Type: text/plain, Size: 10790 bytes --]
Hi,
We have a new Linux/XFS deployment (about a month old) and randomly without
warning the XFS filesystem will go off-line. We are running Scientific
Linux release 5.9 with the latest updates.
# uname -a
Linux node24 2.6.18-348.3.1.el5 #1 SMP Mon Mar 11 15:43:13 EDT 2013 x86_64
x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release
Scientific Linux release 5.9 (Boron)
Here are the errors we see in /var/log/messages after the initial off-line
event:
-- snip --
Apr 2 07:50:28 node24 kernel: xfs_iunlink_remove: xfs_inotobp() returned
an error 22 on dm-6. Returning error.
Apr 2 07:50:28 node24 kernel: xfs_inactive: xfs_ifree() returned an error
= 22 on dm-6
Apr 2 07:50:28 node24 kernel: xfs_force_shutdown(dm-6,0x1) called from
line 1419 of file fs/xfs/xfs_vnodeops.c. Return address =
0xffffffff8855b86b
Apr 2 07:50:28 node24 kernel: Filesystem dm-6: I/O Error Detected.
Shutting down filesystem: dm-6
Apr 2 07:50:28 node24 kernel: Please umount the filesystem, and rectify
the problem(s)
Apr 2 07:50:52 node24 kernel: Filesystem dm-6: xfs_log_force: error 5
returned.
Apr 2 07:51:52 node24 last message repeated 2 times
-- snip --
Here are the messages after I umount/xfs_repair/mount the filesystem:
-- snip --
Apr 2 10:23:04 node24 kernel: xfs_force_shutdown(dm-6,0x1) called from
line 420 of file fs/xfs/xfs_rw.c. Return address = 0xffffffff8855c0fe
Apr 2 10:23:07 node24 kernel: Filesystem dm-6: xfs_log_force: error 5
returned.
Apr 2 10:23:07 node24 last message repeated 4 times
Apr 2 10:24:08 node24 kernel: Filesystem dm-6: Disabling barriers, trial
barrier write failed
Apr 2 10:24:08 node24 kernel: XFS mounting filesystem dm-6
Apr 2 10:24:08 node24 kernel: Starting XFS recovery on filesystem: dm-6
(logdev: internal)
Apr 2 10:24:10 node24 kernel: Ending XFS recovery on filesystem: dm-6
(logdev: internal)
Apr 2 10:24:17 node24 multipathd: dm-6: umount map (uevent)
Apr 2 10:58:54 node24 kernel: Filesystem dm-6: Disabling barriers, trial
barrier write failed
Apr 2 10:58:54 node24 kernel: XFS mounting filesystem dm-6
-- snip --
We are taking 6 devices from a SAN and using LVM to effectively create a
RAID0 block devices which XFS is sitting on. We do not see any multipathd
errors.
I created the filesystem using this command.
# mkfs.xfs -f -d su=256k,sw=6,sectsize=4096,unwritten=0 -i attr=2 -l
sectsize=4096,lazy-count=1 -r extsize=4096 /dev/mapper/vol_d24-root
Here are the mount options:
# cat /etc/fstab | grep xfs
/dev/mapper/vol_d24-root /archive/d24 xfs
defaults,inode64 0 9
# mount | grep xfs
/dev/mapper/vol_d24-root on /archive/d24 type xfs (rw,inode64)
Here is the output of xfs_info:
# xfs_info /dev/mapper/vol_d24-root
meta-data=/dev/mapper/vol_d24-root isize=256 agcount=88,
agsize=268435392 blks
= sectsz=4096 attr=2
data = bsize=4096 blocks=23441774592, imaxpct=25
= sunit=64 swidth=384 blks, unwritten=0
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
After the initial off-line event I:
- umount
- ran xfs_repair (it told me to mount/umount and then re-run xfs_repair)
- mount
- umount
- xfs_repair
Here is the output of xfs_repair:
-- snip --
# xfs_repair /dev/mapper/vol_d24-root
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
2acde2416940: Badness in key lookup (length)
bp=(bno 14657493984, len 16384 bytes) key=(bno 14657493984, len 8192 bytes)
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
2acde2416940: Badness in key lookup (length)
bp=(bno 26065183200, len 16384 bytes) key=(bno 26065183200, len 8192 bytes)
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
2acde2e17940: Badness in key lookup (length)
bp=(bno 43039175488, len 16384 bytes) key=(bno 43039175488, len 8192 bytes)
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 30
- agno = 31
- agno = 32
- agno = 33
- agno = 34
- agno = 35
- agno = 36
- agno = 37
- agno = 38
- agno = 39
- agno = 40
- agno = 41
- agno = 42
- agno = 43
- agno = 44
- agno = 45
- agno = 46
- agno = 47
2acde0613940: Badness in key lookup (length)
bp=(bno 101051527232, len 16384 bytes) key=(bno 101051527232, len 8192
bytes)
2acde0613940: Badness in key lookup (length)
bp=(bno 101081120768, len 16384 bytes) key=(bno 101081120768, len 8192
bytes)
2acde0613940: Badness in key lookup (length)
bp=(bno 102336613216, len 16384 bytes) key=(bno 102336613216, len 8192
bytes)
- agno = 48
- agno = 49
2acde2416940: Badness in key lookup (length)
bp=(bno 107185599392, len 16384 bytes) key=(bno 107185599392, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107606543312, len 16384 bytes) key=(bno 107606543312, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107674994560, len 16384 bytes) key=(bno 107674994560, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675078656, len 16384 bytes) key=(bno 107675078656, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675078688, len 16384 bytes) key=(bno 107675078688, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675078720, len 16384 bytes) key=(bno 107675078720, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675175008, len 16384 bytes) key=(bno 107675175008, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107704942624, len 16384 bytes) key=(bno 107704942624, len 8192
bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107763211904, len 16384 bytes) key=(bno 107763211904, len 8192
bytes)
- agno = 50
2acde1014940: Badness in key lookup (length)
bp=(bno 109436122656, len 16384 bytes) key=(bno 109436122656, len 8192
bytes)
2acde2e17940: Badness in key lookup (length)
bp=(bno 110466056352, len 16384 bytes) key=(bno 110466056352, len 8192
bytes)
2acde2e17940: Badness in key lookup (length)
bp=(bno 110603835392, len 16384 bytes) key=(bno 110603835392, len 8192
bytes)
- agno = 51
- agno = 52
- agno = 53
- agno = 54
- agno = 55
- agno = 56
- agno = 57
- agno = 58
- agno = 59
- agno = 60
- agno = 61
2acde2416940: Badness in key lookup (length)
bp=(bno 132435472416, len 16384 bytes) key=(bno 132435472416, len 8192
bytes)
- agno = 62
2acde2416940: Badness in key lookup (length)
bp=(bno 135330780000, len 16384 bytes) key=(bno 135330780000, len 8192
bytes)
2acde2416940: Badness in key lookup (length)
bp=(bno 135508074496, len 16384 bytes) key=(bno 135508074496, len 8192
bytes)
2acde2416940: Badness in key lookup (length)
bp=(bno 135675982432, len 16384 bytes) key=(bno 135675982432, len 8192
bytes)
- agno = 63
- agno = 64
- agno = 65
- agno = 66
- agno = 67
- agno = 68
- agno = 69
- agno = 70
- agno = 71
- agno = 72
- agno = 73
- agno = 74
- agno = 75
- agno = 76
- agno = 77
- agno = 78
- agno = 79
- agno = 80
- agno = 81
- agno = 82
- agno = 83
- agno = 84
- agno = 85
- agno = 86
- agno = 87
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 5
- agno = 6
- agno = 7
- agno = 2
- agno = 3
- agno = 8
- agno = 9
- agno = 4
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 19
- agno = 20
- agno = 18
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 30
- agno = 31
- agno = 32
- agno = 33
- agno = 34
- agno = 35
- agno = 36
- agno = 37
- agno = 38
- agno = 39
- agno = 40
- agno = 41
- agno = 42
- agno = 43
- agno = 44
- agno = 45
- agno = 46
- agno = 47
- agno = 48
- agno = 49
- agno = 50
- agno = 51
- agno = 52
- agno = 53
- agno = 54
- agno = 55
- agno = 56
- agno = 57
- agno = 58
- agno = 59
- agno = 60
- agno = 61
- agno = 62
- agno = 63
- agno = 64
- agno = 65
- agno = 66
- agno = 67
- agno = 68
- agno = 69
- agno = 70
- agno = 71
- agno = 72
- agno = 73
- agno = 74
- agno = 75
- agno = 76
- agno = 77
- agno = 78
- agno = 79
- agno = 80
- agno = 81
- agno = 82
- agno = 83
- agno = 84
- agno = 85
- agno = 86
- agno = 87
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
disconnected inode 202102936036, moving to lost+found
disconnected inode 215350040250, moving to lost+found
disconnected inode 215350208634, moving to lost+found
disconnected inode 271016406074, moving to lost+found
Phase 7 - verify and correct link counts...
done
-- snip --
Any ideas?
Thanks
[-- Attachment #1.2: Type: text/html, Size: 11897 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: 88TB filesystem going off-line without warning
2013-04-02 18:44 88TB filesystem going off-line without warning L Ox
@ 2013-04-03 19:40 ` Emmanuel Florac
2013-04-04 4:35 ` Dave Chinner
1 sibling, 0 replies; 3+ messages in thread
From: Emmanuel Florac @ 2013-04-03 19:40 UTC (permalink / raw)
To: L Ox; +Cc: xfs
Le Tue, 2 Apr 2013 11:44:15 -0700 vous écriviez:
> Apr 2 07:50:28 node24 kernel: Filesystem dm-6: I/O Error Detected.
> Shutting down filesystem: dm-6
"IO Error" means there's probably a problem with underlying hardware
(bad controller, cable or disk drive). You should start with fixing the
hardware problem.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: 88TB filesystem going off-line without warning
2013-04-02 18:44 88TB filesystem going off-line without warning L Ox
2013-04-03 19:40 ` Emmanuel Florac
@ 2013-04-04 4:35 ` Dave Chinner
1 sibling, 0 replies; 3+ messages in thread
From: Dave Chinner @ 2013-04-04 4:35 UTC (permalink / raw)
To: L Ox; +Cc: xfs
On Tue, Apr 02, 2013 at 11:44:15AM -0700, L Ox wrote:
> Hi,
>
> We have a new Linux/XFS deployment (about a month old) and randomly without
> warning the XFS filesystem will go off-line. We are running Scientific
> Linux release 5.9 with the latest updates.
>
> # uname -a
> Linux node24 2.6.18-348.3.1.el5 #1 SMP Mon Mar 11 15:43:13 EDT 2013 x86_64
> x86_64 x86_64 GNU/Linux
>
> # cat /etc/redhat-release
> Scientific Linux release 5.9 (Boron)
>
> Here are the errors we see in /var/log/messages after the initial off-line
> event:
>
> -- snip --
>
> Apr 2 07:50:28 node24 kernel: xfs_iunlink_remove: xfs_inotobp() returned
> an error 22 on dm-6. Returning error.
> Apr 2 07:50:28 node24 kernel: xfs_inactive: xfs_ifree() returned an error
> = 22 on dm-6
#define EINVAL 22 /* Invalid argument */
That tends to imply a corrupt inode number in the unlinked list
chain.
> Here are the messages after I umount/xfs_repair/mount the filesystem:
What did xfs_repair detect/fix?
> # xfs_repair /dev/mapper/vol_d24-root
> Phase 1 - find and verify superblock...
.....
> Phase 6 - check inode connectivity...
> - resetting contents of realtime bitmap and summary inodes
> - traversing filesystem ...
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> disconnected inode 202102936036, moving to lost+found
> disconnected inode 215350040250, moving to lost+found
> disconnected inode 215350208634, moving to lost+found
> disconnected inode 271016406074, moving to lost+found
Some inodes that had been unlinked from the directory structure
but not freed. They were probably on an unlinked inode list that
couldn't be walked.
> Any ideas?
If the problem is a one off, there isn't anything that can be done.
If you can reproduce it, try to narrow it down to the simplest case
you can...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-04-04 4:35 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-02 18:44 88TB filesystem going off-line without warning L Ox
2013-04-03 19:40 ` Emmanuel Florac
2013-04-04 4:35 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox