* XFS corruption on 3ware RAID6-volume
@ 2011-02-23 13:27 Erik Gulliksson
2011-02-23 14:46 ` Emmanuel Florac
2011-02-23 16:56 ` Stan Hoeppner
0 siblings, 2 replies; 13+ messages in thread
From: Erik Gulliksson @ 2011-02-23 13:27 UTC (permalink / raw)
To: xfs
[-- Attachment #1: Type: text/plain, Size: 9309 bytes --]
Dear XFS people,
I have bumped in to a corruption problem with one a XFS filesystems.
The filesystem lives on a RAID6-volume on a 3ware 9650SE-24M8 with
battery backup and writecache enabled. RAID6-configuration is 11 2.0TB
WD15EARS disks and the volume is reported as OK by the RAID-card. I
believe the corruption below happened when the RAID-card reset itself,
due to disk timeouts on another RAID6-volume on the same controller
card (different story). I have tried to gather some relevant
information, in hope that someone can point me in the right direction
repairing this corruption.
Kernel: 2.6.26-2-amd64
OS: Debian Linux lenny 64-bit
xfsprogs: 2.9.8
Output from xfs_info:
meta-data=/dev/sda1 isize=256 agcount=13, agsize=268435455 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=3295874295, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=0
realtime =none extsz=4096 blocks=0, rtextents=0
mounting the filesystems gives the following in dmesg:
[858397.713452] Starting XFS recovery on filesystem: sda1 (logdev: internal)
[858403.841603] Filesystem "sda1": XFS internal error
xfs_btree_check_sblock at line 334 of file fs/xfs/xfs_btree.c. Caller
0xffffffffa0138321
[858403.841603] Pid: 31433, comm: mount Not tainted 2.6.26-2-amd64 #1
[858403.841603]
[858403.841603] Call Trace:
[858403.841603] [<ffffffffa0138321>] :xfs:xfs_alloc_lookup+0x133/0x34f
[858403.841603] [<ffffffffa014c7fb>] :xfs:xfs_btree_check_sblock+0xaf/0xbf
[858403.841603] [<ffffffffa0138321>] :xfs:xfs_alloc_lookup+0x133/0x34f
[858403.841603] [<ffffffffa014c322>] :xfs:xfs_btree_init_cursor+0x31/0x1ae
[858403.841603] [<ffffffffa0135d17>] :xfs:xfs_free_ag_extent+0x63/0x6b5
[858403.841603] [<ffffffff8042a354>] __down_read+0x12/0xa1
[858403.841603] [<ffffffffa01379dd>] :xfs:xfs_free_extent+0xa9/0xc9
[858403.841603] [<ffffffffa01694b3>] :xfs:xlog_recover_process_efi+0x10e/0x167
[858403.841603] [<ffffffffa016a6a4>] :xfs:xlog_recover_process_efis+0x4b/0x85
[858403.841603] [<ffffffffa016a6f3>] :xfs:xlog_recover_finish+0x15/0xb5
[858403.841603] [<ffffffffa016f2f7>] :xfs:xfs_mountfs+0x475/0x5ac
[858403.841603] [<ffffffffa017a311>] :xfs:kmem_alloc+0x60/0xc4
[858403.841603] [<ffffffffa0174eb4>] :xfs:xfs_mount+0x29b/0x347
[858403.841603] [<ffffffffa01833e6>] :xfs:xfs_fs_fill_super+0x0/0x1ee
[858403.841603] [<ffffffffa018349b>] :xfs:xfs_fs_fill_super+0xb5/0x1ee
[858403.841603] [<ffffffff8029d334>] get_sb_bdev+0xf8/0x145
[858403.841603] [<ffffffff8029cd58>] vfs_kern_mount+0x93/0x11b
[858403.841603] [<ffffffff8029ce33>] do_kern_mount+0x43/0xe3
[858403.841603] [<ffffffff802b18c9>] do_new_mount+0x5b/0x95
[858403.841603] [<ffffffff802b1ac0>] do_mount+0x1bd/0x1e7
[858403.841603] [<ffffffff802769a1>] __alloc_pages_internal+0xd6/0x3bf
[858403.841603] [<ffffffff802b1b74>] sys_mount+0x8a/0xce
[858403.841603] [<ffffffff8020beca>] system_call_after_swapgs+0x8a/0x8f
[858403.841603]
[858403.841603] Failed to recover EFIs on filesystem: sda1
[858403.841603] XFS: log mount finish failed
Output from xfs_check -v:
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_check. If you are unable to mount the filesystem, then use
the xfs_repair -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
Output from xfs_repair -n produced a 2GB file, so this is the cleaned
up version:
- block (1,499522) already used, state 7 (3_636_499 of these)
- block (7,480993) multiply claimed by bno space tree, state -
(26_547_241 of these)
- bno freespace btree block claimed (state 1), agno 7, bno 65565,
suspect 0 (158 of these)
- bcnt freespace btree block claimed (state 1), agno 7, bno 567395,
suspect 0 (175 of these)
- data fork in ino 84753919 claims free block 291349280 (4_580_113 of these)
- would have junked entry "foo" in directory inode 136 (10_095 of these)
- would have corrected i8 count in directory 136 from 2 to 1 (9_016 of these)
- entry "foo" at block 0 offset 72 in directory inode 16955069
references non-existent inode 30065663864
would clear inode number in entry at offset 72... (43_379 of these)
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- scan filesystem freespace and inode maps...
bad magic # 0x26c4 in btcnt block 1/302903
expected level 0 got 514 in btcnt block 1/302903
bad magic # 0x26c4 in btbno block 7/604731
expected level 0 got 256 in btbno block 7/604731
bad magic # 0x26c4 in btbno block 9/8428277
expected level 0 got 59755 in btbno block 9/8428277
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
found inodes not in the inode allocation tree
found inodes not in the inode allocation tree
found inodes not in the inode allocation tree
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
bad directory block magic # 0x26c4 in block 0 for directory inode 6159645547
corrupt block 0 in directory inode 6159645547
would junk block
no . entry for directory 6159645547
no .. entry for directory 6159645547
problem with directory contents in inode 6159645547
would have cleared inode 6159645547
- agno = 2
- agno = 3
- agno = 4
bad directory block magic # 0x6173733d in block 0 for directory inode
19126674939
corrupt block 0 in directory inode 19126674939
would junk block
no . entry for directory 19126674939
no .. entry for directory 19126674939
problem with directory contents in inode 19126674939
would have cleared inode 19126674939
- agno = 5
- agno = 6
- agno = 7
42189950: Badness in key lookup (length)
bp=(bno 15170340024, len 16384 bytes) key=(bno 15170340024, len 8192 bytes)
- agno = 8
bad directory block magic # 0x45b419cb in block 0 for directory inode
35775783660
corrupt block 0 in directory inode 35775783660
would junk block
no . entry for directory 35775783660
no .. entry for directory 35775783660
problem with directory contents in inode 35775783660
would have cleared inode 35775783660
- agno = 9
- agno = 10
bad nblocks 20513 for inode 43585639210, would reset to 15192
bad nextents 37 for inode 43585639210, would reset to 32
- agno = 11
- agno = 12
bad directory block magic # 0x58443244 in block 0 for directory inode
51803060746
corrupt block 0 in directory inode 51803060746
would junk block
no . entry for directory 51803060746
no .. entry for directory 51803060746
problem with directory contents in inode 51803060746
would have cleared inode 51803060746
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
bad directory block magic # 0x26c4 in block 0 for directory inode 6159645547
corrupt block 0 in directory inode 6159645547
would junk block
no . entry for directory 6159645547
no .. entry for directory 6159645547
problem with directory contents in inode 6159645547
would have cleared inode 6159645547
- agno = 2
- agno = 3
- agno = 4
bad directory block magic # 0x6173733d in block 0 for directory inode
19126674939
corrupt block 0 in directory inode 19126674939
would junk block
no . entry for directory 19126674939
no .. entry for directory 19126674939
problem with directory contents in inode 19126674939
would have cleared inode 19126674939
- agno = 5
- agno = 6
- agno = 7
- agno = 8
bad directory block magic # 0x45b419cb in block 0 for directory inode
35775783660
corrupt block 0 in directory inode 35775783660
would junk block
no . entry for directory 35775783660
no .. entry for directory 35775783660
problem with directory contents in inode 35775783660
would have cleared inode 35775783660
- agno = 9
- agno = 10
bad nblocks 20513 for inode 43585639210, would reset to 15192
bad nextents 37 for inode 43585639210, would reset to 32
- agno = 11
- agno = 12
bad directory block magic # 0x58443244 in block 0 for directory inode
51803060746
corrupt block 0 in directory inode 51803060746
would junk block
no . entry for directory 51803060746
no .. entry for directory 51803060746
problem with directory contents in inode 51803060746
would have cleared inode 51803060746
No modify flag set, skipping phase 5
Inode allocation btrees are too corrupted, skipping phases 6 and 7
No modify flag set, skipping filesystem flush and exiting.
I did run "xfs_repair -L" on an image of the filesystem on another
server and I ended up with about 50000 entries in lost+found (~750000
entries recursively). Attaching output from "xfs_logprint -t" and a
xfs_metadump can be made available. Is there any way to diagnose and
salvage this? Any and all help is much appreciated.
Best regards
Erik Gulliksson
[-- Attachment #2: xfs_logprint-t.txt.gz --]
[-- Type: application/x-gzip, Size: 132474 bytes --]
[-- Attachment #3: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-23 13:27 XFS corruption on 3ware RAID6-volume Erik Gulliksson
@ 2011-02-23 14:46 ` Emmanuel Florac
2011-02-23 15:01 ` Erik Gulliksson
2011-02-23 16:56 ` Stan Hoeppner
1 sibling, 1 reply; 13+ messages in thread
From: Emmanuel Florac @ 2011-02-23 14:46 UTC (permalink / raw)
To: Erik Gulliksson; +Cc: xfs
Le Wed, 23 Feb 2011 14:27:27 +0100
Erik Gulliksson <erik@gulliksson.org> écrivait:
> I have bumped in to a corruption problem with one a XFS filesystems.
> The filesystem lives on a RAID6-volume on a 3ware 9650SE-24M8 with
> battery backup and writecache enabled.
What firmware version are you using?
( tw_cli /cX show firmware )
> RAID6-configuration is 11 2.0TB
> WD15EARS disks and the volume is reported as OK by the RAID-card.
Augh. That sounds pretty bad. What does " tw_cli /cX/uY show all" look
like?
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-23 14:46 ` Emmanuel Florac
@ 2011-02-23 15:01 ` Erik Gulliksson
2011-02-23 15:23 ` Emmanuel Florac
2011-02-23 15:29 ` Justin Piszcz
0 siblings, 2 replies; 13+ messages in thread
From: Erik Gulliksson @ 2011-02-23 15:01 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: xfs
Hi Emmanuel,
Thanks for your prompt reply.
On Wed, Feb 23, 2011 at 3:46 PM, Emmanuel Florac <eflorac@intellique.com> wrote:
>
> What firmware version are you using?
>
> ( tw_cli /cX show firmware )
# tw_cli /c0 show firmware
/c0 Firmware Version = FE9X 4.10.00.007
>
> Augh. That sounds pretty bad. What does " tw_cli /cX/uY show all" look
> like?
Yes, it is bad - a decision has been made to replace these disks with
"enterprise"-versions (without TLER/ERC problems etc). Tw_cli produces
this output for the volume:
# tw_cli /c0/u0 show all
/c0/u0 status = OK
/c0/u0 is not rebuilding, its current state is OK
/c0/u0 is not verifying, its current state is OK
/c0/u0 is initialized.
/c0/u0 Write Cache = on
/c0/u0 Read Cache = Intelligent
/c0/u0 volume(s) = 1
/c0/u0 name = xxx
/c0/u0 serial number = yyy
/c0/u0 Ignore ECC policy = off
/c0/u0 Auto Verify Policy = off
/c0/u0 Storsave Policy = protection
/c0/u0 Command Queuing Policy = on
/c0/u0 Rapid RAID Recovery setting = all
/c0/u0 Parity Number = 2
Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
------------------------------------------------------------------------
u0 RAID-6 OK - - - 256K 12572.8
u0-0 DISK OK - - p12 - 1396.97
u0-1 DISK OK - - p21 - 1396.97
u0-2 DISK OK - - p14 - 1396.97
u0-3 DISK OK - - p15 - 1396.97
u0-4 DISK OK - - p16 - 1396.97
u0-5 DISK OK - - p17 - 1396.97
u0-6 DISK OK - - p18 - 1396.97
u0-7 DISK OK - - p19 - 1396.97
u0-8 DISK OK - - p20 - 1396.97
u0-9 DISK OK - - p0 - 1396.97
u0-10 DISK OK - - p22 - 1396.97
u0/v0 Volume - - - - - 12572.8
Best regards
Erik Gulliksson
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-23 15:01 ` Erik Gulliksson
@ 2011-02-23 15:23 ` Emmanuel Florac
2011-02-24 10:20 ` Erik Gulliksson
2011-02-23 15:29 ` Justin Piszcz
1 sibling, 1 reply; 13+ messages in thread
From: Emmanuel Florac @ 2011-02-23 15:23 UTC (permalink / raw)
To: Erik Gulliksson; +Cc: xfs
Le Wed, 23 Feb 2011 16:01:09 +0100
Erik Gulliksson <erik@gulliksson.org> écrivait:
> Hi Emmanuel,
>
> Thanks for your prompt reply.
>
> On Wed, Feb 23, 2011 at 3:46 PM, Emmanuel Florac
> <eflorac@intellique.com> wrote:
> >
> > What firmware version are you using?
> >
> > ( tw_cli /cX show firmware )
>
> # tw_cli /c0 show firmware
> /c0 Firmware Version = FE9X 4.10.00.007
>
OK so this is the latest, or close.
>
> >
> > Augh. That sounds pretty bad. What does " tw_cli /cX/uY show all"
> > look like?
>
> Yes, it is bad - a decision has been made to replace these disks with
> "enterprise"-versions (without TLER/ERC problems etc).
Typical error, alas. Save a couple hundred euros with cheap drives, to
store terabytes of data worth a lot.
> Tw_cli produces
> this output for the volume:
>
> # tw_cli /c0/u0 show all
> /c0/u0 status = OK
> /c0/u0 is not rebuilding, its current state is OK
> /c0/u0 is not verifying, its current state is OK
> /c0/u0 is initialized.
> /c0/u0 Write Cache = on
> /c0/u0 Read Cache = Intelligent
> /c0/u0 volume(s) = 1
> /c0/u0 name = xxx
> /c0/u0 serial number = yyy
> /c0/u0 Ignore ECC policy = off
> /c0/u0 Auto Verify Policy = off
> /c0/u0 Storsave Policy = protection
> /c0/u0 Command Queuing Policy = on
> /c0/u0 Rapid RAID Recovery setting = all
> /c0/u0 Parity Number = 2
>
> Unit UnitType Status %RCmpl %V/I/M Port Stripe
> Size(GB)
> ------------------------------------------------------------------------
> u0 RAID-6 OK - - - 256K
> 12572.8
So the RAID array looks OK, the RAID controller doesn't report any
particular problem. You said it was reported as 0 K. Where did you see
0 K reported?
What gives "dmesg | grep 3w-9xxx" ? and "tw_cli alarms" ? Was the
filesystem under heavy write when the problem occured ?
I'd start with launching a RAID verify, to detect and correct possible
on-disk coherency problems (it can't hurt anyway):
tw_cli /c0/u0 start verify
Then "tail -f /var/log/messages | grep 3w-9xxx" ...
I suppose that there are no problems to be discovered. Most probably
IOs to the array were lost because of the bus reset.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-23 15:01 ` Erik Gulliksson
2011-02-23 15:23 ` Emmanuel Florac
@ 2011-02-23 15:29 ` Justin Piszcz
2011-02-23 15:37 ` Emmanuel Florac
2011-02-24 10:35 ` Erik Gulliksson
1 sibling, 2 replies; 13+ messages in thread
From: Justin Piszcz @ 2011-02-23 15:29 UTC (permalink / raw)
To: Erik Gulliksson; +Cc: xfs
On Wed, 23 Feb 2011, Erik Gulliksson wrote:
> Hi Emmanuel,
>
> Thanks for your prompt reply.
>
> On Wed, Feb 23, 2011 at 3:46 PM, Emmanuel Florac <eflorac@intellique.com> wrote:
>>
>> What firmware version are you using?
>>
>> ( tw_cli /cX show firmware )
>
> # tw_cli /c0 show firmware
> /c0 Firmware Version = FE9X 4.10.00.007
The latest is:
9.5.1-9650-Upgrade.zip
9.5.2-9650-Upgrade.zip
9.5.3-9650-Upgrade.zip
9650SE_9690SA_firmware_beta_fw4.10.00.016.zip
9650SE_9690SA_firmware_beta_fw_4.10.00.019.zip <- latest
>
>
>>
>> Augh. That sounds pretty bad. What does " tw_cli /cX/uY show all" look
>> like?
>
> Yes, it is bad - a decision has been made to replace these disks with
> "enterprise"-versions (without TLER/ERC problems etc). Tw_cli produces
> this output for the volume:
This would seem to be the problem, you should go with Hiatchi next time.
You can use regular non-enterprise drives (Hiatchi) and they just work.
Seagate is a question
Samsung is a question
WD needs TLER.
>
> # tw_cli /c0/u0 show all
> /c0/u0 status = OK
> /c0/u0 is not rebuilding, its current state is OK
> /c0/u0 is not verifying, its current state is OK
> /c0/u0 is initialized.
> /c0/u0 Write Cache = on
> /c0/u0 Read Cache = Intelligent
> /c0/u0 volume(s) = 1
> /c0/u0 name = xxx
> /c0/u0 serial number = yyy
> /c0/u0 Ignore ECC policy = off
> /c0/u0 Auto Verify Policy = off
> /c0/u0 Storsave Policy = protection
> /c0/u0 Command Queuing Policy = on
> /c0/u0 Rapid RAID Recovery setting = all
> /c0/u0 Parity Number = 2
>
> Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
> ------------------------------------------------------------------------
> u0 RAID-6 OK - - - 256K 12572.8
> u0-0 DISK OK - - p12 - 1396.97
> u0-1 DISK OK - - p21 - 1396.97
> u0-2 DISK OK - - p14 - 1396.97
> u0-3 DISK OK - - p15 - 1396.97
> u0-4 DISK OK - - p16 - 1396.97
> u0-5 DISK OK - - p17 - 1396.97
> u0-6 DISK OK - - p18 - 1396.97
> u0-7 DISK OK - - p19 - 1396.97
> u0-8 DISK OK - - p20 - 1396.97
> u0-9 DISK OK - - p0 - 1396.97
> u0-10 DISK OK - - p22 - 1396.97
> u0/v0 Volume - - - - - 12572.8
As far as the problem at hand, I do not know of a good way to fix it unless
you had ls -lRi /raid_array output so you could map the inodes to their
original locations. Sorry don't have a better answer..
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-23 15:29 ` Justin Piszcz
@ 2011-02-23 15:37 ` Emmanuel Florac
2011-02-23 15:42 ` Justin Piszcz
2011-02-24 10:35 ` Erik Gulliksson
1 sibling, 1 reply; 13+ messages in thread
From: Emmanuel Florac @ 2011-02-23 15:37 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Erik Gulliksson, xfs
Le Wed, 23 Feb 2011 10:29:15 -0500 (EST)
Justin Piszcz <jpiszcz@lucidpixels.com> écrivait:
> > /c0 Firmware Version = FE9X 4.10.00.007
> The latest is:
>
> 9.5.1-9650-Upgrade.zip
> 9.5.2-9650-Upgrade.zip
> 9.5.3-9650-Upgrade.zip
> 9650SE_9690SA_firmware_beta_fw4.10.00.016.zip
> 9650SE_9690SA_firmware_beta_fw_4.10.00.019.zip <- latest
Hum well, this is beta, I'd rather stick to released firmware :) His
firmware is actually the latest stable version for the 9650.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-23 15:37 ` Emmanuel Florac
@ 2011-02-23 15:42 ` Justin Piszcz
2011-02-24 10:25 ` Erik Gulliksson
0 siblings, 1 reply; 13+ messages in thread
From: Justin Piszcz @ 2011-02-23 15:42 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: Erik Gulliksson, xfs
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1443 bytes --]
On Wed, 23 Feb 2011, Emmanuel Florac wrote:
> Le Wed, 23 Feb 2011 10:29:15 -0500 (EST)
> Justin Piszcz <jpiszcz@lucidpixels.com> écrivait:
>
>>> /c0 Firmware Version = FE9X 4.10.00.007
>> The latest is:
>>
>> 9.5.1-9650-Upgrade.zip
>> 9.5.2-9650-Upgrade.zip
>> 9.5.3-9650-Upgrade.zip
>> 9650SE_9690SA_firmware_beta_fw4.10.00.016.zip
>> 9650SE_9690SA_firmware_beta_fw_4.10.00.019.zip <- latest
>
> Hum well, this is beta, I'd rather stick to released firmware :) His
> firmware is actually the latest stable version for the 9650.
Yes, but it specifically addresses issue with resets:
019:
SCR FIRM03219 Pchip reset interrupt not handled
SCR FIRM03220 Put parity segment to free state completely when deallocate unused parity segment
016:
SCR 2196: Unexpected controller soft resets
Fixed an issue with regards to deferral of write and read commands to help eliminate unexpected soft resets.
SCR 2217: Incomplete unit after multiple power failures when a RAID 1 unit is initializing after a rebuild.
This issue is fixed in this firmware version.
SCR 819: Controller may assert when cache is disabled in a RAID5/RAID6 configuration under heavy I/O load.
This issue is fixed in this firmware version.
SCR 2214: Performance drops during SES polling.
Fixed the issue where performance drops when a Storage Enclosure Processor (SEP) takes up to 700-800 msec to respond to SES polling.
Justin.
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-23 13:27 XFS corruption on 3ware RAID6-volume Erik Gulliksson
2011-02-23 14:46 ` Emmanuel Florac
@ 2011-02-23 16:56 ` Stan Hoeppner
1 sibling, 0 replies; 13+ messages in thread
From: Stan Hoeppner @ 2011-02-23 16:56 UTC (permalink / raw)
To: Erik Gulliksson; +Cc: xfs
Erik Gulliksson put forth on 2/23/2011 7:27 AM:
> Dear XFS people,
>
> I have bumped in to a corruption problem with one a XFS filesystems.
> The filesystem lives on a RAID6-volume on a 3ware 9650SE-24M8 with
> battery backup and writecache enabled. RAID6-configuration is 11 2.0TB
> WD15EARS disks and the volume is reported as OK by the RAID-card.
There is a reason WD has an enterprise line of drives for use in RAID
applications. The EARS series, along with the entire WD desktop drive
series, are not suitable for use with hardware RAID controllers, mainly
because they don't support TLER. The 2TB low RPM WD Green at Newegg is
$90. The 2TB 7.2k RPM Enterprise RE4 is $270, exactly 3 times the
price. Is your data, and reliable fast access to it, worth the extra
$1,980?
I think too many folks are blinded by low acquisition cost and thus
can't see the overall life cycle cost, or TCO, of their storage
infrastructure.
Please Google the XFS list archives for the horror story at UC Santa
Cruz WRT hardware arrays using the WD 2TB Green drives. The sysop lost
12TB of a 60TB XFS filesystem, and damn near lost the entire 60TB, if
not for luck. The lost 12TB of data was Ph. D. students' research data.
That kind of data is definitely worth the extra $$ for the right type
of drives which would have avoided the problem.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-23 15:23 ` Emmanuel Florac
@ 2011-02-24 10:20 ` Erik Gulliksson
0 siblings, 0 replies; 13+ messages in thread
From: Erik Gulliksson @ 2011-02-24 10:20 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: xfs
Thanks for your comments Emmanuel.
> So the RAID array looks OK, the RAID controller doesn't report any
> particular problem. You said it was reported as 0 K. Where did you see
> 0 K reported?
No I meant it is "OK" with "O" :)
> What gives "dmesg | grep 3w-9xxx" ? and "tw_cli alarms" ? Was the
> filesystem under heavy write when the problem occured ?
The server has been restarted since the problems started, so nothing
notable in "tw_cli alarms" or dmesg. The controller was performing
rebuild on another the other unit when it happened, however I don't
think the actual xfs-filesystem was particularly loaded.
>
> I'd start with launching a RAID verify, to detect and correct possible
> on-disk coherency problems (it can't hurt anyway):
>
> tw_cli /c0/u0 start verify
>
> Then "tail -f /var/log/messages | grep 3w-9xxx" ...
I will try this over night and see if something is reported.
> I suppose that there are no problems to be discovered. Most probably
> IOs to the array were lost because of the bus reset.
That's what I am afraid of too.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-23 15:42 ` Justin Piszcz
@ 2011-02-24 10:25 ` Erik Gulliksson
0 siblings, 0 replies; 13+ messages in thread
From: Erik Gulliksson @ 2011-02-24 10:25 UTC (permalink / raw)
To: Justin Piszcz; +Cc: xfs
Hi Justin,
On Wed, Feb 23, 2011 at 4:42 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Wed, 23 Feb 2011, Emmanuel Florac wrote:
>
>> Le Wed, 23 Feb 2011 10:29:15 -0500 (EST)
>> Justin Piszcz <jpiszcz@lucidpixels.com> écrivait:
>>
>>>> /c0 Firmware Version = FE9X 4.10.00.007
>>>
>>> The latest is:
>>>
>>> 9.5.1-9650-Upgrade.zip
>>> 9.5.2-9650-Upgrade.zip
>>> 9.5.3-9650-Upgrade.zip
>>> 9650SE_9690SA_firmware_beta_fw4.10.00.016.zip
>>> 9650SE_9690SA_firmware_beta_fw_4.10.00.019.zip <- latest
>>
>> Hum well, this is beta, I'd rather stick to released firmware :) His
>> firmware is actually the latest stable version for the 9650.
>
> Yes, but it specifically addresses issue with resets:
I think I'll stick with changing the disks with enterprise-versions.
There might be other reasons why 3ware/LSI still has those firmware
fixes in beta.
Best regards
Erik Gulliksson
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-23 15:29 ` Justin Piszcz
2011-02-23 15:37 ` Emmanuel Florac
@ 2011-02-24 10:35 ` Erik Gulliksson
2011-02-24 11:01 ` Justin Piszcz
2011-02-24 12:14 ` Emmanuel Florac
1 sibling, 2 replies; 13+ messages in thread
From: Erik Gulliksson @ 2011-02-24 10:35 UTC (permalink / raw)
To: Justin Piszcz; +Cc: xfs
Hi again Justin,
> This would seem to be the problem, you should go with Hiatchi next time.
> You can use regular non-enterprise drives (Hiatchi) and they just work.
> Seagate is a question
> Samsung is a question
> WD needs TLER.
This is valuable information, we might consider Hitatchi-disks next
time if we want to play cheap again.
> As far as the problem at hand, I do not know of a good way to fix it unless
> you had ls -lRi /raid_array output so you could map the inodes to their
> original locations. Sorry don't have a better answer..
No, I don't have full filename-inode mappings from before the
corruption. I guess such a list from the filesystem mounted with -o
"ro,norecovery" won't help here?
If the journal is corrupt, is there anyway to salvage only that part
so that the log replay will proceed further (ie "xfs_log_repair")? I'm
not nearly expert enough to parse the stacktrace triggered by mounting
the filesystem, but it seems some calls to xlog_recover_-functions are
involved, which make me think that the log is corrupt.
Best regards
Erik Gulliksson
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-24 10:35 ` Erik Gulliksson
@ 2011-02-24 11:01 ` Justin Piszcz
2011-02-24 12:14 ` Emmanuel Florac
1 sibling, 0 replies; 13+ messages in thread
From: Justin Piszcz @ 2011-02-24 11:01 UTC (permalink / raw)
To: Erik Gulliksson; +Cc: xfs
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1273 bytes --]
On Thu, 24 Feb 2011, Erik Gulliksson wrote:
> Hi again Justin,
>
>> This would seem to be the problem, you should go with Hiatchi next time.
>> You can use regular non-enterprise drives (Hiatchi) and they just work.
>> Seagate is a question
>> Samsung is a question
>> WD needs TLER.
>
> This is valuable information, we might consider Hitatchi-disks next
> time if we want to play cheap again.
>
>
>> As far as the problem at hand, I do not know of a good way to fix it unless
>> you had ls -lRi /raid_array output so you could map the inodes to their
>> original locations. Sorry don't have a better answer..
>
> No, I don't have full filename-inode mappings from before the
> corruption. I guess such a list from the filesystem mounted with -o
> "ro,norecovery" won't help here?
>
> If the journal is corrupt, is there anyway to salvage only that part
> so that the log replay will proceed further (ie "xfs_log_repair")? I'm
> not nearly expert enough to parse the stacktrace triggered by mounting
> the filesystem, but it seems some calls to xlog_recover_-functions are
> involved, which make me think that the log is corrupt.
This is probably best answered by an XFS expert.
Could you also show tw_cli /c0 show diag ?
Justin.
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: XFS corruption on 3ware RAID6-volume
2011-02-24 10:35 ` Erik Gulliksson
2011-02-24 11:01 ` Justin Piszcz
@ 2011-02-24 12:14 ` Emmanuel Florac
1 sibling, 0 replies; 13+ messages in thread
From: Emmanuel Florac @ 2011-02-24 12:14 UTC (permalink / raw)
To: Erik Gulliksson; +Cc: xfs
Le Thu, 24 Feb 2011 11:35:30 +0100
Erik Gulliksson <erik@gulliksson.org> écrivait:
> If the journal is corrupt, is there anyway to salvage only that part
> so that the log replay will proceed further (ie "xfs_log_repair")?
Your best bet is to try booting a bleeding-edge kernel with the very
latest xfs version. It could possibly go somewhat further (that is, if
the log is not complete garbage).
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2011-02-24 12:11 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-23 13:27 XFS corruption on 3ware RAID6-volume Erik Gulliksson
2011-02-23 14:46 ` Emmanuel Florac
2011-02-23 15:01 ` Erik Gulliksson
2011-02-23 15:23 ` Emmanuel Florac
2011-02-24 10:20 ` Erik Gulliksson
2011-02-23 15:29 ` Justin Piszcz
2011-02-23 15:37 ` Emmanuel Florac
2011-02-23 15:42 ` Justin Piszcz
2011-02-24 10:25 ` Erik Gulliksson
2011-02-24 10:35 ` Erik Gulliksson
2011-02-24 11:01 ` Justin Piszcz
2011-02-24 12:14 ` Emmanuel Florac
2011-02-23 16:56 ` Stan Hoeppner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox