* Repeated XFS corruption on RAID-10 on Adaptec 51245
@ 2012-06-08 9:45 Christian J. Dietrich
2012-06-08 10:51 ` Emmanuel Florac
2012-06-08 13:56 ` Eric Sandeen
0 siblings, 2 replies; 7+ messages in thread
From: Christian J. Dietrich @ 2012-06-08 9:45 UTC (permalink / raw)
To: xfs
Hey all,
I have problems with an XFS volume. Upon discovering the message
"kernel: XFS (sda3): corrupt inode 3714097 (bad size 16437 for local
fork, size = 60)."
I ran xfs_repair /dev/sda3 (/dev/sda3 was unmounted). It reported to
have fixed some errors.
However, after a while in normal operation, another XFS corruption
occurred on /dev/sda3. I noticed that repeatedly calling xfs_repair will
always report and fix new errors, even if the volume is not mounted in
between, e.g., "rebuilding directory inode XXX" with different (new)
values of XXX.
/dev/sda is a 12 TB RAID-10 volume on an Adaptec 51245 controller. All
disks are online and none is reported faulty.
Naive, I would assume that running xfs_repair once would fix all errors.
My guess is that the underlying RAID volume (Adaptec 51245 RAID 10) is
somehow invalid (although I can not find any indicators confirming
this). Any suggestions?
I am running CentOS 6.2 (=RHEL 6.2) with kernel
2.6.32-220.17.1.el6.x86_64 (most recent) and all OS updates installed.
Controller Firmware is the most recent (18948), driver version is 1.1-5.
HDDs are 2x WD2001FASS, 10x WD2002FAEX.
Thanks in advance,
Chris
--
Christian J. Dietrich
Institute for Internet Security - if(is)
Westfälische Hochschule University of Applied Sciences
https://www.internet-sicherheit.de
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Repeated XFS corruption on RAID-10 on Adaptec 51245
2012-06-08 9:45 Repeated XFS corruption on RAID-10 on Adaptec 51245 Christian J. Dietrich
@ 2012-06-08 10:51 ` Emmanuel Florac
2012-06-08 19:36 ` Christian J. Dietrich
2012-06-08 13:56 ` Eric Sandeen
1 sibling, 1 reply; 7+ messages in thread
From: Emmanuel Florac @ 2012-06-08 10:51 UTC (permalink / raw)
To: Christian J. Dietrich; +Cc: xfs
Le Fri, 08 Jun 2012 11:45:28 +0200
"Christian J. Dietrich" <dietrich@internet-sicherheit.de> écrivait:
> I am running CentOS 6.2 (=RHEL 6.2) with kernel
> 2.6.32-220.17.1.el6.x86_64 (most recent) and all OS updates installed.
> Controller Firmware is the most recent (18948), driver version is
> 1.1-5. HDDs are 2x WD2001FASS, 10x WD2002FAEX.
>
I currently manage many servers (about 50) with Adaptec 5xx5 raid
cards. The only case of data corruption I've met was with WD drives.
Therefore it's most probably related to the WD drives. WD desktop
drives are well known for being (voluntarily) crippled for RAID
operation. They will almost always create all sort of weird problems
when running under high IO load.
It is of utmost importance that you at least take care of setting TLER
in the correct mode on the drives. Beware that apparently newer WD
drives don't even allow setting TLER properly anymore (I said
"crippled").
http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Repeated XFS corruption on RAID-10 on Adaptec 51245
2012-06-08 10:51 ` Emmanuel Florac
@ 2012-06-08 19:36 ` Christian J. Dietrich
2012-06-09 13:23 ` Emmanuel Florac
0 siblings, 1 reply; 7+ messages in thread
From: Christian J. Dietrich @ 2012-06-08 19:36 UTC (permalink / raw)
To: xfs
Emmanuel,
Am 08.06.2012 12:51, schrieb Emmanuel Florac:
> "Christian J. Dietrich" <dietrich@internet-sicherheit.de> écrivait:
>
>> I am running CentOS 6.2 (=RHEL 6.2) with kernel
>> 2.6.32-220.17.1.el6.x86_64 (most recent) and all OS updates installed.
>> Controller Firmware is the most recent (18948), driver version is
>> 1.1-5. HDDs are 2x WD2001FASS, 10x WD2002FAEX.
>>
>
> I currently manage many servers (about 50) with Adaptec 5xx5 raid
> cards. The only case of data corruption I've met was with WD drives.
>
> Therefore it's most probably related to the WD drives. WD desktop
> drives are well known for being (voluntarily) crippled for RAID
> operation. They will almost always create all sort of weird problems
> when running under high IO load.
>
> It is of utmost importance that you at least take care of setting TLER
> in the correct mode on the drives. Beware that apparently newer WD
> drives don't even allow setting TLER properly anymore (I said
> "crippled").
>
> http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery
Indeed, it seems to be related to the WD HDDs, I am in contact with
Adaptec support and will dig deeper. A couple of disks have comparably
high CommandAborts.
Probably, I will rebuild the RAID volume using disks with proper TLER
support and activate TLER.
Thanks for your help,
Chris
--
Christian J. Dietrich
Institute for Internet Security - if(is)
Westfälische Hochschule University of Applied Sciences
https://www.internet-sicherheit.de
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Repeated XFS corruption on RAID-10 on Adaptec 51245
2012-06-08 19:36 ` Christian J. Dietrich
@ 2012-06-09 13:23 ` Emmanuel Florac
2012-06-10 3:58 ` Stan Hoeppner
0 siblings, 1 reply; 7+ messages in thread
From: Emmanuel Florac @ 2012-06-09 13:23 UTC (permalink / raw)
To: Christian J. Dietrich; +Cc: xfs
Le Fri, 08 Jun 2012 21:36:58 +0200 vous écriviez:
> Indeed, it seems to be related to the WD HDDs, I am in contact with
> Adaptec support and will dig deeper. A couple of disks have comparably
> high CommandAborts.
I'm afraid there's not much they can do, but let us know if they come
back with any constructive suggestion.
> Probably, I will rebuild the RAID volume using disks with proper TLER
> support and activate TLER.
I suppose you may turn off the system, connect the drives to some PC
and run the TLER tool; after that it should work better without even
rebuilding the array.
good luck,
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Repeated XFS corruption on RAID-10 on Adaptec 51245
2012-06-09 13:23 ` Emmanuel Florac
@ 2012-06-10 3:58 ` Stan Hoeppner
0 siblings, 0 replies; 7+ messages in thread
From: Stan Hoeppner @ 2012-06-10 3:58 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: Christian J. Dietrich, xfs
On 6/9/2012 8:23 AM, Emmanuel Florac wrote:
> Le Fri, 08 Jun 2012 21:36:58 +0200 vous écriviez:
>
>> Indeed, it seems to be related to the WD HDDs, I am in contact with
>> Adaptec support and will dig deeper. A couple of disks have comparably
>> high CommandAborts.
>
> I'm afraid there's not much they can do, but let us know if they come
> back with any constructive suggestion.
>
>> Probably, I will rebuild the RAID volume using disks with proper TLER
>> support and activate TLER.
There's nothing to activate. If a drive has TLER/ERC/etc (any SAS or
enterprise SATA) it's turned on/enabled at the factory.
> I suppose you may turn off the system, connect the drives to some PC
> and run the TLER tool; after that it should work better without even
> rebuilding the array.
wdidle will only change the TLER timeout on firmware that allows it,
which means older drives. Many/most of WD's newer consumer drives
apparently do not allow this.
Worth noting, the price premium of the 1TB RE4 over the 1TB Black is
currently ZERO at Newegg-- both drives are $120. No premium for TLER,
enterprise features.
The 2TB RE4 is $230, the 2TB Black is $210, a $20 premium for enterprise
features.
I didn't compare to the WD Green drives because they are squarely low
performance consumer junk designed NOT to be accesses more than TO be
accessed. WD's anticipated mode of operation for such drives is to be
idle with the heads parked over 99% of the time. This simply is not
suitable for RAID use, and they boldly tell us that. Many ignore the
warning then shed tears when arrays die...
At Newegg's current prices, likely other vendors as well, there is no
meaningful price premium for a WD 7.2k enterprise SATA drive over a WD
7.2K performance oriented consumer drive. Thus, if the preference is
for WD gear, no sysadmin has a valid excuse for not buying the RE series
drives.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Repeated XFS corruption on RAID-10 on Adaptec 51245
2012-06-08 9:45 Repeated XFS corruption on RAID-10 on Adaptec 51245 Christian J. Dietrich
2012-06-08 10:51 ` Emmanuel Florac
@ 2012-06-08 13:56 ` Eric Sandeen
2012-06-08 19:33 ` Christian J. Dietrich
1 sibling, 1 reply; 7+ messages in thread
From: Eric Sandeen @ 2012-06-08 13:56 UTC (permalink / raw)
To: Christian J. Dietrich; +Cc: xfs
On 6/8/12 4:45 AM, Christian J. Dietrich wrote:
>
> Hey all,
>
> I have problems with an XFS volume. Upon discovering the message
> "kernel: XFS (sda3): corrupt inode 3714097 (bad size 16437 for local
> fork, size = 60)."
> I ran xfs_repair /dev/sda3 (/dev/sda3 was unmounted). It reported to
> have fixed some errors.
> However, after a while in normal operation, another XFS corruption
> occurred on /dev/sda3. I noticed that repeatedly calling xfs_repair will
> always report and fix new errors, even if the volume is not mounted in
> between, e.g., "rebuilding directory inode XXX" with different (new)
> values of XXX.
>
> /dev/sda is a 12 TB RAID-10 volume on an Adaptec 51245 controller. All
> disks are online and none is reported faulty.
>
> Naive, I would assume that running xfs_repair once would fix all errors.
> My guess is that the underlying RAID volume (Adaptec 51245 RAID 10) is
> somehow invalid (although I can not find any indicators confirming
> this). Any suggestions?
If you suspect a problem with repair, you can try:
# umount /dev/sda
# xfs_metadump -o /dev/sda - | xfs_mdrestore - filesystem.img
# xfs_repair filesystem.img
# xfs_repair filesystem.img
The image shouldn't take too much space, but 12T might take a little while
to dump.
If repair doesn't fix everything the first time please let us know.
> I am running CentOS 6.2 (=RHEL 6.2) with kernel
> 2.6.32-220.17.1.el6.x86_64 (most recent) and all OS updates installed.
please make sure that you aren't running the old kmod-xfs (or was it
xfs-kmod?) rpm.
-Eric
> Controller Firmware is the most recent (18948), driver version is 1.1-5.
> HDDs are 2x WD2001FASS, 10x WD2002FAEX.
>
> Thanks in advance,
> Chris
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Repeated XFS corruption on RAID-10 on Adaptec 51245
2012-06-08 13:56 ` Eric Sandeen
@ 2012-06-08 19:33 ` Christian J. Dietrich
0 siblings, 0 replies; 7+ messages in thread
From: Christian J. Dietrich @ 2012-06-08 19:33 UTC (permalink / raw)
To: xfs
Eric,
thanks for your suggestions.
Am 08.06.2012 15:56, schrieb Eric Sandeen:
>> Naive, I would assume that running xfs_repair once would fix all errors.
>> My guess is that the underlying RAID volume (Adaptec 51245 RAID 10) is
>> somehow invalid (although I can not find any indicators confirming
>> this). Any suggestions?
>
> If you suspect a problem with repair, you can try:
>
> # umount /dev/sda
> # xfs_metadump -o /dev/sda - | xfs_mdrestore - filesystem.img
> # xfs_repair filesystem.img
> # xfs_repair filesystem.img
>
> The image shouldn't take too much space, but 12T might take a little while
> to dump.
> If repair doesn't fix everything the first time please let us know.
Thanks, I can confirm that xfs_repair fixes everything on the first run.
> please make sure that you aren't running the old kmod-xfs (or was it
> xfs-kmod?) rpm.
No, I am using the default driver that comes with the distribution.
Thanks again, I consider the problem to be related to the combination of
controller and HDDs (and no longer to xfs) and will dig deeper into that
direction.
Chris
--
Christian J. Dietrich
Institute for Internet Security - if(is)
Westfälische Hochschule University of Applied Sciences
https://www.internet-sicherheit.de
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-06-10 3:58 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-08 9:45 Repeated XFS corruption on RAID-10 on Adaptec 51245 Christian J. Dietrich
2012-06-08 10:51 ` Emmanuel Florac
2012-06-08 19:36 ` Christian J. Dietrich
2012-06-09 13:23 ` Emmanuel Florac
2012-06-10 3:58 ` Stan Hoeppner
2012-06-08 13:56 ` Eric Sandeen
2012-06-08 19:33 ` Christian J. Dietrich
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox