XFS data corruption with high I/O even on hardware raid

* XFS data corruption with high I/O even on hardware raid
@ 2010-01-14  1:11 Steve Costaras
  2010-01-14  2:24 ` Dave Chinner
  2010-01-14  9:08 ` XFS data corruption with high I/O even on " Andi Kleen
  0 siblings, 2 replies; 9+ messages in thread
From: Steve Costaras @ 2010-01-14  1:11 UTC (permalink / raw)
  To: xfs

[-- Attachment #1.1: Type: text/plain, Size: 12978 bytes --]

Ok, I've been seeing a problem here since had to move over to XFS from 
JFS due to file system size issues.   I am seeing XFS Data corruption 
under ?heavy io?   Basically, what happens is that under heavy load 
(i.e. if I'm doing say a xfs_fsr (which nearly always triggers the 
freeze issue) on a volume the system hovers around 90% utilization for 
the dm device for a while (sometimes an hour+, sometimes minutes) the 
subsystem goes into 100% utilization and then freezes solid forcing me 
to do a hard reboot of the box.  When coming back up generally the XFS 
volumes are really screwed up (see below).    Areca cards all have BBU's 
and the only write cache is on the BBU (drive cache disabled).   Systems 
are all UPS protected as well.    These freezes have happened too 
frequently and unfortunatly nothing is logged anywhere.    It's not 
worth doing a repair as the amount of corruption is too extensive so 
requires a complete restore from backup.    I just mention xfs_fsr here 
as that /seems/ to generate an I/O pattern that nearly always results in 
a freeze.   I have done it with other high-i/o functions though not as 
reliably.

I don't know what else can be done to remove this issue (and not really 
sure it's really directly related to XFS, as LVM and the areca driver 
are also involved) however the main result is that XFS gets really 
screwed up.   I did NOT have these issues w/ JFS (same subsystem lvm + 
areca set up so it /seems/ to point to XFS or at least it's tied in 
there somewhere) unfortunately JFS has issues with file systems larger 
than 32TiB so the only file system I can use is XFS.

Since I'm using hardware raid w/ BBU when I reboot and it comes back up 
the raid controller writes out to the drives any outstanding data in 
it's cache and from the hardware point of view (as well as lvm's point 
of view) the array is ok.    The file system however generally can't be 
mounted (about 4 out of 5 times, some times it does get auto-mounted but 
when I then run an xfs_repair -n -v in those cases there are pages of 
errors (badly aligned inode rec, bad starting inode #'s, dubious inode 
btree block headers among others).    When I let a repair actually run 
in one case out of 4,500,000 files it linked about 2,000,000 or so but 
there was no way to identify and verify file integrity.  The others were 
just lost.

This is not limited to large volume sizes I have seen similar on small 
~2TiB file systems as well.  Also when it happened in a couple cases the 
file system that was taking the I/O (say xfs_fsr -v /home ) another XFS 
filesystem on the same system which was NOT taking much if any I/O gets 
badly corrupted (say /var/test ).   Both would be using the same areca 
controllers and same physical discs (same PV's and same VG's but 
different LV's).

Any suggestions on how to isolate or eliminate this would be greatly 
appreciated.

Steve

Technical data is below:
==============
$iostat -m -x 15
(IOSTAT capture right up to a freeze event:)
(system sits here for a long bit hovering around 90% for the DM device 
and about 30% for the the PV's)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     7.80    0.07    2.00     0.00     0.04    
38.19     0.00    2.26   0.97   0.20
sdb             120.07    34.47  253.00  706.67    24.98    28.96   
115.11     1.06    1.10   0.28  26.87
sdc              48.80    28.93  324.73  730.87    24.98    28.94   
104.62     1.19    1.13   0.29  30.60
sdd             121.73    33.13  251.60  700.40    24.99    28.94   
116.01     1.11    1.17   0.29  27.40
sde              49.00    28.60  324.33  731.47    24.99    28.95   
104.65     1.22    1.15   0.26  27.53
sdf             120.27    33.20  253.00  701.00    24.99    28.97   
115.84     1.14    1.20   0.33  31.67
sdg              48.80    29.07  324.73  731.80    25.00    28.95   
104.59     1.37    1.29   0.35  36.93
sdh             120.47    33.47  252.73  702.53    25.00    28.96   
115.68     1.24    1.30   0.35  33.67
sdi              50.73    28.27  322.73  735.13    24.99    29.01   
104.54     1.34    1.26   0.31  32.27
dm-0              0.00     0.00    0.13    0.13     0.00     0.00    
12.00     0.01   25.00  25.00   0.67
dm-1              0.00     0.00 1602.67  992.73   199.93   231.69   
340.59     4.12    1.59   0.34  88.40
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     
0.00     0.00    0.00   0.00   0.00

(Then it jumps up to 99-100% for the majority of devices (here sdf,sdg, 
sdh, sdi are all on the same physical areca card).
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.60   24.71    0.00   74.69

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     4.07    0.00    1.13     0.00     0.02    
36.71     0.00    1.18   1.18   0.13
sdb               2.07     1.93    8.00   17.00     0.63     0.84   
120.33     0.04    1.49   0.35   0.87
sdc               2.87     1.20    7.40   22.13     0.63     0.83   
101.86     0.04    1.49   0.25   0.73
sdd               2.13     1.80    8.07   17.20     0.63     0.84   
119.64     0.04    1.45   0.32   0.80
sde               2.93     1.07    7.20   21.80     0.63     0.83   
103.65     0.05    1.89   0.34   1.00
sdf               1.93     1.87    8.13   13.67     0.63     0.64   
119.78    46.58    2.35  45.63  99.47
sdg               2.87     1.00    7.13   17.80     0.62     0.64   
104.04    64.12    2.41  39.84  99.33
sdh               2.07     1.67    7.93   13.47     0.62     0.64   
121.22    47.85    2.12  46.39  99.27
sdi               2.93     1.07    7.07   18.47     0.62     0.64   
101.77    62.15    2.32  38.83  99.13
dm-0              0.00     0.00    0.20    0.07     0.00     0.00    
10.00     0.00    2.50   2.50   0.07
dm-1              0.00     0.00   40.20   30.13     5.03     6.68   
340.96    74.73    2.13  14.19  99.80
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     
0.00     0.00    0.00   0.00   0.00

(Then here it hits 100% and the system locks)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.81   24.95    0.00   74.24

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     8.40    0.00    2.13     0.00     0.04    
39.50     0.00    1.88   0.63   0.13
sdb               0.00     0.00    0.00    0.00     0.00     0.00     
0.00     0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     
0.00     0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.07     0.00     0.00    
16.00     0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     
0.00     0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     
0.00    50.00    0.00   0.00 100.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00     
0.00    69.00    0.00   0.00 100.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     
0.00    50.00    0.00   0.00 100.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     
0.00    65.00    0.00   0.00 100.00
dm-0              0.00     0.00    0.00    0.07     0.00     0.00    
16.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     
0.00    85.00    0.00   0.00 100.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     
0.00     0.00    0.00   0.00   0.00

============ (System)
(Ubuntu 8.04.3 LTS):
Linux loki 2.6.24-26-server #1 SMP Tue Dec 1 18:26:43 UTC 2009 x86_64 
GNU/Linux

--------------
xfs_repair version 2.9.4

============= (modinfo's)
filename:       /lib/modules/2.6.24-26-server/kernel/fs/xfs/xfs.ko
license:        GPL
description:    SGI XFS with ACLs, security attributes, realtime, large 
block/inode numbers, no debug enabled
author:         Silicon Graphics, Inc.
srcversion:     A2E6459B3A4C96355F95E61
depends:
vermagic:       2.6.24-26-server SMP mod_unload
============
filename:       
/lib/modules/2.6.24-26-server/kernel/drivers/scsi/arcmsr/arcmsr.ko
version:        Driver Version 1.20.00.15 2007/08/30
license:        Dual BSD/GPL
description:    ARECA (ARC11xx/12xx/13xx/16xx) SATA/SAS RAID HOST Adapter
author:         Erich Chen <support@areca.com.tw>
srcversion:     38E576EB40C1A58E8B9E007
alias:          pci:v000017D3d00001681sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001680sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001381sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001380sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001280sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001270sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001260sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001230sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001220sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001210sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001202sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001201sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001200sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001170sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001160sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001130sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001120sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001110sv*sd*bc*sc*i*
depends:        scsi_mod
vermagic:       2.6.24-26-server SMP mod_unload
===========
filename:       /lib/modules/2.6.24-26-server/kernel/drivers/md/dm-mod.ko
license:        GPL
author:         Joe Thornber <dm-devel@redhat.com>
description:    device-mapper driver
srcversion:     A7E89E997173E41CB6AAF04
depends:
vermagic:       2.6.24-26-server SMP mod_unload
parm:           major:The major number of the device mapper (uint)
===========

============
mounted with:
/dev/vg_media/lv_ftpshare       /var/ftp        xfs     
defaults,relatime,nobarrier,logbufs=8,logbsize=256k,sunit=256,swidth=2048,inode64,noikeep,largeio,swalloc,allocsize=128k       
0       2

============
XFS info:
meta-data=/dev/mapper/vg_media-lv_ftpshare isize=2048   agcount=41, 
agsize=268435424 blks
          =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=10737418200, imaxpct=1
          =                       sunit=32     swidth=256 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=2
          =                       sectsz=512   sunit=32 blks, lazy-count=0
realtime =none                   extsz=1048576 blocks=0, rtextents=0

=============
XFS is running on top of LVM:
  --- Logical volume ---
   LV Name                /dev/vg_media/lv_ftpshare
   VG Name                vg_media
   LV UUID                MgEBWv-x9fn-KUoJ-3y5X-snlk-7F9E-A3CiHh
   LV Write Access        read/write
   LV Status              available
   # open                 1
   LV Size                40.00 TB
   Current LE             40960
   Segments               1
   Allocation             inherit
   Read ahead sectors     0
   Block device           254:1

==============
LVM is using as it's base physical volumes 8 hardware raids 
(MediaVol00-70 inclusive):
[  175.320738] ARECA RAID ADAPTER4: FIRMWARE VERSION V1.47 2009-07-16
[  175.336238] scsi4 : Areca SAS Host Adapter RAID Controller( RAID6 
capable)
[  175.336239]  Driver Version 1.20.00.15 2007/08/30
[  175.336387] ACPI: PCI Interrupt 0000:0a:00.0[A] -> GSI 17 (level, 
low) -> IRQ 17
[  175.336395] PCI: Setting latency timer of device 0000:0a:00.0 to 64
[  175.336990] scsi 4:0:0:0: Direct-Access     Areca    BootVol#00       
R001 PQ: 0 ANSI: 5
[  175.337096] scsi 4:0:0:1: Direct-Access     Areca    MediaVol#00      
R001 PQ: 0 ANSI: 5
[  175.337169] scsi 4:0:0:2: Direct-Access     Areca    MediaVol#10      
R001 PQ: 0 ANSI: 5
[  175.337240] scsi 4:0:0:3: Direct-Access     Areca    MediaVol#20      
R001 PQ: 0 ANSI: 5
[  175.337312] scsi 4:0:0:4: Direct-Access     Areca    MediaVol#30      
R001 PQ: 0 ANSI: 5
[  175.337907] scsi 4:0:16:0: Processor         Areca    RAID 
controller  R001 PQ: 0 ANSI: 0
[  175.356231] ARECA RAID ADAPTER5: FIRMWARE VERSION V1.47 2009-10-22
[  175.376144] scsi5 : Areca SAS Host Adapter RAID Controller( RAID6 
capable)
[  175.376145]  Driver Version 1.20.00.15 2007/08/30
[  175.377354] scsi 5:0:0:5: Direct-Access     Areca    MediaVol#40      
R001 PQ: 0 ANSI: 5
[  175.377434] scsi 5:0:0:6: Direct-Access     Areca    MediaVol#50      
R001 PQ: 0 ANSI: 5
[  175.377495] scsi 5:0:0:7: Direct-Access     Areca    MediaVol#60      
R001 PQ: 0 ANSI: 5
[  175.377587] scsi 5:0:1:0: Direct-Access     Areca    MediaVol#70      
R001 PQ: 0 ANSI: 5
[  175.378156] scsi 5:0:16:0: Processor         Areca    RAID 
controller  R001 PQ: 0 ANSI: 0

=================

[-- Attachment #1.2: Type: text/html, Size: 26553 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread