* XFS data corruption with high I/O even on hardware raid
@ 2010-01-14 1:11 Steve Costaras
2010-01-14 2:24 ` Dave Chinner
2010-01-14 9:08 ` XFS data corruption with high I/O even on " Andi Kleen
0 siblings, 2 replies; 9+ messages in thread
From: Steve Costaras @ 2010-01-14 1:11 UTC (permalink / raw)
To: xfs
[-- Attachment #1.1: Type: text/plain, Size: 12978 bytes --]
Ok, I've been seeing a problem here since had to move over to XFS from
JFS due to file system size issues. I am seeing XFS Data corruption
under ?heavy io? Basically, what happens is that under heavy load
(i.e. if I'm doing say a xfs_fsr (which nearly always triggers the
freeze issue) on a volume the system hovers around 90% utilization for
the dm device for a while (sometimes an hour+, sometimes minutes) the
subsystem goes into 100% utilization and then freezes solid forcing me
to do a hard reboot of the box. When coming back up generally the XFS
volumes are really screwed up (see below). Areca cards all have BBU's
and the only write cache is on the BBU (drive cache disabled). Systems
are all UPS protected as well. These freezes have happened too
frequently and unfortunatly nothing is logged anywhere. It's not
worth doing a repair as the amount of corruption is too extensive so
requires a complete restore from backup. I just mention xfs_fsr here
as that /seems/ to generate an I/O pattern that nearly always results in
a freeze. I have done it with other high-i/o functions though not as
reliably.
I don't know what else can be done to remove this issue (and not really
sure it's really directly related to XFS, as LVM and the areca driver
are also involved) however the main result is that XFS gets really
screwed up. I did NOT have these issues w/ JFS (same subsystem lvm +
areca set up so it /seems/ to point to XFS or at least it's tied in
there somewhere) unfortunately JFS has issues with file systems larger
than 32TiB so the only file system I can use is XFS.
Since I'm using hardware raid w/ BBU when I reboot and it comes back up
the raid controller writes out to the drives any outstanding data in
it's cache and from the hardware point of view (as well as lvm's point
of view) the array is ok. The file system however generally can't be
mounted (about 4 out of 5 times, some times it does get auto-mounted but
when I then run an xfs_repair -n -v in those cases there are pages of
errors (badly aligned inode rec, bad starting inode #'s, dubious inode
btree block headers among others). When I let a repair actually run
in one case out of 4,500,000 files it linked about 2,000,000 or so but
there was no way to identify and verify file integrity. The others were
just lost.
This is not limited to large volume sizes I have seen similar on small
~2TiB file systems as well. Also when it happened in a couple cases the
file system that was taking the I/O (say xfs_fsr -v /home ) another XFS
filesystem on the same system which was NOT taking much if any I/O gets
badly corrupted (say /var/test ). Both would be using the same areca
controllers and same physical discs (same PV's and same VG's but
different LV's).
Any suggestions on how to isolate or eliminate this would be greatly
appreciated.
Steve
Technical data is below:
==============
$iostat -m -x 15
(IOSTAT capture right up to a freeze event:)
(system sits here for a long bit hovering around 90% for the DM device
and about 30% for the the PV's)
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 7.80 0.07 2.00 0.00 0.04
38.19 0.00 2.26 0.97 0.20
sdb 120.07 34.47 253.00 706.67 24.98 28.96
115.11 1.06 1.10 0.28 26.87
sdc 48.80 28.93 324.73 730.87 24.98 28.94
104.62 1.19 1.13 0.29 30.60
sdd 121.73 33.13 251.60 700.40 24.99 28.94
116.01 1.11 1.17 0.29 27.40
sde 49.00 28.60 324.33 731.47 24.99 28.95
104.65 1.22 1.15 0.26 27.53
sdf 120.27 33.20 253.00 701.00 24.99 28.97
115.84 1.14 1.20 0.33 31.67
sdg 48.80 29.07 324.73 731.80 25.00 28.95
104.59 1.37 1.29 0.35 36.93
sdh 120.47 33.47 252.73 702.53 25.00 28.96
115.68 1.24 1.30 0.35 33.67
sdi 50.73 28.27 322.73 735.13 24.99 29.01
104.54 1.34 1.26 0.31 32.27
dm-0 0.00 0.00 0.13 0.13 0.00 0.00
12.00 0.01 25.00 25.00 0.67
dm-1 0.00 0.00 1602.67 992.73 199.93 231.69
340.59 4.12 1.59 0.34 88.40
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
(Then it jumps up to 99-100% for the majority of devices (here sdf,sdg,
sdh, sdi are all on the same physical areca card).
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.60 24.71 0.00 74.69
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 4.07 0.00 1.13 0.00 0.02
36.71 0.00 1.18 1.18 0.13
sdb 2.07 1.93 8.00 17.00 0.63 0.84
120.33 0.04 1.49 0.35 0.87
sdc 2.87 1.20 7.40 22.13 0.63 0.83
101.86 0.04 1.49 0.25 0.73
sdd 2.13 1.80 8.07 17.20 0.63 0.84
119.64 0.04 1.45 0.32 0.80
sde 2.93 1.07 7.20 21.80 0.63 0.83
103.65 0.05 1.89 0.34 1.00
sdf 1.93 1.87 8.13 13.67 0.63 0.64
119.78 46.58 2.35 45.63 99.47
sdg 2.87 1.00 7.13 17.80 0.62 0.64
104.04 64.12 2.41 39.84 99.33
sdh 2.07 1.67 7.93 13.47 0.62 0.64
121.22 47.85 2.12 46.39 99.27
sdi 2.93 1.07 7.07 18.47 0.62 0.64
101.77 62.15 2.32 38.83 99.13
dm-0 0.00 0.00 0.20 0.07 0.00 0.00
10.00 0.00 2.50 2.50 0.07
dm-1 0.00 0.00 40.20 30.13 5.03 6.68
340.96 74.73 2.13 14.19 99.80
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
(Then here it hits 100% and the system locks)
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.81 24.95 0.00 74.24
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 8.40 0.00 2.13 0.00 0.04
39.50 0.00 1.88 0.63 0.13
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.07 0.00 0.00
16.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00
0.00 50.00 0.00 0.00 100.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00
0.00 69.00 0.00 0.00 100.00
sdh 0.00 0.00 0.00 0.00 0.00 0.00
0.00 50.00 0.00 0.00 100.00
sdi 0.00 0.00 0.00 0.00 0.00 0.00
0.00 65.00 0.00 0.00 100.00
dm-0 0.00 0.00 0.00 0.07 0.00 0.00
16.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00
0.00 85.00 0.00 0.00 100.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
============ (System)
(Ubuntu 8.04.3 LTS):
Linux loki 2.6.24-26-server #1 SMP Tue Dec 1 18:26:43 UTC 2009 x86_64
GNU/Linux
--------------
xfs_repair version 2.9.4
============= (modinfo's)
filename: /lib/modules/2.6.24-26-server/kernel/fs/xfs/xfs.ko
license: GPL
description: SGI XFS with ACLs, security attributes, realtime, large
block/inode numbers, no debug enabled
author: Silicon Graphics, Inc.
srcversion: A2E6459B3A4C96355F95E61
depends:
vermagic: 2.6.24-26-server SMP mod_unload
============
filename:
/lib/modules/2.6.24-26-server/kernel/drivers/scsi/arcmsr/arcmsr.ko
version: Driver Version 1.20.00.15 2007/08/30
license: Dual BSD/GPL
description: ARECA (ARC11xx/12xx/13xx/16xx) SATA/SAS RAID HOST Adapter
author: Erich Chen <support@areca.com.tw>
srcversion: 38E576EB40C1A58E8B9E007
alias: pci:v000017D3d00001681sv*sd*bc*sc*i*
alias: pci:v000017D3d00001680sv*sd*bc*sc*i*
alias: pci:v000017D3d00001381sv*sd*bc*sc*i*
alias: pci:v000017D3d00001380sv*sd*bc*sc*i*
alias: pci:v000017D3d00001280sv*sd*bc*sc*i*
alias: pci:v000017D3d00001270sv*sd*bc*sc*i*
alias: pci:v000017D3d00001260sv*sd*bc*sc*i*
alias: pci:v000017D3d00001230sv*sd*bc*sc*i*
alias: pci:v000017D3d00001220sv*sd*bc*sc*i*
alias: pci:v000017D3d00001210sv*sd*bc*sc*i*
alias: pci:v000017D3d00001202sv*sd*bc*sc*i*
alias: pci:v000017D3d00001201sv*sd*bc*sc*i*
alias: pci:v000017D3d00001200sv*sd*bc*sc*i*
alias: pci:v000017D3d00001170sv*sd*bc*sc*i*
alias: pci:v000017D3d00001160sv*sd*bc*sc*i*
alias: pci:v000017D3d00001130sv*sd*bc*sc*i*
alias: pci:v000017D3d00001120sv*sd*bc*sc*i*
alias: pci:v000017D3d00001110sv*sd*bc*sc*i*
depends: scsi_mod
vermagic: 2.6.24-26-server SMP mod_unload
===========
filename: /lib/modules/2.6.24-26-server/kernel/drivers/md/dm-mod.ko
license: GPL
author: Joe Thornber <dm-devel@redhat.com>
description: device-mapper driver
srcversion: A7E89E997173E41CB6AAF04
depends:
vermagic: 2.6.24-26-server SMP mod_unload
parm: major:The major number of the device mapper (uint)
===========
============
mounted with:
/dev/vg_media/lv_ftpshare /var/ftp xfs
defaults,relatime,nobarrier,logbufs=8,logbsize=256k,sunit=256,swidth=2048,inode64,noikeep,largeio,swalloc,allocsize=128k
0 2
============
XFS info:
meta-data=/dev/mapper/vg_media-lv_ftpshare isize=2048 agcount=41,
agsize=268435424 blks
= sectsz=512 attr=0
data = bsize=4096 blocks=10737418200, imaxpct=1
= sunit=32 swidth=256 blks, unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=32 blks, lazy-count=0
realtime =none extsz=1048576 blocks=0, rtextents=0
=============
XFS is running on top of LVM:
--- Logical volume ---
LV Name /dev/vg_media/lv_ftpshare
VG Name vg_media
LV UUID MgEBWv-x9fn-KUoJ-3y5X-snlk-7F9E-A3CiHh
LV Write Access read/write
LV Status available
# open 1
LV Size 40.00 TB
Current LE 40960
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 254:1
==============
LVM is using as it's base physical volumes 8 hardware raids
(MediaVol00-70 inclusive):
[ 175.320738] ARECA RAID ADAPTER4: FIRMWARE VERSION V1.47 2009-07-16
[ 175.336238] scsi4 : Areca SAS Host Adapter RAID Controller( RAID6
capable)
[ 175.336239] Driver Version 1.20.00.15 2007/08/30
[ 175.336387] ACPI: PCI Interrupt 0000:0a:00.0[A] -> GSI 17 (level,
low) -> IRQ 17
[ 175.336395] PCI: Setting latency timer of device 0000:0a:00.0 to 64
[ 175.336990] scsi 4:0:0:0: Direct-Access Areca BootVol#00
R001 PQ: 0 ANSI: 5
[ 175.337096] scsi 4:0:0:1: Direct-Access Areca MediaVol#00
R001 PQ: 0 ANSI: 5
[ 175.337169] scsi 4:0:0:2: Direct-Access Areca MediaVol#10
R001 PQ: 0 ANSI: 5
[ 175.337240] scsi 4:0:0:3: Direct-Access Areca MediaVol#20
R001 PQ: 0 ANSI: 5
[ 175.337312] scsi 4:0:0:4: Direct-Access Areca MediaVol#30
R001 PQ: 0 ANSI: 5
[ 175.337907] scsi 4:0:16:0: Processor Areca RAID
controller R001 PQ: 0 ANSI: 0
[ 175.356231] ARECA RAID ADAPTER5: FIRMWARE VERSION V1.47 2009-10-22
[ 175.376144] scsi5 : Areca SAS Host Adapter RAID Controller( RAID6
capable)
[ 175.376145] Driver Version 1.20.00.15 2007/08/30
[ 175.377354] scsi 5:0:0:5: Direct-Access Areca MediaVol#40
R001 PQ: 0 ANSI: 5
[ 175.377434] scsi 5:0:0:6: Direct-Access Areca MediaVol#50
R001 PQ: 0 ANSI: 5
[ 175.377495] scsi 5:0:0:7: Direct-Access Areca MediaVol#60
R001 PQ: 0 ANSI: 5
[ 175.377587] scsi 5:0:1:0: Direct-Access Areca MediaVol#70
R001 PQ: 0 ANSI: 5
[ 175.378156] scsi 5:0:16:0: Processor Areca RAID
controller R001 PQ: 0 ANSI: 0
=================
[-- Attachment #1.2: Type: text/html, Size: 26553 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: XFS data corruption with high I/O even on hardware raid
2010-01-14 1:11 XFS data corruption with high I/O even on hardware raid Steve Costaras
@ 2010-01-14 2:24 ` Dave Chinner
2010-01-14 2:33 ` Steve Costaras
2010-01-15 0:52 ` XFS data corruption with high I/O even on Areca " Steve Costaras
2010-01-14 9:08 ` XFS data corruption with high I/O even on " Andi Kleen
1 sibling, 2 replies; 9+ messages in thread
From: Dave Chinner @ 2010-01-14 2:24 UTC (permalink / raw)
To: Steve Costaras; +Cc: xfs
On Wed, Jan 13, 2010 at 07:11:27PM -0600, Steve Costaras wrote:
> Ok, I've been seeing a problem here since had to move over to XFS from
> JFS due to file system size issues. I am seeing XFS Data corruption
> under ?heavy io? Basically, what happens is that under heavy load
> (i.e. if I'm doing say a xfs_fsr (which nearly always triggers the
> freeze issue) on a volume the system hovers around 90% utilization for
> the dm device for a while (sometimes an hour+, sometimes minutes) the
> subsystem goes into 100% utilization and then freezes solid forcing me
> to do a hard reboot of the box.
xfs_fsr can cause a *large* amount of IO to be done, so it is no
surprise that it can trigger high load bugs in hardware and
software. XFS can trigger high load problems on hardware more
readily than other filesystems because using direct IO (like xfs_fsr
does) it can push far, far higher throughput to the starge subsystem
than any other linux filesystem can.
The fact that the IO subsystem is freezing at 100% elevator queue
utilisation points to an IO never completing. This immediately makes
me point a finger at either the RAID hardware or the driver - a bug
in XFS is highly unlikely to cause this symptom as those stats are
generated at layers lower than XFS.
Next time you get a freeze, the output of:
# echo w > /proc/sysrq-trigger
will tell use what the system is waiting on (i.e. why it is stuck)
...
> Since I'm using hardware raid w/ BBU when I reboot and it comes back up
> the raid controller writes out to the drives any outstanding data in
> it's cache and from the hardware point of view (as well as lvm's point
> of view) the array is ok. The file system however generally can't be
> mounted (about 4 out of 5 times, some times it does get auto-mounted but
> when I then run an xfs_repair -n -v in those cases there are pages of
> errors (badly aligned inode rec, bad starting inode #'s, dubious inode
> btree block headers among others). When I let a repair actually run
> in one case out of 4,500,000 files it linked about 2,000,000 or so but
> there was no way to identify and verify file integrity. The others were
> just lost.
>
> This is not limited to large volume sizes I have seen similar on small
> ~2TiB file systems as well. Also when it happened in a couple cases the
> file system that was taking the I/O (say xfs_fsr -v /home ) another XFS
> filesystem on the same system which was NOT taking much if any I/O gets
> badly corrupted (say /var/test ). Both would be using the same areca
> controllers and same physical discs (same PV's and same VG's but
> different LV's).
These symptoms really point to a problem outside XFS - the only time
I've seen this sort of behaviour is on buggy hardware. The
cross-volume corruption is the smoking gun, but proving it is damn
near impossible without expensive lab equipment and a lot of time.
> Any suggestions on how to isolate or eliminate this would be greatly
> appreciated.
I'd start by not running xfs_fsr as a short term workaround to keep
the load below the problem threshold.
Looking at the iostat output - the volumes sd[f-i] all lock up at
100% utilisation at the same time. Then looking at this:
> ==============
> LVM is using as it's base physical volumes 8 hardware raids
> (MediaVol00-70 inclusive):
> [ 175.320738] ARECA RAID ADAPTER4: FIRMWARE VERSION V1.47 2009-07-16
> [ 175.336238] scsi4 : Areca SAS Host Adapter RAID Controller( RAID6
> capable)
....
> [ 175.356231] ARECA RAID ADAPTER5: FIRMWARE VERSION V1.47 2009-10-22
> [ 175.376144] scsi5 : Areca SAS Host Adapter RAID Controller( RAID6
> capable)
You've got 4 luns on each controller, and it looks like all the luns
on one controller have locked up. Everything is pointing at the
raid controller as being the problem....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS data corruption with high I/O even on hardware raid
2010-01-14 2:24 ` Dave Chinner
@ 2010-01-14 2:33 ` Steve Costaras
2010-01-15 0:52 ` XFS data corruption with high I/O even on Areca " Steve Costaras
1 sibling, 0 replies; 9+ messages in thread
From: Steve Costaras @ 2010-01-14 2:33 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 01/13/2010 20:24, Dave Chinner wrote:
> On Wed, Jan 13, 2010 at 07:11:27PM -0600, Steve Costaras wrote:
>
>> Ok, I've been seeing a problem here since had to move over to XFS from
>> JFS due to file system size issues. I am seeing XFS Data corruption
>> under ?heavy io? Basically, what happens is that under heavy load
>> (i.e. if I'm doing say a xfs_fsr (which nearly always triggers the
>> freeze issue) on a volume the system hovers around 90% utilization for
>> the dm device for a while (sometimes an hour+, sometimes minutes) the
>> subsystem goes into 100% utilization and then freezes solid forcing me
>> to do a hard reboot of the box.
>>
> xfs_fsr can cause a *large* amount of IO to be done, so it is no
> surprise that it can trigger high load bugs in hardware and
> software. XFS can trigger high load problems on hardware more
> readily than other filesystems because using direct IO (like xfs_fsr
> does) it can push far, far higher throughput to the starge subsystem
> than any other linux filesystem can.
>
> The fact that the IO subsystem is freezing at 100% elevator queue
> utilisation points to an IO never completing. This immediately makes
> me point a finger at either the RAID hardware or the driver - a bug
> in XFS is highly unlikely to cause this symptom as those stats are
> generated at layers lower than XFS.
>
> Next time you get a freeze, the output of:
>
> # echo w> /proc/sysrq-trigger
>
> will tell use what the system is waiting on (i.e. why it is stuck)
>
> ...
>
Thanks will try that, some times I do have enough time to issue a couple
commands before the kernel hard locks and no user input is accepted.
>> Since I'm using hardware raid w/ BBU when I reboot and it comes back up
>> the raid controller writes out to the drives any outstanding data in
>> it's cache and from the hardware point of view (as well as lvm's point
>> of view) the array is ok. The file system however generally can't be
>> mounted (about 4 out of 5 times, some times it does get auto-mounted but
>> when I then run an xfs_repair -n -v in those cases there are pages of
>> errors (badly aligned inode rec, bad starting inode #'s, dubious inode
>> btree block headers among others). When I let a repair actually run
>> in one case out of 4,500,000 files it linked about 2,000,000 or so but
>> there was no way to identify and verify file integrity. The others were
>> just lost.
>>
>> This is not limited to large volume sizes I have seen similar on small
>> ~2TiB file systems as well. Also when it happened in a couple cases the
>> file system that was taking the I/O (say xfs_fsr -v /home ) another XFS
>> filesystem on the same system which was NOT taking much if any I/O gets
>> badly corrupted (say /var/test ). Both would be using the same areca
>> controllers and same physical discs (same PV's and same VG's but
>> different LV's).
>>
> These symptoms really point to a problem outside XFS - the only time
> I've seen this sort of behaviour is on buggy hardware. The
> cross-volume corruption is the smoking gun, but proving it is damn
> near impossible without expensive lab equipment and a lot of time.
>
That's what I figured both the high I/O (as JFS did not produce as much
I/O as I see under XFS) as well as the utilization reaching 100% on a
particular card.
Would enabling write buffers have any positive effect here to at least
minimize data loss issues?
>> Any suggestions on how to isolate or eliminate this would be greatly
>> appreciated.
>>
> I'd start by not running xfs_fsr as a short term workaround to keep
> the load below the problem threshold.
>
> Looking at the iostat output - the volumes sd[f-i] all lock up at
> 100% utilisation at the same time. Then looking at this:
>
Already planning on it, the ?sole? benefit of this corruption is that at
least the full volume restore has much less fragmentation. (kind of a
killer way to defragment but it does work).
Steve
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS data corruption with high I/O even on Areca hardware raid
2010-01-14 2:24 ` Dave Chinner
2010-01-14 2:33 ` Steve Costaras
@ 2010-01-15 0:52 ` Steve Costaras
2010-01-15 1:35 ` Dave Chinner
1 sibling, 1 reply; 9+ messages in thread
From: Steve Costaras @ 2010-01-15 0:52 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 01/13/2010 20:24, Dave Chinner wrote:
> On Wed, Jan 13, 2010 at 07:11:27PM -0600, Steve Costaras wrote:
>
>> Ok, I've been seeing a problem here since had to move over to XFS from
>> JFS due to file system size issues. I am seeing XFS Data corruption
>> under ?heavy io? Basically, what happens is that under heavy load
>> (i.e. if I'm doing say a xfs_fsr (which nearly always triggers the
>> freeze issue) on a volume the system hovers around 90% utilization for
>> the dm device for a while (sometimes an hour+, sometimes minutes) the
>> subsystem goes into 100% utilization and then freezes solid forcing me
>> to do a hard reboot of the box.
>>
> xfs_fsr can cause a *large* amount of IO to be done, so it is no
> surprise that it can trigger high load bugs in hardware and
> software. XFS can trigger high load problems on hardware more
> readily than other filesystems because using direct IO (like xfs_fsr
> does) it can push far, far higher throughput to the starge subsystem
> than any other linux filesystem can.
>
> The fact that the IO subsystem is freezing at 100% elevator queue
> utilisation points to an IO never completing. This immediately makes
> me point a finger at either the RAID hardware or the driver - a bug
> in XFS is highly unlikely to cause this symptom as those stats are
> generated at layers lower than XFS.
>
> Next time you get a freeze, the output of:
>
> # echo w> /proc/sysrq-trigger
>
> will tell use what the system is waiting on (i.e. why it is stuck)
>
> ...
>
didn't want this to happen so soon, but another freeze just occured. I
have the output you mentioned above below. Don't know if this helps
pinpoint anything as to where the problem is more accurately. I don't
like the abort device commands to arcmsr, still have not heard anything
from Areca support for them to look at it.
-------------
[ 3494.731923] arcmsr6: abort device command of scsi id = 0 lun = 5
[ 3511.746966] arcmsr6: abort device command of scsi id = 0 lun = 5
[ 3511.746978] arcmsr6: abort device command of scsi id = 0 lun = 7
[ 3528.759509] arcmsr6: abort device command of scsi id = 0 lun = 7
[ 3528.759520] arcmsr6: abort device command of scsi id = 0 lun = 7
[ 3545.782040] arcmsr6: abort device command of scsi id = 0 lun = 7
[ 3545.782052] arcmsr6: abort device command of scsi id = 0 lun = 6
[ 3562.785862] arcmsr6: abort device command of scsi id = 0 lun = 6
[ 3562.785872] arcmsr6: abort device command of scsi id = 0 lun = 6
[ 3579.798404] arcmsr6: abort device command of scsi id = 0 lun = 6
[ 3579.798410] arcmsr6: abort device command of scsi id = 0 lun = 5
[ 3580.294587] SysRq : Show Blocked State
[ 3580.294646] task PC stack pid father
[ 3580.294653] pdflush D 0000000000000000 0 221 2
[ 3580.294658] ffff8102330dfd80 0000000000000046 ffff81023ef6ce48
ffff81023ef6d008
[ 3580.294664] ffff81023ef6d0e8 ffffffff80682c80 ffffffff80682c80
ffffffff80682c80
[ 3580.294668] ffffffff8067f0a0 ffffffff80682c80 ffff8102330dc230
ffff8102330dfd4c
[ 3580.294673] Call Trace:
[ 3580.294699] [<ffffffff80248812>] __mod_timer+0xc2/0xe0
[ 3580.294706] [<ffffffff80470dbf>] schedule_timeout+0x5f/0xd0
[ 3580.294711] [<ffffffff802482d0>] process_timeout+0x0/0x10
[ 3580.294715] [<ffffffff80470dba>] schedule_timeout+0x5a/0xd0
[ 3580.294721] [<ffffffff8028e250>] pdflush+0x0/0x230
[ 3580.294724] [<ffffffff80470ca8>] io_schedule_timeout+0x28/0x40
[ 3580.294728] [<ffffffff80293cdb>] congestion_wait+0x6b/0x90
[ 3580.294732] [<ffffffff80254530>] autoremove_wake_function+0x0/0x30
[ 3580.294736] [<ffffffff802d67c8>] writeback_inodes+0xb8/0xd0
[ 3580.294739] [<ffffffff8028e250>] pdflush+0x0/0x230
[ 3580.294743] [<ffffffff8028dc1f>] background_writeout+0x3f/0xe0
[ 3580.294751] [<ffffffff8028e250>] pdflush+0x0/0x230
[ 3580.294755] [<ffffffff8028e3a0>] pdflush+0x150/0x230
[ 3580.294759] [<ffffffff8028dbe0>] background_writeout+0x0/0xe0
[ 3580.294764] [<ffffffff8025416b>] kthread+0x4b/0x80
[ 3580.294768] [<ffffffff8020d1b8>] child_rip+0xa/0x12
[ 3580.294777] [<ffffffff80254120>] kthread+0x0/0x80
[ 3580.294780] [<ffffffff8020d1ae>] child_rip+0x0/0x12
[ 3580.294784]
[ 3580.294786] pdflush D 0000000000000001 0 222 2
[ 3580.294791] ffff8102330e1d80 0000000000000046 ffff81023f22c2c8
ffff81023ef6d008
[ 3580.294797] ffff81023ef6d0e8 ffffffff80682c80 ffffffff80682c80
ffffffff80682c80
[ 3580.294801] ffffffff8067f0a0 ffffffff80682c80 ffff8102330d8a20
ffff8102330e1d4c
[ 3580.294805] Call Trace:
[ 3580.294817] [<ffffffff80248812>] __mod_timer+0xc2/0xe0
[ 3580.294823] [<ffffffff80470dbf>] schedule_timeout+0x5f/0xd0
[ 3580.294827] [<ffffffff802482d0>] process_timeout+0x0/0x10
[ 3580.294832] [<ffffffff80470dba>] schedule_timeout+0x5a/0xd0
[ 3580.294837] [<ffffffff8028e250>] pdflush+0x0/0x230
[ 3580.294840] [<ffffffff80470ca8>] io_schedule_timeout+0x28/0x40
[ 3580.294845] [<ffffffff80293cdb>] congestion_wait+0x6b/0x90
[ 3580.294849] [<ffffffff80254530>] autoremove_wake_function+0x0/0x30
[ 3580.294852] [<ffffffff802d67c8>] writeback_inodes+0xb8/0xd0
[ 3580.294856] [<ffffffff8028e250>] pdflush+0x0/0x230
[ 3580.294860] [<ffffffff8028dc1f>] background_writeout+0x3f/0xe0
[ 3580.294868] [<ffffffff8028e250>] pdflush+0x0/0x230
[ 3580.294872] [<ffffffff8028e3a0>] pdflush+0x150/0x230
[ 3580.294875] [<ffffffff8028dbe0>] background_writeout+0x0/0xe0
[ 3580.294880] [<ffffffff8025416b>] kthread+0x4b/0x80
[ 3580.294884] [<ffffffff8020d1b8>] child_rip+0xa/0x12
[ 3580.294893] [<ffffffff80254120>] kthread+0x0/0x80
[ 3580.294896] [<ffffffff8020d1ae>] child_rip+0x0/0x12
[ 3580.294900]
[ 3580.294905] scsi_eh_6 D ffff8100010549a0 0 3058 2
[ 3580.294908] ffff810231c89db0 0000000000000046 ffff810231c89d78
3031343839372e39
[ 3580.294915] ffffffff8000205d ffffffff80682c80 ffffffff80682c80
ffffffff80682c80
[ 3580.294919] ffffffff8067f0a0 ffffffff80682c80 ffff810232978230
ffff810231c89d7c
[ 3580.294923] Call Trace:
[ 3580.294938] [<ffffffff80470dbf>] schedule_timeout+0x5f/0xd0
[ 3580.294943] [<ffffffff802482d0>] process_timeout+0x0/0x10
[ 3580.294947] [<ffffffff80470dba>] schedule_timeout+0x5a/0xd0
[ 3580.294952] [<ffffffff80248847>] msleep+0x17/0x30
[ 3580.294958] [<ffffffff8817430e>] :arcmsr:arcmsr_abort+0x8e/0x340
[ 3580.294974] [<ffffffff8812a379>]
:scsi_mod:scsi_error_handler+0x399/0x570
[ 3580.294991] [<ffffffff88129fe0>] :scsi_mod:scsi_error_handler+0x0/0x570
[ 3580.294995] [<ffffffff8025416b>] kthread+0x4b/0x80
[ 3580.295000] [<ffffffff8020d1b8>] child_rip+0xa/0x12
[ 3580.295009] [<ffffffff80254120>] kthread+0x0/0x80
[ 3580.295012] [<ffffffff8020d1ae>] child_rip+0x0/0x12
[ 3580.295017]
[ 3580.295021] xfsdatad/0 D 0000000000000000 0 5741 2
[ 3580.295024] ffff8102034f7e00 0000000000000046 0000000000000000
ffff810001033880
[ 3580.295029] ffff810233177818 ffffffff80682c80 ffffffff80682c80
ffffffff80682c80
[ 3580.295034] ffffffff8067f0a0 ffffffff80682c80 ffff810233177a00
ffff8102034f7dcc
[ 3580.295038] Call Trace:
[ 3580.295053] [<ffffffff80471dd5>] __down_write_nested+0x75/0xb0
[ 3580.295075] [<ffffffff884bf0bf>] :xfs:xfs_ilock+0x6f/0xa0
[ 3580.295091] [<ffffffff884e3810>] :xfs:xfs_setfilesize+0x40/0xc0
[ 3580.295104] [<ffffffff884e3960>] :xfs:xfs_end_bio_written+0x0/0x20
[ 3580.295118] [<ffffffff884e3970>] :xfs:xfs_end_bio_written+0x10/0x20
[ 3580.295123] [<ffffffff8024fddc>] run_workqueue+0xcc/0x170
[ 3580.295126] [<ffffffff80250970>] worker_thread+0x0/0x110
[ 3580.295129] [<ffffffff80250970>] worker_thread+0x0/0x110
[ 3580.295132] [<ffffffff80250a13>] worker_thread+0xa3/0x110
[ 3580.295137] [<ffffffff80254530>] autoremove_wake_function+0x0/0x30
[ 3580.295141] [<ffffffff80250970>] worker_thread+0x0/0x110
[ 3580.295146] [<ffffffff80250970>] worker_thread+0x0/0x110
[ 3580.295150] [<ffffffff8025416b>] kthread+0x4b/0x80
[ 3580.295153] [<ffffffff8020d1b8>] child_rip+0xa/0x12
[ 3580.295162] [<ffffffff80254120>] kthread+0x0/0x80
[ 3580.295166] [<ffffffff8020d1ae>] child_rip+0x0/0x12
[ 3580.295170]
[ 3580.295174] xfssyncd D 0000000000000000 0 5758 2
[ 3580.295177] ffff810202521b30 0000000000000046 0000000000000000
ffff810075ca8c00
[ 3580.295183] ffff810075ca8c00 ffffffff80682c80 ffffffff80682c80
ffffffff80682c80
[ 3580.295186] ffffffff8067f0a0 ffffffff80682c80 ffff81023317b210
ffff810202521afc
[ 3580.295190] Call Trace:
[ 3580.295202] [<ffffffff803419b4>] get_request+0x1a4/0x350
[ 3580.295208] [<ffffffff80470ce8>] io_schedule+0x28/0x40
[ 3580.295212] [<ffffffff8034278c>] get_request_wait+0xec/0x190
[ 3580.295221] [<ffffffff8804f661>] :dm_mod:__map_bio+0xd1/0x130
[ 3580.295226] [<ffffffff80254530>] autoremove_wake_function+0x0/0x30
[ 3580.295234] [<ffffffff880501bf>] :dm_mod:__split_bio+0xef/0x3e0
[ 3580.295240] [<ffffffff80343b1c>] __make_request+0x8c/0x680
[ 3580.295246] [<ffffffff803526b1>] __up_read+0x21/0xb0
[ 3580.295252] [<ffffffff80340486>] generic_make_request+0x1e6/0x3e0
[ 3580.295262] [<ffffffff803406f5>] submit_bio+0x75/0x100
[ 3580.295278] [<ffffffff884e5d88>] :xfs:_xfs_buf_ioapply+0x198/0x350
[ 3580.295297] [<ffffffff884e6a89>] :xfs:xfs_buf_iorequest+0x29/0x80
[ 3580.295312] [<ffffffff884c9d68>] :xfs:xlog_bdstrat_cb+0x38/0x40
[ 3580.295327] [<ffffffff884cb10e>] :xfs:xlog_sync+0x1fe/0x460
[ 3580.295346] [<ffffffff884cc732>] :xfs:xlog_state_sync_all+0x1f2/0x220
[ 3580.295352] [<ffffffff802486d2>] try_to_del_timer_sync+0x52/0x60
[ 3580.295370] [<ffffffff884ccbfd>] :xfs:_xfs_log_force+0x5d/0x80
[ 3580.295386] [<ffffffff884dba5b>] :xfs:xfs_syncsub+0x4b/0x300
[ 3580.295402] [<ffffffff884eceb7>] :xfs:xfs_sync_worker+0x17/0x40
[ 3580.295416] [<ffffffff884ed18d>] :xfs:xfssyncd+0x13d/0x1c0
[ 3580.295433] [<ffffffff884ed050>] :xfs:xfssyncd+0x0/0x1c0
[ 3580.295438] [<ffffffff8025416b>] kthread+0x4b/0x80
[ 3580.295443] [<ffffffff8020d1b8>] child_rip+0xa/0x12
[ 3580.295451] [<ffffffff80220720>] lapic_next_event+0x0/0x10
[ 3580.295457] [<ffffffff80254120>] kthread+0x0/0x80
[ 3580.295461] [<ffffffff8020d1ae>] child_rip+0x0/0x12
[ 3580.295466]
[ 3580.295479] vmware-vmx D 0000000000000000 0 7317 1
[ 3580.295485] ffff8101f1b53a48 0000000000000086 0000000000000000
ffff810005d8ae38
[ 3580.295490] ffff810075ca8600 ffffffff80682c80 ffffffff80682c80
ffffffff80682c80
[ 3580.295493] ffffffff8067f0a0 ffffffff80682c80 ffff810231d9fa00
ffff8101f1b53a14
[ 3580.295497] Call Trace:
[ 3580.295512] [<ffffffff80285a20>] sync_page+0x0/0x50
[ 3580.295515] [<ffffffff80470ce8>] io_schedule+0x28/0x40
[ 3580.295520] [<ffffffff80285a57>] sync_page+0x37/0x50
[ 3580.295523] [<ffffffff80470fdf>] __wait_on_bit+0x4f/0x80
[ 3580.295529] [<ffffffff80285d4d>] wait_on_page_bit+0x6d/0x80
[ 3580.295534] [<ffffffff80254560>] wake_bit_function+0x0/0x30
[ 3580.295539] [<ffffffff8028cc1a>] __writepage+0xa/0x30
[ 3580.295542] [<ffffffff8028d2a2>] write_cache_pages+0x2b2/0x340
[ 3580.295544] [<ffffffff8028cc10>] __writepage+0x0/0x30
[ 3580.295556] [<ffffffff80286f3b>] generic_file_buffered_write+0x1fb/0x6b0
[ 3580.295562] [<ffffffff8028d380>] do_writepages+0x20/0x40
[ 3580.295566] [<ffffffff802867b2>] __filemap_fdatawrite_range+0x52/0x60
[ 3580.295574] [<ffffffff80286974>] sync_page_range+0x74/0xe0
[ 3580.295590] [<ffffffff884ec745>] :xfs:xfs_write+0x775/0x910
[ 3580.295608] [<ffffffff802b5179>] do_sync_write+0xd9/0x120
[ 3580.295617] [<ffffffff80254530>] autoremove_wake_function+0x0/0x30
[ 3580.295622] [<ffffffff8024aa45>] group_send_sig_info+0x25/0x90
[ 3580.295628] [<ffffffff8024ab01>] kill_pid_info+0x51/0x90
[ 3580.295631] [<ffffffff80250d43>] find_pid_ns+0x3/0xd0
[ 3580.295638] [<ffffffff802b5abd>] vfs_write+0xed/0x190
[ 3580.295643] [<ffffffff802b6324>] sys_pwrite64+0x84/0xa0
[ 3580.295649] [<ffffffff8020c39e>] system_call+0x7e/0x83
[ 3580.295657]
[ 3580.295662] bacula-fd D 0000000000000000 0 7573 1
[ 3580.295666] ffff810161529b98 0000000000000086 0000000000000000
ffffffff8812cdce
[ 3580.295674] ffffffff884e4450 ffffffff80682c80 ffffffff80682c80
ffffffff80682c80
[ 3580.295681] ffffffff8067f0a0 ffffffff80682c80 ffff810232c45a00
ffff810161529b64
[ 3580.295687] Call Trace:
[ 3580.295698] [<ffffffff8812cdce>] :scsi_mod:scsi_request_fn+0x25e/0x3d0
[ 3580.295712] [<ffffffff884e4450>] :xfs:xfs_get_blocks+0x0/0x10
[ 3580.295724] [<ffffffff80285a20>] sync_page+0x0/0x50
[ 3580.295727] [<ffffffff80470ce8>] io_schedule+0x28/0x40
[ 3580.295732] [<ffffffff80285a57>] sync_page+0x37/0x50
[ 3580.295735] [<ffffffff80470eba>] __wait_on_bit_lock+0x4a/0x80
[ 3580.295742] [<ffffffff802859ff>] __lock_page+0x5f/0x70
[ 3580.295746] [<ffffffff80254560>] wake_bit_function+0x0/0x30
[ 3580.295750] [<ffffffff802862b0>] do_generic_mapping_read+0x1c0/0x3c0
[ 3580.295754] [<ffffffff802856d0>] file_read_actor+0x0/0x160
[ 3580.295764] [<ffffffff80287d3f>] generic_file_aio_read+0xff/0x1b0
[ 3580.295784] [<ffffffff884ebe9c>] :xfs:xfs_read+0x11c/0x250
[ 3580.295792] [<ffffffff802b5299>] do_sync_read+0xd9/0x120
[ 3580.295801] [<ffffffff80254530>] autoremove_wake_function+0x0/0x30
[ 3580.295808] [<ffffffff804706b0>] thread_return+0x3a/0x59a
[ 3580.295817] [<ffffffff802b5c4d>] vfs_read+0xed/0x190
[ 3580.295822] [<ffffffff802b6133>] sys_read+0x53/0x90
[ 3580.295828] [<ffffffff8020c39e>] system_call+0x7e/0x83
[ 3580.295836]
[ 3580.295841] smbd D 0000000000000000 0 7596 6572
[ 3580.295846] ffff8101400b7578 0000000000000086 0000000000000000
ffffffff803401cd
[ 3580.295853] ffff8101f1de7480 ffffffff80682c80 ffffffff80682c80
ffffffff80682c80
[ 3580.295860] ffffffff8067f0a0 ffffffff80682c80 ffff81016154e230
ffff8101400b7544
[ 3580.295865] Call Trace:
[ 3580.295869] [<ffffffff803401cd>] blk_recount_segments+0x3d/0x80
[ 3580.295878] [<ffffffff803419b4>] get_request+0x1a4/0x350
[ 3580.295885] [<ffffffff80470ce8>] io_schedule+0x28/0x40
[ 3580.295890] [<ffffffff8034278c>] get_request_wait+0xec/0x190
[ 3580.295897] [<ffffffff8804f661>] :dm_mod:__map_bio+0xd1/0x130
[ 3580.295902] [<ffffffff80254530>] autoremove_wake_function+0x0/0x30
[ 3580.295906] [<ffffffff80341326>] ll_back_merge_fn+0x146/0x240
[ 3580.295912] [<ffffffff80343b1c>] __make_request+0x8c/0x680
[ 3580.295918] [<ffffffff803526b1>] __up_read+0x21/0xb0
[ 3580.295923] [<ffffffff80340486>] generic_make_request+0x1e6/0x3e0
[ 3580.295927] [<ffffffff8028f5f9>] __pagevec_release+0x19/0x30
[ 3580.295935] [<ffffffff803406f5>] submit_bio+0x75/0x100
[ 3580.295939] [<ffffffff802de566>] __bio_add_page+0x1a6/0x210
[ 3580.295955] [<ffffffff884e40ee>] :xfs:xfs_submit_ioend_bio+0x1e/0x30
[ 3580.295970] [<ffffffff884e41de>] :xfs:xfs_submit_ioend+0xbe/0xe0
[ 3580.295986] [<ffffffff884e50e6>] :xfs:xfs_page_state_convert+0x3a6/0x660
[ 3580.296012] [<ffffffff884e54df>] :xfs:xfs_vm_writepage+0x6f/0x120
[ 3580.296017] [<ffffffff8028cc1a>] __writepage+0xa/0x30
[ 3580.296021] [<ffffffff8028d23e>] write_cache_pages+0x24e/0x340
[ 3580.296024] [<ffffffff8028cc10>] __writepage+0x0/0x30
[ 3580.296039] [<ffffffff8028d380>] do_writepages+0x20/0x40
[ 3580.296043] [<ffffffff802d5bf0>] __writeback_single_inode+0xb0/0x380
[ 3580.296054] [<ffffffff802d62b9>] sync_sb_inodes+0x1f9/0x300
[ 3580.296061] [<ffffffff802d67a2>] writeback_inodes+0x92/0xd0
[ 3580.296065] [<ffffffff8028d9bc>]
balance_dirty_pages_ratelimited_nr+0x25c/0x360
[ 3580.296078] [<ffffffff80286f3b>] generic_file_buffered_write+0x1fb/0x6b0
[ 3580.296105] [<ffffffff884ec646>] :xfs:xfs_write+0x676/0x910
[ 3580.296117] [<ffffffff80315e22>] __key_link+0x162/0x330
[ 3580.296125] [<ffffffff802b5179>] do_sync_write+0xd9/0x120
[ 3580.296134] [<ffffffff80254530>] autoremove_wake_function+0x0/0x30
[ 3580.296142] [<ffffffff802c7204>] fcntl_setlk+0x54/0x2f0
[ 3580.296150] [<ffffffff802b5abd>] vfs_write+0xed/0x190
[ 3580.296155] [<ffffffff802b6324>] sys_pwrite64+0x84/0xa0
[ 3580.296159] [<ffffffff802c2ade>] sys_fcntl+0x6e/0x370
[ 3580.296164] [<ffffffff8020c39e>] system_call+0x7e/0x83
[ 3580.296172]
[ 3580.296174] sync D 0000000000000000 0 7639 7636
[ 3580.296179] ffff810146023bb8 0000000000000086 ffff8100378d55a0
ffff810202540228
[ 3580.296187] ffff810231808000 ffffffff80682c80 ffffffff80682c80
ffffffff80682c80
[ 3580.296193] ffffffff8067f0a0 ffffffff80682c80 ffff8101dd1caa20
ffffffff8033e44b
[ 3580.296199] Call Trace:
[ 3580.296207] [<ffffffff8033e44b>] elv_next_request+0x5b/0x1f0
[ 3580.296221] [<ffffffff8812cdce>] :scsi_mod:scsi_request_fn+0x25e/0x3d0
[ 3580.296228] [<ffffffff80472287>] __down+0xa7/0x11f
[ 3580.296233] [<ffffffff80236540>] default_wake_function+0x0/0x10
[ 3580.296246] [<ffffffff884e5d88>] :xfs:_xfs_buf_ioapply+0x198/0x350
[ 3580.296252] [<ffffffff80471f06>] __down_failed+0x35/0x3a
[ 3580.296262] [<ffffffff8812cb70>] :scsi_mod:scsi_request_fn+0x0/0x3d0
[ 3580.296268] [<ffffffff8034f100>] kobject_release+0x0/0x10
[ 3580.296283] [<ffffffff884e56a6>] :xfs:xfs_buf_iowait+0x46/0x50
[ 3580.296297] [<ffffffff884e7b76>] :xfs:xfs_buf_read_flags+0x66/0xa0
[ 3580.296313] [<ffffffff884d87b4>] :xfs:xfs_trans_read_buf+0x64/0x340
[ 3580.296332] [<ffffffff884c1f11>] :xfs:xfs_itobp+0x81/0x1e0
[ 3580.296336] [<ffffffff8028d0fc>] write_cache_pages+0x10c/0x340
[ 3580.296355] [<ffffffff884c48fe>] :xfs:xfs_iflush+0xfe/0x520
[ 3580.296359] [<ffffffff80285864>] find_get_pages_tag+0x34/0x90
[ 3580.296363] [<ffffffff80352612>] __down_read_trylock+0x42/0x60
[ 3580.296380] [<ffffffff884df0d9>] :xfs:xfs_inode_flush+0x179/0x1b0
[ 3580.296396] [<ffffffff884ecd38>] :xfs:xfs_fs_write_inode+0x28/0x70
[ 3580.296401] [<ffffffff802d5dec>] __writeback_single_inode+0x2ac/0x380
[ 3580.296405] [<ffffffff8028ecfa>] pagevec_lookup_tag+0x1a/0x30
[ 3580.296409] [<ffffffff80286673>] wait_on_page_writeback_range+0x73/0x140
[ 3580.296417] [<ffffffff802d62b9>] sync_sb_inodes+0x1f9/0x300
[ 3580.296424] [<ffffffff802d6454>] sync_inodes_sb+0x94/0xb0
[ 3580.296430] [<ffffffff802d64e9>] __sync_inodes+0x79/0xc0
[ 3580.296435] [<ffffffff802d6541>] sync_inodes+0x11/0x30
[ 3580.296438] [<ffffffff802d9672>] do_sync+0x12/0x70
[ 3580.296441] [<ffffffff802d96de>] sys_sync+0xe/0x20
[ 3580.296445] [<ffffffff8020c39e>] system_call+0x7e/0x83
[ 3580.296453]
[ 3580.296455] sh D 0000000000000000 0 7640 7638
[ 3580.296461] ffff810161fd1b18 0000000000000082 0000000000000000
ffffffff802e2e60
[ 3580.296468] ffffffff884e4450 ffffffff80682c80 ffffffff80682c80
ffffffff80682c80
[ 3580.296474] ffffffff8067f0a0 ffffffff80682c80 ffff810190c46a20
ffff810161fd1ae4
[ 3580.296480] Call Trace:
[ 3580.296485] [<ffffffff802e2e60>] mpage_readpages+0xe0/0x100
[ 3580.296499] [<ffffffff884e4450>] :xfs:xfs_get_blocks+0x0/0x10
[ 3580.296511] [<ffffffff80285a20>] sync_page+0x0/0x50
[ 3580.296515] [<ffffffff80470ce8>] io_schedule+0x28/0x40
[ 3580.296519] [<ffffffff80285a57>] sync_page+0x37/0x50
[ 3580.296522] [<ffffffff80470eba>] __wait_on_bit_lock+0x4a/0x80
[ 3580.296528] [<ffffffff802859ff>] __lock_page+0x5f/0x70
[ 3580.296532] [<ffffffff80254560>] wake_bit_function+0x0/0x30
[ 3580.296537] [<ffffffff802862b0>] do_generic_mapping_read+0x1c0/0x3c0
[ 3580.296540] [<ffffffff80232311>] update_curr+0x71/0x100
[ 3580.296543] [<ffffffff802856d0>] file_read_actor+0x0/0x160
[ 3580.296552] [<ffffffff80287d3f>] generic_file_aio_read+0xff/0x1b0
[ 3580.296572] [<ffffffff884ebe9c>] :xfs:xfs_read+0x11c/0x250
[ 3580.296575] [<ffffffff8029352d>] zone_statistics+0x7d/0x80
[ 3580.296583] [<ffffffff802b5299>] do_sync_read+0xd9/0x120
[ 3580.296591] [<ffffffff80254530>] autoremove_wake_function+0x0/0x30
[ 3580.296596] [<ffffffff80334e44>] aa_register_find+0x54/0x170
[ 3580.296602] [<ffffffff80335e2a>] aa_register+0x19a/0x500
[ 3580.296611] [<ffffffff802b5c4d>] vfs_read+0xed/0x190
[ 3580.296616] [<ffffffff802b9be8>] kernel_read+0x38/0x50
[ 3580.296620] [<ffffffff802bb8f5>] do_execve+0x165/0x220
[ 3580.296626] [<ffffffff8020ae74>] sys_execve+0x44/0xb0
[ 3580.296631] [<ffffffff8020c767>] stub_execve+0x67/0xb0
[ 3580.296642]
[ 3580.296644] smbd D 0000000000000000 0 7641 6572
[ 3580.296649] ffff8101a08fd878 0000000000000082 0000000000000000
ffff810231c70000
[ 3580.296655] ffff8101f1de7500 ffffffff80682c80 ffffffff80682c80
ffffffff80682c80
[ 3580.296662] ffffffff8067f0a0 ffffffff80682c80 ffff81016154f210
ffff8101a08fd844
[ 3580.296667] Call Trace:
[ 3580.296682] [<ffffffff80472287>] __down+0xa7/0x11f
[ 3580.296686] [<ffffffff80236540>] default_wake_function+0x0/0x10
[ 3580.296701] [<ffffffff884e5d88>] :xfs:_xfs_buf_ioapply+0x198/0x350
[ 3580.296706] [<ffffffff80471f06>] __down_failed+0x35/0x3a
[ 3580.296723] [<ffffffff884e56a6>] :xfs:xfs_buf_iowait+0x46/0x50
[ 3580.296738] [<ffffffff884e7b76>] :xfs:xfs_buf_read_flags+0x66/0xa0
[ 3580.296754] [<ffffffff884d87b4>] :xfs:xfs_trans_read_buf+0x64/0x340
[ 3580.296772] [<ffffffff884c1f11>] :xfs:xfs_itobp+0x81/0x1e0
[ 3580.296792] [<ffffffff884c5245>] :xfs:xfs_iread+0x85/0x220
[ 3580.296797] [<ffffffff802af0cd>] __slab_alloc+0x1bd/0x410
[ 3580.296815] [<ffffffff884bf831>] :xfs:xfs_iget_core+0x1e1/0x710
[ 3580.296839] [<ffffffff884bfe47>] :xfs:xfs_iget+0xe7/0x160
[ 3580.296858] [<ffffffff884d9a1d>] :xfs:xfs_dir_lookup_int+0x8d/0xf0
[ 3580.296876] [<ffffffff884ddc15>] :xfs:xfs_lookup+0x75/0xa0
[ 3580.296893] [<ffffffff884eaea1>] :xfs:xfs_vn_lookup+0x31/0x70
[ 3580.296898] [<ffffffff802bda46>] do_lookup+0x1b6/0x250
[ 3580.296906] [<ffffffff802bf8cb>] __link_path_walk+0x14b/0xe90
[ 3580.296915] [<ffffffff802c066b>] link_path_walk+0x5b/0x100
[ 3580.296923] [<ffffffff802c82af>] dput+0x1f/0x130
[ 3580.296927] [<ffffffff802cda57>] mntput_no_expire+0x27/0xb0
[ 3580.296931] [<ffffffff802c0691>] link_path_walk+0x81/0x100
[ 3580.296937] [<ffffffff802c092a>] do_path_lookup+0x8a/0x250
[ 3580.296941] [<ffffffff802bf4c9>] getname+0x1a9/0x220
[ 3580.296946] [<ffffffff802c158b>] __user_walk_fd+0x4b/0x80
[ 3580.296951] [<ffffffff802b917f>] vfs_stat_fd+0x2f/0x80
[ 3580.296957] [<ffffffff802c82af>] dput+0x1f/0x130
[ 3580.296961] [<ffffffff802cda57>] mntput_no_expire+0x27/0xb0
[ 3580.296966] [<ffffffff802b48e0>] sys_chdir+0x60/0x90
[ 3580.296969] [<ffffffff802b9267>] sys_newstat+0x27/0x50
[ 3580.296975] [<ffffffff8024cf52>] set_user+0xa2/0xc0
[ 3580.296979] [<ffffffff8024f074>] sys_setresuid+0x194/0x200
[ 3580.296985] [<ffffffff8020c39e>] system_call+0x7e/0x83
[ 3580.296994]
[ 3580.296997] Sched Debug Version: v0.07, 2.6.24-26-server #1
[ 3580.296999] now at 3429623.454505 msecs
[ 3580.297001] .sysctl_sched_latency : 60.000000
[ 3580.297004] .sysctl_sched_min_granularity : 12.000000
[ 3580.297006] .sysctl_sched_wakeup_granularity : 30.000000
[ 3580.297008] .sysctl_sched_batch_wakeup_granularity : 30.000000
[ 3580.297010] .sysctl_sched_child_runs_first : 0.000001
[ 3580.297013] .sysctl_sched_features : 7
[ 3580.297015]
[ 3580.297016] cpu#0, 2404.113 MHz
[ 3580.297018] .nr_running : 1
[ 3580.297020] .load : 1024
[ 3580.297022] .nr_switches : 7933994
[ 3580.297024] .nr_load_updates : 242407
[ 3580.297026] .nr_uninterruptible : -68690
[ 3580.297028] .jiffies : 4295280258
[ 3580.297029] .next_balance : 4295.280276
[ 3580.297032] .curr->pid : 7288
[ 3580.297033] .clock : 2424072.408800
[ 3580.297035] .idle_clock : 0.000000
[ 3580.297037] .prev_clock_raw : 3580293.550258
[ 3580.297039] .clock_warps : 0
[ 3580.297041] .clock_overflows : 2476352
[ 3580.297044] .clock_deep_idle_events : 0
[ 3580.297045] .clock_max_delta : 9.999980
[ 3580.297047] .cpu_load[0] : 1024
[ 3580.297049] .cpu_load[1] : 1024
[ 3580.297051] .cpu_load[2] : 1024
[ 3580.297053] .cpu_load[3] : 1024
[ 3580.297055] .cpu_load[4] : 1024
[ 3580.297057]
[ 3580.297058] cfs_rq
[ 3580.297059] .exec_clock : 0.000000
[ 3580.297062] .MIN_vruntime : 0.000001
[ 3580.297064] .min_vruntime : 552420.036852
[ 3580.297066] .max_vruntime : 0.000001
[ 3580.297068] .spread : 0.000000
[ 3580.297070] .spread0 : 0.000000
[ 3580.297072] .nr_running : 1
[ 3580.297074] .load : 1024
[ 3580.297075] .nr_spread_over : 0
[ 3580.297077]
[ 3580.297078] cfs_rq
[ 3580.297080] .exec_clock : 0.000000
[ 3580.297082] .MIN_vruntime : 0.000001
[ 3580.297084] .min_vruntime : 552420.036852
[ 3580.297087] .max_vruntime : 0.000001
[ 3580.297089] .spread : 0.000000
[ 3580.297091] .spread0 : 0.000000
[ 3580.297092] .nr_running : 1
[ 3580.297094] .load : 1024
[ 3580.297096] .nr_spread_over : 0
[ 3580.297098]
[ 3580.297099] runnable tasks:
[ 3580.297099] task PID tree-key switches
prio exec-runtime sum-exec sum-sleep
[ 3580.297100]
----------------------------------------------------------------------------------------------------------
[ 3580.297106] R vmware-vmx 7288 518608.234067 320744
120 0 0 0.000000
0.000000 0.000000
[ 3580.297114]
[ 3580.297114] cpu#1, 2404.113 MHz
[ 3580.297116] .nr_running : 2
[ 3580.297118] .load : 2048
[ 3580.297120] .nr_switches : 5422548
[ 3580.297121] .nr_load_updates : 244834
[ 3580.297123] .nr_uninterruptible : 56347
[ 3580.297126] .jiffies : 4295280258
[ 3580.297128] .next_balance : 4295.280257
[ 3580.297130] .curr->pid : 7338
[ 3580.297132] .clock : 2448350.962052
[ 3580.297134] .idle_clock : 0.000000
[ 3580.297136] .prev_clock_raw : 3580296.984746
[ 3580.297137] .clock_warps : 0
[ 3580.297139] .clock_overflows : 1027944
[ 3580.297141] .clock_deep_idle_events : 0
[ 3580.297143] .clock_max_delta : 9.999794
[ 3580.297145] .cpu_load[0] : 0
[ 3580.297147] .cpu_load[1] : 0
[ 3580.297149] .cpu_load[2] : 139
[ 3580.297151] .cpu_load[3] : 1532
[ 3580.297153] .cpu_load[4] : 3045
[ 3580.297155]
[ 3580.297155] cfs_rq
[ 3580.297157] .exec_clock : 0.000000
[ 3580.297159] .MIN_vruntime : 0.000001
[ 3580.297161] .min_vruntime : 437151.747709
[ 3580.297163] .max_vruntime : 0.000001
[ 3580.297165] .spread : 0.000000
[ 3580.297167] .spread0 : -115268.289143
[ 3580.297169] .nr_running : 1
[ 3580.297171] .load : 1024
[ 3580.297173] .nr_spread_over : 0
[ 3580.297175]
[ 3580.297175] cfs_rq
[ 3580.297177] .exec_clock : 0.000000
[ 3580.297179] .MIN_vruntime : 406855.054398
[ 3580.297181] .min_vruntime : 437151.747709
[ 3580.297183] .max_vruntime : 406855.054398
[ 3580.297185] .spread : 0.000000
[ 3580.297188] .spread0 : -115268.289143
[ 3580.297190] .nr_running : 2
[ 3580.297191] .load : 2048
[ 3580.297193] .nr_spread_over : 0
[ 3580.297195]
[ 3580.297196] runnable tasks:
[ 3580.297196] task PID tree-key switches
prio exec-runtime sum-exec sum-sleep
[ 3580.297197]
----------------------------------------------------------------------------------------------------------
[ 3580.297200] syslogd 6216 406855.054398 2895
120 0 0 0.000000
0.000000 0.000000
[ 3580.297207] R bash 7338 406855.172049 1400
120 0 0 0.000000
0.000000 0.000000
[ 3580.297215]
[ 3580.297215] cpu#2, 2404.113 MHz
[ 3580.297217] .nr_running : 1
[ 3580.297219] .load : 1024
[ 3580.297221] .nr_switches : 1440663
[ 3580.297223] .nr_load_updates : 276188
[ 3580.297225] .nr_uninterruptible : 6380
[ 3580.297227] .jiffies : 4295280258
[ 3580.297229] .next_balance : 4295.280283
[ 3580.297231] .curr->pid : 6235
[ 3580.297233] .clock : 2761880.177500
[ 3580.297235] .idle_clock : 0.000000
[ 3580.297237] .prev_clock_raw : 3580296.056586
[ 3580.297239] .clock_warps : 0
[ 3580.297242] .clock_overflows : 214421
[ 3580.297243] .clock_deep_idle_events : 0
[ 3580.297245] .clock_max_delta : 9.999999
[ 3580.297247] .cpu_load[0] : 1024
[ 3580.297249] .cpu_load[1] : 512
[ 3580.297251] .cpu_load[2] : 256
[ 3580.297253] .cpu_load[3] : 128
[ 3580.297255] .cpu_load[4] : 64
[ 3580.297257]
[ 3580.297257] cfs_rq
[ 3580.297259] .exec_clock : 0.000000
[ 3580.297261] .MIN_vruntime : 0.000001
[ 3580.297264] .min_vruntime : 102512.690078
[ 3580.297265] .max_vruntime : 0.000001
[ 3580.297268] .spread : 0.000000
[ 3580.297269] .spread0 : -449907.346774
[ 3580.297271] .nr_running : 1
[ 3580.297273] .load : 1024
[ 3580.297275] .nr_spread_over : 0
[ 3580.297277]
[ 3580.297278] cfs_rq
[ 3580.297279] .exec_clock : 0.000000
[ 3580.297282] .MIN_vruntime : 0.000001
[ 3580.297284] .min_vruntime : 102512.690078
[ 3580.297286] .max_vruntime : 0.000001
[ 3580.297288] .spread : 0.000000
[ 3580.297290] .spread0 : -449907.346774
[ 3580.297292] .nr_running : 1
[ 3580.297294] .load : 1024
[ 3580.297296] .nr_spread_over : 0
[ 3580.297298]
[ 3580.297298] runnable tasks:
[ 3580.297299] task PID tree-key switches
prio exec-runtime sum-exec sum-sleep
[ 3580.297300]
----------------------------------------------------------------------------------------------------------
[ 3580.297303] R dd 6235 115167.502387 274
120 0 0 0.000000
0.000000 0.000000
[ 3580.297311]
[ 3580.297311] cpu#3, 2404.113 MHz
[ 3580.297313] .nr_running : 0
[ 3580.297315] .load : 0
[ 3580.297317] .nr_switches : 1591451
[ 3580.297320] .nr_load_updates : 227045
[ 3580.297322] .nr_uninterruptible : 5974
[ 3580.297323] .jiffies : 4295280258
[ 3580.297325] .next_balance : 4295.280259
[ 3580.297327] .curr->pid : 0
[ 3580.297329] .clock : 2270450.190151
[ 3580.297331] .idle_clock : 0.000000
[ 3580.297334] .prev_clock_raw : 3580297.313151
[ 3580.297336] .clock_warps : 0
[ 3580.297337] .clock_overflows : 226883
[ 3580.297339] .clock_deep_idle_events : 0
[ 3580.297341] .clock_max_delta : 9.999865
[ 3580.297343] .cpu_load[0] : 0
[ 3580.297345] .cpu_load[1] : 0
[ 3580.297347] .cpu_load[2] : 0
[ 3580.297349] .cpu_load[3] : 0
[ 3580.297351] .cpu_load[4] : 0
[ 3580.297353]
[ 3580.297353] cfs_rq
[ 3580.297355] .exec_clock : 0.000000
[ 3580.297358] .MIN_vruntime : 0.000001
[ 3580.297360] .min_vruntime : 118343.625154
[ 3580.297362] .max_vruntime : 0.000001
[ 3580.297364] .spread : 0.000000
[ 3580.297366] .spread0 : -434076.411698
[ 3580.297368] .nr_running : 0
[ 3580.297370] .load : 0
[ 3580.297372] .nr_spread_over : 0
[ 3580.297374]
[ 3580.297374] cfs_rq
[ 3580.297376] .exec_clock : 0.000000
[ 3580.297378] .MIN_vruntime : 0.000001
[ 3580.297380] .min_vruntime : 118343.625154
[ 3580.297383] .max_vruntime : 0.000001
[ 3580.297385] .spread : 0.000000
[ 3580.297387] .spread0 : -434076.411698
[ 3580.297389] .nr_running : 0
[ 3580.297391] .load : 0
[ 3580.297393] .nr_spread_over : 0
[ 3580.297395]
[ 3580.297395] runnable tasks:
[ 3580.297396] task PID tree-key switches
prio exec-runtime sum-exec sum-sleep
[ 3580.297397]
----------------------------------------------------------------------------------------------------------
[ 3580.297401]
----------------
iostat -m -x 5 output:
avg-cpu: %user %nice %system %iowait %steal %idle
10.28 0.00 11.55 7.56 0.00 70.61
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 8.40 0.20 3.00 0.00 0.04
29.00 0.00 0.62 0.62 0.20
sdb 10.20 36.00 120.60 34.40 8.18 5.25
177.44 0.23 1.48 0.92 14.20
sdc 53.80 30.60 77.80 36.60 8.18 5.34
242.03 0.24 2.10 1.19 13.60
sdd 10.00 36.60 120.80 37.60 8.18 5.31
174.36 0.27 1.69 0.90 14.20
sde 54.60 33.40 76.40 36.80 8.19 5.39
245.68 0.25 2.23 1.22 13.80
sdf 9.60 36.80 121.40 25.80 8.18 3.99
169.23 5.16 0.99 2.08 30.60
sdg 54.80 32.20 76.00 24.80 8.18 4.01
247.63 4.53 1.39 2.98 30.00
sdh 9.80 39.00 121.00 24.00 8.18 3.99
171.88 4.86 1.10 2.25 32.60
sdi 54.40 33.40 76.40 23.20 8.18 4.02
250.78 4.65 1.57 3.13 31.20
dm-0 0.00 0.00 523.60 0.00 65.45 0.00
256.00 0.77 0.70 0.97 50.60
dm-1 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 1.00 222.60 0.01 42.50
389.30 20.40 3.54 1.18 26.40
avg-cpu: %user %nice %system %iowait %steal %idle
0.42 0.00 6.51 30.68 0.00 62.39
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 10.00 0.00 8.20 0.00 0.07
17.76 0.00 0.00 0.00 0.00
sdb 0.00 17.80 0.00 14.80 0.00 2.42
334.27 0.12 7.84 3.51 5.20
sdc 0.00 18.20 0.00 13.00 0.00 2.40
378.34 0.13 10.15 4.31 5.60
sdd 0.00 17.80 0.00 16.20 0.00 2.42
306.17 0.13 8.27 3.33 5.40
sde 0.00 16.60 0.00 16.80 0.00 2.43
295.90 0.12 7.02 3.10 5.20
sdf 0.00 18.80 0.00 0.00 0.00 0.00
0.00 99.53 0.00 0.00 100.00
sdg 0.00 17.40 0.00 0.00 0.00 0.00
0.00 96.46 0.00 0.00 100.00
sdh 0.00 18.20 0.00 0.00 0.00 0.00
0.00 101.26 0.00 0.00 100.00
sdi 0.00 19.00 0.00 0.00 0.00 0.00
0.00 103.05 0.00 0.00 100.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00
0.00 2.00 0.00 0.00 100.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 112.20 0.00 19.53
356.53 476.04 2.76 8.91 100.00
-------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: XFS data corruption with high I/O even on Areca hardware raid
2010-01-15 0:52 ` XFS data corruption with high I/O even on Areca " Steve Costaras
@ 2010-01-15 1:35 ` Dave Chinner
2010-01-15 2:15 ` Steve Costaras
0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2010-01-15 1:35 UTC (permalink / raw)
To: Steve Costaras; +Cc: xfs
On Thu, Jan 14, 2010 at 06:52:15PM -0600, Steve Costaras wrote:
> On 01/13/2010 20:24, Dave Chinner wrote:
>> On Wed, Jan 13, 2010 at 07:11:27PM -0600, Steve Costaras wrote:
>> The fact that the IO subsystem is freezing at 100% elevator queue
>> utilisation points to an IO never completing. This immediately makes
>> me point a finger at either the RAID hardware or the driver - a bug
>> in XFS is highly unlikely to cause this symptom as those stats are
>> generated at layers lower than XFS.
>>
>> Next time you get a freeze, the output of:
>>
>> # echo w> /proc/sysrq-trigger
>>
>> will tell use what the system is waiting on (i.e. why it is stuck)
>>
>> ...
>
> didn't want this to happen so soon, but another freeze just occured. I
> have the output you mentioned above below. Don't know if this helps
> pinpoint anything as to where the problem is more accurately.
There stack traces do - everything is waiting on IO completion to
occur. The elevator queues are full, the block device is congested
and lots of XFS code is waiting on IO completion to occur.
> I don't
> like the abort device commands to arcmsr, still have not heard anything
> from Areca support for them to look at it.
>
> -------------
> [ 3494.731923] arcmsr6: abort device command of scsi id = 0 lun = 5
> [ 3511.746966] arcmsr6: abort device command of scsi id = 0 lun = 5
> [ 3511.746978] arcmsr6: abort device command of scsi id = 0 lun = 7
> [ 3528.759509] arcmsr6: abort device command of scsi id = 0 lun = 7
> [ 3528.759520] arcmsr6: abort device command of scsi id = 0 lun = 7
> [ 3545.782040] arcmsr6: abort device command of scsi id = 0 lun = 7
> [ 3545.782052] arcmsr6: abort device command of scsi id = 0 lun = 6
> [ 3562.785862] arcmsr6: abort device command of scsi id = 0 lun = 6
> [ 3562.785872] arcmsr6: abort device command of scsi id = 0 lun = 6
> [ 3579.798404] arcmsr6: abort device command of scsi id = 0 lun = 6
> [ 3579.798410] arcmsr6: abort device command of scsi id = 0 lun = 5
Yea, that looks bad - the driver appears to have aborted some IOs
(no idea why) but probably hasn't handled the error correctly and
completed the IOs it aborted with an error status (which would cause
XFS to shut down the filesystem but not freeze like this). So it
looks like there is a buggy error handling path in the driver being
triggered by some kind of hardware problem.
I note that it is the same raid controller that has had problems
as the last report. It might just be a bad RAID card or SATA cables
from that RAID card. I'd work out which card it is, replace it
and the cables and see if that makes the problem go away....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS data corruption with high I/O even on Areca hardware raid
2010-01-15 1:35 ` Dave Chinner
@ 2010-01-15 2:15 ` Steve Costaras
0 siblings, 0 replies; 9+ messages in thread
From: Steve Costaras @ 2010-01-15 2:15 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 01/14/2010 19:35, Dave Chinner wrote:
> There stack traces do - everything is waiting on IO completion to
> occur. The elevator queues are full, the block device is congested
> and lots of XFS code is waiting on IO completion to occur.
>
>
>> I don't
>> like the abort device commands to arcmsr, still have not heard anything
>> from Areca support for them to look at it.
>>
>> -------------
>> [ 3494.731923] arcmsr6: abort device command of scsi id = 0 lun = 5
>> [ 3511.746966] arcmsr6: abort device command of scsi id = 0 lun = 5
>> [ 3511.746978] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3528.759509] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3528.759520] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3545.782040] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3545.782052] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3562.785862] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3562.785872] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3579.798404] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3579.798410] arcmsr6: abort device command of scsi id = 0 lun = 5
>>
> Yea, that looks bad - the driver appears to have aborted some IOs
> (no idea why) but probably hasn't handled the error correctly and
> completed the IOs it aborted with an error status (which would cause
> XFS to shut down the filesystem but not freeze like this). So it
> looks like there is a buggy error handling path in the driver being
> triggered by some kind of hardware problem.
>
> I note that it is the same raid controller that has had problems
> as the last report. It might just be a bad RAID card or SATA cables
> from that RAID card. I'd work out which card it is, replace it
> and the cables and see if that makes the problem go away....
>
Yeah, actually this IS a new raid card * cables since the last failure
so don't think (statistically) that it's hardware. Could be firmware
or driver. I've forwarded this over to Areca and hopefully they can
come up with something.
Right now I'm testing with the raid in write-through mode in hopes that,
if it doesn't avoid the problem, minimize the effect of it. Also found
some references thanks to the above abort messages about lengthening the
scsi command timeouts which may help when under 'heavy' load.
(/sys/class/scsi_device/{?}/device/timeout) which looks to default to 30
seconds on at least my kernel. I haven't read up in detail yet as to
the reasons why it is set to 30 or if that was just some arbitrary
number someone picked, to me it seems long (considering we're talking
10-13ms for service times generally on a 7200rpm drive but who knows).
Steve
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS data corruption with high I/O even on hardware raid
2010-01-14 1:11 XFS data corruption with high I/O even on hardware raid Steve Costaras
2010-01-14 2:24 ` Dave Chinner
@ 2010-01-14 9:08 ` Andi Kleen
2010-01-14 11:19 ` Steve Costaras
1 sibling, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2010-01-14 9:08 UTC (permalink / raw)
To: Steve Costaras; +Cc: xfs
Steve Costaras <stevecs@chaven.com> writes:
>
> ============ (System)
> (Ubuntu 8.04.3 LTS):
> Linux loki 2.6.24-26-server #1 SMP Tue Dec 1 18:26:43 UTC 2009 x86_64 GNU/Linux
That's a really old kernel. Perhaps try it with a newer one?
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS data corruption with high I/O even on hardware raid
2010-01-14 9:08 ` XFS data corruption with high I/O even on " Andi Kleen
@ 2010-01-14 11:19 ` Steve Costaras
2010-01-14 11:36 ` Andi Kleen
0 siblings, 1 reply; 9+ messages in thread
From: Steve Costaras @ 2010-01-14 11:19 UTC (permalink / raw)
To: Andi Kleen; +Cc: xfs
[-- Attachment #1.1: Type: text/plain, Size: 652 bytes --]
It's the current kernel for the distribution, however I take your
point. I did have a 2.6.28 kernel running on another system that also
exhibited the same problem but that system has been rebuilt to
standardize them all on 8.04.3LTS. I am contacting Areca to see if
they have any suggestions as well as will try a newer kernel.
On 01/14/2010 03:08, Andi Kleen wrote:
> Steve Costaras<stevecs@chaven.com> writes:
>
>> ============ (System)
>> (Ubuntu 8.04.3 LTS):
>> Linux loki 2.6.24-26-server #1 SMP Tue Dec 1 18:26:43 UTC 2009 x86_64 GNU/Linux
>>
> That's a really old kernel. Perhaps try it with a newer one?
>
> -Andi
>
>
[-- Attachment #1.2: Type: text/html, Size: 1177 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS data corruption with high I/O even on hardware raid
2010-01-14 11:19 ` Steve Costaras
@ 2010-01-14 11:36 ` Andi Kleen
0 siblings, 0 replies; 9+ messages in thread
From: Andi Kleen @ 2010-01-14 11:36 UTC (permalink / raw)
To: Steve Costaras; +Cc: Andi Kleen, xfs
On Thu, Jan 14, 2010 at 05:19:49AM -0600, Steve Costaras wrote:
> It's the current kernel for the distribution, however I take your point.
The deal is usually -- when you stay with old distribution kernels
you have to contact the distribution for any problems.
> I did have a 2.6.28 kernel running on another system that also exhibited
That's also a very old kernel in mainline terms. Current is 2.6.32
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-01-15 2:14 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-14 1:11 XFS data corruption with high I/O even on hardware raid Steve Costaras
2010-01-14 2:24 ` Dave Chinner
2010-01-14 2:33 ` Steve Costaras
2010-01-15 0:52 ` XFS data corruption with high I/O even on Areca " Steve Costaras
2010-01-15 1:35 ` Dave Chinner
2010-01-15 2:15 ` Steve Costaras
2010-01-14 9:08 ` XFS data corruption with high I/O even on " Andi Kleen
2010-01-14 11:19 ` Steve Costaras
2010-01-14 11:36 ` Andi Kleen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox