* possible data corruption on ICH8 or WD raptor
@ 2008-07-14 21:48 Janos Haar
2008-08-01 4:29 ` Tejun Heo
0 siblings, 1 reply; 5+ messages in thread
From: Janos Haar @ 2008-07-14 21:48 UTC (permalink / raw)
To: linux-ide
Hello list,
I have one (planned) production ready server with DP35DP Intel motherboard,
and 6 drive.
2x 500GB WD SATA (not interesting)
4x 300GB WD Velociraptor, SATA2
When i have tested the server i see one error report on the dmesg:
ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xa frozen
ata3.00: irq_stat 0x00400040, connection status changed
ata3: SError: { PHYRdyChg DevExch }
ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:20:40:de:90/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3: hard resetting link
ata3: softreset failed (device not ready)
ata3: hard resetting link
ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3.00: configured for UDMA/133
ata3: EH complete
sd 2:0:0:0: [sdc] 586072368 512-byte hardware sectors (300069 MB)
sd 2:0:0:0: [sdc] Write Protect is off
sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
I have done some test again, and see, this only happens sometimes, and only
on raptors.
I have tested copy 80GB file from my raid10 array to another raid10 array
inside the 4 raptor, and i can see data corruption on one time. (using cmp)
I can reproduce this issue, about 1-2 hour testing.
[root@gl-bh2k8 host2]# uname -a
Linux gl-bh2k8 2.6.25.9 #1 SMP Thu Jul 10 17:31:31 CEST 2008 x86_64 x86_64
x86_64 GNU/Linux
Can i help to fix it if it is fixable by sw?
Thanks,
Janos Haar
dmesg cut:
ahci 0000:00:1f.2: version 3.0
ACPI: PCI Interrupt 0000:00:1f.2[A] -> GSI 21 (level, low) -> IRQ 21
ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA
mode
ahci 0000:00:1f.2: flags: 64bit ncq sntf led clo pmp pio slum part
PCI: Setting latency timer of device 0000:00:1f.2 to 64
scsi0 : ahci
scsi1 : ahci
scsi2 : ahci
scsi3 : ahci
scsi4 : ahci
scsi5 : ahci
ata1: SATA max UDMA/133 abar m2048@0xe2121000 port 0xe2121100 irq 21
ata2: SATA max UDMA/133 abar m2048@0xe2121000 port 0xe2121180 irq 21
ata3: SATA max UDMA/133 abar m2048@0xe2121000 port 0xe2121200 irq 21
ata4: SATA max UDMA/133 abar m2048@0xe2121000 port 0xe2121280 irq 21
ata5: SATA max UDMA/133 abar m2048@0xe2121000 port 0xe2121300 irq 21
ata6: SATA max UDMA/133 abar m2048@0xe2121000 port 0xe2121380 irq 21
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-8: WDC WD5000AAKS-00A7B0, 01.03B01, max UDMA/133
ata1.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-8: WDC WD5000AAKS-00A7B0, 01.03B01, max UDMA/133
ata2.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata2.00: configured for UDMA/133
ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3.00: ATA-8: WDC WD3000GLFS-01F8U0, 03.03V01, max UDMA/133
ata3.00: 586072368 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata3.00: configured for UDMA/133
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4.00: ATA-8: WDC WD3000GLFS-01F8U0, 03.03V01, max UDMA/133
ata4.00: 586072368 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata4.00: configured for UDMA/133
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5.00: ATA-8: WDC WD3000GLFS-01F8U0, 03.03V01, max UDMA/133
ata5.00: 586072368 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata5.00: configured for UDMA/133
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata6.00: ATA-8: WDC WD3000GLFS-01F8U0, 03.03V01, max UDMA/133
ata6.00: 586072368 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata6.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access ATA WDC WD5000AAKS-0 01.0 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sda: sda1 sda2 sda3 sda4
sd 0:0:0:0: [sda] Attached SCSI disk
sd 0:0:0:0: Attached scsi generic sg0 type 0
scsi 1:0:0:0: Direct-Access ATA WDC WD5000AAKS-0 01.0 PQ: 0 ANSI: 5
sd 1:0:0:0: [sdb] 976773168 512-byte hardware sectors (500108 MB)
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sd 1:0:0:0: [sdb] 976773168 512-byte hardware sectors (500108 MB)
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sdb: sdb1 sdb2 sdb3 sdb4
sd 1:0:0:0: [sdb] Attached SCSI disk
sd 1:0:0:0: Attached scsi generic sg1 type 0
scsi 2:0:0:0: Direct-Access ATA WDC WD3000GLFS-0 03.0 PQ: 0 ANSI: 5
sd 2:0:0:0: [sdc] 586072368 512-byte hardware sectors (300069 MB)
sd 2:0:0:0: [sdc] Write Protect is off
sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sd 2:0:0:0: [sdc] 586072368 512-byte hardware sectors (300069 MB)
sd 2:0:0:0: [sdc] Write Protect is off
sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sdc: sdc1 sdc2 sdc3
sd 2:0:0:0: [sdc] Attached SCSI disk
sd 2:0:0:0: Attached scsi generic sg2 type 0
scsi 3:0:0:0: Direct-Access ATA WDC WD3000GLFS-0 03.0 PQ: 0 ANSI: 5
sd 3:0:0:0: [sdd] 586072368 512-byte hardware sectors (300069 MB)
sd 3:0:0:0: [sdd] Write Protect is off
sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sd 3:0:0:0: [sdd] 586072368 512-byte hardware sectors (300069 MB)
sd 3:0:0:0: [sdd] Write Protect is off
sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sdd: sdd1 sdd2 sdd3
sd 3:0:0:0: [sdd] Attached SCSI disk
sd 3:0:0:0: Attached scsi generic sg3 type 0
scsi 4:0:0:0: Direct-Access ATA WDC WD3000GLFS-0 03.0 PQ: 0 ANSI: 5
sd 4:0:0:0: [sde] 586072368 512-byte hardware sectors (300069 MB)
sd 4:0:0:0: [sde] Write Protect is off
sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sd 4:0:0:0: [sde] 586072368 512-byte hardware sectors (300069 MB)
sd 4:0:0:0: [sde] Write Protect is off
sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sde: sde1 sde2 sde3
sd 4:0:0:0: [sde] Attached SCSI disk
sd 4:0:0:0: Attached scsi generic sg4 type 0
scsi 5:0:0:0: Direct-Access ATA WDC WD3000GLFS-0 03.0 PQ: 0 ANSI: 5
sd 5:0:0:0: [sdf] 586072368 512-byte hardware sectors (300069 MB)
sd 5:0:0:0: [sdf] Write Protect is off
sd 5:0:0:0: [sdf] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sd 5:0:0:0: [sdf] 586072368 512-byte hardware sectors (300069 MB)
sd 5:0:0:0: [sdf] Write Protect is off
sd 5:0:0:0: [sdf] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sdf: sdf1 sdf2 sdf3
sd 5:0:0:0: [sdf] Attached SCSI disk
sd 5:0:0:0: Attached scsi generic sg5 type 0
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: possible data corruption on ICH8 or WD raptor
2008-07-14 21:48 possible data corruption on ICH8 or WD raptor Janos Haar
@ 2008-08-01 4:29 ` Tejun Heo
2008-08-01 10:40 ` Janos Haar
0 siblings, 1 reply; 5+ messages in thread
From: Tejun Heo @ 2008-08-01 4:29 UTC (permalink / raw)
To: Janos Haar; +Cc: linux-ide
Janos Haar wrote:
> Hello list,
>
> I have one (planned) production ready server with DP35DP Intel
> motherboard, and 6 drive.
>
> 2x 500GB WD SATA (not interesting)
> 4x 300GB WD Velociraptor, SATA2
>
> When i have tested the server i see one error report on the dmesg:
>
> ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xa frozen
> ata3.00: irq_stat 0x00400040, connection status changed
> ata3: SError: { PHYRdyChg DevExch }
> ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> res 40/00:20:40:de:90/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
You're getting PHY event on flush which is a pretty strong indication
that you're having power problem. The disk goes out to transfer data in
its buffer to the platter and draws more power from the cable. For some
reason, power is not maintained properly. Disk checks out momentarily
causing the PHY event and losing the data in its buffer. Try to connect
the harddrive to a separate PSU and see whether the problem goes away.
--
tejun
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: possible data corruption on ICH8 or WD raptor
2008-08-01 4:29 ` Tejun Heo
@ 2008-08-01 10:40 ` Janos Haar
2008-08-01 15:36 ` Tejun Heo
2008-08-01 22:40 ` Alan Cox
0 siblings, 2 replies; 5+ messages in thread
From: Janos Haar @ 2008-08-01 10:40 UTC (permalink / raw)
To: Tejun Heo; +Cc: linux-ide
----- Original Message -----
From: "Tejun Heo" <tj@kernel.org>
To: "Janos Haar" <djani22@netcenter.hu>
Cc: <linux-ide@vger.kernel.org>
Sent: Friday, August 01, 2008 6:29 AM
Subject: Re: possible data corruption on ICH8 or WD raptor
> Janos Haar wrote:
>> Hello list,
>>
>> I have one (planned) production ready server with DP35DP Intel
>> motherboard, and 6 drive.
>>
>> 2x 500GB WD SATA (not interesting)
>> 4x 300GB WD Velociraptor, SATA2
>>
>> When i have tested the server i see one error report on the dmesg:
>>
>> ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xa frozen
>> ata3.00: irq_stat 0x00400040, connection status changed
>> ata3: SError: { PHYRdyChg DevExch }
>> ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>> res 40/00:20:40:de:90/00:00:07:00:00/40 Emask 0x10 (ATA bus
>> error)
>
> You're getting PHY event on flush which is a pretty strong indication
> that you're having power problem. The disk goes out to transfer data in
> its buffer to the platter and draws more power from the cable. For some
> reason, power is not maintained properly. Disk checks out momentarily
> causing the PHY event and losing the data in its buffer. Try to connect
> the harddrive to a separate PSU and see whether the problem goes away.
Hello,
Thank you for the answer.
Now, this server is a productive syetem, and runs an important application.
The problem generally exists, but looks like comes only when i am testing
the transfer with big files.
(the application does not do that)
About the power:
This PC have one 650W Chieftech PS, 1 quad core cpu, and 6 hdd.
I have previously measured the power current on the line, and the PC uses
only 100-120W on peak.
The problem only comes on the 4 raptor hdd, and this drive only uses each
6W. (from the documentation).
It is hard to try separate PS or something hw solution.
Additionally, generally i think it is not power issue, i am 90% sure.
Are you sure this can not be software issue?
If you say yes, i will go into the server room, and will try another ps
anyway....
more info:
The PC have 8GB ram, and memtest runs previously 4 day continously, without
error.
Thanks,
Janos Haar
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: possible data corruption on ICH8 or WD raptor
2008-08-01 10:40 ` Janos Haar
@ 2008-08-01 15:36 ` Tejun Heo
2008-08-01 22:40 ` Alan Cox
1 sibling, 0 replies; 5+ messages in thread
From: Tejun Heo @ 2008-08-01 15:36 UTC (permalink / raw)
To: Janos Haar; +Cc: linux-ide
Janos Haar wrote:
>> You're getting PHY event on flush which is a pretty strong indication
>> that you're having power problem. The disk goes out to transfer data in
>> its buffer to the platter and draws more power from the cable. For some
>> reason, power is not maintained properly. Disk checks out momentarily
>> causing the PHY event and losing the data in its buffer. Try to connect
>> the harddrive to a separate PSU and see whether the problem goes away.
>
> Thank you for the answer.
>
> Now, this server is a productive syetem, and runs an important application.
> The problem generally exists, but looks like comes only when i am
> testing the transfer with big files.
> (the application does not do that)
>
> About the power:
> This PC have one 650W Chieftech PS, 1 quad core cpu, and 6 hdd.
> I have previously measured the power current on the line, and the PC
> uses only 100-120W on peak.
>
> The problem only comes on the 4 raptor hdd, and this drive only uses
> each 6W. (from the documentation).
>
> It is hard to try separate PS or something hw solution.
> Additionally, generally i think it is not power issue, i am 90% sure.
Don't be too sure. Power problems seem pretty common. We (or rather I)
often suggest ruling out power problem first and often see unexpectedly
high portion of weird problems actually are caused by power. And in
most of those cases, the wattage or brand printed on the PSU didn't mean
much.
> Are you sure this can not be software issue?
> If you say yes, i will go into the server room, and will try another ps
> anyway....
No, I'm not sure at all it can't be a software issue. What I know are...
* FLUSH is one of the less likely commands which can trigger state
machine or transfer logic problem. It's a command without any data.
Pretty difficult to get that wrong while getting others correct.
* Without ruling power problem out, debugging is really difficult as
power problems could manifest in unpredictable ways. Plus, ruling out
power problem isn't too difficult. Just hook up a separate PSU and
connect problematic hard drives to it.
* For some reason, we've been seeing good portion of weird link related
or data corruption problems following timeout or phy event turn out to
be power related ones. I get the link problems as serial highspeed
links are highly susceptible to interferences. I don't know why
suddenly there seemingly are more machines where disk looses data due to
power instability. Maybe SATA made it cheap and easy to hook up more
disks to a machine. Maybe those multi-lane power supplies just suck. I
don't know.
If you can't hook up a separate PSU, can you please run "smartctl -a
/dev/sdX" right after boot and again after the phy error occurs and
report the results?
--
tejun
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: possible data corruption on ICH8 or WD raptor
2008-08-01 10:40 ` Janos Haar
2008-08-01 15:36 ` Tejun Heo
@ 2008-08-01 22:40 ` Alan Cox
1 sibling, 0 replies; 5+ messages in thread
From: Alan Cox @ 2008-08-01 22:40 UTC (permalink / raw)
To: Janos Haar; +Cc: Tejun Heo, linux-ide
> The problem only comes on the 4 raptor hdd, and this drive only uses each
> 6W. (from the documentation).
Are all the drives off the same cable from the PSU - if so you may find
just shuffling some onto different lines is enough.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-08-01 22:57 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-14 21:48 possible data corruption on ICH8 or WD raptor Janos Haar
2008-08-01 4:29 ` Tejun Heo
2008-08-01 10:40 ` Janos Haar
2008-08-01 15:36 ` Tejun Heo
2008-08-01 22:40 ` Alan Cox
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).