Linux ATA/IDE development
 help / color / mirror / Atom feed
* understanding the cause of ATA failures
@ 2010-03-18 21:50 Ludovico Cavedon
  2010-03-18 22:00 ` Tim Small
  2010-03-22  3:37 ` Robert Hancock
  0 siblings, 2 replies; 11+ messages in thread
From: Ludovico Cavedon @ 2010-03-18 21:50 UTC (permalink / raw)
  To: linux-ide

Hi,

I am trying to understand what might have been the cause for the
following two errors. The machine has 6 SATA drives, configured with
software RAID6.


> [513080.136611] ata5: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
> [513080.136632] ata5: irq_stat 0x00400040, connection status changed
> [513080.136648] ata5: SError: { HostInt PHYRdyChg 10B8B DevExch }
> [513080.136666] ata5: hard resetting link
> [513080.878347] ata5: SATA link down (SStatus 0 SControl 300)
> [513085.869812] ata5: hard resetting link
> [513086.219198] ata5: SATA link down (SStatus 0 SControl 300)
> [513086.219206] ata5: limiting SATA link speed to 1.5 Gbps
> [513091.210623] ata5: hard resetting link
> [513091.560036] ata5: SATA link down (SStatus 0 SControl 310)
> [513091.560044] ata5.00: disabled
> [513091.560055] ata5: EH complete
> [513091.560128] ata5.00: detaching (SCSI 4:0:0:0)
> [513091.560492] sd 4:0:0:0: [sde] Stopping disk
> [513091.560522] sd 4:0:0:0: [sde] START_STOP FAILED
> [513091.560524] sd 4:0:0:0: [sde] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [513659.777152] ata5: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
> [513659.777173] ata5: irq_stat 0x00000040, connection status changed
> [513659.777189] ata5: SError: { CommWake DevExch }
> [513659.777206] ata5: hard resetting link
> [513665.555794] ata5: link is slow to respond, please be patient (ready=0)
> [513669.808493] ata5: COMRESET failed (errno=-16)
> [513669.808509] ata5: hard resetting link
> [513672.593726] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [513674.832573] ata5.00: ATA-8: WDC WD20EADS-00S2B0, 01.00A01, max UDMA/133
> [513674.832577] ata5.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32)
> [513674.835549] ata5.00: configured for UDMA/133
> [513674.835557] ata5: EH complete
> [513674.835716] scsi 4:0:0:0: Direct-Access     ATA      WDC WD20EADS-00S 01.0 PQ: 0 ANSI: 5
> [513674.835860] sd 4:0:0:0: Attached scsi generic sg4 type 0
> [513674.836739] sd 4:0:0:0: [sde] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
> [513674.836783] sd 4:0:0:0: [sde] Write Protect is off
> [513674.836786] sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
> [513674.836807] sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [513674.836936]  sde: unknown partition table
> [513674.849972] sd 4:0:0:0: [sde] Attached SCSI disk

One month later

> [2953663.906081] ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
> [2953663.906136] ata3.00: cmd 61/08:00:9d:87:e0/00:00:e8:00:00/40 tag 0 ncq 4096 out
> [2953663.906137]          res 40/00:14:1d:69:81/00:00:77:00:00/40 Emask 0x4 (timeout)
> [2953663.906226] ata3.00: status: { DRDY }
> [2953663.906254] ata3: hard resetting link
> [2953669.287889] ata3: link is slow to respond, please be patient (ready=0)
> [2953673.900888] ata3: COMRESET failed (errno=-16)
> [2953673.900917] ata3: hard resetting link
> [2953679.282709] ata3: link is slow to respond, please be patient (ready=0)
> [2953683.895706] ata3: COMRESET failed (errno=-16)
> [2953683.895735] ata3: hard resetting link
> [2953689.277538] ata3: link is slow to respond, please be patient (ready=0)
> [2953718.872602] ata3: COMRESET failed (errno=-16)
> [2953718.872632] ata3: limiting SATA link speed to 1.5 Gbps
> [2953718.872635] ata3: hard resetting link
> [2953723.894975] ata3: COMRESET failed (errno=-16)
> [2953723.895005] ata3: reset failed, giving up
> [2953723.895030] ata3.00: disabled
> [2953723.895040] ata3: EH complete
> [2953723.895053] sd 2:0:0:0: [sdc] Unhandled error code
> [2953723.895056] sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [2953723.895060] end_request: I/O error, dev sdc, sector 3907028893

I believe that the same error also happened for the other drives. The
RAID6 failed because other drivers were removed as faulty. I have no
logs though.

Here are some info from the kernel log.
> [    3.115992] ahci 0000:00:1f.2: version 3.0
> [    3.116003]   alloc irq_desc for 19 on node 0
> [    3.116004]   alloc kstat_irqs on node 0
> [    3.116008] ahci 0000:00:1f.2: PCI INT B -> GSI 19 (level, low) -> IRQ 19
> [    3.116045]   alloc irq_desc for 58 on node 0
> [    3.116047]   alloc kstat_irqs on node 0
> [    3.116052] ahci 0000:00:1f.2: irq 58 for MSI/MSI-X
> [    3.116081] ahci: SSS flag set, parallel bus scan disabled
> [    3.116116] ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA mode
> [    3.116119] ahci 0000:00:1f.2: flags: 64bit ncq sntf stag pm led clo pio slum part ems 
> [    3.116122] ahci 0000:00:1f.2: setting latency timer to 64
> [    3.220868] scsi0 : ahci
> [    3.220942] scsi1 : ahci
> [    3.220987] scsi2 : ahci
> [    3.221032] scsi3 : ahci
> [    3.221078] scsi4 : ahci
> [    3.221119] scsi5 : ahci
> [    3.221215] ata1: SATA max UDMA/133 abar m2048@0xfbed6000 port 0xfbed6100 irq 58
> [    3.221218] ata2: SATA max UDMA/133 abar m2048@0xfbed6000 port 0xfbed6180 irq 58
> [    3.221220] ata3: SATA max UDMA/133 abar m2048@0xfbed6000 port 0xfbed6200 irq 58
> [    3.221222] ata4: SATA max UDMA/133 abar m2048@0xfbed6000 port 0xfbed6280 irq 58
> [    3.221225] ata5: SATA max UDMA/133 abar m2048@0xfbed6000 port 0xfbed6300 irq 58
> [    3.221227] ata6: SATA max UDMA/133 abar m2048@0xfbed6000 port 0xfbed6380 irq 58
> [...]
> [    5.117816] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [    5.121331] ata3.00: ATA-8: WDC WD20EADS-00S2B0, 01.00A01, max UDMA/133
> [    5.121335] ata3.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32)
> [    5.124641] ata3.00: configured for UDMA/133
> [    5.137847] scsi 2:0:0:0: Direct-Access     ATA      WDC WD20EADS-00S 01.0 PQ: 0 ANSI: 5
> [    5.137947] sd 2:0:0:0: Attached scsi generic sg2 type 0
> [    5.137968] sd 2:0:0:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
> [    5.137991] sd 2:0:0:0: [sdc] Write Protect is off
> [    5.137993] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> [    5.138005] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [    5.138072]  sdc: sdc1 sdc2
> [    5.196726] sd 2:0:0:0: [sdc] Attached SCSI disk

The full log is at
http://www.cs.ucsb.edu/~cavedon/dmesg.log.gz

Controller, form lspci:
> 00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller

Is there any way to understand what caused the failures? Is it possible
to exclude that is was the hard drive, or cable, or controller, or
kernel fault?

Thank you in advance for any hint,
Cheers,
Ludovico





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: understanding the cause of ATA failures
  2010-03-18 21:50 understanding the cause of ATA failures Ludovico Cavedon
@ 2010-03-18 22:00 ` Tim Small
  2010-03-18 22:13   ` Ludovico Cavedon
  2010-03-22  3:37 ` Robert Hancock
  1 sibling, 1 reply; 11+ messages in thread
From: Tim Small @ 2010-03-18 22:00 UTC (permalink / raw)
  To: Ludovico Cavedon; +Cc: linux-ide

Ludovico Cavedon wrote:

> Is there any way to understand what caused the failures? Is it possible
> to exclude that is was the hard drive, or cable, or controller, or
> kernel fault?
>   

Do the drives have any SMART errors logged?  Any reallocated sectors? 
Do you run smartd, or any other smart data collection?  I've had a load
of trouble with WD drives when smart data collection was enabled. 
Haven't had time to get to the bottom of it, but I suspect a firmware bug.

Tim.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: understanding the cause of ATA failures
  2010-03-18 22:00 ` Tim Small
@ 2010-03-18 22:13   ` Ludovico Cavedon
  2010-03-18 22:33     ` Stan Hoeppner
  0 siblings, 1 reply; 11+ messages in thread
From: Ludovico Cavedon @ 2010-03-18 22:13 UTC (permalink / raw)
  To: Tim Small; +Cc: linux-ide

Tim Small wrote:
> Do the drives have any SMART errors logged?  Any reallocated sectors? 

No SMART errors logged.
No reallocated sector for all the hard drives.

Well, I forgot to mention another weired thing. This is what happneded:
* sdc failed and got removed from the RAID array (I pasted the log in my
previous email)
* sda got removed (no logs available)
* sdb got removed (no logs available)

When I realized tha machine was down, I found that the *sdd* was giving
IO errors. So I had to replace sdd, but sda, sdb and sdc, who were those
drives that "failed" first, are working good.

> Do you run smartd, or any other smart data collection?  I've had a load
> of trouble with WD drives when smart data collection was enabled. 

No smartd running.

> Haven't had time to get to the bottom of it, but I suspect a firmware bug.

SATA controller firmware bug?
Do you think changing the controller mode from "AHCI" to "IDE" in the
BIOS might help to prevent these errors?

Thanks for your answer,
Ludovico

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: understanding the cause of ATA failures
  2010-03-18 22:13   ` Ludovico Cavedon
@ 2010-03-18 22:33     ` Stan Hoeppner
  2010-03-18 23:03       ` Ludovico Cavedon
  0 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2010-03-18 22:33 UTC (permalink / raw)
  To: linux-ide

Is there a SATA backplane involved or is each drive cabled directly to the
controller?  If backplane, is it active or passive?  Whose product is it?

Is this a relatively new machine or has it been running for some months
without problems until recently?

-- 
Stan


Ludovico Cavedon put forth on 3/18/2010 5:13 PM:
> Tim Small wrote:
>> Do the drives have any SMART errors logged?  Any reallocated sectors? 
> 
> No SMART errors logged.
> No reallocated sector for all the hard drives.
> 
> Well, I forgot to mention another weired thing. This is what happneded:
> * sdc failed and got removed from the RAID array (I pasted the log in my
> previous email)
> * sda got removed (no logs available)
> * sdb got removed (no logs available)
> 
> When I realized tha machine was down, I found that the *sdd* was giving
> IO errors. So I had to replace sdd, but sda, sdb and sdc, who were those
> drives that "failed" first, are working good.
> 
>> Do you run smartd, or any other smart data collection?  I've had a load
>> of trouble with WD drives when smart data collection was enabled. 
> 
> No smartd running.
> 
>> Haven't had time to get to the bottom of it, but I suspect a firmware bug.
> 
> SATA controller firmware bug?
> Do you think changing the controller mode from "AHCI" to "IDE" in the
> BIOS might help to prevent these errors?
> 
> Thanks for your answer,
> Ludovico
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: understanding the cause of ATA failures
  2010-03-18 22:33     ` Stan Hoeppner
@ 2010-03-18 23:03       ` Ludovico Cavedon
  2010-03-18 23:39         ` Stan Hoeppner
  0 siblings, 1 reply; 11+ messages in thread
From: Ludovico Cavedon @ 2010-03-18 23:03 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-ide

Stan Hoeppner wrote:
> Is there a SATA backplane involved or is each drive cabled directly to the
> controller?  If backplane, is it active or passive?  Whose product is it?

no backplance.
This is the machine
http://www.supermicro.com/products/system/2U/6026/SYS-6026T-URF.cfm

> Is this a relatively new machine or has it been running for some months
> without problems until recently?

It is new machine, running only for two months.

Thanks,
Ludovico


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: understanding the cause of ATA failures
  2010-03-18 23:03       ` Ludovico Cavedon
@ 2010-03-18 23:39         ` Stan Hoeppner
  2010-03-19  3:38           ` Ludovico Cavedon
  0 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2010-03-18 23:39 UTC (permalink / raw)
  To: linux-ide

Ludovico Cavedon put forth on 3/18/2010 6:03 PM:
> Stan Hoeppner wrote:
>> Is there a SATA backplane involved or is each drive cabled directly to the
>> controller?  If backplane, is it active or passive?  Whose product is it?
> 
> no backplance.
> This is the machine
> http://www.supermicro.com/products/system/2U/6026/SYS-6026T-URF.cfm

It most certainly does have a backplane, and an active backplane at that.
Defective or marginal backplanes are known to cause intermittent problems of
the nature you're describing, especially active backplanes.  This is why I
asked.  "Enclosure Management" below is a feature of only active backplanes.
 The difference between active and passive is that active units have one or
more ASICs (chips) on the circuit board to control various functions of the
backplane such as fan control, alarms, drive monitoring circuits to sense
drive failures, etc.  Have you configured an I2C module to monitor the
backplane?  If so, check those logs.  If not, do so now.  It's possible that
the backplane controller is erroneously kicking the drives off-line.  This
could explain the SATA bus errors.  It's also possible there is a problem
with the backplane controller chip itself or other circuitry on the PCB
causing problems.

SAS Backplane
1x 2U SAS backplane w/ Enclosure Management

http://www.supermicro.com/products/chassis/2U/825/SC825TQ-R720U.cfm

>> Is this a relatively new machine or has it been running for some months
>> without problems until recently?
> 
> It is new machine, running only for two months.

You need to call SuperMicro support and tell them about your issue.
Backplane boards are relatively cheap.  Get them to send you a warranty
advance replacement backplane and see if that fixes the problem.  If you're
not a hardware person, replacing it may not be a job for you.  In that case,
I'm not sure what to tell you, as last I knew SuperMicro doesn't offer
onsite service.  If indeed the backplane is the problem, you may have to
ship the unit back for repair.  This is the main reason (lack of onsite
service) than most companies stick with IBM, Dell, HP, etc servers.

-- 
Stan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: understanding the cause of ATA failures
  2010-03-18 23:39         ` Stan Hoeppner
@ 2010-03-19  3:38           ` Ludovico Cavedon
  2010-03-19 10:26             ` Stan Hoeppner
  2010-03-25  0:52             ` Tejun Heo
  0 siblings, 2 replies; 11+ messages in thread
From: Ludovico Cavedon @ 2010-03-19  3:38 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-ide

Stan Hoeppner wrote:
> Ludovico Cavedon put forth on 3/18/2010 6:03 PM:
>> Stan Hoeppner wrote:
>>> Is there a SATA backplane involved or is each drive cabled directly to the
>>> controller?  If backplane, is it active or passive?  Whose product is it?
>> no backplance.
>> This is the machine
>> http://www.supermicro.com/products/system/2U/6026/SYS-6026T-URF.cfm
> 
> It most certainly does have a backplane, and an active backplane at that.

Uhm, yes, you are right, it has a backplane with SES-2 over I2C support.
At first I though that backplanes were used only on external enclosures.

> drive failures, etc.  Have you configured an I2C module to monitor the
> backplane?  If so, check those logs.  If not, do so now.  It's possible that

Unfortunately the I2C connectors on the backplane are not connected to
anything. The motherboard has a "IPMB I2C" connector", but I guess I
cannot connect a generic I2C device... I should probably get a USB-I2C
module...

Btw, once I am able to access the I2C device on the backplane, what tool
is able to query the state? I am having troubles finding documentation
about that. lm-sensors does not seem to mention that... Is the ses
kernel module able to work over i2c?

> the backplane controller is erroneously kicking the drives off-line.  This
> could explain the SATA bus errors.  It's also possible there is a problem
> with the backplane controller chip itself or other circuitry on the PCB
> causing problems.

I see.

>>> Is this a relatively new machine or has it been running for some months
>>> without problems until recently?
>> It is new machine, running only for two months.
> 
> You need to call SuperMicro support and tell them about your issue.

Makes sense.

> Backplane boards are relatively cheap.  Get them to send you a warranty
> advance replacement backplane and see if that fixes the problem.  If you're
> not a hardware person, replacing it may not be a job for you.  In that case,

That would not be a problem.


Thank you for the information!

Cheers,
Ludovico

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: understanding the cause of ATA failures
  2010-03-19  3:38           ` Ludovico Cavedon
@ 2010-03-19 10:26             ` Stan Hoeppner
  2010-03-25  0:52             ` Tejun Heo
  1 sibling, 0 replies; 11+ messages in thread
From: Stan Hoeppner @ 2010-03-19 10:26 UTC (permalink / raw)
  To: linux-ide

Ludovico Cavedon put forth on 3/18/2010 10:38 PM:

> Unfortunately the I2C connectors on the backplane are not connected to
> anything. The motherboard has a "IPMB I2C" connector", but I guess I
> cannot connect a generic I2C device... I should probably get a USB-I2C
> module...

Normally this backplane I2C connector would be cabled to a (real) RAID card
which would speak the right language.  Your motherboard has no such
dedicated I2C port for the SAS/SATA backplane.  I2C is a bus protocol, so
any I2C compliant device "should" be able to talk on that bus.  The problem
you'll probably run into is that there isn't any generic Linux software
designed to talk to an I2C chip such as the MG9072 on your backplane which
speaks SES-2 over I2C.  And according to various SM docs I found you have to
be running SAS drives/controller in order to use SES-2 over I2C since SES-2
uses the SCSI command set (which SATA obviously lacks).

In short, I don't know how it all needs to be hooked up, what the specific
device combination needs to be, or what Linux modules you need to
communicate with the MG9072 chip on that backplane.  Again, call your
vendor.  It's their product.  They should have the answers.

> Btw, once I am able to access the I2C device on the backplane, what tool
> is able to query the state? I am having troubles finding documentation
> about that. lm-sensors does not seem to mention that... Is the ses
> kernel module able to work over i2c?

lm-sensors isn't the right tool.  You need to tak SES-2 to that chip.  As I
said, short of having a real SAS RAID card, I'm not sure at this point if
you will be able to poll it at all.  Ask SuperMicro.

>> the backplane controller is erroneously kicking the drives off-line.  This
>> could explain the SATA bus errors.  It's also possible there is a problem
>> with the backplane controller chip itself or other circuitry on the PCB
>> causing problems.
> 
> I see.

The manufacturing cost of SCSI/SAS/SATA backplane PCBs is usually less than
$10 USD.  Retail price for a new replacement unit is only $69 at

http://www.atacom.com/program/atacom.cgi?KEYWORDS=RAAC_SUPE_AD_01&USER_ID=www&SEARCH=SEARCH_ALL&CODE=7581A0317

SuperMicro probably pays around $15 for this PCB and sells it to
distributors for $40 who then price it at approximately double their cost.

Ever heard the old saying "you get what you pay for"?  Ultra low cost items
don't get the quality control care that they should.  These backplane PCBs
are all made in China today by the lowest bid PCB manufacturer.  The QC on
disk drives is usually 2-3 orders of magnitude greater than these
backplanes, same goes for mainboards.  This is why backplanes are always the
first suspect when weird intermittent drive behavior is observed.

I'm not guaranteeing your problem is due to the backplane.  What I am saying
is that it's the most likely cause, historically, and thus the first place
to start troubleshooting.  I dealt with more than my share of backplane
problems back when SCSI RAID was king a little over a decade ago.  Mylex
DAC960s and AMI MegaRAID controllers tended to be very finicky about SCSI
bus signal quality.  We had quite a few problems with mid grade single drive
cages and 3-6 drive backplanes from various manufacturers.  IIRC about 1 in
10 backplanes showed problems on the bench while exercising the arrays
during system burn in and required replacement.  1 in 100+ was the norm for
most other products, from mainboards to disk drives.  We never had a bad
RAID controller.  Then again, at those prices back then, they better not
have been bad out of the box, at $350-$1000 each.

> Thank you for the information!

Your welcome.  Glad to pass on some of my experience if it can help someone
else.  I wish there was more I could do at this point, but it's pretty much
up to you now.  Hope I've helped steer you in the right direction.  As
always, don't put all your eggs in this one troubleshooting basket.  The
cause of the problem could also lie elsewhere so keep and open mind and
don't throw up your hand in frustration if this track doesn't pan out.

-- 
Stan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: understanding the cause of ATA failures
  2010-03-18 21:50 understanding the cause of ATA failures Ludovico Cavedon
  2010-03-18 22:00 ` Tim Small
@ 2010-03-22  3:37 ` Robert Hancock
  1 sibling, 0 replies; 11+ messages in thread
From: Robert Hancock @ 2010-03-22  3:37 UTC (permalink / raw)
  To: Ludovico Cavedon; +Cc: linux-ide

On 03/18/2010 03:50 PM, Ludovico Cavedon wrote:
> Hi,
>
> I am trying to understand what might have been the cause for the
> following two errors. The machine has 6 SATA drives, configured with
> software RAID6.
>
>
>> [513080.136611] ata5: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
>> [513080.136632] ata5: irq_stat 0x00400040, connection status changed
>> [513080.136648] ata5: SError: { HostInt PHYRdyChg 10B8B DevExch }
>> [513080.136666] ata5: hard resetting link
>> [513080.878347] ata5: SATA link down (SStatus 0 SControl 300)
>> [513085.869812] ata5: hard resetting link
>> [513086.219198] ata5: SATA link down (SStatus 0 SControl 300)
>> [513086.219206] ata5: limiting SATA link speed to 1.5 Gbps
>> [513091.210623] ata5: hard resetting link
>> [513091.560036] ata5: SATA link down (SStatus 0 SControl 310)
>> [513091.560044] ata5.00: disabled
>> [513091.560055] ata5: EH complete
>> [513091.560128] ata5.00: detaching (SCSI 4:0:0:0)
>> [513091.560492] sd 4:0:0:0: [sde] Stopping disk
>> [513091.560522] sd 4:0:0:0: [sde] START_STOP FAILED
>> [513091.560524] sd 4:0:0:0: [sde] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>> [513659.777152] ata5: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
>> [513659.777173] ata5: irq_stat 0x00000040, connection status changed
>> [513659.777189] ata5: SError: { CommWake DevExch }
>> [513659.777206] ata5: hard resetting link
>> [513665.555794] ata5: link is slow to respond, please be patient (ready=0)
>> [513669.808493] ata5: COMRESET failed (errno=-16)
>> [513669.808509] ata5: hard resetting link
>> [513672.593726] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [513674.832573] ata5.00: ATA-8: WDC WD20EADS-00S2B0, 01.00A01, max UDMA/133
>> [513674.832577] ata5.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32)
>> [513674.835549] ata5.00: configured for UDMA/133
>> [513674.835557] ata5: EH complete
>> [513674.835716] scsi 4:0:0:0: Direct-Access     ATA      WDC WD20EADS-00S 01.0 PQ: 0 ANSI: 5
>> [513674.835860] sd 4:0:0:0: Attached scsi generic sg4 type 0
>> [513674.836739] sd 4:0:0:0: [sde] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
>> [513674.836783] sd 4:0:0:0: [sde] Write Protect is off
>> [513674.836786] sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
>> [513674.836807] sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>> [513674.836936]  sde: unknown partition table
>> [513674.849972] sd 4:0:0:0: [sde] Attached SCSI disk
>
> One month later
>
>> [2953663.906081] ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
>> [2953663.906136] ata3.00: cmd 61/08:00:9d:87:e0/00:00:e8:00:00/40 tag 0 ncq 4096 out
>> [2953663.906137]          res 40/00:14:1d:69:81/00:00:77:00:00/40 Emask 0x4 (timeout)
>> [2953663.906226] ata3.00: status: { DRDY }
>> [2953663.906254] ata3: hard resetting link
>> [2953669.287889] ata3: link is slow to respond, please be patient (ready=0)
>> [2953673.900888] ata3: COMRESET failed (errno=-16)
>> [2953673.900917] ata3: hard resetting link
>> [2953679.282709] ata3: link is slow to respond, please be patient (ready=0)
>> [2953683.895706] ata3: COMRESET failed (errno=-16)
>> [2953683.895735] ata3: hard resetting link
>> [2953689.277538] ata3: link is slow to respond, please be patient (ready=0)
>> [2953718.872602] ata3: COMRESET failed (errno=-16)
>> [2953718.872632] ata3: limiting SATA link speed to 1.5 Gbps
>> [2953718.872635] ata3: hard resetting link
>> [2953723.894975] ata3: COMRESET failed (errno=-16)
>> [2953723.895005] ata3: reset failed, giving up
>> [2953723.895030] ata3.00: disabled
>> [2953723.895040] ata3: EH complete
>> [2953723.895053] sd 2:0:0:0: [sdc] Unhandled error code
>> [2953723.895056] sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>> [2953723.895060] end_request: I/O error, dev sdc, sector 3907028893
>
> I believe that the same error also happened for the other drives. The
> RAID6 failed because other drivers were removed as faulty. I have no
> logs though.

Well, this shows that the outstanding request timed out and it appeared 
the SATA link was down after that. Sounds rather like a hardware problem 
(cable, drive, backplane, etc.) It can't really tell much more specific 
than that.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: understanding the cause of ATA failures
  2010-03-19  3:38           ` Ludovico Cavedon
  2010-03-19 10:26             ` Stan Hoeppner
@ 2010-03-25  0:52             ` Tejun Heo
  2010-03-26  2:22               ` Ludovico Cavedon
  1 sibling, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2010-03-25  0:52 UTC (permalink / raw)
  To: Ludovico Cavedon; +Cc: Stan Hoeppner, linux-ide

Hello,

On 03/19/2010 12:38 PM, Ludovico Cavedon wrote:
> Uhm, yes, you are right, it has a backplane with SES-2 over I2C support.
> At first I though that backplanes were used only on external enclosures.

If you have the machine locally, one thing you can try is to pull out
all the drives and hook them up directly to the motherboard bypassing
the backplane and see whether anything changes.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: understanding the cause of ATA failures
  2010-03-25  0:52             ` Tejun Heo
@ 2010-03-26  2:22               ` Ludovico Cavedon
  0 siblings, 0 replies; 11+ messages in thread
From: Ludovico Cavedon @ 2010-03-26  2:22 UTC (permalink / raw)
  To: Tejun Heo, Stan Hoeppner; +Cc: linux-ide

Tejun Heo wrote:
> If you have the machine locally, one thing you can try is to pull out
> all the drives and hook them up directly to the motherboard bypassing
> the backplane and see whether anything changes.

Yes, I think I'll try to go this way. Unfortunately (or not :) the issue
happens once a month... I'll see if icnreasing the load on the machine
it will happen more frequently.

Stan Hoeppner wrote:
> In short, I don't know how it all needs to be hooked up, what the specific
> device combination needs to be, or what Linux modules you need to
> communicate with the MG9072 chip on that backplane.  Again, call your
> vendor.  It's their product.  They should have the answers.

Yes, I talked to supermirco and suggested basically what Tejun did:
bypass the backplane and see if the problem persists.

> Your welcome.  Glad to pass on some of my experience if it can help someone
> else.  I wish there was more I could do at this point, but it's pretty much
> up to you now.  Hope I've helped steer you in the right direction.  As
> always, don't put all your eggs in this one troubleshooting basket.  The
> cause of the problem could also lie elsewhere so keep and open mind and
> don't throw up your hand in frustration if this track doesn't pan out.

Very valuable info, thank you.

Cheers,
Ludovico

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-03-26  2:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-18 21:50 understanding the cause of ATA failures Ludovico Cavedon
2010-03-18 22:00 ` Tim Small
2010-03-18 22:13   ` Ludovico Cavedon
2010-03-18 22:33     ` Stan Hoeppner
2010-03-18 23:03       ` Ludovico Cavedon
2010-03-18 23:39         ` Stan Hoeppner
2010-03-19  3:38           ` Ludovico Cavedon
2010-03-19 10:26             ` Stan Hoeppner
2010-03-25  0:52             ` Tejun Heo
2010-03-26  2:22               ` Ludovico Cavedon
2010-03-22  3:37 ` Robert Hancock

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox