SATA drive reset/disable events on ICH7 ata_piix when polling SMART info

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
@ 2010-02-05 14:07 Tim Small
  2010-02-05 14:17 ` [smartmontools-support] " Justin Piszcz
  0 siblings, 1 reply; 13+ messages in thread
From: Tim Small @ 2010-02-05 14:07 UTC (permalink / raw)
  To: smartmontools-support@lists.sourceforge.net, linux-ide

Hi,

I have a couple of Debian Lenny ("2.6.26-2-amd64") boxes on rented 
hardware, each has a couple of SATA drives:

One has 2x 1TB Seagate Barracuda 7200.11 model ST31000333AS firmware SD35

The other has 2x 2TB WD Caviar Green model WDC WD20EADS-00R6B0 firmware 
01.00A01

... the machines are currently set up to run smartd, and also log HDD 
temp via munin.  ata_piix is the driver in use.

The WD machine did this sort of thing a couple of times, which got my 
attention.

[119061.717865] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[119061.717865] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[119061.717865] ata1.00: status: { DRDY }
[119071.117368] ata1: link is slow to respond, please be patient (ready=0)
[119079.800059] ata1: device not ready (errno=-16), forcing hardreset
[119079.800091] ata1: soft resetting link
[119087.950128] ata1: link is slow to respond, please be patient (ready=0)
[119097.895803] ata1: SRST failed (errno=-16)
[119097.895881] ata1: soft resetting link
[119107.170874] ata1: link is slow to respond, please be patient (ready=0)
[119114.902193] ata1: SRST failed (errno=-16)
[119114.902219] ata1: soft resetting link
[119123.749111] ata1: link is slow to respond, please be patient (ready=0)
[119176.735727] ata1: SRST failed (errno=-16)
[119176.735761] ata1: soft resetting link
[119185.513569] ata1: SRST failed (errno=-16)
[119185.513593] ata1: reset failed, giving up
[119185.513622] ata1.00: disabled
[119185.513643] ata1.01: disabled
[119185.513680] end_request: I/O error, dev sda, sector 39069887
[119185.516684] ata1: EH complete
[119186.013456] sd 0:0:0:0: [sda]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[119186.013456] end_request: I/O error, dev sda, sector 36525807


If I run a continuous "dd of=file ; sync ; rm file ; sync" to a file on 
the RAID1 mirror of both drives, at the same time as run a continous 
"smartctl -s on -a /dev/sdX > /dev/null || echo failed", then:

1. The smartctl command fails about once in 20 times, and I get a lot of 
this happening:

[93058.989603] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[93058.989645] ata1.01: cmd 35/00:00:a4:f2:51/00:04:03:00:00/f0 tag 0 
dma 524288 out
[93058.993582] ata1.01: status: { DRDY 
}                                            
[93058.993582] ata1: soft resetting 
link                                            
                                                                                     

[93090.804353] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[93090.804395] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 
pio 512 in    
[93090.804427] ata1.01: status: { DRDY 
}                                            
[93090.804458] ata1: soft resetting 
link                                            
                                                                                     

[93252.493902] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[93252.493913] ata1.01: cmd c8/00:80:4c:d0:83/00:00:00:00:00/fa tag 0 
dma 65536 in  
[93252.493913] ata1.01: status: { DRDY 
}                                            
[93252.493913] ata1: soft resetting 
link                                            
                                                                                     

[96265.917847] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[96265.917889] ata1.01: cmd c8/00:80:4c:2c:c1/00:00:00:00:00/fa tag 0 
dma 65536 in  
[96265.921800] ata1.01: status: { DRDY 
}                                            
[96265.921800] ata1: soft resetting 
link                                            
                                                                                     

[96405.491834] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[96405.491834] ata1.01: cmd 25/00:00:cc:a6:c3/00:04:0a:00:00/f0 tag 0 
dma 524288 in 
[96405.491834] ata1.01: status: { DRDY 
}                                            
[96413.900149] ata1: link is slow to respond, please be patient 
(ready=0)           
                                                                                     

[99772.901861] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[99772.901861] ata1.01: cmd ca/00:08:cc:d3:54/00:00:00:00:00/f3 tag 0 
dma 4096 out  
[99772.901861] ata1.01: status: { DRDY 
}                                            
[99783.604235] ata1: link is slow to respond, please be patient 
(ready=0)           
                                                                                     

[100012.860158] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen    
[100012.860201] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 
pio 512 in
[100012.860247] ata1.01: status: { DRDY }
[100012.860281] ata1: soft resetting link

[100256.314912] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[100256.314950] ata1.01: cmd c8/00:80:cc:12:13/00:00:00:00:00/fb tag 0 
dma 65536 in
[100256.314997] ata1.01: status: { DRDY }
[100256.315025] ata1: soft resetting link

[101528.503318] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[101528.503318] ata1.01: cmd c8/00:00:4c:c4:2c/00:00:00:00:00/fb tag 0 
dma 131072 in
[101528.503318] ata1.01: status: { DRDY }
[101535.883662] ata1: link is slow to respond, please be patient (ready=0)

[107747.382563] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[107747.382605] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 
pio 512 in
[107747.386545] ata1.01: status: { DRDY }
[107747.386545] ata1: soft resetting link

[107918.831736] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[107918.831736] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 
pio 512 in
[107918.831736] ata1.01: status: { DRDY }
[107918.831736] ata1: soft resetting link


Sometimes the "resetting link" happens a few times, and if it happens 
enough times, then ata_piix gives up and disables BOTH drives (like the 
first time), which is a bit annoying - this reset-fails behaviour 
normally seems to happen when the drives are not doing much (i.e. in 
normal operation rather than under-test).

If I disable smart data collection (smartd and munin), then the errors 
seem to stop - which I can do obviously, but would prefer not to.

smartctl -x reports the following interesting-looking stuff on the 
device which I've been stressing with smartctl:

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
...
0x000a  2            5  Device-to-host register FISes sent due to a COMRESET
0x8000  4        79322  Vendor specific

and this on the one where I haven't:

0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x8000  4         6779  Vendor specific



... so I would suspect that this is a bug in the WD drives, except that 
the same thing seems to occasionally happen on the machine with the 
Seagate drives:

[1718254.879156] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[1718254.879211] ata1.00: cmd c8/00:08:3c:f1:bf/00:00:00:00:00/e9 tag 0 
dma 4096 in
[1718254.879213]          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 
0x4 (timeout)
[1718254.879316] ata1.00: status: { DRDY }
[1718262.237404] ata1: link is slow to respond, please be patient (ready=0)
[1718270.057698] ata1: device not ready (errno=-16), forcing hardreset
[1718270.057732] ata1: soft resetting link
[1718277.841779] ata1: link is slow to respond, please be patient (ready=0)
[1718281.134473] ata1.00: configured for UDMA/133
[1718281.192815] ata1.01: configured for UDMA/133
[1718281.192815] ata1: EH complete

[1729049.865692] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[1729049.865692] ata1.00: cmd c8/00:08:dc:b3:bf/00:00:00:00:00/e9 tag 0 
dma 4096 in
[1729049.865692]          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 
0x4 (timeout)
[1729049.865692] ata1.00: status: { DRDY }
[1729059.627313] ata1: link is slow to respond, please be patient (ready=0)
[1729068.499782] ata1: device not ready (errno=-16), forcing hardreset
[1729068.499823] ata1: soft resetting link
[1729078.434813] ata1: link is slow to respond, please be patient (ready=0)
[1729088.807850] ata1: SRST failed (errno=-16)
[1729088.807881] ata1: soft resetting link
[1729089.582856] ata1.00: configured for UDMA/133


with this on the stressed drive:

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2           10  Device-to-host register FISes sent due to a COMRESET


and this on the non-stressed drive:

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET


I'd be happy to put a newer kernel on one or both machines to see if 
that'd have any effect.  I also tried doing "hdparm -I" instead of 
"smartctl -a" for a few hours but that didn't elicit any "frozen" 
messages (although I should probably run it for a bit longer to have 
more confidence in that statement).

So, err I suppose that this could be a bug in:

. smartctl
. both HD firmwares
. ata_piix (certainly disabling both drives seems a bit drastic, but I 
don't know if this is a function of the hardware)
. the ICH7 hardware

unfortunately as I don't own the hardware, I'm not in a position to get 
a different SATA controller in the boxes to eliminate the last two.

Any ideas welcome....

Cheers,

Tim.

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-05 14:07 SATA drive reset/disable events on ICH7 ata_piix when polling SMART info Tim Small
@ 2010-02-05 14:17 ` Justin Piszcz
  2010-02-05 14:31   ` Tim Small
  0 siblings, 1 reply; 13+ messages in thread
From: Justin Piszcz @ 2010-02-05 14:17 UTC (permalink / raw)
  To: Tim Small; +Cc: smartmontools-support@lists.sourceforge.net, linux-ide

On Fri, 5 Feb 2010, Tim Small wrote:

> Hi,
>
> I have a couple of Debian Lenny ("2.6.26-2-amd64") boxes on rented
> hardware, each has a couple of SATA drives:
>
> One has 2x 1TB Seagate Barracuda 7200.11 model ST31000333AS firmware SD35
>
> The other has 2x 2TB WD Caviar Green model WDC WD20EADS-00R6B0 firmware
> 01.00A01
>
> ... the machines are currently set up to run smartd, and also log HDD
> temp via munin.  ata_piix is the driver in use.
>
> The WD machine did this sort of thing a couple of times, which got my
> attention.
>

I have seen people report similar problems with the following drives:

1 - Velociraptors (me/others) (don't work at all in raid correctly)
http://forums.storagereview.com/index.php/topic/27303-velociraptor-premature-failure-rate-bad-drives-premature-to-market/
2 - Green Drives (search this list, there are similar problems) in Linux.

I have Caviar Black and WD RE3, they work OK in Linux.

--

The WD Velociraptors do not work (at least in RAID).
The Green, again, search the list I recall seeing people have problems.

--

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-05 14:17 ` [smartmontools-support] " Justin Piszcz
@ 2010-02-05 14:31   ` Tim Small
  2010-02-05 14:48     ` Justin Piszcz
  2010-02-05 21:47     ` Mark Lord
  0 siblings, 2 replies; 13+ messages in thread
From: Tim Small @ 2010-02-05 14:31 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: smartmontools-support@lists.sourceforge.net, linux-ide

Justin Piszcz wrote:
> I have seen people report similar problems with the following drives:
>
> 1 - Velociraptors (me/others) (don't work at all in raid correctly)
> http://forums.storagereview.com/index.php/topic/27303-velociraptor-premature-failure-rate-bad-drives-premature-to-market/ 
>
> 2 - Green Drives (search this list, there are similar problems) in Linux.
>
> I have Caviar Black and WD RE3, they work OK in Linux.

OK, but:

1. Unlike the link you sent, there's nothing suspicious in any of the 
SMART attributes on any of the four drives - no bad sectors or other 
errors i.e. the following raw values are all zero on the WD drive which 
I've been stressing:

Raw_Read_Error_Rate
Reallocated_Sector_Ct
Seek_Error_Rate
Spin_Retry_Count
Calibration_Retry_Count
Reallocated_Event_Count
Current_Pending_Sector
Offline_Uncorrectable
UDMA_CRC_Error_Count
Multi_Zone_Error_Rate

... as well as empty SMART errors logs.


2. A few failures were seen with the Seagate drives as well (see last 
bits of the email), similarly with no apparent bad SMART attributes.


Thanks,

Tim.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-05 14:31   ` Tim Small
@ 2010-02-05 14:48     ` Justin Piszcz
  2010-02-05 21:47     ` Mark Lord
  1 sibling, 0 replies; 13+ messages in thread
From: Justin Piszcz @ 2010-02-05 14:48 UTC (permalink / raw)
  To: Tim Small; +Cc: smartmontools-support@lists.sourceforge.net, linux-ide

On Fri, 5 Feb 2010, Tim Small wrote:

> Justin Piszcz wrote:
>> I have seen people report similar problems with the following drives:
>> 
>> 1 - Velociraptors (me/others) (don't work at all in raid correctly)
>> http://forums.storagereview.com/index.php/topic/27303-velociraptor-premature-failure-rate-bad-drives-premature-to-market/ 
>> 2 - Green Drives (search this list, there are similar problems) in Linux.
>> 
>> I have Caviar Black and WD RE3, they work OK in Linux.
>
> OK, but:
>
> 1. Unlike the link you sent, there's nothing suspicious in any of the SMART 
> attributes on any of the four drives - no bad sectors or other errors i.e. 
> the following raw values are all zero on the WD drive which I've been 
> stressing:
>
> Raw_Read_Error_Rate
> Reallocated_Sector_Ct
> Seek_Error_Rate
> Spin_Retry_Count
> Calibration_Retry_Count
> Reallocated_Event_Count
> Current_Pending_Sector
> Offline_Uncorrectable
> UDMA_CRC_Error_Count
> Multi_Zone_Error_Rate
>
> ... as well as empty SMART errors logs.

Hi,

They seem to have this problem with and without errors, but you should 
run (from a boot cd) for the last one (or from a recovery image) if you have
console access.

1. smartctl -t short
2. smartctl -t long
3. smartctl -t offline # and don't touch the host/machine for the amount
                        # of time it recommends
4. then show smartctl -a output

The obvious things are:

1. Try/ask to get the cables replaced/check connectors.
2. Check to make sure the PSU is ok (w/ lm sensors etc)

--

When these error occur and/or when you reboot do you ever notice any 
corruption or files in /lost+found?

--

Does it happen if you leave the drives alone (do not poll them with smart?)

--

Some other misc/info that seems like it might be useful:

http://www.newegg.com/Product/ProductReview.aspx?Item=22-136-351&SortField=0&SummaryType=0&Pagesize=10&SelectedRating=-1&PurchaseMark=&VideoOnlyMark=False&VendorMark=&Page=1&Keywords=linux

Cons: WD changed the firmware Oct 2009 to disable SCT ERC (Error Recovery Control). These drives are desktop drives that have a 2 minute ERC setting. Most hardware RAID controllers require a maximum of 7 seconds ERC or your drives will be kicked out of RAID array if it takes too long to recover a sector. Note the RAID controller can recovery the troubled sector itself from the parity disk so it doesn't need the drives to try very hard to recover. This timing is the main difference between RAID drives and desktop drives. Prior to Oct 2009 you could use a WD (leaked) utility to enable the WD equivalent of ERC (called TLER - Time Limited Error Recovery) that would then set the recovery timer to 7 seconds. On the newer drives, you should only use them as desktop drives.

Other Thoughts: In addition, the green feature of the drive parks the head very often (every 8 seconds I think). If you use the drive as a Linux OS drive, chances are the drive head will be parked/unparked so often that it will exceed the rated 300,000 load cycles in less than a year. There is another WD (leaked) utility that allows you to set the park timer from 8 seconds to a maximum of 5 minutes. It uses a little more power but prolongs the drive life span. If you use the drive mainly as storage, then there should be nothing to worry about.

Are you using them in raid or as a single disk?

http://doug.warner.fm/d/blog/2009/11/Western-Digital-15TB-Green-Drives-Not-your-Linux-Software-RAID

I'm not sure what to do with these WD drives; while they seem to work fine independantly, they don't perform correctly at all when put into a RAID array.  I'm beginning to get afraid that as the hard drives get larger and larger the complexity of the firmware is growing too quickly for drive manufacterers to keep them performing reliably.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-05 14:31   ` Tim Small
  2010-02-05 14:48     ` Justin Piszcz
@ 2010-02-05 21:47     ` Mark Lord
  2010-02-06  3:39       ` Tejun Heo
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Lord @ 2010-02-05 21:47 UTC (permalink / raw)
  To: Tim Small
  Cc: Justin Piszcz, smartmontools-support@lists.sourceforge.net,
	linux-ide

Tim Small wrote:
> Justin Piszcz wrote:
>> I have seen people report similar problems with the following drives:
>>
>> 1 - Velociraptors (me/others) (don't work at all in raid correctly)
>> http://forums.storagereview.com/index.php/topic/27303-velociraptor-premature-failure-rate-bad-drives-premature-to-market/ 
>>
>> 2 - Green Drives (search this list, there are similar problems) in Linux.
>>
>> I have Caviar Black and WD RE3, they work OK in Linux.
> 
> OK, but:
> 
> 1. Unlike the link you sent, there's nothing suspicious in any of the 
> SMART attributes on any of the four drives - no bad sectors or other 
> errors i.e. the following raw values are all zero on the WD drive which 
> I've been stressing:
> 
> Raw_Read_Error_Rate
> Reallocated_Sector_Ct
> Seek_Error_Rate
> Spin_Retry_Count
> Calibration_Retry_Count
> Reallocated_Event_Count
> Current_Pending_Sector
> Offline_Uncorrectable
> UDMA_CRC_Error_Count
> Multi_Zone_Error_Rate
> 
> ... as well as empty SMART errors logs.
> 
> 
> 2. A few failures were seen with the Seagate drives as well (see last 
> bits of the email), similarly with no apparent bad SMART attributes.
..

I have observed (and reported) the same issue in the past,
on Hitachi and Seagate drives.

The only constants seem to be libata and ICH7/8.
We must have a bug somewhere in there.

-ml

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-05 21:47     ` Mark Lord
@ 2010-02-06  3:39       ` Tejun Heo
  2010-02-06 15:26         ` Tim Small
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2010-02-06  3:39 UTC (permalink / raw)
  To: Mark Lord
  Cc: Tim Small, Justin Piszcz,
	smartmontools-support@lists.sourceforge.net, linux-ide

Hello,

On 02/06/2010 06:47 AM, Mark Lord wrote:
>> 2. A few failures were seen with the Seagate drives as well (see last
>> bits of the email), similarly with no apparent bad SMART attributes.
> ..
> 
> I have observed (and reported) the same issue in the past,
> on Hitachi and Seagate drives.
> 
> The only constants seem to be libata and ICH7/8.
> We must have a bug somewhere in there.

In piix mode or ahci mode?  If in piix mode, ich7 and 8 would behave
quite differently.  ICH8 has SIDPR so it can hardreset while 7 can't.
ICH SIDPR access had a hardware problem where write to SControl to
clear DET is sometimes ignored which led to occassional hardreset
failure which got fixed recently.  The reason why ich's are involved
in those incidents could just be that they are extremely popular.

Things to try after such completely drive shutdown are...

* Disconnect the drive from the host but do not remove power.
  Reconnect the drive to a different port and/or controller, does the
  drive work there?

* Power-cycle the drive (and issue manual rescan if necessary).  Does
  the drive get recognized again?

* Disconnect the drive and connect a different drive to the port.
  Does the port work?

* Soft reset the machine.  Can BIOS recognize the drive?

In many cases I've seen, it's usually that the drive's firmware is
completely hung and only power cycling the drive brought it back.  But
then again, there have been some number of cases which didn't get
diagnosed properly, so it's definitely possible that we're doing
something wrong in the driver.

Anyways, if it happens again, please try the above and try to find out
whether the controller or the drive is hung.  Also, please keep in
mind that timeouts on 0xEA (flush) is very often indicative of power
related issues.  FLUSH spikes power consumption and surprisingly many
PSUs fail to sustain proper voltage over that, so powering up a
separate PSU and connecting only the hard drive to it and see what
happens is often interesting too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-06  3:39       ` Tejun Heo
@ 2010-02-06 15:26         ` Tim Small
  2010-02-06 17:30           ` Mark Lord
  0 siblings, 1 reply; 13+ messages in thread
From: Tim Small @ 2010-02-06 15:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mark Lord, Justin Piszcz,
	smartmontools-support@lists.sourceforge.net, linux-ide

Tejun Heo wrote:
>> The only constants seem to be libata and ICH7/8.
>> We must have a bug somewhere in there.
>>     
>
> In piix mode or ahci mode?  If in piix mode, ich7 and 8 would behave
> quite differently.  ICH8 has SIDPR so it can hardreset while 7 can't.
> ICH SIDPR access had a hardware problem where write to SControl to
> clear DET is sometimes ignored which led to occassional hardreset
> failure which got fixed recently.  The reason why ich's are involved
> in those incidents could just be that they are extremely popular.
>   

It's a non-AHCI capable ICH7, so it's in piix mode.

> Things to try after such completely drive shutdown are...
>   

Unfortunately I can't do much with this box, as it's a rented box in a
datacentre, however....

> * Soft reset the machine.  Can BIOS recognize the drive?
>   

Yes, if I either 'echo b > /proc/sysrq-trigger', then the BIOS
recognises the drive, and the box reboot normally.

> In many cases I've seen, it's usually that the drive's firmware is
> completely hung and only power cycling the drive brought it back.  But
> then again, there have been some number of cases which didn't get
> diagnosed properly, so it's definitely possible that we're doing
> something wrong in the driver.
>
> Anyways, if it happens again, please try the above and try to find out
> whether the controller or the drive is hung.  Also, please keep in
> mind that timeouts on 0xEA (flush) is very often indicative of power
>   

OK, I didn't think I was seeing those - is it possible to tell from the
detail which I posted in my original message?  As for the potential for
PSU shenanigans - I don't have access to the box to fiddle with that,
unfortunately, but I believe I can stress the I/O subsystem quite
heavily with dd and/or bonnie, but it's only when polling for SMART
status that these errors show up.  I've just started dd (to RAID mirror)
+ hdparm -I again to check...

Do the SMART error counters in the OP make this suspicious?  Is there
likely to be any different between running smartctl -a and hdparm -I  in
terms of code path taken though the kernel, or timings on the hardware,
as far as you know?

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-06 15:26         ` Tim Small
@ 2010-02-06 17:30           ` Mark Lord
  2010-02-06 22:22             ` Tim Small
  2010-02-08  2:49             ` Tejun Heo
  0 siblings, 2 replies; 13+ messages in thread
From: Mark Lord @ 2010-02-06 17:30 UTC (permalink / raw)
  To: Tim Small
  Cc: Tejun Heo, Justin Piszcz,
	smartmontools-support@lists.sourceforge.net, linux-ide

Tim Small wrote:
> Tejun Heo wrote:
>>> The only constants seem to be libata and ICH7/8.
>>> We must have a bug somewhere in there.
>>>     
>> In piix mode or ahci mode?  If in piix mode, ich7 and 8 would behave
>> quite differently.  ICH8 has SIDPR so it can hardreset while 7 can't.
>> ICH SIDPR access had a hardware problem where write to SControl to
>> clear DET is sometimes ignored which led to occassional hardreset
>> failure which got fixed recently.  The reason why ich's are involved
>> in those incidents could just be that they are extremely popular.
>>   
> 
> It's a non-AHCI capable ICH7, so it's in piix mode.
> 
>> Things to try after such completely drive shutdown are...
>>   
> 
> Unfortunately I can't do much with this box, as it's a rented box in a
> datacentre, however....
> 
>> * Soft reset the machine.  Can BIOS recognize the drive?
>>   
> 
> Yes, if I either 'echo b > /proc/sysrq-trigger', then the BIOS
> recognises the drive, and the box reboot normally.
> 
>> In many cases I've seen, it's usually that the drive's firmware is
>> completely hung and only power cycling the drive brought it back.  But
>> then again, there have been some number of cases which didn't get
>> diagnosed properly, so it's definitely possible that we're doing
>> something wrong in the driver.
>>
>> Anyways, if it happens again, please try the above and try to find out
>> whether the controller or the drive is hung.  Also, please keep in
>> mind that timeouts on 0xEA (flush) is very often indicative of power
>>   
> 
> OK, I didn't think I was seeing those - is it possible to tell from the
> detail which I posted in my original message?  As for the potential for
> PSU shenanigans - I don't have access to the box to fiddle with that,
> unfortunately, but I believe I can stress the I/O subsystem quite
> heavily with dd and/or bonnie, but it's only when polling for SMART
> status that these errors show up.  I've just started dd (to RAID mirror)
> + hdparm -I again to check...
> 
> Do the SMART error counters in the OP make this suspicious?  Is there
> likely to be any different between running smartctl -a and hdparm -I  in
> terms of code path taken though the kernel, or timings on the hardware,
> as far as you know?
..


My theory on the problem when I first had it here, was that doing
a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent
the problem.  This was never explored further (by me or others).

Cheers


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-06 17:30           ` Mark Lord
@ 2010-02-06 22:22             ` Tim Small
  2010-02-07  4:51               ` Mark Lord
  2010-02-08  2:49             ` Tejun Heo
  1 sibling, 1 reply; 13+ messages in thread
From: Tim Small @ 2010-02-06 22:22 UTC (permalink / raw)
  To: Mark Lord
  Cc: Tejun Heo, Justin Piszcz,
	smartmontools-support@lists.sourceforge.net, linux-ide

Mark Lord wrote:
> My theory on the problem when I first had it here, was that doing
> a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent
> the problem.  This was never explored further (by me or others).
>

Would using "option libata force=pio4" be a simple way to start to test
this hypothesis?

Ta,

Tim.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-06 22:22             ` Tim Small
@ 2010-02-07  4:51               ` Mark Lord
  2010-02-08  2:40                 ` Tejun Heo
  2010-02-08 13:03                 ` Tim Small
  0 siblings, 2 replies; 13+ messages in thread
From: Mark Lord @ 2010-02-07  4:51 UTC (permalink / raw)
  To: Tim Small
  Cc: Tejun Heo, Justin Piszcz,
	smartmontools-support@lists.sourceforge.net, linux-ide

Tim Small wrote:
> Mark Lord wrote:
>> My theory on the problem when I first had it here, was that doing
>> a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent
>> the problem.  This was never explored further (by me or others).
>>
> 
> Would using "option libata force=pio4" be a simple way to start to test
> this hypothesis?
..

Yup.  If the hypothesis is FALSE, then you'll still see trouble.
Otherwise, it *might* be correct.  ;)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-07  4:51               ` Mark Lord
@ 2010-02-08  2:40                 ` Tejun Heo
  2010-02-08 13:03                 ` Tim Small
  1 sibling, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2010-02-08  2:40 UTC (permalink / raw)
  To: Mark Lord
  Cc: Tim Small, Justin Piszcz,
	smartmontools-support@lists.sourceforge.net, linux-ide

Hello,

On 02/07/2010 01:51 PM, Mark Lord wrote:
> Tim Small wrote:
>> Mark Lord wrote:
>>> My theory on the problem when I first had it here, was that doing
>>> a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent
>>> the problem.  This was never explored further (by me or others).
>>>
>>
>> Would using "option libata force=pio4" be a simple way to start to test
>> this hypothesis?
> ..
> 
> Yup.  If the hypothesis is FALSE, then you'll still see trouble.
> Otherwise, it *might* be correct.  ;)

But that would be a big *might*.  The effect of PIO is a bit too
drastic to indicate postivity (as opposed to ruling out stuff).

Anyways, yeap, no harm in trying.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-06 17:30           ` Mark Lord
  2010-02-06 22:22             ` Tim Small
@ 2010-02-08  2:49             ` Tejun Heo
  1 sibling, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2010-02-08  2:49 UTC (permalink / raw)
  To: Mark Lord
  Cc: Tim Small, Justin Piszcz,
	smartmontools-support@lists.sourceforge.net, linux-ide

Hello,

On 02/07/2010 02:30 AM, Mark Lord wrote:
>>> * Soft reset the machine.  Can BIOS recognize the drive?
>>
>> Yes, if I either 'echo b > /proc/sysrq-trigger', then the BIOS
>> recognises the drive, and the box reboot normally.

Hmmm... this means one of the followings.

1. The controller side is hung and needs some sort of reset or
   reinitialization to get working again.

2. The drive is hung requiring hardreset to continue.  ata_piix
   currently can't do hardresets on ich7 but resetting the machine
   will definitely generate hardrsets.

3. The BIOS actually power-cycles the machine when told to reboot.
   Some BIOSen do this.

No chance you can access the machine there?

>>> Anyways, if it happens again, please try the above and try to find out
>>> whether the controller or the drive is hung.  Also, please keep in
>>> mind that timeouts on 0xEA (flush) is very often indicative of power
>>>   
>>
>> OK, I didn't think I was seeing those - is it possible to tell from the
>> detail which I posted in my original message?  As for the potential for
>> PSU shenanigans - I don't have access to the box to fiddle with that,
>> unfortunately, but I believe I can stress the I/O subsystem quite
>> heavily with dd and/or bonnie, but it's only when polling for SMART
>> status that these errors show up.  I've just started dd (to RAID mirror)
>> + hdparm -I again to check...

Oh... if that's the case, PSU problem wouldn't be very probable.

>> Do the SMART error counters in the OP make this suspicious?  Is there
>> likely to be any different between running smartctl -a and hdparm -I  in
>> terms of code path taken though the kernel, or timings on the hardware,
>> as far as you know?

>From driver's POV, hdparm and smart commands behave pretty much the
same.  They travel through the same high/mid layer paths and gets
issued using the same command protocol.  From drive's POV, I imagine
it can be pretty different tho.

> My theory on the problem when I first had it here, was that doing
> a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent
> the problem.  This was never explored further (by me or others).

If that's the case, what would that mean?  Would it be some nasty
interaction inside the drive firmware?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
  2010-02-07  4:51               ` Mark Lord
  2010-02-08  2:40                 ` Tejun Heo
@ 2010-02-08 13:03                 ` Tim Small
  1 sibling, 0 replies; 13+ messages in thread
From: Tim Small @ 2010-02-08 13:03 UTC (permalink / raw)
  To: Mark Lord, Tejun Heo, linux-ide
  Cc: Justin Piszcz, smartmontools-support@lists.sourceforge.net

Mark Lord wrote:
> Tim Small wrote:
>> Mark Lord wrote:
>>> My theory on the problem when I first had it here, was that doing
>>> a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent
>>> the problem.  This was never explored further (by me or others).
>>>
>>
>> Would using "option libata force=pio4" be a simple way to start to test
>> this hypothesis?
> ..
>
> Yup.  If the hypothesis is FALSE, then you'll still see trouble.
> Otherwise, it *might* be correct.  ;)

It looks like it is false then....

[59745.632984] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen
[59745.633036] ata1.01: cmd 34/00:00:87:c6:f7/00:04:00:00:00/f0 tag 0 
pio 524288 out
[59745.633086] ata1.01: status: { DRDY }
[59745.633117] ata1: soft resetting link
[59747.094498] ata1.00: FORCE: xfer_mask set to pio4
[59747.094498] ata1.01: FORCE: xfer_mask set to pio4
[59747.102353] ata1.00: configured for PIO4
[59747.108610] ata1.01: configured for PIO4
[59747.108610] ata1: EH complete
[59747.437125] sd 0:0:0:0: [sda] 3907029168 512-byte hardware sectors 
(2000399 MB)
[59747.499739] sd 0:0:0:0: [sda] Write Protect is off
[59747.499739] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[59747.844755] sd 0:0:0:0: [sda] Write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA
[59748.047834] sd 0:0:1:0: [sdb] 3907029168 512-byte hardware sectors 
(2000399 MB)
...


7 14:20:32: [101181.209812] ata1.01: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
7 14:20:32: [101181.209865] ata1.01: cmd 
34/00:00:0f:4d:f0/00:04:00:00:00/f0 tag 0 pio 524288 out
7 14:20:32: [101181.209909] ata1.01: status: { DRDY }
7 14:20:32: [101181.209946] ata1: soft resetting link
--
7 15:54:12: [110247.451925] ata1.01: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
7 15:54:12: [110247.451979] ata1.01: cmd 
34/00:00:bf:8e:e8/00:04:00:00:00/f0 tag 0 pio 524288 out
7 15:54:12: [110247.452028] ata1.01: status: { DRDY }
7 15:54:12: [110247.452062] ata1: soft resetting link
--
7 23:47:13: [155689.544839] ata1.01: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
7 23:47:13: [155689.544892] ata1.01: cmd 
34/00:00:d7:0f:fe/00:04:00:00:00/f0 tag 0 pio 524288 out
7 23:47:13: [155689.544935] ata1.01: status: { DRDY }
7 23:47:13: [155689.544974] ata1: soft resetting link
--
8 00:59:30: [162616.848048] ata1.01: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
8 00:59:30: [162616.848099] ata1.01: cmd 
34/00:00:5f:6b:e9/00:04:00:00:00/f0 tag 0 pio 524288 out
8 00:59:30: [162616.848143] ata1.01: status: { DRDY }
8 00:59:30: [162616.848175] ata1: soft resetting link
--
8 01:01:22: [162789.662299] ata1.01: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
8 01:01:22: [162789.662338] ata1.01: cmd 
34/00:00:5f:6c:ed/00:04:00:00:00/f0 tag 0 pio 524288 out
8 01:01:22: [162789.662381] ata1.01: status: { DRDY }
8 01:01:22: [162789.662418] ata1: soft resetting link
--
8 01:14:43: [164059.753030] ata1.01: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
8 01:14:43: [164059.753082] ata1.01: cmd 
ec/00:00:00:00:00/00:00:00:00:00/10 tag 0 pio 512 in
8 01:14:43: [164059.753129] ata1.01: status: { DRDY }
8 01:14:48: [164067.298313] ata1: link is slow to respond, please be 
patient (ready=0)
--
8 01:56:33: [168105.660062] ata1.01: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
8 01:56:33: [168105.660115] ata1.01: cmd 
34/00:00:0f:2f:e6/00:04:00:00:00/f0 tag 0 pio 524288 out
8 01:56:33: [168105.660164] ata1.01: status: { DRDY }
8 01:56:33: [168105.660193] ata1: soft resetting link
--
8 02:11:42: [169562.773251] ata1.01: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
8 02:11:42: [169562.773303] ata1.01: cmd 
34/00:00:87:8c:ef/00:04:00:00:00/f0 tag 0 pio 524288 out
8 02:11:42: [169562.773352] ata1.01: status: { DRDY }
8 02:11:42: [169562.773386] ata1: soft resetting link
--
8 04:35:16: [183417.972749] ata1.01: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
8 04:35:16: [183417.972749] ata1.01: cmd 
34/00:40:a7:7f:fc/00:01:00:00:00/f0 tag 0 pio 163840 out
8 04:35:16: [183417.972749] ata1.01: status: { DRDY }
8 04:35:16: [183417.972749] ata1: soft resetting link
--
8 07:11:47: [198460.847454] ata1.01: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
8 07:11:47: [198460.847507] ata1.01: cmd 
34/00:00:67:2c:ef/00:04:00:00:00/f0 tag 0 pio 524288 out
8 07:11:47: [198460.847555] ata1.01: status: { DRDY }
8 07:11:47: [198460.847583] ata1: soft resetting link
--
8 07:40:48: [201232.970903] ata1.01: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
8 07:40:48: [201232.970903] ata1.01: cmd 
34/00:00:c7:2d:e5/00:04:00:00:00/f0 tag 0 pio 524288 out
8 07:40:48: [201232.970903] ata1.01: status: { DRDY }
8 07:40:48: [201232.970903] ata1: soft resetting link

... but, it turns out that I have another box at home which I've been 
able to provoke into doing similar things:

16:46:49: [1130032.307185] ata1.00: exception Emask 0x10 SAct 0x0 SErr 
0x4000000 action 0xe frozen
16:46:49: [1130032.307197] ata1.00: irq_stat 0x00000040, connection 
status changed
16:46:49: [1130032.307200] ata1: SError: { DevExch }
16:46:49: [1130032.307205] ata1.00: cmd 
b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
16:46:49: [1130032.307207]          res 
40/00:4c:1f:fa:9a/00:00:06:00:00/40 Emask 0x10 (ATA bus error)
16:46:49: [1130032.307210] ata1.00: status: { DRDY }
16:46:49: [1130032.307219] ata1: hard resetting link
16:46:55: [1130038.083028] ata1: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
16:47:25: [1130068.090133] ata1.00: qc timeout (cmd 0xec)
16:47:25: [1130068.090148] ata1.00: failed to IDENTIFY (I/O error, 
err_mask=0x5)
16:47:25: [1130068.090152] ata1.00: revalidation failed (errno=-5)
16:47:25: [1130068.090156] ata1: failed to recover some devices, 
retrying in 5 secs
16:47:30: [1130073.094116] ata1: hard resetting link
16:47:30: [1130073.414133] ata1: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
16:47:30: [1130073.436396] ata1.00: configured for UDMA/133
16:47:30: [1130073.436396] ata1: EH complete
16:47:30: [1130073.436396] sd 0:0:0:0: [sda] 976773168 512-byte hardware 
sectors (500108 MB)
16:47:30: [1130073.436396] sd 0:0:0:0: [sda] Write Protect is off
16:47:30: [1130073.436396] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00

17:21:21: [1132149.195367] ata1.00: exception Emask 0x10 SAct 0x0 SErr 
0x4040000 action 0xe frozen
17:21:21: [1132149.195378] ata1.00: irq_stat 0x00000040, connection 
status changed
17:21:21: [1132149.195384] ata1: SError: { CommWake DevExch }
17:21:21: [1132149.195394] ata1.00: cmd 
b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
17:21:21: [1132149.195397]          res 
40/00:2c:77:ad:63/00:00:06:00:00/40 Emask 0x10 (ATA bus error)
17:21:21: [1132149.195403] ata1.00: status: { DRDY }
--
18:28:29: [1136257.076898] ata1.00: exception Emask 0x0 SAct 0x7fffffff 
SErr 0x0 action 0x6 frozen
18:28:29: [1136257.076898] ata1.00: cmd 
61/00:00:27:b5:89/04:00:06:00:00/40 tag 0 ncq 524288 out
18:28:29: [1136257.076898]          res 
40/00:f4:27:b1:89/00:00:06:00:00/40 Emask 0x4 (timeout)
18:28:29: [1136257.076898] ata1.00: status: { DRDY }
18:28:29: [1136257.076898] ata1.00: cmd 
61/00:08:27:b9:89/04:00:06:00:00/40 tag 1 ncq 524288 out
18:28:29: [1136257.076898]          res 
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
--
18:53:19: [1137768.517637] ata1.00: exception Emask 0x10 SAct 0x0 SErr 
0x4040000 action 0xe frozen
18:53:19: [1137768.517637] ata1.00: irq_stat 0x00000040, connection 
status changed
18:53:19: [1137768.517637] ata1: SError: { CommWake DevExch }
18:53:19: [1137768.517637] ata1.00: cmd 
b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
18:53:19: [1137768.517637]          res 
40/00:0c:7b:99:09/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
18:53:19: [1137768.517637] ata1.00: status: { DRDY }


This also has an ICH7, but it's in AHCI mode, so ata_piix would seem to 
be off the hook in this case.

I have a couple of other SATA controllers in that box (JMicron 
20360/20363 and a SiI 3132), so I should be able to put the drive on 
those controllers instead to see if the same thing happens.  Annoyingly 
(but only from the PoV of that issue), I'm about to go on holiday, but 
I'll try and do this before I go....

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-02-08 13:03 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-05 14:07 SATA drive reset/disable events on ICH7 ata_piix when polling SMART info Tim Small
2010-02-05 14:17 ` [smartmontools-support] " Justin Piszcz
2010-02-05 14:31   ` Tim Small
2010-02-05 14:48     ` Justin Piszcz
2010-02-05 21:47     ` Mark Lord
2010-02-06  3:39       ` Tejun Heo
2010-02-06 15:26         ` Tim Small
2010-02-06 17:30           ` Mark Lord
2010-02-06 22:22             ` Tim Small
2010-02-07  4:51               ` Mark Lord
2010-02-08  2:40                 ` Tejun Heo
2010-02-08 13:03                 ` Tim Small
2010-02-08  2:49             ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).