All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tim Small <tim@buttersideup.com>
To: "smartmontools-support@lists.sourceforge.net"
	<smartmontools-support@lists.sourceforge.net>,
	linux-ide@vger.kernel.org
Subject: SATA drive reset/disable events on ICH7 ata_piix when polling SMART info
Date: Fri, 05 Feb 2010 14:07:49 +0000	[thread overview]
Message-ID: <4B6C2635.105@buttersideup.com> (raw)

Hi,

I have a couple of Debian Lenny ("2.6.26-2-amd64") boxes on rented 
hardware, each has a couple of SATA drives:

One has 2x 1TB Seagate Barracuda 7200.11 model ST31000333AS firmware SD35

The other has 2x 2TB WD Caviar Green model WDC WD20EADS-00R6B0 firmware 
01.00A01

... the machines are currently set up to run smartd, and also log HDD 
temp via munin.  ata_piix is the driver in use.

The WD machine did this sort of thing a couple of times, which got my 
attention.

[119061.717865] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[119061.717865] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[119061.717865] ata1.00: status: { DRDY }
[119071.117368] ata1: link is slow to respond, please be patient (ready=0)
[119079.800059] ata1: device not ready (errno=-16), forcing hardreset
[119079.800091] ata1: soft resetting link
[119087.950128] ata1: link is slow to respond, please be patient (ready=0)
[119097.895803] ata1: SRST failed (errno=-16)
[119097.895881] ata1: soft resetting link
[119107.170874] ata1: link is slow to respond, please be patient (ready=0)
[119114.902193] ata1: SRST failed (errno=-16)
[119114.902219] ata1: soft resetting link
[119123.749111] ata1: link is slow to respond, please be patient (ready=0)
[119176.735727] ata1: SRST failed (errno=-16)
[119176.735761] ata1: soft resetting link
[119185.513569] ata1: SRST failed (errno=-16)
[119185.513593] ata1: reset failed, giving up
[119185.513622] ata1.00: disabled
[119185.513643] ata1.01: disabled
[119185.513680] end_request: I/O error, dev sda, sector 39069887
[119185.516684] ata1: EH complete
[119186.013456] sd 0:0:0:0: [sda]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[119186.013456] end_request: I/O error, dev sda, sector 36525807


If I run a continuous "dd of=file ; sync ; rm file ; sync" to a file on 
the RAID1 mirror of both drives, at the same time as run a continous 
"smartctl -s on -a /dev/sdX > /dev/null || echo failed", then:

1. The smartctl command fails about once in 20 times, and I get a lot of 
this happening:

[93058.989603] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[93058.989645] ata1.01: cmd 35/00:00:a4:f2:51/00:04:03:00:00/f0 tag 0 
dma 524288 out
[93058.993582] ata1.01: status: { DRDY 
}                                            
[93058.993582] ata1: soft resetting 
link                                            
                                                                                     

[93090.804353] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[93090.804395] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 
pio 512 in    
[93090.804427] ata1.01: status: { DRDY 
}                                            
[93090.804458] ata1: soft resetting 
link                                            
                                                                                     

[93252.493902] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[93252.493913] ata1.01: cmd c8/00:80:4c:d0:83/00:00:00:00:00/fa tag 0 
dma 65536 in  
[93252.493913] ata1.01: status: { DRDY 
}                                            
[93252.493913] ata1: soft resetting 
link                                            
                                                                                     

[96265.917847] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[96265.917889] ata1.01: cmd c8/00:80:4c:2c:c1/00:00:00:00:00/fa tag 0 
dma 65536 in  
[96265.921800] ata1.01: status: { DRDY 
}                                            
[96265.921800] ata1: soft resetting 
link                                            
                                                                                     

[96405.491834] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[96405.491834] ata1.01: cmd 25/00:00:cc:a6:c3/00:04:0a:00:00/f0 tag 0 
dma 524288 in 
[96405.491834] ata1.01: status: { DRDY 
}                                            
[96413.900149] ata1: link is slow to respond, please be patient 
(ready=0)           
                                                                                     

[99772.901861] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen     
[99772.901861] ata1.01: cmd ca/00:08:cc:d3:54/00:00:00:00:00/f3 tag 0 
dma 4096 out  
[99772.901861] ata1.01: status: { DRDY 
}                                            
[99783.604235] ata1: link is slow to respond, please be patient 
(ready=0)           
                                                                                     

[100012.860158] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen    
[100012.860201] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 
pio 512 in
[100012.860247] ata1.01: status: { DRDY }
[100012.860281] ata1: soft resetting link

[100256.314912] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[100256.314950] ata1.01: cmd c8/00:80:cc:12:13/00:00:00:00:00/fb tag 0 
dma 65536 in
[100256.314997] ata1.01: status: { DRDY }
[100256.315025] ata1: soft resetting link

[101528.503318] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[101528.503318] ata1.01: cmd c8/00:00:4c:c4:2c/00:00:00:00:00/fb tag 0 
dma 131072 in
[101528.503318] ata1.01: status: { DRDY }
[101535.883662] ata1: link is slow to respond, please be patient (ready=0)

[107747.382563] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[107747.382605] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 
pio 512 in
[107747.386545] ata1.01: status: { DRDY }
[107747.386545] ata1: soft resetting link

[107918.831736] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[107918.831736] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 
pio 512 in
[107918.831736] ata1.01: status: { DRDY }
[107918.831736] ata1: soft resetting link


Sometimes the "resetting link" happens a few times, and if it happens 
enough times, then ata_piix gives up and disables BOTH drives (like the 
first time), which is a bit annoying - this reset-fails behaviour 
normally seems to happen when the drives are not doing much (i.e. in 
normal operation rather than under-test).

If I disable smart data collection (smartd and munin), then the errors 
seem to stop - which I can do obviously, but would prefer not to.

smartctl -x reports the following interesting-looking stuff on the 
device which I've been stressing with smartctl:

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
...
0x000a  2            5  Device-to-host register FISes sent due to a COMRESET
0x8000  4        79322  Vendor specific

and this on the one where I haven't:

0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x8000  4         6779  Vendor specific



... so I would suspect that this is a bug in the WD drives, except that 
the same thing seems to occasionally happen on the machine with the 
Seagate drives:

[1718254.879156] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[1718254.879211] ata1.00: cmd c8/00:08:3c:f1:bf/00:00:00:00:00/e9 tag 0 
dma 4096 in
[1718254.879213]          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 
0x4 (timeout)
[1718254.879316] ata1.00: status: { DRDY }
[1718262.237404] ata1: link is slow to respond, please be patient (ready=0)
[1718270.057698] ata1: device not ready (errno=-16), forcing hardreset
[1718270.057732] ata1: soft resetting link
[1718277.841779] ata1: link is slow to respond, please be patient (ready=0)
[1718281.134473] ata1.00: configured for UDMA/133
[1718281.192815] ata1.01: configured for UDMA/133
[1718281.192815] ata1: EH complete

[1729049.865692] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 
0x6 frozen
[1729049.865692] ata1.00: cmd c8/00:08:dc:b3:bf/00:00:00:00:00/e9 tag 0 
dma 4096 in
[1729049.865692]          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 
0x4 (timeout)
[1729049.865692] ata1.00: status: { DRDY }
[1729059.627313] ata1: link is slow to respond, please be patient (ready=0)
[1729068.499782] ata1: device not ready (errno=-16), forcing hardreset
[1729068.499823] ata1: soft resetting link
[1729078.434813] ata1: link is slow to respond, please be patient (ready=0)
[1729088.807850] ata1: SRST failed (errno=-16)
[1729088.807881] ata1: soft resetting link
[1729089.582856] ata1.00: configured for UDMA/133


with this on the stressed drive:

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2           10  Device-to-host register FISes sent due to a COMRESET


and this on the non-stressed drive:

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET


I'd be happy to put a newer kernel on one or both machines to see if 
that'd have any effect.  I also tried doing "hdparm -I" instead of 
"smartctl -a" for a few hours but that didn't elicit any "frozen" 
messages (although I should probably run it for a bit longer to have 
more confidence in that statement).

So, err I suppose that this could be a bug in:

. smartctl
. both HD firmwares
. ata_piix (certainly disabling both drives seems a bit drastic, but I 
don't know if this is a function of the hardware)
. the ICH7 hardware

unfortunately as I don't own the hardware, I'm not in a position to get 
a different SATA controller in the boxes to eliminate the last two.

Any ideas welcome....

Cheers,

Tim.

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com

             reply	other threads:[~2010-02-05 14:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-05 14:07 Tim Small [this message]
2010-02-05 14:17 ` [smartmontools-support] SATA drive reset/disable events on ICH7 ata_piix when polling SMART info Justin Piszcz
2010-02-05 14:31   ` Tim Small
2010-02-05 14:48     ` Justin Piszcz
2010-02-05 21:47     ` Mark Lord
2010-02-06  3:39       ` Tejun Heo
2010-02-06 15:26         ` Tim Small
2010-02-06 17:30           ` Mark Lord
2010-02-06 22:22             ` Tim Small
2010-02-07  4:51               ` Mark Lord
2010-02-08  2:40                 ` Tejun Heo
2010-02-08 13:03                 ` Tim Small
2010-02-08  2:49             ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B6C2635.105@buttersideup.com \
    --to=tim@buttersideup.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=smartmontools-support@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.