linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* SATA resets via SMART selftest
@ 2008-10-10 20:38 Scott Beardsley
  2008-10-10 21:08 ` Alan Cox
  0 siblings, 1 reply; 4+ messages in thread
From: Scott Beardsley @ 2008-10-10 20:38 UTC (permalink / raw)
  To: linux-ide

I originally posted this to smartmontools but was redirected here.

I am running into a problem with short or long smartctl selftests 
causing a disk reset. I'm using kernel.org v2.6.27 (I've also tried a 
few CentOS kernels) and smartmontools v5.38 (the latest of each). When I 
initiate a short selftest it'll run fine for a couple seconds then the 
iowait jumps up while the disk resets. I don't think this is a disk 
issue since I have 36 identical machines (and they all have this same 
reproducible behavior). I also don't think it is a power problem because 
the disks have seated correctly and the machine stays online when the 
cpus are at 100%. The disks seem to be functioning normally because I 
can read and write the whole disk. After I request a short selftest I 
get this in dmesg:

sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd ca/00:40:bf:58:00/00:00:00:00:00/e0 tag 0 dma 32768 out
           res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: link is slow to respond, please be patient (ready=0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting link
ata1: link is slow to respond, please be patient (ready=0)
ata1.00: configured for UDMA/133
ata1: EH complete

Here is some smartctl info that might be helpful (you can see it reset 
three times):

# ./smartctl -d ata -a /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST3500320NS
Serial Number:    9QM6YX2A
Firmware Version: SN05
User Capacity:    500,107,862,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Oct 10 12:33:59 2008 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                          was completed without error.
                                          Auto Offline Data Collection:
Enabled.
Self-test execution status:      (  41) The self-test routine was
interrupted
                                          by the host with a hard or soft
reset.
Total time to complete Offline
data collection:                 ( 634) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                          Auto Offline data collection
on/off support.
                                          Suspend Offline collection 
upon new
                                          command.
                                          Offline surface scan supported.
                                          Self-test supported.
                                          Conveyance Self-test supported.
                                          Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                          power-saving mode.
                                          Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                          General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 114) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                          SCT Feature Control supported.
                                          SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate     0x000f   072   069   044    Pre-fail
Always       -       20837137
    3 Spin_Up_Time            0x0003   099   099   000    Pre-fail
Always       -       0
    4 Start_Stop_Count        0x0032   100   100   020    Old_age
Always       -       8
    5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
Always       -       0
    7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail
Always       -       644433
    9 Power_On_Hours          0x0032   100   100   000    Old_age
Always       -       212
   10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
Always       -       0
   12 Power_Cycle_Count       0x0032   100   037   020    Old_age
Always       -       8
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always
        -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always
        -       0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always
        -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always
        -       0
190 Airflow_Temperature_Cel 0x0022   072   070   045    Old_age   Always
        -       28 (Lifetime Min/Max 28/28)
194 Temperature_Celsius     0x0022   028   040   000    Old_age   Always
        -       28 (0 25 0 0)
195 Hardware_ECC_Recovered  0x001a   035   032   000    Old_age   Always
        -       20837137
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
        -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
        -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      00%       212
       -
# 2  Short offline       Interrupted (host reset)      00%       211
       -
# 3  Extended offline    Interrupted (host reset)      00%       166
       -

SMART Selective self-test log data structure revision number 1
   SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
      1        0        0  Not_testing
      2        0        0  Not_testing
      3        0        0  Not_testing
      4        0        0  Not_testing
      5        0        0  Not_testing
Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Any ideas?
Scott


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-10-14  0:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-10 20:38 SATA resets via SMART selftest Scott Beardsley
2008-10-10 21:08 ` Alan Cox
2008-10-13 17:18   ` Scott Beardsley
2008-10-14  0:19     ` Scott Beardsley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).