SATA exceptions

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* SATA exceptions
@ 2007-07-05 21:46 S.Çağlar Onur
  2007-07-06  1:52 ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: S.Çağlar Onur @ 2007-07-05 21:46 UTC (permalink / raw)
  To: LKML; +Cc: Tejun Heo, Jeff Garzik

[-- Attachment #1: Type: text/plain, Size: 3107 bytes --]

Hi;

I'm starting to see following logs in dmesg for a while (according to kern.log 
these starts with 2.6.22-rc4) on HP Pavilion dv2385ea

...
[ 4260.278408] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 4260.278417] ata1.00: (irq_stat 0x40000001)
[ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb 0x0 
data 4096 out
[ 4260.278430]          res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9 
(media error)
[ 4260.911247] ata1.00: configured for UDMA/100
[ 4260.911263] ata1: EH complete
[ 4260.911809] sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 
MB)
[ 4260.912127] sd 0:0:0:0: [sda] Write Protect is off
[ 4260.912135] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 4260.912672] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA
...

zangetsu log # grep "ata1.00: exception" kern.log
Jun 10 23:23:33 localhost kernel: [ 3472.867317] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 12 17:09:56 localhost kernel: [ 2470.530793] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 12 17:11:19 localhost kernel: [ 2553.874662] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 13 12:08:46 localhost kernel: [ 2235.664683] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 17 16:59:23 localhost kernel: [ 9208.673909] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 22 13:35:56 localhost kernel: [ 1719.191725] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 24 14:13:46 localhost kernel: [ 5822.239007] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 26 15:24:11 localhost kernel: [ 1315.455726] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 26 15:36:46 localhost kernel: [ 2069.003291] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 26 15:37:01 localhost kernel: [ 2082.955499] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 26 15:37:21 localhost kernel: [ 2103.400411] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 26 15:37:38 localhost kernel: [ 2120.251088] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jun 28 10:23:55 localhost kernel: [  383.355017] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jul  5 22:05:14 localhost kernel: [ 4260.278408] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0
Jul  5 22:05:52 localhost kernel: [ 4297.784773] ata1.00: exception Emask 0x0 
SAct 0x0 SErr 0x0 action 0x0

Neither fsck nor badblocks didn't report anything wrong and i far as i know 
these didn't make any real problem until now but i'm not sure they are 
harmness or not (or indicates a hw error), so 

dmesg, smartctl -a, /proc/interrupts and lspci -vv outputs can be found @[1]

if anything else needed please just tell...

[1] http://cekirdek.pardus.org.tr/~caglar/SATA/

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SATA exceptions
  2007-07-05 21:46 SATA exceptions S.Çağlar Onur
@ 2007-07-06  1:52 ` Tejun Heo
  2007-07-06 11:43   ` S.Çağlar Onur
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2007-07-06  1:52 UTC (permalink / raw)
  To: caglar; +Cc: LKML, Jeff Garzik

Hello,

S.Çağlar Onur wrote:
> [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb 0x0 
> data 4096 out
> [ 4260.278430]          res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9 
> (media error)

That's media error on sector 247236823 on WRITE.  Media errors on write
are bad signs - it usually means the drive even failed to remap the
sector because extra space ran out.  I'm not sure this is the case here
tho - the smart log is clear.  Please run smart short/long tests and see
what they say.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SATA exceptions
  2007-07-06  1:52 ` Tejun Heo
@ 2007-07-06 11:43   ` S.Çağlar Onur
  0 siblings, 0 replies; 13+ messages in thread
From: S.Çağlar Onur @ 2007-07-06 11:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Jeff Garzik

[-- Attachment #1: Type: text/plain, Size: 1461 bytes --]

Hi;

06 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: 
> S.Çağlar Onur wrote:
> > [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb
> > 0x0 data 4096 out
> > [ 4260.278430]          res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9
> > (media error)
>
> That's media error on sector 247236823 on WRITE.  Media errors on write
> are bad signs - it usually means the drive even failed to remap the
> sector because extra space ran out. 

Hmm, more than 50GB is empty on disk :)

> I'm not sure this is the case here 
> tho - the smart log is clear.  Please run smart short/long tests and see
> what they say.

Both completed without a problem;

zangetsu ~ # smartctl -l selftest /dev/sda
smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Extended offline    Completed without error       00%       357         -
# 2  Short offline       Completed without error       00%       355         -

If you want me to try something else please just say :)

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <fa.hBFih6KVHDFsBf6Qfg6XispiIuY@ifi.uio.no>]

[parent not found: <fa.50UJZSgW/ChUl7O1Zks6ydq/1js@ifi.uio.no>]

[parent not found: <fa.NhBdOmH9+RkW+QO72+TjPRwEKj0@ifi.uio.no>]

* Re: SATA exceptions
       [not found]   ` <fa.NhBdOmH9+RkW+QO72+TjPRwEKj0@ifi.uio.no>
@ 2007-07-07 18:05     ` Robert Hancock
  2007-07-07 21:35       ` S.Çağlar Onur
  0 siblings, 1 reply; 13+ messages in thread
From: Robert Hancock @ 2007-07-07 18:05 UTC (permalink / raw)
  To: caglar; +Cc: Tejun Heo, LKML, Jeff Garzik

S.Çağlar Onur wrote:
> 06 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: 
>> S.Çağlar Onur wrote:
>>> [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb
>>> 0x0 data 4096 out
>>> [ 4260.278430]          res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9
>>> (media error)
>> That's media error on sector 247236823 on WRITE.  Media errors on write
>> are bad signs - it usually means the drive even failed to remap the
>> sector because extra space ran out. 
> 
> Hmm, more than 50GB is empty on disk :)

It's not the free space on the drive that matters, it's the number of 
free sectors in the spare sector pool on the drive, which is invisible 
to software.

Your SMART log shows 309 reallocated sectors. That seems somewhat high..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SATA exceptions
  2007-07-07 18:05     ` Robert Hancock
@ 2007-07-07 21:35       ` S.Çağlar Onur
  2007-07-09 18:37         ` Tejun Heo
  2007-07-11 20:31         ` Mark Lord
  0 siblings, 2 replies; 13+ messages in thread
From: S.Çağlar Onur @ 2007-07-07 21:35 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Tejun Heo, LKML, Jeff Garzik

[-- Attachment #1: Type: text/plain, Size: 6690 bytes --]

Hi;

07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: 
> It's not the free space on the drive that matters, it's the number of
> free sectors in the spare sector pool on the drive, which is invisible
> to software.
>
> Your SMART log shows 309 reallocated sectors. That seems somewhat high..

Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
(currently its increased to 313) and although i'm not 100 percent sure these 
errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
cause according to kern.log these only visible with 2.6.22+) 

We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its 
smartctl output follows as a reference;

smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HM160JI
Serial Number:    S0W6J10P331479
Firmware Version: AD100-16
User Capacity:    160.041.885.696 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:    Sun Jul  8 00:22:21 2007 EEST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for 
details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine 
completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (5391) seconds.
Offline data collection
capabilities: 			 (0x51) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  89) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  
Always       -       0
  3 Spin_Up_Time            0x0007   253   253   025    Pre-fail  
Always       -       2880
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   
Always       -       2648
  5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  
Always       -       0
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  
Always       -       0
  8 Seek_Time_Performance   0x0025   253   253   015    Pre-fail  
Offline      -       0
  9 Power_On_Hours          0x0032   253   253   000    Old_age   
Always       -       236
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  
Always       -       1
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   
Always       -       2
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   
Always       -       57
187 Unknown_Attribute       0x0032   253   253   000    Old_age   
Always       -       0
188 Unknown_Attribute       0x0032   253   253   000    Old_age   
Always       -       0
190 Temperature_Celsius     0x0022   047   040   040    Old_age   Always   
In_the_past 1008009269
191 G-Sense_Error_Rate      0x0012   100   100   000    Old_age   
Always       -       5396
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   
Always       -       40
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   
Always       -       2575
194 Temperature_Celsius     0x0022   047   040   000    Old_age   
Always       -       53 (Lifetime Min/Max 0/15381)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   
Always       -       98037
196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   
Always       -       0
197 Current_Pending_Sector  0x0012   253   253   000    Old_age   
Always       -       0
198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   
Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   
Always       -       0
201 Soft_Read_Error_Rate    0x0012   253   253   000    Old_age   
Always       -       0
223 Load_Retry_Count        0x0012   100   100   000    Old_age   
Always       -       2
225 Load_Cycle_Count        0x0012   100   100   000    Old_age   
Always       -       2575
255 Unknown_Attribute       0x000a   253   100   000    Old_age   
Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure 
revision number = 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

What is your suggestion in that case will i try to change the hardware 
assuming its harware's fault or that may be regression introduced by kernels 
newer than 2.6.18?

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SATA exceptions
  2007-07-07 21:35       ` S.Çağlar Onur
@ 2007-07-09 18:37         ` Tejun Heo
  2007-07-09 19:06           ` S.Çağlar Onur
                             ` (2 more replies)
  2007-07-11 20:31         ` Mark Lord
  1 sibling, 3 replies; 13+ messages in thread
From: Tejun Heo @ 2007-07-09 18:37 UTC (permalink / raw)
  To: caglar; +Cc: Robert Hancock, LKML, Jeff Garzik

Hello,

S.Çağlar Onur wrote:
> 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: 
>> It's not the free space on the drive that matters, it's the number of
>> free sectors in the spare sector pool on the drive, which is invisible
>> to software.
>>
>> Your SMART log shows 309 reallocated sectors. That seems somewhat high..
> 
> Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
> most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
> (currently its increased to 313) and although i'm not 100 percent sure these 
> errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
> cause according to kern.log these only visible with 2.6.22+) 

OS and driver can't really do much about the reallocation event.  Some
number of reallocations is okay but if you it going up constantly, you
probably have a dying disk.

> We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its 
> smartctl output follows as a reference;
>
>   5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  
> 196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   

Hmm... This is pretty high too.  Do the counts increase on this machine too?

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SATA exceptions
  2007-07-09 18:37         ` Tejun Heo
@ 2007-07-09 19:06           ` S.Çağlar Onur
  2007-07-11 17:20           ` Bill Davidsen
  2007-07-12 19:52           ` Pavel Machek
  2 siblings, 0 replies; 13+ messages in thread
From: S.Çağlar Onur @ 2007-07-09 19:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Robert Hancock, LKML, Jeff Garzik, Onur Küçük,
	İsmail Dönmez

[-- Attachment #1: Type: text/plain, Size: 2198 bytes --]

Hi;

09 Tem 2007 Pts tarihinde, Tejun Heo şunları yazmıştı: 
> > 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı:
> >> It's not the free space on the drive that matters, it's the number of
> >> free sectors in the spare sector pool on the drive, which is invisible
> >> to software.
> >>
> >> Your SMART log shows 309 reallocated sectors. That seems somewhat high..
> >
> > Ah sorry to misinterpret the content:), its a quiet new piece of hardware
> > (at most ~1.5 month old) and  "Reallocated_Event_Count" constantly
> > increases (currently its increased to 313) and although i'm not 100
> > percent sure these errors only occured with kernels > 2.6.18 (or 2.6.18
> > didn't report these cause according to kern.log these only visible with
> > 2.6.22+)
>
> OS and driver can't really do much about the reallocation event.  Some
> number of reallocations is okay but if you it going up constantly, you
> probably have a dying disk.

Hmm its really interesting, then it means 3 piece of ~1.5 month old laptops 
dieing for same decease :) or they already somehow defectived (or we are 
damaging them but it sits on my table happily all that time :P)

> > We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18
> > and its smartctl output follows as a reference;
> >
> >   5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail
> > 196 Reallocated_Event_Count 0x0032   253   253   000    Old_age
>
> Hmm... This is pretty high too.  Do the counts increase on this machine
> too?

Yes, seems so (i'm adding Onur and İsmail to CC as other machines owner) and 
here is the smart logs for this 3 seperate machine, its interesting me and 
İsmail runs  2.6.22 (over 300 reloacations occured for both of us) and Onur 
uses 2.6.18 (0 relocation occured for him)

[1] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.caglar
[2] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.ismail
[3] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.onur

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SATA exceptions
  2007-07-09 18:37         ` Tejun Heo
  2007-07-09 19:06           ` S.Çağlar Onur
@ 2007-07-11 17:20           ` Bill Davidsen
  2007-07-12 19:52           ` Pavel Machek
  2 siblings, 0 replies; 13+ messages in thread
From: Bill Davidsen @ 2007-07-11 17:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: caglar, Robert Hancock, LKML, Jeff Garzik

Tejun Heo wrote:
> Hello,
> 
> S.Çağlar Onur wrote:
>> 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: 
>>> It's not the free space on the drive that matters, it's the number of
>>> free sectors in the spare sector pool on the drive, which is invisible
>>> to software.
>>>
>>> Your SMART log shows 309 reallocated sectors. That seems somewhat high..
>> Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
>> most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
>> (currently its increased to 313) and although i'm not 100 percent sure these 
>> errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
>> cause according to kern.log these only visible with 2.6.22+) 
> 
> OS and driver can't really do much about the reallocation event.  Some
> number of reallocations is okay but if you it going up constantly, you
> probably have a dying disk.
> 
Or, as I learned the hard way, if you have the problem on all drives 
sharing a power supply, a power issue.

>> We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its 
>> smartctl output follows as a reference;
>>
>>   5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  
>> 196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   
> 
> Hmm... This is pretty high too.  Do the counts increase on this machine too?
> 


-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SATA exceptions
  2007-07-09 18:37         ` Tejun Heo
  2007-07-09 19:06           ` S.Çağlar Onur
  2007-07-11 17:20           ` Bill Davidsen
@ 2007-07-12 19:52           ` Pavel Machek
  2007-07-13  3:12             ` Tejun Heo
  2 siblings, 1 reply; 13+ messages in thread
From: Pavel Machek @ 2007-07-12 19:52 UTC (permalink / raw)
  To: Tejun Heo; +Cc: caglar, Robert Hancock, LKML, Jeff Garzik

Hi!

> >> Your SMART log shows 309 reallocated sectors. That seems somewhat high..
> > 
> > Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
> > most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
> > (currently its increased to 313) and although i'm not 100 percent sure these 
> > errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
> > cause according to kern.log these only visible with 2.6.22+) 
> 
> OS and driver can't really do much about the reallocation event.  Some
> number of reallocations is okay but if you it going up constantly, you
> probably have a dying disk.

Hmm... cut the power while writing is doable from OS and might force
reallocations?

You might want to check if number of reallocated sectors increases
with shutdowns/reboots.
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SATA exceptions
  2007-07-12 19:52           ` Pavel Machek
@ 2007-07-13  3:12             ` Tejun Heo
  2007-07-13  7:44               ` S.Çağlar Onur
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2007-07-13  3:12 UTC (permalink / raw)
  To: Pavel Machek; +Cc: caglar, Robert Hancock, LKML, Jeff Garzik

Pavel Machek wrote:
>>>> Your SMART log shows 309 reallocated sectors. That seems somewhat high..
>>> Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
>>> most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
>>> (currently its increased to 313) and although i'm not 100 percent sure these 
>>> errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
>>> cause according to kern.log these only visible with 2.6.22+) 
>> OS and driver can't really do much about the reallocation event.  Some
>> number of reallocations is okay but if you it going up constantly, you
>> probably have a dying disk.
> 
> Hmm... cut the power while writing is doable from OS and might force
> reallocations?

Hmmm... We don't have any pending write when power goes out and I don't
emergency unload can directly increase reallocation count.  It can
shorten lifespan of the head tho.

> You might want to check if number of reallocated sectors increases
> with shutdowns/reboots.

I'm curious too.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SATA exceptions
  2007-07-13  3:12             ` Tejun Heo
@ 2007-07-13  7:44               ` S.Çağlar Onur
  0 siblings, 0 replies; 13+ messages in thread
From: S.Çağlar Onur @ 2007-07-13  7:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Pavel Machek, Robert Hancock, LKML, Jeff Garzik

[-- Attachment #1: Type: text/plain, Size: 1194 bytes --]

13 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: 
> >> OS and driver can't really do much about the reallocation event.  Some
> >> number of reallocations is okay but if you it going up constantly, you
> >> probably have a dying disk.
> >
> > Hmm... cut the power while writing is doable from OS and might force
> > reallocations?
>
> Hmmm... We don't have any pending write when power goes out and I don't
> emergency unload can directly increase reallocation count.  It can
> shorten lifespan of the head tho.
>
> > You might want to check if number of reallocated sectors increases
> > with shutdowns/reboots.
>
> I'm curious too.

It seems reboot/shutdown has no effect on reallocated sectors. After 5 rebot/5 
shutdown it didn't change at all.

zangetsu ~ # smartctl -a /dev/sda | grep Reall
  5 Reallocated_Sector_Ct   0x0033   067   067   010    Pre-fail  
Always       -       314
196 Reallocated_Event_Count 0x0032   067   067   000    Old_age   
Always       -       314

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SATA exceptions
  2007-07-07 21:35       ` S.Çağlar Onur
  2007-07-09 18:37         ` Tejun Heo
@ 2007-07-11 20:31         ` Mark Lord
  2007-07-12  3:13           ` Tejun Heo
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Lord @ 2007-07-11 20:31 UTC (permalink / raw)
  To: caglar; +Cc: Robert Hancock, Tejun Heo, LKML, Jeff Garzik

S.Çag(lar Onur wrote:
> Hi;
> 
> 07 Tem 2007 Cts tarihinde, Robert Hancock ÅŸunlarÄ± yazmÄ±ÅŸtÄ±: 
>> It's not the free space on the drive that matters, it's the number of
>> free sectors in the spare sector pool on the drive, which is invisible
>> to software.
>>
>> Your SMART log shows 309 reallocated sectors. That seems somewhat high..
> 
> Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
> most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
> (currently its increased to 313) and although i'm not 100 percent sure these 
> errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
> cause according to kern.log these only visible with 2.6.22+) 
> 
> We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its 
> smartctl output follows as a reference;
> 
> smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
> 
> === START OF INFORMATION SECTION ===
> Device Model:     SAMSUNG HM160JI
> Serial Number:    S0W6J10P331479
> Firmware Version: AD100-16
> User Capacity:    160.041.885.696 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   7
> ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
> Local Time is:    Sun Jul  8 00:22:21 2007 EEST
> 
> ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for 
> details.
> 
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> See vendor-specific Attribute list for marginal Attributes.
> 
> General SMART Values:
> Offline data collection status:  (0x00)	Offline data collection activity
> 					was never started.
> 					Auto Offline Data Collection: Disabled.
> Self-test execution status:      (   0)	The previous self-test routine 
> completed
> 					without error or no self-test has ever 
> 					been run.
> Total time to complete Offline 
> data collection: 		 (5391) seconds.
> Offline data collection
> capabilities: 			 (0x51) SMART execute Offline immediate.
> 					No Auto Offline data collection support.
> 					Suspend Offline collection upon new
> 					command.
> 					No Offline surface scan supported.
> 					Self-test supported.
> 					No Conveyance Self-test supported.
> 					Selective Self-test supported.
> SMART capabilities:            (0x0003)	Saves SMART data before entering
> 					power-saving mode.
> 					Supports SMART auto save timer.
> Error logging capability:        (0x01)	Error logging supported.
> 					General Purpose Logging supported.
> Short self-test routine 
> recommended polling time: 	 (   2) minutes.
> Extended self-test routine
> recommended polling time: 	 (  89) minutes.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  
> Always       -       0
>   3 Spin_Up_Time            0x0007   253   253   025    Pre-fail  
> Always       -       2880
>   4 Start_Stop_Count        0x0032   098   098   000    Old_age   
> Always       -       2648
>   5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  
> Always       -       0
>   7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  
> Always       -       0
>   8 Seek_Time_Performance   0x0025   253   253   015    Pre-fail  
> Offline      -       0
>   9 Power_On_Hours          0x0032   253   253   000    Old_age   
> Always       -       236
>  10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  
> Always       -       1
>  11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   
> Always       -       2
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   
> Always       -       57
> 187 Unknown_Attribute       0x0032   253   253   000    Old_age   
> Always       -       0
> 188 Unknown_Attribute       0x0032   253   253   000    Old_age   
> Always       -       0
> 190 Temperature_Celsius     0x0022   047   040   040    Old_age   Always   
> In_the_past 1008009269
> 191 G-Sense_Error_Rate      0x0012   100   100   000    Old_age   
> Always       -       5396
> 192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   
> Always       -       40
> 193 Load_Cycle_Count        0x0012   100   100   000    Old_age   
> Always       -       2575
> 194 Temperature_Celsius     0x0022   047   040   000    Old_age   
> Always       -       53 (Lifetime Min/Max 0/15381)
> 195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   
> Always       -       98037
> 196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   
> Always       -       0
> 197 Current_Pending_Sector  0x0012   253   253   000    Old_age   
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   
> Always       -       0
> 201 Soft_Read_Error_Rate    0x0012   253   253   000    Old_age   
> Always       -       0
> 223 Load_Retry_Count        0x0012   100   100   000    Old_age   
> Always       -       2
> 225 Load_Cycle_Count        0x0012   100   100   000    Old_age   
> Always       -       2575
> 255 Unknown_Attribute       0x000a   253   100   000    Old_age   
> Always       -       0
..

I'm not even sure how to interpret those numbers.
It seems rather odd that nearly all fields are either "100" or "253",
so those are probably pre-programmed numbers rather than actual counts.
The raw value at the end of the line (for the various "Reallocated*" fields)
is probably the real value here.

Tejun ??

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SATA exceptions
  2007-07-11 20:31         ` Mark Lord
@ 2007-07-12  3:13           ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2007-07-12  3:13 UTC (permalink / raw)
  To: Mark Lord; +Cc: caglar, Robert Hancock, LKML, Jeff Garzik

Mark Lord wrote:
> I'm not even sure how to interpret those numbers.
> It seems rather odd that nearly all fields are either "100" or "253",
> so those are probably pre-programmed numbers rather than actual counts.
> The raw value at the end of the line (for the various "Reallocated*"
> fields)
> is probably the real value here.

I dunno exactly either.  Different vendors seem to use different metrics
anyway but increasing raw number on reallocate counter is pretty easy to
interpret.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-07-13  7:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-05 21:46 SATA exceptions S.Çağlar Onur
2007-07-06  1:52 ` Tejun Heo
2007-07-06 11:43   ` S.Çağlar Onur
     [not found] <fa.hBFih6KVHDFsBf6Qfg6XispiIuY@ifi.uio.no>
     [not found] ` <fa.50UJZSgW/ChUl7O1Zks6ydq/1js@ifi.uio.no>
     [not found]   ` <fa.NhBdOmH9+RkW+QO72+TjPRwEKj0@ifi.uio.no>
2007-07-07 18:05     ` Robert Hancock
2007-07-07 21:35       ` S.Çağlar Onur
2007-07-09 18:37         ` Tejun Heo
2007-07-09 19:06           ` S.Çağlar Onur
2007-07-11 17:20           ` Bill Davidsen
2007-07-12 19:52           ` Pavel Machek
2007-07-13  3:12             ` Tejun Heo
2007-07-13  7:44               ` S.Çağlar Onur
2007-07-11 20:31         ` Mark Lord
2007-07-12  3:13           ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox