Re: 2.6.24.3: regular sata drive resets - worrisome?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Roger Heflin <rogerheflin@gmail.com>
To: Hans-Peter Jansen <hpj@urpla.net>
Cc: Tejun Heo <htejun@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?
Date: Sun, 30 Mar 2008 07:41:09 -0500	[thread overview]
Message-ID: <47EF8A65.1010005@gmail.com> (raw)
In-Reply-To: <200803301400.10766.hpj@urpla.net>

Hans-Peter Jansen wrote:
> Am Sonntag, 30. März 2008 schrieb Tejun Heo:
>> Hello,
>>
>> Hans-Peter Jansen wrote:
>>>>>> Should I be worried? smartd doesn't show anything suspicious on
>>>>>> those.
>>>> Can you please post the result of "smartctl -a /dev/sdX"?
>>> Here's the last smart report from two of the offending drives. As noted
>>> before, I did the hardware reorganization, replaced the dog slow 3ware
>>> 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the
>>> drives for now, but a nephew already showed interest. What do you
>>> think, can I cede those drives with a clear conscience? The
>>> Hardware_ECC_Recovered values are really worrisome, aren't they?
>> Different vendors use different scales for the raw values.  The value is
>> still pegged at the highest so it could be those raw values are okay or
>> that the vendor just doesn't update value field accordingly.  My P120
>> says 0 for the raw value and 904635 for hardware ECC recovered so there
>> is some difference.  What do other non-failing drives say about those
>> values?
> 
> The only non-failing drive was sdf as it was running in standby mode in this 
> md raid 5 ensemble:
> 
> 20080323-011337-sdc.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162956700
> 20080323-011337-sdc.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> 20080323-011337-sdc.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> 20080323-011337-sdc.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> 20080323-011337-sdc.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 20080323-011338-sdd.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162520674
> 20080323-011338-sdd.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> 20080323-011338-sdd.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> 20080323-011338-sdd.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> 20080323-011338-sdd.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 20080323-011338-sde.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       148429049
> 20080323-011338-sde.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> 20080323-011338-sde.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> 20080323-011338-sde.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> 20080323-011338-sde.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 20080323-011339-sdf.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1559
> 20080323-011339-sdf.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> 20080323-011339-sdf.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> 20080323-011339-sdf.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> 20080323-011339-sdf.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 
>> Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
>> reallocation counters and maybe some pending counts.  Aieee.. weird.
> 
> But there are no reallocations nor any pending sectors on any of them.
> 
>>>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
>>>> FLUSH_EXT timing out usually indicates that the drive is having
>>>> problem writing out what it has in its cache to the media.  There was
>>>> one case where FLUSH_EXT timeout was caused by the driver failing to
>>>> switch controller back from NCQ mode before issuing FLUSH_EXT but that
>>>> was on sata_nv.  There hasn't been any similar problem on sata_sil24.
>>> Hmm, I didn't noticed any data distortions, and if there where, they
>>> live on as copies in their new home..
>> It should have appeared as read errors.  Maybe the drive successfully
>                              ^^^^
>                              write (I guess)
>> wrote those sectors after 30+ secs timeout.
> 
> That would point to some driver issue, wouldn't it? Roger Heflin also
> experienced similar behavior with that controller, which wasn't 
> reproducible with another. 
> 
> I can offer to you rebuilding that md in a test environment, and giving 
> you access to it, if you're interested.
> 
> Anyway, thanks for caring Tejun,
> Pete
> 

Here are the errors I get, though look at it closer, I am don't appear to be 
getting the reset, just this error from time to time:

sd 9:0:0:0: [sde] 976773168 512-byte hardware sectors (500108 MB)
sd 9:0:0:0: [sde] Write Protect is off
sd 9:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 9:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO 
or FUA
ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x0
ata8.00: BMDMA2 stat 0x687d8009
ata8.00: cmd 25/00:80:a7:00:1d/00:01:1d:00:00/e0 tag 0 cdb 0x0 data 196608 in
          res 51/04:8f:98:01:1d/00:00:1d:00:00/f0 Emask 0x1 (device error)
ata8.00: configured for UDMA/100
ata8: EH complete
sd 7:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)
sd 7:0:0:0: [sdd] Write Protect is off
sd 7:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO 
or FUA

I have 4 identical disks, with all 4 connected to the SIL controller all give 
some errors, moving 2 of the disks to a promise controller makes the errors go 
away on the 2 connected to the promise controller.   All drives are part of a 
software raid5 array.

Startup looks like this:
sata_sil 0000:00:09.0: version 2.3
ACPI: PCI Interrupt 0000:00:09.0[A] -> GSI 16 (level, low) -> IRQ 20
sata_sil 0000:00:09.0: Applying R_ERR on DMA activate FIS errata fix
scsi7 : sata_sil
scsi8 : sata_sil
scsi9 : sata_sil
scsi10 : sata_sil
ata8: SATA max UDMA/100 cmd 0xf8942080 ctl 0xf894208a bmdma 0xf8942000 irq 20
ata9: SATA max UDMA/100 cmd 0xf89420c0 ctl 0xf89420ca bmdma 0xf8942008 irq 20
ata10: SATA max UDMA/100 cmd 0xf8942280 ctl 0xf894228a bmdma 0xf8942200 irq 20
ata11: SATA max UDMA/100 cmd 0xf89422c0 ctl 0xf89422ca bmdma 0xf8942208 irq 20

Right now I am running 2.6.23.15-80.fc7, but have also got the errors under 2.6.23.1

                                     Roger

next prev parent reply	other threads:[~2008-03-30 12:41 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-20 14:18 2.6.24.3: regular sata drive resets - worrisome? Hans-Peter Jansen
2008-03-21  4:48 ` Andrew Morton
2008-03-21 18:32   ` Roger Heflin
2008-03-21 23:06     ` Hans-Peter Jansen
2008-03-29 12:58   ` Tejun Heo
2008-03-30  0:14     ` Hans-Peter Jansen
2008-03-30  0:54       ` Tejun Heo
2008-03-30 12:00         ` Hans-Peter Jansen
2008-03-30 12:41           ` Roger Heflin [this message]
2008-03-30 21:02             ` Hein-Pieter van Braam
2008-03-31  4:33             ` Tejun Heo
2008-04-01 19:27               ` Roger Heflin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47EF8A65.1010005@gmail.com \
    --to=rogerheflin@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hpj@urpla.net \
    --cc=htejun@gmail.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.