Re: 2.6.24.3: regular sata drive resets - worrisome?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Hein-Pieter van Braam <hp@tmm.cx>
To: Roger Heflin <rogerheflin@gmail.com>
Cc: linux-ide@vger.kernel.org
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?
Date: Sun, 30 Mar 2008 23:02:15 +0200	[thread overview]
Message-ID: <1206910935.12026.2.camel@liza> (raw)
In-Reply-To: <47EF8A65.1010005@gmail.com>

On Sun, 2008-03-30 at 07:41 -0500, Roger Heflin wrote:
> Hans-Peter Jansen wrote:
> > Am Sonntag, 30. März 2008 schrieb Tejun Heo:
> >> Hello,
> >>
> >> Hans-Peter Jansen wrote:
> >>>>>> Should I be worried? smartd doesn't show anything suspicious on
> >>>>>> those.
> >>>> Can you please post the result of "smartctl -a /dev/sdX"?
> >>> Here's the last smart report from two of the offending drives. As noted
> >>> before, I did the hardware reorganization, replaced the dog slow 3ware
> >>> 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the
> >>> drives for now, but a nephew already showed interest. What do you
> >>> think, can I cede those drives with a clear conscience? The
> >>> Hardware_ECC_Recovered values are really worrisome, aren't they?
> >> Different vendors use different scales for the raw values.  The value is
> >> still pegged at the highest so it could be those raw values are okay or
> >> that the vendor just doesn't update value field accordingly.  My P120
> >> says 0 for the raw value and 904635 for hardware ECC recovered so there
> >> is some difference.  What do other non-failing drives say about those
> >> values?
> > 
> > The only non-failing drive was sdf as it was running in standby mode in this 
> > md raid 5 ensemble:
> > 
> > 20080323-011337-sdc.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162956700
> > 20080323-011337-sdc.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> > 20080323-011337-sdc.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> > 20080323-011337-sdc.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> > 20080323-011337-sdc.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> > 20080323-011338-sdd.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162520674
> > 20080323-011338-sdd.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> > 20080323-011338-sdd.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> > 20080323-011338-sdd.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> > 20080323-011338-sdd.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> > 20080323-011338-sde.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       148429049
> > 20080323-011338-sde.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> > 20080323-011338-sde.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> > 20080323-011338-sde.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> > 20080323-011338-sde.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> > 20080323-011339-sdf.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1559
> > 20080323-011339-sdf.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> > 20080323-011339-sdf.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> > 20080323-011339-sdf.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> > 20080323-011339-sdf.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> > 
> >> Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
> >> reallocation counters and maybe some pending counts.  Aieee.. weird.
> > 
> > But there are no reallocations nor any pending sectors on any of them.
> > 
> >>>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
> >>>> FLUSH_EXT timing out usually indicates that the drive is having
> >>>> problem writing out what it has in its cache to the media.  There was
> >>>> one case where FLUSH_EXT timeout was caused by the driver failing to
> >>>> switch controller back from NCQ mode before issuing FLUSH_EXT but that
> >>>> was on sata_nv.  There hasn't been any similar problem on sata_sil24.
> >>> Hmm, I didn't noticed any data distortions, and if there where, they
> >>> live on as copies in their new home..
> >> It should have appeared as read errors.  Maybe the drive successfully
> >                              ^^^^
> >                              write (I guess)
> >> wrote those sectors after 30+ secs timeout.
> > 
> > That would point to some driver issue, wouldn't it? Roger Heflin also
> > experienced similar behavior with that controller, which wasn't 
> > reproducible with another. 
> > 
> > I can offer to you rebuilding that md in a test environment, and giving 
> > you access to it, if you're interested.
> > 
> > Anyway, thanks for caring Tejun,
> > Pete
> > 
> 
> Here are the errors I get, though look at it closer, I am don't appear to be 
> getting the reset, just this error from time to time:
> 
> sd 9:0:0:0: [sde] 976773168 512-byte hardware sectors (500108 MB)
> sd 9:0:0:0: [sde] Write Protect is off
> sd 9:0:0:0: [sde] Mode Sense: 00 3a 00 00
> sd 9:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO 
> or FUA
> ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x0
> ata8.00: BMDMA2 stat 0x687d8009
> ata8.00: cmd 25/00:80:a7:00:1d/00:01:1d:00:00/e0 tag 0 cdb 0x0 data 196608 in
>           res 51/04:8f:98:01:1d/00:00:1d:00:00/f0 Emask 0x1 (device error)
> ata8.00: configured for UDMA/100
> ata8: EH complete
> sd 7:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)
> sd 7:0:0:0: [sdd] Write Protect is off
> sd 7:0:0:0: [sdd] Mode Sense: 00 3a 00 00
> sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO 
> or FUA
> 
> I have 4 identical disks, with all 4 connected to the SIL controller all give 
> some errors, moving 2 of the disks to a promise controller makes the errors go 
> away on the 2 connected to the promise controller.   All drives are part of a 
> software raid5 array.
> 
> Startup looks like this:
> sata_sil 0000:00:09.0: version 2.3
> ACPI: PCI Interrupt 0000:00:09.0[A] -> GSI 16 (level, low) -> IRQ 20
> sata_sil 0000:00:09.0: Applying R_ERR on DMA activate FIS errata fix
> scsi7 : sata_sil
> scsi8 : sata_sil
> scsi9 : sata_sil
> scsi10 : sata_sil
> ata8: SATA max UDMA/100 cmd 0xf8942080 ctl 0xf894208a bmdma 0xf8942000 irq 20
> ata9: SATA max UDMA/100 cmd 0xf89420c0 ctl 0xf89420ca bmdma 0xf8942008 irq 20
> ata10: SATA max UDMA/100 cmd 0xf8942280 ctl 0xf894228a bmdma 0xf8942200 irq 20
> ata11: SATA max UDMA/100 cmd 0xf89422c0 ctl 0xf89422ca bmdma 0xf8942208 irq 20
> 
> Right now I am running 2.6.23.15-80.fc7, but have also got the errors under 2.6.23.1

I know this is probably not too helpful, but I had the same or similar
problems on a sata_nv based controller back in 2.6.20 ish times. I never
reported it, sadly... but I managed to get them to go away by disabling
adma on the controller.

Probably not very helpful, 2 cents, and all :)

> 
>                                      Roger
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2008-03-30 21:10 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-20 14:18 2.6.24.3: regular sata drive resets - worrisome? Hans-Peter Jansen
2008-03-21  4:48 ` Andrew Morton
2008-03-21 18:32   ` Roger Heflin
2008-03-21 23:06     ` Hans-Peter Jansen
2008-03-29 12:58   ` Tejun Heo
2008-03-30  0:14     ` Hans-Peter Jansen
2008-03-30  0:54       ` Tejun Heo
2008-03-30 12:00         ` Hans-Peter Jansen
2008-03-30 12:41           ` Roger Heflin
2008-03-30 21:02             ` Hein-Pieter van Braam [this message]
2008-03-31  4:33             ` Tejun Heo
2008-04-01 19:27               ` Roger Heflin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1206910935.12026.2.camel@liza \
    --to=hp@tmm.cx \
    --cc=linux-ide@vger.kernel.org \
    --cc=rogerheflin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.