From: Erik Bourget <erik@midmaine.com>
To: John Bradford <john@grabjohn.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: CMD680, kernel 2.4.21, and heartache
Date: Fri, 03 Oct 2003 08:48:06 -0400 [thread overview]
Message-ID: <87he2qtrll.fsf@loki.odinnet> (raw)
In-Reply-To: <87lls2tspv.fsf@loki.odinnet> (Erik Bourget's message of "Fri, 03 Oct 2003 08:23:56 -0400")
Erik Bourget <erik@midmaine.com> writes:
> John Bradford <john@grabjohn.com> writes:
>
>>> Some factors that are definitely NOT a problem: - Faulty run of drives.
>>> This has also happened to Hitachi 80GB drives in the same configurations.
>>>
>>> - Heat. They're in a chilly room. The cases haven't overheated. We've had
>>> guys checking this every few hours after the first one went bonkers.
>>>
>>> Possible problems -
>>> - Simple software problem that somebody can fix and save the day. :)
>>> - All Dell Poweredge 650 servers are broken. :/
>>
>>> Oct 1 07:47:47 mailstore2-1 kernel: hda: dma_intr: status=0x51 { DriveReady
>>> SeekComplete Error } Oct 1 07:47:47 mailstore2-1 kernel: hda: dma_intr:
>>> error=0x40 { UncorrectableError }, LBAsect=37694874, high=2, low=4140442,
>>> sector=35220864
>>
>> That is definitely an error from the drive. If you're absolutely sure
>> it's not a faulty batch of drives or a cooling issue, maybe you have
>> power supply problems? Does SMART give you any useful information?
>>
>> John.
>
> Not power supply problems; two of the machines that have this problem are
> located in different facilities even. What's SMART?
>
Figured out SMART. Looks bad:
mailstore2-1:/home/erik# smartctl -a /dev/hda
Device: IC35L120AVV207-0 Supports ATA Version 6
Drive supports S.M.A.R.T. and is enabled
Check S.M.A.R.T. Passed.
General Smart Values:
Off-line data collection status: (0x85) Offline data collection activity was
aborted by an interrupting command
Self-test execution status: ( 245) Self-test routine in progess
50% of test remaining
Total time to complete off-line
data collection: (2855) Seconds
Offline data collection
Capabilities: (0x1b)SMART EXECUTE OFF-LINE IMMEDIATE
Automatic timer ON/OFF support
Suspend Offline Collection upon new
command
Offline surface scan supported
Self-test supported
Smart Capablilities: (0x0003) Saves SMART data before entering
power-saving mode
Supports SMART auto save timer
Error logging capability: (0x01) Error logging supported
Short self-test routine
recommended polling time: ( 1) Minutes
Extended self-test routine
recommended polling time: ( 48) Minutes
Vendor Specific SMART Attributes with Thresholds:
Revision Number: 16
Attribute Flag Value Worst Threshold Raw Value
( 1)Raw Read Error Rate 0x000b 095 095 060 458761
( 2)Throughput Performance 0x0005 148 148 050 264
( 3)Spin Up Time 0x0007 100 100 024 291
( 4)Start Stop Count 0x0012 100 100 000 6
( 5)Reallocated Sector Ct 0x0033 100 100 005 7
( 7)Seek Error Rate 0x000b 100 100 067 0
( 8)Seek Time Preformance 0x0005 123 123 000 37
( 9)Power On Hours 0x0012 100 100 000 709
( 10)Spin Retry Count 0x0013 100 100 060 0
( 12)Power Cycle Count 0x0032 100 100 000 6
(192)Power-Off Retract Count 0x0032 100 100 050 21
(193)Load Cycle Count 0x0012 100 100 050 21
(194)Temperature 0x0002 196 196 000 1441854
(196)Reallocated Event Count 0x0032 100 100 000 7
(197)Current Pending Sector 0x0022 100 100 000 3
(198)Offline Uncorrectable 0x0008 100 100 000 3
(199)UDMA CRC Error Count 0x000a 200 200 000 0
SMART Error Log:
SMART Error Logging Version: 1
Error Log Data Structure Pointer: 01
ATA Error Count: 1
Non-Fatal Count: 0
Error Log Structure 1:
DCR FR SC SN CL SH D/H CR Timestamp
00 00 08 22 1d 3f e0 25 1851604
00 00 08 aa 2b 3f e0 25 1851604
00 00 08 6a 1d 3f e0 25 1851604
00 00 08 02 96 3f e0 25 1851604
00 00 08 9a 2d 3f e0 25 1851604
00 40 08 9a 2d 3f e2 51 0
Error condition: 0 Error State: 3
Number of Hours in Drive Life: 660 (life of the drive in hours)
Eep.
Are these errors set by the drive itself, or could a faulty harddrive
controller / driver cause them? FWIW, I spoke offline to somebody about this
last week who seemed to think that it was an Alan Cox APIC bug.
- Erik
next prev parent reply other threads:[~2003-10-03 12:49 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-10-03 11:23 CMD680, kernel 2.4.21, and heartache Erik Bourget
2003-10-03 11:59 ` John Bradford
2003-10-03 12:23 ` Erik Bourget
2003-10-03 12:40 ` John Bradford
2003-10-03 12:48 ` Erik Bourget [this message]
2003-10-03 13:11 ` John Bradford
2003-10-03 18:10 ` Tomasz Rola
2003-10-03 18:22 ` Erik Bourget
2003-10-03 18:47 ` John Bradford
2003-10-04 1:57 ` jimbleferret
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87he2qtrll.fsf@loki.odinnet \
--to=erik@midmaine.com \
--cc=john@grabjohn.com \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox