* raid/device failure
@ 2013-02-11 1:27 Thomas Fjellstrom
2013-02-11 2:09 ` Phil Turmel
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Thomas Fjellstrom @ 2013-02-11 1:27 UTC (permalink / raw)
To: linux-raid
I've re-configured my NAS box (still haven't put it into "production") to be a
raid5 over 7 2TB consumer seagate barracuda drives, and with some tweaking,
performance was looking stellar.
Unfortunately I started seeing some messages in dmesg that worried me:
mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
Now, nothing actually seemed amis other than those messages at that point. But
much later down the line I got the following: http://pastebin.com/a5uTs5fT
sd 0:0:7:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh] Sense Key : Aborted Command [current]
sd 0:0:7:0: [sdh] Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 3a 3e 30 08 00 02 98 00
end_request: I/O error, dev sdh, sector 977154056
md/raid:md0: read error corrected (8 sectors at 977154056 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154064 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154072 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154080 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154088 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154096 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154104 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154112 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154120 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154128 on sdh)
mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
sd 0:0:7:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh] Sense Key : Aborted Command [current]
sd 0:0:7:0: [sdh] Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 c4 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587659776
sd 0:0:7:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh] Sense Key : Aborted Command [current]
sd 0:0:7:0: [sdh] Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 d4 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587663872
sd 0:0:7:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh] Sense Key : Aborted Command [current]
sd 0:0:7:0: [sdh] Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 e0 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587666944
sd 0:0:7:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh] Sense Key : Aborted Command [current]
sd 0:0:7:0: [sdh] Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 e4 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587667968
sd 0:0:7:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh] Sense Key : Aborted Command [current]
sd 0:0:7:0: [sdh] Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 ec 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587670016
sd 0:0:7:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh] Sense Key : Aborted Command [current]
sd 0:0:7:0: [sdh] Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 f0 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587671040
raid5_end_read_request: 73 callbacks suppressed
md/raid:md0: read error corrected (8 sectors at 1587660768 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587660776 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587660784 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587660792 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663872 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663880 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663888 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663896 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663904 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663912 on sdh)
sd 0:0:7:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh] Sense Key : Aborted Command [current]
sd 0:0:7:0: [sdh] Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 e4 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587667968
md/raid:md0: read error NOT corrected!! (sector 1587667968 on sdh).
md/raid:md0: Disk failure on sdh, disabling device.
md/raid:md0: Operation continuing on 6 devices.
md/raid:md0: read error not correctable (sector 1587667976 on sdh).
md/raid:md0: read error not correctable (sector 1587667984 on sdh).
md/raid:md0: read error not correctable (sector 1587667992 on sdh).
md/raid:md0: read error not correctable (sector 1587668000 on sdh).
md/raid:md0: read error not correctable (sector 1587668008 on sdh).
md/raid:md0: read error not correctable (sector 1587668016 on sdh).
md/raid:md0: read error not correctable (sector 1587668024 on sdh).
md/raid:md0: read error not correctable (sector 1587668032 on sdh).
md/raid:md0: read error not correctable (sector 1587668040 on sdh).
md/raid:md0: read error not correctable (sector 1587668048 on sdh).
sd 0:0:7:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh] Sense Key : Aborted Command [current]
sd 0:0:7:0: [sdh] Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a2 08 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587677184
RAID conf printout:
--- level:5 rd:7 wd:6
disk 0, o:1, dev:sda
disk 1, o:1, dev:sdb
disk 2, o:1, dev:sdc
disk 3, o:1, dev:sde
disk 4, o:1, dev:sdf
disk 5, o:1, dev:sdg
disk 6, o:0, dev:sdh
RAID conf printout:
--- level:5 rd:7 wd:6
disk 0, o:1, dev:sda
disk 1, o:1, dev:sdb
disk 2, o:1, dev:sdc
disk 3, o:1, dev:sde
disk 4, o:1, dev:sdf
disk 5, o:1, dev:sdg
I've run full S.M.A.R.T. tests (except the conveyance test, probably run that
tonight and see what happens) on all drives in the array, and there are no
obvious warnings or errors in the S.M.A.R.T. restults at all. Including
reallocated (pending or not) sectors.
I've seen references while searching for possible causes, where people had
this error occur with faulty cables, or SAS backplanes. Is this a likely
senario? The cables are brand new, but anything is possible.
The card is a IBM M1015 8 port HBA flashed with the LSI 9211-8i IT firmware,
and no BIOS.
--
Thomas Fjellstrom
thomas@fjellstrom.ca
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid/device failure
2013-02-11 1:27 raid/device failure Thomas Fjellstrom
@ 2013-02-11 2:09 ` Phil Turmel
2013-02-11 2:52 ` EJ Vincent
2013-02-11 2:55 ` Thomas Fjellstrom
2013-02-11 3:22 ` Brad Campbell
` (2 subsequent siblings)
3 siblings, 2 replies; 11+ messages in thread
From: Phil Turmel @ 2013-02-11 2:09 UTC (permalink / raw)
To: thomas; +Cc: linux-raid
On 02/10/2013 08:27 PM, Thomas Fjellstrom wrote:
> I've re-configured my NAS box (still haven't put it into "production") to be a
> raid5 over 7 2TB consumer seagate barracuda drives, and with some tweaking,
> performance was looking stellar.
>
> Unfortunately I started seeing some messages in dmesg that worried me:
[trim /]
The MD subsystem keeps a count of read errors on each device, corrected
or not, and kicks the drive out when the count reaches twenty (20).
Every hour, the accumulated count is cut in half to allow for general
URE "maintenenance" in regular scrubs. This behavior and the count are
hardcoded in the kernel source.
> I've run full S.M.A.R.T. tests (except the conveyance test, probably run that
> tonight and see what happens) on all drives in the array, and there are no
> obvious warnings or errors in the S.M.A.R.T. restults at all. Including
> reallocated (pending or not) sectors.
MD fixed most of these errors, so I wouldn't expect to see them in SMART
unless the fix triggered a relocation. But some weren't corrected--so I
would be concerned that MD and SMART don't agree.
Have these drives ever been scrubbed? (I vaguely recall you mentioning
new drives...) If they are new and already had a URE, I'd be concerned
about mishandling during shipping. If they aren't new, I'd
destructively exercise them and retest.
> I've seen references while searching for possible causes, where people had
> this error occur with faulty cables, or SAS backplanes. Is this a likely
> senario? The cables are brand new, but anything is possible.
>
> The card is a IBM M1015 8 port HBA flashed with the LSI 9211-8i IT firmware,
> and no BIOS.
It might not hurt to recheck your power supply rating vs. load. If you
can't find anything else, a data-logging voltmeter with min/max capture
would be my tool of choice.
http://www.fluke.com/fluke/usen/digital-multimeters/fluke-287.htm?PID=56058
Phil
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid/device failure
2013-02-11 2:09 ` Phil Turmel
@ 2013-02-11 2:52 ` EJ Vincent
2013-02-11 3:44 ` Phil Turmel
2013-02-11 2:55 ` Thomas Fjellstrom
1 sibling, 1 reply; 11+ messages in thread
From: EJ Vincent @ 2013-02-11 2:52 UTC (permalink / raw)
Cc: linux-raid
On 2/10/2013 9:09 PM, Phil Turmel wrote:
> Have these drives ever been scrubbed? (I vaguely recall you mentioning
> new drives...) If they are new and already had a URE, I'd be concerned
> about mishandling during shipping. If they aren't new, I'd
> destructively exercise them and retest.
Hi Phil,
Could you elaborate on procedures and tools to thoroughly exercise newly
purchased drives? Are you talking about programs such as 'badblocks'?
Thanks.
-EJ
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid/device failure
2013-02-11 2:09 ` Phil Turmel
2013-02-11 2:52 ` EJ Vincent
@ 2013-02-11 2:55 ` Thomas Fjellstrom
1 sibling, 0 replies; 11+ messages in thread
From: Thomas Fjellstrom @ 2013-02-11 2:55 UTC (permalink / raw)
To: Phil Turmel; +Cc: linux-raid
On February 10, 2013, Phil Turmel wrote:
> On 02/10/2013 08:27 PM, Thomas Fjellstrom wrote:
> > I've re-configured my NAS box (still haven't put it into "production") to
> > be a raid5 over 7 2TB consumer seagate barracuda drives, and with some
> > tweaking, performance was looking stellar.
>
> > Unfortunately I started seeing some messages in dmesg that worried me:
> [trim /]
>
> The MD subsystem keeps a count of read errors on each device, corrected
> or not, and kicks the drive out when the count reaches twenty (20).
> Every hour, the accumulated count is cut in half to allow for general
> URE "maintenenance" in regular scrubs. This behavior and the count are
> hardcoded in the kernel source.
>
Interesting. Thats good to know.
> > I've run full S.M.A.R.T. tests (except the conveyance test, probably run
> > that tonight and see what happens) on all drives in the array, and there
> > are no obvious warnings or errors in the S.M.A.R.T. restults at all.
> > Including reallocated (pending or not) sectors.
>
> MD fixed most of these errors, so I wouldn't expect to see them in SMART
> unless the fix triggered a relocation. But some weren't corrected--so I
> would be concerned that MD and SMART don't agree.
That is what I was wondering. I tought an uncorrected read error meant it
wrote the data back out, and then a read of that data again was wrong.
> Have these drives ever been scrubbed? (I vaguely recall you mentioning
> new drives...) If they are new and already had a URE, I'd be concerned
> about mishandling during shipping. If they aren't new, I'd
> destructively exercise them and retest.
They are new in that they haven't been used very much at all yet, and I
haven't done a full scrub over every sector. I have run some lenghy tests
using iozone over 32GB or more space (individually, and as part of a raid6),
but as a bunch of parameters have changed from my last setup (raid5 vs raid6,
xfs inode32 vs inode64), and xfs/md may or may not have alloated the test
files from different areas of the device, so I can't be sure that the same
general area of the disks were being accessed.
I did think that a full destructive write test may be in order, just to make
sure. I've seen a drive throw errors at me, refuse to reallocate a sector
untill it was written over manually, and then work fine afterwards.
> > I've seen references while searching for possible causes, where people
> > had this error occur with faulty cables, or SAS backplanes. Is this a
> > likely senario? The cables are brand new, but anything is possible.
> >
> > The card is a IBM M1015 8 port HBA flashed with the LSI 9211-8i IT
> > firmware, and no BIOS.
>
> It might not hurt to recheck your power supply rating vs. load. If you
> can't find anything else, a data-logging voltmeter with min/max capture
> would be my tool of choice.
>
> http://www.fluke.com/fluke/usen/digital-multimeters/fluke-287.htm?PID=56058
The PSU is overspeced if anything. But that doesn't mean it's not faulty in
some way. It's a Seasonic G series 450W 80+ gold PSU. The system at full load
should come in at just over half of that (core i3 2120, intel s1200kp m-itx
board, hba, 7 hdds, 2 ssds, 2 x 8GB ddr3 1333mhz ECC ram).
I have an Agilent U1253B ( http://goo.gl/kl1aC ) which should be adequate to
test with.
The NAS is on a 1000VA (600W?) UPS, so incomming power should be decently
clean and even (assuming the UPS isn't bad).
> Phil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Thomas Fjellstrom
thomas@fjellstrom.ca
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid/device failure
2013-02-11 1:27 raid/device failure Thomas Fjellstrom
2013-02-11 2:09 ` Phil Turmel
@ 2013-02-11 3:22 ` Brad Campbell
2013-02-11 7:55 ` Thomas Fjellstrom
2013-02-11 8:29 ` Roy Sigurd Karlsbakk
2013-02-12 22:31 ` Thomas Fjellstrom
3 siblings, 1 reply; 11+ messages in thread
From: Brad Campbell @ 2013-02-11 3:22 UTC (permalink / raw)
To: thomas; +Cc: linux-raid
On 11/02/13 09:27, Thomas Fjellstrom wrote:
> sd 0:0:7:0: [sdh] Add. Sense: Information unit iuCRC error detected
The CRC error there is the key. Check your cables, backplane & PSU.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid/device failure
2013-02-11 2:52 ` EJ Vincent
@ 2013-02-11 3:44 ` Phil Turmel
2013-02-11 20:28 ` EJ Vincent
0 siblings, 1 reply; 11+ messages in thread
From: Phil Turmel @ 2013-02-11 3:44 UTC (permalink / raw)
To: EJ Vincent
On 02/10/2013 09:52 PM, EJ Vincent wrote:
> On 2/10/2013 9:09 PM, Phil Turmel wrote:
>> Have these drives ever been scrubbed? (I vaguely recall you mentioning
>> new drives...) If they are new and already had a URE, I'd be concerned
>> about mishandling during shipping. If they aren't new, I'd
>> destructively exercise them and retest.
>
> Hi Phil,
>
> Could you elaborate on procedures and tools to thoroughly exercise newly
> purchased drives? Are you talking about programs such as 'badblocks'?
Yes, badblocks is convenient because it is part of e2fsprogs, which
pretty much ships by default in all distros.
What I recommend:
Record the complete drive status as unpacked:
1) smartctl -x /dev/sdX >xxxx-as-received.smart.txt
Userspace surface check:
2) badblocks -w -b 4096 /dev/sdX
Security erase (vital for SSDs):
3a) hdparm --user-master u --security-set-pass password /dev/sdX
3b) hdparm --user-master u --security-erase password /dev/sdX
(wait until done)
Long drive self-test:
4) smartctl -t long /dev/sdX
(wait until done)
Record the complete drive status post-test:
5) smartctl -x /dev/sdX >xxxx-as-tested.smart.txt
I won't accept any reallocations in a new drive, and no more than single
digits in older drives. In my (subjective) experience, once
reallocations get into double digits, their incidence seems to accelerate.
I also pay extra attention to desktop drives with 30,000+ hours, as I
haven't had any get to 40,000.
Phil
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid/device failure
2013-02-11 3:22 ` Brad Campbell
@ 2013-02-11 7:55 ` Thomas Fjellstrom
0 siblings, 0 replies; 11+ messages in thread
From: Thomas Fjellstrom @ 2013-02-11 7:55 UTC (permalink / raw)
To: Brad Campbell; +Cc: linux-raid
On February 10, 2013, Brad Campbell wrote:
> On 11/02/13 09:27, Thomas Fjellstrom wrote:
> > sd 0:0:7:0: [sdh] Add. Sense: Information unit iuCRC error detected
>
> The CRC error there is the key. Check your cables, backplane & PSU.
>
Very good to know. Thank you. :) Google had hinted the same thing. I've gone
into the machine and made sure everything was snug, just in case it was a lose
cable or connection on the backplane.
I've --add'ed the drive back to the array. I think that should make a simple
test to see if anything reallocates. After that, I'll try some more read and
mixed read-write tests to see if the error's going to pop up again, but so far
after a few hours, not even a single warning*.
Should the problem come back, I'll follow up with a power test, then some
cable tests. Since it seemed that it was always erroring out on the same drive
(not 100% sure, but it seems like it), and if its not likely a drive problem,
it may be one SFF-8087 breakout cable, then swapping the cables should change
which drive it happens to, if it does I should know its the cable.
* except for this: "[ 1052.626900] The scan_unevictable_pages sysctl/node-
interface has been disabled for lack of a legitimate use case. If you have
one, please send an email to linux-mm@kvack.org." and I have absolutely no
idea what caused that at this point and time. Also don't think its applicable.
--
Thomas Fjellstrom
thomas@fjellstrom.ca
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid/device failure
2013-02-11 1:27 raid/device failure Thomas Fjellstrom
2013-02-11 2:09 ` Phil Turmel
2013-02-11 3:22 ` Brad Campbell
@ 2013-02-11 8:29 ` Roy Sigurd Karlsbakk
2013-02-11 9:13 ` Thomas Fjellstrom
2013-02-12 22:31 ` Thomas Fjellstrom
3 siblings, 1 reply; 11+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-02-11 8:29 UTC (permalink / raw)
To: thomas; +Cc: linux-raid
> I've re-configured my NAS box (still haven't put it into "production")
> to be a
> raid5 over 7 2TB consumer seagate barracuda drives, and with some
> tweaking, performance was looking stellar.
Using that many drives in raid-5 is a bit risky. Better use raid-6…
Just my 2c
Vennlige hilsener / Best regards
roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid/device failure
2013-02-11 8:29 ` Roy Sigurd Karlsbakk
@ 2013-02-11 9:13 ` Thomas Fjellstrom
0 siblings, 0 replies; 11+ messages in thread
From: Thomas Fjellstrom @ 2013-02-11 9:13 UTC (permalink / raw)
To: Roy Sigurd Karlsbakk; +Cc: linux-raid
On February 11, 2013, Roy Sigurd Karlsbakk wrote:
> > I've re-configured my NAS box (still haven't put it into "production")
> > to be a
> > raid5 over 7 2TB consumer seagate barracuda drives, and with some
> > tweaking, performance was looking stellar.
>
> Using that many drives in raid-5 is a bit risky. Better use raid-6…
>
> Just my 2c
Yeah, I had first set it up in raid6. But I will have a separate backup array
of this array. Nothing fancy, just a linear concat of some 3TB+1TB disks,
though it may eventually get upgraded to a raid5 of 3TB disks.
I've figured that if I lose three disks all within the 5 hour time it takes to
do a full rebuild, I probably have larger problems to worry about than losing
my collection of random linux isos, even more random downloaded files, debian
apt mirror, one copy of my important backed up files, and miscelaneous media
files. I'll be running with the bitmap enabled, so if there are minor
problems, the array will take minutes to resync rather than hours.
Also, since I've gotten decent power supplies and UPSs, I haven't had very
many issues with harddrives. I personally haven't had a drive failure that
wasn't within the first couple weeks of purchase in years. And I always try to
make sure to stress test disks before putting them into service to weed out
SIDS.
This way I get the 10TB+ I was aiming for when I decided to build a NAS. In
raid6, I get 9.2TB or so. Very close I know ;)
> Vennlige hilsener / Best regards
>
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 98013356
> roy@karlsbakk.net
> http://blogg.karlsbakk.net/
> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det
> er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse
> av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer
> adekvate og relevante synonymer på norsk. --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Thomas Fjellstrom
thomas@fjellstrom.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid/device failure
2013-02-11 3:44 ` Phil Turmel
@ 2013-02-11 20:28 ` EJ Vincent
0 siblings, 0 replies; 11+ messages in thread
From: EJ Vincent @ 2013-02-11 20:28 UTC (permalink / raw)
Cc: linux-raid
On 2/10/2013 10:44 PM, Phil Turmel wrote:
> On 02/10/2013 09:52 PM, EJ Vincent wrote:
>> On 2/10/2013 9:09 PM, Phil Turmel wrote:
>>> Have these drives ever been scrubbed? (I vaguely recall you mentioning
>>> new drives...) If they are new and already had a URE, I'd be concerned
>>> about mishandling during shipping. If they aren't new, I'd
>>> destructively exercise them and retest.
>> Hi Phil,
>>
>> Could you elaborate on procedures and tools to thoroughly exercise newly
>> purchased drives? Are you talking about programs such as 'badblocks'?
> Yes, badblocks is convenient because it is part of e2fsprogs, which
> pretty much ships by default in all distros.
>
> What I recommend:
>
> Record the complete drive status as unpacked:
> 1) smartctl -x /dev/sdX >xxxx-as-received.smart.txt
>
> Userspace surface check:
> 2) badblocks -w -b 4096 /dev/sdX
>
> Security erase (vital for SSDs):
> 3a) hdparm --user-master u --security-set-pass password /dev/sdX
> 3b) hdparm --user-master u --security-erase password /dev/sdX
>
> (wait until done)
>
> Long drive self-test:
> 4) smartctl -t long /dev/sdX
>
> (wait until done)
>
> Record the complete drive status post-test:
> 5) smartctl -x /dev/sdX >xxxx-as-tested.smart.txt
>
> I won't accept any reallocations in a new drive, and no more than single
> digits in older drives. In my (subjective) experience, once
> reallocations get into double digits, their incidence seems to accelerate.
>
> I also pay extra attention to desktop drives with 30,000+ hours, as I
> haven't had any get to 40,000.
>
> Phil
Very clear. Thanks!
-EJ
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid/device failure
2013-02-11 1:27 raid/device failure Thomas Fjellstrom
` (2 preceding siblings ...)
2013-02-11 8:29 ` Roy Sigurd Karlsbakk
@ 2013-02-12 22:31 ` Thomas Fjellstrom
3 siblings, 0 replies; 11+ messages in thread
From: Thomas Fjellstrom @ 2013-02-12 22:31 UTC (permalink / raw)
To: linux-raid
On Sun February 10, 2013, Thomas Fjellstrom wrote:
> I've re-configured my NAS box (still haven't put it into "production") to
> be a raid5 over 7 2TB consumer seagate barracuda drives, and with some
> tweaking, performance was looking stellar.
>
> Unfortunately I started seeing some messages in dmesg that worried me:
>
> mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11),
> sub_code(0x0d01)
>
[snip]
So the more serrious issues are gone after some tests, but the above message
is back after a while:
mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
I tried looking through the mpt code for the sub code 0x0d01, but it didn't
seem to be in the code at all. Does anyone know what that means?
--
Thomas Fjellstrom
thomas@fjellstrom.ca
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-02-12 22:31 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-11 1:27 raid/device failure Thomas Fjellstrom
2013-02-11 2:09 ` Phil Turmel
2013-02-11 2:52 ` EJ Vincent
2013-02-11 3:44 ` Phil Turmel
2013-02-11 20:28 ` EJ Vincent
2013-02-11 2:55 ` Thomas Fjellstrom
2013-02-11 3:22 ` Brad Campbell
2013-02-11 7:55 ` Thomas Fjellstrom
2013-02-11 8:29 ` Roy Sigurd Karlsbakk
2013-02-11 9:13 ` Thomas Fjellstrom
2013-02-12 22:31 ` Thomas Fjellstrom
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).