raid/device failure

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid/device failure
@ 2013-02-11  1:27 Thomas Fjellstrom
  2013-02-11  2:09 ` Phil Turmel
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Thomas Fjellstrom @ 2013-02-11  1:27 UTC (permalink / raw)
  To: linux-raid

I've re-configured my NAS box (still haven't put it into "production") to be a 
raid5 over 7 2TB consumer seagate barracuda drives, and with some tweaking, 
performance was looking stellar.

Unfortunately I started seeing some messages in dmesg that worried me:

mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)

Now, nothing actually seemed amis other than those messages at that point. But 
much later down the line I got the following: http://pastebin.com/a5uTs5fT

sd 0:0:7:0: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh]  Sense Key : Aborted Command [current] 
sd 0:0:7:0: [sdh]  Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 3a 3e 30 08 00 02 98 00
end_request: I/O error, dev sdh, sector 977154056
md/raid:md0: read error corrected (8 sectors at 977154056 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154064 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154072 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154080 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154088 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154096 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154104 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154112 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154120 on sdh)
md/raid:md0: read error corrected (8 sectors at 977154128 on sdh)
mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
sd 0:0:7:0: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh]  Sense Key : Aborted Command [current] 
sd 0:0:7:0: [sdh]  Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 c4 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587659776
sd 0:0:7:0: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh]  Sense Key : Aborted Command [current] 
sd 0:0:7:0: [sdh]  Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 d4 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587663872
sd 0:0:7:0: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh]  Sense Key : Aborted Command [current] 
sd 0:0:7:0: [sdh]  Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 e0 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587666944
sd 0:0:7:0: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh]  Sense Key : Aborted Command [current] 
sd 0:0:7:0: [sdh]  Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 e4 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587667968
sd 0:0:7:0: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh]  Sense Key : Aborted Command [current] 
sd 0:0:7:0: [sdh]  Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 ec 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587670016
sd 0:0:7:0: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh]  Sense Key : Aborted Command [current] 
sd 0:0:7:0: [sdh]  Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 f0 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587671040
raid5_end_read_request: 73 callbacks suppressed
md/raid:md0: read error corrected (8 sectors at 1587660768 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587660776 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587660784 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587660792 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663872 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663880 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663888 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663896 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663904 on sdh)
md/raid:md0: read error corrected (8 sectors at 1587663912 on sdh)
sd 0:0:7:0: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh]  Sense Key : Aborted Command [current] 
sd 0:0:7:0: [sdh]  Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a1 e4 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587667968
md/raid:md0: read error NOT corrected!! (sector 1587667968 on sdh).
md/raid:md0: Disk failure on sdh, disabling device.
md/raid:md0: Operation continuing on 6 devices.
md/raid:md0: read error not correctable (sector 1587667976 on sdh).
md/raid:md0: read error not correctable (sector 1587667984 on sdh).
md/raid:md0: read error not correctable (sector 1587667992 on sdh).
md/raid:md0: read error not correctable (sector 1587668000 on sdh).
md/raid:md0: read error not correctable (sector 1587668008 on sdh).
md/raid:md0: read error not correctable (sector 1587668016 on sdh).
md/raid:md0: read error not correctable (sector 1587668024 on sdh).
md/raid:md0: read error not correctable (sector 1587668032 on sdh).
md/raid:md0: read error not correctable (sector 1587668040 on sdh).
md/raid:md0: read error not correctable (sector 1587668048 on sdh).
sd 0:0:7:0: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:7:0: [sdh]  Sense Key : Aborted Command [current] 
sd 0:0:7:0: [sdh]  Add. Sense: Information unit iuCRC error detected
sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 5e a2 08 00 00 04 00 00
end_request: I/O error, dev sdh, sector 1587677184
RAID conf printout:
 --- level:5 rd:7 wd:6
 disk 0, o:1, dev:sda
 disk 1, o:1, dev:sdb
 disk 2, o:1, dev:sdc
 disk 3, o:1, dev:sde
 disk 4, o:1, dev:sdf
 disk 5, o:1, dev:sdg
 disk 6, o:0, dev:sdh
RAID conf printout:
 --- level:5 rd:7 wd:6
 disk 0, o:1, dev:sda
 disk 1, o:1, dev:sdb
 disk 2, o:1, dev:sdc
 disk 3, o:1, dev:sde
 disk 4, o:1, dev:sdf
 disk 5, o:1, dev:sdg

I've run full S.M.A.R.T. tests (except the conveyance test, probably run that 
tonight and see what happens) on all drives in the array, and there are no 
obvious warnings or errors in the S.M.A.R.T. restults at all. Including 
reallocated (pending or not) sectors.

I've seen references while searching for possible causes, where people had 
this error occur with faulty cables, or SAS backplanes. Is this a likely 
senario? The cables are brand new, but anything is possible.

The card is a IBM M1015 8 port HBA flashed with the LSI 9211-8i IT firmware, 
and no BIOS.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid/device failure
  2013-02-11  1:27 raid/device failure Thomas Fjellstrom
@ 2013-02-11  2:09 ` Phil Turmel
  2013-02-11  2:52   ` EJ Vincent
  2013-02-11  2:55   ` Thomas Fjellstrom
  2013-02-11  3:22 ` Brad Campbell
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 11+ messages in thread
From: Phil Turmel @ 2013-02-11  2:09 UTC (permalink / raw)
  To: thomas; +Cc: linux-raid

On 02/10/2013 08:27 PM, Thomas Fjellstrom wrote:
> I've re-configured my NAS box (still haven't put it into "production") to be a 
> raid5 over 7 2TB consumer seagate barracuda drives, and with some tweaking, 
> performance was looking stellar.
> 
> Unfortunately I started seeing some messages in dmesg that worried me:

[trim /]

The MD subsystem keeps a count of read errors on each device, corrected
or not, and kicks the drive out when the count reaches twenty (20).
Every hour, the accumulated count is cut in half to allow for general
URE "maintenenance" in regular scrubs.  This behavior and the count are
hardcoded in the kernel source.

> I've run full S.M.A.R.T. tests (except the conveyance test, probably run that 
> tonight and see what happens) on all drives in the array, and there are no 
> obvious warnings or errors in the S.M.A.R.T. restults at all. Including 
> reallocated (pending or not) sectors.

MD fixed most of these errors, so I wouldn't expect to see them in SMART
unless the fix triggered a relocation.  But some weren't corrected--so I
would be concerned that MD and SMART don't agree.

Have these drives ever been scrubbed?  (I vaguely recall you mentioning
new drives...)  If they are new and already had a URE, I'd be concerned
about mishandling during shipping.  If they aren't new, I'd
destructively exercise them and retest.

> I've seen references while searching for possible causes, where people had 
> this error occur with faulty cables, or SAS backplanes. Is this a likely 
> senario? The cables are brand new, but anything is possible.
> 
> The card is a IBM M1015 8 port HBA flashed with the LSI 9211-8i IT firmware, 
> and no BIOS.

It might not hurt to recheck your power supply rating vs. load.  If you
can't find anything else, a data-logging voltmeter with min/max capture
would be my tool of choice.

http://www.fluke.com/fluke/usen/digital-multimeters/fluke-287.htm?PID=56058

Phil

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid/device failure
  2013-02-11  2:09 ` Phil Turmel
@ 2013-02-11  2:52   ` EJ Vincent
  2013-02-11  3:44     ` Phil Turmel
  2013-02-11  2:55   ` Thomas Fjellstrom
  1 sibling, 1 reply; 11+ messages in thread
From: EJ Vincent @ 2013-02-11  2:52 UTC (permalink / raw)
  Cc: linux-raid

On 2/10/2013 9:09 PM, Phil Turmel wrote:
> Have these drives ever been scrubbed? (I vaguely recall you mentioning 
> new drives...) If they are new and already had a URE, I'd be concerned 
> about mishandling during shipping. If they aren't new, I'd 
> destructively exercise them and retest.

Hi Phil,

Could you elaborate on procedures and tools to thoroughly exercise newly 
purchased drives? Are you talking about programs such as 'badblocks'?

Thanks.

-EJ


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid/device failure
  2013-02-11  2:09 ` Phil Turmel
  2013-02-11  2:52   ` EJ Vincent
@ 2013-02-11  2:55   ` Thomas Fjellstrom
  1 sibling, 0 replies; 11+ messages in thread
From: Thomas Fjellstrom @ 2013-02-11  2:55 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

On February 10, 2013, Phil Turmel wrote:
> On 02/10/2013 08:27 PM, Thomas Fjellstrom wrote:
> > I've re-configured my NAS box (still haven't put it into "production") to
> > be a raid5 over 7 2TB consumer seagate barracuda drives, and with some
> > tweaking, performance was looking stellar.
> 
> > Unfortunately I started seeing some messages in dmesg that worried me:
> [trim /]
> 
> The MD subsystem keeps a count of read errors on each device, corrected
> or not, and kicks the drive out when the count reaches twenty (20).
> Every hour, the accumulated count is cut in half to allow for general
> URE "maintenenance" in regular scrubs.  This behavior and the count are
> hardcoded in the kernel source.
> 

Interesting. Thats good to know.

> > I've run full S.M.A.R.T. tests (except the conveyance test, probably run
> > that tonight and see what happens) on all drives in the array, and there
> > are no obvious warnings or errors in the S.M.A.R.T. restults at all.
> > Including reallocated (pending or not) sectors.
> 
> MD fixed most of these errors, so I wouldn't expect to see them in SMART
> unless the fix triggered a relocation.  But some weren't corrected--so I
> would be concerned that MD and SMART don't agree.

That is what I was wondering. I tought an uncorrected read error meant it 
wrote the data back out, and then a read of that data again was wrong.

> Have these drives ever been scrubbed?  (I vaguely recall you mentioning
> new drives...)  If they are new and already had a URE, I'd be concerned
> about mishandling during shipping.  If they aren't new, I'd
> destructively exercise them and retest.

They are new in that they haven't been used very much at all yet, and I 
haven't done a full scrub over every sector. I have run some lenghy tests 
using iozone over 32GB or more space (individually, and as part of a raid6), 
but as a bunch of parameters have changed from my last setup (raid5 vs raid6, 
xfs inode32 vs inode64), and xfs/md may or may not have alloated the test 
files from different areas of the device, so I can't be sure that the same 
general area of the disks were being accessed.

I did think that a full destructive write test may be in order, just to make 
sure. I've seen a drive throw errors at me, refuse to reallocate a sector 
untill it was written over manually, and then work fine afterwards.

> > I've seen references while searching for possible causes, where people
> > had this error occur with faulty cables, or SAS backplanes. Is this a
> > likely senario? The cables are brand new, but anything is possible.
> > 
> > The card is a IBM M1015 8 port HBA flashed with the LSI 9211-8i IT
> > firmware, and no BIOS.
> 
> It might not hurt to recheck your power supply rating vs. load.  If you
> can't find anything else, a data-logging voltmeter with min/max capture
> would be my tool of choice.
> 
> http://www.fluke.com/fluke/usen/digital-multimeters/fluke-287.htm?PID=56058

The PSU is overspeced if anything. But that doesn't mean it's not faulty in 
some way. It's a Seasonic G series 450W 80+ gold PSU. The system at full load 
should come in at just over half of that (core i3 2120, intel s1200kp m-itx 
board, hba, 7 hdds, 2 ssds, 2 x 8GB ddr3 1333mhz ECC ram).

I have an Agilent U1253B ( http://goo.gl/kl1aC ) which should be adequate to 
test with.

The NAS is on a 1000VA (600W?) UPS, so incomming power should be decently 
clean and even (assuming the UPS isn't bad).

> Phil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid/device failure
  2013-02-11  1:27 raid/device failure Thomas Fjellstrom
  2013-02-11  2:09 ` Phil Turmel
@ 2013-02-11  3:22 ` Brad Campbell
  2013-02-11  7:55   ` Thomas Fjellstrom
  2013-02-11  8:29 ` Roy Sigurd Karlsbakk
  2013-02-12 22:31 ` Thomas Fjellstrom
  3 siblings, 1 reply; 11+ messages in thread
From: Brad Campbell @ 2013-02-11  3:22 UTC (permalink / raw)
  To: thomas; +Cc: linux-raid

On 11/02/13 09:27, Thomas Fjellstrom wrote:

> sd 0:0:7:0: [sdh]  Add. Sense: Information unit iuCRC error detected

The CRC error there is the key. Check your cables, backplane & PSU.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid/device failure
  2013-02-11  2:52   ` EJ Vincent
@ 2013-02-11  3:44     ` Phil Turmel
  2013-02-11 20:28       ` EJ Vincent
  0 siblings, 1 reply; 11+ messages in thread
From: Phil Turmel @ 2013-02-11  3:44 UTC (permalink / raw)
  To: EJ Vincent

On 02/10/2013 09:52 PM, EJ Vincent wrote:
> On 2/10/2013 9:09 PM, Phil Turmel wrote:
>> Have these drives ever been scrubbed? (I vaguely recall you mentioning
>> new drives...) If they are new and already had a URE, I'd be concerned
>> about mishandling during shipping. If they aren't new, I'd
>> destructively exercise them and retest.
> 
> Hi Phil,
> 
> Could you elaborate on procedures and tools to thoroughly exercise newly
> purchased drives? Are you talking about programs such as 'badblocks'?

Yes, badblocks is convenient because it is part of e2fsprogs, which
pretty much ships by default in all distros.

What I recommend:

Record the complete drive status as unpacked:
1) smartctl -x /dev/sdX >xxxx-as-received.smart.txt

Userspace surface check:
2) badblocks -w -b 4096 /dev/sdX

Security erase (vital for SSDs):
3a) hdparm --user-master u --security-set-pass password /dev/sdX
3b) hdparm --user-master u --security-erase password /dev/sdX

(wait until done)

Long drive self-test:
4) smartctl -t long /dev/sdX

(wait until done)

Record the complete drive status post-test:
5) smartctl -x /dev/sdX >xxxx-as-tested.smart.txt

I won't accept any reallocations in a new drive, and no more than single
digits in older drives.  In my (subjective) experience, once
reallocations get into double digits, their incidence seems to accelerate.

I also pay extra attention to desktop drives with 30,000+ hours, as I
haven't had any get to 40,000.

Phil

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid/device failure
  2013-02-11  3:22 ` Brad Campbell
@ 2013-02-11  7:55   ` Thomas Fjellstrom
  0 siblings, 0 replies; 11+ messages in thread
From: Thomas Fjellstrom @ 2013-02-11  7:55 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid

On February 10, 2013, Brad Campbell wrote:
> On 11/02/13 09:27, Thomas Fjellstrom wrote:
> > sd 0:0:7:0: [sdh]  Add. Sense: Information unit iuCRC error detected
> 
> The CRC error there is the key. Check your cables, backplane & PSU.
> 

Very good to know. Thank you. :) Google had hinted the same thing. I've gone 
into the machine and made sure everything was snug, just in case it was a lose 
cable or connection on the backplane.

I've --add'ed the drive back to the array. I think that should make a simple 
test to see if anything reallocates. After that, I'll try some more read and 
mixed read-write tests to see if the error's going to pop up again, but so far 
after a few hours, not even a single warning*.

Should the problem come back, I'll follow up with a power test, then some 
cable tests. Since it seemed that it was always erroring out on the same drive 
(not 100% sure, but it seems like it), and if its not likely a drive problem, 
it may be one SFF-8087 breakout cable, then swapping the cables should change 
which drive it happens to, if it does I should know its the cable.

* except for this: "[ 1052.626900] The scan_unevictable_pages sysctl/node-
interface has been disabled for lack of a legitimate use case.  If you have 
one, please send an email to linux-mm@kvack.org." and I have absolutely no 
idea what caused that at this point and time. Also don't think its applicable.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid/device failure
  2013-02-11  1:27 raid/device failure Thomas Fjellstrom
  2013-02-11  2:09 ` Phil Turmel
  2013-02-11  3:22 ` Brad Campbell
@ 2013-02-11  8:29 ` Roy Sigurd Karlsbakk
  2013-02-11  9:13   ` Thomas Fjellstrom
  2013-02-12 22:31 ` Thomas Fjellstrom
  3 siblings, 1 reply; 11+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-02-11  8:29 UTC (permalink / raw)
  To: thomas; +Cc: linux-raid

> I've re-configured my NAS box (still haven't put it into "production")
> to be a
> raid5 over 7 2TB consumer seagate barracuda drives, and with some
> tweaking, performance was looking stellar.

Using that many drives in raid-5 is a bit risky. Better use raid-6…

Just my 2c

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid/device failure
  2013-02-11  8:29 ` Roy Sigurd Karlsbakk
@ 2013-02-11  9:13   ` Thomas Fjellstrom
  0 siblings, 0 replies; 11+ messages in thread
From: Thomas Fjellstrom @ 2013-02-11  9:13 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: linux-raid

On February 11, 2013, Roy Sigurd Karlsbakk wrote:
> > I've re-configured my NAS box (still haven't put it into "production")
> > to be a
> > raid5 over 7 2TB consumer seagate barracuda drives, and with some
> > tweaking, performance was looking stellar.
> 
> Using that many drives in raid-5 is a bit risky. Better use raid-6…
> 
> Just my 2c

Yeah, I had first set it up in raid6. But I will have a separate backup array 
of this array. Nothing fancy, just a linear concat of some 3TB+1TB disks, 
though it may eventually get upgraded to a raid5 of 3TB disks.

I've figured that if I lose three disks all within the 5 hour time it takes to 
do a full rebuild, I probably have larger problems to worry about than losing 
my collection of random linux isos, even more random downloaded files, debian 
apt mirror, one copy of my important backed up files, and miscelaneous media 
files. I'll be running with the bitmap enabled, so if there are minor 
problems, the array will take minutes to resync rather than hours.

Also, since I've gotten decent power supplies and UPSs, I haven't had very 
many issues with harddrives. I personally haven't had a drive failure that 
wasn't within the first couple weeks of purchase in years. And I always try to 
make sure to stress test disks before putting them into service to weed out 
SIDS.

This way I get the 10TB+ I was aiming for when I decided to build a NAS. In 
raid6, I get 9.2TB or so. Very close I know ;)

> Vennlige hilsener / Best regards
> 
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 98013356
> roy@karlsbakk.net
> http://blogg.karlsbakk.net/
> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det
> er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse
> av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer
> adekvate og relevante synonymer på norsk. --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid/device failure
  2013-02-11  3:44     ` Phil Turmel
@ 2013-02-11 20:28       ` EJ Vincent
  0 siblings, 0 replies; 11+ messages in thread
From: EJ Vincent @ 2013-02-11 20:28 UTC (permalink / raw)
  Cc: linux-raid

On 2/10/2013 10:44 PM, Phil Turmel wrote:
> On 02/10/2013 09:52 PM, EJ Vincent wrote:
>> On 2/10/2013 9:09 PM, Phil Turmel wrote:
>>> Have these drives ever been scrubbed? (I vaguely recall you mentioning
>>> new drives...) If they are new and already had a URE, I'd be concerned
>>> about mishandling during shipping. If they aren't new, I'd
>>> destructively exercise them and retest.
>> Hi Phil,
>>
>> Could you elaborate on procedures and tools to thoroughly exercise newly
>> purchased drives? Are you talking about programs such as 'badblocks'?
> Yes, badblocks is convenient because it is part of e2fsprogs, which
> pretty much ships by default in all distros.
>
> What I recommend:
>
> Record the complete drive status as unpacked:
> 1) smartctl -x /dev/sdX >xxxx-as-received.smart.txt
>
> Userspace surface check:
> 2) badblocks -w -b 4096 /dev/sdX
>
> Security erase (vital for SSDs):
> 3a) hdparm --user-master u --security-set-pass password /dev/sdX
> 3b) hdparm --user-master u --security-erase password /dev/sdX
>
> (wait until done)
>
> Long drive self-test:
> 4) smartctl -t long /dev/sdX
>
> (wait until done)
>
> Record the complete drive status post-test:
> 5) smartctl -x /dev/sdX >xxxx-as-tested.smart.txt
>
> I won't accept any reallocations in a new drive, and no more than single
> digits in older drives.  In my (subjective) experience, once
> reallocations get into double digits, their incidence seems to accelerate.
>
> I also pay extra attention to desktop drives with 30,000+ hours, as I
> haven't had any get to 40,000.
>
> Phil

Very clear.  Thanks!

-EJ


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid/device failure
  2013-02-11  1:27 raid/device failure Thomas Fjellstrom
                   ` (2 preceding siblings ...)
  2013-02-11  8:29 ` Roy Sigurd Karlsbakk
@ 2013-02-12 22:31 ` Thomas Fjellstrom
  3 siblings, 0 replies; 11+ messages in thread
From: Thomas Fjellstrom @ 2013-02-12 22:31 UTC (permalink / raw)
  To: linux-raid

On Sun February 10, 2013, Thomas Fjellstrom wrote:
> I've re-configured my NAS box (still haven't put it into "production") to
> be a raid5 over 7 2TB consumer seagate barracuda drives, and with some
> tweaking, performance was looking stellar.
> 
> Unfortunately I started seeing some messages in dmesg that worried me:
> 
> mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11),
> sub_code(0x0d01)
> 
[snip]

So the more serrious issues are gone after some tests, but the above message 
is back after a while:

mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)

I tried looking through the mpt code for the sub code 0x0d01, but it didn't 
seem to be in the code at all. Does anyone know what that means?

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-02-12 22:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-11  1:27 raid/device failure Thomas Fjellstrom
2013-02-11  2:09 ` Phil Turmel
2013-02-11  2:52   ` EJ Vincent
2013-02-11  3:44     ` Phil Turmel
2013-02-11 20:28       ` EJ Vincent
2013-02-11  2:55   ` Thomas Fjellstrom
2013-02-11  3:22 ` Brad Campbell
2013-02-11  7:55   ` Thomas Fjellstrom
2013-02-11  8:29 ` Roy Sigurd Karlsbakk
2013-02-11  9:13   ` Thomas Fjellstrom
2013-02-12 22:31 ` Thomas Fjellstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).