* RAID1 == two different ARRAY in scan, and Q on read error corrected
@ 2008-04-18 19:35 Phil Lobbes
2008-04-18 22:02 ` Richard Scobie
0 siblings, 1 reply; 7+ messages in thread
From: Phil Lobbes @ 2008-04-18 19:35 UTC (permalink / raw)
To: linux-raid
Hi,
I have been lurking for a little while on the mail list and been doing
some investigation on my own. I don't mean to impose and hopefully this
is the right forum for these questions. If anyone has some
suggestions/recommendations/guidance on the following two questions I'm
all ears!
_________________________________________________________________
Q1: RAID1 == two different ARRAY in scan
I recently upgraded my server from Fedora Core 5 to Fedora 8 and along
with that I noticed something that either overlooked before or perhaps
caused during the upgrade. On that system I have a 300G RAID1 mirror:
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[0] sdd1[1]
293049600 blocks [2/2] [UU]
unused devices: <none>
When I use mdadm --examine --scan my 300G RAID1 mirror returns two
separate UUIDs with different devices for each:
* (correct) a "complete disk partition" aka /dev/sd{c,d}1
* (bogus) a entire device aka /dev/sd{c,d}
# mdadm --examine --scan --verbose
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=12c2d7a3:0b791468:9e965247:f4354b36
devices=/dev/sdd,/dev/sdc
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=7b879b21:7cc83b9c:765dd3f3:2af46d19
devices=/dev/sdd1,/dev/sdc1
I didn't find a match in a FAQ or other posting so I was hoping to get
some insight/pointers here.
Should I:
a. Ignore this?
b. Zero out the superblock on sd{c,d}? I'm no expert here so not
positive this is a good option. My theory is that a superblock for
sdc must be different than a superblock for sdc1 so if that is
correct the "fix" might be something like:
# mdadm --zero-superblock /dev/sdc /dev/sdd
Is this correct and safe? No worries about it somehow impacting
/dev/sdc1 and /dev/sdd1 and the good mirror, right?
c. Something else altogether?
For what it's worth, I suppose there is a chance I may have caused this
by trying to 'rename' the md# used by the ARRAY /dev/md0 => /dev/md3.
-----------------------------------------------------------------
* Disk/Partition info:
NOTE: Valid mirror is for partition /dev/sd{c,d}1 (not device
/dev/sd{c,d})
# fdisk -l /dev/sdc /dev/sdd
Disk /dev/sdc: 300.0 GB, 300090728448 bytes
255 heads, 63 sectors/track, 36483 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000
Device Boot Start End Blocks Id System
/dev/sdc1 1 36483 293049666 fd Linux raid autodetect
Disk /dev/sdd: 300.0 GB, 300090728448 bytes
255 heads, 63 sectors/track, 36483 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000
Device Boot Start End Blocks Id System
/dev/sdd1 1 36483 293049666 fd Linux raid autodetect
_________________________________________________________________
* Q2: On read error corrected messages
On an unrelated note, during/after the upgrade I noticed that I'm now
seeing a few of these events logged:
Apr 15 11:07:14 kernel: raid1: sdc1: rescheduling sector 517365296
Apr 15 11:07:54 kernel: raid1:md0: read error corrected (8 sectors at 517365296 on sdc1)
Apr 15 11:07:54 kernel: raid1: sdc1: redirecting sector 517365296 to another mirror
Apr 15 11:08:32 kernel: raid1: sdc1: rescheduling sector 517365472
Apr 15 11:09:09 kernel: raid1:md0: read error corrected (8 sectors at 517365472 on sdc1)
Apr 15 11:09:09 kernel: raid1: sdc1: redirecting sector 517365472 to another mirror
And also more of these:
Apr 18 14:01:45 smartd[2104]: Device: /dev/sdc, 3 Currently unreadable (pending) sectors
Apr 18 14:01:45 smartd[2104]: Device: /dev/sdc, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 240 to 241
Apr 18 14:01:45 smartd[2104]: Device: /dev/sdd, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 238 to 239
Here's some info from smartctl:
# smartctl -a /dev/sdc
smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Maxtor DiamondMax 10 family (ATA/133 and SATA/150)
Device Model: Maxtor 6B300S0
Serial Number: B60370HH
Firmware Version: BANC1980
User Capacity: 300,090,728,448 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Fri Apr 18 15:09:02 2008 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...
SMART Error Log Version: 1
ATA Error Count: 36 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 36 occurred at disk power-on lifetime: 27108 hours (1129 days + 12 hours)
When the command that caused the error occurred, the device was in an unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
5e 00 00 00 00 00 a0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
00 00 00 00 00 00 a0 00 18d+12:45:51.593 NOP [Abort queued commands]
00 00 08 1f 5f d6 e0 00 18d+12:45:48.339 NOP [Abort queued commands]
00 00 00 00 00 00 e0 00 18d+12:45:48.338 NOP [Abort queued commands]
00 00 00 00 00 00 a0 00 18d+12:45:48.335 NOP [Abort queued commands]
00 03 46 00 00 00 a0 00 18d+12:45:48.332 NOP [Reserved subcommand]
Luckily, I'm not an expert on hard drives (nor their failures) but I'm
hoping that somebody might be able to give me some insight on any of
this and if I should be concerned or if I should just considered these
unreadable sectors as "normal" in the life of the drive.
Sincerely,
Phil
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: RAID1 == two different ARRAY in scan, and Q on read error corrected
2008-04-18 19:35 RAID1 == two different ARRAY in scan, and Q on read error corrected Phil Lobbes
@ 2008-04-18 22:02 ` Richard Scobie
2008-04-18 23:49 ` David Lethe
0 siblings, 1 reply; 7+ messages in thread
From: Richard Scobie @ 2008-04-18 22:02 UTC (permalink / raw)
To: Linux RAID Mailing List
Phil Lobbes wrote:
____________________________________________________________
>
> Apr 15 11:07:14 kernel: raid1: sdc1: rescheduling sector 517365296
> Apr 15 11:07:54 kernel: raid1:md0: read error corrected (8 sectors at 517365296 on sdc1)
> Apr 15 11:07:54 kernel: raid1: sdc1: redirecting sector 517365296 to another mirror
> Apr 15 11:08:32 kernel: raid1: sdc1: rescheduling sector 517365472
> Apr 15 11:09:09 kernel: raid1:md0: read error corrected (8 sectors at 517365472 on sdc1)
> Apr 15 11:09:09 kernel: raid1: sdc1: redirecting sector 517365472 to another mirror
These entries,
> Apr 18 14:01:45 smartd[2104]: Device: /dev/sdc, 3 Currently unreadable (pending) sectors
and this, indicate that sdc is losing sectors, so you probably want a
backup of the array.
Depending on how important the array is you could fail and remove sdc
from the array, dd if=/dev/zero of=/dev/sdc bs=1M and re-add it back.
It may then be fine for some time, but if it continues to gather pending
sectors in the short term, it is probably dying.
Otherwise just replace it with a new one.
Regards,
Richard
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: RAID1 == two different ARRAY in scan, and Q on read error corrected
2008-04-18 22:02 ` Richard Scobie
@ 2008-04-18 23:49 ` David Lethe
2008-04-19 3:15 ` Richard Scobie
0 siblings, 1 reply; 7+ messages in thread
From: David Lethe @ 2008-04-18 23:49 UTC (permalink / raw)
To: Richard Scobie, Linux RAID Mailing List
You can't assume a disk is losing sectors and failing without running
some diagnostics. If you had an improper shutdown (i.e, power loss, not
a crash), and disks were writing, then you can get ECC errors. That
does not indicate the disk is bad.
Of course, you must always have a backup, even if both drives are
perfectly fine, RAID1 doesn't protect you from entering rm -rf * tmp
instead of rm -rf *.tmp
I strongly advise against using Richard's dd if=/dev/zero suggestion.
It puts you at risk as you only have one online copy of the data ..
unless you have current backup and it can easily do a cold-metal
restore. Not worth the risk if you ask me.
Enter dd if=/dev/md0 of=/dev/null instead, and it will force parity
rebuild. You do this with both disks online in RAID1. Furthermore, you
can get report of what blocks were bad. There is likely also a mdadm
rescan or mdadm rebuild, but you'd have to look up syntax. They are
preferable to using the dd command.
Either technique won't technically check all physical blocks on both
disks, but they will take you considerably less clock time, and will
protect your data.
Warning .. A block-level dd read on every block in md0 will not
necessarily rebuild parity for all kernels. You probably have to do
something to temporarily disable cache, I don't know.
Good luck -
David @ santools ^com
http://www.santools.com/smart/unix/manual
(Use a smaller blocksize to dd if you
-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Richard Scobie
Sent: Friday, April 18, 2008 5:02 PM
To: Linux RAID Mailing List
Subject: Re: RAID1 == two different ARRAY in scan, and Q on read error
corrected
Phil Lobbes wrote:
____________________________________________________________
>
> Apr 15 11:07:14 kernel: raid1: sdc1: rescheduling sector 517365296
> Apr 15 11:07:54 kernel: raid1:md0: read error corrected (8 sectors at
517365296 on sdc1)
> Apr 15 11:07:54 kernel: raid1: sdc1: redirecting sector 517365296 to
another mirror
> Apr 15 11:08:32 kernel: raid1: sdc1: rescheduling sector 517365472
> Apr 15 11:09:09 kernel: raid1:md0: read error corrected (8 sectors at
517365472 on sdc1)
> Apr 15 11:09:09 kernel: raid1: sdc1: redirecting sector 517365472 to
another mirror
These entries,
> Apr 18 14:01:45 smartd[2104]: Device: /dev/sdc, 3 Currently
unreadable (pending) sectors
and this, indicate that sdc is losing sectors, so you probably want a
backup of the array.
Depending on how important the array is you could fail and remove sdc
from the array, dd if=/dev/zero of=/dev/sdc bs=1M and re-add it back.
It may then be fine for some time, but if it continues to gather pending
sectors in the short term, it is probably dying.
Otherwise just replace it with a new one.
Regards,
Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: RAID1 == two different ARRAY in scan, and Q on read error corrected
2008-04-18 23:49 ` David Lethe
@ 2008-04-19 3:15 ` Richard Scobie
2008-04-19 17:26 ` Phil Lobbes
0 siblings, 1 reply; 7+ messages in thread
From: Richard Scobie @ 2008-04-19 3:15 UTC (permalink / raw)
To: Linux RAID Mailing List
David Lethe wrote:
> You can't assume a disk is losing sectors and failing without running
> some diagnostics. If you had an improper shutdown (i.e, power loss, not
> a crash), and disks were writing, then you can get ECC errors. That
> does not indicate the disk is bad.
The error was:
Apr 18 14:01:45 smartd[2104]: Device: /dev/sdc, 3 Currently unreadable
(pending) sectors
These are not ECC errors and the definition I have for pending sector
errors is, "Current count of unstable sectors (waiting for remapping)."
It would be useful to see the Reallocated Sectors Count from the
smartctl output.
Regards,
Richard
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: RAID1 == two different ARRAY in scan, and Q on read error corrected
2008-04-19 3:15 ` Richard Scobie
@ 2008-04-19 17:26 ` Phil Lobbes
2008-04-19 19:58 ` Richard Scobie
0 siblings, 1 reply; 7+ messages in thread
From: Phil Lobbes @ 2008-04-19 17:26 UTC (permalink / raw)
To: Linux RAID Mailing List
> The error was:
>
> Apr 18 14:01:45 smartd[2104]: Device: /dev/sdc, 3 Currently
> unreadable (pending) sectors
>
> These are not ECC errors and the definition I have for pending sector
> errors is, "Current count of unstable sectors (waiting for
> remapping)."
>
> It would be useful to see the Reallocated Sectors Count from the
> smartctl output.
Here's the requested count:
5 Reallocated_Sector_Ct 0x0033 251 251 063 Pre-fail Always - 22
And the full set in case it helps give a clearer picture...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0027 186 181 063 Pre-fail Always - 24376
4 Start_Stop_Count 0x0032 252 252 000 Old_age Always - 1988
5 Reallocated_Sector_Ct 0x0033 251 251 063 Pre-fail Always - 22
6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail Offline - 0
7 Seek_Error_Rate 0x000a 253 252 000 Old_age Always - 0
8 Seek_Time_Performance 0x0027 240 224 187 Pre-fail Always - 64525
9 Power_On_Minutes 0x0032 170 170 000 Old_age Always - 614h+37m
10 Spin_Retry_Count 0x002b 253 252 157 Pre-fail Always - 0
11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 253 253 000 Old_age Always - 6
192 Power-Off_Retract_Count 0x0032 253 253 000 Old_age Always - 0
193 Load_Cycle_Count 0x0032 253 253 000 Old_age Always - 0
194 Temperature_Celsius 0x0032 001 253 000 Old_age Always - 53
195 Hardware_ECC_Recovered 0x000a 252 204 000 Old_age Always - 39547
196 Reallocated_Event_Count 0x0008 253 253 000 Old_age Offline - 0
197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 3
198 Offline_Uncorrectable 0x0008 253 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0008 199 199 000 Old_age Offline - 0
200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 253 252 000 Old_age Always - 0
202 TA_Increase_Count 0x000a 253 001 000 Old_age Always - 0
203 Run_Out_Cancel 0x000b 253 076 180 Pre-fail Always In_the_past 0
204 Shock_Count_Write_Opern 0x000a 253 252 000 Old_age Always - 0
205 Shock_Rate_Write_Opern 0x000a 253 252 000 Old_age Always - 0
207 Spin_High_Current 0x002a 253 252 000 Old_age Always - 0
208 Spin_Buzz 0x002a 253 252 000 Old_age Always - 0
209 Offline_Seek_Performnce 0x0024 239 239 000 Old_age Offline - 169
210 Unknown_Attribute 0x0032 253 252 000 Old_age Always - 0
211 Unknown_Attribute 0x0032 253 252 000 Old_age Always - 0
212 Unknown_Attribute 0x0032 253 253 000 Old_age Always - 0
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: RAID1 == two different ARRAY in scan, and Q on read error corrected
2008-04-19 17:26 ` Phil Lobbes
@ 2008-04-19 19:58 ` Richard Scobie
2008-04-19 20:43 ` David Lethe
0 siblings, 1 reply; 7+ messages in thread
From: Richard Scobie @ 2008-04-19 19:58 UTC (permalink / raw)
To: Linux RAID Mailing List
Phil Lobbes wrote:
> Here's the requested count:
>
> 5 Reallocated_Sector_Ct 0x0033 251 251 063 Pre-fail Always - 22
I'm sure David can give a better interpretation, but as I read it, the
drive has failed and reallocated 22 sectors and there are 3 more
(Current_Pending_Sector) waiting to be reallocated.
I have seen some drives that have reallocated some sectors in a burst
and then operated for years without losing any more, but these were the
minority.
Usually these counts continue to increase over the short term as the
drive fails. If the data is important, I would replace it.
Regards,
Richard
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: RAID1 == two different ARRAY in scan, and Q on read error corrected
2008-04-19 19:58 ` Richard Scobie
@ 2008-04-19 20:43 ` David Lethe
0 siblings, 0 replies; 7+ messages in thread
From: David Lethe @ 2008-04-19 20:43 UTC (permalink / raw)
To: Richard Scobie, Linux RAID Mailing List
Yes, that is correct. Statistically speaking you have a 20% chance that
a disk will fail within 3 months of having this particular error. Since
you already had (22 - 3) reallocated sectors before this happened, then
probability of failure is probably (don't know off top of head) more
like 50%.
Go shopping. Leave now.
-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Richard Scobie
Sent: Saturday, April 19, 2008 2:59 PM
To: Linux RAID Mailing List
Subject: Re: RAID1 == two different ARRAY in scan, and Q on read error
corrected
Phil Lobbes wrote:
> Here's the requested count:
>
> 5 Reallocated_Sector_Ct 0x0033 251 251 063 Pre-fail
Always - 22
I'm sure David can give a better interpretation, but as I read it, the
drive has failed and reallocated 22 sectors and there are 3 more
(Current_Pending_Sector) waiting to be reallocated.
I have seen some drives that have reallocated some sectors in a burst
and then operated for years without losing any more, but these were the
minority.
Usually these counts continue to increase over the short term as the
drive fails. If the data is important, I would replace it.
Regards,
Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-04-19 20:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-18 19:35 RAID1 == two different ARRAY in scan, and Q on read error corrected Phil Lobbes
2008-04-18 22:02 ` Richard Scobie
2008-04-18 23:49 ` David Lethe
2008-04-19 3:15 ` Richard Scobie
2008-04-19 17:26 ` Phil Lobbes
2008-04-19 19:58 ` Richard Scobie
2008-04-19 20:43 ` David Lethe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).