RAID1 == two different ARRAY in scan, and Q on read error corrected

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID1 == two different ARRAY in scan, and Q on read error corrected
@ 2008-04-18 19:35 Phil Lobbes
  2008-04-18 22:02 ` Richard Scobie
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Lobbes @ 2008-04-18 19:35 UTC (permalink / raw)
  To: linux-raid

Hi,

I have been lurking for a little while on the mail list and been doing
some investigation on my own.  I don't mean to impose and hopefully this
is the right forum for these questions.  If anyone has some
suggestions/recommendations/guidance on the following two questions I'm
all ears!

_________________________________________________________________
Q1: RAID1 == two different ARRAY in scan

I recently upgraded my server from Fedora Core 5 to Fedora 8 and along
with that I noticed something that either overlooked before or perhaps
caused during the upgrade.  On that system I have a 300G RAID1 mirror:

  # cat /proc/mdstat
  Personalities : [raid1]
  md0 : active raid1 sdc1[0] sdd1[1]
        293049600 blocks [2/2] [UU]

  unused devices: <none>

When I use mdadm --examine --scan my 300G RAID1 mirror returns two
separate UUIDs with different devices for each:
* (correct) a "complete disk partition" aka /dev/sd{c,d}1
* (bogus) a entire device aka /dev/sd{c,d}

  # mdadm --examine --scan --verbose
  ARRAY /dev/md0 level=raid1 num-devices=2 UUID=12c2d7a3:0b791468:9e965247:f4354b36
     devices=/dev/sdd,/dev/sdc
  ARRAY /dev/md0 level=raid1 num-devices=2 UUID=7b879b21:7cc83b9c:765dd3f3:2af46d19
     devices=/dev/sdd1,/dev/sdc1

I didn't find a match in a FAQ or other posting so I was hoping to get
some insight/pointers here.

Should I:
a. Ignore this?

b. Zero out the superblock on sd{c,d}?  I'm no expert here so not
   positive this is a good option.  My theory is that a superblock for
   sdc must be different than a superblock for sdc1 so if that is
   correct the "fix" might be something like:

   # mdadm --zero-superblock /dev/sdc /dev/sdd

   Is this correct and safe?  No worries about it somehow impacting
   /dev/sdc1 and /dev/sdd1 and the good mirror, right?

c. Something else altogether?

For what it's worth, I suppose there is a chance I may have caused this
by trying to 'rename' the md# used by the ARRAY /dev/md0 => /dev/md3.

-----------------------------------------------------------------
* Disk/Partition info:

NOTE: Valid mirror is for partition /dev/sd{c,d}1 (not device
/dev/sd{c,d})

# fdisk -l /dev/sdc /dev/sdd

Disk /dev/sdc: 300.0 GB, 300090728448 bytes
255 heads, 63 sectors/track, 36483 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1       36483   293049666   fd  Linux raid autodetect

Disk /dev/sdd: 300.0 GB, 300090728448 bytes
255 heads, 63 sectors/track, 36483 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1               1       36483   293049666   fd  Linux raid autodetect

_________________________________________________________________
* Q2: On read error corrected messages

On an unrelated note, during/after the upgrade I noticed that I'm now
seeing a few of these events logged:

Apr 15 11:07:14  kernel: raid1: sdc1: rescheduling sector 517365296
Apr 15 11:07:54  kernel: raid1:md0: read error corrected (8 sectors at 517365296 on sdc1)
Apr 15 11:07:54  kernel: raid1: sdc1: redirecting sector 517365296 to another mirror
Apr 15 11:08:32  kernel: raid1: sdc1: rescheduling sector 517365472
Apr 15 11:09:09  kernel: raid1:md0: read error corrected (8 sectors at 517365472 on sdc1)
Apr 15 11:09:09  kernel: raid1: sdc1: redirecting sector 517365472 to another mirror

And also more of these:

Apr 18 14:01:45  smartd[2104]: Device: /dev/sdc, 3 Currently unreadable (pending) sectors
Apr 18 14:01:45  smartd[2104]: Device: /dev/sdc, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 240 to 241
Apr 18 14:01:45  smartd[2104]: Device: /dev/sdd, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 238 to 239

Here's some info from smartctl:

# smartctl -a /dev/sdc
smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Maxtor DiamondMax 10 family (ATA/133 and SATA/150)
Device Model:     Maxtor 6B300S0
Serial Number:    B60370HH
Firmware Version: BANC1980
User Capacity:    300,090,728,448 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:    Fri Apr 18 15:09:02 2008 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...

SMART Error Log Version: 1
ATA Error Count: 36 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 36 occurred at disk power-on lifetime: 27108 hours (1129 days + 12 hours)
  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  5e 00 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  00 00 00 00 00 00 a0 00  18d+12:45:51.593  NOP [Abort queued commands]
  00 00 08 1f 5f d6 e0 00  18d+12:45:48.339  NOP [Abort queued commands]
  00 00 00 00 00 00 e0 00  18d+12:45:48.338  NOP [Abort queued commands]
  00 00 00 00 00 00 a0 00  18d+12:45:48.335  NOP [Abort queued commands]
  00 03 46 00 00 00 a0 00  18d+12:45:48.332  NOP [Reserved subcommand]

Luckily, I'm not an expert on hard drives (nor their failures) but I'm
hoping that somebody might be able to give me some insight on any of
this and if I should be concerned or if I should just considered these
unreadable sectors as "normal" in the life of the drive.

Sincerely,
Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID1 == two different ARRAY in scan, and Q on read error corrected
  2008-04-18 19:35 RAID1 == two different ARRAY in scan, and Q on read error corrected Phil Lobbes
@ 2008-04-18 22:02 ` Richard Scobie
  2008-04-18 23:49   ` David Lethe
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Scobie @ 2008-04-18 22:02 UTC (permalink / raw)
  To: Linux RAID Mailing List

Phil Lobbes wrote:
____________________________________________________________

> 
> Apr 15 11:07:14  kernel: raid1: sdc1: rescheduling sector 517365296
> Apr 15 11:07:54  kernel: raid1:md0: read error corrected (8 sectors at 517365296 on sdc1)
> Apr 15 11:07:54  kernel: raid1: sdc1: redirecting sector 517365296 to another mirror
> Apr 15 11:08:32  kernel: raid1: sdc1: rescheduling sector 517365472
> Apr 15 11:09:09  kernel: raid1:md0: read error corrected (8 sectors at 517365472 on sdc1)
> Apr 15 11:09:09  kernel: raid1: sdc1: redirecting sector 517365472 to another mirror

These entries,

> Apr 18 14:01:45  smartd[2104]: Device: /dev/sdc, 3 Currently unreadable (pending) sectors

and this, indicate that sdc is losing sectors, so you probably want a 
backup of the array.

Depending on how important the array is you could fail and remove sdc 
from the array, dd if=/dev/zero of=/dev/sdc bs=1M and re-add it back.

It may then be fine for some time, but if it continues to gather pending 
sectors in the short term, it is probably dying.

Otherwise just replace it with a new one.

Regards,

Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: RAID1 == two different ARRAY in scan, and Q on read error corrected
  2008-04-18 22:02 ` Richard Scobie
@ 2008-04-18 23:49   ` David Lethe
  2008-04-19  3:15     ` Richard Scobie
  0 siblings, 1 reply; 7+ messages in thread
From: David Lethe @ 2008-04-18 23:49 UTC (permalink / raw)
  To: Richard Scobie, Linux RAID Mailing List

You can't assume a disk is losing sectors and failing without running
some diagnostics. If you had an improper shutdown (i.e, power loss, not
a crash), and disks were writing, then you can get ECC errors.  That
does not indicate the disk is bad.

Of course, you must always have a backup, even if both drives are
perfectly fine, RAID1 doesn't protect you from entering rm -rf * tmp
instead of rm -rf *.tmp 

I strongly advise against using Richard's dd if=/dev/zero suggestion.
It puts you at risk as you only have one online copy of the data ..
unless you have current backup and it can easily do a cold-metal
restore.  Not worth the risk if you ask me.

Enter dd if=/dev/md0 of=/dev/null instead, and it will force parity
rebuild. You do this with both disks online in RAID1.   Furthermore, you
can get report of what blocks were bad.   There is likely also a mdadm
rescan or mdadm rebuild, but you'd have to look up syntax. They are
preferable to using the dd command.

Either technique won't technically check all physical blocks on both
disks, but they will take you considerably less clock time, and will
protect your data. 

Warning .. A block-level dd read on every block in md0 will not
necessarily rebuild parity for all kernels.  You probably have to do
something to temporarily disable cache, I don't know.

Good luck - 

David @ santools ^com
http://www.santools.com/smart/unix/manual

(Use a smaller blocksize to dd if you 

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Richard Scobie
Sent: Friday, April 18, 2008 5:02 PM
To: Linux RAID Mailing List
Subject: Re: RAID1 == two different ARRAY in scan, and Q on read error
corrected

Phil Lobbes wrote:
____________________________________________________________

> 
> Apr 15 11:07:14  kernel: raid1: sdc1: rescheduling sector 517365296
> Apr 15 11:07:54  kernel: raid1:md0: read error corrected (8 sectors at
517365296 on sdc1)
> Apr 15 11:07:54  kernel: raid1: sdc1: redirecting sector 517365296 to
another mirror
> Apr 15 11:08:32  kernel: raid1: sdc1: rescheduling sector 517365472
> Apr 15 11:09:09  kernel: raid1:md0: read error corrected (8 sectors at
517365472 on sdc1)
> Apr 15 11:09:09  kernel: raid1: sdc1: redirecting sector 517365472 to
another mirror

These entries,

> Apr 18 14:01:45  smartd[2104]: Device: /dev/sdc, 3 Currently
unreadable (pending) sectors

and this, indicate that sdc is losing sectors, so you probably want a 
backup of the array.

Depending on how important the array is you could fail and remove sdc 
from the array, dd if=/dev/zero of=/dev/sdc bs=1M and re-add it back.

It may then be fine for some time, but if it continues to gather pending

sectors in the short term, it is probably dying.

Otherwise just replace it with a new one.

Regards,

Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID1 == two different ARRAY in scan, and Q on read error corrected
  2008-04-18 23:49   ` David Lethe
@ 2008-04-19  3:15     ` Richard Scobie
  2008-04-19 17:26       ` Phil Lobbes
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Scobie @ 2008-04-19  3:15 UTC (permalink / raw)
  To: Linux RAID Mailing List

David Lethe wrote:
> You can't assume a disk is losing sectors and failing without running
> some diagnostics. If you had an improper shutdown (i.e, power loss, not
> a crash), and disks were writing, then you can get ECC errors.  That
> does not indicate the disk is bad.

The error was:

Apr 18 14:01:45  smartd[2104]: Device: /dev/sdc, 3 Currently unreadable 
(pending) sectors

These are not ECC errors and the definition I have for pending sector 
errors is, "Current count of unstable sectors (waiting for remapping)."

It would be useful to see the Reallocated Sectors Count from the 
smartctl output.

Regards,

Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID1 == two different ARRAY in scan, and Q on read error corrected
  2008-04-19  3:15     ` Richard Scobie
@ 2008-04-19 17:26       ` Phil Lobbes
  2008-04-19 19:58         ` Richard Scobie
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Lobbes @ 2008-04-19 17:26 UTC (permalink / raw)
  To: Linux RAID Mailing List

> The error was:
> 
> Apr 18 14:01:45  smartd[2104]: Device: /dev/sdc, 3 Currently
> unreadable (pending) sectors
> 
> These are not ECC errors and the definition I have for pending sector
> errors is, "Current count of unstable sectors (waiting for
> remapping)."
>
> It would be useful to see the Reallocated Sectors Count from the
> smartctl output.

Here's the requested count:

  5 Reallocated_Sector_Ct   0x0033   251   251   063    Pre-fail  Always       -       22

And the full set in case it helps give a clearer picture...

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0027   186   181   063    Pre-fail  Always       -       24376
  4 Start_Stop_Count        0x0032   252   252   000    Old_age   Always       -       1988
  5 Reallocated_Sector_Ct   0x0033   251   251   063    Pre-fail  Always       -       22
  6 Read_Channel_Margin     0x0001   253   253   100    Pre-fail  Offline      -       0
  7 Seek_Error_Rate         0x000a   253   252   000    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0027   240   224   187    Pre-fail  Always       -       64525
  9 Power_On_Minutes        0x0032   170   170   000    Old_age   Always       -       614h+37m
 10 Spin_Retry_Count        0x002b   253   252   157    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x002b   253   252   223    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   253   253   000    Old_age   Always       -       6
192 Power-Off_Retract_Count 0x0032   253   253   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   253   253   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0032   001   253   000    Old_age   Always       -       53
195 Hardware_ECC_Recovered  0x000a   252   204   000    Old_age   Always       -       39547
196 Reallocated_Event_Count 0x0008   253   253   000    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       3
198 Offline_Uncorrectable   0x0008   253   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0008   199   199   000    Old_age   Offline      -       0
200 Multi_Zone_Error_Rate   0x000a   253   252   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   252   000    Old_age   Always       -       0
202 TA_Increase_Count       0x000a   253   001   000    Old_age   Always       -       0
203 Run_Out_Cancel          0x000b   253   076   180    Pre-fail  Always   In_the_past 0
204 Shock_Count_Write_Opern 0x000a   253   252   000    Old_age   Always       -       0
205 Shock_Rate_Write_Opern  0x000a   253   252   000    Old_age   Always       -       0
207 Spin_High_Current       0x002a   253   252   000    Old_age   Always       -       0
208 Spin_Buzz               0x002a   253   252   000    Old_age   Always       -       0
209 Offline_Seek_Performnce 0x0024   239   239   000    Old_age   Offline      -       169
210 Unknown_Attribute       0x0032   253   252   000    Old_age   Always       -       0
211 Unknown_Attribute       0x0032   253   252   000    Old_age   Always       -       0
212 Unknown_Attribute       0x0032   253   253   000    Old_age   Always       -       0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID1 == two different ARRAY in scan, and Q on read error corrected
  2008-04-19 17:26       ` Phil Lobbes
@ 2008-04-19 19:58         ` Richard Scobie
  2008-04-19 20:43           ` David Lethe
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Scobie @ 2008-04-19 19:58 UTC (permalink / raw)
  To: Linux RAID Mailing List

Phil Lobbes wrote:

> Here's the requested count:
> 
>   5 Reallocated_Sector_Ct   0x0033   251   251   063    Pre-fail  Always       -       22

I'm sure David can give a better interpretation, but as I read it, the 
drive has failed and reallocated 22 sectors and there are 3 more 
(Current_Pending_Sector) waiting to be reallocated.

I have seen some drives that have reallocated some sectors in a burst 
and then operated for years without losing any more, but these were the 
minority.

Usually these counts continue to increase over the short term as the 
drive fails. If the data is important, I would replace it.

Regards,

Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: RAID1 == two different ARRAY in scan, and Q on read error corrected
  2008-04-19 19:58         ` Richard Scobie
@ 2008-04-19 20:43           ` David Lethe
  0 siblings, 0 replies; 7+ messages in thread
From: David Lethe @ 2008-04-19 20:43 UTC (permalink / raw)
  To: Richard Scobie, Linux RAID Mailing List

Yes, that is correct.  Statistically speaking you have a 20% chance that
a disk will fail within 3 months of having this particular error. Since
you already had (22 - 3) reallocated sectors before this happened, then
probability of failure is probably (don't know off top of head) more
like 50%.

Go shopping. Leave now.

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Richard Scobie
Sent: Saturday, April 19, 2008 2:59 PM
To: Linux RAID Mailing List
Subject: Re: RAID1 == two different ARRAY in scan, and Q on read error
corrected

Phil Lobbes wrote:

> Here's the requested count:
> 
>   5 Reallocated_Sector_Ct   0x0033   251   251   063    Pre-fail
Always       -       22

I'm sure David can give a better interpretation, but as I read it, the 
drive has failed and reallocated 22 sectors and there are 3 more 
(Current_Pending_Sector) waiting to be reallocated.

I have seen some drives that have reallocated some sectors in a burst 
and then operated for years without losing any more, but these were the 
minority.

Usually these counts continue to increase over the short term as the 
drive fails. If the data is important, I would replace it.

Regards,

Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-04-19 20:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-18 19:35 RAID1 == two different ARRAY in scan, and Q on read error corrected Phil Lobbes
2008-04-18 22:02 ` Richard Scobie
2008-04-18 23:49   ` David Lethe
2008-04-19  3:15     ` Richard Scobie
2008-04-19 17:26       ` Phil Lobbes
2008-04-19 19:58         ` Richard Scobie
2008-04-19 20:43           ` David Lethe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).