Uncorrectable errors: how do I fix it?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Uncorrectable errors: how do I fix it?
@ 2008-11-28 18:21 John Robinson
  2008-11-28 21:03 ` Justin Piszcz
  2008-11-28 21:53 ` NeilBrown
  0 siblings, 2 replies; 3+ messages in thread
From: John Robinson @ 2008-11-28 18:21 UTC (permalink / raw)
  To: Linux RAID

One of the drives in my RAID-5 array is showing uncorrectable errors:
Nov 28 17:52:36 beast smartd[8184]: Device: /dev/sdc, 1 Currently 
unreadable (pending) sectors
Nov 28 17:52:36 beast smartd[8184]: Device: /dev/sdc, 1 Offline 
uncorrectable sectors

And it fails a self-test:
SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision 
number = 1
Num  Test_Description    Status                  Remaining 
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%       931 
   1953520763

Now that's not good but it's probably not bad enough to get the drive 
replaced. (Opinions?) Anyway, rewriting the sector ought to "cure" it, 
so how do I do that?

Here's the details of my array:
[root@beast md]# mdadm --detail /dev/md1
/dev/md1:
         Version : 00.90.03
   Creation Time : Mon Jul 28 15:49:09 2008
      Raid Level : raid5
      Array Size : 1953310720 (1862.82 GiB 2000.19 GB)
   Used Dev Size : 976655360 (931.41 GiB 1000.10 GB)
    Raid Devices : 3
   Total Devices : 3
Preferred Minor : 1
     Persistence : Superblock is persistent

   Intent Bitmap : Internal

     Update Time : Fri Nov 28 17:56:22 2008
           State : active
  Active Devices : 3
Working Devices : 3
  Failed Devices : 0
   Spare Devices : 0

          Layout : left-symmetric
      Chunk Size : 256K

            UUID : d8c57a89:166ee722:23adec48:1574b5fc
          Events : 0.6112

     Number   Major   Minor   RaidDevice State
        0       8        2        0      active sync   /dev/sda2
        1       8       18        1      active sync   /dev/sdb2
        2       8       34        2      active sync   /dev/sdc2

I tried:
[root@beast md]# mdadm /dev/md1 --fail /dev/sdc2
mdadm: set /dev/sdc2 faulty in /dev/md1
[root@beast md]# mdadm /dev/md1 --remove /dev/sdc2
mdadm: hot removed /dev/sdc2
[root@beast md]# mdadm /dev/md1 --add /dev/sdc2
mdadm: re-added /dev/sdc2

but that finished instantly. I guess it would since the array has a 
write-intent bitmap and it's noticed that sdc2 is being re-added. I 
could tell the system to do a complete resync with:
# echo repair > /sys/block/md1/md/sync_action

but really I want to tell the system to rebuild entirely from sda2 and 
sdb2, onto sdc2. At least I think I do. I've a feeling the answer is to 
zero the superblock, but I'm not confident about doing that because I'm 
not sure if re-adding the thing without a superblock will either work or 
do the Right Thing[tm].

Cheers,

John.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Uncorrectable errors: how do I fix it?
  2008-11-28 18:21 Uncorrectable errors: how do I fix it? John Robinson
@ 2008-11-28 21:03 ` Justin Piszcz
  2008-11-28 21:53 ` NeilBrown
  1 sibling, 0 replies; 3+ messages in thread
From: Justin Piszcz @ 2008-11-28 21:03 UTC (permalink / raw)
  To: John Robinson; +Cc: Linux RAID



On Fri, 28 Nov 2008, John Robinson wrote:

> One of the drives in my RAID-5 array is showing uncorrectable errors:
>
> I tried:
> [root@beast md]# mdadm /dev/md1 --fail /dev/sdc2
> mdadm: set /dev/sdc2 faulty in /dev/md1
> [root@beast md]# mdadm /dev/md1 --remove /dev/sdc2
> mdadm: hot removed /dev/sdc2
> [root@beast md]# mdadm /dev/md1 --add /dev/sdc2
> mdadm: re-added /dev/sdc2
>
> but that finished instantly. I guess it would since the array has a 
> write-intent bitmap and it's noticed that sdc2 is being re-added. I could 
> tell the system to do a complete resync with:
> # echo repair > /sys/block/md1/md/sync_action
>
> but really I want to tell the system to rebuild entirely from sda2 and sdb2, 
> onto sdc2. At least I think I do. I've a feeling the answer is to zero the 
> superblock, but I'm not confident about doing that because I'm not sure if 
> re-adding the thing without a superblock will either work or do the Right 
> Thing[tm].

Before you do this you should backup your data elsewhere just incase you 
cannot rebuild the array later.

You may be able to fix it using the hdparm --write-sector command if you 
know what you are doing, otherwise, the quickest way is to fail it from 
the array as you did before, remove it from the raid set.

zero the disk:
dd if=/dev/zero of=/dev/disk

run badblocks on it a few times:
run short+long test when its all done:

re-check smart statistics and they should complete successfully.

then sfdisk -d /dev/sda | sfdisk /dev/sdc
then pop it back into the array with mdadm /dev/md1 -a /dev/sdc2

Justin.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Uncorrectable errors: how do I fix it?
  2008-11-28 18:21 Uncorrectable errors: how do I fix it? John Robinson
  2008-11-28 21:03 ` Justin Piszcz
@ 2008-11-28 21:53 ` NeilBrown
  1 sibling, 0 replies; 3+ messages in thread
From: NeilBrown @ 2008-11-28 21:53 UTC (permalink / raw)
  To: John Robinson; +Cc: Linux RAID

On Sat, November 29, 2008 5:21 am, John Robinson wrote:
> One of the drives in my RAID-5 array is showing uncorrectable errors:
> Nov 28 17:52:36 beast smartd[8184]: Device: /dev/sdc, 1 Currently
> unreadable (pending) sectors
> Nov 28 17:52:36 beast smartd[8184]: Device: /dev/sdc, 1 Offline
> uncorrectable sectors
>
> And it fails a self-test:
> SMART Self-test log structure revision number 0
> Warning: ATA Specification requires self-test log structure revision
> number = 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed: read failure       20%       931
>    1953520763
>
> Now that's not good but it's probably not bad enough to get the drive
> replaced. (Opinions?) Anyway, rewriting the sector ought to "cure" it,
> so how do I do that?
..
> I tried:
> [root@beast md]# mdadm /dev/md1 --fail /dev/sdc2
> mdadm: set /dev/sdc2 faulty in /dev/md1
> [root@beast md]# mdadm /dev/md1 --remove /dev/sdc2
> mdadm: hot removed /dev/sdc2
> [root@beast md]# mdadm /dev/md1 --add /dev/sdc2
> mdadm: re-added /dev/sdc2
>
> but that finished instantly. I guess it would since the array has a
> write-intent bitmap and it's noticed that sdc2 is being re-added. I
> could tell the system to do a complete resync with:
> # echo repair > /sys/block/md1/md/sync_action
>
> but really I want to tell the system to rebuild entirely from sda2 and
> sdb2, onto sdc2. At least I think I do. I've a feeling the answer is to
> zero the superblock, but I'm not confident about doing that because I'm
> not sure if re-adding the thing without a superblock will either work or
> do the Right Thing[tm].

I would recommend the "echo repair" approach.  It won't write every block
on sdc, but you don't really need that.
And if you hit a bad block on some other drive, it will cope much better
than removing a drive and adding it back in.

However if you really want to write all of sdc and you are willing to
risk the possibility of a bad block on sda or sdb, then zeroing the
superblock on sdc before adding it back in will do what you expect.

The suggestion made by Justin of always having backups is, of course,
a good one.


NeilBrown



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-11-28 21:53 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-28 18:21 Uncorrectable errors: how do I fix it? John Robinson
2008-11-28 21:03 ` Justin Piszcz
2008-11-28 21:53 ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.