Broken harddisk

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Broken harddisk
@ 2005-01-29  0:22 T. Ermlich
  2005-01-29 12:46 ` Gordon Henderson
  0 siblings, 1 reply; 13+ messages in thread
From: T. Ermlich @ 2005-01-29  0:22 UTC (permalink / raw)
  To: linux-raid

Hello there,

I just got here from http://cgi.cse.unsw.edu.au/~neilb/Contact ...
Hopefully I'm more/less right here.

Several month ago I set-up an raid1 using mdadm.
Two drives (/dev/sda & /dev/sdb, each one is an 160GB Samsung SATA 
disks) are used, and provide now /dev/md0, /dev/md1, /dev/md2 & 
/dev/md3. In november 2004 I upgraded to mdadm 1.8.1.
This afternoon, about 9 hours ago, /dev/sda broke down ... no chnace to 
get it working again .. :(

My question now is: what does I have to do now?
The system is up and running, so I'd do an actual backup of the most 
important data ... but how to 'replace' the broken drive, and 'restore' 
the data content there (sorry, as english is not my native language I 
have no idea how to explain it correctly).
Is there a way to do so, or does I have to create an raid1 from scratch, 
and copy all data from /dev/md0-3 there manually?

Thanks in advance
Torsten

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Broken harddisk
  2005-01-29  0:22 Broken harddisk T. Ermlich
@ 2005-01-29 12:46 ` Gordon Henderson
  2005-01-29 15:34   ` T. Ermlich
  0 siblings, 1 reply; 13+ messages in thread
From: Gordon Henderson @ 2005-01-29 12:46 UTC (permalink / raw)
  To: T. Ermlich; +Cc: linux-raid

On Sat, 29 Jan 2005, T. Ermlich wrote:

> Hello there,
>
> I just got here from http://cgi.cse.unsw.edu.au/~neilb/Contact ...
> Hopefully I'm more/less right here.
>
> Several month ago I set-up an raid1 using mdadm.
> Two drives (/dev/sda & /dev/sdb, each one is an 160GB Samsung SATA
> disks) are used, and provide now /dev/md0, /dev/md1, /dev/md2 &
> /dev/md3. In november 2004 I upgraded to mdadm 1.8.1.

Drop 1.8.1 and get 1.8.0. I understand 1.8.1 has some experimental code
and not designed to be used for real.

> This afternoon, about 9 hours ago, /dev/sda broke down ... no chnace to
> get it working again .. :(
>
> My question now is: what does I have to do now?

Well, go through the procedure to remove the disk and put a new one back
in...

> The system is up and running, so I'd do an actual backup of the most
> important data ... but how to 'replace' the broken drive, and 'restore'
> the data content there (sorry, as english is not my native language I
> have no idea how to explain it correctly).
> Is there a way to do so, or does I have to create an raid1 from scratch,
> and copy all data from /dev/md0-3 there manually?

You should not have to copy it - thats the whole point of it all, however,
RAID is not a substitute for proper backups, so make sure you do those
backups now and regularly in the future.

OK - here are the basic steps - you may have to modify them as you haven't
posted enough detail for me to work it out to your exact system.

I'm assuing that you have partitioned each disk with 4 partitions and both
disks are partitioned identically and you are combining the same partition
of each device into the md devices. (eg. /dev/md0 is made from /dev/sda1
and /dev/sdb1) This is reasonably "sane" and I'm sure lots of people do it
this way (I do, but I'm a small sample :) If you aren't doing it this way,
then this won't work for you, but you may be able to adapt it for your
needs.

Firstly, get mdadm 1.8.0 as I mentioned above.

Look at /proc/mdstat.

See if all 4 md devices have a failed device in it. If the disk is really
dead, this is likely to be the case, if it's not, then you'll need to fail
each partition in each md device:

So make make sure that each md device has the failed disk really failed,
you can do:

  mdadm --fail /dev/md0 /dev/sda1
  mdadm --fail /dev/md1 /dev/sda2
  mdadm --fail /dev/md2 /dev/sda3
  mdadm --fail /dev/md3 /dev/sda4

Next, you need to remove the failed disk from each array

  mdadm --remove /dev/md0 /dev/sda1
  mdadm --remove /dev/md1 /dev/sda2
  mdadm --remove /dev/md2 /dev/sda3
  mdadm --remove /dev/md3 /dev/sda4

Strictly speaking, you don't have to do this - you can just power down and
put a new disk in, but I feel this is "cleaner" and hopefully leaves the
system in a stable and known state when you do power down.

At this point you can power down the machine and physically remove the
drive and replace it with a new, identical unit.

Reboot your PC. If it would normally boot off sda, you have to persuade it
to boot off sdb. You might need to alter the bios to do this, ot maybe
not... All BIOSes and controllers have their own little ideas about how
this is done.

If it boots off another drive (eg. an IDE drive) then you should be fine.
If it does boot off sda, then I hope you used the raid-extra-boot command
in lilo.conf (and tested it...) If you are using grub, I can't be of any
assistance there as I don't use it.

You should now have the system running with the data intact on sdb and all
the md devices working and mounted as normal.

Now you have to re-partition the new sda identical to sdb. If they are the
same make and size, you can use this:

  sfdisk -d /dev/sdb | sfdisk /dev/sda

Now, tell the raid code to re-mirror the drives:

  mdadm --add /dev/md0 /dev/sda1
  mdadm --add /dev/md1 /dev/sda2
  mdadm --add /dev/md2 /dev/sda3
  mdadm --add /dev/md3 /dev/sda4

then run:

   watch -n1 cat /proc/mdstat

and wait for it to finish, however the system is fully usable all during
this process.

If you can't power the machine down, and have hot-swappable drives in
proper caddys, then there is a way to tell the kernel that you are
removing the drive and adding a new one in, however it's probably safer if
you can do it while powered down.

If this doesn't make sense, post back the output of /proc/mdstat and
fdisk -l

Goos luck!

Gordon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Broken harddisk
  2005-01-29 12:46 ` Gordon Henderson
@ 2005-01-29 15:34   ` T. Ermlich
  2005-01-29 15:56     ` Gordon Henderson
  0 siblings, 1 reply; 13+ messages in thread
From: T. Ermlich @ 2005-01-29 15:34 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: linux-raid

Hi,

I'd like to say Thanks to everyone replied till now! :-)

Gordon Henderson scribbled on 29.01.2005 13:46:
> On Sat, 29 Jan 2005, T. Ermlich wrote:
> 
>>Hello there,
>>
>>I just got here from http://cgi.cse.unsw.edu.au/~neilb/Contact ...
>>Hopefully I'm more/less right here.
>>
>>Several month ago I set-up an raid1 using mdadm.
>>Two drives (/dev/sda & /dev/sdb, each one is an 160GB Samsung SATA
>>disks) are used, and provide now /dev/md0, /dev/md1, /dev/md2 &
>>/dev/md3. In november 2004 I upgraded to mdadm 1.8.1.
> 
> Drop 1.8.1 and get 1.8.0. I understand 1.8.1 has some experimental code
> and not designed to be used for real.
> 
>>This afternoon, about 9 hours ago, /dev/sda broke down ... no chnace to
>>get it working again .. :(
>>
>>My question now is: what does I have to do now?
> 
> Well, go through the procedure to remove the disk and put a new one back
> in...

Ok ... as the broken disks stops the system, and during boot procedure
the system hung, I had to remove it (disconnected the cables).

>>The system is up and running, so I'd do an actual backup of the most
>>important data ... but how to 'replace' the broken drive, and 'restore'
>>the data content there (sorry, as english is not my native language I
>>have no idea how to explain it correctly).
>>Is there a way to do so, or does I have to create an raid1 from scratch,
>>and copy all data from /dev/md0-3 there manually?
> 
> You should not have to copy it - thats the whole point of it all, however,
> RAID is not a substitute for proper backups, so make sure you do those
> backups now and regularly in the future.

Backups are done very night (3 am), so I just made a backup of the
latest changes (between ~3am and 15:30pm).

> OK - here are the basic steps - you may have to modify them as you haven't
> posted enough detail for me to work it out to your exact system.
> 
> I'm assuing that you have partitioned each disk with 4 partitions and both
> disks are partitioned identically and you are combining the same partition
> of each device into the md devices. (eg. /dev/md0 is made from /dev/sda1
> and /dev/sdb1) This is reasonably "sane" and I'm sure lots of people do it
> this way (I do, but I'm a small sample :) If you aren't doing it this way,
> then this won't work for you, but you may be able to adapt it for your
> needs.

That's right: each harddisk is partitioned absolutly identically, like:
     0 - 19456 - /dev/sda1 - extended partition
     1 - 6528  - /dev/sda5 - /dev/md0
  6529 - 9138  - /dev/sda6 - /dev/md1
  9139 - 16970 - /dev/sda7 - /dev/md2
16971 - 19456 - /dev/sda8 - /dev/md3
And after doing those partitionings I 'combined' them to act as raid1.

> Firstly, get mdadm 1.8.0 as I mentioned above.
> 
> Look at /proc/mdstat.
> 
> See if all 4 md devices have a failed device in it. If the disk is really
> dead, this is likely to be the case, if it's not, then you'll need to fail
> each partition in each md device:
> 
> So make make sure that each md device has the failed disk really failed,
> you can do:
> 
>   mdadm --fail /dev/md0 /dev/sda1
>   mdadm --fail /dev/md1 /dev/sda2
>   mdadm --fail /dev/md2 /dev/sda3
>   mdadm --fail /dev/md3 /dev/sda4
> 
> Next, you need to remove the failed disk from each array
> 
>   mdadm --remove /dev/md0 /dev/sda1
>   mdadm --remove /dev/md1 /dev/sda2
>   mdadm --remove /dev/md2 /dev/sda3
>   mdadm --remove /dev/md3 /dev/sda4
> 
> Strictly speaking, you don't have to do this - you can just power down and
> put a new disk in, but I feel this is "cleaner" and hopefully leaves the
> system in a stable and known state when you do power down.

Habven't done that, b/c the system was already down ...

> At this point you can power down the machine and physically remove the
> drive and replace it with a new, identical unit.

So I did: replaced the broken one (Samsung SP1614C) with an identical drive.

> Reboot your PC. If it would normally boot off sda, you have to persuade it
> to boot off sdb. You might need to alter the bios to do this, ot maybe
> not... All BIOSes and controllers have their own little ideas about how
> this is done.
> 
> If it boots off another drive (eg. an IDE drive) then you should be fine.
> If it does boot off sda, then I hope you used the raid-extra-boot command
> in lilo.conf (and tested it...) If you are using grub, I can't be of any
> assistance there as I don't use it.

I have two additional IDE drives in that system.
/dev/hda contains some data, and is the boot drive, /dev/hdb contains 
some less important data.

> You should now have the system running with the data intact on sdb and all
> the md devices working and mounted as normal.
> 
> Now you have to re-partition the new sda identical to sdb. If they are the
> same make and size, you can use this:
> 
>   sfdisk -d /dev/sdb | sfdisk /dev/sda

This didn't work proper, so I partitioned the new drive manually.

> Now, tell the raid code to re-mirror the drives:
> 
>   mdadm --add /dev/md0 /dev/sda1
>   mdadm --add /dev/md1 /dev/sda2
>   mdadm --add /dev/md2 /dev/sda3
>   mdadm --add /dev/md3 /dev/sda4

Now some new trouble starts ...?
'mdadm --add /dev/md0 /dev/sda1' started just fine - but exactly at 50% 
it started giving tons of errors, like:
[quote]
Jan 29 16:10:24 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed
Jan 29 16:10:24 suse92 kernel: end_request: I/O error, dev sdb, sector 
52460420
Jan 29 16:10:25 suse92 kernel: ata2: status=0x51 { DriveReady 
SeekComplete Error }
Jan 29 16:10:25 suse92 kernel: ata2: error=0x40 { UncorrectableError }
Jan 29 16:10:25 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, 
CDB: Read (10) 00 03 20 7b 85 00 02 f9 00
Jan 29 16:10:25 suse92 kernel: Current sdb: sense key Medium Error
Jan 29 16:10:25 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed
Jan 29 16:10:25 suse92 kernel: end_request: I/O error, dev sdb, sector 
52460421
Jan 29 16:10:26 suse92 kernel: ata2: status=0x51 { DriveReady 
SeekComplete Error }
Jan 29 16:10:26 suse92 kernel: ata2: error=0x40 { UncorrectableError }
Jan 29 16:10:26 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, 
CDB: Read (10) 00 03 20 7b 86 00 02 f8 00
Jan 29 16:10:26 suse92 kernel: Current sdb: sense key Medium Error
Jan 29 16:10:26 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed
Jan 29 16:10:26 suse92 kernel: end_request: I/O error, dev sdb, sector 
52460422
Jan 29 16:10:27 suse92 kernel: ata2: status=0x51 { DriveReady 
SeekComplete Error }
Jan 29 16:10:27 suse92 kernel: ata2: error=0x40 { UncorrectableError }
Jan 29 16:10:27 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, 
CDB: Read (10) 00 03 20 7b 87 00 02 f7 00
Jan 29 16:10:27 suse92 kernel: Current sdb: sense key Medium Error
Jan 29 16:10:27 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed
Jan 29 16:10:27 suse92 kernel: end_request: I/O error, dev sdb, sector 
52460423
[/quote]

> then run:
> 
>    watch -n1 cat /proc/mdstat
> 
> and wait for it to finish, however the system is fully usable all during
> this process.

[quote]
Every 1,0s: cat /proc/mdstat 
                              Sat Jan 29 16:08:50 2005

Personalities : [raid1]
md3 : active raid1 sdb8[1]
       19960640 blocks [2/1] [_U]

md2 : active raid1 sdb7[1]
       62910400 blocks [2/1] [_U]

md1 : active raid1 sdb6[1]
       20964672 blocks [2/1] [_U]

md0 : active raid1 sdb5[1] sda5[2]
       52436032 blocks [2/1] [_U]
       [==========>..........]  recovery = 50.0% (26230016/52436032) 
finish=121.7min speed=1050K/sec
unused devices: <none>
[/quote]

Can I stop that process for /dev/md0, and start with /dev/md1 (just to 
compare if its a problem with that partition only, or an general problem 
(so that eg. the second drive has problens, too)?

btw: does mdadm also format the partitions?

> If you can't power the machine down, and have hot-swappable drives in
> proper caddys, then there is a way to tell the kernel that you are
> removing the drive and adding a new one in, however it's probably safer if
> you can do it while powered down.
> 
> If this doesn't make sense, post back the output of /proc/mdstat and
> fdisk -l
> 
> Goos luck!
> 
> Gordon

Have a nice day
Torsten


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Broken harddisk
  2005-01-29 15:34   ` T. Ermlich
@ 2005-01-29 15:56     ` Gordon Henderson
  2005-01-29 16:19       ` Guy
  2005-01-29 16:47       ` T. Ermlich
  0 siblings, 2 replies; 13+ messages in thread
From: Gordon Henderson @ 2005-01-29 15:56 UTC (permalink / raw)
  To: T. Ermlich; +Cc: linux-raid

On Sat, 29 Jan 2005, T. Ermlich wrote:

> That's right: each harddisk is partitioned absolutly identically, like:
>      0 - 19456 - /dev/sda1 - extended partition
>      1 - 6528  - /dev/sda5 - /dev/md0
>   6529 - 9138  - /dev/sda6 - /dev/md1
>   9139 - 16970 - /dev/sda7 - /dev/md2
> 16971 - 19456 - /dev/sda8 - /dev/md3
> And after doing those partitionings I 'combined' them to act as raid1.

> I have two additional IDE drives in that system.
> /dev/hda contains some data, and is the boot drive, /dev/hdb contains
> some less important data.

Just as a point of note - if the boot disk goes down it will be harder to
recover the data... Consider making the boot disk mirrored too!

> >   mdadm --add /dev/md0 /dev/sda1
> >   mdadm --add /dev/md1 /dev/sda2
> >   mdadm --add /dev/md2 /dev/sda3
> >   mdadm --add /dev/md3 /dev/sda4
>
> Now some new trouble starts ...?
> 'mdadm --add /dev/md0 /dev/sda1' started just fine - but exactly at 50%
> it started giving tons of errors, like:

You should ve using:

  mdadm --add /dev/md0 /dev/sda5

> [quote]
> Jan 29 16:10:24 suse92 kernel: Additional sense: Unrecovered read error
> - auto reallocate failed
> Jan 29 16:10:24 suse92 kernel: end_request: I/O error, dev sdb, sector
> 52460420

The is a read error from /dev/sdb. What it's saying is that sdb has bad
sectors which can't be recoverd.

You have 2 bad drives in a RAID-1 - and thats really bad )-:

> Personalities : [raid1]
> md3 : active raid1 sdb8[1]
>        19960640 blocks [2/1] [_U]
>
> md2 : active raid1 sdb7[1]
>        62910400 blocks [2/1] [_U]
>
> md1 : active raid1 sdb6[1]
>        20964672 blocks [2/1] [_U]
>
> md0 : active raid1 sdb5[1] sda5[2]
>        52436032 blocks [2/1] [_U]
>        [==========>..........]  recovery = 50.0% (26230016/52436032)
> finish=121.7min speed=1050K/sec
> unused devices: <none>
> [/quote]
>
> Can I stop that process for /dev/md0, and start with /dev/md1 (just to
> compare if its a problem with that partition only, or an general problem
> (so that eg. the second drive has problens, too)?

Yes - just fail & remove the drive partition:

  mdadm --fail   /dev/md0 /dev/sda5
  mdadm --remove /dev/md0 /dev/sda5

At this point, I'd run a badblocks on the other partitions before doing
the resync:

  badblocks -s -c 256 /dev/sdb6
  badblocks -s -c 256 /dev/sdb7
  badblocks -s -c 256 /dev/sdb8

if these pass, you can do the hot-add, however, it looks like the sdb disk
is also faulty.

At this point, I'd be looking to replace both disks and restore from
backup, but if you can re-sync the other 3 partitions, then remove the
also-faulty sdb, and replace it with a new one, and you can re-sync the 3
good partitions, and you only have to restore the '5' partition (md0) from
backup.

You could try mkfs'ing the new partition sda5, mounting it, and copying
the data on md0 over to it - theres a chance the bad sectors on sdb lie
outside the filing system... This would save you having to restore from
backup, however, it then becomes trickier as you then have to re-create
the raid set on a new disk with a missing drive, and copy it again.

> btw: does mdadm also format the partitions?

No... You don't need to format/mkfs the partitions, as the raid resync
will take care of making it a mirror of the existing working disk.

Gordon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Broken harddisk
  2005-01-29 15:56     ` Gordon Henderson
@ 2005-01-29 16:19       ` Guy
  2005-01-29 18:31         ` Mike Hardy
  2005-01-29 16:47       ` T. Ermlich
  1 sibling, 1 reply; 13+ messages in thread
From: Guy @ 2005-01-29 16:19 UTC (permalink / raw)
  To: 'Gordon Henderson', 'T. Ermlich'; +Cc: linux-raid

For future reference:

Everyone should do a nightly disk test to prevent bad blocks from hiding
undetected.  smartd, badblocks or dd can be used.  Example:
dd if=/dev/sda of=/dev/null bs=64k

Just create a nice little script that emails you the output.  Put this
script in a nighty cron to run while the system is idle.

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Gordon Henderson
Sent: Saturday, January 29, 2005 10:56 AM
To: T. Ermlich
Cc: linux-raid@vger.kernel.org
Subject: Re: Broken harddisk

On Sat, 29 Jan 2005, T. Ermlich wrote:

> That's right: each harddisk is partitioned absolutly identically, like:
>      0 - 19456 - /dev/sda1 - extended partition
>      1 - 6528  - /dev/sda5 - /dev/md0
>   6529 - 9138  - /dev/sda6 - /dev/md1
>   9139 - 16970 - /dev/sda7 - /dev/md2
> 16971 - 19456 - /dev/sda8 - /dev/md3
> And after doing those partitionings I 'combined' them to act as raid1.

> I have two additional IDE drives in that system.
> /dev/hda contains some data, and is the boot drive, /dev/hdb contains
> some less important data.

Just as a point of note - if the boot disk goes down it will be harder to
recover the data... Consider making the boot disk mirrored too!

> >   mdadm --add /dev/md0 /dev/sda1
> >   mdadm --add /dev/md1 /dev/sda2
> >   mdadm --add /dev/md2 /dev/sda3
> >   mdadm --add /dev/md3 /dev/sda4
>
> Now some new trouble starts ...?
> 'mdadm --add /dev/md0 /dev/sda1' started just fine - but exactly at 50%
> it started giving tons of errors, like:

You should ve using:

  mdadm --add /dev/md0 /dev/sda5

> [quote]
> Jan 29 16:10:24 suse92 kernel: Additional sense: Unrecovered read error
> - auto reallocate failed
> Jan 29 16:10:24 suse92 kernel: end_request: I/O error, dev sdb, sector
> 52460420

The is a read error from /dev/sdb. What it's saying is that sdb has bad
sectors which can't be recoverd.

You have 2 bad drives in a RAID-1 - and thats really bad )-:

> Personalities : [raid1]
> md3 : active raid1 sdb8[1]
>        19960640 blocks [2/1] [_U]
>
> md2 : active raid1 sdb7[1]
>        62910400 blocks [2/1] [_U]
>
> md1 : active raid1 sdb6[1]
>        20964672 blocks [2/1] [_U]
>
> md0 : active raid1 sdb5[1] sda5[2]
>        52436032 blocks [2/1] [_U]
>        [==========>..........]  recovery = 50.0% (26230016/52436032)
> finish=121.7min speed=1050K/sec
> unused devices: <none>
> [/quote]
>
> Can I stop that process for /dev/md0, and start with /dev/md1 (just to
> compare if its a problem with that partition only, or an general problem
> (so that eg. the second drive has problens, too)?

Yes - just fail & remove the drive partition:

  mdadm --fail   /dev/md0 /dev/sda5
  mdadm --remove /dev/md0 /dev/sda5

At this point, I'd run a badblocks on the other partitions before doing
the resync:

  badblocks -s -c 256 /dev/sdb6
  badblocks -s -c 256 /dev/sdb7
  badblocks -s -c 256 /dev/sdb8

if these pass, you can do the hot-add, however, it looks like the sdb disk
is also faulty.

At this point, I'd be looking to replace both disks and restore from
backup, but if you can re-sync the other 3 partitions, then remove the
also-faulty sdb, and replace it with a new one, and you can re-sync the 3
good partitions, and you only have to restore the '5' partition (md0) from
backup.

You could try mkfs'ing the new partition sda5, mounting it, and copying
the data on md0 over to it - theres a chance the bad sectors on sdb lie
outside the filing system... This would save you having to restore from
backup, however, it then becomes trickier as you then have to re-create
the raid set on a new disk with a missing drive, and copy it again.

> btw: does mdadm also format the partitions?

No... You don't need to format/mkfs the partitions, as the raid resync
will take care of making it a mirror of the existing working disk.

Gordon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Broken harddisk
  2005-01-29 16:19       ` Guy
@ 2005-01-29 18:31         ` Mike Hardy
  2005-01-29 23:30           ` berk walker
  2005-02-01 15:02           ` Robin Bowes
  0 siblings, 2 replies; 13+ messages in thread
From: Mike Hardy @ 2005-01-29 18:31 UTC (permalink / raw)
  To: Guy; +Cc: 'Gordon Henderson', 'T. Ermlich', linux-raid

Guy wrote:
> For future reference:
> 
> Everyone should do a nightly disk test to prevent bad blocks from hiding
> undetected.  smartd, badblocks or dd can be used.  Example:
> dd if=/dev/sda of=/dev/null bs=64k
> 
> Just create a nice little script that emails you the output.  Put this
> script in a nighty cron to run while the system is idle.

While I agree with your purpose 100% Guy, I respectfully disagree with 
the method. If at all possible, you should use tools that access the 
SMART capabilities of the device so that you get more than a read test - 
you also get statistics on the various other health parameters the drive 
checks some of which can serve fair warning of impending death before 
you get bad blocks.

http://smartmontools.sf.net is the source for fresh packages there, and 
smartd can be set up with a config file to do tests on any schedule you 
like, emailing you urgent results as it gets them, or just putting 
information of general interest in the logs that Logwatch picks up.

If you're drives don't talk SMART (older ones don't, it doesn't work 
through all interfaces either) then by all means take Guy's advice. A 
'dd' test is certainly valuable. But if they do talk SMART, I think its 
better

-Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Broken harddisk
  2005-01-29 18:31         ` Mike Hardy
@ 2005-01-29 23:30           ` berk walker
  2005-02-01 15:02           ` Robin Bowes
  1 sibling, 0 replies; 13+ messages in thread
From: berk walker @ 2005-01-29 23:30 UTC (permalink / raw)
  To: Mike Hardy, Guy
  Cc: 'Gordon Henderson', 'T. Ermlich', linux-raid

I think it might be a good idea to check memory, and power supply.  I have  
had several motherboards where the IDE channels went bad. I have become a  
believer of not using exact same drives in an array, because today's  
quality control in manufacturing (not design nor testing) Clones may have  
very similardegradation rates, suddenly, it seems, dying together.  it is  
not possible to get the manufacturer's name and lot for the platters, but  
_maybe_ buying similar drives of different mfgs might cut down the  
multiple failure rates.  raid1 your boot disk.  another compubox sometimes  
helps as a control for checking hdwe.  Think about spending the extra bux  
for another raid box, and have them rsync'd. you _can_ have stable,  
automatic, online backup of all of your data (don't forget the ups's), if  
the house burns down, it will be trivial that you lost data - or you could  
share mirroring with a remotely located friend, so if one house burnt to  
the ground, maybe the other still has the data.  doing the continuous  
circle of backups level 1 thru 9 by hand feeding tapes just sux, and might  
not even work on restore.  knew a business that backed up every day,  
system fried - and the tapes could not be read.

sorry - too much from me (just thinking that it's too damned bad that we  
can't go to the local whatever store and walk out with SCSI stuff.

b-

On Sat, 29 Jan 2005 10:31:40 -0800, Mike Hardy <mhardy@h3c.com> wrote:

>
> Guy wrote:
>> For future reference:
>>  Everyone should do a nightly disk test to prevent bad blocks from  
>> hiding
>> undetected.  smartd, badblocks or dd can be used.  Example:
>> dd if=/dev/sda of=/dev/null bs=64k
>>  Just create a nice little script that emails you the output.  Put this
>> script in a nighty cron to run while the system is idle.
>
> While I agree with your purpose 100% Guy, I respectfully disagree with  
> the method. If at all possible, you should use tools that access the  
> SMART capabilities of the device so that you get more than a read test -  
> you also get statistics on the various other health parameters the drive  
> checks some of which can serve fair warning of impending death before  
> you get bad blocks.
>
> http://smartmontools.sf.net is the source for fresh packages there, and  
> smartd can be set up with a config file to do tests on any schedule you  
> like, emailing you urgent results as it gets them, or just putting  
> information of general interest in the logs that Logwatch picks up.
>
> If you're drives don't talk SMART (older ones don't, it doesn't work  
> through all interfaces either) then by all means take Guy's advice. A  
> 'dd' test is certainly valuable. But if they do talk SMART, I think its  
> better
>
> -Mike
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Broken harddisk
  2005-01-29 18:31         ` Mike Hardy
  2005-01-29 23:30           ` berk walker
@ 2005-02-01 15:02           ` Robin Bowes
  2005-02-01 22:57             ` Luca Berra
  1 sibling, 1 reply; 13+ messages in thread
From: Robin Bowes @ 2005-02-01 15:02 UTC (permalink / raw)
  To: linux-raid

Mike Hardy wrote:
> If you're drives don't talk SMART (older ones don't, it doesn't work 
> through all interfaces either) then by all means take Guy's advice. A 
> 'dd' test is certainly valuable. But if they do talk SMART, I think its 
> better

True enough. However, until SMART support makes it into linux SATA 
drivers I'm pretty much stuck with dd!

R.
-- 
http://robinbowes.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Broken harddisk
  2005-02-01 15:02           ` Robin Bowes
@ 2005-02-01 22:57             ` Luca Berra
  2005-02-01 23:24               ` Robin Bowes
  0 siblings, 1 reply; 13+ messages in thread
From: Luca Berra @ 2005-02-01 22:57 UTC (permalink / raw)
  To: linux-raid

On Tue, Feb 01, 2005 at 03:02:54PM +0000, Robin Bowes wrote:
>Mike Hardy wrote:
>>If you're drives don't talk SMART (older ones don't, it doesn't work 
>>through all interfaces either) then by all means take Guy's advice. A 
>>'dd' test is certainly valuable. But if they do talk SMART, I think its 
>>better
>
>True enough. However, until SMART support makes it into linux SATA 
>drivers I'm pretty much stuck with dd!
>
http://www.kernel.org/pub/linux/kernel/people/jgarzik/libata/

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Broken harddisk
  2005-02-01 22:57             ` Luca Berra
@ 2005-02-01 23:24               ` Robin Bowes
  0 siblings, 0 replies; 13+ messages in thread
From: Robin Bowes @ 2005-02-01 23:24 UTC (permalink / raw)
  To: linux-raid

Luca Berra wrote:
> On Tue, Feb 01, 2005 at 03:02:54PM +0000, Robin Bowes wrote:
>> True enough. However, until SMART support makes it into linux SATA 
>> drivers I'm pretty much stuck with dd!
>>
> http://www.kernel.org/pub/linux/kernel/people/jgarzik/libata/

I avoid patching kernels, preferring to use the stock Fedora releases.

Perhaps I should re-phrase the above statement to read "until the Fedora 
Core 3 kernel includes SMART support in libata ..." :)

Actually, I'm running 2.6.10 - how can I tell if SMART support is 
included in libata?

R.
-- 
http://robinbowes.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Broken harddisk
  2005-01-29 15:56     ` Gordon Henderson
  2005-01-29 16:19       ` Guy
@ 2005-01-29 16:47       ` T. Ermlich
  2005-01-29 18:18         ` T. Ermlich
  1 sibling, 1 reply; 13+ messages in thread
From: T. Ermlich @ 2005-01-29 16:47 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: linux-raid

Hi again,

well, due to that realy handy hints I subscribed to the list ... ;)

Gordon Henderson scribbled on 29.01.2005 16:56:
> On Sat, 29 Jan 2005, T. Ermlich wrote:
> 
>>That's right: each harddisk is partitioned absolutly identically, like:
>>     0 - 19456 - /dev/sda1 - extended partition
>>     1 - 6528  - /dev/sda5 - /dev/md0
>>  6529 - 9138  - /dev/sda6 - /dev/md1
>>  9139 - 16970 - /dev/sda7 - /dev/md2
>>16971 - 19456 - /dev/sda8 - /dev/md3
>>And after doing those partitionings I 'combined' them to act as raid1.
> 
>>I have two additional IDE drives in that system.
>>/dev/hda contains some data, and is the boot drive, /dev/hdb contains
>>some less important data.
> 
> Just as a point of note - if the boot disk goes down it will be harder to
> recover the data... Consider making the boot disk mirrored too!

Yeah .. I thought about that in the past ... and decided to buy an 3Ware 
controller (9500S-4LP) for those things in ~2-3 month (as I don't have 
the money yet).
Currently I'm using the onboard SATA controller (Asus A7V8X with an 
Promise controller),

>>>  mdadm --add /dev/md0 /dev/sda1
>>>  mdadm --add /dev/md1 /dev/sda2
>>>  mdadm --add /dev/md2 /dev/sda3
>>>  mdadm --add /dev/md3 /dev/sda4
>>
>>Now some new trouble starts ...?
>>'mdadm --add /dev/md0 /dev/sda1' started just fine - but exactly at 50%
>>it started giving tons of errors, like:
> 
> You should ve using:
> 
>   mdadm --add /dev/md0 /dev/sda5

Yes, I did - I just made a mistake when writing the command above.

>>[quote]
>>Jan 29 16:10:24 suse92 kernel: Additional sense: Unrecovered read error
>>- auto reallocate failed
>>Jan 29 16:10:24 suse92 kernel: end_request: I/O error, dev sdb, sector
>>52460420
> 
> The is a read error from /dev/sdb. What it's saying is that sdb has bad
> sectors which can't be recoverd.
> 
> You have 2 bad drives in a RAID-1 - and thats really bad )-:

All I have ... better than nothing ... will be improved in the future ;)

>>Personalities : [raid1]
>>md3 : active raid1 sdb8[1]
>>       19960640 blocks [2/1] [_U]
>>
>>md2 : active raid1 sdb7[1]
>>       62910400 blocks [2/1] [_U]
>>
>>md1 : active raid1 sdb6[1]
>>       20964672 blocks [2/1] [_U]
>>
>>md0 : active raid1 sdb5[1] sda5[2]
>>       52436032 blocks [2/1] [_U]
>>       [==========>..........]  recovery = 50.0% (26230016/52436032)
>>finish=121.7min speed=1050K/sec
>>unused devices: <none>
>>[/quote]
>>
>>Can I stop that process for /dev/md0, and start with /dev/md1 (just to
>>compare if its a problem with that partition only, or an general problem
>>(so that eg. the second drive has problens, too)? 
> 
> Yes - just fail & remove the drive partition:
> 
>   mdadm --fail   /dev/md0 /dev/sda5
>   mdadm --remove /dev/md0 /dev/sda5
> 
> At this point, I'd run a badblocks on the other partitions before doing
> the resync:
> 
>   badblocks -s -c 256 /dev/sdb6
>   badblocks -s -c 256 /dev/sdb7
>   badblocks -s -c 256 /dev/sdb8
> 
> if these pass, you can do the hot-add, however, it looks like the sdb disk
> is also faulty.
> 
> At this point, I'd be looking to replace both disks and restore from
> backup, but if you can re-sync the other 3 partitions, then remove the
> also-faulty sdb, and replace it with a new one, and you can re-sync the 3
> good partitions, and you only have to restore the '5' partition (md0) from
> backup.
> 
> You could try mkfs'ing the new partition sda5, mounting it, and copying
> the data on md0 over to it - theres a chance the bad sectors on sdb lie
> outside the filing system... This would save you having to restore from
> backup, however, it then becomes trickier as you then have to re-create
> the raid set on a new disk with a missing drive, and copy it again.

Ok, I'll do that.
I attached an older 80GB harddisk (/dev/hdc), and right now I'm copying 
the content of /dev/md0 there, using 'cp -a'.
If that's finished I'd start checking for badblocks ... and I guess the 
backups I made in the past might be full with probably damaged data ... :-(

Should I delete /dev/md0 completly after the copy-process has finished?
Or just checking for badblocks and continue using it?

>>btw: does mdadm also format the partitions?
> 
> No... You don't need to format/mkfs the partitions, as the raid resync
> will take care of making it a mirror of the existing working disk.

Ah .. ok. :-)

> Gordon

Thanks a lot!!
Torsten


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Broken harddisk
  2005-01-29 16:47       ` T. Ermlich
@ 2005-01-29 18:18         ` T. Ermlich
  2005-01-29 23:12           ` T. Ermlich
  0 siblings, 1 reply; 13+ messages in thread
From: T. Ermlich @ 2005-01-29 18:18 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: linux-raid

Some more info

T. Ermlich scribbled on 29.01.2005 17:47:
[...]

>>   badblocks -s -c 256 /dev/sdb6
>>   badblocks -s -c 256 /dev/sdb7
>>   badblocks -s -c 256 /dev/sdb8

suse92:/ # badblocks -s -c 256 /dev/sda6
Suche nach defekten Blöcken (Nur-Lesen-Modus):done
suse92:/ # badblocks -s -c 256 /dev/sda7
Suche nach defekten Blöcken (Nur-Lesen-Modus):done
suse92:/ # badblocks -s -c 256 /dev/sda8
Suche nach defekten Blöcken (Nur-Lesen-Modus):done
suse92:/ # badblocks -s -c 256 /dev/sda5
Suche nach defekten Blöcken (Nur-Lesen-Modus): 26188800/ 52436097

Then the system hungs .. I did it twice, and twice it hung.

The last messages in /var/log/messages are:
Jan 29 19:06:11 suse92 kernel: Current sda: sense key Medium Error
Jan 29 19:06:11 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed
Jan 29 19:06:11 suse92 kernel: end_request: I/O error, dev sda, sector 
52460485
Jan 29 19:06:12 suse92 kernel: ata2: status=0x51 { DriveReady 
SeekComplete Error }
Jan 29 19:06:12 suse92 kernel: ata2: error=0x40 { UncorrectableError }
Jan 29 19:06:12 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, 
CDB: Read (10) 00 03 20 7b c6 00 00 b8 00
Jan 29 19:06:12 suse92 kernel: Current sda: sense key Medium Error
Jan 29 19:06:12 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed
Jan 29 19:06:12 suse92 kernel: end_request: I/O error, dev sda, sector 
52460486
Jan 29 19:06:13 suse92 kernel: ata2: status=0x51 { DriveReady 
SeekComplete Error }
Jan 29 19:06:13 suse92 kernel: ata2: error=0x40 { UncorrectableError }
Jan 29 19:06:13 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, 
CDB: Read (10) 00 03 20 7b c7 00 00 b7 00
Jan 29 19:06:13 suse92 kernel: Current sda: sense key Medium Error
Jan 29 19:06:13 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed
Jan 29 19:06:13 suse92 kernel: end_request: I/O error, dev sda, sector 
52460487
Jan 29 19:06:14 suse92 kernel: ata2: status=0x51 { DriveReady 
SeekComplete Error }
Jan 29 19:06:14 suse92 kernel: ata2: error=0x40 { UncorrectableError }
Jan 29 19:06:14 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, 
CDB: Read (10) 00 03 20 7b c8 00 00 b6 00
Jan 29 19:06:14 suse92 kernel: Current sda: sense key Medium Error
Jan 29 19:06:14 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed
Jan 29 19:06:14 suse92 kernel: end_request: I/O error, dev sda, sector 
52460488

Well, now I think, as only sda5 is involved, I'll re-format it, check it 
for badblocks, and copy the previciously copied data back on it ...

c y
Torsten


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Broken harddisk
  2005-01-29 18:18         ` T. Ermlich
@ 2005-01-29 23:12           ` T. Ermlich
  0 siblings, 0 replies; 13+ messages in thread
From: T. Ermlich @ 2005-01-29 23:12 UTC (permalink / raw)
  To: linux-raid

Got it up and running again.

I had to delete /dev/md0 & /dev/md1, and create them new.
Due to my backups no single file was lost ... :-)

But now I get these messages when booting the system (I don't know if 
the came up in the past, too ... I never noticed them):

   ...
   <6>md: Autodetecting RAID arrays.
   <3>md: could not bd_claim sda5.
   <3>md: could not bd_claim sda6.
   <3>md: could not bd_claim sdb5.
   <3>md: could not bd_claim sdb6.
   <3>md: could not bd_claim sdb7.
   <3>md: could not bd_claim sdb8.
   <6>md: autorun ...
   ...
   Disk /dev/md0 does not contain a valid partition table
   Disk /dev/md1 does not contain a valid partition table
   Disk /dev/md2 does not contain a valid partition table
   Disk /dev/md3 does not contain a valid partition table

What else does I have to do?

Thanks a lot!!
Torsten

T. Ermlich scribbled on 29.01.2005 19:18:

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2005-02-01 23:24 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-29  0:22 Broken harddisk T. Ermlich
2005-01-29 12:46 ` Gordon Henderson
2005-01-29 15:34   ` T. Ermlich
2005-01-29 15:56     ` Gordon Henderson
2005-01-29 16:19       ` Guy
2005-01-29 18:31         ` Mike Hardy
2005-01-29 23:30           ` berk walker
2005-02-01 15:02           ` Robin Bowes
2005-02-01 22:57             ` Luca Berra
2005-02-01 23:24               ` Robin Bowes
2005-01-29 16:47       ` T. Ermlich
2005-01-29 18:18         ` T. Ermlich
2005-01-29 23:12           ` T. Ermlich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).