* Broken harddisk @ 2005-01-29 0:22 T. Ermlich 2005-01-29 12:46 ` Gordon Henderson 0 siblings, 1 reply; 13+ messages in thread From: T. Ermlich @ 2005-01-29 0:22 UTC (permalink / raw) To: linux-raid Hello there, I just got here from http://cgi.cse.unsw.edu.au/~neilb/Contact ... Hopefully I'm more/less right here. Several month ago I set-up an raid1 using mdadm. Two drives (/dev/sda & /dev/sdb, each one is an 160GB Samsung SATA disks) are used, and provide now /dev/md0, /dev/md1, /dev/md2 & /dev/md3. In november 2004 I upgraded to mdadm 1.8.1. This afternoon, about 9 hours ago, /dev/sda broke down ... no chnace to get it working again .. :( My question now is: what does I have to do now? The system is up and running, so I'd do an actual backup of the most important data ... but how to 'replace' the broken drive, and 'restore' the data content there (sorry, as english is not my native language I have no idea how to explain it correctly). Is there a way to do so, or does I have to create an raid1 from scratch, and copy all data from /dev/md0-3 there manually? Thanks in advance Torsten ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Broken harddisk 2005-01-29 0:22 Broken harddisk T. Ermlich @ 2005-01-29 12:46 ` Gordon Henderson 2005-01-29 15:34 ` T. Ermlich 0 siblings, 1 reply; 13+ messages in thread From: Gordon Henderson @ 2005-01-29 12:46 UTC (permalink / raw) To: T. Ermlich; +Cc: linux-raid On Sat, 29 Jan 2005, T. Ermlich wrote: > Hello there, > > I just got here from http://cgi.cse.unsw.edu.au/~neilb/Contact ... > Hopefully I'm more/less right here. > > Several month ago I set-up an raid1 using mdadm. > Two drives (/dev/sda & /dev/sdb, each one is an 160GB Samsung SATA > disks) are used, and provide now /dev/md0, /dev/md1, /dev/md2 & > /dev/md3. In november 2004 I upgraded to mdadm 1.8.1. Drop 1.8.1 and get 1.8.0. I understand 1.8.1 has some experimental code and not designed to be used for real. > This afternoon, about 9 hours ago, /dev/sda broke down ... no chnace to > get it working again .. :( > > My question now is: what does I have to do now? Well, go through the procedure to remove the disk and put a new one back in... > The system is up and running, so I'd do an actual backup of the most > important data ... but how to 'replace' the broken drive, and 'restore' > the data content there (sorry, as english is not my native language I > have no idea how to explain it correctly). > Is there a way to do so, or does I have to create an raid1 from scratch, > and copy all data from /dev/md0-3 there manually? You should not have to copy it - thats the whole point of it all, however, RAID is not a substitute for proper backups, so make sure you do those backups now and regularly in the future. OK - here are the basic steps - you may have to modify them as you haven't posted enough detail for me to work it out to your exact system. I'm assuing that you have partitioned each disk with 4 partitions and both disks are partitioned identically and you are combining the same partition of each device into the md devices. (eg. /dev/md0 is made from /dev/sda1 and /dev/sdb1) This is reasonably "sane" and I'm sure lots of people do it this way (I do, but I'm a small sample :) If you aren't doing it this way, then this won't work for you, but you may be able to adapt it for your needs. Firstly, get mdadm 1.8.0 as I mentioned above. Look at /proc/mdstat. See if all 4 md devices have a failed device in it. If the disk is really dead, this is likely to be the case, if it's not, then you'll need to fail each partition in each md device: So make make sure that each md device has the failed disk really failed, you can do: mdadm --fail /dev/md0 /dev/sda1 mdadm --fail /dev/md1 /dev/sda2 mdadm --fail /dev/md2 /dev/sda3 mdadm --fail /dev/md3 /dev/sda4 Next, you need to remove the failed disk from each array mdadm --remove /dev/md0 /dev/sda1 mdadm --remove /dev/md1 /dev/sda2 mdadm --remove /dev/md2 /dev/sda3 mdadm --remove /dev/md3 /dev/sda4 Strictly speaking, you don't have to do this - you can just power down and put a new disk in, but I feel this is "cleaner" and hopefully leaves the system in a stable and known state when you do power down. At this point you can power down the machine and physically remove the drive and replace it with a new, identical unit. Reboot your PC. If it would normally boot off sda, you have to persuade it to boot off sdb. You might need to alter the bios to do this, ot maybe not... All BIOSes and controllers have their own little ideas about how this is done. If it boots off another drive (eg. an IDE drive) then you should be fine. If it does boot off sda, then I hope you used the raid-extra-boot command in lilo.conf (and tested it...) If you are using grub, I can't be of any assistance there as I don't use it. You should now have the system running with the data intact on sdb and all the md devices working and mounted as normal. Now you have to re-partition the new sda identical to sdb. If they are the same make and size, you can use this: sfdisk -d /dev/sdb | sfdisk /dev/sda Now, tell the raid code to re-mirror the drives: mdadm --add /dev/md0 /dev/sda1 mdadm --add /dev/md1 /dev/sda2 mdadm --add /dev/md2 /dev/sda3 mdadm --add /dev/md3 /dev/sda4 then run: watch -n1 cat /proc/mdstat and wait for it to finish, however the system is fully usable all during this process. If you can't power the machine down, and have hot-swappable drives in proper caddys, then there is a way to tell the kernel that you are removing the drive and adding a new one in, however it's probably safer if you can do it while powered down. If this doesn't make sense, post back the output of /proc/mdstat and fdisk -l Goos luck! Gordon ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Broken harddisk 2005-01-29 12:46 ` Gordon Henderson @ 2005-01-29 15:34 ` T. Ermlich 2005-01-29 15:56 ` Gordon Henderson 0 siblings, 1 reply; 13+ messages in thread From: T. Ermlich @ 2005-01-29 15:34 UTC (permalink / raw) To: Gordon Henderson; +Cc: linux-raid Hi, I'd like to say Thanks to everyone replied till now! :-) Gordon Henderson scribbled on 29.01.2005 13:46: > On Sat, 29 Jan 2005, T. Ermlich wrote: > >>Hello there, >> >>I just got here from http://cgi.cse.unsw.edu.au/~neilb/Contact ... >>Hopefully I'm more/less right here. >> >>Several month ago I set-up an raid1 using mdadm. >>Two drives (/dev/sda & /dev/sdb, each one is an 160GB Samsung SATA >>disks) are used, and provide now /dev/md0, /dev/md1, /dev/md2 & >>/dev/md3. In november 2004 I upgraded to mdadm 1.8.1. > > Drop 1.8.1 and get 1.8.0. I understand 1.8.1 has some experimental code > and not designed to be used for real. > >>This afternoon, about 9 hours ago, /dev/sda broke down ... no chnace to >>get it working again .. :( >> >>My question now is: what does I have to do now? > > Well, go through the procedure to remove the disk and put a new one back > in... Ok ... as the broken disks stops the system, and during boot procedure the system hung, I had to remove it (disconnected the cables). >>The system is up and running, so I'd do an actual backup of the most >>important data ... but how to 'replace' the broken drive, and 'restore' >>the data content there (sorry, as english is not my native language I >>have no idea how to explain it correctly). >>Is there a way to do so, or does I have to create an raid1 from scratch, >>and copy all data from /dev/md0-3 there manually? > > You should not have to copy it - thats the whole point of it all, however, > RAID is not a substitute for proper backups, so make sure you do those > backups now and regularly in the future. Backups are done very night (3 am), so I just made a backup of the latest changes (between ~3am and 15:30pm). > OK - here are the basic steps - you may have to modify them as you haven't > posted enough detail for me to work it out to your exact system. > > I'm assuing that you have partitioned each disk with 4 partitions and both > disks are partitioned identically and you are combining the same partition > of each device into the md devices. (eg. /dev/md0 is made from /dev/sda1 > and /dev/sdb1) This is reasonably "sane" and I'm sure lots of people do it > this way (I do, but I'm a small sample :) If you aren't doing it this way, > then this won't work for you, but you may be able to adapt it for your > needs. That's right: each harddisk is partitioned absolutly identically, like: 0 - 19456 - /dev/sda1 - extended partition 1 - 6528 - /dev/sda5 - /dev/md0 6529 - 9138 - /dev/sda6 - /dev/md1 9139 - 16970 - /dev/sda7 - /dev/md2 16971 - 19456 - /dev/sda8 - /dev/md3 And after doing those partitionings I 'combined' them to act as raid1. > Firstly, get mdadm 1.8.0 as I mentioned above. > > Look at /proc/mdstat. > > See if all 4 md devices have a failed device in it. If the disk is really > dead, this is likely to be the case, if it's not, then you'll need to fail > each partition in each md device: > > So make make sure that each md device has the failed disk really failed, > you can do: > > mdadm --fail /dev/md0 /dev/sda1 > mdadm --fail /dev/md1 /dev/sda2 > mdadm --fail /dev/md2 /dev/sda3 > mdadm --fail /dev/md3 /dev/sda4 > > Next, you need to remove the failed disk from each array > > mdadm --remove /dev/md0 /dev/sda1 > mdadm --remove /dev/md1 /dev/sda2 > mdadm --remove /dev/md2 /dev/sda3 > mdadm --remove /dev/md3 /dev/sda4 > > Strictly speaking, you don't have to do this - you can just power down and > put a new disk in, but I feel this is "cleaner" and hopefully leaves the > system in a stable and known state when you do power down. Habven't done that, b/c the system was already down ... > At this point you can power down the machine and physically remove the > drive and replace it with a new, identical unit. So I did: replaced the broken one (Samsung SP1614C) with an identical drive. > Reboot your PC. If it would normally boot off sda, you have to persuade it > to boot off sdb. You might need to alter the bios to do this, ot maybe > not... All BIOSes and controllers have their own little ideas about how > this is done. > > If it boots off another drive (eg. an IDE drive) then you should be fine. > If it does boot off sda, then I hope you used the raid-extra-boot command > in lilo.conf (and tested it...) If you are using grub, I can't be of any > assistance there as I don't use it. I have two additional IDE drives in that system. /dev/hda contains some data, and is the boot drive, /dev/hdb contains some less important data. > You should now have the system running with the data intact on sdb and all > the md devices working and mounted as normal. > > Now you have to re-partition the new sda identical to sdb. If they are the > same make and size, you can use this: > > sfdisk -d /dev/sdb | sfdisk /dev/sda This didn't work proper, so I partitioned the new drive manually. > Now, tell the raid code to re-mirror the drives: > > mdadm --add /dev/md0 /dev/sda1 > mdadm --add /dev/md1 /dev/sda2 > mdadm --add /dev/md2 /dev/sda3 > mdadm --add /dev/md3 /dev/sda4 Now some new trouble starts ...? 'mdadm --add /dev/md0 /dev/sda1' started just fine - but exactly at 50% it started giving tons of errors, like: [quote] Jan 29 16:10:24 suse92 kernel: Additional sense: Unrecovered read error - auto reallocate failed Jan 29 16:10:24 suse92 kernel: end_request: I/O error, dev sdb, sector 52460420 Jan 29 16:10:25 suse92 kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 29 16:10:25 suse92 kernel: ata2: error=0x40 { UncorrectableError } Jan 29 16:10:25 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 03 20 7b 85 00 02 f9 00 Jan 29 16:10:25 suse92 kernel: Current sdb: sense key Medium Error Jan 29 16:10:25 suse92 kernel: Additional sense: Unrecovered read error - auto reallocate failed Jan 29 16:10:25 suse92 kernel: end_request: I/O error, dev sdb, sector 52460421 Jan 29 16:10:26 suse92 kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 29 16:10:26 suse92 kernel: ata2: error=0x40 { UncorrectableError } Jan 29 16:10:26 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 03 20 7b 86 00 02 f8 00 Jan 29 16:10:26 suse92 kernel: Current sdb: sense key Medium Error Jan 29 16:10:26 suse92 kernel: Additional sense: Unrecovered read error - auto reallocate failed Jan 29 16:10:26 suse92 kernel: end_request: I/O error, dev sdb, sector 52460422 Jan 29 16:10:27 suse92 kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 29 16:10:27 suse92 kernel: ata2: error=0x40 { UncorrectableError } Jan 29 16:10:27 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 03 20 7b 87 00 02 f7 00 Jan 29 16:10:27 suse92 kernel: Current sdb: sense key Medium Error Jan 29 16:10:27 suse92 kernel: Additional sense: Unrecovered read error - auto reallocate failed Jan 29 16:10:27 suse92 kernel: end_request: I/O error, dev sdb, sector 52460423 [/quote] > then run: > > watch -n1 cat /proc/mdstat > > and wait for it to finish, however the system is fully usable all during > this process. [quote] Every 1,0s: cat /proc/mdstat Sat Jan 29 16:08:50 2005 Personalities : [raid1] md3 : active raid1 sdb8[1] 19960640 blocks [2/1] [_U] md2 : active raid1 sdb7[1] 62910400 blocks [2/1] [_U] md1 : active raid1 sdb6[1] 20964672 blocks [2/1] [_U] md0 : active raid1 sdb5[1] sda5[2] 52436032 blocks [2/1] [_U] [==========>..........] recovery = 50.0% (26230016/52436032) finish=121.7min speed=1050K/sec unused devices: <none> [/quote] Can I stop that process for /dev/md0, and start with /dev/md1 (just to compare if its a problem with that partition only, or an general problem (so that eg. the second drive has problens, too)? btw: does mdadm also format the partitions? > If you can't power the machine down, and have hot-swappable drives in > proper caddys, then there is a way to tell the kernel that you are > removing the drive and adding a new one in, however it's probably safer if > you can do it while powered down. > > If this doesn't make sense, post back the output of /proc/mdstat and > fdisk -l > > Goos luck! > > Gordon Have a nice day Torsten ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Broken harddisk 2005-01-29 15:34 ` T. Ermlich @ 2005-01-29 15:56 ` Gordon Henderson 2005-01-29 16:19 ` Guy 2005-01-29 16:47 ` T. Ermlich 0 siblings, 2 replies; 13+ messages in thread From: Gordon Henderson @ 2005-01-29 15:56 UTC (permalink / raw) To: T. Ermlich; +Cc: linux-raid On Sat, 29 Jan 2005, T. Ermlich wrote: > That's right: each harddisk is partitioned absolutly identically, like: > 0 - 19456 - /dev/sda1 - extended partition > 1 - 6528 - /dev/sda5 - /dev/md0 > 6529 - 9138 - /dev/sda6 - /dev/md1 > 9139 - 16970 - /dev/sda7 - /dev/md2 > 16971 - 19456 - /dev/sda8 - /dev/md3 > And after doing those partitionings I 'combined' them to act as raid1. > I have two additional IDE drives in that system. > /dev/hda contains some data, and is the boot drive, /dev/hdb contains > some less important data. Just as a point of note - if the boot disk goes down it will be harder to recover the data... Consider making the boot disk mirrored too! > > mdadm --add /dev/md0 /dev/sda1 > > mdadm --add /dev/md1 /dev/sda2 > > mdadm --add /dev/md2 /dev/sda3 > > mdadm --add /dev/md3 /dev/sda4 > > Now some new trouble starts ...? > 'mdadm --add /dev/md0 /dev/sda1' started just fine - but exactly at 50% > it started giving tons of errors, like: You should ve using: mdadm --add /dev/md0 /dev/sda5 > [quote] > Jan 29 16:10:24 suse92 kernel: Additional sense: Unrecovered read error > - auto reallocate failed > Jan 29 16:10:24 suse92 kernel: end_request: I/O error, dev sdb, sector > 52460420 The is a read error from /dev/sdb. What it's saying is that sdb has bad sectors which can't be recoverd. You have 2 bad drives in a RAID-1 - and thats really bad )-: > Personalities : [raid1] > md3 : active raid1 sdb8[1] > 19960640 blocks [2/1] [_U] > > md2 : active raid1 sdb7[1] > 62910400 blocks [2/1] [_U] > > md1 : active raid1 sdb6[1] > 20964672 blocks [2/1] [_U] > > md0 : active raid1 sdb5[1] sda5[2] > 52436032 blocks [2/1] [_U] > [==========>..........] recovery = 50.0% (26230016/52436032) > finish=121.7min speed=1050K/sec > unused devices: <none> > [/quote] > > Can I stop that process for /dev/md0, and start with /dev/md1 (just to > compare if its a problem with that partition only, or an general problem > (so that eg. the second drive has problens, too)? Yes - just fail & remove the drive partition: mdadm --fail /dev/md0 /dev/sda5 mdadm --remove /dev/md0 /dev/sda5 At this point, I'd run a badblocks on the other partitions before doing the resync: badblocks -s -c 256 /dev/sdb6 badblocks -s -c 256 /dev/sdb7 badblocks -s -c 256 /dev/sdb8 if these pass, you can do the hot-add, however, it looks like the sdb disk is also faulty. At this point, I'd be looking to replace both disks and restore from backup, but if you can re-sync the other 3 partitions, then remove the also-faulty sdb, and replace it with a new one, and you can re-sync the 3 good partitions, and you only have to restore the '5' partition (md0) from backup. You could try mkfs'ing the new partition sda5, mounting it, and copying the data on md0 over to it - theres a chance the bad sectors on sdb lie outside the filing system... This would save you having to restore from backup, however, it then becomes trickier as you then have to re-create the raid set on a new disk with a missing drive, and copy it again. > btw: does mdadm also format the partitions? No... You don't need to format/mkfs the partitions, as the raid resync will take care of making it a mirror of the existing working disk. Gordon ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: Broken harddisk 2005-01-29 15:56 ` Gordon Henderson @ 2005-01-29 16:19 ` Guy 2005-01-29 18:31 ` Mike Hardy 2005-01-29 16:47 ` T. Ermlich 1 sibling, 1 reply; 13+ messages in thread From: Guy @ 2005-01-29 16:19 UTC (permalink / raw) To: 'Gordon Henderson', 'T. Ermlich'; +Cc: linux-raid For future reference: Everyone should do a nightly disk test to prevent bad blocks from hiding undetected. smartd, badblocks or dd can be used. Example: dd if=/dev/sda of=/dev/null bs=64k Just create a nice little script that emails you the output. Put this script in a nighty cron to run while the system is idle. -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Gordon Henderson Sent: Saturday, January 29, 2005 10:56 AM To: T. Ermlich Cc: linux-raid@vger.kernel.org Subject: Re: Broken harddisk On Sat, 29 Jan 2005, T. Ermlich wrote: > That's right: each harddisk is partitioned absolutly identically, like: > 0 - 19456 - /dev/sda1 - extended partition > 1 - 6528 - /dev/sda5 - /dev/md0 > 6529 - 9138 - /dev/sda6 - /dev/md1 > 9139 - 16970 - /dev/sda7 - /dev/md2 > 16971 - 19456 - /dev/sda8 - /dev/md3 > And after doing those partitionings I 'combined' them to act as raid1. > I have two additional IDE drives in that system. > /dev/hda contains some data, and is the boot drive, /dev/hdb contains > some less important data. Just as a point of note - if the boot disk goes down it will be harder to recover the data... Consider making the boot disk mirrored too! > > mdadm --add /dev/md0 /dev/sda1 > > mdadm --add /dev/md1 /dev/sda2 > > mdadm --add /dev/md2 /dev/sda3 > > mdadm --add /dev/md3 /dev/sda4 > > Now some new trouble starts ...? > 'mdadm --add /dev/md0 /dev/sda1' started just fine - but exactly at 50% > it started giving tons of errors, like: You should ve using: mdadm --add /dev/md0 /dev/sda5 > [quote] > Jan 29 16:10:24 suse92 kernel: Additional sense: Unrecovered read error > - auto reallocate failed > Jan 29 16:10:24 suse92 kernel: end_request: I/O error, dev sdb, sector > 52460420 The is a read error from /dev/sdb. What it's saying is that sdb has bad sectors which can't be recoverd. You have 2 bad drives in a RAID-1 - and thats really bad )-: > Personalities : [raid1] > md3 : active raid1 sdb8[1] > 19960640 blocks [2/1] [_U] > > md2 : active raid1 sdb7[1] > 62910400 blocks [2/1] [_U] > > md1 : active raid1 sdb6[1] > 20964672 blocks [2/1] [_U] > > md0 : active raid1 sdb5[1] sda5[2] > 52436032 blocks [2/1] [_U] > [==========>..........] recovery = 50.0% (26230016/52436032) > finish=121.7min speed=1050K/sec > unused devices: <none> > [/quote] > > Can I stop that process for /dev/md0, and start with /dev/md1 (just to > compare if its a problem with that partition only, or an general problem > (so that eg. the second drive has problens, too)? Yes - just fail & remove the drive partition: mdadm --fail /dev/md0 /dev/sda5 mdadm --remove /dev/md0 /dev/sda5 At this point, I'd run a badblocks on the other partitions before doing the resync: badblocks -s -c 256 /dev/sdb6 badblocks -s -c 256 /dev/sdb7 badblocks -s -c 256 /dev/sdb8 if these pass, you can do the hot-add, however, it looks like the sdb disk is also faulty. At this point, I'd be looking to replace both disks and restore from backup, but if you can re-sync the other 3 partitions, then remove the also-faulty sdb, and replace it with a new one, and you can re-sync the 3 good partitions, and you only have to restore the '5' partition (md0) from backup. You could try mkfs'ing the new partition sda5, mounting it, and copying the data on md0 over to it - theres a chance the bad sectors on sdb lie outside the filing system... This would save you having to restore from backup, however, it then becomes trickier as you then have to re-create the raid set on a new disk with a missing drive, and copy it again. > btw: does mdadm also format the partitions? No... You don't need to format/mkfs the partitions, as the raid resync will take care of making it a mirror of the existing working disk. Gordon - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Broken harddisk 2005-01-29 16:19 ` Guy @ 2005-01-29 18:31 ` Mike Hardy 2005-01-29 23:30 ` berk walker 2005-02-01 15:02 ` Robin Bowes 0 siblings, 2 replies; 13+ messages in thread From: Mike Hardy @ 2005-01-29 18:31 UTC (permalink / raw) To: Guy; +Cc: 'Gordon Henderson', 'T. Ermlich', linux-raid Guy wrote: > For future reference: > > Everyone should do a nightly disk test to prevent bad blocks from hiding > undetected. smartd, badblocks or dd can be used. Example: > dd if=/dev/sda of=/dev/null bs=64k > > Just create a nice little script that emails you the output. Put this > script in a nighty cron to run while the system is idle. While I agree with your purpose 100% Guy, I respectfully disagree with the method. If at all possible, you should use tools that access the SMART capabilities of the device so that you get more than a read test - you also get statistics on the various other health parameters the drive checks some of which can serve fair warning of impending death before you get bad blocks. http://smartmontools.sf.net is the source for fresh packages there, and smartd can be set up with a config file to do tests on any schedule you like, emailing you urgent results as it gets them, or just putting information of general interest in the logs that Logwatch picks up. If you're drives don't talk SMART (older ones don't, it doesn't work through all interfaces either) then by all means take Guy's advice. A 'dd' test is certainly valuable. But if they do talk SMART, I think its better -Mike ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Broken harddisk 2005-01-29 18:31 ` Mike Hardy @ 2005-01-29 23:30 ` berk walker 2005-02-01 15:02 ` Robin Bowes 1 sibling, 0 replies; 13+ messages in thread From: berk walker @ 2005-01-29 23:30 UTC (permalink / raw) To: Mike Hardy, Guy Cc: 'Gordon Henderson', 'T. Ermlich', linux-raid I think it might be a good idea to check memory, and power supply. I have had several motherboards where the IDE channels went bad. I have become a believer of not using exact same drives in an array, because today's quality control in manufacturing (not design nor testing) Clones may have very similardegradation rates, suddenly, it seems, dying together. it is not possible to get the manufacturer's name and lot for the platters, but _maybe_ buying similar drives of different mfgs might cut down the multiple failure rates. raid1 your boot disk. another compubox sometimes helps as a control for checking hdwe. Think about spending the extra bux for another raid box, and have them rsync'd. you _can_ have stable, automatic, online backup of all of your data (don't forget the ups's), if the house burns down, it will be trivial that you lost data - or you could share mirroring with a remotely located friend, so if one house burnt to the ground, maybe the other still has the data. doing the continuous circle of backups level 1 thru 9 by hand feeding tapes just sux, and might not even work on restore. knew a business that backed up every day, system fried - and the tapes could not be read. sorry - too much from me (just thinking that it's too damned bad that we can't go to the local whatever store and walk out with SCSI stuff. b- On Sat, 29 Jan 2005 10:31:40 -0800, Mike Hardy <mhardy@h3c.com> wrote: > > Guy wrote: >> For future reference: >> Everyone should do a nightly disk test to prevent bad blocks from >> hiding >> undetected. smartd, badblocks or dd can be used. Example: >> dd if=/dev/sda of=/dev/null bs=64k >> Just create a nice little script that emails you the output. Put this >> script in a nighty cron to run while the system is idle. > > While I agree with your purpose 100% Guy, I respectfully disagree with > the method. If at all possible, you should use tools that access the > SMART capabilities of the device so that you get more than a read test - > you also get statistics on the various other health parameters the drive > checks some of which can serve fair warning of impending death before > you get bad blocks. > > http://smartmontools.sf.net is the source for fresh packages there, and > smartd can be set up with a config file to do tests on any schedule you > like, emailing you urgent results as it gets them, or just putting > information of general interest in the logs that Logwatch picks up. > > If you're drives don't talk SMART (older ones don't, it doesn't work > through all interfaces either) then by all means take Guy's advice. A > 'dd' test is certainly valuable. But if they do talk SMART, I think its > better > > -Mike > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Broken harddisk 2005-01-29 18:31 ` Mike Hardy 2005-01-29 23:30 ` berk walker @ 2005-02-01 15:02 ` Robin Bowes 2005-02-01 22:57 ` Luca Berra 1 sibling, 1 reply; 13+ messages in thread From: Robin Bowes @ 2005-02-01 15:02 UTC (permalink / raw) To: linux-raid Mike Hardy wrote: > If you're drives don't talk SMART (older ones don't, it doesn't work > through all interfaces either) then by all means take Guy's advice. A > 'dd' test is certainly valuable. But if they do talk SMART, I think its > better True enough. However, until SMART support makes it into linux SATA drivers I'm pretty much stuck with dd! R. -- http://robinbowes.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Broken harddisk 2005-02-01 15:02 ` Robin Bowes @ 2005-02-01 22:57 ` Luca Berra 2005-02-01 23:24 ` Robin Bowes 0 siblings, 1 reply; 13+ messages in thread From: Luca Berra @ 2005-02-01 22:57 UTC (permalink / raw) To: linux-raid On Tue, Feb 01, 2005 at 03:02:54PM +0000, Robin Bowes wrote: >Mike Hardy wrote: >>If you're drives don't talk SMART (older ones don't, it doesn't work >>through all interfaces either) then by all means take Guy's advice. A >>'dd' test is certainly valuable. But if they do talk SMART, I think its >>better > >True enough. However, until SMART support makes it into linux SATA >drivers I'm pretty much stuck with dd! > http://www.kernel.org/pub/linux/kernel/people/jgarzik/libata/ -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Broken harddisk 2005-02-01 22:57 ` Luca Berra @ 2005-02-01 23:24 ` Robin Bowes 0 siblings, 0 replies; 13+ messages in thread From: Robin Bowes @ 2005-02-01 23:24 UTC (permalink / raw) To: linux-raid Luca Berra wrote: > On Tue, Feb 01, 2005 at 03:02:54PM +0000, Robin Bowes wrote: >> True enough. However, until SMART support makes it into linux SATA >> drivers I'm pretty much stuck with dd! >> > http://www.kernel.org/pub/linux/kernel/people/jgarzik/libata/ I avoid patching kernels, preferring to use the stock Fedora releases. Perhaps I should re-phrase the above statement to read "until the Fedora Core 3 kernel includes SMART support in libata ..." :) Actually, I'm running 2.6.10 - how can I tell if SMART support is included in libata? R. -- http://robinbowes.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Broken harddisk 2005-01-29 15:56 ` Gordon Henderson 2005-01-29 16:19 ` Guy @ 2005-01-29 16:47 ` T. Ermlich 2005-01-29 18:18 ` T. Ermlich 1 sibling, 1 reply; 13+ messages in thread From: T. Ermlich @ 2005-01-29 16:47 UTC (permalink / raw) To: Gordon Henderson; +Cc: linux-raid Hi again, well, due to that realy handy hints I subscribed to the list ... ;) Gordon Henderson scribbled on 29.01.2005 16:56: > On Sat, 29 Jan 2005, T. Ermlich wrote: > >>That's right: each harddisk is partitioned absolutly identically, like: >> 0 - 19456 - /dev/sda1 - extended partition >> 1 - 6528 - /dev/sda5 - /dev/md0 >> 6529 - 9138 - /dev/sda6 - /dev/md1 >> 9139 - 16970 - /dev/sda7 - /dev/md2 >>16971 - 19456 - /dev/sda8 - /dev/md3 >>And after doing those partitionings I 'combined' them to act as raid1. > >>I have two additional IDE drives in that system. >>/dev/hda contains some data, and is the boot drive, /dev/hdb contains >>some less important data. > > Just as a point of note - if the boot disk goes down it will be harder to > recover the data... Consider making the boot disk mirrored too! Yeah .. I thought about that in the past ... and decided to buy an 3Ware controller (9500S-4LP) for those things in ~2-3 month (as I don't have the money yet). Currently I'm using the onboard SATA controller (Asus A7V8X with an Promise controller), >>> mdadm --add /dev/md0 /dev/sda1 >>> mdadm --add /dev/md1 /dev/sda2 >>> mdadm --add /dev/md2 /dev/sda3 >>> mdadm --add /dev/md3 /dev/sda4 >> >>Now some new trouble starts ...? >>'mdadm --add /dev/md0 /dev/sda1' started just fine - but exactly at 50% >>it started giving tons of errors, like: > > You should ve using: > > mdadm --add /dev/md0 /dev/sda5 Yes, I did - I just made a mistake when writing the command above. >>[quote] >>Jan 29 16:10:24 suse92 kernel: Additional sense: Unrecovered read error >>- auto reallocate failed >>Jan 29 16:10:24 suse92 kernel: end_request: I/O error, dev sdb, sector >>52460420 > > The is a read error from /dev/sdb. What it's saying is that sdb has bad > sectors which can't be recoverd. > > You have 2 bad drives in a RAID-1 - and thats really bad )-: All I have ... better than nothing ... will be improved in the future ;) >>Personalities : [raid1] >>md3 : active raid1 sdb8[1] >> 19960640 blocks [2/1] [_U] >> >>md2 : active raid1 sdb7[1] >> 62910400 blocks [2/1] [_U] >> >>md1 : active raid1 sdb6[1] >> 20964672 blocks [2/1] [_U] >> >>md0 : active raid1 sdb5[1] sda5[2] >> 52436032 blocks [2/1] [_U] >> [==========>..........] recovery = 50.0% (26230016/52436032) >>finish=121.7min speed=1050K/sec >>unused devices: <none> >>[/quote] >> >>Can I stop that process for /dev/md0, and start with /dev/md1 (just to >>compare if its a problem with that partition only, or an general problem >>(so that eg. the second drive has problens, too)? > > Yes - just fail & remove the drive partition: > > mdadm --fail /dev/md0 /dev/sda5 > mdadm --remove /dev/md0 /dev/sda5 > > At this point, I'd run a badblocks on the other partitions before doing > the resync: > > badblocks -s -c 256 /dev/sdb6 > badblocks -s -c 256 /dev/sdb7 > badblocks -s -c 256 /dev/sdb8 > > if these pass, you can do the hot-add, however, it looks like the sdb disk > is also faulty. > > At this point, I'd be looking to replace both disks and restore from > backup, but if you can re-sync the other 3 partitions, then remove the > also-faulty sdb, and replace it with a new one, and you can re-sync the 3 > good partitions, and you only have to restore the '5' partition (md0) from > backup. > > You could try mkfs'ing the new partition sda5, mounting it, and copying > the data on md0 over to it - theres a chance the bad sectors on sdb lie > outside the filing system... This would save you having to restore from > backup, however, it then becomes trickier as you then have to re-create > the raid set on a new disk with a missing drive, and copy it again. Ok, I'll do that. I attached an older 80GB harddisk (/dev/hdc), and right now I'm copying the content of /dev/md0 there, using 'cp -a'. If that's finished I'd start checking for badblocks ... and I guess the backups I made in the past might be full with probably damaged data ... :-( Should I delete /dev/md0 completly after the copy-process has finished? Or just checking for badblocks and continue using it? >>btw: does mdadm also format the partitions? > > No... You don't need to format/mkfs the partitions, as the raid resync > will take care of making it a mirror of the existing working disk. Ah .. ok. :-) > Gordon Thanks a lot!! Torsten ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Broken harddisk 2005-01-29 16:47 ` T. Ermlich @ 2005-01-29 18:18 ` T. Ermlich 2005-01-29 23:12 ` T. Ermlich 0 siblings, 1 reply; 13+ messages in thread From: T. Ermlich @ 2005-01-29 18:18 UTC (permalink / raw) To: Gordon Henderson; +Cc: linux-raid Some more info T. Ermlich scribbled on 29.01.2005 17:47: [...] >> badblocks -s -c 256 /dev/sdb6 >> badblocks -s -c 256 /dev/sdb7 >> badblocks -s -c 256 /dev/sdb8 suse92:/ # badblocks -s -c 256 /dev/sda6 Suche nach defekten Blöcken (Nur-Lesen-Modus):done suse92:/ # badblocks -s -c 256 /dev/sda7 Suche nach defekten Blöcken (Nur-Lesen-Modus):done suse92:/ # badblocks -s -c 256 /dev/sda8 Suche nach defekten Blöcken (Nur-Lesen-Modus):done suse92:/ # badblocks -s -c 256 /dev/sda5 Suche nach defekten Blöcken (Nur-Lesen-Modus): 26188800/ 52436097 Then the system hungs .. I did it twice, and twice it hung. The last messages in /var/log/messages are: Jan 29 19:06:11 suse92 kernel: Current sda: sense key Medium Error Jan 29 19:06:11 suse92 kernel: Additional sense: Unrecovered read error - auto reallocate failed Jan 29 19:06:11 suse92 kernel: end_request: I/O error, dev sda, sector 52460485 Jan 29 19:06:12 suse92 kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 29 19:06:12 suse92 kernel: ata2: error=0x40 { UncorrectableError } Jan 29 19:06:12 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 03 20 7b c6 00 00 b8 00 Jan 29 19:06:12 suse92 kernel: Current sda: sense key Medium Error Jan 29 19:06:12 suse92 kernel: Additional sense: Unrecovered read error - auto reallocate failed Jan 29 19:06:12 suse92 kernel: end_request: I/O error, dev sda, sector 52460486 Jan 29 19:06:13 suse92 kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 29 19:06:13 suse92 kernel: ata2: error=0x40 { UncorrectableError } Jan 29 19:06:13 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 03 20 7b c7 00 00 b7 00 Jan 29 19:06:13 suse92 kernel: Current sda: sense key Medium Error Jan 29 19:06:13 suse92 kernel: Additional sense: Unrecovered read error - auto reallocate failed Jan 29 19:06:13 suse92 kernel: end_request: I/O error, dev sda, sector 52460487 Jan 29 19:06:14 suse92 kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 29 19:06:14 suse92 kernel: ata2: error=0x40 { UncorrectableError } Jan 29 19:06:14 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 03 20 7b c8 00 00 b6 00 Jan 29 19:06:14 suse92 kernel: Current sda: sense key Medium Error Jan 29 19:06:14 suse92 kernel: Additional sense: Unrecovered read error - auto reallocate failed Jan 29 19:06:14 suse92 kernel: end_request: I/O error, dev sda, sector 52460488 Well, now I think, as only sda5 is involved, I'll re-format it, check it for badblocks, and copy the previciously copied data back on it ... c y Torsten ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Broken harddisk 2005-01-29 18:18 ` T. Ermlich @ 2005-01-29 23:12 ` T. Ermlich 0 siblings, 0 replies; 13+ messages in thread From: T. Ermlich @ 2005-01-29 23:12 UTC (permalink / raw) To: linux-raid Got it up and running again. I had to delete /dev/md0 & /dev/md1, and create them new. Due to my backups no single file was lost ... :-) But now I get these messages when booting the system (I don't know if the came up in the past, too ... I never noticed them): ... <6>md: Autodetecting RAID arrays. <3>md: could not bd_claim sda5. <3>md: could not bd_claim sda6. <3>md: could not bd_claim sdb5. <3>md: could not bd_claim sdb6. <3>md: could not bd_claim sdb7. <3>md: could not bd_claim sdb8. <6>md: autorun ... ... Disk /dev/md0 does not contain a valid partition table Disk /dev/md1 does not contain a valid partition table Disk /dev/md2 does not contain a valid partition table Disk /dev/md3 does not contain a valid partition table What else does I have to do? Thanks a lot!! Torsten T. Ermlich scribbled on 29.01.2005 19:18: [...] ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2005-02-01 23:24 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-01-29 0:22 Broken harddisk T. Ermlich 2005-01-29 12:46 ` Gordon Henderson 2005-01-29 15:34 ` T. Ermlich 2005-01-29 15:56 ` Gordon Henderson 2005-01-29 16:19 ` Guy 2005-01-29 18:31 ` Mike Hardy 2005-01-29 23:30 ` berk walker 2005-02-01 15:02 ` Robin Bowes 2005-02-01 22:57 ` Luca Berra 2005-02-01 23:24 ` Robin Bowes 2005-01-29 16:47 ` T. Ermlich 2005-01-29 18:18 ` T. Ermlich 2005-01-29 23:12 ` T. Ermlich
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).