* Hot-swapping: what's that? (and 3ware 9650SE) @ 2009-08-18 19:09 kwick 2009-08-18 20:12 ` Greg Freemyer 2009-08-18 22:49 ` Drew 0 siblings, 2 replies; 5+ messages in thread From: kwick @ 2009-08-18 19:09 UTC (permalink / raw) To: linux-raid Hello raid-ers I have been reading previous posts in this ML regarding hot-swapping capability of controllers or the lack thereof. One question remains: ok but what is hot-swap anyway? If I have a controller that does not notify the operating systems of drive removal and insertions, but will give error if one writes to the device while the HDD is disconnected, and will correctly write to the disk if the HDD is later reconnected, would this be "hot-swap"? Actually we have a 3ware 9650SE controller in JBOD mode here, and it behaves very similarly to what I have described. Two differences: 1- it actually notifies the OS of drive removal, I can see that in dmesg, but the block device special-files are still not deleted from the system in response to this. However if you read or write from those block devices it immediately returns error (dd stops immediately) 2- you need to use the tw_cli commandline prog to rescan the controller after the drives' insertion (doing this you can decide whether notify the OS or not, but this apparently does not make a substantial difference, it just gets logged in dmesg). The new inserted drives will get the old drive letters (block-device files), i.e. the drive letters stay attached to the physical slots, except the case in which you are reordering drives, in this latter case the controller will try to remap the "units" so that the drive letters follow the HDDs (it uses the serial numbers to identify the HDDs). So is this a "hot-swap" controller or not? What is hot-swap more than this? BTW I'd have a few more questions which are important for us, related to hot-swapping. I understand these might be offtopic here, but I see you are knowledgeable over this matter, so I hope I can ask: With the 9650SE as described before, I would like to reliably flush all data to a drive before removing the drive manually. - Do you confirm that "umount" is not enough for flushing the block device? - Is the bash "sync" command / sync() syscall what I have to use? (after umount) - Is the sync() enough anyway? Thank you kwick ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Hot-swapping: what's that? (and 3ware 9650SE) 2009-08-18 19:09 Hot-swapping: what's that? (and 3ware 9650SE) kwick @ 2009-08-18 20:12 ` Greg Freemyer 2009-08-18 22:49 ` Drew 1 sibling, 0 replies; 5+ messages in thread From: Greg Freemyer @ 2009-08-18 20:12 UTC (permalink / raw) To: kwick; +Cc: linux-raid On Tue, Aug 18, 2009 at 3:09 PM, kwick<kwick@shiftmail.org> wrote: > Hello raid-ers > I have been reading previous posts in this ML regarding hot-swapping > capability of controllers or the lack thereof. > > One question remains: ok but what is hot-swap anyway? I think originally hot-swap meant swapping components with power connected. i.e You don't have to power down the whole computer to swap components. At least on the sata list they now tend to use hot-swap to mean you can walk up to a chassis and just pull the drive and plug in another and the system will recognize it is a new drive automatically. But they ONLY mean it at the sata level. They talk about warm-swap if you have to have user interaction involved. So if a controller is warm-swap only you have to manually trigger a sata bus scan before the new drive is visible. > If I have a controller that does not notify the operating systems of > drive removal and insertions, but will give error if one writes to the > device while the HDD is disconnected, and will correctly write to the > disk if the HDD is later reconnected, would this be "hot-swap"? It is hot-swap at the hardware level anyway. I don't know if mdraid and the low level sata drivers are integrated enough to call it hot-swap from a users perspective. ie. Does the user have to manually trigger a reconfig / rebuild, or does it happen automatically? > Actually we have a 3ware 9650SE controller in JBOD mode here, and it > behaves very similarly to what I have described. Two differences: > 1- it actually notifies the OS of drive removal, I can see that in > dmesg, but the block device special-files are still not deleted from the > system in response to this. However if you read or write from those > block devices it immediately returns error (dd stops immediately) > 2- you need to use the tw_cli commandline prog to rescan the controller > after the drives' insertion (doing this you can decide whether notify > the OS or not, but this apparently does not make a substantial > difference, it just gets logged in dmesg). The new inserted drives will > get the old drive letters (block-device files), i.e. the drive letters > stay attached to the physical slots, except the case in which you are > reordering drives, in this latter case the controller will try to remap > the "units" so that the drive letters follow the HDDs (it uses the > serial numbers to identify the HDDs). > > So is this a "hot-swap" controller or not? What is hot-swap more than this? You had to use tw_cli to scan rescan the controller. I would call that warm swap even at the hardware level. > BTW I'd have a few more questions which are important for us, related to > hot-swapping. I understand these might be offtopic here, but I see you > are knowledgeable over this matter, so I hope I can ask: > > With the 9650SE as described before, I would like to reliably flush all > data to a drive before removing the drive manually. > - Do you confirm that "umount" is not enough for flushing the block device? > - Is the bash "sync" command / sync() syscall what I have to use? (after > umount) > - Is the sync() enough anyway? I personally think of umount being stronger than just sync. So I umount a drive before pulling it out. I'm fairly confident the umount does an internal drive sync at the end of the process so there is no need to umount; sync; sync by itself should be good enough IF you know for sure no processes are still writing to the drive and IF you have a journaled filesystem in use. Even with a journaled filesystem the unapplied journal entries will have to be applied the next time you mount the drive. fyi: fat has unique mount option that is much more efficient than full "sync" mode on mount, but that keeps the disk cache written to disk most of the time. It was written to support thumbdrives that people want to plug in, use, and unplug with no other user action. fyi2: I would call thumbdrives that mount automatically with mount option truly hot-swap all the way up and down the stack. > > Thank you > kwick Greg ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Hot-swapping: what's that? (and 3ware 9650SE) 2009-08-18 19:09 Hot-swapping: what's that? (and 3ware 9650SE) kwick 2009-08-18 20:12 ` Greg Freemyer @ 2009-08-18 22:49 ` Drew 2009-08-19 2:31 ` John Robinson 1 sibling, 1 reply; 5+ messages in thread From: Drew @ 2009-08-18 22:49 UTC (permalink / raw) To: kwick, Linux RAID Mailing List > One question remains: ok but what is hot-swap anyway? A "Hot Swappable Component" refers to any system component which can be replaced without shutting down the machine. My servers at work have hot swappable PCI slots, for example. Most often though you have to tell the OS the device is about to vanish otherwise things break. It can refer to non-raid controllers that allow you to remove drives without hanging the bus they attach to. If it's in use you still have tell the OS it's about to vanish, unmount file systems, etc. I have an SAS/SATA controller at home that does this. In the context of RAID, "hot swap" typically refers to any system which allows drives to be changed out on a live system without having to interact with the operating system beforehand. IBM's ServeRAID controllers are a good example. Replacing a failed drive is as simple as walking over to the server, pulling out the drive identified as defective, and inserting a replacement. The raid controller recognizes the replacement and begins to integrate it back into the array within 30secs. Hope that helps. -- Drew "Nothing in life is to be feared. It is only to be understood." --Marie Curie ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Hot-swapping: what's that? (and 3ware 9650SE) 2009-08-18 22:49 ` Drew @ 2009-08-19 2:31 ` John Robinson 2009-08-19 3:28 ` NeilBrown 0 siblings, 1 reply; 5+ messages in thread From: John Robinson @ 2009-08-19 2:31 UTC (permalink / raw) To: Drew; +Cc: Linux RAID Mailing List On 18/08/2009 23:49, Drew wrote: >> One question remains: ok but what is hot-swap anyway? [...] > In the context of RAID, "hot swap" typically refers to any system > which allows drives to be changed out on a live system without having > to interact with the operating system beforehand. IBM's ServeRAID > controllers are a good example. Replacing a failed drive is as simple > as walking over to the server, pulling out the drive identified as > defective, and inserting a replacement. The raid controller recognizes > the replacement and begins to integrate it back into the array within > 30secs. By the above definition, md RAID doesn't do hot swap. My hardware does hot swap (ICH10R SATA, SuperMicro drive cage), and I just tried yanking one of my drives: Aug 19 02:21:56 beast kernel: ata3: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen Aug 19 02:21:56 beast kernel: ata3: irq_stat 0x00400040, connection status changed Aug 19 02:21:56 beast kernel: ata3: SError: { HostInt PHYRdyChg 10B8B DevExch } Aug 19 02:21:56 beast kernel: ata3: hard resetting link Aug 19 02:21:57 beast kernel: ata3: SATA link down (SStatus 0 SControl 300) Aug 19 02:21:57 beast kernel: ata3: failed to recover some devices, retrying in 5 secs Aug 19 02:22:02 beast kernel: ata3: hard resetting link Aug 19 02:22:02 beast kernel: ata3: SATA link down (SStatus 0 SControl 300) Aug 19 02:22:02 beast kernel: ata3: failed to recover some devices, retrying in 5 secs Aug 19 02:22:07 beast kernel: ata3: hard resetting link Aug 19 02:22:07 beast kernel: ata3: SATA link down (SStatus 0 SControl 300) Aug 19 02:22:07 beast kernel: ata3.00: disabled Aug 19 02:22:07 beast kernel: sd 2:0:0:0: rejecting I/O to offline device Aug 19 02:22:08 beast last message repeated 2 times Aug 19 02:22:08 beast kernel: raid5: Disk failure on sda2, disabling device. Operation continuing on 2 devices Aug 19 02:22:08 beast kernel: RAID5 conf printout: Aug 19 02:22:08 beast kernel: --- rd:3 wd:2 fd:1 Aug 19 02:22:08 beast kernel: disk 0, o:0, dev:sda2 Aug 19 02:22:08 beast kernel: disk 1, o:1, dev:sdb2 Aug 19 02:22:08 beast kernel: disk 2, o:1, dev:sdc2 Aug 19 02:22:08 beast kernel: RAID5 conf printout: Aug 19 02:22:08 beast kernel: --- rd:3 wd:2 fd:1 Aug 19 02:22:08 beast kernel: disk 1, o:1, dev:sdb2 Aug 19 02:22:08 beast kernel: disk 2, o:1, dev:sdc2 Aug 19 02:22:08 beast kernel: ata3: EH complete Aug 19 02:22:08 beast kernel: ata3.00: detaching (SCSI 2:0:0:0) So that all went well. Then I plugged it in again: Aug 19 02:22:48 beast kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen Aug 19 02:22:48 beast kernel: ata3: irq_stat 0x00000040, connection status changed Aug 19 02:22:48 beast kernel: ata3: SError: { CommWake DevExch } Aug 19 02:22:48 beast kernel: ata3: hard resetting link Aug 19 02:22:55 beast kernel: ata3: link is slow to respond, please be patient (ready=0) Aug 19 02:22:58 beast kernel: ata3: softreset failed (device not ready) Aug 19 02:22:58 beast kernel: ata3: hard resetting link Aug 19 02:23:00 beast kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 19 02:23:00 beast kernel: ata3.00: ATA-7: SAMSUNG HD103UJ, 1AA01112, max UDMA7 Aug 19 02:23:00 beast kernel: ata3.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32) Aug 19 02:23:00 beast kernel: ata3.00: configured for UDMA/133 Aug 19 02:23:00 beast kernel: ata3: EH complete Aug 19 02:23:00 beast kernel: Vendor: ATA Model: SAMSUNG HD103UJ Rev: 1AA0 Aug 19 02:23:00 beast kernel: Type: Direct-Access ANSI SCSI revision: 05 Aug 19 02:23:00 beast kernel: SCSI device sdd: 1953525168 512-byte hdwr sectors (1000205 MB) Aug 19 02:23:00 beast kernel: sdd: Write Protect is off Aug 19 02:23:00 beast kernel: SCSI device sdd: drive cache: write back Aug 19 02:23:00 beast kernel: SCSI device sdd: 1953525168 512-byte hdwr sectors (1000205 MB) Aug 19 02:23:00 beast kernel: sdd: Write Protect is off Aug 19 02:23:00 beast kernel: SCSI device sdd: drive cache: write back Aug 19 02:23:00 beast kernel: sdd: sdd1 sdd2 Aug 19 02:23:00 beast kernel: sd 2:0:0:0: Attached scsi disk sdd Aug 19 02:23:00 beast kernel: sd 2:0:0:0: Attached scsi generic sg1 type 0 I waited for a bit to see if anything else would happen automatically. It didn't, so I manually re-added sdd2 to md1: Aug 19 02:24:05 beast kernel: md: bind<sdd2> Aug 19 02:24:05 beast kernel: RAID5 conf printout: Aug 19 02:24:05 beast kernel: --- rd:3 wd:2 fd:1 Aug 19 02:24:05 beast kernel: disk 0, o:1, dev:sdd2 Aug 19 02:24:05 beast kernel: disk 1, o:1, dev:sdb2 Aug 19 02:24:05 beast kernel: disk 2, o:1, dev:sdc2 Aug 19 02:24:05 beast kernel: md: syncing RAID array md1 Aug 19 02:24:05 beast kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. Aug 19 02:24:05 beast kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. Aug 19 02:24:05 beast kernel: md: using 128k window, over a total of 976655360 blocks. Aug 19 02:24:09 beast kernel: md: md1: sync done. Aug 19 02:24:10 beast kernel: RAID5 conf printout: Aug 19 02:24:10 beast kernel: --- rd:3 wd:3 fd:0 Aug 19 02:24:10 beast kernel: disk 0, o:1, dev:sdd2 Aug 19 02:24:10 beast kernel: disk 1, o:1, dev:sdb2 Aug 19 02:24:10 beast kernel: disk 2, o:1, dev:sdc2 Then I realised that md0 hadn't noticed sda1 was missing. I re-added sdd1 anyway; it said it was adding it, not re-adding it, and this is what was logged: Aug 19 02:24:12 beast kernel: md: export_rdev(sdd1) Aug 19 02:24:12 beast kernel: md: bind<sdd1> Aug 19 02:24:29 beast kernel: scsi 2:0:0:0: rejecting I/O to dead device Aug 19 02:24:29 beast kernel: raid1: sda1: rescheduling sector 208512 Aug 19 02:24:29 beast kernel: raid1: sda1: rescheduling sector 208514 Aug 19 02:24:29 beast kernel: raid1: sda1: rescheduling sector 208516 Aug 19 02:24:29 beast kernel: raid1: sda1: rescheduling sector 208518 Aug 19 02:24:29 beast kernel: scsi 2:0:0:0: rejecting I/O to dead device Aug 19 02:24:29 beast kernel: scsi 2:0:0:0: rejecting I/O to dead device Aug 19 02:24:29 beast kernel: raid1: Disk failure on sda1, disabling device. Aug 19 02:24:29 beast kernel: Operation continuing on 2 devices Aug 19 02:24:29 beast kernel: raid1: sdb1: redirecting sector 208512 to another mirror Aug 19 02:24:29 beast kernel: raid1: sdb1: redirecting sector 208514 to another mirror Aug 19 02:24:29 beast kernel: raid1: sdb1: redirecting sector 208516 to another mirror Aug 19 02:24:29 beast kernel: raid1: sdb1: redirecting sector 208518 to another mirror Aug 19 02:24:29 beast kernel: RAID1 conf printout: Aug 19 02:24:29 beast kernel: --- wd:2 rd:3 Aug 19 02:24:29 beast kernel: disk 0, wo:1, o:0, dev:sda1 Aug 19 02:24:29 beast kernel: disk 1, wo:0, o:1, dev:sdb1 Aug 19 02:24:29 beast kernel: disk 2, wo:0, o:1, dev:sdc1 Aug 19 02:24:29 beast kernel: RAID1 conf printout: Aug 19 02:24:29 beast kernel: --- wd:2 rd:3 Aug 19 02:24:29 beast kernel: disk 1, wo:0, o:1, dev:sdb1 Aug 19 02:24:29 beast kernel: disk 2, wo:0, o:1, dev:sdc1 Aug 19 02:24:30 beast kernel: RAID1 conf printout: Aug 19 02:24:30 beast kernel: --- wd:2 rd:3 Aug 19 02:24:30 beast kernel: disk 0, wo:1, o:1, dev:sdd1 Aug 19 02:24:30 beast kernel: disk 1, wo:0, o:1, dev:sdb1 Aug 19 02:24:30 beast kernel: disk 2, wo:0, o:1, dev:sdc1 Aug 19 02:24:30 beast kernel: md: syncing RAID array md0 Aug 19 02:24:30 beast kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. Aug 19 02:24:30 beast kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. Aug 19 02:24:30 beast kernel: md: using 128k window, over a total of 104320 blocks. Aug 19 02:24:32 beast kernel: md: md0: sync done. Aug 19 02:24:32 beast kernel: RAID1 conf printout: Aug 19 02:24:32 beast kernel: --- wd:3 rd:3 Aug 19 02:24:32 beast kernel: disk 0, wo:0, o:1, dev:sdd1 Aug 19 02:24:32 beast kernel: disk 1, wo:0, o:1, dev:sdb1 Aug 19 02:24:32 beast kernel: disk 2, wo:0, o:1, dev:sdc1 So that all worked perfectly. Now is there a tool out there I can use in conjunction with udev (for hotplugging) and md/mdadm to do this automatically (including recreating my partition table if it's a fresh disc)? I like IBM ServeRAID, and more to the point I would like to be able to have rebuilds begin as soon as the operator in the data centre has changed a dead drive. I've just done a spot of Googling etc. and found scsirastools but it looks like it's a year since anything was done with it, it talks about kernel patches to make it work, it bundles mdadm 1.3.0 and its SRPM doesn't build on CentOS 5, so I'm not sure that's quite the thing! Cheers, John. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Hot-swapping: what's that? (and 3ware 9650SE) 2009-08-19 2:31 ` John Robinson @ 2009-08-19 3:28 ` NeilBrown 0 siblings, 0 replies; 5+ messages in thread From: NeilBrown @ 2009-08-19 3:28 UTC (permalink / raw) To: John Robinson; +Cc: Drew, Linux RAID Mailing List On Wed, August 19, 2009 12:31 pm, John Robinson wrote: > On 18/08/2009 23:49, Drew wrote: >>> One question remains: ok but what is hot-swap anyway? > [...] >> In the context of RAID, "hot swap" typically refers to any system >> which allows drives to be changed out on a live system without having >> to interact with the operating system beforehand. IBM's ServeRAID >> controllers are a good example. Replacing a failed drive is as simple >> as walking over to the server, pulling out the drive identified as >> defective, and inserting a replacement. The raid controller recognizes >> the replacement and begins to integrate it back into the array within >> 30secs. > > By the above definition, md RAID doesn't do hot swap. My hardware does > hot swap (ICH10R SATA, SuperMicro drive cage), and I just tried yanking > one of my drives: Correct. mdraid does not do hotswap by that definition. However all the the bits you need to implement hotswap exist, you just need some scripts to tied it together. Specifically, you need to write something that udev runs whenever it sees a new device. Somehow it has to decide what should be done with that device, and do it. The "do it" part would simply be a call the mdadm, possible "mdadm --add ..." The "Somehow ... decide" is the tricky part. What you want probably isn't what I want or what someone else wants. If I were to integrate hotswap in to md raid, I would probably connect it to the "mdadm --incremental" functionality. I would add some sort of "policy" information to /etc/mdadm.conf telling mdadm that if it found an unrecognised device on some particular controler (or on any controller), it should add it to some particular 'spare group'. That would just be very basic hotswap. It has been suggested earlier in this thread that you might like a RAID1 to convert to a RAID5 as soon as a third drive was available. This is functionality I would probably put in "mdadm --monitor" (if at all). Again there would need to be some policy rule in /etc/mdadm.conf, maybe describing the ideal configuration of an array, which would include minimum number of spares, maximum number of disks in the array, number of disks at which to switch from RAID5 to RAID6. But before doing something like this, I would really like someone to try it out themselves with a few ad-hoc scripts and report the results. There is nothing like real concrete experience to give useful guidance to designing this sort of functionality. NeilBrown ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-08-19 3:28 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-18 19:09 Hot-swapping: what's that? (and 3ware 9650SE) kwick 2009-08-18 20:12 ` Greg Freemyer 2009-08-18 22:49 ` Drew 2009-08-19 2:31 ` John Robinson 2009-08-19 3:28 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox