Hot-swapping: what's that? (and 3ware 9650SE)

public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed

* Hot-swapping: what's that? (and 3ware 9650SE)
@ 2009-08-18 19:09 kwick
  2009-08-18 20:12 ` Greg Freemyer
  2009-08-18 22:49 ` Drew
  0 siblings, 2 replies; 5+ messages in thread
From: kwick @ 2009-08-18 19:09 UTC (permalink / raw)
  To: linux-raid

Hello raid-ers
I have been reading previous posts in this ML regarding hot-swapping
capability of controllers or the lack thereof.

One question remains: ok but what is hot-swap anyway?

If I have a controller that does not notify the operating systems of
drive removal and insertions, but will give error if one writes to the
device while the HDD is disconnected, and will correctly write to the
disk if the HDD is later reconnected, would this be "hot-swap"?

Actually we have a 3ware 9650SE controller in JBOD mode here, and it
behaves very similarly to what I have described. Two differences:
1- it actually notifies the OS of drive removal, I can see that in
dmesg, but the block device special-files are still not deleted from the
system in response to this. However if you read or write from those
block devices it immediately returns error (dd stops immediately)
2- you need to use the tw_cli commandline prog to rescan the controller
after the drives' insertion (doing this you can decide whether notify
the OS or not, but this apparently does not make a substantial
difference, it just gets logged in dmesg). The new inserted drives will
get the old drive letters (block-device files), i.e. the drive letters
stay attached to the physical slots, except the case in which you are
reordering drives, in this latter case the controller will try to remap
the "units" so that the drive letters follow the HDDs (it uses the
serial numbers to identify the HDDs).

So is this a "hot-swap" controller or not? What is hot-swap more than this?

BTW I'd have a few more questions which are important for us, related to
hot-swapping. I understand these might be offtopic here, but I see you
are knowledgeable over this matter, so I hope I can ask:

With the 9650SE as described before, I would like to reliably flush all
data to a drive before removing the drive manually.
- Do you confirm that "umount" is not enough for flushing the block device?
- Is the bash "sync" command / sync() syscall what I have to use? (after
umount)
- Is the sync() enough anyway?

Thank you
kwick

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hot-swapping: what's that? (and 3ware 9650SE)
  2009-08-18 19:09 Hot-swapping: what's that? (and 3ware 9650SE) kwick
@ 2009-08-18 20:12 ` Greg Freemyer
  2009-08-18 22:49 ` Drew
  1 sibling, 0 replies; 5+ messages in thread
From: Greg Freemyer @ 2009-08-18 20:12 UTC (permalink / raw)
  To: kwick; +Cc: linux-raid

On Tue, Aug 18, 2009 at 3:09 PM, kwick<kwick@shiftmail.org> wrote:
> Hello raid-ers
> I have been reading previous posts in this ML regarding hot-swapping
> capability of controllers or the lack thereof.
>
> One question remains: ok but what is hot-swap anyway?

I think originally hot-swap meant swapping components with power
connected.  i.e You don't have to power down the whole computer to
swap components.

At least on the sata list they now tend to use hot-swap to mean you
can walk up to a chassis and just pull the drive and plug in another
and the system will recognize it is a new drive automatically.  But
they ONLY mean it at the sata level.

They talk about warm-swap if you have to have user interaction
involved.  So if a controller is warm-swap only you have to manually
trigger a sata bus scan before the new drive is visible.

> If I have a controller that does not notify the operating systems of
> drive removal and insertions, but will give error if one writes to the
> device while the HDD is disconnected, and will correctly write to the
> disk if the HDD is later reconnected, would this be "hot-swap"?

It is hot-swap at the hardware level anyway.

I don't know if mdraid and the low level sata drivers are integrated
enough to call it hot-swap from a users perspective.

ie. Does the user have to manually trigger a reconfig / rebuild, or
does it happen automatically?

> Actually we have a 3ware 9650SE controller in JBOD mode here, and it
> behaves very similarly to what I have described. Two differences:
> 1- it actually notifies the OS of drive removal, I can see that in
> dmesg, but the block device special-files are still not deleted from the
> system in response to this. However if you read or write from those
> block devices it immediately returns error (dd stops immediately)
> 2- you need to use the tw_cli commandline prog to rescan the controller
> after the drives' insertion (doing this you can decide whether notify
> the OS or not, but this apparently does not make a substantial
> difference, it just gets logged in dmesg). The new inserted drives will
> get the old drive letters (block-device files), i.e. the drive letters
> stay attached to the physical slots, except the case in which you are
> reordering drives, in this latter case the controller will try to remap
> the "units" so that the drive letters follow the HDDs (it uses the
> serial numbers to identify the HDDs).
>
> So is this a "hot-swap" controller or not? What is hot-swap more than this?

You had to use tw_cli to scan rescan the controller.  I would call
that warm swap even at the hardware level.

> BTW I'd have a few more questions which are important for us, related to
> hot-swapping. I understand these might be offtopic here, but I see you
> are knowledgeable over this matter, so I hope I can ask:
>
> With the 9650SE as described before, I would like to reliably flush all
> data to a drive before removing the drive manually.
> - Do you confirm that "umount" is not enough for flushing the block device?
> - Is the bash "sync" command / sync() syscall what I have to use? (after
> umount)
> - Is the sync() enough anyway?

I personally think of umount being stronger than just sync.  So I
umount a drive before pulling it out.  I'm fairly confident the umount
does an internal drive sync at the end of the process so there is no
need to umount; sync;

sync by itself should be good enough IF you know for sure no processes
are still writing to the drive and IF you have a journaled filesystem
in use.  Even with a journaled filesystem the unapplied journal
entries will have to be applied the next time you mount the drive.

fyi: fat has unique mount option that is much more efficient than full
"sync" mode on mount, but that keeps the disk cache written  to disk
most of the time.  It was written to support thumbdrives that people
want to plug in, use, and unplug with no other user action.

fyi2: I would call thumbdrives that mount automatically with mount
option truly hot-swap all the way up and down the stack.

>
> Thank you
> kwick

Greg

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hot-swapping: what's that? (and 3ware 9650SE)
  2009-08-18 19:09 Hot-swapping: what's that? (and 3ware 9650SE) kwick
  2009-08-18 20:12 ` Greg Freemyer
@ 2009-08-18 22:49 ` Drew
  2009-08-19  2:31   ` John Robinson
  1 sibling, 1 reply; 5+ messages in thread
From: Drew @ 2009-08-18 22:49 UTC (permalink / raw)
  To: kwick, Linux RAID Mailing List

> One question remains: ok but what is hot-swap anyway?

A "Hot Swappable Component" refers to any system component which can
be replaced without shutting down the machine. My servers at work have
hot swappable PCI slots, for example. Most often though you have to
tell the OS the device is about to vanish otherwise things break.

It can refer to non-raid controllers that allow you to remove drives
without hanging the bus they attach to. If it's in use you still have
tell the OS it's about to vanish, unmount file systems, etc. I have an
SAS/SATA controller at home that does this.

In the context of RAID, "hot swap" typically refers to any system
which allows drives to be changed out on a live system without having
to interact with the operating system beforehand. IBM's ServeRAID
controllers are a good example. Replacing a failed drive is as simple
as walking over to the server, pulling out the drive identified as
defective, and inserting a replacement. The raid controller recognizes
the replacement and begins to integrate it back into the array within
30secs.

Hope that helps.

-- 
Drew

"Nothing in life is to be feared. It is only to be understood."
--Marie Curie

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hot-swapping: what's that? (and 3ware 9650SE)
  2009-08-18 22:49 ` Drew
@ 2009-08-19  2:31   ` John Robinson
  2009-08-19  3:28     ` NeilBrown
  0 siblings, 1 reply; 5+ messages in thread
From: John Robinson @ 2009-08-19  2:31 UTC (permalink / raw)
  To: Drew; +Cc: Linux RAID Mailing List

On 18/08/2009 23:49, Drew wrote:
>> One question remains: ok but what is hot-swap anyway?
[...]
> In the context of RAID, "hot swap" typically refers to any system
> which allows drives to be changed out on a live system without having
> to interact with the operating system beforehand. IBM's ServeRAID
> controllers are a good example. Replacing a failed drive is as simple
> as walking over to the server, pulling out the drive identified as
> defective, and inserting a replacement. The raid controller recognizes
> the replacement and begins to integrate it back into the array within
> 30secs.

By the above definition, md RAID doesn't do hot swap. My hardware does 
hot swap (ICH10R SATA, SuperMicro drive cage), and I just tried yanking 
one of my drives:

Aug 19 02:21:56 beast kernel: ata3: exception Emask 0x50 SAct 0x0 SErr 
0x4090800 action 0xe frozen
Aug 19 02:21:56 beast kernel: ata3: irq_stat 0x00400040, connection 
status changed
Aug 19 02:21:56 beast kernel: ata3: SError: { HostInt PHYRdyChg 10B8B 
DevExch }
Aug 19 02:21:56 beast kernel: ata3: hard resetting link
Aug 19 02:21:57 beast kernel: ata3: SATA link down (SStatus 0 SControl 300)
Aug 19 02:21:57 beast kernel: ata3: failed to recover some devices, 
retrying in 5 secs
Aug 19 02:22:02 beast kernel: ata3: hard resetting link
Aug 19 02:22:02 beast kernel: ata3: SATA link down (SStatus 0 SControl 300)
Aug 19 02:22:02 beast kernel: ata3: failed to recover some devices, 
retrying in 5 secs
Aug 19 02:22:07 beast kernel: ata3: hard resetting link
Aug 19 02:22:07 beast kernel: ata3: SATA link down (SStatus 0 SControl 300)
Aug 19 02:22:07 beast kernel: ata3.00: disabled
Aug 19 02:22:07 beast kernel: sd 2:0:0:0: rejecting I/O to offline device
Aug 19 02:22:08 beast last message repeated 2 times
Aug 19 02:22:08 beast kernel: raid5: Disk failure on sda2, disabling 
device. Operation continuing on 2 devices
Aug 19 02:22:08 beast kernel: RAID5 conf printout:
Aug 19 02:22:08 beast kernel:  --- rd:3 wd:2 fd:1
Aug 19 02:22:08 beast kernel:  disk 0, o:0, dev:sda2
Aug 19 02:22:08 beast kernel:  disk 1, o:1, dev:sdb2
Aug 19 02:22:08 beast kernel:  disk 2, o:1, dev:sdc2
Aug 19 02:22:08 beast kernel: RAID5 conf printout:
Aug 19 02:22:08 beast kernel:  --- rd:3 wd:2 fd:1
Aug 19 02:22:08 beast kernel:  disk 1, o:1, dev:sdb2
Aug 19 02:22:08 beast kernel:  disk 2, o:1, dev:sdc2
Aug 19 02:22:08 beast kernel: ata3: EH complete
Aug 19 02:22:08 beast kernel: ata3.00: detaching (SCSI 2:0:0:0)

So that all went well. Then I plugged it in again:

Aug 19 02:22:48 beast kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 
0x4040000 action 0xe frozen
Aug 19 02:22:48 beast kernel: ata3: irq_stat 0x00000040, connection 
status changed
Aug 19 02:22:48 beast kernel: ata3: SError: { CommWake DevExch }
Aug 19 02:22:48 beast kernel: ata3: hard resetting link
Aug 19 02:22:55 beast kernel: ata3: link is slow to respond, please be 
patient (ready=0)
Aug 19 02:22:58 beast kernel: ata3: softreset failed (device not ready)
Aug 19 02:22:58 beast kernel: ata3: hard resetting link
Aug 19 02:23:00 beast kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 
SControl 300)
Aug 19 02:23:00 beast kernel: ata3.00: ATA-7: SAMSUNG HD103UJ, 1AA01112, 
max UDMA7
Aug 19 02:23:00 beast kernel: ata3.00: 1953525168 sectors, multi 0: 
LBA48 NCQ (depth 31/32)
Aug 19 02:23:00 beast kernel: ata3.00: configured for UDMA/133
Aug 19 02:23:00 beast kernel: ata3: EH complete
Aug 19 02:23:00 beast kernel:   Vendor: ATA       Model: SAMSUNG HD103UJ 
   Rev: 1AA0
Aug 19 02:23:00 beast kernel:   Type:   Direct-Access 
    ANSI SCSI revision: 05
Aug 19 02:23:00 beast kernel: SCSI device sdd: 1953525168 512-byte hdwr 
sectors (1000205 MB)
Aug 19 02:23:00 beast kernel: sdd: Write Protect is off
Aug 19 02:23:00 beast kernel: SCSI device sdd: drive cache: write back
Aug 19 02:23:00 beast kernel: SCSI device sdd: 1953525168 512-byte hdwr 
sectors (1000205 MB)
Aug 19 02:23:00 beast kernel: sdd: Write Protect is off
Aug 19 02:23:00 beast kernel: SCSI device sdd: drive cache: write back
Aug 19 02:23:00 beast kernel:  sdd: sdd1 sdd2
Aug 19 02:23:00 beast kernel: sd 2:0:0:0: Attached scsi disk sdd
Aug 19 02:23:00 beast kernel: sd 2:0:0:0: Attached scsi generic sg1 type 0

I waited for a bit to see if anything else would happen automatically. 
It didn't, so I manually re-added sdd2 to md1:

Aug 19 02:24:05 beast kernel: md: bind<sdd2>
Aug 19 02:24:05 beast kernel: RAID5 conf printout:
Aug 19 02:24:05 beast kernel:  --- rd:3 wd:2 fd:1
Aug 19 02:24:05 beast kernel:  disk 0, o:1, dev:sdd2
Aug 19 02:24:05 beast kernel:  disk 1, o:1, dev:sdb2
Aug 19 02:24:05 beast kernel:  disk 2, o:1, dev:sdc2
Aug 19 02:24:05 beast kernel: md: syncing RAID array md1
Aug 19 02:24:05 beast kernel: md: minimum _guaranteed_ reconstruction 
speed: 1000 KB/sec/disc.
Aug 19 02:24:05 beast kernel: md: using maximum available idle IO 
bandwidth (but not more than 200000 KB/sec) for reconstruction.
Aug 19 02:24:05 beast kernel: md: using 128k window, over a total of 
976655360 blocks.
Aug 19 02:24:09 beast kernel: md: md1: sync done.
Aug 19 02:24:10 beast kernel: RAID5 conf printout:
Aug 19 02:24:10 beast kernel:  --- rd:3 wd:3 fd:0
Aug 19 02:24:10 beast kernel:  disk 0, o:1, dev:sdd2
Aug 19 02:24:10 beast kernel:  disk 1, o:1, dev:sdb2
Aug 19 02:24:10 beast kernel:  disk 2, o:1, dev:sdc2

Then I realised that md0 hadn't noticed sda1 was missing. I re-added 
sdd1 anyway; it said it was adding it, not re-adding it, and this is 
what was logged:

Aug 19 02:24:12 beast kernel: md: export_rdev(sdd1)
Aug 19 02:24:12 beast kernel: md: bind<sdd1>
Aug 19 02:24:29 beast kernel: scsi 2:0:0:0: rejecting I/O to dead device
Aug 19 02:24:29 beast kernel: raid1: sda1: rescheduling sector 208512
Aug 19 02:24:29 beast kernel: raid1: sda1: rescheduling sector 208514
Aug 19 02:24:29 beast kernel: raid1: sda1: rescheduling sector 208516
Aug 19 02:24:29 beast kernel: raid1: sda1: rescheduling sector 208518
Aug 19 02:24:29 beast kernel: scsi 2:0:0:0: rejecting I/O to dead device
Aug 19 02:24:29 beast kernel: scsi 2:0:0:0: rejecting I/O to dead device
Aug 19 02:24:29 beast kernel: raid1: Disk failure on sda1, disabling device.
Aug 19 02:24:29 beast kernel:   Operation continuing on 2 devices
Aug 19 02:24:29 beast kernel: raid1: sdb1: redirecting sector 208512 to 
another mirror
Aug 19 02:24:29 beast kernel: raid1: sdb1: redirecting sector 208514 to 
another mirror
Aug 19 02:24:29 beast kernel: raid1: sdb1: redirecting sector 208516 to 
another mirror
Aug 19 02:24:29 beast kernel: raid1: sdb1: redirecting sector 208518 to 
another mirror
Aug 19 02:24:29 beast kernel: RAID1 conf printout:
Aug 19 02:24:29 beast kernel:  --- wd:2 rd:3
Aug 19 02:24:29 beast kernel:  disk 0, wo:1, o:0, dev:sda1
Aug 19 02:24:29 beast kernel:  disk 1, wo:0, o:1, dev:sdb1
Aug 19 02:24:29 beast kernel:  disk 2, wo:0, o:1, dev:sdc1
Aug 19 02:24:29 beast kernel: RAID1 conf printout:
Aug 19 02:24:29 beast kernel:  --- wd:2 rd:3
Aug 19 02:24:29 beast kernel:  disk 1, wo:0, o:1, dev:sdb1
Aug 19 02:24:29 beast kernel:  disk 2, wo:0, o:1, dev:sdc1
Aug 19 02:24:30 beast kernel: RAID1 conf printout:
Aug 19 02:24:30 beast kernel:  --- wd:2 rd:3
Aug 19 02:24:30 beast kernel:  disk 0, wo:1, o:1, dev:sdd1
Aug 19 02:24:30 beast kernel:  disk 1, wo:0, o:1, dev:sdb1
Aug 19 02:24:30 beast kernel:  disk 2, wo:0, o:1, dev:sdc1
Aug 19 02:24:30 beast kernel: md: syncing RAID array md0
Aug 19 02:24:30 beast kernel: md: minimum _guaranteed_ reconstruction 
speed: 1000 KB/sec/disc.
Aug 19 02:24:30 beast kernel: md: using maximum available idle IO 
bandwidth (but not more than 200000 KB/sec) for reconstruction.
Aug 19 02:24:30 beast kernel: md: using 128k window, over a total of 
104320 blocks.
Aug 19 02:24:32 beast kernel: md: md0: sync done.
Aug 19 02:24:32 beast kernel: RAID1 conf printout:
Aug 19 02:24:32 beast kernel:  --- wd:3 rd:3
Aug 19 02:24:32 beast kernel:  disk 0, wo:0, o:1, dev:sdd1
Aug 19 02:24:32 beast kernel:  disk 1, wo:0, o:1, dev:sdb1
Aug 19 02:24:32 beast kernel:  disk 2, wo:0, o:1, dev:sdc1

So that all worked perfectly. Now is there a tool out there I can use in 
conjunction with udev (for hotplugging) and md/mdadm to do this 
automatically (including recreating my partition table if it's a fresh 
disc)? I like IBM ServeRAID, and more to the point I would like to be 
able to have rebuilds begin as soon as the operator in the data centre 
has changed a dead drive.

I've just done a spot of Googling etc. and found scsirastools but it 
looks like it's a year since anything was done with it, it talks about 
kernel patches to make it work, it bundles mdadm 1.3.0 and its SRPM 
doesn't build on CentOS 5, so I'm not sure that's quite the thing!

Cheers,

John.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hot-swapping: what's that? (and 3ware 9650SE)
  2009-08-19  2:31   ` John Robinson
@ 2009-08-19  3:28     ` NeilBrown
  0 siblings, 0 replies; 5+ messages in thread
From: NeilBrown @ 2009-08-19  3:28 UTC (permalink / raw)
  To: John Robinson; +Cc: Drew, Linux RAID Mailing List

On Wed, August 19, 2009 12:31 pm, John Robinson wrote:
> On 18/08/2009 23:49, Drew wrote:
>>> One question remains: ok but what is hot-swap anyway?
> [...]
>> In the context of RAID, "hot swap" typically refers to any system
>> which allows drives to be changed out on a live system without having
>> to interact with the operating system beforehand. IBM's ServeRAID
>> controllers are a good example. Replacing a failed drive is as simple
>> as walking over to the server, pulling out the drive identified as
>> defective, and inserting a replacement. The raid controller recognizes
>> the replacement and begins to integrate it back into the array within
>> 30secs.
>
> By the above definition, md RAID doesn't do hot swap. My hardware does
> hot swap (ICH10R SATA, SuperMicro drive cage), and I just tried yanking
> one of my drives:

Correct.  mdraid does not do hotswap by that definition.

However all the the bits you need to implement hotswap exist, you just
need some scripts to tied it together.
Specifically, you need to write something that udev runs whenever it
sees a new device.  Somehow it has to decide what should be done with
that device, and do it.
The "do it" part would simply be a call the mdadm, possible "mdadm --add ..."
The "Somehow ... decide" is the tricky part.  What you want probably isn't
what I want or what someone else wants.

If I were to integrate hotswap in to md raid, I would probably connect
it to the "mdadm --incremental" functionality.
I would add some sort of "policy" information to /etc/mdadm.conf
telling mdadm that if it found an unrecognised device on some
particular controler (or on any controller), it should add it
to some particular 'spare group'.

That would just be very basic hotswap.  It has been suggested earlier in this
thread that you might like a RAID1 to convert to a RAID5 as soon as a
third drive was available.  This is functionality I would probably
put in "mdadm --monitor" (if at all).  Again there would need to be
some policy rule in /etc/mdadm.conf, maybe describing the ideal
configuration of an array, which would include minimum number of spares,
maximum number of disks in the array, number of disks at which to
switch from RAID5 to RAID6.

But before doing something like this, I would really like someone
to try it out themselves with a few ad-hoc scripts and report the
results.  There is nothing like real concrete experience to give
useful guidance to designing this sort of functionality.

NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-08-19  3:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-18 19:09 Hot-swapping: what's that? (and 3ware 9650SE) kwick
2009-08-18 20:12 ` Greg Freemyer
2009-08-18 22:49 ` Drew
2009-08-19  2:31   ` John Robinson
2009-08-19  3:28     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox