Using linux software raid (mdadm) in a shared-disk cluster.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Using linux software raid (mdadm) in a shared-disk cluster.
@ 2009-04-14  8:58 John Hughes
  2009-04-22  8:54 ` Goswin von Brederlow
  0 siblings, 1 reply; 4+ messages in thread
From: John Hughes @ 2009-04-14  8:58 UTC (permalink / raw)
  To: linux-raid

I've got a little shared disk cluster (parallel SCSI, external DELL 
PV210 disk cabinet).

I've used linux raid to make a nice RAID10 on the external disks.

I can access this from either machine in the cluster, only one at a time 
of course, it works very well and I'm happy.

Now I'm running XEN and I want to be able to migrate a XEN domU from one 
machine to the other while the domU is using the RAID10 device.  I can 
make this "work" using XEN's migration hooks - it calls a script when it 
has stopped the running domU and I can start the raid device on the 
destination node, ready for the arrival of the domU.

There is one small problem - I can't stop the RAID10 on the source node 
until the domU has finished, so it seems to me there is a window that 
could lead to data corruption:

Source node                             Destination node

mdadm --assemble /dev/md0 ....
Start migrate
domU suspended
call migration script
               \-------------------->   mdadm --assemble /dev/md0 ...
                                        domU starts running
...
domU destroyed
mdadm --stop /dev/md0

I seems to me that the source node could still be messing with the 
bitmap and resyncing between the moment the destination node
starts the RAID10 and the source node stops it[*].

Am I right?  Is there a window?

If there is a window it could be closed if there was some kind of mdadm 
--freeze command which would stop the sync activity, which could be run 
on the source node before doing the assemble on the destination node.

([*] - imagine some block is marked unsynced in the bitmap.  The 
destination node does the assemble, so now it's in-memory bitmap has the 
block marked.  The source node syncs the block, updates the on disk 
bitmap.   Now the destination node happens to write that block,  it 
thinks the block is marked unsynced on the disk so it doesn't bother 
updating the bitmnap.  If the destination node crashes at this point 
there is a block on the disk that is unsyced, but the bitmap claims it's 
in sync.)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Using linux software raid (mdadm) in a shared-disk cluster.
  2009-04-14  8:58 Using linux software raid (mdadm) in a shared-disk cluster John Hughes
@ 2009-04-22  8:54 ` Goswin von Brederlow
  2009-04-23  9:22   ` John Hughes
  0 siblings, 1 reply; 4+ messages in thread
From: Goswin von Brederlow @ 2009-04-22  8:54 UTC (permalink / raw)
  To: John Hughes; +Cc: linux-raid

John Hughes <john@Calva.COM> writes:

> I've got a little shared disk cluster (parallel SCSI, external DELL
> PV210 disk cabinet).
>
> I've used linux raid to make a nice RAID10 on the external disks.
>
> I can access this from either machine in the cluster, only one at a
> time of course, it works very well and I'm happy.
>
> Now I'm running XEN and I want to be able to migrate a XEN domU from
> one machine to the other while the domU is using the RAID10 device.  I
> can make this "work" using XEN's migration hooks - it calls a script
> when it has stopped the running domU and I can start the raid device
> on the destination node, ready for the arrival of the domU.
>
> There is one small problem - I can't stop the RAID10 on the source
> node until the domU has finished, so it seems to me there is a window
> that could lead to data corruption:

Can you put it into read-only mode?

> Source node                             Destination node
>
> mdadm --assemble /dev/md0 ....
> Start migrate
> domU suspended
> call migration script
>               \-------------------->   mdadm --assemble /dev/md0 ...
>                                        domU starts running
> ...
> domU destroyed
> mdadm --stop /dev/md0
>
>
> I seems to me that the source node could still be messing with the
> bitmap and resyncing between the moment the destination node
> starts the RAID10 and the source node stops it[*].
>
> Am I right?  Is there a window?

Certainly.

> If there is a window it could be closed if there was some kind of
> mdadm --freeze command which would stop the sync activity, which could
> be run on the source node before doing the assemble on the destination
> node.

> ([*] - imagine some block is marked unsynced in the bitmap.  The
> destination node does the assemble, so now it's in-memory bitmap has
> the block marked.  The source node syncs the block, updates the on
> disk bitmap.   Now the destination node happens to write that block,
> it thinks the block is marked unsynced on the disk so it doesn't
> bother updating the bitmnap.  If the destination node crashes at this
> point there is a block on the disk that is unsyced, but the bitmap
> claims it's in sync.)

Source node                             Destination node

read block X for sync
                                        Write block X
                                        Write mirror of block X
write mirror of block X

Now block X and its mirror have different content while being marked
in sync.

I'm not even sure putting a raid in read-only mode will stop
background syncing.



As an alternative approach how about running the raid10 inside the
domU?

MfG
        Goswin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Using linux software raid (mdadm) in a shared-disk cluster.
  2009-04-22  8:54 ` Goswin von Brederlow
@ 2009-04-23  9:22   ` John Hughes
  2009-04-23 20:30     ` Goswin von Brederlow
  0 siblings, 1 reply; 4+ messages in thread
From: John Hughes @ 2009-04-23  9:22 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-raid

Goswin von Brederlow wrote:
> John Hughes <john@Calva.COM> writes:
>   
>> I've got a little shared disk cluster (parallel SCSI, external DELL
>> PV210 disk cabinet).
>>
>> I've used linux raid to make a nice RAID10 on the external disks.
>>
>> I can access this from either machine in the cluster, only one at a
>> time of course, it works very well and I'm happy.
>>
>> Now I'm running XEN and I want to be able to migrate a XEN domU from
>> one machine to the other while the domU is using the RAID10 device.  I
>> can make this "work" using XEN's migration hooks - it calls a script
>> when it has stopped the running domU and I can start the raid device
>> on the destination node, ready for the arrival of the domU.
>>
>> There is one small problem - I can't stop the RAID10 on the source
>> node until the domU has finished, so it seems to me there is a window
>> that could lead to data corruption:
>>     
>
> Can you put it into read-only mode?
>   

How do I do that?

Ah, by "mdadm --readonly"[*].

> I'm not even sure putting a raid in read-only mode will stop
> background syncing.
>   

Hang on a sec, I'll try it.

... run bonnie on mounted /dev/md0, kill it, umount the device ...

# cat /proc/mstat
Personalities : [raid10] 
md0 : active raid10 sda3[0] sde2[4](S) sdd3[3] sdc3[2] sdb3[1]
      32067456 blocks 64K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 66/123 pages [264KB], 128KB chunk

# mdadm --readonly /dev/md0
# cat /proc/mdstat 
Personalities : [raid10] 
md0 : active (read-only) raid10 sda3[0] sde2[4](S) sdd3[3] sdc3[2] sdb3[1]
      32067456 blocks 64K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 66/123 pages [264KB], 128KB chunk

... wait a bit
# cat /proc/mdstat
Personalities : [raid10] 
md0 : active (read-only) raid10 sda3[0] sde2[4](S) sdd3[3] sdc3[2] sdb3[1]
      32067456 blocks 64K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 0/123 pages [0KB], 128KB chunk

No, it seems to keep on syncing.  Bummer.

> As an alternative approach how about running the raid10 inside the
> domU?
>   

Didn't want to do that as it requires exporting loadsa devices to the 
domU instead of one.

But maybe it's the only way (without hacking the md code).

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Using linux software raid (mdadm) in a shared-disk cluster.
  2009-04-23  9:22   ` John Hughes
@ 2009-04-23 20:30     ` Goswin von Brederlow
  0 siblings, 0 replies; 4+ messages in thread
From: Goswin von Brederlow @ 2009-04-23 20:30 UTC (permalink / raw)
  To: John Hughes; +Cc: Goswin von Brederlow, linux-raid

John Hughes <john@Calva.COM> writes:

> Goswin von Brederlow wrote:
>> John Hughes <john@Calva.COM> writes:
>>
>>> I've got a little shared disk cluster (parallel SCSI, external DELL
>>> PV210 disk cabinet).
>>>
>>> I've used linux raid to make a nice RAID10 on the external disks.
>>>
>>> I can access this from either machine in the cluster, only one at a
>>> time of course, it works very well and I'm happy.
>>>
>>> Now I'm running XEN and I want to be able to migrate a XEN domU from
>>> one machine to the other while the domU is using the RAID10 device.  I
>>> can make this "work" using XEN's migration hooks - it calls a script
>>> when it has stopped the running domU and I can start the raid device
>>> on the destination node, ready for the arrival of the domU.
>>>
>>> There is one small problem - I can't stop the RAID10 on the source
>>> node until the domU has finished, so it seems to me there is a window
>>> that could lead to data corruption:
>>>
>>
>> Can you put it into read-only mode?
>>
>
> How do I do that?
>
> Ah, by "mdadm --readonly"[*].
>
>> I'm not even sure putting a raid in read-only mode will stop
>> background syncing.
>>
>
> Hang on a sec, I'll try it.
>
> ... run bonnie on mounted /dev/md0, kill it, umount the device ...
>
> # cat /proc/mstat
> Personalities : [raid10] md0 : active raid10 sda3[0] sde2[4](S)
> sdd3[3] sdc3[2] sdb3[1]
>      32067456 blocks 64K chunks 2 near-copies [4/4] [UUUU]
>      bitmap: 66/123 pages [264KB], 128KB chunk
>
> # mdadm --readonly /dev/md0
> # cat /proc/mdstat Personalities : [raid10] md0 : active (read-only)
> raid10 sda3[0] sde2[4](S) sdd3[3] sdc3[2] sdb3[1]
>      32067456 blocks 64K chunks 2 near-copies [4/4] [UUUU]
>      bitmap: 66/123 pages [264KB], 128KB chunk
>
> ... wait a bit
> # cat /proc/mdstat
> Personalities : [raid10] md0 : active (read-only) raid10 sda3[0]
> sde2[4](S) sdd3[3] sdc3[2] sdb3[1]
>      32067456 blocks 64K chunks 2 near-copies [4/4] [UUUU]
>      bitmap: 0/123 pages [0KB], 128KB chunk
>
> No, it seems to keep on syncing.  Bummer.
>
>> As an alternative approach how about running the raid10 inside the
>> domU?
>>
>
> Didn't want to do that as it requires exporting loadsa devices to the
> domU instead of one.
>
> But maybe it's the only way (without hacking the md code).

You could wait for the syncs to complete after the domU is
suspended. But that could mean hours if there is a resync or repair
running. For those you realy need a "pause" state.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-04-23 20:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-14  8:58 Using linux software raid (mdadm) in a shared-disk cluster John Hughes
2009-04-22  8:54 ` Goswin von Brederlow
2009-04-23  9:22   ` John Hughes
2009-04-23 20:30     ` Goswin von Brederlow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).