[PATCH 00/24] Clustered MD RAID1

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/24] Clustered MD RAID1
@ 2014-12-18 16:14 Goldwyn Rodrigues
  2014-12-23 17:24 ` Goldwyn Rodrigues
  2015-02-06  2:39 ` NeilBrown
  0 siblings, 2 replies; 10+ messages in thread
From: Goldwyn Rodrigues @ 2014-12-18 16:14 UTC (permalink / raw)
  To: neilb; +Cc: lzhong, linux-raid

Hello,

This is an attempt to make MD-RAID cluster-aware. The advantage of
redundancy can help highly available systems to improve uptime.
Currently, the implementation is limited to RAID1 but with further work
(and some positive feedback), we could extend this to other compatible
RAID scenarios.

The design document (first patch) is pretty descriptive of how
the md has been made cluster-aware and how DLM is used to safeguard data
and communication.

This work requires some patches to the mdadm tool [1]

A quick howto:

1. With your corosync/pacemaker based cluster running execute:
# mdadm --create md0 --bitmap=clustered --raid-devices=2 --level=mirror --assume-clean /dev/sda /dev/sdb

2. On other nodes, issue:
# mdadm --assemble md0 /dev/sda /dev/sdb

References:
[1] mdadm tool changes: https://github.com/goldwynr/mdadm branch:cluster-md
[2] Patches against stable 3.14: https://github.com/goldwynr/linux branch: cluster-md-devel

Regards,

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 00/24] Clustered MD RAID1
  2014-12-18 16:14 [PATCH 00/24] Clustered MD RAID1 Goldwyn Rodrigues
@ 2014-12-23 17:24 ` Goldwyn Rodrigues
  2014-12-23 21:47   ` John Stoffel
  2014-12-26 22:35   ` Peter Kieser
  2015-02-06  2:39 ` NeilBrown
  1 sibling, 2 replies; 10+ messages in thread
From: Goldwyn Rodrigues @ 2014-12-23 17:24 UTC (permalink / raw)
  To: neilb; +Cc: lzhong, john, linux-raid

Since, I missed on important setup details, I am sending an addendum to 
explain in more detail how to setup a cluster-md.

Requirements
============
You would require a multi-node cluster setup using corosync and
pacemaker. You can read more about how to setup a cluster from the
one of the guides available [3]. Make sure that you are using corosync
version greater than 2.3.1

You need to have the Distributed Lock Manager (DLM) service running on
all the nodes of the cluster. A simple CRM configuration on my virtual
cluster is:

node 1084752262: node3
node 1084752351: node1
node 1084752358: node2
primitive dlm ocf:pacemaker:controld \
         op start interval=0 timeout=90 \
         op stop interval=0 timeout=100 \
         op monitor interval=60 timeout=60
primitive stone stonith:external/libvirt \
         params hostlist="node1,node2,node3"
hypervisor_uri="qemu+tcp://vmhnost/system" \
         op start timeout=120s interval=0 \
         op stop timeout=120s interval=0 \
         op monitor interval=40s timeout=120s \
         meta target-role=Started
group base-group dlm
clone base-clone base-group \
         meta interleave=true target-role=Started

Note: The configuration may be different for your scenario.

This work requires some patches for the mdadm tool [1]. The changes in
mdadm are basic enough to get the clustered-md up and running. There are
a couple of options checks which are missing. Use with care. You would
need the corosync libraries to compile cluster related stuff in mdadm.

Download and install the patched mdadm.

A howto to use cluster-md:

1. With your corosync/pacemaker based cluster with DLM running execute:
# mdadm --create md0 --bitmap=clustered --raid-devices=2 --level=mirror 
--assume-clean <device1> <device2>

With the option of --bitmap=clustered, it automatically creates multiple
bitmaps (one for each node). The default currently is set to 4 nodes.
However, you can set it by --nodes=<number> option.
It also detects the cluster name which is required for creating a
clustered md. In order to specify that, use --cluster-name=<name>

2. On other nodes, issue:
# mdadm --assemble md0 <device1> <device2>

This md device can be used as a regular shared device. There are no
restrictions on the type of filesystem or LVM you can use, as long as
you observe clustering rules of using a shared device.

There is only one special case as opposed to a regular non-clustered md, 
which is to add a device. This is because all nodes should be able to 
"see" the device before adding it.

You can (hot) add a spare device by issuing the regular --add command.

# mdadm --manage /dev/md0 --add <device3>

The other nodes must acknowledge that they see the device by issuing:

# mdadm --manage /dev/md0 --cluster-confirm 2:<device3>

where 2 is the raid slot number. This step can be automated
using a udev script because the module sends a udev event when another
node issues an --add. The uevent is with the usual device name
parameters and:

EVENT=ADD_DEVICE
DEVICE_UUID=<uuid of the device>
RAID_DISK=<slot number>

Usually, you would use blkid to find the devices uuid and issue the
--cluster-confirm command.

If the node does not "see" the device, it must issue (or timeout):

# mdadm --manage /dev/md0 --cluster-confirm 2:missing

References:
[1] mdadm tool changes: https://github.com/goldwynr/mdadm branch:cluster-md
[2] Patches against stable 3.14: https://github.com/goldwynr/linux 
branch: cluster-md-devel
[3] A guide to setup a cluster using corosync/pacemaker
https://www.suse.com/documentation/sle-ha-12/singlehtml/book_sleha/book_sleha.html 
(Note: The basic concepts of this guide should work with any distro)

-- 
-- 
Goldwyn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 00/24] Clustered MD RAID1
  2014-12-23 17:24 ` Goldwyn Rodrigues
@ 2014-12-23 21:47   ` John Stoffel
  2014-12-26 22:35   ` Peter Kieser
  1 sibling, 0 replies; 10+ messages in thread
From: John Stoffel @ 2014-12-23 21:47 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: neilb, lzhong, john, linux-raid

This is much improved, thanks for takign the time to listen to my
comments.  But I do have some questions and clarifications I think you
neeed to make.  

Goldwyn> A howto to use cluster-md:

Goldwyn> 1. With your corosync/pacemaker based cluster with DLM
Goldwyn> running execute: # mdadm --create md0 --bitmap=clustered
Goldwyn> --raid-devices=2 --level=mirror --assume-clean <device1>
Goldwyn> <device2>

What are <device1> and <device2>?  in terms of block devices?  Are
they local?  Shared across the cluster?  Are the iSCSI or FibreChannel
block devices accessible from all nodes at the same time?  It's not
clear, and this is a *key* issue to address and make sure end users
understand.  

Goldwyn> With the option of --bitmap=clustered, it automatically
Goldwyn> creates multiple bitmaps (one for each node). The default
Goldwyn> currently is set to 4 nodes.  However, you can set it by
Goldwyn> --nodes=<number> option.  It also detects the cluster name
Goldwyn> which is required for creating a clustered md. In order to
Goldwyn> specify that, use --cluster-name=<name>

Goldwyn> 2. On other nodes, issue: # mdadm --assemble md0 <device1>
Goldwyn> <device2>

Same comment here, what are the limits/restrictions/expectations of
devices?  

Goldwyn> This md device can be used as a regular shared device. There
Goldwyn> are no restrictions on the type of filesystem or LVM you can
Goldwyn> use, as long as you observe clustering rules of using a
Goldwyn> shared device.

Another place where you need to be more explicit.  For example, if I'm
running ext3, I assume my cluster needs to be running in an
Active/Passive mode, so that only one node is accessing the filesystem
at a time, correct?

But if I'm running glusterfs, I could be using the block device and
the filesystem in an Active/Active mode?  

Goldwyn> There is only one special case as opposed to a regular
Goldwyn> non-clustered md, which is to add a device. This is because
Goldwyn> all nodes should be able to "see" the device before adding
Goldwyn> it.

So this little snipped implies that my question above about device
restrictions is that you MUST be able to see each device from all
nodes, correct?  

Goldwyn> You can (hot) add a spare device by issuing the regular --add
Goldwyn> command.

Goldwyn> # mdadm --manage /dev/md0 --add <device3>

Again, this device needs to be visible to all nodes.  

Goldwyn> The other nodes must acknowledge that they see the device by
Goldwyn> issuing:

Goldwyn> # mdadm --manage /dev/md0 --cluster-confirm 2:<device3>

Goldwyn> where 2 is the raid slot number. This step can be automated
Goldwyn> using a udev script because the module sends a udev event
Goldwyn> when another node issues an --add. The uevent is with the
Goldwyn> usual device name parameters and:

Goldwyn> EVENT=ADD_DEVICE DEVICE_UUID=<uuid of the device>
Goldwyn> RAID_DISK=<slot number>

Goldwyn> Usually, you would use blkid to find the devices uuid and
Goldwyn> issue the --cluster-confirm command.

Goldwyn> If the node does not "see" the device, it must issue (or
Goldwyn> timeout):

Goldwyn> # mdadm --manage /dev/md0 --cluster-confirm 2:missing

This seems wrong, if the other node doesn't see the new device, how
will the other nodes know that it's there or not?  What is the
communication channel between nodes?  Do they use the DLM?  

And what are the performance impacts and benefits of using this setup?
Why not use a cluster aware filesystem which already does RAID in some
manner?  I'm honestly clueless about glusterFS or other Linux based
cluster filesystems, haven't had any need to run them, nor the time to
play with them.

I think you've got a good start now on the documentation, but you need
to expand more on the why and what is expected as an infrastructure.
Just a note that all devices must be visible on all nodes would be a
key thing to include. 

Thanks,
John

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 00/24] Clustered MD RAID1
  2014-12-23 17:24 ` Goldwyn Rodrigues
  2014-12-23 21:47   ` John Stoffel
@ 2014-12-26 22:35   ` Peter Kieser
  2014-12-26 22:49     ` Goldwyn Rodrigues
  2014-12-27  0:47     ` Marcus
  1 sibling, 2 replies; 10+ messages in thread
From: Peter Kieser @ 2014-12-26 22:35 UTC (permalink / raw)
  To: Goldwyn Rodrigues, neilb; +Cc: lzhong, john, linux-raid

[-- Attachment #1: Type: text/plain, Size: 588 bytes --]


On 2014-12-23 9:24 AM, Goldwyn Rodrigues wrote:
> This md device can be used as a regular shared device. There are no
> restrictions on the type of filesystem or LVM you can use, as long as
> you observe clustering rules of using a shared device.

This is very vague. I feel like this is a huge layering violation. LVM 
or filesystem on top of mdadm should not have to be aware of what is 
going on below. Otherwise, I feel like other clustered filesystems that 
include RAID are better suited. As it stands, what are the benefits of 
using this over $CLUSTER_FS?

-Peter


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4291 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 00/24] Clustered MD RAID1
  2014-12-26 22:35   ` Peter Kieser
@ 2014-12-26 22:49     ` Goldwyn Rodrigues
  2014-12-27  0:47     ` Marcus
  1 sibling, 0 replies; 10+ messages in thread
From: Goldwyn Rodrigues @ 2014-12-26 22:49 UTC (permalink / raw)
  To: Peter Kieser, neilb; +Cc: lzhong, john, linux-raid

Hi Peter,

On 12/26/2014 04:35 PM, Peter Kieser wrote:
>
> On 2014-12-23 9:24 AM, Goldwyn Rodrigues wrote:
>> This md device can be used as a regular shared device. There are no
>> restrictions on the type of filesystem or LVM you can use, as long as
>> you observe clustering rules of using a shared device.
>
> This is very vague. I feel like this is a huge layering violation. LVM
> or filesystem on top of mdadm should not have to be aware of what is
> going on below.

How is this a layering violation? I said there are _no_ restrictions as 
long you follow the clustering rules of using a shared device. This 
device can be used as a regular device with the filesystem being unaware 
of how the data is handled underneath. The only thing the md device 
ensures is that the data (between mirrors) is consistent for the cluster.

By clustering rules I mean rules such as:
	- Use a clustered filesystem
	- Use cLVM (as opposed to usual LVM) on top of MD-RAID.
	- Use local filesystems in active/passive mode

> Otherwise, I feel like other clustered filesystems that
> include RAID are better suited.

such as?

> As it stands, what are the benefits of
> using this over $CLUSTER_FS?

This is not a cluster filesystem. This is a RAID which is made cluster 
aware for your shared storage. This provides redundancy at the device 
level for a clustered filesystem. Your cluster shared storage  is a 
single point of failure (in case hardware-based raid is not used) and 
this software-based solution avoids making it a single point of failure.

An example usage scenario (not limited to) could be using it across two 
iSCSI devices to ensure higher availability.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 00/24] Clustered MD RAID1
  2014-12-26 22:35   ` Peter Kieser
  2014-12-26 22:49     ` Goldwyn Rodrigues
@ 2014-12-27  0:47     ` Marcus
  1 sibling, 0 replies; 10+ messages in thread
From: Marcus @ 2014-12-27  0:47 UTC (permalink / raw)
  To: Peter Kieser; +Cc: Goldwyn Rodrigues, Neil Brown, lzhong, john, linux-raid

I don't think it's any different than exposing the same block device
to multiple servers, like an iSCSI or FC volume. It just allows you to
also treat a RAID device like you would any other SAN-type block
device by keeping the MD metadata consistent.  You still have to
either put something on top of it that will be aware of the fact that
other hosts can see the block device and maintains its own consistency
(clustered FS), or be sure to only use it on one host at a time, again
like you would treat any other SAN volume.

On Fri, Dec 26, 2014 at 2:35 PM, Peter Kieser <peter@kieser.ca> wrote:
>
> On 2014-12-23 9:24 AM, Goldwyn Rodrigues wrote:
>>
>> This md device can be used as a regular shared device. There are no
>> restrictions on the type of filesystem or LVM you can use, as long as
>> you observe clustering rules of using a shared device.
>
>
> This is very vague. I feel like this is a huge layering violation. LVM or
> filesystem on top of mdadm should not have to be aware of what is going on
> below. Otherwise, I feel like other clustered filesystems that include RAID
> are better suited. As it stands, what are the benefits of using this over
> $CLUSTER_FS?
>
> -Peter
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 00/24] Clustered MD RAID1
  2014-12-18 16:14 [PATCH 00/24] Clustered MD RAID1 Goldwyn Rodrigues
  2014-12-23 17:24 ` Goldwyn Rodrigues
@ 2015-02-06  2:39 ` NeilBrown
  2015-02-10 17:00   ` Goldwyn Rodrigues
  1 sibling, 1 reply; 10+ messages in thread
From: NeilBrown @ 2015-02-06  2:39 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: lzhong, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3977 bytes --]

On Thu, 18 Dec 2014 10:14:57 -0600 Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:

> Hello,
> 
> This is an attempt to make MD-RAID cluster-aware. The advantage of
> redundancy can help highly available systems to improve uptime.
> Currently, the implementation is limited to RAID1 but with further work
> (and some positive feedback), we could extend this to other compatible
> RAID scenarios.
> 
> The design document (first patch) is pretty descriptive of how
> the md has been made cluster-aware and how DLM is used to safeguard data
> and communication.
> 
> This work requires some patches to the mdadm tool [1]
> 
> A quick howto:
> 
> 1. With your corosync/pacemaker based cluster running execute:
> # mdadm --create md0 --bitmap=clustered --raid-devices=2 --level=mirror --assume-clean /dev/sda /dev/sdb
> 
> 2. On other nodes, issue:
> # mdadm --assemble md0 /dev/sda /dev/sdb
> 
> References:
> [1] mdadm tool changes: https://github.com/goldwynr/mdadm branch:cluster-md
> [2] Patches against stable 3.14: https://github.com/goldwynr/linux branch: cluster-md-devel
> 
> Regards,
> 

hi Goldwyn,
 thanks for these - and sorry for the long delay.   Lots of leave over
 southern summer, and the lots of email etc to deal with.

This patch set is very close and I am tempted to just apply it and then
fix things up with subsequent patches.  In order to allow that, could you
please:
  - rebase against current upstream
  - fix the checkpatch.pl errors and warnings.
    The "WARNING: line over 80 characters" are often a judgement call
    so I'm not particularly worried about those.  Most, if not all, of
    the others should be followed just to have consistent layout.

Then I'll queue them up for 3.21, providing I don't find anything that would
hurt non-cluster usage ....
 On that topic: why initialise rv to -EINVAL in "metadata_update sends
 message...".  That looks wrong.

I noticed that a number of times a patch will revert something that a
previous patch added.  It would be much nicer to fold these changes back into
the original patch.  Often this is just extra blank lines, but occasionally
variable names are changed (md -> mddev).  It should be given the final name
when introduced.  Every chunk in every patch should be directly relevant to
that patch.

Some other issues, that could possibly be fixed up afterwards:

 - Is a clustername 64 bytes or 63 bytes?  I would have thought 64,
   but the use of strlcpy make is 63 plus a nul.  Is that really what is
   wanted?

 - Based on https://lkml.org/lkml/2012/10/23/580 it might be good to add
   "default n" to Kconfig, and possible add a WARN() if anyone tries to use
   the code.

 - I'm a bit concerned about the behaviour on node failure.
   When a node fails, two things must happen w.r.t the bits in that node's
   bitmap.
   1/ The corresponding regions of the array need to be resynced.  You do have
      code to do this.
   2/ Other nodes must avoid read-balancing on those regions until the
      resync has completed.  

   You do have code for this second bit, but it looks wrong.  It avoids
   read-balancing if ->area_resyncing().  That isn't sufficient.
   The "area_resyncing" is always (I assume) a relatively small region of
   the array which will be completely resynced quite quickly.  It must be
   because writes are blocked to this area.  However the region in which
   we must disable re-balancing can be much larger.  It covers *all* bits
   that are set in any unsynced bitmap.  So it isn't just the area that is
   currently being synced, but all areas that will be synced.

 - I think md_reload_sb() might be too simple.  It probably should check that
   nothing serious has changed.  The "mddev->raid_disks = 0" look suspicious.
   I'll have to think about this a bit more.

That's all I can see for now.  I'll have another look once I have it all in my tree.

Thanks,
NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 00/24] Clustered MD RAID1
  2015-02-06  2:39 ` NeilBrown
@ 2015-02-10 17:00   ` Goldwyn Rodrigues
  2015-02-11  4:17     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Goldwyn Rodrigues @ 2015-02-10 17:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: lzhong, linux-raid

Hi Neil,

>
>
> hi Goldwyn,
>   thanks for these - and sorry for the long delay.   Lots of leave over
>   southern summer, and the lots of email etc to deal with.
>
> This patch set is very close and I am tempted to just apply it and then
> fix things up with subsequent patches.  In order to allow that, could you
> please:
>    - rebase against current upstream
>    - fix the checkpatch.pl errors and warnings.
>      The "WARNING: line over 80 characters" are often a judgement call
>      so I'm not particularly worried about those.  Most, if not all, of
>      the others should be followed just to have consistent layout.

Done.

>
> Then I'll queue them up for 3.21, providing I don't find anything that would
> hurt non-cluster usage ....
>   On that topic: why initialise rv to -EINVAL in "metadata_update sends
>   message...".  That looks wrong.

Yes, this is fixed.

>
> I noticed that a number of times a patch will revert something that a
> previous patch added.  It would be much nicer to fold these changes back into
> the original patch.  Often this is just extra blank lines, but occasionally
> variable names are changed (md -> mddev).  It should be given the final name
> when introduced.  Every chunk in every patch should be directly relevant to
> that patch.

I have cross-checked this and I did not find anything with respect to 
variable names. I did some cleanup with respect to the code though.

There is one instance where I have used a variable: cluster_setup_done 
and then removed it. I think this is required to understand the patch 
and a smooth transition to subsequent patches. However, if you want me 
to aggressively remove that part, I should be able to do that.

>
> Some other issues, that could possibly be fixed up afterwards:
>
>   - Is a clustername 64 bytes or 63 bytes?  I would have thought 64,
>     but the use of strlcpy make is 63 plus a nul.  Is that really what is
>     wanted?

Yes, it is 64 bytes. I haven't fixed this as yet.

>
>   - Based on https://lkml.org/lkml/2012/10/23/580 it might be good to add
>     "default n" to Kconfig, and possible add a WARN() if anyone tries to use
>     the code.

Done. Added pr_warn while loading the module.

>
>   - I'm a bit concerned about the behaviour on node failure.
>     When a node fails, two things must happen w.r.t the bits in that node's
>     bitmap.
>     1/ The corresponding regions of the array need to be resynced.  You do have
>        code to do this.
>     2/ Other nodes must avoid read-balancing on those regions until the
>        resync has completed.
>
>     You do have code for this second bit, but it looks wrong.  It avoids
>     read-balancing if ->area_resyncing().  That isn't sufficient.
>     The "area_resyncing" is always (I assume) a relatively small region of
>     the array which will be completely resynced quite quickly.  It must be
>     because writes are blocked to this area.  However the region in which
>     we must disable re-balancing can be much larger.  It covers *all* bits
>     that are set in any unsynced bitmap.  So it isn't just the area that is
>     currently being synced, but all areas that will be synced.

What are unsynced bitmaps? Are they bitmaps which are associated with an 
active node or dirty bitmaps with dead nodes? If it is the former, I 
agree this is not enough. If it is latter, all nodes maintain a linked 
list of all the nodes which are currently performing resync (probably 
because of multiple nodes died simultaneously). One node performs the 
recovery (aka bitmap resync) of exactly one "dead" node at a time. 
area_resyncing goes through all the nodes which are performing resync.

>
>   - I think md_reload_sb() might be too simple.  It probably should check that
>     nothing serious has changed.  The "mddev->raid_disks = 0" look suspicious.
>     I'll have to think about this a bit more.

Yes, I get that feeling as well. However, I am not sure how to perform 
an exact comparison to understand what has changed. Perhaps it needs a 
new flag?

>
> That's all I can see for now.  I'll have another look once I have it all in my tree.
>

I have put all the changes in my git:
https://github.com/goldwynr/linux
The branch cluster-md is against the latest upstream. I also performed a 
small sanity test to check everything is working properly.

Let me know if you would want me to repost the entire patchset to the 
mailing list.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 00/24] Clustered MD RAID1
  2015-02-10 17:00   ` Goldwyn Rodrigues
@ 2015-02-11  4:17     ` NeilBrown
  2015-02-11 17:25       ` Goldwyn Rodrigues
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2015-02-11  4:17 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: lzhong, linux-raid

[-- Attachment #1: Type: text/plain, Size: 8437 bytes --]

On Tue, 10 Feb 2015 11:00:31 -0600 Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:

> Hi Neil,
> 
> >
> >
> > hi Goldwyn,
> >   thanks for these - and sorry for the long delay.   Lots of leave over
> >   southern summer, and the lots of email etc to deal with.
> >
> > This patch set is very close and I am tempted to just apply it and then
> > fix things up with subsequent patches.  In order to allow that, could you
> > please:
> >    - rebase against current upstream
> >    - fix the checkpatch.pl errors and warnings.
> >      The "WARNING: line over 80 characters" are often a judgement call
> >      so I'm not particularly worried about those.  Most, if not all, of
> >      the others should be followed just to have consistent layout.
> 
> Done.

ERROR: code indent should use tabs where possible

  when you use spaces, they show up in red for me!!
Ditto for
WARNING: please, no space before tabs

WARNING: quoted string split across lines
   It really is best to fix those, even though it makes the line long.
   When grepping to find out where a message comes from, it is very annoying
   if the grep fails because the line was split.

WARNING: Missing a blank line after declarations
   Worth fixing I think.

WARNING: printk() should include KERN_ facility level
   Definitely should be fixed, maybe make it pr_warn()??

> 
> >
> > Then I'll queue them up for 3.21, providing I don't find anything that would
> > hurt non-cluster usage ....
> >   On that topic: why initialise rv to -EINVAL in "metadata_update sends
> >   message...".  That looks wrong.
> 
> Yes, this is fixed.
> 
> >
> > I noticed that a number of times a patch will revert something that a
> > previous patch added.  It would be much nicer to fold these changes back into
> > the original patch.  Often this is just extra blank lines, but occasionally
> > variable names are changed (md -> mddev).  It should be given the final name
> > when introduced.  Every chunk in every patch should be directly relevant to
> > that patch.
> 
> I have cross-checked this and I did not find anything with respect to 
> variable names. I did some cleanup with respect to the code though.
> 
> There is one instance where I have used a variable: cluster_setup_done 
> and then removed it. I think this is required to understand the patch 
> and a smooth transition to subsequent patches. However, if you want me 
> to aggressively remove that part, I should be able to do that.

No, "cluster_setup_done" makes sense. It is scaffolding that you later need
to remove.

I'm probably letting me OCD tendencies get carried away, but these some of
the things that I noticed:

 "Introduce md_cluster_info" moves 'bast' to a new location in
 dlm_lock_resource for no apparent reason.
 Also 'leave()' has a parameter which is changed from 'md' to 'mddev',
 as does 'join'.

 "Add node recovery callbacks" adds a comment to the 'nodes' field of 'struct
 mddev'.  Why not add the comment when the field is added?
 Oh, and it mis-spells "unmber".

 In "Gather on-going resync information of other nodes" you have:

 static struct md_cluster_operations cluster_ops = {
        .join   = join,
        .leave  = leave,
-       .slot_number = slot_number
+       .slot_number = slot_number,
+       .resync_info_update = resync_info_update
 };

It is really best to put a comma at the end of each entry, even the last.
Then the patch would have been:

 static struct md_cluster_operations cluster_ops = {
        .join   = join,
        .leave  = leave,
        .slot_number = slot_number,
+       .resync_info_update = resync_info_update,
 };

which is much nicer to read.  You finally get this right in
  "Suspend writes in RAID1 if within range" :-)

> 
> >
> > Some other issues, that could possibly be fixed up afterwards:
> >
> >   - Is a clustername 64 bytes or 63 bytes?  I would have thought 64,
> >     but the use of strlcpy make is 63 plus a nul.  Is that really what is
> >     wanted?
> 
> Yes, it is 64 bytes. I haven't fixed this as yet.
> 
> >
> >   - Based on https://lkml.org/lkml/2012/10/23/580 it might be good to add
> >     "default n" to Kconfig, and possible add a WARN() if anyone tries to use
> >     the code.
> 
> Done. Added pr_warn while loading the module.
> 
> >
> >   - I'm a bit concerned about the behaviour on node failure.
> >     When a node fails, two things must happen w.r.t the bits in that node's
> >     bitmap.
> >     1/ The corresponding regions of the array need to be resynced.  You do have
> >        code to do this.
> >     2/ Other nodes must avoid read-balancing on those regions until the
> >        resync has completed.
> >
> >     You do have code for this second bit, but it looks wrong.  It avoids
> >     read-balancing if ->area_resyncing().  That isn't sufficient.
> >     The "area_resyncing" is always (I assume) a relatively small region of
> >     the array which will be completely resynced quite quickly.  It must be
> >     because writes are blocked to this area.  However the region in which
> >     we must disable re-balancing can be much larger.  It covers *all* bits
> >     that are set in any unsynced bitmap.  So it isn't just the area that is
> >     currently being synced, but all areas that will be synced.
> 
> What are unsynced bitmaps? Are they bitmaps which are associated with an 
> active node or dirty bitmaps with dead nodes? If it is the former, I 
> agree this is not enough. If it is latter, all nodes maintain a linked 
> list of all the nodes which are currently performing resync (probably 
> because of multiple nodes died simultaneously). One node performs the 
> recovery (aka bitmap resync) of exactly one "dead" node at a time. 
> area_resyncing goes through all the nodes which are performing resync.

The later - bitmaps associated with a dead node.
Bitmaps associated with an active node contain transient information, and the
filesystem will ensure that it never reads from somewhere that someone else
might be writing (or if it does, it will know that the data cannot be
trusted).

I looked at the code again, and discovered that I had the problem backwards.
But there is still a problem.

when any node is resyncing, your code blocks writes for the entire span of the
array from where-ever the resync is up to, to the end of the device.
So a write to a location near the end of the  device will hang until all
resyncs finish.  This could be a much longer time than you would like writes
to hang for.

I think that the resyncing host should only report that it is resyncing a
relatively small range of the array, maybe 100Meg.  Maybe 1G.
Then that would only block access to that small part of the array, which
should clear in just a few seconds at most.

This information on the range being synced is not enough to limit
read-balancing.
I imagined that *every* node would read the bitmap for a failed node, and
would use that information to limit read-balancing.  There are some
complexities in this though.

So the current code isn't "wrong" exactly, but it think it could cause sever
delays in some (unusual) circumstances.

> 
> >
> >   - I think md_reload_sb() might be too simple.  It probably should check that
> >     nothing serious has changed.  The "mddev->raid_disks = 0" look suspicious.
> >     I'll have to think about this a bit more.
> 
> Yes, I get that feeling as well. However, I am not sure how to perform 
> an exact comparison to understand what has changed. Perhaps it needs a 
> new flag?

Probably.  I haven't thought much about it.

> 
> >
> > That's all I can see for now.  I'll have another look once I have it all in my tree.
> >
> 
> I have put all the changes in my git:
> https://github.com/goldwynr/linux
> The branch cluster-md is against the latest upstream. I also performed a 
> small sanity test to check everything is working properly.
> 
> Let me know if you would want me to repost the entire patchset to the 
> mailing list.
> 
> 

I don't think there is any need for that.
I won't pull it in just yet - to give you a chance to resolve the last of the
checkpatch problems.
Then I'll double check that there is no risk to non-cluster users and try to
get it into -next after 3.20-rc2 is out.

Thanks,
NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 00/24] Clustered MD RAID1
  2015-02-11  4:17     ` NeilBrown
@ 2015-02-11 17:25       ` Goldwyn Rodrigues
  0 siblings, 0 replies; 10+ messages in thread
From: Goldwyn Rodrigues @ 2015-02-11 17:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: lzhong, linux-raid

Hi Neil,

On 02/10/2015 10:17 PM, NeilBrown wrote:
> On Tue, 10 Feb 2015 11:00:31 -0600 Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
>> Hi Neil,
>>
>>>
>>>
>>> hi Goldwyn,
>>>    thanks for these - and sorry for the long delay.   Lots of leave over
>>>    southern summer, and the lots of email etc to deal with.
>>>
>>> This patch set is very close and I am tempted to just apply it and then
>>> fix things up with subsequent patches.  In order to allow that, could you
>>> please:
>>>     - rebase against current upstream
>>>     - fix the checkpatch.pl errors and warnings.
>>>       The "WARNING: line over 80 characters" are often a judgement call
>>>       so I'm not particularly worried about those.  Most, if not all, of
>>>       the others should be followed just to have consistent layout.
>>
>> Done.
>
> ERROR: code indent should use tabs where possible
>
>    when you use spaces, they show up in red for me!!
> Ditto for
> WARNING: please, no space before tabs
>
> WARNING: quoted string split across lines
>     It really is best to fix those, even though it makes the line long.
>     When grepping to find out where a message comes from, it is very annoying
>     if the grep fails because the line was split.
>
> WARNING: Missing a blank line after declarations
>     Worth fixing I think.
>
> WARNING: printk() should include KERN_ facility level
>     Definitely should be fixed, maybe make it pr_warn()??
>

Ok, my review was not good. I have incorporated all of these and put 
them in the same branch. Sorry for the trouble.

>   "Introduce md_cluster_info" moves 'bast' to a new location in
>   dlm_lock_resource for no apparent reason.
>   Also 'leave()' has a parameter which is changed from 'md' to 'mddev',
>   as does 'join'.
>
>   "Add node recovery callbacks" adds a comment to the 'nodes' field of 'struct
>   mddev'.  Why not add the comment when the field is added?
>   Oh, and it mis-spells "unmber".
>
>
>   In "Gather on-going resync information of other nodes" you have:
>
>   static struct md_cluster_operations cluster_ops = {
>          .join   = join,
>          .leave  = leave,
> -       .slot_number = slot_number
> +       .slot_number = slot_number,
> +       .resync_info_update = resync_info_update
>   };
>
>
> It is really best to put a comma at the end of each entry, even the last.
> Then the patch would have been:
>
>   static struct md_cluster_operations cluster_ops = {
>          .join   = join,
>          .leave  = leave,
>          .slot_number = slot_number,
> +       .resync_info_update = resync_info_update,
>   };
>
> which is much nicer to read.  You finally get this right in
>    "Suspend writes in RAID1 if within range" :-)

Ok, I have fixed them in all the patches.

>>>
>>>    - I'm a bit concerned about the behaviour on node failure.
>>>      When a node fails, two things must happen w.r.t the bits in that node's
>>>      bitmap.
>>>      1/ The corresponding regions of the array need to be resynced.  You do have
>>>         code to do this.
>>>      2/ Other nodes must avoid read-balancing on those regions until the
>>>         resync has completed.
>>>
>>>      You do have code for this second bit, but it looks wrong.  It avoids
>>>      read-balancing if ->area_resyncing().  That isn't sufficient.
>>>      The "area_resyncing" is always (I assume) a relatively small region of
>>>      the array which will be completely resynced quite quickly.  It must be
>>>      because writes are blocked to this area.  However the region in which
>>>      we must disable re-balancing can be much larger.  It covers *all* bits
>>>      that are set in any unsynced bitmap.  So it isn't just the area that is
>>>      currently being synced, but all areas that will be synced.
>>
>> What are unsynced bitmaps? Are they bitmaps which are associated with an
>> active node or dirty bitmaps with dead nodes? If it is the former, I
>> agree this is not enough. If it is latter, all nodes maintain a linked
>> list of all the nodes which are currently performing resync (probably
>> because of multiple nodes died simultaneously). One node performs the
>> recovery (aka bitmap resync) of exactly one "dead" node at a time.
>> area_resyncing goes through all the nodes which are performing resync.
>
> The later - bitmaps associated with a dead node.
> Bitmaps associated with an active node contain transient information, and the
> filesystem will ensure that it never reads from somewhere that someone else
> might be writing (or if it does, it will know that the data cannot be
> trusted).
>
> I looked at the code again, and discovered that I had the problem backwards.
> But there is still a problem.
>
> when any node is resyncing, your code blocks writes for the entire span of the
> array from where-ever the resync is up to, to the end of the device.
> So a write to a location near the end of the  device will hang until all
> resyncs finish.  This could be a much longer time than you would like writes
> to hang for.
>
> I think that the resyncing host should only report that it is resyncing a
> relatively small range of the array, maybe 100Meg.  Maybe 1G.
> Then that would only block access to that small part of the array, which
> should clear in just a few seconds at most.
>
> This information on the range being synced is not enough to limit
> read-balancing.
> I imagined that *every* node would read the bitmap for a failed node, and
> would use that information to limit read-balancing.  There are some
> complexities in this though.
>
>
> So the current code isn't "wrong" exactly, but it think it could cause sever
> delays in some (unusual) circumstances.


I thought about this as well and this is what I think we should do for a 
node failure:

Attempt a PW lock (DLM_LKF_NOQUEUE) on the failed nodes bitmap. If 
successful:
	a. Read the bitmap.
	b. Update the bitmap LVB with the resync details.
	c. Send the RESYNCING message
	d. Perform the resync and update the LVB as we proceed. (This means we 
will have to write another resync function independent of md_do_sync)

area_resyncing:
	a. Check the resyncing node list. If not found, or is out of range, 
return 0.
	b. If an entry exists with overlapping range of I/O, take a CR lock on 
bitmap-<nodenum>.
	c. Read the LVB for limits and update limits in the resyncing node list.
	d. If read/write range is not within new range details, unlock CR and 
return 0
	e. If read/write is within range, read the bitmap and check for 
interfering bitmaps. If not interfering return 0 else return 1. Perhaps 
we could reserve the bitmap.

Using step c,d may make it a wee bit faster without the need to going to 
the disk.

The problem I see in this approach is cascading failures. What should we 
do if the node performing the resync for a recently failed node also 
fails? How do we detect that the failed node was performing a resync for 
another node? One option is we could add that information in the 
RESYNCING message. Can you think of something better?

The way it is being done now is that we aggressively (no NOQUEUE) take 
the bitmap lock and the resyncing node copies the bits into its own 
bitmap and clears the failed bitmap before releasing the bitmap lock.

Finally, should we put some sort of versioning in the messages for 
rolling upgrades? (One node is on a higher kernel version with respect 
to the rest of the nodes)

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-02-11 17:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-18 16:14 [PATCH 00/24] Clustered MD RAID1 Goldwyn Rodrigues
2014-12-23 17:24 ` Goldwyn Rodrigues
2014-12-23 21:47   ` John Stoffel
2014-12-26 22:35   ` Peter Kieser
2014-12-26 22:49     ` Goldwyn Rodrigues
2014-12-27  0:47     ` Marcus
2015-02-06  2:39 ` NeilBrown
2015-02-10 17:00   ` Goldwyn Rodrigues
2015-02-11  4:17     ` NeilBrown
2015-02-11 17:25       ` Goldwyn Rodrigues

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).