* [PATCH 00/24] Clustered MD RAID1 @ 2014-12-18 16:14 Goldwyn Rodrigues 2014-12-23 17:24 ` Goldwyn Rodrigues 2015-02-06 2:39 ` NeilBrown 0 siblings, 2 replies; 10+ messages in thread From: Goldwyn Rodrigues @ 2014-12-18 16:14 UTC (permalink / raw) To: neilb; +Cc: lzhong, linux-raid Hello, This is an attempt to make MD-RAID cluster-aware. The advantage of redundancy can help highly available systems to improve uptime. Currently, the implementation is limited to RAID1 but with further work (and some positive feedback), we could extend this to other compatible RAID scenarios. The design document (first patch) is pretty descriptive of how the md has been made cluster-aware and how DLM is used to safeguard data and communication. This work requires some patches to the mdadm tool [1] A quick howto: 1. With your corosync/pacemaker based cluster running execute: # mdadm --create md0 --bitmap=clustered --raid-devices=2 --level=mirror --assume-clean /dev/sda /dev/sdb 2. On other nodes, issue: # mdadm --assemble md0 /dev/sda /dev/sdb References: [1] mdadm tool changes: https://github.com/goldwynr/mdadm branch:cluster-md [2] Patches against stable 3.14: https://github.com/goldwynr/linux branch: cluster-md-devel Regards, -- Goldwyn ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 00/24] Clustered MD RAID1 2014-12-18 16:14 [PATCH 00/24] Clustered MD RAID1 Goldwyn Rodrigues @ 2014-12-23 17:24 ` Goldwyn Rodrigues 2014-12-23 21:47 ` John Stoffel 2014-12-26 22:35 ` Peter Kieser 2015-02-06 2:39 ` NeilBrown 1 sibling, 2 replies; 10+ messages in thread From: Goldwyn Rodrigues @ 2014-12-23 17:24 UTC (permalink / raw) To: neilb; +Cc: lzhong, john, linux-raid Since, I missed on important setup details, I am sending an addendum to explain in more detail how to setup a cluster-md. Requirements ============ You would require a multi-node cluster setup using corosync and pacemaker. You can read more about how to setup a cluster from the one of the guides available [3]. Make sure that you are using corosync version greater than 2.3.1 You need to have the Distributed Lock Manager (DLM) service running on all the nodes of the cluster. A simple CRM configuration on my virtual cluster is: node 1084752262: node3 node 1084752351: node1 node 1084752358: node2 primitive dlm ocf:pacemaker:controld \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 \ op monitor interval=60 timeout=60 primitive stone stonith:external/libvirt \ params hostlist="node1,node2,node3" hypervisor_uri="qemu+tcp://vmhnost/system" \ op start timeout=120s interval=0 \ op stop timeout=120s interval=0 \ op monitor interval=40s timeout=120s \ meta target-role=Started group base-group dlm clone base-clone base-group \ meta interleave=true target-role=Started Note: The configuration may be different for your scenario. This work requires some patches for the mdadm tool [1]. The changes in mdadm are basic enough to get the clustered-md up and running. There are a couple of options checks which are missing. Use with care. You would need the corosync libraries to compile cluster related stuff in mdadm. Download and install the patched mdadm. A howto to use cluster-md: 1. With your corosync/pacemaker based cluster with DLM running execute: # mdadm --create md0 --bitmap=clustered --raid-devices=2 --level=mirror --assume-clean <device1> <device2> With the option of --bitmap=clustered, it automatically creates multiple bitmaps (one for each node). The default currently is set to 4 nodes. However, you can set it by --nodes=<number> option. It also detects the cluster name which is required for creating a clustered md. In order to specify that, use --cluster-name=<name> 2. On other nodes, issue: # mdadm --assemble md0 <device1> <device2> This md device can be used as a regular shared device. There are no restrictions on the type of filesystem or LVM you can use, as long as you observe clustering rules of using a shared device. There is only one special case as opposed to a regular non-clustered md, which is to add a device. This is because all nodes should be able to "see" the device before adding it. You can (hot) add a spare device by issuing the regular --add command. # mdadm --manage /dev/md0 --add <device3> The other nodes must acknowledge that they see the device by issuing: # mdadm --manage /dev/md0 --cluster-confirm 2:<device3> where 2 is the raid slot number. This step can be automated using a udev script because the module sends a udev event when another node issues an --add. The uevent is with the usual device name parameters and: EVENT=ADD_DEVICE DEVICE_UUID=<uuid of the device> RAID_DISK=<slot number> Usually, you would use blkid to find the devices uuid and issue the --cluster-confirm command. If the node does not "see" the device, it must issue (or timeout): # mdadm --manage /dev/md0 --cluster-confirm 2:missing References: [1] mdadm tool changes: https://github.com/goldwynr/mdadm branch:cluster-md [2] Patches against stable 3.14: https://github.com/goldwynr/linux branch: cluster-md-devel [3] A guide to setup a cluster using corosync/pacemaker https://www.suse.com/documentation/sle-ha-12/singlehtml/book_sleha/book_sleha.html (Note: The basic concepts of this guide should work with any distro) -- -- Goldwyn ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 00/24] Clustered MD RAID1 2014-12-23 17:24 ` Goldwyn Rodrigues @ 2014-12-23 21:47 ` John Stoffel 2014-12-26 22:35 ` Peter Kieser 1 sibling, 0 replies; 10+ messages in thread From: John Stoffel @ 2014-12-23 21:47 UTC (permalink / raw) To: Goldwyn Rodrigues; +Cc: neilb, lzhong, john, linux-raid This is much improved, thanks for takign the time to listen to my comments. But I do have some questions and clarifications I think you neeed to make. Goldwyn> A howto to use cluster-md: Goldwyn> 1. With your corosync/pacemaker based cluster with DLM Goldwyn> running execute: # mdadm --create md0 --bitmap=clustered Goldwyn> --raid-devices=2 --level=mirror --assume-clean <device1> Goldwyn> <device2> What are <device1> and <device2>? in terms of block devices? Are they local? Shared across the cluster? Are the iSCSI or FibreChannel block devices accessible from all nodes at the same time? It's not clear, and this is a *key* issue to address and make sure end users understand. Goldwyn> With the option of --bitmap=clustered, it automatically Goldwyn> creates multiple bitmaps (one for each node). The default Goldwyn> currently is set to 4 nodes. However, you can set it by Goldwyn> --nodes=<number> option. It also detects the cluster name Goldwyn> which is required for creating a clustered md. In order to Goldwyn> specify that, use --cluster-name=<name> Goldwyn> 2. On other nodes, issue: # mdadm --assemble md0 <device1> Goldwyn> <device2> Same comment here, what are the limits/restrictions/expectations of devices? Goldwyn> This md device can be used as a regular shared device. There Goldwyn> are no restrictions on the type of filesystem or LVM you can Goldwyn> use, as long as you observe clustering rules of using a Goldwyn> shared device. Another place where you need to be more explicit. For example, if I'm running ext3, I assume my cluster needs to be running in an Active/Passive mode, so that only one node is accessing the filesystem at a time, correct? But if I'm running glusterfs, I could be using the block device and the filesystem in an Active/Active mode? Goldwyn> There is only one special case as opposed to a regular Goldwyn> non-clustered md, which is to add a device. This is because Goldwyn> all nodes should be able to "see" the device before adding Goldwyn> it. So this little snipped implies that my question above about device restrictions is that you MUST be able to see each device from all nodes, correct? Goldwyn> You can (hot) add a spare device by issuing the regular --add Goldwyn> command. Goldwyn> # mdadm --manage /dev/md0 --add <device3> Again, this device needs to be visible to all nodes. Goldwyn> The other nodes must acknowledge that they see the device by Goldwyn> issuing: Goldwyn> # mdadm --manage /dev/md0 --cluster-confirm 2:<device3> Goldwyn> where 2 is the raid slot number. This step can be automated Goldwyn> using a udev script because the module sends a udev event Goldwyn> when another node issues an --add. The uevent is with the Goldwyn> usual device name parameters and: Goldwyn> EVENT=ADD_DEVICE DEVICE_UUID=<uuid of the device> Goldwyn> RAID_DISK=<slot number> Goldwyn> Usually, you would use blkid to find the devices uuid and Goldwyn> issue the --cluster-confirm command. Goldwyn> If the node does not "see" the device, it must issue (or Goldwyn> timeout): Goldwyn> # mdadm --manage /dev/md0 --cluster-confirm 2:missing This seems wrong, if the other node doesn't see the new device, how will the other nodes know that it's there or not? What is the communication channel between nodes? Do they use the DLM? And what are the performance impacts and benefits of using this setup? Why not use a cluster aware filesystem which already does RAID in some manner? I'm honestly clueless about glusterFS or other Linux based cluster filesystems, haven't had any need to run them, nor the time to play with them. I think you've got a good start now on the documentation, but you need to expand more on the why and what is expected as an infrastructure. Just a note that all devices must be visible on all nodes would be a key thing to include. Thanks, John ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 00/24] Clustered MD RAID1 2014-12-23 17:24 ` Goldwyn Rodrigues 2014-12-23 21:47 ` John Stoffel @ 2014-12-26 22:35 ` Peter Kieser 2014-12-26 22:49 ` Goldwyn Rodrigues 2014-12-27 0:47 ` Marcus 1 sibling, 2 replies; 10+ messages in thread From: Peter Kieser @ 2014-12-26 22:35 UTC (permalink / raw) To: Goldwyn Rodrigues, neilb; +Cc: lzhong, john, linux-raid [-- Attachment #1: Type: text/plain, Size: 588 bytes --] On 2014-12-23 9:24 AM, Goldwyn Rodrigues wrote: > This md device can be used as a regular shared device. There are no > restrictions on the type of filesystem or LVM you can use, as long as > you observe clustering rules of using a shared device. This is very vague. I feel like this is a huge layering violation. LVM or filesystem on top of mdadm should not have to be aware of what is going on below. Otherwise, I feel like other clustered filesystems that include RAID are better suited. As it stands, what are the benefits of using this over $CLUSTER_FS? -Peter [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4291 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 00/24] Clustered MD RAID1 2014-12-26 22:35 ` Peter Kieser @ 2014-12-26 22:49 ` Goldwyn Rodrigues 2014-12-27 0:47 ` Marcus 1 sibling, 0 replies; 10+ messages in thread From: Goldwyn Rodrigues @ 2014-12-26 22:49 UTC (permalink / raw) To: Peter Kieser, neilb; +Cc: lzhong, john, linux-raid Hi Peter, On 12/26/2014 04:35 PM, Peter Kieser wrote: > > On 2014-12-23 9:24 AM, Goldwyn Rodrigues wrote: >> This md device can be used as a regular shared device. There are no >> restrictions on the type of filesystem or LVM you can use, as long as >> you observe clustering rules of using a shared device. > > This is very vague. I feel like this is a huge layering violation. LVM > or filesystem on top of mdadm should not have to be aware of what is > going on below. How is this a layering violation? I said there are _no_ restrictions as long you follow the clustering rules of using a shared device. This device can be used as a regular device with the filesystem being unaware of how the data is handled underneath. The only thing the md device ensures is that the data (between mirrors) is consistent for the cluster. By clustering rules I mean rules such as: - Use a clustered filesystem - Use cLVM (as opposed to usual LVM) on top of MD-RAID. - Use local filesystems in active/passive mode > Otherwise, I feel like other clustered filesystems that > include RAID are better suited. such as? > As it stands, what are the benefits of > using this over $CLUSTER_FS? This is not a cluster filesystem. This is a RAID which is made cluster aware for your shared storage. This provides redundancy at the device level for a clustered filesystem. Your cluster shared storage is a single point of failure (in case hardware-based raid is not used) and this software-based solution avoids making it a single point of failure. An example usage scenario (not limited to) could be using it across two iSCSI devices to ensure higher availability. -- Goldwyn ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 00/24] Clustered MD RAID1 2014-12-26 22:35 ` Peter Kieser 2014-12-26 22:49 ` Goldwyn Rodrigues @ 2014-12-27 0:47 ` Marcus 1 sibling, 0 replies; 10+ messages in thread From: Marcus @ 2014-12-27 0:47 UTC (permalink / raw) To: Peter Kieser; +Cc: Goldwyn Rodrigues, Neil Brown, lzhong, john, linux-raid I don't think it's any different than exposing the same block device to multiple servers, like an iSCSI or FC volume. It just allows you to also treat a RAID device like you would any other SAN-type block device by keeping the MD metadata consistent. You still have to either put something on top of it that will be aware of the fact that other hosts can see the block device and maintains its own consistency (clustered FS), or be sure to only use it on one host at a time, again like you would treat any other SAN volume. On Fri, Dec 26, 2014 at 2:35 PM, Peter Kieser <peter@kieser.ca> wrote: > > On 2014-12-23 9:24 AM, Goldwyn Rodrigues wrote: >> >> This md device can be used as a regular shared device. There are no >> restrictions on the type of filesystem or LVM you can use, as long as >> you observe clustering rules of using a shared device. > > > This is very vague. I feel like this is a huge layering violation. LVM or > filesystem on top of mdadm should not have to be aware of what is going on > below. Otherwise, I feel like other clustered filesystems that include RAID > are better suited. As it stands, what are the benefits of using this over > $CLUSTER_FS? > > -Peter > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 00/24] Clustered MD RAID1 2014-12-18 16:14 [PATCH 00/24] Clustered MD RAID1 Goldwyn Rodrigues 2014-12-23 17:24 ` Goldwyn Rodrigues @ 2015-02-06 2:39 ` NeilBrown 2015-02-10 17:00 ` Goldwyn Rodrigues 1 sibling, 1 reply; 10+ messages in thread From: NeilBrown @ 2015-02-06 2:39 UTC (permalink / raw) To: Goldwyn Rodrigues; +Cc: lzhong, linux-raid [-- Attachment #1: Type: text/plain, Size: 3977 bytes --] On Thu, 18 Dec 2014 10:14:57 -0600 Goldwyn Rodrigues <rgoldwyn@suse.de> wrote: > Hello, > > This is an attempt to make MD-RAID cluster-aware. The advantage of > redundancy can help highly available systems to improve uptime. > Currently, the implementation is limited to RAID1 but with further work > (and some positive feedback), we could extend this to other compatible > RAID scenarios. > > The design document (first patch) is pretty descriptive of how > the md has been made cluster-aware and how DLM is used to safeguard data > and communication. > > This work requires some patches to the mdadm tool [1] > > A quick howto: > > 1. With your corosync/pacemaker based cluster running execute: > # mdadm --create md0 --bitmap=clustered --raid-devices=2 --level=mirror --assume-clean /dev/sda /dev/sdb > > 2. On other nodes, issue: > # mdadm --assemble md0 /dev/sda /dev/sdb > > References: > [1] mdadm tool changes: https://github.com/goldwynr/mdadm branch:cluster-md > [2] Patches against stable 3.14: https://github.com/goldwynr/linux branch: cluster-md-devel > > Regards, > hi Goldwyn, thanks for these - and sorry for the long delay. Lots of leave over southern summer, and the lots of email etc to deal with. This patch set is very close and I am tempted to just apply it and then fix things up with subsequent patches. In order to allow that, could you please: - rebase against current upstream - fix the checkpatch.pl errors and warnings. The "WARNING: line over 80 characters" are often a judgement call so I'm not particularly worried about those. Most, if not all, of the others should be followed just to have consistent layout. Then I'll queue them up for 3.21, providing I don't find anything that would hurt non-cluster usage .... On that topic: why initialise rv to -EINVAL in "metadata_update sends message...". That looks wrong. I noticed that a number of times a patch will revert something that a previous patch added. It would be much nicer to fold these changes back into the original patch. Often this is just extra blank lines, but occasionally variable names are changed (md -> mddev). It should be given the final name when introduced. Every chunk in every patch should be directly relevant to that patch. Some other issues, that could possibly be fixed up afterwards: - Is a clustername 64 bytes or 63 bytes? I would have thought 64, but the use of strlcpy make is 63 plus a nul. Is that really what is wanted? - Based on https://lkml.org/lkml/2012/10/23/580 it might be good to add "default n" to Kconfig, and possible add a WARN() if anyone tries to use the code. - I'm a bit concerned about the behaviour on node failure. When a node fails, two things must happen w.r.t the bits in that node's bitmap. 1/ The corresponding regions of the array need to be resynced. You do have code to do this. 2/ Other nodes must avoid read-balancing on those regions until the resync has completed. You do have code for this second bit, but it looks wrong. It avoids read-balancing if ->area_resyncing(). That isn't sufficient. The "area_resyncing" is always (I assume) a relatively small region of the array which will be completely resynced quite quickly. It must be because writes are blocked to this area. However the region in which we must disable re-balancing can be much larger. It covers *all* bits that are set in any unsynced bitmap. So it isn't just the area that is currently being synced, but all areas that will be synced. - I think md_reload_sb() might be too simple. It probably should check that nothing serious has changed. The "mddev->raid_disks = 0" look suspicious. I'll have to think about this a bit more. That's all I can see for now. I'll have another look once I have it all in my tree. Thanks, NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 00/24] Clustered MD RAID1 2015-02-06 2:39 ` NeilBrown @ 2015-02-10 17:00 ` Goldwyn Rodrigues 2015-02-11 4:17 ` NeilBrown 0 siblings, 1 reply; 10+ messages in thread From: Goldwyn Rodrigues @ 2015-02-10 17:00 UTC (permalink / raw) To: NeilBrown; +Cc: lzhong, linux-raid Hi Neil, > > > hi Goldwyn, > thanks for these - and sorry for the long delay. Lots of leave over > southern summer, and the lots of email etc to deal with. > > This patch set is very close and I am tempted to just apply it and then > fix things up with subsequent patches. In order to allow that, could you > please: > - rebase against current upstream > - fix the checkpatch.pl errors and warnings. > The "WARNING: line over 80 characters" are often a judgement call > so I'm not particularly worried about those. Most, if not all, of > the others should be followed just to have consistent layout. Done. > > Then I'll queue them up for 3.21, providing I don't find anything that would > hurt non-cluster usage .... > On that topic: why initialise rv to -EINVAL in "metadata_update sends > message...". That looks wrong. Yes, this is fixed. > > I noticed that a number of times a patch will revert something that a > previous patch added. It would be much nicer to fold these changes back into > the original patch. Often this is just extra blank lines, but occasionally > variable names are changed (md -> mddev). It should be given the final name > when introduced. Every chunk in every patch should be directly relevant to > that patch. I have cross-checked this and I did not find anything with respect to variable names. I did some cleanup with respect to the code though. There is one instance where I have used a variable: cluster_setup_done and then removed it. I think this is required to understand the patch and a smooth transition to subsequent patches. However, if you want me to aggressively remove that part, I should be able to do that. > > Some other issues, that could possibly be fixed up afterwards: > > - Is a clustername 64 bytes or 63 bytes? I would have thought 64, > but the use of strlcpy make is 63 plus a nul. Is that really what is > wanted? Yes, it is 64 bytes. I haven't fixed this as yet. > > - Based on https://lkml.org/lkml/2012/10/23/580 it might be good to add > "default n" to Kconfig, and possible add a WARN() if anyone tries to use > the code. Done. Added pr_warn while loading the module. > > - I'm a bit concerned about the behaviour on node failure. > When a node fails, two things must happen w.r.t the bits in that node's > bitmap. > 1/ The corresponding regions of the array need to be resynced. You do have > code to do this. > 2/ Other nodes must avoid read-balancing on those regions until the > resync has completed. > > You do have code for this second bit, but it looks wrong. It avoids > read-balancing if ->area_resyncing(). That isn't sufficient. > The "area_resyncing" is always (I assume) a relatively small region of > the array which will be completely resynced quite quickly. It must be > because writes are blocked to this area. However the region in which > we must disable re-balancing can be much larger. It covers *all* bits > that are set in any unsynced bitmap. So it isn't just the area that is > currently being synced, but all areas that will be synced. What are unsynced bitmaps? Are they bitmaps which are associated with an active node or dirty bitmaps with dead nodes? If it is the former, I agree this is not enough. If it is latter, all nodes maintain a linked list of all the nodes which are currently performing resync (probably because of multiple nodes died simultaneously). One node performs the recovery (aka bitmap resync) of exactly one "dead" node at a time. area_resyncing goes through all the nodes which are performing resync. > > - I think md_reload_sb() might be too simple. It probably should check that > nothing serious has changed. The "mddev->raid_disks = 0" look suspicious. > I'll have to think about this a bit more. Yes, I get that feeling as well. However, I am not sure how to perform an exact comparison to understand what has changed. Perhaps it needs a new flag? > > That's all I can see for now. I'll have another look once I have it all in my tree. > I have put all the changes in my git: https://github.com/goldwynr/linux The branch cluster-md is against the latest upstream. I also performed a small sanity test to check everything is working properly. Let me know if you would want me to repost the entire patchset to the mailing list. -- Goldwyn ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 00/24] Clustered MD RAID1 2015-02-10 17:00 ` Goldwyn Rodrigues @ 2015-02-11 4:17 ` NeilBrown 2015-02-11 17:25 ` Goldwyn Rodrigues 0 siblings, 1 reply; 10+ messages in thread From: NeilBrown @ 2015-02-11 4:17 UTC (permalink / raw) To: Goldwyn Rodrigues; +Cc: lzhong, linux-raid [-- Attachment #1: Type: text/plain, Size: 8437 bytes --] On Tue, 10 Feb 2015 11:00:31 -0600 Goldwyn Rodrigues <rgoldwyn@suse.de> wrote: > Hi Neil, > > > > > > > hi Goldwyn, > > thanks for these - and sorry for the long delay. Lots of leave over > > southern summer, and the lots of email etc to deal with. > > > > This patch set is very close and I am tempted to just apply it and then > > fix things up with subsequent patches. In order to allow that, could you > > please: > > - rebase against current upstream > > - fix the checkpatch.pl errors and warnings. > > The "WARNING: line over 80 characters" are often a judgement call > > so I'm not particularly worried about those. Most, if not all, of > > the others should be followed just to have consistent layout. > > Done. ERROR: code indent should use tabs where possible when you use spaces, they show up in red for me!! Ditto for WARNING: please, no space before tabs WARNING: quoted string split across lines It really is best to fix those, even though it makes the line long. When grepping to find out where a message comes from, it is very annoying if the grep fails because the line was split. WARNING: Missing a blank line after declarations Worth fixing I think. WARNING: printk() should include KERN_ facility level Definitely should be fixed, maybe make it pr_warn()?? > > > > > Then I'll queue them up for 3.21, providing I don't find anything that would > > hurt non-cluster usage .... > > On that topic: why initialise rv to -EINVAL in "metadata_update sends > > message...". That looks wrong. > > Yes, this is fixed. > > > > > I noticed that a number of times a patch will revert something that a > > previous patch added. It would be much nicer to fold these changes back into > > the original patch. Often this is just extra blank lines, but occasionally > > variable names are changed (md -> mddev). It should be given the final name > > when introduced. Every chunk in every patch should be directly relevant to > > that patch. > > I have cross-checked this and I did not find anything with respect to > variable names. I did some cleanup with respect to the code though. > > There is one instance where I have used a variable: cluster_setup_done > and then removed it. I think this is required to understand the patch > and a smooth transition to subsequent patches. However, if you want me > to aggressively remove that part, I should be able to do that. No, "cluster_setup_done" makes sense. It is scaffolding that you later need to remove. I'm probably letting me OCD tendencies get carried away, but these some of the things that I noticed: "Introduce md_cluster_info" moves 'bast' to a new location in dlm_lock_resource for no apparent reason. Also 'leave()' has a parameter which is changed from 'md' to 'mddev', as does 'join'. "Add node recovery callbacks" adds a comment to the 'nodes' field of 'struct mddev'. Why not add the comment when the field is added? Oh, and it mis-spells "unmber". In "Gather on-going resync information of other nodes" you have: static struct md_cluster_operations cluster_ops = { .join = join, .leave = leave, - .slot_number = slot_number + .slot_number = slot_number, + .resync_info_update = resync_info_update }; It is really best to put a comma at the end of each entry, even the last. Then the patch would have been: static struct md_cluster_operations cluster_ops = { .join = join, .leave = leave, .slot_number = slot_number, + .resync_info_update = resync_info_update, }; which is much nicer to read. You finally get this right in "Suspend writes in RAID1 if within range" :-) > > > > > Some other issues, that could possibly be fixed up afterwards: > > > > - Is a clustername 64 bytes or 63 bytes? I would have thought 64, > > but the use of strlcpy make is 63 plus a nul. Is that really what is > > wanted? > > Yes, it is 64 bytes. I haven't fixed this as yet. > > > > > - Based on https://lkml.org/lkml/2012/10/23/580 it might be good to add > > "default n" to Kconfig, and possible add a WARN() if anyone tries to use > > the code. > > Done. Added pr_warn while loading the module. > > > > > - I'm a bit concerned about the behaviour on node failure. > > When a node fails, two things must happen w.r.t the bits in that node's > > bitmap. > > 1/ The corresponding regions of the array need to be resynced. You do have > > code to do this. > > 2/ Other nodes must avoid read-balancing on those regions until the > > resync has completed. > > > > You do have code for this second bit, but it looks wrong. It avoids > > read-balancing if ->area_resyncing(). That isn't sufficient. > > The "area_resyncing" is always (I assume) a relatively small region of > > the array which will be completely resynced quite quickly. It must be > > because writes are blocked to this area. However the region in which > > we must disable re-balancing can be much larger. It covers *all* bits > > that are set in any unsynced bitmap. So it isn't just the area that is > > currently being synced, but all areas that will be synced. > > What are unsynced bitmaps? Are they bitmaps which are associated with an > active node or dirty bitmaps with dead nodes? If it is the former, I > agree this is not enough. If it is latter, all nodes maintain a linked > list of all the nodes which are currently performing resync (probably > because of multiple nodes died simultaneously). One node performs the > recovery (aka bitmap resync) of exactly one "dead" node at a time. > area_resyncing goes through all the nodes which are performing resync. The later - bitmaps associated with a dead node. Bitmaps associated with an active node contain transient information, and the filesystem will ensure that it never reads from somewhere that someone else might be writing (or if it does, it will know that the data cannot be trusted). I looked at the code again, and discovered that I had the problem backwards. But there is still a problem. when any node is resyncing, your code blocks writes for the entire span of the array from where-ever the resync is up to, to the end of the device. So a write to a location near the end of the device will hang until all resyncs finish. This could be a much longer time than you would like writes to hang for. I think that the resyncing host should only report that it is resyncing a relatively small range of the array, maybe 100Meg. Maybe 1G. Then that would only block access to that small part of the array, which should clear in just a few seconds at most. This information on the range being synced is not enough to limit read-balancing. I imagined that *every* node would read the bitmap for a failed node, and would use that information to limit read-balancing. There are some complexities in this though. So the current code isn't "wrong" exactly, but it think it could cause sever delays in some (unusual) circumstances. > > > > > - I think md_reload_sb() might be too simple. It probably should check that > > nothing serious has changed. The "mddev->raid_disks = 0" look suspicious. > > I'll have to think about this a bit more. > > Yes, I get that feeling as well. However, I am not sure how to perform > an exact comparison to understand what has changed. Perhaps it needs a > new flag? Probably. I haven't thought much about it. > > > > > That's all I can see for now. I'll have another look once I have it all in my tree. > > > > I have put all the changes in my git: > https://github.com/goldwynr/linux > The branch cluster-md is against the latest upstream. I also performed a > small sanity test to check everything is working properly. > > Let me know if you would want me to repost the entire patchset to the > mailing list. > > I don't think there is any need for that. I won't pull it in just yet - to give you a chance to resolve the last of the checkpatch problems. Then I'll double check that there is no risk to non-cluster users and try to get it into -next after 3.20-rc2 is out. Thanks, NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 00/24] Clustered MD RAID1 2015-02-11 4:17 ` NeilBrown @ 2015-02-11 17:25 ` Goldwyn Rodrigues 0 siblings, 0 replies; 10+ messages in thread From: Goldwyn Rodrigues @ 2015-02-11 17:25 UTC (permalink / raw) To: NeilBrown; +Cc: lzhong, linux-raid Hi Neil, On 02/10/2015 10:17 PM, NeilBrown wrote: > On Tue, 10 Feb 2015 11:00:31 -0600 Goldwyn Rodrigues <rgoldwyn@suse.de> wrote: > >> Hi Neil, >> >>> >>> >>> hi Goldwyn, >>> thanks for these - and sorry for the long delay. Lots of leave over >>> southern summer, and the lots of email etc to deal with. >>> >>> This patch set is very close and I am tempted to just apply it and then >>> fix things up with subsequent patches. In order to allow that, could you >>> please: >>> - rebase against current upstream >>> - fix the checkpatch.pl errors and warnings. >>> The "WARNING: line over 80 characters" are often a judgement call >>> so I'm not particularly worried about those. Most, if not all, of >>> the others should be followed just to have consistent layout. >> >> Done. > > ERROR: code indent should use tabs where possible > > when you use spaces, they show up in red for me!! > Ditto for > WARNING: please, no space before tabs > > WARNING: quoted string split across lines > It really is best to fix those, even though it makes the line long. > When grepping to find out where a message comes from, it is very annoying > if the grep fails because the line was split. > > WARNING: Missing a blank line after declarations > Worth fixing I think. > > WARNING: printk() should include KERN_ facility level > Definitely should be fixed, maybe make it pr_warn()?? > Ok, my review was not good. I have incorporated all of these and put them in the same branch. Sorry for the trouble. > "Introduce md_cluster_info" moves 'bast' to a new location in > dlm_lock_resource for no apparent reason. > Also 'leave()' has a parameter which is changed from 'md' to 'mddev', > as does 'join'. > > "Add node recovery callbacks" adds a comment to the 'nodes' field of 'struct > mddev'. Why not add the comment when the field is added? > Oh, and it mis-spells "unmber". > > > In "Gather on-going resync information of other nodes" you have: > > static struct md_cluster_operations cluster_ops = { > .join = join, > .leave = leave, > - .slot_number = slot_number > + .slot_number = slot_number, > + .resync_info_update = resync_info_update > }; > > > It is really best to put a comma at the end of each entry, even the last. > Then the patch would have been: > > static struct md_cluster_operations cluster_ops = { > .join = join, > .leave = leave, > .slot_number = slot_number, > + .resync_info_update = resync_info_update, > }; > > which is much nicer to read. You finally get this right in > "Suspend writes in RAID1 if within range" :-) Ok, I have fixed them in all the patches. >>> >>> - I'm a bit concerned about the behaviour on node failure. >>> When a node fails, two things must happen w.r.t the bits in that node's >>> bitmap. >>> 1/ The corresponding regions of the array need to be resynced. You do have >>> code to do this. >>> 2/ Other nodes must avoid read-balancing on those regions until the >>> resync has completed. >>> >>> You do have code for this second bit, but it looks wrong. It avoids >>> read-balancing if ->area_resyncing(). That isn't sufficient. >>> The "area_resyncing" is always (I assume) a relatively small region of >>> the array which will be completely resynced quite quickly. It must be >>> because writes are blocked to this area. However the region in which >>> we must disable re-balancing can be much larger. It covers *all* bits >>> that are set in any unsynced bitmap. So it isn't just the area that is >>> currently being synced, but all areas that will be synced. >> >> What are unsynced bitmaps? Are they bitmaps which are associated with an >> active node or dirty bitmaps with dead nodes? If it is the former, I >> agree this is not enough. If it is latter, all nodes maintain a linked >> list of all the nodes which are currently performing resync (probably >> because of multiple nodes died simultaneously). One node performs the >> recovery (aka bitmap resync) of exactly one "dead" node at a time. >> area_resyncing goes through all the nodes which are performing resync. > > The later - bitmaps associated with a dead node. > Bitmaps associated with an active node contain transient information, and the > filesystem will ensure that it never reads from somewhere that someone else > might be writing (or if it does, it will know that the data cannot be > trusted). > > I looked at the code again, and discovered that I had the problem backwards. > But there is still a problem. > > when any node is resyncing, your code blocks writes for the entire span of the > array from where-ever the resync is up to, to the end of the device. > So a write to a location near the end of the device will hang until all > resyncs finish. This could be a much longer time than you would like writes > to hang for. > > I think that the resyncing host should only report that it is resyncing a > relatively small range of the array, maybe 100Meg. Maybe 1G. > Then that would only block access to that small part of the array, which > should clear in just a few seconds at most. > > This information on the range being synced is not enough to limit > read-balancing. > I imagined that *every* node would read the bitmap for a failed node, and > would use that information to limit read-balancing. There are some > complexities in this though. > > > So the current code isn't "wrong" exactly, but it think it could cause sever > delays in some (unusual) circumstances. I thought about this as well and this is what I think we should do for a node failure: Attempt a PW lock (DLM_LKF_NOQUEUE) on the failed nodes bitmap. If successful: a. Read the bitmap. b. Update the bitmap LVB with the resync details. c. Send the RESYNCING message d. Perform the resync and update the LVB as we proceed. (This means we will have to write another resync function independent of md_do_sync) area_resyncing: a. Check the resyncing node list. If not found, or is out of range, return 0. b. If an entry exists with overlapping range of I/O, take a CR lock on bitmap-<nodenum>. c. Read the LVB for limits and update limits in the resyncing node list. d. If read/write range is not within new range details, unlock CR and return 0 e. If read/write is within range, read the bitmap and check for interfering bitmaps. If not interfering return 0 else return 1. Perhaps we could reserve the bitmap. Using step c,d may make it a wee bit faster without the need to going to the disk. The problem I see in this approach is cascading failures. What should we do if the node performing the resync for a recently failed node also fails? How do we detect that the failed node was performing a resync for another node? One option is we could add that information in the RESYNCING message. Can you think of something better? The way it is being done now is that we aggressively (no NOQUEUE) take the bitmap lock and the resyncing node copies the bits into its own bitmap and clears the failed bitmap before releasing the bitmap lock. Finally, should we put some sort of versioning in the messages for rolling upgrades? (One node is on a higher kernel version with respect to the rest of the nodes) -- Goldwyn ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-02-11 17:25 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-12-18 16:14 [PATCH 00/24] Clustered MD RAID1 Goldwyn Rodrigues 2014-12-23 17:24 ` Goldwyn Rodrigues 2014-12-23 21:47 ` John Stoffel 2014-12-26 22:35 ` Peter Kieser 2014-12-26 22:49 ` Goldwyn Rodrigues 2014-12-27 0:47 ` Marcus 2015-02-06 2:39 ` NeilBrown 2015-02-10 17:00 ` Goldwyn Rodrigues 2015-02-11 4:17 ` NeilBrown 2015-02-11 17:25 ` Goldwyn Rodrigues
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).