From mboxrd@z Thu Jan 1 00:00:00 1970 From: lhh@sourceware.org Date: 21 Jul 2006 17:56:16 -0000 Subject: [Cluster-devel] cluster/cman man/Makefile qdisk/README man/mkq ... Message-ID: <20060721175616.31057.qmail@sourceware.org> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit CVSROOT: /cvs/cluster Module name: cluster Branch: RHEL4U4 Changes by: lhh at sourceware.org 2006-07-21 17:56:15 Modified files: cman/man : Makefile cman/qdisk : README Added files: cman/man : mkqdisk.8 qdisk.5 qdiskd.8 Log message: Add man pages for qdisk Patches: http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/mkqdisk.8.diff?cvsroot=cluster&only_with_tag=RHEL4U4&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/qdisk.5.diff?cvsroot=cluster&only_with_tag=RHEL4U4&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/qdiskd.8.diff?cvsroot=cluster&only_with_tag=RHEL4U4&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/Makefile.diff?cvsroot=cluster&only_with_tag=RHEL4U4&r1=1.1&r2=1.1.14.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/README.diff?cvsroot=cluster&only_with_tag=RHEL4U4&r1=1.1.2.1.2.1&r2=1.1.2.1.2.2 /cvs/cluster/cluster/cman/man/mkqdisk.8,v --> standard output revision 1.2.2.1 --- cluster/cman/man/mkqdisk.8 +++ - 2006-07-21 17:56:15.747890000 +0000 @@ -0,0 +1,23 @@ +.TH "mkqdisk" "8" "July 2006" "" "Quorum Disk Management" +.SH "NAME" +mkqdisk \- Cluster Quorum Disk Utility +.SH "WARNING" +Use of this command can cause the cluster to malfunction. +.SH "SYNOPSIS" +\fBmkqdisk [\-?|\-h] | [\-L] | [\-f \fPlabel\fB] [\-c \fPdevice \fB -l \fPlabel\fB] +.SH "DESCRIPTION" +.PP +The \fBmkqdisk\fP command is used to create a new quorum disk or display +existing quorum disks accessible from a given cluster node. +.SH "OPTIONS" +.IP "\-c device \-l label" +Initialize a new cluster quorum disk. This will destroy all data on the given +device. If a cluster is currently using that device as a quorum disk, the +entire cluster will malfunction. Do not ru +.IP "\-f label" +Find the cluster quorum disk with the given label and display information about it.. +.IP "\-L" +Display information on all accessible cluster quorum disks. + +.SH "SEE ALSO" +qdisk(5) qdiskd(8) /cvs/cluster/cluster/cman/man/qdisk.5,v --> standard output revision 1.2.2.1 --- cluster/cman/man/qdisk.5 +++ - 2006-07-21 17:56:15.834011000 +0000 @@ -0,0 +1,309 @@ +.TH "QDisk" "8" "July 2006" "" "Cluster Quorum Disk" +.SH "NAME" +QDisk 1.0 \- a disk-based quorum daemon for CMAN / Linux-Cluster +.SH "1. Overview" +.SH "1.1 Problem" +In some situations, it may be necessary or desirable to sustain +a majority node failure of a cluster without introducing the need for +asymmetric cluster configurations (e.g. client-server, or heavily-weighted +voting nodes). + +.SH "1.2. Design Requirements" +* Ability to sustain 1..(n-1)/n simultaneous node failures, without the +danger of a simple network partition causing a split brain. That is, we +need to be able to ensure that the majority failure case is not merely +the result of a network partition. + +* Ability to use external reasons for deciding which partition is the +the quorate partition in a partitioned cluster. For example, a user may +have a service running on one node, and that node must always be the master +in the event of a network partition. Or, a node might lose all network +connectivity except the cluster communication path - in which case, a +user may wish that node to be evicted from the cluster. + +* Integration with CMAN. We must not require CMAN to run with us (or +without us). Linux-Cluster does not require a quorum disk normally - +introducing new requirements on the base of how Linux-Cluster operates +is not allowed. + +* Data integrity. In order to recover from a majority failure, fencing +is required. The fencing subsystem is already provided by Linux-Cluster. + +* Non-reliance on hardware or protocol specific methods (i.e. SCSI +reservations). This ensures the quorum disk algorithm can be used on the +widest range of hardware configurations possible. + +* Little or no memory allocation after initialization. In critical paths +during failover, we do not want to have to worry about being killed during +a memory pressure situation because we request a page fault, and the Linux +OOM killer responds... + +.SH "1.3. Hardware Considerations and Requirements" +.SH "1.3.1. Concurrent, Synchronous, Read/Write Access" +This quorum daemon requires a shared block device with concurrent read/write +access from all nodes in the cluster. The shared block device can be +a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI +target, or even GNBD. The quorum daemon uses O_DIRECT to write to the +device. + +.SH "1.3.2. Bargain-basement JBODs need not apply" +There is a minimum performance requirement inherent when using disk-based +cluster quorum algorithms, so design your cluster accordingly. Using a +cheap JBOD with old SCSI2 disks on a multi-initiator bus will cause +problems at the first load spike. Plan your loads accordingly; a node's +inability to write to the quorum disk in a timely manner will cause the +cluster to evict the node. Using host-RAID or multi-initiator parallel +SCSI configurations with the qdisk daemon is unlikely to work, and will +probably cause administrators a lot of frustration. That having been +said, because the timeouts are configurable, most hardware should work +if the timeouts are set high enough. + +.SH "1.3.3. Fencing is Required" +In order to maintain data integrity under all failure scenarios, use of +this quorum daemon requires adequate fencing, preferrably power-based +fencing. Watchdog timers and software-based solutions to reboot the node +internally, while possibly sufficient, are not considered 'fencing' for +the purposes of using the quorum disk. + +.SH "1.4. Limitations" +* At this time, this daemon supports a maximum of 16 nodes. This is +primarily a scalability issue: As we increase the node count, we increase +the amount of synchronous I/O contention on the shared quorum disk. + +* Cluster node IDs must be statically configured in cluster.conf and +must be numbered from 1..16 (there can be gaps, of course). + +* Cluster node votes should be more or less equal. + +* CMAN must be running before the qdisk program can start. + +* CMAN's eviction timeout should be at least 2x the quorum daemon's +to give the quorum daemon adequate time to converge on a master during a +failure + load spike situation. + +* The total number of votes assigned to the quorum device should be +equal to or greater than the total number of node-votes in the cluster. +While it is possible to assign only one (or a few) votes to the quorum +device, the effects of doing so have not been explored. + +* Currently, the quorum disk daemon is difficult to use with CLVM if +the quorum disk resides on a CLVM logical volume. CLVM requires a +quorate cluster to correctly operate, which introduces a chicken-and-egg +problem for starting the cluster: CLVM needs quorum, but the quorum daemon +needs CLVM (if and only if the quorum device lies on CLVM-managed storage). +One way to work around this is to *not* set the cluster's expected votes +to include the quorum daemon's votes. Bring all nodes online, and start +the quorum daemon *after* the whole cluster is running. This will allow +the expected votes to increase naturally. + +.SH "2. Algorithms" +.SH "2.1. Heartbeating & Liveliness Determination" +Nodes update individual status blocks on the quorum disk at a user- +defined rate. Each write of a status block alters the timestamp, which +is what other nodes use to decide whether a node has hung or not. If, +after a user-defined number of 'misses' (that is, failure to update a +timestamp), a node is declared offline. After a certain number of 'hits' +(changed timestamp + "i am alive" state), the node is declared online. + +The status block contains additional information, such as a bitmask of +the nodes that node believes are online. Some of this information is +used by the master - while some is just for performace recording, and +may be used at a later time. The most important pieces of information +a node writes to its status block are: + +.in 12 +- Timestamp +.br +- Internal state (available / not available) +.br +- Score +.br +- Known max score (may be used in the future to detect invalid configurations) +.br +- Vote/bid messages +.br +- Other nodes it thinks are online +.in 0 + +.SH "2.2. Scoring & Heuristics" +The administrator can configure up to 10 purely arbitrary heuristics, and +must exercise caution in doing so. At least one administrator- +defined heuristic is required for operation, but it is generally a good +idea to have more than one heuristic. By default, only nodes scoring over +1/2 of the total maximum score will claim they are available via the +quorum disk, and a node (master or otherwise) whose score drops too low +will remove itself (usually, by rebooting). + +The heuristics themselves can be any command executable by 'sh -c'. For +example, in early testing the following was used: + +.ti 12 +<\fBheuristic \fP\fIprogram\fP\fB="\fP[ -f /quorum ]\fB" \fP\fIscore\fP\fB="\fP10\fB" \fP\fIinterval\fP\fB="\fP2\fB"/>\fP + +This is a literal sh-ism which tests for the existence of a file called +"/quorum". Without that file, the node would claim it was unavailable. +This is an awful example, and should never, ever be used in production, +but is provided as an example as to what one could do... + +Typically, the heuristics should be snippets of shell code or commands which +help determine a node's usefulness to the cluster or clients. Ideally, you +want to add traces for all of your network paths (e.g. check links, or +ping routers), and methods to detect availability of shared storage. + +.SH "2.3. Master Election" +Only one master is present at any one time in the cluster, regardless of +how many partitions exist within the cluster itself. The master is +elected by a simple voting scheme in which the lowest node which believes +it is capable of running (i.e. scores high enough) bids for master status. +If the other nodes agree, it becomes the master. This algorithm is +run whenever no master is present. + +If another node comes online with a lower node ID while a node is still +bidding for master status, it will rescind its bid and vote for the lower +node ID. If a master dies or a bidding node dies, the voting algorithm +is started over. The voting algorithm typically takes two passes to +complete. + +Master deaths take marginally longer to recover from than non-master +deaths, because a new master must be elected before the old master can +be evicted & fenced. + +.SH "2.4. Master Duties" +The master node decides who is or is not in the master partition, as +well as handles eviction of dead nodes (both via the quorum disk and via +the linux-cluster fencing system by using the cman_kill_node() API). + +.SH "2.5. How it All Ties Together" +When a master is present, and if the master believes a node to be online, +that node will advertise to CMAN that the quorum disk is available. The +master will only grant a node membership if: + +.in 12 +(a) CMAN believes the node to be online, and +.br +(b) that node has made enough consecutive, timely writes +.in 16 +to the quorum disk, and +.in 12 +(c) the node has a high enough score to consider itself online. +.in 0 + +.SH "3. Configuration" +.SH "3.1. The tag" +This tag is a child of the top-level tag. + +.in 8 +\fB\fP +.in 12 +This overrides the device field if present. If specified, the quorum +daemon will read /proc/partitions and check for qdisk signatures +on every block device found, comparing the label against the specified +label. This is useful in configurations where the block device name +differs on a per-node basis. +.in 0 + +.SH "3.2. The tag" +This tag is a child of the tag. + +.in 8 +\fB\fP +.in 12 +This is the frequency at which we poll the heuristic. +.in 0 + +.SH "3.3. Example" +.in 8 + +.in 12 + +.br + +.br + +.br +.in 8 + +.in 0 + +.SH "3.4. Heuristic score considerations" +* Heuristic timeouts should be set high enough to allow the previous run +of a given heuristic to complete. + +* Heuristic scripts returning anything except 0 as their return code +are considered failed. + +* The worst-case for improperly configured quorum heuristics is a race +to fence where two partitions simultaneously try to kill each other. + +.SH "3.5. Creating a quorum disk partition" +The mkqdisk utility can create and list currently configured quorum disks +visible to the local node; see +.B mkqdisk(8) +for more details. + +.SH "SEE ALSO" +mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5) /cvs/cluster/cluster/cman/man/qdiskd.8,v --> standard output revision 1.2.2.1 --- cluster/cman/man/qdiskd.8 +++ - 2006-07-21 17:56:15.915107000 +0000 @@ -0,0 +1,20 @@ +.TH "qdiskd" "8" "July 2006" "" "Quorum Disk Management" +.SH "NAME" +qdiskd \- Cluster Quorum Disk Daemon +.SH "SYNOPSIS" +\fBqdiskd [\-f] [\-d] +.SH "DESCRIPTION" +.PP +The \fBqdiskd\fP daemon talks to CMAN and provides a mechanism for determining +node-fitness in a cluster environment. See +.B +qdisk(5) +for configuration information. +.SH "OPTIONS" +.IP "\-f" +Run in the foreground (do not fork / daemonize). +.IP "\-d" +Enable debug output. + +.SH "SEE ALSO" +mkqdisk(8), qdisk(5), cman(5) --- cluster/cman/man/Makefile 2004/08/13 06:38:22 1.1 +++ cluster/cman/man/Makefile 2006/07/21 17:56:15 1.1.14.1 @@ -18,10 +18,10 @@ install: install -d ${mandir}/man5 install -d ${mandir}/man8 - install cman.5 ${mandir}/man5 - install cman_tool.8 ${mandir}/man8 + install cman.5 qdisk.5 ${mandir}/man5 + install cman_tool.8 qdiskd.8 mkqdisk.8 ${mandir}/man8 uninstall: - ${UNINSTALL} cman.5 ${mandir}/man5 - ${UNINSTALL} cman_tool.8 ${mandir}/man8 + ${UNINSTALL} cman.5 qdisk.5 ${mandir}/man5 + ${UNINSTALL} cman_tool.8 qdiskd.8 mkqdisk.8 ${mandir}/man8 --- cluster/cman/qdisk/README 2006/06/23 16:02:01 1.1.2.1.2.1 +++ cluster/cman/qdisk/README 2006/07/21 17:56:15 1.1.2.1.2.2 @@ -1,274 +1 @@ -qdisk 1.0 - a disk-based quorum algorithm for Linux-Cluster - -(C) 2006 Red Hat, Inc. - -1. Overview - -1.1. Problem - -In some situations, it may be necessary or desirable to sustain -a majority node failure of a cluster without introducing the need for -asymmetric (client-server, or heavy-weighted voting nodes). - -1.2. Design Requirements - -* Ability to sustain 1..(n-1)/n simultaneous node failures, without the -danger of a simple network partition causing a split brain. That is, we -need to be able to ensure that the majority failure case is not merely -the result of a network partition. - -* Ability to use external reasons for deciding which partition is the -the quorate partition in a partitioned cluster. For example, a user may -have a service running on one node, and that node must always be the master -in the event of a network partition. Or, a node might lose all network -connectivity except the cluster communication path - in which case, a -user may wish that node to be evicted from the cluster. - -* Integration with CMAN. We must not require CMAN to run with us (or -without us). Linux-Cluster does not require a quorum disk normally - -introducing new requirements on the base of how Linux-Cluster operates -is not allowed. - -* Data integrity. In order to recover from a majority failure, fencing -is required. The fencing subsystem is already provided by Linux-Cluster. - -* Non-reliance on hardware or protocol specific methods (i.e. SCSI -reservations). This ensures the quorum disk algorithm can be used on the -widest range of hardware configurations possible. - -* Little or no memory allocation after initialization. In critical paths -during failover, we do not want to have to worry about being killed during -a memory pressure situation because we request a page fault, and the Linux -OOM killer responds... - - -1.3. Hardware Configuration Considerations - -1.3.1. Concurrent, Synchronous, Read/Write Access - -This daemon requires a shared block device with concurrent read/write -access from all nodes in the cluster. The shared block device can be -a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI -target, or even GNBD. The quorum daemon uses O_DIRECT to write to the -device. - -1.3.2. Bargain-basement JBODs need not apply - -There is a minimum performance requirement inherent when using disk-based -cluster quorum algorithms, so design your cluster accordingly. Using a -cheap JBOD with old SCSI2 disks on a multi-initiator bus will cause -problems at the first load spike. Plan your loads accordingly; a node's -inability to write to the quorum disk in a timely manner will cause the -cluster to evict the node. Using host-RAID or multi-initiator parallel -SCSI configurations with the qdisk daemon is unlikely to work, and will -probably cause administrators a lot of frustration. That having been -said, because the timeouts are configurable, most hardware should work -if the timeouts are set high enough. - -1.3.3. Fencing is Required - -In order to maintain data integrity under all failure scenarios, use of -this quorum daemon requires adequate fencing, preferrably power-based -fencing. - - -1.4. Limitations - -* At this time, this daemon only supports a maximum of 16 nodes. - -* Cluster node IDs must be statically configured in cluster.conf and -must be numbered from 1..16 (there can be gaps, of course). - -* Cluster node votes should be more or less equal. - -* CMAN must be running before the qdisk program can start. This -limitation will be removed before a production release. - -* CMAN's eviction timeout should be at least 2x the quorum daemon's -to give the quorum daemon adequate time to converge on a master during a -failure + load spike situation. - -* The total number of votes assigned to the quorum device should be -equal to or greater than the total number of node-votes in the cluster. -While it is possible to assign only one (or a few) votes to the quorum -device, the effects of doing so have not been explored. - -* Currently, the quorum disk daemon is difficult to use with CLVM if -the quorum disk resides on a CLVM logical volume. CLVM requires a -quorate cluster to correctly operate, which introduces a chicken-and-egg -problem for starting the cluster: CLVM needs quorum, but the quorum daemon -needs CLVM (if and only if the quorum device lies on CLVM-managed storage). -One way to work around this is to *not* set the cluster's expected votes -to include theh quorum daemon's votes. Bring all nodes online, and start -the quorum daemon *after* the whole cluster is running. This will allow -the expected votes to increase naturally. - - -2. Algorithms - -2.1. Heartbeating & Liveliness Determination - -Nodes update individual status blocks on the quorum disk at a user- -defined rate. Each write of a status block alters the timestamp, which -is what other nodes use to decide whether a node has hung or not. If, -after a user-defined number of 'misses' (that is, failure to update a -timestamp), a node is declared offline. After a certain number of 'hits' -(changed timestamp + "i am alive" state), the node is declared online. - -The status block contains additional information, such as a bitmask of -the nodes that node believes are online. Some of this information is -used by the master - while some is just for performace recording, and -may be used at a later time. The most important pieces of information -a node writes to its status block are: - - - timestamp - - internal state (available / not available) - - score - - max score - - vote/bid messages - - other nodes it thinks are online - - -2.2. Scoring & Heuristics - -The administrator can configure up to 10 purely arbitrary heuristics, and -must exercise caution in doing so. By default, only nodes scoring over -1/2 of the total maximum score will claim they are available via the -quorum disk, and a node (master or otherwise) whose score drops too low -will remove itself (usually, by rebooting). - -The heuristics themselves can be any command executable by 'sh -c'. For -example, in early testing, I used this: - - - -This is a literal sh-ism which tests for the existence of a file called -"/quorum". Without that file, the node would claim it was unavailable. -This is an awful example, and should never, ever be used in production, -but is provided as an example as to what one could do... - -Typically, the heuristics should be snippets of shell code or commands which -help determine a node's usefulness to the cluster or clients. Ideally, you -want to add traces for all of your network paths (e.g. check links, or -ping routers), and methods to detect availability of shared storage. - - -2.3. Master Election - -Only one master is present at any one time in the cluster, regardless of -how many partitions exist within the cluster itself. The master is -elected by a simple voting scheme in which the lowest node which believes -it is capable of running (i.e. scores high enough) bids for master status. -If the other nodes agree, it becomes the master. This algorithm is -run whenever no master is present. - -If another node comes online with a lower node ID while a node is still -bidding for master status, it will rescind its bid and vote for the lower -node ID. If a master dies or a bidding node dies, the voting algorithm -is started over. The voting algorithm typically takes two passes to -complete. - -Master deaths take marginally longer to recover from than non-master -deaths, because a new master must be elected before the old master can -be evicted & fenced. - - -2.4. Master Duties - -The master node decides who is or is not in the master partition, as -well as handles eviction of dead nodes (both via the quorum disk and via -the linux-cluster fencing system by using the cman_kill_node() API). - - -2.5. How it All Ties Together - -When a master is present, and if the master believes a node to be online, -that node will advertise to CMAN that the quorum disk is avilable. The -master will only grant a node membership if: - - (a) CMAN believes the node to be online, and - (b) that node has made enough consecutive, timely writes to the quorum - disk. - - -3. Configuration - -3.1. The tag - -This tag is a child of the top-level tag. - - This overrides the device field if present. - If specified, the quorum daemon will read - /proc/partitions and check for qdisk signatures - on every block device found, comparing the label - against the specified label. This is useful in - configurations where the block device name - differs on a per-node basis. - - -3.2. The tag - -This tag is a child of the tag. - - This is the frequency at which we poll the - heuristic. - -3.3. Example - - - - - - - -3.4. Heuristic score considerations - -* Heuristic timeouts should be set high enough to allow the previous run -of a given heuristic to complete. - -* Heuristic scripts returning anything except 0 as their return code -are considered failed. - -* The worst-case for improperly configured quorum heuristics is a race -to fence where two partitions simultaneously try to kill each other. - -3.5. Creating a quorum disk partition - -3.5.1. The mkqdisk utility. - -The mkqdisk utility can create and list currently configured quorum disks -visible to the local node. - - mkqdisk -L List available quorum disks. - - mkqdisk -f