From mboxrd@z Thu Jan 1 00:00:00 1970 From: lhh@sourceware.org Date: 21 Jul 2006 17:55:06 -0000 Subject: [Cluster-devel] cluster/cman/man Makefile mkqdisk.8 qdisk.5 qd ... Message-ID: <20060721175506.29405.qmail@sourceware.org> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit CVSROOT: /cvs/cluster Module name: cluster Changes by: lhh at sourceware.org 2006-07-21 17:55:04 Modified files: cman/man : Makefile Added files: cman/man : mkqdisk.8 qdisk.5 qdiskd.8 Log message: Add man pages for qdisk Patches: http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/mkqdisk.8.diff?cvsroot=cluster&r1=1.1&r2=1.2 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/qdisk.5.diff?cvsroot=cluster&r1=1.1&r2=1.2 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/qdiskd.8.diff?cvsroot=cluster&r1=1.1&r2=1.2 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/Makefile.diff?cvsroot=cluster&r1=1.1&r2=1.2 --- cluster/cman/man/mkqdisk.8 2006/07/21 17:53:08 1.1 +++ cluster/cman/man/mkqdisk.8 2006/07/21 17:55:04 1.2 @@ -0,0 +1,23 @@ +.TH "mkqdisk" "8" "July 2006" "" "Quorum Disk Management" +.SH "NAME" +mkqdisk \- Cluster Quorum Disk Utility +.SH "WARNING" +Use of this command can cause the cluster to malfunction. +.SH "SYNOPSIS" +\fBmkqdisk [\-?|\-h] | [\-L] | [\-f \fPlabel\fB] [\-c \fPdevice \fB -l \fPlabel\fB] +.SH "DESCRIPTION" +.PP +The \fBmkqdisk\fP command is used to create a new quorum disk or display +existing quorum disks accessible from a given cluster node. +.SH "OPTIONS" +.IP "\-c device \-l label" +Initialize a new cluster quorum disk. This will destroy all data on the given +device. If a cluster is currently using that device as a quorum disk, the +entire cluster will malfunction. Do not ru +.IP "\-f label" +Find the cluster quorum disk with the given label and display information about it.. +.IP "\-L" +Display information on all accessible cluster quorum disks. + +.SH "SEE ALSO" +qdisk(5) qdiskd(8) --- cluster/cman/man/qdisk.5 2006/07/21 17:53:08 1.1 +++ cluster/cman/man/qdisk.5 2006/07/21 17:55:04 1.2 @@ -0,0 +1,309 @@ +.TH "QDisk" "8" "July 2006" "" "Cluster Quorum Disk" +.SH "NAME" +QDisk 1.0 \- a disk-based quorum daemon for CMAN / Linux-Cluster +.SH "1. Overview" +.SH "1.1 Problem" +In some situations, it may be necessary or desirable to sustain +a majority node failure of a cluster without introducing the need for +asymmetric cluster configurations (e.g. client-server, or heavily-weighted +voting nodes). + +.SH "1.2. Design Requirements" +* Ability to sustain 1..(n-1)/n simultaneous node failures, without the +danger of a simple network partition causing a split brain. That is, we +need to be able to ensure that the majority failure case is not merely +the result of a network partition. + +* Ability to use external reasons for deciding which partition is the +the quorate partition in a partitioned cluster. For example, a user may +have a service running on one node, and that node must always be the master +in the event of a network partition. Or, a node might lose all network +connectivity except the cluster communication path - in which case, a +user may wish that node to be evicted from the cluster. + +* Integration with CMAN. We must not require CMAN to run with us (or +without us). Linux-Cluster does not require a quorum disk normally - +introducing new requirements on the base of how Linux-Cluster operates +is not allowed. + +* Data integrity. In order to recover from a majority failure, fencing +is required. The fencing subsystem is already provided by Linux-Cluster. + +* Non-reliance on hardware or protocol specific methods (i.e. SCSI +reservations). This ensures the quorum disk algorithm can be used on the +widest range of hardware configurations possible. + +* Little or no memory allocation after initialization. In critical paths +during failover, we do not want to have to worry about being killed during +a memory pressure situation because we request a page fault, and the Linux +OOM killer responds... + +.SH "1.3. Hardware Considerations and Requirements" +.SH "1.3.1. Concurrent, Synchronous, Read/Write Access" +This quorum daemon requires a shared block device with concurrent read/write +access from all nodes in the cluster. The shared block device can be +a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI +target, or even GNBD. The quorum daemon uses O_DIRECT to write to the +device. + +.SH "1.3.2. Bargain-basement JBODs need not apply" +There is a minimum performance requirement inherent when using disk-based +cluster quorum algorithms, so design your cluster accordingly. Using a +cheap JBOD with old SCSI2 disks on a multi-initiator bus will cause +problems at the first load spike. Plan your loads accordingly; a node's +inability to write to the quorum disk in a timely manner will cause the +cluster to evict the node. Using host-RAID or multi-initiator parallel +SCSI configurations with the qdisk daemon is unlikely to work, and will +probably cause administrators a lot of frustration. That having been +said, because the timeouts are configurable, most hardware should work +if the timeouts are set high enough. + +.SH "1.3.3. Fencing is Required" +In order to maintain data integrity under all failure scenarios, use of +this quorum daemon requires adequate fencing, preferrably power-based +fencing. Watchdog timers and software-based solutions to reboot the node +internally, while possibly sufficient, are not considered 'fencing' for +the purposes of using the quorum disk. + +.SH "1.4. Limitations" +* At this time, this daemon supports a maximum of 16 nodes. This is +primarily a scalability issue: As we increase the node count, we increase +the amount of synchronous I/O contention on the shared quorum disk. + +* Cluster node IDs must be statically configured in cluster.conf and +must be numbered from 1..16 (there can be gaps, of course). + +* Cluster node votes should be more or less equal. + +* CMAN must be running before the qdisk program can start. + +* CMAN's eviction timeout should be at least 2x the quorum daemon's +to give the quorum daemon adequate time to converge on a master during a +failure + load spike situation. + +* The total number of votes assigned to the quorum device should be +equal to or greater than the total number of node-votes in the cluster. +While it is possible to assign only one (or a few) votes to the quorum +device, the effects of doing so have not been explored. + +* Currently, the quorum disk daemon is difficult to use with CLVM if +the quorum disk resides on a CLVM logical volume. CLVM requires a +quorate cluster to correctly operate, which introduces a chicken-and-egg +problem for starting the cluster: CLVM needs quorum, but the quorum daemon +needs CLVM (if and only if the quorum device lies on CLVM-managed storage). +One way to work around this is to *not* set the cluster's expected votes +to include the quorum daemon's votes. Bring all nodes online, and start +the quorum daemon *after* the whole cluster is running. This will allow +the expected votes to increase naturally. + +.SH "2. Algorithms" +.SH "2.1. Heartbeating & Liveliness Determination" +Nodes update individual status blocks on the quorum disk at a user- +defined rate. Each write of a status block alters the timestamp, which +is what other nodes use to decide whether a node has hung or not. If, +after a user-defined number of 'misses' (that is, failure to update a +timestamp), a node is declared offline. After a certain number of 'hits' +(changed timestamp + "i am alive" state), the node is declared online. + +The status block contains additional information, such as a bitmask of +the nodes that node believes are online. Some of this information is +used by the master - while some is just for performace recording, and +may be used at a later time. The most important pieces of information +a node writes to its status block are: + +.in 12 +- Timestamp +.br +- Internal state (available / not available) +.br +- Score +.br +- Known max score (may be used in the future to detect invalid configurations) +.br +- Vote/bid messages +.br +- Other nodes it thinks are online +.in 0 + +.SH "2.2. Scoring & Heuristics" +The administrator can configure up to 10 purely arbitrary heuristics, and +must exercise caution in doing so. At least one administrator- +defined heuristic is required for operation, but it is generally a good +idea to have more than one heuristic. By default, only nodes scoring over +1/2 of the total maximum score will claim they are available via the +quorum disk, and a node (master or otherwise) whose score drops too low +will remove itself (usually, by rebooting). + +The heuristics themselves can be any command executable by 'sh -c'. For +example, in early testing the following was used: + +.ti 12 +<\fBheuristic \fP\fIprogram\fP\fB="\fP[ -f /quorum ]\fB" \fP\fIscore\fP\fB="\fP10\fB" \fP\fIinterval\fP\fB="\fP2\fB"/>\fP + +This is a literal sh-ism which tests for the existence of a file called +"/quorum". Without that file, the node would claim it was unavailable. +This is an awful example, and should never, ever be used in production, +but is provided as an example as to what one could do... + +Typically, the heuristics should be snippets of shell code or commands which +help determine a node's usefulness to the cluster or clients. Ideally, you +want to add traces for all of your network paths (e.g. check links, or +ping routers), and methods to detect availability of shared storage. + +.SH "2.3. Master Election" +Only one master is present at any one time in the cluster, regardless of +how many partitions exist within the cluster itself. The master is +elected by a simple voting scheme in which the lowest node which believes +it is capable of running (i.e. scores high enough) bids for master status. +If the other nodes agree, it becomes the master. This algorithm is +run whenever no master is present. + +If another node comes online with a lower node ID while a node is still +bidding for master status, it will rescind its bid and vote for the lower +node ID. If a master dies or a bidding node dies, the voting algorithm +is started over. The voting algorithm typically takes two passes to +complete. + +Master deaths take marginally longer to recover from than non-master +deaths, because a new master must be elected before the old master can +be evicted & fenced. + +.SH "2.4. Master Duties" +The master node decides who is or is not in the master partition, as +well as handles eviction of dead nodes (both via the quorum disk and via +the linux-cluster fencing system by using the cman_kill_node() API). + +.SH "2.5. How it All Ties Together" +When a master is present, and if the master believes a node to be online, +that node will advertise to CMAN that the quorum disk is available. The +master will only grant a node membership if: + +.in 12 +(a) CMAN believes the node to be online, and +.br +(b) that node has made enough consecutive, timely writes +.in 16 +to the quorum disk, and +.in 12 +(c) the node has a high enough score to consider itself online. +.in 0 + +.SH "3. Configuration" +.SH "3.1. The tag" +This tag is a child of the top-level tag. + +.in 8 +\fB\fP +.in 12 +This overrides the device field if present. If specified, the quorum +daemon will read /proc/partitions and check for qdisk signatures +on every block device found, comparing the label against the specified +label. This is useful in configurations where the block device name +differs on a per-node basis. +.in 0 + +.SH "3.2. The tag" +This tag is a child of the tag. + +.in 8 +\fB\fP +.in 12 +This is the frequency at which we poll the heuristic. +.in 0 + +.SH "3.3. Example" +.in 8 + +.in 12 + +.br + +.br + +.br +.in 8 + +.in 0 + +.SH "3.4. Heuristic score considerations" +* Heuristic timeouts should be set high enough to allow the previous run +of a given heuristic to complete. + +* Heuristic scripts returning anything except 0 as their return code +are considered failed. + +* The worst-case for improperly configured quorum heuristics is a race +to fence where two partitions simultaneously try to kill each other. + +.SH "3.5. Creating a quorum disk partition" +The mkqdisk utility can create and list currently configured quorum disks +visible to the local node; see +.B mkqdisk(8) +for more details. + +.SH "SEE ALSO" +mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5) --- cluster/cman/man/qdiskd.8 2006/07/21 17:53:08 1.1 +++ cluster/cman/man/qdiskd.8 2006/07/21 17:55:04 1.2 @@ -0,0 +1,20 @@ +.TH "qdiskd" "8" "July 2006" "" "Quorum Disk Management" +.SH "NAME" +qdiskd \- Cluster Quorum Disk Daemon +.SH "SYNOPSIS" +\fBqdiskd [\-f] [\-d] +.SH "DESCRIPTION" +.PP +The \fBqdiskd\fP daemon talks to CMAN and provides a mechanism for determining +node-fitness in a cluster environment. See +.B +qdisk(5) +for configuration information. +.SH "OPTIONS" +.IP "\-f" +Run in the foreground (do not fork / daemonize). +.IP "\-d" +Enable debug output. + +.SH "SEE ALSO" +mkqdisk(8), qdisk(5), cman(5) --- cluster/cman/man/Makefile 2004/08/13 06:38:22 1.1 +++ cluster/cman/man/Makefile 2006/07/21 17:55:04 1.2 @@ -18,10 +18,10 @@ install: install -d ${mandir}/man5 install -d ${mandir}/man8 - install cman.5 ${mandir}/man5 - install cman_tool.8 ${mandir}/man8 + install cman.5 qdisk.5 ${mandir}/man5 + install cman_tool.8 qdiskd.8 mkqdisk.8 ${mandir}/man8 uninstall: - ${UNINSTALL} cman.5 ${mandir}/man5 - ${UNINSTALL} cman_tool.8 ${mandir}/man8 + ${UNINSTALL} cman.5 qdisk.5 ${mandir}/man5 + ${UNINSTALL} cman_tool.8 qdiskd.8 mkqdisk.8 ${mandir}/man8