From mboxrd@z Thu Jan 1 00:00:00 1970 From: lhh@sourceware.org Date: 21 Jul 2006 18:01:40 -0000 Subject: [Cluster-devel] cluster/cman Makefile init.d/Makefile man/Make ... Message-ID: <20060721180140.1996.qmail@sourceware.org> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit CVSROOT: /cvs/cluster Module name: cluster Branch: STABLE Changes by: lhh at sourceware.org 2006-07-21 18:01:38 Modified files: cman : Makefile cman/init.d : Makefile cman/man : Makefile Added files: cman/init.d : qdiskd cman/man : mkqdisk.8 qdisk.5 qdiskd.8 cman/qdisk : Makefile README bitmap.c clulog.c clulog.h crc32.c disk.c disk.h disk_util.c gettid.c gettid.h main.c mkqdisk.c platform.h proc.c score.c score.h Log message: Merge from RHEL4 branch; add QDisk Patches: http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/Makefile.diff?cvsroot=cluster&only_with_tag=STABLE&r1=1.4.8.1&r2=1.4.8.2 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/init.d/qdiskd.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/init.d/Makefile.diff?cvsroot=cluster&only_with_tag=STABLE&r1=1.1&r2=1.1.8.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/mkqdisk.8.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.4.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/qdisk.5.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.4.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/qdiskd.8.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.4.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/Makefile.diff?cvsroot=cluster&only_with_tag=STABLE&r1=1.1&r2=1.1.8.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/Makefile.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.5.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/README.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.4.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/bitmap.c.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/clulog.c.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/clulog.h.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/crc32.c.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/disk.c.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.4.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/disk.h.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.3.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/disk_util.c.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/gettid.c.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.4.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/gettid.h.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/main.c.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.3.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/mkqdisk.c.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.3.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/platform.h.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/proc.c.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/score.c.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.2.1 http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/score.h.diff?cvsroot=cluster&only_with_tag=STABLE&r1=NONE&r2=1.2.2.1 --- cluster/cman/Makefile 2005/07/05 16:01:29 1.4.8.1 +++ cluster/cman/Makefile 2006/07/21 18:01:37 1.4.8.2 @@ -14,14 +14,17 @@ all: cd cman_tool && ${MAKE} all cd lib && ${MAKE} all + cd qdisk && ${MAKE} all copytobin: cd cman_tool && ${MAKE} copytobin + cd qdisk && ${MAKE} copytobin cd lib && ${MAKE} copytobin clean: cd bin && ${MAKE} clean cd cman_tool && ${MAKE} clean + cd qdisk && ${MAKE} clean cd lib && ${MAKE} clean distclean: clean @@ -31,10 +34,12 @@ cd man && ${MAKE} install cd cman_tool && ${MAKE} install cd lib && ${MAKE} install + cd qdisk && ${MAKE} install cd init.d && ${MAKE} install uninstall: cd cman_tool && ${MAKE} uninstall cd lib && ${MAKE} uninstall cd man && ${MAKE} uninstall + cd qdisk && ${MAKE} uninstall cd init.d && ${MAKE} uninstall /cvs/cluster/cluster/cman/init.d/qdiskd,v --> standard output revision 1.2.2.1 --- cluster/cman/init.d/qdiskd +++ - 2006-07-21 18:01:38.720108000 +0000 @@ -0,0 +1,63 @@ +#!/bin/bash +# +# chkconfig: 345 22 78 +# description: Starts and stops the quroum disk daemon +# +# +### BEGIN INIT INFO +# Provides: +### END INIT INFO + +. /etc/init.d/functions +[ -f /etc/sysconfig/cluster ] && . /etc/sysconfig/cluster + +LOCK_FILE="/var/lock/subsys/qdiskd" + +rtrn=1 +retries=0 + +# See how we were called. +case "$1" in + start) + action "Starting the Quorum Disk Daemon:" qdiskd + rtrn=$? + [ $rtrn = 0 ] && touch $LOCK_FILE + ;; + + stop) + echo -n "Stopping the Quorum Disk Daemon:" + killproc qdiskd + while [ -n "`pidof qdiskd`" ] && [ $retries -lt 5 ]; do + sleep 1 + killproc qdiskd + ((retries++)) + done + if [ -z "`pidof qdiskd`" ]; then + echo_success + echo + rtrn=0 + rm -f $LOCK_FILE + else + echo_failure + echo + rtrn=1 + fi + ;; + + restart) + $0 stop || exit $? + $0 start + rtrn=$? + ;; + + status) + status qdiskd + rtrn=$? + ;; + + *) + echo $"Usage: $0 {start|stop|restart|status}" + ;; +esac + +exit $rtrn --- cluster/cman/init.d/Makefile 2004/12/17 20:07:59 1.1 +++ cluster/cman/init.d/Makefile 2006/07/21 18:01:38 1.1.8.1 @@ -10,7 +10,7 @@ ############################################################################### ############################################################################### -TARGET= cman +TARGET= cman qdiskd UNINSTALL=${top_srcdir}/scripts/uninstall.pl /cvs/cluster/cluster/cman/man/mkqdisk.8,v --> standard output revision 1.2.4.1 --- cluster/cman/man/mkqdisk.8 +++ - 2006-07-21 18:01:38.882836000 +0000 @@ -0,0 +1,23 @@ +.TH "mkqdisk" "8" "July 2006" "" "Quorum Disk Management" +.SH "NAME" +mkqdisk \- Cluster Quorum Disk Utility +.SH "WARNING" +Use of this command can cause the cluster to malfunction. +.SH "SYNOPSIS" +\fBmkqdisk [\-?|\-h] | [\-L] | [\-f \fPlabel\fB] [\-c \fPdevice \fB -l \fPlabel\fB] +.SH "DESCRIPTION" +.PP +The \fBmkqdisk\fP command is used to create a new quorum disk or display +existing quorum disks accessible from a given cluster node. +.SH "OPTIONS" +.IP "\-c device \-l label" +Initialize a new cluster quorum disk. This will destroy all data on the given +device. If a cluster is currently using that device as a quorum disk, the +entire cluster will malfunction. Do not ru +.IP "\-f label" +Find the cluster quorum disk with the given label and display information about it.. +.IP "\-L" +Display information on all accessible cluster quorum disks. + +.SH "SEE ALSO" +qdisk(5) qdiskd(8) /cvs/cluster/cluster/cman/man/qdisk.5,v --> standard output revision 1.2.4.1 --- cluster/cman/man/qdisk.5 +++ - 2006-07-21 18:01:38.970862000 +0000 @@ -0,0 +1,309 @@ +.TH "QDisk" "8" "July 2006" "" "Cluster Quorum Disk" +.SH "NAME" +QDisk 1.0 \- a disk-based quorum daemon for CMAN / Linux-Cluster +.SH "1. Overview" +.SH "1.1 Problem" +In some situations, it may be necessary or desirable to sustain +a majority node failure of a cluster without introducing the need for +asymmetric cluster configurations (e.g. client-server, or heavily-weighted +voting nodes). + +.SH "1.2. Design Requirements" +* Ability to sustain 1..(n-1)/n simultaneous node failures, without the +danger of a simple network partition causing a split brain. That is, we +need to be able to ensure that the majority failure case is not merely +the result of a network partition. + +* Ability to use external reasons for deciding which partition is the +the quorate partition in a partitioned cluster. For example, a user may +have a service running on one node, and that node must always be the master +in the event of a network partition. Or, a node might lose all network +connectivity except the cluster communication path - in which case, a +user may wish that node to be evicted from the cluster. + +* Integration with CMAN. We must not require CMAN to run with us (or +without us). Linux-Cluster does not require a quorum disk normally - +introducing new requirements on the base of how Linux-Cluster operates +is not allowed. + +* Data integrity. In order to recover from a majority failure, fencing +is required. The fencing subsystem is already provided by Linux-Cluster. + +* Non-reliance on hardware or protocol specific methods (i.e. SCSI +reservations). This ensures the quorum disk algorithm can be used on the +widest range of hardware configurations possible. + +* Little or no memory allocation after initialization. In critical paths +during failover, we do not want to have to worry about being killed during +a memory pressure situation because we request a page fault, and the Linux +OOM killer responds... + +.SH "1.3. Hardware Considerations and Requirements" +.SH "1.3.1. Concurrent, Synchronous, Read/Write Access" +This quorum daemon requires a shared block device with concurrent read/write +access from all nodes in the cluster. The shared block device can be +a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI +target, or even GNBD. The quorum daemon uses O_DIRECT to write to the +device. + +.SH "1.3.2. Bargain-basement JBODs need not apply" +There is a minimum performance requirement inherent when using disk-based +cluster quorum algorithms, so design your cluster accordingly. Using a +cheap JBOD with old SCSI2 disks on a multi-initiator bus will cause +problems at the first load spike. Plan your loads accordingly; a node's +inability to write to the quorum disk in a timely manner will cause the +cluster to evict the node. Using host-RAID or multi-initiator parallel +SCSI configurations with the qdisk daemon is unlikely to work, and will +probably cause administrators a lot of frustration. That having been +said, because the timeouts are configurable, most hardware should work +if the timeouts are set high enough. + +.SH "1.3.3. Fencing is Required" +In order to maintain data integrity under all failure scenarios, use of +this quorum daemon requires adequate fencing, preferrably power-based +fencing. Watchdog timers and software-based solutions to reboot the node +internally, while possibly sufficient, are not considered 'fencing' for +the purposes of using the quorum disk. + +.SH "1.4. Limitations" +* At this time, this daemon supports a maximum of 16 nodes. This is +primarily a scalability issue: As we increase the node count, we increase +the amount of synchronous I/O contention on the shared quorum disk. + +* Cluster node IDs must be statically configured in cluster.conf and +must be numbered from 1..16 (there can be gaps, of course). + +* Cluster node votes should be more or less equal. + +* CMAN must be running before the qdisk program can start. + +* CMAN's eviction timeout should be at least 2x the quorum daemon's +to give the quorum daemon adequate time to converge on a master during a +failure + load spike situation. + +* The total number of votes assigned to the quorum device should be +equal to or greater than the total number of node-votes in the cluster. +While it is possible to assign only one (or a few) votes to the quorum +device, the effects of doing so have not been explored. + +* Currently, the quorum disk daemon is difficult to use with CLVM if +the quorum disk resides on a CLVM logical volume. CLVM requires a +quorate cluster to correctly operate, which introduces a chicken-and-egg +problem for starting the cluster: CLVM needs quorum, but the quorum daemon +needs CLVM (if and only if the quorum device lies on CLVM-managed storage). +One way to work around this is to *not* set the cluster's expected votes +to include the quorum daemon's votes. Bring all nodes online, and start +the quorum daemon *after* the whole cluster is running. This will allow +the expected votes to increase naturally. + +.SH "2. Algorithms" +.SH "2.1. Heartbeating & Liveliness Determination" +Nodes update individual status blocks on the quorum disk at a user- +defined rate. Each write of a status block alters the timestamp, which +is what other nodes use to decide whether a node has hung or not. If, +after a user-defined number of 'misses' (that is, failure to update a +timestamp), a node is declared offline. After a certain number of 'hits' +(changed timestamp + "i am alive" state), the node is declared online. + +The status block contains additional information, such as a bitmask of +the nodes that node believes are online. Some of this information is +used by the master - while some is just for performace recording, and +may be used at a later time. The most important pieces of information +a node writes to its status block are: + +.in 12 +- Timestamp +.br +- Internal state (available / not available) +.br +- Score +.br +- Known max score (may be used in the future to detect invalid configurations) +.br +- Vote/bid messages +.br +- Other nodes it thinks are online +.in 0 + +.SH "2.2. Scoring & Heuristics" +The administrator can configure up to 10 purely arbitrary heuristics, and +must exercise caution in doing so. At least one administrator- +defined heuristic is required for operation, but it is generally a good +idea to have more than one heuristic. By default, only nodes scoring over +1/2 of the total maximum score will claim they are available via the +quorum disk, and a node (master or otherwise) whose score drops too low +will remove itself (usually, by rebooting). + +The heuristics themselves can be any command executable by 'sh -c'. For +example, in early testing the following was used: + +.ti 12 +<\fBheuristic \fP\fIprogram\fP\fB="\fP[ -f /quorum ]\fB" \fP\fIscore\fP\fB="\fP10\fB" \fP\fIinterval\fP\fB="\fP2\fB"/>\fP + +This is a literal sh-ism which tests for the existence of a file called +"/quorum". Without that file, the node would claim it was unavailable. +This is an awful example, and should never, ever be used in production, +but is provided as an example as to what one could do... + +Typically, the heuristics should be snippets of shell code or commands which +help determine a node's usefulness to the cluster or clients. Ideally, you +want to add traces for all of your network paths (e.g. check links, or +ping routers), and methods to detect availability of shared storage. + +.SH "2.3. Master Election" +Only one master is present at any one time in the cluster, regardless of +how many partitions exist within the cluster itself. The master is +elected by a simple voting scheme in which the lowest node which believes +it is capable of running (i.e. scores high enough) bids for master status. +If the other nodes agree, it becomes the master. This algorithm is +run whenever no master is present. + +If another node comes online with a lower node ID while a node is still +bidding for master status, it will rescind its bid and vote for the lower +node ID. If a master dies or a bidding node dies, the voting algorithm +is started over. The voting algorithm typically takes two passes to +complete. + +Master deaths take marginally longer to recover from than non-master +deaths, because a new master must be elected before the old master can +be evicted & fenced. + +.SH "2.4. Master Duties" +The master node decides who is or is not in the master partition, as +well as handles eviction of dead nodes (both via the quorum disk and via +the linux-cluster fencing system by using the cman_kill_node() API). + +.SH "2.5. How it All Ties Together" +When a master is present, and if the master believes a node to be online, +that node will advertise to CMAN that the quorum disk is available. The +master will only grant a node membership if: + +.in 12 +(a) CMAN believes the node to be online, and +.br +(b) that node has made enough consecutive, timely writes +.in 16 +to the quorum disk, and +.in 12 +(c) the node has a high enough score to consider itself online. +.in 0 + +.SH "3. Configuration" +.SH "3.1. The tag" +This tag is a child of the top-level tag. + +.in 8 +\fB\fP +.in 12 +This overrides the device field if present. If specified, the quorum +daemon will read /proc/partitions and check for qdisk signatures +on every block device found, comparing the label against the specified +label. This is useful in configurations where the block device name +differs on a per-node basis. +.in 0 + +.SH "3.2. The tag" +This tag is a child of the tag. + +.in 8 +\fB\fP +.in 12 +This is the frequency at which we poll the heuristic. +.in 0 + +.SH "3.3. Example" +.in 8 + +.in 12 + +.br + +.br + +.br +.in 8 + +.in 0 + +.SH "3.4. Heuristic score considerations" +* Heuristic timeouts should be set high enough to allow the previous run +of a given heuristic to complete. + +* Heuristic scripts returning anything except 0 as their return code +are considered failed. + +* The worst-case for improperly configured quorum heuristics is a race +to fence where two partitions simultaneously try to kill each other. + +.SH "3.5. Creating a quorum disk partition" +The mkqdisk utility can create and list currently configured quorum disks +visible to the local node; see +.B mkqdisk(8) +for more details. + +.SH "SEE ALSO" +mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5) /cvs/cluster/cluster/cman/man/qdiskd.8,v --> standard output revision 1.2.4.1 --- cluster/cman/man/qdiskd.8 +++ - 2006-07-21 18:01:39.053646000 +0000 @@ -0,0 +1,20 @@ +.TH "qdiskd" "8" "July 2006" "" "Quorum Disk Management" +.SH "NAME" +qdiskd \- Cluster Quorum Disk Daemon +.SH "SYNOPSIS" +\fBqdiskd [\-f] [\-d] +.SH "DESCRIPTION" +.PP +The \fBqdiskd\fP daemon talks to CMAN and provides a mechanism for determining +node-fitness in a cluster environment. See +.B +qdisk(5) +for configuration information. +.SH "OPTIONS" +.IP "\-f" +Run in the foreground (do not fork / daemonize). +.IP "\-d" +Enable debug output. + +.SH "SEE ALSO" +mkqdisk(8), qdisk(5), cman(5) --- cluster/cman/man/Makefile 2004/08/13 06:38:22 1.1 +++ cluster/cman/man/Makefile 2006/07/21 18:01:38 1.1.8.1 @@ -18,10 +18,10 @@ install: install -d ${mandir}/man5 install -d ${mandir}/man8 - install cman.5 ${mandir}/man5 - install cman_tool.8 ${mandir}/man8 + install cman.5 qdisk.5 ${mandir}/man5 + install cman_tool.8 qdiskd.8 mkqdisk.8 ${mandir}/man8 uninstall: - ${UNINSTALL} cman.5 ${mandir}/man5 - ${UNINSTALL} cman_tool.8 ${mandir}/man8 + ${UNINSTALL} cman.5 qdisk.5 ${mandir}/man5 + ${UNINSTALL} cman_tool.8 qdiskd.8 mkqdisk.8 ${mandir}/man8 /cvs/cluster/cluster/cman/qdisk/Makefile,v --> standard output revision 1.5.2.1 --- cluster/cman/qdisk/Makefile +++ - 2006-07-21 18:01:39.244798000 +0000 @@ -0,0 +1,49 @@ +############################################################################### +############################################################################### +## +## Copyright (C) 2004-2006 Red Hat, Inc. All rights reserved. +## +## This copyrighted material is made available to anyone wishing to use, +## modify, copy, or redistribute it subject to the terms and conditions +## of the GNU General Public License v.2. +## +############################################################################### +############################################################################### + +top_srcdir=.. +UNINSTALL=${top_srcdir}/scripts/uninstall.pl + +include ${top_srcdir}/make/defines.mk + +INCLUDES+=-I. -I../lib +CFLAGS +=-I${incdir} -I${top_srcdir}/config \ + -Wall -Werror -Wstrict-prototypes -Wshadow -D_GNU_SOURCE -g + +TARGET=qdiskd mkqdisk + +all: ${TARGET} + +copytobin: all + cp ${TARGET} ${top_srcdir}/bin + +install: ${TARGET} + install -d ${sbindir} + install ${TARGET} ${sbindir} + +qdiskd: disk.o crc32.o disk_util.o main.o score.o bitmap.o clulog.o \ + gettid.o proc.o ../lib/libcman.a + gcc -o $@ $^ -lpthread -L../lib -lccs + +mkqdisk: disk.o crc32.o disk_util.o \ + proc.o mkqdisk.o + gcc -o $@ $^ + + +%.o: %.c + $(CC) -c -o $@ $^ $(INCLUDES) $(CFLAGS) + +clean: + rm -f *.o ${TARGET} + +uninstall: + ${UNINSTALL} ${TARGET} ${sbindir} /cvs/cluster/cluster/cman/qdisk/README,v --> standard output revision 1.4.2.1 --- cluster/cman/qdisk/README +++ - 2006-07-21 18:01:39.324979000 +0000 @@ -0,0 +1,274 @@ +qdisk 1.0 - a disk-based quorum algorithm for Linux-Cluster + +(C) 2006 Red Hat, Inc. + +1. Overview + +1.1. Problem + +In some situations, it may be necessary or desirable to sustain +a majority node failure of a cluster without introducing the need for +asymmetric (client-server, or heavy-weighted voting nodes). + +1.2. Design Requirements + +* Ability to sustain 1..(n-1)/n simultaneous node failures, without the +danger of a simple network partition causing a split brain. That is, we +need to be able to ensure that the majority failure case is not merely +the result of a network partition. + +* Ability to use external reasons for deciding which partition is the +the quorate partition in a partitioned cluster. For example, a user may +have a service running on one node, and that node must always be the master +in the event of a network partition. Or, a node might lose all network +connectivity except the cluster communication path - in which case, a +user may wish that node to be evicted from the cluster. + +* Integration with CMAN. We must not require CMAN to run with us (or +without us). Linux-Cluster does not require a quorum disk normally - +introducing new requirements on the base of how Linux-Cluster operates +is not allowed. + +* Data integrity. In order to recover from a majority failure, fencing +is required. The fencing subsystem is already provided by Linux-Cluster. + +* Non-reliance on hardware or protocol specific methods (i.e. SCSI +reservations). This ensures the quorum disk algorithm can be used on the +widest range of hardware configurations possible. + +* Little or no memory allocation after initialization. In critical paths +during failover, we do not want to have to worry about being killed during +a memory pressure situation because we request a page fault, and the Linux +OOM killer responds... + + +1.3. Hardware Configuration Considerations + +1.3.1. Concurrent, Synchronous, Read/Write Access + +This daemon requires a shared block device with concurrent read/write +access from all nodes in the cluster. The shared block device can be +a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI +target, or even GNBD. The quorum daemon uses O_DIRECT to write to the +device. + +1.3.2. Bargain-basement JBODs need not apply + +There is a minimum performance requirement inherent when using disk-based +cluster quorum algorithms, so design your cluster accordingly. Using a +cheap JBOD with old SCSI2 disks on a multi-initiator bus will cause +problems at the first load spike. Plan your loads accordingly; a node's +inability to write to the quorum disk in a timely manner will cause the +cluster to evict the node. Using host-RAID or multi-initiator parallel +SCSI configurations with the qdisk daemon is unlikely to work, and will +probably cause administrators a lot of frustration. That having been +said, because the timeouts are configurable, most hardware should work +if the timeouts are set high enough. + +1.3.3. Fencing is Required + +In order to maintain data integrity under all failure scenarios, use of +this quorum daemon requires adequate fencing, preferrably power-based +fencing. + + +1.4. Limitations + +* At this time, this daemon only supports a maximum of 16 nodes. + +* Cluster node IDs must be statically configured in cluster.conf and +must be numbered from 1..16 (there can be gaps, of course). + +* Cluster node votes should be more or less equal. + +* CMAN must be running before the qdisk program can start. This +limitation will be removed before a production release. + +* CMAN's eviction timeout should be at least 2x the quorum daemon's +to give the quorum daemon adequate time to converge on a master during a +failure + load spike situation. + +* The total number of votes assigned to the quorum device should be +equal to or greater than the total number of node-votes in the cluster. +While it is possible to assign only one (or a few) votes to the quorum +device, the effects of doing so have not been explored. + +* Currently, the quorum disk daemon is difficult to use with CLVM if +the quorum disk resides on a CLVM logical volume. CLVM requires a +quorate cluster to correctly operate, which introduces a chicken-and-egg +problem for starting the cluster: CLVM needs quorum, but the quorum daemon +needs CLVM (if and only if the quorum device lies on CLVM-managed storage). +One way to work around this is to *not* set the cluster's expected votes +to include theh quorum daemon's votes. Bring all nodes online, and start +the quorum daemon *after* the whole cluster is running. This will allow +the expected votes to increase naturally. + + +2. Algorithms + +2.1. Heartbeating & Liveliness Determination + +Nodes update individual status blocks on the quorum disk at a user- +defined rate. Each write of a status block alters the timestamp, which +is what other nodes use to decide whether a node has hung or not. If, +after a user-defined number of 'misses' (that is, failure to update a +timestamp), a node is declared offline. After a certain number of 'hits' +(changed timestamp + "i am alive" state), the node is declared online. + +The status block contains additional information, such as a bitmask of +the nodes that node believes are online. Some of this information is +used by the master - while some is just for performace recording, and +may be used at a later time. The most important pieces of information +a node writes to its status block are: + + - timestamp + - internal state (available / not available) + - score + - max score + - vote/bid messages + - other nodes it thinks are online + + +2.2. Scoring & Heuristics + +The administrator can configure up to 10 purely arbitrary heuristics, and +must exercise caution in doing so. By default, only nodes scoring over +1/2 of the total maximum score will claim they are available via the +quorum disk, and a node (master or otherwise) whose score drops too low +will remove itself (usually, by rebooting). + +The heuristics themselves can be any command executable by 'sh -c'. For +example, in early testing, I used this: + + + +This is a literal sh-ism which tests for the existence of a file called +"/quorum". Without that file, the node would claim it was unavailable. +This is an awful example, and should never, ever be used in production, +but is provided as an example as to what one could do... + +Typically, the heuristics should be snippets of shell code or commands which +help determine a node's usefulness to the cluster or clients. Ideally, you +want to add traces for all of your network paths (e.g. check links, or +ping routers), and methods to detect availability of shared storage. + + +2.3. Master Election + +Only one master is present at any one time in the cluster, regardless of +how many partitions exist within the cluster itself. The master is +elected by a simple voting scheme in which the lowest node which believes +it is capable of running (i.e. scores high enough) bids for master status. +If the other nodes agree, it becomes the master. This algorithm is +run whenever no master is present. + +If another node comes online with a lower node ID while a node is still +bidding for master status, it will rescind its bid and vote for the lower +node ID. If a master dies or a bidding node dies, the voting algorithm +is started over. The voting algorithm typically takes two passes to +complete. + +Master deaths take marginally longer to recover from than non-master +deaths, because a new master must be elected before the old master can +be evicted & fenced. + + +2.4. Master Duties + +The master node decides who is or is not in the master partition, as +well as handles eviction of dead nodes (both via the quorum disk and via +the linux-cluster fencing system by using the cman_kill_node() API). + + +2.5. How it All Ties Together + +When a master is present, and if the master believes a node to be online, +that node will advertise to CMAN that the quorum disk is avilable. The +master will only grant a node membership if: + + (a) CMAN believes the node to be online, and + (b) that node has made enough consecutive, timely writes to the quorum + disk. + + +3. Configuration + +3.1. The tag + +This tag is a child of the top-level tag. + + This overrides the device field if present. + If specified, the quorum daemon will read + /proc/partitions and check for qdisk signatures + on every block device found, comparing the label + against the specified label. This is useful in + configurations where the block device name + differs on a per-node basis. + + +3.2. The tag + +This tag is a child of the tag. + + This is the frequency at which we poll the + heuristic. + +3.3. Example + + + + + + + +3.4. Heuristic score considerations + +* Heuristic timeouts should be set high enough to allow the previous run +of a given heuristic to complete. + +* Heuristic scripts returning anything except 0 as their return code +are considered failed. + +* The worst-case for improperly configured quorum heuristics is a race +to fence where two partitions simultaneously try to kill each other. + +3.5. Creating a quorum disk partition + +3.5.1. The mkqdisk utility. + +The mkqdisk utility can create and list currently configured quorum disks +visible to the local node. + + mkqdisk -L List available quorum disks. + + mkqdisk -f