From mboxrd@z Thu Jan 1 00:00:00 1970 From: lhh@sourceware.org Date: 21 Jul 2006 17:55:20 -0000 Subject: [Cluster-devel] cluster/cman/qdisk README Message-ID: <20060721175520.29843.qmail@sourceware.org> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit CVSROOT: /cvs/cluster Module name: cluster Changes by: lhh at sourceware.org 2006-07-21 17:55:19 Modified files: cman/qdisk : README Log message: Add man pages for qdisk Patches: http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/README.diff?cvsroot=cluster&r1=1.3&r2=1.4 --- cluster/cman/qdisk/README 2006/06/23 16:05:33 1.3 +++ cluster/cman/qdisk/README 2006/07/21 17:55:19 1.4 @@ -1,274 +1 @@ -qdisk 1.0 - a disk-based quorum algorithm for Linux-Cluster - -(C) 2006 Red Hat, Inc. - -1. Overview - -1.1. Problem - -In some situations, it may be necessary or desirable to sustain -a majority node failure of a cluster without introducing the need for -asymmetric (client-server, or heavy-weighted voting nodes). - -1.2. Design Requirements - -* Ability to sustain 1..(n-1)/n simultaneous node failures, without the -danger of a simple network partition causing a split brain. That is, we -need to be able to ensure that the majority failure case is not merely -the result of a network partition. - -* Ability to use external reasons for deciding which partition is the -the quorate partition in a partitioned cluster. For example, a user may -have a service running on one node, and that node must always be the master -in the event of a network partition. Or, a node might lose all network -connectivity except the cluster communication path - in which case, a -user may wish that node to be evicted from the cluster. - -* Integration with CMAN. We must not require CMAN to run with us (or -without us). Linux-Cluster does not require a quorum disk normally - -introducing new requirements on the base of how Linux-Cluster operates -is not allowed. - -* Data integrity. In order to recover from a majority failure, fencing -is required. The fencing subsystem is already provided by Linux-Cluster. - -* Non-reliance on hardware or protocol specific methods (i.e. SCSI -reservations). This ensures the quorum disk algorithm can be used on the -widest range of hardware configurations possible. - -* Little or no memory allocation after initialization. In critical paths -during failover, we do not want to have to worry about being killed during -a memory pressure situation because we request a page fault, and the Linux -OOM killer responds... - - -1.3. Hardware Configuration Considerations - -1.3.1. Concurrent, Synchronous, Read/Write Access - -This daemon requires a shared block device with concurrent read/write -access from all nodes in the cluster. The shared block device can be -a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI -target, or even GNBD. The quorum daemon uses O_DIRECT to write to the -device. - -1.3.2. Bargain-basement JBODs need not apply - -There is a minimum performance requirement inherent when using disk-based -cluster quorum algorithms, so design your cluster accordingly. Using a -cheap JBOD with old SCSI2 disks on a multi-initiator bus will cause -problems at the first load spike. Plan your loads accordingly; a node's -inability to write to the quorum disk in a timely manner will cause the -cluster to evict the node. Using host-RAID or multi-initiator parallel -SCSI configurations with the qdisk daemon is unlikely to work, and will -probably cause administrators a lot of frustration. That having been -said, because the timeouts are configurable, most hardware should work -if the timeouts are set high enough. - -1.3.3. Fencing is Required - -In order to maintain data integrity under all failure scenarios, use of -this quorum daemon requires adequate fencing, preferrably power-based -fencing. - - -1.4. Limitations - -* At this time, this daemon only supports a maximum of 16 nodes. - -* Cluster node IDs must be statically configured in cluster.conf and -must be numbered from 1..16 (there can be gaps, of course). - -* Cluster node votes should be more or less equal. - -* CMAN must be running before the qdisk program can start. This -limitation will be removed before a production release. - -* CMAN's eviction timeout should be at least 2x the quorum daemon's -to give the quorum daemon adequate time to converge on a master during a -failure + load spike situation. - -* The total number of votes assigned to the quorum device should be -equal to or greater than the total number of node-votes in the cluster. -While it is possible to assign only one (or a few) votes to the quorum -device, the effects of doing so have not been explored. - -* Currently, the quorum disk daemon is difficult to use with CLVM if -the quorum disk resides on a CLVM logical volume. CLVM requires a -quorate cluster to correctly operate, which introduces a chicken-and-egg -problem for starting the cluster: CLVM needs quorum, but the quorum daemon -needs CLVM (if and only if the quorum device lies on CLVM-managed storage). -One way to work around this is to *not* set the cluster's expected votes -to include theh quorum daemon's votes. Bring all nodes online, and start -the quorum daemon *after* the whole cluster is running. This will allow -the expected votes to increase naturally. - - -2. Algorithms - -2.1. Heartbeating & Liveliness Determination - -Nodes update individual status blocks on the quorum disk at a user- -defined rate. Each write of a status block alters the timestamp, which -is what other nodes use to decide whether a node has hung or not. If, -after a user-defined number of 'misses' (that is, failure to update a -timestamp), a node is declared offline. After a certain number of 'hits' -(changed timestamp + "i am alive" state), the node is declared online. - -The status block contains additional information, such as a bitmask of -the nodes that node believes are online. Some of this information is -used by the master - while some is just for performace recording, and -may be used at a later time. The most important pieces of information -a node writes to its status block are: - - - timestamp - - internal state (available / not available) - - score - - max score - - vote/bid messages - - other nodes it thinks are online - - -2.2. Scoring & Heuristics - -The administrator can configure up to 10 purely arbitrary heuristics, and -must exercise caution in doing so. By default, only nodes scoring over -1/2 of the total maximum score will claim they are available via the -quorum disk, and a node (master or otherwise) whose score drops too low -will remove itself (usually, by rebooting). - -The heuristics themselves can be any command executable by 'sh -c'. For -example, in early testing, I used this: - - - -This is a literal sh-ism which tests for the existence of a file called -"/quorum". Without that file, the node would claim it was unavailable. -This is an awful example, and should never, ever be used in production, -but is provided as an example as to what one could do... - -Typically, the heuristics should be snippets of shell code or commands which -help determine a node's usefulness to the cluster or clients. Ideally, you -want to add traces for all of your network paths (e.g. check links, or -ping routers), and methods to detect availability of shared storage. - - -2.3. Master Election - -Only one master is present at any one time in the cluster, regardless of -how many partitions exist within the cluster itself. The master is -elected by a simple voting scheme in which the lowest node which believes -it is capable of running (i.e. scores high enough) bids for master status. -If the other nodes agree, it becomes the master. This algorithm is -run whenever no master is present. - -If another node comes online with a lower node ID while a node is still -bidding for master status, it will rescind its bid and vote for the lower -node ID. If a master dies or a bidding node dies, the voting algorithm -is started over. The voting algorithm typically takes two passes to -complete. - -Master deaths take marginally longer to recover from than non-master -deaths, because a new master must be elected before the old master can -be evicted & fenced. - - -2.4. Master Duties - -The master node decides who is or is not in the master partition, as -well as handles eviction of dead nodes (both via the quorum disk and via -the linux-cluster fencing system by using the cman_kill_node() API). - - -2.5. How it All Ties Together - -When a master is present, and if the master believes a node to be online, -that node will advertise to CMAN that the quorum disk is avilable. The -master will only grant a node membership if: - - (a) CMAN believes the node to be online, and - (b) that node has made enough consecutive, timely writes to the quorum - disk. - - -3. Configuration - -3.1. The tag - -This tag is a child of the top-level tag. - - This overrides the device field if present. - If specified, the quorum daemon will read - /proc/partitions and check for qdisk signatures - on every block device found, comparing the label - against the specified label. This is useful in - configurations where the block device name - differs on a per-node basis. - - -3.2. The tag - -This tag is a child of the tag. - - This is the frequency at which we poll the - heuristic. - -3.3. Example - - - - - - - -3.4. Heuristic score considerations - -* Heuristic timeouts should be set high enough to allow the previous run -of a given heuristic to complete. - -* Heuristic scripts returning anything except 0 as their return code -are considered failed. - -* The worst-case for improperly configured quorum heuristics is a race -to fence where two partitions simultaneously try to kill each other. - -3.5. Creating a quorum disk partition - -3.5.1. The mkqdisk utility. - -The mkqdisk utility can create and list currently configured quorum disks -visible to the local node. - - mkqdisk -L List available quorum disks. - - mkqdisk -f