ocfs2-devel.oss.oracle.com archive mirror
 help / color / mirror / Atom feed
From: Sunil Mushran <sunil.mushran@oracle.com>
To: ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] [PATCH 17/22] ocfs2/cluster: Maintain bitmap of failed regions
Date: Thu,  7 Oct 2010 17:15:31 -0700	[thread overview]
Message-ID: <1286496936-17072-18-git-send-email-sunil.mushran@oracle.com> (raw)
In-Reply-To: <1286496936-17072-1-git-send-email-sunil.mushran@oracle.com>

In global heartbeat mode, we track the bitmap of regions that have seen
heartbeat timeouts. We fence if the number of such regions is greater than
or equal to half the number of quorum regions.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
---
 fs/ocfs2/cluster/heartbeat.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 62a8af2..f890656 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -68,10 +68,12 @@ static DECLARE_WAIT_QUEUE_HEAD(o2hb_steady_queue);
  * 	- o2hb_live_region_bitmap tracks live regions (seen steady iterations).
  * 	- o2hb_quorum_region_bitmap tracks live regions that have seen all nodes
  * 		heartbeat on it.
+ * 	- o2hb_failed_region_bitmap tracks the regions that have seen io timeouts.
  */
 static unsigned long o2hb_region_bitmap[BITS_TO_LONGS(O2NM_MAX_REGIONS)];
 static unsigned long o2hb_live_region_bitmap[BITS_TO_LONGS(O2NM_MAX_REGIONS)];
 static unsigned long o2hb_quorum_region_bitmap[BITS_TO_LONGS(O2NM_MAX_REGIONS)];
+static unsigned long o2hb_failed_region_bitmap[BITS_TO_LONGS(O2NM_MAX_REGIONS)];
 
 #define O2HB_DB_TYPE_LIVENODES		0
 struct o2hb_debug_buf {
@@ -217,8 +219,19 @@ struct o2hb_bio_wait_ctxt {
 	int               wc_error;
 };
 
+static int o2hb_pop_count(void *map, int count)
+{
+	int i = -1, pop = 0;
+
+	while ((i = find_next_bit(map, count, i + 1)) < count)
+		pop++;
+	return pop;
+}
+
 static void o2hb_write_timeout(struct work_struct *work)
 {
+	int failed, quorum;
+	unsigned long flags;
 	struct o2hb_region *reg =
 		container_of(work, struct o2hb_region,
 			     hr_write_timeout_work.work);
@@ -226,6 +239,28 @@ static void o2hb_write_timeout(struct work_struct *work)
 	mlog(ML_ERROR, "Heartbeat write timeout to device %s after %u "
 	     "milliseconds\n", reg->hr_dev_name,
 	     jiffies_to_msecs(jiffies - reg->hr_last_timeout_start));
+
+	if (o2hb_global_heartbeat_active()) {
+		spin_lock_irqsave(&o2hb_live_lock, flags);
+		if (test_bit(reg->hr_region_num, o2hb_quorum_region_bitmap))
+			set_bit(reg->hr_region_num, o2hb_failed_region_bitmap);
+		failed = o2hb_pop_count(&o2hb_failed_region_bitmap,
+					O2NM_MAX_REGIONS);
+		quorum = o2hb_pop_count(&o2hb_quorum_region_bitmap,
+					O2NM_MAX_REGIONS);
+		spin_unlock_irqrestore(&o2hb_live_lock, flags);
+
+		mlog(ML_HEARTBEAT, "Number of regions %d, failed regions %d\n",
+		     quorum, failed);
+
+		/*
+		 * Fence if the number of failed regions >= half the number
+		 * of  quorum regions
+		 */
+		if ((failed << 1) < quorum)
+			return;
+	}
+
 	o2quo_disk_timeout();
 }
 
@@ -234,6 +269,11 @@ static void o2hb_arm_write_timeout(struct o2hb_region *reg)
 	mlog(ML_HEARTBEAT, "Queue write timeout for %u ms\n",
 	     O2HB_MAX_WRITE_TIMEOUT_MS);
 
+	if (o2hb_global_heartbeat_active()) {
+		spin_lock(&o2hb_live_lock);
+		clear_bit(reg->hr_region_num, o2hb_failed_region_bitmap);
+		spin_unlock(&o2hb_live_lock);
+	}
 	cancel_delayed_work(&reg->hr_write_timeout_work);
 	reg->hr_last_timeout_start = jiffies;
 	schedule_delayed_work(&reg->hr_write_timeout_work,
@@ -1173,6 +1213,7 @@ int o2hb_init(void)
 	memset(o2hb_region_bitmap, 0, sizeof(o2hb_region_bitmap));
 	memset(o2hb_live_region_bitmap, 0, sizeof(o2hb_live_region_bitmap));
 	memset(o2hb_quorum_region_bitmap, 0, sizeof(o2hb_quorum_region_bitmap));
+	memset(o2hb_failed_region_bitmap, 0, sizeof(o2hb_failed_region_bitmap));
 
 	return o2hb_debug_init();
 }
-- 
1.7.0.4

  parent reply	other threads:[~2010-10-08  0:15 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-08  0:15 [Ocfs2-devel] O2CB global heartbeat - hopefully final drop! Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 01/22] ocfs2/cluster: Add heartbeat mode configfs parameter Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 02/22] ocfs2: Add an incompat feature flag OCFS2_FEATURE_INCOMPAT_CLUSTERINFO Sunil Mushran
2010-10-08 23:11   ` Mark Fasheh
2010-10-08 23:26     ` Sunil Mushran
2010-10-09  0:15       ` Joel Becker
2010-10-09 11:41         ` Sunil Mushran
2010-10-09 17:35           ` Sunil Mushran
2010-10-11 21:46             ` Joel Becker
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 03/22] ocfs2: Add support for heartbeat=global mount option Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 04/22] ocfs2/dlm: Expose dlm_protocol in dlm_state Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 05/22] ocfs2/cluster: Get all heartbeat regions Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 06/22] ocfs2/dlm: Add message DLM_QUERY_REGION Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 07/22] ocfs2: Print message if user mounts without starting global heartbeat Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 08/22] ocfs2/dlm: Add message DLM_QUERY_NODEINFO Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 09/22] ocfs2/cluster: Print messages when adding/removing heartbeat regions Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 10/22] ocfs2/cluster: Print messages when adding/removing nodes Sunil Mushran
2010-10-08  0:33   ` Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 11/22] ocfs2/cluster: Check slots for unconfigured live nodes Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 12/22] ocfs2/cluster: Reorganize o2hb debugfs init Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 13/22] ocfs2/cluster: Maintain live node bitmap per heartbeat region Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 14/22] ocfs2/cluster: Track number of global heartbeat regions Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 15/22] ocfs2/cluster: Track bitmap of live " Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 16/22] ocfs2/cluster: Maintain bitmap of quorum regions Sunil Mushran
2010-10-08  0:15 ` Sunil Mushran [this message]
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 18/22] ocfs2/cluster: Create debugfs files for live, quorum and failed region bitmaps Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 19/22] ocfs2/cluster: Create debugfs dir/files for each region Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 20/22] ocfs2/cluster: Add mlogs for heartbeat up/down events Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 21/22] ocfs2/cluster: Show per region heartbeat elapsed time Sunil Mushran
2010-10-08  0:15 ` [Ocfs2-devel] [PATCH 22/22] ocfs2/dlm: Bump up dlm protocol to version 1.1 Sunil Mushran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1286496936-17072-18-git-send-email-sunil.mushran@oracle.com \
    --to=sunil.mushran@oracle.com \
    --cc=ocfs2-devel@oss.oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).