From: jbrassow@sourceware.org <jbrassow@sourceware.org>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] cluster/cmirror-kernel/src dm-cmirror-client.c ...
Date: 14 Mar 2007 04:28:33 -0000 [thread overview]
Message-ID: <20070314042833.28019.qmail@sourceware.org> (raw)
CVSROOT: /cvs/cluster
Module name: cluster
Branch: RHEL4
Changes by: jbrassow at sourceware.org 2007-03-14 04:28:32
Modified files:
cmirror-kernel/src: dm-cmirror-client.c dm-cmirror-server.c
Log message:
Bug 231230: leg failure on cmirrors causes devices to be stuck in SUSPE...
The problem here appears to be timeouts related to clvmd.
During failures under heavy load, clvmd commands (suspend/resume/
activate/deactivate) can take a long time. Clvmd assumes to quickly
that they have failed. This results in the fault handling being left
half done. Further calls to vgreduce (by hand or by dmeventd) will
not help because the _on-disk_ version of the meta-data is consistent -
that is, the faulty device has been removed.
The most significant change in this patch is the removal of the
'is_remote_recovering' function. This function was designed to check
if a remote node was recovering a region so that writes to the region
could be delayed. However, even with this function, it was possible
for a remote node to begin recovery on a region _after_ the function
was called, but before the write (mark request) took place. Because
of this, checking is done during the mark request stage - rendering
the call to 'is_remote_recovering' meaningless. Given the useless
nature of this function, it has been pulled. The benefits of its
removal are increased performance and much faster (more than an
order of magnitude) response during the mirror suspend process.
The faster suspend process leads to less clvmd timeouts and
reduced probability that bug 231230 will be triggered.
However, when a mirror device is reconfigured, the mirror sub-devices
are removed. This is done by activating them cluster-wide before
their removal. With high enough load during recovery, these operations
can still take a long time - even though they are linear devices.
This too has the potential for causing clvmd to timeout and trigger
bug 231230. There is no cluster logging fix for this issue. The
delay on the linear devices must be determined. A temporary
work-around would be to increase the timeout of clvmd (e.g. clvmd -t #).
Patches:
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cmirror-kernel/src/dm-cmirror-client.c.diff?cvsroot=cluster&only_with_tag=RHEL4&r1=1.1.2.40&r2=1.1.2.41
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cmirror-kernel/src/dm-cmirror-server.c.diff?cvsroot=cluster&only_with_tag=RHEL4&r1=1.1.2.25&r2=1.1.2.26
--- cluster/cmirror-kernel/src/Attic/dm-cmirror-client.c 2007/03/02 22:31:14 1.1.2.40
+++ cluster/cmirror-kernel/src/Attic/dm-cmirror-client.c 2007/03/14 04:28:32 1.1.2.41
@@ -647,12 +647,12 @@
}
}
+ list_add(&lc->log_list, &log_list_head);
+ spin_unlock(&log_list_lock);
DMDEBUG("Creating %s (%d)",
lc->uuid + (strlen(lc->uuid) - 8),
lc->uuid_ref);
- list_add(&lc->log_list, &log_list_head);
- spin_unlock(&log_list_lock);
INIT_LIST_HEAD(&lc->region_users);
lc->server_id = 0xDEAD;
@@ -767,6 +767,11 @@
list_del_init(&lc->log_list);
spin_unlock(&log_list_lock);
+ if ((lc->server_id == my_id) && !atomic_read(&lc->suspended))
+ consult_server(lc, 0, LRT_MASTER_LEAVING, NULL);
+
+ sock_release(lc->client_sock);
+
spin_lock(®ion_state_lock);
list_for_each_entry_safe(rs, tmp_rs, &clear_region_list, rs_list) {
@@ -781,11 +786,6 @@
spin_unlock(®ion_state_lock);
- if ((lc->server_id == my_id) && !atomic_read(&lc->suspended))
- consult_server(lc, 0, LRT_MASTER_LEAVING, NULL);
-
- sock_release(lc->client_sock);
-
if (lc->log_dev)
disk_dtr(log);
else
@@ -844,7 +844,6 @@
lc->sync_search = 0;
resume_server_requests();
atomic_set(&lc->suspended, 0);
- consult_server(lc, 0, LRT_IN_SYNC, NULL);
return 0;
}
@@ -1354,7 +1353,7 @@
.resume = cluster_resume,
.get_region_size = cluster_get_region_size,
.is_clean = cluster_is_clean,
- .is_remote_recovering = cluster_is_remote_recovering,
+/* .is_remote_recovering = cluster_is_remote_recovering,*/
.in_sync = cluster_in_sync,
.flush = cluster_flush,
.mark_region = cluster_mark_region,
@@ -1376,7 +1375,7 @@
.resume = cluster_resume,
.get_region_size = cluster_get_region_size,
.is_clean = cluster_is_clean,
- .is_remote_recovering = cluster_is_remote_recovering,
+/* .is_remote_recovering = cluster_is_remote_recovering,*/
.in_sync = cluster_in_sync,
.flush = cluster_flush,
.mark_region = cluster_mark_region,
--- cluster/cmirror-kernel/src/Attic/dm-cmirror-server.c 2007/02/26 17:38:06 1.1.2.25
+++ cluster/cmirror-kernel/src/Attic/dm-cmirror-server.c 2007/03/14 04:28:32 1.1.2.26
@@ -619,7 +619,7 @@
lc->sync_count++;
}
} else if (log_test_bit(lc->sync_bits, lr->u.lr_region)) {
- lc->sync_count--;
+ /* gone again: lc->sync_count--;*/
log_clear_bit(lc, lc->sync_bits, lr->u.lr_region);
}
next reply other threads:[~2007-03-14 4:28 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-03-14 4:28 jbrassow [this message]
-- strict thread matches above, loose matches on Subject: below --
2007-10-03 19:02 [Cluster-devel] cluster/cmirror-kernel/src dm-cmirror-client.c jbrassow
2007-09-27 20:31 jbrassow
2007-09-26 3:15 jbrassow
2007-09-21 20:07 jbrassow
2007-09-13 15:24 jbrassow
2007-07-11 16:18 jbrassow
2007-04-26 16:55 jbrassow
2007-04-26 16:54 jbrassow
2007-04-24 20:10 jbrassow
2007-04-24 20:08 jbrassow
2007-04-10 7:13 jbrassow
2007-04-10 7:12 jbrassow
2007-04-05 21:33 jbrassow
2007-04-05 21:32 jbrassow
2007-04-03 18:23 jbrassow
2007-04-03 18:21 jbrassow
2007-03-22 22:34 jbrassow
2007-03-22 22:22 jbrassow
2007-02-26 17:38 jbrassow
2007-02-20 19:35 jbrassow
2007-02-19 16:29 jbrassow
2007-02-14 17:44 jbrassow
2007-02-02 17:22 jbrassow
2007-01-08 19:28 jbrassow
2006-12-07 18:58 jbrassow
2006-09-05 17:50 jbrassow
2006-09-05 17:48 jbrassow
2006-07-27 23:11 jbrassow
2006-07-27 23:11 jbrassow
2006-07-22 22:19 jbrassow
2006-07-22 22:19 jbrassow
2006-07-22 22:12 jbrassow
2006-06-29 19:49 jbrassow
2006-06-29 19:48 jbrassow
2006-06-29 19:46 jbrassow
2006-06-27 20:19 jbrassow
2006-06-15 19:48 jbrassow
2006-06-15 19:34 jbrassow
2006-06-13 16:26 jbrassow
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070314042833.28019.qmail@sourceware.org \
--to=jbrassow@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).