* [Cluster-devel] qdiskd hangs cluster activity
@ 2007-10-29 7:19 Fabio Massimo Di Nitto
2007-10-30 13:39 ` Fabio Massimo Di Nitto
0 siblings, 1 reply; 2+ messages in thread
From: Fabio Massimo Di Nitto @ 2007-10-29 7:19 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi Lon,
I found a very interesting bug that manages to hang the entire cluster.
Setup is a 3 nodes cluster, with no fancy stuff running at all (i will be able
to show it to you next wed as it lives on my laptop ;)).
<quorumd label="test1">
<heuristic program="ping 192.168.1.1 -c1 -t1" score="1" interval="2" tko="3"/>
</quorumd>
test1 is a 1GB shared AOE device between the 3 nodes.
the cluster starts without problems. After firing up qdiskd -f -d:
qdiskd -f -d
[12681] debug: Loading configuration information
[12681] debug: Heuristic: 'ping 192.168.1.1 -c1 -t1' score=1 interv
=2 tko=3
[12681] debug: 1 heuristics loaded
[12681] debug: Quorum Daemon: 1 heuristics, 1 interval, 10 tko, 0 votes
open_partition: seek: Invalid argument
qdisk_validate: open of /dev/sda2 for RDWR failed: Illegal seek
qdisk_verify: Illegal seek
[12681] info: Quorum Partition: /dev/etherd/e1.0 Label: test1
[12681] info: Quorum Daemon Initializing
[12682] info: Heuristic: 'ping 192.168.1.1 -c1 -t1' UP
[12681] debug: Node 2 is UP
[12681] debug: Node 3 is UP
[12681] info: Initial score 1/1
[12681] info: Initialization complete
[12681] notice: Score sufficient for master operation (1/1; required=1); upgra
ng
[12681] debug: Making bid for master
[12681] info: Assuming master role
A few seconds after the node assume master role, it hangs. The others will
follow in a matter of seconds.
aisexec is stalled in recv(..
No way to recover. kill -9 all over is required.
In attachment is a qdiskd strace from all the 3 nodes started at the exact same
time.
Fabio
PS I wonder if we are hitting this:
from qdisk/disk.c:
/*
* All IOs must be of size which is a multiple of 512. Here we
* just add in enough extra to accommodate.
* XXX - if the on-disk offsets don't provide enough room we're cooked!
*/
--
I'm going to make him an offer he can't refuse.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: qdisk.logs.tar.bz2
Type: application/x-bzip
Size: 10859 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20071029/300ad994/attachment.bin>
^ permalink raw reply [flat|nested] 2+ messages in thread
* [Cluster-devel] qdiskd hangs cluster activity
2007-10-29 7:19 [Cluster-devel] qdiskd hangs cluster activity Fabio Massimo Di Nitto
@ 2007-10-30 13:39 ` Fabio Massimo Di Nitto
0 siblings, 0 replies; 2+ messages in thread
From: Fabio Massimo Di Nitto @ 2007-10-30 13:39 UTC (permalink / raw)
To: cluster-devel.redhat.com
The culprit is a missing patch from openais 0.82 release.
https://bugzilla.redhat.com/show_bug.cgi?id=314641
Fabio
Fabio Massimo Di Nitto wrote:
> Hi Lon,
>
> I found a very interesting bug that manages to hang the entire cluster.
>
> Setup is a 3 nodes cluster, with no fancy stuff running at all (i will be able
> to show it to you next wed as it lives on my laptop ;)).
>
> <quorumd label="test1">
> <heuristic program="ping 192.168.1.1 -c1 -t1" score="1" interval="2" tko="3"/>
> </quorumd>
>
> test1 is a 1GB shared AOE device between the 3 nodes.
>
> the cluster starts without problems. After firing up qdiskd -f -d:
>
> qdiskd -f -d
> [12681] debug: Loading configuration information
> [12681] debug: Heuristic: 'ping 192.168.1.1 -c1 -t1' score=1 interv
> =2 tko=3
> [12681] debug: 1 heuristics loaded
> [12681] debug: Quorum Daemon: 1 heuristics, 1 interval, 10 tko, 0 votes
> open_partition: seek: Invalid argument
> qdisk_validate: open of /dev/sda2 for RDWR failed: Illegal seek
> qdisk_verify: Illegal seek
> [12681] info: Quorum Partition: /dev/etherd/e1.0 Label: test1
> [12681] info: Quorum Daemon Initializing
> [12682] info: Heuristic: 'ping 192.168.1.1 -c1 -t1' UP
> [12681] debug: Node 2 is UP
> [12681] debug: Node 3 is UP
> [12681] info: Initial score 1/1
> [12681] info: Initialization complete
> [12681] notice: Score sufficient for master operation (1/1; required=1); upgra
> ng
> [12681] debug: Making bid for master
> [12681] info: Assuming master role
>
> A few seconds after the node assume master role, it hangs. The others will
> follow in a matter of seconds.
>
> aisexec is stalled in recv(..
>
> No way to recover. kill -9 all over is required.
>
> In attachment is a qdiskd strace from all the 3 nodes started at the exact same
> time.
>
> Fabio
>
> PS I wonder if we are hitting this:
>
> from qdisk/disk.c:
>
> /*
> * All IOs must be of size which is a multiple of 512. Here we
> * just add in enough extra to accommodate.
> * XXX - if the on-disk offsets don't provide enough room we're cooked!
> */
>
--
I'm going to make him an offer he can't refuse.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2007-10-30 13:39 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-29 7:19 [Cluster-devel] qdiskd hangs cluster activity Fabio Massimo Di Nitto
2007-10-30 13:39 ` Fabio Massimo Di Nitto
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.