From mboxrd@z Thu Jan 1 00:00:00 1970 From: Fabio Massimo Di Nitto Date: Tue, 30 Oct 2007 14:39:02 +0100 Subject: [Cluster-devel] qdiskd hangs cluster activity In-Reply-To: <47258982.9060400@ubuntu.com> References: <47258982.9060400@ubuntu.com> Message-ID: <472733F6.8070702@ubuntu.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit The culprit is a missing patch from openais 0.82 release. https://bugzilla.redhat.com/show_bug.cgi?id=314641 Fabio Fabio Massimo Di Nitto wrote: > Hi Lon, > > I found a very interesting bug that manages to hang the entire cluster. > > Setup is a 3 nodes cluster, with no fancy stuff running at all (i will be able > to show it to you next wed as it lives on my laptop ;)). > > > > > > test1 is a 1GB shared AOE device between the 3 nodes. > > the cluster starts without problems. After firing up qdiskd -f -d: > > qdiskd -f -d > [12681] debug: Loading configuration information > [12681] debug: Heuristic: 'ping 192.168.1.1 -c1 -t1' score=1 interv > =2 tko=3 > [12681] debug: 1 heuristics loaded > [12681] debug: Quorum Daemon: 1 heuristics, 1 interval, 10 tko, 0 votes > open_partition: seek: Invalid argument > qdisk_validate: open of /dev/sda2 for RDWR failed: Illegal seek > qdisk_verify: Illegal seek > [12681] info: Quorum Partition: /dev/etherd/e1.0 Label: test1 > [12681] info: Quorum Daemon Initializing > [12682] info: Heuristic: 'ping 192.168.1.1 -c1 -t1' UP > [12681] debug: Node 2 is UP > [12681] debug: Node 3 is UP > [12681] info: Initial score 1/1 > [12681] info: Initialization complete > [12681] notice: Score sufficient for master operation (1/1; required=1); upgra > ng > [12681] debug: Making bid for master > [12681] info: Assuming master role > > A few seconds after the node assume master role, it hangs. The others will > follow in a matter of seconds. > > aisexec is stalled in recv(.. > > No way to recover. kill -9 all over is required. > > In attachment is a qdiskd strace from all the 3 nodes started at the exact same > time. > > Fabio > > PS I wonder if we are hitting this: > > from qdisk/disk.c: > > /* > * All IOs must be of size which is a multiple of 512. Here we > * just add in enough extra to accommodate. > * XXX - if the on-disk offsets don't provide enough room we're cooked! > */ > -- I'm going to make him an offer he can't refuse.