From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mathieu Avila Date: Mon, 16 Oct 2006 16:07:38 +0200 Subject: [Cluster-devel] Panic when stopping gulm. Message-ID: <20061016160738.4b35b93e@mathieu.toulouse> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hello, I got panics sometimes, when stopping gulm on my whole cluster. These are really not very frequent. The panics appear inside a function of the "ipv6" module, when called by one of the gulm kernel threads. ^MProcess gulm_res_recvd (pid: 5029, threadinfo 0000010021300000, task 000001003f60f030) ^MStack: 0000000000004034 0000000000000000 0000001e124dd670 000001001533d380 ^M 0000010021301d08 0000000000000000 0000010021301e18 000001003c3a5e00 ^M 00000100124dd670 0000000023222120 ^MCall Trace:{:ipv6:tcp_v6_xmit+611} {autoremove_wake_function+0} ^M {:lock_gulm:do_tfer+252} {:lock_gulm:xdr_send+34} ^M {:lock_gulm:xdr_enc_flush+44} {:lock_gulm:xdr_enc_release+19} ^M {:lock_gulm:lg_core_handle_messages+394} ^M {:lock_gulm:cm_io_recving_thread+73} ^M {child_rip+8} {:lock_gulm:cm_io_recving_thread+0} ^M {child_rip+0} I looked at the code, in src/gulm/xdr_io.c, in function "do_tfer". I find something strange : --------------------------------------------------- for (;;) { m.msg_iov = iov; m.msg_iovlen = n; m.msg_flags = MSG_NOSIGNAL; if (dir) rv = sock_sendmsg (sock, &m, size - moved); else rv = sock_recvmsg (sock, &m, size - moved, 0); if (rv <= 0) goto out_err; moved += rv; if (moved >= size) break; /* adjust iov's for next transfer */ while (iov->iov_len == 0) { iov++; n--; } --------------------------------------------------- In my opinion, when "sock_sendmsg" doesn't return the exact size that was asked to be sent, we get into while (iov->iov_len == 0) { iov++; n--; } Even if we are already at the last buffer, without checking "n", which is the number of buffers in the table "iov". "sock_sendmsg" is then called with an invalid buffer pointer.... (m.msg_iov = iov) I don't know if this is of any interest, since "n" always equals "1", wherever "do_tfer" is called. Anyway, this couldn't happen if "n" was checked: --------------------------- while ( (n>1)&&(iov->iov_len == 0) { iov++; n--; } if (n<=1) break; --------------------------- This still doesn't guarantee that the message will be sent as a whole. Using : m.msg_flags = MSG_NOSIGNAL | MSG_WAITALL; and a loop over sock_sendmsg till the full message is sent is the solution, maybe. Any idea on this ? Thanks in advance, -- Mathieu Avila