From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43880) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WFIdT-0005Zp-UF for qemu-devel@nongnu.org; Mon, 17 Feb 2014 02:29:13 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WFIdF-00087T-Rd for qemu-devel@nongnu.org; Mon, 17 Feb 2014 02:29:03 -0500 Received: from e39.co.us.ibm.com ([32.97.110.160]:48283) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WFIdF-000879-LM for qemu-devel@nongnu.org; Mon, 17 Feb 2014 02:28:49 -0500 Received: from /spool/local by e39.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 17 Feb 2014 00:28:47 -0700 Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com [9.57.198.25]) by d01dlp01.pok.ibm.com (Postfix) with ESMTP id 3102B38C803B for ; Mon, 17 Feb 2014 02:28:44 -0500 (EST) Received: from d01av05.pok.ibm.com (d01av05.pok.ibm.com [9.56.224.195]) by b01cxnp22035.gho.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id s1H7SiE22163100 for ; Mon, 17 Feb 2014 07:28:44 GMT Received: from d01av05.pok.ibm.com (localhost [127.0.0.1]) by d01av05.pok.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id s1H7ShZL006878 for ; Mon, 17 Feb 2014 02:28:43 -0500 Message-ID: <5301BA28.4070709@linux.vnet.ibm.com> Date: Mon, 17 Feb 2014 15:28:40 +0800 From: "Michael R. Hines" MIME-Version: 1.0 References: <20140206122611.GD3013@work-vm> In-Reply-To: <20140206122611.GD3013@work-vm> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] qemu_rdma_cleanup seg - related to 5a91337? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert" , yamahata@private.email.ne.jp, qemu-devel@nongnu.org, mrhines@us.ibm.com Cc: hinesmr@cn.ibm.com, quintela@redhat.com On 02/06/2014 08:26 PM, Dr. David Alan Gilbert wrote: > Hi Isaku, > I hit a seg in qemu_rdma_cleanup in the code changed by your > '[PATCH] rdma: clean up of qemu_rdma_cleanup()' > > migration-rdma.c ~ 2241 > > if (rdma->qp) { > rdma_destroy_qp(rdma->cm_id); > rdma->qp = NULL; > } > > Your patch changed that to free cm_id at that point rather than > qp; but in my case cm_id is NULL and so rdma_destroy_qp segs. > > given that there is a : > > if (rdma->cm_id) { > rdma_destroy_id(rdma->cm_id); > rdma->cm_id = NULL; > } > > later down, and there is now no longer any destroy of rdma->qp > I don't understand your change. > > Your change text says: > '- RDMAContext::qp is created by rdma_create_qp() so that it should be destroyed > by rdma_destroy_qp(). not ibv_destroy_qp()' > > but the diff is: > if (rdma->qp) { > - ibv_destroy_qp(rdma->qp); > + rdma_destroy_qp(rdma->cm_id); > rdma->qp = NULL; > > should that have been rdma_destroy_qp(rdma->qp)? > > Dave (who doesn't yet know enough RDMA to be dangerous) > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > Responding for Isaku..... Thanks for reporting the bug, but I need some help in tracking down the cause of the bug, see below. Actually, the parameter "rdma->cm_id" to the function is correct, it's just that the variable never got initialized in the first place, which means that either the connection never got established or an early error happened during the migration that required cleaning up the identifier. Can you describe the conditions of the migration and the environment? 1. Did you migrate only one VM? Was the host under heavy load? 2. Did your migration lose connectivity? Did one of the hosts crash? 3. Was the connection abruptly broken for some reason? 4. Did you ever cancel the migration at some point and restart? 5. Did you use libvirt? A simple fix would be to surround the "rdma_destroy_qp()" call with a check to see if rdma->cm_id is valid, but that doesn't answer why rdma->cm_id would be invalid in the first place. I need some additional information to try to reproduce the conditions of the bug. Thanks! - Michael Hines