From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36200) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WFZmm-0000Di-BC for qemu-devel@nongnu.org; Mon, 17 Feb 2014 20:47:57 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WFZmd-0007Gn-Bd for qemu-devel@nongnu.org; Mon, 17 Feb 2014 20:47:48 -0500 Received: from e7.ny.us.ibm.com ([32.97.182.137]:60416) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WFZmd-0007GI-7H for qemu-devel@nongnu.org; Mon, 17 Feb 2014 20:47:39 -0500 Received: from /spool/local by e7.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 17 Feb 2014 20:47:38 -0500 Received: from b01cxnp23034.gho.pok.ibm.com (b01cxnp23034.gho.pok.ibm.com [9.57.198.29]) by d01dlp01.pok.ibm.com (Postfix) with ESMTP id EB50A38C803B for ; Mon, 17 Feb 2014 20:47:35 -0500 (EST) Received: from d01av05.pok.ibm.com (d01av05.pok.ibm.com [9.56.224.195]) by b01cxnp23034.gho.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id s1I1lZNX3342828 for ; Tue, 18 Feb 2014 01:47:35 GMT Received: from d01av05.pok.ibm.com (localhost [127.0.0.1]) by d01av05.pok.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id s1I1lZDs009844 for ; Mon, 17 Feb 2014 20:47:35 -0500 Message-ID: <5302BBB2.4010704@linux.vnet.ibm.com> Date: Tue, 18 Feb 2014 09:47:30 +0800 From: "Michael R. Hines" MIME-Version: 1.0 References: <20140206122611.GD3013@work-vm> <5301BA28.4070709@linux.vnet.ibm.com> <20140217090602.GA2978@work-vm> In-Reply-To: <20140217090602.GA2978@work-vm> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] qemu_rdma_cleanup seg - related to 5a91337? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert" Cc: yamahata@private.email.ne.jp, quintela@redhat.com, hinesmr@cn.ibm.com, qemu-devel@nongnu.org, mrhines@us.ibm.com On 02/17/2014 05:06 PM, Dr. David Alan Gilbert wrote: > * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote: >> On 02/06/2014 08:26 PM, Dr. David Alan Gilbert wrote: >>> Hi Isaku, >>> I hit a seg in qemu_rdma_cleanup in the code changed by your >>> '[PATCH] rdma: clean up of qemu_rdma_cleanup()' >>> >>> migration-rdma.c ~ 2241 >>> >>> if (rdma->qp) { >>> rdma_destroy_qp(rdma->cm_id); >>> rdma->qp = NULL; >>> } >>> >>> Your patch changed that to free cm_id at that point rather than >>> qp; but in my case cm_id is NULL and so rdma_destroy_qp segs. >>> >>> given that there is a : >>> >>> if (rdma->cm_id) { >>> rdma_destroy_id(rdma->cm_id); >>> rdma->cm_id = NULL; >>> } >>> >>> later down, and there is now no longer any destroy of rdma->qp >>> I don't understand your change. >>> >>> Your change text says: >>> '- RDMAContext::qp is created by rdma_create_qp() so that it should be destroyed >>> by rdma_destroy_qp(). not ibv_destroy_qp()' >>> >>> but the diff is: >>> if (rdma->qp) { >>> - ibv_destroy_qp(rdma->qp); >>> + rdma_destroy_qp(rdma->cm_id); >>> rdma->qp = NULL; >>> >>> should that have been rdma_destroy_qp(rdma->qp)? >>> >>> Dave (who doesn't yet know enough RDMA to be dangerous) >>> -- >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > Hi Michael, > >> Responding for Isaku..... Thanks for reporting the bug, but I need some help >> in tracking down the cause of the bug, see below. > >> Actually, the parameter "rdma->cm_id" to the function is correct, it's just >> that the variable never got initialized in the first place, which >> means that either >> the connection never got established or an early error happened during >> the migration that required cleaning up the identifier. >> >> Can you describe the conditions of the migration and the environment? >> 1. Did you migrate only one VM? Was the host under heavy load? >> 2. Did your migration lose connectivity? Did one of the hosts crash? >> 3. Was the connection abruptly broken for some reason? >> 4. Did you ever cancel the migration at some point and restart? >> 5. Did you use libvirt? > This is my 1st attempt with RDMA and I'm using softiwarp and > getting an early error, I've not tracked down why yet, hence why I'm > only really reporting the cleanup seg. > > outgoing: > rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect: No such file or directory > RDMA ERROR: connecting to destination! > migration: setting error state > migration: setting error state > migrate: RDMA ERROR: connecting to destination! > (qemu) > > incoming: > ibv_poll_cq wc.status=5 Work Request Flushed Error! > ibv_poll_cq wrid=CONTROL RECV! > messages from qemu_rdma_poll > > I've not 100% sure which side fails first yet, but I believe that > incoming fails after outgoing calls rdma_connect but before it calls > rdma_get_cm_event; but as I say I'm new to RDMA and it's my 1st time > trying to debug it. OK, yes, that explains it. That means the cm_id was never successfully connected to begin with, so I'll just go ahead with a patch to check for NULL properly. And regarding softiwarp - I recommend making sure that the standard RDMA helper utilities from OFED are working cleanly first, like 'ucmatose' and rdma_read/write and so forth between the two machines you're trying to use. I've successfully migrated over softiwarp before - but only after making sure the utilities were working..... - Michael