* [Qemu-devel] qemu_rdma_cleanup seg - related to 5a91337? @ 2014-02-06 12:26 Dr. David Alan Gilbert 2014-02-17 7:28 ` Michael R. Hines 0 siblings, 1 reply; 4+ messages in thread From: Dr. David Alan Gilbert @ 2014-02-06 12:26 UTC (permalink / raw) To: yamahata, qemu-devel, mrhines; +Cc: quintela Hi Isaku, I hit a seg in qemu_rdma_cleanup in the code changed by your '[PATCH] rdma: clean up of qemu_rdma_cleanup()' migration-rdma.c ~ 2241 if (rdma->qp) { rdma_destroy_qp(rdma->cm_id); rdma->qp = NULL; } Your patch changed that to free cm_id at that point rather than qp; but in my case cm_id is NULL and so rdma_destroy_qp segs. given that there is a : if (rdma->cm_id) { rdma_destroy_id(rdma->cm_id); rdma->cm_id = NULL; } later down, and there is now no longer any destroy of rdma->qp I don't understand your change. Your change text says: '- RDMAContext::qp is created by rdma_create_qp() so that it should be destroyed by rdma_destroy_qp(). not ibv_destroy_qp()' but the diff is: if (rdma->qp) { - ibv_destroy_qp(rdma->qp); + rdma_destroy_qp(rdma->cm_id); rdma->qp = NULL; should that have been rdma_destroy_qp(rdma->qp)? Dave (who doesn't yet know enough RDMA to be dangerous) -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] qemu_rdma_cleanup seg - related to 5a91337? 2014-02-06 12:26 [Qemu-devel] qemu_rdma_cleanup seg - related to 5a91337? Dr. David Alan Gilbert @ 2014-02-17 7:28 ` Michael R. Hines 2014-02-17 9:06 ` Dr. David Alan Gilbert 0 siblings, 1 reply; 4+ messages in thread From: Michael R. Hines @ 2014-02-17 7:28 UTC (permalink / raw) To: Dr. David Alan Gilbert, yamahata, qemu-devel, mrhines; +Cc: hinesmr, quintela On 02/06/2014 08:26 PM, Dr. David Alan Gilbert wrote: > Hi Isaku, > I hit a seg in qemu_rdma_cleanup in the code changed by your > '[PATCH] rdma: clean up of qemu_rdma_cleanup()' > > migration-rdma.c ~ 2241 > > if (rdma->qp) { > rdma_destroy_qp(rdma->cm_id); > rdma->qp = NULL; > } > > Your patch changed that to free cm_id at that point rather than > qp; but in my case cm_id is NULL and so rdma_destroy_qp segs. > > given that there is a : > > if (rdma->cm_id) { > rdma_destroy_id(rdma->cm_id); > rdma->cm_id = NULL; > } > > later down, and there is now no longer any destroy of rdma->qp > I don't understand your change. > > Your change text says: > '- RDMAContext::qp is created by rdma_create_qp() so that it should be destroyed > by rdma_destroy_qp(). not ibv_destroy_qp()' > > but the diff is: > if (rdma->qp) { > - ibv_destroy_qp(rdma->qp); > + rdma_destroy_qp(rdma->cm_id); > rdma->qp = NULL; > > should that have been rdma_destroy_qp(rdma->qp)? > > Dave (who doesn't yet know enough RDMA to be dangerous) > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > Responding for Isaku..... Thanks for reporting the bug, but I need some help in tracking down the cause of the bug, see below. Actually, the parameter "rdma->cm_id" to the function is correct, it's just that the variable never got initialized in the first place, which means that either the connection never got established or an early error happened during the migration that required cleaning up the identifier. Can you describe the conditions of the migration and the environment? 1. Did you migrate only one VM? Was the host under heavy load? 2. Did your migration lose connectivity? Did one of the hosts crash? 3. Was the connection abruptly broken for some reason? 4. Did you ever cancel the migration at some point and restart? 5. Did you use libvirt? A simple fix would be to surround the "rdma_destroy_qp()" call with a check to see if rdma->cm_id is valid, but that doesn't answer why rdma->cm_id would be invalid in the first place. I need some additional information to try to reproduce the conditions of the bug. Thanks! - Michael Hines ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] qemu_rdma_cleanup seg - related to 5a91337? 2014-02-17 7:28 ` Michael R. Hines @ 2014-02-17 9:06 ` Dr. David Alan Gilbert 2014-02-18 1:47 ` Michael R. Hines 0 siblings, 1 reply; 4+ messages in thread From: Dr. David Alan Gilbert @ 2014-02-17 9:06 UTC (permalink / raw) To: Michael R. Hines; +Cc: yamahata, quintela, hinesmr, qemu-devel, mrhines * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote: > On 02/06/2014 08:26 PM, Dr. David Alan Gilbert wrote: > >Hi Isaku, > > I hit a seg in qemu_rdma_cleanup in the code changed by your > >'[PATCH] rdma: clean up of qemu_rdma_cleanup()' > > > >migration-rdma.c ~ 2241 > > > > if (rdma->qp) { > > rdma_destroy_qp(rdma->cm_id); > > rdma->qp = NULL; > > } > > > >Your patch changed that to free cm_id at that point rather than > >qp; but in my case cm_id is NULL and so rdma_destroy_qp segs. > > > >given that there is a : > > > > if (rdma->cm_id) { > > rdma_destroy_id(rdma->cm_id); > > rdma->cm_id = NULL; > > } > > > >later down, and there is now no longer any destroy of rdma->qp > >I don't understand your change. > > > >Your change text says: > > '- RDMAContext::qp is created by rdma_create_qp() so that it should be destroyed > > by rdma_destroy_qp(). not ibv_destroy_qp()' > > > >but the diff is: > > if (rdma->qp) { > >- ibv_destroy_qp(rdma->qp); > >+ rdma_destroy_qp(rdma->cm_id); > > rdma->qp = NULL; > > > >should that have been rdma_destroy_qp(rdma->qp)? > > > >Dave (who doesn't yet know enough RDMA to be dangerous) > >-- > >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK Hi Michael, > Responding for Isaku..... Thanks for reporting the bug, but I need some help > in tracking down the cause of the bug, see below. > Actually, the parameter "rdma->cm_id" to the function is correct, it's just > that the variable never got initialized in the first place, which > means that either > the connection never got established or an early error happened during > the migration that required cleaning up the identifier. > > Can you describe the conditions of the migration and the environment? > 1. Did you migrate only one VM? Was the host under heavy load? > 2. Did your migration lose connectivity? Did one of the hosts crash? > 3. Was the connection abruptly broken for some reason? > 4. Did you ever cancel the migration at some point and restart? > 5. Did you use libvirt? This is my 1st attempt with RDMA and I'm using softiwarp and getting an early error, I've not tracked down why yet, hence why I'm only really reporting the cleanup seg. outgoing: rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect: No such file or directory RDMA ERROR: connecting to destination! migration: setting error state migration: setting error state migrate: RDMA ERROR: connecting to destination! (qemu) incoming: ibv_poll_cq wc.status=5 Work Request Flushed Error! ibv_poll_cq wrid=CONTROL RECV! messages from qemu_rdma_poll I've not 100% sure which side fails first yet, but I believe that incoming fails after outgoing calls rdma_connect but before it calls rdma_get_cm_event; but as I say I'm new to RDMA and it's my 1st time trying to debug it. > A simple fix would be to surround the "rdma_destroy_qp()" call with a check > to see if rdma->cm_id is valid, but that doesn't answer why > rdma->cm_id would be invalid > in the first place. Yeh, I think just adding the NULL check is best. > I need some additional information to try to reproduce the > conditions of the bug. > > Thanks! > - Michael Hines Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] qemu_rdma_cleanup seg - related to 5a91337? 2014-02-17 9:06 ` Dr. David Alan Gilbert @ 2014-02-18 1:47 ` Michael R. Hines 0 siblings, 0 replies; 4+ messages in thread From: Michael R. Hines @ 2014-02-18 1:47 UTC (permalink / raw) To: Dr. David Alan Gilbert; +Cc: yamahata, quintela, hinesmr, qemu-devel, mrhines On 02/17/2014 05:06 PM, Dr. David Alan Gilbert wrote: > * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote: >> On 02/06/2014 08:26 PM, Dr. David Alan Gilbert wrote: >>> Hi Isaku, >>> I hit a seg in qemu_rdma_cleanup in the code changed by your >>> '[PATCH] rdma: clean up of qemu_rdma_cleanup()' >>> >>> migration-rdma.c ~ 2241 >>> >>> if (rdma->qp) { >>> rdma_destroy_qp(rdma->cm_id); >>> rdma->qp = NULL; >>> } >>> >>> Your patch changed that to free cm_id at that point rather than >>> qp; but in my case cm_id is NULL and so rdma_destroy_qp segs. >>> >>> given that there is a : >>> >>> if (rdma->cm_id) { >>> rdma_destroy_id(rdma->cm_id); >>> rdma->cm_id = NULL; >>> } >>> >>> later down, and there is now no longer any destroy of rdma->qp >>> I don't understand your change. >>> >>> Your change text says: >>> '- RDMAContext::qp is created by rdma_create_qp() so that it should be destroyed >>> by rdma_destroy_qp(). not ibv_destroy_qp()' >>> >>> but the diff is: >>> if (rdma->qp) { >>> - ibv_destroy_qp(rdma->qp); >>> + rdma_destroy_qp(rdma->cm_id); >>> rdma->qp = NULL; >>> >>> should that have been rdma_destroy_qp(rdma->qp)? >>> >>> Dave (who doesn't yet know enough RDMA to be dangerous) >>> -- >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > Hi Michael, > >> Responding for Isaku..... Thanks for reporting the bug, but I need some help >> in tracking down the cause of the bug, see below. > >> Actually, the parameter "rdma->cm_id" to the function is correct, it's just >> that the variable never got initialized in the first place, which >> means that either >> the connection never got established or an early error happened during >> the migration that required cleaning up the identifier. >> >> Can you describe the conditions of the migration and the environment? >> 1. Did you migrate only one VM? Was the host under heavy load? >> 2. Did your migration lose connectivity? Did one of the hosts crash? >> 3. Was the connection abruptly broken for some reason? >> 4. Did you ever cancel the migration at some point and restart? >> 5. Did you use libvirt? > This is my 1st attempt with RDMA and I'm using softiwarp and > getting an early error, I've not tracked down why yet, hence why I'm > only really reporting the cleanup seg. > > outgoing: > rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect: No such file or directory > RDMA ERROR: connecting to destination! > migration: setting error state > migration: setting error state > migrate: RDMA ERROR: connecting to destination! > (qemu) > > incoming: > ibv_poll_cq wc.status=5 Work Request Flushed Error! > ibv_poll_cq wrid=CONTROL RECV! > messages from qemu_rdma_poll > > I've not 100% sure which side fails first yet, but I believe that > incoming fails after outgoing calls rdma_connect but before it calls > rdma_get_cm_event; but as I say I'm new to RDMA and it's my 1st time > trying to debug it. OK, yes, that explains it. That means the cm_id was never successfully connected to begin with, so I'll just go ahead with a patch to check for NULL properly. And regarding softiwarp - I recommend making sure that the standard RDMA helper utilities from OFED are working cleanly first, like 'ucmatose' and rdma_read/write and so forth between the two machines you're trying to use. I've successfully migrated over softiwarp before - but only after making sure the utilities were working..... - Michael ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-02-18 1:47 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-02-06 12:26 [Qemu-devel] qemu_rdma_cleanup seg - related to 5a91337? Dr. David Alan Gilbert 2014-02-17 7:28 ` Michael R. Hines 2014-02-17 9:06 ` Dr. David Alan Gilbert 2014-02-18 1:47 ` Michael R. Hines
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).