qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: yamahata@private.email.ne.jp, quintela@redhat.com,
	hinesmr@cn.ibm.com, qemu-devel@nongnu.org, mrhines@us.ibm.com
Subject: Re: [Qemu-devel] qemu_rdma_cleanup seg - related to 5a91337?
Date: Tue, 18 Feb 2014 09:47:30 +0800	[thread overview]
Message-ID: <5302BBB2.4010704@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140217090602.GA2978@work-vm>

On 02/17/2014 05:06 PM, Dr. David Alan Gilbert wrote:
> * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
>> On 02/06/2014 08:26 PM, Dr. David Alan Gilbert wrote:
>>> Hi Isaku,
>>>     I hit a seg in qemu_rdma_cleanup in the code changed by your
>>> '[PATCH] rdma: clean up of qemu_rdma_cleanup()'
>>>
>>> migration-rdma.c ~ 2241
>>>
>>>      if (rdma->qp) {
>>>          rdma_destroy_qp(rdma->cm_id);
>>>          rdma->qp = NULL;
>>>      }
>>>
>>> Your patch changed that to free cm_id at that point rather than
>>> qp; but in my case cm_id is NULL and so rdma_destroy_qp segs.
>>>
>>> given that there is a :
>>>
>>>      if (rdma->cm_id) {
>>>          rdma_destroy_id(rdma->cm_id);
>>>          rdma->cm_id = NULL;
>>>      }
>>>
>>> later down, and there is now no longer any destroy of rdma->qp
>>> I don't understand your change.
>>>
>>> Your change text says:
>>>    '- RDMAContext::qp is created by rdma_create_qp() so that it should be destroyed
>>>     by rdma_destroy_qp(). not ibv_destroy_qp()'
>>>
>>> but the diff is:
>>>        if (rdma->qp) {
>>> -        ibv_destroy_qp(rdma->qp);
>>> +        rdma_destroy_qp(rdma->cm_id);
>>>            rdma->qp = NULL;
>>>
>>> should that have been rdma_destroy_qp(rdma->qp)?
>>>
>>> Dave (who doesn't yet know enough RDMA to be dangerous)
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> Hi Michael,
>
>> Responding for Isaku..... Thanks for reporting the bug, but I need some help
>> in tracking down the cause of the bug, see below.
>
>> Actually, the parameter "rdma->cm_id" to the function is correct, it's just
>> that the variable never got initialized in the first place, which
>> means that either
>> the connection never got established or an early error happened during
>> the migration that required cleaning up the identifier.
>>
>> Can you describe the conditions of the migration and the environment?
>> 1. Did you migrate only one VM? Was the host under heavy load?
>> 2. Did your migration lose connectivity? Did one of the hosts crash?
>> 3. Was the connection abruptly broken for some reason?
>> 4. Did you ever cancel the migration at some point and restart?
>> 5. Did you use libvirt?
> This is my 1st attempt with RDMA and I'm using softiwarp and
> getting an early error, I've not tracked down why yet, hence why I'm
> only really reporting the cleanup seg.
>
> outgoing:
> rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect: No such file or directory
> RDMA ERROR: connecting to destination!
> migration: setting error state
> migration: setting error state
> migrate: RDMA ERROR: connecting to destination!
> (qemu)
>
> incoming:
> ibv_poll_cq wc.status=5 Work Request Flushed Error!
> ibv_poll_cq wrid=CONTROL RECV!
>      messages from qemu_rdma_poll
>
> I've not 100% sure which side fails first yet, but I believe that
> incoming fails after outgoing calls rdma_connect but before it calls
> rdma_get_cm_event; but as I say I'm new to RDMA and it's my 1st time
> trying to debug it.

OK, yes, that explains it. That means the cm_id was never successfully 
connected
to begin with, so I'll just go ahead with a patch to check for NULL 
properly.

And regarding softiwarp - I recommend making sure that the standard RDMA
helper utilities from OFED are working cleanly first, like 'ucmatose' 
and rdma_read/write
and so forth between the two machines you're trying to use.

I've successfully migrated over softiwarp before - but only after making 
sure the
utilities were working.....

- Michael

      reply	other threads:[~2014-02-18  1:47 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-06 12:26 [Qemu-devel] qemu_rdma_cleanup seg - related to 5a91337? Dr. David Alan Gilbert
2014-02-17  7:28 ` Michael R. Hines
2014-02-17  9:06   ` Dr. David Alan Gilbert
2014-02-18  1:47     ` Michael R. Hines [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5302BBB2.4010704@linux.vnet.ibm.com \
    --to=mrhines@linux.vnet.ibm.com \
    --cc=dgilbert@redhat.com \
    --cc=hinesmr@cn.ibm.com \
    --cc=mrhines@us.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=yamahata@private.email.ne.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).