public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
From: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
To: "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: State of ipoib cm mode
Date: Wed, 27 Jul 2016 15:05:52 +0300	[thread overview]
Message-ID: <5798A3A0.8070301@kyup.com> (raw)
In-Reply-To: <20160727115450.GA9717-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

[Resending with the linux-rdma list cc'ed + some additional information]

On 07/27/2016 02:54 PM, Michael S. Tsirkin wrote:
> On Wed, Jul 27, 2016 at 01:41:53PM +0300, Nikolay Borisov wrote:
>> Hello,
>>
>> I've been running some production servers with ipoib cm but have
>> observed various hangs, e.g. :
>>
>> http://www.spinics.net/lists/linux-rdma/msg34577.html
>> http://www.spinics.net/lists/linux-rdma/msg37011.html
>> http://thread.gmane.org/gmane.linux.drivers.rdma/38899
>>
>> Other people have also confirmed that there is a latent bug, which is
>> very hard to debug (e.g. here:
>> http://www.spinics.net/lists/linux-rdma/msg37022.html). Essentially
>>
>> As the person who originally wrote the code and considering that git
>> blame indicates most of it hasn't been touched does that mean it's
>> considered stable? Also do you happen to have a hunch as to what might
>> be causing such stalls?
>>
>> Regards,
>> Nikolay
> 
> Please repost copying a mailing list.
> I have a general policy against responding to off-list mail.

Ok.

In addition to that, here is the state of a node which has been hung for
about 2 days now - no infiniband multicast connectivity, this is similar
to the issue observed in the first mailing list entry I have referenced,
but this time I managed to obtain the state of the ipoib_cm_rx and
ib_cm_id structs (as well as any other structs which are referenced from
those):


struct ipoib_cm_rx {
  id = 0xffff8802128fa600,
  qp = 0xffff880100e94000,
  rx_ring = 0x0,
  list = {
    next = 0xffff88055f02bdd8,
    prev = 0xffff88055f02bdd8
  },
  dev = 0xffff880661f68000,
  jiffies = 4367003834,
  state = IPOIB_CM_RX_FLUSH,
  recv_count = 0
}

struct ib_cm_id {
  cm_handler = 0xffffffffa01e7b60 <ipoib_cm_rx_handler>,
  context = 0xffff880660f11780,
  device = 0xffff8800378e4000,
  service_id = 216172782113783824,
  service_mask = 18446744073709551615,
  state = IB_CM_IDLE,
  lap_state = IB_CM_LAP_UNINIT,
  local_id = 1741978561,
  remote_id = 3782023797,
  remote_cm_qpn = 1
}

And the backtrace is like that:

PID: 28224  TASK: ffff88064bdb5280  CPU: 5   COMMAND: "kworker/u24:2"
 #0 [ffff88055f02bc28] __schedule at ffffffff8160fc6a
 #1 [ffff88055f02bc70] schedule at ffffffff816103dc
 #2 [ffff88055f02bc88] schedule_timeout at ffffffff81613642
 #3 [ffff88055f02bd08] wait_for_completion at ffffffff816118df
 #4 [ffff88055f02bd68] cm_destroy_id at ffffffffa01d3759 [ib_cm]
 #5 [ffff88055f02bdc0] ib_destroy_cm_id at ffffffffa01d3a10 [ib_cm]
 #6 [ffff88055f02bdd0] ipoib_cm_free_rx_reap_list at ffffffffa01e7675
[ib_ipoib]
 #7 [ffff88055f02be18] ipoib_cm_rx_reap at ffffffffa01e7705 [ib_ipoib]
 #8 [ffff88055f02be28] process_one_work at ffffffff8106bdf9
 #9 [ffff88055f02be68] worker_thread at ffffffff8106c4a9
#10 [ffff88055f02bed0] kthread at ffffffff8107161f
#11 [ffff88055f02bf50] ret_from_fork at ffffffff816149ff

ffffffffa01d3759 is wait_for_completion(&cm_id_priv->comp);

Can you advise what other information might be helpful to debug this ?

Regards,
Nikolay
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

       reply	other threads:[~2016-07-27 12:05 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <57988FF1.9050405@kyup.com>
     [not found] ` <20160727115450.GA9717@redhat.com>
     [not found]   ` <20160727115450.GA9717-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-07-27 12:05     ` Nikolay Borisov [this message]
     [not found]       ` <5798A3A0.8070301-6AxghH7DbtA@public.gmane.org>
2016-07-27 12:46         ` State of ipoib cm mode Michael S. Tsirkin
     [not found]       ` <CAH_0vi9btY8PaDXnZOoP98hW_ZKkb6q11Fs+bEr_1VgRFgZjFQ@mail.gmail.com>
     [not found]         ` <CAH_0vi9btY8PaDXnZOoP98hW_ZKkb6q11Fs+bEr_1VgRFgZjFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-07-27 16:35           ` Nikolay Borisov
     [not found]             ` <CAJFSNy7seqPM3rF_OaXQnTGgTBTNTAYjXCxkPHBs11Od__8P0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-01  8:08               ` Erez Shitrit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5798A3A0.8070301@kyup.com \
    --to=kernel-6axghh7dbta@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox