linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
Cc: "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	SiteGround Operations
	<operations-/eCPMmvKun9pLGFMi4vTTA@public.gmane.org>
Subject: Re: [IPoIB] Missing join mcast events causing full machine lockup
Date: Tue, 02 Aug 2016 16:29:30 -0400	[thread overview]
Message-ID: <1470169770.18081.44.camel@redhat.com> (raw)
In-Reply-To: <CAJFSNy6USnLqcBiPEOcFOG8MrGq8gXwvakG48jHHi_-YgVaQ3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 2143 bytes --]

On Tue, 2016-08-02 at 23:18 +0300, Nikolay Borisov wrote:
> On Tue, Aug 2, 2016 at 10:21 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> wrote:
> > 
> > On Thu, 2016-07-21 at 10:31 +0300, Nikolay Borisov wrote:
> > > 
> > > Hello,
> > > 
> > > With running the risk of sounding like a broken record, I came
> > > across
> > > another case where ipoib can cause the machine to go haywire due
> > > to
> > > missed join requests. This is on 4.4.14 kernel. Here is what I
> > > believe
> > > happens:
> > 
> > [ snip long traces ]
> > 
> > > 
> > > This makes me wonder if using timeouts is actually better than
> > > blindly relying on completing the join.
> > 
> > Blindly relying on the join completions is not what we do.  We are
> > very
> > careful to make sure we always have the right locking so that we
> > never
> > leave a join request in the BUSY state without running the
> > completion
> > at some time.  If you are seeing us do that, then it means we have
> > a
> > bug in our locking or state processing.  The answer then is to find
> > that bug and not to paper over it with a timeout.  Can you find
> > some
> > way to reproduce this with a 4.7 kernel?
> 
> Unfortunately my environment is constrained to 4.4 kernel. I will,
> however,
> try and check if I can get a couple of IB-enabled nodes on 4.7 and
> see
> if something
> shows up. And while I don't have a 100% reproducer for it I see those
> symptoms rather regularly
> on production nodes. I'm able and happy to extract any runtime state
> that might be useful in debugging this i.e I can obtain crashdumps
> and
> reverse the state of the ipoib stacks. I've seen this issue on 3.12
> and on 4.4.
> Some of my previous emails also show this manifesting in hangs in
> cm_destroy_id
> as well. So clearly there is a problem there but it proves very
> elusive.

Can you give any clues as to what's causing it?  Do you have link flap?
SM bounces?  Lots of multicast joins/leaves?

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

  parent reply	other threads:[~2016-08-02 20:29 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-21  7:31 [IPoIB] Missing join mcast events causing full machine lockup Nikolay Borisov
     [not found] ` <57907A37.3000902-6AxghH7DbtA@public.gmane.org>
2016-08-02 19:21   ` Doug Ledford
     [not found]     ` <1470165672.18081.37.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-08-02 20:18       ` Nikolay Borisov
     [not found]         ` <CAJFSNy6USnLqcBiPEOcFOG8MrGq8gXwvakG48jHHi_-YgVaQ3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-02 20:29           ` Doug Ledford [this message]
     [not found]             ` <1470169770.18081.44.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-08-03  8:18               ` Nikolay Borisov
     [not found]                 ` <57A1A8F2.8040709-6AxghH7DbtA@public.gmane.org>
2016-08-04  0:17                   ` Marian Marinov
2016-08-17 11:26               ` Nikolay Borisov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1470169770.18081.44.camel@redhat.com \
    --to=dledford-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=kernel-6AxghH7DbtA@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=operations-/eCPMmvKun9pLGFMi4vTTA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).