All of lore.kernel.org
 help / color / mirror / Atom feed
From: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
Cc: "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	SiteGround Operations
	<operations-/eCPMmvKun9pLGFMi4vTTA@public.gmane.org>
Subject: Re: [IPoIB] Missing join mcast events causing full machine lockup
Date: Tue, 02 Aug 2016 16:29:30 -0400	[thread overview]
Message-ID: <1470169770.18081.44.camel@redhat.com> (raw)
In-Reply-To: <CAJFSNy6USnLqcBiPEOcFOG8MrGq8gXwvakG48jHHi_-YgVaQ3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 2143 bytes --]

On Tue, 2016-08-02 at 23:18 +0300, Nikolay Borisov wrote:
> On Tue, Aug 2, 2016 at 10:21 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> wrote:
> > 
> > On Thu, 2016-07-21 at 10:31 +0300, Nikolay Borisov wrote:
> > > 
> > > Hello,
> > > 
> > > With running the risk of sounding like a broken record, I came
> > > across
> > > another case where ipoib can cause the machine to go haywire due
> > > to
> > > missed join requests. This is on 4.4.14 kernel. Here is what I
> > > believe
> > > happens:
> > 
> > [ snip long traces ]
> > 
> > > 
> > > This makes me wonder if using timeouts is actually better than
> > > blindly relying on completing the join.
> > 
> > Blindly relying on the join completions is not what we do.  We are
> > very
> > careful to make sure we always have the right locking so that we
> > never
> > leave a join request in the BUSY state without running the
> > completion
> > at some time.  If you are seeing us do that, then it means we have
> > a
> > bug in our locking or state processing.  The answer then is to find
> > that bug and not to paper over it with a timeout.  Can you find
> > some
> > way to reproduce this with a 4.7 kernel?
> 
> Unfortunately my environment is constrained to 4.4 kernel. I will,
> however,
> try and check if I can get a couple of IB-enabled nodes on 4.7 and
> see
> if something
> shows up. And while I don't have a 100% reproducer for it I see those
> symptoms rather regularly
> on production nodes. I'm able and happy to extract any runtime state
> that might be useful in debugging this i.e I can obtain crashdumps
> and
> reverse the state of the ipoib stacks. I've seen this issue on 3.12
> and on 4.4.
> Some of my previous emails also show this manifesting in hangs in
> cm_destroy_id
> as well. So clearly there is a problem there but it proves very
> elusive.

Can you give any clues as to what's causing it?  Do you have link flap?
SM bounces?  Lots of multicast joins/leaves?

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

  parent reply	other threads:[~2016-08-02 20:29 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-21  7:31 [IPoIB] Missing join mcast events causing full machine lockup Nikolay Borisov
     [not found] ` <57907A37.3000902-6AxghH7DbtA@public.gmane.org>
2016-08-02 19:21   ` Doug Ledford
     [not found]     ` <1470165672.18081.37.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-08-02 20:18       ` Nikolay Borisov
     [not found]         ` <CAJFSNy6USnLqcBiPEOcFOG8MrGq8gXwvakG48jHHi_-YgVaQ3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-02 20:29           ` Doug Ledford [this message]
     [not found]             ` <1470169770.18081.44.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-08-03  8:18               ` Nikolay Borisov
     [not found]                 ` <57A1A8F2.8040709-6AxghH7DbtA@public.gmane.org>
2016-08-04  0:17                   ` Marian Marinov
2016-08-17 11:26               ` Nikolay Borisov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1470169770.18081.44.camel@redhat.com \
    --to=dledford-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=kernel-6AxghH7DbtA@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=operations-/eCPMmvKun9pLGFMi4vTTA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.