From: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
Cc: "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
SiteGround Operations
<operations-/eCPMmvKun9pLGFMi4vTTA@public.gmane.org>
Subject: Re: [IPoIB] Missing join mcast events causing full machine lockup
Date: Tue, 02 Aug 2016 16:29:30 -0400 [thread overview]
Message-ID: <1470169770.18081.44.camel@redhat.com> (raw)
In-Reply-To: <CAJFSNy6USnLqcBiPEOcFOG8MrGq8gXwvakG48jHHi_-YgVaQ3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
[-- Attachment #1: Type: text/plain, Size: 2143 bytes --]
On Tue, 2016-08-02 at 23:18 +0300, Nikolay Borisov wrote:
> On Tue, Aug 2, 2016 at 10:21 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> wrote:
> >
> > On Thu, 2016-07-21 at 10:31 +0300, Nikolay Borisov wrote:
> > >
> > > Hello,
> > >
> > > With running the risk of sounding like a broken record, I came
> > > across
> > > another case where ipoib can cause the machine to go haywire due
> > > to
> > > missed join requests. This is on 4.4.14 kernel. Here is what I
> > > believe
> > > happens:
> >
> > [ snip long traces ]
> >
> > >
> > > This makes me wonder if using timeouts is actually better than
> > > blindly relying on completing the join.
> >
> > Blindly relying on the join completions is not what we do. We are
> > very
> > careful to make sure we always have the right locking so that we
> > never
> > leave a join request in the BUSY state without running the
> > completion
> > at some time. If you are seeing us do that, then it means we have
> > a
> > bug in our locking or state processing. The answer then is to find
> > that bug and not to paper over it with a timeout. Can you find
> > some
> > way to reproduce this with a 4.7 kernel?
>
> Unfortunately my environment is constrained to 4.4 kernel. I will,
> however,
> try and check if I can get a couple of IB-enabled nodes on 4.7 and
> see
> if something
> shows up. And while I don't have a 100% reproducer for it I see those
> symptoms rather regularly
> on production nodes. I'm able and happy to extract any runtime state
> that might be useful in debugging this i.e I can obtain crashdumps
> and
> reverse the state of the ipoib stacks. I've seen this issue on 3.12
> and on 4.4.
> Some of my previous emails also show this manifesting in hangs in
> cm_destroy_id
> as well. So clearly there is a problem there but it proves very
> elusive.
Can you give any clues as to what's causing it? Do you have link flap?
SM bounces? Lots of multicast joins/leaves?
--
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
GPG KeyID: 0E572FDD
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
next prev parent reply other threads:[~2016-08-02 20:29 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-21 7:31 [IPoIB] Missing join mcast events causing full machine lockup Nikolay Borisov
[not found] ` <57907A37.3000902-6AxghH7DbtA@public.gmane.org>
2016-08-02 19:21 ` Doug Ledford
[not found] ` <1470165672.18081.37.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-08-02 20:18 ` Nikolay Borisov
[not found] ` <CAJFSNy6USnLqcBiPEOcFOG8MrGq8gXwvakG48jHHi_-YgVaQ3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-02 20:29 ` Doug Ledford [this message]
[not found] ` <1470169770.18081.44.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-08-03 8:18 ` Nikolay Borisov
[not found] ` <57A1A8F2.8040709-6AxghH7DbtA@public.gmane.org>
2016-08-04 0:17 ` Marian Marinov
2016-08-17 11:26 ` Nikolay Borisov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1470169770.18081.44.camel@redhat.com \
--to=dledford-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
--cc=kernel-6AxghH7DbtA@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=operations-/eCPMmvKun9pLGFMi4vTTA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).