public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
From: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Amir Vadai <amirv-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Eyal Perry <eyalpe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: [PATCH V3 FIX for-3.19] IB/ipoib: Fix sendonly traffic and multicast traffic
Date: Mon, 26 Jan 2015 17:00:05 -0500	[thread overview]
Message-ID: <1422309605.2854.62.camel@redhat.com> (raw)
In-Reply-To: <CAJ3xEMg3vYGbGuT+Z-XQMv5YuPws33XHQP_Wcz8gvpBbCg3TSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 3671 bytes --]

On Mon, 2015-01-26 at 22:57 +0200, Or Gerlitz wrote:
> On Mon, Jan 26, 2015 at 9:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, 2015-01-26 at 15:16 +0200, Or Gerlitz wrote:
> >> On Mon, Jan 26, 2015 at 3:00 PM, Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> >> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage" both
> >> > IPv6 traffic and for the most cases all IPv4 multicast traffic aren't
> >> > working.
> >>
> >>
> >> Hi Doug + Roland
> >>
> >> Erez was very patiently reviewing and testing all the six (V0...V5)
> >> patch series you sent to fix the 3.19-rc1 regression.
> >
> > Yes he has.
> 
> 
> >>  Can you also give this patch a try?
> 
> > I can test it.  But I need to know how it's supposed to be applied.
> 
> just apply it on latest upstream and run whatever tests you have, simple.

I used the same base kernel that I used for my patchset.

> > It might fix the regression, it might also reintroduce a race on
> > ifup/ifdown.  I'll test and see.
> 
> Let's see it in action @ your env

It passed the initial IPv6 after a failed join issue that my own
patchset just finally passes.

However, I didn't get more than 5 minutes into testing before I was able
to livelock the system.  In this case, from machine A running my
patchset, I did

ping6 -I mlx4_ib0 -i .25 <machine B address>

On machine B running Erez's patch, I did:

rmmod ib_ipoib; modprobe ib_ipoib mcast_debug_level=1; sleep 2; ping6
-i .25 -c 10 -I mlx4_ib0 <machine A address>

And on the machine rdma-master, where the opensm runs, I did just a few:

systemctl restart opensm

The livelock is in the mcast flushing code.  On the machine that
livelocked, here's the dmesg tail:

[  423.189514] mlx4_ib0.8002: multicast join failed for ff12:401b:8002:0000:0000:0000:ffff:ffff, status -110
[  423.189541] mlx4_ib0.8002: deleting multicast group ff12:401b:8002:0000:0000:0000:0000:0001
[  423.189545] mlx4_ib0.8002: deleting multicast group ff12:601b:8002:0000:0000:0000:0000:0001
[  423.189547] mlx4_ib0.8002: deleting multicast group ff12:601b:8002:0000:0000:0001:ff7b:e1b1
[  423.189549] mlx4_ib0.8002: deleting multicast group ff12:401b:8002:0000:0000:0000:0000:00fb
[  423.189551] mlx4_ib0.8002: deleting multicast group ff12:401b:8002:0000:0000:0000:ffff:ffff
[  423.204570] mlx4_ib0.8002: stopping multicast thread
[  423.204573] mlx4_ib0.8002: flushing multicast list
[  423.213567] mlx4_ib0: stopping multicast thread
[  423.213571] mlx4_ib0: flushing multicast list

The rmmod operation is stuck in ib_sa_unregister_client (one of the
specific fixes my patchset resolves BTW).

On another machine I started another one of my tests:

On machine A:

ping6 I mlx4_ib0 -i .25 <machine C address>

On rdma-master:

while true; do sleep 4; systemctl restart opensm; done

One machine C:

passes=0; while true; do ifdown qib_ib0; ifup qib_ib0; echo "Passes $passes..."; let passes++; done

In this test Erez's patch made it through about 5 down/up cycles before
the machine oopsed.

Do I need to keep going?  I was able to crash two different machines on
two different brands of hardware within only a few test cycles.  My
patchset, while large and intrusive, now survives all of this with
flying colors, and now that I've replicated Erez's specific multicast
join failure, I've taken care of that corner case too (and will be
adding that to my long term QE setup so it doesn't regress in the
future).


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

  parent reply	other threads:[~2015-01-26 22:00 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-26 13:00 [PATCH V3 FIX for-3.19] IB/ipoib: Fix sendonly traffic and multicast traffic Erez Shitrit
     [not found] ` <1422277227-1086-1-git-send-email-erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-01-26 13:16   ` Or Gerlitz
     [not found]     ` <CAJ3xEMjERaEP5d_ZT8RN5+w8Z_Hig4T7dhuq3o+1NOUuQgfJLw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-26 19:38       ` Doug Ledford
     [not found]         ` <1422301106.2854.41.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-26 20:57           ` Or Gerlitz
     [not found]             ` <CAJ3xEMg3vYGbGuT+Z-XQMv5YuPws33XHQP_Wcz8gvpBbCg3TSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-26 22:00               ` Doug Ledford [this message]
     [not found]                 ` <1422309605.2854.62.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-27  8:33                   ` Erez Shitrit
     [not found]                     ` <54C74D49.3080201-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-01-27 17:02                       ` Doug Ledford
     [not found]                         ` <1422378130.2854.119.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-29 12:51                           ` Or Gerlitz
     [not found]                             ` <54CA2CE0.30107-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-01-29 15:34                               ` Doug Ledford
     [not found]                                 ` <1422545677.2854.260.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-29 19:23                                   ` Roland Dreier
     [not found]                                     ` <CAL1RGDV30SRUv0oxZCQW0e+tziO0g+iDha8DSWeM56PiWtnRwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-29 19:27                                       ` Doug Ledford
2015-01-29 20:29                                       ` Jason Gunthorpe
2015-01-27 13:05                   ` Or Gerlitz
     [not found]                     ` <54C78D36.7050700-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-01-27 17:51                       ` Doug Ledford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1422309605.2854.62.camel@redhat.com \
    --to=dledford-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=amirv-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=eyalpe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox