All of lore.kernel.org
 help / color / mirror / Atom feed
From: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues
Date: Sun, 22 Feb 2015 16:56:16 -0500	[thread overview]
Message-ID: <1424642176.4847.2.camel@redhat.com> (raw)
In-Reply-To: <CAJ3xEMgj=ATKLt0MA67c3WefCrG1hZ59eSrhpD-u_dxLJe2kfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 6850 bytes --]

On Sun, 2015-02-22 at 23:34 +0200, Or Gerlitz wrote:
> On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > This is the re-ordered, squashed version of my 22 patch set that I
> > posted on Feb 11.  There are a few minor differences between that
> > set and this one.
> 
> Hi Doug,
> 
> I took quick look on your git repo @
> git://github.com/dledford/linux.git and it seems not to contain this
> series, can you please get that there and tell what branch to pull?

It's there now, branch for-3.20-squashed.

> Or.
> 
> > They are:
> > 1) Rename __ipoib_mcast_continue_join_thread to
> >    __ipoib_mcast_schedule_join_thread
> > 2) Make __ipoib_mcast_schedule_join_thread cancel any delayed work to
> >    avoid us accidentally trying to queue the single work struct instance
> >    twice (which doesn't work)
> > 3) Slight alter layout of __ipoib_mcast_schedule_join_thread.  Logic
> >    is the same modulo #2, but indenting is reduced and readability
> >    increased
> > 4) Switch a few instances of FLAG_ADMIN_UP to FLAG_OPER_UP
> > 5) Add a couple missing spinlocks so that we always call the schedule
> >    helper with the spinlock held
> > 6) Make sure that we only clear the BUSY flag once we have done all the
> >    other things we are going to do to the mcast entry, and if possible,
> >    only call complete after we have released the spinlock
> > 7) Fix the usage of time_before_eq when we should have just used
> >    time_before in ipoib_mcast_join_task
> > 8) Create/destroy priv->wq in a slightly different point of
> >    ipoib_transport_dev_init/ipoib_transport_dev_cleanup
> >
> > This entire patchset was intended to address the issue of ipoib
> > interfaces being brought up/down in a tight loop, which will hardlock
> > a standard v3.19 kernel.  It succeeds at resolving that problem.  In
> > order to be sure this patchset does not introduce other problems,
> > and in order to ensure that this rework of the patches into a new
> > set does not break bisectability, this entire patchset has been
> > extensively tested, starting with the first patch and going through
> > the last.
> >
> > I used a 12 machine group plus the subnet manager to test these
> > patches.
> >
> > 1 machine ran ifconfig up/ifconfig down in a tight loop tests
> > 1 machine ran rmmod/insmod ib_ipoib in a loop with a 10 second pause
> >   between insmod and rmmod
> > 1 machine ran rmmod/insmod ib_ipoib in a tight loop with only a .1
> >   second pause between insmod and rmmod
> > 9 machines that kept their interfaces up and ran iperf servers, 6 also
> >   ran ping6 instances to the addresses of all 12 machines, 3 ran iperf
> >   clients that sent data to all 9 iperf servers in an infinite loop
> > 1 subnet manager machine that otherwise did not participate, but
> >   during testing was set to restart opensm once every 30 seconds to
> >   force net re-register events on all 12 machines in the group
> >
> > In addition to the configuration of various machines above to test
> > data transfers, the IPoIB infrastructure itself contained several
> > elements designed to test specific multicast capabilities.
> >
> > The primary P_Key, the one with the ping6 instances running on it,
> > intentionally had some well known multicast groups not defined in
> > order to intentionally cause failed sendonly multicast joins on
> > the same device that needed to work with IPv6 pings as well as
> > IPv4 multicast.
> >
> > One of the alternate P_Key interfaces was defined with a minimum
> > rate of 56GBit/s, so all machines without 56GBit/s capability
> > were unable to ever join the broadcast group on these P_Keys.
> > This was done to make sure that when the broadcast group is not
> > joined, no other multicast joins, sendonly or otherwise, are ever
> > sent.  It also was done to make sure that failed attempts to join
> > the broadcast group honored the backoff delays properly.
> >
> > Note: both machines that were doing the insmod/rmmod loops were
> > changed to not have any P_Key interfaces defined other than the
> > default P_Key interface.  It is known that repeated insmod/rmmod
> > of the ib_ipoib interface is fragile and easily breaks in the
> > presence of child interfaces.  It was not my intent to address
> > that particular problem with this patch set and so to avoid false
> > issues, children interfaces were removed from the mix on these
> > machines.
> >
> > A wide array of hardware was also tested with this 12 machine group,
> > covering mthca, mlx4, mlx5, and qib hardware.
> >
> > Patches 1 through 6 were tested without the ifconfig/rmmod/opensm
> > loops as those particular problems were not expected to be addressed
> > until patch 7.  Pathes 7 through 9 were tested with all tests.
> >
> > The final, complete patch set was left running with the various
> > tests until it had completed 257 opensm restarts, 12052
> > ifconfig up/ifconfig down loops, 765 10 second insmod/rmmod loops,
> > and 1971 .1 second insmod/rmmod loops.  The only observed problem
> > was that the fast insmod/rmmod loop eventually locked up the
> > network stack on the machine.  It was stuck on a rtnl_lock deadlock,
> > but not one related to the multicast code (and therefore outside
> > the scope of these patches to address).  There are several bits of
> > additional locking to be fixed in the overall ipoib code in relation
> > to insmod/rmmod races and this patch set does not attempt to address
> > those.  It merely attempts not to introduce any new issues while
> > resolving the mcast locking issues related to bringing the interface
> > up and down.  I feel confident that it does that.
> >
> > Doug Ledford (9):
> >   IB/ipoib: factor out ah flushing
> >   IB/ipoib: change init sequence ordering
> >   IB/ipoib: Consolidate rtnl_lock tasks in workqueue
> >   IB/ipoib: Make the carrier_on_task race aware
> >   IB/ipoib: Use dedicated workqueues per interface
> >   IB/ipoib: No longer use flush as a parameter
> >   IB/ipoib: fix MCAST_FLAG_BUSY usage
> >   IB/ipoib: deserialize multicast joins
> >   IB/ipoib: drop mcast_mutex usage
> >
> >  drivers/infiniband/ulp/ipoib/ipoib.h           |  20 +-
> >  drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  18 +-
> >  drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  69 ++--
> >  drivers/infiniband/ulp/ipoib/ipoib_main.c      |  60 +--
> >  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 500 +++++++++++++------------
> >  drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  31 +-
> >  6 files changed, 389 insertions(+), 309 deletions(-)
> >
> > --
> > 2.1.0
> >


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

  parent reply	other threads:[~2015-02-22 21:56 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-22  0:26 [PATCH 0/9] IB/ipoib: fixup multicast locking issues Doug Ledford
     [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-22  0:26   ` [PATCH 1/9] IB/ipoib: factor out ah flushing Doug Ledford
     [not found]     ` <b06eb720c2f654f5ecdb72c66f4e89149d1c24ec.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-26 13:28       ` Erez Shitrit
     [not found]         ` <54EF1F67.4000001-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-02-26 16:27           ` Doug Ledford
     [not found]             ` <1424968046.2543.18.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-01  6:47               ` Erez Shitrit
     [not found]                 ` <54F2B61C.9080308-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-03-02 15:09                   ` Doug Ledford
     [not found]                     ` <1425308967.2354.19.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-03  9:59                       ` Erez Shitrit
     [not found]                         ` <54F585E9.7070704-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-03-13  8:39                           ` Or Gerlitz
     [not found]                             ` <CAJ3xEMgxxHu5BQdADaRe-Grtf4rm1LMfsCRiDyF6ToPdV_62OA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-15 18:42                               ` Doug Ledford
     [not found]                                 ` <3A0A417D-BFE4-475C-BAB3-C3FB1D313022-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-16 15:24                                   ` Erez Shitrit
     [not found]                                     ` <5506F5B2.1080900-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-03-16 16:06                                       ` Doug Ledford
     [not found]                                         ` <ADC46FD9-3179-4182-949D-1884C9D31757-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-16 16:51                                           ` Erez Shitrit
2015-03-16 18:00                                       ` Doug Ledford
2015-02-22  0:27   ` [PATCH 2/9] IB/ipoib: change init sequence ordering Doug Ledford
2015-02-22  0:27   ` [PATCH 3/9] IB/ipoib: Consolidate rtnl_lock tasks in workqueue Doug Ledford
2015-02-22  0:27   ` [PATCH 4/9] IB/ipoib: Make the carrier_on_task race aware Doug Ledford
2015-02-22  0:27   ` [PATCH 5/9] IB/ipoib: Use dedicated workqueues per interface Doug Ledford
     [not found]     ` <1cfdf15058cea312f07c2907490a1d7300603c40.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-23 16:48       ` Or Gerlitz
2015-02-22  0:27   ` [PATCH 6/9] IB/ipoib: No longer use flush as a parameter Doug Ledford
2015-02-22  0:27   ` [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage Doug Ledford
     [not found]     ` <9d657f64ee961ee3b3233520d8b499b234a42bcd.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-01  9:31       ` Erez Shitrit
     [not found]         ` <54F2DC81.304-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-03-02 15:27           ` Doug Ledford
     [not found]             ` <1425310036.2354.24.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-03  9:53               ` Erez Shitrit
2015-02-22  0:27   ` [PATCH 8/9] IB/ipoib: deserialize multicast joins Doug Ledford
     [not found]     ` <a24ade295dfdd1369aac47a978003569ec190952.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-01 13:58       ` Erez Shitrit
     [not found]         ` <54F31AEC.3010001-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-03-02 15:29           ` Doug Ledford
     [not found]             ` <1425310145.2354.26.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-03  9:54               ` Erez Shitrit
2015-02-22  0:27   ` [PATCH 9/9] IB/ipoib: drop mcast_mutex usage Doug Ledford
     [not found]     ` <767f4c41779db63ce8c6dbba04b21959aba70ef9.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-23 16:56       ` Or Gerlitz
     [not found]         ` <CAJ3xEMgLPF9pCwQDy9QyL9fAERJXJRXN2gBj3nhuXUCcbfCMPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-23 17:41           ` Doug Ledford
2015-02-22 21:34   ` [PATCH 0/9] IB/ipoib: fixup multicast locking issues Or Gerlitz
     [not found]     ` <CAJ3xEMgj=ATKLt0MA67c3WefCrG1hZ59eSrhpD-u_dxLJe2kfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-22 21:56       ` Doug Ledford [this message]
     [not found]         ` <1424642176.4847.2.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-22 21:57           ` Doug Ledford
2015-03-13  8:41   ` Or Gerlitz
     [not found]     ` <CAJ3xEMjHrTH_F=zPDsH9A9qRWo=AYN4sgbsdDKV62nzBkB5kXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-15 18:52       ` Doug Ledford
     [not found]         ` <F42024C5-60A5-4B92-B4AC-4D225E2C0FC3-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-31 17:04           ` ira.weiny
     [not found]             ` <20150331170452.GA6261-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2015-03-31 20:42               ` Or Gerlitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1424642176.4847.2.camel@redhat.com \
    --to=dledford-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.