[PATCH 00/22] IB/ipoib: Fixups for multicast issues

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/22] IB/ipoib: Fixups for multicast issues
@ 2015-02-12  1:43 Doug Ledford
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

This patchset revives the 8 patches that were reverted from 3.19,
the 11 patches that fixed the problems with the first 8, the single
patch that was related to reaping of ah's and failure to dealloc
resources on shutdown, and then adds two new patches that would
have been enhancements and not bugfixes and hence weren't appropriate
to post in the 3.19 tussle.

Testing of this patchset is currently underway, but it has done
well so far.  IPv4 multicast, IPv6 multicast, connected mode,
datagram mode, rmmod/insmod while active, restart opensm while
active, ifconfig up/ifconfig down in a tight while loop have
all passed.

There are two outstanding issues that I think stilll need addressed
(while performing all the other testing I ran across these issues,
and I think they existed prior to my patchset, but I haven't booted
up a clean kernel to verify it yet...I'll do that tomorrow and if
things are not as I expect, I'll report back here):

1) In connected mode, the initial ip6 ping to any host takes almost
exactly 1 second to complete.  The debug messages show this delay
very clearly:

[19059.689967] qib_ib0: joining MGID ff12:601b:ffff:0000:0000:0001:ff31:7791
[19059.689970] qib_ib0: successfully started all multicast joins
[19059.690313] qib_ib0: sendonly join completion for ff12:601b:ffff:0000:0000:0001:ff31:7791 (status 0)
[19059.690314] qib_ib0: Created ah ffff88080cc0ef60
[19059.690315] qib_ib0: MGID ff12:601b:ffff:0000:0000:0001:ff31:7791 AV ffff88080cc0ef60, LID 0xc035, SL 0 <- Final debug message from creating our AH and when we should have requeued our sends
[19060.694190] qib_ib0: REQ arrived <- almost exactly 1 second later, we
finally start setting up our connection

In datagram mode, this does not happen and initial startup of ping6
is mostly immediate.

2) In connected mode, restartng opensm repeatedly can cause some of
the machines to start failing to find other machines when trying to
use ping6.  However, they don't loose connectivity to all machines,
only specific machines.  A rmmod/insmod cycle solves the problem.
So does a full ifdown/ifup cycle.  Given enough idle time, the
problem goes away.  I suspect that neighbor flushing when in
connected mode is not reliable/sufficient when opensm events come
in.  Again, I think this exists in the upstream kernel and I'll
test more on that tomorrow.

Doug Ledford (22):
  IB/ipoib: Consolidate rtnl_lock tasks in workqueue
  IB/ipoib: Make the carrier_on_task race aware
  IB/ipoib: fix MCAST_FLAG_BUSY usage
  IB/ipoib: fix mcast_dev_flush/mcast_restart_task race
  IB/ipoib: change init sequence ordering
  IB/ipoib: Use dedicated workqueues per interface
  IB/ipoib: Make ipoib_mcast_stop_thread flush the workqueue
  IB/ipoib: No longer use flush as a parameter
  IB/ipoib: fix IPOIB_MCAST_RUN flag usage
  IB/ipoib: Add a helper to restart the multicast task
  IB/ipoib: make delayed tasks not hold up everything
  IB/ipoib: Handle -ENETRESET properly in our callback
  IB/ipoib: don't restart our thread on ENETRESET
  IB/ipoib: remove unneeded locks
  IB/ipoib: fix race between mcast_dev_flush and mcast_join
  IB/ipoib: fix ipoib_mcast_restart_task
  IB/ipoib: flush the ipoib_workqueue on unregister
  IB/ipoib: cleanup a couple debug messages
  IB/ipoib: make sure we reap all our ah on shutdown
  IB/ipoib: don't queue a work struct up twice
  IB/ipoib: deserialize multicast joins
  IB/ipoib: drop mcast_mutex usage

 drivers/infiniband/ulp/ipoib/ipoib.h           |  20 +-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  18 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  69 ++--
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  51 ++-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 479 ++++++++++++-------------
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  24 +-
 6 files changed, 356 insertions(+), 305 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/22] IB/ipoib: Consolidate rtnl_lock tasks in workqueue
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 02/22] IB/ipoib: Make the carrier_on_task race aware Doug Ledford
                     ` (21 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Setting the MTU can safely be moved to the carrier_on_task, which keeps
us from needing to take the rtnl lock in the join_finish section.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index ffb83b5f7e8..eee66d13e5b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 		spin_unlock_irq(&priv->lock);
 		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
 		set_qkey = 1;
-
-		if (!ipoib_cm_admin_enabled(dev)) {
-			rtnl_lock();
-			dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu));
-			rtnl_unlock();
-		}
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
 	}
 
 	rtnl_lock();
+	if (!ipoib_cm_admin_enabled(priv->dev))
+		dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
 	netif_carrier_on(priv->dev);
 	rtnl_unlock();
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/22] IB/ipoib: Make the carrier_on_task race aware
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-12  1:43   ` [PATCH 01/22] IB/ipoib: Consolidate rtnl_lock tasks in workqueue Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 03/22] IB/ipoib: fix MCAST_FLAG_BUSY usage Doug Ledford
                     ` (20 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

We blindly assume that we can just take the rtnl lock and that will
prevent races with downing this interface.  Unfortunately, that's not
the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
in an attempt to clear out all remaining instances of ipoib_join_task.
But, since this task is put on the same workqueue as the join task,
the flush_workqueue waits on this thread too.  But this thread is
deadlocked on the rtnl lock.  The better thing here is to use trylock
and loop on that until we either get the lock or we see that
FLAG_ADMIN_UP has been cleared, in which case we don't need to do
anything anyway and we just return.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index eee66d13e5b..9862c76a83f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
 						   carrier_on_task);
 	struct ib_port_attr attr;
 
-	/*
-	 * Take rtnl_lock to avoid racing with ipoib_stop() and
-	 * turning the carrier back on while a device is being
-	 * removed.
-	 */
 	if (ib_query_port(priv->ca, priv->port, &attr) ||
 	    attr.state != IB_PORT_ACTIVE) {
 		ipoib_dbg(priv, "Keeping carrier off until IB port is active\n");
 		return;
 	}
 
-	rtnl_lock();
+	/*
+	 * Take rtnl_lock to avoid racing with ipoib_stop() and
+	 * turning the carrier back on while a device is being
+	 * removed.  However, ipoib_stop() will attempt to flush
+	 * the workqueue while holding the rtnl lock, so loop
+	 * on trylock until either we get the lock or we see
+	 * FLAG_ADMIN_UP go away as that signals that we are bailing
+	 * and can safely ignore the carrier on work
+	 */
+	while (!rtnl_trylock()) {
+		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+			return;
+		else
+			msleep(20);
+	}
 	if (!ipoib_cm_admin_enabled(priv->dev))
 		dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
 	netif_carrier_on(priv->dev);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/22] IB/ipoib: fix MCAST_FLAG_BUSY usage
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-12  1:43   ` [PATCH 01/22] IB/ipoib: Consolidate rtnl_lock tasks in workqueue Doug Ledford
  2015-02-12  1:43   ` [PATCH 02/22] IB/ipoib: Make the carrier_on_task race aware Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 04/22] IB/ipoib: fix mcast_dev_flush/mcast_restart_task race Doug Ledford
                     ` (19 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
in how it was used.  We didn't always initialize the completion struct
before we set the flag, and we didn't always call complete on the
completion struct from all paths that complete it.  This made it less
than totally effective, and certainly made its use confusing.  And in
the flush function we would use the presence of this flag to signal
that we should wait on the completion struct, but we never cleared
this flag, ever.  This is further muddied by the fact that we overload
the MCAST_FLAG_BUSY flag to mean two different things: we have a join
in flight, and we have succeeded in getting an ib_sa_join_multicast.

In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.

 1) Remove the MCAST_JOIN_STARTED flag entirely
 2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight
 3) Test on mcast->mc directly to see if we have completed
    ib_sa_join_multicast (using IS_ERR_OR_NULL)
 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
    the mcast->done completion struct
 5) Make sure that before calling complete(&mcast->done), we always clear
    the MCAST_FLAG_BUSY bit
 6) Take the mcast_mutex before we call ib_sa_multicast_join and also
    take the mutex in our join callback.  This forces
    ib_sa_multicast_join to return and set mcast->mc before we process
    the callback.  This way, our callback can safely clear mcast->mc
    if there is an error on the join and we will do the right thing as
    a result in mcast_dev_flush.
 7) Because we need the mutex to synchronize mcast->mc, we can no
    longer call mcast_sendonly_join directly from mcast_send and
    instead must add sendonly join processing to the mcast_join_task

A number of different races are resolved with these changes.  These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.

One race looks something like this:

Thread 1                             Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
  alloc member
  call callback
                                     ifconfig ib0 down
				     wait_for_completion
    callback call completes
                                     wait_for_completion in
				     mcast_dev_flush completes
				       mcast->mc is PTR_ERR_OR_NULL
				       so we skip ib_sa_leave_multicast
    return from callback
  return from ib_sa_join_multicast
set mcast->mc = return from ib_sa_multicast

We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c

Another like this:

Thread 1                   Thread 2         Thread 3
ib_sa_multicast_join
                                            ifconfig ib0 down
					    priv->broadcast = NULL
                           join_complete
			                    wait_for_completion
			   mcast->mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast->mc
			   complete
			   return -EAGAIN (making mcast->mc invalid)
			   		    call ib_sa_multicast_leave
					    on invalid mcast->mc, hang
					    forever

By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast->mc to be valid at the time we
run the callback.  This allows us to clear mcast->mc if there is an
error and the join is going to fail.  We do this before we complete
the mcast.  In this way, mcast_dev_flush always sees consistent state
in regards to mcast->mc membership at the time that the
wait_for_completion() returns.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  10 +-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 148 ++++++++++++++++---------
 2 files changed, 101 insertions(+), 57 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index d7562beb542..f4c1b20b23b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -98,9 +98,15 @@ enum {
 
 	IPOIB_MCAST_FLAG_FOUND	  = 0,	/* used in set_multicast_list */
 	IPOIB_MCAST_FLAG_SENDONLY = 1,
-	IPOIB_MCAST_FLAG_BUSY	  = 2,	/* joining or already joined */
+	/*
+	 * For IPOIB_MCAST_FLAG_BUSY
+	 * When set, in flight join and mcast->mc is unreliable
+	 * When clear and mcast->mc IS_ERR_OR_NULL, need to restart or
+	 *   haven't started yet
+	 * When clear and mcast->mc is valid pointer, join was successful
+	 */
+	IPOIB_MCAST_FLAG_BUSY	  = 2,
 	IPOIB_MCAST_FLAG_ATTACHED = 3,
-	IPOIB_MCAST_JOIN_STARTED  = 4,
 
 	MAX_SEND_CQE		  = 16,
 	IPOIB_CM_COPYBREAK	  = 256,
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 9862c76a83f..a52c9f3f7e4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status,
 	struct ipoib_mcast *mcast = multicast->context;
 	struct net_device *dev = mcast->dev;
 
+	/*
+	 * We have to take the mutex to force mcast_sendonly_join to
+	 * return from ib_sa_multicast_join and set mcast->mc to a
+	 * valid value.  Otherwise we were racing with ourselves in
+	 * that we might fail here, but get a valid return from
+	 * ib_sa_multicast_join after we had cleared mcast->mc here,
+	 * resulting in mis-matched joins and leaves and a deadlock
+	 */
+	mutex_lock(&mcast_mutex);
+
 	/* We trap for port events ourselves. */
 	if (status == -ENETRESET)
-		return 0;
+		goto out;
 
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
 
 	if (status) {
 		if (mcast->logcount++ < 20)
-			ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for %pI6, status %d\n",
+			ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
+					"join failed for %pI6, status %d\n",
 					mcast->mcmember.mgid.raw, status);
 
 		/* Flush out any queued packets */
@@ -290,11 +301,15 @@ ipoib_mcast_sendonly_join_complete(int status,
 			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
 		}
 		netif_tx_unlock_bh(dev);
-
-		/* Clear the busy flag so we try again */
-		status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY,
-					    &mcast->flags);
 	}
+out:
+	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	if (status)
+		mcast->mc = NULL;
+	complete(&mcast->done);
+	if (status == -ENETRESET)
+		status = 0;
+	mutex_unlock(&mcast_mutex);
 	return status;
 }
 
@@ -312,12 +327,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	int ret = 0;
 
 	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
-		ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n");
+		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
+				"multicast joins\n");
 		return -ENODEV;
 	}
 
-	if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
-		ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n");
+	if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
+		ipoib_dbg_mcast(priv, "multicast entry busy, skipping "
+				"sendonly join\n");
 		return -EBUSY;
 	}
 
@@ -325,6 +342,9 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	rec.port_gid = priv->local_gid;
 	rec.pkey     = cpu_to_be16(priv->pkey);
 
+	mutex_lock(&mcast_mutex);
+	init_completion(&mcast->done);
+	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
 					 priv->port, &rec,
 					 IB_SA_MCMEMBER_REC_MGID	|
@@ -337,12 +357,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	if (IS_ERR(mcast->mc)) {
 		ret = PTR_ERR(mcast->mc);
 		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-		ipoib_warn(priv, "ib_sa_join_multicast failed (ret = %d)\n",
-			   ret);
+		complete(&mcast->done);
+		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
+			   "failed (ret = %d)\n", ret);
 	} else {
-		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting join\n",
-				mcast->mcmember.mgid.raw);
+		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
+				"sendonly join\n", mcast->mcmember.mgid.raw);
 	}
+	mutex_unlock(&mcast_mutex);
 
 	return ret;
 }
@@ -390,22 +412,28 @@ static int ipoib_mcast_join_complete(int status,
 	ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
 			mcast->mcmember.mgid.raw, status);
 
+	/*
+	 * We have to take the mutex to force mcast_join to
+	 * return from ib_sa_multicast_join and set mcast->mc to a
+	 * valid value.  Otherwise we were racing with ourselves in
+	 * that we might fail here, but get a valid return from
+	 * ib_sa_multicast_join after we had cleared mcast->mc here,
+	 * resulting in mis-matched joins and leaves and a deadlock
+	 */
+	mutex_lock(&mcast_mutex);
+
 	/* We trap for port events ourselves. */
-	if (status == -ENETRESET) {
-		status = 0;
+	if (status == -ENETRESET)
 		goto out;
-	}
 
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
 
 	if (!status) {
 		mcast->backoff = 1;
-		mutex_lock(&mcast_mutex);
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
 					   &priv->mcast_task, 0);
-		mutex_unlock(&mcast_mutex);
 
 		/*
 		 * Defer carrier on work to ipoib_workqueue to avoid a
@@ -413,37 +441,35 @@ static int ipoib_mcast_join_complete(int status,
 		 */
 		if (mcast == priv->broadcast)
 			queue_work(ipoib_workqueue, &priv->carrier_on_task);
-
-		status = 0;
-		goto out;
-	}
-
-	if (mcast->logcount++ < 20) {
-		if (status == -ETIMEDOUT || status == -EAGAIN) {
-			ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
-					mcast->mcmember.mgid.raw, status);
-		} else {
-			ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
-				   mcast->mcmember.mgid.raw, status);
+	} else {
+		if (mcast->logcount++ < 20) {
+			if (status == -ETIMEDOUT || status == -EAGAIN) {
+				ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
+						mcast->mcmember.mgid.raw, status);
+			} else {
+				ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
+					   mcast->mcmember.mgid.raw, status);
+			}
 		}
-	}
-
-	mcast->backoff *= 2;
-	if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
-		mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-
-	/* Clear the busy flag so we try again */
-	status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 
-	mutex_lock(&mcast_mutex);
+		mcast->backoff *= 2;
+		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
+			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+	}
+out:
 	spin_lock_irq(&priv->lock);
-	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	if (status)
+		mcast->mc = NULL;
+	complete(&mcast->done);
+	if (status == -ENETRESET)
+		status = 0;
+	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
 		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
 				   mcast->backoff * HZ);
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
-out:
-	complete(&mcast->done);
+
 	return status;
 }
 
@@ -492,10 +518,9 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		rec.hop_limit	  = priv->broadcast->mcmember.hop_limit;
 	}
 
-	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	mutex_lock(&mcast_mutex);
 	init_completion(&mcast->done);
-	set_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags);
-
+	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
 					 &rec, comp_mask, GFP_KERNEL,
 					 ipoib_mcast_join_complete, mcast);
@@ -509,13 +534,12 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
 			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
 
-		mutex_lock(&mcast_mutex);
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
 					   &priv->mcast_task,
 					   mcast->backoff * HZ);
-		mutex_unlock(&mcast_mutex);
 	}
+	mutex_unlock(&mcast_mutex);
 }
 
 void ipoib_mcast_join_task(struct work_struct *work)
@@ -568,7 +592,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
-		if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
+		if (IS_ERR_OR_NULL(priv->broadcast->mc) &&
+		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
 			ipoib_mcast_join(dev, priv->broadcast, 0);
 		return;
 	}
@@ -576,23 +601,33 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	while (1) {
 		struct ipoib_mcast *mcast = NULL;
 
+		/*
+		 * Need the mutex so our flags are consistent, need the
+		 * priv->lock so we don't race with list removals in either
+		 * mcast_dev_flush or mcast_restart_task
+		 */
+		mutex_lock(&mcast_mutex);
 		spin_lock_irq(&priv->lock);
 		list_for_each_entry(mcast, &priv->multicast_list, list) {
-			if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)
-			    && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)
-			    && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+			if (IS_ERR_OR_NULL(mcast->mc) &&
+			    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
+			    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
 				/* Found the next unjoined group */
 				break;
 			}
 		}
 		spin_unlock_irq(&priv->lock);
+		mutex_unlock(&mcast_mutex);
 
 		if (&mcast->list == &priv->multicast_list) {
 			/* All done */
 			break;
 		}
 
-		ipoib_mcast_join(dev, mcast, 1);
+		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+			ipoib_mcast_sendonly_join(mcast);
+		else
+			ipoib_mcast_join(dev, mcast, 1);
 		return;
 	}
 
@@ -638,6 +673,9 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 	int ret = 0;
 
 	if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
+		ipoib_warn(priv, "ipoib_mcast_leave on an in-flight join\n");
+
+	if (!IS_ERR_OR_NULL(mcast->mc))
 		ib_sa_free_multicast(mcast->mc);
 
 	if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
@@ -690,6 +728,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
+		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
+			queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
 	}
 
 	if (!mcast->ah) {
@@ -703,8 +743,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			ipoib_dbg_mcast(priv, "no address vector, "
 					"but multicast join already started\n");
-		else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
-			ipoib_mcast_sendonly_join(mcast);
 
 		/*
 		 * If lookup completes between here and out:, don't
@@ -766,7 +804,7 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
 
 	/* seperate between the wait to the leave*/
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
-		if (test_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags))
+		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			wait_for_completion(&mcast->done);
 
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/22] IB/ipoib: fix mcast_dev_flush/mcast_restart_task race
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 03/22] IB/ipoib: fix MCAST_FLAG_BUSY usage Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 05/22] IB/ipoib: change init sequence ordering Doug Ledford
                     ` (18 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Our mcast_dev_flush routine and our mcast_restart_task can race
against each other.  In particular, they both hold the priv->lock
while manipulating the rbtree and while removing mcast entries from
the multicast_list and while adding entries to the remove_list, but
they also both drop their locks prior to doing the actual removes.
The mcast_dev_flush routine is run entirely under the rtnl lock and so
has at least some locking.  The actual race condition is like this:

Thread 1                                Thread 2
ifconfig ib0 up
  start multicast join for broadcast
  multicast join completes for broadcast
  start to add more multicast joins
    call mcast_restart_task to add new entries
                                        ifconfig ib0 down
					  mcast_dev_flush
					    mcast_leave(mcast A)
    mcast_leave(mcast A)

As mcast_leave calls ib_sa_multicast_leave, and as member in
core/multicast.c is ref counted, we run into an unbalanced refcount
issue.  To avoid stomping on each others removes, take the rtnl lock
specifically when we are deleting the entries from the remove list.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index a52c9f3f7e4..41325960e4e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -802,7 +802,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
 
 	spin_unlock_irqrestore(&priv->lock, flags);
 
-	/* seperate between the wait to the leave*/
+	/*
+	 * make sure the in-flight joins have finished before we attempt
+	 * to leave
+	 */
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
 		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			wait_for_completion(&mcast->done);
@@ -923,14 +926,38 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	netif_addr_unlock(dev);
 	local_irq_restore(flags);
 
-	/* We have to cancel outside of the spinlock */
+	/*
+	 * make sure the in-flight joins have finished before we attempt
+	 * to leave
+	 */
+	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
+		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
+			wait_for_completion(&mcast->done);
+
+	/*
+	 * We have to cancel outside of the spinlock, but we have to
+	 * take the rtnl lock or else we race with the removal of
+	 * entries from the remove list in mcast_dev_flush as part
+	 * of ipoib_stop() which will call mcast_stop_thread with
+	 * flush == 1 while holding the rtnl lock, and the
+	 * flush_workqueue won't complete until this restart_mcast_task
+	 * completes.  So do like the carrier on task and attempt to
+	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
+	 * goes away, then just return and know that the remove list will
+	 * get flushed later by mcast_dev_flush.
+	 */
+	while (!rtnl_trylock()) {
+		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+			return;
+		else
+			msleep(20);
+	}
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
 		ipoib_mcast_leave(mcast->dev, mcast);
 		ipoib_mcast_free(mcast);
 	}
-
-	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-		ipoib_mcast_start_thread(dev);
+	ipoib_mcast_start_thread(dev);
+	rtnl_unlock();
 }
 
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/22] IB/ipoib: change init sequence ordering
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 04/22] IB/ipoib: fix mcast_dev_flush/mcast_restart_task race Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 06/22] IB/ipoib: Use dedicated workqueues per interface Doug Ledford
                     ` (17 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

In preparation for using per device work queues, we need to move the
start of the neighbor thread task to after ipoib_ib_dev_init and move
the destruction of the neighbor task to before ipoib_ib_dev_cleanup.
Otherwise we will end up freeing our workqueue with work possibly
still on it.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 58b5aa3b6f2..2cf81ef5141 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1262,15 +1262,13 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	if (ipoib_neigh_hash_init(priv) < 0)
-		goto out;
 	/* Allocate RX/TX "rings" to hold queued skbs */
 	priv->rx_ring =	kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring,
 				GFP_KERNEL);
 	if (!priv->rx_ring) {
 		printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n",
 		       ca->name, ipoib_recvq_size);
-		goto out_neigh_hash_cleanup;
+		goto out;
 	}
 
 	priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring);
@@ -1285,16 +1283,24 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 	if (ipoib_ib_dev_init(dev, ca, port))
 		goto out_tx_ring_cleanup;
 
+	/*
+	 * Must be after ipoib_ib_dev_init so we can allocate a per
+	 * device wq there and use it here
+	 */
+	if (ipoib_neigh_hash_init(priv) < 0)
+		goto out_dev_uninit;
+
 	return 0;
 
+out_dev_uninit:
+	ipoib_ib_dev_cleanup();
+
 out_tx_ring_cleanup:
 	vfree(priv->tx_ring);
 
 out_rx_ring_cleanup:
 	kfree(priv->rx_ring);
 
-out_neigh_hash_cleanup:
-	ipoib_neigh_hash_uninit(dev);
 out:
 	return -ENOMEM;
 }
@@ -1317,6 +1323,12 @@ void ipoib_dev_cleanup(struct net_device *dev)
 	}
 	unregister_netdevice_many(&head);
 
+	/*
+	 * Must be before ipoib_ib_dev_cleanup or we delete an in use
+	 * work queue
+	 */
+	ipoib_neigh_hash_uninit(dev);
+
 	ipoib_ib_dev_cleanup(dev);
 
 	kfree(priv->rx_ring);
@@ -1324,8 +1336,6 @@ void ipoib_dev_cleanup(struct net_device *dev)
 
 	priv->rx_ring = NULL;
 	priv->tx_ring = NULL;
-
-	ipoib_neigh_hash_uninit(dev);
 }
 
 static const struct header_ops ipoib_header_ops = {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/22] IB/ipoib: Use dedicated workqueues per interface
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 05/22] IB/ipoib: change init sequence ordering Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 07/22] IB/ipoib: Make ipoib_mcast_stop_thread flush the workqueue Doug Ledford
                     ` (16 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

During my recent work on the rtnl lock deadlock in the IPoIB driver, I
saw that even once I fixed the apparent races for a single device, as
soon as that device had any children, new races popped up.  It turns
out that this is because no matter how well we protect against races
on a single device, the fact that all devices use the same workqueue,
and flush_workqueue() flushes *everything* from that workqueue, we can
have one device in the middle of a down and holding the rtnl lock and
another totally unrelated device needing to run mcast_restart_task,
which wants the rtnl lock and will loop trying to take it unless is
sees its own FLAG_ADMIN_UP flag go away.  Because the unrelated
interface will never see its own ADMIN_UP flag drop, the interface
going down will deadlock trying to flush the queue.  There are several
possible solutions to this problem:

Make carrier_on_task and mcast_restart_task try to take the rtnl for
some set period of time and if they fail, then bail.  This runs the
real risk of dropping work on the floor, which can end up being its
own separate kind of deadlock.

Set some global flag in the driver that says some device is in the
middle of going down, letting all tasks know to bail.  Again, this can
drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
away, then maybe after a few tries on the rtnl lock we can queue our
own task back up as a delayed work and return and avoid dropping work
on the floor that way.  But I'm not 100% convinced that we won't cause
other problems.

Or the method this patch attempts to use, which is when we bring an
interface up, create a workqueue specifically for that interface, so
that when we take it back down, we are flushing only those tasks
associated with our interface.  In addition, keep the global
workqueue, but now limit it to only flush tasks.  In this way, the
flush tasks can always flush the device specific work queues without
having deadlock issues.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        | 18 +++++++++---------
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  6 +++---
 drivers/infiniband/ulp/ipoib/ipoib_main.c      | 19 ++++++++++++-------
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 ++++++++++++--------------
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     | 22 +++++++++++++++++++++-
 6 files changed, 58 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index f4c1b20b23b..45fd10a72ec 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -323,6 +323,7 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
+	struct workqueue_struct *wq;
 	struct delayed_work mcast_task;
 	struct work_struct carrier_on_task;
 	struct work_struct flush_light;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 933efcea0d0..56959adb6c7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 	}
 
 	spin_lock_irq(&priv->lock);
-	queue_delayed_work(ipoib_workqueue,
+	queue_delayed_work(priv->wq,
 			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 	/* Add this entry to passive ids list head, but do not re-add it
 	 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
@@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 			spin_lock_irqsave(&priv->lock, flags);
 			list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list);
 			ipoib_cm_start_rx_drain(priv);
-			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
+			queue_work(priv->wq, &priv->cm.rx_reap_task);
 			spin_unlock_irqrestore(&priv->lock, flags);
 		} else
 			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
@@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 				spin_lock_irqsave(&priv->lock, flags);
 				list_move(&p->list, &priv->cm.rx_reap_list);
 				spin_unlock_irqrestore(&priv->lock, flags);
-				queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
+				queue_work(priv->wq, &priv->cm.rx_reap_task);
 			}
 			return;
 		}
@@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 
 		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 			list_move(&tx->list, &priv->cm.reap_list);
-			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+			queue_work(priv->wq, &priv->cm.reap_task);
 		}
 
 		clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags);
@@ -1255,7 +1255,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
 
 		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 			list_move(&tx->list, &priv->cm.reap_list);
-			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+			queue_work(priv->wq, &priv->cm.reap_task);
 		}
 
 		spin_unlock_irqrestore(&priv->lock, flags);
@@ -1284,7 +1284,7 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path
 	tx->dev = dev;
 	list_add(&tx->list, &priv->cm.start_list);
 	set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags);
-	queue_work(ipoib_workqueue, &priv->cm.start_task);
+	queue_work(priv->wq, &priv->cm.start_task);
 	return tx;
 }
 
@@ -1295,7 +1295,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
 	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 		spin_lock_irqsave(&priv->lock, flags);
 		list_move(&tx->list, &priv->cm.reap_list);
-		queue_work(ipoib_workqueue, &priv->cm.reap_task);
+		queue_work(priv->wq, &priv->cm.reap_task);
 		ipoib_dbg(priv, "Reap connection for gid %pI6\n",
 			  tx->neigh->daddr + 4);
 		tx->neigh = NULL;
@@ -1417,7 +1417,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb,
 
 	skb_queue_tail(&priv->cm.skb_queue, skb);
 	if (e)
-		queue_work(ipoib_workqueue, &priv->cm.skb_task);
+		queue_work(priv->wq, &priv->cm.skb_task);
 }
 
 static void ipoib_cm_rx_reap(struct work_struct *work)
@@ -1450,7 +1450,7 @@ static void ipoib_cm_stale_task(struct work_struct *work)
 	}
 
 	if (!list_empty(&priv->cm.passive_ids))
-		queue_delayed_work(ipoib_workqueue,
+		queue_delayed_work(priv->wq,
 				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 	spin_unlock_irq(&priv->lock);
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 72626c34817..bfd17d41b5f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -655,7 +655,7 @@ void ipoib_reap_ah(struct work_struct *work)
 	__ipoib_reap_ah(dev);
 
 	if (!test_bit(IPOIB_STOP_REAPER, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
+		queue_delayed_work(priv->wq, &priv->ah_reap_task,
 				   round_jiffies_relative(HZ));
 }
 
@@ -696,7 +696,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
 	}
 
 	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
-	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
+	queue_delayed_work(priv->wq, &priv->ah_reap_task,
 			   round_jiffies_relative(HZ));
 
 	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
@@ -881,7 +881,7 @@ timeout:
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
 	if (flush)
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 
 	begin = jiffies;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 2cf81ef5141..42e5c278f48 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -839,7 +839,7 @@ static void ipoib_set_mcast_list(struct net_device *dev)
 		return;
 	}
 
-	queue_work(ipoib_workqueue, &priv->restart_task);
+	queue_work(priv->wq, &priv->restart_task);
 }
 
 static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr)
@@ -954,7 +954,7 @@ static void ipoib_reap_neigh(struct work_struct *work)
 	__ipoib_reap_neigh(priv);
 
 	if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
+		queue_delayed_work(priv->wq, &priv->neigh_reap_task,
 				   arp_tbl.gc_interval);
 }
 
@@ -1133,7 +1133,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
 
 	/* start garbage collection */
 	clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
-	queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
+	queue_delayed_work(priv->wq, &priv->neigh_reap_task,
 			   arp_tbl.gc_interval);
 
 	return 0;
@@ -1293,7 +1293,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 	return 0;
 
 out_dev_uninit:
-	ipoib_ib_dev_cleanup();
+	ipoib_ib_dev_cleanup(dev);
 
 out_tx_ring_cleanup:
 	vfree(priv->tx_ring);
@@ -1646,7 +1646,7 @@ register_failed:
 	/* Stop GC if started before flush */
 	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
 	cancel_delayed_work(&priv->neigh_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	flush_workqueue(priv->wq);
 
 event_failed:
 	ipoib_dev_cleanup(priv->dev);
@@ -1717,7 +1717,7 @@ static void ipoib_remove_one(struct ib_device *device)
 		/* Stop GC */
 		set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
 		cancel_delayed_work(&priv->neigh_reap_task);
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 
 		unregister_netdev(priv->dev);
 		free_netdev(priv->dev);
@@ -1758,8 +1758,13 @@ static int __init ipoib_init_module(void)
 	 * unregister_netdev() and linkwatch_event take the rtnl lock,
 	 * so flush_scheduled_work() can deadlock during device
 	 * removal.
+	 *
+	 * In addition, bringing one device up and another down at the
+	 * same time can deadlock a single workqueue, so we have this
+	 * global fallback workqueue, but we also attempt to open a
+	 * per device workqueue each time we bring an interface up
 	 */
-	ipoib_workqueue = create_singlethread_workqueue("ipoib");
+	ipoib_workqueue = create_singlethread_workqueue("ipoib_flush");
 	if (!ipoib_workqueue) {
 		ret = -ENOMEM;
 		goto err_fs;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 41325960e4e..845f910eb21 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -388,7 +388,7 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
 	 * the workqueue while holding the rtnl lock, so loop
 	 * on trylock until either we get the lock or we see
 	 * FLAG_ADMIN_UP go away as that signals that we are bailing
-	 * and can safely ignore the carrier on work
+	 * and can safely ignore the carrier on work.
 	 */
 	while (!rtnl_trylock()) {
 		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
@@ -432,15 +432,14 @@ static int ipoib_mcast_join_complete(int status,
 	if (!status) {
 		mcast->backoff = 1;
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->mcast_task, 0);
+			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 
 		/*
-		 * Defer carrier on work to ipoib_workqueue to avoid a
+		 * Defer carrier on work to priv->wq to avoid a
 		 * deadlock on rtnl_lock here.
 		 */
 		if (mcast == priv->broadcast)
-			queue_work(ipoib_workqueue, &priv->carrier_on_task);
+			queue_work(priv->wq, &priv->carrier_on_task);
 	} else {
 		if (mcast->logcount++ < 20) {
 			if (status == -ETIMEDOUT || status == -EAGAIN) {
@@ -465,7 +464,7 @@ out:
 	if (status == -ENETRESET)
 		status = 0;
 	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
+		queue_delayed_work(priv->wq, &priv->mcast_task,
 				   mcast->backoff * HZ);
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
@@ -535,8 +534,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
 
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->mcast_task,
+			queue_delayed_work(priv->wq, &priv->mcast_task,
 					   mcast->backoff * HZ);
 	}
 	mutex_unlock(&mcast_mutex);
@@ -576,8 +574,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
 			mutex_lock(&mcast_mutex);
 			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-				queue_delayed_work(ipoib_workqueue,
-						   &priv->mcast_task, HZ);
+				queue_delayed_work(priv->wq, &priv->mcast_task,
+						   HZ);
 			mutex_unlock(&mcast_mutex);
 			return;
 		}
@@ -644,7 +642,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 
 	mutex_lock(&mcast_mutex);
 	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
+		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	mutex_unlock(&mcast_mutex);
 
 	return 0;
@@ -662,7 +660,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
 	mutex_unlock(&mcast_mutex);
 
 	if (flush)
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 
 	return 0;
 }
@@ -729,7 +727,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
 		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
+			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	}
 
 	if (!mcast->ah) {
@@ -944,7 +942,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	 * completes.  So do like the carrier on task and attempt to
 	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
 	 * goes away, then just return and know that the remove list will
-	 * get flushed later by mcast_dev_flush.
+	 * get flushed later by mcast_stop_thread.
 	 */
 	while (!rtnl_trylock()) {
 		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index c56d5d44c53..b72a753eb41 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -145,10 +145,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	int ret, size;
 	int i;
 
+	/*
+	 * the various IPoIB tasks assume they will never race against
+	 * themselves, so always use a single thread workqueue
+	 */
+	priv->wq = create_singlethread_workqueue("ipoib_wq");
+	if (!priv->wq) {
+		printk(KERN_WARNING "ipoib: failed to allocate device WQ\n");
+		return -ENODEV;
+	}
+
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
 		printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name);
-		return -ENODEV;
+		goto out_free_wq;
 	}
 
 	priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE);
@@ -242,6 +252,10 @@ out_free_mr:
 
 out_free_pd:
 	ib_dealloc_pd(priv->pd);
+
+out_free_wq:
+	destroy_workqueue(priv->wq);
+	priv->wq = NULL;
 	return -ENODEV;
 }
 
@@ -270,6 +284,12 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
 
 	if (ib_dealloc_pd(priv->pd))
 		ipoib_warn(priv, "ib_dealloc_pd failed\n");
+
+	if (priv->wq) {
+		flush_workqueue(priv->wq);
+		destroy_workqueue(priv->wq);
+		priv->wq = NULL;
+	}
 }
 
 void ipoib_event(struct ib_event_handler *handler,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/22] IB/ipoib: Make ipoib_mcast_stop_thread flush the workqueue
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 06/22] IB/ipoib: Use dedicated workqueues per interface Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 08/22] IB/ipoib: No longer use flush as a parameter Doug Ledford
                     ` (15 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

We used to pass a flush variable to mcast_stop_thread to indicate if
we should flush the workqueue or not.  This was due to some code
trying to flush a workqueue that it was currently running on which is
a no-no.  Now that we have per-device work queues, and now that
ipoib_mcast_restart_task has taken the fact that it is queued on a
single thread workqueue with all of the ipoib_mcast_join_task's and
therefore has no need to stop the join task while it runs, we can do
away with the flush parameter and unilaterally flush always.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  2 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  4 ++--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++------------
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 45fd10a72ec..28dc927c0e8 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -499,7 +499,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb);
 
 void ipoib_mcast_restart_task(struct work_struct *work);
 int ipoib_mcast_start_thread(struct net_device *dev);
-int ipoib_mcast_stop_thread(struct net_device *dev, int flush);
+int ipoib_mcast_stop_thread(struct net_device *dev);
 
 void ipoib_mcast_dev_down(struct net_device *dev);
 void ipoib_mcast_dev_flush(struct net_device *dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index bfd17d41b5f..66096787119 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -747,7 +747,7 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush)
 	clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
 	netif_carrier_off(dev);
 
-	ipoib_mcast_stop_thread(dev, flush);
+	ipoib_mcast_stop_thread(dev);
 	ipoib_mcast_dev_flush(dev);
 
 	ipoib_flush_paths(dev);
@@ -1097,7 +1097,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
 	 */
 	ipoib_flush_paths(dev);
 
-	ipoib_mcast_stop_thread(dev, 1);
+	ipoib_mcast_stop_thread(dev);
 	ipoib_mcast_dev_flush(dev);
 
 	ipoib_transport_dev_cleanup(dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 845f910eb21..bc50dd0d0e4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -648,7 +648,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 	return 0;
 }
 
-int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
+int ipoib_mcast_stop_thread(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
@@ -659,8 +659,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
 	cancel_delayed_work(&priv->mcast_task);
 	mutex_unlock(&mcast_mutex);
 
-	if (flush)
-		flush_workqueue(priv->wq);
+	flush_workqueue(priv->wq);
 
 	return 0;
 }
@@ -838,8 +837,6 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 
 	ipoib_dbg_mcast(priv, "restarting multicast task\n");
 
-	ipoib_mcast_stop_thread(dev, 0);
-
 	local_irq_save(flags);
 	netif_addr_lock(dev);
 	spin_lock(&priv->lock);
@@ -936,13 +933,10 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	 * We have to cancel outside of the spinlock, but we have to
 	 * take the rtnl lock or else we race with the removal of
 	 * entries from the remove list in mcast_dev_flush as part
-	 * of ipoib_stop() which will call mcast_stop_thread with
-	 * flush == 1 while holding the rtnl lock, and the
-	 * flush_workqueue won't complete until this restart_mcast_task
-	 * completes.  So do like the carrier on task and attempt to
-	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
-	 * goes away, then just return and know that the remove list will
-	 * get flushed later by mcast_stop_thread.
+	 * of ipoib_stop().  We detect the drop of the ADMIN_UP flag
+	 * to signal that we have hit this particular race, and we
+	 * return since we know we don't need to do anything else
+	 * anyway.
 	 */
 	while (!rtnl_trylock()) {
 		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
@@ -954,6 +948,9 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 		ipoib_mcast_leave(mcast->dev, mcast);
 		ipoib_mcast_free(mcast);
 	}
+	/*
+	 * Restart our join task if needed
+	 */
 	ipoib_mcast_start_thread(dev);
 	rtnl_unlock();
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/22] IB/ipoib: No longer use flush as a parameter
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 07/22] IB/ipoib: Make ipoib_mcast_stop_thread flush the workqueue Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 09/22] IB/ipoib: fix IPOIB_MCAST_RUN flag usage Doug Ledford
                     ` (14 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Various places in the IPoIB code had a deadlock related to flushing
the ipoib workqueue.  Now that we have per device workqueues and a
specific flush workqueue, there is no longer a deadlock issue with
flushing the device specific workqueues and we can do so unilaterally.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h      |  6 +++---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c   | 19 +++++++++----------
 drivers/infiniband/ulp/ipoib/ipoib_main.c |  8 ++++----
 3 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 28dc927c0e8..8ba80a6d3a4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -484,10 +484,10 @@ void ipoib_ib_dev_flush_heavy(struct work_struct *work);
 void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
-int ipoib_ib_dev_open(struct net_device *dev, int flush);
+int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
-int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev, int flush);
+int ipoib_ib_dev_down(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev);
 void ipoib_pkey_dev_check_presence(struct net_device *dev);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 66096787119..fe65abb5150 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -664,7 +664,7 @@ static void ipoib_ib_tx_timer_func(unsigned long ctx)
 	drain_tx_cq((struct net_device *)ctx);
 }
 
-int ipoib_ib_dev_open(struct net_device *dev, int flush)
+int ipoib_ib_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
@@ -706,7 +706,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
 dev_stop:
 	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
 		napi_enable(&priv->napi);
-	ipoib_ib_dev_stop(dev, flush);
+	ipoib_ib_dev_stop(dev);
 	return -1;
 }
 
@@ -738,7 +738,7 @@ int ipoib_ib_dev_up(struct net_device *dev)
 	return ipoib_mcast_start_thread(dev);
 }
 
-int ipoib_ib_dev_down(struct net_device *dev, int flush)
+int ipoib_ib_dev_down(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
@@ -807,7 +807,7 @@ void ipoib_drain_cq(struct net_device *dev)
 	local_bh_enable();
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev, int flush)
+int ipoib_ib_dev_stop(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -880,8 +880,7 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	if (flush)
-		flush_workqueue(priv->wq);
+	flush_workqueue(priv->wq);
 
 	begin = jiffies;
 
@@ -918,7 +917,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 		    (unsigned long) dev);
 
 	if (dev->flags & IFF_UP) {
-		if (ipoib_ib_dev_open(dev, 1)) {
+		if (ipoib_ib_dev_open(dev)) {
 			ipoib_transport_dev_cleanup(dev);
 			return -ENODEV;
 		}
@@ -1040,12 +1039,12 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
 	}
 
 	if (level >= IPOIB_FLUSH_NORMAL)
-		ipoib_ib_dev_down(dev, 0);
+		ipoib_ib_dev_down(dev);
 
 	if (level == IPOIB_FLUSH_HEAVY) {
 		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
-			ipoib_ib_dev_stop(dev, 0);
-		if (ipoib_ib_dev_open(dev, 0) != 0)
+			ipoib_ib_dev_stop(dev);
+		if (ipoib_ib_dev_open(dev) != 0)
 			return;
 		if (netif_queue_stopped(dev))
 			netif_start_queue(dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 42e5c278f48..6bad17d4d58 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -108,7 +108,7 @@ int ipoib_open(struct net_device *dev)
 
 	set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
 
-	if (ipoib_ib_dev_open(dev, 1)) {
+	if (ipoib_ib_dev_open(dev)) {
 		if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
 			return 0;
 		goto err_disable;
@@ -139,7 +139,7 @@ int ipoib_open(struct net_device *dev)
 	return 0;
 
 err_stop:
-	ipoib_ib_dev_stop(dev, 1);
+	ipoib_ib_dev_stop(dev);
 
 err_disable:
 	clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
@@ -157,8 +157,8 @@ static int ipoib_stop(struct net_device *dev)
 
 	netif_stop_queue(dev);
 
-	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev, 0);
+	ipoib_ib_dev_down(dev);
+	ipoib_ib_dev_stop(dev);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/22] IB/ipoib: fix IPOIB_MCAST_RUN flag usage
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 08/22] IB/ipoib: No longer use flush as a parameter Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
       [not found]     ` <73f38862d167a9849482464519c04b7c1f0a8b7c.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-12  1:43   ` [PATCH 10/22] IB/ipoib: Add a helper to restart the multicast task Doug Ledford
                     ` (13 subsequent siblings)
  22 siblings, 1 reply; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

The usage of IPOIB_MCAST_RUN as a flag is inconsistent.  In some places
it is used to mean "our device is administratively allowed to send
multicast joins/leaves/packets" and in other places it means "our
multicast join task thread is currently running and will process your
request if you put it on the queue".  However, this latter meaning is in
fact flawed as there is a race condition between the join task testing
the mcast list and finding it empty of remaining work, dropping the
mcast mutex and also the priv->lock spinlock, and clearing the
IPOIB_MCAST_RUN flag.  Further, there are numerous locations that use
the flag in the former fashion, and when all tasks complete and the task
thread clears the RUN flag, all of those other locations will fail to
ever again queue any work.  This results in the interface coming up fine
initially, but having problems adding new multicast groups after the
first round of groups have all been added and the RUN flag is cleared by
the join task thread when it thinks it is done.  To resolve this issue,
convert all locations in the code to treat the RUN flag as an indicator
that the multicast portion of this interface is in fact administratively
up and joins/leaves/sends can be performed.  There is no harm (other
than a slight performance penalty) to never clearing this flag and using
it in this fashion as it simply means that a few places that used to
micro-optimize how often this task was queued on a work queue will now
queue the task a few extra times.  We can address that suboptimal
behavior in future patches.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index bc50dd0d0e4..91b8fe118ec 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -630,8 +630,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	}
 
 	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
-
-	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
 }
 
 int ipoib_mcast_start_thread(struct net_device *dev)
@@ -641,8 +639,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 	ipoib_dbg_mcast(priv, "starting multicast thread\n");
 
 	mutex_lock(&mcast_mutex);
-	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+	set_bit(IPOIB_MCAST_RUN, &priv->flags);
+	queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	mutex_unlock(&mcast_mutex);
 
 	return 0;
@@ -725,7 +723,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
+		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
 			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	}
 
@@ -951,7 +949,8 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	/*
 	 * Restart our join task if needed
 	 */
-	ipoib_mcast_start_thread(dev);
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	rtnl_unlock();
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/22] IB/ipoib: Add a helper to restart the multicast task
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 09/22] IB/ipoib: fix IPOIB_MCAST_RUN flag usage Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 11/22] IB/ipoib: make delayed tasks not hold up everything Doug Ledford
                     ` (12 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Add a simple helper to do the right thing when restarting
the multicast thread.  Call that helper from all the places
that need to restart the thread.  Add two places that didn't
restart the thread, but should have.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 46 ++++++++++++++++----------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 91b8fe118ec..cb1e495bd74 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -66,6 +66,13 @@ struct ipoib_mcast_iter {
 	unsigned int       send_only;
 };
 
+static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv,
+					       int delay)
+{
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+		queue_delayed_work(priv->wq, &priv->mcast_task, delay);
+}
+
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
 	struct net_device *dev = mcast->dev;
@@ -270,6 +277,7 @@ ipoib_mcast_sendonly_join_complete(int status,
 {
 	struct ipoib_mcast *mcast = multicast->context;
 	struct net_device *dev = mcast->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	/*
 	 * We have to take the mutex to force mcast_sendonly_join to
@@ -309,6 +317,7 @@ out:
 	complete(&mcast->done);
 	if (status == -ENETRESET)
 		status = 0;
+	__ipoib_mcast_continue_join_thread(priv, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
 }
@@ -360,6 +369,7 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 		complete(&mcast->done);
 		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
 			   "failed (ret = %d)\n", ret);
+		__ipoib_mcast_continue_join_thread(priv, 0);
 	} else {
 		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
 				"sendonly join\n", mcast->mcmember.mgid.raw);
@@ -431,8 +441,7 @@ static int ipoib_mcast_join_complete(int status,
 
 	if (!status) {
 		mcast->backoff = 1;
-		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+		__ipoib_mcast_continue_join_thread(priv, 0);
 
 		/*
 		 * Defer carrier on work to priv->wq to avoid a
@@ -454,6 +463,13 @@ static int ipoib_mcast_join_complete(int status,
 		mcast->backoff *= 2;
 		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
 			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+		/*
+		 * XXX - This is wrong.  *Our* join failed, but because the
+		 * join thread does the joins in a serial fashion, if there
+		 * are any joins behind ours waiting to complete, they should
+		 * not be subjected to our backoff delay.
+		 */
+		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
 	}
 out:
 	spin_lock_irq(&priv->lock);
@@ -463,9 +479,6 @@ out:
 	complete(&mcast->done);
 	if (status == -ENETRESET)
 		status = 0;
-	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task,
-				   mcast->backoff * HZ);
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
 
@@ -532,10 +545,13 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		mcast->backoff *= 2;
 		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
 			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-
-		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task,
-					   mcast->backoff * HZ);
+		/*
+		 * XXX - This is wrong.  *Our* join failed, but because the
+		 * join thread does the joins in a serial fashion, if there
+		 * are any joins behind ours waiting to complete, they should
+		 * not be subjected to our backoff delay.
+		 */
+		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
 	}
 	mutex_unlock(&mcast_mutex);
 }
@@ -573,9 +589,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		if (!broadcast) {
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
 			mutex_lock(&mcast_mutex);
-			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-				queue_delayed_work(priv->wq, &priv->mcast_task,
-						   HZ);
+			__ipoib_mcast_continue_join_thread(priv, HZ);
 			mutex_unlock(&mcast_mutex);
 			return;
 		}
@@ -723,8 +737,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+		__ipoib_mcast_continue_join_thread(priv, 0);
 	}
 
 	if (!mcast->ah) {
@@ -947,10 +960,9 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 		ipoib_mcast_free(mcast);
 	}
 	/*
-	 * Restart our join task if needed
+	 * Restart our join task thread if needed
 	 */
-	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+	__ipoib_mcast_continue_join_thread(priv, 0);
 	rtnl_unlock();
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/22] IB/ipoib: make delayed tasks not hold up everything
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 10/22] IB/ipoib: Add a helper to restart the multicast task Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 12/22] IB/ipoib: Handle -ENETRESET properly in our callback Doug Ledford
                     ` (11 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Currently, the multicast join thread only processes one task at a
time.  It does this by having itself be scheduled as a delayed work
item, when it runs, it finds one, and only one, piece of work to
process, it then kicks that off via either the normal join process
or the sendonly join process, and then it immediately exits.  Both
of those process chains are responsible for restarting the task
when they have completed their specific action.  This makes the entire
join process serial with only one join supposedly ever outstanding at
a time.

However, if we fail a join, and we need to initiate a backoff delay,
that delay holds up the entire join process for all joins, not just
the failed join.  So modify the design such that we can have joins
in delay, and the multicast thread will ignore them until their time
comes around, and then we can process the rest of the queue without
waiting for the delayed items.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  1 +
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 86 +++++++++++++++++---------
 2 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 8ba80a6d3a4..c79dcd5ee8a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -154,6 +154,7 @@ struct ipoib_mcast {
 
 	unsigned long created;
 	unsigned long backoff;
+	unsigned long delay_until;
 
 	unsigned long flags;
 	unsigned char logcount;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index cb1e495bd74..957e7d2e80c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -67,10 +67,34 @@ struct ipoib_mcast_iter {
 };
 
 static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv,
+					       struct ipoib_mcast *mcast,
 					       int delay)
 {
-	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) {
+		/*
+		 * Mark this mcast for its delay and set a timer to kick the
+		 * thread when the delay completes
+		 */
+		if (mcast && delay) {
+			mcast->backoff *= 2;
+			if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
+				mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+			mcast->delay_until = jiffies + (mcast->backoff * HZ);
+			queue_delayed_work(priv->wq, &priv->mcast_task,
+					   mcast->backoff * HZ);
+		} else if (delay) {
+			/* Special case of retrying after a failure to
+			 * allocate the broadcast multicast group, wait
+			 * 1 second and try again
+			 */
+			queue_delayed_work(priv->wq, &priv->mcast_task, HZ);
+		}
+		/*
+		 * But also rekick the thread immediately for any other
+		 * tasks in queue behind this one
+		 */
 		queue_delayed_work(priv->wq, &priv->mcast_task, delay);
+	}
 }
 
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
@@ -110,6 +134,7 @@ static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev,
 
 	mcast->dev = dev;
 	mcast->created = jiffies;
+	mcast->delay_until = jiffies;
 	mcast->backoff = 1;
 
 	INIT_LIST_HEAD(&mcast->list);
@@ -309,6 +334,10 @@ ipoib_mcast_sendonly_join_complete(int status,
 			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
 		}
 		netif_tx_unlock_bh(dev);
+	} else {
+		/* Join completed, so reset any backoff parameters */
+		mcast->backoff = 1;
+		mcast->delay_until = jiffies;
 	}
 out:
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
@@ -317,7 +346,7 @@ out:
 	complete(&mcast->done);
 	if (status == -ENETRESET)
 		status = 0;
-	__ipoib_mcast_continue_join_thread(priv, 0);
+	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
 }
@@ -369,7 +398,7 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 		complete(&mcast->done);
 		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
 			   "failed (ret = %d)\n", ret);
-		__ipoib_mcast_continue_join_thread(priv, 0);
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	} else {
 		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
 				"sendonly join\n", mcast->mcmember.mgid.raw);
@@ -441,7 +470,8 @@ static int ipoib_mcast_join_complete(int status,
 
 	if (!status) {
 		mcast->backoff = 1;
-		__ipoib_mcast_continue_join_thread(priv, 0);
+		mcast->delay_until = jiffies;
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 
 		/*
 		 * Defer carrier on work to priv->wq to avoid a
@@ -460,16 +490,8 @@ static int ipoib_mcast_join_complete(int status,
 			}
 		}
 
-		mcast->backoff *= 2;
-		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
-			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-		/*
-		 * XXX - This is wrong.  *Our* join failed, but because the
-		 * join thread does the joins in a serial fashion, if there
-		 * are any joins behind ours waiting to complete, they should
-		 * not be subjected to our backoff delay.
-		 */
-		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
+		/* Requeue this join task with a backoff delay */
+		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	}
 out:
 	spin_lock_irq(&priv->lock);
@@ -541,17 +563,8 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		complete(&mcast->done);
 		ret = PTR_ERR(mcast->mc);
 		ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret);
-
-		mcast->backoff *= 2;
-		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
-			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-		/*
-		 * XXX - This is wrong.  *Our* join failed, but because the
-		 * join thread does the joins in a serial fashion, if there
-		 * are any joins behind ours waiting to complete, they should
-		 * not be subjected to our backoff delay.
-		 */
-		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
+		/* Requeue this join task with a backoff delay */
+		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	}
 	mutex_unlock(&mcast_mutex);
 }
@@ -589,7 +602,13 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		if (!broadcast) {
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
 			mutex_lock(&mcast_mutex);
-			__ipoib_mcast_continue_join_thread(priv, HZ);
+			/*
+			 * Restart us after a 1 second delay to retry
+			 * creating our broadcast group and attaching to
+			 * it.  Until this succeeds, this ipoib dev is
+			 * completely stalled (multicast wise).
+			 */
+			__ipoib_mcast_continue_join_thread(priv, NULL, 1);
 			mutex_unlock(&mcast_mutex);
 			return;
 		}
@@ -623,7 +642,10 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		list_for_each_entry(mcast, &priv->multicast_list, list) {
 			if (IS_ERR_OR_NULL(mcast->mc) &&
 			    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
-			    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+			    !test_bit(IPOIB_MCAST_FLAG_ATTACHED,
+				      &mcast->flags) &&
+			    (mcast->backoff == 1 ||
+			     time_after_eq(jiffies, mcast->delay_until))) {
 				/* Found the next unjoined group */
 				break;
 			}
@@ -632,7 +654,11 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		mutex_unlock(&mcast_mutex);
 
 		if (&mcast->list == &priv->multicast_list) {
-			/* All done */
+			/* All done, unless we have delayed work from
+			 * backoff retransmissions, but we will get
+			 * restarted when the time is right, so we are
+			 * done for now
+			 */
 			break;
 		}
 
@@ -737,7 +763,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		__ipoib_mcast_continue_join_thread(priv, 0);
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	}
 
 	if (!mcast->ah) {
@@ -962,7 +988,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	/*
 	 * Restart our join task thread if needed
 	 */
-	__ipoib_mcast_continue_join_thread(priv, 0);
+	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	rtnl_unlock();
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 12/22] IB/ipoib: Handle -ENETRESET properly in our callback
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 11/22] IB/ipoib: make delayed tasks not hold up everything Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 13/22] IB/ipoib: don't restart our thread on ENETRESET Doug Ledford
                     ` (10 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

When the core layer calls our callback with ENETRESET as the error, we
clear the status to 0 before returning indicating that we are going to
handle the error ourselves.  This causes the core layer to not free the
mcast->mc structure, but we are releasing our reference to it by
clearing mcast->mc.  So, preserve our reference to the multicast
structure so when we next run ipoib_mcast_dev_flush, it will be able
to properly release the mcast->mc entry at the right time in the right
way.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 957e7d2e80c..50d2de2270a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -315,8 +315,10 @@ ipoib_mcast_sendonly_join_complete(int status,
 	mutex_lock(&mcast_mutex);
 
 	/* We trap for port events ourselves. */
-	if (status == -ENETRESET)
+	if (status == -ENETRESET) {
+		status = 0;
 		goto out;
+	}
 
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
@@ -344,8 +346,6 @@ out:
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	if (status == -ENETRESET)
-		status = 0;
 	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
@@ -462,8 +462,10 @@ static int ipoib_mcast_join_complete(int status,
 	mutex_lock(&mcast_mutex);
 
 	/* We trap for port events ourselves. */
-	if (status == -ENETRESET)
+	if (status == -ENETRESET) {
+		status = 0;
 		goto out;
+	}
 
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
@@ -499,8 +501,6 @@ out:
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	if (status == -ENETRESET)
-		status = 0;
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 13/22] IB/ipoib: don't restart our thread on ENETRESET
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 12/22] IB/ipoib: Handle -ENETRESET properly in our callback Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 14/22] IB/ipoib: remove unneeded locks Doug Ledford
                     ` (9 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Make the send only join completion handler behave like the normal join
completion handler in that we should restart the multicast join thread
whenever our join completes with either an error or success, but don't
restart the thread if we got a net reset event, let ipoib_event queue
up a device flush, which will then call ipoib_mcast_dev_flush to remove
all of the current mcast entries, and ipoib_mcase_restart_task to check
out current mcast entry list against the netdev list of mcast entries
we are supposed to have and requeue the ones we need to get back.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 50d2de2270a..a03b6a04203 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -336,17 +336,18 @@ ipoib_mcast_sendonly_join_complete(int status,
 			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
 		}
 		netif_tx_unlock_bh(dev);
+		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	} else {
 		/* Join completed, so reset any backoff parameters */
 		mcast->backoff = 1;
 		mcast->delay_until = jiffies;
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	}
 out:
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 14/22] IB/ipoib: remove unneeded locks
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 13/22] IB/ipoib: don't restart our thread on ENETRESET Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
       [not found]     ` <3cd3c664adb2877317c8f684ee344749b2915e45.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-12  1:43   ` [PATCH 15/22] IB/ipoib: fix race between mcast_dev_flush and mcast_join Doug Ledford
                     ` (8 subsequent siblings)
  22 siblings, 1 reply; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

During the various work I've done on this, some extra locking has crept
in when things weren't working right.  This is one of those spots.
Remove the unneeded spinlocks.  The mutex is enough to protect against
what we need to protect against.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index a03b6a04203..424c602d5a0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -497,12 +497,10 @@ static int ipoib_mcast_join_complete(int status,
 		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	}
 out:
-	spin_lock_irq(&priv->lock);
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
 
 	return status;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 15/22] IB/ipoib: fix race between mcast_dev_flush and mcast_join
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (13 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 14/22] IB/ipoib: remove unneeded locks Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 16/22] IB/ipoib: fix ipoib_mcast_restart_task Doug Ledford
                     ` (7 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

There is a classic race between mcast_dev_flush and mcast_join_thread
in that the join thread is going to loop through all of the mcast
entries in the device mcast list, while dev_flush wants to remove
those entries from the list so that they can eventually be freed when
they are no longer busy.

As always, the list traversal and the list removal must be done under
the same lock to avoid races, so modify the join_task thread code to
hold onto the spinlock for the entirety of its run up until it has
saved the mcast entry that it's going to work on into a private pointer
and set the flags on that entry to mark it as busy.  Only then is it
safe to drop the lock and let mcast_dev_flush remove entries from the
list.

I also reworked the flow of the join_task thread to be more clear that
it is making a single pass at work to do and then exiting as soon as it
gets that single task queued off.

I then reworked both the regular and send only join routines to know
that they will be called with the BUSY flag already set on the mcast
entry they are passed, so any error condition means they need to
clear the flag and complete the task and they should not check the
busy status as it will now always be true.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 112 ++++++++++++-------------
 1 file changed, 56 insertions(+), 56 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 424c602d5a0..972fa15a225 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -368,22 +368,16 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
 		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
 				"multicast joins\n");
+		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+		complete(&mcast->done);
 		return -ENODEV;
 	}
 
-	if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
-		ipoib_dbg_mcast(priv, "multicast entry busy, skipping "
-				"sendonly join\n");
-		return -EBUSY;
-	}
-
 	rec.mgid     = mcast->mcmember.mgid;
 	rec.port_gid = priv->local_gid;
 	rec.pkey     = cpu_to_be16(priv->pkey);
 
 	mutex_lock(&mcast_mutex);
-	init_completion(&mcast->done);
-	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
 					 priv->port, &rec,
 					 IB_SA_MCMEMBER_REC_MGID	|
@@ -552,8 +546,6 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 	}
 
 	mutex_lock(&mcast_mutex);
-	init_completion(&mcast->done);
-	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
 					 &rec, comp_mask, GFP_KERNEL,
 					 ipoib_mcast_join_complete, mcast);
@@ -574,6 +566,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		container_of(work, struct ipoib_dev_priv, mcast_task.work);
 	struct net_device *dev = priv->dev;
 	struct ib_port_attr port_attr;
+	struct ipoib_mcast *mcast = NULL;
+	int create = 1;
 
 	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
 		return;
@@ -591,16 +585,25 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	else
 		memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid));
 
+	/*
+	 * We have to hold the mutex to keep from racing with the join
+	 * completion threads on setting flags on mcasts, and we have
+	 * to hold the priv->lock because dev_flush will remove entries
+	 * out from underneath us, so at a minimum we need the lock
+	 * through the time that we do the for_each loop of the mcast
+	 * list or else dev_flush can make us oops.
+	 */
+	mutex_lock(&mcast_mutex);
+	spin_lock_irq(&priv->lock);
+	if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+		goto out;
+
 	if (!priv->broadcast) {
 		struct ipoib_mcast *broadcast;
 
-		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-			return;
-
-		broadcast = ipoib_mcast_alloc(dev, 1);
+		broadcast = ipoib_mcast_alloc(dev, 0);
 		if (!broadcast) {
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
-			mutex_lock(&mcast_mutex);
 			/*
 			 * Restart us after a 1 second delay to retry
 			 * creating our broadcast group and attaching to
@@ -608,67 +611,64 @@ void ipoib_mcast_join_task(struct work_struct *work)
 			 * completely stalled (multicast wise).
 			 */
 			__ipoib_mcast_continue_join_thread(priv, NULL, 1);
-			mutex_unlock(&mcast_mutex);
-			return;
+			goto out;
 		}
 
-		spin_lock_irq(&priv->lock);
 		memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		       sizeof (union ib_gid));
 		priv->broadcast = broadcast;
 
 		__ipoib_mcast_add(dev, priv->broadcast);
-		spin_unlock_irq(&priv->lock);
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
 		if (IS_ERR_OR_NULL(priv->broadcast->mc) &&
-		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
-			ipoib_mcast_join(dev, priv->broadcast, 0);
-		return;
-	}
-
-	while (1) {
-		struct ipoib_mcast *mcast = NULL;
-
-		/*
-		 * Need the mutex so our flags are consistent, need the
-		 * priv->lock so we don't race with list removals in either
-		 * mcast_dev_flush or mcast_restart_task
-		 */
-		mutex_lock(&mcast_mutex);
-		spin_lock_irq(&priv->lock);
-		list_for_each_entry(mcast, &priv->multicast_list, list) {
-			if (IS_ERR_OR_NULL(mcast->mc) &&
-			    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
-			    !test_bit(IPOIB_MCAST_FLAG_ATTACHED,
-				      &mcast->flags) &&
-			    (mcast->backoff == 1 ||
-			     time_after_eq(jiffies, mcast->delay_until))) {
-				/* Found the next unjoined group */
-				break;
-			}
+		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) {
+			mcast = priv->broadcast;
+			create = 0;
 		}
-		spin_unlock_irq(&priv->lock);
-		mutex_unlock(&mcast_mutex);
+		goto out;
+	}
 
-		if (&mcast->list == &priv->multicast_list) {
-			/* All done, unless we have delayed work from
-			 * backoff retransmissions, but we will get
-			 * restarted when the time is right, so we are
-			 * done for now
-			 */
+	/* We'll never get here until the broadcast group is both allocated
+	 * and attached
+	 */
+	list_for_each_entry(mcast, &priv->multicast_list, list) {
+		if (IS_ERR_OR_NULL(mcast->mc) &&
+		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
+		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags) &&
+		    (mcast->backoff == 1 ||
+		     time_after_eq(jiffies, mcast->delay_until))) {
+			/* Found the next unjoined group */
 			break;
 		}
+	}
 
+	if (&mcast->list == &priv->multicast_list) {
+		/* All done, unless we have delayed work from
+		 * backoff retransmissions, but we will get
+		 * restarted when the time is right, so we are
+		 * done for now
+		 */
+		mcast = NULL;
+		ipoib_dbg_mcast(priv, "successfully joined all "
+				"multicast groups\n");
+	}
+
+out:
+	if (mcast) {
+		init_completion(&mcast->done);
+		set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	}
+	spin_unlock_irq(&priv->lock);
+	mutex_unlock(&mcast_mutex);
+	if (mcast) {
 		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
 			ipoib_mcast_sendonly_join(mcast);
 		else
-			ipoib_mcast_join(dev, mcast, 1);
-		return;
+			ipoib_mcast_join(dev, mcast, create);
 	}
-
-	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
+	return;
 }
 
 int ipoib_mcast_start_thread(struct net_device *dev)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 16/22] IB/ipoib: fix ipoib_mcast_restart_task
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (14 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 15/22] IB/ipoib: fix race between mcast_dev_flush and mcast_join Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 17/22] IB/ipoib: flush the ipoib_workqueue on unregister Doug Ledford
                     ` (6 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

There are two intertwined problems here.

First, if we are downing the interface, we will get called as part of
the process.  When that happens, we don't really want to do anything
here as the next thing to happen is ipoib_mcast_dev_flush will get
called.  We just want to let it do all the work since we don't need to
do any address matches and such.  So, we should be testing for whether
or not the interface is up at the beginning of this function and
returning if it isn't.

Second, the attempt to grab the rtnl_lock to prevent racing against
ipoib_mcast_dev_flush was mistaken.  The removed mcast groups are on a
stack local list head and so once we've removed them from the mcast
rbtree, ipoib_mcast_dev_flush can no longer find them and race us on
removing them.  However, because we attempt to take the lock, and then
check the status of the interface, if our interface has gone down, we
simply return.  This is absolutely the wrong thing to do as it means we
just leaked all of the mcast entries on our remove list.

The fix is to check for the interface being up at the very beginning of
the function, and then to always remove our entries on the remove list
no matter what as they will get leaked otherwise, and skip taking the
rtnl_lock since it isn't needed anyway.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 19 +++----------------
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 972fa15a225..10e05437c83 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -871,6 +871,9 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	unsigned long flags;
 	struct ib_sa_mcmember_rec rec;

+	if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+		return;
+
 	ipoib_dbg_mcast(priv, "restarting multicast task\n");

 	local_irq_save(flags);
@@ -965,21 +968,6 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			wait_for_completion(&mcast->done);

-	/*
-	 * We have to cancel outside of the spinlock, but we have to
-	 * take the rtnl lock or else we race with the removal of
-	 * entries from the remove list in mcast_dev_flush as part
-	 * of ipoib_stop().  We detect the drop of the ADMIN_UP flag
-	 * to signal that we have hit this particular race, and we
-	 * return since we know we don't need to do anything else
-	 * anyway.
-	 */
-	while (!rtnl_trylock()) {
-		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-			return;
-		else
-			msleep(20);
-	}
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
 		ipoib_mcast_leave(mcast->dev, mcast);
 		ipoib_mcast_free(mcast);
@@ -988,7 +976,6 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	 * Restart our join task thread if needed
 	 */
 	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
-	rtnl_unlock();
 }

 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 17/22] IB/ipoib: flush the ipoib_workqueue on unregister
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (15 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 16/22] IB/ipoib: fix ipoib_mcast_restart_task Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 18/22] IB/ipoib: cleanup a couple debug messages Doug Ledford
                     ` (5 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

When we unregister our event handler as a prelude to removing our
device, we need to go ahead and flush the ipoib_workqueue to make sure
there are no delayed flushes for the device we are getting ready to
remove.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 6bad17d4d58..8810514cf40 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1643,6 +1643,7 @@ sysfs_failed:
 
 register_failed:
 	ib_unregister_event_handler(&priv->event_handler);
+	flush_workqueue(ipoib_workqueue);
 	/* Stop GC if started before flush */
 	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
 	cancel_delayed_work(&priv->neigh_reap_task);
@@ -1709,6 +1710,7 @@ static void ipoib_remove_one(struct ib_device *device)
 
 	list_for_each_entry_safe(priv, tmp, dev_list, list) {
 		ib_unregister_event_handler(&priv->event_handler);
+		flush_workqueue(ipoib_workqueue);
 
 		rtnl_lock();
 		dev_change_flags(priv->dev, priv->dev->flags & ~IFF_UP);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 18/22] IB/ipoib: cleanup a couple debug messages
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (16 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 17/22] IB/ipoib: flush the ipoib_workqueue on unregister Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
       [not found]     ` <7aeffef3862526da5a472c15f94564897a4d7537.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-12  1:43   ` [PATCH 19/22] IB/ipoib: make sure we reap all our ah on shutdown Doug Ledford
                     ` (4 subsequent siblings)
  22 siblings, 1 reply; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

The one I removed in particular can prevent progress in the driver from
happening because it isn't rate limited.  In my testing, I have seen the
flood of messages from that debug message delay multicast joins for in
excess of 30 seconds.

The added debug message was just so I would know when we were leaving a
multicast group that we didn't fully join.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 10e05437c83..ad2bcd105e4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -711,6 +711,8 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 
 	if (!IS_ERR_OR_NULL(mcast->mc))
 		ib_sa_free_multicast(mcast->mc);
+	else
+		ipoib_dbg(priv, "ipoib_mcast_leave with mcast->mc invalid\n");
 
 	if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
 		ipoib_dbg_mcast(priv, "leaving MGID %pI6\n",
@@ -721,7 +723,9 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 				      be16_to_cpu(mcast->mcmember.mlid));
 		if (ret)
 			ipoib_warn(priv, "ib_detach_mcast failed (result = %d)\n", ret);
-	}
+	} else if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+		ipoib_dbg(priv, "leaving with no mcmember but not a "
+			  "SENDONLY join\n");
 
 	return 0;
 }
@@ -772,11 +776,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 			++dev->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 		}
-
-		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
-			ipoib_dbg_mcast(priv, "no address vector, "
-					"but multicast join already started\n");
-
 		/*
 		 * If lookup completes between here and out:, don't
 		 * want to send packet twice.
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 19/22] IB/ipoib: make sure we reap all our ah on shutdown
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (17 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 18/22] IB/ipoib: cleanup a couple debug messages Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
       [not found]     ` <14ee2e7737123c7d37465ac52be5c026240c9985.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-12  1:43   ` [PATCH 20/22] IB/ipoib: don't queue a work struct up twice Doug Ledford
                     ` (3 subsequent siblings)
  22 siblings, 1 reply; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

The introduction of garbage collection for neighbors in the ipoib stack
caused a leak of resources.  In particular, an ah can have a refcount
from a mcast struct, from a neigh struct, and from a path struct.  When
we put an ah ref, the ah gets moved to a dead list to be reaped later if
the refcount goes to 0.

During ipoib_ib_dev_stop we attempt to reap these ah entries, but since
that function does not force the neigh garbage collection to be run, it
can leak any entries held by a neighbor.  And even though the code
attempts to warn us that we will leak entries, that warning is broken
because the entries we are leaking will not yet be on the dead_ahs list
because they still have valid references held elsewhere.  It is during
shutdown, when we finally kill the neighbor garbage collection, that our
final ahs will get put on the dead_ahs list, but after the point at
which we run the neighbor garbage collection, no attempt is made to reap
the newly added ahs, and because during ipoib_ib_dev_stop we killed the
ah_reap task, no timed attempt will be made to clear them either.

Instead, create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
appropriate times to flush out all remaining ah entries before we shut
the device down.

This is done to prevent resource leaks on shutdown, which manifest with
this message on ib_ipoib module remove:

<ibdev>: ib_dealloc_pd failed

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    | 44 ++++++++++++++++++------------
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |  4 ++-
 2 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index fe65abb5150..e144d07d53c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -659,6 +659,23 @@ void ipoib_reap_ah(struct work_struct *work)
 				   round_jiffies_relative(HZ));
 }
 
+static void ipoib_flush_ah(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	cancel_delayed_work(&priv->ah_reap_task);
+	flush_workqueue(priv->wq);
+	ipoib_reap_ah(&priv->ah_reap_task.work);
+}
+
+static void ipoib_stop_ah(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	set_bit(IPOIB_STOP_REAPER, &priv->flags);
+	ipoib_flush_ah(dev);
+}
+
 static void ipoib_ib_tx_timer_func(unsigned long ctx)
 {
 	drain_tx_cq((struct net_device *)ctx);
@@ -877,23 +894,7 @@ timeout:
 	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
 		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
 
-	/* Wait for all AHs to be reaped */
-	set_bit(IPOIB_STOP_REAPER, &priv->flags);
-	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(priv->wq);
-
-	begin = jiffies;
-
-	while (!list_empty(&priv->dead_ahs)) {
-		__ipoib_reap_ah(dev);
-
-		if (time_after(jiffies, begin + HZ)) {
-			ipoib_warn(priv, "timing out; will leak address handles\n");
-			break;
-		}
-
-		msleep(1);
-	}
+	ipoib_flush_ah(dev);
 
 	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
 
@@ -1036,6 +1037,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
 	if (level == IPOIB_FLUSH_LIGHT) {
 		ipoib_mark_paths_invalid(dev);
 		ipoib_mcast_dev_flush(dev);
+		ipoib_flush_ah(dev);
 	}
 
 	if (level >= IPOIB_FLUSH_NORMAL)
@@ -1099,6 +1101,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
 	ipoib_mcast_stop_thread(dev);
 	ipoib_mcast_dev_flush(dev);
 
+	/*
+	 * All of our ah references aren't free until after
+	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
+	 * the neighbor garbage collection is stopped and reaped.
+	 * That should all be done now, so make a final ah flush.
+	 */
+	ipoib_stop_ah(dev);
+
 	ipoib_transport_dev_cleanup(dev);
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index b72a753eb41..66df78382a4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -263,6 +263,9 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
+	if (priv->wq)
+		flush_workqueue(priv->wq);
+
 	if (priv->qp) {
 		if (ib_destroy_qp(priv->qp))
 			ipoib_warn(priv, "ib_qp_destroy failed\n");
@@ -286,7 +289,6 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
 		ipoib_warn(priv, "ib_dealloc_pd failed\n");
 
 	if (priv->wq) {
-		flush_workqueue(priv->wq);
 		destroy_workqueue(priv->wq);
 		priv->wq = NULL;
 	}
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 20/22] IB/ipoib: don't queue a work struct up twice
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (18 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 19/22] IB/ipoib: make sure we reap all our ah on shutdown Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
       [not found]     ` <30a5bd6461381448c52af0d7408dbc14da9ac4d0.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-12  1:43   ` [PATCH 21/22] IB/ipoib: deserialize multicast joins Doug Ledford
                     ` (2 subsequent siblings)
  22 siblings, 1 reply; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

In b2d408f78b3 (IB/ipoib: make delayed tasks not hold up everything) I
added the ability to have some joins processed around delayed joins.
Because the join task processes one join at a time, we needed to queue
the task to run again immediately and then again at the appropriate
delay time.  I didn't think about the fact that our work struct can only
be on the workqueue once at a given time and instead tried to queue the
same struct up twice.  Instead, kick the queue again immediately on a
delayed restart attempt, when the task sees that it has nothing to do
but delayed work, it will pick the shortest work item and queue a delay
equal to its expiration.  Finally, in case we have delayed work, in the
send_mcast routine we need to cancel any delayed work before kicking the
thread immediately.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 49 ++++++++++++++++----------
 1 file changed, 31 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index ad2bcd105e4..68b47bed33a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -71,29 +71,28 @@ static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv,
 					       int delay)
 {
 	if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) {
-		/*
-		 * Mark this mcast for its delay and set a timer to kick the
-		 * thread when the delay completes
-		 */
 		if (mcast && delay) {
 			mcast->backoff *= 2;
 			if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
 				mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
 			mcast->delay_until = jiffies + (mcast->backoff * HZ);
-			queue_delayed_work(priv->wq, &priv->mcast_task,
-					   mcast->backoff * HZ);
+			/*
+			 * Mark this mcast for its delay, but restart the
+			 * task immediately.  It will make sure to clear
+			 * out all entries without delays, and then
+			 * schedule itself for run again when the delay
+			 * expires
+			 */
+			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 		} else if (delay) {
-			/* Special case of retrying after a failure to
+			/*
+			 * Special case of retrying after a failure to
 			 * allocate the broadcast multicast group, wait
 			 * 1 second and try again
 			 */
 			queue_delayed_work(priv->wq, &priv->mcast_task, HZ);
-		}
-		/*
-		 * But also rekick the thread immediately for any other
-		 * tasks in queue behind this one
-		 */
-		queue_delayed_work(priv->wq, &priv->mcast_task, delay);
+		} else
+			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	}
 }
 
@@ -568,6 +567,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	struct ib_port_attr port_attr;
 	struct ipoib_mcast *mcast = NULL;
 	int create = 1;
+	unsigned long delay_until = 0;
 
 	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
 		return;
@@ -636,11 +636,14 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	list_for_each_entry(mcast, &priv->multicast_list, list) {
 		if (IS_ERR_OR_NULL(mcast->mc) &&
 		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
-		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags) &&
-		    (mcast->backoff == 1 ||
-		     time_after_eq(jiffies, mcast->delay_until))) {
-			/* Found the next unjoined group */
-			break;
+		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+			if (mcast->backoff == 1 ||
+			    time_after_eq(jiffies, mcast->delay_until))
+				/* Found the next unjoined group */
+				break;
+			else if (!delay_until ||
+				 time_before(mcast->delay_until, delay_until))
+				delay_until = mcast->delay_until;
 		}
 	}
 
@@ -653,6 +656,9 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		mcast = NULL;
 		ipoib_dbg_mcast(priv, "successfully joined all "
 				"multicast groups\n");
+		if (delay_until)
+			queue_delayed_work(priv->wq, &priv->mcast_task,
+					   delay_until - jiffies);
 	}
 
 out:
@@ -766,6 +772,13 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
+		/*
+		 * Make sure that if we are currently waiting for a backoff
+		 * delay to expire, that we cancel that, run immediately,
+		 * and then requeue our task to fire when the backoff
+		 * timer is over.
+		 */
+		cancel_delayed_work(&priv->mcast_task);
 		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	}
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 21/22] IB/ipoib: deserialize multicast joins
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (19 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 20/22] IB/ipoib: don't queue a work struct up twice Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  1:43   ` [PATCH 22/22] IB/ipoib: drop mcast_mutex usage Doug Ledford
  2015-02-12  4:46   ` [PATCH 00/22] IB/ipoib: Fixups for multicast issues Or Gerlitz
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Allow the ipoib layer to attempt to join all outstanding multicast
groups at once.  The ib_sa layer will serialize multiple attempts to
join the same group, but will process attempt to join different groups
in parallel.  Take advantage of that.

In order to make this happen, change the mcast_join_thread to loop
through all needed joins, sending a join request for each one that we
still need to join.  There are a few special cases we handle though:

1) Don't attempt to join anything but the broadcast group until the join
of the broadcast group has succeeded.
2) No longer restart the join task at the end of completion handling as
the join task will fire off everything.  Unless we are the broadcast
group, and then we need to.  Or in case we need to initiate a retry.
3) No longer use separate join/completion routines for regular and
sendonly joins, pass them all through the same routine and just do the
right thing based on the SENDONLY join flag.
4) Only try to join a SENDONLY join twice, then drop the packets and
quit trying.  We leave the mcast group in the list so that if we get a
new packet, all that we have to do is queue up the packet and restart
the join task and it will automatically try to join twice and then
either send or flush the queue again.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 264 +++++++++----------------
 1 file changed, 88 insertions(+), 176 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 68b47bed33a..355c320dc4d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -295,113 +295,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 	return 0;
 }
 
-static int
-ipoib_mcast_sendonly_join_complete(int status,
-				   struct ib_sa_multicast *multicast)
-{
-	struct ipoib_mcast *mcast = multicast->context;
-	struct net_device *dev = mcast->dev;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-
-	/*
-	 * We have to take the mutex to force mcast_sendonly_join to
-	 * return from ib_sa_multicast_join and set mcast->mc to a
-	 * valid value.  Otherwise we were racing with ourselves in
-	 * that we might fail here, but get a valid return from
-	 * ib_sa_multicast_join after we had cleared mcast->mc here,
-	 * resulting in mis-matched joins and leaves and a deadlock
-	 */
-	mutex_lock(&mcast_mutex);
-
-	/* We trap for port events ourselves. */
-	if (status == -ENETRESET) {
-		status = 0;
-		goto out;
-	}
-
-	if (!status)
-		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
-
-	if (status) {
-		if (mcast->logcount++ < 20)
-			ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
-					"join failed for %pI6, status %d\n",
-					mcast->mcmember.mgid.raw, status);
-
-		/* Flush out any queued packets */
-		netif_tx_lock_bh(dev);
-		while (!skb_queue_empty(&mcast->pkt_queue)) {
-			++dev->stats.tx_dropped;
-			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
-		}
-		netif_tx_unlock_bh(dev);
-		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
-	} else {
-		/* Join completed, so reset any backoff parameters */
-		mcast->backoff = 1;
-		mcast->delay_until = jiffies;
-		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
-	}
-out:
-	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-	if (status)
-		mcast->mc = NULL;
-	complete(&mcast->done);
-	mutex_unlock(&mcast_mutex);
-	return status;
-}
-
-static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
-{
-	struct net_device *dev = mcast->dev;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ib_sa_mcmember_rec rec = {
-#if 0				/* Some SMs don't support send-only yet */
-		.join_state = 4
-#else
-		.join_state = 1
-#endif
-	};
-	int ret = 0;
-
-	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
-		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
-				"multicast joins\n");
-		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-		complete(&mcast->done);
-		return -ENODEV;
-	}
-
-	rec.mgid     = mcast->mcmember.mgid;
-	rec.port_gid = priv->local_gid;
-	rec.pkey     = cpu_to_be16(priv->pkey);
-
-	mutex_lock(&mcast_mutex);
-	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
-					 priv->port, &rec,
-					 IB_SA_MCMEMBER_REC_MGID	|
-					 IB_SA_MCMEMBER_REC_PORT_GID	|
-					 IB_SA_MCMEMBER_REC_PKEY	|
-					 IB_SA_MCMEMBER_REC_JOIN_STATE,
-					 GFP_ATOMIC,
-					 ipoib_mcast_sendonly_join_complete,
-					 mcast);
-	if (IS_ERR(mcast->mc)) {
-		ret = PTR_ERR(mcast->mc);
-		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-		complete(&mcast->done);
-		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
-			   "failed (ret = %d)\n", ret);
-		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
-	} else {
-		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
-				"sendonly join\n", mcast->mcmember.mgid.raw);
-	}
-	mutex_unlock(&mcast_mutex);
-
-	return ret;
-}
-
 void ipoib_mcast_carrier_on_task(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
@@ -442,7 +335,9 @@ static int ipoib_mcast_join_complete(int status,
 	struct net_device *dev = mcast->dev;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
+	ipoib_dbg_mcast(priv, "%sjoin completion for %pI6 (status %d)\n",
+			test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ?
+			"sendonly " : "",
 			mcast->mcmember.mgid.raw, status);
 
 	/*
@@ -467,27 +362,52 @@ static int ipoib_mcast_join_complete(int status,
 	if (!status) {
 		mcast->backoff = 1;
 		mcast->delay_until = jiffies;
-		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 
 		/*
 		 * Defer carrier on work to priv->wq to avoid a
-		 * deadlock on rtnl_lock here.
+		 * deadlock on rtnl_lock here.  Requeue our multicast
+		 * work too, which will end up happening right after
+		 * our carrier on task work and will allow us to
+		 * send out all of the non-broadcast joins
 		 */
-		if (mcast == priv->broadcast)
+		if (mcast == priv->broadcast) {
 			queue_work(priv->wq, &priv->carrier_on_task);
+			__ipoib_mcast_continue_join_thread(priv, NULL, 0);
+		}
 	} else {
 		if (mcast->logcount++ < 20) {
 			if (status == -ETIMEDOUT || status == -EAGAIN) {
-				ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
+				ipoib_dbg_mcast(priv, "%smulticast join failed for %pI6, status %d\n",
+						test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ? "sendonly " : "",
 						mcast->mcmember.mgid.raw, status);
 			} else {
-				ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
+				ipoib_warn(priv, "%smulticast join failed for %pI6, status %d\n",
+						test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ? "sendonly " : "",
 					   mcast->mcmember.mgid.raw, status);
 			}
 		}
 
-		/* Requeue this join task with a backoff delay */
-		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
+		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) &&
+		    mcast->backoff >= 2) {
+			/*
+			 * We only retry sendonly joins once before we drop
+			 * the packet and quit trying to deal with the
+			 * group.  However, we leave the group in the
+			 * mcast list as an unjoined group.  If we want to
+			 * try joining again, we simply queue up a packet
+			 * and restart the join thread.  The empty queue
+			 * is why the join thread ignores this group.
+			 */
+			mcast->backoff = 1;
+			netif_tx_lock_bh(dev);
+			while (!skb_queue_empty(&mcast->pkt_queue)) {
+				++dev->stats.tx_dropped;
+				dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
+			}
+			netif_tx_unlock_bh(dev);
+		} else
+			/* Requeue this join task with a backoff delay */
+			__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	}
 out:
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
@@ -626,6 +546,11 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) {
 			mcast = priv->broadcast;
 			create = 0;
+			if (mcast->backoff != 1 &&
+			    time_before_eq(jiffies, mcast->delay_until)) {
+				delay_until = mcast->delay_until;
+				mcast = NULL;
+			}
 		}
 		goto out;
 	}
@@ -636,44 +561,45 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	list_for_each_entry(mcast, &priv->multicast_list, list) {
 		if (IS_ERR_OR_NULL(mcast->mc) &&
 		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
-		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+		    (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ||
+		     !skb_queue_empty(&mcast->pkt_queue))) {
 			if (mcast->backoff == 1 ||
-			    time_after_eq(jiffies, mcast->delay_until))
+			    time_after_eq(jiffies, mcast->delay_until)) {
 				/* Found the next unjoined group */
-				break;
-			else if (!delay_until ||
+				init_completion(&mcast->done);
+				set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+				if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+					create = 0;
+				else
+					create = 1;
+				spin_unlock_irq(&priv->lock);
+				mutex_unlock(&mcast_mutex);
+				ipoib_mcast_join(dev, mcast, create);
+				mutex_lock(&mcast_mutex);
+				spin_lock_irq(&priv->lock);
+			} else if (!delay_until ||
 				 time_before(mcast->delay_until, delay_until))
 				delay_until = mcast->delay_until;
 		}
 	}
 
-	if (&mcast->list == &priv->multicast_list) {
-		/* All done, unless we have delayed work from
-		 * backoff retransmissions, but we will get
-		 * restarted when the time is right, so we are
-		 * done for now
-		 */
-		mcast = NULL;
-		ipoib_dbg_mcast(priv, "successfully joined all "
-				"multicast groups\n");
-		if (delay_until)
-			queue_delayed_work(priv->wq, &priv->mcast_task,
-					   delay_until - jiffies);
-	}
+	mcast = NULL;
+	ipoib_dbg_mcast(priv, "successfully started all multicast joins\n");
 
 out:
+	if (delay_until) {
+		cancel_delayed_work(&priv->mcast_task);
+		queue_delayed_work(priv->wq, &priv->mcast_task,
+				   delay_until - jiffies);
+	}
 	if (mcast) {
 		init_completion(&mcast->done);
 		set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	}
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
-	if (mcast) {
-		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
-			ipoib_mcast_sendonly_join(mcast);
-		else
-			ipoib_mcast_join(dev, mcast, create);
-	}
+	if (mcast)
+		ipoib_mcast_join(dev, mcast, create);
 	return;
 }
 
@@ -717,8 +643,6 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 
 	if (!IS_ERR_OR_NULL(mcast->mc))
 		ib_sa_free_multicast(mcast->mc);
-	else
-		ipoib_dbg(priv, "ipoib_mcast_leave with mcast->mc invalid\n");
 
 	if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
 		ipoib_dbg_mcast(priv, "leaving MGID %pI6\n",
@@ -754,50 +678,38 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 	}
 
 	mcast = __ipoib_mcast_find(dev, mgid);
-	if (!mcast) {
-		/* Let's create a new send only group now */
-		ipoib_dbg_mcast(priv, "setting up send only multicast group for %pI6\n",
-				mgid);
-
-		mcast = ipoib_mcast_alloc(dev, 0);
+	if (!mcast || !mcast->ah) {
 		if (!mcast) {
-			ipoib_warn(priv, "unable to allocate memory for "
-				   "multicast structure\n");
-			++dev->stats.tx_dropped;
-			dev_kfree_skb_any(skb);
-			goto out;
-		}
-
-		set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags);
-		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
-		__ipoib_mcast_add(dev, mcast);
-		list_add_tail(&mcast->list, &priv->multicast_list);
-		/*
-		 * Make sure that if we are currently waiting for a backoff
-		 * delay to expire, that we cancel that, run immediately,
-		 * and then requeue our task to fire when the backoff
-		 * timer is over.
-		 */
-		cancel_delayed_work(&priv->mcast_task);
-		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
-	}
+			/* Let's create a new send only group now */
+			ipoib_dbg_mcast(priv, "setting up send only multicast group for %pI6\n",
+					mgid);
+
+			mcast = ipoib_mcast_alloc(dev, 0);
+			if (!mcast) {
+				ipoib_warn(priv, "unable to allocate memory "
+					   "for multicast structure\n");
+				++dev->stats.tx_dropped;
+				dev_kfree_skb_any(skb);
+				goto unlock;
+			}
 
-	if (!mcast->ah) {
+			set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags);
+			memcpy(mcast->mcmember.mgid.raw, mgid,
+			       sizeof (union ib_gid));
+			__ipoib_mcast_add(dev, mcast);
+			list_add_tail(&mcast->list, &priv->multicast_list);
+		}
 		if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
 			skb_queue_tail(&mcast->pkt_queue, skb);
 		else {
 			++dev->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 		}
-		/*
-		 * If lookup completes between here and out:, don't
-		 * want to send packet twice.
-		 */
-		mcast = NULL;
-	}
-
-out:
-	if (mcast && mcast->ah) {
+		if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
+			cancel_delayed_work(&priv->mcast_task);
+			__ipoib_mcast_continue_join_thread(priv, NULL, 0);
+		}
+	} else {
 		struct ipoib_neigh *neigh;
 
 		spin_unlock_irqrestore(&priv->lock, flags);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 22/22] IB/ipoib: drop mcast_mutex usage
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (20 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 21/22] IB/ipoib: deserialize multicast joins Doug Ledford
@ 2015-02-12  1:43   ` Doug Ledford
  2015-02-12  4:46   ` [PATCH 00/22] IB/ipoib: Fixups for multicast issues Or Gerlitz
  22 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  1:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

We needed the mcast_mutex when we had to protect two different workqueue
threads against stomping on each others work.  By no longer using
mcast->mc to directly store the return value of ib_sa_join_multicast, we
can switch all of the mcast flag/completion locking to being only the
priv->lock spinlock.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 55 +++++++++-----------------
 1 file changed, 19 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 355c320dc4d..e1fab519f96 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -55,8 +55,6 @@ MODULE_PARM_DESC(mcast_debug_level,
 		 "Enable multicast debug tracing if > 0");
 #endif
 
-static DEFINE_MUTEX(mcast_mutex);
-
 struct ipoib_mcast_iter {
 	struct net_device *dev;
 	union ib_gid       mgid;
@@ -340,16 +338,6 @@ static int ipoib_mcast_join_complete(int status,
 			"sendonly " : "",
 			mcast->mcmember.mgid.raw, status);
 
-	/*
-	 * We have to take the mutex to force mcast_join to
-	 * return from ib_sa_multicast_join and set mcast->mc to a
-	 * valid value.  Otherwise we were racing with ourselves in
-	 * that we might fail here, but get a valid return from
-	 * ib_sa_multicast_join after we had cleared mcast->mc here,
-	 * resulting in mis-matched joins and leaves and a deadlock
-	 */
-	mutex_lock(&mcast_mutex);
-
 	/* We trap for port events ourselves. */
 	if (status == -ENETRESET) {
 		status = 0;
@@ -410,11 +398,14 @@ static int ipoib_mcast_join_complete(int status,
 			__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	}
 out:
+	spin_lock_irq(&priv->lock);
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	if (status)
 		mcast->mc = NULL;
+	else
+		mcast->mc = multicast;
 	complete(&mcast->done);
-	mutex_unlock(&mcast_mutex);
+	spin_unlock_irq(&priv->lock);
 
 	return status;
 }
@@ -423,6 +414,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 			     int create)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_sa_multicast *multicast;
 	struct ib_sa_mcmember_rec rec = {
 		.join_state = 1
 	};
@@ -464,19 +456,20 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		rec.hop_limit	  = priv->broadcast->mcmember.hop_limit;
 	}
 
-	mutex_lock(&mcast_mutex);
-	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
+	multicast = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
 					 &rec, comp_mask, GFP_KERNEL,
 					 ipoib_mcast_join_complete, mcast);
-	if (IS_ERR(mcast->mc)) {
-		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-		complete(&mcast->done);
-		ret = PTR_ERR(mcast->mc);
+	if (IS_ERR(multicast)) {
+		ret = PTR_ERR(multicast);
+		mcast->mc = NULL;
 		ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret);
+		spin_lock_irq(&priv->lock);
+		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 		/* Requeue this join task with a backoff delay */
 		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
+		complete(&mcast->done);
+		spin_unlock_irq(&priv->lock);
 	}
-	mutex_unlock(&mcast_mutex);
 }
 
 void ipoib_mcast_join_task(struct work_struct *work)
@@ -505,15 +498,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	else
 		memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid));
 
-	/*
-	 * We have to hold the mutex to keep from racing with the join
-	 * completion threads on setting flags on mcasts, and we have
-	 * to hold the priv->lock because dev_flush will remove entries
-	 * out from underneath us, so at a minimum we need the lock
-	 * through the time that we do the for_each loop of the mcast
-	 * list or else dev_flush can make us oops.
-	 */
-	mutex_lock(&mcast_mutex);
 	spin_lock_irq(&priv->lock);
 	if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
 		goto out;
@@ -573,9 +557,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
 				else
 					create = 1;
 				spin_unlock_irq(&priv->lock);
-				mutex_unlock(&mcast_mutex);
 				ipoib_mcast_join(dev, mcast, create);
-				mutex_lock(&mcast_mutex);
 				spin_lock_irq(&priv->lock);
 			} else if (!delay_until ||
 				 time_before(mcast->delay_until, delay_until))
@@ -597,7 +579,6 @@ out:
 		set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	}
 	spin_unlock_irq(&priv->lock);
-	mutex_unlock(&mcast_mutex);
 	if (mcast)
 		ipoib_mcast_join(dev, mcast, create);
 	return;
@@ -606,13 +587,14 @@ out:
 int ipoib_mcast_start_thread(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned long flags;
 
 	ipoib_dbg_mcast(priv, "starting multicast thread\n");
 
-	mutex_lock(&mcast_mutex);
+	spin_lock_irqsave(&priv->lock, flags);
 	set_bit(IPOIB_MCAST_RUN, &priv->flags);
 	queue_delayed_work(priv->wq, &priv->mcast_task, 0);
-	mutex_unlock(&mcast_mutex);
+	spin_unlock_irqrestore(&priv->lock, flags);
 
 	return 0;
 }
@@ -620,13 +602,14 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 int ipoib_mcast_stop_thread(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned long flags;
 
 	ipoib_dbg_mcast(priv, "stopping multicast thread\n");
 
-	mutex_lock(&mcast_mutex);
+	spin_lock_irqsave(&priv->lock, flags);
 	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
 	cancel_delayed_work(&priv->mcast_task);
-	mutex_unlock(&mcast_mutex);
+	spin_unlock_irqrestore(&priv->lock, flags);
 
 	flush_workqueue(priv->wq);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/22] IB/ipoib: Fixups for multicast issues
       [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (21 preceding siblings ...)
  2015-02-12  1:43   ` [PATCH 22/22] IB/ipoib: drop mcast_mutex usage Doug Ledford
@ 2015-02-12  4:46   ` Or Gerlitz
       [not found]     ` <54DC303C.2020207-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  22 siblings, 1 reply; 36+ messages in thread
From: Or Gerlitz @ 2015-02-12  4:46 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit

On 2/11/2015 8:43 PM, Doug Ledford wrote:
> This patchset revives the 8 patches that were reverted from 3.19,
> the 11 patches that fixed the problems with the first 8, the single
> patch that was related to reaping of ah's and failure to dealloc
> resources on shutdown, and then adds two new patches that would
> have been enhancements and not bugfixes and hence weren't appropriate
> to post in the 3.19 tussle.
Doug, as asked few times... please provide public git tree where this 
series can be pulled from. So no squashes for patches which actually 
represent a fix to a downstream patch within the series? any specific 
reason for that?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/22] IB/ipoib: Fixups for multicast issues
       [not found]     ` <54DC303C.2020207-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-02-12  4:55       ` Doug Ledford
       [not found]         ` <1423716906.3424.74.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Doug Ledford @ 2015-02-12  4:55 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]

On Wed, 2015-02-11 at 23:46 -0500, Or Gerlitz wrote:
> On 2/11/2015 8:43 PM, Doug Ledford wrote:
> > This patchset revives the 8 patches that were reverted from 3.19,
> > the 11 patches that fixed the problems with the first 8, the single
> > patch that was related to reaping of ah's and failure to dealloc
> > resources on shutdown, and then adds two new patches that would
> > have been enhancements and not bugfixes and hence weren't appropriate
> > to post in the 3.19 tussle.
> Doug, as asked few times... please provide public git tree where this 
> series can be pulled from.

git://github.com/dledford/linux.git for-3.20

>  So no squashes for patches which actually 
> represent a fix to a downstream patch within the series? any specific 
> reason for that?

I tried, but the most appropriate squashes were things like patch 18 and
19 squashed into patch 3, and some other similar jumping around of
patches.  The number of conflicts that created in both the original
squash and the follow on patches was perverse.  So, since the original
20 patches as I submitted previously are what has gone through our QA, I
left them as they were.  If I fixed them up and squashed them, and if
some overlooked fixup failure snuck in there, it would invalidate all of
that testing.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/22] IB/ipoib: Fixups for multicast issues
       [not found]         ` <1423716906.3424.74.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-12  5:25           ` Or Gerlitz
       [not found]             ` <54DC3934.3010401-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Or Gerlitz @ 2015-02-12  5:25 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

On 2/11/2015 11:55 PM, Doug Ledford wrote:
> I tried, but the most appropriate squashes were things like patch 18 and
> 19 squashed into patch 3, and some other similar jumping around of
> patches.  The number of conflicts that created in both the original
> squash and the follow on patches was perverse.  So, since the original
> 20 patches as I submitted previously are what has gone through our QA, I
> left them as they were.  If I fixed them up and squashed them, and if
> some overlooked fixup failure snuck in there, it would invalidate all of
> that testing.

I see. Does this sequence bisect-able? namely the driver builds and 
functions after applying patch by patch? note this is hard requirement 
when modifying existing upstream drivers.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/22] IB/ipoib: Fixups for multicast issues
       [not found]             ` <54DC3934.3010401-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-02-12 14:02               ` Doug Ledford
  0 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-12 14:02 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 1174 bytes --]

On Thu, 2015-02-12 at 00:25 -0500, Or Gerlitz wrote:
> On 2/11/2015 11:55 PM, Doug Ledford wrote:
> > I tried, but the most appropriate squashes were things like patch 18 and
> > 19 squashed into patch 3, and some other similar jumping around of
> > patches.  The number of conflicts that created in both the original
> > squash and the follow on patches was perverse.  So, since the original
> > 20 patches as I submitted previously are what has gone through our QA, I
> > left them as they were.  If I fixed them up and squashed them, and if
> > some overlooked fixup failure snuck in there, it would invalidate all of
> > that testing.
> 
> I see. Does this sequence bisect-able? namely the driver builds and 
> functions after applying patch by patch? note this is hard requirement 
> when modifying existing upstream drivers.

Yes.  There might be some oddities in the behavior of the running IPoIB
driver specifically when testing the mcast abilities in the middle of
this patch set, but they will all build and function at a basic level.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 20/22] IB/ipoib: don't queue a work struct up twice
       [not found]     ` <30a5bd6461381448c52af0d7408dbc14da9ac4d0.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-12 18:33       ` Or Gerlitz
       [not found]         ` <54DCF20C.3040704-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Or Gerlitz @ 2015-02-12 18:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

On 02/12/2015 03:43 AM, Doug Ledford wrote:
> In b2d408f78b3 (IB/ipoib: make delayed tasks not hold up everything) I
> added the ability to have some joins processed around delayed joins.
> Because the join task processes one join at a time, we needed to queue
> the task to run again immediately and then again at the appropriate
> delay time.  I didn't think about the fact that our work struct can only
> be on the workqueue once at a given time and instead tried to queue the
> same struct up twice.  Instead, kick the queue again immediately on a
> delayed restart attempt, when the task sees that it has nothing to do
> but delayed work, it will pick the shortest work item and queue a delay
> equal to its expiration.  Finally, in case we have delayed work, in the
> send_mcast routine we need to cancel any delayed work before kicking the
> thread immediately.
This is strictly a fix to a downstream patch of the series, lets squash 
it there somehow, I see no point to submit
commit that has a bug and later a fix in the same series.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 19/22] IB/ipoib: make sure we reap all our ah on shutdown
       [not found]     ` <14ee2e7737123c7d37465ac52be5c026240c9985.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-12 18:35       ` Or Gerlitz
  0 siblings, 0 replies; 36+ messages in thread
From: Or Gerlitz @ 2015-02-12 18:35 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit


On 02/12/2015 03:43 AM, Doug Ledford wrote:
> The introduction of garbage collection for neighbors in the ipoib stack
> caused a leak of resources.  In particular, an ah can have a refcount
> from a mcast struct, from a neigh struct, and from a path struct.  When
> we put an ah ref, the ah gets moved to a dead list to be reaped later if
> the refcount goes to 0.
see my reply on this patch in it's singleton version which you sent 
couple of weeks ago
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 20/22] IB/ipoib: don't queue a work struct up twice
       [not found]         ` <54DCF20C.3040704-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-02-12 19:47           ` Doug Ledford
       [not found]             ` <1423770436.3387.4.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Doug Ledford @ 2015-02-12 19:47 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 2359 bytes --]

On Thu, 2015-02-12 at 20:33 +0200, Or Gerlitz wrote:
> On 02/12/2015 03:43 AM, Doug Ledford wrote:
> > In b2d408f78b3 (IB/ipoib: make delayed tasks not hold up everything) I
> > added the ability to have some joins processed around delayed joins.
> > Because the join task processes one join at a time, we needed to queue
> > the task to run again immediately and then again at the appropriate
> > delay time.  I didn't think about the fact that our work struct can only
> > be on the workqueue once at a given time and instead tried to queue the
> > same struct up twice.  Instead, kick the queue again immediately on a
> > delayed restart attempt, when the task sees that it has nothing to do
> > but delayed work, it will pick the shortest work item and queue a delay
> > equal to its expiration.  Finally, in case we have delayed work, in the
> > send_mcast routine we need to cancel any delayed work before kicking the
> > thread immediately.
> This is strictly a fix to a downstream patch of the series, lets squash 
> it there somehow, I see no point to submit
> commit that has a bug and later a fix in the same series.

A number of them *are* strictly fixes.  That's a given.  But both you
and Erez emailed me and wanted this patch set sooner rather than later
because Erez in particular had some other things he wanted to send
through and wanted them to be based upon this patchset so they would
apply cleanly.  When I went to squash these things, it became clear very
quickly that I would end up rewriting significant portions of this work,
and that I would have to retest all but the first few patches one patch
at a time.  And I'm actually OK doing that, but it wasn't possible to
meet both requirements at the same time.  So, I'll keep a new branch and
this branch, and I'll reorder things (like the singleton fix you brought
up in another email being first) and squash things, and when I get done,
I'll do a diff between the two branches to make sure that there are no
logical differences.  That should avoid invalidating all of the
testing/QE work that has already been done on this patchset.  In the
meantime, Erez can work off of these patches knowing the end result will
be the same either way.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 20/22] IB/ipoib: don't queue a work struct up twice
       [not found]             ` <1423770436.3387.4.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-12 21:35               ` Or Gerlitz
  0 siblings, 0 replies; 36+ messages in thread
From: Or Gerlitz @ 2015-02-12 21:35 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

On 02/12/2015 09:47 PM, Doug Ledford wrote:
> So, I'll keep a new branch and this branch, and I'll reorder things (like the singleton fix you brought
> up in another email being first) and squash things, and when I get done,
> I'll do a diff between the two branches to make sure that there are no
> logical differences.  That should avoid invalidating all of the
> testing/QE work that has already been done on this patchset.

yes, lets do that, for example, patches 1-21 touch 56 times the 
mcast_mutex and patch #22 removes that mutex all together, this is 
nightmare for future bisection and maintenance. Please post the 
revised/shorter series once you have it to the list and specify the 
branch too.

> In the meantime, Erez can work off of these patches knowing the end result will be the same either way.
We are almost one week into the 3.20 merge window and our WW (Sun-Thu) 
is pretty much done. Erez will try to give some post-WW feedback but 
next week is on a well-pre-planned OOO -- I'm really not sure what 
should we do here. If you have the squashed patches by the end of this 
week, and our regression system runs them early next week maybe we can 
ack them.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/22] IB/ipoib: fix IPOIB_MCAST_RUN flag usage
       [not found]     ` <73f38862d167a9849482464519c04b7c1f0a8b7c.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-13  8:50       ` Erez Shitrit
       [not found]         ` <54DDBAB8.10002-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Erez Shitrit @ 2015-02-13  8:50 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit

On 2/12/2015 3:43 AM, Doug Ledford wrote:
> The usage of IPOIB_MCAST_RUN as a flag is inconsistent.  In some places
> it is used to mean "our device is administratively allowed to send
> multicast joins/leaves/packets" and in other places it means "our
> multicast join task thread is currently running and will process your
> request if you put it on the queue".  However, this latter meaning is in
> fact flawed as there is a race condition between the join task testing
> the mcast list and finding it empty of remaining work, dropping the
> mcast mutex and also the priv->lock spinlock, and clearing the
> IPOIB_MCAST_RUN flag.  Further, there are numerous locations that use
> the flag in the former fashion, and when all tasks complete and the task
> thread clears the RUN flag, all of those other locations will fail to
> ever again queue any work.  This results in the interface coming up fine
> initially, but having problems adding new multicast groups after the
> first round of groups have all been added and the RUN flag is cleared by
> the join task thread when it thinks it is done.  To resolve this issue,
> convert all locations in the code to treat the RUN flag as an indicator
> that the multicast portion of this interface is in fact administratively
> up and joins/leaves/sends can be performed.  There is no harm (other
> than a slight performance penalty) to never clearing this flag and using
> it in this fashion as it simply means that a few places that used to
> micro-optimize how often this task was queued on a work queue will now
> queue the task a few extra times.  We can address that suboptimal
> behavior in future patches.
>
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +++++------
>   1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index bc50dd0d0e4..91b8fe118ec 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -630,8 +630,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
>   	}
>   
>   	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
> -
> -	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
>   }
>   
>   int ipoib_mcast_start_thread(struct net_device *dev)
> @@ -641,8 +639,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
>   	ipoib_dbg_mcast(priv, "starting multicast thread\n");
>   
>   	mutex_lock(&mcast_mutex);
> -	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> -		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
> +	set_bit(IPOIB_MCAST_RUN, &priv->flags);

Hi Doug,
Can you use IPOIB_FLAG_OPER_UP instead of IPOIB_MCAST_RUN, in all these 
places and get rid of that RUN bit?

> +	queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>   	mutex_unlock(&mcast_mutex);
>   
>   	return 0;
> @@ -725,7 +723,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
>   		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
>   		__ipoib_mcast_add(dev, mcast);
>   		list_add_tail(&mcast->list, &priv->multicast_list);
> -		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> +		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
>   			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>   	}
>   
> @@ -951,7 +949,8 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>   	/*
>   	 * Restart our join task if needed
>   	 */
> -	ipoib_mcast_start_thread(dev);
> +	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> +		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>   	rtnl_unlock();
>   }
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/22] IB/ipoib: fix IPOIB_MCAST_RUN flag usage
       [not found]         ` <54DDBAB8.10002-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-02-13 13:30           ` Doug Ledford
  0 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-13 13:30 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 3233 bytes --]

On Fri, 2015-02-13 at 10:50 +0200, Erez Shitrit wrote:
> On 2/12/2015 3:43 AM, Doug Ledford wrote:
> > The usage of IPOIB_MCAST_RUN as a flag is inconsistent.  In some places
> > it is used to mean "our device is administratively allowed to send
> > multicast joins/leaves/packets" and in other places it means "our
> > multicast join task thread is currently running and will process your
> > request if you put it on the queue".  However, this latter meaning is in
> > fact flawed as there is a race condition between the join task testing
> > the mcast list and finding it empty of remaining work, dropping the
> > mcast mutex and also the priv->lock spinlock, and clearing the
> > IPOIB_MCAST_RUN flag.  Further, there are numerous locations that use
> > the flag in the former fashion, and when all tasks complete and the task
> > thread clears the RUN flag, all of those other locations will fail to
> > ever again queue any work.  This results in the interface coming up fine
> > initially, but having problems adding new multicast groups after the
> > first round of groups have all been added and the RUN flag is cleared by
> > the join task thread when it thinks it is done.  To resolve this issue,
> > convert all locations in the code to treat the RUN flag as an indicator
> > that the multicast portion of this interface is in fact administratively
> > up and joins/leaves/sends can be performed.  There is no harm (other
> > than a slight performance penalty) to never clearing this flag and using
> > it in this fashion as it simply means that a few places that used to
> > micro-optimize how often this task was queued on a work queue will now
> > queue the task a few extra times.  We can address that suboptimal
> > behavior in future patches.
> >
> > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +++++------
> >   1 file changed, 5 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > index bc50dd0d0e4..91b8fe118ec 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > @@ -630,8 +630,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
> >   	}
> >   
> >   	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
> > -
> > -	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
> >   }
> >   
> >   int ipoib_mcast_start_thread(struct net_device *dev)
> > @@ -641,8 +639,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
> >   	ipoib_dbg_mcast(priv, "starting multicast thread\n");
> >   
> >   	mutex_lock(&mcast_mutex);
> > -	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> > -		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
> > +	set_bit(IPOIB_MCAST_RUN, &priv->flags);
> 
> Hi Doug,
> Can you use IPOIB_FLAG_OPER_UP instead of IPOIB_MCAST_RUN, in all these 
> places and get rid of that RUN bit?

Yes, I think that should be easily doable.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 18/22] IB/ipoib: cleanup a couple debug messages
       [not found]     ` <7aeffef3862526da5a472c15f94564897a4d7537.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-13 16:40       ` Or Gerlitz
       [not found]         ` <CAJ3xEMi2yvaL6QGPFXDcJavnnw4Lxk64bdkZnm--FE8NiKQmbg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Or Gerlitz @ 2015-02-13 16:40 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Roland Dreier,
	Erez Shitrit

On Thu, Feb 12, 2015 at 3:43 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> The one I removed in particular can prevent progress in the driver from
> happening because it isn't rate limited.

So isn't the right thing to do would be to rate-limit that?! note you
need not solve the whole world (ipoib...)
problems for 3.20-rc1, this can be 3.20-rc fix


In my testing, I have seen the
> flood of messages from that debug message delay multicast joins for in
> excess of 30 seconds.
>
> The added debug message was just so I would know when we were leaving a
> multicast group that we didn't fully join.

so squash them to downstream patches of the series?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 14/22] IB/ipoib: remove unneeded locks
       [not found]     ` <3cd3c664adb2877317c8f684ee344749b2915e45.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-13 16:59       ` Or Gerlitz
  0 siblings, 0 replies; 36+ messages in thread
From: Or Gerlitz @ 2015-02-13 16:59 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit

On 02/12/2015 03:43 AM, Doug Ledford wrote:
> During the various work I've done on this, some extra locking has crept
> in when things weren't working right.  This is one of those spots.
> Remove the unneeded spinlocks.

we need a more concrete/precise text here explaining why these locking 
spots can be removed. What you wrote is too high level and can not 
really be used in future debugging/bisection and maintenance, sorry.

> The mutex is enough to protect against what we need to protect against.

but this is the very same mutex you are removing in a downstream patch 
of the series, isn't that?!

Or.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 18/22] IB/ipoib: cleanup a couple debug messages
       [not found]         ` <CAJ3xEMi2yvaL6QGPFXDcJavnnw4Lxk64bdkZnm--FE8NiKQmbg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-02-13 17:21           ` Doug Ledford
  0 siblings, 0 replies; 36+ messages in thread
From: Doug Ledford @ 2015-02-13 17:21 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Roland Dreier,
	Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 1200 bytes --]

On Fri, 2015-02-13 at 18:40 +0200, Or Gerlitz wrote:
> On Thu, Feb 12, 2015 at 3:43 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > The one I removed in particular can prevent progress in the driver from
> > happening because it isn't rate limited.
> 
> So isn't the right thing to do would be to rate-limit that?! note you
> need not solve the whole world (ipoib...)
> problems for 3.20-rc1, this can be 3.20-rc fix

I'll let you know when I get to that patch.  I'm still squashing.

> 
> In my testing, I have seen the
> > flood of messages from that debug message delay multicast joins for in
> > excess of 30 seconds.
> >
> > The added debug message was just so I would know when we were leaving a
> > multicast group that we didn't fully join.
> 
> so squash them to downstream patches of the series?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2015-02-13 17:21 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-12  1:43 [PATCH 00/22] IB/ipoib: Fixups for multicast issues Doug Ledford
     [not found] ` <cover.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-12  1:43   ` [PATCH 01/22] IB/ipoib: Consolidate rtnl_lock tasks in workqueue Doug Ledford
2015-02-12  1:43   ` [PATCH 02/22] IB/ipoib: Make the carrier_on_task race aware Doug Ledford
2015-02-12  1:43   ` [PATCH 03/22] IB/ipoib: fix MCAST_FLAG_BUSY usage Doug Ledford
2015-02-12  1:43   ` [PATCH 04/22] IB/ipoib: fix mcast_dev_flush/mcast_restart_task race Doug Ledford
2015-02-12  1:43   ` [PATCH 05/22] IB/ipoib: change init sequence ordering Doug Ledford
2015-02-12  1:43   ` [PATCH 06/22] IB/ipoib: Use dedicated workqueues per interface Doug Ledford
2015-02-12  1:43   ` [PATCH 07/22] IB/ipoib: Make ipoib_mcast_stop_thread flush the workqueue Doug Ledford
2015-02-12  1:43   ` [PATCH 08/22] IB/ipoib: No longer use flush as a parameter Doug Ledford
2015-02-12  1:43   ` [PATCH 09/22] IB/ipoib: fix IPOIB_MCAST_RUN flag usage Doug Ledford
     [not found]     ` <73f38862d167a9849482464519c04b7c1f0a8b7c.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-13  8:50       ` Erez Shitrit
     [not found]         ` <54DDBAB8.10002-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-02-13 13:30           ` Doug Ledford
2015-02-12  1:43   ` [PATCH 10/22] IB/ipoib: Add a helper to restart the multicast task Doug Ledford
2015-02-12  1:43   ` [PATCH 11/22] IB/ipoib: make delayed tasks not hold up everything Doug Ledford
2015-02-12  1:43   ` [PATCH 12/22] IB/ipoib: Handle -ENETRESET properly in our callback Doug Ledford
2015-02-12  1:43   ` [PATCH 13/22] IB/ipoib: don't restart our thread on ENETRESET Doug Ledford
2015-02-12  1:43   ` [PATCH 14/22] IB/ipoib: remove unneeded locks Doug Ledford
     [not found]     ` <3cd3c664adb2877317c8f684ee344749b2915e45.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-13 16:59       ` Or Gerlitz
2015-02-12  1:43   ` [PATCH 15/22] IB/ipoib: fix race between mcast_dev_flush and mcast_join Doug Ledford
2015-02-12  1:43   ` [PATCH 16/22] IB/ipoib: fix ipoib_mcast_restart_task Doug Ledford
2015-02-12  1:43   ` [PATCH 17/22] IB/ipoib: flush the ipoib_workqueue on unregister Doug Ledford
2015-02-12  1:43   ` [PATCH 18/22] IB/ipoib: cleanup a couple debug messages Doug Ledford
     [not found]     ` <7aeffef3862526da5a472c15f94564897a4d7537.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-13 16:40       ` Or Gerlitz
     [not found]         ` <CAJ3xEMi2yvaL6QGPFXDcJavnnw4Lxk64bdkZnm--FE8NiKQmbg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-13 17:21           ` Doug Ledford
2015-02-12  1:43   ` [PATCH 19/22] IB/ipoib: make sure we reap all our ah on shutdown Doug Ledford
     [not found]     ` <14ee2e7737123c7d37465ac52be5c026240c9985.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-12 18:35       ` Or Gerlitz
2015-02-12  1:43   ` [PATCH 20/22] IB/ipoib: don't queue a work struct up twice Doug Ledford
     [not found]     ` <30a5bd6461381448c52af0d7408dbc14da9ac4d0.1423703861.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-12 18:33       ` Or Gerlitz
     [not found]         ` <54DCF20C.3040704-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-02-12 19:47           ` Doug Ledford
     [not found]             ` <1423770436.3387.4.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-12 21:35               ` Or Gerlitz
2015-02-12  1:43   ` [PATCH 21/22] IB/ipoib: deserialize multicast joins Doug Ledford
2015-02-12  1:43   ` [PATCH 22/22] IB/ipoib: drop mcast_mutex usage Doug Ledford
2015-02-12  4:46   ` [PATCH 00/22] IB/ipoib: Fixups for multicast issues Or Gerlitz
     [not found]     ` <54DC303C.2020207-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-02-12  4:55       ` Doug Ledford
     [not found]         ` <1423716906.3424.74.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-12  5:25           ` Or Gerlitz
     [not found]             ` <54DC3934.3010401-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-02-12 14:02               ` Doug Ledford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox