[PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
@ 2015-01-22 14:31 Doug Ledford
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2015-01-22 14:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

My 8 patch set taken into 3.19 caused some regressions.  This patch
set resolves those issues.

These patches are to resolve issues created by my previous patch set.
While that set worked fine in my testing, there were problems with
multicast joins after the initial set of joins had completed.  Since my
testing relied upon the normal set of multicast joins that happen
when the interface is first brought up, I missed those problems.

Symptoms vary from failure to send packets due to a failed join, to
loss of connectivity after a subnet manager restart, to failure
to properly release multicast groups on shutdown resulting in hangs
when the mlx4 driver attempts to unload itself via its reboot
notifier handler.

This set of patches has passed a number of tests above and beyond my
original tests.  As suggested by Or Gerlitz I added IPv6 and IPv4
multicast tests.  I also added both subnet manager restarts and
manual shutdown/restart of individual ports at the switch in order to
ensure that the ENETRESET path was properly tested.  I included
testing, then a subnet manager restart, then a quiescent period for
caches to expire, then restarting testing to make sure that arp and
neighbor discovery work after the subnet manager restart.

All in all, I have not been able to trip the multicast joins up any
longer.

Additionally, the original impetus for my first 8 patch set was that
it was simply too easy to break the IPoIB subsystem with this simple
loop:

while true; do
    ifconfig ib0 up
    ifconfig ib0 down
done

Just to be safe, I made sure this problem did not resurface.

v5: fix an oversight in mcast_restart_task that leaked mcast joins
    fix a failure to flush the ipoib_workqueue on deregister that
    meant we could end up running our code after our device had been
    removed, resulting in an oops
    remove a debug message that could be trigger so fast that the
    kernel printk mechanism would starve out the mcast join task thread
    resulting in what looked like a mcast failure that was really just
    delayed action

Doug Ledford (10):
  IB/ipoib: fix IPOIB_MCAST_RUN flag usage
  IB/ipoib: Add a helper to restart the multicast task
  IB/ipoib: make delayed tasks not hold up everything
  IB/ipoib: Handle -ENETRESET properly in our callback
  IB/ipoib: don't restart our thread on ENETRESET
  IB/ipoib: remove unneeded locks
  IB/ipoib: fix race between mcast_dev_flush and mcast_join
  IB/ipoib: fix ipoib_mcast_restart_task
  IB/ipoib: flush the ipoib_workqueue on unregister
  IB/ipoib: cleanup a couple debug messages

 drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |   2 +
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 234 ++++++++++++++-----------
 3 files changed, 131 insertions(+), 106 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH FIX For-3.19 v5 01/10] IB/ipoib: fix IPOIB_MCAST_RUN flag usage
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-22 14:31   ` Doug Ledford
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 02/10] IB/ipoib: Add a helper to restart the multicast task Doug Ledford
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-22 14:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

The usage of IPOIB_MCAST_RUN as a flag is inconsistent.  In some places
it is used to mean "our device is administratively allowed to send
multicast joins/leaves/packets" and in other places it means "our
multicast join task thread is currently running and will process your
request if you put it on the queue".  However, this latter meaning is in
fact flawed as there is a race condition between the join task testing
the mcast list and finding it empty of remaining work, dropping the
mcast mutex and also the priv->lock spinlock, and clearing the
IPOIB_MCAST_RUN flag.  Further, there are numerous locations that use
the flag in the former fashion, and when all tasks complete and the task
thread clears the RUN flag, all of those other locations will fail to
ever again queue any work.  This results in the interface coming up fine
initially, but having problems adding new multicast groups after the
first round of groups have all been added and the RUN flag is cleared by
the join task thread when it thinks it is done.  To resolve this issue,
convert all locations in the code to treat the RUN flag as an indicator
that the multicast portion of this interface is in fact administratively
up and joins/leaves/sends can be performed.  There is no harm (other
than a slight performance penalty) to never clearing this flag and using
it in this fashion as it simply means that a few places that used to
micro-optimize how often this task was queued on a work queue will now
queue the task a few extra times.  We can address that suboptimal
behavior in future patches.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index bc50dd0d0e4..91b8fe118ec 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -630,8 +630,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	}

 	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
-
-	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
 }

 int ipoib_mcast_start_thread(struct net_device *dev)
@@ -641,8 +639,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 	ipoib_dbg_mcast(priv, "starting multicast thread\n");

 	mutex_lock(&mcast_mutex);
-	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+	set_bit(IPOIB_MCAST_RUN, &priv->flags);
+	queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	mutex_unlock(&mcast_mutex);

 	return 0;
@@ -725,7 +723,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
+		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
 			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	}

@@ -951,7 +949,8 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	/*
 	 * Restart our join task if needed
 	 */
-	ipoib_mcast_start_thread(dev);
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	rtnl_unlock();
 }

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH FIX For-3.19 v5 02/10] IB/ipoib: Add a helper to restart the multicast task
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 01/10] IB/ipoib: fix IPOIB_MCAST_RUN flag usage Doug Ledford
@ 2015-01-22 14:31   ` Doug Ledford
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 03/10] IB/ipoib: make delayed tasks not hold up everything Doug Ledford
                     ` (10 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-22 14:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

Add a simple helper to do the right thing when restarting
the multicast thread.  Call that helper from all the places
that need to restart the thread.  Add two places that didn't
restart the thread, but should have.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 46 ++++++++++++++++----------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 91b8fe118ec..cb1e495bd74 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -66,6 +66,13 @@ struct ipoib_mcast_iter {
 	unsigned int       send_only;
 };
 
+static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv,
+					       int delay)
+{
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+		queue_delayed_work(priv->wq, &priv->mcast_task, delay);
+}
+
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
 	struct net_device *dev = mcast->dev;
@@ -270,6 +277,7 @@ ipoib_mcast_sendonly_join_complete(int status,
 {
 	struct ipoib_mcast *mcast = multicast->context;
 	struct net_device *dev = mcast->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	/*
 	 * We have to take the mutex to force mcast_sendonly_join to
@@ -309,6 +317,7 @@ out:
 	complete(&mcast->done);
 	if (status == -ENETRESET)
 		status = 0;
+	__ipoib_mcast_continue_join_thread(priv, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
 }
@@ -360,6 +369,7 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 		complete(&mcast->done);
 		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
 			   "failed (ret = %d)\n", ret);
+		__ipoib_mcast_continue_join_thread(priv, 0);
 	} else {
 		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
 				"sendonly join\n", mcast->mcmember.mgid.raw);
@@ -431,8 +441,7 @@ static int ipoib_mcast_join_complete(int status,
 
 	if (!status) {
 		mcast->backoff = 1;
-		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+		__ipoib_mcast_continue_join_thread(priv, 0);
 
 		/*
 		 * Defer carrier on work to priv->wq to avoid a
@@ -454,6 +463,13 @@ static int ipoib_mcast_join_complete(int status,
 		mcast->backoff *= 2;
 		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
 			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+		/*
+		 * XXX - This is wrong.  *Our* join failed, but because the
+		 * join thread does the joins in a serial fashion, if there
+		 * are any joins behind ours waiting to complete, they should
+		 * not be subjected to our backoff delay.
+		 */
+		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
 	}
 out:
 	spin_lock_irq(&priv->lock);
@@ -463,9 +479,6 @@ out:
 	complete(&mcast->done);
 	if (status == -ENETRESET)
 		status = 0;
-	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task,
-				   mcast->backoff * HZ);
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
 
@@ -532,10 +545,13 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		mcast->backoff *= 2;
 		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
 			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-
-		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task,
-					   mcast->backoff * HZ);
+		/*
+		 * XXX - This is wrong.  *Our* join failed, but because the
+		 * join thread does the joins in a serial fashion, if there
+		 * are any joins behind ours waiting to complete, they should
+		 * not be subjected to our backoff delay.
+		 */
+		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
 	}
 	mutex_unlock(&mcast_mutex);
 }
@@ -573,9 +589,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		if (!broadcast) {
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
 			mutex_lock(&mcast_mutex);
-			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-				queue_delayed_work(priv->wq, &priv->mcast_task,
-						   HZ);
+			__ipoib_mcast_continue_join_thread(priv, HZ);
 			mutex_unlock(&mcast_mutex);
 			return;
 		}
@@ -723,8 +737,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+		__ipoib_mcast_continue_join_thread(priv, 0);
 	}
 
 	if (!mcast->ah) {
@@ -947,10 +960,9 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 		ipoib_mcast_free(mcast);
 	}
 	/*
-	 * Restart our join task if needed
+	 * Restart our join task thread if needed
 	 */
-	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+	__ipoib_mcast_continue_join_thread(priv, 0);
 	rtnl_unlock();
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH FIX For-3.19 v5 03/10] IB/ipoib: make delayed tasks not hold up everything
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 01/10] IB/ipoib: fix IPOIB_MCAST_RUN flag usage Doug Ledford
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 02/10] IB/ipoib: Add a helper to restart the multicast task Doug Ledford
@ 2015-01-22 14:31   ` Doug Ledford
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 04/10] IB/ipoib: Handle -ENETRESET properly in our callback Doug Ledford
                     ` (9 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-22 14:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

Currently, the multicast join thread only processes one task at a
time.  It does this by having itself be scheduled as a delayed work
item, when it runs, it finds one, and only one, piece of work to
process, it then kicks that off via either the normal join process
or the sendonly join process, and then it immediately exits.  Both
of those process chains are responsible for restarting the task
when they have completed their specific action.  This makes the entire
join process serial with only one join supposedly ever outstanding at
a time.

However, if we fail a join, and we need to initiate a backoff delay,
that delay holds up the entire join process for all joins, not just
the failed join.  So modify the design such that we can have joins
in delay, and the multicast thread will ignore them until their time
comes around, and then we can process the rest of the queue without
waiting for the delayed items.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  1 +
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 86 +++++++++++++++++---------
 2 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 8ba80a6d3a4..c79dcd5ee8a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -154,6 +154,7 @@ struct ipoib_mcast {
 
 	unsigned long created;
 	unsigned long backoff;
+	unsigned long delay_until;
 
 	unsigned long flags;
 	unsigned char logcount;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index cb1e495bd74..957e7d2e80c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -67,10 +67,34 @@ struct ipoib_mcast_iter {
 };
 
 static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv,
+					       struct ipoib_mcast *mcast,
 					       int delay)
 {
-	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) {
+		/*
+		 * Mark this mcast for its delay and set a timer to kick the
+		 * thread when the delay completes
+		 */
+		if (mcast && delay) {
+			mcast->backoff *= 2;
+			if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
+				mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+			mcast->delay_until = jiffies + (mcast->backoff * HZ);
+			queue_delayed_work(priv->wq, &priv->mcast_task,
+					   mcast->backoff * HZ);
+		} else if (delay) {
+			/* Special case of retrying after a failure to
+			 * allocate the broadcast multicast group, wait
+			 * 1 second and try again
+			 */
+			queue_delayed_work(priv->wq, &priv->mcast_task, HZ);
+		}
+		/*
+		 * But also rekick the thread immediately for any other
+		 * tasks in queue behind this one
+		 */
 		queue_delayed_work(priv->wq, &priv->mcast_task, delay);
+	}
 }
 
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
@@ -110,6 +134,7 @@ static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev,
 
 	mcast->dev = dev;
 	mcast->created = jiffies;
+	mcast->delay_until = jiffies;
 	mcast->backoff = 1;
 
 	INIT_LIST_HEAD(&mcast->list);
@@ -309,6 +334,10 @@ ipoib_mcast_sendonly_join_complete(int status,
 			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
 		}
 		netif_tx_unlock_bh(dev);
+	} else {
+		/* Join completed, so reset any backoff parameters */
+		mcast->backoff = 1;
+		mcast->delay_until = jiffies;
 	}
 out:
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
@@ -317,7 +346,7 @@ out:
 	complete(&mcast->done);
 	if (status == -ENETRESET)
 		status = 0;
-	__ipoib_mcast_continue_join_thread(priv, 0);
+	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
 }
@@ -369,7 +398,7 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 		complete(&mcast->done);
 		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
 			   "failed (ret = %d)\n", ret);
-		__ipoib_mcast_continue_join_thread(priv, 0);
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	} else {
 		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
 				"sendonly join\n", mcast->mcmember.mgid.raw);
@@ -441,7 +470,8 @@ static int ipoib_mcast_join_complete(int status,
 
 	if (!status) {
 		mcast->backoff = 1;
-		__ipoib_mcast_continue_join_thread(priv, 0);
+		mcast->delay_until = jiffies;
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 
 		/*
 		 * Defer carrier on work to priv->wq to avoid a
@@ -460,16 +490,8 @@ static int ipoib_mcast_join_complete(int status,
 			}
 		}
 
-		mcast->backoff *= 2;
-		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
-			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-		/*
-		 * XXX - This is wrong.  *Our* join failed, but because the
-		 * join thread does the joins in a serial fashion, if there
-		 * are any joins behind ours waiting to complete, they should
-		 * not be subjected to our backoff delay.
-		 */
-		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
+		/* Requeue this join task with a backoff delay */
+		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	}
 out:
 	spin_lock_irq(&priv->lock);
@@ -541,17 +563,8 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		complete(&mcast->done);
 		ret = PTR_ERR(mcast->mc);
 		ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret);
-
-		mcast->backoff *= 2;
-		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
-			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-		/*
-		 * XXX - This is wrong.  *Our* join failed, but because the
-		 * join thread does the joins in a serial fashion, if there
-		 * are any joins behind ours waiting to complete, they should
-		 * not be subjected to our backoff delay.
-		 */
-		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
+		/* Requeue this join task with a backoff delay */
+		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	}
 	mutex_unlock(&mcast_mutex);
 }
@@ -589,7 +602,13 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		if (!broadcast) {
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
 			mutex_lock(&mcast_mutex);
-			__ipoib_mcast_continue_join_thread(priv, HZ);
+			/*
+			 * Restart us after a 1 second delay to retry
+			 * creating our broadcast group and attaching to
+			 * it.  Until this succeeds, this ipoib dev is
+			 * completely stalled (multicast wise).
+			 */
+			__ipoib_mcast_continue_join_thread(priv, NULL, 1);
 			mutex_unlock(&mcast_mutex);
 			return;
 		}
@@ -623,7 +642,10 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		list_for_each_entry(mcast, &priv->multicast_list, list) {
 			if (IS_ERR_OR_NULL(mcast->mc) &&
 			    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
-			    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+			    !test_bit(IPOIB_MCAST_FLAG_ATTACHED,
+				      &mcast->flags) &&
+			    (mcast->backoff == 1 ||
+			     time_after_eq(jiffies, mcast->delay_until))) {
 				/* Found the next unjoined group */
 				break;
 			}
@@ -632,7 +654,11 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		mutex_unlock(&mcast_mutex);
 
 		if (&mcast->list == &priv->multicast_list) {
-			/* All done */
+			/* All done, unless we have delayed work from
+			 * backoff retransmissions, but we will get
+			 * restarted when the time is right, so we are
+			 * done for now
+			 */
 			break;
 		}
 
@@ -737,7 +763,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		__ipoib_mcast_continue_join_thread(priv, 0);
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	}
 
 	if (!mcast->ah) {
@@ -962,7 +988,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	/*
 	 * Restart our join task thread if needed
 	 */
-	__ipoib_mcast_continue_join_thread(priv, 0);
+	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	rtnl_unlock();
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH FIX For-3.19 v5 04/10] IB/ipoib: Handle -ENETRESET properly in our callback
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 03/10] IB/ipoib: make delayed tasks not hold up everything Doug Ledford
@ 2015-01-22 14:31   ` Doug Ledford
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 05/10] IB/ipoib: don't restart our thread on ENETRESET Doug Ledford
                     ` (8 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-22 14:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

When the core layer calls our callback with ENETRESET as the error, we
clear the status to 0 before returning indicating that we are going to
handle the error ourselves.  This causes the core layer to not free the
mcast->mc structure, but we are releasing our reference to it by
clearing mcast->mc.  So, preserve our reference to the multicast
structure so when we next run ipoib_mcast_dev_flush, it will be able
to properly release the mcast->mc entry at the right time in the right
way.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 957e7d2e80c..50d2de2270a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -315,8 +315,10 @@ ipoib_mcast_sendonly_join_complete(int status,
 	mutex_lock(&mcast_mutex);
 
 	/* We trap for port events ourselves. */
-	if (status == -ENETRESET)
+	if (status == -ENETRESET) {
+		status = 0;
 		goto out;
+	}
 
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
@@ -344,8 +346,6 @@ out:
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	if (status == -ENETRESET)
-		status = 0;
 	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
@@ -462,8 +462,10 @@ static int ipoib_mcast_join_complete(int status,
 	mutex_lock(&mcast_mutex);
 
 	/* We trap for port events ourselves. */
-	if (status == -ENETRESET)
+	if (status == -ENETRESET) {
+		status = 0;
 		goto out;
+	}
 
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
@@ -499,8 +501,6 @@ out:
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	if (status == -ENETRESET)
-		status = 0;
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH FIX For-3.19 v5 05/10] IB/ipoib: don't restart our thread on ENETRESET
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 04/10] IB/ipoib: Handle -ENETRESET properly in our callback Doug Ledford
@ 2015-01-22 14:31   ` Doug Ledford
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 06/10] IB/ipoib: remove unneeded locks Doug Ledford
                     ` (7 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-22 14:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

Make the send only join completion handler behave like the normal join
completion handler in that we should restart the multicast join thread
whenever our join completes with either an error or success, but don't
restart the thread if we got a net reset event, let ipoib_event queue
up a device flush, which will then call ipoib_mcast_dev_flush to remove
all of the current mcast entries, and ipoib_mcase_restart_task to check
out current mcast entry list against the netdev list of mcast entries
we are supposed to have and requeue the ones we need to get back.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 50d2de2270a..a03b6a04203 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -336,17 +336,18 @@ ipoib_mcast_sendonly_join_complete(int status,
 			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
 		}
 		netif_tx_unlock_bh(dev);
+		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	} else {
 		/* Join completed, so reset any backoff parameters */
 		mcast->backoff = 1;
 		mcast->delay_until = jiffies;
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	}
 out:
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH FIX For-3.19 v5 06/10] IB/ipoib: remove unneeded locks
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 05/10] IB/ipoib: don't restart our thread on ENETRESET Doug Ledford
@ 2015-01-22 14:31   ` Doug Ledford
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 07/10] IB/ipoib: fix race between mcast_dev_flush and mcast_join Doug Ledford
                     ` (6 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-22 14:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

During the various work I've done on this, some extra locking has crept
in when things weren't working right.  This is one of those spots.
Remove the unneeded spinlocks.  The mutex is enough to protect against
what we need to protect against.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index a03b6a04203..424c602d5a0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -497,12 +497,10 @@ static int ipoib_mcast_join_complete(int status,
 		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	}
 out:
-	spin_lock_irq(&priv->lock);
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
 
 	return status;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH FIX For-3.19 v5 07/10] IB/ipoib: fix race between mcast_dev_flush and mcast_join
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 06/10] IB/ipoib: remove unneeded locks Doug Ledford
@ 2015-01-22 14:31   ` Doug Ledford
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 08/10] IB/ipoib: fix ipoib_mcast_restart_task Doug Ledford
                     ` (5 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-22 14:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

There is a classic race between mcast_dev_flush and mcast_join_thread
in that the join thread is going to loop through all of the mcast
entries in the device mcast list, while dev_flush wants to remove
those entries from the list so that they can eventually be freed when
they are no longer busy.

As always, the list traversal and the list removal must be done under
the same lock to avoid races, so modify the join_task thread code to
hold onto the spinlock for the entirety of its run up until it has
saved the mcast entry that it's going to work on into a private pointer
and set the flags on that entry to mark it as busy.  Only then is it
safe to drop the lock and let mcast_dev_flush remove entries from the
list.

I also reworked the flow of the join_task thread to be more clear that
it is making a single pass at work to do and then exiting as soon as it
gets that single task queued off.

I then reworked both the regular and send only join routines to know
that they will be called with the BUSY flag already set on the mcast
entry they are passed, so any error condition means they need to
clear the flag and complete the task and they should not check the
busy status as it will now always be true.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 112 ++++++++++++-------------
 1 file changed, 56 insertions(+), 56 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 424c602d5a0..972fa15a225 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -368,22 +368,16 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
 		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
 				"multicast joins\n");
+		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+		complete(&mcast->done);
 		return -ENODEV;
 	}
 
-	if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
-		ipoib_dbg_mcast(priv, "multicast entry busy, skipping "
-				"sendonly join\n");
-		return -EBUSY;
-	}
-
 	rec.mgid     = mcast->mcmember.mgid;
 	rec.port_gid = priv->local_gid;
 	rec.pkey     = cpu_to_be16(priv->pkey);
 
 	mutex_lock(&mcast_mutex);
-	init_completion(&mcast->done);
-	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
 					 priv->port, &rec,
 					 IB_SA_MCMEMBER_REC_MGID	|
@@ -552,8 +546,6 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 	}
 
 	mutex_lock(&mcast_mutex);
-	init_completion(&mcast->done);
-	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
 					 &rec, comp_mask, GFP_KERNEL,
 					 ipoib_mcast_join_complete, mcast);
@@ -574,6 +566,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		container_of(work, struct ipoib_dev_priv, mcast_task.work);
 	struct net_device *dev = priv->dev;
 	struct ib_port_attr port_attr;
+	struct ipoib_mcast *mcast = NULL;
+	int create = 1;
 
 	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
 		return;
@@ -591,16 +585,25 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	else
 		memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid));
 
+	/*
+	 * We have to hold the mutex to keep from racing with the join
+	 * completion threads on setting flags on mcasts, and we have
+	 * to hold the priv->lock because dev_flush will remove entries
+	 * out from underneath us, so at a minimum we need the lock
+	 * through the time that we do the for_each loop of the mcast
+	 * list or else dev_flush can make us oops.
+	 */
+	mutex_lock(&mcast_mutex);
+	spin_lock_irq(&priv->lock);
+	if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+		goto out;
+
 	if (!priv->broadcast) {
 		struct ipoib_mcast *broadcast;
 
-		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-			return;
-
-		broadcast = ipoib_mcast_alloc(dev, 1);
+		broadcast = ipoib_mcast_alloc(dev, 0);
 		if (!broadcast) {
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
-			mutex_lock(&mcast_mutex);
 			/*
 			 * Restart us after a 1 second delay to retry
 			 * creating our broadcast group and attaching to
@@ -608,67 +611,64 @@ void ipoib_mcast_join_task(struct work_struct *work)
 			 * completely stalled (multicast wise).
 			 */
 			__ipoib_mcast_continue_join_thread(priv, NULL, 1);
-			mutex_unlock(&mcast_mutex);
-			return;
+			goto out;
 		}
 
-		spin_lock_irq(&priv->lock);
 		memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		       sizeof (union ib_gid));
 		priv->broadcast = broadcast;
 
 		__ipoib_mcast_add(dev, priv->broadcast);
-		spin_unlock_irq(&priv->lock);
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
 		if (IS_ERR_OR_NULL(priv->broadcast->mc) &&
-		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
-			ipoib_mcast_join(dev, priv->broadcast, 0);
-		return;
-	}
-
-	while (1) {
-		struct ipoib_mcast *mcast = NULL;
-
-		/*
-		 * Need the mutex so our flags are consistent, need the
-		 * priv->lock so we don't race with list removals in either
-		 * mcast_dev_flush or mcast_restart_task
-		 */
-		mutex_lock(&mcast_mutex);
-		spin_lock_irq(&priv->lock);
-		list_for_each_entry(mcast, &priv->multicast_list, list) {
-			if (IS_ERR_OR_NULL(mcast->mc) &&
-			    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
-			    !test_bit(IPOIB_MCAST_FLAG_ATTACHED,
-				      &mcast->flags) &&
-			    (mcast->backoff == 1 ||
-			     time_after_eq(jiffies, mcast->delay_until))) {
-				/* Found the next unjoined group */
-				break;
-			}
+		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) {
+			mcast = priv->broadcast;
+			create = 0;
 		}
-		spin_unlock_irq(&priv->lock);
-		mutex_unlock(&mcast_mutex);
+		goto out;
+	}
 
-		if (&mcast->list == &priv->multicast_list) {
-			/* All done, unless we have delayed work from
-			 * backoff retransmissions, but we will get
-			 * restarted when the time is right, so we are
-			 * done for now
-			 */
+	/* We'll never get here until the broadcast group is both allocated
+	 * and attached
+	 */
+	list_for_each_entry(mcast, &priv->multicast_list, list) {
+		if (IS_ERR_OR_NULL(mcast->mc) &&
+		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
+		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags) &&
+		    (mcast->backoff == 1 ||
+		     time_after_eq(jiffies, mcast->delay_until))) {
+			/* Found the next unjoined group */
 			break;
 		}
+	}
 
+	if (&mcast->list == &priv->multicast_list) {
+		/* All done, unless we have delayed work from
+		 * backoff retransmissions, but we will get
+		 * restarted when the time is right, so we are
+		 * done for now
+		 */
+		mcast = NULL;
+		ipoib_dbg_mcast(priv, "successfully joined all "
+				"multicast groups\n");
+	}
+
+out:
+	if (mcast) {
+		init_completion(&mcast->done);
+		set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	}
+	spin_unlock_irq(&priv->lock);
+	mutex_unlock(&mcast_mutex);
+	if (mcast) {
 		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
 			ipoib_mcast_sendonly_join(mcast);
 		else
-			ipoib_mcast_join(dev, mcast, 1);
-		return;
+			ipoib_mcast_join(dev, mcast, create);
 	}
-
-	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
+	return;
 }
 
 int ipoib_mcast_start_thread(struct net_device *dev)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH FIX For-3.19 v5 08/10] IB/ipoib: fix ipoib_mcast_restart_task
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 07/10] IB/ipoib: fix race between mcast_dev_flush and mcast_join Doug Ledford
@ 2015-01-22 14:31   ` Doug Ledford
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 09/10] IB/ipoib: flush the ipoib_workqueue on unregister Doug Ledford
                     ` (4 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-22 14:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

There are two intertwined problems here.

First, if we are downing the interface, we will get called as part of
the process.  When that happens, we don't really want to do anything
here as the next thing to happen is ipoib_mcast_dev_flush will get
called.  We just want to let it do all the work since we don't need to
do any address matches and such.  So, we should be testing for whether
or not the interface is up at the beginning of this function and
returning if it isn't.

Second, the attempt to grab the rtnl_lock to prevent racing against
ipoib_mcast_dev_flush was mistaken.  The removed mcast groups are on a
stack local list head and so once we've removed them from the mcast
rbtree, ipoib_mcast_dev_flush can no longer find them and race us on
removing them.  However, because we attempt to take the lock, and then
check the status of the interface, if our interface has gone down, we
simply return.  This is absolutely the wrong thing to do as it means we
just leaked all of the mcast entries on our remove list.

The fix is to check for the interface being up at the very beginning of
the function, and then to always remove our entries on the remove list
no matter what as they will get leaked otherwise, and skip taking the
rtnl_lock since it isn't needed anyway.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 19 +++----------------
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 972fa15a225..10e05437c83 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -871,6 +871,9 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	unsigned long flags;
 	struct ib_sa_mcmember_rec rec;

+	if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+		return;
+
 	ipoib_dbg_mcast(priv, "restarting multicast task\n");

 	local_irq_save(flags);
@@ -965,21 +968,6 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			wait_for_completion(&mcast->done);

-	/*
-	 * We have to cancel outside of the spinlock, but we have to
-	 * take the rtnl lock or else we race with the removal of
-	 * entries from the remove list in mcast_dev_flush as part
-	 * of ipoib_stop().  We detect the drop of the ADMIN_UP flag
-	 * to signal that we have hit this particular race, and we
-	 * return since we know we don't need to do anything else
-	 * anyway.
-	 */
-	while (!rtnl_trylock()) {
-		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-			return;
-		else
-			msleep(20);
-	}
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
 		ipoib_mcast_leave(mcast->dev, mcast);
 		ipoib_mcast_free(mcast);
@@ -988,7 +976,6 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	 * Restart our join task thread if needed
 	 */
 	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
-	rtnl_unlock();
 }

 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH FIX For-3.19 v5 09/10] IB/ipoib: flush the ipoib_workqueue on unregister
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 08/10] IB/ipoib: fix ipoib_mcast_restart_task Doug Ledford
@ 2015-01-22 14:31   ` Doug Ledford
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 10/10] IB/ipoib: cleanup a couple debug messages Doug Ledford
                     ` (3 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-22 14:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

When we unregister our event handler as a prelude to removing our
device, we need to go ahead and flush the ipoib_workqueue to make sure
there are no delayed flushes for the device we are getting ready to
remove.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 6bad17d4d58..8810514cf40 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1643,6 +1643,7 @@ sysfs_failed:
 
 register_failed:
 	ib_unregister_event_handler(&priv->event_handler);
+	flush_workqueue(ipoib_workqueue);
 	/* Stop GC if started before flush */
 	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
 	cancel_delayed_work(&priv->neigh_reap_task);
@@ -1709,6 +1710,7 @@ static void ipoib_remove_one(struct ib_device *device)
 
 	list_for_each_entry_safe(priv, tmp, dev_list, list) {
 		ib_unregister_event_handler(&priv->event_handler);
+		flush_workqueue(ipoib_workqueue);
 
 		rtnl_lock();
 		dev_change_flags(priv->dev, priv->dev->flags & ~IFF_UP);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH FIX For-3.19 v5 10/10] IB/ipoib: cleanup a couple debug messages
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 09/10] IB/ipoib: flush the ipoib_workqueue on unregister Doug Ledford
@ 2015-01-22 14:31   ` Doug Ledford
  2015-01-23  7:01   ` [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions Or Gerlitz
                     ` (2 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-22 14:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

The one I removed in particular can prevent progress in the driver from
happening because it isn't rate limited.  In my testing, I have seen the
flood of messages from that debug message delay multicast joins for in
excess of 30 seconds.

The added debug message was just so I would know when we were leaving a
multicast group that we didn't fully join.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 10e05437c83..ad2bcd105e4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -711,6 +711,8 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 
 	if (!IS_ERR_OR_NULL(mcast->mc))
 		ib_sa_free_multicast(mcast->mc);
+	else
+		ipoib_dbg(priv, "ipoib_mcast_leave with mcast->mc invalid\n");
 
 	if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
 		ipoib_dbg_mcast(priv, "leaving MGID %pI6\n",
@@ -721,7 +723,9 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 				      be16_to_cpu(mcast->mcmember.mlid));
 		if (ret)
 			ipoib_warn(priv, "ib_detach_mcast failed (result = %d)\n", ret);
-	}
+	} else if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+		ipoib_dbg(priv, "leaving with no mcmember but not a "
+			  "SENDONLY join\n");
 
 	return 0;
 }
@@ -772,11 +776,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 			++dev->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 		}
-
-		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
-			ipoib_dbg_mcast(priv, "no address vector, "
-					"but multicast join already started\n");
-
 		/*
 		 * If lookup completes between here and out:, don't
 		 * want to send packet twice.
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 10/10] IB/ipoib: cleanup a couple debug messages Doug Ledford
@ 2015-01-23  7:01   ` Or Gerlitz
       [not found]     ` <CAJ3xEMi7mowr_qFMUXtM5m8p974qF39nPf-Qh-NOYK_jUzswSg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-01-23 16:52   ` Doug Ledford
  2015-01-26 19:34   ` [PATCH FIX For-3.19 11/10] IB/ipoib: don't queue a work struct up twice Doug Ledford
  12 siblings, 1 reply; 26+ messages in thread
From: Or Gerlitz @ 2015-01-23  7:01 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Roland Dreier,
	Amir Vadai, Eyal Perry, Erez Shitrit, Alex Estrin

On Thu, Jan 22, 2015 at 4:31 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> My 8 patch set taken into 3.19 caused some regressions.  This patch
> set resolves those issues.

Doug, Roland - eight rc1 patches followed ten rc6 fixes - sounds a bit
too much to me. If we (and Linus) like or not, probably the right
thing to do is revert the rc1 patches and come up with a set of
hopefully < 18 patches for 3.20 - using squashes and such, and there's
not much time for that too, BTW.

Taking into account Doug's listing of the remaining problems in the
original set which are not addressed by the patch Erez suggested, this
isn't an option either.

Alex, you acked all the series, haven't you noticed the IPv6 and IPv4
multicast breakage or it worked for you?

Or.

> Doug Ledford (10):
>   IB/ipoib: fix IPOIB_MCAST_RUN flag usage
>   IB/ipoib: Add a helper to restart the multicast task
>   IB/ipoib: make delayed tasks not hold up everything
>   IB/ipoib: Handle -ENETRESET properly in our callback
>   IB/ipoib: don't restart our thread on ENETRESET
>   IB/ipoib: remove unneeded locks
>   IB/ipoib: fix race between mcast_dev_flush and mcast_join
>   IB/ipoib: fix ipoib_mcast_restart_task
>   IB/ipoib: flush the ipoib_workqueue on unregister
>   IB/ipoib: cleanup a couple debug messages
>
>  drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
>  drivers/infiniband/ulp/ipoib/ipoib_main.c      |   2 +
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 234 ++++++++++++++-----------
>  3 files changed, 131 insertions(+), 106 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]     ` <CAJ3xEMi7mowr_qFMUXtM5m8p974qF39nPf-Qh-NOYK_jUzswSg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-23  7:45       ` Doug Ledford
       [not found]         ` <1421999125.3352.265.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-01-23 12:54       ` Estrin, Alex
  1 sibling, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2015-01-23  7:45 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Roland Dreier,
	Amir Vadai, Eyal Perry, Erez Shitrit, Alex Estrin

[-- Attachment #1: Type: text/plain, Size: 4242 bytes --]

On Fri, 2015-01-23 at 09:01 +0200, Or Gerlitz wrote:
> On Thu, Jan 22, 2015 at 4:31 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > My 8 patch set taken into 3.19 caused some regressions.  This patch
> > set resolves those issues.
> 
> Doug, Roland - eight rc1 patches followed ten rc6 fixes - sounds a bit
> too much to me.

We can do as we wish upstream, but these are going live in our product
ASAP.  The original 8 rc1 patches took so long to get upstream that they
had already been live in 6.5-z and 6.6 and were slated for 7.1.  We had
a few customers report problems, but not as many as you might think.
The new patches are going to go out and fix those problems, period.  I
can't wait for 3.20 to fix these issues.  C'est la vie.

FWIW, I expect that these patches (plus the one patch I didn't put in
this same thread) to be live on production machines probably within 10
days (or less).  I'll let you know how that goes.

>  If we (and Linus) like or not, probably the right
> thing to do is revert the rc1 patches and come up with a set of
> hopefully < 18 patches for 3.20 - using squashes and such, and there's
> not much time for that too, BTW.

I'm not going to squash them.  I used to write big monolithic patches,
back in the day when I was a newbie kernel engineer.  Trying to review
or understand the nuances of those patches was nigh on impossible.
Here, each patch was unique in the issue it addressed and done that way
particularly to make it easier to follow the logic later when someone
might review this code and these changes.  A big monolithic patch that
hides these revisions hides the nuances that made getting this right
difficult.  I'm not going to go back to those days.  Git is just fine
with these patches being separated.

> Taking into account Doug's listing of the remaining problems in the
> original set which are not addressed by the patch Erez suggested, this
> isn't an option either.

I agree with you on this point.  But, I would also point out, that if
you are looking for justification to take this current patch set, that's
the real justification.  It's all about understanding the problem, and
dealing with it properly.  My patch set does that.  Patch count and line
count are secondary to understanding when it comes to determining if a
change is acceptable in late rc stage IMO.

> Alex, you acked all the series, haven't you noticed the IPv6 and IPv4
> multicast breakage or it worked for you?

Surprisingly enough, not many people use multicast.  And the whole
reason I didn't see the problem in the first place is that the initial
multicast setup that's necessary to make the links come up works OK,
it's only new joins after that which are effected.  We haven't had any
reports of multicast not working except from you.  Your report was
valid, just no one else cared (I'm sure someone would have eventually,
but out of the people using 6.5-z and 6.6 kernels, no one reported
multicast issues, just other issues).

> Or.
> 
> > Doug Ledford (10):
> >   IB/ipoib: fix IPOIB_MCAST_RUN flag usage
> >   IB/ipoib: Add a helper to restart the multicast task
> >   IB/ipoib: make delayed tasks not hold up everything
> >   IB/ipoib: Handle -ENETRESET properly in our callback
> >   IB/ipoib: don't restart our thread on ENETRESET
> >   IB/ipoib: remove unneeded locks
> >   IB/ipoib: fix race between mcast_dev_flush and mcast_join
> >   IB/ipoib: fix ipoib_mcast_restart_task
> >   IB/ipoib: flush the ipoib_workqueue on unregister
> >   IB/ipoib: cleanup a couple debug messages
> >
> >  drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
> >  drivers/infiniband/ulp/ipoib/ipoib_main.c      |   2 +
> >  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 234 ++++++++++++++-----------
> >  3 files changed, 131 insertions(+), 106 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]     ` <CAJ3xEMi7mowr_qFMUXtM5m8p974qF39nPf-Qh-NOYK_jUzswSg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-01-23  7:45       ` Doug Ledford
@ 2015-01-23 12:54       ` Estrin, Alex
  1 sibling, 0 replies; 26+ messages in thread
From: Estrin, Alex @ 2015-01-23 12:54 UTC (permalink / raw)
  To: Or Gerlitz, Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Roland Dreier,
	Amir Vadai, Eyal Perry, Erez Shitrit



> -----Original Message-----
> From: Or Gerlitz [mailto:gerlitz.or@gmail.com]
> Sent: Friday, January 23, 2015 2:02 AM
> To: Doug Ledford
> Cc: linux-rdma@vger.kernel.org; Roland Dreier; Amir Vadai; Eyal Perry; Erez
> Shitrit; Estrin, Alex
> Subject: Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
> 
> On Thu, Jan 22, 2015 at 4:31 PM, Doug Ledford <dledford@redhat.com> wrote:
> > My 8 patch set taken into 3.19 caused some regressions.  This patch
> > set resolves those issues.
> 
> Doug, Roland - eight rc1 patches followed ten rc6 fixes - sounds a bit
> too much to me. If we (and Linus) like or not, probably the right
> thing to do is revert the rc1 patches and come up with a set of
> hopefully < 18 patches for 3.20 - using squashes and such, and there's
> not much time for that too, BTW.
> 
> Taking into account Doug's listing of the remaining problems in the
> original set which are not addressed by the patch Erez suggested, this
> isn't an option either.
> 
> Alex, you acked all the series, haven't you noticed the IPv6 and IPv4
> multicast breakage or it worked for you?
> 

I didn't run elaborate multicast tests on the patch-set so I didn't hit it.
Was concentrated on initialization flow. That was a mistake.

> Or.
> 
> > Doug Ledford (10):
> >   IB/ipoib: fix IPOIB_MCAST_RUN flag usage
> >   IB/ipoib: Add a helper to restart the multicast task
> >   IB/ipoib: make delayed tasks not hold up everything
> >   IB/ipoib: Handle -ENETRESET properly in our callback
> >   IB/ipoib: don't restart our thread on ENETRESET
> >   IB/ipoib: remove unneeded locks
> >   IB/ipoib: fix race between mcast_dev_flush and mcast_join
> >   IB/ipoib: fix ipoib_mcast_restart_task
> >   IB/ipoib: flush the ipoib_workqueue on unregister
> >   IB/ipoib: cleanup a couple debug messages
> >
> >  drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
> >  drivers/infiniband/ulp/ipoib/ipoib_main.c      |   2 +
> >  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 234 ++++++++++++++-----------
> >  3 files changed, 131 insertions(+), 106 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2015-01-23  7:01   ` [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions Or Gerlitz
@ 2015-01-23 16:52   ` Doug Ledford
       [not found]     ` <1422031938.3352.286.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-01-26 19:34   ` [PATCH FIX For-3.19 11/10] IB/ipoib: don't queue a work struct up twice Doug Ledford
  12 siblings, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2015-01-23 16:52 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, Amir Vadai, Eyal Perry, Or Gerlitz,
	Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 3362 bytes --]

On Thu, 2015-01-22 at 09:31 -0500, Doug Ledford wrote:
> My 8 patch set taken into 3.19 caused some regressions.  This patch
> set resolves those issues.
> 
> These patches are to resolve issues created by my previous patch set.
> While that set worked fine in my testing, there were problems with
> multicast joins after the initial set of joins had completed.  Since my
> testing relied upon the normal set of multicast joins that happen
> when the interface is first brought up, I missed those problems.
> 
> Symptoms vary from failure to send packets due to a failed join, to
> loss of connectivity after a subnet manager restart, to failure
> to properly release multicast groups on shutdown resulting in hangs
> when the mlx4 driver attempts to unload itself via its reboot
> notifier handler.
> 
> This set of patches has passed a number of tests above and beyond my
> original tests.  As suggested by Or Gerlitz I added IPv6 and IPv4
> multicast tests.  I also added both subnet manager restarts and
> manual shutdown/restart of individual ports at the switch in order to
> ensure that the ENETRESET path was properly tested.  I included
> testing, then a subnet manager restart, then a quiescent period for
> caches to expire, then restarting testing to make sure that arp and
> neighbor discovery work after the subnet manager restart.
> 
> All in all, I have not been able to trip the multicast joins up any
> longer.
> 
> Additionally, the original impetus for my first 8 patch set was that
> it was simply too easy to break the IPoIB subsystem with this simple
> loop:
> 
> while true; do
>     ifconfig ib0 up
>     ifconfig ib0 down
> done
> 
> Just to be safe, I made sure this problem did not resurface.
> 
> v5: fix an oversight in mcast_restart_task that leaked mcast joins
>     fix a failure to flush the ipoib_workqueue on deregister that
>     meant we could end up running our code after our device had been
>     removed, resulting in an oops
>     remove a debug message that could be trigger so fast that the
>     kernel printk mechanism would starve out the mcast join task thread
>     resulting in what looked like a mcast failure that was really just
>     delayed action
> 
> 
> Doug Ledford (10):
>   IB/ipoib: fix IPOIB_MCAST_RUN flag usage
>   IB/ipoib: Add a helper to restart the multicast task
>   IB/ipoib: make delayed tasks not hold up everything
>   IB/ipoib: Handle -ENETRESET properly in our callback
>   IB/ipoib: don't restart our thread on ENETRESET
>   IB/ipoib: remove unneeded locks
>   IB/ipoib: fix race between mcast_dev_flush and mcast_join
>   IB/ipoib: fix ipoib_mcast_restart_task
>   IB/ipoib: flush the ipoib_workqueue on unregister
>   IB/ipoib: cleanup a couple debug messages
> 
>  drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
>  drivers/infiniband/ulp/ipoib/ipoib_main.c      |   2 +
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 234 ++++++++++++++-----------
>  3 files changed, 131 insertions(+), 106 deletions(-)
> 

FWIW, a couple different customers have tried a test kernel I built
internally with my patches and I've had multiple reports that all
previously observed issues have been resolved.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]         ` <1421999125.3352.265.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-24  4:58           ` Roland Dreier
  0 siblings, 0 replies; 26+ messages in thread
From: Roland Dreier @ 2015-01-24  4:58 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Amir Vadai, Eyal Perry, Erez Shitrit, Alex Estrin

>> Doug, Roland - eight rc1 patches followed ten rc6 fixes - sounds a bit
>> too much to me.
>
> We can do as we wish upstream, but these are going live in our product
> ASAP.  The original 8 rc1 patches took so long to get upstream that they
> had already been live in 6.5-z and 6.6 and were slated for 7.1.  We had
> a few customers report problems, but not as many as you might think.
> The new patches are going to go out and fix those problems, period.  I
> can't wait for 3.20 to fix these issues.  C'est la vie.
>
> FWIW, I expect that these patches (plus the one patch I didn't put in
> this same thread) to be live on production machines probably within 10
> days (or less).  I'll let you know how that goes.

Given the level of testing you've done, and the fact that the changes
are going to be in a Red Hat release, I'm inclined to try to get all
this into 3.19.  It may be more than Linus would take, and it's more
than I would like to take so late in the cycle, but really it seems
safer than reverting back to the status quo ante.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]     ` <1422031938.3352.286.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-25 12:54       ` Erez Shitrit
       [not found]         ` <54C4E793.2010103-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Erez Shitrit @ 2015-01-25 12:54 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, Amir Vadai, Eyal Perry, Or Gerlitz,
	Erez Shitrit

On 1/23/2015 6:52 PM, Doug Ledford wrote:
> On Thu, 2015-01-22 at 09:31 -0500, Doug Ledford wrote:
>> My 8 patch set taken into 3.19 caused some regressions.  This patch
>> set resolves those issues.
>>
>> These patches are to resolve issues created by my previous patch set.
>> While that set worked fine in my testing, there were problems with
>> multicast joins after the initial set of joins had completed.  Since my
>> testing relied upon the normal set of multicast joins that happen
>> when the interface is first brought up, I missed those problems.
>>
>> Symptoms vary from failure to send packets due to a failed join, to
>> loss of connectivity after a subnet manager restart, to failure
>> to properly release multicast groups on shutdown resulting in hangs
>> when the mlx4 driver attempts to unload itself via its reboot
>> notifier handler.
>>
>> This set of patches has passed a number of tests above and beyond my
>> original tests.  As suggested by Or Gerlitz I added IPv6 and IPv4
>> multicast tests.  I also added both subnet manager restarts and
>> manual shutdown/restart of individual ports at the switch in order to
>> ensure that the ENETRESET path was properly tested.  I included
>> testing, then a subnet manager restart, then a quiescent period for
>> caches to expire, then restarting testing to make sure that arp and
>> neighbor discovery work after the subnet manager restart.
>>
>> All in all, I have not been able to trip the multicast joins up any
>> longer.
>>
>> Additionally, the original impetus for my first 8 patch set was that
>> it was simply too easy to break the IPoIB subsystem with this simple
>> loop:
>>
>> while true; do
>>      ifconfig ib0 up
>>      ifconfig ib0 down
>> done
>>
>> Just to be safe, I made sure this problem did not resurface.
>>
>> v5: fix an oversight in mcast_restart_task that leaked mcast joins
>>      fix a failure to flush the ipoib_workqueue on deregister that
>>      meant we could end up running our code after our device had been
>>      removed, resulting in an oops
>>      remove a debug message that could be trigger so fast that the
>>      kernel printk mechanism would starve out the mcast join task thread
>>      resulting in what looked like a mcast failure that was really just
>>      delayed action
>>
>>
>> Doug Ledford (10):
>>    IB/ipoib: fix IPOIB_MCAST_RUN flag usage
>>    IB/ipoib: Add a helper to restart the multicast task
>>    IB/ipoib: make delayed tasks not hold up everything
>>    IB/ipoib: Handle -ENETRESET properly in our callback
>>    IB/ipoib: don't restart our thread on ENETRESET
>>    IB/ipoib: remove unneeded locks
>>    IB/ipoib: fix race between mcast_dev_flush and mcast_join
>>    IB/ipoib: fix ipoib_mcast_restart_task
>>    IB/ipoib: flush the ipoib_workqueue on unregister
>>    IB/ipoib: cleanup a couple debug messages
>>
>>   drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
>>   drivers/infiniband/ulp/ipoib/ipoib_main.c      |   2 +
>>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 234 ++++++++++++++-----------
>>   3 files changed, 131 insertions(+), 106 deletions(-)
>>
> FWIW, a couple different customers have tried a test kernel I built
> internally with my patches and I've had multiple reports that all
> previously observed issues have been resolved.
>

Hi Doug,
still see an issue with the last version, and as a result no sendonly or 
IPv6 is working.
The scenario is some how simple to reproduce, if there is a sendonly 
multicast group that failed to join (sm refuses, perhaps the group was 
closed, etc.) that causes all the other mcg's to be blocked behind it 
forever.
for example, there is bad mcg ff12:601b:ffff:0000:0000:0000:0000:0016 
that the sm refuses to join, and after some time the user tries to send 
packets to ip address 225.5.5.5 (mcg: 
ff12:401b:ffff:0000:0000:0000:0105:0505 )
the log will show something like:
[1561627.426080] ib0: no multicast record for 
ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[1561633.726768] ib0: setting up send only multicast group for 
ff12:401b:ffff:0000:0000:0000:0105:0505
[1561643.498990] ib0: no multicast record for 
ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[1561675.645424] ib0: no multicast record for 
ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[1561691.718464] ib0: no multicast record for 
ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[1561707.791609] ib0: no multicast record for 
ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[1561723.864839] ib0: no multicast record for 
ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[1561739.937981] ib0: no multicast record for 
ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[1561756.010895] ib0: no multicast record for 
ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
....
....
for ever or till the sm will decide in the future to let 
ff12:601b:ffff:0000:0000:0000:0000:0016 join, till than no sendonly traffic.

The main cause is the concept that was broken for the send-only join, 
when you treat the sendonly like a regular mcg and add it to the mc list 
and to the mc_task etc.
(not consider other issues like packet drop, and bw of the sendonly traffic)
IMHO, sendonly is part of the TX flow as it was till now in the ipoib 
driver and should stay in that way.

I went over your comments to my patch, will try to response/ cover them 
ASAP.

Thanks, Erez

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]         ` <54C4E793.2010103-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-01-25 22:21           ` Doug Ledford
       [not found]             ` <1422224477.3352.373.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2015-01-25 22:21 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 11001 bytes --]

On Sun, 2015-01-25 at 14:54 +0200, Erez Shitrit wrote:
> On 1/23/2015 6:52 PM, Doug Ledford wrote:
> > On Thu, 2015-01-22 at 09:31 -0500, Doug Ledford wrote:
> >> My 8 patch set taken into 3.19 caused some regressions.  This patch
> >> set resolves those issues.
> >>
> >> These patches are to resolve issues created by my previous patch set.
> >> While that set worked fine in my testing, there were problems with
> >> multicast joins after the initial set of joins had completed.  Since my
> >> testing relied upon the normal set of multicast joins that happen
> >> when the interface is first brought up, I missed those problems.
> >>
> >> Symptoms vary from failure to send packets due to a failed join, to
> >> loss of connectivity after a subnet manager restart, to failure
> >> to properly release multicast groups on shutdown resulting in hangs
> >> when the mlx4 driver attempts to unload itself via its reboot
> >> notifier handler.
> >>
> >> This set of patches has passed a number of tests above and beyond my
> >> original tests.  As suggested by Or Gerlitz I added IPv6 and IPv4
> >> multicast tests.  I also added both subnet manager restarts and
> >> manual shutdown/restart of individual ports at the switch in order to
> >> ensure that the ENETRESET path was properly tested.  I included
> >> testing, then a subnet manager restart, then a quiescent period for
> >> caches to expire, then restarting testing to make sure that arp and
> >> neighbor discovery work after the subnet manager restart.
> >>
> >> All in all, I have not been able to trip the multicast joins up any
> >> longer.
> >>
> >> Additionally, the original impetus for my first 8 patch set was that
> >> it was simply too easy to break the IPoIB subsystem with this simple
> >> loop:
> >>
> >> while true; do
> >>      ifconfig ib0 up
> >>      ifconfig ib0 down
> >> done
> >>
> >> Just to be safe, I made sure this problem did not resurface.
> >>
> >> v5: fix an oversight in mcast_restart_task that leaked mcast joins
> >>      fix a failure to flush the ipoib_workqueue on deregister that
> >>      meant we could end up running our code after our device had been
> >>      removed, resulting in an oops
> >>      remove a debug message that could be trigger so fast that the
> >>      kernel printk mechanism would starve out the mcast join task thread
> >>      resulting in what looked like a mcast failure that was really just
> >>      delayed action
> >>
> >>
> >> Doug Ledford (10):
> >>    IB/ipoib: fix IPOIB_MCAST_RUN flag usage
> >>    IB/ipoib: Add a helper to restart the multicast task
> >>    IB/ipoib: make delayed tasks not hold up everything
> >>    IB/ipoib: Handle -ENETRESET properly in our callback
> >>    IB/ipoib: don't restart our thread on ENETRESET
> >>    IB/ipoib: remove unneeded locks
> >>    IB/ipoib: fix race between mcast_dev_flush and mcast_join
> >>    IB/ipoib: fix ipoib_mcast_restart_task
> >>    IB/ipoib: flush the ipoib_workqueue on unregister
> >>    IB/ipoib: cleanup a couple debug messages
> >>
> >>   drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
> >>   drivers/infiniband/ulp/ipoib/ipoib_main.c      |   2 +
> >>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 234 ++++++++++++++-----------
> >>   3 files changed, 131 insertions(+), 106 deletions(-)
> >>
> > FWIW, a couple different customers have tried a test kernel I built
> > internally with my patches and I've had multiple reports that all
> > previously observed issues have been resolved.
> >
> 
> Hi Doug,
> still see an issue with the last version, and as a result no sendonly or 
> IPv6 is working.

I disagree with your assessment.  See below.

> The scenario is some how simple to reproduce, if there is a sendonly 
> multicast group that failed to join (sm refuses, perhaps the group was 
> closed, etc.)

This is not the case.  "failed to join" implies that the join completed
with an error.  According to the logs below, something impossible is
happening.

>  that causes all the other mcg's to be blocked behind it 
> forever.

Your right.  Because the join tasks are serialized and we only queue up
one join at a time (something I would like to address, but not in the
3.19 timeframe), if one join simply never returns, as the logs below
indicate, then it blocks up the whole system.

> for example, there is bad mcg ff12:601b:ffff:0000:0000:0000:0000:0016 
> that the sm refuses to join, and after some time the user tries to send 
> packets to ip address 225.5.5.5 (mcg: 
> ff12:401b:ffff:0000:0000:0000:0105:0505 )
> the log will show something like:
> [1561627.426080] ib0: no multicast record for 
> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join

This is printed out by sendonly_join

> [1561633.726768] ib0: setting up send only multicast group for 
> ff12:401b:ffff:0000:0000:0000:0105:0505

This is part of mcast_send and indicates we have created the new group
and added it to the mcast rbtree, but not that we've started the join.

> [1561643.498990] ib0: no multicast record for 
> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join

Our mcast task got ran again, and we are restarting the same group for
some reason.  Theoretically that means we should have cleared the busy
flag on this group somehow, which requires that sendonly_join_complete
must have been called.  However, I can't find a single code path in
sendonly_join_complete that could possibly clear out the busy flag on
this group and not result in a printout message of some sort.  Are you
trimming the list of messages here?  The reason I say this is that these
are the possibilities:

sendonly_join_complete with status == ENETRESET, we silently drop out
but we would immediately get an ib_event with netreset and process a
flush, which would print out copious messages
sendonly_join_complete with status == 0, we would call mcast_join_finish
and the only silent return is if this is our broadcast group and
priv->broadcast is not set, in which case we don't have link up and the
silent failures and the failure to join any other address are normal and
expected since the ipoib layer considers things dead without the
broadcast group joined, every other return path will print out
something, and in particular the attempt to get an ah on the group will
always print out either success or failure
sendonly_join_complete with status anything else results in a debug
message about the join failing.

As far as I know, your setup has the IPv4 broadcast attached and the
link is up.  The group you are actually showing as bad is the IPv6
MLDv2-capable Routers group and failure to join it should not be
blocking anything else.  So I'm back to what I said before: unless you
are trimming messages, this debug output makes no sense whatsoever and
should be impossible to generate.  Are you sure you have a clean tree
with no possible merge oddities that are throwing your testing off?

> [1561675.645424] ib0: no multicast record for 
> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
> [1561691.718464] ib0: no multicast record for 
> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
> [1561707.791609] ib0: no multicast record for 
> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
> [1561723.864839] ib0: no multicast record for 
> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
> [1561739.937981] ib0: no multicast record for 
> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
> [1561756.010895] ib0: no multicast record for 
> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join

These are all being queued with successively longer backoffs, exactly as
they should be.

In __ipoib_mcast_continue_join_thread, when we get called with delay !=
0, we end up doing two things.  One, we change the mcast->backoff from 1
to whatever it should be now (minimum of 2, maximum of 16), and we set
delay_until to jiffies + (HZ * mcast->backoff).  This lets the mcast
task know to skip this group until such time as its backoff is expired.
From the messages above, that seems to be working. The other thing we do
is we don't queue up the mcast thread just once, we queue it up twice.
Once at the right time for the backoff, and again now (although there is
a bug here, the second instance queues it up with delay as the time
offset, that should be 0, that's a holdover from before, at least
fortunately it is only ever 1 or 0 so it's not horrible, but things
would go a little faster if I fix that so that the second instance is
always 0).  That way the instance that's queued up first will know to
skip the delayed group and continue on through the list, and that should
allow us to get these other groups that you say are blocked.

> ....
> for ever or till the sm will decide in the future to let 
> ff12:601b:ffff:0000:0000:0000:0000:0016 join, till than no sendonly traffic.
> 
> 
> The main cause is the concept that was broken for the send-only join, 
> when you treat the sendonly like a regular mcg and add it to the mc list 
> and to the mc_task etc.

I'm looking at 3.18 right now, and 3.18 adds sendonly groups to the mcg
just like my current code does.  The only difference, and I do mean
*only*, is that it calls sendonly_join directly instead of via the
mcast_task.  The net effect of that is that sendonly joins are not
serialized in the old version and they are in the new.  I'm more than
happy to deserialize them, but I don't think that necessarily means we
have to call sendonly_join directly from mcast_send, it could also be
done by allowing the mcast_task to fire off more than one join at a
time.  Whether we use the mcast_task or call sendonly_join directly is
in-consequential, it's more an issue of whether or not we serialize the
joins.

> (not consider other issues like packet drop, and bw of the sendonly traffic)
> IMHO, sendonly is part of the TX flow as it was till now in the ipoib 
> driver and should stay in that way.

It seems like you're thinking of the mcast task thread as something that
only runs as part of mcast joins or leaves when the upper layer calls
set multicast list in our driver.  It used to be that way, but since I
reworked the task thread, that's no longer the case.  As such, I don't
think this distinction you are drawing is necessarily still correct.

> I went over your comments to my patch, will try to response/ cover them 
> ASAP.
> 
> Thanks, Erez
> 
> 
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]             ` <1422224477.3352.373.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-26 10:27               ` Erez Shitrit
       [not found]                 ` <54C616A8.3050804-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Erez Shitrit @ 2015-01-26 10:27 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 12458 bytes --]

On 1/26/2015 12:21 AM, Doug Ledford wrote:
> On Sun, 2015-01-25 at 14:54 +0200, Erez Shitrit wrote:
>> On 1/23/2015 6:52 PM, Doug Ledford wrote:
>>> On Thu, 2015-01-22 at 09:31 -0500, Doug Ledford wrote:
>>>> My 8 patch set taken into 3.19 caused some regressions.  This patch
>>>> set resolves those issues.
>>>>
>>>> These patches are to resolve issues created by my previous patch set.
>>>> While that set worked fine in my testing, there were problems with
>>>> multicast joins after the initial set of joins had completed.  Since my
>>>> testing relied upon the normal set of multicast joins that happen
>>>> when the interface is first brought up, I missed those problems.
>>>>
>>>> Symptoms vary from failure to send packets due to a failed join, to
>>>> loss of connectivity after a subnet manager restart, to failure
>>>> to properly release multicast groups on shutdown resulting in hangs
>>>> when the mlx4 driver attempts to unload itself via its reboot
>>>> notifier handler.
>>>>
>>>> This set of patches has passed a number of tests above and beyond my
>>>> original tests.  As suggested by Or Gerlitz I added IPv6 and IPv4
>>>> multicast tests.  I also added both subnet manager restarts and
>>>> manual shutdown/restart of individual ports at the switch in order to
>>>> ensure that the ENETRESET path was properly tested.  I included
>>>> testing, then a subnet manager restart, then a quiescent period for
>>>> caches to expire, then restarting testing to make sure that arp and
>>>> neighbor discovery work after the subnet manager restart.
>>>>
>>>> All in all, I have not been able to trip the multicast joins up any
>>>> longer.
>>>>
>>>> Additionally, the original impetus for my first 8 patch set was that
>>>> it was simply too easy to break the IPoIB subsystem with this simple
>>>> loop:
>>>>
>>>> while true; do
>>>>       ifconfig ib0 up
>>>>       ifconfig ib0 down
>>>> done
>>>>
>>>> Just to be safe, I made sure this problem did not resurface.
>>>>
>>>> v5: fix an oversight in mcast_restart_task that leaked mcast joins
>>>>       fix a failure to flush the ipoib_workqueue on deregister that
>>>>       meant we could end up running our code after our device had been
>>>>       removed, resulting in an oops
>>>>       remove a debug message that could be trigger so fast that the
>>>>       kernel printk mechanism would starve out the mcast join task thread
>>>>       resulting in what looked like a mcast failure that was really just
>>>>       delayed action
>>>>
>>>>
>>>> Doug Ledford (10):
>>>>     IB/ipoib: fix IPOIB_MCAST_RUN flag usage
>>>>     IB/ipoib: Add a helper to restart the multicast task
>>>>     IB/ipoib: make delayed tasks not hold up everything
>>>>     IB/ipoib: Handle -ENETRESET properly in our callback
>>>>     IB/ipoib: don't restart our thread on ENETRESET
>>>>     IB/ipoib: remove unneeded locks
>>>>     IB/ipoib: fix race between mcast_dev_flush and mcast_join
>>>>     IB/ipoib: fix ipoib_mcast_restart_task
>>>>     IB/ipoib: flush the ipoib_workqueue on unregister
>>>>     IB/ipoib: cleanup a couple debug messages
>>>>
>>>>    drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
>>>>    drivers/infiniband/ulp/ipoib/ipoib_main.c      |   2 +
>>>>    drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 234 ++++++++++++++-----------
>>>>    3 files changed, 131 insertions(+), 106 deletions(-)
>>>>
>>> FWIW, a couple different customers have tried a test kernel I built
>>> internally with my patches and I've had multiple reports that all
>>> previously observed issues have been resolved.
>>>
>> Hi Doug,
>> still see an issue with the last version, and as a result no sendonly or
>> IPv6 is working.
> I disagree with your assessment.  See below.
>
>> The scenario is some how simple to reproduce, if there is a sendonly
>> multicast group that failed to join (sm refuses, perhaps the group was
>> closed, etc.)
> This is not the case.  "failed to join" implies that the join completed
> with an error.  According to the logs below, something impossible is
> happening.
>
>>   that causes all the other mcg's to be blocked behind it
>> forever.
> Your right.  Because the join tasks are serialized and we only queue up
> one join at a time (something I would like to address, but not in the
> 3.19 timeframe), if one join simply never returns, as the logs below
> indicate, then it blocks up the whole system.
>
>> for example, there is bad mcg ff12:601b:ffff:0000:0000:0000:0000:0016
>> that the sm refuses to join, and after some time the user tries to send
>> packets to ip address 225.5.5.5 (mcg:
>> ff12:401b:ffff:0000:0000:0000:0105:0505 )
>> the log will show something like:
>> [1561627.426080] ib0: no multicast record for
>> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
> This is printed out by sendonly_join
>
>> [1561633.726768] ib0: setting up send only multicast group for
>> ff12:401b:ffff:0000:0000:0000:0105:0505
> This is part of mcast_send and indicates we have created the new group
> and added it to the mcast rbtree, but not that we've started the join.
>
>> [1561643.498990] ib0: no multicast record for
>> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
> Our mcast task got ran again, and we are restarting the same group for
> some reason.  Theoretically that means we should have cleared the busy
> flag on this group somehow, which requires that sendonly_join_complete
> must have been called.  However, I can't find a single code path in
> sendonly_join_complete that could possibly clear out the busy flag on
> this group and not result in a printout message of some sort.  Are you
> trimming the list of messages here?  The reason I say this is that these
> are the possibilities:
>
> sendonly_join_complete with status == ENETRESET, we silently drop out
> but we would immediately get an ib_event with netreset and process a
> flush, which would print out copious messages
> sendonly_join_complete with status == 0, we would call mcast_join_finish
> and the only silent return is if this is our broadcast group and
> priv->broadcast is not set, in which case we don't have link up and the
> silent failures and the failure to join any other address are normal and
> expected since the ipoib layer considers things dead without the
> broadcast group joined, every other return path will print out
> something, and in particular the attempt to get an ah on the group will
> always print out either success or failure
> sendonly_join_complete with status anything else results in a debug
> message about the join failing.
>
> As far as I know, your setup has the IPv4 broadcast attached and the
> link is up.  The group you are actually showing as bad is the IPv6
> MLDv2-capable Routers group and failure to join it should not be
> blocking anything else.  So I'm back to what I said before: unless you
> are trimming messages, this debug output makes no sense whatsoever and
> should be impossible to generate.

New (and full) dmesg attached, (after modprobe ib_ipoib, with all debug 
flags set) it is all there.

the case is simple: ping6
In device A:
# ip addr show ib0
19: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP 
qlen 256
     link/infiniband 
a0:04:02:10:fe:80:00:00:00:00:00:00:00:02:c9:03:00:9f:3b:0a brd 
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
     inet 11.134.33.1/8 brd 11.255.255.255 scope global ib0
     inet6 fe80::202:c903:9f:3b0a/64 scope link
        valid_lft forever preferred_lft forever

In device B, where i have the latest version:
#ping6 fe80::202:c903:9f:3b0a -I ib0


>    Are you sure you have a clean tree
> with no possible merge oddities that are throwing your testing off?

double checked myself, new git, with the last 10 patches, recompiled all 
modules, reboot etc. the same issue.

#git log --oneline
362509a IB/ipoib: cleanup a couple debug messages
e8e1618 IB/ipoib: flush the ipoib_workqueue on unregister
867555c IB/ipoib: fix ipoib_mcast_restart_task
125a466 IB/ipoib: fix race between mcast_dev_flush and mcast_join
9b0908b IB/ipoib: remove unneeded locks
c2fb665 IB/ipoib: don't restart our thread on ENETRESET
2345213 IB/ipoib: Handle -ENETRESET properly in our callback
6ed68e8 IB/ipoib: make delayed tasks not hold up everything
bee92be5 IB/ipoib: Add a helper to restart the multicast task
2084aee IB/ipoib: fix IPOIB_MCAST_RUN flag usage
a7cfef2 Merge branches 'core', 'cxgb4', 'ipoib', 'iser', 'mlx4', 
'ocrdma', 'odp' and 'srp' into for-next
....
....



>> [1561675.645424] ib0: no multicast record for
>> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
>> [1561691.718464] ib0: no multicast record for
>> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
>> [1561707.791609] ib0: no multicast record for
>> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
>> [1561723.864839] ib0: no multicast record for
>> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
>> [1561739.937981] ib0: no multicast record for
>> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
>> [1561756.010895] ib0: no multicast record for
>> ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
> These are all being queued with successively longer backoffs, exactly as
> they should be.
>
> In __ipoib_mcast_continue_join_thread, when we get called with delay !=
> 0, we end up doing two things.  One, we change the mcast->backoff from 1
> to whatever it should be now (minimum of 2, maximum of 16), and we set
> delay_until to jiffies + (HZ * mcast->backoff).  This lets the mcast
> task know to skip this group until such time as its backoff is expired.
>  From the messages above, that seems to be working. The other thing we do
> is we don't queue up the mcast thread just once, we queue it up twice.
> Once at the right time for the backoff, and again now (although there is
> a bug here, the second instance queues it up with delay as the time
> offset, that should be 0, that's a holdover from before, at least
> fortunately it is only ever 1 or 0 so it's not horrible, but things
> would go a little faster if I fix that so that the second instance is
> always 0).  That way the instance that's queued up first will know to
> skip the delayed group and continue on through the list, and that should
> allow us to get these other groups that you say are blocked.
>
>> ....
>> for ever or till the sm will decide in the future to let
>> ff12:601b:ffff:0000:0000:0000:0000:0016 join, till than no sendonly traffic.
>>
>>
>> The main cause is the concept that was broken for the send-only join,
>> when you treat the sendonly like a regular mcg and add it to the mc list
>> and to the mc_task etc.
> I'm looking at 3.18 right now, and 3.18 adds sendonly groups to the mcg
> just like my current code does.  The only difference, and I do mean
> *only*, is that it calls sendonly_join directly instead of via the
> mcast_task.

Yes, and i already wrote that it is more than just "only", it changed 
the concept of the sendonly mc packet.

>    The net effect of that is that sendonly joins are not
> serialized in the old version and they are in the new.  I'm more than
> happy to deserialize them, but I don't think that necessarily means we
> have to call sendonly_join directly from mcast_send, it could also be
> done by allowing the mcast_task to fire off more than one join at a
> time.  Whether we use the mcast_task or call sendonly_join directly is
> in-consequential, it's more an issue of whether or not we serialize the
> joins.
>
>> (not consider other issues like packet drop, and bw of the sendonly traffic)
>> IMHO, sendonly is part of the TX flow as it was till now in the ipoib
>> driver and should stay in that way.
> It seems like you're thinking of the mcast task thread as something that
> only runs as part of mcast joins or leaves when the upper layer calls
> set multicast list in our driver.  It used to be that way, but since I
> reworked the task thread, that's no longer the case.  As such, I don't
> think this distinction you are drawing is necessarily still correct.
>
>> I went over your comments to my patch, will try to response/ cover them
>> ASAP.
>>
>> Thanks, Erez
>>
>>
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


[-- Attachment #2: dmesg.txt --]
[-- Type: text/plain, Size: 7845 bytes --]

[ 2425.913245] ib0: bringing up interface
[ 2425.918035] ib0: starting multicast thread
[ 2425.922260] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[ 2425.928542] ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff
[ 2425.935065] ib0: restarting multicast task
[ 2425.939228] ib0: adding multicast entry for mgid ff12:601b:ffff:0000:0000:0000:0000:0001
[ 2425.939335] ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status 0)
[ 2425.955538] ib0: adding multicast entry for mgid ff12:401b:ffff:0000:0000:0000:0000:0001
[ 2425.964891] ib0: Created ah ffff8800ca0b24c0
[ 2425.969231] ib0: MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff AV ffff8800ca0b24c0, LID 0xc000, SL 0
[ 2425.978434] ib0: joining MGID ff12:601b:ffff:0000:0000:0000:0000:0001
[ 2425.985613] ib0: joining MGID ff12:401b:ffff:0000:0000:0000:0000:0001
[ 2425.992595] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[ 2425.998967] ib0: restarting multicast task
[ 2426.001183] ib0: setting up send only multicast group for ff12:601b:ffff:0000:0000:0000:0000:0016
[ 2426.012123] ib0: adding multicast entry for mgid ff12:601b:ffff:0000:0000:0001:ff43:3bf1
[ 2426.021059] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2426.330297] ib0: join completion for ff12:601b:ffff:0000:0000:0000:0000:0001 (status 0)
[ 2426.339054] ib0: Created ah ffff8800ca0b25c0
[ 2426.343389] ib0: MGID ff12:601b:ffff:0000:0000:0000:0000:0001 AV ffff8800ca0b25c0, LID 0xc016, SL 0
[ 2426.353901] ib0: join completion for ff12:401b:ffff:0000:0000:0000:0000:0001 (status 0)
[ 2426.362984] ib0: Created ah ffff88041f104440
[ 2426.367316] ib0: MGID ff12:401b:ffff:0000:0000:0000:0000:0001 AV ffff88041f104440, LID 0xc014, SL 0
[ 2426.376487] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2426.386032] ib0: joining MGID ff12:601b:ffff:0000:0000:0001:ff43:3bf1
[ 2426.393162] ib0: join completion for ff12:601b:ffff:0000:0000:0001:ff43:3bf1 (status 0)
[ 2426.402013] ib0: Created ah ffff88041f115340
[ 2426.406349] ib0: MGID ff12:601b:ffff:0000:0000:0001:ff43:3bf1 AV ffff88041f115340, LID 0xc018, SL 0
[ 2426.415567] ib0: successfully joined all multicast groups
[ 2426.421610] ib0: successfully joined all multicast groups
[ 2427.037903] ib0: setting up send only multicast group for ff12:601b:ffff:0000:0000:0000:0000:0002
[ 2427.047677] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting sendonly join
[ 2427.057362] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
[ 2429.073774] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2429.083391] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2433.100207] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2433.109820] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2434.594637] ib0: setting up send only multicast group for ff12:601b:ffff:0000:0000:0001:ff9f:3b0a
[ 2441.136671] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2441.146279] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2457.210044] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2457.219652] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2473.282869] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2473.292482] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2489.356138] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2489.365749] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2505.429325] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2505.438941] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2516.191339] ib0: neigh free for ffffff ff12:601b:ffff:0000:0000:0001:ff43:3bf1
[ 2521.502141] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2521.511744] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2537.575375] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2537.584980] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2553.648531] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2553.658137] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2569.721539] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2569.731143] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2585.794694] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2585.804307] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2601.867993] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2601.877600] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2617.940929] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2617.950538] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2634.014165] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2634.023776] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2650.087367] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2650.096972] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2666.160196] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2666.169811] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2682.233359] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2682.242969] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2698.306545] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2698.316151] ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[ 2714.379507] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2730.452688] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2746.525811] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2762.598925] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2778.672146] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2794.745171] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2810.818171] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2826.891541] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[ 2842.964546] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]                 ` <54C616A8.3050804-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-01-26 12:51                   ` Doug Ledford
       [not found]                     ` <1422276712.2854.5.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2015-01-26 12:51 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 1232 bytes --]

On Mon, 2015-01-26 at 12:27 +0200, Erez Shitrit wrote:

> New (and full) dmesg attached, (after modprobe ib_ipoib, with all debug 
> flags set) it is all there.

Thank you, I know what's going on here now.  Will correct shortly.

> >> The main cause is the concept that was broken for the send-only join,
> >> when you treat the sendonly like a regular mcg and add it to the mc list
> >> and to the mc_task etc.
> > I'm looking at 3.18 right now, and 3.18 adds sendonly groups to the mcg
> > just like my current code does.  The only difference, and I do mean
> > *only*, is that it calls sendonly_join directly instead of via the
> > mcast_task.
> 
> Yes, and i already wrote that it is more than just "only", it changed 
> the concept of the sendonly mc packet.

Be more specific please.  What do you mean by "concept"?  And just so we
are clear, this all started because the existing multicast code was
super easy to break and was racy, so if the "concept" you are referring
to is what made the original code easy to break and racy, I'm not going
to care one whit that I changed that concept.


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]                     ` <1422276712.2854.5.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-26 13:24                       ` Erez Shitrit
       [not found]                         ` <54C6400E.30607-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Erez Shitrit @ 2015-01-26 13:24 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit

On 1/26/2015 2:51 PM, Doug Ledford wrote:
> On Mon, 2015-01-26 at 12:27 +0200, Erez Shitrit wrote:
>
>> New (and full) dmesg attached, (after modprobe ib_ipoib, with all debug
>> flags set) it is all there.
> Thank you, I know what's going on here now.  Will correct shortly.

welcome -:)

>
>>>> The main cause is the concept that was broken for the send-only join,
>>>> when you treat the sendonly like a regular mcg and add it to the mc list
>>>> and to the mc_task etc.
>>> I'm looking at 3.18 right now, and 3.18 adds sendonly groups to the mcg
>>> just like my current code does.  The only difference, and I do mean
>>> *only*, is that it calls sendonly_join directly instead of via the
>>> mcast_task.
>> Yes, and i already wrote that it is more than just "only", it changed
>> the concept of the sendonly mc packet.
> Be more specific please.  What do you mean by "concept"?  And just so we
> are clear, this all started because the existing multicast code was
> super easy to break and was racy, so if the "concept" you are referring
> to is what made the original code easy to break and racy, I'm not going
> to care one whit that I changed that concept.
>

I agree that you fixed many bugs in your patches to 3.18, where the mc 
flow was easy to break, no argue about that.
The only issue that i disagree is about the way now sendonly is handled 
(and i think that this is the reason for the regression we see now).
In general, IMHO, the sendonly join is part of the TX flow and not part 
of the  ipoib_set_mcast_list flow.
The original meaning of the ipoib_set_mcast_list task that restart the 
mc_task is to be used for the kernel in order to add one or more new 
mcg's macs to the driver/HW (ndo_set_rx_mode), the sendonly mc is not 
such object, its mac should not be part of the "mac" list of the driver 
(in IB wards, no qp_attach for it) and from the kernel point of view 
whenever it sends packet from sendonly mcg type no need to do the join, 
it's a regular send, the only reason we have the sendonly join is the IB 
enforcement for such mcg.
The reason the driver keeps the sendonly mcg in its mc_list is from 
others reasons, the first is to handle the case when the kernel decides 
to move a mcg from sendonly membership to full-member, one more other 
reason is to do the leave operation when needed and not for being 
handled as a full-member mcg.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]                         ` <54C6400E.30607-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-01-26 13:37                           ` Doug Ledford
       [not found]                             ` <1422279465.2854.15.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-01-26 19:30                           ` Doug Ledford
  1 sibling, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2015-01-26 13:37 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 4163 bytes --]

On Mon, 2015-01-26 at 15:24 +0200, Erez Shitrit wrote:
> >>>> The main cause is the concept that was broken for the send-only join,
> >>>> when you treat the sendonly like a regular mcg and add it to the mc list
> >>>> and to the mc_task etc.
> >>> I'm looking at 3.18 right now, and 3.18 adds sendonly groups to the mcg
> >>> just like my current code does.  The only difference, and I do mean
> >>> *only*, is that it calls sendonly_join directly instead of via the
> >>> mcast_task.
> >> Yes, and i already wrote that it is more than just "only", it changed
> >> the concept of the sendonly mc packet.
> > Be more specific please.  What do you mean by "concept"?  And just so we
> > are clear, this all started because the existing multicast code was
> > super easy to break and was racy, so if the "concept" you are referring
> > to is what made the original code easy to break and racy, I'm not going
> > to care one whit that I changed that concept.
> >
> 
> I agree that you fixed many bugs in your patches to 3.18, where the mc 
> flow was easy to break, no argue about that.
> The only issue that i disagree is about the way now sendonly is handled 
> (and i think that this is the reason for the regression we see now).

It's not.  The patch I sent you off list should be sufficient at this
point to finally put your issues to rest.

> In general, IMHO, the sendonly join is part of the TX flow and not part 
> of the  ipoib_set_mcast_list flow.

I disagree.  I'll get into that below.

> The original meaning of the ipoib_set_mcast_list task that restart the 
> mc_task is to be used for the kernel in order to add one or more new 
> mcg's macs to the driver/HW (ndo_set_rx_mode), the sendonly mc is not 
> such object, its mac should not be part of the "mac" list of the driver 
> (in IB wards, no qp_attach for it)

This part I agree with.

>  and from the kernel point of view 
> whenever it sends packet from sendonly mcg type no need to do the join, 
> it's a regular send, the only reason we have the sendonly join is the IB 
> enforcement for such mcg.

This is where you and I differ.  There is a requirement on us from the
IB spec that we have to join a sendonly MC group in order to do what the
kernel would think of as a normal send if we were on Ethernet.  We
aren't on Ethernet though, so acting like we are to the above layer is
fine, but in our own layer we should be coding to the realities of our
layer.  And that means treating a sendonly MC group like a regular MC
group.

> The reason the driver keeps the sendonly mcg in its mc_list is from 
> others reasons, the first is to handle the case when the kernel decides 
> to move a mcg from sendonly membership to full-member, one more other 
> reason is to do the leave operation when needed and not for being 
> handled as a full-member mcg.

It's been a long time since I first started working on this issue, so
some of the details are fuzzy in my memory.  I think my first 8 patch
patchset is now about 6 months old.  But, I think probably the most
important thing you left off this list, and the one that trumps all the
others, is that whether a sendonly MC group has a QP attached or not, we
still have to account for these groups, we have to track them, and on
shutdown we have to reliably leave them without oopsing.  Putting the
join of the sendonly groups directly in mcast_send opens us up to
problems with synchronizing our mcast list (either due to a dev flush, a
mcast restart task, or something else).  I can't say for certain that
the change to how we record the return value from ib_sa_join_multicast
in sendonly_join is enough to keep the sendonly join code from racing
with the flush code and causing problems.  For that reason, I am highly
reluctant to put the sendonly join back directly into the tx path
because that could have been part of the original problem in the first
place, it's just that my memory of back then is too fuzzy to say that
with 100% certainty either way.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]                             ` <1422279465.2854.15.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-26 14:07                               ` Erez Shitrit
       [not found]                                 ` <54C64A2A.5070306-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Erez Shitrit @ 2015-01-26 14:07 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit

On 1/26/2015 3:37 PM, Doug Ledford wrote:
> On Mon, 2015-01-26 at 15:24 +0200, Erez Shitrit wrote:
>>>>>> The main cause is the concept that was broken for the send-only join,
>>>>>> when you treat the sendonly like a regular mcg and add it to the mc list
>>>>>> and to the mc_task etc.
>>>>> I'm looking at 3.18 right now, and 3.18 adds sendonly groups to the mcg
>>>>> just like my current code does.  The only difference, and I do mean
>>>>> *only*, is that it calls sendonly_join directly instead of via the
>>>>> mcast_task.
>>>> Yes, and i already wrote that it is more than just "only", it changed
>>>> the concept of the sendonly mc packet.
>>> Be more specific please.  What do you mean by "concept"?  And just so we
>>> are clear, this all started because the existing multicast code was
>>> super easy to break and was racy, so if the "concept" you are referring
>>> to is what made the original code easy to break and racy, I'm not going
>>> to care one whit that I changed that concept.
>>>
>> I agree that you fixed many bugs in your patches to 3.18, where the mc
>> flow was easy to break, no argue about that.
>> The only issue that i disagree is about the way now sendonly is handled
>> (and i think that this is the reason for the regression we see now).
> It's not.  The patch I sent you off list should be sufficient at this
> point to finally put your issues to rest.
>
>> In general, IMHO, the sendonly join is part of the TX flow and not part
>> of the  ipoib_set_mcast_list flow.
> I disagree.  I'll get into that below.
>
>> The original meaning of the ipoib_set_mcast_list task that restart the
>> mc_task is to be used for the kernel in order to add one or more new
>> mcg's macs to the driver/HW (ndo_set_rx_mode), the sendonly mc is not
>> such object, its mac should not be part of the "mac" list of the driver
>> (in IB wards, no qp_attach for it)
> This part I agree with.
>
>>   and from the kernel point of view
>> whenever it sends packet from sendonly mcg type no need to do the join,
>> it's a regular send, the only reason we have the sendonly join is the IB
>> enforcement for such mcg.
> This is where you and I differ.  There is a requirement on us from the
> IB spec that we have to join a sendonly MC group in order to do what the
> kernel would think of as a normal send if we were on Ethernet.  We
> aren't on Ethernet though, so acting like we are to the above layer is
> fine, but in our own layer we should be coding to the realities of our
> layer.  And that means treating a sendonly MC group like a regular MC
> group.

Just to make it clear, the IB request to do sendonly join was done prior 
to your changes. (stopped after them.)
The IB spec doesn't care if you mix sendonly with full-member, the 
kernel does care, and by adding them to the mc_list you abuse the kernel 
ndo (by adding unnecessary joins to it, as i wrote in my previous mail)
I think you should not mix between RX operation (like the full-member 
join, kernel rx ndo) to TX operation (sendonly join)

>
>> The reason the driver keeps the sendonly mcg in its mc_list is from
>> others reasons, the first is to handle the case when the kernel decides
>> to move a mcg from sendonly membership to full-member, one more other
>> reason is to do the leave operation when needed and not for being
>> handled as a full-member mcg.
> It's been a long time since I first started working on this issue, so
> some of the details are fuzzy in my memory.  I think my first 8 patch
> patchset is now about 6 months old.  But, I think probably the most
> important thing you left off this list, and the one that trumps all the
> others, is that whether a sendonly MC group has a QP attached or not, we
> still have to account for these groups, we have to track them, and on
> shutdown we have to reliably leave them without oopsing.  Putting the
> join of the sendonly groups directly in mcast_send opens us up to
> problems with synchronizing our mcast list (either due to a dev flush, a
> mcast restart task, or something else).

This is the reason the sendonly mcg was added to the mc_list. and should 
stay there.

>    I can't say for certain that
> the change to how we record the return value from ib_sa_join_multicast
> in sendonly_join is enough to keep the sendonly join code from racing
> with the flush code and causing problems.  For that reason, I am highly
> reluctant to put the sendonly join back directly into the tx path
> because that could have been part of the original problem in the first
> place, it's just that my memory of back then is too fuzzy to say that
> with 100% certainty either way.

That fear from the unknown is not a reason to break the original logic 
of the sendonly.
if there is such a problem, we need to find it and to fix.


Thanks, Erez

>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]                                 ` <54C64A2A.5070306-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-01-26 18:45                                   ` Doug Ledford
  0 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-26 18:45 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 6957 bytes --]

On Mon, 2015-01-26 at 16:07 +0200, Erez Shitrit wrote:
> On 1/26/2015 3:37 PM, Doug Ledford wrote:
> > On Mon, 2015-01-26 at 15:24 +0200, Erez Shitrit wrote:
> >>>>>> The main cause is the concept that was broken for the send-only join,
> >>>>>> when you treat the sendonly like a regular mcg and add it to the mc list
> >>>>>> and to the mc_task etc.
> >>>>> I'm looking at 3.18 right now, and 3.18 adds sendonly groups to the mcg
> >>>>> just like my current code does.  The only difference, and I do mean
> >>>>> *only*, is that it calls sendonly_join directly instead of via the
> >>>>> mcast_task.
> >>>> Yes, and i already wrote that it is more than just "only", it changed
> >>>> the concept of the sendonly mc packet.
> >>> Be more specific please.  What do you mean by "concept"?  And just so we
> >>> are clear, this all started because the existing multicast code was
> >>> super easy to break and was racy, so if the "concept" you are referring
> >>> to is what made the original code easy to break and racy, I'm not going
> >>> to care one whit that I changed that concept.
> >>>
> >> I agree that you fixed many bugs in your patches to 3.18, where the mc
> >> flow was easy to break, no argue about that.
> >> The only issue that i disagree is about the way now sendonly is handled
> >> (and i think that this is the reason for the regression we see now).
> > It's not.  The patch I sent you off list should be sufficient at this
> > point to finally put your issues to rest.
> >
> >> In general, IMHO, the sendonly join is part of the TX flow and not part
> >> of the  ipoib_set_mcast_list flow.
> > I disagree.  I'll get into that below.
> >
> >> The original meaning of the ipoib_set_mcast_list task that restart the
> >> mc_task is to be used for the kernel in order to add one or more new
> >> mcg's macs to the driver/HW (ndo_set_rx_mode), the sendonly mc is not
> >> such object, its mac should not be part of the "mac" list of the driver
> >> (in IB wards, no qp_attach for it)
> > This part I agree with.
> >
> >>   and from the kernel point of view
> >> whenever it sends packet from sendonly mcg type no need to do the join,
> >> it's a regular send, the only reason we have the sendonly join is the IB
> >> enforcement for such mcg.
> > This is where you and I differ.  There is a requirement on us from the
> > IB spec that we have to join a sendonly MC group in order to do what the
> > kernel would think of as a normal send if we were on Ethernet.  We
> > aren't on Ethernet though, so acting like we are to the above layer is
> > fine, but in our own layer we should be coding to the realities of our
> > layer.  And that means treating a sendonly MC group like a regular MC
> > group.
> 
> Just to make it clear, the IB request to do sendonly join was done prior 
> to your changes.

Yes...

>  (stopped after them.)

What do you mean here?  With my changes we still do sendonly joins.

> The IB spec doesn't care if you mix sendonly with full-member,

Correct.

>  the 
> kernel does care,

Where?  What part of the kernel cares?

>  and by adding them to the mc_list you abuse the kernel 
> ndo (by adding unnecessary joins to it, as i wrote in my previous mail)

How?  We never add a sendonly during a typical set_multicast call
invocation.  And in fact, we still do the exact same thing we used to do
when the kernel ndo is called, which is to call restart_task which will
scan our currently joined groups (both regular and sendonly, like it
always has), clear out any that don't have a matching address entry set
via the kernel ndo, and rejoin those that do have a matching entry set
via the ndo.

> I think you should not mix between RX operation (like the full-member 
> join, kernel rx ndo) to TX operation (sendonly join)

And I think having two distinctly different code paths when it doesn't
matter whether the join is sendonly or normal, they still must all be
accurately tracked and reaped on shutdown and various other events, is
wrong and is an invitation to failures for no good reason.

> >
> >> The reason the driver keeps the sendonly mcg in its mc_list is from
> >> others reasons, the first is to handle the case when the kernel decides
> >> to move a mcg from sendonly membership to full-member, one more other
> >> reason is to do the leave operation when needed and not for being
> >> handled as a full-member mcg.
> > It's been a long time since I first started working on this issue, so
> > some of the details are fuzzy in my memory.  I think my first 8 patch
> > patchset is now about 6 months old.  But, I think probably the most
> > important thing you left off this list, and the one that trumps all the
> > others, is that whether a sendonly MC group has a QP attached or not, we
> > still have to account for these groups, we have to track them, and on
> > shutdown we have to reliably leave them without oopsing.  Putting the
> > join of the sendonly groups directly in mcast_send opens us up to
> > problems with synchronizing our mcast list (either due to a dev flush, a
> > mcast restart task, or something else).
> 
> This is the reason the sendonly mcg was added to the mc_list. and should 
> stay there.
> 
> >    I can't say for certain that
> > the change to how we record the return value from ib_sa_join_multicast
> > in sendonly_join is enough to keep the sendonly join code from racing
> > with the flush code and causing problems.  For that reason, I am highly
> > reluctant to put the sendonly join back directly into the tx path
> > because that could have been part of the original problem in the first
> > place, it's just that my memory of back then is too fuzzy to say that
> > with 100% certainty either way.
> 
> That fear from the unknown is not a reason to break the original logic 
> of the sendonly.

No, the reason to break the original logic of the sendonly was that it
was a bad design that contributed to other problems.  The reason not to
consider putting that bad logic back in place right now, IMO, is that I
can't remember the details of the breakage better than I do at this
moment.

> if there is such a problem, we need to find it and to fix.

I disagree.  The original logic was an artificial and bad separation of
too more or less identical things in the code that helped contribute to
it being broken in the first place.  We are under no obligation to put
it back to that bad design.  And in the future, I'd like to bring the
sendonly and regular joins more in line with each other, quite possible
doing away with them having separate functions.  I'd also like to
completely deserialize them.  Both of these things will be easier to do
from the starting point we have now versus going back to the old setup.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions
       [not found]                         ` <54C6400E.30607-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-01-26 13:37                           ` Doug Ledford
@ 2015-01-26 19:30                           ` Doug Ledford
  1 sibling, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-26 19:30 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 4110 bytes --]

On Mon, 2015-01-26 at 15:24 +0200, Erez Shitrit wrote:
> On 1/26/2015 2:51 PM, Doug Ledford wrote:
> > On Mon, 2015-01-26 at 12:27 +0200, Erez Shitrit wrote:
> >
> >> New (and full) dmesg attached, (after modprobe ib_ipoib, with all debug
> >> flags set) it is all there.
> > Thank you, I know what's going on here now.  Will correct shortly.
> 
> welcome -:)

I munged my opensm configuration so that I could forcibly replicate the
situation here (I intentionally took several well known multicast groups
and forbid their creation).

I was able to first replicate Eriz's problem.

Then I installed a new ib_ipoib module with my proposed fix for Erez's
problem and it worked exactly as expected.  It was a mistake in one of
my earlier patches (the third in the series).  When I added a delayed
queue of the task thread, I didn't have a separate work struct and
instead tried to queue the same work struct twice.  I reworked it so
that the work struct is only ever queued once and if the multicast task
gets to the end of its run and there are delayed entries waiting still,
it will queue itself to run again when the shortest delay has expired.
I'll send that through.

Here's the log of the attempt:

[root@rdma-master linus (firewall/for-rc)]$ dmesg | tail -10
[337072.429488] mlx4_ib0: successfully joined all multicast groups
[337073.856932] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting sendonly join
[337073.869686] mlx4_ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
[337073.882754] mlx4_ib0: successfully joined all multicast groups
[337088.480082] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[337088.492789] mlx4_ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[337088.505819] mlx4_ib0: successfully joined all multicast groups
[337089.897041] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting sendonly join
[337089.909870] mlx4_ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
[337089.922893] mlx4_ib0: successfully joined all multicast groups
[root@rdma-master linus (firewall/for-rc)]$ ping6 -I mlx4_ib0 fe80::211:7500:77:d3cc
PING fe80::211:7500:77:d3cc(fe80::211:7500:77:d3cc) from fe80::f652:1403:7b:cba1 mlx4_ib0: 56 data bytes
64 bytes from fe80::211:7500:77:d3cc: icmp_seq=1 ttl=64 time=77.6 ms
64 bytes from fe80::211:7500:77:d3cc: icmp_seq=2 ttl=64 time=0.159 ms
64 bytes from fe80::211:7500:77:d3cc: icmp_seq=3 ttl=64 time=0.125 ms
64 bytes from fe80::211:7500:77:d3cc: icmp_seq=4 ttl=64 time=0.128 ms
^C
--- fe80::211:7500:77:d3cc ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3001ms
rtt min/avg/max/mdev = 0.125/19.503/77.600/33.542 ms
[root@rdma-master linus (firewall/for-rc)]$ dmesg | tail -10[337120.632427] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join
[337120.645166] mlx4_ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
[337120.658292] mlx4_ib0: successfully joined all multicast groups
[337121.977733] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting sendonly join
[337121.990478] mlx4_ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
[337122.003589] mlx4_ib0: successfully joined all multicast groups
[337130.410559] mlx4_ib0: setting up send only multicast group for ff12:601b:ffff:0000:0000:0001:ff77:d3cc
[337130.423203] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0001:ff77:d3cc, starting sendonly join
[337130.436327] mlx4_ib0: MGID ff12:601b:ffff:0000:0000:0001:ff77:d3cc AV ffff882027235f00, LID 0xc01e, SL 0
[337130.448970] mlx4_ib0: successfully joined all multicast groups
[root@rdma-master linus (firewall/for-rc)]$ 


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH FIX For-3.19 11/10] IB/ipoib: don't queue a work struct up twice
       [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2015-01-23 16:52   ` Doug Ledford
@ 2015-01-26 19:34   ` Doug Ledford
  12 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-01-26 19:34 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Doug Ledford, Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit

In b2d408f78b3 (IB/ipoib: make delayed tasks not hold up everything) I
added the ability to have some joins processed around delayed joins.
Because the join task processes one join at a time, we needed to queue
the task to run again immediately and then again at the appropriate
delay time.  I didn't think about the fact that our work struct can only
be on the workqueue once at a given time and instead tried to queue the
same struct up twice.  Instead, kick the queue again immediately on a
delayed restart attempt, when the task sees that it has nothing to do
but delayed work, it will pick the shortest work item and queue a delay
equal to its expiration.  Finally, in case we have delayed work, in the
send_mcast routine we need to cancel any delayed work before kicking the
thread immediately.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 49 ++++++++++++++++----------
 1 file changed, 31 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index ad2bcd105e4..68b47bed33a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -71,29 +71,28 @@ static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv,
 					       int delay)
 {
 	if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) {
-		/*
-		 * Mark this mcast for its delay and set a timer to kick the
-		 * thread when the delay completes
-		 */
 		if (mcast && delay) {
 			mcast->backoff *= 2;
 			if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
 				mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
 			mcast->delay_until = jiffies + (mcast->backoff * HZ);
-			queue_delayed_work(priv->wq, &priv->mcast_task,
-					   mcast->backoff * HZ);
+			/*
+			 * Mark this mcast for its delay, but restart the
+			 * task immediately.  It will make sure to clear
+			 * out all entries without delays, and then
+			 * schedule itself for run again when the delay
+			 * expires
+			 */
+			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 		} else if (delay) {
-			/* Special case of retrying after a failure to
+			/*
+			 * Special case of retrying after a failure to
 			 * allocate the broadcast multicast group, wait
 			 * 1 second and try again
 			 */
 			queue_delayed_work(priv->wq, &priv->mcast_task, HZ);
-		}
-		/*
-		 * But also rekick the thread immediately for any other
-		 * tasks in queue behind this one
-		 */
-		queue_delayed_work(priv->wq, &priv->mcast_task, delay);
+		} else
+			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	}
 }
 
@@ -568,6 +567,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	struct ib_port_attr port_attr;
 	struct ipoib_mcast *mcast = NULL;
 	int create = 1;
+	unsigned long delay_until = 0;
 
 	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
 		return;
@@ -636,11 +636,14 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	list_for_each_entry(mcast, &priv->multicast_list, list) {
 		if (IS_ERR_OR_NULL(mcast->mc) &&
 		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
-		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags) &&
-		    (mcast->backoff == 1 ||
-		     time_after_eq(jiffies, mcast->delay_until))) {
-			/* Found the next unjoined group */
-			break;
+		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+			if (mcast->backoff == 1 ||
+			    time_after_eq(jiffies, mcast->delay_until))
+				/* Found the next unjoined group */
+				break;
+			else if (!delay_until ||
+				 time_before(mcast->delay_until, delay_until))
+				delay_until = mcast->delay_until;
 		}
 	}
 
@@ -653,6 +656,9 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		mcast = NULL;
 		ipoib_dbg_mcast(priv, "successfully joined all "
 				"multicast groups\n");
+		if (delay_until)
+			queue_delayed_work(priv->wq, &priv->mcast_task,
+					   delay_until - jiffies);
 	}
 
 out:
@@ -766,6 +772,13 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
+		/*
+		 * Make sure that if we are currently waiting for a backoff
+		 * delay to expire, that we cancel that, run immediately,
+		 * and then requeue our task to fire when the backoff
+		 * timer is over.
+		 */
+		cancel_delayed_work(&priv->mcast_task);
 		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	}
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-01-26 19:34 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-22 14:31 [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions Doug Ledford
     [not found] ` <cover.1421936879.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 01/10] IB/ipoib: fix IPOIB_MCAST_RUN flag usage Doug Ledford
2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 02/10] IB/ipoib: Add a helper to restart the multicast task Doug Ledford
2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 03/10] IB/ipoib: make delayed tasks not hold up everything Doug Ledford
2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 04/10] IB/ipoib: Handle -ENETRESET properly in our callback Doug Ledford
2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 05/10] IB/ipoib: don't restart our thread on ENETRESET Doug Ledford
2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 06/10] IB/ipoib: remove unneeded locks Doug Ledford
2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 07/10] IB/ipoib: fix race between mcast_dev_flush and mcast_join Doug Ledford
2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 08/10] IB/ipoib: fix ipoib_mcast_restart_task Doug Ledford
2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 09/10] IB/ipoib: flush the ipoib_workqueue on unregister Doug Ledford
2015-01-22 14:31   ` [PATCH FIX For-3.19 v5 10/10] IB/ipoib: cleanup a couple debug messages Doug Ledford
2015-01-23  7:01   ` [PATCH FIX For-3.19 v5 00/10] Fix ipoib regressions Or Gerlitz
     [not found]     ` <CAJ3xEMi7mowr_qFMUXtM5m8p974qF39nPf-Qh-NOYK_jUzswSg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-23  7:45       ` Doug Ledford
     [not found]         ` <1421999125.3352.265.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-24  4:58           ` Roland Dreier
2015-01-23 12:54       ` Estrin, Alex
2015-01-23 16:52   ` Doug Ledford
     [not found]     ` <1422031938.3352.286.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-25 12:54       ` Erez Shitrit
     [not found]         ` <54C4E793.2010103-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-01-25 22:21           ` Doug Ledford
     [not found]             ` <1422224477.3352.373.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-26 10:27               ` Erez Shitrit
     [not found]                 ` <54C616A8.3050804-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-01-26 12:51                   ` Doug Ledford
     [not found]                     ` <1422276712.2854.5.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-26 13:24                       ` Erez Shitrit
     [not found]                         ` <54C6400E.30607-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-01-26 13:37                           ` Doug Ledford
     [not found]                             ` <1422279465.2854.15.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-26 14:07                               ` Erez Shitrit
     [not found]                                 ` <54C64A2A.5070306-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-01-26 18:45                                   ` Doug Ledford
2015-01-26 19:30                           ` Doug Ledford
2015-01-26 19:34   ` [PATCH FIX For-3.19 11/10] IB/ipoib: don't queue a work struct up twice Doug Ledford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox