From mboxrd@z Thu Jan 1 00:00:00 1970 From: Erez Shitrit Subject: Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow Date: Thu, 22 Jan 2015 13:39:51 +0200 Message-ID: <54C0E187.2010100@dev.mellanox.co.il> References: <1420643066-3599-1-git-send-email-ogerlitz@mellanox.com> <1421172470.43839.207.camel@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1421172470.43839.207.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Doug Ledford , Or Gerlitz Cc: Roland Dreier , linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai , Eyal Perry , Erez Shitrit List-Id: linux-rdma@vger.kernel.org On 1/13/2015 8:07 PM, Doug Ledford wrote: > On Wed, 2015-01-07 at 17:04 +0200, Or Gerlitz wrote: >> From: Erez Shitrit >> >> Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage" >> both IPv6 traffic and for the most cases all IPv4 multicast traffic >> aren't working. >> >> After this change there is no mechanism to handle the work that does the >> join process for the rest of the mcg's. For example, if in the list of >> all the mcg's there is a send-only request, after its processing, the >> code in ipoib_mcast_sendonly_join_complete() will not requeue the >> mcast task, but leaves the bit that signals this task is running, >> and hence the task will never run. >> >> Also, whenever the kernel sends multicast packet (w.o joining to this >> group), we don't call ipoib_send_only_join(), the code tries to start >> the mcast task but it failed because the bit IPOIB_MCAST_RUN is always >> set, As a result the multicast packet will never be sent. >> >> The fix handles all the join requests via the same logic, and call >> explicitly to sendonly join whenever there is a packet from sendonly type. >> >> Since ipoib_mcast_sendonly_join() is now called from the driver TX flow, >> we can't take mutex there. Locking isn't required there since the multicast >> join callback will be called only after the SA agent initialized the relevant >> multicast object. >> >> Fixes: 016d9fb25cd9 ('IPoIB: fix MCAST_FLAG_BUSY usage') >> Reported-by: Eyal Perry >> Signed-off-by: Erez Shitrit >> Signed-off-by: Or Gerlitz >> --- >> V0 --> V1 changes: Added credits (...) and furnished the change-log abit. >> >> drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 15 ++++++--------- >> 1 files changed, 6 insertions(+), 9 deletions(-) >> >> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> index bc50dd0..0ea4b08 100644 >> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> @@ -301,9 +301,10 @@ ipoib_mcast_sendonly_join_complete(int status, >> dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); >> } >> netif_tx_unlock_bh(dev); >> + >> + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); >> } >> out: >> - clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); >> if (status) >> mcast->mc = NULL; >> complete(&mcast->done); > This chunk is wrong. We are in our complete routine, which means > ib_sa_join_multicast is calling us for this mcast group, and we will > never see another return for this group. We must clear the BUSY flag no > matter what as the BUSY flag now indicates that our mcast join is still > outstanding in the lower layer ib_sa_ area, not that we have joined the > group. Please re-read my patches that re-worked the BUSY flag usage. > The BUSY flag was poorly named/used in the past, which is why a previous > patch introduced the JOINING or whatever flag it was called. My > patchset reworks the flag usage to be more sane. BUSY now means > *exactly* that: this mcast group is in the process of joining, aka it's > BUSY. It doesn't mean we've joined the group and there are no more > outstanding join requests. That's signified by mcast->mc != > IS_ERR_OR_NULL. agree, may mistake, the clear_bit(IPOIB_MCAST_FLAG_BUSY) should be in any case, i will move it 3 lines later, BTW, this will solve the "in-flight" messages you saw. > >> @@ -342,7 +343,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) >> rec.port_gid = priv->local_gid; >> rec.pkey = cpu_to_be16(priv->pkey); >> >> - mutex_lock(&mcast_mutex); >> init_completion(&mcast->done); >> set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); >> mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, >> @@ -364,7 +364,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) >> ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting " >> "sendonly join\n", mcast->mcmember.mgid.raw); >> } >> - mutex_unlock(&mcast_mutex); >> >> return ret; >> } > No! You can not, under any circumstances, remove this locking! One of > the things that frustrated me for a bit until I tracked it down was how > ib_sa_join_multicast returns errors to the ipoib layer. When you call > ib_sa_join_multicast, the return value is either a valid mcast->mc > pointer or IS_ERR(err). If it's a valid pointer, that does not mean we > have successfully joined, it means that we might join, but it isn't > until we have completed the callback that we know. The callback will > clear out mcast->mc if we encounter an error during the callback and > know that by returning an error from the callback, the lower layer is > going to delete the mcast->mc context out from underneath us. As it > turns out, we often get our callbacks called even before we get the > initial return from ib_sa_join_multicast. If we don't have this > locking, and we get any error in the callback, the callback will clear > mcast->mc to indicate that we have no valid group, then we will return > from ib_sa_join_multicast and set mcast->mc to an invalid group. To > prevent that, the callback grabs this mutex at the beginning of its > operation. We *must* grab the mutex here and hold it until we are done > with mcast->mc or else we can't know for sure if the mcast->mc that we > just set as the return code from ib_sa_join_multicast is the right > return value or if we just overwrote the callbacks setting of that > field. OK, i got your point now, you want to sync between the ipoib_mcas_sendonly_join to the sa callback (ipoib_mcast_sendonly_join_complete), but for that you don't need to use mutex (if you are using mutes you are not able to call ipoib_mcast_sendonly_join from the TX flow). the easy way is to add simple check in the function ipoib_mcast_sendonly_join instead of the IS_ERR(mcast->mc) use IS_ERR_OR_NULL(mcast->mc). IMHO, that solves the sync issue. >> @@ -622,10 +621,8 @@ void ipoib_mcast_join_task(struct work_struct *work) >> break; >> } >> >> - if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) >> - ipoib_mcast_sendonly_join(mcast); >> - else >> - ipoib_mcast_join(dev, mcast, 1); >> + ipoib_mcast_join(dev, mcast, 1); >> + >> return; >> } >> >> @@ -725,8 +722,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb) >> memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid)); >> __ipoib_mcast_add(dev, mcast); >> list_add_tail(&mcast->list, &priv->multicast_list); >> - if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) >> - queue_delayed_work(priv->wq, &priv->mcast_task, 0); >> } >> >> if (!mcast->ah) { >> @@ -740,6 +735,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb) >> if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) >> ipoib_dbg_mcast(priv, "no address vector, " >> "but multicast join already started\n"); >> + else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) >> + ipoib_mcast_sendonly_join(mcast); >> >> /* >> * If lookup completes between here and out:, don't > None of this looks right to me either. But that's OK, I think I found > the bug while looking all of this over. I'm going to do some testing > and I'll report back here when done. You can't even do this without > removing the mutex_lock above that I pointed out has to stay, so this > really isn't right. I think, that after i will add the changes i mentioned, it will be fine. will send you the new patch, please take a look. > > Thanks, Erez -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html